Re: Indexing and searching using Apache Stanbol

srecko joksimovic Tue, 13 Mar 2012 05:43:45 -0700

Hi,

I forgot to mentioned, but I think that I have posted this question before.
Anyway, is it possible to configure Stanbol to run at
http://xxx.xxx.xxx.xxx:9999/testing/ instead of http://localhost:9999/ ?


Because of the company policy I need to define application URL, and it must
have something similar to http://xxx.xxx.xxx.xxx:9999/testing/. That means
(for example) that I need to have:

http://xxx.xxx.xxx.xxx:9999/testing/enhancer/engine, instead of
http://localhost:9999/enhancer/engine.

Best,
Srecko

On Tue, Mar 13, 2012 at 1:29 PM, srecko joksimovic <
[email protected]> wrote:

> Hi,
>
> Ok, looks like I didn't understand that. It's clear now.
>
> Thank you.
>
> Srecko
>
>
> On Tue, Mar 13, 2012 at 1:16 PM, Rupert Westenthaler <
> [email protected]> wrote:
>
>> Hi
>>
>> On Tue, Mar 13, 2012 at 1:05 PM, srecko joksimovic
>> <[email protected]> wrote:
>> > Hi Rupert,
>> >
>> > and thank you for the answer. I need to read few more things, but the
>> answer
>> > helped me a lot.
>>
>> great!
>>
>> > If I understood well, the search is case sensitive, and if I need case
>> > insensitive search, I will have to implement application specific logic?
>> >
>>
>> Keyword searches via the content hub and Solr query for the field
>> "text_all" are case insensitive!
>>
>> Only searches for the fields "organizations_t", "people_t" and
>> "places_t" are case sensitive. However I would consider this as a bug
>> and the comment (**) in my previous mail suggests to correct that.
>>
>>
>> best
>> Rupert
>>
>> > Best,
>> > Srecko
>> >
>> >
>> > On Tue, Mar 13, 2012 at 11:46 AM, Rupert Westenthaler
>> > <[email protected]> wrote:
>> >>
>> >> Hi Srecko, all
>> >>
>> >> @Stanbol developers: Note (*) and (**) comments at the end of this mail
>> >>
>> >> On Mon, Mar 12, 2012 at 6:03 PM, Srecko Joksimovic
>> >> <[email protected]> wrote:
>> >> >
>> >> > Until now I have developed few applications for annotating documents
>> >> > using
>> >> > Apache Stanbol. Now I need to add indexing and search capabilities.
>> >> >
>> >> > I tried ContentHub
>> >> >
>> >> > (
>> http://incubator.apache.org/stanbol/docs/trunk/contenthub/contenthub5min)
>> >> > in the way that I started full launcher and access web interface.
>> There
>> >> > are
>> >> > few possibilities: to provide text, to upload document, to provide an
>> >> > URI… I
>> >> > tried to upload a few txt documents. I didn’t get any extracted
>> >> > entities,
>> >>
>> >> The content hub shows the number of extracted enhancements. This can
>> >> easily be used as indicator if the Stanbol Enhancer was able to
>> >> extract knowledge form the parsed content.
>> >>
>> >> Typical reasons for not getting expected enhancement results are:
>> >>
>> >> 1. unsupported content type: The current version of Apache Stanbol
>> >> uses the
>> >> [TikaEngine](
>> http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/tikaengine.html
>> )
>> >> to process non-plain-text content parsed to the Stanbol
>> >> Enhancer/Contenthub. So everything that is covered by Apache Tika
>> >> should also work just fine with Apache Stanbol.
>> >>
>> >> 2. unsupported language: Some Enhancement Engines (e.g. NER - Named
>> >> Entity Recognition) do only support some languages. If the parsed
>> >> content is in an other language the will not be able to process the
>> >> parsed content. With the default configuration of Stanbol only english
>> >> (and in the newest version spanish and dutch) documents will work.
>> >> Users with custom configurations will also be able to process
>> >> documents with other languages)
>> >>
>> >> > but search (using Web View) worked fine.
>> >>
>> >> This is because the Conenthub also supports full text search over the
>> >> parsed content. (*)
>> >>
>> >> >Another step was to upload pdf
>> >> > documents and I got extracted entities grouped by People, Places
>> >> > Concepts
>> >> > categories. It was also in the list of recently uploaded documents,
>> but
>> >> > I
>> >> > couldn’t find any term from that document.
>> >> >
>> >>
>> >> Based on your request I tried the following (with the default
>> >> configuration of the Full launcher)
>> >> NOTE: this excludes the possibility to create your own search index by
>> >> using LDPath.
>> >>
>> >> 1) upload some files to the content hub
>> >>
>> >>    * file upload worked (some scientific papers from the local HD
>> >>    * URL upload worked (some technical blogs + comments)
>> >>    * pasting text worked (some of the examples included for the
>> enhancer)
>> >>    * based on the UI I got > 100 enhancements for all tested PDFs
>> >>
>> >> 2) test of the contenthub search
>> >>
>> >>    * keyword search worked also for me
>> >>
>> >> 3) direct solr searches on {host}/solr/default/contenthub/ (*)
>> >>
>> >>    * searches like
>> >> "{host}solr/default/contenthub/select?q=organizations_t:Stanford*"
>> >> worked fine. Note that searches are case sensitive (**)
>> >>    * I think the keyword search uses the "text_all" field. So queries
>> >> for  "{host}solr/default/contenthub/select?q=text_all:{keyword} should
>> >> return the same values as the UI of the content hub. This fields
>> >> basically supports full text search.
>> >>    * all the semantic "stanbolreserved_*" fields (e.g. *_countries,
>> >> *_workinstitutions ...) where missing. I think this is expected,
>> >> because such fields do require a dbpedia index with the required
>> >> fields.
>> >>
>> >>
>> >> >
>> >> > I suppose that I will have to provide a stream from pdf (or any other
>> >> > kind)
>> >> > documents and to index it like text? I need all mentioned
>> >> > functionalities
>> >> > (index text, docs, URIs…) using Java application and I would
>> appreciate
>> >> > a
>> >> > code example, if it is available, please.
>> >> >
>> >>
>> >> I think parsing of URIs is currently not possible by using the RESTful
>> >> API. For using the RESTful services I would recommend you the use of
>> >> the Apache Http commons client. Code examples on how to build requests
>> >> can be found at
>> >>
>> >>
>> http://hc.apache.org/httpcomponents-client-ga/tutorial/html/fundamentals.html
>> >>
>> >>
>> >> best
>> >> Rupert
>> >>
>> >> Comments intended for Stanbol Developers:
>> >> -----
>> >>
>> >> (*) Normally I would expect the SolrIndex to only include the plain
>> >> text version of the parsed content within a field with stored=false.
>> >> However I assume that currently the index needs to store the actual
>> >> content, because is is also used to store the data. Is this correct?
>> >> If this is the case than it will get fixed with STANBOL-471 in any
>> case.
>> >>
>> >> I also noted that "stanbolreserved_content" currently stores the
>> >> content as parsed to the content hub but is configured as
>> >> indexed="true" and type="text_general". So in case of an PDF file the
>> >> binary content is processed as natural language text AND is also
>> >> indexed!
>> >> So if this field is used for full text indexing (what I think is not
>> >> the case, because I think the "text_all" field is used for that) than
>> >> you need to ensure that the plain text version is used for full text
>> >> indexing. The plain text contents are available from enhanced
>> >> ContentItems by using
>> >>
>> >>
>> ContentItemHelper.getBlob(contentItem,Collections.singelton("text/plain")).
>> >> As an alternative one could also use the features introduced by
>> >> STANBOL-500 for this.
>> >> If this field is used to store the actual content, than you should use
>> >> an binary field type and deactivate indexing for this field.
>> >>
>> >> (**) All *_t fields use string as field type. This means that no
>> >> tokenizer is used AND queries are case sensitive. I do not think this
>> >> is a good decision and would rather us the already defined "text_ws"
>> >> type (white space tokenizer, word delimiter and lower case)
>> >>
>> >>
>> >> best
>> >> Rupert
>> >>
>> >> --
>> >> | Rupert Westenthaler             [email protected]
>> >> | Bodenlehenstraße 11                             ++43-699-11108907
>> >> | A-5500 Bischofshofen
>> >
>> >
>>
>>
>>
>> --
>> | Rupert Westenthaler             [email protected]
>> | Bodenlehenstraße 11                             ++43-699-11108907
>> | A-5500 Bischofshofen
>>
>
>

Re: Indexing and searching using Apache Stanbol

Reply via email to