Re: Indexing and searching using Apache Stanbol

Rupert Westenthaler Tue, 13 Mar 2012 05:16:52 -0700

Hi

On Tue, Mar 13, 2012 at 1:05 PM, srecko joksimovic
<[email protected]> wrote:
> Hi Rupert,
>
> and thank you for the answer. I need to read few more things, but the answer
> helped me a lot.


great!

> If I understood well, the search is case sensitive, and if I need case
> insensitive search, I will have to implement application specific logic?
>

Keyword searches via the content hub and Solr query for the field
"text_all" are case insensitive!

Only searches for the fields "organizations_t", "people_t" and
"places_t" are case sensitive. However I would consider this as a bug
and the comment (**) in my previous mail suggests to correct that.


best
Rupert

> Best,
> Srecko
>
>
> On Tue, Mar 13, 2012 at 11:46 AM, Rupert Westenthaler
> <[email protected]> wrote:
>>
>> Hi Srecko, all
>>
>> @Stanbol developers: Note (*) and (**) comments at the end of this mail
>>
>> On Mon, Mar 12, 2012 at 6:03 PM, Srecko Joksimovic
>> <[email protected]> wrote:
>> >
>> > Until now I have developed few applications for annotating documents
>> > using
>> > Apache Stanbol. Now I need to add indexing and search capabilities.
>> >
>> > I tried ContentHub
>> >
>> > (http://incubator.apache.org/stanbol/docs/trunk/contenthub/contenthub5min)
>> > in the way that I started full launcher and access web interface. There
>> > are
>> > few possibilities: to provide text, to upload document, to provide an
>> > URI… I
>> > tried to upload a few txt documents. I didn’t get any extracted
>> > entities,
>>
>> The content hub shows the number of extracted enhancements. This can
>> easily be used as indicator if the Stanbol Enhancer was able to
>> extract knowledge form the parsed content.
>>
>> Typical reasons for not getting expected enhancement results are:
>>
>> 1. unsupported content type: The current version of Apache Stanbol
>> uses the
>> [TikaEngine](http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/tikaengine.html)
>> to process non-plain-text content parsed to the Stanbol
>> Enhancer/Contenthub. So everything that is covered by Apache Tika
>> should also work just fine with Apache Stanbol.
>>
>> 2. unsupported language: Some Enhancement Engines (e.g. NER - Named
>> Entity Recognition) do only support some languages. If the parsed
>> content is in an other language the will not be able to process the
>> parsed content. With the default configuration of Stanbol only english
>> (and in the newest version spanish and dutch) documents will work.
>> Users with custom configurations will also be able to process
>> documents with other languages)
>>
>> > but search (using Web View) worked fine.
>>
>> This is because the Conenthub also supports full text search over the
>> parsed content. (*)
>>
>> >Another step was to upload pdf
>> > documents and I got extracted entities grouped by People, Places
>> > Concepts
>> > categories. It was also in the list of recently uploaded documents, but
>> > I
>> > couldn’t find any term from that document.
>> >
>>
>> Based on your request I tried the following (with the default
>> configuration of the Full launcher)
>> NOTE: this excludes the possibility to create your own search index by
>> using LDPath.
>>
>> 1) upload some files to the content hub
>>
>>    * file upload worked (some scientific papers from the local HD
>>    * URL upload worked (some technical blogs + comments)
>>    * pasting text worked (some of the examples included for the enhancer)
>>    * based on the UI I got > 100 enhancements for all tested PDFs
>>
>> 2) test of the contenthub search
>>
>>    * keyword search worked also for me
>>
>> 3) direct solr searches on {host}/solr/default/contenthub/ (*)
>>
>>    * searches like
>> "{host}solr/default/contenthub/select?q=organizations_t:Stanford*"
>> worked fine. Note that searches are case sensitive (**)
>>    * I think the keyword search uses the "text_all" field. So queries
>> for  "{host}solr/default/contenthub/select?q=text_all:{keyword} should
>> return the same values as the UI of the content hub. This fields
>> basically supports full text search.
>>    * all the semantic "stanbolreserved_*" fields (e.g. *_countries,
>> *_workinstitutions ...) where missing. I think this is expected,
>> because such fields do require a dbpedia index with the required
>> fields.
>>
>>
>> >
>> > I suppose that I will have to provide a stream from pdf (or any other
>> > kind)
>> > documents and to index it like text? I need all mentioned
>> > functionalities
>> > (index text, docs, URIs…) using Java application and I would appreciate
>> > a
>> > code example, if it is available, please.
>> >
>>
>> I think parsing of URIs is currently not possible by using the RESTful
>> API. For using the RESTful services I would recommend you the use of
>> the Apache Http commons client. Code examples on how to build requests
>> can be found at
>>
>> http://hc.apache.org/httpcomponents-client-ga/tutorial/html/fundamentals.html
>>
>>
>> best
>> Rupert
>>
>> Comments intended for Stanbol Developers:
>> -----
>>
>> (*) Normally I would expect the SolrIndex to only include the plain
>> text version of the parsed content within a field with stored=false.
>> However I assume that currently the index needs to store the actual
>> content, because is is also used to store the data. Is this correct?
>> If this is the case than it will get fixed with STANBOL-471 in any case.
>>
>> I also noted that "stanbolreserved_content" currently stores the
>> content as parsed to the content hub but is configured as
>> indexed="true" and type="text_general". So in case of an PDF file the
>> binary content is processed as natural language text AND is also
>> indexed!
>> So if this field is used for full text indexing (what I think is not
>> the case, because I think the "text_all" field is used for that) than
>> you need to ensure that the plain text version is used for full text
>> indexing. The plain text contents are available from enhanced
>> ContentItems by using
>>
>> ContentItemHelper.getBlob(contentItem,Collections.singelton("text/plain")).
>> As an alternative one could also use the features introduced by
>> STANBOL-500 for this.
>> If this field is used to store the actual content, than you should use
>> an binary field type and deactivate indexing for this field.
>>
>> (**) All *_t fields use string as field type. This means that no
>> tokenizer is used AND queries are case sensitive. I do not think this
>> is a good decision and would rather us the already defined "text_ws"
>> type (white space tokenizer, word delimiter and lower case)
>>
>>
>> best
>> Rupert
>>
>> --
>> | Rupert Westenthaler             [email protected]
>> | Bodenlehenstraße 11                             ++43-699-11108907
>> | A-5500 Bischofshofen
>
>



-- 
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Indexing and searching using Apache Stanbol

Reply via email to