Re: Indexing and searching using Apache Stanbol

Suat Gonul Tue, 13 Mar 2012 05:16:51 -0700

Hi Rupert, all,

First of all, thanks for your feedback.


On 03/13/2012 12:46 PM, Rupert Westenthaler wrote:
> Hi Srecko, all
>
> @Stanbol developers: Note (*) and (**) comments at the end of this mail
>
> On Mon, Mar 12, 2012 at 6:03 PM, Srecko Joksimovic
> <[email protected]> wrote:
>> Until now I have developed few applications for annotating documents using
>> Apache Stanbol. Now I need to add indexing and search capabilities.
>>
>> I tried ContentHub
>> (http://incubator.apache.org/stanbol/docs/trunk/contenthub/contenthub5min)
>> in the way that I started full launcher and access web interface. There are
>> few possibilities: to provide text, to upload document, to provide an URI… I
>> tried to upload a few txt documents. I didn’t get any extracted entities,
> The content hub shows the number of extracted enhancements. This can
> easily be used as indicator if the Stanbol Enhancer was able to
> extract knowledge form the parsed content.
>
> Typical reasons for not getting expected enhancement results are:
>
> 1. unsupported content type: The current version of Apache Stanbol
> uses the 
> [TikaEngine](http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/tikaengine.html)
> to process non-plain-text content parsed to the Stanbol
> Enhancer/Contenthub. So everything that is covered by Apache Tika
> should also work just fine with Apache Stanbol.
>
> 2. unsupported language: Some Enhancement Engines (e.g. NER - Named
> Entity Recognition) do only support some languages. If the parsed
> content is in an other language the will not be able to process the
> parsed content. With the default configuration of Stanbol only english
> (and in the newest version spanish and dutch) documents will work.
> Users with custom configurations will also be able to process
> documents with other languages)
>
>> but search (using Web View) worked fine.
> This is because the Conenthub also supports full text search over the
> parsed content. (*)
>
>> Another step was to upload pdf
>> documents and I got extracted entities grouped by People, Places Concepts
>> categories. It was also in the list of recently uploaded documents, but I
>> couldn’t find any term from that document.
>>
> Based on your request I tried the following (with the default
> configuration of the Full launcher)
> NOTE: this excludes the possibility to create your own search index by
> using LDPath.
>
> 1) upload some files to the content hub
>
>     * file upload worked (some scientific papers from the local HD
>     * URL upload worked (some technical blogs + comments)
>     * pasting text worked (some of the examples included for the enhancer)
>     * based on the UI I got > 100 enhancements for all tested PDFs
>
> 2) test of the contenthub search
>
>     * keyword search worked also for me
>
> 3) direct solr searches on {host}/solr/default/contenthub/ (*)
>
>     * searches like
> "{host}solr/default/contenthub/select?q=organizations_t:Stanford*"
> worked fine. Note that searches are case sensitive (**)
>     * I think the keyword search uses the "text_all" field. So queries
> for  "{host}solr/default/contenthub/select?q=text_all:{keyword} should
> return the same values as the UI of the content hub. This fields
> basically supports full text search.
>     * all the semantic "stanbolreserved_*" fields (e.g. *_countries,
> *_workinstitutions ...) where missing. I think this is expected,
> because such fields do require a dbpedia index with the required
> fields.
>
>
>> I suppose that I will have to provide a stream from pdf (or any other kind)
>> documents and to index it like text? I need all mentioned functionalities
>> (index text, docs, URIs…) using Java application and I would appreciate a
>> code example, if it is available, please.
>>
> I think parsing of URIs is currently not possible by using the RESTful
> API. For using the RESTful services I would recommend you the use of
> the Apache Http commons client. Code examples on how to build requests
> can be found at
> http://hc.apache.org/httpcomponents-client-ga/tutorial/html/fundamentals.html
>
>
> best
> Rupert
>
> Comments intended for Stanbol Developers:
> -----
>
> (*) Normally I would expect the SolrIndex to only include the plain
> text version of the parsed content within a field with stored=false.
> However I assume that currently the index needs to store the actual
> content, because is is also used to store the data. Is this correct?
> If this is the case than it will get fixed with STANBOL-471 in any case.
>
Exactly.

> I also noted that "stanbolreserved_content" currently stores the
> content as parsed to the content hub but is configured as
> indexed="true" and type="text_general". So in case of an PDF file the
> binary content is processed as natural language text AND is also
> indexed!
> So if this field is used for full text indexing (what I think is not
> the case, because I think the "text_all" field is used for that) than
> you need to ensure that the plain text version is used for full text
> indexing. The plain text contents are available from enhanced
> ContentItems by using
> ContentItemHelper.getBlob(contentItem,Collections.singelton("text/plain")).
> As an alternative one could also use the features introduced by
> STANBOL-500 for this.
> If this field is used to store the actual content, than you should use
> an binary field type and deactivate indexing for this field.
Currently, all fields are copied to "text_all" and indexed within this
field. But, most of these fields (such as "stanbolreserved_content") are
also indexed which is false as you pointed out. Furthermore, binary
content has not been considered at all in the Solr index. We will find
shortcut solutions as soon as possible and leave the actual solutions to
the implementation of STANBOL-471.

> (**) All *_t fields use string as field type. This means that no
> tokenizer is used AND queries are case sensitive. I do not think this
> is a good decision and would rather us the already defined "text_ws"
> type (white space tokenizer, word delimiter and lower case)
>
>

Ok, thanks for this suggestion. Indeed, it might be better to set it to
"text_general". WDYT?

Best,
Suat

> best
> Rupert
>

Re: Indexing and searching using Apache Stanbol

Reply via email to