Re: Indexing and searching using Apache Stanbol

srecko joksimovic Tue, 13 Mar 2012 05:06:02 -0700

Hi Rupert,

and thank you for the answer. I need to read few more things, but the
answer helped me a lot.
If I understood well, the search is case sensitive, and if I need case
insensitive search, I will have to implement application specific logic?


Best,
Srecko

On Tue, Mar 13, 2012 at 11:46 AM, Rupert Westenthaler <
[email protected]> wrote:

> Hi Srecko, all
>
> @Stanbol developers: Note (*) and (**) comments at the end of this mail
>
> On Mon, Mar 12, 2012 at 6:03 PM, Srecko Joksimovic
> <[email protected]> wrote:
> >
> > Until now I have developed few applications for annotating documents
> using
> > Apache Stanbol. Now I need to add indexing and search capabilities.
> >
> > I tried ContentHub
> > (
> http://incubator.apache.org/stanbol/docs/trunk/contenthub/contenthub5min)
> > in the way that I started full launcher and access web interface. There
> are
> > few possibilities: to provide text, to upload document, to provide an
> URI… I
> > tried to upload a few txt documents. I didn’t get any extracted entities,
>
> The content hub shows the number of extracted enhancements. This can
> easily be used as indicator if the Stanbol Enhancer was able to
> extract knowledge form the parsed content.
>
> Typical reasons for not getting expected enhancement results are:
>
> 1. unsupported content type: The current version of Apache Stanbol
> uses the [TikaEngine](
> http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/tikaengine.html
> )
> to process non-plain-text content parsed to the Stanbol
> Enhancer/Contenthub. So everything that is covered by Apache Tika
> should also work just fine with Apache Stanbol.
>
> 2. unsupported language: Some Enhancement Engines (e.g. NER - Named
> Entity Recognition) do only support some languages. If the parsed
> content is in an other language the will not be able to process the
> parsed content. With the default configuration of Stanbol only english
> (and in the newest version spanish and dutch) documents will work.
> Users with custom configurations will also be able to process
> documents with other languages)
>
> > but search (using Web View) worked fine.
>
> This is because the Conenthub also supports full text search over the
> parsed content. (*)
>
> >Another step was to upload pdf
> > documents and I got extracted entities grouped by People, Places Concepts
> > categories. It was also in the list of recently uploaded documents, but I
> > couldn’t find any term from that document.
> >
>
> Based on your request I tried the following (with the default
> configuration of the Full launcher)
> NOTE: this excludes the possibility to create your own search index by
> using LDPath.
>
> 1) upload some files to the content hub
>
>    * file upload worked (some scientific papers from the local HD
>    * URL upload worked (some technical blogs + comments)
>    * pasting text worked (some of the examples included for the enhancer)
>    * based on the UI I got > 100 enhancements for all tested PDFs
>
> 2) test of the contenthub search
>
>    * keyword search worked also for me
>
> 3) direct solr searches on {host}/solr/default/contenthub/ (*)
>
>    * searches like
> "{host}solr/default/contenthub/select?q=organizations_t:Stanford*"
> worked fine. Note that searches are case sensitive (**)
>    * I think the keyword search uses the "text_all" field. So queries
> for  "{host}solr/default/contenthub/select?q=text_all:{keyword} should
> return the same values as the UI of the content hub. This fields
> basically supports full text search.
>    * all the semantic "stanbolreserved_*" fields (e.g. *_countries,
> *_workinstitutions ...) where missing. I think this is expected,
> because such fields do require a dbpedia index with the required
> fields.
>
>
> >
> > I suppose that I will have to provide a stream from pdf (or any other
> kind)
> > documents and to index it like text? I need all mentioned functionalities
> > (index text, docs, URIs…) using Java application and I would appreciate a
> > code example, if it is available, please.
> >
>
> I think parsing of URIs is currently not possible by using the RESTful
> API. For using the RESTful services I would recommend you the use of
> the Apache Http commons client. Code examples on how to build requests
> can be found at
>
> http://hc.apache.org/httpcomponents-client-ga/tutorial/html/fundamentals.html
>
>
> best
> Rupert
>
> Comments intended for Stanbol Developers:
> -----
>
> (*) Normally I would expect the SolrIndex to only include the plain
> text version of the parsed content within a field with stored=false.
> However I assume that currently the index needs to store the actual
> content, because is is also used to store the data. Is this correct?
> If this is the case than it will get fixed with STANBOL-471 in any case.
>
> I also noted that "stanbolreserved_content" currently stores the
> content as parsed to the content hub but is configured as
> indexed="true" and type="text_general". So in case of an PDF file the
> binary content is processed as natural language text AND is also
> indexed!
> So if this field is used for full text indexing (what I think is not
> the case, because I think the "text_all" field is used for that) than
> you need to ensure that the plain text version is used for full text
> indexing. The plain text contents are available from enhanced
> ContentItems by using
> ContentItemHelper.getBlob(contentItem,Collections.singelton("text/plain")).
> As an alternative one could also use the features introduced by
> STANBOL-500 for this.
> If this field is used to store the actual content, than you should use
> an binary field type and deactivate indexing for this field.
>
> (**) All *_t fields use string as field type. This means that no
> tokenizer is used AND queries are case sensitive. I do not think this
> is a good decision and would rather us the already defined "text_ws"
> type (white space tokenizer, word delimiter and lower case)
>
>
> best
> Rupert
>
> --
> | Rupert Westenthaler             [email protected]
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>

Re: Indexing and searching using Apache Stanbol

Reply via email to