Re: Indexing and searching using Apache Stanbol

srecko joksimovic Tue, 13 Mar 2012 05:30:10 -0700

Hi,

Ok, looks like I didn't understand that. It's clear now.


Thank you.

Srecko

On Tue, Mar 13, 2012 at 1:16 PM, Rupert Westenthaler <
[email protected]> wrote:

> Hi
>
> On Tue, Mar 13, 2012 at 1:05 PM, srecko joksimovic
> <[email protected]> wrote:
> > Hi Rupert,
> >
> > and thank you for the answer. I need to read few more things, but the
> answer
> > helped me a lot.
>
> great!
>
> > If I understood well, the search is case sensitive, and if I need case
> > insensitive search, I will have to implement application specific logic?
> >
>
> Keyword searches via the content hub and Solr query for the field
> "text_all" are case insensitive!
>
> Only searches for the fields "organizations_t", "people_t" and
> "places_t" are case sensitive. However I would consider this as a bug
> and the comment (**) in my previous mail suggests to correct that.
>
>
> best
> Rupert
>
> > Best,
> > Srecko
> >
> >
> > On Tue, Mar 13, 2012 at 11:46 AM, Rupert Westenthaler
> > <[email protected]> wrote:
> >>
> >> Hi Srecko, all
> >>
> >> @Stanbol developers: Note (*) and (**) comments at the end of this mail
> >>
> >> On Mon, Mar 12, 2012 at 6:03 PM, Srecko Joksimovic
> >> <[email protected]> wrote:
> >> >
> >> > Until now I have developed few applications for annotating documents
> >> > using
> >> > Apache Stanbol. Now I need to add indexing and search capabilities.
> >> >
> >> > I tried ContentHub
> >> >
> >> > (
> http://incubator.apache.org/stanbol/docs/trunk/contenthub/contenthub5min)
> >> > in the way that I started full launcher and access web interface.
> There
> >> > are
> >> > few possibilities: to provide text, to upload document, to provide an
> >> > URI… I
> >> > tried to upload a few txt documents. I didn’t get any extracted
> >> > entities,
> >>
> >> The content hub shows the number of extracted enhancements. This can
> >> easily be used as indicator if the Stanbol Enhancer was able to
> >> extract knowledge form the parsed content.
> >>
> >> Typical reasons for not getting expected enhancement results are:
> >>
> >> 1. unsupported content type: The current version of Apache Stanbol
> >> uses the
> >> [TikaEngine](
> http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/tikaengine.html
> )
> >> to process non-plain-text content parsed to the Stanbol
> >> Enhancer/Contenthub. So everything that is covered by Apache Tika
> >> should also work just fine with Apache Stanbol.
> >>
> >> 2. unsupported language: Some Enhancement Engines (e.g. NER - Named
> >> Entity Recognition) do only support some languages. If the parsed
> >> content is in an other language the will not be able to process the
> >> parsed content. With the default configuration of Stanbol only english
> >> (and in the newest version spanish and dutch) documents will work.
> >> Users with custom configurations will also be able to process
> >> documents with other languages)
> >>
> >> > but search (using Web View) worked fine.
> >>
> >> This is because the Conenthub also supports full text search over the
> >> parsed content. (*)
> >>
> >> >Another step was to upload pdf
> >> > documents and I got extracted entities grouped by People, Places
> >> > Concepts
> >> > categories. It was also in the list of recently uploaded documents,
> but
> >> > I
> >> > couldn’t find any term from that document.
> >> >
> >>
> >> Based on your request I tried the following (with the default
> >> configuration of the Full launcher)
> >> NOTE: this excludes the possibility to create your own search index by
> >> using LDPath.
> >>
> >> 1) upload some files to the content hub
> >>
> >>    * file upload worked (some scientific papers from the local HD
> >>    * URL upload worked (some technical blogs + comments)
> >>    * pasting text worked (some of the examples included for the
> enhancer)
> >>    * based on the UI I got > 100 enhancements for all tested PDFs
> >>
> >> 2) test of the contenthub search
> >>
> >>    * keyword search worked also for me
> >>
> >> 3) direct solr searches on {host}/solr/default/contenthub/ (*)
> >>
> >>    * searches like
> >> "{host}solr/default/contenthub/select?q=organizations_t:Stanford*"
> >> worked fine. Note that searches are case sensitive (**)
> >>    * I think the keyword search uses the "text_all" field. So queries
> >> for  "{host}solr/default/contenthub/select?q=text_all:{keyword} should
> >> return the same values as the UI of the content hub. This fields
> >> basically supports full text search.
> >>    * all the semantic "stanbolreserved_*" fields (e.g. *_countries,
> >> *_workinstitutions ...) where missing. I think this is expected,
> >> because such fields do require a dbpedia index with the required
> >> fields.
> >>
> >>
> >> >
> >> > I suppose that I will have to provide a stream from pdf (or any other
> >> > kind)
> >> > documents and to index it like text? I need all mentioned
> >> > functionalities
> >> > (index text, docs, URIs…) using Java application and I would
> appreciate
> >> > a
> >> > code example, if it is available, please.
> >> >
> >>
> >> I think parsing of URIs is currently not possible by using the RESTful
> >> API. For using the RESTful services I would recommend you the use of
> >> the Apache Http commons client. Code examples on how to build requests
> >> can be found at
> >>
> >>
> http://hc.apache.org/httpcomponents-client-ga/tutorial/html/fundamentals.html
> >>
> >>
> >> best
> >> Rupert
> >>
> >> Comments intended for Stanbol Developers:
> >> -----
> >>
> >> (*) Normally I would expect the SolrIndex to only include the plain
> >> text version of the parsed content within a field with stored=false.
> >> However I assume that currently the index needs to store the actual
> >> content, because is is also used to store the data. Is this correct?
> >> If this is the case than it will get fixed with STANBOL-471 in any case.
> >>
> >> I also noted that "stanbolreserved_content" currently stores the
> >> content as parsed to the content hub but is configured as
> >> indexed="true" and type="text_general". So in case of an PDF file the
> >> binary content is processed as natural language text AND is also
> >> indexed!
> >> So if this field is used for full text indexing (what I think is not
> >> the case, because I think the "text_all" field is used for that) than
> >> you need to ensure that the plain text version is used for full text
> >> indexing. The plain text contents are available from enhanced
> >> ContentItems by using
> >>
> >>
> ContentItemHelper.getBlob(contentItem,Collections.singelton("text/plain")).
> >> As an alternative one could also use the features introduced by
> >> STANBOL-500 for this.
> >> If this field is used to store the actual content, than you should use
> >> an binary field type and deactivate indexing for this field.
> >>
> >> (**) All *_t fields use string as field type. This means that no
> >> tokenizer is used AND queries are case sensitive. I do not think this
> >> is a good decision and would rather us the already defined "text_ws"
> >> type (white space tokenizer, word delimiter and lower case)
> >>
> >>
> >> best
> >> Rupert
> >>
> >> --
> >> | Rupert Westenthaler             [email protected]
> >> | Bodenlehenstraße 11                             ++43-699-11108907
> >> | A-5500 Bischofshofen
> >
> >
>
>
>
> --
> | Rupert Westenthaler             [email protected]
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>

Re: Indexing and searching using Apache Stanbol

Reply via email to