Hi Rupert, and thank you for the answer. I need to read few more things, but the answer helped me a lot. If I understood well, the search is case sensitive, and if I need case insensitive search, I will have to implement application specific logic?
Best, Srecko On Tue, Mar 13, 2012 at 11:46 AM, Rupert Westenthaler < [email protected]> wrote: > Hi Srecko, all > > @Stanbol developers: Note (*) and (**) comments at the end of this mail > > On Mon, Mar 12, 2012 at 6:03 PM, Srecko Joksimovic > <[email protected]> wrote: > > > > Until now I have developed few applications for annotating documents > using > > Apache Stanbol. Now I need to add indexing and search capabilities. > > > > I tried ContentHub > > ( > http://incubator.apache.org/stanbol/docs/trunk/contenthub/contenthub5min) > > in the way that I started full launcher and access web interface. There > are > > few possibilities: to provide text, to upload document, to provide an > URI… I > > tried to upload a few txt documents. I didn’t get any extracted entities, > > The content hub shows the number of extracted enhancements. This can > easily be used as indicator if the Stanbol Enhancer was able to > extract knowledge form the parsed content. > > Typical reasons for not getting expected enhancement results are: > > 1. unsupported content type: The current version of Apache Stanbol > uses the [TikaEngine]( > http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/tikaengine.html > ) > to process non-plain-text content parsed to the Stanbol > Enhancer/Contenthub. So everything that is covered by Apache Tika > should also work just fine with Apache Stanbol. > > 2. unsupported language: Some Enhancement Engines (e.g. NER - Named > Entity Recognition) do only support some languages. If the parsed > content is in an other language the will not be able to process the > parsed content. With the default configuration of Stanbol only english > (and in the newest version spanish and dutch) documents will work. > Users with custom configurations will also be able to process > documents with other languages) > > > but search (using Web View) worked fine. > > This is because the Conenthub also supports full text search over the > parsed content. (*) > > >Another step was to upload pdf > > documents and I got extracted entities grouped by People, Places Concepts > > categories. It was also in the list of recently uploaded documents, but I > > couldn’t find any term from that document. > > > > Based on your request I tried the following (with the default > configuration of the Full launcher) > NOTE: this excludes the possibility to create your own search index by > using LDPath. > > 1) upload some files to the content hub > > * file upload worked (some scientific papers from the local HD > * URL upload worked (some technical blogs + comments) > * pasting text worked (some of the examples included for the enhancer) > * based on the UI I got > 100 enhancements for all tested PDFs > > 2) test of the contenthub search > > * keyword search worked also for me > > 3) direct solr searches on {host}/solr/default/contenthub/ (*) > > * searches like > "{host}solr/default/contenthub/select?q=organizations_t:Stanford*" > worked fine. Note that searches are case sensitive (**) > * I think the keyword search uses the "text_all" field. So queries > for "{host}solr/default/contenthub/select?q=text_all:{keyword} should > return the same values as the UI of the content hub. This fields > basically supports full text search. > * all the semantic "stanbolreserved_*" fields (e.g. *_countries, > *_workinstitutions ...) where missing. I think this is expected, > because such fields do require a dbpedia index with the required > fields. > > > > > > I suppose that I will have to provide a stream from pdf (or any other > kind) > > documents and to index it like text? I need all mentioned functionalities > > (index text, docs, URIs…) using Java application and I would appreciate a > > code example, if it is available, please. > > > > I think parsing of URIs is currently not possible by using the RESTful > API. For using the RESTful services I would recommend you the use of > the Apache Http commons client. Code examples on how to build requests > can be found at > > http://hc.apache.org/httpcomponents-client-ga/tutorial/html/fundamentals.html > > > best > Rupert > > Comments intended for Stanbol Developers: > ----- > > (*) Normally I would expect the SolrIndex to only include the plain > text version of the parsed content within a field with stored=false. > However I assume that currently the index needs to store the actual > content, because is is also used to store the data. Is this correct? > If this is the case than it will get fixed with STANBOL-471 in any case. > > I also noted that "stanbolreserved_content" currently stores the > content as parsed to the content hub but is configured as > indexed="true" and type="text_general". So in case of an PDF file the > binary content is processed as natural language text AND is also > indexed! > So if this field is used for full text indexing (what I think is not > the case, because I think the "text_all" field is used for that) than > you need to ensure that the plain text version is used for full text > indexing. The plain text contents are available from enhanced > ContentItems by using > ContentItemHelper.getBlob(contentItem,Collections.singelton("text/plain")). > As an alternative one could also use the features introduced by > STANBOL-500 for this. > If this field is used to store the actual content, than you should use > an binary field type and deactivate indexing for this field. > > (**) All *_t fields use string as field type. This means that no > tokenizer is used AND queries are case sensitive. I do not think this > is a good decision and would rather us the already defined "text_ws" > type (white space tokenizer, word delimiter and lower case) > > > best > Rupert > > -- > | Rupert Westenthaler [email protected] > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen >
