Hi, Ok, looks like I didn't understand that. It's clear now.
Thank you. Srecko On Tue, Mar 13, 2012 at 1:16 PM, Rupert Westenthaler < [email protected]> wrote: > Hi > > On Tue, Mar 13, 2012 at 1:05 PM, srecko joksimovic > <[email protected]> wrote: > > Hi Rupert, > > > > and thank you for the answer. I need to read few more things, but the > answer > > helped me a lot. > > great! > > > If I understood well, the search is case sensitive, and if I need case > > insensitive search, I will have to implement application specific logic? > > > > Keyword searches via the content hub and Solr query for the field > "text_all" are case insensitive! > > Only searches for the fields "organizations_t", "people_t" and > "places_t" are case sensitive. However I would consider this as a bug > and the comment (**) in my previous mail suggests to correct that. > > > best > Rupert > > > Best, > > Srecko > > > > > > On Tue, Mar 13, 2012 at 11:46 AM, Rupert Westenthaler > > <[email protected]> wrote: > >> > >> Hi Srecko, all > >> > >> @Stanbol developers: Note (*) and (**) comments at the end of this mail > >> > >> On Mon, Mar 12, 2012 at 6:03 PM, Srecko Joksimovic > >> <[email protected]> wrote: > >> > > >> > Until now I have developed few applications for annotating documents > >> > using > >> > Apache Stanbol. Now I need to add indexing and search capabilities. > >> > > >> > I tried ContentHub > >> > > >> > ( > http://incubator.apache.org/stanbol/docs/trunk/contenthub/contenthub5min) > >> > in the way that I started full launcher and access web interface. > There > >> > are > >> > few possibilities: to provide text, to upload document, to provide an > >> > URI… I > >> > tried to upload a few txt documents. I didn’t get any extracted > >> > entities, > >> > >> The content hub shows the number of extracted enhancements. This can > >> easily be used as indicator if the Stanbol Enhancer was able to > >> extract knowledge form the parsed content. > >> > >> Typical reasons for not getting expected enhancement results are: > >> > >> 1. unsupported content type: The current version of Apache Stanbol > >> uses the > >> [TikaEngine]( > http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/tikaengine.html > ) > >> to process non-plain-text content parsed to the Stanbol > >> Enhancer/Contenthub. So everything that is covered by Apache Tika > >> should also work just fine with Apache Stanbol. > >> > >> 2. unsupported language: Some Enhancement Engines (e.g. NER - Named > >> Entity Recognition) do only support some languages. If the parsed > >> content is in an other language the will not be able to process the > >> parsed content. With the default configuration of Stanbol only english > >> (and in the newest version spanish and dutch) documents will work. > >> Users with custom configurations will also be able to process > >> documents with other languages) > >> > >> > but search (using Web View) worked fine. > >> > >> This is because the Conenthub also supports full text search over the > >> parsed content. (*) > >> > >> >Another step was to upload pdf > >> > documents and I got extracted entities grouped by People, Places > >> > Concepts > >> > categories. It was also in the list of recently uploaded documents, > but > >> > I > >> > couldn’t find any term from that document. > >> > > >> > >> Based on your request I tried the following (with the default > >> configuration of the Full launcher) > >> NOTE: this excludes the possibility to create your own search index by > >> using LDPath. > >> > >> 1) upload some files to the content hub > >> > >> * file upload worked (some scientific papers from the local HD > >> * URL upload worked (some technical blogs + comments) > >> * pasting text worked (some of the examples included for the > enhancer) > >> * based on the UI I got > 100 enhancements for all tested PDFs > >> > >> 2) test of the contenthub search > >> > >> * keyword search worked also for me > >> > >> 3) direct solr searches on {host}/solr/default/contenthub/ (*) > >> > >> * searches like > >> "{host}solr/default/contenthub/select?q=organizations_t:Stanford*" > >> worked fine. Note that searches are case sensitive (**) > >> * I think the keyword search uses the "text_all" field. So queries > >> for "{host}solr/default/contenthub/select?q=text_all:{keyword} should > >> return the same values as the UI of the content hub. This fields > >> basically supports full text search. > >> * all the semantic "stanbolreserved_*" fields (e.g. *_countries, > >> *_workinstitutions ...) where missing. I think this is expected, > >> because such fields do require a dbpedia index with the required > >> fields. > >> > >> > >> > > >> > I suppose that I will have to provide a stream from pdf (or any other > >> > kind) > >> > documents and to index it like text? I need all mentioned > >> > functionalities > >> > (index text, docs, URIs…) using Java application and I would > appreciate > >> > a > >> > code example, if it is available, please. > >> > > >> > >> I think parsing of URIs is currently not possible by using the RESTful > >> API. For using the RESTful services I would recommend you the use of > >> the Apache Http commons client. Code examples on how to build requests > >> can be found at > >> > >> > http://hc.apache.org/httpcomponents-client-ga/tutorial/html/fundamentals.html > >> > >> > >> best > >> Rupert > >> > >> Comments intended for Stanbol Developers: > >> ----- > >> > >> (*) Normally I would expect the SolrIndex to only include the plain > >> text version of the parsed content within a field with stored=false. > >> However I assume that currently the index needs to store the actual > >> content, because is is also used to store the data. Is this correct? > >> If this is the case than it will get fixed with STANBOL-471 in any case. > >> > >> I also noted that "stanbolreserved_content" currently stores the > >> content as parsed to the content hub but is configured as > >> indexed="true" and type="text_general". So in case of an PDF file the > >> binary content is processed as natural language text AND is also > >> indexed! > >> So if this field is used for full text indexing (what I think is not > >> the case, because I think the "text_all" field is used for that) than > >> you need to ensure that the plain text version is used for full text > >> indexing. The plain text contents are available from enhanced > >> ContentItems by using > >> > >> > ContentItemHelper.getBlob(contentItem,Collections.singelton("text/plain")). > >> As an alternative one could also use the features introduced by > >> STANBOL-500 for this. > >> If this field is used to store the actual content, than you should use > >> an binary field type and deactivate indexing for this field. > >> > >> (**) All *_t fields use string as field type. This means that no > >> tokenizer is used AND queries are case sensitive. I do not think this > >> is a good decision and would rather us the already defined "text_ws" > >> type (white space tokenizer, word delimiter and lower case) > >> > >> > >> best > >> Rupert > >> > >> -- > >> | Rupert Westenthaler [email protected] > >> | Bodenlehenstraße 11 ++43-699-11108907 > >> | A-5500 Bischofshofen > > > > > > > > -- > | Rupert Westenthaler [email protected] > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen >
