Hi On Tue, Mar 13, 2012 at 1:05 PM, srecko joksimovic <[email protected]> wrote: > Hi Rupert, > > and thank you for the answer. I need to read few more things, but the answer > helped me a lot.
great! > If I understood well, the search is case sensitive, and if I need case > insensitive search, I will have to implement application specific logic? > Keyword searches via the content hub and Solr query for the field "text_all" are case insensitive! Only searches for the fields "organizations_t", "people_t" and "places_t" are case sensitive. However I would consider this as a bug and the comment (**) in my previous mail suggests to correct that. best Rupert > Best, > Srecko > > > On Tue, Mar 13, 2012 at 11:46 AM, Rupert Westenthaler > <[email protected]> wrote: >> >> Hi Srecko, all >> >> @Stanbol developers: Note (*) and (**) comments at the end of this mail >> >> On Mon, Mar 12, 2012 at 6:03 PM, Srecko Joksimovic >> <[email protected]> wrote: >> > >> > Until now I have developed few applications for annotating documents >> > using >> > Apache Stanbol. Now I need to add indexing and search capabilities. >> > >> > I tried ContentHub >> > >> > (http://incubator.apache.org/stanbol/docs/trunk/contenthub/contenthub5min) >> > in the way that I started full launcher and access web interface. There >> > are >> > few possibilities: to provide text, to upload document, to provide an >> > URI… I >> > tried to upload a few txt documents. I didn’t get any extracted >> > entities, >> >> The content hub shows the number of extracted enhancements. This can >> easily be used as indicator if the Stanbol Enhancer was able to >> extract knowledge form the parsed content. >> >> Typical reasons for not getting expected enhancement results are: >> >> 1. unsupported content type: The current version of Apache Stanbol >> uses the >> [TikaEngine](http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/tikaengine.html) >> to process non-plain-text content parsed to the Stanbol >> Enhancer/Contenthub. So everything that is covered by Apache Tika >> should also work just fine with Apache Stanbol. >> >> 2. unsupported language: Some Enhancement Engines (e.g. NER - Named >> Entity Recognition) do only support some languages. If the parsed >> content is in an other language the will not be able to process the >> parsed content. With the default configuration of Stanbol only english >> (and in the newest version spanish and dutch) documents will work. >> Users with custom configurations will also be able to process >> documents with other languages) >> >> > but search (using Web View) worked fine. >> >> This is because the Conenthub also supports full text search over the >> parsed content. (*) >> >> >Another step was to upload pdf >> > documents and I got extracted entities grouped by People, Places >> > Concepts >> > categories. It was also in the list of recently uploaded documents, but >> > I >> > couldn’t find any term from that document. >> > >> >> Based on your request I tried the following (with the default >> configuration of the Full launcher) >> NOTE: this excludes the possibility to create your own search index by >> using LDPath. >> >> 1) upload some files to the content hub >> >> * file upload worked (some scientific papers from the local HD >> * URL upload worked (some technical blogs + comments) >> * pasting text worked (some of the examples included for the enhancer) >> * based on the UI I got > 100 enhancements for all tested PDFs >> >> 2) test of the contenthub search >> >> * keyword search worked also for me >> >> 3) direct solr searches on {host}/solr/default/contenthub/ (*) >> >> * searches like >> "{host}solr/default/contenthub/select?q=organizations_t:Stanford*" >> worked fine. Note that searches are case sensitive (**) >> * I think the keyword search uses the "text_all" field. So queries >> for "{host}solr/default/contenthub/select?q=text_all:{keyword} should >> return the same values as the UI of the content hub. This fields >> basically supports full text search. >> * all the semantic "stanbolreserved_*" fields (e.g. *_countries, >> *_workinstitutions ...) where missing. I think this is expected, >> because such fields do require a dbpedia index with the required >> fields. >> >> >> > >> > I suppose that I will have to provide a stream from pdf (or any other >> > kind) >> > documents and to index it like text? I need all mentioned >> > functionalities >> > (index text, docs, URIs…) using Java application and I would appreciate >> > a >> > code example, if it is available, please. >> > >> >> I think parsing of URIs is currently not possible by using the RESTful >> API. For using the RESTful services I would recommend you the use of >> the Apache Http commons client. Code examples on how to build requests >> can be found at >> >> http://hc.apache.org/httpcomponents-client-ga/tutorial/html/fundamentals.html >> >> >> best >> Rupert >> >> Comments intended for Stanbol Developers: >> ----- >> >> (*) Normally I would expect the SolrIndex to only include the plain >> text version of the parsed content within a field with stored=false. >> However I assume that currently the index needs to store the actual >> content, because is is also used to store the data. Is this correct? >> If this is the case than it will get fixed with STANBOL-471 in any case. >> >> I also noted that "stanbolreserved_content" currently stores the >> content as parsed to the content hub but is configured as >> indexed="true" and type="text_general". So in case of an PDF file the >> binary content is processed as natural language text AND is also >> indexed! >> So if this field is used for full text indexing (what I think is not >> the case, because I think the "text_all" field is used for that) than >> you need to ensure that the plain text version is used for full text >> indexing. The plain text contents are available from enhanced >> ContentItems by using >> >> ContentItemHelper.getBlob(contentItem,Collections.singelton("text/plain")). >> As an alternative one could also use the features introduced by >> STANBOL-500 for this. >> If this field is used to store the actual content, than you should use >> an binary field type and deactivate indexing for this field. >> >> (**) All *_t fields use string as field type. This means that no >> tokenizer is used AND queries are case sensitive. I do not think this >> is a good decision and would rather us the already defined "text_ws" >> type (white space tokenizer, word delimiter and lower case) >> >> >> best >> Rupert >> >> -- >> | Rupert Westenthaler [email protected] >> | Bodenlehenstraße 11 ++43-699-11108907 >> | A-5500 Bischofshofen > > -- | Rupert Westenthaler [email protected] | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen
