Hi Srecko, all @Stanbol developers: Note (*) and (**) comments at the end of this mail
On Mon, Mar 12, 2012 at 6:03 PM, Srecko Joksimovic <[email protected]> wrote: > > Until now I have developed few applications for annotating documents using > Apache Stanbol. Now I need to add indexing and search capabilities. > > I tried ContentHub > (http://incubator.apache.org/stanbol/docs/trunk/contenthub/contenthub5min) > in the way that I started full launcher and access web interface. There are > few possibilities: to provide text, to upload document, to provide an URI… I > tried to upload a few txt documents. I didn’t get any extracted entities, The content hub shows the number of extracted enhancements. This can easily be used as indicator if the Stanbol Enhancer was able to extract knowledge form the parsed content. Typical reasons for not getting expected enhancement results are: 1. unsupported content type: The current version of Apache Stanbol uses the [TikaEngine](http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/tikaengine.html) to process non-plain-text content parsed to the Stanbol Enhancer/Contenthub. So everything that is covered by Apache Tika should also work just fine with Apache Stanbol. 2. unsupported language: Some Enhancement Engines (e.g. NER - Named Entity Recognition) do only support some languages. If the parsed content is in an other language the will not be able to process the parsed content. With the default configuration of Stanbol only english (and in the newest version spanish and dutch) documents will work. Users with custom configurations will also be able to process documents with other languages) > but search (using Web View) worked fine. This is because the Conenthub also supports full text search over the parsed content. (*) >Another step was to upload pdf > documents and I got extracted entities grouped by People, Places Concepts > categories. It was also in the list of recently uploaded documents, but I > couldn’t find any term from that document. > Based on your request I tried the following (with the default configuration of the Full launcher) NOTE: this excludes the possibility to create your own search index by using LDPath. 1) upload some files to the content hub * file upload worked (some scientific papers from the local HD * URL upload worked (some technical blogs + comments) * pasting text worked (some of the examples included for the enhancer) * based on the UI I got > 100 enhancements for all tested PDFs 2) test of the contenthub search * keyword search worked also for me 3) direct solr searches on {host}/solr/default/contenthub/ (*) * searches like "{host}solr/default/contenthub/select?q=organizations_t:Stanford*" worked fine. Note that searches are case sensitive (**) * I think the keyword search uses the "text_all" field. So queries for "{host}solr/default/contenthub/select?q=text_all:{keyword} should return the same values as the UI of the content hub. This fields basically supports full text search. * all the semantic "stanbolreserved_*" fields (e.g. *_countries, *_workinstitutions ...) where missing. I think this is expected, because such fields do require a dbpedia index with the required fields. > > I suppose that I will have to provide a stream from pdf (or any other kind) > documents and to index it like text? I need all mentioned functionalities > (index text, docs, URIs…) using Java application and I would appreciate a > code example, if it is available, please. > I think parsing of URIs is currently not possible by using the RESTful API. For using the RESTful services I would recommend you the use of the Apache Http commons client. Code examples on how to build requests can be found at http://hc.apache.org/httpcomponents-client-ga/tutorial/html/fundamentals.html best Rupert Comments intended for Stanbol Developers: ----- (*) Normally I would expect the SolrIndex to only include the plain text version of the parsed content within a field with stored=false. However I assume that currently the index needs to store the actual content, because is is also used to store the data. Is this correct? If this is the case than it will get fixed with STANBOL-471 in any case. I also noted that "stanbolreserved_content" currently stores the content as parsed to the content hub but is configured as indexed="true" and type="text_general". So in case of an PDF file the binary content is processed as natural language text AND is also indexed! So if this field is used for full text indexing (what I think is not the case, because I think the "text_all" field is used for that) than you need to ensure that the plain text version is used for full text indexing. The plain text contents are available from enhanced ContentItems by using ContentItemHelper.getBlob(contentItem,Collections.singelton("text/plain")). As an alternative one could also use the features introduced by STANBOL-500 for this. If this field is used to store the actual content, than you should use an binary field type and deactivate indexing for this field. (**) All *_t fields use string as field type. This means that no tokenizer is used AND queries are case sensitive. I do not think this is a good decision and would rather us the already defined "text_ws" type (white space tokenizer, word delimiter and lower case) best Rupert -- | Rupert Westenthaler [email protected] | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen
