Hi, I forgot to mentioned, but I think that I have posted this question before. Anyway, is it possible to configure Stanbol to run at http://xxx.xxx.xxx.xxx:9999/testing/ instead of http://localhost:9999/ ?
Because of the company policy I need to define application URL, and it must have something similar to http://xxx.xxx.xxx.xxx:9999/testing/. That means (for example) that I need to have: http://xxx.xxx.xxx.xxx:9999/testing/enhancer/engine, instead of http://localhost:9999/enhancer/engine. Best, Srecko On Tue, Mar 13, 2012 at 1:29 PM, srecko joksimovic < [email protected]> wrote: > Hi, > > Ok, looks like I didn't understand that. It's clear now. > > Thank you. > > Srecko > > > On Tue, Mar 13, 2012 at 1:16 PM, Rupert Westenthaler < > [email protected]> wrote: > >> Hi >> >> On Tue, Mar 13, 2012 at 1:05 PM, srecko joksimovic >> <[email protected]> wrote: >> > Hi Rupert, >> > >> > and thank you for the answer. I need to read few more things, but the >> answer >> > helped me a lot. >> >> great! >> >> > If I understood well, the search is case sensitive, and if I need case >> > insensitive search, I will have to implement application specific logic? >> > >> >> Keyword searches via the content hub and Solr query for the field >> "text_all" are case insensitive! >> >> Only searches for the fields "organizations_t", "people_t" and >> "places_t" are case sensitive. However I would consider this as a bug >> and the comment (**) in my previous mail suggests to correct that. >> >> >> best >> Rupert >> >> > Best, >> > Srecko >> > >> > >> > On Tue, Mar 13, 2012 at 11:46 AM, Rupert Westenthaler >> > <[email protected]> wrote: >> >> >> >> Hi Srecko, all >> >> >> >> @Stanbol developers: Note (*) and (**) comments at the end of this mail >> >> >> >> On Mon, Mar 12, 2012 at 6:03 PM, Srecko Joksimovic >> >> <[email protected]> wrote: >> >> > >> >> > Until now I have developed few applications for annotating documents >> >> > using >> >> > Apache Stanbol. Now I need to add indexing and search capabilities. >> >> > >> >> > I tried ContentHub >> >> > >> >> > ( >> http://incubator.apache.org/stanbol/docs/trunk/contenthub/contenthub5min) >> >> > in the way that I started full launcher and access web interface. >> There >> >> > are >> >> > few possibilities: to provide text, to upload document, to provide an >> >> > URI… I >> >> > tried to upload a few txt documents. I didn’t get any extracted >> >> > entities, >> >> >> >> The content hub shows the number of extracted enhancements. This can >> >> easily be used as indicator if the Stanbol Enhancer was able to >> >> extract knowledge form the parsed content. >> >> >> >> Typical reasons for not getting expected enhancement results are: >> >> >> >> 1. unsupported content type: The current version of Apache Stanbol >> >> uses the >> >> [TikaEngine]( >> http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/tikaengine.html >> ) >> >> to process non-plain-text content parsed to the Stanbol >> >> Enhancer/Contenthub. So everything that is covered by Apache Tika >> >> should also work just fine with Apache Stanbol. >> >> >> >> 2. unsupported language: Some Enhancement Engines (e.g. NER - Named >> >> Entity Recognition) do only support some languages. If the parsed >> >> content is in an other language the will not be able to process the >> >> parsed content. With the default configuration of Stanbol only english >> >> (and in the newest version spanish and dutch) documents will work. >> >> Users with custom configurations will also be able to process >> >> documents with other languages) >> >> >> >> > but search (using Web View) worked fine. >> >> >> >> This is because the Conenthub also supports full text search over the >> >> parsed content. (*) >> >> >> >> >Another step was to upload pdf >> >> > documents and I got extracted entities grouped by People, Places >> >> > Concepts >> >> > categories. It was also in the list of recently uploaded documents, >> but >> >> > I >> >> > couldn’t find any term from that document. >> >> > >> >> >> >> Based on your request I tried the following (with the default >> >> configuration of the Full launcher) >> >> NOTE: this excludes the possibility to create your own search index by >> >> using LDPath. >> >> >> >> 1) upload some files to the content hub >> >> >> >> * file upload worked (some scientific papers from the local HD >> >> * URL upload worked (some technical blogs + comments) >> >> * pasting text worked (some of the examples included for the >> enhancer) >> >> * based on the UI I got > 100 enhancements for all tested PDFs >> >> >> >> 2) test of the contenthub search >> >> >> >> * keyword search worked also for me >> >> >> >> 3) direct solr searches on {host}/solr/default/contenthub/ (*) >> >> >> >> * searches like >> >> "{host}solr/default/contenthub/select?q=organizations_t:Stanford*" >> >> worked fine. Note that searches are case sensitive (**) >> >> * I think the keyword search uses the "text_all" field. So queries >> >> for "{host}solr/default/contenthub/select?q=text_all:{keyword} should >> >> return the same values as the UI of the content hub. This fields >> >> basically supports full text search. >> >> * all the semantic "stanbolreserved_*" fields (e.g. *_countries, >> >> *_workinstitutions ...) where missing. I think this is expected, >> >> because such fields do require a dbpedia index with the required >> >> fields. >> >> >> >> >> >> > >> >> > I suppose that I will have to provide a stream from pdf (or any other >> >> > kind) >> >> > documents and to index it like text? I need all mentioned >> >> > functionalities >> >> > (index text, docs, URIs…) using Java application and I would >> appreciate >> >> > a >> >> > code example, if it is available, please. >> >> > >> >> >> >> I think parsing of URIs is currently not possible by using the RESTful >> >> API. For using the RESTful services I would recommend you the use of >> >> the Apache Http commons client. Code examples on how to build requests >> >> can be found at >> >> >> >> >> http://hc.apache.org/httpcomponents-client-ga/tutorial/html/fundamentals.html >> >> >> >> >> >> best >> >> Rupert >> >> >> >> Comments intended for Stanbol Developers: >> >> ----- >> >> >> >> (*) Normally I would expect the SolrIndex to only include the plain >> >> text version of the parsed content within a field with stored=false. >> >> However I assume that currently the index needs to store the actual >> >> content, because is is also used to store the data. Is this correct? >> >> If this is the case than it will get fixed with STANBOL-471 in any >> case. >> >> >> >> I also noted that "stanbolreserved_content" currently stores the >> >> content as parsed to the content hub but is configured as >> >> indexed="true" and type="text_general". So in case of an PDF file the >> >> binary content is processed as natural language text AND is also >> >> indexed! >> >> So if this field is used for full text indexing (what I think is not >> >> the case, because I think the "text_all" field is used for that) than >> >> you need to ensure that the plain text version is used for full text >> >> indexing. The plain text contents are available from enhanced >> >> ContentItems by using >> >> >> >> >> ContentItemHelper.getBlob(contentItem,Collections.singelton("text/plain")). >> >> As an alternative one could also use the features introduced by >> >> STANBOL-500 for this. >> >> If this field is used to store the actual content, than you should use >> >> an binary field type and deactivate indexing for this field. >> >> >> >> (**) All *_t fields use string as field type. This means that no >> >> tokenizer is used AND queries are case sensitive. I do not think this >> >> is a good decision and would rather us the already defined "text_ws" >> >> type (white space tokenizer, word delimiter and lower case) >> >> >> >> >> >> best >> >> Rupert >> >> >> >> -- >> >> | Rupert Westenthaler [email protected] >> >> | Bodenlehenstraße 11 ++43-699-11108907 >> >> | A-5500 Bischofshofen >> > >> > >> >> >> >> -- >> | Rupert Westenthaler [email protected] >> | Bodenlehenstraße 11 ++43-699-11108907 >> | A-5500 Bischofshofen >> > >
