Re: Indexing and searching using Apache Stanbol

Rupert Westenthaler Tue, 13 Mar 2012 06:30:18 -0700

Let me just add one additional bit of information to that

If you change the alias for the "Apache Stanbol Web Application" this will NOT 
affect the path for the published Solr Servers ("{host}/solr" by default.


To change this you will need to also change the configuration of the 
"SolrServerPublishingComponent" to "/{alias}/solr/" (property: 
org.apache.stanbol.commons.solr.web.dispatchfilter.prefix).

Note that older Stanbol versions also included a configuration for the 
"SolrDispatchFilterComponent" (search for "Dispatch Filter Configuration" in 
the configuration tab). If you find such a configuration you can safely remove 
it as it just duplicates the functionality provided by the above. If you not 
remove this configuration the Solr indexes might be available with and without 
{alias}.
(Stanbol versions based on a revision < 1299616 might be affected by that).

best
Rupert

On 13.03.2012, at 13:58, [email protected] wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Yes, you did: {grin}
> 
> http://markmail.org/message/xdrcxwkuwgo3u65d
> 
> - ---
> A. Soroka
> Software & Systems Engineering :: Online Library Environment
> the University of Virginia Library
> 
> On Mar 13, 2012, at 8:43 AM, srecko joksimovic wrote:
> 
>> Hi,
>> 
>> I forgot to mentioned, but I think that I have posted this question before.
>> Anyway, is it possible to configure Stanbol to run at
>> http://xxx.xxx.xxx.xxx:9999/testing/ instead of http://localhost:9999/ ?
>> 
>> Because of the company policy I need to define application URL, and it must
>> have something similar to http://xxx.xxx.xxx.xxx:9999/testing/. That means
>> (for example) that I need to have:
>> 
>> http://xxx.xxx.xxx.xxx:9999/testing/enhancer/engine, instead of
>> http://localhost:9999/enhancer/engine.
>> 
>> Best,
>> Srecko
>> 
>> On Tue, Mar 13, 2012 at 1:29 PM, srecko joksimovic <
>> [email protected]> wrote:
>> 
>>> Hi,
>>> 
>>> Ok, looks like I didn't understand that. It's clear now.
>>> 
>>> Thank you.
>>> 
>>> Srecko
>>> 
>>> 
>>> On Tue, Mar 13, 2012 at 1:16 PM, Rupert Westenthaler <
>>> [email protected]> wrote:
>>> 
>>>> Hi
>>>> 
>>>> On Tue, Mar 13, 2012 at 1:05 PM, srecko joksimovic
>>>> <[email protected]> wrote:
>>>>> Hi Rupert,
>>>>> 
>>>>> and thank you for the answer. I need to read few more things, but the
>>>> answer
>>>>> helped me a lot.
>>>> 
>>>> great!
>>>> 
>>>>> If I understood well, the search is case sensitive, and if I need case
>>>>> insensitive search, I will have to implement application specific logic?
>>>>> 
>>>> 
>>>> Keyword searches via the content hub and Solr query for the field
>>>> "text_all" are case insensitive!
>>>> 
>>>> Only searches for the fields "organizations_t", "people_t" and
>>>> "places_t" are case sensitive. However I would consider this as a bug
>>>> and the comment (**) in my previous mail suggests to correct that.
>>>> 
>>>> 
>>>> best
>>>> Rupert
>>>> 
>>>>> Best,
>>>>> Srecko
>>>>> 
>>>>> 
>>>>> On Tue, Mar 13, 2012 at 11:46 AM, Rupert Westenthaler
>>>>> <[email protected]> wrote:
>>>>>> 
>>>>>> Hi Srecko, all
>>>>>> 
>>>>>> @Stanbol developers: Note (*) and (**) comments at the end of this mail
>>>>>> 
>>>>>> On Mon, Mar 12, 2012 at 6:03 PM, Srecko Joksimovic
>>>>>> <[email protected]> wrote:
>>>>>>> 
>>>>>>> Until now I have developed few applications for annotating documents
>>>>>>> using
>>>>>>> Apache Stanbol. Now I need to add indexing and search capabilities.
>>>>>>> 
>>>>>>> I tried ContentHub
>>>>>>> 
>>>>>>> (
>>>> http://incubator.apache.org/stanbol/docs/trunk/contenthub/contenthub5min)
>>>>>>> in the way that I started full launcher and access web interface.
>>>> There
>>>>>>> are
>>>>>>> few possibilities: to provide text, to upload document, to provide an
>>>>>>> URI… I
>>>>>>> tried to upload a few txt documents. I didn’t get any extracted
>>>>>>> entities,
>>>>>> 
>>>>>> The content hub shows the number of extracted enhancements. This can
>>>>>> easily be used as indicator if the Stanbol Enhancer was able to
>>>>>> extract knowledge form the parsed content.
>>>>>> 
>>>>>> Typical reasons for not getting expected enhancement results are:
>>>>>> 
>>>>>> 1. unsupported content type: The current version of Apache Stanbol
>>>>>> uses the
>>>>>> [TikaEngine](
>>>> http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/tikaengine.html
>>>> )
>>>>>> to process non-plain-text content parsed to the Stanbol
>>>>>> Enhancer/Contenthub. So everything that is covered by Apache Tika
>>>>>> should also work just fine with Apache Stanbol.
>>>>>> 
>>>>>> 2. unsupported language: Some Enhancement Engines (e.g. NER - Named
>>>>>> Entity Recognition) do only support some languages. If the parsed
>>>>>> content is in an other language the will not be able to process the
>>>>>> parsed content. With the default configuration of Stanbol only english
>>>>>> (and in the newest version spanish and dutch) documents will work.
>>>>>> Users with custom configurations will also be able to process
>>>>>> documents with other languages)
>>>>>> 
>>>>>>> but search (using Web View) worked fine.
>>>>>> 
>>>>>> This is because the Conenthub also supports full text search over the
>>>>>> parsed content. (*)
>>>>>> 
>>>>>>> Another step was to upload pdf
>>>>>>> documents and I got extracted entities grouped by People, Places
>>>>>>> Concepts
>>>>>>> categories. It was also in the list of recently uploaded documents,
>>>> but
>>>>>>> I
>>>>>>> couldn’t find any term from that document.
>>>>>>> 
>>>>>> 
>>>>>> Based on your request I tried the following (with the default
>>>>>> configuration of the Full launcher)
>>>>>> NOTE: this excludes the possibility to create your own search index by
>>>>>> using LDPath.
>>>>>> 
>>>>>> 1) upload some files to the content hub
>>>>>> 
>>>>>>  * file upload worked (some scientific papers from the local HD
>>>>>>  * URL upload worked (some technical blogs + comments)
>>>>>>  * pasting text worked (some of the examples included for the
>>>> enhancer)
>>>>>>  * based on the UI I got > 100 enhancements for all tested PDFs
>>>>>> 
>>>>>> 2) test of the contenthub search
>>>>>> 
>>>>>>  * keyword search worked also for me
>>>>>> 
>>>>>> 3) direct solr searches on {host}/solr/default/contenthub/ (*)
>>>>>> 
>>>>>>  * searches like
>>>>>> "{host}solr/default/contenthub/select?q=organizations_t:Stanford*"
>>>>>> worked fine. Note that searches are case sensitive (**)
>>>>>>  * I think the keyword search uses the "text_all" field. So queries
>>>>>> for  "{host}solr/default/contenthub/select?q=text_all:{keyword} should
>>>>>> return the same values as the UI of the content hub. This fields
>>>>>> basically supports full text search.
>>>>>>  * all the semantic "stanbolreserved_*" fields (e.g. *_countries,
>>>>>> *_workinstitutions ...) where missing. I think this is expected,
>>>>>> because such fields do require a dbpedia index with the required
>>>>>> fields.
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> I suppose that I will have to provide a stream from pdf (or any other
>>>>>>> kind)
>>>>>>> documents and to index it like text? I need all mentioned
>>>>>>> functionalities
>>>>>>> (index text, docs, URIs…) using Java application and I would
>>>> appreciate
>>>>>>> a
>>>>>>> code example, if it is available, please.
>>>>>>> 
>>>>>> 
>>>>>> I think parsing of URIs is currently not possible by using the RESTful
>>>>>> API. For using the RESTful services I would recommend you the use of
>>>>>> the Apache Http commons client. Code examples on how to build requests
>>>>>> can be found at
>>>>>> 
>>>>>> 
>>>> http://hc.apache.org/httpcomponents-client-ga/tutorial/html/fundamentals.html
>>>>>> 
>>>>>> 
>>>>>> best
>>>>>> Rupert
>>>>>> 
>>>>>> Comments intended for Stanbol Developers:
>>>>>> -----
>>>>>> 
>>>>>> (*) Normally I would expect the SolrIndex to only include the plain
>>>>>> text version of the parsed content within a field with stored=false.
>>>>>> However I assume that currently the index needs to store the actual
>>>>>> content, because is is also used to store the data. Is this correct?
>>>>>> If this is the case than it will get fixed with STANBOL-471 in any
>>>> case.
>>>>>> 
>>>>>> I also noted that "stanbolreserved_content" currently stores the
>>>>>> content as parsed to the content hub but is configured as
>>>>>> indexed="true" and type="text_general". So in case of an PDF file the
>>>>>> binary content is processed as natural language text AND is also
>>>>>> indexed!
>>>>>> So if this field is used for full text indexing (what I think is not
>>>>>> the case, because I think the "text_all" field is used for that) than
>>>>>> you need to ensure that the plain text version is used for full text
>>>>>> indexing. The plain text contents are available from enhanced
>>>>>> ContentItems by using
>>>>>> 
>>>>>> 
>>>> ContentItemHelper.getBlob(contentItem,Collections.singelton("text/plain")).
>>>>>> As an alternative one could also use the features introduced by
>>>>>> STANBOL-500 for this.
>>>>>> If this field is used to store the actual content, than you should use
>>>>>> an binary field type and deactivate indexing for this field.
>>>>>> 
>>>>>> (**) All *_t fields use string as field type. This means that no
>>>>>> tokenizer is used AND queries are case sensitive. I do not think this
>>>>>> is a good decision and would rather us the already defined "text_ws"
>>>>>> type (white space tokenizer, word delimiter and lower case)
>>>>>> 
>>>>>> 
>>>>>> best
>>>>>> Rupert
>>>>>> 
>>>>>> --
>>>>>> | Rupert Westenthaler             [email protected]
>>>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>>>> | A-5500 Bischofshofen
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> | Rupert Westenthaler             [email protected]
>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>> | A-5500 Bischofshofen
>>>> 
>>> 
>>> 
> 
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
> Comment: GPGTools - http://gpgtools.org
> 
> iQEcBAEBAgAGBQJPX0RpAAoJEATpPYSyaoIkDakH/AwYwQAr7tSGOo8k1RPRSFN2
> rEw08y14v0Pun9I83s0o83vTkENS+QOVkxmnxHJssRuFIe8OiUypAA29ZuiQ6DQk
> qcZ81AHik4Nx7gWamxVt+1LcobZ8P7/2iYkDfAoGdarU4cRhfAfRgUOb8Rha/bs3
> 0ApbZB/7gxk8YSj1OhY+xo78l4uDOHA94STYch6u/iQnhHXGDU8yQ4rxyX/EW7He
> Q3I7YVQXisxaNAgkQ/Vdgraw3ujJv45Wrv0wGCA0BWEJZRjlK4uil5/9oMogFdZY
> OoLQ3FkQPeRJdJkwStW1HscT6dv+sjZNOmkmCCFL8OC5dqXSuC8S/nDu+jLHzvA=
> =PgTe
> -----END PGP SIGNATURE-----

Re: Indexing and searching using Apache Stanbol

Reply via email to