Re: Indexing and searching using Apache Stanbol

[email protected] Tue, 13 Mar 2012 05:58:55 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Yes, you did: {grin}


http://markmail.org/message/xdrcxwkuwgo3u65d

- ---
A. Soroka
Software & Systems Engineering :: Online Library Environment
the University of Virginia Library

On Mar 13, 2012, at 8:43 AM, srecko joksimovic wrote:

> Hi,
> 
> I forgot to mentioned, but I think that I have posted this question before.
> Anyway, is it possible to configure Stanbol to run at
> http://xxx.xxx.xxx.xxx:9999/testing/ instead of http://localhost:9999/ ?
> 
> Because of the company policy I need to define application URL, and it must
> have something similar to http://xxx.xxx.xxx.xxx:9999/testing/. That means
> (for example) that I need to have:
> 
> http://xxx.xxx.xxx.xxx:9999/testing/enhancer/engine, instead of
> http://localhost:9999/enhancer/engine.
> 
> Best,
> Srecko
> 
> On Tue, Mar 13, 2012 at 1:29 PM, srecko joksimovic <
> [email protected]> wrote:
> 
>> Hi,
>> 
>> Ok, looks like I didn't understand that. It's clear now.
>> 
>> Thank you.
>> 
>> Srecko
>> 
>> 
>> On Tue, Mar 13, 2012 at 1:16 PM, Rupert Westenthaler <
>> [email protected]> wrote:
>> 
>>> Hi
>>> 
>>> On Tue, Mar 13, 2012 at 1:05 PM, srecko joksimovic
>>> <[email protected]> wrote:
>>>> Hi Rupert,
>>>> 
>>>> and thank you for the answer. I need to read few more things, but the
>>> answer
>>>> helped me a lot.
>>> 
>>> great!
>>> 
>>>> If I understood well, the search is case sensitive, and if I need case
>>>> insensitive search, I will have to implement application specific logic?
>>>> 
>>> 
>>> Keyword searches via the content hub and Solr query for the field
>>> "text_all" are case insensitive!
>>> 
>>> Only searches for the fields "organizations_t", "people_t" and
>>> "places_t" are case sensitive. However I would consider this as a bug
>>> and the comment (**) in my previous mail suggests to correct that.
>>> 
>>> 
>>> best
>>> Rupert
>>> 
>>>> Best,
>>>> Srecko
>>>> 
>>>> 
>>>> On Tue, Mar 13, 2012 at 11:46 AM, Rupert Westenthaler
>>>> <[email protected]> wrote:
>>>>> 
>>>>> Hi Srecko, all
>>>>> 
>>>>> @Stanbol developers: Note (*) and (**) comments at the end of this mail
>>>>> 
>>>>> On Mon, Mar 12, 2012 at 6:03 PM, Srecko Joksimovic
>>>>> <[email protected]> wrote:
>>>>>> 
>>>>>> Until now I have developed few applications for annotating documents
>>>>>> using
>>>>>> Apache Stanbol. Now I need to add indexing and search capabilities.
>>>>>> 
>>>>>> I tried ContentHub
>>>>>> 
>>>>>> (
>>> http://incubator.apache.org/stanbol/docs/trunk/contenthub/contenthub5min)
>>>>>> in the way that I started full launcher and access web interface.
>>> There
>>>>>> are
>>>>>> few possibilities: to provide text, to upload document, to provide an
>>>>>> URI… I
>>>>>> tried to upload a few txt documents. I didn’t get any extracted
>>>>>> entities,
>>>>> 
>>>>> The content hub shows the number of extracted enhancements. This can
>>>>> easily be used as indicator if the Stanbol Enhancer was able to
>>>>> extract knowledge form the parsed content.
>>>>> 
>>>>> Typical reasons for not getting expected enhancement results are:
>>>>> 
>>>>> 1. unsupported content type: The current version of Apache Stanbol
>>>>> uses the
>>>>> [TikaEngine](
>>> http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/tikaengine.html
>>> )
>>>>> to process non-plain-text content parsed to the Stanbol
>>>>> Enhancer/Contenthub. So everything that is covered by Apache Tika
>>>>> should also work just fine with Apache Stanbol.
>>>>> 
>>>>> 2. unsupported language: Some Enhancement Engines (e.g. NER - Named
>>>>> Entity Recognition) do only support some languages. If the parsed
>>>>> content is in an other language the will not be able to process the
>>>>> parsed content. With the default configuration of Stanbol only english
>>>>> (and in the newest version spanish and dutch) documents will work.
>>>>> Users with custom configurations will also be able to process
>>>>> documents with other languages)
>>>>> 
>>>>>> but search (using Web View) worked fine.
>>>>> 
>>>>> This is because the Conenthub also supports full text search over the
>>>>> parsed content. (*)
>>>>> 
>>>>>> Another step was to upload pdf
>>>>>> documents and I got extracted entities grouped by People, Places
>>>>>> Concepts
>>>>>> categories. It was also in the list of recently uploaded documents,
>>> but
>>>>>> I
>>>>>> couldn’t find any term from that document.
>>>>>> 
>>>>> 
>>>>> Based on your request I tried the following (with the default
>>>>> configuration of the Full launcher)
>>>>> NOTE: this excludes the possibility to create your own search index by
>>>>> using LDPath.
>>>>> 
>>>>> 1) upload some files to the content hub
>>>>> 
>>>>>   * file upload worked (some scientific papers from the local HD
>>>>>   * URL upload worked (some technical blogs + comments)
>>>>>   * pasting text worked (some of the examples included for the
>>> enhancer)
>>>>>   * based on the UI I got > 100 enhancements for all tested PDFs
>>>>> 
>>>>> 2) test of the contenthub search
>>>>> 
>>>>>   * keyword search worked also for me
>>>>> 
>>>>> 3) direct solr searches on {host}/solr/default/contenthub/ (*)
>>>>> 
>>>>>   * searches like
>>>>> "{host}solr/default/contenthub/select?q=organizations_t:Stanford*"
>>>>> worked fine. Note that searches are case sensitive (**)
>>>>>   * I think the keyword search uses the "text_all" field. So queries
>>>>> for  "{host}solr/default/contenthub/select?q=text_all:{keyword} should
>>>>> return the same values as the UI of the content hub. This fields
>>>>> basically supports full text search.
>>>>>   * all the semantic "stanbolreserved_*" fields (e.g. *_countries,
>>>>> *_workinstitutions ...) where missing. I think this is expected,
>>>>> because such fields do require a dbpedia index with the required
>>>>> fields.
>>>>> 
>>>>> 
>>>>>> 
>>>>>> I suppose that I will have to provide a stream from pdf (or any other
>>>>>> kind)
>>>>>> documents and to index it like text? I need all mentioned
>>>>>> functionalities
>>>>>> (index text, docs, URIs…) using Java application and I would
>>> appreciate
>>>>>> a
>>>>>> code example, if it is available, please.
>>>>>> 
>>>>> 
>>>>> I think parsing of URIs is currently not possible by using the RESTful
>>>>> API. For using the RESTful services I would recommend you the use of
>>>>> the Apache Http commons client. Code examples on how to build requests
>>>>> can be found at
>>>>> 
>>>>> 
>>> http://hc.apache.org/httpcomponents-client-ga/tutorial/html/fundamentals.html
>>>>> 
>>>>> 
>>>>> best
>>>>> Rupert
>>>>> 
>>>>> Comments intended for Stanbol Developers:
>>>>> -----
>>>>> 
>>>>> (*) Normally I would expect the SolrIndex to only include the plain
>>>>> text version of the parsed content within a field with stored=false.
>>>>> However I assume that currently the index needs to store the actual
>>>>> content, because is is also used to store the data. Is this correct?
>>>>> If this is the case than it will get fixed with STANBOL-471 in any
>>> case.
>>>>> 
>>>>> I also noted that "stanbolreserved_content" currently stores the
>>>>> content as parsed to the content hub but is configured as
>>>>> indexed="true" and type="text_general". So in case of an PDF file the
>>>>> binary content is processed as natural language text AND is also
>>>>> indexed!
>>>>> So if this field is used for full text indexing (what I think is not
>>>>> the case, because I think the "text_all" field is used for that) than
>>>>> you need to ensure that the plain text version is used for full text
>>>>> indexing. The plain text contents are available from enhanced
>>>>> ContentItems by using
>>>>> 
>>>>> 
>>> ContentItemHelper.getBlob(contentItem,Collections.singelton("text/plain")).
>>>>> As an alternative one could also use the features introduced by
>>>>> STANBOL-500 for this.
>>>>> If this field is used to store the actual content, than you should use
>>>>> an binary field type and deactivate indexing for this field.
>>>>> 
>>>>> (**) All *_t fields use string as field type. This means that no
>>>>> tokenizer is used AND queries are case sensitive. I do not think this
>>>>> is a good decision and would rather us the already defined "text_ws"
>>>>> type (white space tokenizer, word delimiter and lower case)
>>>>> 
>>>>> 
>>>>> best
>>>>> Rupert
>>>>> 
>>>>> --
>>>>> | Rupert Westenthaler             [email protected]
>>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>>> | A-5500 Bischofshofen
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> | Rupert Westenthaler             [email protected]
>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>> | A-5500 Bischofshofen
>>> 
>> 
>> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org

iQEcBAEBAgAGBQJPX0RpAAoJEATpPYSyaoIkDakH/AwYwQAr7tSGOo8k1RPRSFN2
rEw08y14v0Pun9I83s0o83vTkENS+QOVkxmnxHJssRuFIe8OiUypAA29ZuiQ6DQk
qcZ81AHik4Nx7gWamxVt+1LcobZ8P7/2iYkDfAoGdarU4cRhfAfRgUOb8Rha/bs3
0ApbZB/7gxk8YSj1OhY+xo78l4uDOHA94STYch6u/iQnhHXGDU8yQ4rxyX/EW7He
Q3I7YVQXisxaNAgkQ/Vdgraw3ujJv45Wrv0wGCA0BWEJZRjlK4uil5/9oMogFdZY
OoLQ3FkQPeRJdJkwStW1HscT6dv+sjZNOmkmCCFL8OC5dqXSuC8S/nDu+jLHzvA=
=PgTe
-----END PGP SIGNATURE-----

Re: Indexing and searching using Apache Stanbol

Reply via email to