Re: Indexing and searching using Apache Stanbol

Rupert Westenthaler Tue, 13 Mar 2012 03:47:27 -0700

Hi Srecko, all

@Stanbol developers: Note (*) and (**) comments at the end of this mail

On Mon, Mar 12, 2012 at 6:03 PM, Srecko Joksimovic
<[email protected]> wrote:
>
> Until now I have developed few applications for annotating documents using
> Apache Stanbol. Now I need to add indexing and search capabilities.
>
> I tried ContentHub
> (http://incubator.apache.org/stanbol/docs/trunk/contenthub/contenthub5min)
> in the way that I started full launcher and access web interface. There are
> few possibilities: to provide text, to upload document, to provide an URI… I
> tried to upload a few txt documents. I didn’t get any extracted entities,

The content hub shows the number of extracted enhancements. This can
easily be used as indicator if the Stanbol Enhancer was able to
extract knowledge form the parsed content.

Typical reasons for not getting expected enhancement results are:

1. unsupported content type: The current version of Apache Stanbol
uses the 
[TikaEngine](http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/tikaengine.html)
to process non-plain-text content parsed to the Stanbol
Enhancer/Contenthub. So everything that is covered by Apache Tika
should also work just fine with Apache Stanbol.

2. unsupported language: Some Enhancement Engines (e.g. NER - Named
Entity Recognition) do only support some languages. If the parsed
content is in an other language the will not be able to process the
parsed content. With the default configuration of Stanbol only english
(and in the newest version spanish and dutch) documents will work.
Users with custom configurations will also be able to process
documents with other languages)

> but search (using Web View) worked fine.

This is because the Conenthub also supports full text search over the
parsed content. (*)

>Another step was to upload pdf
> documents and I got extracted entities grouped by People, Places Concepts
> categories. It was also in the list of recently uploaded documents, but I
> couldn’t find any term from that document.
>

Based on your request I tried the following (with the default
configuration of the Full launcher)
NOTE: this excludes the possibility to create your own search index by
using LDPath.

1) upload some files to the content hub

    * file upload worked (some scientific papers from the local HD
    * URL upload worked (some technical blogs + comments)
    * pasting text worked (some of the examples included for the enhancer)
    * based on the UI I got > 100 enhancements for all tested PDFs

2) test of the contenthub search

    * keyword search worked also for me

3) direct solr searches on {host}/solr/default/contenthub/ (*)

    * searches like
"{host}solr/default/contenthub/select?q=organizations_t:Stanford*"
worked fine. Note that searches are case sensitive (**)
    * I think the keyword search uses the "text_all" field. So queries
for  "{host}solr/default/contenthub/select?q=text_all:{keyword} should
return the same values as the UI of the content hub. This fields
basically supports full text search.
    * all the semantic "stanbolreserved_*" fields (e.g. *_countries,
*_workinstitutions ...) where missing. I think this is expected,
because such fields do require a dbpedia index with the required
fields.

>
> I suppose that I will have to provide a stream from pdf (or any other kind)
> documents and to index it like text? I need all mentioned functionalities
> (index text, docs, URIs…) using Java application and I would appreciate a
> code example, if it is available, please.
>

I think parsing of URIs is currently not possible by using the RESTful
API. For using the RESTful services I would recommend you the use of
the Apache Http commons client. Code examples on how to build requests
can be found at
http://hc.apache.org/httpcomponents-client-ga/tutorial/html/fundamentals.html

best
Rupert

Comments intended for Stanbol Developers:
-----

(*) Normally I would expect the SolrIndex to only include the plain
text version of the parsed content within a field with stored=false.
However I assume that currently the index needs to store the actual
content, because is is also used to store the data. Is this correct?
If this is the case than it will get fixed with STANBOL-471 in any case.

I also noted that "stanbolreserved_content" currently stores the
content as parsed to the content hub but is configured as
indexed="true" and type="text_general". So in case of an PDF file the
binary content is processed as natural language text AND is also
indexed!
So if this field is used for full text indexing (what I think is not
the case, because I think the "text_all" field is used for that) than
you need to ensure that the plain text version is used for full text
indexing. The plain text contents are available from enhanced
ContentItems by using
ContentItemHelper.getBlob(contentItem,Collections.singelton("text/plain")).
As an alternative one could also use the features introduced by
STANBOL-500 for this.
If this field is used to store the actual content, than you should use
an binary field type and deactivate indexing for this field.

(**) All *_t fields use string as field type. This means that no
tokenizer is used AND queries are case sensitive. I do not think this
is a good decision and would rather us the already defined "text_ws"
type (white space tokenizer, word delimiter and lower case)

best
Rupert

-- 
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Indexing and searching using Apache Stanbol

Reply via email to