On 03/13/2012 12:46 PM, Rupert Westenthaler wrote:

Comments intended for Stanbol Developers:
-----

(*) Normally I would expect the SolrIndex to only include the plain
text version of the parsed content within a field with stored=false.
However I assume that currently the index needs to store the actual
content, because is is also used to store the data. Is this correct?
If this is the case than it will get fixed with STANBOL-471 in any case.

I also noted that "stanbolreserved_content" currently stores the
content as parsed to the content hub but is configured as
indexed="true" and type="text_general". So in case of an PDF file the
binary content is processed as natural language text AND is also
indexed!
So if this field is used for full text indexing (what I think is not
the case, because I think the "text_all" field is used for that) than
you need to ensure that the plain text version is used for full text
indexing. The plain text contents are available from enhanced
ContentItems by using
ContentItemHelper.getBlob(contentItem,Collections.singelton("text/plain")).
As an alternative one could also use the features introduced by
STANBOL-500 for this.
If this field is used to store the actual content, than you should use
an binary field type and deactivate indexing for this field.

(**) All *_t fields use string as field type. This means that no
tokenizer is used AND queries are case sensitive. I do not think this
is a good decision and would rather us the already defined "text_ws"
type (white space tokenizer, word delimiter and lower case)


best
Rupert


Hi,

I want to give some information about my last commit. It applies some changes to the default Contenthub index on Solr.

(*) "stanbolreserved_content" indexes the text content of the document, but not stored. (*) "stanbolreserved_binarycontent" only stores the binary content, not indexed. STANBOL-471 will most probably remove these issues. For demo purposes, we may continue to store the not-indexed binary content.

(**) "*_t" continues with the "string" type because we want to provide the faceted search with the name of extracted entities in the web GUI of Contenthub. Therefore, it is stored and indexed. (**) "*_i" is added to the schema with "text_ws" type. This type uses the WhitespaceTokenizerFactory, WordDelimiterFilterFactory, LowerCaseFilterFactory and RemoveDuplicatesTokenFilterFactory of Solr. So, for dynamic fields if "x_t" exists, "x_i" also exists. This field is neither stored nor copied to "stanbolreserved_text_all". (BTW, I renamed "text_all" --> "stanbolreserved_text_all"). (**) Since "*_t" fields are being copied to "stanbolreserved_text_all", values of these fields are indexed through "text_general" type. If you want to search on a specific field, you can use the one ends with "_i" instead of "_t".

Best,
Anil.

Reply via email to