Re: jena-text indexing fields with KeywordAnalyzer

Andy Seaborne Fri, 14 Mar 2014 02:07:24 -0700

On 14/03/14 07:29, Osma Suominen wrote:

Hi Paul!


On 14/03/14 02:51, Paul Tyson wrote:

I just tried out the jena-text indexing and query capabilities of jena
2.11. Great stuff, but the property values I indexed contain part
numbers that frequently contain hyphens. Apparently Lucene's
StandardAnalyzer tokenizes on hyphens, so my initial search results were
quite puzzling.


Yes, StandardAnalyzer is "smart" for many scenarios but not good for
everything.

However, even with the limited results, I can see that the text queries
are much faster than strstarts() or regex() filters on the same property
values. So I would like to try indexing the property values using
Lucene's KeywordAnalyzer. I think I can see in the code how this could
be easily done.


Searching using an index is typically much faster than filters, because
the text index will directly give you (at least approximately) the hits
you need, whereas a filter requires traversing through a lot more rows
and throwing most of them away.

Has anyone else encountered this problem? Have I missed some other way
to improve response time for a filtered string search, or overestimated
the possible performance improvement? (I'm new to Lucene.) Would the
developers consider an enhancement to make this option configurable in
the text assembler?

Yes - that would be a good contribution. "developers" is everyone. Asan Apache project, the PMC and committers are primarily responsible forthe community around the codebase. Of course, they might also besignificant contributors as well but the whole community arecontributors. That's how it can scale.

It's of course possible to just replace StandardAnalyzer with
KeywordAnalyzer in the code and compile your own modified jena-text.
Making it configurable would require some more work...

One way is to put the analyzer to use in the EntityDefinition, and haveit settable from the assembler.


        Andy

However, another possible solution is to switch to the Solr backend also
supported by jena-text. Then you can configure all fields exactly as you
like using Solr's schema.xml configuration file [1].

-Osma

[1] http://wiki.apache.org/solr/SchemaXml

Re: jena-text indexing fields with KeywordAnalyzer

Reply via email to