Re: IndexingConfiguration jr 1.4 release, analyzing, searching and synonymprovider

Marcel Reutegger Mon, 13 Aug 2007 06:20:41 -0700

Ard Schrijvers wrote:

and sorry for spamming, but I just want to share my findings/impressions, and
what I am posting I am willimg to implement and port to the JackRabbit trunk
(so if you bother to read it, and are positive about it, I will implement it
:-) )


you don't have to feel sorry, your input is very welcome!

[...]

So, one part that bothers me, is multilinguality (with lang specific
stopwords, stemming, synonyms). Many customers these days want multilingual
sites, and search them accordingly. And, obviously, lucene has quite some
code for exactly this : see contrib/analyzers/src/java.

Obviously, lucene has many more analyzers, and you can easily add your own.
AFAIU, there is a single configuration place where I can define the overall
JackRabbit analyzer that is used within one workspace:

in repository.xml :

<param name="analyzer"
value="org.apache.lucene.analysis.standard.StandardAnalyzer"/>

but, what I want, is a per property defineable analyzer (I would give bode_fr
a french analyzer, body_de a german, some properties i might want to be
indexed with keyword analyzers, like zipcodes). The best place for this IMO,
is the IndexingConfiguration: then, if you do not configure it, nothing
changes for you.

So, for example the first index rule at
http://wiki.apache.org/jackrabbit/IndexingConfiguration would change in:

<index-rule nodeType="nt:unstructured" boost="2.0"> <property
analyzer="org.apache.lucene.analysis.Analyzer.GermanAnalyzer">text_de</property>
 </index-rule>

and during loading, we construct a Map of {jr-property,analyzer} (call it
propertyAnalyzerMap). Then, all we need to add is one jackrabbit global
analyzer, that look like:

class JRAnalyzer extends Analyzer { Analyzer defaultAnalyzer = new
StandardAnalyzer();

public TokenStream tokenStream(String fieldName, Reader reader) { Analyzer

analyzer = (Analyzer)propertyAnalyzerMap.get(fieldName); if(analyzer!=null){return analyzer.tokenStream(fieldName, reader); }else{ return

this.defaultAnalyzer.tokenStream(fieldName, reader); } } }

This very same JRAnalyzer is also used for the QueryParser in
LuceneQueryBuilder, so this will work also for searching IIUC. So, WDOT? I
can implement it and send a patch, but if the community is reluctant to it, I
will have to do it for myself in a non jr code intrusive way.

This would work quite well for jcr:contains functions that operate on aproperty. However I'm not sure what to do with this:


//*[jcr:contains(., 'hägar')]

the node scope does not indicate which analyzer to use for the query statement.Would we just run the statement through all analyzers and combine them in an ORquery?

Example of the SynonymProvider mentioned at the top:

If my suggested changes are accepted, things like a SynonymProvider becomes
superfluous, and very easy to add on the fly:

suppose, I want on the "body" property of my nodes always full searching with
dutch synonyms. This boils down to adding an analyzer for this property, that
extends the DutchAnalyzer in lucene, and that adds synonym functionality
(very simple example in "lucene in action" book). I think it is better to do
synonyms during analyzing (as opposed to the SynonymProvider in jr trunk),
and simply use an analyzer for it. Ofcourse, a difference of using it, would
be that with the current SynonymProvider you specifically have to define that
you do a synonymsearch (~term), while with an analyzer, you define which
properties whould be indexed with an synonymanalyzer, and searched
accordingly (without having to specify it),

well, those are actually the reasons why I implemented it the other way. If yougo the analyzer way to expand synonyms you have to re-index the complete contentif you want to add a single synonym. I also wanted the user to decide ifsynonyms should be considered. Again this would not be possible if the analyzerautomatically adds synonyms.

but fortunately, with jackrabbit both is possible ;) if one prefers to expandterms on index time, just use an appropriate analyzer and don't configure aSynonymProvider.

So WDOT? Again, sry for mailing so much, just trying to sell my ideas :-)


again, your ideas are very welcome.

regards
 marcel

Re: IndexingConfiguration jr 1.4 release, analyzing, searching and synonymprovider

Reply via email to