Ard Schrijvers wrote:
and sorry for spamming, but I just want to share my findings/impressions, and
what I am posting I am willimg to implement and port to the JackRabbit trunk
(so if you bother to read it, and are positive about it, I will implement it
:-) )
you don't have to feel sorry, your input is very welcome!
[...]
So, one part that bothers me, is multilinguality (with lang specific
stopwords, stemming, synonyms). Many customers these days want multilingual
sites, and search them accordingly. And, obviously, lucene has quite some
code for exactly this : see contrib/analyzers/src/java.
Obviously, lucene has many more analyzers, and you can easily add your own.
AFAIU, there is a single configuration place where I can define the overall
JackRabbit analyzer that is used within one workspace:
in repository.xml :
<param name="analyzer"
value="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
but, what I want, is a per property defineable analyzer (I would give bode_fr
a french analyzer, body_de a german, some properties i might want to be
indexed with keyword analyzers, like zipcodes). The best place for this IMO,
is the IndexingConfiguration: then, if you do not configure it, nothing
changes for you.
So, for example the first index rule at
http://wiki.apache.org/jackrabbit/IndexingConfiguration would change in:
<index-rule nodeType="nt:unstructured" boost="2.0"> <property
analyzer="org.apache.lucene.analysis.Analyzer.GermanAnalyzer">text_de</property>
</index-rule>
and during loading, we construct a Map of {jr-property,analyzer} (call it
propertyAnalyzerMap). Then, all we need to add is one jackrabbit global
analyzer, that look like:
class JRAnalyzer extends Analyzer { Analyzer defaultAnalyzer = new
StandardAnalyzer();
public TokenStream tokenStream(String fieldName, Reader reader) { Analyzer
analyzer = (Analyzer)propertyAnalyzerMap.get(fieldName); if(analyzer!=null){
return analyzer.tokenStream(fieldName, reader); }else{ return
this.defaultAnalyzer.tokenStream(fieldName, reader); } } }
This very same JRAnalyzer is also used for the QueryParser in
LuceneQueryBuilder, so this will work also for searching IIUC. So, WDOT? I
can implement it and send a patch, but if the community is reluctant to it, I
will have to do it for myself in a non jr code intrusive way.
This would work quite well for jcr:contains functions that operate on a
property. However I'm not sure what to do with this:
//*[jcr:contains(., 'hägar')]
the node scope does not indicate which analyzer to use for the query statement.
Would we just run the statement through all analyzers and combine them in an OR
query?
Example of the SynonymProvider mentioned at the top:
If my suggested changes are accepted, things like a SynonymProvider becomes
superfluous, and very easy to add on the fly:
suppose, I want on the "body" property of my nodes always full searching with
dutch synonyms. This boils down to adding an analyzer for this property, that
extends the DutchAnalyzer in lucene, and that adds synonym functionality
(very simple example in "lucene in action" book). I think it is better to do
synonyms during analyzing (as opposed to the SynonymProvider in jr trunk),
and simply use an analyzer for it. Ofcourse, a difference of using it, would
be that with the current SynonymProvider you specifically have to define that
you do a synonymsearch (~term), while with an analyzer, you define which
properties whould be indexed with an synonymanalyzer, and searched
accordingly (without having to specify it),
well, those are actually the reasons why I implemented it the other way. If you
go the analyzer way to expand synonyms you have to re-index the complete content
if you want to add a single synonym. I also wanted the user to decide if
synonyms should be considered. Again this would not be possible if the analyzer
automatically adds synonyms.
but fortunately, with jackrabbit both is possible ;) if one prefers to expand
terms on index time, just use an appropriate analyzer and don't configure a
SynonymProvider.
So WDOT? Again, sry for mailing so much, just trying to sell my ideas :-)
again, your ideas are very welcome.
regards
marcel