On 7/25/06, Martin Braun <[EMAIL PROTECTED]> wrote:
Hi Yonik,
>> I can't figure out what the parameters does. ;)
>
> Yes, it will fail without slop... I don't think there is a practical
> way around that.
I am trying to analyze your WordDelimiterFilter.
If I have x-men, after analyzing (with catenateAll) I get this:
Analzying "The x-men story"
de.unihd.ub.ftsearch.WordIndexAnalyzer:
[the] [x] [men] [xmen] [story]
1: [the:0->3:word]
2: [x:4->5:word]
3: [men:6->9:word] [xmen:4->9:word]
4: [story:10->15:word]
1: [the]
2: [x]
3: [men] [xmen]
4: [story]
So a Phrase search to "The xmen story" will fail. With a slop of 1 the
doc will be found.
But when generating the query I won't know when to use a slop. So adding
slops isn't a nice solution.
If you can't tolerate slop, this is a problem.
The only 100% solution that I could think of to this problem is to
re-index the entire stream (with a very large position gap inbetween)
for each variant.
"the x men story"
"the xmen story"
Problems:
1) combinatorial explosion very quickly (not practical at all)
2) messes up idfs pretty badly
Phrase slop is the easiest workaround, esp when you wanted slop anyway.
Would it be a solution, to take the concatenated synonyms to both
Positions? Or are there any drawbacks with this?
I considered that too... but it increases false matches, and it still
doesn't fix many phrase queries.
1: [the]
2: [x] [xmen]
3: [men] [xmen]
4: [story]
While "the xmen" and "xmen story" will now both match, "the xmen
story" will still fail to match.
-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]