Re: dash-words

Yonik Seeley Tue, 25 Jul 2006 08:42:55 -0700

On 7/25/06, Martin Braun <[EMAIL PROTECTED]> wrote:

Hi Yonik,


>> I can't figure out what the parameters does. ;)
>
> Yes, it will fail without slop... I don't think there is a practical
> way around that.

I am trying to analyze your WordDelimiterFilter.

If I have x-men, after analyzing (with catenateAll) I get this:


 Analzying "The x-men story"
        de.unihd.ub.ftsearch.WordIndexAnalyzer:
                [the] [x] [men] [xmen] [story]


1: [the:0->3:word]
2: [x:4->5:word]
3: [men:6->9:word] [xmen:4->9:word]
4: [story:10->15:word]

1: [the]
2: [x]
3: [men] [xmen]
4: [story]


So a Phrase search to "The xmen story" will fail. With a slop of 1 the
doc will be found.

But when generating the query I won't know when to use a slop. So adding
slops isn't a nice solution.


If you can't tolerate slop, this is a problem.

The only 100% solution that I could think of to this problem is to
re-index the entire stream (with a very large position gap inbetween)
for each variant.

"the x men story"
"the xmen story"

Problems:
1) combinatorial explosion very quickly (not practical at all)
2) messes up idfs pretty badly

Phrase slop is the easiest workaround, esp when you wanted slop anyway.

Would it be a solution, to take the concatenated synonyms to both
Positions? Or are there any drawbacks with this?


I considered that too... but it increases false matches, and it still
doesn't fix many phrase queries.

1: [the]
2: [x] [xmen]
3: [men] [xmen]
4: [story]


While "the xmen" and "xmen story" will now both match, "the xmen
story" will still fail to match.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: dash-words

Reply via email to