On Thursday, November 13, 2003, at 03:22 AM, Hackl, Rene wrote:
documents contain very long strings for chemical substances, users are
interested in certain parts of the string e.g. find all documents that
comprise "*foo*" be it "1-foo-bar" or "rab-oof-13-foonyl-naphthalene").

So you're saying you want users to be able to search for "of-13" and match that second one? User's really are demanding that?


Suggestions on improvements are always welcome! :-)

It seems like some very clever tokenization during analysis is what you're after. If you tokenized by dash (yes, you mentioned in the next message it is more complex than that, but just for this example let's simplify it to that), then the first document would have "1", "foo", "bar", and the second would have "rab", "oof", "13", "foonyl", and "naphthalene".


A PrefixQuery (not even a WildcardQuery) for "foo" would find both.

Now suppose the users want to search for "oo" and find both documents. First, I'd probably argue that this doesn't really make sense given the domain.

But, keep in mind that WildcardQuery itself does support "*oo*" and it would work as expected (although with the performance caveat if the index is huge). If you want QueryParser to support a leading wildcard character, you would have to customize it yourself.

Another, perhaps ridiculous, alternative is to index each sequence of characters for each piece as tokens too: "f", "fo", "foo", "foon", "foony", "foonyl", "o", "oo", "oon".... and so on.

Erik


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to