Auto Slop

Walt Stoneburner Tue, 03 Jul 2007 14:03:03 -0700

I've solved the problem, thanks to tips from Mark Miller and Ard
Schrijvers, and am simply recording it so that someone else walking
through the archives might get some benefit.


A while ago I had been working on a case-sensitive version of Lucene,
where with a prefix symbol, it was possible to indicate whether you
were interested in the token as-is, or the case insensitive version.
(For instance 'LET' is a real acronym, whereas 'let' is a stop word.)

I did this by writing a filter that injected two tokens for every one.
The problem was, that token.setPositionIncrement(0) wasn't being
called on the duplicate token.  As such, it did appear that there were
intermediate tokens between the real ones, when they should have
occupied the same logical position.

Adding that simple call made phrases work perfectly, even in the light
of all the modified functionality I've added.  Gotta love Lucene!

What would have been a helpful debugging tool, would have been if Luke
was capable of dumping the token positions.  Clearly it can
reconstruct a document in this manner, but it would have been extra
nice to see numerical positions they resided at, not just the sequence
of tokens.

Hope this helps someone else in the future,
-Walt Stoneburner,
http://www.wwco.com/~wls/blog/

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Auto Slop

Reply via email to