On 07/23/2008 at 5:09 PM, Steven A Rowe wrote:
> Karl Wettin's recently committed ShingleMatrixAnalyzer
Oops, "ShingleMatrixAnalyzer" -> "ShingleMatrixFilter".
Steve
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional co
Hi Ryan,
Well, at 100 million+ keywords, Lucene might be the right tool.
One thing that you might check out for the query side is Karl Wettin's recently
committed ShingleMatrixAnalyzer (not in any Lucene release yet - only on the
trunk).
The JUnit test class TestShingleMatrixFilter has an exam
Heh, actually I'm using Perl but I've always associated text-search
with Lucene, I'm not sure if it's the best solution or not. On the
small side there are 1.6 million keywords, on the large side there are
well over 100 million but I might find another way to break down the
searches into sm
Hi Ryan,
I'm not sure Lucene's the right tool for this job.
I have used regular expressions and ternary search trees in the past to do
similar things.
Is the set of keywords too large for an in-memory solution like these? If not,
consider using a tool like the Perl package Regex::PreSuf
You need to invert the process. Using Lucene may not be the best option... You
need to make your document a key into an index of key words. I've done the
same thing, but not with Lucene. You need to pass through the document and for
each word (token) lookup in some index (hashtable) to find po