[ http://issues.apache.org/jira/browse/LUCENE-537?page=all ]
Karl Wettin updated LUCENE-537:
-------------------------------
Attachment: ngram_spellcheck_karl_v3.tar
This update include same name changes, small optimizations of logic and a fix
to the evil bug that rendered my and the SVN-version inusable, mentioned in
earlier comment.
It might be worth to mention that I in my derivate of this code cache all
suggestions in a Map. Really really really speeds things up, and does not
consume that much RAM.
As a side note, I feel that the "suggest only more frequent terms" is not
satifactory. The threashold should be a strategy, and I think there must be a
better one than what is available.
I do however think this is the final version of my changes to the ngram spell
checker. I Will start working on a new suggestion scheme based on A-stared
markov chain that analyses the relation between multiple words, as this
ngrammer really only is good at one word at the time. Perhaps it can be a base
for the new one. Levenstein is more compelling to me.
> Refactor of spell check
> -----------------------
>
> Key: LUCENE-537
> URL: http://issues.apache.org/jira/browse/LUCENE-537
> Project: Lucene - Java
> Type: Improvement
> Reporter: Karl Wettin
> Attachments: lucene_spellcheck.tar.gz, ngram_spellcheck_karl_v3.tar
>
> I use the same ngram index for multiple categories, but only want to spell
> check per category. The old implementation did not support this as it used
> docFreq as controller source.
> The spell check returns suggestions with score and not just the suggested
> word.
> TokenFrequencyVector replace the IndexReader used for docFreq.
> LuceneTokenFrequencyVector wraps an IndexReader and works just as the old
> implementation.
> LuceneQueryDictionary creates an ngram dictionary based on a query and not
> the whole index.
> MultiTokenFrequencyVector treats a number of TokenFrequencyVector:s as one.
> I.e. for use when spell checking in multiple contexts.
> TokenFrequencyVectorMap is a HashMap facade. Comes with static factory to
> create the vector based on the the tokens in a specific field from a search.
> I use the TokenFrequencyVectorMap to build one vector per category and
> instanciate a MultiTokenFrequencyVector for each user query. Could probably
> save a couple of clock ticks by buffering MultiVectors rather than creating
> new once all the time.
> Also it seems as the ngram-code might not be thread safe. This also include
> the source in CVS. Have not succeded to prove it when when testing, only in
> the live environment. Each instance of Spellchecker only suggest once. And it
> takes quite some resources to create new instances of the spellchecker as it
> is designed today. Might get back on that subject.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]