RE: File based wordlists for spellchecker

Dyer, James Tue, 15 Nov 2011 07:56:30 -0800

Writing your own spellchecker to do what you propose might be difficult.  At 
issue is the fact that both the "index-based" and "file-based" spellcheckers 
are designed to work off a Lucene index and use the document frequency reported 
by Lucene to base their decisions.  Both spell checkers build a separate Lucene 
index on the fly to use as a dictionary just for this purpose.


But maybe you don't need to go down that path.  If your original field is not 
being stemmed or aggresively analyzed, then you can base your spellchecker on 
the original field, and there is no need to do a <copyField> for a spell check 
index.  If you have to do a <copyField> for the dictionary due to stemming, etc 
in the original, you may be pleasantly surprised that the overhead for the 
copyField is a lot less than you thought.  Be sure to set it as 
stored=false,indexed=true and omitNorms=true.  I'd recommend trying this before 
anything else as it just might work.

If you're worried about the size of the dictionary that gets built on the fly, 
then I would look into possibly upgrading to Trunk/4.0 and using 
DirectSolrSpellChecker, which does not build a separate dictionary.  If going 
to Trunk is out of the question, it might be possible for you to have it store 
your dictionary to a different disk if disk space is your issue.

If you end up writing your own spellchecker, take a look at 
org.apache.lucene.search.spell.SpellChecker.  You'll need to write a 
"suggestSimilar" method that does what you want.  Possibly you can store your 
terms and frequencies in a hey/value hash and use that to order the results.  
You then would need to write a wrapper for Solr, similar to 
org.apache.solr.spelling.FileBasedSpellChecker.  Like I mentioned, this would 
be a lot of work and it would take a lot of thought to make it perform well, 
etc.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

-----Original Message-----
From: Tomasz Wegrzanowski [mailto:tomasz.wegrzanow...@gmail.com] 
Sent: Monday, November 14, 2011 10:52 PM
To: solr-user@lucene.apache.org
Subject: File based wordlists for spellchecker

Hi,

I have a very large index, and I'm trying to add a spell checker for it.
I don't want to copy all text in index to extra spell field, since that would
be prohibitively big, and index is already close to how big it can
reasonably be,
so I just want to extract word frequencies as I index for offline processing.

After some filtering I get something like this (word, frequency):

a       122958495
aa      834203
aaa     175206
aaaa    22389
aaab    1522
aaai    1050
aaas    6384
aab     8109
aabb    1906
aac     35100
aacc    1692
aachen  11723

I wanted to use FileBasedSpellChecker, but it doesn't support frequencies,
so its recommendations are consistently horrible. Increasing frequency cutoff
won't really help that much - it will still suggest less frequent
words over equally
similar more frequent words.

What's the easiest way to get this working?
Presumably I'd need to create a separate index with just these words.
How do I get frequencies there, without actually creating 11723 records with
"aachen" in them etc.?

I can do some small Java coding if need be.
I'm already using 3.x branch (mostly for edismax, plus some unrelated
minor patches).

Thanks,
Tomasz

RE: File based wordlists for spellchecker

Reply via email to