File based wordlists for spellchecker

2011-11-14 Thread Tomasz Wegrzanowski
Hi,

I have a very large index, and I'm trying to add a spell checker for it.
I don't want to copy all text in index to extra spell field, since that would
be prohibitively big, and index is already close to how big it can
reasonably be,
so I just want to extract word frequencies as I index for offline processing.

After some filtering I get something like this (word, frequency):

a   122958495
aa  834203
aaa 175206
22389
aaab1522
aaai1050
aaas6384
aab 8109
aabb1906
aac 35100
aacc1692
aachen  11723

I wanted to use FileBasedSpellChecker, but it doesn't support frequencies,
so its recommendations are consistently horrible. Increasing frequency cutoff
won't really help that much - it will still suggest less frequent
words over equally
similar more frequent words.

What's the easiest way to get this working?
Presumably I'd need to create a separate index with just these words.
How do I get frequencies there, without actually creating 11723 records with
"aachen" in them etc.?

I can do some small Java coding if need be.
I'm already using 3.x branch (mostly for edismax, plus some unrelated
minor patches).

Thanks,
Tomasz


Re: File based wordlists for spellchecker

2011-11-15 Thread Tomasz Wegrzanowski
On 15 November 2011 15:55, Dyer, James  wrote:
> Writing your own spellchecker to do what you propose might be difficult.  At 
> issue is the fact that both the "index-based" and "file-based" spellcheckers 
> are designed to work off a Lucene index and use the document frequency 
> reported by Lucene to base their decisions.  Both spell checkers build a 
> separate Lucene index on the fly to use as a dictionary just for this purpose.

I'm fine with spellchecker index, it will be small compared with
everything else.

I don't want every original record to have extra copyField since they
would probably be prohibitively huge.

> But maybe you don't need to go down that path.  If your original field is not 
> being stemmed or aggresively analyzed, then you can base your spellchecker on 
> the original field, and there is no need to do a  for a spell 
> check index.  If you have to do a  for the dictionary due to 
> stemming, etc in the original, you may be pleasantly surprised that the 
> overhead for the copyField is a lot less than you thought.  Be sure to set it 
> as stored=false,indexed=true and omitNorms=true.  I'd recommend trying this 
> before anything else as it just might work.

My original index is stemmed and very aggressively analyzed, copyField
would be necessary.

> If you're worried about the size of the dictionary that gets built on the 
> fly, then I would look into possibly upgrading to Trunk/4.0 and using 
> DirectSolrSpellChecker, which does not build a separate dictionary.  If going 
> to Trunk is out of the question, it might be possible for you to have it 
> store your dictionary to a different disk if disk space is your issue.
>
> If you end up writing your own spellchecker, take a look at 
> org.apache.lucene.search.spell.SpellChecker.  You'll need to write a 
> "suggestSimilar" method that does what you want.  Possibly you can store your 
> terms and frequencies in a hey/value hash and use that to order the results.  
> You then would need to write a wrapper for Solr, similar to 
> org.apache.solr.spelling.FileBasedSpellChecker.  Like I mentioned, this would 
> be a lot of work and it would take a lot of thought to make it perform well, 
> etc.

Doesn't IndexBasedSpellChecker simply extract (word, freq) pairs from index,
puts them into spellcheckingIndex, and forgets about the index altogether?

If so, then I'd only need to override index building, and reuse that.

Am I correct here, or does it actually go back to the original index?


solr-user@lucene.apache.org

2011-11-21 Thread Tomasz Wegrzanowski
Hi,

I've been trying to match some phrases with + and & (like c++,
google+, r&d etc.),
but tokenized gets rid of them before I can do anything with synonym filters.

So I tried using CharFilters like this:


  











  
  








  


This mostly works, but for a very small number of documents, mostly
those with large number of pluses in them,
highlighter just crashes (and it's highlighter since turning it off
and reissuing the query works just fine, if I replace
pluses with spaces and reindex, the same query reruns just fine) with
exception like this:

Nov 21, 2011 11:35:11 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.lang.String.substring(String.java:1938)
at 
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:237)
at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:462)
at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:378)
at 
org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:116)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:343)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:845)
at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
at java.lang.Thread.run(Thread.java:619)

Is this a known issue?

Are CharFilters even the right way to approach it?

Or should I perhaps change or subclass StandardTokenizerFactory to
treat + and & as words?
I haven't looked at StandardTokenizerFactory code yet, so I don't know
how feasible would that be.

Thanks,
Tomasz


solr-user@lucene.apache.org

2011-11-24 Thread Tomasz Wegrzanowski
On 22 November 2011 14:28, Jan Høydahl  wrote:
> Why do you need spaces in the replacement?
>
> Try pattern="\+" replacement="plus" - it will cause the transformed 
> charstream to contain as many tokens as the original and avoid the 
> highlighting crash.

I tried that, it still crashes.

Replacing it with single character, including single non-ASCII
character, doesn't cause a crash.

I'm sort of tempted to just use reuse some CJK character, and synonym filter
it to mean "plus".


solr-user@lucene.apache.org

2011-11-28 Thread Tomasz Wegrzanowski
On 24 November 2011 15:18, Tomasz Wegrzanowski
 wrote:
> On 22 November 2011 14:28, Jan Høydahl  wrote:
>> Why do you need spaces in the replacement?
>>
>> Try pattern="\+" replacement="plus" - it will cause the transformed 
>> charstream to contain as many tokens as the original and avoid the 
>> highlighting crash.
>
> I tried that, it still crashes.
>
> Replacing it with single character, including single non-ASCII
> character, doesn't cause a crash.
>
> I'm sort of tempted to just use reuse some CJK character, and synonym filter
> it to mean "plus".

In case anybody else runs into this problem, I found a solution.

The only thing that works and doesn't seem to crash solr is CJK expansions:

  
    
Followed by un-CJK-ing in synonym filter:

# General rules
加 => plus
和 => and
# And any special synonyms you want:
r and d, r 和 d => r and d, research and development
s and p, s 和 p => s and p, standand and poor's
at and t, at  和 t => at and t, american telephone and telegraph

User never sees these CJK characters, they only exist for a brief time
within solr pipeline to make tokenizer happy.

I also tried private use Unicode characters, but they're ignored by tokenizer.


Solr branches

2010-08-12 Thread Tomasz Wegrzanowski
Hi,

I'm having oome problems with solr. From random browsing
I'm getting an impression that a lot of memory fixes happened
recently in solr and lucene.

Could you give me a quick summary how (un)stable are different
lucene / solr branches and how much improvement I can expect?


Re: Solr branches

2010-08-12 Thread Tomasz Wegrzanowski
On 12 August 2010 13:46, Koji Sekiguchi  wrote:
> (10/08/12 21:06), Tomasz Wegrzanowski wrote:
>>
>> Hi,
>>
>> I'm having oome problems with solr. From random browsing
>> I'm getting an impression that a lot of memory fixes happened
>> recently in solr and lucene.
>>
>> Could you give me a quick summary how (un)stable are different
>> lucene / solr branches and how much improvement I can expect?
>
> Lucene/Solr have CHANGES.txt. You can refer to it to see
> how much Lucene/Solr get improved from previous release.

This is technically true, but I'm not sufficiently familiar with
solr/lucene development process to infer much about performance
and stability of different branches from it.