[jira] Updated: (LUCENE-2391) Spellchecker uses default IW mergefactor/ramMB settings of 300/10

2011-01-04 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2391:


Fix Version/s: 4.0
   3.1

> Spellchecker uses default IW mergefactor/ramMB settings of 300/10
> -
>
> Key: LUCENE-2391
> URL: https://issues.apache.org/jira/browse/LUCENE-2391
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/spellchecker
>Reporter: Mark Miller
>Assignee: Robert Muir
>Priority: Trivial
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2391.patch
>
>
> These settings seem odd - I'd like to investigate what makes most sense here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2391) Spellchecker uses default IW mergefactor/ramMB settings of 300/10

2010-12-22 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2391:


Attachment: LUCENE-2391.patch

Here's a patch to speed up the spellchecker build.

* i wired the default RamMB to IWConfig's default
* i didnt mess with the mergefactor for now (because the default is still to 
optimize)
* but i added an additional 'optimize' parameter so you can update your 
spellcheck index without re-optimizing.
* when updating, i changed the exists() to work per-segment, so its reasonable 
if the index isn't optimized.
* the exists() check now bypasses the term dictionary cache, which is stupid 
and just slows it down.
* we don't do any of the exists() logic if the index is empty (this is the case 
for i think solr which completely rebuilds
  and doesnt do an incremental update)
* the startXXX, endXXX, and word fields can only contain one term per document. 
I turned off norms, positions,
  and tf for these.
* the gramXXX field is unchanged, i didnt want to change spellchecker scoring 
in any way. But we could
  reasonably in the future likely omit norms here too since i think its gonna 
be very short.

{noformat}
trunk:
scratch build time: 229,803ms
index size: 214,322,200 bytes
no-op update time (updating but there is no new terms to add): 4,619ms

patch:
scratch build time: 99,214ms
index size: 177,781,273 bytes
no-op update time: 2,504ms
{noformat}

i still left the optimize default on, but really i think for most users (e.g. 
solr) they should set 
mergefactor to be maybe a bit more reasonable, set optimize to false, and the 
scratch build 
is then must faster (60,000 ms), but the no-op update time is heavier (eg 
16,000ms). Still, 
if you are rebuilding on every commit for smallish updates something like 20-30 
seconds 
is a lot better than 100seconds, but for now I kept the defaults as is 
(optimizing every time).


> Spellchecker uses default IW mergefactor/ramMB settings of 300/10
> -
>
> Key: LUCENE-2391
> URL: https://issues.apache.org/jira/browse/LUCENE-2391
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/spellchecker
>Reporter: Mark Miller
>Priority: Trivial
> Attachments: LUCENE-2391.patch
>
>
> These settings seem odd - I'd like to investigate what makes most sense here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org