[ 
https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544540
 ] 

Doug Cutting commented on LUCENE-1052:
--------------------------------------

> We find a surprising number of them contain embedded encoded binary data.

It sounds like a detector for this would be very useful.  It would, e.g., 
substantially speed updates of such indexes, and not slow searches of them like 
a divisor does.  At Excite we evolved effective heuristics for wordness to keep 
our dictionaries from exploding.  Perhaps you should look into that?  Also, it 
sounds like you might increase your default term index interval, since it 
sounds like you have big indexes with noisy data.

> Our users won't accept a solution like, wait until the problem occurs and 
> then increment your termIndexDivisor. They expect our app to manage this 
> automatically.

You could look at the size of the .tii files before you open an index, and, if 
they're too large, set the divisor automatically as you see fit.

> int bound = (int) 
> (1+TERM_BOUNDING_MULTIPLIER*Math.sqrt(1+segmentNumDocs)/TERM_INDEX_INTERVAL);

This sounds like a fine approach.

> Add an "termInfosIndexDivisor" to IndexReader
> ---------------------------------------------
>
>                 Key: LUCENE-1052
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1052
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1052.patch, termInfosConfigurer.patch
>
>
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to