[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

Chuck Williams (JIRA) Wed, 21 Nov 2007 15:56:13 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544644
 ]


Chuck Williams commented on LUCENE-1052:
----------------------------------------

> It almost feels like we should have "hooks" that are invoked at
> certain times, like when we are about to load the term infos index,
> that give the application a chance to change something...

I agree with the need for some kind of hook.  This is what TermInfosConfigurer 
is.  It calls a method whenever a SegmentReader reads an index to obtains 
parameters (termIndexDivisor) that should be used to configure the 
TermInfosReader.

Why not make the setters/getters on SegmentIndexProperties regular non-static 
methods, and allow hook methods as well?  E.g., setTermIndexDivsior(), 
getTermIndexDivisor(), getMaxTermsCached(String segmentName, int 
segmentNumDocs, long segmentNumTerms).  Non-static methods make the defaulting 
straightforward and allow for subclassing to override hook methods. 

> It sounds like a detector for this would be very useful. It would, e.g., 
> substantially
> speed updates of such indexes, and not slow searches of them like a divisor 
> does.
> At Excite we evolved effective heuristics for wordness to keep our 
> dictionaries from exploding.

Yes, we are pursuing that approach as well, but we have some stringent 
requirements in our market.  E.g., we cannot filter *any* valid content, 
because searches must be guaranteed to find all matching results.  As of result 
of this, we cannot impose any maximum length for documents.

Any type of binary content recognizer would either need to be 100% accurate, 
which is impossible, or require human intervention to validate filtering.  For 
a human intervention approach to be viable the false positive rate must be 
tiny.  To be effective the false negative rate must be tiny.  Although invalid 
content is pretty easy for people tor recognize, I've found so far that 
high-accuracy recognition rules are surprising subtle.

Do you by chance no of any quality work in this area?

> > int bound = (int) 
> > (1+TERM_BOUNDING_MULTIPLIER*Math.sqrt(1+segmentNumDocs)/TERM_INDEX_INTERVAL);

> This sounds like a fine approach.

It seems to be working ok, but there is one issue.  Heap's Law is based on the 
total number of tokens in the content, not the total number of documents.  
I.e., longer documents will generate more distinct terms than shorter 
documents.  For large segments the use of numDocs works ok due to statistical 
averaging, but for smaller segments there are errors.  I may loosen the bound 
somewhat on smaller segments in order to allow for their larger standard 
deviation.

If Lucene indexes tracked totalTokens (with duplicates, i.e. not 
numDistinctTokens) that would be perfect, but they don't.  I don't know whether 
or not there would be other good uses for totalTokens but mention its relevance 
here in case there are.


> Add an "termInfosIndexDivisor" to IndexReader
> ---------------------------------------------
>
>                 Key: LUCENE-1052
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1052
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1052.patch, LUCENE-1052.patch, 
> termInfosConfigurer.patch
>
>
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

Reply via email to