[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField

Michael McCandless (JIRA) Thu, 10 Dec 2009 09:24:42 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788798#action_12788798
 ]


Michael McCandless commented on LUCENE-2133:
--------------------------------------------

This patch is a good step forward -- it associates the cache directly
with IndexReader, where it belongs; it cleanly decouples cache from
reader (vs the hack we have today with IndexReader.getFieldCacheKey),
so that eg cloned readers can share the same cache; it also preserves
back compat, which is quite a stunning accomplishment :)

But... there are many more things I don't like about FieldCache, that
I'm not sure (?) the patch addresses:

  * Uninversion to derive eg an int[] is horribly slow, compared to
    say loading the pre-encoded binary ints from disk, created during
    indexing.  Ie, I think, if we are going to overhaul FieldCache
    API, we should somehow make LUCENE-1231 feasible.

  * There's no pluggability to customize where the int[] comes from
    for a given field -- most you can do is provide your own IntParser
    that the uninverter uses.  EG the fact that the patch had to
    "move" FieldCacheRange/TermsFilter down, is strange -- these
    filters (and in general any future "cache consumers") should live
    in oal.search, but simply pull from a different int[] source,
    somehow.

  * Error checking of single-value-per-field (for StringIndex) is
    dangerous, today -- it's intermittent, and, it's an unchecked
    exception.  We should probably just remove the exception, or,
    maybe make it checked.  Actually I'll go open a new issue for
    this.  We should simply fix this.

  * Single-value-per-field limitation (though, that's a nice to have,
    future improvement)

  * Even accepting the single-value-per-field limitaiton, we should
    allow multiple values per field during uninversion, w/
    customizable logic about which value is kept as the single one
    (there is an issue open for this I think).  This really should be
    some sort of added extensibility to whatever class drives
    uninversion...

  * The terror of accidentally asking for the array at the top-level
    of Multi/DirReader.  I think this shouldn't even be allowed, at
    least not easily, ie Dir/MultiReader.getIndexCache should throw
    UOE.  If we really wanted to, we could provide sugar methods in
    maybe ReaderUtil to "glom N int[]'s into a new int[]".  But it
    should be named something scary :) Then we wouldn't need any
    insanity checking.

  * No control over caching policy (cannot evict things)

  * If we make field cache flexible enough, we could maybe fold norms
    & deleted docs into it (would be a separate future issue to
    actually do so...).

Some other questions about the patch:

  * Consumers of the cache API (the sort comparator,
    FieldCacheTerms/RangeFilter, and any other future users of "the
    field cache") shouldn't have to move down into fields sub-package?

  * It's a little strange that the term vectors & fields reader also
    got pulled into the cache?


> [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-2133
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2133
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.9.1, 3.0
>            Reporter: Christian Kohlschütter
>         Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, 
> LUCENE-2133.patch, LUCENE-2133.patch
>
>
> Hi all,
> up to the current version Lucene contains a conceptual flaw, that is the 
> FieldCache. The FieldCache is a singleton which is supposed to cache certain 
> information for every IndexReader that is currently open
> The FieldCache is flawed because it is incorrect to assume that:
> 1. one IndexReader instance equals one index. In fact, there can be many 
> clones (of SegmentReader) or decorators (FilterIndexReader) which all access 
> the very same data.
> 2. the cache information remains valid for the lifetime of an IndexReader. In 
> fact, some IndexReaders may be reopen()'ed and thus they may contain 
> completely different information.
> 3. all IndexReaders need the same type of cache. In fact, because of the 
> limitations imposed by the singleton construct there was no implementation 
> other than FieldCacheImpl.
> Furthermore, FieldCacheImpl and FieldComparator are bloated by several static 
> inner-classes that could be moved to package level.
> There have been a few attempts to improve FieldCache, namely LUCENE-831, 
> LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: 
> There is a central registry for assigning Caches to IndexReader instances.
> I now propose the following:
> 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, 
> extensible cache instances ("IndexCache"). IndexCaches provide common caching 
> functionality for all IndexReaders and may be extended (for example, 
> SegmentReader would have a SegmentReaderIndexCache and store different data 
> than a regular IndexCache)
> 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. 
> IndexFieldCache is an interface just like FieldCache and may support 
> different implementations.
> 3. The IndexCache instances may be flushed/closed by the associated 
> IndexReaders whenever necessary.
> 4. Obsolete FieldCacheSanityChecker because no more "insanities" are expected 
> (or at least, they do not impact the overall performance)
> 5. Refactor FieldCacheImpl and the related classes (FieldComparator, 
> SortField) 
> I have provided an patch which takes care of all these issues. It passes all 
> JUnit tests.
> The patch is quite large, admittedly, but the change required several 
> modifications and some more to preserve backwards-compatibility. 
> Backwards-compatibility is preserved by moving some of the updated 
> functionality in the package org.apache.lucene.search.fields (field 
> comparators and parsers, SortField) while adding wrapper instances and 
> keeping old code in org.apache.lucene.search.
> In detail and besides the above mentioned improvements, the following is 
> provided:
> 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved 
> from SegmentReader to SegmentReaderIndexCache.
> 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the 
> close() method to all registered instances by calling an onClose() method 
> with the threads' instances.
> 3. Analyzer.close now may throw an IOException (this already is covered by 
> java.io.Closeable).
> 4. A change to Collector: allow IndexCache instead of IndexReader being 
> passed to setNextReader()
> 5. SortField's numeric types have been replaced by direct assignments of 
> FieldComparatorSource. This removes the "switch" statements and the 
> possibility to throw IllegalArgumentExceptions because of unsupported type 
> values.
> The following classes have been deprecated and replaced by new classes in 
> org.apache.lucene.search.fields:
> - FieldCacheRangeFilter (=> IndexFieldCacheRangeFilter)
> - FieldCacheTermsFilter (=> IndexFieldCacheTermsFilter)
> - FieldCache (=> IndexFieldCache)
> - FieldCacheImpl (=> IndexFieldCacheImpl)
> - all classes in FieldCacheImpl (=> several package-level classes)
> - all subclasses of FieldComparator (=> several package-level classes)
> Final notes:
> - The patch would be simpler if no backwards compatibility was necessary. The 
> Lucene community has to decide which classes/methods can immediately be 
> removed, which ones later, which not at all. Whenever new classes depend on 
> the old ones, an appropriate notice exists in the javadocs.
> - The patch introduces a new, deprecated class 
> IndexFieldCacheSanityChecker.java which is just there for testing purposes, 
> to show that no sanity checks are necessary any longer. This class may be 
> removed at any time.
> - I expect that the patch does not impact performance. On the contrary, as 
> the patch removes a few unnecessary checks we might even see a slight 
> speedup. No benchmarking has been done so far, though.
> - I have tried to preserve the existing functionality wherever possible and 
> to focus on the class/method structure only. We certainly may improve the 
> caches' behavior, but this out of scope for this patch.
> - The refactoring finally makes the high duplication of code visible: For all 
> supported atomic types (byte, double, float, int, long, short) three classes 
> each are required: *Cache, *Comparator and *Parser. I think that further 
> simplification might be possible (maybe using Java generics?), but I guess 
> the current patch is large enough for now.
> Cheers,
> Christian

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField

Reply via email to