[
https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788798#action_12788798
]
Michael McCandless commented on LUCENE-2133:
--------------------------------------------
This patch is a good step forward -- it associates the cache directly
with IndexReader, where it belongs; it cleanly decouples cache from
reader (vs the hack we have today with IndexReader.getFieldCacheKey),
so that eg cloned readers can share the same cache; it also preserves
back compat, which is quite a stunning accomplishment :)
But... there are many more things I don't like about FieldCache, that
I'm not sure (?) the patch addresses:
* Uninversion to derive eg an int[] is horribly slow, compared to
say loading the pre-encoded binary ints from disk, created during
indexing. Ie, I think, if we are going to overhaul FieldCache
API, we should somehow make LUCENE-1231 feasible.
* There's no pluggability to customize where the int[] comes from
for a given field -- most you can do is provide your own IntParser
that the uninverter uses. EG the fact that the patch had to
"move" FieldCacheRange/TermsFilter down, is strange -- these
filters (and in general any future "cache consumers") should live
in oal.search, but simply pull from a different int[] source,
somehow.
* Error checking of single-value-per-field (for StringIndex) is
dangerous, today -- it's intermittent, and, it's an unchecked
exception. We should probably just remove the exception, or,
maybe make it checked. Actually I'll go open a new issue for
this. We should simply fix this.
* Single-value-per-field limitation (though, that's a nice to have,
future improvement)
* Even accepting the single-value-per-field limitaiton, we should
allow multiple values per field during uninversion, w/
customizable logic about which value is kept as the single one
(there is an issue open for this I think). This really should be
some sort of added extensibility to whatever class drives
uninversion...
* The terror of accidentally asking for the array at the top-level
of Multi/DirReader. I think this shouldn't even be allowed, at
least not easily, ie Dir/MultiReader.getIndexCache should throw
UOE. If we really wanted to, we could provide sugar methods in
maybe ReaderUtil to "glom N int[]'s into a new int[]". But it
should be named something scary :) Then we wouldn't need any
insanity checking.
* No control over caching policy (cannot evict things)
* If we make field cache flexible enough, we could maybe fold norms
& deleted docs into it (would be a separate future issue to
actually do so...).
Some other questions about the patch:
* Consumers of the cache API (the sort comparator,
FieldCacheTerms/RangeFilter, and any other future users of "the
field cache") shouldn't have to move down into fields sub-package?
* It's a little strange that the term vectors & fields reader also
got pulled into the cache?
> [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
> -------------------------------------------------------------------------
>
> Key: LUCENE-2133
> URL: https://issues.apache.org/jira/browse/LUCENE-2133
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.9.1, 3.0
> Reporter: Christian Kohlschütter
> Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch,
> LUCENE-2133.patch, LUCENE-2133.patch
>
>
> Hi all,
> up to the current version Lucene contains a conceptual flaw, that is the
> FieldCache. The FieldCache is a singleton which is supposed to cache certain
> information for every IndexReader that is currently open
> The FieldCache is flawed because it is incorrect to assume that:
> 1. one IndexReader instance equals one index. In fact, there can be many
> clones (of SegmentReader) or decorators (FilterIndexReader) which all access
> the very same data.
> 2. the cache information remains valid for the lifetime of an IndexReader. In
> fact, some IndexReaders may be reopen()'ed and thus they may contain
> completely different information.
> 3. all IndexReaders need the same type of cache. In fact, because of the
> limitations imposed by the singleton construct there was no implementation
> other than FieldCacheImpl.
> Furthermore, FieldCacheImpl and FieldComparator are bloated by several static
> inner-classes that could be moved to package level.
> There have been a few attempts to improve FieldCache, namely LUCENE-831,
> LUCENE-1579 and LUCENE-1749, but the overall situation remains the same:
> There is a central registry for assigning Caches to IndexReader instances.
> I now propose the following:
> 1. Obsolete FieldCache and FieldCacheKey and provide index-specific,
> extensible cache instances ("IndexCache"). IndexCaches provide common caching
> functionality for all IndexReaders and may be extended (for example,
> SegmentReader would have a SegmentReaderIndexCache and store different data
> than a regular IndexCache)
> 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache.
> IndexFieldCache is an interface just like FieldCache and may support
> different implementations.
> 3. The IndexCache instances may be flushed/closed by the associated
> IndexReaders whenever necessary.
> 4. Obsolete FieldCacheSanityChecker because no more "insanities" are expected
> (or at least, they do not impact the overall performance)
> 5. Refactor FieldCacheImpl and the related classes (FieldComparator,
> SortField)
> I have provided an patch which takes care of all these issues. It passes all
> JUnit tests.
> The patch is quite large, admittedly, but the change required several
> modifications and some more to preserve backwards-compatibility.
> Backwards-compatibility is preserved by moving some of the updated
> functionality in the package org.apache.lucene.search.fields (field
> comparators and parsers, SortField) while adding wrapper instances and
> keeping old code in org.apache.lucene.search.
> In detail and besides the above mentioned improvements, the following is
> provided:
> 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved
> from SegmentReader to SegmentReaderIndexCache.
> 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the
> close() method to all registered instances by calling an onClose() method
> with the threads' instances.
> 3. Analyzer.close now may throw an IOException (this already is covered by
> java.io.Closeable).
> 4. A change to Collector: allow IndexCache instead of IndexReader being
> passed to setNextReader()
> 5. SortField's numeric types have been replaced by direct assignments of
> FieldComparatorSource. This removes the "switch" statements and the
> possibility to throw IllegalArgumentExceptions because of unsupported type
> values.
> The following classes have been deprecated and replaced by new classes in
> org.apache.lucene.search.fields:
> - FieldCacheRangeFilter (=> IndexFieldCacheRangeFilter)
> - FieldCacheTermsFilter (=> IndexFieldCacheTermsFilter)
> - FieldCache (=> IndexFieldCache)
> - FieldCacheImpl (=> IndexFieldCacheImpl)
> - all classes in FieldCacheImpl (=> several package-level classes)
> - all subclasses of FieldComparator (=> several package-level classes)
> Final notes:
> - The patch would be simpler if no backwards compatibility was necessary. The
> Lucene community has to decide which classes/methods can immediately be
> removed, which ones later, which not at all. Whenever new classes depend on
> the old ones, an appropriate notice exists in the javadocs.
> - The patch introduces a new, deprecated class
> IndexFieldCacheSanityChecker.java which is just there for testing purposes,
> to show that no sanity checks are necessary any longer. This class may be
> removed at any time.
> - I expect that the patch does not impact performance. On the contrary, as
> the patch removes a few unnecessary checks we might even see a slight
> speedup. No benchmarking has been done so far, though.
> - I have tried to preserve the existing functionality wherever possible and
> to focus on the class/method structure only. We certainly may improve the
> caches' behavior, but this out of scope for this patch.
> - The refactoring finally makes the high duplication of code visible: For all
> supported atomic types (byte, double, float, int, long, short) three classes
> each are required: *Cache, *Comparator and *Parser. I think that further
> simplification might be possible (maybe using Java generics?), but I guess
> the current patch is large enough for now.
> Cheers,
> Christian
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]