[
https://issues.apache.org/jira/browse/LUCENE-6766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15276107#comment-15276107
]
Adrien Grand commented on LUCENE-6766:
--------------------------------------
This looks great!
{quote}
it does add some increase in code complexity, which I think is OK/contained.
{quote}
Agreed. The only thing I am slightly worried about is how all optimized bulk
mergers need to opt out if a sort order is configured. I am wondering if our
base consumer classes should have two merge methods so that you would not have
to check the sort order when overriding the method for regular merges? This is
just an idea, it has drawbacks too since there would not be a single entry
point to merging anymore and we would need another method in our API, but I'm
suggesting it anyway hoping that it might give somebody a better idea. :)
bq. but this is already an enormous change so I think we really have to look
into "sort on flush" (which is hairy by itself) later, separately
+1
{code}
+// nocommit if index time sorting is in use, don't try to bulk merge ... later
we can make crazy bulk merger that looks for long runs from
+// one sub?
{code}
Maybe this one could be made a simple TODO. I think it is totally fine if index
sorting always bypasses optimized bulk mergers, at least for now? Since we are
still pulling a merge instance, it should not be too bad (no worse than merging
across different codecs)?
{code}
// nocommit in the unsorted case, this should map correctly, e.g. apply per
segment docBase
{code}
This seems to already be the case based on the code?
{code}
// nocommit isn't liveDocs redundant? docMap returns -1 for us?
{code}
+1 I think it would be easier if this part of the code only used the docMap.
{code}
// nocommit is it sub's job to skip deleted docs?
{code}
I think it is since there is no mapped doc ID for deleted docs?
{code}
// nocommit doesn't support index sorting? or sorts must be the same?
public void addIndexes(Directory... dirs) throws IOException {
{code}
Can we do like the nocommit on {{addIndexes(CodecReader...)}} suggests and just
make sure that we cannot end up with segments that have different sort orders
in the index?
{code}
// nocommit what about MergedReaderWrapper in here?
{code}
I think we should still wrap with MergedReaderWrapper? This will help stored
fields if two documents from the same block are read consecutively (which could
likely happen if the order in which docs are indexed is somehow correlated to
the index sort, like if sorting by timestamp)?
{code}
+ Sort indexSort = null;
+
// build FieldInfos and fieldToReader map:
for (final LeafReader reader : this.parallelReaders) {
+ if (indexSort == null) {
+ indexSort = reader.getIndexSort();
+ } else if (indexSort.equals(reader.getIndexSort()) == false) {
+ throw new IllegalArgumentException("cannot combine LeafReaders that
have different index sorts: saw both sort=" + indexSort + " and " +
reader.getIndexSort());
+ }
{code}
I think this is buggy since it ignores {{null}} sorts at the beginning of the
list but not at the end, so the same list of readers may or may not raise an
exception depending on the order in which readers are provided?
{code}
// nocommit does search time "do the right thing" automatically when segment is
sorted?
{code}
Agreed it should. I see you also left nocommits about moving the
early-terminating collectors from misc to core, but leveraging index sorting at
search time looks like a big task to me so maybe we should defer it to a
follow-up issue like sorting on flush?
{code}
// nocommit just do assertReaderEquals, don't use @BeforeClass, etc.?
{code}
+1!
{code}
---
trunk/lucene/misc/src/java/org/apache/lucene/search/BlockJoinComparatorSource.java
2016-02-16 11:18:34.753021816 -0500
+++
indexsort/lucene/misc/src/java/org/apache/lucene/search/BlockJoinComparatorSource.java
2016-05-06 19:17:29.893848515 -0400
@@ -20,13 +20,14 @@
+// nocommit what to do here?
{code}
Let's remove it for now and later see whether this is something that could be
added back?
{code}
+ @Override
+ public int nextDoc() {
+ try {
+ return postings.nextDoc();
+ } catch (IOException ioe) {
+ throw new RuntimeException(ioe);
+ }
+ }
{code}
Should DocIdMerger.Sub.nextDoc throw an IOException?
> Make index sorting a first-class citizen
> ----------------------------------------
>
> Key: LUCENE-6766
> URL: https://issues.apache.org/jira/browse/LUCENE-6766
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Priority: Minor
> Attachments: LUCENE-6766.patch, LUCENE-6766.patch
>
>
> Today index sorting is a very expert feature. You need to use a custom merge
> policy, custom collectors, etc. I would like to explore making it a
> first-class citizen so that:
> - the sort order could be configured on IndexWriterConfig
> - segments would record the sort order that was used to write them
> - IndexSearcher could automatically early terminate when computing top docs
> on a sort order that is a prefix of the sort order of a segment (and if the
> user is not interested in totalHits).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]