[jira] [Commented] (LUCENE-6766) Make index sorting a first-class citizen

Adrien Grand (JIRA) Mon, 09 May 2016 01:36:13 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-6766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15276107#comment-15276107
 ]


Adrien Grand commented on LUCENE-6766:
--------------------------------------

This looks great!

{quote}
it does add some increase in code complexity, which I think is OK/contained.
{quote}

Agreed. The only thing I am slightly worried about is how all optimized bulk 
mergers need to opt out if a sort order is configured. I am wondering if our 
base consumer classes should have two merge methods so that you would not have 
to check the sort order when overriding the method for regular merges? This is 
just an idea, it has drawbacks too since there would not be a single entry 
point to merging anymore and we would need another method in our API, but I'm 
suggesting it anyway hoping that it might give somebody a better idea. :)

bq. but this is already an enormous change so I think we really have to look 
into "sort on flush" (which is hairy by itself) later, separately

+1

{code}
+// nocommit if index time sorting is in use, don't try to bulk merge ... later 
we can make crazy bulk merger that looks for long runs from
+// one sub?
{code}

Maybe this one could be made a simple TODO. I think it is totally fine if index 
sorting always bypasses optimized bulk mergers, at least for now? Since we are 
still pulling a merge instance, it should not be too bad (no worse than merging 
across different codecs)?

{code}
 // nocommit in the unsorted case, this should map correctly, e.g. apply per 
segment docBase
{code}

This seems to already be the case based on the code?

{code}
// nocommit isn't liveDocs redundant?  docMap returns -1 for us?
{code}

+1 I think it would be easier if this part of the code only used the docMap.

{code}
// nocommit is it sub's job to skip deleted docs?
{code}

I think it is since there is no mapped doc ID for deleted docs?

{code}
  // nocommit doesn't support index sorting?  or sorts must be the same?
  public void addIndexes(Directory... dirs) throws IOException {
{code}

Can we do like the nocommit on {{addIndexes(CodecReader...)}} suggests and just 
make sure that we cannot end up with segments that have different sort orders 
in the index?

{code}
// nocommit what about MergedReaderWrapper in here?
{code}

I think we should still wrap with MergedReaderWrapper? This will help stored 
fields if two documents from the same block are read consecutively (which could 
likely happen if the order in which docs are indexed is somehow correlated to 
the index sort, like if sorting by timestamp)?

{code}
+    Sort indexSort = null;
+
     // build FieldInfos and fieldToReader map:
     for (final LeafReader reader : this.parallelReaders) {
+      if (indexSort == null) {
+        indexSort = reader.getIndexSort();
+      } else if (indexSort.equals(reader.getIndexSort()) == false) {
+        throw new IllegalArgumentException("cannot combine LeafReaders that 
have different index sorts: saw both sort=" + indexSort + " and " + 
reader.getIndexSort());
+      }
{code}

I think this is buggy since it ignores {{null}} sorts at the beginning of the 
list but not at the end, so the same list of readers may or may not raise an 
exception depending on the order in which readers are provided?

{code}
// nocommit does search time "do the right thing" automatically when segment is 
sorted?
{code}

Agreed it should. I see you also left nocommits about moving the 
early-terminating collectors from misc to core, but leveraging index sorting at 
search time looks like a big task to me so maybe we should defer it to a 
follow-up issue like sorting on flush?

{code}
// nocommit just do assertReaderEquals, don't use @BeforeClass, etc.?
{code}

+1!

{code}
--- 
trunk/lucene/misc/src/java/org/apache/lucene/search/BlockJoinComparatorSource.java
  2016-02-16 11:18:34.753021816 -0500
+++ 
indexsort/lucene/misc/src/java/org/apache/lucene/search/BlockJoinComparatorSource.java
      2016-05-06 19:17:29.893848515 -0400
@@ -20,13 +20,14 @@

+// nocommit what to do here?
{code}

Let's remove it for now and later see whether this is something that could be 
added back?

{code}
+    @Override
+    public int nextDoc() {
+      try {
+        return postings.nextDoc();
+      } catch (IOException ioe) {
+        throw new RuntimeException(ioe);
+      }
+    }
{code}

Should DocIdMerger.Sub.nextDoc throw an IOException?

> Make index sorting a first-class citizen
> ----------------------------------------
>
>                 Key: LUCENE-6766
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6766
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-6766.patch, LUCENE-6766.patch
>
>
> Today index sorting is a very expert feature. You need to use a custom merge 
> policy, custom collectors, etc. I would like to explore making it a 
> first-class citizen so that:
>  - the sort order could be configured on IndexWriterConfig
>  - segments would record the sort order that was used to write them
>  - IndexSearcher could automatically early terminate when computing top docs 
> on a sort order that is a prefix of the sort order of a segment (and if the 
> user is not interested in totalHits).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6766) Make index sorting a first-class citizen

Reply via email to