[
https://issues.apache.org/jira/browse/LUCENE-7396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15397340#comment-15397340
]
Michael McCandless commented on LUCENE-7396:
--------------------------------------------
Thanks [~jpountz], I also saw similar initial speedups when running
{{IndexTaxis.java}} from luceneutil. I'll re-index all 1.2 B points
w/ N threads and compare performance shortly.
This is a very impressive change; here's my attempt at the high-level
summary:
* Improves utility apis for the select algorithm: new abstract
Selector class, with Quick/Radix/IntroSelector impls
* At flush, takes advantage of the fact that all values are already
in heap (IW's RAM buffer), more cleanly than LUCENE-7390 (should
we revert that?)
* Fix 1D BKD flushing to also use the optimized writing that we do at
merge
* Fix N dim flushing to avoid up front sort on each dimension and
instead use select algo at each recursion to find the split
point, and only do a sort at each leaf block.
* The general case (merging, N dims) still requires up front sort,
likely using offline sorter
Using select to split on each dim is faster for at least two reasons:
* Select algo in the typical case is linear time, and N * log(N)
sorting 1000 small chunks that are 1000 times smaller each is
less work: N * log(N) vs N * log(N/1000).
* The sorting in the leaf block is only in the 1 dimension
needed for the run-length compression, vs N dimensions
currently.
Maybe {{MutablePointsReader}} should expose {{public byte getByteAt(int i, int
bytePos);}}?
This would save copying a whole value just because you
want to see a specific byte.
That's very clever how you logically "concatenate" the packed docID
onto the end of the byte[] values that we are selecting in BKDWriter:
{noformat}
final int shift = bitsPerDocId - ((k - cmpBytes + 1) << 3);
return (reader.getDocID(i) >>> Math.max(0, shift)) & 0xff;
{noformat}
I hope our points test cases are stressful enough in testing this
tie break case!
On line 522 of BKDWriter.java can't {{byte[] pivot = scratch1}} be
final? And also line 1426?
The javadocs for {{RadixSelector.getFallbackSelector}} say {{sorter}}
but should be {{selector}}? Can you improve those javadacs,
e.g. fallback is used when too much recursion has happened or when the
range is tiny?
Can we remove {{RadixSelector.quickSelect}} and inline it into the one
place it's called in {{select}}?
Hmm, why are we changing {{ord}} from {{long}} to {{int}} here?:
{noformat}
// only called from assert
- private boolean valueInOrder(long ord, byte[] lastPackedValue, byte[]
packedValue, int packedValueOffset) {
- if (ord > 0 && StringHelper.compare(bytesPerDim, lastPackedValue, 0,
packedValue, packedValueOffset) > 0) {
- throw new AssertionError("values out of order: last value=" + new
BytesRef(lastPackedValue) + " current value=" + new BytesRef(packedValue,
packedValueOffset, packedBytesLength) + " ord=" + ord);
+ private boolean valueInOrder(int ord, int sortedDim, byte[] lastPackedValue,
byte[] packedValue, int packedValueOffset,
+ int[] docs, int docsOffset) {
+ int dimOffset = sortedDim * bytesPerDim;
+ if (ord > 0) {
+ int cmp = StringHelper.compare(bytesPerDim, lastPackedValue, dimOffset,
packedValue, packedValueOffset + dimOffset);
+ if (cmp > 0) {
+ throw new AssertionError("values out of order: last value=" + new
BytesRef(lastPackedValue) + " current value=" + new BytesRef(packedValue,
packedValueOffset, packedBytesLength) + " ord=" + ord);
+ }
+ if (cmp == 0 && docs[docsOffset + ord - 1] > docs[docsOffset + ord]) {
+ throw new AssertionError("docs out of order: last doc=" +
docs[docsOffset + ord - 1] + " current doc=" + docs[docsOffset + ord] + " ord="
+ ord);
+ }
}
- System.arraycopy(packedValue, packedValueOffset, lastPackedValue, 0,
bytesPerDim);
+ System.arraycopy(packedValue, packedValueOffset, lastPackedValue, 0,
packedBytesLength);
return true;
}
{noformat}
It looks like we no longer call this from {{assert}} in the merge 1D
case, except within one leaf block? Was that intentional?
> Speed up flush of 1-dimension points
> ------------------------------------
>
> Key: LUCENE-7396
> URL: https://issues.apache.org/jira/browse/LUCENE-7396
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Priority: Minor
> Attachments: LUCENE-7396.patch, LUCENE-7396.patch, LUCENE-7396.patch
>
>
> 1D points already have an optimized merge implementation which works when
> points come in order. So maybe we could make IndexWriter's PointValuesWriter
> sort before feeding the PointsFormat and somehow propagate the information to
> the PointsFormat?
> The benefit is that flushing could directly stream points to disk with little
> memory usage.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]