subject:"\[GitHub\] iverase commented on a change in pull request #556\: LUCENE\-8673\: Use radix partitioning when merging dimensional points"

[GitHub] iverase commented on a change in pull request #556: LUCENE-8673: Use radix partitioning when merging dimensional points

2019-02-06 Thread GitBox

iverase commented on a change in pull request #556: LUCENE-8673: Use radix 
partitioning when merging dimensional points
URL: https://github.com/apache/lucene-solr/pull/556#discussion_r254560658
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java
 ##
 @@ -194,40 +169,28 @@ protected BKDWriter(int maxDoc, Directory tempDir, 
String tempFileNamePrefix, in
 minPackedValue = new byte[packedIndexBytesLength];
 maxPackedValue = new byte[packedIndexBytesLength];
 
-// If we may have more than 1+Integer.MAX_VALUE values, then we must 
encode ords with long (8 bytes), else we can use int (4 bytes).
-this.longOrds = longOrds;
-
-this.singleValuePerDoc = singleValuePerDoc;
+// dimensional values (numDims * bytesPerDim) + docID (int)
+bytesPerDoc = packedBytesLength + Integer.BYTES;
 
-// dimensional values (numDims * bytesPerDim) + ord (int or long) + docID 
(int)
-if (singleValuePerDoc) {
-  // Lucene only supports up to 2.1 docs, so we better not need longOrds 
in this case:
-  assert longOrds == false;
-  bytesPerDoc = packedBytesLength + Integer.BYTES;
-} else if (longOrds) {
-  bytesPerDoc = packedBytesLength + Long.BYTES + Integer.BYTES;
-} else {
-  bytesPerDoc = packedBytesLength + Integer.BYTES + Integer.BYTES;
-}
 
 // As we recurse, we compute temporary partitions of the data, halving the
 // number of points at each recursion.  Once there are few enough points,
 // we can switch to sorting in heap instead of offline (on disk).  At any
 // time in the recursion, we hold the number of points at that level, plus
 // all recursive halves (i.e. 16 + 8 + 4 + 2) so the memory usage is 2X
 // what that level would consume, so we multiply by 0.5 to convert from
-// bytes to points here.  Each dimension has its own sorted partition, so
-// we must divide by numDims as wel.
+// bytes to points here.  In addition the radix partitioning may sort on 
memory
+// double of this size so we multiply by another 0.5.
 
-maxPointsSortInHeap = (int) (0.5 * (maxMBSortInHeap * 1024 * 1024) / 
(bytesPerDoc * numDataDims));
+maxPointsSortInHeap = (int) (0.25 * (maxMBSortInHeap * 1024 * 1024) / 
(bytesPerDoc));
 
 Review comment:
   This is due to the fact that BKDRadixSelector can build a buffer up to `2 * 
maxPointsSortInHeap`. Thinking about it, it might not be ideal(we are not using 
all available heap) but I have in mind a follow up issue where this will be 
handled better so I will keep it like that.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[GitHub] iverase commented on a change in pull request #556: LUCENE-8673: Use radix partitioning when merging dimensional points

2019-02-06 Thread GitBox

iverase commented on a change in pull request #556: LUCENE-8673: Use radix 
partitioning when merging dimensional points
URL: https://github.com/apache/lucene-solr/pull/556#discussion_r254296091
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/util/bkd/BKDRadixSelector.java
 ##
 @@ -0,0 +1,311 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.bkd;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.FutureArrays;
+import org.apache.lucene.util.RadixSelector;
+
+
+/**
+ *
+ * Offline Radix selector for BKD tree.
+ *
+ *  @lucene.internal
+ * */
+public final class BKDRadixSelector {
+  //size of the histogram
+  private static final int HISTOGRAM_SIZE = 256;
+  //size of the online buffer: 8 KB
+  private static final int MAX_SIZE_OFFLINE_BUFFER = 1024 * 8;
+  // we store one histogram per recursion level
+  private final long[][] histogram;
+  //bytes per dimension
+  private final int bytesPerDim;
+  // number of bytes to be sorted: bytesPerDim + Integer.BYTES
+  private final int bytesSorted;
+  //data dimensions size
+  private final int packedBytesLength;
+  //flag to when we are moving to sort on heap
+  private final int maxPointsSortInHeap;
+  //reusable buffer
+  private final byte[] offlineBuffer;
+  //holder for partition points
+  private final int[] partitionBucket;
+  // scratch object to move bytes around
+  private final BytesRef bytesRef1 = new BytesRef();
+  // scratch object to move bytes around
+  private final BytesRef bytesRef2 = new BytesRef();
+  //Directory to create new Offline writer
+  private final Directory tempDir;
+  // prefix for temp files
+  private final String tempFileNamePrefix;
+
+  /**
+   * Sole constructor.
+   */
+  public BKDRadixSelector(int numDim, int bytesPerDim, int 
maxPointsSortInHeap, Directory tempDir, String tempFileNamePrefix) {
+this.bytesPerDim = bytesPerDim;
+this.packedBytesLength = numDim * bytesPerDim;
+this.bytesSorted = bytesPerDim + Integer.BYTES;
+this.maxPointsSortInHeap = 2 * maxPointsSortInHeap;
+int numberOfPointsOffline  = MAX_SIZE_OFFLINE_BUFFER / (packedBytesLength 
+ Integer.BYTES);
+this.offlineBuffer = new byte[numberOfPointsOffline * (packedBytesLength + 
Integer.BYTES)];
+this.partitionBucket = new int[bytesSorted];
+this.histogram = new long[bytesSorted][HISTOGRAM_SIZE];
+this.bytesRef1.length = numDim * bytesPerDim;
+this.tempDir = tempDir;
+this.tempFileNamePrefix = tempFileNamePrefix;
+  }
+
+  /**
+   *
+   * Method to partition the input data. It returns the value of the dimension 
where
+   * the split happens. The method destroys the original writer.
+   *
+   */
+  public byte[] select(PointWriter points, PointWriter left, PointWriter 
right, long from, long to, long partitionPoint, int dim) throws IOException {
 
 Review comment:
   Of course this needs a bit of work because left and right `PointWriters` 
should then be created inside this object.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[GitHub] iverase commented on a change in pull request #556: LUCENE-8673: Use radix partitioning when merging dimensional points

2019-02-06 Thread GitBox

iverase commented on a change in pull request #556: LUCENE-8673: Use radix 
partitioning when merging dimensional points
URL: https://github.com/apache/lucene-solr/pull/556#discussion_r254294954
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/util/bkd/BKDRadixSelector.java
 ##
 @@ -0,0 +1,311 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.bkd;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.FutureArrays;
+import org.apache.lucene.util.RadixSelector;
+
+
+/**
+ *
+ * Offline Radix selector for BKD tree.
+ *
+ *  @lucene.internal
+ * */
+public final class BKDRadixSelector {
+  //size of the histogram
+  private static final int HISTOGRAM_SIZE = 256;
+  //size of the online buffer: 8 KB
+  private static final int MAX_SIZE_OFFLINE_BUFFER = 1024 * 8;
+  // we store one histogram per recursion level
+  private final long[][] histogram;
+  //bytes per dimension
+  private final int bytesPerDim;
+  // number of bytes to be sorted: bytesPerDim + Integer.BYTES
+  private final int bytesSorted;
+  //data dimensions size
+  private final int packedBytesLength;
+  //flag to when we are moving to sort on heap
+  private final int maxPointsSortInHeap;
+  //reusable buffer
+  private final byte[] offlineBuffer;
+  //holder for partition points
+  private final int[] partitionBucket;
+  // scratch object to move bytes around
+  private final BytesRef bytesRef1 = new BytesRef();
+  // scratch object to move bytes around
+  private final BytesRef bytesRef2 = new BytesRef();
+  //Directory to create new Offline writer
+  private final Directory tempDir;
+  // prefix for temp files
+  private final String tempFileNamePrefix;
+
+  /**
+   * Sole constructor.
+   */
+  public BKDRadixSelector(int numDim, int bytesPerDim, int 
maxPointsSortInHeap, Directory tempDir, String tempFileNamePrefix) {
+this.bytesPerDim = bytesPerDim;
+this.packedBytesLength = numDim * bytesPerDim;
+this.bytesSorted = bytesPerDim + Integer.BYTES;
+this.maxPointsSortInHeap = 2 * maxPointsSortInHeap;
+int numberOfPointsOffline  = MAX_SIZE_OFFLINE_BUFFER / (packedBytesLength 
+ Integer.BYTES);
+this.offlineBuffer = new byte[numberOfPointsOffline * (packedBytesLength + 
Integer.BYTES)];
+this.partitionBucket = new int[bytesSorted];
+this.histogram = new long[bytesSorted][HISTOGRAM_SIZE];
+this.bytesRef1.length = numDim * bytesPerDim;
+this.tempDir = tempDir;
+this.tempFileNamePrefix = tempFileNamePrefix;
+  }
+
+  /**
+   *
+   * Method to partition the input data. It returns the value of the dimension 
where
+   * the split happens. The method destroys the original writer.
+   *
+   */
+  public byte[] select(PointWriter points, PointWriter left, PointWriter 
right, long from, long to, long partitionPoint, int dim) throws IOException {
 
 Review comment:
   At the moment no, but one of the improvements it can be done is to re-use 
the idea of path slices on the previous versions and keep just the first 
`HeapPointWriter`. Then you do not need to copy bytes between heap objects and 
just adjust the start end end points for the different levels.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[GitHub] iverase commented on a change in pull request #556: LUCENE-8673: Use radix partitioning when merging dimensional points

2019-02-04 Thread GitBox

iverase commented on a change in pull request #556: LUCENE-8673: Use radix 
partitioning when merging dimensional points
URL: https://github.com/apache/lucene-solr/pull/556#discussion_r253495525
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/util/bkd/BKDRadixSelector.java
 ##
 @@ -0,0 +1,433 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.bkd;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.FutureArrays;
+import org.apache.lucene.util.IntroSelector;
+
+/**
+ *
+ * Offline Radix selector for BKD tree.
+ *
+ *  @lucene.internal
+ * */
+public final class BKDRadixSelector {
+  //size of the histogram
+  private static final int HISTOGRAM_SIZE = 256;
+  // we store one histogram per recursion level
+  private final long[][] histogram;
+  //bytes we are sorting
+  private final int bytesPerDim;
+  // number of bytes to be sorted: bytesPerDim + Integer.BYTES
+  private final int bytesSorted;
+  //data dimensions size
+  private final int packedBytesLength;
+  //flag to when we are moving to sort on heap
+  private final int maxPointsSortedOffHeap;
+  //reusable buffer
+  private final byte[] offlineBuffer;
+  //holder for partition points
+  private final int[] partitionBucket;
+  //holder for partition bytes
+  private final byte[] partitionBytes;
+  //re-usable on-heap selector
+  private final HeapSelector heapSelector;
+  // scratch object to move bytes around
+  private final BytesRef bytesRef1 = new BytesRef();
+  // scratch object to move bytes around
+  private final BytesRef bytesRef2 = new BytesRef();
+  //Directory to create new Offline writer
+  private final Directory tempDir;
+  // prefix for temp files
+  private final String tempFileNamePrefix;
+
+
+
+  /**
+   * Sole constructor.
+   */
+  public BKDRadixSelector(int numDim, int bytesPerDim, int 
maxPointsSortedOffHeap, Directory tempDir, String tempFileNamePrefix) {
+this.bytesPerDim = bytesPerDim;
+this.packedBytesLength = numDim * bytesPerDim;
+this.bytesSorted = bytesPerDim + Integer.BYTES;
+this.maxPointsSortedOffHeap = maxPointsSortedOffHeap;
+this.offlineBuffer = new byte[maxPointsSortedOffHeap * (packedBytesLength 
+ Integer.BYTES)];
+this.partitionBucket = new int[bytesSorted];
+this.partitionBytes =  new byte[bytesSorted];
+this.histogram = new long[bytesSorted][HISTOGRAM_SIZE];
+this.bytesRef1.length = numDim * bytesPerDim;
+this.heapSelector = new HeapSelector(numDim, bytesPerDim);
+this.tempDir = tempDir;
+this.tempFileNamePrefix = tempFileNamePrefix;
+  }
+
+  /**
+   * Method to partition the input data. It returns the value of the dimension 
where
+   * the split happens.
+   */
+  public byte[] select(PointWriter points, PointWriter left, PointWriter 
right, long from, long to, long partitionPoint, int dim) throws IOException {
+checkArgs(from, to, partitionPoint);
+
+//If we are on heap then we just select on heap
+if (points instanceof HeapPointWriter) {
+  return heapSelect((HeapPointWriter) points, left, right, dim, 
Math.toIntExact(from), Math.toIntExact(to),  Math.toIntExact(partitionPoint), 
0);
+}
+
+//reset histogram
+for (int i = 0; i < bytesSorted; i++) {
+  Arrays.fill(histogram[i], 0);
+}
+OfflinePointWriter offlinePointWriter = (OfflinePointWriter) points;
+
+//find common prefix, it does already set histogram values if needed
+int commonPrefix = findCommonPrefix(offlinePointWriter, from, to, dim);
+
+//if all equals we just partition the data
+if (commonPrefix ==  bytesSorted) {
+  return partition(offlinePointWriter, left, right, from, to, 
partitionPoint, dim, null, commonPrefix - 1, partitionPoint);
+}
+//let's rock'n'roll
+return buildHistogramAndPartition(offlinePointWriter, null, left, right, 
from, to, partitionPoint, 0, commonPrefix, dim,0, 0);
+  }
+
+  void checkArgs(long from, long to, long middle) {
+if (middle < from) {
+  throw new IllegalArgumentException("middle must be >= from");
+}
+if (middle >= to) {
+  throw new I

[GitHub] iverase commented on a change in pull request #556: LUCENE-8673: Use radix partitioning when merging dimensional points

2019-02-04 Thread GitBox

iverase commented on a change in pull request #556: LUCENE-8673: Use radix 
partitioning when merging dimensional points
URL: https://github.com/apache/lucene-solr/pull/556#discussion_r253401456
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/util/bkd/BKDRadixSelector.java
 ##
 @@ -0,0 +1,433 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.bkd;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.FutureArrays;
+import org.apache.lucene.util.IntroSelector;
+
+/**
+ *
+ * Offline Radix selector for BKD tree.
+ *
+ *  @lucene.internal
+ * */
+public final class BKDRadixSelector {
+  //size of the histogram
+  private static final int HISTOGRAM_SIZE = 256;
+  // we store one histogram per recursion level
+  private final long[][] histogram;
+  //bytes we are sorting
+  private final int bytesPerDim;
+  // number of bytes to be sorted: bytesPerDim + Integer.BYTES
+  private final int bytesSorted;
+  //data dimensions size
+  private final int packedBytesLength;
+  //flag to when we are moving to sort on heap
+  private final int maxPointsSortedOffHeap;
+  //reusable buffer
+  private final byte[] offlineBuffer;
+  //holder for partition points
+  private final int[] partitionBucket;
+  //holder for partition bytes
+  private final byte[] partitionBytes;
+  //re-usable on-heap selector
+  private final HeapSelector heapSelector;
+  // scratch object to move bytes around
+  private final BytesRef bytesRef1 = new BytesRef();
+  // scratch object to move bytes around
+  private final BytesRef bytesRef2 = new BytesRef();
+  //Directory to create new Offline writer
+  private final Directory tempDir;
+  // prefix for temp files
+  private final String tempFileNamePrefix;
+
+
+
+  /**
+   * Sole constructor.
+   */
+  public BKDRadixSelector(int numDim, int bytesPerDim, int 
maxPointsSortedOffHeap, Directory tempDir, String tempFileNamePrefix) {
+this.bytesPerDim = bytesPerDim;
+this.packedBytesLength = numDim * bytesPerDim;
+this.bytesSorted = bytesPerDim + Integer.BYTES;
+this.maxPointsSortedOffHeap = maxPointsSortedOffHeap;
+this.offlineBuffer = new byte[maxPointsSortedOffHeap * (packedBytesLength 
+ Integer.BYTES)];
+this.partitionBucket = new int[bytesSorted];
+this.partitionBytes =  new byte[bytesSorted];
+this.histogram = new long[bytesSorted][HISTOGRAM_SIZE];
+this.bytesRef1.length = numDim * bytesPerDim;
+this.heapSelector = new HeapSelector(numDim, bytesPerDim);
+this.tempDir = tempDir;
+this.tempFileNamePrefix = tempFileNamePrefix;
+  }
+
+  /**
+   * Method to partition the input data. It returns the value of the dimension 
where
+   * the split happens.
+   */
+  public byte[] select(PointWriter points, PointWriter left, PointWriter 
right, long from, long to, long partitionPoint, int dim) throws IOException {
+checkArgs(from, to, partitionPoint);
+
+//If we are on heap then we just select on heap
+if (points instanceof HeapPointWriter) {
+  return heapSelect((HeapPointWriter) points, left, right, dim, 
Math.toIntExact(from), Math.toIntExact(to),  Math.toIntExact(partitionPoint), 
0);
+}
+
+//reset histogram
+for (int i = 0; i < bytesSorted; i++) {
+  Arrays.fill(histogram[i], 0);
+}
+OfflinePointWriter offlinePointWriter = (OfflinePointWriter) points;
+
+//find common prefix, it does already set histogram values if needed
+int commonPrefix = findCommonPrefix(offlinePointWriter, from, to, dim);
+
+//if all equals we just partition the data
+if (commonPrefix ==  bytesSorted) {
+  return partition(offlinePointWriter, left, right, from, to, 
partitionPoint, dim, null, commonPrefix - 1, partitionPoint);
+}
+//let's rock'n'roll
+return buildHistogramAndPartition(offlinePointWriter, null, left, right, 
from, to, partitionPoint, 0, commonPrefix, dim,0, 0);
+  }
+
+  void checkArgs(long from, long to, long middle) {
+if (middle < from) {
+  throw new IllegalArgumentException("middle must be >= from");
+}
+if (middle >= to) {
+  throw new I

[GitHub] iverase commented on a change in pull request #556: LUCENE-8673: Use radix partitioning when merging dimensional points

2019-02-04 Thread GitBox

iverase commented on a change in pull request #556: LUCENE-8673: Use radix 
partitioning when merging dimensional points
URL: https://github.com/apache/lucene-solr/pull/556#discussion_r253400942
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/util/bkd/BKDRadixSelector.java
 ##
 @@ -0,0 +1,433 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.bkd;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.FutureArrays;
+import org.apache.lucene.util.IntroSelector;
+
+/**
+ *
+ * Offline Radix selector for BKD tree.
+ *
+ *  @lucene.internal
+ * */
+public final class BKDRadixSelector {
+  //size of the histogram
+  private static final int HISTOGRAM_SIZE = 256;
+  // we store one histogram per recursion level
+  private final long[][] histogram;
+  //bytes we are sorting
+  private final int bytesPerDim;
+  // number of bytes to be sorted: bytesPerDim + Integer.BYTES
+  private final int bytesSorted;
+  //data dimensions size
+  private final int packedBytesLength;
+  //flag to when we are moving to sort on heap
+  private final int maxPointsSortedOffHeap;
+  //reusable buffer
+  private final byte[] offlineBuffer;
+  //holder for partition points
+  private final int[] partitionBucket;
+  //holder for partition bytes
+  private final byte[] partitionBytes;
+  //re-usable on-heap selector
+  private final HeapSelector heapSelector;
+  // scratch object to move bytes around
+  private final BytesRef bytesRef1 = new BytesRef();
+  // scratch object to move bytes around
+  private final BytesRef bytesRef2 = new BytesRef();
+  //Directory to create new Offline writer
+  private final Directory tempDir;
+  // prefix for temp files
+  private final String tempFileNamePrefix;
+
+
+
+  /**
+   * Sole constructor.
+   */
+  public BKDRadixSelector(int numDim, int bytesPerDim, int 
maxPointsSortedOffHeap, Directory tempDir, String tempFileNamePrefix) {
+this.bytesPerDim = bytesPerDim;
+this.packedBytesLength = numDim * bytesPerDim;
+this.bytesSorted = bytesPerDim + Integer.BYTES;
+this.maxPointsSortedOffHeap = maxPointsSortedOffHeap;
+this.offlineBuffer = new byte[maxPointsSortedOffHeap * (packedBytesLength 
+ Integer.BYTES)];
+this.partitionBucket = new int[bytesSorted];
+this.partitionBytes =  new byte[bytesSorted];
+this.histogram = new long[bytesSorted][HISTOGRAM_SIZE];
+this.bytesRef1.length = numDim * bytesPerDim;
+this.heapSelector = new HeapSelector(numDim, bytesPerDim);
+this.tempDir = tempDir;
+this.tempFileNamePrefix = tempFileNamePrefix;
+  }
+
+  /**
+   * Method to partition the input data. It returns the value of the dimension 
where
+   * the split happens.
+   */
+  public byte[] select(PointWriter points, PointWriter left, PointWriter 
right, long from, long to, long partitionPoint, int dim) throws IOException {
+checkArgs(from, to, partitionPoint);
+
+//If we are on heap then we just select on heap
+if (points instanceof HeapPointWriter) {
+  return heapSelect((HeapPointWriter) points, left, right, dim, 
Math.toIntExact(from), Math.toIntExact(to),  Math.toIntExact(partitionPoint), 
0);
+}
+
+//reset histogram
+for (int i = 0; i < bytesSorted; i++) {
+  Arrays.fill(histogram[i], 0);
+}
+OfflinePointWriter offlinePointWriter = (OfflinePointWriter) points;
+
+//find common prefix, it does already set histogram values if needed
+int commonPrefix = findCommonPrefix(offlinePointWriter, from, to, dim);
+
+//if all equals we just partition the data
+if (commonPrefix ==  bytesSorted) {
+  return partition(offlinePointWriter, left, right, from, to, 
partitionPoint, dim, null, commonPrefix - 1, partitionPoint);
+}
+//let's rock'n'roll
+return buildHistogramAndPartition(offlinePointWriter, null, left, right, 
from, to, partitionPoint, 0, commonPrefix, dim,0, 0);
+  }
+
+  void checkArgs(long from, long to, long middle) {
+if (middle < from) {
+  throw new IllegalArgumentException("middle must be >= from");
+}
+if (middle >= to) {
+  throw new I

[GitHub] iverase commented on a change in pull request #556: LUCENE-8673: Use radix partitioning when merging dimensional points

2019-02-04 Thread GitBox

iverase commented on a change in pull request #556: LUCENE-8673: Use radix 
partitioning when merging dimensional points
URL: https://github.com/apache/lucene-solr/pull/556#discussion_r253400460
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/util/bkd/OfflinePointReader.java
 ##
 @@ -74,55 +69,68 @@ public OfflinePointReader(Directory tempDir, String 
tempFileName, int packedByte
   // at another level of the BKDWriter recursion
   in = tempDir.openInput(tempFileName, IOContext.READONCE);
 }
+
 name = tempFileName;
 
 long seekFP = start * bytesPerDoc;
 in.seek(seekFP);
 countLeft = length;
-packedValue = new byte[packedBytesLength];
-this.longOrds = longOrds;
+if (reusableBuffer != null) {
 
 Review comment:
   I am actually thinking if we should make those constructors from the readers 
protected. They should be always be constructed from the writers.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[GitHub] iverase commented on a change in pull request #556: LUCENE-8673: Use radix partitioning when merging dimensional points

2019-02-04 Thread GitBox

iverase commented on a change in pull request #556: LUCENE-8673: Use radix 
partitioning when merging dimensional points
URL: https://github.com/apache/lucene-solr/pull/556#discussion_r253399133
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/util/bkd/PointWriter.java
 ##
 @@ -19,24 +19,30 @@
 
 import java.io.Closeable;
 import java.io.IOException;
-import java.util.List;
+
+import org.apache.lucene.util.BytesRef;
 
 /** Appends many points, and then at the end provides a {@link PointReader} to 
iterate
  *  those points.  This abstracts away whether we write to disk, or use simple 
arrays
  *  in heap.
  *
- *  @lucene.internal */
-public interface PointWriter extends Closeable {
-  /** Add a new point */
-  void append(byte[] packedValue, long ord, int docID) throws IOException;
+ *  @lucene.internal
+ *  */
+public interface PointWriter extends Closeable {
+  /** Add a new point from byte array*/
+  void append(byte[] packedValue, int docID) throws IOException;
 
 Review comment:
   This method is used when we spill offline the incoming points on the BKD 
writer. We can wrap there the incoming byte[] into a BytesRef


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[GitHub] iverase commented on a change in pull request #556: LUCENE-8673: Use radix partitioning when merging dimensional points

[GitHub] iverase commented on a change in pull request #556: LUCENE-8673: Use radix partitioning when merging dimensional points

[GitHub] iverase commented on a change in pull request #556: LUCENE-8673: Use radix partitioning when merging dimensional points

[GitHub] iverase commented on a change in pull request #556: LUCENE-8673: Use radix partitioning when merging dimensional points

[GitHub] iverase commented on a change in pull request #556: LUCENE-8673: Use radix partitioning when merging dimensional points

[GitHub] iverase commented on a change in pull request #556: LUCENE-8673: Use radix partitioning when merging dimensional points

[GitHub] iverase commented on a change in pull request #556: LUCENE-8673: Use radix partitioning when merging dimensional points

[GitHub] iverase commented on a change in pull request #556: LUCENE-8673: Use radix partitioning when merging dimensional points

8 matches

Site Navigation

Mail list logo

Footer information