Github user sachouche commented on a diff in the pull request:
https://github.com/apache/drill/pull/1060#discussion_r160514755
--- Diff:
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/VLAbstractEntryReader.java
---
@@ -0,0 +1,215 @@
+/*******************************************************************************
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+
******************************************************************************/
+package org.apache.drill.exec.store.parquet.columnreaders;
+
+import java.nio.ByteBuffer;
+
+import
org.apache.drill.exec.store.parquet.columnreaders.VLColumnBulkInput.ColumnPrecisionInfo;
+import
org.apache.drill.exec.store.parquet.columnreaders.VLColumnBulkInput.PageDataInfo;
+import org.apache.drill.exec.util.MemoryUtils;
+
+/** Abstract class for sub-classes implementing several algorithms for
loading a Bulk Entry */
+abstract class VLAbstractEntryReader {
+
+ /** byte buffer used for buffering page data */
+ protected final ByteBuffer buffer;
+ /** Page Data Information */
+ protected final PageDataInfo pageInfo;
+ /** expected precision type: fixed or variable length */
+ protected final ColumnPrecisionInfo columnPrecInfo;
+ /** Bulk entry */
+ protected final VLColumnBulkEntry entry;
+
+ /**
+ * CTOR.
+ * @param _buffer byte buffer for data buffering (within CPU cache)
+ * @param _pageInfo page being processed information
+ * @param _columnPrecInfo column precision information
+ * @param _entry reusable bulk entry object
+ */
+ VLAbstractEntryReader(ByteBuffer _buffer,
+ PageDataInfo _pageInfo,
+ ColumnPrecisionInfo _columnPrecInfo,
+ VLColumnBulkEntry _entry) {
+
+ this.buffer = _buffer;
+ this.pageInfo = _pageInfo;
+ this.columnPrecInfo = _columnPrecInfo;
+ this.entry = _entry;
+ }
+
+ /**
+ * @param valuesToRead maximum values to read within the current page
+ * @return a bulk entry object
+ */
+ abstract VLColumnBulkEntry getEntry(int valsToReadWithinPage);
+
+ /**
+ * Indicates whether to use bulk processing
+ */
+ protected final boolean bulkProcess() {
+ return columnPrecInfo.bulkProcess;
+ }
+
+ /**
+ * Loads new data into the buffer if empty or the force flag is set.
+ *
+ * @param force flag to force loading new data into the buffer
+ */
+ protected final boolean load(boolean force) {
+
+ if (!force && buffer.hasRemaining()) {
+ return true; // NOOP
+ }
+
+ // We're here either because the buffer is empty or we want to force a
new load operation.
+ // In the case of force, there might be unprocessed data (still in the
buffer) which is fine
+ // since the caller updates the page data buffer's offset only for the
data it has consumed; this
+ // means unread data will be loaded again but this time will be
positioned in the beginning of the
+ // buffer. This can happen only for the last entry in the buffer when
either of its length or value
+ // is incomplete.
+ buffer.clear();
+
+ int remaining = remainingPageData();
+ int toCopy = remaining > buffer.capacity() ? buffer.capacity() :
remaining;
+
+ if (toCopy == 0) {
+ return false;
+ }
+
+ pageInfo.pageData.getBytes(pageInfo.pageDataOff, buffer.array(),
buffer.position(), toCopy);
--- End diff --
@parthchandra I know this is counter intuitive but this path makes it
faster and I had VTune to prove it; let me explain it:
- The logic is to fetch small chunks of data from main memory to the L1
cache (using 4k buffers) for processing
- This logic doesn't introduce any extra memory bandwidth penalty since in
all cases data has to move from main memory to the CPU and back to main memory
(if modified)
- Note that the CPU has optimizations that disable memory flush for private
mutated data which fully fit in the cache
- This means the byte array read / write operations is always local
(doesn't involve the memory bus)
- Another topic could be any extra CPU instructions we might be using..
usually this is not a problem as long as in the end our logic is avoiding CPU
stalls, better utilizing the CPU pipelining, and using more optimal
instructions (e.g., 64bit vs 8bit, SIMD, etc)
I would recommend using VTune as it clearly show cases the above assertions.
---