[GitHub] drill pull request #781: DRILL-5351: Minimize bounds checking in var len vec...
Github user asfgit closed the pull request at: https://github.com/apache/drill/pull/781 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #781: DRILL-5351: Minimize bounds checking in var len vec...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/781#discussion_r106795253 --- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java --- @@ -502,11 +502,15 @@ public void setSafe(int index, byte[] bytes) { assert index >= 0; final int currentOffset = offsetVector.getAccessor().get(index); - while (data.capacity() < currentOffset + bytes.length) { -reAlloc(); - } offsetVector.getMutator().setSafe(index + 1, currentOffset + bytes.length); - data.setBytes(currentOffset, bytes, 0, bytes.length); + try { +data.setBytes(currentOffset, bytes, 0, bytes.length); + } catch (IndexOutOfBoundsException e) { --- End diff -- By the way, if the suggested change blows up the scope of this fix, then the original proposal is fine; we can always adjust it later if needed when we solve the low-density batch problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #781: DRILL-5351: Minimize bounds checking in var len vec...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/781#discussion_r106794181 --- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java --- @@ -502,11 +502,15 @@ public void setSafe(int index, byte[] bytes) { assert index >= 0; final int currentOffset = offsetVector.getAccessor().get(index); - while (data.capacity() < currentOffset + bytes.length) { -reAlloc(); - } offsetVector.getMutator().setSafe(index + 1, currentOffset + bytes.length); - data.setBytes(currentOffset, bytes, 0, bytes.length); + try { +data.setBytes(currentOffset, bytes, 0, bytes.length); + } catch (IndexOutOfBoundsException e) { --- End diff -- The Netty/Drillbuf code is complex, so this does boil down to details... Yes, I agree that Netty does the bounds checks -- if we call Netty code. Consider this code in DrillBuf: ``` @Override public ByteBuf setBytes(int index, ByteBuf src, int srcIndex, int length) { udle.setBytes(index + offset, src, srcIndex, length); return this; } public ByteBuf setBytes(int index, ByteBuffer src, int srcIndex, int length) { if (src.isDirect()) { checkIndex(index, length); PlatformDependent.copyMemory(PlatformDependent.directBufferAddress(src) + srcIndex, this.memoryAddress() + index, length); ... ``` We have one version that delegates to the "udle" which calls another buf which calls... all the way to a bounds check. But, we have another method which cuts out the middleman and just does a direct memory copy. Given that, we could certainly add a version that does a bounds check and copy from heap into direct memory. Just use this Netty method: ``` class PlatformDependent ... public static void copyMemory(byte[] src, int srcIndex, long dstAddr, long length) ... ``` So, something like this: ``` class Drillbuf ... public boolean setIfCapacity(int offset, byte data[], int len) { if (offset + len >= capacity()) { return false; } PlatformDependent.copyMemory(data, 0, PlatformDependent.directBufferAddress(src) + offset, len); return true; } ``` Of course, this assumes an implementation of the underlying direct memory, but, as we saw, we are already doing something similar. Would this work and perform as well as the exception-based approach? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #781: DRILL-5351: Minimize bounds checking in var len vec...
Github user parthchandra commented on a diff in the pull request: https://github.com/apache/drill/pull/781#discussion_r106792968 --- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java --- @@ -502,11 +502,15 @@ public void setSafe(int index, byte[] bytes) { assert index >= 0; final int currentOffset = offsetVector.getAccessor().get(index); - while (data.capacity() < currentOffset + bytes.length) { -reAlloc(); - } offsetVector.getMutator().setSafe(index + 1, currentOffset + bytes.length); - data.setBytes(currentOffset, bytes, 0, bytes.length); + try { +data.setBytes(currentOffset, bytes, 0, bytes.length); + } catch (IndexOutOfBoundsException e) { --- End diff -- The check for a DrillBuf exceeding bounds is already being done once in Netty (which also throws the Exception). What we want to do is to avoid having to do more than one bounds check for every vector write operation. By adding the suggested check, we simply move the check from every vector setSafe method to every Drillbuf write. This would possibly impact performance in other parts of the code. I particularly changed only var len vectors to address a hotspot in the Parquet reader performance, because as you are suggesting, the right way to fix is to address the low density batch problem at the same time as the vector overflow problem. Perhaps a longer discussion may be required here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #781: DRILL-5351: Minimize bounds checking in var len vec...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/781#discussion_r106491214 --- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java --- @@ -502,11 +502,15 @@ public void setSafe(int index, byte[] bytes) { assert index >= 0; final int currentOffset = offsetVector.getAccessor().get(index); - while (data.capacity() < currentOffset + bytes.length) { -reAlloc(); - } offsetVector.getMutator().setSafe(index + 1, currentOffset + bytes.length); - data.setBytes(currentOffset, bytes, 0, bytes.length); + try { +data.setBytes(currentOffset, bytes, 0, bytes.length); + } catch (IndexOutOfBoundsException e) { --- End diff -- While this is a clever way to avoid checks, it will lead to difficulty when debugging Drill. Intentionally throwing a common exception makes it even harder to find cases where the exception indicates an error. Let's take a step back. One of the things we need to change in Parquet is to avoid "low density batches" vectors that have very little data. Turns out one reason is tied up with the assumption that the code makes that it can tell when it has reached the end of a vector. (There are many bugs, but that is the key idea.) Vectors don't have that ability today, so the code never worked. What if we solve that problem, and yours, by changing how the DrillBuf works: ``` if ( ! data.setBytesIfCapacity(...)) { reAlloc(); data.setBytes(...) } ``` The above avoids the spurious exception *and* provides the means to manage variable-length vectors in Parquet. Note that the bounds check is still done, but only inside Drillbuf. And, of course, that same check is done with the PR code: that check is what raises the exception. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #781: DRILL-5351: Minimize bounds checking in var len vec...
GitHub user parthchandra opened a pull request: https://github.com/apache/drill/pull/781 DRILL-5351: Minimize bounds checking in var len vectors for Parquet reader Two changes in var len vectors: 1) Instead of checking to see if we need to realloc for every setSafe call, let the write fail and catch the exception. The exception, though expensive, will happen very rarely. 2) Call fillEmpties only if there are empty values to fill. This saves a bunch of CPU on every setSafe call. You can merge this pull request into a Git repository by running: $ git pull https://github.com/parthchandra/drill DRILL-5351 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/781.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #781 commit 57869496526a43351575d0f4879d2ac28fe973d4 Author: Parth ChandraDate: 2017-02-11T01:40:25Z DRILL-5351: Minimize bounds checking in var len vectors for Parquet reader --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---