[GitHub] drill pull request #781: DRILL-5351: Minimize bounds checking in var len vec...

2017-04-02 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/781


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #781: DRILL-5351: Minimize bounds checking in var len vec...

2017-03-18 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/781#discussion_r106795253
  
--- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java 
---
@@ -502,11 +502,15 @@ public void setSafe(int index, byte[] bytes) {
   assert index >= 0;
 
   final int currentOffset = offsetVector.getAccessor().get(index);
-  while (data.capacity() < currentOffset + bytes.length) {
-reAlloc();
-  }
   offsetVector.getMutator().setSafe(index + 1, currentOffset + 
bytes.length);
-  data.setBytes(currentOffset, bytes, 0, bytes.length);
+  try {
+data.setBytes(currentOffset, bytes, 0, bytes.length);
+  } catch (IndexOutOfBoundsException e) {
--- End diff --

By the way, if the suggested change blows up the scope of this fix, then 
the original proposal is fine; we can always adjust it later if needed when we 
solve the low-density batch problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #781: DRILL-5351: Minimize bounds checking in var len vec...

2017-03-18 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/781#discussion_r106794181
  
--- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java 
---
@@ -502,11 +502,15 @@ public void setSafe(int index, byte[] bytes) {
   assert index >= 0;
 
   final int currentOffset = offsetVector.getAccessor().get(index);
-  while (data.capacity() < currentOffset + bytes.length) {
-reAlloc();
-  }
   offsetVector.getMutator().setSafe(index + 1, currentOffset + 
bytes.length);
-  data.setBytes(currentOffset, bytes, 0, bytes.length);
+  try {
+data.setBytes(currentOffset, bytes, 0, bytes.length);
+  } catch (IndexOutOfBoundsException e) {
--- End diff --

The Netty/Drillbuf code is complex, so this does boil down to details... 
Yes, I agree that Netty does the bounds checks -- if we call Netty code. 
Consider this code in DrillBuf:

```
  @Override
  public ByteBuf setBytes(int index, ByteBuf src, int srcIndex, int length) 
{
udle.setBytes(index + offset, src, srcIndex, length);
return this;
  }

  public ByteBuf setBytes(int index, ByteBuffer src, int srcIndex, int 
length) {
if (src.isDirect()) {
  checkIndex(index, length);
  
PlatformDependent.copyMemory(PlatformDependent.directBufferAddress(src) + 
srcIndex, this.memoryAddress() + index,
  length);
  ...
```

We have one version that delegates to the "udle" which calls another buf 
which calls... all the way to a bounds check.

But, we have another method which cuts out the middleman and just does a 
direct memory copy. Given that, we could certainly add a version that does a 
bounds check and copy from heap into direct memory. Just use this Netty method:

```
class  PlatformDependent ...
public static void copyMemory(byte[] src, int srcIndex, long dstAddr, 
long length) ...
```

So, something like this:

```
class Drillbuf ...
  public boolean setIfCapacity(int offset, byte data[], int len) {
if (offset + len >= capacity()) { return false; }
PlatformDependent.copyMemory(data, 0,
  PlatformDependent.directBufferAddress(src) + offset,
  len);
   return true;
}
```

Of course, this assumes an implementation of the underlying direct memory, 
but, as we saw, we are already doing something similar.

Would this work and perform as well as the exception-based approach?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #781: DRILL-5351: Minimize bounds checking in var len vec...

2017-03-18 Thread parthchandra
Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/781#discussion_r106792968
  
--- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java 
---
@@ -502,11 +502,15 @@ public void setSafe(int index, byte[] bytes) {
   assert index >= 0;
 
   final int currentOffset = offsetVector.getAccessor().get(index);
-  while (data.capacity() < currentOffset + bytes.length) {
-reAlloc();
-  }
   offsetVector.getMutator().setSafe(index + 1, currentOffset + 
bytes.length);
-  data.setBytes(currentOffset, bytes, 0, bytes.length);
+  try {
+data.setBytes(currentOffset, bytes, 0, bytes.length);
+  } catch (IndexOutOfBoundsException e) {
--- End diff --

The check for a DrillBuf exceeding bounds is already being done once in 
Netty (which also throws the Exception). What we want to do  is to avoid having 
to do more than one bounds check for every vector write operation. By adding 
the suggested check, we simply move the check from every vector setSafe method 
to every Drillbuf write. This would possibly impact performance in other parts 
of the code. 
I particularly changed only var len vectors to address a hotspot in the 
Parquet reader performance, because as you are suggesting, the right way to fix 
is to address the low density batch problem at the same time as the vector 
overflow problem. Perhaps a longer discussion may be required here.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #781: DRILL-5351: Minimize bounds checking in var len vec...

2017-03-16 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/781#discussion_r106491214
  
--- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java 
---
@@ -502,11 +502,15 @@ public void setSafe(int index, byte[] bytes) {
   assert index >= 0;
 
   final int currentOffset = offsetVector.getAccessor().get(index);
-  while (data.capacity() < currentOffset + bytes.length) {
-reAlloc();
-  }
   offsetVector.getMutator().setSafe(index + 1, currentOffset + 
bytes.length);
-  data.setBytes(currentOffset, bytes, 0, bytes.length);
+  try {
+data.setBytes(currentOffset, bytes, 0, bytes.length);
+  } catch (IndexOutOfBoundsException e) {
--- End diff --

While this is a clever way to avoid checks, it will lead to difficulty when 
debugging Drill. Intentionally throwing a common exception makes it even harder 
to find cases where the exception indicates an error.

Let's take a step back. One of the things we need to change in Parquet is 
to avoid "low density batches" vectors that have very little data. Turns out 
one reason is tied up with the assumption that the code makes that it can tell 
when it has reached the end of a vector. (There are many bugs, but that is the 
key idea.)

Vectors don't have that ability today, so the code never worked.

What if we solve that problem, and yours, by changing how the DrillBuf 
works:

```
if ( ! data.setBytesIfCapacity(...)) {
  reAlloc();
  data.setBytes(...)
}
```

The above avoids the spurious exception *and* provides the means to manage 
variable-length vectors in Parquet.

Note that the bounds check is still done, but only inside Drillbuf. And, of 
course, that same check is done with the PR code: that check is what raises the 
exception.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #781: DRILL-5351: Minimize bounds checking in var len vec...

2017-03-13 Thread parthchandra
GitHub user parthchandra opened a pull request:

https://github.com/apache/drill/pull/781

DRILL-5351: Minimize bounds checking in var len vectors for Parquet

reader

Two changes in var len vectors: 
1) Instead of checking to see if we need to realloc for every setSafe call, 
let the write fail and catch the exception. The exception, though expensive, 
will happen very rarely.
2) Call fillEmpties only if there are empty values to fill. 
This saves a bunch of CPU on every setSafe call.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/parthchandra/drill DRILL-5351

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/781.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #781


commit 57869496526a43351575d0f4879d2ac28fe973d4
Author: Parth Chandra 
Date:   2017-02-11T01:40:25Z

DRILL-5351: Minimize bounds checking in var len vectors for Parquet
reader




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---