Kafka Plugin Filter Pushdown Design Proposal

2018-03-27 Thread Abhishek Ravi
Hi All,

I'm planning to add Filter Pushdown feature for Kafka Plugin (
https://issues.apache.org/jira/browse/DRILL-5977). I have come up with an
initial draft capturing details about the implementation for this feature.
​
 Kafka Plugin Filter Pushdown Design Proposal

​
I am new to Drill and hence wanted to share the details to get some
feedback. Please take a look and leave comments.

Regards,
Abhishek


[GitHub] drill issue #1161: DRILL-6230: Extend row set readers to handle hyper vector...

2018-03-27 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/1161
  
@ppadma , any partial comments that I can start to take a look at? There 
are still a number of PRs in this chain and it would be great if we could keep 
things ticking along... Thanks!


---


[GitHub] drill pull request #1179: DRILL-6254: IllegalArgumentException: the requeste...

2018-03-27 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1179#discussion_r177619465
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java
 ---
@@ -237,7 +237,10 @@ protected IterOutcome doWork() {
 
   private void handleRemainder() {
 int remainingRecordCount = 
flattener.getFlattenField().getAccessor().getInnerValueCount() - remainderIndex;
-if (!doAlloc(remainingRecordCount)) {
+
+// remainingRecordCount can be much higher than number of rows we will 
have in outgoing batch.
+// Do memory allocation only for number of rows we are going to have 
in the batch.
+if (!doAlloc(Math.min(remainingRecordCount, 
flattenMemoryManager.getOutputRowCount( {
--- End diff --

Not related to this fix at all... If we run out of memory, we should throw 
an `OutOfMemoryException` rather than trying to set flags and handle the case. 
In particular, after the batch sizing fixes, if we can't allocate memory now, 
something is very wrong and we'll never be able to. This code may be a vestige 
of the old system where operators tried to negotiate via `OUT_OF_MEMORY` 
statuses...


---


[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...

2018-03-27 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1181#discussion_r177617185
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java
 ---
@@ -99,6 +100,22 @@ private void clear() {
 }
   }
 
+  public enum Metric implements MetricDef {
+NUM_INCOMING_BATCHES,
+AVG_INPUT_BATCH_SIZE,
--- End diff --

Not clear what units `SIZE` is in. Maybe `AVG_INPUT_BATCH_BYTES` (if it is 
bytes)?


---


[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...

2018-03-27 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1181#discussion_r177617922
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/AbstractRecordBatchMemoryManager.java
 ---
@@ -29,6 +29,50 @@
   private int outgoingRowWidth;
   private RecordBatchSizer sizer;
 
+  /**
+   * operator metric stats
+   */
+  private long numIncomingBatches;
+  private long sumInputBatchSizes;
+  private long sumInputRowWidths;
+  private long totalInputRecords;
+  private long numOutgoingBatches;
+  private long sumOutputBatchSizes;
+  private long sumOutputRowWidths;
+  private long totalOutputRecords;
+
+  public long getNumIncomingBatches() {
+return numIncomingBatches;
+  }
+
+  public long getTotalInputRecords() {
+return totalInputRecords;
+  }
+
+  public long getNumOutgoingBatches() {
+return numOutgoingBatches;
+  }
+
+  public long getTotalOutputRecords() {
+return totalOutputRecords;
+  }
+
+  public long getAvgInputBatchSize() {
+return RecordBatchSizer.safeDivide(sumInputBatchSizes, 
numIncomingBatches);
+  }
+
+  public long getAvgInputRowWidth() {
+return RecordBatchSizer.safeDivide(sumInputRowWidths, 
numIncomingBatches);
--- End diff --

See below regarding the average calculation.


---


[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...

2018-03-27 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1181#discussion_r177617262
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java
 ---
@@ -99,6 +100,22 @@ private void clear() {
 }
   }
 
+  public enum Metric implements MetricDef {
+NUM_INCOMING_BATCHES,
+AVG_INPUT_BATCH_SIZE,
+AVG_INPUT_ROW_WIDTH,
+TOTAL_INPUT_RECORDS,
+NUM_OUTGOING_BATCHES,
--- End diff --

`OUTPUT_BATCH_COUNT` as per comments above?


---


[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...

2018-03-27 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1181#discussion_r177617387
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java
 ---
@@ -463,4 +488,20 @@ protected boolean setupNewSchema() throws 
SchemaChangeException {
 }
 return exprs;
   }
+
+  private void updateStats() {
+stats.setLongStat(Metric.NUM_INCOMING_BATCHES, 
flattenMemoryManager.getNumIncomingBatches());
+stats.setLongStat(Metric.AVG_INPUT_BATCH_SIZE, 
flattenMemoryManager.getAvgInputBatchSize());
+stats.setLongStat(Metric.AVG_INPUT_ROW_WIDTH, 
flattenMemoryManager.getAvgInputRowWidth());
+stats.setLongStat(Metric.TOTAL_INPUT_RECORDS, 
flattenMemoryManager.getTotalInputRecords());
+stats.setLongStat(Metric.NUM_OUTGOING_BATCHES, 
flattenMemoryManager.getNumOutgoingBatches());
+stats.setLongStat(Metric.AVG_OUTPUT_BATCH_SIZE, 
flattenMemoryManager.getAvgOutputBatchSize());
+stats.setLongStat(Metric.AVG_OUTPUT_ROW_WIDTH, 
flattenMemoryManager.getAvgOutputRowWidth());
+stats.setLongStat(Metric.TOTAL_OUTPUT_RECORDS, 
flattenMemoryManager.getTotalOutputRecords());
+  }
+
+  @Override
+  public void close() {
--- End diff --

Is `super.close()` needed?


---


[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...

2018-03-27 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1181#discussion_r177617131
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java
 ---
@@ -99,6 +100,22 @@ private void clear() {
 }
   }
 
+  public enum Metric implements MetricDef {
+NUM_INCOMING_BATCHES,
--- End diff --

Also, maybe `INPUT_BATCH_COUNT` to be parallel with `AVG_INPUT_BATCH_SIZE`?


---


[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...

2018-03-27 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1181#discussion_r177617860
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/AbstractRecordBatchMemoryManager.java
 ---
@@ -29,6 +29,50 @@
   private int outgoingRowWidth;
   private RecordBatchSizer sizer;
 
+  /**
+   * operator metric stats
+   */
+  private long numIncomingBatches;
+  private long sumInputBatchSizes;
+  private long sumInputRowWidths;
+  private long totalInputRecords;
+  private long numOutgoingBatches;
+  private long sumOutputBatchSizes;
+  private long sumOutputRowWidths;
+  private long totalOutputRecords;
+
+  public long getNumIncomingBatches() {
+return numIncomingBatches;
+  }
+
+  public long getTotalInputRecords() {
+return totalInputRecords;
+  }
+
+  public long getNumOutgoingBatches() {
+return numOutgoingBatches;
+  }
+
+  public long getTotalOutputRecords() {
+return totalOutputRecords;
+  }
+
+  public long getAvgInputBatchSize() {
+return RecordBatchSizer.safeDivide(sumInputBatchSizes, 
numIncomingBatches);
+  }
+
+  public long getAvgInputRowWidth() {
+return RecordBatchSizer.safeDivide(sumInputRowWidths, 
numIncomingBatches);
+  }
+
+  public long getAvgOutputBatchSize() {
+return RecordBatchSizer.safeDivide(sumOutputBatchSizes, 
numOutgoingBatches);
+  }
+
+  public long getAvgOutputRowWidth() {
+return RecordBatchSizer.safeDivide(sumOutputRowWidths, 
numOutgoingBatches);
--- End diff --

Not sure if it really matters, but this calculation is not accurate. This 
is an unweighted average. The actual width requires a weighted average. I have 
one batch with a row of 1 MB in size, and another batch of 1K rows of 1K each. 
The average row width is actually ~2K, not ~500K.


---


[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...

2018-03-27 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1181#discussion_r177616893
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java
 ---
@@ -99,6 +100,22 @@ private void clear() {
 }
   }
 
+  public enum Metric implements MetricDef {
+NUM_INCOMING_BATCHES,
--- End diff --

Strangely, these enum names show up as labels in the web UI. Maybe 
`INCOMING_BATCH_COUNT`? The `AVG_` and `TOTAL_` names seem readable enough...


---


[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...

2018-03-27 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1181#discussion_r177617047
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java
 ---
@@ -99,6 +100,22 @@ private void clear() {
 }
   }
 
+  public enum Metric implements MetricDef {
+NUM_INCOMING_BATCHES,
+AVG_INPUT_BATCH_SIZE,
+AVG_INPUT_ROW_WIDTH,
--- End diff --

`AVG_INPUT_ROW_WIDTH` to be parallel with `AVG_OUTPUT_ROW_WIDTH`?


---


[GitHub] drill issue #1105: DRILL-6125: Fix possible memory leak when query is finish...

2018-03-27 Thread vrozov
Github user vrozov commented on the issue:

https://github.com/apache/drill/pull/1105
  
@arina-ielchiieva Please review.


---


[GitHub] drill pull request #1185: DRILL-6288: Upgrade org.javassist:javassist and or...

2018-03-27 Thread vrozov
Github user vrozov commented on a diff in the pull request:

https://github.com/apache/drill/pull/1185#discussion_r177499603
  
--- Diff: exec/jdbc-all/pom.xml ---
@@ -559,7 +559,7 @@
   This is likely due to you adding new 
dependencies to a java-exec and not updating the excludes in this module. This 
is important as it minimizes the size of the dependency of Drill application 
users.
 
 
-3200
+3300
--- End diff --

Yes.


---


[GitHub] drill issue #1189: DRILL-6282: Excluding io.dropwizard.metrics dependencies

2018-03-27 Thread vrozov
Github user vrozov commented on the issue:

https://github.com/apache/drill/pull/1189
  
@vdiravka right, check the first link, the new `groupId` for 
`com.codahale.metrics` is `io.dropwizard.metrics`, so all new versions will be 
deployed using the new `groupId`. 


---


[GitHub] drill issue #1190: DRILL-5937: ExecConstants: changed comment, timeout defau...

2018-03-27 Thread arina-ielchiieva
Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/1190
  
+1


---


[jira] [Created] (DRILL-6294) Update Calcite version to 1.16.0

2018-03-27 Thread Volodymyr Vysotskyi (JIRA)
Volodymyr Vysotskyi created DRILL-6294:
--

 Summary: Update Calcite version to 1.16.0
 Key: DRILL-6294
 URL: https://issues.apache.org/jira/browse/DRILL-6294
 Project: Apache Drill
  Issue Type: Task
Reporter: Volodymyr Vysotskyi
Assignee: Volodymyr Vysotskyi


Upgrade to Calcite 16 version.


 From the last upgrade to Calcite 15, several commits were left in 
Drill-Calcite fork. Since no additional work was done to move those commits 
from the fork, they will be placed on top of Calcite 16.


 Status from the last upgrade:
 
[https://docs.google.com/document/d/1Lqk9NoKQviz0YimBmov4z1pui7QjJGjDVwMa1p0emPk/edit#heading=h.i3rowg20vxv4]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] drill issue #1190: DRILL-5937: ExecConstants: changed comment, timeout defau...

2018-03-27 Thread pushpendra-jaiswal-90
Github user pushpendra-jaiswal-90 commented on the issue:

https://github.com/apache/drill/pull/1190
  
@arina-ielchiieva I have removed comment.


---


[GitHub] drill pull request #1190: DRILL-5937: ExecConstants: changed comment, timeou...

2018-03-27 Thread arina-ielchiieva
Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/1190#discussion_r177385544
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java ---
@@ -557,8 +557,7 @@ private ExecConstants() {
   public static final LongValidator CODE_GEN_EXP_IN_METHOD_SIZE_VALIDATOR 
= new LongValidator(CODE_GEN_EXP_IN_METHOD_SIZE);
 
   /**
-   * Timeout for create prepare statement request. If the request exceeds 
this timeout, then request is timed out.
-   * Default value is 10mins.
+   * Default value is 30 seconds.
--- End diff --

Maybe we should remove comment about the default as well? Actual default 
value can be found in drill-module.conf and if it changes, we won't need to 
modify it here.


---


[GitHub] drill issue #1189: DRILL-6282: Excluding io.dropwizard.metrics dependencies

2018-03-27 Thread vdiravka
Github user vdiravka commented on the issue:

https://github.com/apache/drill/pull/1189
  
@vrozov I understand what you mean: `com.codahale` isn't updated for a long 
time
https://mvnrepository.com/artifact/com.codahale.metrics/metrics-core, 
but `io.dropwizard.metrics` is updated recently 
https://mvnrepository.com/artifact/io.dropwizard.metrics/metrics-core. 

So we can switch into the last one. I will update my PR and let you know. 
Thank you. 


---


[GitHub] drill issue #1190: DRILL-5937: ExecConstants: changed comment, timeout defau...

2018-03-27 Thread pushpendra-jaiswal-90
Github user pushpendra-jaiswal-90 commented on the issue:

https://github.com/apache/drill/pull/1190
  
@arina-ielchiieva I have done the changes. Could you please review?


---


[GitHub] drill pull request #1191: DRILL-6103: lsb_release: command not found

2018-03-27 Thread kkhatua
GitHub user kkhatua opened a pull request:

https://github.com/apache/drill/pull/1191

DRILL-6103: lsb_release: command not found

Thanks to Sanel Zukan for providing a small patch that checks for 
/etc/fedora-release path. This is more common, than lsb_release command on 
Linux distros.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kkhatua/drill DRILL-6103

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1191.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1191


commit 11a7ff214f70a353b54aef38ba51279da5b31935
Author: Kunal Khatua 
Date:   2018-03-27T06:30:31Z

DRILL-6103: lsb_release: command not found

Thanks to Sanel Zukan for providing a small patch that checks for 
/etc/fedora-release path. This is more common, than lsb_release command on 
Linux distros.




---


[GitHub] drill issue #1191: DRILL-6103: lsb_release: command not found

2018-03-27 Thread kkhatua
Github user kkhatua commented on the issue:

https://github.com/apache/drill/pull/1191
  
@chunhui-shi please review


---


[GitHub] drill pull request #1166: DRILL-6016 - Fix for Error reading INT96 created b...

2018-03-27 Thread rajrahul
Github user rajrahul commented on a diff in the pull request:

https://github.com/apache/drill/pull/1166#discussion_r177318780
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java
 ---
@@ -797,6 +797,24 @@ public void 
testImpalaParquetBinaryAsTimeStamp_DictChange() throws Exception {
 }
   }
 
+  @Test
+  public void testSparkParquetBinaryAsTimeStamp_DictChange() throws 
Exception {
+try {
+  mockUtcDateTimeZone();
--- End diff --

I could see two ways of doing this within the code itself.
1. Mock and run with UTC, and compare the results in UTC as in 
TestCastFunctions#testToDateForTimeStamp. Since TestParquetWriter already has a 
RunWith annotation, we might have to create another class and move both the 
methods.
2. Run with the JVM timezone(no mocking) and compare the results after a 
'convertToLocalTimestamp' as in TestParquetWriter#testInt96TimeStampValueWidth

Approach 2 does not used fixed UTC timezone. Which approach do you suggest?


---