[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344474#comment-16344474
 ] 

ASF GitHub Bot commented on DRILL-6032:
---

Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1101#discussion_r164634551
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java
 ---
@@ -215,6 +206,7 @@ public BatchHolder() {
   MaterializedField outputField = materializedValueFields[i];
   // Create a type-specific ValueVector for this value
   vector = TypeHelper.getNewVector(outputField, allocator);
+  int columnSize = new RecordBatchSizer.ColumnSize(vector).estSize;
--- End diff --

Why not just use TypeHelper.getSize(outputField.getType()) ? It seems like 
you  don't need RecordBatchSizer for this.  estSize in RecordBatchSizer is 
taking actual memory allocation into account i.e. it includes the over head of  
unused vector space and we need that value as well for different reasons. I 
also think it is doing right thing by returning zero when there are no rows and 
no memory allocated. 


> Use RecordBatchSizer to estimate size of columns in HashAgg
> ---
>
> Key: DRILL-6032
> URL: https://issues.apache.org/jira/browse/DRILL-6032
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
> Fix For: 1.13.0
>
>
> We need to use the RecordBatchSize to estimate the size of columns in the 
> Partition batches created by HashAgg.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344403#comment-16344403
 ] 

ASF GitHub Bot commented on DRILL-6032:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1101#discussion_r164624503
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java
 ---
@@ -215,6 +206,7 @@ public BatchHolder() {
   MaterializedField outputField = materializedValueFields[i];
   // Create a type-specific ValueVector for this value
   vector = TypeHelper.getNewVector(outputField, allocator);
+  int columnSize = new RecordBatchSizer.ColumnSize(vector).estSize;
--- End diff --

Agree that the "sizer" should return the size of a fixed-width column even 
if there are 0 rows. Thanks for catching and fixing this. Though, I will point 
out that if you get a 0-length batch, just throw it away and read another; no 
useful purpose is served by doing anything with an empty batch (except to pass 
it along for schema purposes if this is the first batch: the so-called "fast 
schema" that is supposed to work, but seems to not be fully implemented...)

I've found that Drill behaves poorly and erratically when trying to 
allocate zero-sized vectors. Best not to do that. If your batch size is 0, 
don't allocate vectors would be my general advice to avoid lots of grief.


> Use RecordBatchSizer to estimate size of columns in HashAgg
> ---
>
> Key: DRILL-6032
> URL: https://issues.apache.org/jira/browse/DRILL-6032
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
> Fix For: 1.13.0
>
>
> We need to use the RecordBatchSize to estimate the size of columns in the 
> Partition batches created by HashAgg.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344399#comment-16344399
 ] 

ASF GitHub Bot commented on DRILL-6032:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1101#discussion_r164623893
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/spill/RecordBatchSizer.java
 ---
@@ -129,11 +143,16 @@ public ColumnSize(ValueVector v, String prefix) {
 // No standard size for Union type
 dataSize = v.getPayloadByteCount(valueCount);
 break;
+  case GENERIC_OBJECT:
+// We cannot provide a size for Generic Objects
--- End diff --

The `GENERIC_OBJECT` type is used any time we do a system table query: 
system tables are represented as Java objects.

There is an open question about certain aggregate functions. (See a note 
sent a week or so ago.) These aggregate functions use an `ObjectHolder` as 
their `\@Workspace`. @Ben-Zvi and I discussed whether such aggregates are 
spillable. This may be an unresolved issue. 


> Use RecordBatchSizer to estimate size of columns in HashAgg
> ---
>
> Key: DRILL-6032
> URL: https://issues.apache.org/jira/browse/DRILL-6032
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
> Fix For: 1.13.0
>
>
> We need to use the RecordBatchSize to estimate the size of columns in the 
> Partition batches created by HashAgg.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344397#comment-16344397
 ] 

ASF GitHub Bot commented on DRILL-6032:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1101#discussion_r164623527
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/spill/RecordBatchSizer.java
 ---
@@ -232,9 +251,8 @@ else if (width > 0) {
 }
   }
 
-  public static final int MAX_VECTOR_SIZE = ValueVector.MAX_BUFFER_SIZE; 
// 16 MiB
-
   private List columnSizes = new ArrayList<>();
+  private Map columnSizeMap = 
CaseInsensitiveMap.newHashMap();
--- End diff --

Drill is case insensitive internally. The case insensitive map is correct. 
Thanks for catching this @ilooner!

Unfortunately, record batches have no name space: they are just a 
collection of vectors. So, we could end up with columns called both "c" and 
"C". This situation will case the column size map to end up with one entry for 
both columns, with the last writer winning.

The best solution would be to enforce name space uniqueness when creating 
vectors. The new "result set loader" does this, but I suspect other readers 
might not -- depending on the particular way that they create their vectors. 
Still, creating names that differ only in case is a bug and any code doing that 
should be fixed. 


> Use RecordBatchSizer to estimate size of columns in HashAgg
> ---
>
> Key: DRILL-6032
> URL: https://issues.apache.org/jira/browse/DRILL-6032
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
> Fix For: 1.13.0
>
>
> We need to use the RecordBatchSize to estimate the size of columns in the 
> Partition batches created by HashAgg.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5864) Selecting a non-existing field from a MapR-DB JSON table fails with NPE

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344358#comment-16344358
 ] 

ASF GitHub Bot commented on DRILL-5864:
---

Github user HanumathRao closed the pull request at:

https://github.com/apache/drill/pull/1007


> Selecting a non-existing field from a MapR-DB JSON table fails with NPE
> ---
>
> Key: DRILL-5864
> URL: https://issues.apache.org/jira/browse/DRILL-5864
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators, Storage - MapRDB
>Affects Versions: 1.12.0
>Reporter: Abhishek Girish
>Assignee: Hanumath Rao Maduri
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.12.0
>
> Attachments: OrderByNPE.log, OrderByNPE2.log
>
>
> Query 1
> {code}
> > select C_FIRST_NAME,C_BIRTH_COUNTRY,C_BIRTH_YEAR,C_BIRTH_MONTH,C_BIRTH_DAY 
> > from customer ORDER BY C_BIRTH_COUNTRY ASC, C_FIRST_NAME ASC LIMIT 10;
> Error: SYSTEM ERROR: NullPointerException
>   (java.lang.NullPointerException) null
> org.apache.drill.exec.record.SchemaUtil.coerceContainer():176
> 
> org.apache.drill.exec.physical.impl.xsort.managed.BufferedBatches.convertBatch():124
> org.apache.drill.exec.physical.impl.xsort.managed.BufferedBatches.add():90
> org.apache.drill.exec.physical.impl.xsort.managed.SortImpl.addBatch():265
> 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.loadBatch():421
> 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.load():357
> 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.innerNext():302
> org.apache.drill.exec.record.AbstractRecordBatch.next():164
> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> org.apache.drill.exec.record.AbstractRecordBatch.next():109
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
> 
> org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext():93
> org.apache.drill.exec.record.AbstractRecordBatch.next():164
> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> org.apache.drill.exec.record.AbstractRecordBatch.next():109
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
> org.apache.drill.exec.physical.impl.limit.LimitRecordBatch.innerNext():115
> org.apache.drill.exec.record.AbstractRecordBatch.next():164
> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> org.apache.drill.exec.record.AbstractRecordBatch.next():109
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
> org.apache.drill.exec.physical.impl.limit.LimitRecordBatch.innerNext():115
> org.apache.drill.exec.record.AbstractRecordBatch.next():164
> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> org.apache.drill.exec.record.AbstractRecordBatch.next():109
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
> 
> org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext():93
> org.apache.drill.exec.record.AbstractRecordBatch.next():164
> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> org.apache.drill.exec.record.AbstractRecordBatch.next():109
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
> 
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():134
> org.apache.drill.exec.record.AbstractRecordBatch.next():164
> org.apache.drill.exec.physical.impl.BaseRootExec.next():105
> 
> org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81
> org.apache.drill.exec.physical.impl.BaseRootExec.next():95
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():234
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():227
> java.security.AccessController.doPrivileged():-2
> javax.security.auth.Subject.doAs():422
> org.apache.hadoop.security.UserGroupInformation.doAs():1595
> org.apache.drill.exec.work.fragment.FragmentExecutor.run():227
> org.apache.drill.common.SelfCleaningRunnable.run():38
> java.util.concurrent.ThreadPoolExecutor.runWorker():1149
> java.util.concurrent.ThreadPoolExecutor$Worker.run():624
> java.lang.Thread.run():748 (state=,code=0)
> {code}
> Plan
> {code}
> 00-00Screen
> 00-01  Project(C_FIRST_NAME=[$0], C_BIRTH_COUNTRY=[$1], 
> C_BIRTH_YEAR=[$2], C_BIRTH_MONTH=[$3], C_BIRTH_DAY=[$4])
> 00-02SelectionVectorRemover
> 00-03  Limit(fetch=[10])
> 00-04Limit(fetch=[10])
> 00-05  SelectionVectorRemover
> 00-06Sort(sort0=[$1], sort1=[$0], dir0=[ASC], dir1=[ASC])
> 00-07  

[jira] [Updated] (DRILL-6099) Drill does not push limit past project (flatten) if it cannot be pushed into scan

2018-01-29 Thread Gautam Kumar Parai (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautam Kumar Parai updated DRILL-6099:
--
Labels: ready-to-commit  (was: )

[~ben-zvi] please consider this PR (https://github.com/apache/drill/pull/1096) 
for the batch commit.

> Drill does not push limit past project (flatten) if it cannot be pushed into 
> scan
> -
>
> Key: DRILL-6099
> URL: https://issues.apache.org/jira/browse/DRILL-6099
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.12.0
>Reporter: Gautam Kumar Parai
>Assignee: Gautam Kumar Parai
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.13.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> It would be useful to have pushdown occur past flatten(project). Here is an 
> example to illustrate the issue:
> {{explain plan without implementation for }}{{select name, 
> flatten(categories) as category from dfs.`/tmp/t_json_20` LIMIT 1;}}
> {{DrillScreenRel}}{{  }}
> {{  DrillLimitRel(fetch=[1])}}{{    }}
> {{    DrillProjectRel(name=[$0], category=[FLATTEN($1)])}}
> {{      DrillScanRel(table=[[dfs, /tmp/t_json_20]], groupscan=[EasyGroupScan 
> [selectionRoot=maprfs:/tmp/t_json_20, numFiles=1, columns=[`name`, 
> `categories`], files=[maprfs:///tmp/t_json_20/0_0_0.json]]])}}
> = 
> Content of 0_0_0.json
> =
> {
>   "name" : "Eric Goldberg, MD",
>   "categories" : [ "Doctors", "Health & Medical" ]
> } {
>   "name" : "Pine Cone Restaurant",
>   "categories" : [ "Restaurants" ]
> } {
>   "name" : "Deforest Family Restaurant",
>   "categories" : [ "American (Traditional)", "Restaurants" ]
> } {
>   "name" : "Culver's",
>   "categories" : [ "Food", "Ice Cream & Frozen Yogurt", "Fast Food", 
> "Restaurants" ]
> } {
>   "name" : "Chang Jiang Chinese Kitchen",
>   "categories" : [ "Chinese", "Restaurants" ]
> } 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6117) Query End time is not being set

2018-01-29 Thread Kunal Khatua (JIRA)
Kunal Khatua created DRILL-6117:
---

 Summary: Query End time is not being set
 Key: DRILL-6117
 URL: https://issues.apache.org/jira/browse/DRILL-6117
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.13.0
Reporter: Kunal Khatua
Assignee: Arina Ielchiieva
 Fix For: 1.13.0


Looks like a commit related to Query State change (DRILL-5963) introduced a bug 
where the *End* time is is not being set, resulting in the UI using the current 
clock time as the profile's end time and the duration is incorrectly calculated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344347#comment-16344347
 ] 

ASF GitHub Bot commented on DRILL-6032:
---

Github user ilooner commented on a diff in the pull request:

https://github.com/apache/drill/pull/1101#discussion_r164615997
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/agg/TestHashAggrSpill.java
 ---
@@ -53,59 +51,54 @@
   @Rule
   public final BaseDirTestWatcher dirTestWatcher = new 
BaseDirTestWatcher();
 
-/**
- *  A template for Hash Aggr spilling tests
- *
- * @throws Exception
- */
-private void testSpill(long maxMem, long numPartitions, long 
minBatches, int maxParallel, boolean fallback ,boolean predict,
-   String sql, long expectedRows, int cycle, int 
fromPart, int toPart) throws Exception {
-LogFixture.LogFixtureBuilder logBuilder = LogFixture.builder()
--- End diff --

I can add this back. In general I don't like leaving unused code in the 
codebase though.


> Use RecordBatchSizer to estimate size of columns in HashAgg
> ---
>
> Key: DRILL-6032
> URL: https://issues.apache.org/jira/browse/DRILL-6032
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
> Fix For: 1.13.0
>
>
> We need to use the RecordBatchSize to estimate the size of columns in the 
> Partition batches created by HashAgg.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344343#comment-16344343
 ] 

ASF GitHub Bot commented on DRILL-6032:
---

Github user ilooner commented on a diff in the pull request:

https://github.com/apache/drill/pull/1101#discussion_r164615743
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/spill/RecordBatchSizer.java
 ---
@@ -232,9 +251,8 @@ else if (width > 0) {
 }
   }
 
-  public static final int MAX_VECTOR_SIZE = ValueVector.MAX_BUFFER_SIZE; 
// 16 MiB
-
   private List columnSizes = new ArrayList<>();
+  private Map columnSizeMap = 
CaseInsensitiveMap.newHashMap();
--- End diff --

I will follow up on this on the mailing list. I was under the impression 
that Drill is case-**in**sensitive and did not support case sensitivity of 
columns in an external store like HBase or MapRDB. If Drill is case-sensitive 
then we have a bigger issue because some of the functional tests assume that it 
is case-**in**sensitive.


> Use RecordBatchSizer to estimate size of columns in HashAgg
> ---
>
> Key: DRILL-6032
> URL: https://issues.apache.org/jira/browse/DRILL-6032
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
> Fix For: 1.13.0
>
>
> We need to use the RecordBatchSize to estimate the size of columns in the 
> Partition batches created by HashAgg.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344338#comment-16344338
 ] 

ASF GitHub Bot commented on DRILL-6032:
---

Github user ilooner commented on a diff in the pull request:

https://github.com/apache/drill/pull/1101#discussion_r164615077
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/spill/RecordBatchSizer.java
 ---
@@ -129,11 +143,16 @@ public ColumnSize(ValueVector v, String prefix) {
 // No standard size for Union type
 dataSize = v.getPayloadByteCount(valueCount);
 break;
+  case GENERIC_OBJECT:
+// We cannot provide a size for Generic Objects
--- End diff --

Execution gets here in some of the HashAgg functional tests. Probably in 
the case where varchars are aggregated, since as you explained to me varchars 
are stored in object vectors on heap.


> Use RecordBatchSizer to estimate size of columns in HashAgg
> ---
>
> Key: DRILL-6032
> URL: https://issues.apache.org/jira/browse/DRILL-6032
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
> Fix For: 1.13.0
>
>
> We need to use the RecordBatchSize to estimate the size of columns in the 
> Partition batches created by HashAgg.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344333#comment-16344333
 ] 

ASF GitHub Bot commented on DRILL-6032:
---

Github user ilooner commented on a diff in the pull request:

https://github.com/apache/drill/pull/1101#discussion_r164614812
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableTemplate.java
 ---
@@ -805,7 +803,12 @@ private IntVector allocMetadataVector(int size, int 
initialValue) {
   }
 
   @Override
-  public void setMaxVarcharSize(int size) { maxVarcharSize = size; }
+  public void setKeySizes(Map keySizes) {
+Preconditions.checkNotNull(keySizes);
+
+this.keySizes = CaseInsensitiveMap.newHashMap();
--- End diff --

It helps to avoid bugs. It is assumed that the keySizes map will never 
change once it is set, copying the map helps enforce that constraint. If we 
don't copy the map and a user calls this method and passes a keySizes map and 
then later updates the keySizes map or reuses it, errors would occur. Some 
languages like Scala have immutable flavors of data structures for this reason.


> Use RecordBatchSizer to estimate size of columns in HashAgg
> ---
>
> Key: DRILL-6032
> URL: https://issues.apache.org/jira/browse/DRILL-6032
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
> Fix For: 1.13.0
>
>
> We need to use the RecordBatchSize to estimate the size of columns in the 
> Partition batches created by HashAgg.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344325#comment-16344325
 ] 

ASF GitHub Bot commented on DRILL-6032:
---

Github user ilooner commented on a diff in the pull request:

https://github.com/apache/drill/pull/1101#discussion_r164613714
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java
 ---
@@ -956,21 +925,8 @@ private void spillAPartition(int part) {
   this.htables[part].outputKeys(currOutBatchIndex, this.outContainer, 
outStartIdxHolder.value, outNumRecordsHolder.value, numPendingOutput);
 
   // set the value count for outgoing batch value vectors
-  /* int i = 0; */
   for (VectorWrapper v : outgoing) {
 v.getValueVector().getMutator().setValueCount(numOutputRecords);
-/*
--- End diff --

If this logic is really critical for debugging we should have utility 
methods to help with it so that others can benefit from the code. Maybe we can 
brainstorm about what methods would be helpful for debugging and start building 
a utility library. In the meantime if you need this code to debug you can 
easily look into the history and copy and paste it from there.


> Use RecordBatchSizer to estimate size of columns in HashAgg
> ---
>
> Key: DRILL-6032
> URL: https://issues.apache.org/jira/browse/DRILL-6032
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
> Fix For: 1.13.0
>
>
> We need to use the RecordBatchSize to estimate the size of columns in the 
> Partition batches created by HashAgg.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344317#comment-16344317
 ] 

ASF GitHub Bot commented on DRILL-6032:
---

Github user ilooner commented on a diff in the pull request:

https://github.com/apache/drill/pull/1101#discussion_r164612920
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java
 ---
@@ -516,43 +501,48 @@ private void initializeSetup(RecordBatch newIncoming) 
throws SchemaChangeExcepti
* @param incoming
*/
   private void updateEstMaxBatchSize(RecordBatch incoming) {
-if ( estMaxBatchSize > 0 ) { return; }  // no handling of a schema (or 
varchar) change
 // Use the sizer to get the input row width and the length of the 
longest varchar column
-RecordBatchSizer sizer = new RecordBatchSizer(incoming);
-logger.trace("Incoming sizer: {}",sizer);
-// An empty batch only has the schema, can not tell actual length of 
varchars
-// else use the actual varchars length, each capped at 50 (to match 
the space allocation)
-long estInputRowWidth = sizer.rowCount() == 0 ? sizer.stdRowWidth() : 
sizer.netRowWidthCap50();
+final RecordBatchSizer incomingColumnSizes = new 
RecordBatchSizer(incoming);
+final Map columnSizeMap = 
incomingColumnSizes.getColumnSizeMap();
+keySizes = CaseInsensitiveMap.newHashMap();
 
-// Get approx max (varchar) column width to get better memory 
allocation
-maxColumnWidth = Math.max(sizer.maxAvgColumnSize(), 
VARIABLE_MIN_WIDTH_VALUE_SIZE);
-maxColumnWidth = Math.min(maxColumnWidth, 
VARIABLE_MAX_WIDTH_VALUE_SIZE);
+logger.trace("Incoming sizer: {}",incomingColumnSizes);
 
-//
 // Calculate the estimated max (internal) batch (i.e. Keys batch + 
Values batch) size
 // (which is used to decide when to spill)
 // Also calculate the values batch size (used as a reserve to overcome 
an OOM)
-//
-Iterator outgoingIter = outContainer.iterator();
-int fieldId = 0;
-while (outgoingIter.hasNext()) {
-  ValueVector vv = outgoingIter.next().getValueVector();
-  MaterializedField mr = vv.getField();
-  int fieldSize = vv instanceof VariableWidthVector ? maxColumnWidth :
-  TypeHelper.getSize(mr.getType());
-  estRowWidth += fieldSize;
-  estOutputRowWidth += fieldSize;
-  if ( fieldId < numGroupByOutFields ) { fieldId++; }
-  else { estValuesRowWidth += fieldSize; }
+
+for (int columnIndex = 0; columnIndex < numGroupByOutFields; 
columnIndex++) {
+  final VectorWrapper vectorWrapper = 
outContainer.getValueVector(columnIndex);
+  final String columnName = vectorWrapper.getField().getName();
+  final int columnSize = columnSizeMap.get(columnName).estSize;
+  keySizes.put(columnName, columnSize);
+  estOutputRowWidth += columnSize;
 }
+
+long estValuesRowWidth = 0;
+
+for (int columnIndex = numGroupByOutFields; columnIndex < 
outContainer.getNumberOfColumns(); columnIndex++) {
+  VectorWrapper vectorWrapper = 
outContainer.getValueVector(columnIndex);
+  RecordBatchSizer.ColumnSize columnSize = new 
RecordBatchSizer.ColumnSize(vectorWrapper.getValueVector());
+  estOutputRowWidth += columnSize.estSize;
+  estValuesRowWidth += columnSize.estSize;
+}
+
 // multiply by the max number of rows in a batch to get the final 
estimated max size
-estMaxBatchSize = Math.max(estRowWidth, estInputRowWidth) * 
MAX_BATCH_SIZE;
+estMaxBatchSize = Math.max(estOutputRowWidth, 1) * MAX_BATCH_SIZE;
 // (When there are no aggr functions, use '1' as later code relies on 
this size being non-zero)
+// Note: estValuesBatchSize cannot be 0 because a zero value for 
estValuesBatchSize will cause reserveValueBatchMemory to have a value of 0. And 
the meaning
+// of a reserveValueBatchMemory value of 0 has multiple meanings in 
different contexts. So estValuesBatchSize has an enforced minimum value of 1, 
without this
+// estValuesBatchsize could have a value of 0 in the case were there 
are no value columns and all the columns are key columns.
 estValuesBatchSize = Math.max(estValuesRowWidth, 1) * MAX_BATCH_SIZE;
 estOutgoingAllocSize = estValuesBatchSize; // initially assume same 
size
 
-logger.trace("{} phase. Estimated internal row width: {} Values row 
width: {} batch size: {}  memory limit: {}  max column width: {}",
-
isTwoPhase?(is2ndPhase?"2nd":"1st"):"Single",estRowWidth,estValuesRowWidth,estMaxBatchSize,allocator.getLimit(),maxColumnWidth);
+if (logger.isTraceEnabled()) {
 

[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344309#comment-16344309
 ] 

ASF GitHub Bot commented on DRILL-6032:
---

Github user ilooner commented on a diff in the pull request:

https://github.com/apache/drill/pull/1101#discussion_r164612280
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java
 ---
@@ -84,13 +85,6 @@
 public abstract class HashAggTemplate implements HashAggregator {
   protected static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(HashAggregator.class);
 
-  private static final int VARIABLE_MAX_WIDTH_VALUE_SIZE = 50;
-  private static final int VARIABLE_MIN_WIDTH_VALUE_SIZE = 8;
-
-  private static final boolean EXTRA_DEBUG_1 = false;
--- End diff --

Logging frameworks were invented to solve this problem. We should just use 
the logging framework.


> Use RecordBatchSizer to estimate size of columns in HashAgg
> ---
>
> Key: DRILL-6032
> URL: https://issues.apache.org/jira/browse/DRILL-6032
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
> Fix For: 1.13.0
>
>
> We need to use the RecordBatchSize to estimate the size of columns in the 
> Partition batches created by HashAgg.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344297#comment-16344297
 ] 

ASF GitHub Bot commented on DRILL-6032:
---

Github user Ben-Zvi commented on a diff in the pull request:

https://github.com/apache/drill/pull/1101#discussion_r164609183
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java
 ---
@@ -516,43 +501,48 @@ private void initializeSetup(RecordBatch newIncoming) 
throws SchemaChangeExcepti
* @param incoming
*/
   private void updateEstMaxBatchSize(RecordBatch incoming) {
-if ( estMaxBatchSize > 0 ) { return; }  // no handling of a schema (or 
varchar) change
 // Use the sizer to get the input row width and the length of the 
longest varchar column
-RecordBatchSizer sizer = new RecordBatchSizer(incoming);
-logger.trace("Incoming sizer: {}",sizer);
-// An empty batch only has the schema, can not tell actual length of 
varchars
-// else use the actual varchars length, each capped at 50 (to match 
the space allocation)
-long estInputRowWidth = sizer.rowCount() == 0 ? sizer.stdRowWidth() : 
sizer.netRowWidthCap50();
+final RecordBatchSizer incomingColumnSizes = new 
RecordBatchSizer(incoming);
+final Map columnSizeMap = 
incomingColumnSizes.getColumnSizeMap();
+keySizes = CaseInsensitiveMap.newHashMap();
 
-// Get approx max (varchar) column width to get better memory 
allocation
-maxColumnWidth = Math.max(sizer.maxAvgColumnSize(), 
VARIABLE_MIN_WIDTH_VALUE_SIZE);
-maxColumnWidth = Math.min(maxColumnWidth, 
VARIABLE_MAX_WIDTH_VALUE_SIZE);
+logger.trace("Incoming sizer: {}",incomingColumnSizes);
 
-//
 // Calculate the estimated max (internal) batch (i.e. Keys batch + 
Values batch) size
 // (which is used to decide when to spill)
 // Also calculate the values batch size (used as a reserve to overcome 
an OOM)
-//
-Iterator outgoingIter = outContainer.iterator();
-int fieldId = 0;
-while (outgoingIter.hasNext()) {
-  ValueVector vv = outgoingIter.next().getValueVector();
-  MaterializedField mr = vv.getField();
-  int fieldSize = vv instanceof VariableWidthVector ? maxColumnWidth :
-  TypeHelper.getSize(mr.getType());
-  estRowWidth += fieldSize;
-  estOutputRowWidth += fieldSize;
-  if ( fieldId < numGroupByOutFields ) { fieldId++; }
-  else { estValuesRowWidth += fieldSize; }
+
+for (int columnIndex = 0; columnIndex < numGroupByOutFields; 
columnIndex++) {
+  final VectorWrapper vectorWrapper = 
outContainer.getValueVector(columnIndex);
+  final String columnName = vectorWrapper.getField().getName();
+  final int columnSize = columnSizeMap.get(columnName).estSize;
+  keySizes.put(columnName, columnSize);
+  estOutputRowWidth += columnSize;
 }
+
+long estValuesRowWidth = 0;
+
+for (int columnIndex = numGroupByOutFields; columnIndex < 
outContainer.getNumberOfColumns(); columnIndex++) {
+  VectorWrapper vectorWrapper = 
outContainer.getValueVector(columnIndex);
+  RecordBatchSizer.ColumnSize columnSize = new 
RecordBatchSizer.ColumnSize(vectorWrapper.getValueVector());
+  estOutputRowWidth += columnSize.estSize;
--- End diff --

What is missing (maybe in the original code as well) is those extras: The 
Values batch sometimes uses those 8 byte vectors to represent nullable, and for 
some aggregations (e.g. STDDEV), internally there are two columns ("count" and 
"sum of squares", or something like that), but the outContainer has only one - 
with the result (this is why the Values vectors in the outContainer are always 
allocated, instead of passing the existing vectors - see DRILL-5588 )


> Use RecordBatchSizer to estimate size of columns in HashAgg
> ---
>
> Key: DRILL-6032
> URL: https://issues.apache.org/jira/browse/DRILL-6032
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
> Fix For: 1.13.0
>
>
> We need to use the RecordBatchSize to estimate the size of columns in the 
> Partition batches created by HashAgg.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344292#comment-16344292
 ] 

ASF GitHub Bot commented on DRILL-6032:
---

Github user Ben-Zvi commented on a diff in the pull request:

https://github.com/apache/drill/pull/1101#discussion_r164571408
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/spill/RecordBatchSizer.java
 ---
@@ -129,11 +143,16 @@ public ColumnSize(ValueVector v, String prefix) {
 // No standard size for Union type
 dataSize = v.getPayloadByteCount(valueCount);
 break;
+  case GENERIC_OBJECT:
+// We cannot provide a size for Generic Objects
--- End diff --

How can execution ever get here ?  Should we throw a 
UserException.unsupportedError here ?



> Use RecordBatchSizer to estimate size of columns in HashAgg
> ---
>
> Key: DRILL-6032
> URL: https://issues.apache.org/jira/browse/DRILL-6032
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
> Fix For: 1.13.0
>
>
> We need to use the RecordBatchSize to estimate the size of columns in the 
> Partition batches created by HashAgg.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344296#comment-16344296
 ] 

ASF GitHub Bot commented on DRILL-6032:
---

Github user Ben-Zvi commented on a diff in the pull request:

https://github.com/apache/drill/pull/1101#discussion_r164604859
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java
 ---
@@ -397,11 +384,9 @@ private void delayedSetup() {
 }
 numPartitions = BaseAllocator.nextPowerOfTwo(numPartitions); // in 
case not a power of 2
 
-if ( schema == null ) { estValuesBatchSize = estOutgoingAllocSize = 
estMaxBatchSize = 0; } // incoming was an empty batch
--- End diff --

Why was the ( schema == null ) check removed ?  Without it, calling 
updateEstMaxBatchSize() would return an NPE when accessing the batch's schema 
(when calling RecordBatchSizer(incoming) ).



> Use RecordBatchSizer to estimate size of columns in HashAgg
> ---
>
> Key: DRILL-6032
> URL: https://issues.apache.org/jira/browse/DRILL-6032
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
> Fix For: 1.13.0
>
>
> We need to use the RecordBatchSize to estimate the size of columns in the 
> Partition batches created by HashAgg.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344293#comment-16344293
 ] 

ASF GitHub Bot commented on DRILL-6032:
---

Github user Ben-Zvi commented on a diff in the pull request:

https://github.com/apache/drill/pull/1101#discussion_r164576364
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/agg/TestHashAggrSpill.java
 ---
@@ -53,59 +51,54 @@
   @Rule
   public final BaseDirTestWatcher dirTestWatcher = new 
BaseDirTestWatcher();
 
-/**
- *  A template for Hash Aggr spilling tests
- *
- * @throws Exception
- */
-private void testSpill(long maxMem, long numPartitions, long 
minBatches, int maxParallel, boolean fallback ,boolean predict,
-   String sql, long expectedRows, int cycle, int 
fromPart, int toPart) throws Exception {
-LogFixture.LogFixtureBuilder logBuilder = LogFixture.builder()
--- End diff --

The logBuilder is here in case someone uses the Hash Agg spill tests and 
needs to log. It saves the trouble to put this code back again for that 
purpose. Unfortunately Java does not support #ifdef ..
  


> Use RecordBatchSizer to estimate size of columns in HashAgg
> ---
>
> Key: DRILL-6032
> URL: https://issues.apache.org/jira/browse/DRILL-6032
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
> Fix For: 1.13.0
>
>
> We need to use the RecordBatchSize to estimate the size of columns in the 
> Partition batches created by HashAgg.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344295#comment-16344295
 ] 

ASF GitHub Bot commented on DRILL-6032:
---

Github user Ben-Zvi commented on a diff in the pull request:

https://github.com/apache/drill/pull/1101#discussion_r164600675
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableTemplate.java
 ---
@@ -805,7 +803,12 @@ private IntVector allocMetadataVector(int size, int 
initialValue) {
   }
 
   @Override
-  public void setMaxVarcharSize(int size) { maxVarcharSize = size; }
+  public void setKeySizes(Map keySizes) {
+Preconditions.checkNotNull(keySizes);
+
+this.keySizes = CaseInsensitiveMap.newHashMap();
--- End diff --

Minor: Why does every hash table allocate a new map and copy its values ?  
Can't all just share the keySizes map ?



> Use RecordBatchSizer to estimate size of columns in HashAgg
> ---
>
> Key: DRILL-6032
> URL: https://issues.apache.org/jira/browse/DRILL-6032
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
> Fix For: 1.13.0
>
>
> We need to use the RecordBatchSize to estimate the size of columns in the 
> Partition batches created by HashAgg.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344298#comment-16344298
 ] 

ASF GitHub Bot commented on DRILL-6032:
---

Github user Ben-Zvi commented on a diff in the pull request:

https://github.com/apache/drill/pull/1101#discussion_r164609742
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java
 ---
@@ -84,13 +85,6 @@
 public abstract class HashAggTemplate implements HashAggregator {
   protected static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(HashAggregator.class);
 
-  private static final int VARIABLE_MAX_WIDTH_VALUE_SIZE = 50;
-  private static final int VARIABLE_MIN_WIDTH_VALUE_SIZE = 8;
-
-  private static final boolean EXTRA_DEBUG_1 = false;
--- End diff --

This is Aman's implementation of an #ifdef .
Basically he wanted to preserve these debugging messages, but not seeing 
them (there are too many) while doing some other debugging work.
 


> Use RecordBatchSizer to estimate size of columns in HashAgg
> ---
>
> Key: DRILL-6032
> URL: https://issues.apache.org/jira/browse/DRILL-6032
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
> Fix For: 1.13.0
>
>
> We need to use the RecordBatchSize to estimate the size of columns in the 
> Partition batches created by HashAgg.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344294#comment-16344294
 ] 

ASF GitHub Bot commented on DRILL-6032:
---

Github user Ben-Zvi commented on a diff in the pull request:

https://github.com/apache/drill/pull/1101#discussion_r164574746
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/spill/RecordBatchSizer.java
 ---
@@ -232,9 +251,8 @@ else if (width > 0) {
 }
   }
 
-  public static final int MAX_VECTOR_SIZE = ValueVector.MAX_BUFFER_SIZE; 
// 16 MiB
-
   private List columnSizes = new ArrayList<>();
+  private Map columnSizeMap = 
CaseInsensitiveMap.newHashMap();
--- End diff --

There may be an issue with MapRDB or HBase which are case sensitive. Also 
Drill allows for quoted column names (e.g., "FOO" or "foo") which keep the case 
sensitivity. 
  One crude "fix" can be to add a user option, so when turned on, this code 
would use an AbstractHashedMap instead.



> Use RecordBatchSizer to estimate size of columns in HashAgg
> ---
>
> Key: DRILL-6032
> URL: https://issues.apache.org/jira/browse/DRILL-6032
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
> Fix For: 1.13.0
>
>
> We need to use the RecordBatchSize to estimate the size of columns in the 
> Partition batches created by HashAgg.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344291#comment-16344291
 ] 

ASF GitHub Bot commented on DRILL-6032:
---

Github user Ben-Zvi commented on a diff in the pull request:

https://github.com/apache/drill/pull/1101#discussion_r164549701
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatch.java
 ---
@@ -255,7 +254,6 @@ private HashAggregator createAggregatorInternal() 
throws SchemaChangeException,
   groupByOutFieldIds[i] = container.add(vv);
 }
 
-int extraNonNullColumns = 0; // each of SUM, MAX and MIN gets an extra 
bigint column
--- End diff --

Why was this removed ?  Unfortunately the code generator still generates an 
internal Bigint column to keep the "not null" status of the Aggregated Values, 
in the cases of sum/max/min. Below is an example of a generated code (returning 
a nullable varchar) - note the nonNullCount (see DRILL-5728):

```
public void outputRecordValues(int htRowIdx, int outRowIdx)
throws SchemaChangeException
{
{
NullableVarCharHolder out17;
{
final NullableVarCharHolder out = new 
NullableVarCharHolder();
vv1 .getAccessor().get((htRowIdx), work0);
ObjectHolder value = work0;
work4 .value = vv5 .getAccessor().get((htRowIdx));
UInt1Holder init = work4;
work8 .value = vv9 .getAccessor().get((htRowIdx));
BigIntHolder nonNullCount = work8;
DrillBuf buf = work12;
 
MaxVarBytesFunctions$NullableVarCharMax_output: {
if (nonNullCount.value > 0) {
out.isSet = 1;

org.apache.drill.exec.expr.fn.impl.DrillByteArray tmp = 
(org.apache.drill.exec.expr.fn.impl.DrillByteArray) value.obj;

buf = buf.reallocIfNeeded(tmp.getLength());
buf.setBytes(0, tmp.getBytes(), 0, tmp.getLength());
out.start = 0;
out.end = tmp.getLength();
out.buffer = buf;
} else
{
out.isSet = 0;
}
}
 
work0 = value;
vv1 .getMutator().setSafe((htRowIdx), work0);
work4 = init;
vv5 .getMutator().set((htRowIdx), work4 .value);
work8 = nonNullCount;
vv9 .getMutator().set((htRowIdx), work8 .value);
work12 = buf;
out17 = out;
}
if (!(out17 .isSet == 0)) {
vv18 .getMutator().setSafe((outRowIdx), out17 .isSet, 
out17 .start, out17 .end, out17 .buffer);
}
}
}
```
 


> Use RecordBatchSizer to estimate size of columns in HashAgg
> ---
>
> Key: DRILL-6032
> URL: https://issues.apache.org/jira/browse/DRILL-6032
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
> Fix For: 1.13.0
>
>
> We need to use the RecordBatchSize to estimate the size of columns in the 
> Partition batches created by HashAgg.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344299#comment-16344299
 ] 

ASF GitHub Bot commented on DRILL-6032:
---

Github user Ben-Zvi commented on a diff in the pull request:

https://github.com/apache/drill/pull/1101#discussion_r164611358
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java
 ---
@@ -956,21 +925,8 @@ private void spillAPartition(int part) {
   this.htables[part].outputKeys(currOutBatchIndex, this.outContainer, 
outStartIdxHolder.value, outNumRecordsHolder.value, numPendingOutput);
 
   // set the value count for outgoing batch value vectors
-  /* int i = 0; */
   for (VectorWrapper v : outgoing) {
 v.getValueVector().getMutator().setValueCount(numOutputRecords);
-/*
--- End diff --

We don't have any facility to take a peek at actual data for debugging. 
While this commented code looks ugly, it allows for easy enabling to look at 
data, if needed (or an example of how to do so, for elsewhere).



> Use RecordBatchSizer to estimate size of columns in HashAgg
> ---
>
> Key: DRILL-6032
> URL: https://issues.apache.org/jira/browse/DRILL-6032
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
> Fix For: 1.13.0
>
>
> We need to use the RecordBatchSize to estimate the size of columns in the 
> Partition batches created by HashAgg.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344286#comment-16344286
 ] 

ASF GitHub Bot commented on DRILL-6032:
---

Github user ilooner commented on a diff in the pull request:

https://github.com/apache/drill/pull/1101#discussion_r164611155
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java
 ---
@@ -215,6 +206,7 @@ public BatchHolder() {
   MaterializedField outputField = materializedValueFields[i];
   // Create a type-specific ValueVector for this value
   vector = TypeHelper.getNewVector(outputField, allocator);
+  int columnSize = new RecordBatchSizer.ColumnSize(vector).estSize;
--- End diff --

The goal of this code is to preallocate enough space to hold BATCH_SIZE 
number of elements in the batch. The BatchHolder itself is used to hold the 
aggregate values and aggregate values can currently only be FixedWidth vectors 
or ObjectVectors. Since we know how much direct memory each of these types will 
consume, we can use that knowledge in our column size estimate and preallocate 
the correct amount of space.

**Note:** earlier the RecordBatchSizer would return an estSize of 0 for an 
empty FixedWidth value vectors. For example and empty IntVector would return an 
estSize of 0. This was incorrect behavior so I updated the RecordBatchSizer to 
return the correct size of a FixedWidth vector even when the value vector is 
empty. Please see the changes in the RecordBatchSizer and ValueVector templates 
for more details.


> Use RecordBatchSizer to estimate size of columns in HashAgg
> ---
>
> Key: DRILL-6032
> URL: https://issues.apache.org/jira/browse/DRILL-6032
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
> Fix For: 1.13.0
>
>
> We need to use the RecordBatchSize to estimate the size of columns in the 
> Partition batches created by HashAgg.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5846) Improve Parquet Reader Performance for Flat Data types

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344206#comment-16344206
 ] 

ASF GitHub Bot commented on DRILL-5846:
---

Github user vrozov commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r164600674
  
--- Diff: 
exec/memory/base/src/main/java/org/apache/drill/exec/util/MemoryUtils.java ---
@@ -0,0 +1,186 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util;
+
+import java.lang.reflect.Field;
+import java.nio.ByteOrder;
+
+import sun.misc.Unsafe;
+
+/** Exposes advanced Memory Access APIs for Little-Endian / Unaligned 
platforms */
+@SuppressWarnings("restriction")
+public final class MemoryUtils {
+
+  // Ensure this is a little-endian hardware */
+  static {
+if (ByteOrder.nativeOrder() != ByteOrder.LITTLE_ENDIAN) {
+  throw new IllegalStateException("Drill only runs on LittleEndian 
systems.");
+}
+  }
+
+  /** Java's unsafe object */
+  private static Unsafe UNSAFE;
+
+  static {
+try {
+  Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe");
+  theUnsafe.setAccessible(true);
+  UNSAFE = (Unsafe) theUnsafe.get(null);
+}
+catch (Exception e) {
+  throw new RuntimeException(e);
+}
+  }
+
+  /** Byte arrays offset */
+  private static final long BYTE_ARRAY_OFFSET = 
UNSAFE.arrayBaseOffset(byte[].class);
+
+  /** Number of bytes in a long */
+  public static final int LONG_NUM_BYTES  = 8;
+  /** Number of bytes in an int */
+  public static final int INT_NUM_BYTES   = 4;
+  /** Number of bytes in a short */
+  public static final int SHORT_NUM_BYTES = 2;
+

+//
+// APIs

+//
+
+  /**
+   * @param data source byte array
+   * @param index index within the byte array
+   * @return short value starting at data+index
+   */
+  public static short getShort(byte[] data, int index) {
--- End diff --

The sanity check that can be turned on and off is required, otherwise, 
there is no difference between Netty `PlatformDependend` and `MemoryUtils` and 
it will be better to have a unified way independent from Java assertions to 
turn bounds checking on or off.

My assumption is that Drill either reads from DrillBuf or writes to 
DrillBuf as data is not supposed to be in the heap, so DrillBuf looks like a 
natural choice for the functionality.




> Improve Parquet Reader Performance for Flat Data types 
> ---
>
> Key: DRILL-5846
> URL: https://issues.apache.org/jira/browse/DRILL-5846
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.11.0
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
>  Labels: performance
> Fix For: 1.13.0
>
>
> The Parquet Reader is a key use-case for Drill. This JIRA is an attempt to 
> further improve the Parquet Reader performance as several users reported that 
> Parquet parsing represents the lion share of the overall query execution. It 
> tracks Flat Data types only as Nested DTs might involve functional and 
> processing enhancements (e.g., a nested column can be seen as a Document; 
> user might want to perform operations scoped at the document level that is no 
> need to span all rows). Another JIRA will be created to handle the nested 
> columns use-case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344203#comment-16344203
 ] 

ASF GitHub Bot commented on DRILL-6032:
---

Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1101#discussion_r164599681
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java
 ---
@@ -215,6 +206,7 @@ public BatchHolder() {
   MaterializedField outputField = materializedValueFields[i];
   // Create a type-specific ValueVector for this value
   vector = TypeHelper.getNewVector(outputField, allocator);
+  int columnSize = new RecordBatchSizer.ColumnSize(vector).estSize;
--- End diff --

I wonder what is wrong if estSize is 0 when there is no data. If there is 
no data for a column, why would we want to add it's value width to outgoing row 
width ? 


> Use RecordBatchSizer to estimate size of columns in HashAgg
> ---
>
> Key: DRILL-6032
> URL: https://issues.apache.org/jira/browse/DRILL-6032
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
> Fix For: 1.13.0
>
>
> We need to use the RecordBatchSize to estimate the size of columns in the 
> Partition batches created by HashAgg.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6099) Drill does not push limit past project (flatten) if it cannot be pushed into scan

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344166#comment-16344166
 ] 

ASF GitHub Bot commented on DRILL-6099:
---

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/1096#discussion_r164593749
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillPushLimitToScanRule.java
 ---
@@ -121,4 +132,50 @@ protected void doOnMatch(RelOptRuleCall call, 
DrillLimitRel limitRel, DrillScanR
 }
 
   }
+
+  private static boolean isProjectFlatten(RelNode project) {
--- End diff --

Done.


> Drill does not push limit past project (flatten) if it cannot be pushed into 
> scan
> -
>
> Key: DRILL-6099
> URL: https://issues.apache.org/jira/browse/DRILL-6099
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.12.0
>Reporter: Gautam Kumar Parai
>Assignee: Gautam Kumar Parai
>Priority: Major
> Fix For: 1.13.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> It would be useful to have pushdown occur past flatten(project). Here is an 
> example to illustrate the issue:
> {{explain plan without implementation for }}{{select name, 
> flatten(categories) as category from dfs.`/tmp/t_json_20` LIMIT 1;}}
> {{DrillScreenRel}}{{  }}
> {{  DrillLimitRel(fetch=[1])}}{{    }}
> {{    DrillProjectRel(name=[$0], category=[FLATTEN($1)])}}
> {{      DrillScanRel(table=[[dfs, /tmp/t_json_20]], groupscan=[EasyGroupScan 
> [selectionRoot=maprfs:/tmp/t_json_20, numFiles=1, columns=[`name`, 
> `categories`], files=[maprfs:///tmp/t_json_20/0_0_0.json]]])}}
> = 
> Content of 0_0_0.json
> =
> {
>   "name" : "Eric Goldberg, MD",
>   "categories" : [ "Doctors", "Health & Medical" ]
> } {
>   "name" : "Pine Cone Restaurant",
>   "categories" : [ "Restaurants" ]
> } {
>   "name" : "Deforest Family Restaurant",
>   "categories" : [ "American (Traditional)", "Restaurants" ]
> } {
>   "name" : "Culver's",
>   "categories" : [ "Food", "Ice Cream & Frozen Yogurt", "Fast Food", 
> "Restaurants" ]
> } {
>   "name" : "Chang Jiang Chinese Kitchen",
>   "categories" : [ "Chinese", "Restaurants" ]
> } 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6071) Limit batch size for flatten operator

2018-01-29 Thread Pritesh Maker (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pritesh Maker updated DRILL-6071:
-
Reviewer: Paul Rogers

> Limit batch size for flatten operator
> -
>
> Key: DRILL-6071
> URL: https://issues.apache.org/jira/browse/DRILL-6071
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Flow
>Affects Versions: 1.12.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.13.0
>
>
> flatten currently uses an adaptive algorithm to control the outgoing batch 
> size. 
>  While processing the input batch, it adjusts the number of records in 
> outgoing batch based on memory usage so far. Once memory usage exceeds the 
> configured limit for a batch, the algorithm becomes more proactive and 
> adjusts the limit half way through and end of every batch. All this periodic 
> checking of memory usage is unnecessary overhead and impacts performance. 
> Also, we will know only after the fact.
> Instead, figure out how many rows should be there in the outgoing batch from 
> incoming batch.
>  The way to do that would be to figure out average row size of the outgoing 
> batch and based on that figure out how many rows can be there for a given 
> amount of memory. value vectors provide us the necessary information to be 
> able to figure this out.
> Row count in output batch should be decided based on memory (with min 1 and 
> max 64k rows) and not hard coded (to 4K) in code. Memory for output batch 
> should be configurable system option.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6071) Limit batch size for flatten operator

2018-01-29 Thread Pritesh Maker (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pritesh Maker updated DRILL-6071:
-
Issue Type: Improvement  (was: Bug)

> Limit batch size for flatten operator
> -
>
> Key: DRILL-6071
> URL: https://issues.apache.org/jira/browse/DRILL-6071
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Flow
>Affects Versions: 1.12.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.13.0
>
>
> flatten currently uses an adaptive algorithm to control the outgoing batch 
> size. 
>  While processing the input batch, it adjusts the number of records in 
> outgoing batch based on memory usage so far. Once memory usage exceeds the 
> configured limit for a batch, the algorithm becomes more proactive and 
> adjusts the limit half way through and end of every batch. All this periodic 
> checking of memory usage is unnecessary overhead and impacts performance. 
> Also, we will know only after the fact.
> Instead, figure out how many rows should be there in the outgoing batch from 
> incoming batch.
>  The way to do that would be to figure out average row size of the outgoing 
> batch and based on that figure out how many rows can be there for a given 
> amount of memory. value vectors provide us the necessary information to be 
> able to figure this out.
> Row count in output batch should be decided based on memory (with min 1 and 
> max 64k rows) and not hard coded (to 4K) in code. Memory for output batch 
> should be configurable system option.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6099) Drill does not push limit past project (flatten) if it cannot be pushed into scan

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344124#comment-16344124
 ] 

ASF GitHub Bot commented on DRILL-6099:
---

Github user chunhui-shi commented on a diff in the pull request:

https://github.com/apache/drill/pull/1096#discussion_r164585670
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillPushLimitToScanRule.java
 ---
@@ -121,4 +132,50 @@ protected void doOnMatch(RelOptRuleCall call, 
DrillLimitRel limitRel, DrillScanR
 }
 
   }
+
+  private static boolean isProjectFlatten(RelNode project) {
--- End diff --

I think it might be more general to name the functions to schemaUnknown(for 
conert_fromJson), rowCountUnknown(for flatten), so if in future we have some 
other functions fall in these two categories, we could easily add these 
functions to the categories. What do you think?


> Drill does not push limit past project (flatten) if it cannot be pushed into 
> scan
> -
>
> Key: DRILL-6099
> URL: https://issues.apache.org/jira/browse/DRILL-6099
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.12.0
>Reporter: Gautam Kumar Parai
>Assignee: Gautam Kumar Parai
>Priority: Major
> Fix For: 1.13.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> It would be useful to have pushdown occur past flatten(project). Here is an 
> example to illustrate the issue:
> {{explain plan without implementation for }}{{select name, 
> flatten(categories) as category from dfs.`/tmp/t_json_20` LIMIT 1;}}
> {{DrillScreenRel}}{{  }}
> {{  DrillLimitRel(fetch=[1])}}{{    }}
> {{    DrillProjectRel(name=[$0], category=[FLATTEN($1)])}}
> {{      DrillScanRel(table=[[dfs, /tmp/t_json_20]], groupscan=[EasyGroupScan 
> [selectionRoot=maprfs:/tmp/t_json_20, numFiles=1, columns=[`name`, 
> `categories`], files=[maprfs:///tmp/t_json_20/0_0_0.json]]])}}
> = 
> Content of 0_0_0.json
> =
> {
>   "name" : "Eric Goldberg, MD",
>   "categories" : [ "Doctors", "Health & Medical" ]
> } {
>   "name" : "Pine Cone Restaurant",
>   "categories" : [ "Restaurants" ]
> } {
>   "name" : "Deforest Family Restaurant",
>   "categories" : [ "American (Traditional)", "Restaurants" ]
> } {
>   "name" : "Culver's",
>   "categories" : [ "Food", "Ice Cream & Frozen Yogurt", "Fast Food", 
> "Restaurants" ]
> } {
>   "name" : "Chang Jiang Chinese Kitchen",
>   "categories" : [ "Chinese", "Restaurants" ]
> } 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6106) Use valueOf method instead of constructor since valueOf has a higher performance by caching frequently requested values.

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344039#comment-16344039
 ] 

ASF GitHub Bot commented on DRILL-6106:
---

Github user vrozov commented on the issue:

https://github.com/apache/drill/pull/1099
  
Travis CI failed due to a known issue 
`TestSortImpl.testLargeBatch:502->runJumboBatchTest:475->runLargeSortTest:454 
Value of 1:0 expected:<0> but was:<1>` unrelated to the SSLConfigClient.java 
changes.


> Use valueOf method instead of constructor since valueOf has a higher 
> performance by caching frequently requested values.
> 
>
> Key: DRILL-6106
> URL: https://issues.apache.org/jira/browse/DRILL-6106
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Reudismam Rolim de Sousa
>Priority: Minor
>
> Use valueOf method instead of constructor since valueOf has a higher 
> performance by caching frequently requested values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6106) Use valueOf method instead of constructor since valueOf has a higher performance by caching frequently requested values.

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343960#comment-16343960
 ] 

ASF GitHub Bot commented on DRILL-6106:
---

Github user reudismam commented on the issue:

https://github.com/apache/drill/pull/1099
  
I reverted the edit because the code does not pass in Travis CI.


> Use valueOf method instead of constructor since valueOf has a higher 
> performance by caching frequently requested values.
> 
>
> Key: DRILL-6106
> URL: https://issues.apache.org/jira/browse/DRILL-6106
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Reudismam Rolim de Sousa
>Priority: Minor
>
> Use valueOf method instead of constructor since valueOf has a higher 
> performance by caching frequently requested values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6106) Use valueOf method instead of constructor since valueOf has a higher performance by caching frequently requested values.

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343946#comment-16343946
 ] 

ASF GitHub Bot commented on DRILL-6106:
---

Github user vrozov commented on the issue:

https://github.com/apache/drill/pull/1099
  
What was the reason to revert the change in SSLConfigClient.java?


> Use valueOf method instead of constructor since valueOf has a higher 
> performance by caching frequently requested values.
> 
>
> Key: DRILL-6106
> URL: https://issues.apache.org/jira/browse/DRILL-6106
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Reudismam Rolim de Sousa
>Priority: Minor
>
> Use valueOf method instead of constructor since valueOf has a higher 
> performance by caching frequently requested values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6116) MAXDIR - Unsupported operation exception // partition explorer interface?

2018-01-29 Thread John Humphreys (JIRA)
John Humphreys created DRILL-6116:
-

 Summary: MAXDIR - Unsupported operation exception // partition 
explorer interface?
 Key: DRILL-6116
 URL: https://issues.apache.org/jira/browse/DRILL-6116
 Project: Apache Drill
  Issue Type: Bug
  Components: Client - JDBC
Affects Versions: 1.10.0
Reporter: John Humphreys


Using MAXDIR in drill seems to cause errors when trying to apply other standard 
where clause filters. Am I doing something wrong here?

**This Query Fails**

SELECT epoch_hour, entity, `1_average`, `1_95p`
 FROM dfs.`/path/drill-hour-final/`
 where
 dir0 = '2018' and dir1 = '01' and dir2 = '27'
 and dir3 = MAXDIR('dfs', concat('/path/drill-hour-final/',
 dir0, '/', dir1, '/', dir2, '/')) 
 and entity = 1784928

**With This Error**

> SYSTEM ERROR: UnsupportedOperationException: The partition explorer
> interface can only be used in functions that can be evaluated at
> planning time. Make sure that the planner.enable_constant_folding
> configuration option is set to true.
> 
> Fragment 1:7
> 
> [Error Id: d62c8227-db0e-4d31-9443-e570193d1010 on
> psclxcpdevsys13.nomura.com:31010]

**Wrapping it Makes it Work Without Error**

select * from (
 SELECT epoch_hour, entity, `1_average`, `1_95p`
 FROM dfs.`/path/drill-hour-final/`
 where
 dir0 = '2018' and dir1 = '01' and dir2 = '27'
 and dir3 = MAXDIR('dfs', concat('/path/drill-hour-final/',
 dir0, '/', dir1, '/', dir2, '/'))
 ) where entity = 1784928

epoch_hour entity 1_average 1_95p
 1517086800 1784928 5.3347222 8.82857142857143
 1517014800 1784928 25.944984717977217 40.37453087409783



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6115) SingleMergeExchange is not scaling up when many minor fragments are allocated for a query.

2018-01-29 Thread Hanumath Rao Maduri (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hanumath Rao Maduri updated DRILL-6115:
---
Attachment: Enhancing Drill to multiplex ordered merge exchanges.docx

> SingleMergeExchange is not scaling up when many minor fragments are allocated 
> for a query.
> --
>
> Key: DRILL-6115
> URL: https://issues.apache.org/jira/browse/DRILL-6115
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators
>Affects Versions: 1.12.0
>Reporter: Hanumath Rao Maduri
>Assignee: Hanumath Rao Maduri
>Priority: Major
> Attachments: Enhancing Drill to multiplex ordered merge exchanges.docx
>
>
> SingleMergeExchange is created when a global order is required in the output. 
> The following query produces the SingleMergeExchange.
> {code:java}
> 0: jdbc:drill:zk=local> explain plan for select L_LINENUMBER from 
> dfs.`/drill/tables/lineitem` order by L_LINENUMBER;
> +--+--+
> | text | json |
> +--+--+
> | 00-00 Screen
> 00-01 Project(L_LINENUMBER=[$0])
> 00-02 SingleMergeExchange(sort0=[0])
> 01-01 SelectionVectorRemover
> 01-02 Sort(sort0=[$0], dir0=[ASC])
> 01-03 HashToRandomExchange(dist0=[[$0]])
> 02-01 Scan(table=[[dfs, /drill/tables/lineitem]], 
> groupscan=[JsonTableGroupScan [ScanSpec=JsonScanSpec 
> [tableName=maprfs:///drill/tables/lineitem, condition=null], 
> columns=[`L_LINENUMBER`], maxwidth=15]])
> {code}
> On a 10 node cluster if the table is huge then DRILL can spawn many minor 
> fragments which are all merged on a single node with one merge receiver. 
> Doing so will create lot of memory pressure on the receiver node and also 
> execution bottleneck. To address this issue, merge receiver should be 
> multiphase merge receiver. 
> Ideally for large cluster one can introduce tree merges so that merging can 
> be done parallel. But as a first step I think it is better to use the 
> existing infrastructure for multiplexing operators to generate an OrderedMux 
> so that all the minor fragments pertaining to one DRILLBIT should be merged 
> and the merged data can be sent across to the receiver operator.
> On a 10 node cluster if each node processes 14 minor fragments.
> Current version of code merges 140 minor fragments
> the proposed version has two level merges 1 - 14 merge in each drillbit which 
> is parallel 
> and 10 minorfragments are merged at the receiver node.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6115) SingleMergeExchange is not scaling up when many minor fragments are allocated for a query.

2018-01-29 Thread Hanumath Rao Maduri (JIRA)
Hanumath Rao Maduri created DRILL-6115:
--

 Summary: SingleMergeExchange is not scaling up when many minor 
fragments are allocated for a query.
 Key: DRILL-6115
 URL: https://issues.apache.org/jira/browse/DRILL-6115
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Relational Operators
Affects Versions: 1.12.0
Reporter: Hanumath Rao Maduri
Assignee: Hanumath Rao Maduri
 Attachments: Enhancing Drill to multiplex ordered merge exchanges.docx

SingleMergeExchange is created when a global order is required in the output. 
The following query produces the SingleMergeExchange.
{code:java}
0: jdbc:drill:zk=local> explain plan for select L_LINENUMBER from 
dfs.`/drill/tables/lineitem` order by L_LINENUMBER;
+--+--+
| text | json |
+--+--+
| 00-00 Screen
00-01 Project(L_LINENUMBER=[$0])
00-02 SingleMergeExchange(sort0=[0])
01-01 SelectionVectorRemover
01-02 Sort(sort0=[$0], dir0=[ASC])
01-03 HashToRandomExchange(dist0=[[$0]])
02-01 Scan(table=[[dfs, /drill/tables/lineitem]], groupscan=[JsonTableGroupScan 
[ScanSpec=JsonScanSpec [tableName=maprfs:///drill/tables/lineitem, 
condition=null], columns=[`L_LINENUMBER`], maxwidth=15]])
{code}

On a 10 node cluster if the table is huge then DRILL can spawn many minor 
fragments which are all merged on a single node with one merge receiver. Doing 
so will create lot of memory pressure on the receiver node and also execution 
bottleneck. To address this issue, merge receiver should be multiphase merge 
receiver. 

Ideally for large cluster one can introduce tree merges so that merging can be 
done parallel. But as a first step I think it is better to use the existing 
infrastructure for multiplexing operators to generate an OrderedMux so that all 
the minor fragments pertaining to one DRILLBIT should be merged and the merged 
data can be sent across to the receiver operator.

On a 10 node cluster if each node processes 14 minor fragments.

Current version of code merges 140 minor fragments
the proposed version has two level merges 1 - 14 merge in each drillbit which 
is parallel 
and 10 minorfragments are merged at the receiver node.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5846) Improve Parquet Reader Performance for Flat Data types

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343842#comment-16343842
 ] 

ASF GitHub Bot commented on DRILL-5846:
---

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r164532290
  
--- Diff: exec/memory/base/src/main/java/io/netty/buffer/DrillBuf.java ---
@@ -703,7 +703,18 @@ protected void _setLong(int index, long value) {
 
   @Override
   public ByteBuf getBytes(int index, ByteBuf dst, int dstIndex, int 
length) {
-udle.getBytes(index + offset, dst, dstIndex, length);
+final int BULK_COPY_THR = 1024;
--- End diff --

You are right, I'll put more information on this optimization:
- During code profiling, I have noticed that getBytes() doesn't perform 
well when called with small length (lower than 1k)
- Its throughput improves as the length increases

Analysis :
- Java exposes two intrinsics for writing to direct memory: putByte and 
copyMemory
- The JVM is able to inline memory access (no function call) for putByte 
- copyMemory is a bulk API and this internally invokes libc memcpy 
(requires function call)
- The rational is that we are willing to incur a function call if the 
associated processing is larger than the overhead; this is almost never the 
case for small memory accesses. 


> Improve Parquet Reader Performance for Flat Data types 
> ---
>
> Key: DRILL-5846
> URL: https://issues.apache.org/jira/browse/DRILL-5846
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.11.0
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
>  Labels: performance
> Fix For: 1.13.0
>
>
> The Parquet Reader is a key use-case for Drill. This JIRA is an attempt to 
> further improve the Parquet Reader performance as several users reported that 
> Parquet parsing represents the lion share of the overall query execution. It 
> tracks Flat Data types only as Nested DTs might involve functional and 
> processing enhancements (e.g., a nested column can be seen as a Document; 
> user might want to perform operations scoped at the document level that is no 
> need to span all rows). Another JIRA will be created to handle the nested 
> columns use-case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5846) Improve Parquet Reader Performance for Flat Data types

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343798#comment-16343798
 ] 

ASF GitHub Bot commented on DRILL-5846:
---

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r164525752
  
--- Diff: 
exec/memory/base/src/main/java/org/apache/drill/exec/util/MemoryUtils.java ---
@@ -0,0 +1,186 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util;
+
+import java.lang.reflect.Field;
+import java.nio.ByteOrder;
+
+import sun.misc.Unsafe;
+
+/** Exposes advanced Memory Access APIs for Little-Endian / Unaligned 
platforms */
+@SuppressWarnings("restriction")
+public final class MemoryUtils {
+
+  // Ensure this is a little-endian hardware */
+  static {
+if (ByteOrder.nativeOrder() != ByteOrder.LITTLE_ENDIAN) {
+  throw new IllegalStateException("Drill only runs on LittleEndian 
systems.");
+}
+  }
+
+  /** Java's unsafe object */
+  private static Unsafe UNSAFE;
+
+  static {
+try {
+  Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe");
+  theUnsafe.setAccessible(true);
+  UNSAFE = (Unsafe) theUnsafe.get(null);
+}
+catch (Exception e) {
+  throw new RuntimeException(e);
+}
+  }
+
+  /** Byte arrays offset */
+  private static final long BYTE_ARRAY_OFFSET = 
UNSAFE.arrayBaseOffset(byte[].class);
+
+  /** Number of bytes in a long */
+  public static final int LONG_NUM_BYTES  = 8;
+  /** Number of bytes in an int */
+  public static final int INT_NUM_BYTES   = 4;
+  /** Number of bytes in a short */
+  public static final int SHORT_NUM_BYTES = 2;
+

+//
+// APIs

+//
+
+  /**
+   * @param data source byte array
+   * @param index index within the byte array
+   * @return short value starting at data+index
+   */
+  public static short getShort(byte[] data, int index) {
--- End diff --

Vlad,

Can you please clarify your point? Are you suggesting that every memory 
access has to be via DrillBuf even if the data is loaded in Java heap and the 
developer is performing an optimization. If this is your goal, then I would 
think the danger is that the DrillBuf class will be cluttered and contracts 
hard to understand:
- Notice that DrillBuf performs expensive sanity checks (boundary and 
reference count checks); this is due to Drill memory optimizations for 
minimizing the overall memory usage.
- When the code is processing already load data, these checks are not 
applicable anymore as this data is private.
- I could have easily moved the new APIs there but I strongly felt this was 
inappropriate.. 


> Improve Parquet Reader Performance for Flat Data types 
> ---
>
> Key: DRILL-5846
> URL: https://issues.apache.org/jira/browse/DRILL-5846
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.11.0
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
>  Labels: performance
> Fix For: 1.13.0
>
>
> The Parquet Reader is a key use-case for Drill. This JIRA is an attempt to 
> further improve the Parquet Reader performance as several users reported that 
> Parquet parsing represents the lion share of the overall query execution. It 
> tracks Flat Data types only as Nested DTs might involve functional and 
> processing enhancements (e.g., a nested column can be seen as a Document; 
> user might want to perform operations scoped at the document level that is no 
> need to span all rows). Another JIRA 

[jira] [Commented] (DRILL-5846) Improve Parquet Reader Performance for Flat Data types

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343779#comment-16343779
 ] 

ASF GitHub Bot commented on DRILL-5846:
---

Github user vrozov commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r164520611
  
--- Diff: exec/memory/base/src/main/java/io/netty/buffer/DrillBuf.java ---
@@ -703,7 +703,18 @@ protected void _setLong(int index, long value) {
 
   @Override
   public ByteBuf getBytes(int index, ByteBuf dst, int dstIndex, int 
length) {
-udle.getBytes(index + offset, dst, dstIndex, length);
+final int BULK_COPY_THR = 1024;
--- End diff --

The optimization is not obvious (and is counterintuitive) without more 
details.


> Improve Parquet Reader Performance for Flat Data types 
> ---
>
> Key: DRILL-5846
> URL: https://issues.apache.org/jira/browse/DRILL-5846
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.11.0
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
>  Labels: performance
> Fix For: 1.13.0
>
>
> The Parquet Reader is a key use-case for Drill. This JIRA is an attempt to 
> further improve the Parquet Reader performance as several users reported that 
> Parquet parsing represents the lion share of the overall query execution. It 
> tracks Flat Data types only as Nested DTs might involve functional and 
> processing enhancements (e.g., a nested column can be seen as a Document; 
> user might want to perform operations scoped at the document level that is no 
> need to span all rows). Another JIRA will be created to handle the nested 
> columns use-case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5846) Improve Parquet Reader Performance for Flat Data types

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343770#comment-16343770
 ] 

ASF GitHub Bot commented on DRILL-5846:
---

Github user vrozov commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r164519147
  
--- Diff: 
exec/memory/base/src/main/java/org/apache/drill/exec/util/MemoryUtils.java ---
@@ -0,0 +1,186 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util;
+
+import java.lang.reflect.Field;
+import java.nio.ByteOrder;
+
+import sun.misc.Unsafe;
+
+/** Exposes advanced Memory Access APIs for Little-Endian / Unaligned 
platforms */
+@SuppressWarnings("restriction")
+public final class MemoryUtils {
+
+  // Ensure this is a little-endian hardware */
+  static {
+if (ByteOrder.nativeOrder() != ByteOrder.LITTLE_ENDIAN) {
+  throw new IllegalStateException("Drill only runs on LittleEndian 
systems.");
+}
+  }
+
+  /** Java's unsafe object */
+  private static Unsafe UNSAFE;
+
+  static {
+try {
+  Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe");
+  theUnsafe.setAccessible(true);
+  UNSAFE = (Unsafe) theUnsafe.get(null);
+}
+catch (Exception e) {
+  throw new RuntimeException(e);
+}
+  }
+
+  /** Byte arrays offset */
+  private static final long BYTE_ARRAY_OFFSET = 
UNSAFE.arrayBaseOffset(byte[].class);
+
+  /** Number of bytes in a long */
+  public static final int LONG_NUM_BYTES  = 8;
+  /** Number of bytes in an int */
+  public static final int INT_NUM_BYTES   = 4;
+  /** Number of bytes in a short */
+  public static final int SHORT_NUM_BYTES = 2;
+

+//
+// APIs

+//
+
+  /**
+   * @param data source byte array
+   * @param index index within the byte array
+   * @return short value starting at data+index
+   */
+  public static short getShort(byte[] data, int index) {
--- End diff --

@sachouche It is not clear why using MemoryUtils will make difference for 
hotspot in using cache or 8bit instruction set. It will be used for both 
MemoryUtils and DrillBuf.


> Improve Parquet Reader Performance for Flat Data types 
> ---
>
> Key: DRILL-5846
> URL: https://issues.apache.org/jira/browse/DRILL-5846
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.11.0
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
>  Labels: performance
> Fix For: 1.13.0
>
>
> The Parquet Reader is a key use-case for Drill. This JIRA is an attempt to 
> further improve the Parquet Reader performance as several users reported that 
> Parquet parsing represents the lion share of the overall query execution. It 
> tracks Flat Data types only as Nested DTs might involve functional and 
> processing enhancements (e.g., a nested column can be seen as a Document; 
> user might want to perform operations scoped at the document level that is no 
> need to span all rows). Another JIRA will be created to handle the nested 
> columns use-case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg

2018-01-29 Thread Timothy Farkas (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Farkas updated DRILL-6032:
--
Reviewer: Boaz Ben-Zvi

> Use RecordBatchSizer to estimate size of columns in HashAgg
> ---
>
> Key: DRILL-6032
> URL: https://issues.apache.org/jira/browse/DRILL-6032
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
> Fix For: 1.13.0
>
>
> We need to use the RecordBatchSize to estimate the size of columns in the 
> Partition batches created by HashAgg.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5846) Improve Parquet Reader Performance for Flat Data types

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343704#comment-16343704
 ] 

ASF GitHub Bot commented on DRILL-5846:
---

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r164509658
  
--- Diff: 
exec/memory/base/src/main/java/org/apache/drill/exec/util/MemoryUtils.java ---
@@ -0,0 +1,186 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util;
+
+import java.lang.reflect.Field;
+import java.nio.ByteOrder;
+
+import sun.misc.Unsafe;
+
+/** Exposes advanced Memory Access APIs for Little-Endian / Unaligned 
platforms */
+@SuppressWarnings("restriction")
+public final class MemoryUtils {
+
+  // Ensure this is a little-endian hardware */
+  static {
+if (ByteOrder.nativeOrder() != ByteOrder.LITTLE_ENDIAN) {
+  throw new IllegalStateException("Drill only runs on LittleEndian 
systems.");
+}
+  }
+
+  /** Java's unsafe object */
+  private static Unsafe UNSAFE;
+
+  static {
+try {
+  Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe");
+  theUnsafe.setAccessible(true);
+  UNSAFE = (Unsafe) theUnsafe.get(null);
+}
+catch (Exception e) {
+  throw new RuntimeException(e);
+}
+  }
+
+  /** Byte arrays offset */
+  private static final long BYTE_ARRAY_OFFSET = 
UNSAFE.arrayBaseOffset(byte[].class);
+
+  /** Number of bytes in a long */
+  public static final int LONG_NUM_BYTES  = 8;
+  /** Number of bytes in an int */
+  public static final int INT_NUM_BYTES   = 4;
+  /** Number of bytes in a short */
+  public static final int SHORT_NUM_BYTES = 2;
+

+//
+// APIs

+//
+
+  /**
+   * @param data source byte array
+   * @param index index within the byte array
+   * @return short value starting at data+index
+   */
+  public static short getShort(byte[] data, int index) {
--- End diff --

Vlad,

- The reader has loaded data from direct memory into CPU level-1 cache 
through DrillBuf
- The data is stored (4k at a time) in a Java byte[]
- The only reason for using MemoryUtil is to perform 64bit processing (I 
have previously indicated that many other processing intensive applications 
such as Snappy and Hadoop Comparison use the same pattern)
- I solely added this logic after noticing the Hotspot was generating 8bit 
instructions 

Again, the intent is not to access data from DrillBuf (the data has been 
already loaded into CPU cache); this is mainly an optimization step.  


> Improve Parquet Reader Performance for Flat Data types 
> ---
>
> Key: DRILL-5846
> URL: https://issues.apache.org/jira/browse/DRILL-5846
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.11.0
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
>  Labels: performance
> Fix For: 1.13.0
>
>
> The Parquet Reader is a key use-case for Drill. This JIRA is an attempt to 
> further improve the Parquet Reader performance as several users reported that 
> Parquet parsing represents the lion share of the overall query execution. It 
> tracks Flat Data types only as Nested DTs might involve functional and 
> processing enhancements (e.g., a nested column can be seen as a Document; 
> user might want to perform operations scoped at the document level that is no 
> need to span all rows). Another JIRA will be created to handle the nested 
> columns use-case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-5846) Improve Parquet Reader Performance for Flat Data types

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343688#comment-16343688
 ] 

ASF GitHub Bot commented on DRILL-5846:
---

Github user vrozov commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r164506523
  
--- Diff: 
exec/memory/base/src/main/java/org/apache/drill/exec/util/MemoryUtils.java ---
@@ -0,0 +1,186 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util;
+
+import java.lang.reflect.Field;
+import java.nio.ByteOrder;
+
+import sun.misc.Unsafe;
+
+/** Exposes advanced Memory Access APIs for Little-Endian / Unaligned 
platforms */
+@SuppressWarnings("restriction")
+public final class MemoryUtils {
+
+  // Ensure this is a little-endian hardware */
+  static {
+if (ByteOrder.nativeOrder() != ByteOrder.LITTLE_ENDIAN) {
+  throw new IllegalStateException("Drill only runs on LittleEndian 
systems.");
+}
+  }
+
+  /** Java's unsafe object */
+  private static Unsafe UNSAFE;
+
+  static {
+try {
+  Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe");
+  theUnsafe.setAccessible(true);
+  UNSAFE = (Unsafe) theUnsafe.get(null);
+}
+catch (Exception e) {
+  throw new RuntimeException(e);
+}
+  }
+
+  /** Byte arrays offset */
+  private static final long BYTE_ARRAY_OFFSET = 
UNSAFE.arrayBaseOffset(byte[].class);
+
+  /** Number of bytes in a long */
+  public static final int LONG_NUM_BYTES  = 8;
+  /** Number of bytes in an int */
+  public static final int INT_NUM_BYTES   = 4;
+  /** Number of bytes in a short */
+  public static final int SHORT_NUM_BYTES = 2;
+

+//
+// APIs

+//
+
+  /**
+   * @param data source byte array
+   * @param index index within the byte array
+   * @return short value starting at data+index
+   */
+  public static short getShort(byte[] data, int index) {
--- End diff --

Unless there is a significant performance benefit in introducing two ways 
to access direct buffers, Drill should have a unified way to work with the 
off-heap memory preferably using DrillBuf. With bounds checking disabled by 
default, I don't see why MemoryUtils will outperform DrillBuf.


> Improve Parquet Reader Performance for Flat Data types 
> ---
>
> Key: DRILL-5846
> URL: https://issues.apache.org/jira/browse/DRILL-5846
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.11.0
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
>  Labels: performance
> Fix For: 1.13.0
>
>
> The Parquet Reader is a key use-case for Drill. This JIRA is an attempt to 
> further improve the Parquet Reader performance as several users reported that 
> Parquet parsing represents the lion share of the overall query execution. It 
> tracks Flat Data types only as Nested DTs might involve functional and 
> processing enhancements (e.g., a nested column can be seen as a Document; 
> user might want to perform operations scoped at the document level that is no 
> need to span all rows). Another JIRA will be created to handle the nested 
> columns use-case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6106) Use valueOf method instead of constructor since valueOf has a higher performance by caching frequently requested values.

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343662#comment-16343662
 ] 

ASF GitHub Bot commented on DRILL-6106:
---

Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/1099
  
@reudismam you can leave `valueOf` for Double and Float for consistency. 
There are two other comments @vrozov has left. Could you please address hem 
as well? They are not directly related to your changes but will make Drill code 
better. Also please squash commits into one.


> Use valueOf method instead of constructor since valueOf has a higher 
> performance by caching frequently requested values.
> 
>
> Key: DRILL-6106
> URL: https://issues.apache.org/jira/browse/DRILL-6106
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Reudismam Rolim de Sousa
>Priority: Minor
>
> Use valueOf method instead of constructor since valueOf has a higher 
> performance by caching frequently requested values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6106) Use valueOf method instead of constructor since valueOf has a higher performance by caching frequently requested values.

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343652#comment-16343652
 ] 

ASF GitHub Bot commented on DRILL-6106:
---

Github user reudismam commented on the issue:

https://github.com/apache/drill/pull/1099
  
Conflicts resolved. Do you want to revert the edits to Double and Float? I 
think it is ok for consistency. 


> Use valueOf method instead of constructor since valueOf has a higher 
> performance by caching frequently requested values.
> 
>
> Key: DRILL-6106
> URL: https://issues.apache.org/jira/browse/DRILL-6106
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Reudismam Rolim de Sousa
>Priority: Minor
>
> Use valueOf method instead of constructor since valueOf has a higher 
> performance by caching frequently requested values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6080) Sort incorrectly limits batch size to 65535 records rather than 65536

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343613#comment-16343613
 ] 

ASF GitHub Bot commented on DRILL-6080:
---

Github user vrozov commented on the issue:

https://github.com/apache/drill/pull/1090
  
+1


> Sort incorrectly limits batch size to 65535 records rather than 65536
> -
>
> Key: DRILL-6080
> URL: https://issues.apache.org/jira/browse/DRILL-6080
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.12.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.13.0
>
>
> Drill places an upper limit on the number of rows in a batch of 64K. That is 
> 65,536 decimal. When we index records, the indexes run from 0 to 64K-1 or 0 
> to 65,535.
> The sort code incorrectly uses {{Character.MAX_VALUE}} as the maximum row 
> count. So, if an incoming batch uses the full 64K size, sort ends up 
> splitting batches unnecessarily.
> The fix is to instead use the correct constant `ValueVector.MAX_ROW_COUNT`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6106) Use valueOf method instead of constructor since valueOf has a higher performance by caching frequently requested values.

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343354#comment-16343354
 ] 

ASF GitHub Bot commented on DRILL-6106:
---

Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/1099
  
@reudismam could you please address code review comments so changes can be 
pushed into master? Also please resolve merge conflict.


> Use valueOf method instead of constructor since valueOf has a higher 
> performance by caching frequently requested values.
> 
>
> Key: DRILL-6106
> URL: https://issues.apache.org/jira/browse/DRILL-6106
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Reudismam Rolim de Sousa
>Priority: Minor
>
> Use valueOf method instead of constructor since valueOf has a higher 
> performance by caching frequently requested values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6090) While connecting to drill-bits using JDBC Driver through Zookeeper, a lot of "Curator-Framework-0" threads are created if connection to drill-bit is not successful(no d

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343353#comment-16343353
 ] 

ASF GitHub Bot commented on DRILL-6090:
---

Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/1094
  
@milindt could you please address code review comments so changes can be 
pushed into master?


> While connecting to drill-bits using JDBC Driver through Zookeeper, a lot of 
> "Curator-Framework-0" threads are created if connection to drill-bit is not 
> successful(no drill-bits are up/reachable)
> ---
>
> Key: DRILL-6090
> URL: https://issues.apache.org/jira/browse/DRILL-6090
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Client - JDBC
>Affects Versions: 1.12.0
> Environment: Centos 65, Java 7, Drill JDBC 1.12.0
>Reporter: Milind Takawale
>Assignee: Milind Takawale
>Priority: Major
> Fix For: 1.13.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> I am using Drill JDBC driver 1.12.0 to connect to MapR-DB. I am finding the 
> available drill-bits using Zookeepers. When drill-bits are not up or not 
> reachable, the connection is failed with exception: "Failure in connecting to 
> Drill: oadd.org.apache.drill.exec.rpc.RpcException: Failure setting up ZK for 
> client", which is expected, but number of threads created by 
> ZKClusterCoordinator just keeps on increasing.
> Steps to reproduce the issue
>  # Setup a connection with a drill-bit using Apache Drill JDBC driver 1.12.0 
> through Zookeeper hosts(port 5181)
>  # Now stop the drill-bit services or block the drill-bit IPs using iptable 
> rules
>  # Truncate catalina logs
>  # Try to connect to the drill-bit/hit a code path that requires connection 
> to drill-bits.
>  # Take thread dump using kill -QUIT 
>  # grep -c "Curator-Framework-0" catalina.out
> Observe that the curator framework thread just keep on accumulating
> RCA:
>  # ZKClusterCoordinator creates curator threads in the constructor
>  # ZKClusterCoordinator is instantiated by DrillClient.connect
>  # DrillClient.connect is called in DrillConnectionImpl constructor
> Fix:
> Call DrillConnectionImpl .cleanup() from all the catch blocks in the 
> DrillConnectionImpl  constructor.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)