[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg
[ https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344474#comment-16344474 ] ASF GitHub Bot commented on DRILL-6032: --- Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r164634551 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java --- @@ -215,6 +206,7 @@ public BatchHolder() { MaterializedField outputField = materializedValueFields[i]; // Create a type-specific ValueVector for this value vector = TypeHelper.getNewVector(outputField, allocator); + int columnSize = new RecordBatchSizer.ColumnSize(vector).estSize; --- End diff -- Why not just use TypeHelper.getSize(outputField.getType()) ? It seems like you don't need RecordBatchSizer for this. estSize in RecordBatchSizer is taking actual memory allocation into account i.e. it includes the over head of unused vector space and we need that value as well for different reasons. I also think it is doing right thing by returning zero when there are no rows and no memory allocated. > Use RecordBatchSizer to estimate size of columns in HashAgg > --- > > Key: DRILL-6032 > URL: https://issues.apache.org/jira/browse/DRILL-6032 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > We need to use the RecordBatchSize to estimate the size of columns in the > Partition batches created by HashAgg. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg
[ https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344403#comment-16344403 ] ASF GitHub Bot commented on DRILL-6032: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r164624503 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java --- @@ -215,6 +206,7 @@ public BatchHolder() { MaterializedField outputField = materializedValueFields[i]; // Create a type-specific ValueVector for this value vector = TypeHelper.getNewVector(outputField, allocator); + int columnSize = new RecordBatchSizer.ColumnSize(vector).estSize; --- End diff -- Agree that the "sizer" should return the size of a fixed-width column even if there are 0 rows. Thanks for catching and fixing this. Though, I will point out that if you get a 0-length batch, just throw it away and read another; no useful purpose is served by doing anything with an empty batch (except to pass it along for schema purposes if this is the first batch: the so-called "fast schema" that is supposed to work, but seems to not be fully implemented...) I've found that Drill behaves poorly and erratically when trying to allocate zero-sized vectors. Best not to do that. If your batch size is 0, don't allocate vectors would be my general advice to avoid lots of grief. > Use RecordBatchSizer to estimate size of columns in HashAgg > --- > > Key: DRILL-6032 > URL: https://issues.apache.org/jira/browse/DRILL-6032 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > We need to use the RecordBatchSize to estimate the size of columns in the > Partition batches created by HashAgg. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg
[ https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344399#comment-16344399 ] ASF GitHub Bot commented on DRILL-6032: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r164623893 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/spill/RecordBatchSizer.java --- @@ -129,11 +143,16 @@ public ColumnSize(ValueVector v, String prefix) { // No standard size for Union type dataSize = v.getPayloadByteCount(valueCount); break; + case GENERIC_OBJECT: +// We cannot provide a size for Generic Objects --- End diff -- The `GENERIC_OBJECT` type is used any time we do a system table query: system tables are represented as Java objects. There is an open question about certain aggregate functions. (See a note sent a week or so ago.) These aggregate functions use an `ObjectHolder` as their `\@Workspace`. @Ben-Zvi and I discussed whether such aggregates are spillable. This may be an unresolved issue. > Use RecordBatchSizer to estimate size of columns in HashAgg > --- > > Key: DRILL-6032 > URL: https://issues.apache.org/jira/browse/DRILL-6032 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > We need to use the RecordBatchSize to estimate the size of columns in the > Partition batches created by HashAgg. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg
[ https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344397#comment-16344397 ] ASF GitHub Bot commented on DRILL-6032: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r164623527 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/spill/RecordBatchSizer.java --- @@ -232,9 +251,8 @@ else if (width > 0) { } } - public static final int MAX_VECTOR_SIZE = ValueVector.MAX_BUFFER_SIZE; // 16 MiB - private List columnSizes = new ArrayList<>(); + private MapcolumnSizeMap = CaseInsensitiveMap.newHashMap(); --- End diff -- Drill is case insensitive internally. The case insensitive map is correct. Thanks for catching this @ilooner! Unfortunately, record batches have no name space: they are just a collection of vectors. So, we could end up with columns called both "c" and "C". This situation will case the column size map to end up with one entry for both columns, with the last writer winning. The best solution would be to enforce name space uniqueness when creating vectors. The new "result set loader" does this, but I suspect other readers might not -- depending on the particular way that they create their vectors. Still, creating names that differ only in case is a bug and any code doing that should be fixed. > Use RecordBatchSizer to estimate size of columns in HashAgg > --- > > Key: DRILL-6032 > URL: https://issues.apache.org/jira/browse/DRILL-6032 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > We need to use the RecordBatchSize to estimate the size of columns in the > Partition batches created by HashAgg. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-5864) Selecting a non-existing field from a MapR-DB JSON table fails with NPE
[ https://issues.apache.org/jira/browse/DRILL-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344358#comment-16344358 ] ASF GitHub Bot commented on DRILL-5864: --- Github user HanumathRao closed the pull request at: https://github.com/apache/drill/pull/1007 > Selecting a non-existing field from a MapR-DB JSON table fails with NPE > --- > > Key: DRILL-5864 > URL: https://issues.apache.org/jira/browse/DRILL-5864 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Relational Operators, Storage - MapRDB >Affects Versions: 1.12.0 >Reporter: Abhishek Girish >Assignee: Hanumath Rao Maduri >Priority: Major > Labels: ready-to-commit > Fix For: 1.12.0 > > Attachments: OrderByNPE.log, OrderByNPE2.log > > > Query 1 > {code} > > select C_FIRST_NAME,C_BIRTH_COUNTRY,C_BIRTH_YEAR,C_BIRTH_MONTH,C_BIRTH_DAY > > from customer ORDER BY C_BIRTH_COUNTRY ASC, C_FIRST_NAME ASC LIMIT 10; > Error: SYSTEM ERROR: NullPointerException > (java.lang.NullPointerException) null > org.apache.drill.exec.record.SchemaUtil.coerceContainer():176 > > org.apache.drill.exec.physical.impl.xsort.managed.BufferedBatches.convertBatch():124 > org.apache.drill.exec.physical.impl.xsort.managed.BufferedBatches.add():90 > org.apache.drill.exec.physical.impl.xsort.managed.SortImpl.addBatch():265 > > org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.loadBatch():421 > > org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.load():357 > > org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.innerNext():302 > org.apache.drill.exec.record.AbstractRecordBatch.next():164 > org.apache.drill.exec.record.AbstractRecordBatch.next():119 > org.apache.drill.exec.record.AbstractRecordBatch.next():109 > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 > > org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext():93 > org.apache.drill.exec.record.AbstractRecordBatch.next():164 > org.apache.drill.exec.record.AbstractRecordBatch.next():119 > org.apache.drill.exec.record.AbstractRecordBatch.next():109 > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 > org.apache.drill.exec.physical.impl.limit.LimitRecordBatch.innerNext():115 > org.apache.drill.exec.record.AbstractRecordBatch.next():164 > org.apache.drill.exec.record.AbstractRecordBatch.next():119 > org.apache.drill.exec.record.AbstractRecordBatch.next():109 > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 > org.apache.drill.exec.physical.impl.limit.LimitRecordBatch.innerNext():115 > org.apache.drill.exec.record.AbstractRecordBatch.next():164 > org.apache.drill.exec.record.AbstractRecordBatch.next():119 > org.apache.drill.exec.record.AbstractRecordBatch.next():109 > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 > > org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext():93 > org.apache.drill.exec.record.AbstractRecordBatch.next():164 > org.apache.drill.exec.record.AbstractRecordBatch.next():119 > org.apache.drill.exec.record.AbstractRecordBatch.next():109 > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 > > org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():134 > org.apache.drill.exec.record.AbstractRecordBatch.next():164 > org.apache.drill.exec.physical.impl.BaseRootExec.next():105 > > org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81 > org.apache.drill.exec.physical.impl.BaseRootExec.next():95 > org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():234 > org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():227 > java.security.AccessController.doPrivileged():-2 > javax.security.auth.Subject.doAs():422 > org.apache.hadoop.security.UserGroupInformation.doAs():1595 > org.apache.drill.exec.work.fragment.FragmentExecutor.run():227 > org.apache.drill.common.SelfCleaningRunnable.run():38 > java.util.concurrent.ThreadPoolExecutor.runWorker():1149 > java.util.concurrent.ThreadPoolExecutor$Worker.run():624 > java.lang.Thread.run():748 (state=,code=0) > {code} > Plan > {code} > 00-00Screen > 00-01 Project(C_FIRST_NAME=[$0], C_BIRTH_COUNTRY=[$1], > C_BIRTH_YEAR=[$2], C_BIRTH_MONTH=[$3], C_BIRTH_DAY=[$4]) > 00-02SelectionVectorRemover > 00-03 Limit(fetch=[10]) > 00-04Limit(fetch=[10]) > 00-05 SelectionVectorRemover > 00-06Sort(sort0=[$1], sort1=[$0], dir0=[ASC], dir1=[ASC]) > 00-07
[jira] [Updated] (DRILL-6099) Drill does not push limit past project (flatten) if it cannot be pushed into scan
[ https://issues.apache.org/jira/browse/DRILL-6099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gautam Kumar Parai updated DRILL-6099: -- Labels: ready-to-commit (was: ) [~ben-zvi] please consider this PR (https://github.com/apache/drill/pull/1096) for the batch commit. > Drill does not push limit past project (flatten) if it cannot be pushed into > scan > - > > Key: DRILL-6099 > URL: https://issues.apache.org/jira/browse/DRILL-6099 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Gautam Kumar Parai >Assignee: Gautam Kumar Parai >Priority: Major > Labels: ready-to-commit > Fix For: 1.13.0 > > Original Estimate: 48h > Remaining Estimate: 48h > > It would be useful to have pushdown occur past flatten(project). Here is an > example to illustrate the issue: > {{explain plan without implementation for }}{{select name, > flatten(categories) as category from dfs.`/tmp/t_json_20` LIMIT 1;}} > {{DrillScreenRel}}{{ }} > {{ DrillLimitRel(fetch=[1])}}{{ }} > {{ DrillProjectRel(name=[$0], category=[FLATTEN($1)])}} > {{ DrillScanRel(table=[[dfs, /tmp/t_json_20]], groupscan=[EasyGroupScan > [selectionRoot=maprfs:/tmp/t_json_20, numFiles=1, columns=[`name`, > `categories`], files=[maprfs:///tmp/t_json_20/0_0_0.json]]])}} > = > Content of 0_0_0.json > = > { > "name" : "Eric Goldberg, MD", > "categories" : [ "Doctors", "Health & Medical" ] > } { > "name" : "Pine Cone Restaurant", > "categories" : [ "Restaurants" ] > } { > "name" : "Deforest Family Restaurant", > "categories" : [ "American (Traditional)", "Restaurants" ] > } { > "name" : "Culver's", > "categories" : [ "Food", "Ice Cream & Frozen Yogurt", "Fast Food", > "Restaurants" ] > } { > "name" : "Chang Jiang Chinese Kitchen", > "categories" : [ "Chinese", "Restaurants" ] > } > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (DRILL-6117) Query End time is not being set
Kunal Khatua created DRILL-6117: --- Summary: Query End time is not being set Key: DRILL-6117 URL: https://issues.apache.org/jira/browse/DRILL-6117 Project: Apache Drill Issue Type: Bug Affects Versions: 1.13.0 Reporter: Kunal Khatua Assignee: Arina Ielchiieva Fix For: 1.13.0 Looks like a commit related to Query State change (DRILL-5963) introduced a bug where the *End* time is is not being set, resulting in the UI using the current clock time as the profile's end time and the duration is incorrectly calculated. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg
[ https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344347#comment-16344347 ] ASF GitHub Bot commented on DRILL-6032: --- Github user ilooner commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r164615997 --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/agg/TestHashAggrSpill.java --- @@ -53,59 +51,54 @@ @Rule public final BaseDirTestWatcher dirTestWatcher = new BaseDirTestWatcher(); -/** - * A template for Hash Aggr spilling tests - * - * @throws Exception - */ -private void testSpill(long maxMem, long numPartitions, long minBatches, int maxParallel, boolean fallback ,boolean predict, - String sql, long expectedRows, int cycle, int fromPart, int toPart) throws Exception { -LogFixture.LogFixtureBuilder logBuilder = LogFixture.builder() --- End diff -- I can add this back. In general I don't like leaving unused code in the codebase though. > Use RecordBatchSizer to estimate size of columns in HashAgg > --- > > Key: DRILL-6032 > URL: https://issues.apache.org/jira/browse/DRILL-6032 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > We need to use the RecordBatchSize to estimate the size of columns in the > Partition batches created by HashAgg. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg
[ https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344343#comment-16344343 ] ASF GitHub Bot commented on DRILL-6032: --- Github user ilooner commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r164615743 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/spill/RecordBatchSizer.java --- @@ -232,9 +251,8 @@ else if (width > 0) { } } - public static final int MAX_VECTOR_SIZE = ValueVector.MAX_BUFFER_SIZE; // 16 MiB - private List columnSizes = new ArrayList<>(); + private MapcolumnSizeMap = CaseInsensitiveMap.newHashMap(); --- End diff -- I will follow up on this on the mailing list. I was under the impression that Drill is case-**in**sensitive and did not support case sensitivity of columns in an external store like HBase or MapRDB. If Drill is case-sensitive then we have a bigger issue because some of the functional tests assume that it is case-**in**sensitive. > Use RecordBatchSizer to estimate size of columns in HashAgg > --- > > Key: DRILL-6032 > URL: https://issues.apache.org/jira/browse/DRILL-6032 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > We need to use the RecordBatchSize to estimate the size of columns in the > Partition batches created by HashAgg. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg
[ https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344338#comment-16344338 ] ASF GitHub Bot commented on DRILL-6032: --- Github user ilooner commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r164615077 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/spill/RecordBatchSizer.java --- @@ -129,11 +143,16 @@ public ColumnSize(ValueVector v, String prefix) { // No standard size for Union type dataSize = v.getPayloadByteCount(valueCount); break; + case GENERIC_OBJECT: +// We cannot provide a size for Generic Objects --- End diff -- Execution gets here in some of the HashAgg functional tests. Probably in the case where varchars are aggregated, since as you explained to me varchars are stored in object vectors on heap. > Use RecordBatchSizer to estimate size of columns in HashAgg > --- > > Key: DRILL-6032 > URL: https://issues.apache.org/jira/browse/DRILL-6032 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > We need to use the RecordBatchSize to estimate the size of columns in the > Partition batches created by HashAgg. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg
[ https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344333#comment-16344333 ] ASF GitHub Bot commented on DRILL-6032: --- Github user ilooner commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r164614812 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableTemplate.java --- @@ -805,7 +803,12 @@ private IntVector allocMetadataVector(int size, int initialValue) { } @Override - public void setMaxVarcharSize(int size) { maxVarcharSize = size; } + public void setKeySizes(MapkeySizes) { +Preconditions.checkNotNull(keySizes); + +this.keySizes = CaseInsensitiveMap.newHashMap(); --- End diff -- It helps to avoid bugs. It is assumed that the keySizes map will never change once it is set, copying the map helps enforce that constraint. If we don't copy the map and a user calls this method and passes a keySizes map and then later updates the keySizes map or reuses it, errors would occur. Some languages like Scala have immutable flavors of data structures for this reason. > Use RecordBatchSizer to estimate size of columns in HashAgg > --- > > Key: DRILL-6032 > URL: https://issues.apache.org/jira/browse/DRILL-6032 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > We need to use the RecordBatchSize to estimate the size of columns in the > Partition batches created by HashAgg. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg
[ https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344325#comment-16344325 ] ASF GitHub Bot commented on DRILL-6032: --- Github user ilooner commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r164613714 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java --- @@ -956,21 +925,8 @@ private void spillAPartition(int part) { this.htables[part].outputKeys(currOutBatchIndex, this.outContainer, outStartIdxHolder.value, outNumRecordsHolder.value, numPendingOutput); // set the value count for outgoing batch value vectors - /* int i = 0; */ for (VectorWrapper v : outgoing) { v.getValueVector().getMutator().setValueCount(numOutputRecords); -/* --- End diff -- If this logic is really critical for debugging we should have utility methods to help with it so that others can benefit from the code. Maybe we can brainstorm about what methods would be helpful for debugging and start building a utility library. In the meantime if you need this code to debug you can easily look into the history and copy and paste it from there. > Use RecordBatchSizer to estimate size of columns in HashAgg > --- > > Key: DRILL-6032 > URL: https://issues.apache.org/jira/browse/DRILL-6032 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > We need to use the RecordBatchSize to estimate the size of columns in the > Partition batches created by HashAgg. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg
[ https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344317#comment-16344317 ] ASF GitHub Bot commented on DRILL-6032: --- Github user ilooner commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r164612920 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java --- @@ -516,43 +501,48 @@ private void initializeSetup(RecordBatch newIncoming) throws SchemaChangeExcepti * @param incoming */ private void updateEstMaxBatchSize(RecordBatch incoming) { -if ( estMaxBatchSize > 0 ) { return; } // no handling of a schema (or varchar) change // Use the sizer to get the input row width and the length of the longest varchar column -RecordBatchSizer sizer = new RecordBatchSizer(incoming); -logger.trace("Incoming sizer: {}",sizer); -// An empty batch only has the schema, can not tell actual length of varchars -// else use the actual varchars length, each capped at 50 (to match the space allocation) -long estInputRowWidth = sizer.rowCount() == 0 ? sizer.stdRowWidth() : sizer.netRowWidthCap50(); +final RecordBatchSizer incomingColumnSizes = new RecordBatchSizer(incoming); +final MapcolumnSizeMap = incomingColumnSizes.getColumnSizeMap(); +keySizes = CaseInsensitiveMap.newHashMap(); -// Get approx max (varchar) column width to get better memory allocation -maxColumnWidth = Math.max(sizer.maxAvgColumnSize(), VARIABLE_MIN_WIDTH_VALUE_SIZE); -maxColumnWidth = Math.min(maxColumnWidth, VARIABLE_MAX_WIDTH_VALUE_SIZE); +logger.trace("Incoming sizer: {}",incomingColumnSizes); -// // Calculate the estimated max (internal) batch (i.e. Keys batch + Values batch) size // (which is used to decide when to spill) // Also calculate the values batch size (used as a reserve to overcome an OOM) -// -Iterator outgoingIter = outContainer.iterator(); -int fieldId = 0; -while (outgoingIter.hasNext()) { - ValueVector vv = outgoingIter.next().getValueVector(); - MaterializedField mr = vv.getField(); - int fieldSize = vv instanceof VariableWidthVector ? maxColumnWidth : - TypeHelper.getSize(mr.getType()); - estRowWidth += fieldSize; - estOutputRowWidth += fieldSize; - if ( fieldId < numGroupByOutFields ) { fieldId++; } - else { estValuesRowWidth += fieldSize; } + +for (int columnIndex = 0; columnIndex < numGroupByOutFields; columnIndex++) { + final VectorWrapper vectorWrapper = outContainer.getValueVector(columnIndex); + final String columnName = vectorWrapper.getField().getName(); + final int columnSize = columnSizeMap.get(columnName).estSize; + keySizes.put(columnName, columnSize); + estOutputRowWidth += columnSize; } + +long estValuesRowWidth = 0; + +for (int columnIndex = numGroupByOutFields; columnIndex < outContainer.getNumberOfColumns(); columnIndex++) { + VectorWrapper vectorWrapper = outContainer.getValueVector(columnIndex); + RecordBatchSizer.ColumnSize columnSize = new RecordBatchSizer.ColumnSize(vectorWrapper.getValueVector()); + estOutputRowWidth += columnSize.estSize; + estValuesRowWidth += columnSize.estSize; +} + // multiply by the max number of rows in a batch to get the final estimated max size -estMaxBatchSize = Math.max(estRowWidth, estInputRowWidth) * MAX_BATCH_SIZE; +estMaxBatchSize = Math.max(estOutputRowWidth, 1) * MAX_BATCH_SIZE; // (When there are no aggr functions, use '1' as later code relies on this size being non-zero) +// Note: estValuesBatchSize cannot be 0 because a zero value for estValuesBatchSize will cause reserveValueBatchMemory to have a value of 0. And the meaning +// of a reserveValueBatchMemory value of 0 has multiple meanings in different contexts. So estValuesBatchSize has an enforced minimum value of 1, without this +// estValuesBatchsize could have a value of 0 in the case were there are no value columns and all the columns are key columns. estValuesBatchSize = Math.max(estValuesRowWidth, 1) * MAX_BATCH_SIZE; estOutgoingAllocSize = estValuesBatchSize; // initially assume same size -logger.trace("{} phase. Estimated internal row width: {} Values row width: {} batch size: {} memory limit: {} max column width: {}", - isTwoPhase?(is2ndPhase?"2nd":"1st"):"Single",estRowWidth,estValuesRowWidth,estMaxBatchSize,allocator.getLimit(),maxColumnWidth); +if (logger.isTraceEnabled()) {
[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg
[ https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344309#comment-16344309 ] ASF GitHub Bot commented on DRILL-6032: --- Github user ilooner commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r164612280 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java --- @@ -84,13 +85,6 @@ public abstract class HashAggTemplate implements HashAggregator { protected static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(HashAggregator.class); - private static final int VARIABLE_MAX_WIDTH_VALUE_SIZE = 50; - private static final int VARIABLE_MIN_WIDTH_VALUE_SIZE = 8; - - private static final boolean EXTRA_DEBUG_1 = false; --- End diff -- Logging frameworks were invented to solve this problem. We should just use the logging framework. > Use RecordBatchSizer to estimate size of columns in HashAgg > --- > > Key: DRILL-6032 > URL: https://issues.apache.org/jira/browse/DRILL-6032 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > We need to use the RecordBatchSize to estimate the size of columns in the > Partition batches created by HashAgg. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg
[ https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344297#comment-16344297 ] ASF GitHub Bot commented on DRILL-6032: --- Github user Ben-Zvi commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r164609183 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java --- @@ -516,43 +501,48 @@ private void initializeSetup(RecordBatch newIncoming) throws SchemaChangeExcepti * @param incoming */ private void updateEstMaxBatchSize(RecordBatch incoming) { -if ( estMaxBatchSize > 0 ) { return; } // no handling of a schema (or varchar) change // Use the sizer to get the input row width and the length of the longest varchar column -RecordBatchSizer sizer = new RecordBatchSizer(incoming); -logger.trace("Incoming sizer: {}",sizer); -// An empty batch only has the schema, can not tell actual length of varchars -// else use the actual varchars length, each capped at 50 (to match the space allocation) -long estInputRowWidth = sizer.rowCount() == 0 ? sizer.stdRowWidth() : sizer.netRowWidthCap50(); +final RecordBatchSizer incomingColumnSizes = new RecordBatchSizer(incoming); +final MapcolumnSizeMap = incomingColumnSizes.getColumnSizeMap(); +keySizes = CaseInsensitiveMap.newHashMap(); -// Get approx max (varchar) column width to get better memory allocation -maxColumnWidth = Math.max(sizer.maxAvgColumnSize(), VARIABLE_MIN_WIDTH_VALUE_SIZE); -maxColumnWidth = Math.min(maxColumnWidth, VARIABLE_MAX_WIDTH_VALUE_SIZE); +logger.trace("Incoming sizer: {}",incomingColumnSizes); -// // Calculate the estimated max (internal) batch (i.e. Keys batch + Values batch) size // (which is used to decide when to spill) // Also calculate the values batch size (used as a reserve to overcome an OOM) -// -Iterator outgoingIter = outContainer.iterator(); -int fieldId = 0; -while (outgoingIter.hasNext()) { - ValueVector vv = outgoingIter.next().getValueVector(); - MaterializedField mr = vv.getField(); - int fieldSize = vv instanceof VariableWidthVector ? maxColumnWidth : - TypeHelper.getSize(mr.getType()); - estRowWidth += fieldSize; - estOutputRowWidth += fieldSize; - if ( fieldId < numGroupByOutFields ) { fieldId++; } - else { estValuesRowWidth += fieldSize; } + +for (int columnIndex = 0; columnIndex < numGroupByOutFields; columnIndex++) { + final VectorWrapper vectorWrapper = outContainer.getValueVector(columnIndex); + final String columnName = vectorWrapper.getField().getName(); + final int columnSize = columnSizeMap.get(columnName).estSize; + keySizes.put(columnName, columnSize); + estOutputRowWidth += columnSize; } + +long estValuesRowWidth = 0; + +for (int columnIndex = numGroupByOutFields; columnIndex < outContainer.getNumberOfColumns(); columnIndex++) { + VectorWrapper vectorWrapper = outContainer.getValueVector(columnIndex); + RecordBatchSizer.ColumnSize columnSize = new RecordBatchSizer.ColumnSize(vectorWrapper.getValueVector()); + estOutputRowWidth += columnSize.estSize; --- End diff -- What is missing (maybe in the original code as well) is those extras: The Values batch sometimes uses those 8 byte vectors to represent nullable, and for some aggregations (e.g. STDDEV), internally there are two columns ("count" and "sum of squares", or something like that), but the outContainer has only one - with the result (this is why the Values vectors in the outContainer are always allocated, instead of passing the existing vectors - see DRILL-5588 ) > Use RecordBatchSizer to estimate size of columns in HashAgg > --- > > Key: DRILL-6032 > URL: https://issues.apache.org/jira/browse/DRILL-6032 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > We need to use the RecordBatchSize to estimate the size of columns in the > Partition batches created by HashAgg. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg
[ https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344292#comment-16344292 ] ASF GitHub Bot commented on DRILL-6032: --- Github user Ben-Zvi commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r164571408 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/spill/RecordBatchSizer.java --- @@ -129,11 +143,16 @@ public ColumnSize(ValueVector v, String prefix) { // No standard size for Union type dataSize = v.getPayloadByteCount(valueCount); break; + case GENERIC_OBJECT: +// We cannot provide a size for Generic Objects --- End diff -- How can execution ever get here ? Should we throw a UserException.unsupportedError here ? > Use RecordBatchSizer to estimate size of columns in HashAgg > --- > > Key: DRILL-6032 > URL: https://issues.apache.org/jira/browse/DRILL-6032 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > We need to use the RecordBatchSize to estimate the size of columns in the > Partition batches created by HashAgg. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg
[ https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344296#comment-16344296 ] ASF GitHub Bot commented on DRILL-6032: --- Github user Ben-Zvi commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r164604859 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java --- @@ -397,11 +384,9 @@ private void delayedSetup() { } numPartitions = BaseAllocator.nextPowerOfTwo(numPartitions); // in case not a power of 2 -if ( schema == null ) { estValuesBatchSize = estOutgoingAllocSize = estMaxBatchSize = 0; } // incoming was an empty batch --- End diff -- Why was the ( schema == null ) check removed ? Without it, calling updateEstMaxBatchSize() would return an NPE when accessing the batch's schema (when calling RecordBatchSizer(incoming) ). > Use RecordBatchSizer to estimate size of columns in HashAgg > --- > > Key: DRILL-6032 > URL: https://issues.apache.org/jira/browse/DRILL-6032 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > We need to use the RecordBatchSize to estimate the size of columns in the > Partition batches created by HashAgg. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg
[ https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344293#comment-16344293 ] ASF GitHub Bot commented on DRILL-6032: --- Github user Ben-Zvi commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r164576364 --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/agg/TestHashAggrSpill.java --- @@ -53,59 +51,54 @@ @Rule public final BaseDirTestWatcher dirTestWatcher = new BaseDirTestWatcher(); -/** - * A template for Hash Aggr spilling tests - * - * @throws Exception - */ -private void testSpill(long maxMem, long numPartitions, long minBatches, int maxParallel, boolean fallback ,boolean predict, - String sql, long expectedRows, int cycle, int fromPart, int toPart) throws Exception { -LogFixture.LogFixtureBuilder logBuilder = LogFixture.builder() --- End diff -- The logBuilder is here in case someone uses the Hash Agg spill tests and needs to log. It saves the trouble to put this code back again for that purpose. Unfortunately Java does not support #ifdef .. > Use RecordBatchSizer to estimate size of columns in HashAgg > --- > > Key: DRILL-6032 > URL: https://issues.apache.org/jira/browse/DRILL-6032 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > We need to use the RecordBatchSize to estimate the size of columns in the > Partition batches created by HashAgg. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg
[ https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344295#comment-16344295 ] ASF GitHub Bot commented on DRILL-6032: --- Github user Ben-Zvi commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r164600675 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableTemplate.java --- @@ -805,7 +803,12 @@ private IntVector allocMetadataVector(int size, int initialValue) { } @Override - public void setMaxVarcharSize(int size) { maxVarcharSize = size; } + public void setKeySizes(MapkeySizes) { +Preconditions.checkNotNull(keySizes); + +this.keySizes = CaseInsensitiveMap.newHashMap(); --- End diff -- Minor: Why does every hash table allocate a new map and copy its values ? Can't all just share the keySizes map ? > Use RecordBatchSizer to estimate size of columns in HashAgg > --- > > Key: DRILL-6032 > URL: https://issues.apache.org/jira/browse/DRILL-6032 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > We need to use the RecordBatchSize to estimate the size of columns in the > Partition batches created by HashAgg. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg
[ https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344298#comment-16344298 ] ASF GitHub Bot commented on DRILL-6032: --- Github user Ben-Zvi commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r164609742 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java --- @@ -84,13 +85,6 @@ public abstract class HashAggTemplate implements HashAggregator { protected static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(HashAggregator.class); - private static final int VARIABLE_MAX_WIDTH_VALUE_SIZE = 50; - private static final int VARIABLE_MIN_WIDTH_VALUE_SIZE = 8; - - private static final boolean EXTRA_DEBUG_1 = false; --- End diff -- This is Aman's implementation of an #ifdef . Basically he wanted to preserve these debugging messages, but not seeing them (there are too many) while doing some other debugging work. > Use RecordBatchSizer to estimate size of columns in HashAgg > --- > > Key: DRILL-6032 > URL: https://issues.apache.org/jira/browse/DRILL-6032 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > We need to use the RecordBatchSize to estimate the size of columns in the > Partition batches created by HashAgg. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg
[ https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344294#comment-16344294 ] ASF GitHub Bot commented on DRILL-6032: --- Github user Ben-Zvi commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r164574746 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/spill/RecordBatchSizer.java --- @@ -232,9 +251,8 @@ else if (width > 0) { } } - public static final int MAX_VECTOR_SIZE = ValueVector.MAX_BUFFER_SIZE; // 16 MiB - private List columnSizes = new ArrayList<>(); + private MapcolumnSizeMap = CaseInsensitiveMap.newHashMap(); --- End diff -- There may be an issue with MapRDB or HBase which are case sensitive. Also Drill allows for quoted column names (e.g., "FOO" or "foo") which keep the case sensitivity. One crude "fix" can be to add a user option, so when turned on, this code would use an AbstractHashedMap instead. > Use RecordBatchSizer to estimate size of columns in HashAgg > --- > > Key: DRILL-6032 > URL: https://issues.apache.org/jira/browse/DRILL-6032 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > We need to use the RecordBatchSize to estimate the size of columns in the > Partition batches created by HashAgg. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg
[ https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344291#comment-16344291 ] ASF GitHub Bot commented on DRILL-6032: --- Github user Ben-Zvi commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r164549701 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatch.java --- @@ -255,7 +254,6 @@ private HashAggregator createAggregatorInternal() throws SchemaChangeException, groupByOutFieldIds[i] = container.add(vv); } -int extraNonNullColumns = 0; // each of SUM, MAX and MIN gets an extra bigint column --- End diff -- Why was this removed ? Unfortunately the code generator still generates an internal Bigint column to keep the "not null" status of the Aggregated Values, in the cases of sum/max/min. Below is an example of a generated code (returning a nullable varchar) - note the nonNullCount (see DRILL-5728): ``` public void outputRecordValues(int htRowIdx, int outRowIdx) throws SchemaChangeException { { NullableVarCharHolder out17; { final NullableVarCharHolder out = new NullableVarCharHolder(); vv1 .getAccessor().get((htRowIdx), work0); ObjectHolder value = work0; work4 .value = vv5 .getAccessor().get((htRowIdx)); UInt1Holder init = work4; work8 .value = vv9 .getAccessor().get((htRowIdx)); BigIntHolder nonNullCount = work8; DrillBuf buf = work12; MaxVarBytesFunctions$NullableVarCharMax_output: { if (nonNullCount.value > 0) { out.isSet = 1; org.apache.drill.exec.expr.fn.impl.DrillByteArray tmp = (org.apache.drill.exec.expr.fn.impl.DrillByteArray) value.obj; buf = buf.reallocIfNeeded(tmp.getLength()); buf.setBytes(0, tmp.getBytes(), 0, tmp.getLength()); out.start = 0; out.end = tmp.getLength(); out.buffer = buf; } else { out.isSet = 0; } } work0 = value; vv1 .getMutator().setSafe((htRowIdx), work0); work4 = init; vv5 .getMutator().set((htRowIdx), work4 .value); work8 = nonNullCount; vv9 .getMutator().set((htRowIdx), work8 .value); work12 = buf; out17 = out; } if (!(out17 .isSet == 0)) { vv18 .getMutator().setSafe((outRowIdx), out17 .isSet, out17 .start, out17 .end, out17 .buffer); } } } ``` > Use RecordBatchSizer to estimate size of columns in HashAgg > --- > > Key: DRILL-6032 > URL: https://issues.apache.org/jira/browse/DRILL-6032 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > We need to use the RecordBatchSize to estimate the size of columns in the > Partition batches created by HashAgg. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg
[ https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344299#comment-16344299 ] ASF GitHub Bot commented on DRILL-6032: --- Github user Ben-Zvi commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r164611358 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java --- @@ -956,21 +925,8 @@ private void spillAPartition(int part) { this.htables[part].outputKeys(currOutBatchIndex, this.outContainer, outStartIdxHolder.value, outNumRecordsHolder.value, numPendingOutput); // set the value count for outgoing batch value vectors - /* int i = 0; */ for (VectorWrapper v : outgoing) { v.getValueVector().getMutator().setValueCount(numOutputRecords); -/* --- End diff -- We don't have any facility to take a peek at actual data for debugging. While this commented code looks ugly, it allows for easy enabling to look at data, if needed (or an example of how to do so, for elsewhere). > Use RecordBatchSizer to estimate size of columns in HashAgg > --- > > Key: DRILL-6032 > URL: https://issues.apache.org/jira/browse/DRILL-6032 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > We need to use the RecordBatchSize to estimate the size of columns in the > Partition batches created by HashAgg. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg
[ https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344286#comment-16344286 ] ASF GitHub Bot commented on DRILL-6032: --- Github user ilooner commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r164611155 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java --- @@ -215,6 +206,7 @@ public BatchHolder() { MaterializedField outputField = materializedValueFields[i]; // Create a type-specific ValueVector for this value vector = TypeHelper.getNewVector(outputField, allocator); + int columnSize = new RecordBatchSizer.ColumnSize(vector).estSize; --- End diff -- The goal of this code is to preallocate enough space to hold BATCH_SIZE number of elements in the batch. The BatchHolder itself is used to hold the aggregate values and aggregate values can currently only be FixedWidth vectors or ObjectVectors. Since we know how much direct memory each of these types will consume, we can use that knowledge in our column size estimate and preallocate the correct amount of space. **Note:** earlier the RecordBatchSizer would return an estSize of 0 for an empty FixedWidth value vectors. For example and empty IntVector would return an estSize of 0. This was incorrect behavior so I updated the RecordBatchSizer to return the correct size of a FixedWidth vector even when the value vector is empty. Please see the changes in the RecordBatchSizer and ValueVector templates for more details. > Use RecordBatchSizer to estimate size of columns in HashAgg > --- > > Key: DRILL-6032 > URL: https://issues.apache.org/jira/browse/DRILL-6032 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > We need to use the RecordBatchSize to estimate the size of columns in the > Partition batches created by HashAgg. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-5846) Improve Parquet Reader Performance for Flat Data types
[ https://issues.apache.org/jira/browse/DRILL-5846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344206#comment-16344206 ] ASF GitHub Bot commented on DRILL-5846: --- Github user vrozov commented on a diff in the pull request: https://github.com/apache/drill/pull/1060#discussion_r164600674 --- Diff: exec/memory/base/src/main/java/org/apache/drill/exec/util/MemoryUtils.java --- @@ -0,0 +1,186 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.util; + +import java.lang.reflect.Field; +import java.nio.ByteOrder; + +import sun.misc.Unsafe; + +/** Exposes advanced Memory Access APIs for Little-Endian / Unaligned platforms */ +@SuppressWarnings("restriction") +public final class MemoryUtils { + + // Ensure this is a little-endian hardware */ + static { +if (ByteOrder.nativeOrder() != ByteOrder.LITTLE_ENDIAN) { + throw new IllegalStateException("Drill only runs on LittleEndian systems."); +} + } + + /** Java's unsafe object */ + private static Unsafe UNSAFE; + + static { +try { + Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe"); + theUnsafe.setAccessible(true); + UNSAFE = (Unsafe) theUnsafe.get(null); +} +catch (Exception e) { + throw new RuntimeException(e); +} + } + + /** Byte arrays offset */ + private static final long BYTE_ARRAY_OFFSET = UNSAFE.arrayBaseOffset(byte[].class); + + /** Number of bytes in a long */ + public static final int LONG_NUM_BYTES = 8; + /** Number of bytes in an int */ + public static final int INT_NUM_BYTES = 4; + /** Number of bytes in a short */ + public static final int SHORT_NUM_BYTES = 2; + +// +// APIs +// + + /** + * @param data source byte array + * @param index index within the byte array + * @return short value starting at data+index + */ + public static short getShort(byte[] data, int index) { --- End diff -- The sanity check that can be turned on and off is required, otherwise, there is no difference between Netty `PlatformDependend` and `MemoryUtils` and it will be better to have a unified way independent from Java assertions to turn bounds checking on or off. My assumption is that Drill either reads from DrillBuf or writes to DrillBuf as data is not supposed to be in the heap, so DrillBuf looks like a natural choice for the functionality. > Improve Parquet Reader Performance for Flat Data types > --- > > Key: DRILL-5846 > URL: https://issues.apache.org/jira/browse/DRILL-5846 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.11.0 >Reporter: salim achouche >Assignee: salim achouche >Priority: Major > Labels: performance > Fix For: 1.13.0 > > > The Parquet Reader is a key use-case for Drill. This JIRA is an attempt to > further improve the Parquet Reader performance as several users reported that > Parquet parsing represents the lion share of the overall query execution. It > tracks Flat Data types only as Nested DTs might involve functional and > processing enhancements (e.g., a nested column can be seen as a Document; > user might want to perform operations scoped at the document level that is no > need to span all rows). Another JIRA will be created to handle the nested > columns use-case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg
[ https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344203#comment-16344203 ] ASF GitHub Bot commented on DRILL-6032: --- Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r164599681 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java --- @@ -215,6 +206,7 @@ public BatchHolder() { MaterializedField outputField = materializedValueFields[i]; // Create a type-specific ValueVector for this value vector = TypeHelper.getNewVector(outputField, allocator); + int columnSize = new RecordBatchSizer.ColumnSize(vector).estSize; --- End diff -- I wonder what is wrong if estSize is 0 when there is no data. If there is no data for a column, why would we want to add it's value width to outgoing row width ? > Use RecordBatchSizer to estimate size of columns in HashAgg > --- > > Key: DRILL-6032 > URL: https://issues.apache.org/jira/browse/DRILL-6032 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > We need to use the RecordBatchSize to estimate the size of columns in the > Partition batches created by HashAgg. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6099) Drill does not push limit past project (flatten) if it cannot be pushed into scan
[ https://issues.apache.org/jira/browse/DRILL-6099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344166#comment-16344166 ] ASF GitHub Bot commented on DRILL-6099: --- Github user gparai commented on a diff in the pull request: https://github.com/apache/drill/pull/1096#discussion_r164593749 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillPushLimitToScanRule.java --- @@ -121,4 +132,50 @@ protected void doOnMatch(RelOptRuleCall call, DrillLimitRel limitRel, DrillScanR } } + + private static boolean isProjectFlatten(RelNode project) { --- End diff -- Done. > Drill does not push limit past project (flatten) if it cannot be pushed into > scan > - > > Key: DRILL-6099 > URL: https://issues.apache.org/jira/browse/DRILL-6099 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Gautam Kumar Parai >Assignee: Gautam Kumar Parai >Priority: Major > Fix For: 1.13.0 > > Original Estimate: 48h > Remaining Estimate: 48h > > It would be useful to have pushdown occur past flatten(project). Here is an > example to illustrate the issue: > {{explain plan without implementation for }}{{select name, > flatten(categories) as category from dfs.`/tmp/t_json_20` LIMIT 1;}} > {{DrillScreenRel}}{{ }} > {{ DrillLimitRel(fetch=[1])}}{{ }} > {{ DrillProjectRel(name=[$0], category=[FLATTEN($1)])}} > {{ DrillScanRel(table=[[dfs, /tmp/t_json_20]], groupscan=[EasyGroupScan > [selectionRoot=maprfs:/tmp/t_json_20, numFiles=1, columns=[`name`, > `categories`], files=[maprfs:///tmp/t_json_20/0_0_0.json]]])}} > = > Content of 0_0_0.json > = > { > "name" : "Eric Goldberg, MD", > "categories" : [ "Doctors", "Health & Medical" ] > } { > "name" : "Pine Cone Restaurant", > "categories" : [ "Restaurants" ] > } { > "name" : "Deforest Family Restaurant", > "categories" : [ "American (Traditional)", "Restaurants" ] > } { > "name" : "Culver's", > "categories" : [ "Food", "Ice Cream & Frozen Yogurt", "Fast Food", > "Restaurants" ] > } { > "name" : "Chang Jiang Chinese Kitchen", > "categories" : [ "Chinese", "Restaurants" ] > } > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6071) Limit batch size for flatten operator
[ https://issues.apache.org/jira/browse/DRILL-6071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pritesh Maker updated DRILL-6071: - Reviewer: Paul Rogers > Limit batch size for flatten operator > - > > Key: DRILL-6071 > URL: https://issues.apache.org/jira/browse/DRILL-6071 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Flow >Affects Versions: 1.12.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy >Priority: Major > Labels: ready-to-commit > Fix For: 1.13.0 > > > flatten currently uses an adaptive algorithm to control the outgoing batch > size. > While processing the input batch, it adjusts the number of records in > outgoing batch based on memory usage so far. Once memory usage exceeds the > configured limit for a batch, the algorithm becomes more proactive and > adjusts the limit half way through and end of every batch. All this periodic > checking of memory usage is unnecessary overhead and impacts performance. > Also, we will know only after the fact. > Instead, figure out how many rows should be there in the outgoing batch from > incoming batch. > The way to do that would be to figure out average row size of the outgoing > batch and based on that figure out how many rows can be there for a given > amount of memory. value vectors provide us the necessary information to be > able to figure this out. > Row count in output batch should be decided based on memory (with min 1 and > max 64k rows) and not hard coded (to 4K) in code. Memory for output batch > should be configurable system option. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6071) Limit batch size for flatten operator
[ https://issues.apache.org/jira/browse/DRILL-6071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pritesh Maker updated DRILL-6071: - Issue Type: Improvement (was: Bug) > Limit batch size for flatten operator > - > > Key: DRILL-6071 > URL: https://issues.apache.org/jira/browse/DRILL-6071 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Flow >Affects Versions: 1.12.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy >Priority: Major > Labels: ready-to-commit > Fix For: 1.13.0 > > > flatten currently uses an adaptive algorithm to control the outgoing batch > size. > While processing the input batch, it adjusts the number of records in > outgoing batch based on memory usage so far. Once memory usage exceeds the > configured limit for a batch, the algorithm becomes more proactive and > adjusts the limit half way through and end of every batch. All this periodic > checking of memory usage is unnecessary overhead and impacts performance. > Also, we will know only after the fact. > Instead, figure out how many rows should be there in the outgoing batch from > incoming batch. > The way to do that would be to figure out average row size of the outgoing > batch and based on that figure out how many rows can be there for a given > amount of memory. value vectors provide us the necessary information to be > able to figure this out. > Row count in output batch should be decided based on memory (with min 1 and > max 64k rows) and not hard coded (to 4K) in code. Memory for output batch > should be configurable system option. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6099) Drill does not push limit past project (flatten) if it cannot be pushed into scan
[ https://issues.apache.org/jira/browse/DRILL-6099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344124#comment-16344124 ] ASF GitHub Bot commented on DRILL-6099: --- Github user chunhui-shi commented on a diff in the pull request: https://github.com/apache/drill/pull/1096#discussion_r164585670 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillPushLimitToScanRule.java --- @@ -121,4 +132,50 @@ protected void doOnMatch(RelOptRuleCall call, DrillLimitRel limitRel, DrillScanR } } + + private static boolean isProjectFlatten(RelNode project) { --- End diff -- I think it might be more general to name the functions to schemaUnknown(for conert_fromJson), rowCountUnknown(for flatten), so if in future we have some other functions fall in these two categories, we could easily add these functions to the categories. What do you think? > Drill does not push limit past project (flatten) if it cannot be pushed into > scan > - > > Key: DRILL-6099 > URL: https://issues.apache.org/jira/browse/DRILL-6099 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Gautam Kumar Parai >Assignee: Gautam Kumar Parai >Priority: Major > Fix For: 1.13.0 > > Original Estimate: 48h > Remaining Estimate: 48h > > It would be useful to have pushdown occur past flatten(project). Here is an > example to illustrate the issue: > {{explain plan without implementation for }}{{select name, > flatten(categories) as category from dfs.`/tmp/t_json_20` LIMIT 1;}} > {{DrillScreenRel}}{{ }} > {{ DrillLimitRel(fetch=[1])}}{{ }} > {{ DrillProjectRel(name=[$0], category=[FLATTEN($1)])}} > {{ DrillScanRel(table=[[dfs, /tmp/t_json_20]], groupscan=[EasyGroupScan > [selectionRoot=maprfs:/tmp/t_json_20, numFiles=1, columns=[`name`, > `categories`], files=[maprfs:///tmp/t_json_20/0_0_0.json]]])}} > = > Content of 0_0_0.json > = > { > "name" : "Eric Goldberg, MD", > "categories" : [ "Doctors", "Health & Medical" ] > } { > "name" : "Pine Cone Restaurant", > "categories" : [ "Restaurants" ] > } { > "name" : "Deforest Family Restaurant", > "categories" : [ "American (Traditional)", "Restaurants" ] > } { > "name" : "Culver's", > "categories" : [ "Food", "Ice Cream & Frozen Yogurt", "Fast Food", > "Restaurants" ] > } { > "name" : "Chang Jiang Chinese Kitchen", > "categories" : [ "Chinese", "Restaurants" ] > } > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6106) Use valueOf method instead of constructor since valueOf has a higher performance by caching frequently requested values.
[ https://issues.apache.org/jira/browse/DRILL-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344039#comment-16344039 ] ASF GitHub Bot commented on DRILL-6106: --- Github user vrozov commented on the issue: https://github.com/apache/drill/pull/1099 Travis CI failed due to a known issue `TestSortImpl.testLargeBatch:502->runJumboBatchTest:475->runLargeSortTest:454 Value of 1:0 expected:<0> but was:<1>` unrelated to the SSLConfigClient.java changes. > Use valueOf method instead of constructor since valueOf has a higher > performance by caching frequently requested values. > > > Key: DRILL-6106 > URL: https://issues.apache.org/jira/browse/DRILL-6106 > Project: Apache Drill > Issue Type: Improvement >Reporter: Reudismam Rolim de Sousa >Priority: Minor > > Use valueOf method instead of constructor since valueOf has a higher > performance by caching frequently requested values. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6106) Use valueOf method instead of constructor since valueOf has a higher performance by caching frequently requested values.
[ https://issues.apache.org/jira/browse/DRILL-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343960#comment-16343960 ] ASF GitHub Bot commented on DRILL-6106: --- Github user reudismam commented on the issue: https://github.com/apache/drill/pull/1099 I reverted the edit because the code does not pass in Travis CI. > Use valueOf method instead of constructor since valueOf has a higher > performance by caching frequently requested values. > > > Key: DRILL-6106 > URL: https://issues.apache.org/jira/browse/DRILL-6106 > Project: Apache Drill > Issue Type: Improvement >Reporter: Reudismam Rolim de Sousa >Priority: Minor > > Use valueOf method instead of constructor since valueOf has a higher > performance by caching frequently requested values. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6106) Use valueOf method instead of constructor since valueOf has a higher performance by caching frequently requested values.
[ https://issues.apache.org/jira/browse/DRILL-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343946#comment-16343946 ] ASF GitHub Bot commented on DRILL-6106: --- Github user vrozov commented on the issue: https://github.com/apache/drill/pull/1099 What was the reason to revert the change in SSLConfigClient.java? > Use valueOf method instead of constructor since valueOf has a higher > performance by caching frequently requested values. > > > Key: DRILL-6106 > URL: https://issues.apache.org/jira/browse/DRILL-6106 > Project: Apache Drill > Issue Type: Improvement >Reporter: Reudismam Rolim de Sousa >Priority: Minor > > Use valueOf method instead of constructor since valueOf has a higher > performance by caching frequently requested values. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (DRILL-6116) MAXDIR - Unsupported operation exception // partition explorer interface?
John Humphreys created DRILL-6116: - Summary: MAXDIR - Unsupported operation exception // partition explorer interface? Key: DRILL-6116 URL: https://issues.apache.org/jira/browse/DRILL-6116 Project: Apache Drill Issue Type: Bug Components: Client - JDBC Affects Versions: 1.10.0 Reporter: John Humphreys Using MAXDIR in drill seems to cause errors when trying to apply other standard where clause filters. Am I doing something wrong here? **This Query Fails** SELECT epoch_hour, entity, `1_average`, `1_95p` FROM dfs.`/path/drill-hour-final/` where dir0 = '2018' and dir1 = '01' and dir2 = '27' and dir3 = MAXDIR('dfs', concat('/path/drill-hour-final/', dir0, '/', dir1, '/', dir2, '/')) and entity = 1784928 **With This Error** > SYSTEM ERROR: UnsupportedOperationException: The partition explorer > interface can only be used in functions that can be evaluated at > planning time. Make sure that the planner.enable_constant_folding > configuration option is set to true. > > Fragment 1:7 > > [Error Id: d62c8227-db0e-4d31-9443-e570193d1010 on > psclxcpdevsys13.nomura.com:31010] **Wrapping it Makes it Work Without Error** select * from ( SELECT epoch_hour, entity, `1_average`, `1_95p` FROM dfs.`/path/drill-hour-final/` where dir0 = '2018' and dir1 = '01' and dir2 = '27' and dir3 = MAXDIR('dfs', concat('/path/drill-hour-final/', dir0, '/', dir1, '/', dir2, '/')) ) where entity = 1784928 epoch_hour entity 1_average 1_95p 1517086800 1784928 5.3347222 8.82857142857143 1517014800 1784928 25.944984717977217 40.37453087409783 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6115) SingleMergeExchange is not scaling up when many minor fragments are allocated for a query.
[ https://issues.apache.org/jira/browse/DRILL-6115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hanumath Rao Maduri updated DRILL-6115: --- Attachment: Enhancing Drill to multiplex ordered merge exchanges.docx > SingleMergeExchange is not scaling up when many minor fragments are allocated > for a query. > -- > > Key: DRILL-6115 > URL: https://issues.apache.org/jira/browse/DRILL-6115 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Relational Operators >Affects Versions: 1.12.0 >Reporter: Hanumath Rao Maduri >Assignee: Hanumath Rao Maduri >Priority: Major > Attachments: Enhancing Drill to multiplex ordered merge exchanges.docx > > > SingleMergeExchange is created when a global order is required in the output. > The following query produces the SingleMergeExchange. > {code:java} > 0: jdbc:drill:zk=local> explain plan for select L_LINENUMBER from > dfs.`/drill/tables/lineitem` order by L_LINENUMBER; > +--+--+ > | text | json | > +--+--+ > | 00-00 Screen > 00-01 Project(L_LINENUMBER=[$0]) > 00-02 SingleMergeExchange(sort0=[0]) > 01-01 SelectionVectorRemover > 01-02 Sort(sort0=[$0], dir0=[ASC]) > 01-03 HashToRandomExchange(dist0=[[$0]]) > 02-01 Scan(table=[[dfs, /drill/tables/lineitem]], > groupscan=[JsonTableGroupScan [ScanSpec=JsonScanSpec > [tableName=maprfs:///drill/tables/lineitem, condition=null], > columns=[`L_LINENUMBER`], maxwidth=15]]) > {code} > On a 10 node cluster if the table is huge then DRILL can spawn many minor > fragments which are all merged on a single node with one merge receiver. > Doing so will create lot of memory pressure on the receiver node and also > execution bottleneck. To address this issue, merge receiver should be > multiphase merge receiver. > Ideally for large cluster one can introduce tree merges so that merging can > be done parallel. But as a first step I think it is better to use the > existing infrastructure for multiplexing operators to generate an OrderedMux > so that all the minor fragments pertaining to one DRILLBIT should be merged > and the merged data can be sent across to the receiver operator. > On a 10 node cluster if each node processes 14 minor fragments. > Current version of code merges 140 minor fragments > the proposed version has two level merges 1 - 14 merge in each drillbit which > is parallel > and 10 minorfragments are merged at the receiver node. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (DRILL-6115) SingleMergeExchange is not scaling up when many minor fragments are allocated for a query.
Hanumath Rao Maduri created DRILL-6115: -- Summary: SingleMergeExchange is not scaling up when many minor fragments are allocated for a query. Key: DRILL-6115 URL: https://issues.apache.org/jira/browse/DRILL-6115 Project: Apache Drill Issue Type: Bug Components: Execution - Relational Operators Affects Versions: 1.12.0 Reporter: Hanumath Rao Maduri Assignee: Hanumath Rao Maduri Attachments: Enhancing Drill to multiplex ordered merge exchanges.docx SingleMergeExchange is created when a global order is required in the output. The following query produces the SingleMergeExchange. {code:java} 0: jdbc:drill:zk=local> explain plan for select L_LINENUMBER from dfs.`/drill/tables/lineitem` order by L_LINENUMBER; +--+--+ | text | json | +--+--+ | 00-00 Screen 00-01 Project(L_LINENUMBER=[$0]) 00-02 SingleMergeExchange(sort0=[0]) 01-01 SelectionVectorRemover 01-02 Sort(sort0=[$0], dir0=[ASC]) 01-03 HashToRandomExchange(dist0=[[$0]]) 02-01 Scan(table=[[dfs, /drill/tables/lineitem]], groupscan=[JsonTableGroupScan [ScanSpec=JsonScanSpec [tableName=maprfs:///drill/tables/lineitem, condition=null], columns=[`L_LINENUMBER`], maxwidth=15]]) {code} On a 10 node cluster if the table is huge then DRILL can spawn many minor fragments which are all merged on a single node with one merge receiver. Doing so will create lot of memory pressure on the receiver node and also execution bottleneck. To address this issue, merge receiver should be multiphase merge receiver. Ideally for large cluster one can introduce tree merges so that merging can be done parallel. But as a first step I think it is better to use the existing infrastructure for multiplexing operators to generate an OrderedMux so that all the minor fragments pertaining to one DRILLBIT should be merged and the merged data can be sent across to the receiver operator. On a 10 node cluster if each node processes 14 minor fragments. Current version of code merges 140 minor fragments the proposed version has two level merges 1 - 14 merge in each drillbit which is parallel and 10 minorfragments are merged at the receiver node. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-5846) Improve Parquet Reader Performance for Flat Data types
[ https://issues.apache.org/jira/browse/DRILL-5846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343842#comment-16343842 ] ASF GitHub Bot commented on DRILL-5846: --- Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1060#discussion_r164532290 --- Diff: exec/memory/base/src/main/java/io/netty/buffer/DrillBuf.java --- @@ -703,7 +703,18 @@ protected void _setLong(int index, long value) { @Override public ByteBuf getBytes(int index, ByteBuf dst, int dstIndex, int length) { -udle.getBytes(index + offset, dst, dstIndex, length); +final int BULK_COPY_THR = 1024; --- End diff -- You are right, I'll put more information on this optimization: - During code profiling, I have noticed that getBytes() doesn't perform well when called with small length (lower than 1k) - Its throughput improves as the length increases Analysis : - Java exposes two intrinsics for writing to direct memory: putByte and copyMemory - The JVM is able to inline memory access (no function call) for putByte - copyMemory is a bulk API and this internally invokes libc memcpy (requires function call) - The rational is that we are willing to incur a function call if the associated processing is larger than the overhead; this is almost never the case for small memory accesses. > Improve Parquet Reader Performance for Flat Data types > --- > > Key: DRILL-5846 > URL: https://issues.apache.org/jira/browse/DRILL-5846 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.11.0 >Reporter: salim achouche >Assignee: salim achouche >Priority: Major > Labels: performance > Fix For: 1.13.0 > > > The Parquet Reader is a key use-case for Drill. This JIRA is an attempt to > further improve the Parquet Reader performance as several users reported that > Parquet parsing represents the lion share of the overall query execution. It > tracks Flat Data types only as Nested DTs might involve functional and > processing enhancements (e.g., a nested column can be seen as a Document; > user might want to perform operations scoped at the document level that is no > need to span all rows). Another JIRA will be created to handle the nested > columns use-case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-5846) Improve Parquet Reader Performance for Flat Data types
[ https://issues.apache.org/jira/browse/DRILL-5846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343798#comment-16343798 ] ASF GitHub Bot commented on DRILL-5846: --- Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1060#discussion_r164525752 --- Diff: exec/memory/base/src/main/java/org/apache/drill/exec/util/MemoryUtils.java --- @@ -0,0 +1,186 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.util; + +import java.lang.reflect.Field; +import java.nio.ByteOrder; + +import sun.misc.Unsafe; + +/** Exposes advanced Memory Access APIs for Little-Endian / Unaligned platforms */ +@SuppressWarnings("restriction") +public final class MemoryUtils { + + // Ensure this is a little-endian hardware */ + static { +if (ByteOrder.nativeOrder() != ByteOrder.LITTLE_ENDIAN) { + throw new IllegalStateException("Drill only runs on LittleEndian systems."); +} + } + + /** Java's unsafe object */ + private static Unsafe UNSAFE; + + static { +try { + Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe"); + theUnsafe.setAccessible(true); + UNSAFE = (Unsafe) theUnsafe.get(null); +} +catch (Exception e) { + throw new RuntimeException(e); +} + } + + /** Byte arrays offset */ + private static final long BYTE_ARRAY_OFFSET = UNSAFE.arrayBaseOffset(byte[].class); + + /** Number of bytes in a long */ + public static final int LONG_NUM_BYTES = 8; + /** Number of bytes in an int */ + public static final int INT_NUM_BYTES = 4; + /** Number of bytes in a short */ + public static final int SHORT_NUM_BYTES = 2; + +// +// APIs +// + + /** + * @param data source byte array + * @param index index within the byte array + * @return short value starting at data+index + */ + public static short getShort(byte[] data, int index) { --- End diff -- Vlad, Can you please clarify your point? Are you suggesting that every memory access has to be via DrillBuf even if the data is loaded in Java heap and the developer is performing an optimization. If this is your goal, then I would think the danger is that the DrillBuf class will be cluttered and contracts hard to understand: - Notice that DrillBuf performs expensive sanity checks (boundary and reference count checks); this is due to Drill memory optimizations for minimizing the overall memory usage. - When the code is processing already load data, these checks are not applicable anymore as this data is private. - I could have easily moved the new APIs there but I strongly felt this was inappropriate.. > Improve Parquet Reader Performance for Flat Data types > --- > > Key: DRILL-5846 > URL: https://issues.apache.org/jira/browse/DRILL-5846 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.11.0 >Reporter: salim achouche >Assignee: salim achouche >Priority: Major > Labels: performance > Fix For: 1.13.0 > > > The Parquet Reader is a key use-case for Drill. This JIRA is an attempt to > further improve the Parquet Reader performance as several users reported that > Parquet parsing represents the lion share of the overall query execution. It > tracks Flat Data types only as Nested DTs might involve functional and > processing enhancements (e.g., a nested column can be seen as a Document; > user might want to perform operations scoped at the document level that is no > need to span all rows). Another JIRA
[jira] [Commented] (DRILL-5846) Improve Parquet Reader Performance for Flat Data types
[ https://issues.apache.org/jira/browse/DRILL-5846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343779#comment-16343779 ] ASF GitHub Bot commented on DRILL-5846: --- Github user vrozov commented on a diff in the pull request: https://github.com/apache/drill/pull/1060#discussion_r164520611 --- Diff: exec/memory/base/src/main/java/io/netty/buffer/DrillBuf.java --- @@ -703,7 +703,18 @@ protected void _setLong(int index, long value) { @Override public ByteBuf getBytes(int index, ByteBuf dst, int dstIndex, int length) { -udle.getBytes(index + offset, dst, dstIndex, length); +final int BULK_COPY_THR = 1024; --- End diff -- The optimization is not obvious (and is counterintuitive) without more details. > Improve Parquet Reader Performance for Flat Data types > --- > > Key: DRILL-5846 > URL: https://issues.apache.org/jira/browse/DRILL-5846 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.11.0 >Reporter: salim achouche >Assignee: salim achouche >Priority: Major > Labels: performance > Fix For: 1.13.0 > > > The Parquet Reader is a key use-case for Drill. This JIRA is an attempt to > further improve the Parquet Reader performance as several users reported that > Parquet parsing represents the lion share of the overall query execution. It > tracks Flat Data types only as Nested DTs might involve functional and > processing enhancements (e.g., a nested column can be seen as a Document; > user might want to perform operations scoped at the document level that is no > need to span all rows). Another JIRA will be created to handle the nested > columns use-case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-5846) Improve Parquet Reader Performance for Flat Data types
[ https://issues.apache.org/jira/browse/DRILL-5846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343770#comment-16343770 ] ASF GitHub Bot commented on DRILL-5846: --- Github user vrozov commented on a diff in the pull request: https://github.com/apache/drill/pull/1060#discussion_r164519147 --- Diff: exec/memory/base/src/main/java/org/apache/drill/exec/util/MemoryUtils.java --- @@ -0,0 +1,186 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.util; + +import java.lang.reflect.Field; +import java.nio.ByteOrder; + +import sun.misc.Unsafe; + +/** Exposes advanced Memory Access APIs for Little-Endian / Unaligned platforms */ +@SuppressWarnings("restriction") +public final class MemoryUtils { + + // Ensure this is a little-endian hardware */ + static { +if (ByteOrder.nativeOrder() != ByteOrder.LITTLE_ENDIAN) { + throw new IllegalStateException("Drill only runs on LittleEndian systems."); +} + } + + /** Java's unsafe object */ + private static Unsafe UNSAFE; + + static { +try { + Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe"); + theUnsafe.setAccessible(true); + UNSAFE = (Unsafe) theUnsafe.get(null); +} +catch (Exception e) { + throw new RuntimeException(e); +} + } + + /** Byte arrays offset */ + private static final long BYTE_ARRAY_OFFSET = UNSAFE.arrayBaseOffset(byte[].class); + + /** Number of bytes in a long */ + public static final int LONG_NUM_BYTES = 8; + /** Number of bytes in an int */ + public static final int INT_NUM_BYTES = 4; + /** Number of bytes in a short */ + public static final int SHORT_NUM_BYTES = 2; + +// +// APIs +// + + /** + * @param data source byte array + * @param index index within the byte array + * @return short value starting at data+index + */ + public static short getShort(byte[] data, int index) { --- End diff -- @sachouche It is not clear why using MemoryUtils will make difference for hotspot in using cache or 8bit instruction set. It will be used for both MemoryUtils and DrillBuf. > Improve Parquet Reader Performance for Flat Data types > --- > > Key: DRILL-5846 > URL: https://issues.apache.org/jira/browse/DRILL-5846 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.11.0 >Reporter: salim achouche >Assignee: salim achouche >Priority: Major > Labels: performance > Fix For: 1.13.0 > > > The Parquet Reader is a key use-case for Drill. This JIRA is an attempt to > further improve the Parquet Reader performance as several users reported that > Parquet parsing represents the lion share of the overall query execution. It > tracks Flat Data types only as Nested DTs might involve functional and > processing enhancements (e.g., a nested column can be seen as a Document; > user might want to perform operations scoped at the document level that is no > need to span all rows). Another JIRA will be created to handle the nested > columns use-case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg
[ https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Farkas updated DRILL-6032: -- Reviewer: Boaz Ben-Zvi > Use RecordBatchSizer to estimate size of columns in HashAgg > --- > > Key: DRILL-6032 > URL: https://issues.apache.org/jira/browse/DRILL-6032 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > We need to use the RecordBatchSize to estimate the size of columns in the > Partition batches created by HashAgg. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-5846) Improve Parquet Reader Performance for Flat Data types
[ https://issues.apache.org/jira/browse/DRILL-5846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343704#comment-16343704 ] ASF GitHub Bot commented on DRILL-5846: --- Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1060#discussion_r164509658 --- Diff: exec/memory/base/src/main/java/org/apache/drill/exec/util/MemoryUtils.java --- @@ -0,0 +1,186 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.util; + +import java.lang.reflect.Field; +import java.nio.ByteOrder; + +import sun.misc.Unsafe; + +/** Exposes advanced Memory Access APIs for Little-Endian / Unaligned platforms */ +@SuppressWarnings("restriction") +public final class MemoryUtils { + + // Ensure this is a little-endian hardware */ + static { +if (ByteOrder.nativeOrder() != ByteOrder.LITTLE_ENDIAN) { + throw new IllegalStateException("Drill only runs on LittleEndian systems."); +} + } + + /** Java's unsafe object */ + private static Unsafe UNSAFE; + + static { +try { + Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe"); + theUnsafe.setAccessible(true); + UNSAFE = (Unsafe) theUnsafe.get(null); +} +catch (Exception e) { + throw new RuntimeException(e); +} + } + + /** Byte arrays offset */ + private static final long BYTE_ARRAY_OFFSET = UNSAFE.arrayBaseOffset(byte[].class); + + /** Number of bytes in a long */ + public static final int LONG_NUM_BYTES = 8; + /** Number of bytes in an int */ + public static final int INT_NUM_BYTES = 4; + /** Number of bytes in a short */ + public static final int SHORT_NUM_BYTES = 2; + +// +// APIs +// + + /** + * @param data source byte array + * @param index index within the byte array + * @return short value starting at data+index + */ + public static short getShort(byte[] data, int index) { --- End diff -- Vlad, - The reader has loaded data from direct memory into CPU level-1 cache through DrillBuf - The data is stored (4k at a time) in a Java byte[] - The only reason for using MemoryUtil is to perform 64bit processing (I have previously indicated that many other processing intensive applications such as Snappy and Hadoop Comparison use the same pattern) - I solely added this logic after noticing the Hotspot was generating 8bit instructions Again, the intent is not to access data from DrillBuf (the data has been already loaded into CPU cache); this is mainly an optimization step. > Improve Parquet Reader Performance for Flat Data types > --- > > Key: DRILL-5846 > URL: https://issues.apache.org/jira/browse/DRILL-5846 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.11.0 >Reporter: salim achouche >Assignee: salim achouche >Priority: Major > Labels: performance > Fix For: 1.13.0 > > > The Parquet Reader is a key use-case for Drill. This JIRA is an attempt to > further improve the Parquet Reader performance as several users reported that > Parquet parsing represents the lion share of the overall query execution. It > tracks Flat Data types only as Nested DTs might involve functional and > processing enhancements (e.g., a nested column can be seen as a Document; > user might want to perform operations scoped at the document level that is no > need to span all rows). Another JIRA will be created to handle the nested > columns use-case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-5846) Improve Parquet Reader Performance for Flat Data types
[ https://issues.apache.org/jira/browse/DRILL-5846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343688#comment-16343688 ] ASF GitHub Bot commented on DRILL-5846: --- Github user vrozov commented on a diff in the pull request: https://github.com/apache/drill/pull/1060#discussion_r164506523 --- Diff: exec/memory/base/src/main/java/org/apache/drill/exec/util/MemoryUtils.java --- @@ -0,0 +1,186 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.util; + +import java.lang.reflect.Field; +import java.nio.ByteOrder; + +import sun.misc.Unsafe; + +/** Exposes advanced Memory Access APIs for Little-Endian / Unaligned platforms */ +@SuppressWarnings("restriction") +public final class MemoryUtils { + + // Ensure this is a little-endian hardware */ + static { +if (ByteOrder.nativeOrder() != ByteOrder.LITTLE_ENDIAN) { + throw new IllegalStateException("Drill only runs on LittleEndian systems."); +} + } + + /** Java's unsafe object */ + private static Unsafe UNSAFE; + + static { +try { + Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe"); + theUnsafe.setAccessible(true); + UNSAFE = (Unsafe) theUnsafe.get(null); +} +catch (Exception e) { + throw new RuntimeException(e); +} + } + + /** Byte arrays offset */ + private static final long BYTE_ARRAY_OFFSET = UNSAFE.arrayBaseOffset(byte[].class); + + /** Number of bytes in a long */ + public static final int LONG_NUM_BYTES = 8; + /** Number of bytes in an int */ + public static final int INT_NUM_BYTES = 4; + /** Number of bytes in a short */ + public static final int SHORT_NUM_BYTES = 2; + +// +// APIs +// + + /** + * @param data source byte array + * @param index index within the byte array + * @return short value starting at data+index + */ + public static short getShort(byte[] data, int index) { --- End diff -- Unless there is a significant performance benefit in introducing two ways to access direct buffers, Drill should have a unified way to work with the off-heap memory preferably using DrillBuf. With bounds checking disabled by default, I don't see why MemoryUtils will outperform DrillBuf. > Improve Parquet Reader Performance for Flat Data types > --- > > Key: DRILL-5846 > URL: https://issues.apache.org/jira/browse/DRILL-5846 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.11.0 >Reporter: salim achouche >Assignee: salim achouche >Priority: Major > Labels: performance > Fix For: 1.13.0 > > > The Parquet Reader is a key use-case for Drill. This JIRA is an attempt to > further improve the Parquet Reader performance as several users reported that > Parquet parsing represents the lion share of the overall query execution. It > tracks Flat Data types only as Nested DTs might involve functional and > processing enhancements (e.g., a nested column can be seen as a Document; > user might want to perform operations scoped at the document level that is no > need to span all rows). Another JIRA will be created to handle the nested > columns use-case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6106) Use valueOf method instead of constructor since valueOf has a higher performance by caching frequently requested values.
[ https://issues.apache.org/jira/browse/DRILL-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343662#comment-16343662 ] ASF GitHub Bot commented on DRILL-6106: --- Github user arina-ielchiieva commented on the issue: https://github.com/apache/drill/pull/1099 @reudismam you can leave `valueOf` for Double and Float for consistency. There are two other comments @vrozov has left. Could you please address hem as well? They are not directly related to your changes but will make Drill code better. Also please squash commits into one. > Use valueOf method instead of constructor since valueOf has a higher > performance by caching frequently requested values. > > > Key: DRILL-6106 > URL: https://issues.apache.org/jira/browse/DRILL-6106 > Project: Apache Drill > Issue Type: Improvement >Reporter: Reudismam Rolim de Sousa >Priority: Minor > > Use valueOf method instead of constructor since valueOf has a higher > performance by caching frequently requested values. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6106) Use valueOf method instead of constructor since valueOf has a higher performance by caching frequently requested values.
[ https://issues.apache.org/jira/browse/DRILL-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343652#comment-16343652 ] ASF GitHub Bot commented on DRILL-6106: --- Github user reudismam commented on the issue: https://github.com/apache/drill/pull/1099 Conflicts resolved. Do you want to revert the edits to Double and Float? I think it is ok for consistency. > Use valueOf method instead of constructor since valueOf has a higher > performance by caching frequently requested values. > > > Key: DRILL-6106 > URL: https://issues.apache.org/jira/browse/DRILL-6106 > Project: Apache Drill > Issue Type: Improvement >Reporter: Reudismam Rolim de Sousa >Priority: Minor > > Use valueOf method instead of constructor since valueOf has a higher > performance by caching frequently requested values. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6080) Sort incorrectly limits batch size to 65535 records rather than 65536
[ https://issues.apache.org/jira/browse/DRILL-6080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343613#comment-16343613 ] ASF GitHub Bot commented on DRILL-6080: --- Github user vrozov commented on the issue: https://github.com/apache/drill/pull/1090 +1 > Sort incorrectly limits batch size to 65535 records rather than 65536 > - > > Key: DRILL-6080 > URL: https://issues.apache.org/jira/browse/DRILL-6080 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Labels: ready-to-commit > Fix For: 1.13.0 > > > Drill places an upper limit on the number of rows in a batch of 64K. That is > 65,536 decimal. When we index records, the indexes run from 0 to 64K-1 or 0 > to 65,535. > The sort code incorrectly uses {{Character.MAX_VALUE}} as the maximum row > count. So, if an incoming batch uses the full 64K size, sort ends up > splitting batches unnecessarily. > The fix is to instead use the correct constant `ValueVector.MAX_ROW_COUNT`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6106) Use valueOf method instead of constructor since valueOf has a higher performance by caching frequently requested values.
[ https://issues.apache.org/jira/browse/DRILL-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343354#comment-16343354 ] ASF GitHub Bot commented on DRILL-6106: --- Github user arina-ielchiieva commented on the issue: https://github.com/apache/drill/pull/1099 @reudismam could you please address code review comments so changes can be pushed into master? Also please resolve merge conflict. > Use valueOf method instead of constructor since valueOf has a higher > performance by caching frequently requested values. > > > Key: DRILL-6106 > URL: https://issues.apache.org/jira/browse/DRILL-6106 > Project: Apache Drill > Issue Type: Improvement >Reporter: Reudismam Rolim de Sousa >Priority: Minor > > Use valueOf method instead of constructor since valueOf has a higher > performance by caching frequently requested values. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6090) While connecting to drill-bits using JDBC Driver through Zookeeper, a lot of "Curator-Framework-0" threads are created if connection to drill-bit is not successful(no d
[ https://issues.apache.org/jira/browse/DRILL-6090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343353#comment-16343353 ] ASF GitHub Bot commented on DRILL-6090: --- Github user arina-ielchiieva commented on the issue: https://github.com/apache/drill/pull/1094 @milindt could you please address code review comments so changes can be pushed into master? > While connecting to drill-bits using JDBC Driver through Zookeeper, a lot of > "Curator-Framework-0" threads are created if connection to drill-bit is not > successful(no drill-bits are up/reachable) > --- > > Key: DRILL-6090 > URL: https://issues.apache.org/jira/browse/DRILL-6090 > Project: Apache Drill > Issue Type: Bug > Components: Client - JDBC >Affects Versions: 1.12.0 > Environment: Centos 65, Java 7, Drill JDBC 1.12.0 >Reporter: Milind Takawale >Assignee: Milind Takawale >Priority: Major > Fix For: 1.13.0 > > Original Estimate: 48h > Remaining Estimate: 48h > > I am using Drill JDBC driver 1.12.0 to connect to MapR-DB. I am finding the > available drill-bits using Zookeepers. When drill-bits are not up or not > reachable, the connection is failed with exception: "Failure in connecting to > Drill: oadd.org.apache.drill.exec.rpc.RpcException: Failure setting up ZK for > client", which is expected, but number of threads created by > ZKClusterCoordinator just keeps on increasing. > Steps to reproduce the issue > # Setup a connection with a drill-bit using Apache Drill JDBC driver 1.12.0 > through Zookeeper hosts(port 5181) > # Now stop the drill-bit services or block the drill-bit IPs using iptable > rules > # Truncate catalina logs > # Try to connect to the drill-bit/hit a code path that requires connection > to drill-bits. > # Take thread dump using kill -QUIT > # grep -c "Curator-Framework-0" catalina.out > Observe that the curator framework thread just keep on accumulating > RCA: > # ZKClusterCoordinator creates curator threads in the constructor > # ZKClusterCoordinator is instantiated by DrillClient.connect > # DrillClient.connect is called in DrillConnectionImpl constructor > Fix: > Call DrillConnectionImpl .cleanup() from all the catch blocks in the > DrillConnectionImpl constructor. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)