Github user ilooner commented on a diff in the pull request:
https://github.com/apache/drill/pull/1101#discussion_r165516121
--- Diff:
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java
---
@@ -215,6 +206,7 @@ public BatchHolder() {
MaterializedField outputField = materializedValueFields[i];
// Create a type-specific ValueVector for this value
vector = TypeHelper.getNewVector(outputField, allocator);
+ int columnSize = new RecordBatchSizer.ColumnSize(vector).estSize;
--- End diff --
Thanks for the info @paul-rogers . Since we want to have the
RecordBatchSizer be a one stop shop for size related information I propose to
do the following to address @ppadma 's concerns:
- Remove changes to estSize in the RecordBatchSizer that I introduced.
- Add a field called knownSize to the RecordBatchSizer. This field will
contain the known column width for FixedWidthVectors and will be -1 for things
like varchars and repeated columns. Then we can add a method getKnownSize() for
a column size. This method will return the known size if available and will
throw an exception if the column is not FixedWidth. So this method should only
be used by the caller when they know they are dealing with FixedWidth columns
as is the case with HashAgg for now (things will probably change when strings
will be stored in varchars instead of objects). The rest of RecordBatchSizer
will remain as is.
- Add a constant called FRAGMENTATION_FACTOR (1.33) that people can use to
multiply their custom crafted batch sizes by when they are estimating batch
sizes from scratch. In HashAgg I would multiply the custom batch size I compute
by this factor.
Thoughts?
---