[
https://issues.apache.org/jira/browse/DRILL-6688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16592349#comment-16592349
]
ASF GitHub Bot commented on DRILL-6688:
---------------------------------------
bitblender commented on a change in pull request #1442: DRILL-6688 Data batches
for Project operator exceed the maximum specified
URL: https://github.com/apache/drill/pull/1442#discussion_r212781844
##########
File path:
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/OutputWidthVisitor.java
##########
@@ -270,8 +270,10 @@ public OutputWidthExpression
visitIfElseWidthExpr(IfElseWidthExpr ifElseWidthExp
private OutputWidthExpression getFixedLenExpr(MajorType majorType) {
MajorType type = majorType;
if (Types.isFixedWidthType(type)) {
- int fixedWidth =
ProjectMemoryManager.getWidthOfFixedWidthType(type);
- return new OutputWidthExpression.FixedLenExpr(fixedWidth);
+ // Use only the width of the data. Metadata width will be
accounted for at the end
+ // This is to avoid using metadata size in intermediate
calculations
+ int fixedDataWidth =
ProjectMemoryManager.getDataWidthOfFixedWidthType(type);
Review comment:
The MinorType.NULL check is required for handling cases where there can be a
function with a null argument. This can happen, for instance, in
convert_to_JSON like in TestComplexTypeReader testNonExistentFieldConverting()
when trying to convert a non existent field.
See below for the object graph.
this = {DrillFuncHolderExpr@8281}
holder = {DrillSimpleFuncHolder@8279} "DrillSimpleFuncHolder
[functionNames=[convert_toJSON, convert_toSIMPLEJSON],
returnType=MajorType[minor_type: VARBINARY mode: REQUIRED],
nullHandling=NULL_IF_NULL, parameters=[ValueReference
[type=MajorType[minor_type: LATE mode: REQUIRED], name=input]]]"
majorType = {TypeProtos$MajorType@8317} "minor_type: VARBINARY\nmode:
OPTIONAL\n"
interpreter = null
args = {SingletonImmutableList@8315} size = 1
0 = {NullExpression@8285}
t = {TypeProtos$MajorType@8287} "minor_type: NULL\nmode: OPTIONAL\n"
nameUsed = "convert_tojson"
fieldReference = null
pos = {ExpressionPosition@8322}
"org.apache.drill.common.expression.ExpressionPosition@748c42c3[charIndex = -1,
expression = --UNKNOWN EXPRESSION--]"
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Data batches for Project operator exceed the maximum specified
> --------------------------------------------------------------
>
> Key: DRILL-6688
> URL: https://issues.apache.org/jira/browse/DRILL-6688
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - Relational Operators
> Affects Versions: 1.14.0
> Reporter: Robert Hou
> Assignee: Karthikeyan Manivannan
> Priority: Major
> Fix For: 1.15.0
>
>
> I ran this query:
> alter session set `drill.exec.memory.operator.project.output_batch_size` =
> 131072;
> alter session set `planner.width.max_per_node` = 1;
> alter session set `planner.width.max_per_query` = 1;
> select
> chr(101) CharacterValuea,
> chr(102) CharacterValueb,
> chr(103) CharacterValuec,
> chr(104) CharacterValued,
> chr(105) CharacterValuee
> from dfs.`/drill/testdata/batch_memory/character5_1MB.parquet`;
> The output has 1024 identical lines:
> e f g h i
> There is one incoming batch:
> 2018-08-09 15:50:14,794 [24933ad8-a5e2-73f1-90dd-947fc2938e54:frag:0:0] DEBUG
> o.a.d.e.p.i.p.ProjectMemoryManager - BATCH_STATS, incoming: Batch size:
> { Records: 60000, Total size: 0, Data size: 300000, Gross row width: 0, Net
> row width: 5, Density: 0% }
> Batch schema & sizes:
> { `_DEFAULT_COL_TO_READ_`(type: OPTIONAL INT, count: 60000, Per entry: std
> data size: 4, std net size: 5, actual data size: 4, actual net size: 5
> Totals: data size: 240000, net size: 300000) }
> }
> There are four outgoing batches. All are too large. The first three look like
> this:
> 2018-08-09 15:50:14,799 [24933ad8-a5e2-73f1-90dd-947fc2938e54:frag:0:0] DEBUG
> o.a.d.e.p.i.p.ProjectRecordBatch - BATCH_STATS, outgoing: Batch size:
> { Records: 16383, Total size: 0, Data size: 409575, Gross row width: 0, Net
> row width: 25, Density: 0% }
> Batch schema & sizes:
> { CharacterValuea(type: REQUIRED VARCHAR, count: 16383, Per entry: std data
> size: 50, std net size: 54, actual data size: 1, actual net size: 5 Totals:
> data size: 16383, net size: 81915) }
> CharacterValueb(type: REQUIRED VARCHAR, count: 16383, Per entry: std data
> size: 50, std net size: 54, actual data size: 1, actual net size: 5 Totals:
> data size: 16383, net size: 81915) }
> CharacterValuec(type: REQUIRED VARCHAR, count: 16383, Per entry: std data
> size: 50, std net size: 54, actual data size: 1, actual net size: 5 Totals:
> data size: 16383, net size: 81915) }
> CharacterValued(type: REQUIRED VARCHAR, count: 16383, Per entry: std data
> size: 50, std net size: 54, actual data size: 1, actual net size: 5 Totals:
> data size: 16383, net size: 81915) }
> CharacterValuee(type: REQUIRED VARCHAR, count: 16383, Per entry: std data
> size: 50, std net size: 54, actual data size: 1, actual net size: 5 Totals:
> data size: 16383, net size: 81915) }
> }
> The last batch is smaller because it has the remaining records.
> The data size (409575) exceeds the maximum batch size (131072).
> character415.q
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)