[ https://issues.apache.org/jira/browse/DRILL-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453305#comment-16453305 ]
ASF GitHub Bot commented on DRILL-6307: --------------------------------------- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1228#discussion_r184244877 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java --- @@ -536,6 +556,11 @@ public ColumnSize getColumn(String name) { */ private int netRowWidth; private int netRowWidthCap50; + + /** + * actual row size if input is not empty. Otherwise, standard size. + */ + private int rowAllocSize; --- End diff -- Unless I'm missing something, we can't move forward on a join if one side is empty: we won't know if we have the rows we need. Consider a merge join (simplest). The left gets some data, but the right is empty. We can't proceed unless the right hit EOF. Otherwise, we don't know if we have a match or not for the first left row. We need to read another right batch and keep going until we either hit EOF (no matching rows) or get some data. Once we have some data, we can go row-by-row to see if we have a left-only, right-only, or matching set of rows. If we get to EOF on either side, we know that their are no matches for the other side. What we do in the no-match case depends on whether we are doing LEFT OUTER, RIGHT OUTER or an INNER join. The point is, we can't make progress until we get that non-empty right batch (in this example). So, no reason to allocate space based on an empty batch (unless the entire input is empty) because we'll need to find a non-empty (or EOF) batch anyway. > Handle empty batches in record batch sizer correctly > ---------------------------------------------------- > > Key: DRILL-6307 > URL: https://issues.apache.org/jira/browse/DRILL-6307 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Flow > Affects Versions: 1.13.0 > Reporter: Padma Penumarthy > Assignee: Padma Penumarthy > Priority: Major > Fix For: 1.14.0 > > > when we get empty batch, record batch sizer calculates row width as zero. In > that case, we do not do accounting and memory allocation correctly for > outgoing batches. > For example, in merge join, for outer left join, if right side batch is > empty, we still have to include the right side columns as null in outgoing > batch. > Say first batch is empty. Then, for outgoing, we allocate empty vectors with > zero capacity. When we read the next batch with data, we will end up going > through realloc loop. If we use right side row width as 0 in outgoing row > width calculation, number of rows we will calculate will be higher and later > when we get a non empty batch, we might exceed the memory limits. > One possible workaround/solution : Allocate memory based on std size for > empty input batch. Use allocation width as width of the batch in number of > rows calculation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)