[ 
https://issues.apache.org/jira/browse/SPARK-9357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14704311#comment-14704311
 ] 

Cheng Hao commented on SPARK-9357:
----------------------------------

JoinedRow does increase the overhead by adding layer of indirection, however, 
it is a trade-off, as copying the 2 non-continue pieces of memory together is 
also causes performance issue, particularly the case I listed above, only a few 
records really need by the downstream operators(writing to files) after the 
filtering.

The n-ary JoinedRow will be really helpful in case like the sequential joins. 
For example:
{code} SELECT* FROM a join b on a.key=b.key join c on a.key=c.key and 
a.col1>b.col1 and b.col2<c.col3 {code}

As we know the join keys is exactly the same, and the join operators will 
compute the result in the same stage, and the intermediate row probably like 
below, and the intermediate row will be sent to operator Filter for computing 
the (a.col1>b.col1 and b.col2<c.col3);
{noformat}
          JoinedRow(a,b,c)
              /      \
JoinedRow(a, b)    row(c)
        /   \
row(a)    row(b)
{noformat}

I am thinking if it's be more helpful if we can codegen a row layer for 
supporting the n-ary joined rows, it's definitely save memory copyings, as we 
never copy even a single byte at all.

> Remove JoinedRow
> ----------------
>
>                 Key: SPARK-9357
>                 URL: https://issues.apache.org/jira/browse/SPARK-9357
>             Project: Spark
>          Issue Type: Umbrella
>          Components: SQL
>            Reporter: Reynold Xin
>
> JoinedRow was introduced to join two rows together, in aggregation (join key 
> and value), joins (left, right), window functions, etc.
> It aims to reduce the amount of data copied, but incurs branches when the row 
> is actually read. Given all the fields will be read almost all the time 
> (otherwise they get pruned out by the optimizer), branch predictor cannot do 
> anything about those branches.
> I think a better way is just to remove this thing, and materializes the row 
> data directly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to