[jira] [Commented] (PIG-2163) Improve nested cross to stream one relation

Zhijie Shen (JIRA) Mon, 29 Aug 2011 00:22:33 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092675#comment-13092675
 ]


Zhijie Shen commented on PIG-2163:
----------------------------------

Hi Daniel,

If I understand your suggestion correctly, you mean that when cross over n 
relations, the first n-1 relations are recorded temporally in n-1 bags, and the 
last relation ejects the tuples iteratively (through getNext()) and crosses it 
with the stored bags.

However, the problem is that the tuples in the last relation will not be 
iterated once but k1*k2*...kn-1 times, where ki is the number of tuples in i-th 
relation. For example, if there are three relations:

bag1: {(a, 1)}

bag2: {(a, x), (a, y)}
1st     ^
2rd             ^

bag3: {(a, true), (a, false)}

the bag3 will be iterated twice: first to cross with (a, x) and second to cross 
with (a, y).

On the other hand, getNext() can only go through the last relation once. Hence 
I think the n bags inevitable. How do you think about this? Correct me if I'm 
wrong.

By the way, this issue reminds me a problem that the computation of cross 
product is expensive especially when the number of relations is large. I'm not 
a database specialist. Does anybody know some smarter algorithms to reduce the 
rounds of scanning the relations?


> Improve nested cross to stream one relation
> -------------------------------------------
>
>                 Key: PIG-2163
>                 URL: https://issues.apache.org/jira/browse/PIG-2163
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.10
>            Reporter: Daniel Dai
>            Assignee: Zhijie Shen
>             Fix For: 0.10
>
>
> PIG-1916 added nested cross support for PIG. One optimization is instead of 
> materialize all bags before producing result, we can stream one of the input 
> to save on memory.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2163) Improve nested cross to stream one relation

Reply via email to