[ https://issues.apache.org/jira/browse/PIG-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092675#comment-13092675 ]
Zhijie Shen commented on PIG-2163: ---------------------------------- Hi Daniel, If I understand your suggestion correctly, you mean that when cross over n relations, the first n-1 relations are recorded temporally in n-1 bags, and the last relation ejects the tuples iteratively (through getNext()) and crosses it with the stored bags. However, the problem is that the tuples in the last relation will not be iterated once but k1*k2*...kn-1 times, where ki is the number of tuples in i-th relation. For example, if there are three relations: bag1: {(a, 1)} bag2: {(a, x), (a, y)} 1st ^ 2rd ^ bag3: {(a, true), (a, false)} the bag3 will be iterated twice: first to cross with (a, x) and second to cross with (a, y). On the other hand, getNext() can only go through the last relation once. Hence I think the n bags inevitable. How do you think about this? Correct me if I'm wrong. By the way, this issue reminds me a problem that the computation of cross product is expensive especially when the number of relations is large. I'm not a database specialist. Does anybody know some smarter algorithms to reduce the rounds of scanning the relations? > Improve nested cross to stream one relation > ------------------------------------------- > > Key: PIG-2163 > URL: https://issues.apache.org/jira/browse/PIG-2163 > Project: Pig > Issue Type: Improvement > Components: impl > Affects Versions: 0.10 > Reporter: Daniel Dai > Assignee: Zhijie Shen > Fix For: 0.10 > > > PIG-1916 added nested cross support for PIG. One optimization is instead of > materialize all bags before producing result, we can stream one of the input > to save on memory. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira