Cheolsoo Park created PIG-3395:
----------------------------------

             Summary: Large filter expression makes Pig hang
                 Key: PIG-3395
                 URL: https://issues.apache.org/jira/browse/PIG-3395
             Project: Pig
          Issue Type: Bug
          Components: impl
            Reporter: Cheolsoo Park
            Assignee: Cheolsoo Park
             Fix For: 0.12


Currently, partition filter push down is quite costly. For example, if you have 
many nested or/and expressions, Pig hangs:
{code}
base = load '<partitioned table>' using MyStorage();
filt = filter base by
(dateint == 20130719 and batchid == 'merged_1' and hour IN (19,20,21,22,23))
or
(dateint == 20130720 and batchid == 'merged_1' and hour IN (0,1,2,3,4,5,6,7,8))
or
(dateint == 20130720 and batchid == 'merged_2' and hour == 7)
or
(dateint == 20130720 and batchid == 'merged_1' and hour IN 
(9,10,11,12,13,14,15,16,17,18,19,20,21,22,23))
or
(dateint == 20130721 and batchid == 'merged_1' and hour IN 
(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23))
or
(dateint == 20130722 and batchid == 'merged_1' and hour IN 
(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16));
dump filt;
{code}
Note that IN operator is converted to nested OR's by Pig parser.

Looking at the thread dump, I found it creates almost 60 stack frames and makes 
JVM suffer. (I will attach full stack trace.)
{code}
<repeated ...>
at 
org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:504)
at 
org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:237)
at 
org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:504)
at 
org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:214)
at 
org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:504)
at 
org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:211)
at 
org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:108)
{code}
Although the filter expression can be simplified, it seems possible to make 
PColFilterExtractor more efficient.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to