Cheolsoo Park created PIG-3395: ---------------------------------- Summary: Large filter expression makes Pig hang Key: PIG-3395 URL: https://issues.apache.org/jira/browse/PIG-3395 Project: Pig Issue Type: Bug Components: impl Reporter: Cheolsoo Park Assignee: Cheolsoo Park Fix For: 0.12
Currently, partition filter push down is quite costly. For example, if you have many nested or/and expressions, Pig hangs: {code} base = load '<partitioned table>' using MyStorage(); filt = filter base by (dateint == 20130719 and batchid == 'merged_1' and hour IN (19,20,21,22,23)) or (dateint == 20130720 and batchid == 'merged_1' and hour IN (0,1,2,3,4,5,6,7,8)) or (dateint == 20130720 and batchid == 'merged_2' and hour == 7) or (dateint == 20130720 and batchid == 'merged_1' and hour IN (9,10,11,12,13,14,15,16,17,18,19,20,21,22,23)) or (dateint == 20130721 and batchid == 'merged_1' and hour IN (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23)) or (dateint == 20130722 and batchid == 'merged_1' and hour IN (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)); dump filt; {code} Note that IN operator is converted to nested OR's by Pig parser. Looking at the thread dump, I found it creates almost 60 stack frames and makes JVM suffer. (I will attach full stack trace.) {code} <repeated ...> at org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:504) at org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:237) at org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:504) at org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:214) at org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:504) at org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:211) at org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:108) {code} Although the filter expression can be simplified, it seems possible to make PColFilterExtractor more efficient. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira