[jira] Commented: (PIG-849) Local engine loses records in splits
[ https://issues.apache.org/jira/browse/PIG-849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720306#action_12720306 ] Gunther Hagleitner commented on PIG-849: Same errors as before. Ran manually and passed. The issue with the automated patch testing seems to be still there. > Local engine loses records in splits > > > Key: PIG-849 > URL: https://issues.apache.org/jira/browse/PIG-849 > Project: Pig > Issue Type: Bug >Reporter: Gunther Hagleitner > Attachments: local_engine.patch, local_engine.patch > > > When there is a split in the physical plan records can be dropped in certain > circumstances. > The local split operator puts all records in a databag and turns over > iterators to the POSplitOutput operators. The problem is that the local split > also adds STATUS_NULL records to the bag. That will cause the databag's > iterator to prematurely return false on the hasNext call (so a STATUS_NULL > becomes a STATUS_EOP in the split output operators). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-849) Local engine loses records in splits
[ https://issues.apache.org/jira/browse/PIG-849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719968#action_12719968 ] Gunther Hagleitner commented on PIG-849: the new patch has a unit test - otherwise it's the same > Local engine loses records in splits > > > Key: PIG-849 > URL: https://issues.apache.org/jira/browse/PIG-849 > Project: Pig > Issue Type: Bug >Reporter: Gunther Hagleitner > Attachments: local_engine.patch, local_engine.patch > > > When there is a split in the physical plan records can be dropped in certain > circumstances. > The local split operator puts all records in a databag and turns over > iterators to the POSplitOutput operators. The problem is that the local split > also adds STATUS_NULL records to the bag. That will cause the databag's > iterator to prematurely return false on the hasNext call (so a STATUS_NULL > becomes a STATUS_EOP in the split output operators). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-849) Local engine loses records in splits
[ https://issues.apache.org/jira/browse/PIG-849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-849: --- Attachment: local_engine.patch > Local engine loses records in splits > > > Key: PIG-849 > URL: https://issues.apache.org/jira/browse/PIG-849 > Project: Pig > Issue Type: Bug >Reporter: Gunther Hagleitner > Attachments: local_engine.patch, local_engine.patch > > > When there is a split in the physical plan records can be dropped in certain > circumstances. > The local split operator puts all records in a databag and turns over > iterators to the POSplitOutput operators. The problem is that the local split > also adds STATUS_NULL records to the bag. That will cause the databag's > iterator to prematurely return false on the hasNext call (so a STATUS_NULL > becomes a STATUS_EOP in the split output operators). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-849) Local engine loses records in splits
[ https://issues.apache.org/jira/browse/PIG-849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-849: --- Status: Patch Available (was: Open) > Local engine loses records in splits > > > Key: PIG-849 > URL: https://issues.apache.org/jira/browse/PIG-849 > Project: Pig > Issue Type: Bug >Reporter: Gunther Hagleitner > Attachments: local_engine.patch, local_engine.patch > > > When there is a split in the physical plan records can be dropped in certain > circumstances. > The local split operator puts all records in a databag and turns over > iterators to the POSplitOutput operators. The problem is that the local split > also adds STATUS_NULL records to the bag. That will cause the databag's > iterator to prematurely return false on the hasNext call (so a STATUS_NULL > becomes a STATUS_EOP in the split output operators). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-849) Local engine loses records in splits
[ https://issues.apache.org/jira/browse/PIG-849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-849: --- Status: Patch Available (was: Open) > Local engine loses records in splits > > > Key: PIG-849 > URL: https://issues.apache.org/jira/browse/PIG-849 > Project: Pig > Issue Type: Bug >Reporter: Gunther Hagleitner > Attachments: local_engine.patch > > > When there is a split in the physical plan records can be dropped in certain > circumstances. > The local split operator puts all records in a databag and turns over > iterators to the POSplitOutput operators. The problem is that the local split > also adds STATUS_NULL records to the bag. That will cause the databag's > iterator to prematurely return false on the hasNext call (so a STATUS_NULL > becomes a STATUS_EOP in the split output operators). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-849) Local engine loses records in splits
Local engine loses records in splits Key: PIG-849 URL: https://issues.apache.org/jira/browse/PIG-849 Project: Pig Issue Type: Bug Reporter: Gunther Hagleitner Attachments: local_engine.patch When there is a split in the physical plan records can be dropped in certain circumstances. The local split operator puts all records in a databag and turns over iterators to the POSplitOutput operators. The problem is that the local split also adds STATUS_NULL records to the bag. That will cause the databag's iterator to prematurely return false on the hasNext call (so a STATUS_NULL becomes a STATUS_EOP in the split output operators). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-849) Local engine loses records in splits
[ https://issues.apache.org/jira/browse/PIG-849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-849: --- Attachment: local_engine.patch > Local engine loses records in splits > > > Key: PIG-849 > URL: https://issues.apache.org/jira/browse/PIG-849 > Project: Pig > Issue Type: Bug >Reporter: Gunther Hagleitner > Attachments: local_engine.patch > > > When there is a split in the physical plan records can be dropped in certain > circumstances. > The local split operator puts all records in a databag and turns over > iterators to the POSplitOutput operators. The problem is that the local split > also adds STATUS_NULL records to the bag. That will cause the databag's > iterator to prematurely return false on the hasNext call (so a STATUS_NULL > becomes a STATUS_EOP in the split output operators). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-839) incorrect return codes on failure when using -f or -e flags
[ https://issues.apache.org/jira/browse/PIG-839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-839: --- Status: Patch Available (was: Open) > incorrect return codes on failure when using -f or -e flags > --- > > Key: PIG-839 > URL: https://issues.apache.org/jira/browse/PIG-839 > Project: Pig > Issue Type: Bug >Reporter: Gunther Hagleitner >Assignee: Gunther Hagleitner > Attachments: fix_return_code.patch > > > To repro: pig -e "a = load '' ; b = stream a through \`false\` ; > store b into '';" > Both the -e and -f flags do not return the right code upon exit. Running the > script w/o using -f works fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-839) incorrect return codes on failure when using -f or -e flags
[ https://issues.apache.org/jira/browse/PIG-839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-839: --- Attachment: fix_return_code.patch > incorrect return codes on failure when using -f or -e flags > --- > > Key: PIG-839 > URL: https://issues.apache.org/jira/browse/PIG-839 > Project: Pig > Issue Type: Bug >Reporter: Gunther Hagleitner >Assignee: Gunther Hagleitner > Attachments: fix_return_code.patch > > > To repro: pig -e "a = load '' ; b = stream a through \`false\` ; > store b into '';" > Both the -e and -f flags do not return the right code upon exit. Running the > script w/o using -f works fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-839) incorrect return codes on failure when using -f or -e flags
incorrect return codes on failure when using -f or -e flags --- Key: PIG-839 URL: https://issues.apache.org/jira/browse/PIG-839 Project: Pig Issue Type: Bug Reporter: Gunther Hagleitner Assignee: Gunther Hagleitner To repro: pig -e "rmf keep99; a = load '' ; b = stream a through \`false\` ; store b into '';" Both the -e and -f flags do not return the right code upon exit. Running the script w/o using -f works fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-839) incorrect return codes on failure when using -f or -e flags
[ https://issues.apache.org/jira/browse/PIG-839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-839: --- Description: To repro: pig -e "a = load '' ; b = stream a through \`false\` ; store b into '';" Both the -e and -f flags do not return the right code upon exit. Running the script w/o using -f works fine. was: To repro: pig -e "rmf keep99; a = load '' ; b = stream a through \`false\` ; store b into '';" Both the -e and -f flags do not return the right code upon exit. Running the script w/o using -f works fine. > incorrect return codes on failure when using -f or -e flags > --- > > Key: PIG-839 > URL: https://issues.apache.org/jira/browse/PIG-839 > Project: Pig > Issue Type: Bug >Reporter: Gunther Hagleitner >Assignee: Gunther Hagleitner > > To repro: pig -e "a = load '' ; b = stream a through \`false\` ; > store b into '';" > Both the -e and -f flags do not return the right code upon exit. Running the > script w/o using -f works fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-818) Explain doesn't handle PODemux properly
Explain doesn't handle PODemux properly --- Key: PIG-818 URL: https://issues.apache.org/jira/browse/PIG-818 Project: Pig Issue Type: Bug Reporter: Gunther Hagleitner Assignee: Gunther Hagleitner Attachments: explain.patch The PODemux operator has nested plans but they are not expanded in the -dot version of explain. Also, both split and demux are displayed as clusters of nodes, but it really makes more sense to just show them as multi output operators. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-818) Explain doesn't handle PODemux properly
[ https://issues.apache.org/jira/browse/PIG-818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-818: --- Status: Patch Available (was: Open) > Explain doesn't handle PODemux properly > --- > > Key: PIG-818 > URL: https://issues.apache.org/jira/browse/PIG-818 > Project: Pig > Issue Type: Bug >Reporter: Gunther Hagleitner >Assignee: Gunther Hagleitner > Attachments: explain.patch > > > The PODemux operator has nested plans but they are not expanded in the -dot > version of explain. > Also, both split and demux are displayed as clusters of nodes, but it really > makes more sense to just show them as multi output operators. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-818) Explain doesn't handle PODemux properly
[ https://issues.apache.org/jira/browse/PIG-818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-818: --- Attachment: explain.patch > Explain doesn't handle PODemux properly > --- > > Key: PIG-818 > URL: https://issues.apache.org/jira/browse/PIG-818 > Project: Pig > Issue Type: Bug >Reporter: Gunther Hagleitner >Assignee: Gunther Hagleitner > Attachments: explain.patch > > > The PODemux operator has nested plans but they are not expanded in the -dot > version of explain. > Also, both split and demux are displayed as clusters of nodes, but it really > makes more sense to just show them as multi output operators. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-811) Globs with "?" in the pattern are broken in local mode
[ https://issues.apache.org/jira/browse/PIG-811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-811: --- Status: Patch Available (was: Open) > Globs with "?" in the pattern are broken in local mode > -- > > Key: PIG-811 > URL: https://issues.apache.org/jira/browse/PIG-811 > Project: Pig > Issue Type: Bug >Affects Versions: 0.3.0 >Reporter: Olga Natkovich >Assignee: Gunther Hagleitner > Fix For: 0.3.0 > > Attachments: local_engine_glob.patch > > > Script: > a = load 'studenttab10?'; > dump a; > Actual file name: studenttab10k > Stack trace: > ERROR 2081: Unable to setup the load function. > org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to > setup the load function. > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.getNext(POLoad.java:128) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNext(POStore.java:117) > at > org.apache.pig.backend.local.executionengine.LocalPigLauncher.runPipeline(LocalPigLauncher.java:129) > at > org.apache.pig.backend.local.executionengine.LocalPigLauncher.launchPig(LocalPigLauncher.java:102) > at > org.apache.pig.backend.local.executionengine.LocalExecutionEngine.execute(LocalExecutionEngine.java:163) > at > org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:763) > at org.apache.pig.PigServer.execute(PigServer.java:756) > at org.apache.pig.PigServer.access$100(PigServer.java:88) > at org.apache.pig.PigServer$Graph.execute(PigServer.java:923) > at org.apache.pig.PigServer.executeBatch(PigServer.java:242) > at > org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:110) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:151) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:123) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88) > at org.apache.pig.Main.main(Main.java:372) > Caused by: java.io.IOException: > file:/home/y/share/pigtest/local/data/singlefile/studenttab10 does not exist > at > org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:188) > at > org.apache.pig.impl.io.FileLocalizer.openLFSFile(FileLocalizer.java:244) > at org.apache.pig.impl.io.FileLocalizer.open(FileLocalizer.java:299) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.setUp(POLoad.java:96) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.getNext(POLoad.java:124) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-811) Globs with "?" in the pattern are broken in local mode
[ https://issues.apache.org/jira/browse/PIG-811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-811: --- Attachment: local_engine_glob.patch This patch should fix the problem. Globs are working again in local engine mode. > Globs with "?" in the pattern are broken in local mode > -- > > Key: PIG-811 > URL: https://issues.apache.org/jira/browse/PIG-811 > Project: Pig > Issue Type: Bug >Affects Versions: 0.3.0 >Reporter: Olga Natkovich >Assignee: Gunther Hagleitner > Fix For: 0.3.0 > > Attachments: local_engine_glob.patch > > > Script: > a = load 'studenttab10?'; > dump a; > Actual file name: studenttab10k > Stack trace: > ERROR 2081: Unable to setup the load function. > org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to > setup the load function. > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.getNext(POLoad.java:128) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNext(POStore.java:117) > at > org.apache.pig.backend.local.executionengine.LocalPigLauncher.runPipeline(LocalPigLauncher.java:129) > at > org.apache.pig.backend.local.executionengine.LocalPigLauncher.launchPig(LocalPigLauncher.java:102) > at > org.apache.pig.backend.local.executionengine.LocalExecutionEngine.execute(LocalExecutionEngine.java:163) > at > org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:763) > at org.apache.pig.PigServer.execute(PigServer.java:756) > at org.apache.pig.PigServer.access$100(PigServer.java:88) > at org.apache.pig.PigServer$Graph.execute(PigServer.java:923) > at org.apache.pig.PigServer.executeBatch(PigServer.java:242) > at > org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:110) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:151) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:123) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88) > at org.apache.pig.Main.main(Main.java:372) > Caused by: java.io.IOException: > file:/home/y/share/pigtest/local/data/singlefile/studenttab10 does not exist > at > org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:188) > at > org.apache.pig.impl.io.FileLocalizer.openLFSFile(FileLocalizer.java:244) > at org.apache.pig.impl.io.FileLocalizer.open(FileLocalizer.java:299) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.setUp(POLoad.java:96) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.getNext(POLoad.java:124) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-777) Code refactoring: Create optimization out of store/load post processing code
[ https://issues.apache.org/jira/browse/PIG-777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709603#action_12709603 ] Gunther Hagleitner commented on PIG-777: There is no new code. I just fixed an indentation issue in addition to the new log message. > Code refactoring: Create optimization out of store/load post processing code > > > Key: PIG-777 > URL: https://issues.apache.org/jira/browse/PIG-777 > Project: Pig > Issue Type: Improvement >Reporter: Gunther Hagleitner > Attachments: log_message.patch > > > The postProcessing method in the pig server checks whether a logical graph > contains stores to and loads from the same location. If so, it will either > connect the store and load, or optimize by throwing out the load and > connecting the store predecessor with the successor of the load. > Ideally the introduction of the store and load connection should happen in > the query compiler, while the optimization should then happen in an separate > optimizer step as part of the optimizer framework. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-810) Scripts failing with NPE
[ https://issues.apache.org/jira/browse/PIG-810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-810: --- Attachment: null_pointer.patch Ran into the same issue. I have a similar fix, but I also added a unit test, in case you're interested. > Scripts failing with NPE > > > Key: PIG-810 > URL: https://issues.apache.org/jira/browse/PIG-810 > Project: Pig > Issue Type: Bug >Affects Versions: 0.3.0 >Reporter: Alan Gates >Assignee: Alan Gates > Fix For: 0.3.0 > > Attachments: null_pointer.patch, PIG-810.patch > > > Scripts such as: > {code} > a = load 'nosuchfile'; > b = store a into 'bla'; > {code} > are failing with > {code} > ERROR 2043: Unexpected error during execution. > org.apache.pig.backend.executionengine.ExecException: ERROR 2043: Unexpected > error during execution. > at > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:275) > at > org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:757) > at org.apache.pig.PigServer.execute(PigServer.java:750) > at org.apache.pig.PigServer.access$100(PigServer.java:88) > at org.apache.pig.PigServer$Graph.execute(PigServer.java:917) > at org.apache.pig.PigServer.executeBatch(PigServer.java:242) > at > org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:110) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:151) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:123) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88) > at org.apache.pig.Main.main(Main.java:372) > Caused by: java.lang.NullPointerException > at > org.apache.pig.tools.pigstats.PigStats.accumulateMRStats(PigStats.java:175) > at > org.apache.pig.tools.pigstats.PigStats.accumulateStats(PigStats.java:94) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:148) > at > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:262) > ... 10 more > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-777) Code refactoring: Create optimization out of store/load post processing code
[ https://issues.apache.org/jira/browse/PIG-777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-777: --- Status: Patch Available (was: Open) > Code refactoring: Create optimization out of store/load post processing code > > > Key: PIG-777 > URL: https://issues.apache.org/jira/browse/PIG-777 > Project: Pig > Issue Type: Improvement >Reporter: Gunther Hagleitner > Attachments: log_message.patch > > > The postProcessing method in the pig server checks whether a logical graph > contains stores to and loads from the same location. If so, it will either > connect the store and load, or optimize by throwing out the load and > connecting the store predecessor with the successor of the load. > Ideally the introduction of the store and load connection should happen in > the query compiler, while the optimization should then happen in an separate > optimizer step as part of the optimizer framework. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-781) Error reporting for failed MR jobs
[ https://issues.apache.org/jira/browse/PIG-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-781: --- Attachment: partial_failure.patch Fixing the findbugs warning. > Error reporting for failed MR jobs > -- > > Key: PIG-781 > URL: https://issues.apache.org/jira/browse/PIG-781 > Project: Pig > Issue Type: Improvement >Reporter: Gunther Hagleitner > Attachments: partial_failure.patch, partial_failure.patch, > partial_failure.patch, partial_failure.patch > > > If we have multiple MR jobs to run and some of them fail the behavior of the > system is to not stop on the first failure but to keep going. That way jobs > that do not depend on the failed job might still succeed. > The question is to how best report this scenario to a user. How do we tell > which jobs failed and which didn't? > One way could be to tie jobs to stores and report which store locations won't > have data and which ones do. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-781) Error reporting for failed MR jobs
[ https://issues.apache.org/jira/browse/PIG-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-781: --- Attachment: partial_failure.patch The latest patch is against the latest code base. It also includes the test with the "done" file. Finally, I was wrong about the log files. It's already the case that all the errors are logged into the same pig file. > Error reporting for failed MR jobs > -- > > Key: PIG-781 > URL: https://issues.apache.org/jira/browse/PIG-781 > Project: Pig > Issue Type: Improvement >Reporter: Gunther Hagleitner > Attachments: partial_failure.patch, partial_failure.patch, > partial_failure.patch > > > If we have multiple MR jobs to run and some of them fail the behavior of the > system is to not stop on the first failure but to keep going. That way jobs > that do not depend on the failed job might still succeed. > The question is to how best report this scenario to a user. How do we tell > which jobs failed and which didn't? > One way could be to tie jobs to stores and report which store locations won't > have data and which ones do. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-781) Error reporting for failed MR jobs
[ https://issues.apache.org/jira/browse/PIG-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-781: --- Status: Patch Available (was: Open) > Error reporting for failed MR jobs > -- > > Key: PIG-781 > URL: https://issues.apache.org/jira/browse/PIG-781 > Project: Pig > Issue Type: Improvement >Reporter: Gunther Hagleitner > Attachments: partial_failure.patch, partial_failure.patch, > partial_failure.patch > > > If we have multiple MR jobs to run and some of them fail the behavior of the > system is to not stop on the first failure but to keep going. That way jobs > that do not depend on the failed job might still succeed. > The question is to how best report this scenario to a user. How do we tell > which jobs failed and which didn't? > One way could be to tie jobs to stores and report which store locations won't > have data and which ones do. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-781) Error reporting for failed MR jobs
[ https://issues.apache.org/jira/browse/PIG-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-781: --- Attachment: partial_failure.patch The new patch does same as above (report on the failed and succeeded jobs), but also: * Returns a list of exec jobs, one for each store, so that embedded programs can iterate through results and determine success and failures * Adds a flag "-F" or "-stop_on_failure" that causes an exception on the first failure which will cause the processing to stop. * Returns 2 when all jobs fail or when the stop_on_failure flag is specified. Returns 3 if some jobs passed and others failed. > Error reporting for failed MR jobs > -- > > Key: PIG-781 > URL: https://issues.apache.org/jira/browse/PIG-781 > Project: Pig > Issue Type: Improvement >Reporter: Gunther Hagleitner > Attachments: partial_failure.patch, partial_failure.patch > > > If we have multiple MR jobs to run and some of them fail the behavior of the > system is to not stop on the first failure but to keep going. That way jobs > that do not depend on the failed job might still succeed. > The question is to how best report this scenario to a user. How do we tell > which jobs failed and which didn't? > One way could be to tie jobs to stores and report which store locations won't > have data and which ones do. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (PIG-781) Error reporting for failed MR jobs
[ https://issues.apache.org/jira/browse/PIG-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12705951#action_12705951 ] Gunther Hagleitner edited comment on PIG-781 at 5/5/09 1:19 AM: This fix associates stores with MR jobs. At the end of the execution it will print out which stores have passed and which ones have failed. Example: {noformat} 50% complete 100% complete 1 map reduce job(s) failed! Failed to produce result in: "/user/hagleitn/baz" Successfully stored result in: "/user/hagleitn/bar" Successfully stored result in: "/user/hagleitn/foo" Some jobs have failed! {noformat} was (Author: hagleitn): This fix associates stores with MR jobs. At the end of the execution it will print out which stores have passed and which ones have failed. Example: {noformat} 50% complete 100% complete 1 map reduce job(s) failed! Failed to produce result in: "hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/hagleitn/baz" Successfully stored result in: "hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/hagleitn/bar" Successfully stored result in: "hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/hagleitn/foo" Some jobs have failed! {noformat} > Error reporting for failed MR jobs > -- > > Key: PIG-781 > URL: https://issues.apache.org/jira/browse/PIG-781 > Project: Pig > Issue Type: Improvement >Reporter: Gunther Hagleitner > Attachments: partial_failure.patch > > > If we have multiple MR jobs to run and some of them fail the behavior of the > system is to not stop on the first failure but to keep going. That way jobs > that do not depend on the failed job might still succeed. > The question is to how best report this scenario to a user. How do we tell > which jobs failed and which didn't? > One way could be to tie jobs to stores and report which store locations won't > have data and which ones do. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-781) Error reporting for failed MR jobs
[ https://issues.apache.org/jira/browse/PIG-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-781: --- Attachment: partial_failure.patch This fix associates stores with MR jobs. At the end of the execution it will print out which stores have passed and which ones have failed. Example: {noformat} 50% complete 100% complete 1 map reduce job(s) failed! Failed to produce result in: "hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/hagleitn/baz" Successfully stored result in: "hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/hagleitn/bar" Successfully stored result in: "hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/hagleitn/foo" Some jobs have failed! {noformat} > Error reporting for failed MR jobs > -- > > Key: PIG-781 > URL: https://issues.apache.org/jira/browse/PIG-781 > Project: Pig > Issue Type: Improvement >Reporter: Gunther Hagleitner > Attachments: partial_failure.patch > > > If we have multiple MR jobs to run and some of them fail the behavior of the > system is to not stop on the first failure but to keep going. That way jobs > that do not depend on the failed job might still succeed. > The question is to how best report this scenario to a user. How do we tell > which jobs failed and which didn't? > One way could be to tie jobs to stores and report which store locations won't > have data and which ones do. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-789) coupling load and store in script no longer works
[ https://issues.apache.org/jira/browse/PIG-789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-789: --- Attachment: dump_bug.patch Both dump (openIterator) and illustrate (getExamples) show this problem. dump_bug.patch contains a fix; The patch is for the trunk. > coupling load and store in script no longer works > - > > Key: PIG-789 > URL: https://issues.apache.org/jira/browse/PIG-789 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.3.0 >Reporter: Alan Gates >Assignee: Gunther Hagleitner > Attachments: dump_bug.patch > > > Many user's pig script do something like this: > a = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); > c = filter a by age > 500; > e = group c by (name, age); > f = foreach e generate group, COUNT($1); > store f into 'bla'; > f1 = load 'bla'; > g = order f1 by $1; > dump g; > With the inclusion of the multi-query phase2 patch this appears to no longer > work. You get an error: > 2009-04-28 18:24:50,776 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 2100: hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/gates/bla does not exist. > We shouldn't be checking for bla's existence here because it will be created > eventually by the script. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-759) HBaseStorage scheme for Load/Slice function
[ https://issues.apache.org/jira/browse/PIG-759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704011#action_12704011 ] Gunther Hagleitner commented on PIG-759: Looking at the code, it seems you can already specify columns in the load statment: {noformat} table = load 'hbase://foo' using org.apache.pig.backend.hadoop.hbase.HBaseStorage('bar:c bar:d') as (c:int, d:int); {noformat} Is the suggestion to change the syntax of that? Or did I misunderstand the code? > HBaseStorage scheme for Load/Slice function > --- > > Key: PIG-759 > URL: https://issues.apache.org/jira/browse/PIG-759 > Project: Pig > Issue Type: Bug >Reporter: Gunther Hagleitner > > We would like to change the HBaseStorage function to use a scheme when > loading a table in pig. The scheme we are thinking of is: "hbase". So in > order to load an hbase table in a pig script the statement should read: > {noformat} > table = load 'hbase://' using HBaseStorage(); > {noformat} > If the scheme is omitted pig would assume the tablename to be an hdfs path > and the storage function would use the last component of the path as a table > name and output a warning. > For details on why see jira issue: PIG-758 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-777) Code refactoring: Create optimization out of store/load post processing code
[ https://issues.apache.org/jira/browse/PIG-777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-777: --- Attachment: log_message.patch log_message.patch adds a the message "Removing unnecessary load operation ..." when we remove the load from the logical plan. > Code refactoring: Create optimization out of store/load post processing code > > > Key: PIG-777 > URL: https://issues.apache.org/jira/browse/PIG-777 > Project: Pig > Issue Type: Improvement >Reporter: Gunther Hagleitner > Attachments: log_message.patch > > > The postProcessing method in the pig server checks whether a logical graph > contains stores to and loads from the same location. If so, it will either > connect the store and load, or optimize by throwing out the load and > connecting the store predecessor with the successor of the load. > Ideally the introduction of the store and load connection should happen in > the query compiler, while the optimization should then happen in an separate > optimizer step as part of the optimizer framework. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-777) Code refactoring: Create optimization out of store/load post processing code
[ https://issues.apache.org/jira/browse/PIG-777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704002#action_12704002 ] Gunther Hagleitner commented on PIG-777: David, Per PIG-627 the first example you gave will result in a single map reduce job that is going to process both store operations. No duplication of steps A thru D. So, yes you shouldn't need to introduce "D = load". Also PIG-627 introduced an optimization that will throw the "D = load" out - basically transforming your second example into the first. This bug is mostly about the way the optimization is written. Some code should be moved around to align it with the optimization framework. Adding a log message when this happens is a good idea though. Let me add that. > Code refactoring: Create optimization out of store/load post processing code > > > Key: PIG-777 > URL: https://issues.apache.org/jira/browse/PIG-777 > Project: Pig > Issue Type: Improvement >Reporter: Gunther Hagleitner > > The postProcessing method in the pig server checks whether a logical graph > contains stores to and loads from the same location. If so, it will either > connect the store and load, or optimize by throwing out the load and > connecting the store predecessor with the successor of the load. > Ideally the introduction of the store and load connection should happen in > the query compiler, while the optimization should then happen in an separate > optimizer step as part of the optimizer framework. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-652) Need to give user control of OutputFormat
[ https://issues.apache.org/jira/browse/PIG-652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-652: --- Attachment: PIG-652-v5.patch v5 patch includes the stuff for multiquery. > Need to give user control of OutputFormat > - > > Key: PIG-652 > URL: https://issues.apache.org/jira/browse/PIG-652 > Project: Pig > Issue Type: New Feature > Components: impl >Affects Versions: 0.2.0 >Reporter: Alan Gates >Assignee: Pradeep Kamath > Attachments: PIG-652-v2.patch, PIG-652-v3.patch, PIG-652-v4.patch, > PIG-652-v5.patch, PIG-652.patch > > > Pig currently allows users some control over InputFormat via the Slicer and > Slice interfaces. It does not allow any control over OutputFormat and > RecordWriter interfaces. It just allows the user to implement a storage > function that controls how the data is serialized. For hadoop tables, we > will need to allow custom OutputFormats that prepare output information and > objects needed by a Table store function. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-781) Error reporting for failed MR jobs
Error reporting for failed MR jobs -- Key: PIG-781 URL: https://issues.apache.org/jira/browse/PIG-781 Project: Pig Issue Type: Improvement Reporter: Gunther Hagleitner If we have multiple MR jobs to run and some of them fail the behavior of the system is to not stop on the first failure but to keep going. That way jobs that do not depend on the failed job might still succeed. The question is to how best report this scenario to a user. How do we tell which jobs failed and which didn't? One way could be to tie jobs to stores and report which store locations won't have data and which ones do. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-780) Code refactoring: PlanPrinters
Code refactoring: PlanPrinters -- Key: PIG-780 URL: https://issues.apache.org/jira/browse/PIG-780 Project: Pig Issue Type: Improvement Reporter: Gunther Hagleitner Priority: Minor There seems to be quite a bit of duplicated code/functionality with all the PlanPrinters in the system. It would make things easier, if that was consolidated. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-779) Warning from javacc
Warning from javacc --- Key: PIG-779 URL: https://issues.apache.org/jira/browse/PIG-779 Project: Pig Issue Type: Improvement Reporter: Gunther Hagleitner This warning needs fixing: Reading from file .../src/org/apache/pig/tools/pigscript/parser/PigScriptParser.jj . . . [javacc] Warning: Choice conflict in (...)* construct at line 560, column 9. [javacc] Expansion nested within construct and expansion following construct [javacc] have common prefixes, one of which is: "-param" [javacc] Consider using a lookahead of 2 or more for nested expansion. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-778) ReversibleLoadStore Semantics
ReversibleLoadStore Semantics - Key: PIG-778 URL: https://issues.apache.org/jira/browse/PIG-778 Project: Pig Issue Type: Improvement Reporter: Gunther Hagleitner The question about how to use the ReversibleLoadStore function came up in 2 scenarios recently: a) Can we generate a load operator from a store by simply taking the same store function string, if the store function is a ReversibleLoadStore function? I would like to use that to remove unnecessary compiler generated stores, if we can change the depending load operators to load from a different store. b) Is it sufficient to check whether a pair of store and load operations on the same location is reversible to know whether we can eliminate it without changing the data? This is done in the pig server for logical plans right now. If I go by PigStorage then, the answer to (a) is yes. The answer to (b) is no, but we also need to check that both load and store use the same parameter to the reversible function. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-777) Code refactoring: Create optimization out of store/load post processing code
Code refactoring: Create optimization out of store/load post processing code Key: PIG-777 URL: https://issues.apache.org/jira/browse/PIG-777 Project: Pig Issue Type: Improvement Reporter: Gunther Hagleitner The postProcessing method in the pig server checks whether a logical graph contains stores to and loads from the same location. If so, it will either connect the store and load, or optimize by throwing out the load and connecting the store predecessor with the successor of the load. Ideally the introduction of the store and load connection should happen in the query compiler, while the optimization should then happen in an separate optimizer step as part of the optimizer framework. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-776) Code refactoring: Move "moveResults" code from JobControlCompiler to MapReduceLauncher
Code refactoring: Move "moveResults" code from JobControlCompiler to MapReduceLauncher -- Key: PIG-776 URL: https://issues.apache.org/jira/browse/PIG-776 Project: Pig Issue Type: Improvement Reporter: Gunther Hagleitner Priority: Minor It makes more sense for the moveResults code to live in the launcher rather than the compiler. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: error_handling_0416.patch Fixed some issues with the error handling patch (0415): * Duplicated error code 2129 * Unclear string "splitter" * Added native exception message to error msg in store operator. > PERFORMANCE: multi-query optimization > - > > Key: PIG-627 > URL: https://issues.apache.org/jira/browse/PIG-627 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.2.0 >Reporter: Olga Natkovich > Attachments: doc-fix.patch, error_handling_0415.patch, > error_handling_0416.patch, file_cmds-0305.patch, fix_store_prob.patch, > merge-041409.patch, merge_741727_HEAD__0324.patch, > merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, > multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, > multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, > multiquery_0306.patch, multiquery_explain_fix.patch, > non_reversible_store_load_dependencies.patch, > non_reversible_store_load_dependencies_2.patch, > noop_filter_absolute_path_flag.patch, > noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch > > > Currently, if your Pig script contains multiple stores and some shared > computation, Pig will execute several independent queries. For instance: > A = load 'data' as (a, b, c); > B = filter A by a > 5; > store B into 'output1'; > C = group B by b; > store C into 'output2'; > This script will result in map-only job that generated output1 followed by a > map-reduce job that generated output2. As the resuld data is read, parsed and > filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: error_handling_0415.patch This patch contains: * Error codes/msg * Javadoc changes * fix the merge error in parser ("aliases" cmd) * updated golden files > PERFORMANCE: multi-query optimization > - > > Key: PIG-627 > URL: https://issues.apache.org/jira/browse/PIG-627 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.2.0 >Reporter: Olga Natkovich > Attachments: doc-fix.patch, error_handling_0415.patch, > file_cmds-0305.patch, fix_store_prob.patch, merge-041409.patch, > merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, > merge_trunk_to_branch.patch, multi-store-0303.patch, multi-store-0304.patch, > multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, > multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, > multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, > non_reversible_store_load_dependencies_2.patch, > noop_filter_absolute_path_flag.patch, > noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch > > > Currently, if your Pig script contains multiple stores and some shared > computation, Pig will execute several independent queries. For instance: > A = load 'data' as (a, b, c); > B = filter A by a > 5; > store B into 'output1'; > C = group B by b; > store C into 'output2'; > This script will result in map-only job that generated output1 followed by a > map-reduce job that generated output2. As the resuld data is read, parsed and > filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: doc-fix.patch javadoc changes only. doc-fix.patch contains "fixes" to silence javadoc warnings. > PERFORMANCE: multi-query optimization > - > > Key: PIG-627 > URL: https://issues.apache.org/jira/browse/PIG-627 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.2.0 >Reporter: Olga Natkovich > Attachments: doc-fix.patch, file_cmds-0305.patch, > fix_store_prob.patch, merge-041409.patch, merge_741727_HEAD__0324.patch, > merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, > multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, > multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, > multiquery_0306.patch, multiquery_explain_fix.patch, > non_reversible_store_load_dependencies.patch, > non_reversible_store_load_dependencies_2.patch, > noop_filter_absolute_path_flag.patch, > noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch > > > Currently, if your Pig script contains multiple stores and some shared > computation, Pig will execute several independent queries. For instance: > A = load 'data' as (a, b, c); > B = filter A by a > 5; > store B into 'output1'; > C = group B by b; > store C into 'output2'; > This script will result in map-only job that generated output1 followed by a > map-reduce job that generated output2. As the resuld data is read, parsed and > filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: merge-041409.patch merge-041409.patch contains the latest merge from trunk to branch. > PERFORMANCE: multi-query optimization > - > > Key: PIG-627 > URL: https://issues.apache.org/jira/browse/PIG-627 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.2.0 >Reporter: Olga Natkovich > Attachments: file_cmds-0305.patch, fix_store_prob.patch, > merge-041409.patch, merge_741727_HEAD__0324.patch, > merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, > multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, > multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, > multiquery_0306.patch, multiquery_explain_fix.patch, > non_reversible_store_load_dependencies.patch, > non_reversible_store_load_dependencies_2.patch, > noop_filter_absolute_path_flag.patch, > noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch > > > Currently, if your Pig script contains multiple stores and some shared > computation, Pig will execute several independent queries. For instance: > A = load 'data' as (a, b, c); > B = filter A by a > 5; > store B into 'output1'; > C = group B by b; > store C into 'output2'; > This script will result in map-only job that generated output1 followed by a > map-reduce job that generated output2. As the resuld data is read, parsed and > filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: streaming-fix.patch Some fixes in the patch "streaming-fix.patch": * The split operator wasn't always playing nicely with the way we run the pipeline one extra time in the mapper's or reducer's close function if there's a stream operator present * Moved the MR optimizer that sets "stream in map" and "stream in reduce" to the end of the queue. * PhyPlanVisitor forgets to pop some walkers it pushed on the stack. That can result in the NoopFilterRemoval stage failing, because it's looking in the wrong plan. * Setting the jobname by default to the scriptname came in through the last merge, but didn't work anymore > PERFORMANCE: multi-query optimization > - > > Key: PIG-627 > URL: https://issues.apache.org/jira/browse/PIG-627 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.2.0 >Reporter: Olga Natkovich > Attachments: file_cmds-0305.patch, fix_store_prob.patch, > merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, > merge_trunk_to_branch.patch, multi-store-0303.patch, multi-store-0304.patch, > multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, > multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, > multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, > non_reversible_store_load_dependencies_2.patch, > noop_filter_absolute_path_flag.patch, > noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch > > > Currently, if your Pig script contains multiple stores and some shared > computation, Pig will execute several independent queries. For instance: > A = load 'data' as (a, b, c); > B = filter A by a > 5; > store B into 'output1'; > C = group B by b; > store C into 'output2'; > This script will result in map-only job that generated output1 followed by a > map-reduce job that generated output2. As the resuld data is read, parsed and > filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-759) HBaseStorage scheme for Load/Slice function
HBaseStorage scheme for Load/Slice function --- Key: PIG-759 URL: https://issues.apache.org/jira/browse/PIG-759 Project: Pig Issue Type: Bug Reporter: Gunther Hagleitner We would like to change the HBaseStorage function to use a scheme when loading a table in pig. The scheme we are thinking of is: "hbase". So in order to load an hbase table in a pig script the statement should read: {noformat} table = load 'hbase://' using HBaseStorage(); {noformat} If the scheme is omitted pig would assume the tablename to be an hdfs path and the storage function would use the last component of the path as a table name and output a warning. For details on why see jira issue: PIG-758 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-757) Using schemes in load and store paths
[ https://issues.apache.org/jira/browse/PIG-757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner resolved PIG-757. Resolution: Duplicate > Using schemes in load and store paths > - > > Key: PIG-757 > URL: https://issues.apache.org/jira/browse/PIG-757 > Project: Pig > Issue Type: Bug >Reporter: Gunther Hagleitner > > As part of the multiquery optimization work there's a need to use absolute > paths for load and store operations (because the current directory changes > during the execution of the script). In order to do so, the suggestion is to > change the semantics of the location/filename string used in LoadFunc and > Slicer/Slice. > The proposed change is: >* Load locations without a scheme part are expected to be hdfs (mapreduce > mode) or local (local mode) paths >* Any hdfs or local path will be translated to a fully qualified absolute > path before it is handed to either a LoadFunc or Slicer >* Any scheme other than file or hdfs will result in the load path be > passed through to the LoadFunc or Slicer without any modification. > Example: > If you have a LoadFunc that reads from a database, right now the following > could be used: > {{{ > a = load 'table' using DBLoader(); > }}} > With the proposed changes table would be translated into an hdfs path though > ("hdfs:///table"). Probably not what the loader wants to see. So in order > to make this work one would use: > {{{ > a = load 'sql://table' using DBLoader(); > }}} > Now the DBLoader would see the unchanged string "sql://table". And pig will > not use the string as an hdfs location. > This is an incompatible change but it's hopefully few existing > Slicers/Loaders that are affected. This behavior is part of the multiquery > work and can be turned off (reverted back) by using the "no_multiquery" flag. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-758) Converting load/store locations into fully qualified absolute paths
[ https://issues.apache.org/jira/browse/PIG-758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-758: --- Description: As part of the multiquery optimization work there is a need to use absolute paths for load and store operations (because the current directory changes during the execution of the script). In order to do so, we are suggesting a change to the semantics of the location/filename string used in LoadFunc and Slicer/Slice. The proposed change is: * Load locations without a scheme part are expected to be hdfs (mapreduce mode) or local (local mode) paths * Any hdfs or local path will be translated to a fully qualified absolute path before it is handed to either a LoadFunc or Slicer * Any scheme other than "file" or "hdfs" will result in the load path to be passed through to the LoadFunc or Slicer without any modification. Example: If you have a LoadFunc that reads from a database, in the current system the following could be used: {noformat} a = load 'table' using DBLoader(); {noformat} With the proposed changes table would be translated into an hdfs path though ("hdfs:///table"). Probably not what the DBLoader would want to see. In order to make it work one could use: {noformat} a = load 'sql://table' using DBLoader(); {noformat} Now the DBLoader would see the unchanged string "sql://table". This is an incompatible change, but hopefully not affecting many existing Loaders/Slicers. Since this is needed with the multiquery feature, the behavior can be reverted back by using the "no_multiquery" pig flag. was: As part of the multiquery optimization work there is a need to use absolute paths for load and store operations (because the current directory changes during the execution of the script). In order to do so, we are suggesting a change to the semantics of the location/filename string used in LoadFunc and Slicer/Slice. The proposed change is: * Load locations without a scheme part are expected to be hdfs (mapreduce mode) or local (local mode) paths * Any hdfs or local path will be translated to a fully qualified absolute path before it is handed to either a LoadFunc or Slicer * Any scheme other than "file" or "hdfs" will result in the load path to be passed through to the LoadFunc or Slicer without any modification. Example: If you have a LoadFunc that reads from a database, in the current system the following could be used: {code} a = load 'table' using DBLoader(); {code} With the proposed changes table would be translated into an hdfs path though ("hdfs:///table"). Probably not what the DBLoader would want to see. In order to make it work one could use: {code} a = load 'sql://table' using DBLoader(); {code} Now the DBLoader would see the unchanged string "sql://table". This is an incompatible change, but hopefully not affecting many existing Loaders/Slicers. Since this is needed with the multiquery feature, the behavior can be reverted back by using the "no_multiquery" pig flag. > Converting load/store locations into fully qualified absolute paths > --- > > Key: PIG-758 > URL: https://issues.apache.org/jira/browse/PIG-758 > Project: Pig > Issue Type: Bug >Reporter: Gunther Hagleitner > > As part of the multiquery optimization work there is a need to use absolute > paths for load and store operations (because the current directory changes > during the execution of the script). In order to do so, we are suggesting a > change to the semantics of the location/filename string used in LoadFunc and > Slicer/Slice. > The proposed change is: >* Load locations without a scheme part are expected to be hdfs (mapreduce > mode) or local (local mode) paths >* Any hdfs or local path will be translated to a fully qualified absolute > path before it is handed to either a LoadFunc or Slicer >* Any scheme other than "file" or "hdfs" will result in the load path to > be passed through to the LoadFunc or Slicer without any modification. > Example: > If you have a LoadFunc that reads from a database, in the current system the > following could be used: > {noformat} > a = load 'table' using DBLoader(); > {noformat} > With the proposed changes table would be translated into an hdfs path though > ("hdfs:///table"). Probably not what the DBLoader would want to see. In > order to make it work one could use: > {noformat} > a = load 'sql://table' using DBLoader(); > {noformat} > Now the DBLoader would see the unchanged string "sql://table". > This is an incompatible change, but hopefully not affecting many existing > Loaders/Slicers. Since this is needed with the multiquery feature, the > behavior can be reverted back by using the "no_multiquery" pig flag. -- This message is automatically generated by J
[jira] Updated: (PIG-758) Converting load/store locations into fully qualified absolute paths
[ https://issues.apache.org/jira/browse/PIG-758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-758: --- Description: As part of the multiquery optimization work there is a need to use absolute paths for load and store operations (because the current directory changes during the execution of the script). In order to do so, we are suggesting a change to the semantics of the location/filename string used in LoadFunc and Slicer/Slice. The proposed change is: * Load locations without a scheme part are expected to be hdfs (mapreduce mode) or local (local mode) paths * Any hdfs or local path will be translated to a fully qualified absolute path before it is handed to either a LoadFunc or Slicer * Any scheme other than "file" or "hdfs" will result in the load path to be passed through to the LoadFunc or Slicer without any modification. Example: If you have a LoadFunc that reads from a database, in the current system the following could be used: {code} a = load 'table' using DBLoader(); {code} With the proposed changes table would be translated into an hdfs path though ("hdfs:///table"). Probably not what the DBLoader would want to see. In order to make it work one could use: {code} a = load 'sql://table' using DBLoader(); {code} Now the DBLoader would see the unchanged string "sql://table". This is an incompatible change, but hopefully not affecting many existing Loaders/Slicers. Since this is needed with the multiquery feature, the behavior can be reverted back by using the "no_multiquery" pig flag. was: As part of the multiquery optimization work there is a need to use absolute paths for load and store operations (because the current directory changes during the execution of the script). In order to do so, we are suggesting a change to the semantics of the location/filename string used in LoadFunc and Slicer/Slice. The proposed change is: * Load locations without a scheme part are expected to be hdfs (mapreduce mode) or local (local mode) paths * Any hdfs or local path will be translated to a fully qualified absolute path before it is handed to either a LoadFunc or Slicer * Any scheme other than "file" or "hdfs" will result in the load path to be passed through to the LoadFunc or Slicer without any modification. Example: If you have a LoadFunc that reads from a database, in the current system the following could be used: {{{ a = load 'table' using DBLoader(); }}} With the proposed changes table would be translated into an hdfs path though ("hdfs:///table"). Probably not what the DBLoader would want to see. In order to make it work one could use: {{{ a = load 'sql://table' using DBLoader(); }}} Now the DBLoader would see the unchanged string "sql://table". This is an incompatible change, but hopefully not affecting many existing Loaders/Slicers. Since this is needed with the multiquery feature, the behavior can be reverted back by using the "no_multiquery" pig flag. > Converting load/store locations into fully qualified absolute paths > --- > > Key: PIG-758 > URL: https://issues.apache.org/jira/browse/PIG-758 > Project: Pig > Issue Type: Bug >Reporter: Gunther Hagleitner > > As part of the multiquery optimization work there is a need to use absolute > paths for load and store operations (because the current directory changes > during the execution of the script). In order to do so, we are suggesting a > change to the semantics of the location/filename string used in LoadFunc and > Slicer/Slice. > The proposed change is: >* Load locations without a scheme part are expected to be hdfs (mapreduce > mode) or local (local mode) paths >* Any hdfs or local path will be translated to a fully qualified absolute > path before it is handed to either a LoadFunc or Slicer >* Any scheme other than "file" or "hdfs" will result in the load path to > be passed through to the LoadFunc or Slicer without any modification. > Example: > If you have a LoadFunc that reads from a database, in the current system the > following could be used: > {code} > a = load 'table' using DBLoader(); > {code} > With the proposed changes table would be translated into an hdfs path though > ("hdfs:///table"). Probably not what the DBLoader would want to see. In > order to make it work one could use: > {code} > a = load 'sql://table' using DBLoader(); > {code} > Now the DBLoader would see the unchanged string "sql://table". > This is an incompatible change, but hopefully not affecting many existing > Loaders/Slicers. Since this is needed with the multiquery feature, the > behavior can be reverted back by using the "no_multiquery" pig flag. -- This message is automatically generated by JIRA. - You can reply to this email to add a
[jira] Created: (PIG-758) Converting load/store locations into fully qualified absolute paths
Converting load/store locations into fully qualified absolute paths --- Key: PIG-758 URL: https://issues.apache.org/jira/browse/PIG-758 Project: Pig Issue Type: Bug Reporter: Gunther Hagleitner As part of the multiquery optimization work there is a need to use absolute paths for load and store operations (because the current directory changes during the execution of the script). In order to do so, we are suggesting a change to the semantics of the location/filename string used in LoadFunc and Slicer/Slice. The proposed change is: * Load locations without a scheme part are expected to be hdfs (mapreduce mode) or local (local mode) paths * Any hdfs or local path will be translated to a fully qualified absolute path before it is handed to either a LoadFunc or Slicer * Any scheme other than "file" or "hdfs" will result in the load path to be passed through to the LoadFunc or Slicer without any modification. Example: If you have a LoadFunc that reads from a database, in the current system the following could be used: {{{ a = load 'table' using DBLoader(); }}} With the proposed changes table would be translated into an hdfs path though ("hdfs:///table"). Probably not what the DBLoader would want to see. In order to make it work one could use: {{{ a = load 'sql://table' using DBLoader(); }}} Now the DBLoader would see the unchanged string "sql://table". This is an incompatible change, but hopefully not affecting many existing Loaders/Slicers. Since this is needed with the multiquery feature, the behavior can be reverted back by using the "no_multiquery" pig flag. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-757) Using schemes in load and store paths
Using schemes in load and store paths - Key: PIG-757 URL: https://issues.apache.org/jira/browse/PIG-757 Project: Pig Issue Type: Bug Reporter: Gunther Hagleitner As part of the multiquery optimization work there's a need to use absolute paths for load and store operations (because the current directory changes during the execution of the script). In order to do so, the suggestion is to change the semantics of the location/filename string used in LoadFunc and Slicer/Slice. The proposed change is: * Load locations without a scheme part are expected to be hdfs (mapreduce mode) or local (local mode) paths * Any hdfs or local path will be translated to a fully qualified absolute path before it is handed to either a LoadFunc or Slicer * Any scheme other than file or hdfs will result in the load path be passed through to the LoadFunc or Slicer without any modification. Example: If you have a LoadFunc that reads from a database, right now the following could be used: {{{ a = load 'table' using DBLoader(); }}} With the proposed changes table would be translated into an hdfs path though ("hdfs:///table"). Probably not what the loader wants to see. So in order to make this work one would use: {{{ a = load 'sql://table' using DBLoader(); }}} Now the DBLoader would see the unchanged string "sql://table". And pig will not use the string as an hdfs location. This is an incompatible change but it's hopefully few existing Slicers/Loaders that are affected. This behavior is part of the multiquery work and can be turned off (reverted back) by using the "no_multiquery" flag. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: merge_trunk_to_branch.patch Merge latest trunk changes to branch > PERFORMANCE: multi-query optimization > - > > Key: PIG-627 > URL: https://issues.apache.org/jira/browse/PIG-627 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.2.0 >Reporter: Olga Natkovich > Attachments: file_cmds-0305.patch, fix_store_prob.patch, > merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, > merge_trunk_to_branch.patch, multi-store-0303.patch, multi-store-0304.patch, > multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, > multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, > multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, > non_reversible_store_load_dependencies_2.patch, > noop_filter_absolute_path_flag.patch, > noop_filter_absolute_path_flag_0401.patch > > > Currently, if your Pig script contains multiple stores and some shared > computation, Pig will execute several independent queries. For instance: > A = load 'data' as (a, b, c); > B = filter A by a > 5; > store B into 'output1'; > C = group B by b; > store C into 'output2'; > This script will result in map-only job that generated output1 followed by a > map-reduce job that generated output2. As the resuld data is read, parsed and > filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: non_reversible_store_load_dependencies_2.patch Same as above plus: * Fix for explain when a script has execution points inside. Like: {{{ a = load ... ... store a exec; b = load ... ... }}} This will run explain once for each execution block. > PERFORMANCE: multi-query optimization > - > > Key: PIG-627 > URL: https://issues.apache.org/jira/browse/PIG-627 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.2.0 >Reporter: Olga Natkovich > Attachments: file_cmds-0305.patch, fix_store_prob.patch, > merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, > multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, > multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, > multiquery_0306.patch, multiquery_explain_fix.patch, > non_reversible_store_load_dependencies.patch, > non_reversible_store_load_dependencies_2.patch, > noop_filter_absolute_path_flag.patch, > noop_filter_absolute_path_flag_0401.patch > > > Currently, if your Pig script contains multiple stores and some shared > computation, Pig will execute several independent queries. For instance: > A = load 'data' as (a, b, c); > B = filter A by a > 5; > store B into 'output1'; > C = group B by b; > store C into 'output2'; > This script will result in map-only job that generated output1 followed by a > map-reduce job that generated output2. As the resuld data is read, parsed and > filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: non_reversible_store_load_dependencies.patch This patch takes care of two things: * Cases where in a script you have a store followed by load where the Load/StoreFunc is either not reversible or they are different functions. * PlanSetter for physical plans in the JobControlCompiler (right now only the outermost plan's elements are set) > PERFORMANCE: multi-query optimization > - > > Key: PIG-627 > URL: https://issues.apache.org/jira/browse/PIG-627 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.2.0 >Reporter: Olga Natkovich > Attachments: file_cmds-0305.patch, fix_store_prob.patch, > merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, > multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, > multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, > multiquery_0306.patch, multiquery_explain_fix.patch, > non_reversible_store_load_dependencies.patch, > noop_filter_absolute_path_flag.patch, > noop_filter_absolute_path_flag_0401.patch > > > Currently, if your Pig script contains multiple stores and some shared > computation, Pig will execute several independent queries. For instance: > A = load 'data' as (a, b, c); > B = filter A by a > 5; > store B into 'output1'; > C = group B by b; > store C into 'output2'; > This script will result in map-only job that generated output1 followed by a > map-reduce job that generated output2. As the resuld data is read, parsed and > filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: noop_filter_absolute_path_flag_0401.patch This one is the same as before, but: * Added some comments * Reversed the multiquery flag (on by default) * HBase stuff works without the "hbase://" but will print warning * Fixed problem in NoopStoreRemover > PERFORMANCE: multi-query optimization > - > > Key: PIG-627 > URL: https://issues.apache.org/jira/browse/PIG-627 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.2.0 >Reporter: Olga Natkovich > Attachments: file_cmds-0305.patch, fix_store_prob.patch, > merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, > multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, > multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, > multiquery_0306.patch, multiquery_explain_fix.patch, > noop_filter_absolute_path_flag.patch, > noop_filter_absolute_path_flag_0401.patch > > > Currently, if your Pig script contains multiple stores and some shared > computation, Pig will execute several independent queries. For instance: > A = load 'data' as (a, b, c); > B = filter A by a > 5; > store B into 'output1'; > C = group B by b; > store C into 'output2'; > This script will result in map-only job that generated output1 followed by a > map-reduce job that generated output2. As the resuld data is read, parsed and > filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: noop_filter_absolute_path_flag.patch This patch contains three items: - Removes the noop stores as described above - Makes load and store paths absolute and canonical - Introduces a flag that turns multiquery on and off (default is off) > PERFORMANCE: multi-query optimization > - > > Key: PIG-627 > URL: https://issues.apache.org/jira/browse/PIG-627 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.2.0 >Reporter: Olga Natkovich > Attachments: file_cmds-0305.patch, fix_store_prob.patch, > merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, > multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, > multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, > multiquery_0306.patch, multiquery_explain_fix.patch, > noop_filter_absolute_path_flag.patch > > > Currently, if your Pig script contains multiple stores and some shared > computation, Pig will execute several independent queries. For instance: > A = load 'data' as (a, b, c); > B = filter A by a > 5; > store B into 'output1'; > C = group B by b; > store C into 'output2'; > This script will result in map-only job that generated output1 followed by a > map-reduce job that generated output2. As the resuld data is read, parsed and > filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: fix_store_prob.patch This patch addresses an issue with the way we deal with scripts that do: {{{ ... store a into 'foo'; a = load 'foo'; ... }}} In the logical plan this will end up as a split with one branch storing into 'foo' and the other continuing the processing after the load. The actual load is removed. This works well but has an unfortunate side effect. If the store/load mark the boundary between two map-reduce jobs the MRCompiler has to insert a tmp store-load bridge - which means that we now end up with two stores. This fix detects this case in the optimizing phase after the compilation. It removes the unnecessary store and loads from the other one. > PERFORMANCE: multi-query optimization > - > > Key: PIG-627 > URL: https://issues.apache.org/jira/browse/PIG-627 > Project: Pig > Issue Type: Improvement >Affects Versions: 1.0.0 >Reporter: Olga Natkovich > Attachments: file_cmds-0305.patch, fix_store_prob.patch, > merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, > multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, > multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, > multiquery_0306.patch, multiquery_explain_fix.patch > > > Currently, if your Pig script contains multiple stores and some shared > computation, Pig will execute several independent queries. For instance: > A = load 'data' as (a, b, c); > B = filter A by a > 5; > store B into 'output1'; > C = group B by b; > store C into 'output2'; > This script will result in map-only job that generated output1 followed by a > map-reduce job that generated output2. As the resuld data is read, parsed and > filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-726) Stop printing scope as part of Operator.toString()
[ https://issues.apache.org/jira/browse/PIG-726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12689268#action_12689268 ] Gunther Hagleitner commented on PIG-726: I've made a simpler change that had similar effects in the multiquery branch. I basically set the scope to an integer (the scope is not really used right now as I understand it. It's a leftover from times when pig was designed as a standalone server). That way each operator will say: ForEach 1-4 (or 2-4 depending on how many instances of the pig server you have in your jvm.) The alternative is to change all the logical operators name() function. They look something like: return + mKey.scope + "-" + mKey.id; For physical operators we could get away with the proposed change to key.toString() function. That seems more painful. > Stop printing scope as part of Operator.toString() > -- > > Key: PIG-726 > URL: https://issues.apache.org/jira/browse/PIG-726 > Project: Pig > Issue Type: Improvement >Reporter: Thejas M Nair > > When an operator is printed in pig, it prints a string with the user name and > date at which the grunt shell was started. This information is not useful and > makes the output very verbose. > For example, a line in explain is like - > ForEach tejas-Thu Mar 19 11:25:23 PDT 2009-4 Schema: {themap: map[ ]} Type: > bag > I am proposing that it should change to - > ForEach (id:4) Schema: {themap: map[ ]} Type: bag > That string comes from scope in OperatorKey class. We don't use make use of > it anywhere, so we should stop printing it. The change is only in > OperatorKey.toString(); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: merge_741727_HEAD__0324_2.patch Seems like the last merge patch didn't correctly contain the entire new TestFinish.java file. Well, this one does. > PERFORMANCE: multi-query optimization > - > > Key: PIG-627 > URL: https://issues.apache.org/jira/browse/PIG-627 > Project: Pig > Issue Type: Improvement >Affects Versions: 1.0.0 >Reporter: Olga Natkovich > Attachments: file_cmds-0305.patch, merge_741727_HEAD__0324.patch, > merge_741727_HEAD__0324_2.patch, multi-store-0303.patch, > multi-store-0304.patch, multiquery-phase2_0313.patch, > multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, > multiquery_0306.patch, multiquery_explain_fix.patch > > > Currently, if your Pig script contains multiple stores and some shared > computation, Pig will execute several independent queries. For instance: > A = load 'data' as (a, b, c); > B = filter A by a > 5; > store B into 'output1'; > C = group B by b; > store C into 'output2'; > This script will result in map-only job that generated output1 followed by a > map-reduce job that generated output2. As the resuld data is read, parsed and > filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: merge_741727_HEAD__0324.patch Merge of trunk (741727:HEAD) into multiquery branch. Aka merge from hell :-) I ran all unit tests, the multiquery tests and the nightly tests and everything looks fine (no errors). > PERFORMANCE: multi-query optimization > - > > Key: PIG-627 > URL: https://issues.apache.org/jira/browse/PIG-627 > Project: Pig > Issue Type: Improvement >Affects Versions: 1.0.0 >Reporter: Olga Natkovich > Attachments: file_cmds-0305.patch, merge_741727_HEAD__0324.patch, > multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, > multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, > multiquery_0306.patch, multiquery_explain_fix.patch > > > Currently, if your Pig script contains multiple stores and some shared > computation, Pig will execute several independent queries. For instance: > A = load 'data' as (a, b, c); > B = filter A by a > 5; > store B into 'output1'; > C = group B by b; > store C into 'output2'; > This script will result in map-only job that generated output1 followed by a > map-reduce job that generated output2. As the resuld data is read, parsed and > filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: multiquery_explain_fix.patch Fixes three issues with explain: a) Ceci n'est pas un bug. Splits in interactive mode still need this branch. b) explain needs to discard batch iff it was loading a script c) Split is now a nested operator (and explain needs to know) This patch doesn't have any overlapped files with Richards last patch. > PERFORMANCE: multi-query optimization > - > > Key: PIG-627 > URL: https://issues.apache.org/jira/browse/PIG-627 > Project: Pig > Issue Type: Improvement >Affects Versions: 1.0.0 >Reporter: Olga Natkovich > Attachments: file_cmds-0305.patch, multi-store-0303.patch, > multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery_0223.patch, > multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch > > > Currently, if your Pig script contains multiple stores and some shared > computation, Pig will execute several independent queries. For instance: > A = load 'data' as (a, b, c); > B = filter A by a > 5; > store B into 'output1'; > C = group B by b; > store C into 'output2'; > This script will result in map-only job that generated output1 followed by a > map-reduce job that generated output2. As the resuld data is read, parsed and > filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12679500#action_12679500 ] Gunther Hagleitner commented on PIG-627: Oh. I also took out the restriction of the openIterator in batch mode. That was no longer needed. > PERFORMANCE: multi-query optimization > - > > Key: PIG-627 > URL: https://issues.apache.org/jira/browse/PIG-627 > Project: Pig > Issue Type: Improvement >Affects Versions: types_branch >Reporter: Olga Natkovich > Fix For: types_branch > > Attachments: file_cmds-0305.patch, multi-store-0303.patch, > multi-store-0304.patch, multiquery_0223.patch, multiquery_0224.patch > > > Currently, if your Pig script contains multiple stores and some shared > computation, Pig will execute several independent queries. For instance: > A = load 'data' as (a, b, c); > B = filter A by a > 5; > store B into 'output1'; > C = group B by b; > store C into 'output2'; > This script will result in map-only job that generated output1 followed by a > map-reduce job that generated output2. As the resuld data is read, parsed and > filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: file_cmds-0305.patch This patch is for the multi query branch again. It mostly fixes the problem with certain commands in the script that require immediate execution (in batch mode). So if you do stuff like: ... store a into 'tmp_foo'; ... rm tmp_foo ... The rm will trigger execution and the file will be there for you to delete, copyToLocal, move, etc. You can also use the "exec" statement without params in a script now, to force execution of what we've seen so far. This patch also contains a minor fix with the computation of progress in MR jobs (which I screwed up in the last patch). > PERFORMANCE: multi-query optimization > - > > Key: PIG-627 > URL: https://issues.apache.org/jira/browse/PIG-627 > Project: Pig > Issue Type: Improvement >Affects Versions: types_branch >Reporter: Olga Natkovich > Fix For: types_branch > > Attachments: file_cmds-0305.patch, multi-store-0303.patch, > multi-store-0304.patch, multiquery_0223.patch, multiquery_0224.patch > > > Currently, if your Pig script contains multiple stores and some shared > computation, Pig will execute several independent queries. For instance: > A = load 'data' as (a, b, c); > B = filter A by a > 5; > store B into 'output1'; > C = group B by b; > store C into 'output2'; > This script will result in map-only job that generated output1 followed by a > map-reduce job that generated output2. As the resuld data is read, parsed and > filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: multi-store-0304.patch Same as the other one except: - Documented the createStoreFunction method some more. - Removed unnecessary fields in the path parsing - Moved tear down of stores below extra streaming run (in PigMapBase's, PigMapReduce's close function) > PERFORMANCE: multi-query optimization > - > > Key: PIG-627 > URL: https://issues.apache.org/jira/browse/PIG-627 > Project: Pig > Issue Type: Improvement >Affects Versions: types_branch >Reporter: Olga Natkovich > Fix For: types_branch > > Attachments: multi-store-0303.patch, multi-store-0304.patch, > multiquery_0223.patch, multiquery_0224.patch > > > Currently, if your Pig script contains multiple stores and some shared > computation, Pig will execute several independent queries. For instance: > A = load 'data' as (a, b, c); > B = filter A by a > 5; > store B into 'output1'; > C = group B by b; > store C into 'output2'; > This script will result in map-only job that generated output1 followed by a > map-reduce job that generated output2. As the resuld data is read, parsed and > filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: multi-store-0303.patch This patch introduces the functionality to support multiple stores in a single MR job. It's for the multiquery branch and it is needed to unblock concurrent dev on the split operator. There aren't enough unit tests in this patch yet. They will be provided once the split operator can use multi stores (right now, nothing actually uses these stores, so testing is difficult). In order to test the patch, I had temporarily turned multi store on for all queries (even if they only have one store) and then ran all the unit tests. All tests passed. > PERFORMANCE: multi-query optimization > - > > Key: PIG-627 > URL: https://issues.apache.org/jira/browse/PIG-627 > Project: Pig > Issue Type: Improvement >Affects Versions: types_branch >Reporter: Olga Natkovich > Fix For: types_branch > > Attachments: multi-store-0303.patch, multiquery_0223.patch, > multiquery_0224.patch > > > Currently, if your Pig script contains multiple stores and some shared > computation, Pig will execute several independent queries. For instance: > A = load 'data' as (a, b, c); > B = filter A by a > 5; > store B into 'output1'; > C = group B by b; > store C into 'output2'; > This script will result in map-only job that generated output1 followed by a > map-reduce job that generated output2. As the resuld data is read, parsed and > filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: multiquery_0224.patch This patch includes the multiquery unit test cases. > PERFORMANCE: multi-query optimization > - > > Key: PIG-627 > URL: https://issues.apache.org/jira/browse/PIG-627 > Project: Pig > Issue Type: Improvement >Affects Versions: types_branch >Reporter: Olga Natkovich > Fix For: types_branch > > Attachments: multiquery_0223.patch, multiquery_0224.patch > > > Currently, if your Pig script contains multiple stores and some shared > computation, Pig will execute several independent queries. For instance: > A = load 'data' as (a, b, c); > B = filter A by a > 5; > store B into 'output1'; > C = group B by b; > store C into 'output2'; > This script will result in map-only job that generated output1 followed by a > map-reduce job that generated output2. As the resuld data is read, parsed and > filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: multiquery_0223.patch This is for the multiquery branch. It's phase 1. It contains a lot of infrastructural work to be able to look at entire scripts during evaluation (batch mode). It will look at a script plan and insert splits whenever there is a shared sequence of operations. The split execution is still the same as it was before (load-store bridge). > PERFORMANCE: multi-query optimization > - > > Key: PIG-627 > URL: https://issues.apache.org/jira/browse/PIG-627 > Project: Pig > Issue Type: Improvement >Affects Versions: types_branch >Reporter: Olga Natkovich > Fix For: types_branch > > Attachments: multiquery_0223.patch > > > Currently, if your Pig script contains multiple stores and some shared > computation, Pig will execute several independent queries. For instance: > A = load 'data' as (a, b, c); > B = filter A by a > 5; > store B into 'output1'; > C = group B by b; > store C into 'output2'; > This script will result in map-only job that generated output1 followed by a > map-reduce job that generated output2. As the resuld data is read, parsed and > filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-574) run command for grunt
[ https://issues.apache.org/jira/browse/PIG-574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-574: --- Attachment: run_command_params_021109.patch Good point. I felt it was a little strange to specify "-param" on the grunt shell, but it is easier to remember if your using it outside the shell already. So, this patch does the same as the last one, but the syntax is: run myscript.pig -param LIMIT=5 -param FILE=/foo/bar.txt -param_file myparams.ppf > run command for grunt > - > > Key: PIG-574 > URL: https://issues.apache.org/jira/browse/PIG-574 > Project: Pig > Issue Type: New Feature > Components: grunt >Reporter: David Ciemiewicz >Priority: Minor > Attachments: run_command.patch, run_command_params.patch, > run_command_params_021109.patch > > > This is a request for a "run file" command in grunt which will read a script > from the local file system and execute the script interactively while in the > grunt shell. > One of the things that slows down iterative development of large, complicated > Pig scripts that must operate on hadoop fs data is that the edit, run, debug > cycle is slow because I must wait to allocate a Hadoop-on-Demand (hod) > cluster for each iteration. I would prefer not to preallocate a cluster of > nodes (though I could). > Instead, I'd like to have one window open and edit my Pig script using vim or > emacs, write it, and then type "run myscript.pig" at the grunt shell until I > get things right. > I'm used to doing similar things with Oracle, MySQL, and R. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-574) run command for grunt
[ https://issues.apache.org/jira/browse/PIG-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12672570#action_12672570 ] Gunther Hagleitner commented on PIG-574: Oh, I also ran the unit tests. They pass. > run command for grunt > - > > Key: PIG-574 > URL: https://issues.apache.org/jira/browse/PIG-574 > Project: Pig > Issue Type: New Feature > Components: grunt >Reporter: David Ciemiewicz >Priority: Minor > Attachments: run_command.patch, run_command_params.patch > > > This is a request for a "run file" command in grunt which will read a script > from the local file system and execute the script interactively while in the > grunt shell. > One of the things that slows down iterative development of large, complicated > Pig scripts that must operate on hadoop fs data is that the edit, run, debug > cycle is slow because I must wait to allocate a Hadoop-on-Demand (hod) > cluster for each iteration. I would prefer not to preallocate a cluster of > nodes (though I could). > Instead, I'd like to have one window open and edit my Pig script using vim or > emacs, write it, and then type "run myscript.pig" at the grunt shell until I > get things right. > I'm used to doing similar things with Oracle, MySQL, and R. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-574) run command for grunt
[ https://issues.apache.org/jira/browse/PIG-574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-574: --- Attachment: run_command_params.patch Thanks for reviewing the patch! I tried to address the 3 issues you pointed out: 1) You can now specify parameters and param files in both the exec and run command grunt> run myscript.pig using param_file myparams.ppf or: grunt> run myscript.pig using param LIMIT=5 param_file myparams.ppf The syntax mimics what you can do on the command line when executing a script without the "-"s. 2) The script lines are now added to the command history in interactive mode 3) The double grunt... That's actually harder to fix than it thought, but I added a newline, so it won't say: grunt> grunt> but: grunt> grunt> Let's just tell everyone that that's because they have extra newlines in their scripts. Maybe they won't find out. ;-) > run command for grunt > - > > Key: PIG-574 > URL: https://issues.apache.org/jira/browse/PIG-574 > Project: Pig > Issue Type: New Feature > Components: grunt >Reporter: David Ciemiewicz >Priority: Minor > Attachments: run_command.patch, run_command_params.patch > > > This is a request for a "run file" command in grunt which will read a script > from the local file system and execute the script interactively while in the > grunt shell. > One of the things that slows down iterative development of large, complicated > Pig scripts that must operate on hadoop fs data is that the edit, run, debug > cycle is slow because I must wait to allocate a Hadoop-on-Demand (hod) > cluster for each iteration. I would prefer not to preallocate a cluster of > nodes (though I could). > Instead, I'd like to have one window open and edit my Pig script using vim or > emacs, write it, and then type "run myscript.pig" at the grunt shell until I > get things right. > I'm used to doing similar things with Oracle, MySQL, and R. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-574) run command for grunt
[ https://issues.apache.org/jira/browse/PIG-574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-574: --- Status: Patch Available (was: Open) > run command for grunt > - > > Key: PIG-574 > URL: https://issues.apache.org/jira/browse/PIG-574 > Project: Pig > Issue Type: New Feature > Components: grunt >Reporter: David Ciemiewicz >Priority: Minor > Attachments: run_command.patch > > > This is a request for a "run file" command in grunt which will read a script > from the local file system and execute the script interactively while in the > grunt shell. > One of the things that slows down iterative development of large, complicated > Pig scripts that must operate on hadoop fs data is that the edit, run, debug > cycle is slow because I must wait to allocate a Hadoop-on-Demand (hod) > cluster for each iteration. I would prefer not to preallocate a cluster of > nodes (though I could). > Instead, I'd like to have one window open and edit my Pig script using vim or > emacs, write it, and then type "run myscript.pig" at the grunt shell until I > get things right. > I'm used to doing similar things with Oracle, MySQL, and R. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-574) run command for grunt
[ https://issues.apache.org/jira/browse/PIG-574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-574: --- Attachment: run_command.patch Introduces run and exec command > run command for grunt > - > > Key: PIG-574 > URL: https://issues.apache.org/jira/browse/PIG-574 > Project: Pig > Issue Type: New Feature > Components: grunt >Reporter: David Ciemiewicz >Priority: Minor > Attachments: run_command.patch > > > This is a request for a "run file" command in grunt which will read a script > from the local file system and execute the script interactively while in the > grunt shell. > One of the things that slows down iterative development of large, complicated > Pig scripts that must operate on hadoop fs data is that the edit, run, debug > cycle is slow because I must wait to allocate a Hadoop-on-Demand (hod) > cluster for each iteration. I would prefer not to preallocate a cluster of > nodes (though I could). > Instead, I'd like to have one window open and edit my Pig script using vim or > emacs, write it, and then type "run myscript.pig" at the grunt shell until I > get things right. > I'm used to doing similar things with Oracle, MySQL, and R. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.