[jira] Subscription: PIG patch available
Issue Subscription Filter: PIG patch available (18 issues) Subscriber: pigdaily Key Summary PIG-3431Return more information for parsing related exceptions. https://issues.apache.org/jira/browse/PIG-3431 PIG-3430Add xml format for explaining MapReduce Plan. https://issues.apache.org/jira/browse/PIG-3430 PIG-3426Add support for removing s3 files https://issues.apache.org/jira/browse/PIG-3426 PIG-3419Pluggable Execution Engine https://issues.apache.org/jira/browse/PIG-3419 PIG-3379Alias reuse in nested foreach causes PIG script to fail https://issues.apache.org/jira/browse/PIG-3379 PIG-3374CASE and IN fail when expression includes dereferencing operator https://issues.apache.org/jira/browse/PIG-3374 PIG-3349Document ToString(Datetime, String) UDF https://issues.apache.org/jira/browse/PIG-3349 PIG-3346New property that controls the number of combined splits https://issues.apache.org/jira/browse/PIG-3346 PIG-Fix remaining Windows core unit test failures https://issues.apache.org/jira/browse/PIG- PIG-3325Adding a tuple to a bag is slow https://issues.apache.org/jira/browse/PIG-3325 PIG-3295Casting from bytearray failing after Union (even when each field is from a single Loader) https://issues.apache.org/jira/browse/PIG-3295 PIG-3292Logical plan invalid state: duplicate uid in schema during self-join to get cross product https://issues.apache.org/jira/browse/PIG-3292 PIG-3257Add unique identifier UDF https://issues.apache.org/jira/browse/PIG-3257 PIG-3199Expose LogicalPlan via PigServer API https://issues.apache.org/jira/browse/PIG-3199 PIG-3168TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk https://issues.apache.org/jira/browse/PIG-3168 PIG-3088Add a builtin udf which removes prefixes https://issues.apache.org/jira/browse/PIG-3088 PIG-3048Add mapreduce workflow information to job configuration https://issues.apache.org/jira/browse/PIG-3048 PIG-3021Split results missing records when there is null values in the column comparison https://issues.apache.org/jira/browse/PIG-3021 You may edit this subscription at: https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=13225&filterId=12322384
[jira] [Updated] (PIG-3385) DISTINCT no longer uses custom partitioner
[ https://issues.apache.org/jira/browse/PIG-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3385: -- Attachment: pig-3385-v01.patch Wondering if custom partitioner ever worked for distinct. Looks like partitioner info is passed through POGlobalRearrange but "distinct" doesn't use it. Uploading an initial patch that just passes that info through PODistinct. It's the first time for me to touch the backend code. Appreciate if someone can take a look. I'll upload a testcase next. > DISTINCT no longer uses custom partitioner > -- > > Key: PIG-3385 > URL: https://issues.apache.org/jira/browse/PIG-3385 > Project: Pig > Issue Type: Bug > Components: documentation >Reporter: Will Oberman >Priority: Minor > Attachments: pig-3385-v01.patch > > > From u...@pig.apache.org: It looks like an optimization was put in to make > distinct use a special partitioner which prevents the user from setting the > partitioner. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3168) TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk
[ https://issues.apache.org/jira/browse/PIG-3168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13745487#comment-13745487 ] Rohini Palaniswamy commented on PIG-3168: - +1 > TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk > > > Key: PIG-3168 > URL: https://issues.apache.org/jira/browse/PIG-3168 > Project: Pig > Issue Type: Bug >Affects Versions: 0.12 >Reporter: Cheolsoo Park >Assignee: Cheolsoo Park > Fix For: 0.12 > > Attachments: PIG-3168-2.patch, PIG-3168.patch > > > PIG-2994 made explain with no alias be equivalent to explain on the previous > alias. This breaks > TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge because the > previous alias is an auto-generated alias not a user-defined alias. > The following fixes the test: > {code} > "I = GROUP F2 BY (f7, f8);" + > "STORE I into 'foo4' using BinStorage();" + > -"explain;"; > +"explain I;"; > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3419) Pluggable Execution Engine
[ https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13745448#comment-13745448 ] Cheolsoo Park commented on PIG-3419: [~julienledem], {quote} Do we really throw Exception? {quote} No, we don't throw Exception to the end user. But currently, PigServer catches them all in a single catch block and sort them out using instanceof calls (see below). Probably we should make ExecutionEngine throw FEE, EE, and IOE and replace instanceof calls with catch blocks in PigServer. {code} try { stats = pigContext.getExecutionEngine().launchPig(lp, jobName, pigContext); } catch (Exception e) { // There are a lot of exceptions thrown by the launcher. If this // is an ExecException, just let it through. Else wrap it. if (e instanceof ExecException){ throw (ExecException)e; } else if (e instanceof FrontendException) { throw (FrontendException)e; } else { int errCode = 2043; String msg = "Unexpected error during execution."; throw new ExecException(msg, errCode, PigException.BUG, e); } } {code} > Pluggable Execution Engine > --- > > Key: PIG-3419 > URL: https://issues.apache.org/jira/browse/PIG-3419 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.12 >Reporter: Achal Soni >Assignee: Achal Soni >Priority: Minor > Attachments: execengine.patch, mapreduce_execengine.patch, > stats_scriptstate.patch, test_suite.patch > > > In an effort to adapt Pig to work using Apache Tez > (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for > a cleaner ExecutionEngine abstraction than existed before. The changes are > not that major as Pig was already relatively abstracted out between the > frontend and backend. The changes in the attached commit are essentially the > barebones changes -- I tried to not change the structure of Pig's different > components too much. I think it will be interesting to see in the future how > we can refactor more areas of Pig to really honor this abstraction between > the frontend and backend. > Some of the changes was to reinstate an ExecutionEngine interface to tie > together the front end and backend, and making the changes in Pig to delegate > to the EE when necessary, and creating an MRExecutionEngine that implements > this interface. Other work included changing ExecType to cycle through the > ExecutionEngines on the classpath and select the appropriate one (this is > done using Java ServiceLoader, exactly how MapReduce does for choosing the > framework to use between local and distributed mode). Also I tried to make > ScriptState, JobStats, and PigStats as abstract as possible in its current > state. I think in the future some work will need to be done here to perhaps > re-evaluate the usage of ScriptState and the responsibilities of the > different statistics classes. I haven't touched the PPNL, but I think more > abstraction is needed here, perhaps in a separate patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3419) Pluggable Execution Engine
[ https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13745398#comment-13745398 ] Julien Le Dem commented on PIG-3419: [~cheolsoo] 1. Do we really throw Exception ? If yes, then let's just throw that. If not then let's instead have FrontEndException, ExecException, IOException. i.e. let's remove the exceptions that are already included by the highest exception level. 2. agreed with you. I would expect the execution engine to handle the Properties internally and the signature of this method to be: {noformat} public void setProperty(String property, String value); {noformat} > Pluggable Execution Engine > --- > > Key: PIG-3419 > URL: https://issues.apache.org/jira/browse/PIG-3419 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.12 >Reporter: Achal Soni >Assignee: Achal Soni >Priority: Minor > Attachments: execengine.patch, mapreduce_execengine.patch, > stats_scriptstate.patch, test_suite.patch > > > In an effort to adapt Pig to work using Apache Tez > (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for > a cleaner ExecutionEngine abstraction than existed before. The changes are > not that major as Pig was already relatively abstracted out between the > frontend and backend. The changes in the attached commit are essentially the > barebones changes -- I tried to not change the structure of Pig's different > components too much. I think it will be interesting to see in the future how > we can refactor more areas of Pig to really honor this abstraction between > the frontend and backend. > Some of the changes was to reinstate an ExecutionEngine interface to tie > together the front end and backend, and making the changes in Pig to delegate > to the EE when necessary, and creating an MRExecutionEngine that implements > this interface. Other work included changing ExecType to cycle through the > ExecutionEngines on the classpath and select the appropriate one (this is > done using Java ServiceLoader, exactly how MapReduce does for choosing the > framework to use between local and distributed mode). Also I tried to make > ScriptState, JobStats, and PigStats as abstract as possible in its current > state. I think in the future some work will need to be done here to perhaps > re-evaluate the usage of ScriptState and the responsibilities of the > different statistics classes. I haven't touched the PPNL, but I think more > abstraction is needed here, perhaps in a separate patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3168) TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk
[ https://issues.apache.org/jira/browse/PIG-3168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-3168: --- Status: Patch Available (was: Reopened) TestMultiQueryBasic passes. > TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk > > > Key: PIG-3168 > URL: https://issues.apache.org/jira/browse/PIG-3168 > Project: Pig > Issue Type: Bug >Affects Versions: 0.12 >Reporter: Cheolsoo Park >Assignee: Cheolsoo Park > Fix For: 0.12 > > Attachments: PIG-3168-2.patch, PIG-3168.patch > > > PIG-2994 made explain with no alias be equivalent to explain on the previous > alias. This breaks > TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge because the > previous alias is an auto-generated alias not a user-defined alias. > The following fixes the test: > {code} > "I = GROUP F2 BY (f7, f8);" + > "STORE I into 'foo4' using BinStorage();" + > -"explain;"; > +"explain I;"; > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3168) TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk
[ https://issues.apache.org/jira/browse/PIG-3168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-3168: --- Attachment: (was: PIG-3618-2.patch) > TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk > > > Key: PIG-3168 > URL: https://issues.apache.org/jira/browse/PIG-3168 > Project: Pig > Issue Type: Bug >Affects Versions: 0.12 >Reporter: Cheolsoo Park >Assignee: Cheolsoo Park > Fix For: 0.12 > > Attachments: PIG-3168-2.patch, PIG-3168.patch > > > PIG-2994 made explain with no alias be equivalent to explain on the previous > alias. This breaks > TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge because the > previous alias is an auto-generated alias not a user-defined alias. > The following fixes the test: > {code} > "I = GROUP F2 BY (f7, f8);" + > "STORE I into 'foo4' using BinStorage();" + > -"explain;"; > +"explain I;"; > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3168) TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk
[ https://issues.apache.org/jira/browse/PIG-3168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-3168: --- Attachment: PIG-3168-2.patch > TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk > > > Key: PIG-3168 > URL: https://issues.apache.org/jira/browse/PIG-3168 > Project: Pig > Issue Type: Bug >Affects Versions: 0.12 >Reporter: Cheolsoo Park >Assignee: Cheolsoo Park > Fix For: 0.12 > > Attachments: PIG-3168-2.patch, PIG-3168.patch > > > PIG-2994 made explain with no alias be equivalent to explain on the previous > alias. This breaks > TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge because the > previous alias is an auto-generated alias not a user-defined alias. > The following fixes the test: > {code} > "I = GROUP F2 BY (f7, f8);" + > "STORE I into 'foo4' using BinStorage();" + > -"explain;"; > +"explain I;"; > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3168) TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk
[ https://issues.apache.org/jira/browse/PIG-3168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-3168: --- Attachment: PIG-3618-2.patch So here is what it does now: * In interactive mode, explain with no alias == explain on the last relation. * In batch mode, explain with no alias == explain on the entire script. Let me know whether this is not good. > TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk > > > Key: PIG-3168 > URL: https://issues.apache.org/jira/browse/PIG-3168 > Project: Pig > Issue Type: Bug >Affects Versions: 0.12 >Reporter: Cheolsoo Park >Assignee: Cheolsoo Park > Fix For: 0.12 > > Attachments: PIG-3168.patch, PIG-3618-2.patch > > > PIG-2994 made explain with no alias be equivalent to explain on the previous > alias. This breaks > TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge because the > previous alias is an auto-generated alias not a user-defined alias. > The following fixes the test: > {code} > "I = GROUP F2 BY (f7, f8);" + > "STORE I into 'foo4' using BinStorage();" + > -"explain;"; > +"explain I;"; > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Reopened] (PIG-3168) TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk
[ https://issues.apache.org/jira/browse/PIG-3168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park reopened PIG-3168: Thanks Rohini. I will post a patch that reverts it to the old behavior in batch mode. > TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk > > > Key: PIG-3168 > URL: https://issues.apache.org/jira/browse/PIG-3168 > Project: Pig > Issue Type: Bug >Affects Versions: 0.12 >Reporter: Cheolsoo Park >Assignee: Cheolsoo Park > Fix For: 0.12 > > Attachments: PIG-3168.patch > > > PIG-2994 made explain with no alias be equivalent to explain on the previous > alias. This breaks > TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge because the > previous alias is an auto-generated alias not a user-defined alias. > The following fixes the test: > {code} > "I = GROUP F2 BY (f7, f8);" + > "STORE I into 'foo4' using BinStorage();" + > -"explain;"; > +"explain I;"; > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (PIG-3432) typo in log message in SchemaTupleFrontend
[ https://issues.apache.org/jira/browse/PIG-3432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park resolved PIG-3432. Resolution: Fixed Fix Version/s: 0.12 Assignee: oleksii iepishkin Committed to trunk. Thank you Oleksii! > typo in log message in SchemaTupleFrontend > -- > > Key: PIG-3432 > URL: https://issues.apache.org/jira/browse/PIG-3432 > Project: Pig > Issue Type: Bug >Reporter: oleksii iepishkin >Assignee: oleksii iepishkin > Fix For: 0.12 > > Attachments: PIG-3432.patch > > > https://github.com/apache/pig/pull/11.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Slow Group By operator
Hi Benjarmin, Can you describe which step of group by is slow? Mapper side or reducer side? What's your query like? Can you share it? Do you call any algebraic UDF after group by? I am wondering whether combiner matters in your test. Thanks, Cheolsoo On Tue, Aug 20, 2013 at 2:27 AM, Benjamin Jakobus wrote: > Hi all, > > After benchmarking Hive and Pig, I found that the Group By operator in Pig > is drastically slower that Hive's. I was wondering whether anybody has > experienced the same? And whether people may have any tips for improving > the performance of this operation? (Adding a DISTINCT as suggested by an > earlier post on here doesn't help. I am currently re-running the benchmark > with LZO compression enabled). > > Regards, > Ben >
[jira] [Commented] (PIG-3168) TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk
[ https://issues.apache.org/jira/browse/PIG-3168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13745240#comment-13745240 ] Rohini Palaniswamy commented on PIG-3168: - bq. PIG-2994 made explain with no alias be equivalent to explain on the previous alias. Shouldn't we revert back the behavior of explain with no alias to older behavior of explaining the whole script instead of fixing the test? It is kind of breaking backward compatibility. > TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge fails in trunk > > > Key: PIG-3168 > URL: https://issues.apache.org/jira/browse/PIG-3168 > Project: Pig > Issue Type: Bug >Affects Versions: 0.12 >Reporter: Cheolsoo Park >Assignee: Cheolsoo Park > Fix For: 0.12 > > Attachments: PIG-3168.patch > > > PIG-2994 made explain with no alias be equivalent to explain on the previous > alias. This breaks > TestMultiQueryBasic.testMultiQueryWithSplitInMapAndMultiMerge because the > previous alias is an auto-generated alias not a user-defined alias. > The following fixes the test: > {code} > "I = GROUP F2 BY (f7, f8);" + > "STORE I into 'foo4' using BinStorage();" + > -"explain;"; > +"explain I;"; > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3419) Pluggable Execution Engine
[ https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13745227#comment-13745227 ] Cheolsoo Park commented on PIG-3419: I have two more comments in ExecutionEngine and MRExecutionEngine as follows: # Can you simplify the checked exceptions in the ExecutionEngine interface? For example, From: {code} public PigStats launchPig(LogicalPlan lp, String grpName, PigContext pc) throws PlanException, VisitorException, IOException, ExecException, JobCreationException, FrontendException, Exception; {code} To: {code} public PigStats launchPig(LogicalPlan lp, String grpName, PigContext pc) throws Exception; {code} Looks like there's no point of throwing them again in ExecutionEngine because they will be caught as Exception in PigServer anyway. If needed, we should take specific actions per exception in ExecutionEngine. # As for the setProperty method in ExecutionEngine, do we need to pass a properties? Can we construct a properties with the given key/value pair and call recomputeProperties() internally? {code} public void setProperty(Properties properties, String property, String value); {code} Also, as for the setProperty method in MRExecutionEngine, looks like it's mostly duplicate of recomputeProperties(). Can you just reuse recomputeProperties()? Julien said you're working on a new patch. It would be nice if you could incorporate these (of course if you agree with me). Thank you a lot! > Pluggable Execution Engine > --- > > Key: PIG-3419 > URL: https://issues.apache.org/jira/browse/PIG-3419 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.12 >Reporter: Achal Soni >Assignee: Achal Soni >Priority: Minor > Attachments: execengine.patch, mapreduce_execengine.patch, > stats_scriptstate.patch, test_suite.patch > > > In an effort to adapt Pig to work using Apache Tez > (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for > a cleaner ExecutionEngine abstraction than existed before. The changes are > not that major as Pig was already relatively abstracted out between the > frontend and backend. The changes in the attached commit are essentially the > barebones changes -- I tried to not change the structure of Pig's different > components too much. I think it will be interesting to see in the future how > we can refactor more areas of Pig to really honor this abstraction between > the frontend and backend. > Some of the changes was to reinstate an ExecutionEngine interface to tie > together the front end and backend, and making the changes in Pig to delegate > to the EE when necessary, and creating an MRExecutionEngine that implements > this interface. Other work included changing ExecType to cycle through the > ExecutionEngines on the classpath and select the appropriate one (this is > done using Java ServiceLoader, exactly how MapReduce does for choosing the > framework to use between local and distributed mode). Also I tried to make > ScriptState, JobStats, and PigStats as abstract as possible in its current > state. I think in the future some work will need to be done here to perhaps > re-evaluate the usage of ScriptState and the responsibilities of the > different statistics classes. I haven't touched the PPNL, but I think more > abstraction is needed here, perhaps in a separate patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3432) typo in log message in SchemaTupleFrontend
[ https://issues.apache.org/jira/browse/PIG-3432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] oleksii iepishkin updated PIG-3432: --- Attachment: PIG-3432.patch > typo in log message in SchemaTupleFrontend > -- > > Key: PIG-3432 > URL: https://issues.apache.org/jira/browse/PIG-3432 > Project: Pig > Issue Type: Bug >Reporter: oleksii iepishkin > Attachments: PIG-3432.patch > > > https://github.com/apache/pig/pull/11.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3432) typo in log message in SchemaTupleFrontend
[ https://issues.apache.org/jira/browse/PIG-3432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13745183#comment-13745183 ] oleksii iepishkin commented on PIG-3432: I've attached the patch to this ticket. Just in case here is how the patch has been created: {code} git clone g...@github.com:apache/pig.git cd pig git checkout trunk # merge pull request curl https://github.com/apache/pig/pull/11.patch | git am #create patch file for apache (I wish it was easier for a simple typo fix) git reset HEAD~ git diff --no-prefix > PIG-3432.patch {code} > typo in log message in SchemaTupleFrontend > -- > > Key: PIG-3432 > URL: https://issues.apache.org/jira/browse/PIG-3432 > Project: Pig > Issue Type: Bug >Reporter: oleksii iepishkin > Attachments: PIG-3432.patch > > > https://github.com/apache/pig/pull/11.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3432) typo in log message in SchemaTupleFrontend
[ https://issues.apache.org/jira/browse/PIG-3432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13745112#comment-13745112 ] Cheolsoo Park commented on PIG-3432: [~epishkin], thank you for the contribution. Unfortunately, we cannot pull your request on github since it's just a read-only mirror. Do you mind uploading your patch to this jira? Please see here: https://cwiki.apache.org/confluence/display/PIG/HowToContribute > typo in log message in SchemaTupleFrontend > -- > > Key: PIG-3432 > URL: https://issues.apache.org/jira/browse/PIG-3432 > Project: Pig > Issue Type: Bug >Reporter: oleksii iepishkin > > https://github.com/apache/pig/pull/11.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Slow Group By operator
Hi all, After benchmarking Hive and Pig, I found that the Group By operator in Pig is drastically slower that Hive's. I was wondering whether anybody has experienced the same? And whether people may have any tips for improving the performance of this operation? (Adding a DISTINCT as suggested by an earlier post on here doesn't help. I am currently re-running the benchmark with LZO compression enabled). Regards, Ben
pig pull request: fixed a typo in a log message
GitHub user epishkin opened a pull request: https://github.com/apache/pig/pull/11 fixed a typo in a log message You can merge this pull request into a Git repository by running: $ git pull https://github.com/epishkin/pig patch-1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/pig/pull/11.patch
[jira] [Commented] (PIG-3429) Reduce Pig memory footprint using specialized tuple classes (complementary to SchemaTuple)
[ https://issues.apache.org/jira/browse/PIG-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13745007#comment-13745007 ] Jonathan Packer commented on PIG-3429: -- Hi, so the current patch now seems to pass every unit tests except ones which use tuple's append() method which breaks. I have an idea for fixing this, but wanted to wait for feedback to make sure I'm going in the right direction. I know this is changing some important classes, but I think the memory improvements could especially help make Pig local mode more viable for general-purpose use as memory is more of an issue on laptops then on clusters. My idea for fixing append() is that for the specialized tuple impls, they have an extra field "Tuple promotedTuple". This is null by default, so it only adds 8 bytes of overhead (still much cheaper than the ArrayList when it is unused). If someone needs to append to the specialized tuple, the existing fields are copied into a new default tuple in the "promotedTuple" field and that is just used by proxy. So there is a small overhead vs default when use append, but for most cases where append is not used you retain the memory savings of the specialized tuples. Does this seem like an workable idea? > Reduce Pig memory footprint using specialized tuple classes (complementary to > SchemaTuple) > -- > > Key: PIG-3429 > URL: https://issues.apache.org/jira/browse/PIG-3429 > Project: Pig > Issue Type: Improvement > Components: data >Affects Versions: 0.12 >Reporter: Jonathan Packer >Assignee: Jonathan Packer > Attachments: PIG-3429-v1.diff, PIG-3429-v2.diff > > > Pig's default tuple implementation is very memory inefficient for small > tuples, as the minimum size of an empty tuple is 96 bytes. This leads to bags > being spilled more often than they need to. SchemaTuple addresses this, but > is not fully integrated into the PhysicalPlan pipeline (and seems like it > would be difficult to do so). Furthermore, it is likely that almost all UDFs > do not use SchemaTuple. > This patch therefore provides some basic optimizations to reduce memory > footprint of tuples by having BinSedesTupleFactory construct specialized > tuple implementations in certain circumstances. This way, anything using > BinSedesTupleFactory will reap the benefits, and since SchemaTuple uses a > different factory, it will not be interfered with. > There is a long description below, because this patch might break stuff. I > tried to think through possible implementation hazards which I will list. > The specialized tuple implementations are as follows: > EmptyTuple // no fields, just an object header = 8 bytes > NullWrapperTuple// wraps a single null field, 8 bytes > CountingTuple // replaces (1L) as initial output of COUNT, 8 bytes > IntegerWrapperTuple // these all wrap a single primitive field > LongWrapperTuple// object header + rounded primitive size = 16 bytes > FloatWrapperTuple > DoubleWrapperTuple > BinSedesTuple2 // these are pair/triples of fields with no ArrayList > BinSedesTuple3 // 16/24 bytes of overhead as opposed to 80 from ArrayList > The memory savings are greatest for the algebraic math functions COUNT, SUM, > etc. For example, the size of an intermediate tuple for COUNT should go from > 112 bytes to 8 bytes. The size of an intermediate tuple from SUM should go > from 112 bytes to 16 bytes. > I haven't finished running the full unit-tests, but TestAlgebraicEval passes > so I'm hopeful it will be manageable to debug. > The three concerns that I have are: > 1) Since TupleFactory now sometimes outputs non-appendable tuples, the > isFixedSize() method had to be removed. A file search didn't show it being > used anywhere though. I think appending to tuples instead of finding out the > requisite size ahead of time is bad practice as well (I changed POForeach to > do the latter so it can take advantage of the special tuple impls). > 2) Also since TupleFactory now has multiple tuple types, the tupleClass() > method gets tricky. I made a superclass GenericBinSedesTuple that all the > specialized classes inherit from, and it seems to work, but I'm not sure what > the implications of this are. It breaks the inheritance tree of AbstractTuple > <-- DefaultTuple <-- BinSedesTuple, so now "DefaultBinSedesTuple" inherits > directly from GenericBinSedesTuple and DefaultTuple is left unused. In the > patch, all the stuff for DefaultBinSedesTuple is just copied over from the > old DefaultTuple. > 3) I tried to be careful not to break BinInterSedesTupleRawComparator, but > this will need verification. > Finally, > 4) For my personal use cases, I'd like to make custom tuple implementations