[jira] Updated: (PIG-1289) PIG Join fails while doing a filter on joined data
[ https://issues.apache.org/jira/browse/PIG-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1289: Resolution: Fixed Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) Unit test failure due to port conflict. Manual test successful. Patch committed. PIG Join fails while doing a filter on joined data -- Key: PIG-1289 URL: https://issues.apache.org/jira/browse/PIG-1289 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Karim Saadah Assignee: Daniel Dai Priority: Minor Fix For: 0.7.0 Attachments: PIG-1289-1.patch, PIG-1289-2.patch PIG Join fails while doing a filter on joined data Here are the steps to reproduce it: -bash-3.1$ pig -latest -x local grunt a = load 'first.dat' using PigStorage('\u0001') as (f1:int, f2:chararray); grunt DUMP a; (1,A) (2,B) (3,C) (4,D) grunt b = load 'second.dat' using PigStorage() as (f3:chararray); grunt DUMP b; (A) (D) (E) grunt c = join a by f2 LEFT OUTER, b by f3; grunt DUMP c; (1,A,A) (2,B,) (3,C,) (4,D,D) grunt describe c; c: {a::f1: int,a::f2: chararray,b::f3: chararray} grunt d = filter c by (f3 is null or f3 ==''); grunt dump d; 2010-03-03 15:00:37,129 [main] INFO org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned for b 2010-03-03 15:00:37,129 [main] INFO org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned for b 2010-03-03 15:00:37,129 [main] INFO org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned for a 2010-03-03 15:00:37,130 [main] INFO org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned for a 2010-03-03 15:00:37,130 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1002: Unable to store alias d This one is failing too: grunt d = filter c by (b::f3 is null or b::f3 ==''); or this one not returning results as expected: grunt d = foreach c generate f1 as f1, f2 as f2, f3 as f3; grunt e = filter d by (f3 is null or f3 ==''); grunt DUMP e; (1,A,) (2,B,) (3,C,) (4,D,) while the expected result is (2,B,) (3,C,) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1258) [zebra] Number of sorted input splits is unusually high
[ https://issues.apache.org/jira/browse/PIG-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847260#action_12847260 ] Hadoop QA commented on PIG-1258: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12438944/PIG-1258.patch against trunk revision 925034. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/244/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/244/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/244/console This message is automatically generated. [zebra] Number of sorted input splits is unusually high --- Key: PIG-1258 URL: https://issues.apache.org/jira/browse/PIG-1258 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Yan Zhou Attachments: PIG-1258.patch Number of sorted input splits is unusually high if the projections are on multiple column groups, or a union of tables, or column group(s) that hold many small tfiles. In one test, the number is about 100 times bigger that from unsorted input splits on the same input tables. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables
[ https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1117: --- Attachment: PIG-1117-0.7.0-reviewed.patch Minor review changes, all superficial. - changed the spacing to confirm to project conventions - spaces before / after the curly braces where I saw them missing - spelling and occasional references to HiveRCLoader in the docs (you've renamed it to HiveColumnarLoader) - minor tweak to get rid of one remaining deprecation warning in the RecordReader Tests pass on my machine. Gerrit, if you are ok with these changes, I will commit. Pig reading hive columnar rc tables --- Key: PIG-1117 URL: https://issues.apache.org/jira/browse/PIG-1117 Project: Pig Issue Type: New Feature Affects Versions: 0.7.0 Reporter: Gerrit Jansen van Vuuren Assignee: Gerrit Jansen van Vuuren Fix For: 0.7.0 Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch, PIG-1117-0.7.0-new.patch, PIG-1117-0.7.0-reviewed.patch, PIG-1117.patch, PIG-117-v.0.6.0.patch, PIG-117-v.0.7.0.patch I've coded a LoadFunc implementation that can read from Hive Columnar RC tables, this is needed for a project that I'm working on because all our data is stored using the Hive thrift serialized Columnar RC format. I have looked at the piggy bank but did not find any implementation that could do this. We've been running it on our cluster for the last week and have worked out most bugs. There are still some improvements to be done but I would need like setting the amount of mappers based on date partitioning. Its been optimized so as to read only specific columns and can churn through a data set almost 8 times faster with this improvement because not all column data is read. I would like to contribute the class to the piggybank can you guide me in what I need to do? I've used hive specific classes to implement this, is it possible to add this to the piggy bank build ivy for automatic download of the dependencies? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables
[ https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1117: --- Attachment: PIG-1117-0.7.0-reviewed.patch Attaching again -- forgot to click the license check box. Which reminded me to check for Apache license headers in the new files, and turns out they were missing -- so I added them. Assuming that's ok since Gerrit granted license for the patches when he attached them to the Jira. Pig reading hive columnar rc tables --- Key: PIG-1117 URL: https://issues.apache.org/jira/browse/PIG-1117 Project: Pig Issue Type: New Feature Affects Versions: 0.7.0 Reporter: Gerrit Jansen van Vuuren Assignee: Gerrit Jansen van Vuuren Fix For: 0.7.0 Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch, PIG-1117-0.7.0-new.patch, PIG-1117-0.7.0-reviewed.patch, PIG-1117-0.7.0-reviewed.patch, PIG-1117.patch, PIG-117-v.0.6.0.patch, PIG-117-v.0.7.0.patch I've coded a LoadFunc implementation that can read from Hive Columnar RC tables, this is needed for a project that I'm working on because all our data is stored using the Hive thrift serialized Columnar RC format. I have looked at the piggy bank but did not find any implementation that could do this. We've been running it on our cluster for the last week and have worked out most bugs. There are still some improvements to be done but I would need like setting the amount of mappers based on date partitioning. Its been optimized so as to read only specific columns and can churn through a data set almost 8 times faster with this improvement because not all column data is read. I would like to contribute the class to the piggybank can you guide me in what I need to do? I've used hive specific classes to implement this, is it possible to add this to the piggy bank build ivy for automatic download of the dependencies? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1258) [zebra] Number of sorted input splits is unusually high
[ https://issues.apache.org/jira/browse/PIG-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1258: -- Status: Open (was: Patch Available) The test report page having the claimed failures of some core tests is not available on the web. Will resubmit. [zebra] Number of sorted input splits is unusually high --- Key: PIG-1258 URL: https://issues.apache.org/jira/browse/PIG-1258 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Yan Zhou Attachments: PIG-1258.patch Number of sorted input splits is unusually high if the projections are on multiple column groups, or a union of tables, or column group(s) that hold many small tfiles. In one test, the number is about 100 times bigger that from unsorted input splits on the same input tables. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1258) [zebra] Number of sorted input splits is unusually high
[ https://issues.apache.org/jira/browse/PIG-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1258: -- Status: Patch Available (was: Open) Resumbit so hudson will rerun. [zebra] Number of sorted input splits is unusually high --- Key: PIG-1258 URL: https://issues.apache.org/jira/browse/PIG-1258 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Yan Zhou Attachments: PIG-1258.patch Number of sorted input splits is unusually high if the projections are on multiple column groups, or a union of tables, or column group(s) that hold many small tfiles. In one test, the number is about 100 times bigger that from unsorted input splits on the same input tables. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables
[ https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1117: --- Resolution: Fixed Status: Resolved (was: Patch Available) Patch commited. Thanks for this contribution, Gerrit! This will really help people who are working with both Hive and Pig. Now we just need a Zebra SerDe... :-) Pig reading hive columnar rc tables --- Key: PIG-1117 URL: https://issues.apache.org/jira/browse/PIG-1117 Project: Pig Issue Type: New Feature Affects Versions: 0.7.0 Reporter: Gerrit Jansen van Vuuren Assignee: Gerrit Jansen van Vuuren Fix For: 0.7.0 Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch, PIG-1117-0.7.0-new.patch, PIG-1117-0.7.0-reviewed.patch, PIG-1117-0.7.0-reviewed.patch, PIG-1117.patch, PIG-117-v.0.6.0.patch, PIG-117-v.0.7.0.patch I've coded a LoadFunc implementation that can read from Hive Columnar RC tables, this is needed for a project that I'm working on because all our data is stored using the Hive thrift serialized Columnar RC format. I have looked at the piggy bank but did not find any implementation that could do this. We've been running it on our cluster for the last week and have worked out most bugs. There are still some improvements to be done but I would need like setting the amount of mappers based on date partitioning. Its been optimized so as to read only specific columns and can churn through a data set almost 8 times faster with this improvement because not all column data is read. I would like to contribute the class to the piggybank can you guide me in what I need to do? I've used hive specific classes to implement this, is it possible to add this to the piggy bank build ivy for automatic download of the dependencies? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1298) Restore file traversal behavior to Pig loaders
[ https://issues.apache.org/jira/browse/PIG-1298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1298: -- Status: Patch Available (was: Open) Restore file traversal behavior to Pig loaders -- Key: PIG-1298 URL: https://issues.apache.org/jira/browse/PIG-1298 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.7.0 Attachments: PIG-1298.patch, PIG-1298_1.patch Given a location to a Pig loader, it is expected to recursively load all the files under the location (i.e., all the files returned with ls -R command). However, after the transition to using Hadoop 20 API, only files returned with ls command are loaded. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1298) Restore file traversal behavior to Pig loaders
[ https://issues.apache.org/jira/browse/PIG-1298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1298: -- Status: Open (was: Patch Available) Restore file traversal behavior to Pig loaders -- Key: PIG-1298 URL: https://issues.apache.org/jira/browse/PIG-1298 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.7.0 Attachments: PIG-1298.patch, PIG-1298_1.patch Given a location to a Pig loader, it is expected to recursively load all the files under the location (i.e., all the files returned with ls -R command). However, after the transition to using Hadoop 20 API, only files returned with ls command are loaded. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1298) Restore file traversal behavior to Pig loaders
[ https://issues.apache.org/jira/browse/PIG-1298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1298: -- Attachment: PIG-1298_1.patch Fix release audit issue. Restore file traversal behavior to Pig loaders -- Key: PIG-1298 URL: https://issues.apache.org/jira/browse/PIG-1298 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.7.0 Attachments: PIG-1298.patch, PIG-1298_1.patch Given a location to a Pig loader, it is expected to recursively load all the files under the location (i.e., all the files returned with ls -R command). However, after the transition to using Hadoop 20 API, only files returned with ls command are loaded. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1117) Pig reading hive columnar rc tables
[ https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847455#action_12847455 ] Gerrit Jansen van Vuuren commented on PIG-1117: --- :) Yep I might just start on a Zebra SerDe for Hive, then we can have complete Hive Pig Harmony. Pig reading hive columnar rc tables --- Key: PIG-1117 URL: https://issues.apache.org/jira/browse/PIG-1117 Project: Pig Issue Type: New Feature Affects Versions: 0.7.0 Reporter: Gerrit Jansen van Vuuren Assignee: Gerrit Jansen van Vuuren Fix For: 0.7.0 Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch, PIG-1117-0.7.0-new.patch, PIG-1117-0.7.0-reviewed.patch, PIG-1117-0.7.0-reviewed.patch, PIG-1117.patch, PIG-117-v.0.6.0.patch, PIG-117-v.0.7.0.patch I've coded a LoadFunc implementation that can read from Hive Columnar RC tables, this is needed for a project that I'm working on because all our data is stored using the Hive thrift serialized Columnar RC format. I have looked at the piggy bank but did not find any implementation that could do this. We've been running it on our cluster for the last week and have worked out most bugs. There are still some improvements to be done but I would need like setting the amount of mappers based on date partitioning. Its been optimized so as to read only specific columns and can churn through a data set almost 8 times faster with this improvement because not all column data is read. I would like to contribute the class to the piggybank can you guide me in what I need to do? I've used hive specific classes to implement this, is it possible to add this to the piggy bank build ivy for automatic download of the dependencies? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-915) Load row names in HBase loader
[ https://issues.apache.org/jira/browse/PIG-915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847467#action_12847467 ] Olga Natkovich commented on PIG-915: Jeff, are you still planning to get this patch for 0.7.0? We are planning to branch on Monday and need to get it in before that. Otherwise, we can postpone it till 0.8.0 release. Load row names in HBase loader -- Key: PIG-915 URL: https://issues.apache.org/jira/browse/PIG-915 Project: Pig Issue Type: Improvement Affects Versions: 0.6.0 Reporter: Alex Newman Assignee: Jeff Zhang Priority: Minor Fix For: 0.7.0 Attachments: Pig_915.Patch Currently their is no way to get the Row names when doing a query from HBase, we should probably remedy this as important data may be stored there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1182) Pig reference manual does not mention syntax for comments
[ https://issues.apache.org/jira/browse/PIG-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1182: Fix Version/s: (was: 0.7.0) Assignee: (was: Corinne Chandel) Pig reference manual does not mention syntax for comments - Key: PIG-1182 URL: https://issues.apache.org/jira/browse/PIG-1182 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.5.0 Reporter: David Ciemiewicz The Pig 0.5.0 reference manual does not mention how to write comments in your pig code using -- (two dashes). http://hadoop.apache.org/pig/docs/r0.5.0/piglatin_reference.html Also, does /* */ also work? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
[ https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847470#action_12847470 ] Olga Natkovich commented on PIG-1205: - Jeff, are you still planning to get this into Pig 0.7.0 by Monday or should we move this to Pig 0.8.0? Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc -- Key: PIG-1205 URL: https://issues.apache.org/jira/browse/PIG-1205 Project: Pig Issue Type: Sub-task Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Fix For: 0.7.0 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, PIG_1205_4.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
[ https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847473#action_12847473 ] Dmitriy V. Ryaboy commented on PIG-1205: fwiw -- I have an implementation for 0.6 that does most of what I outlined above; could probably port to 0.7 and make apache-friendly within the next couple of weeks. Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc -- Key: PIG-1205 URL: https://issues.apache.org/jira/browse/PIG-1205 Project: Pig Issue Type: Sub-task Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Fix For: 0.7.0 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, PIG_1205_4.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1238) Dump does not respect the schema
[ https://issues.apache.org/jira/browse/PIG-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847474#action_12847474 ] Richard Ding commented on PIG-1238: --- Hi Ankur, I run following script {code} A = LOAD '1.txt' USING PigStorage(); B = FOREACH A GENERATE ['a'#'12'] as b:map[], ['b'#['c'#'12']] as mapFields; C = FOREACH B GENERATE(CHARARRAY) mapFields#'b'#'c' AS f1, RANDOM() AS f2; D = ORDER C BY f2 PARALLEL 10; E = LIMIT D 20; F = FOREACH E GENERATE f1; dump F; {code} and it gets the correct result. Can you sync again with the trunk and let me know if the problem still exists? Dump does not respect the schema Key: PIG-1238 URL: https://issues.apache.org/jira/browse/PIG-1238 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur Assignee: Richard Ding Fix For: 0.7.0 Attachments: PIG-1238.patch For complex data type and certain sequence of operations dump produces results with non-existent field in the relation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Begin a discussion about Pig as a top level project
You have probably heard by now that there is a discussion going on in the Hadoop PMC as to whether a number of the subprojects (Hbase, Avro, Zookeeper, Hive, and Pig) should move out from under the Hadoop umbrella and become top level Apache projects (TLP). This discussion has picked up recently since the Apache board has clearly communicated to the Hadoop PMC that it is concerned that Hadoop is acting as an umbrella project with many disjoint subprojects underneath it. They are concerned that this gives Apache little insight into the health and happenings of the subproject communities which in turn means Apache cannot properly mentor those communities. The purpose of this email is to start a discussion within the Pig community about this topic. Let me cover first what becoming TLP would mean for Pig, and then I'll go into what options I think we as a community have. Becoming a TLP would mean that Pig would itself have a PMC that would report directly to the Apache board. Who would be on the PMC would be something we as a community would need to decide. Common options would be to say all active committers are on the PMC, or all active committers who have been a committer for at least a year. We would also need to elect a chair of the PMC. This lucky person would have no additional power, but would have the additional responsibility of writing quarterly reports on Pig's status for Apache board meetings, as well as coordinating with Apache to get accounts for new committers, etc. For more information see http://www.apache.org/foundation/how-it-works.html#roles Becoming a TLP would not mean that we are ostracized from the Hadoop community. We would continue to be invited to Hadoop Summits, HUGs, etc. Since all Pig developers and users are by definition Hadoop users, we would continue to be a strong presence in the Hadoop community. I see three ways that we as a community can respond to this: 1) Say yes, we want to be a TLP now. 2) Say yes, we want to be a TLP, but not yet. We feel we need more time to mature. If we choose this option we need to be able to clearly articulate how much time we need and what we hope to see change in that time. 3) Say no, we feel the benefits for us staying with Hadoop outweigh the drawbacks of being a disjoint subproject. If we choose this, we need to be able to say exactly what those benefits are and why we feel they will be compromised by leaving the Hadoop project. There may other options that I haven't thought of. Please feel free to suggest any you think of. Questions? Thoughts? Let the discussion begin. Alan.
[jira] Commented: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
[ https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847476#action_12847476 ] Olga Natkovich commented on PIG-1205: - Sounds good. Then I will mark it for inclusion in 0.8.0. Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc -- Key: PIG-1205 URL: https://issues.apache.org/jira/browse/PIG-1205 Project: Pig Issue Type: Sub-task Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Fix For: 0.8.0 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, PIG_1205_4.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
[ https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1205: Fix Version/s: (was: 0.7.0) 0.8.0 Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc -- Key: PIG-1205 URL: https://issues.apache.org/jira/browse/PIG-1205 Project: Pig Issue Type: Sub-task Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Fix For: 0.8.0 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, PIG_1205_4.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1285) Allow SingleTupleBag to be serialized
[ https://issues.apache.org/jira/browse/PIG-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847481#action_12847481 ] Olga Natkovich commented on PIG-1285: - Dmitry, are you still planning to get this in before Monday or should we move it to Pig 0.8.0? Allow SingleTupleBag to be serialized - Key: PIG-1285 URL: https://issues.apache.org/jira/browse/PIG-1285 Project: Pig Issue Type: Improvement Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.7.0 Attachments: PIG-1285.patch Currently, Pig uses a SingleTupleBag for efficiency when a full-blown spillable bag implementation is not needed in the Combiner optimization. Unfortunately this can create problems. The below Initial.exec() code fails at run-time with the message that a SingleTupleBag cannot be serialized: {code} @Override public Tuple exec(Tuple in) throws IOException { // single record. just copy. if (in == null) return null; try { Tuple resTuple = tupleFactory_.newTuple(in.size()); for (int i=0; i in.size(); i++) { resTuple.set(i, in.get(i)); } return resTuple; } catch (IOException e) { log.warn(e); return null; } } {code} The code below can fix the problem in the UDF, but it seems like something that should be handled transparently, not requiring UDF authors to know about SingleTupleBags. {code} @Override public Tuple exec(Tuple in) throws IOException { // single record. just copy. if (in == null) return null; /* * Unfortunately SingleTupleBags are not serializable. We cache whether a given index contains a bag * in the map below, and copy all bags into DefaultBags before returning to avoid serialization exceptions. */ MapInteger, Boolean isBagAtIndex = Maps.newHashMap(); try { Tuple resTuple = tupleFactory_.newTuple(in.size()); for (int i=0; i in.size(); i++) { Object obj = in.get(i); if (!isBagAtIndex.containsKey(i)) { isBagAtIndex.put(i, obj instanceof SingleTupleBag); } if (isBagAtIndex.get(i)) { DataBag newBag = bagFactory_.newDefaultBag(); newBag.addAll((DataBag)obj); obj = newBag; } resTuple.set(i, obj); } return resTuple; } catch (IOException e) { log.warn(e); return null; } } {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1308) Inifinite loop in JobClient when reading from BinStorage Message: [org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2]
[ https://issues.apache.org/jira/browse/PIG-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich reassigned PIG-1308: --- Assignee: Pradeep Kamath Inifinite loop in JobClient when reading from BinStorage Message: [org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2] Key: PIG-1308 URL: https://issues.apache.org/jira/browse/PIG-1308 Project: Pig Issue Type: Bug Reporter: Viraj Bhat Assignee: Pradeep Kamath Fix For: 0.7.0 Simple script fails to read files from BinStorage() and fails to submit jobs to JobTracker. This occurs with trunk and not with Pig 0.6 branch. {code} data = load 'binstoragesample' using BinStorage() as (s, m, l); A = foreach ULT generate s#'key' as value; X = limit A 20; dump X; {code} When this script is submitted to the Jobtracker, we found the following error: 2010-03-18 22:31:22,296 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:32:01,574 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:32:43,276 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:33:21,743 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:34:02,004 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:34:43,442 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:35:25,907 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:36:07,402 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:36:48,596 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:37:28,014 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:38:04,823 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:38:38,981 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:39:12,220 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 Stack Trace revelead at org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:144) at org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:115) at org.apache.pig.builtin.BinStorage.getSchema(BinStorage.java:404) at org.apache.pig.impl.logicalLayer.LOLoad.determineSchema(LOLoad.java:167) at org.apache.pig.impl.logicalLayer.LOLoad.getProjectionMap(LOLoad.java:263) at org.apache.pig.impl.logicalLayer.ProjectionMapCalculator.visit(ProjectionMapCalculator.java:112) at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:210) at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:52) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.impl.logicalLayer.optimizer.LogicalTransformer.rebuildProjectionMaps(LogicalTransformer.java:76) at org.apache.pig.impl.logicalLayer.optimizer.LogicalOptimizer.optimize(LogicalOptimizer.java:216) at org.apache.pig.PigServer.compileLp(PigServer.java:883) at org.apache.pig.PigServer.store(PigServer.java:564) The binstorage data was generated from 2 datasets using limit and union: {code} Large1 = load 'input1' using PigStorage(); Large2 = load 'input2' using PigStorage(); V = limit Large1 1; C = limit Large2 1; U = union V, C; store U into 'binstoragesample' using BinStorage(); {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1285) Allow SingleTupleBag to be serialized
[ https://issues.apache.org/jira/browse/PIG-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847507#action_12847507 ] Dmitriy V. Ryaboy commented on PIG-1285: Yeah I'll post it over the weekend. Just to make sure -- Pradeep, you would be ok then if I just copied the writeFields and readFields out of DefaultAbstractBag into SingleTupleBag? Allow SingleTupleBag to be serialized - Key: PIG-1285 URL: https://issues.apache.org/jira/browse/PIG-1285 Project: Pig Issue Type: Improvement Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.7.0 Attachments: PIG-1285.patch Currently, Pig uses a SingleTupleBag for efficiency when a full-blown spillable bag implementation is not needed in the Combiner optimization. Unfortunately this can create problems. The below Initial.exec() code fails at run-time with the message that a SingleTupleBag cannot be serialized: {code} @Override public Tuple exec(Tuple in) throws IOException { // single record. just copy. if (in == null) return null; try { Tuple resTuple = tupleFactory_.newTuple(in.size()); for (int i=0; i in.size(); i++) { resTuple.set(i, in.get(i)); } return resTuple; } catch (IOException e) { log.warn(e); return null; } } {code} The code below can fix the problem in the UDF, but it seems like something that should be handled transparently, not requiring UDF authors to know about SingleTupleBags. {code} @Override public Tuple exec(Tuple in) throws IOException { // single record. just copy. if (in == null) return null; /* * Unfortunately SingleTupleBags are not serializable. We cache whether a given index contains a bag * in the map below, and copy all bags into DefaultBags before returning to avoid serialization exceptions. */ MapInteger, Boolean isBagAtIndex = Maps.newHashMap(); try { Tuple resTuple = tupleFactory_.newTuple(in.size()); for (int i=0; i in.size(); i++) { Object obj = in.get(i); if (!isBagAtIndex.containsKey(i)) { isBagAtIndex.put(i, obj instanceof SingleTupleBag); } if (isBagAtIndex.get(i)) { DataBag newBag = bagFactory_.newDefaultBag(); newBag.addAll((DataBag)obj); obj = newBag; } resTuple.set(i, obj); } return resTuple; } catch (IOException e) { log.warn(e); return null; } } {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1309) Map-side Cogroup
Map-side Cogroup Key: PIG-1309 URL: https://issues.apache.org/jira/browse/PIG-1309 Project: Pig Issue Type: Bug Components: impl Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Attachments: mapsideCogrp.patch In never ending quest to make Pig go faster, we want to parallelize as many relational operations as possible. Its already possible to do Group-by( PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira is to add map-side implementation of Cogroup in Pig. Details to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1309) Map-side Cogroup
[ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1309: -- Attachment: mapsideCogrp.patch Preliminary patch to discuss the approach. Not ready for inclusion yet. Map-side Cogroup Key: PIG-1309 URL: https://issues.apache.org/jira/browse/PIG-1309 Project: Pig Issue Type: Bug Components: impl Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Attachments: mapsideCogrp.patch In never ending quest to make Pig go faster, we want to parallelize as many relational operations as possible. Its already possible to do Group-by( PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira is to add map-side implementation of Cogroup in Pig. Details to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1253) [zebra] make map/reduce test cases run on real cluster
[ https://issues.apache.org/jira/browse/PIG-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847521#action_12847521 ] Yan Zhou commented on PIG-1253: --- +1 on PIG-1253-0.6.patch that is committed to the 0.6 branch. [zebra] make map/reduce test cases run on real cluster -- Key: PIG-1253 URL: https://issues.apache.org/jira/browse/PIG-1253 Project: Pig Issue Type: Task Affects Versions: 0.6.0 Reporter: Chao Wang Assignee: Chao Wang Fix For: 0.7.0 Attachments: PIG-1253-0.6.patch, PIG-1253.patch, PIG-1253.patch The goal of this task is to make map/reduce test cases run on real cluster. Currently map/reduce test cases are mostly tested under local mode. When running on real cluster, all involved jars have to be manually deployed in advance which is not desired. The major change here is to support -libjars option to be able to ship user jars to backend automatically. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1258) [zebra] Number of sorted input splits is unusually high
[ https://issues.apache.org/jira/browse/PIG-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847560#action_12847560 ] Gaurav Jain commented on PIG-1258: -- +1 [zebra] Number of sorted input splits is unusually high --- Key: PIG-1258 URL: https://issues.apache.org/jira/browse/PIG-1258 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Yan Zhou Attachments: PIG-1258.patch Number of sorted input splits is unusually high if the projections are on multiple column groups, or a union of tables, or column group(s) that hold many small tfiles. In one test, the number is about 100 times bigger that from unsorted input splits on the same input tables. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1307) when we spill the DefaultDataBag we are not setting the sized changed flag to be true.
[ https://issues.apache.org/jira/browse/PIG-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847587#action_12847587 ] Daniel Dai commented on PIG-1307: - Hi, Ben, Is the patch ready? Do you need help to add some test cases? when we spill the DefaultDataBag we are not setting the sized changed flag to be true. -- Key: PIG-1307 URL: https://issues.apache.org/jira/browse/PIG-1307 Project: Pig Issue Type: Bug Reporter: Benjamin Reed Assignee: Benjamin Reed Fix For: 0.7.0 Attachments: PIG-1307.patch pig uses a size changed flag to indicate when we should recalculate the memory footprint of the bag. the setting of this flag is sprinkled throughout the code. unfortunately, it is missing in DefaultDataBag.spill(). there may be other cases as well. the problem with this case is that when the low memory threshold kicks in, bags are spilled until the desired amount of memory is freed. since the flag is not being reset subsequent calls to the threshold events will retrigger the spill() and think more memory was freed even though nothing was actually spilled. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1309) Map-side Cogroup
[ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847589#action_12847589 ] Alan Gates commented on PIG-1309: - Here's a write up on the design behind this: http://wiki.apache.org/pig/MapSideCogroup Map-side Cogroup Key: PIG-1309 URL: https://issues.apache.org/jira/browse/PIG-1309 Project: Pig Issue Type: Bug Components: impl Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Attachments: mapsideCogrp.patch In never ending quest to make Pig go faster, we want to parallelize as many relational operations as possible. Its already possible to do Group-by( PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira is to add map-side implementation of Cogroup in Pig. Details to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847594#action_12847594 ] Allen Wittenauer commented on PIG-794: -- What is the latest on getting Avro support in pig? Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1310) ISO Date UDFs: Conversion, Rounding and Date Math
ISO Date UDFs: Conversion, Rounding and Date Math - Key: PIG-1310 URL: https://issues.apache.org/jira/browse/PIG-1310 Project: Pig Issue Type: New Feature Components: impl Reporter: Russell Jurney Fix For: 0.7.0 I've written UDFs to handle loading unix times, datemonth values and ISO 8601 formatted date strings, and working with them as ISO datetimes using jodatime. The working code is here: http://github.com/rjurney/oink/tree/master/src/java/oink/udf/isodate/ It needs to be documented and tests added, and a couple UDFs are missing, but these work if you REGISTER the jodatime jar in your script. Hopefully I can get this stuff in piggybank before someone else writes it this time :) The rounding also may not be performant, but the code works. Ultimately I'd also like to enable support for ISO 8601 durations. Someone slap me if this isn't done soon, it is not much work and this should help everyone working with time series. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1310) ISO Date UDFs: Conversion, Rounding and Date Math
[ https://issues.apache.org/jira/browse/PIG-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1310: Fix Version/s: (was: 0.7.0) 0.8.0 I think this will be very useful for many users! We are freezing 0.7.0 on Monday so moving this to 0.8.0 release. ISO Date UDFs: Conversion, Rounding and Date Math - Key: PIG-1310 URL: https://issues.apache.org/jira/browse/PIG-1310 Project: Pig Issue Type: New Feature Components: impl Reporter: Russell Jurney Fix For: 0.8.0 Original Estimate: 168h Remaining Estimate: 168h I've written UDFs to handle loading unix times, datemonth values and ISO 8601 formatted date strings, and working with them as ISO datetimes using jodatime. The working code is here: http://github.com/rjurney/oink/tree/master/src/java/oink/udf/isodate/ It needs to be documented and tests added, and a couple UDFs are missing, but these work if you REGISTER the jodatime jar in your script. Hopefully I can get this stuff in piggybank before someone else writes it this time :) The rounding also may not be performant, but the code works. Ultimately I'd also like to enable support for ISO 8601 durations. Someone slap me if this isn't done soon, it is not much work and this should help everyone working with time series. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1150) VAR() Variance UDF
[ https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847600#action_12847600 ] Russell Jurney commented on PIG-1150: - Yes, this sounds like the thing to do :) On Tue, Mar 16, 2010 at 5:29 PM, Dmitriy V. Ryaboy (JIRA) VAR() Variance UDF -- Key: PIG-1150 URL: https://issues.apache.org/jira/browse/PIG-1150 Project: Pig Issue Type: New Feature Affects Versions: 0.5.0 Environment: UDF, written in Pig 0.5 contrib/ Reporter: Russell Jurney Fix For: 0.7.0 Attachments: var.patch I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates variance in a distributed manner, based on the AVG() builtin. It works by calculating the count, sum and sum of squares, as described here: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm Is this a worthwhile contribution? Taking the square root of this value using the contrib SQRT() function gives Standard Deviation, which is missing from Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1309) Map-side Cogroup
[ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847604#action_12847604 ] Alan Gates commented on PIG-1309: - Comments: A liberal dose of comments would help greatly in understanding what the various helper methods are doing. You use LocalRearrange to split the keys and values. What's the overhead of that? Would it be more efficient to factor the key splitting code out of LR and share it between LR and here? I don't understand the need for pullTuplesFromSideLoaders(). In setup() you put one tuple from each input into the heap. Then you pull from the heap until you see a key change. But I don't understand the next step. At key change you call pullTuplesFromSideLoaders(). But if you've been adding into the heap as you pull tuples there's no need to pull anything from the side loaders at this point. All you should need to do is package up the bags you've build and return them as your tuple. Also, it appears your using pullTuplesFromSideLoaders() to fill the heap. You shouldn't be pulling all tuples for a current key from side loaders, as you're likely to miss tuples with keys that are in the side loaders but not in the main loader. The algorithm should be that as you pull a tuple from the heap, you place the next tuple from that same stream into the heap. The heap will guarantee that your tuples come out in order. Map-side Cogroup Key: PIG-1309 URL: https://issues.apache.org/jira/browse/PIG-1309 Project: Pig Issue Type: Bug Components: impl Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Attachments: mapsideCogrp.patch In never ending quest to make Pig go faster, we want to parallelize as many relational operations as possible. Its already possible to do Group-by( PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira is to add map-side implementation of Cogroup in Pig. Details to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847606#action_12847606 ] Alan Gates commented on PIG-794: It depends on what you mean by support. As far as Pig using Avro for serialization between Map and Reduce and MR jobs, we haven't done anything on that front lately. Last time we tested the performance was comparable to our own BinStorage so we weren't motivated to move yet. Now that Avro has matured a bit maybe we should test again. As far as using Avro to store user data, with Pig 0.7 it should become quite easy to write Avro load and store functions. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847607#action_12847607 ] Jeff Hammerbacher commented on PIG-794: --- bq. Last time we tested the performance was comparable to our own BinStorage so we weren't motivated to move yet. Hey Alan, There should be benefits to using Avro besides just performance. Either way, looking forward to seeing you on the Avro lists when you decide to test again! Regards, Jeff Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1150) VAR() Variance UDF
[ https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1150: --- Fix Version/s: (was: 0.7.0) 0.8.0 Changed the target to 0.8 -- we won't have time to finish splitting this out by Monday. VAR() Variance UDF -- Key: PIG-1150 URL: https://issues.apache.org/jira/browse/PIG-1150 Project: Pig Issue Type: New Feature Affects Versions: 0.5.0 Environment: UDF, written in Pig 0.5 contrib/ Reporter: Russell Jurney Fix For: 0.8.0 Attachments: var.patch I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates variance in a distributed manner, based on the AVG() builtin. It works by calculating the count, sum and sum of squares, as described here: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm Is this a worthwhile contribution? Taking the square root of this value using the contrib SQRT() function gives Standard Deviation, which is missing from Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847613#action_12847613 ] Alan Gates commented on PIG-794: Jeff, Beyond performance what do you see as the big wins of using Avro? I'm just thinking here of moving data between MR jobs in a Pig script and between Map and Reduce phases. I see lots of advantages to users using Avro to store their data. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847614#action_12847614 ] Dmitriy V. Ryaboy commented on PIG-794: --- I'll take a crack at it. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Assignee: Dmitriy V. Ryaboy Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy reassigned PIG-794: - Assignee: Dmitriy V. Ryaboy Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Assignee: Dmitriy V. Ryaboy Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1150) VAR() Variance UDF
[ https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy reassigned PIG-1150: -- Assignee: Dmitriy V. Ryaboy VAR() Variance UDF -- Key: PIG-1150 URL: https://issues.apache.org/jira/browse/PIG-1150 Project: Pig Issue Type: New Feature Affects Versions: 0.5.0 Environment: UDF, written in Pig 0.5 contrib/ Reporter: Russell Jurney Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: var.patch I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates variance in a distributed manner, based on the AVG() builtin. It works by calculating the count, sum and sum of squares, as described here: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm Is this a worthwhile contribution? Taking the square root of this value using the contrib SQRT() function gives Standard Deviation, which is missing from Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1298) Restore file traversal behavior to Pig loaders
[ https://issues.apache.org/jira/browse/PIG-1298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847630#action_12847630 ] Ashutosh Chauhan commented on PIG-1298: --- +1 Restore file traversal behavior to Pig loaders -- Key: PIG-1298 URL: https://issues.apache.org/jira/browse/PIG-1298 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.7.0 Attachments: PIG-1298.patch, PIG-1298_1.patch Given a location to a Pig loader, it is expected to recursively load all the files under the location (i.e., all the files returned with ls -R command). However, after the transition to using Hadoop 20 API, only files returned with ls command are loaded. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1307) when we spill the DefaultDataBag we are not setting the sized changed flag to be true.
[ https://issues.apache.org/jira/browse/PIG-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847631#action_12847631 ] Benjamin Reed commented on PIG-1307: i don't have test cases since the test cases would be orders of magnitude higher more difficult to write than the patch and may not reproduce the problem across different machine configurations. when we spill the DefaultDataBag we are not setting the sized changed flag to be true. -- Key: PIG-1307 URL: https://issues.apache.org/jira/browse/PIG-1307 Project: Pig Issue Type: Bug Reporter: Benjamin Reed Assignee: Benjamin Reed Fix For: 0.7.0 Attachments: PIG-1307.patch pig uses a size changed flag to indicate when we should recalculate the memory footprint of the bag. the setting of this flag is sprinkled throughout the code. unfortunately, it is missing in DefaultDataBag.spill(). there may be other cases as well. the problem with this case is that when the low memory threshold kicks in, bags are spilled until the desired amount of memory is freed. since the flag is not being reset subsequent calls to the threshold events will retrigger the spill() and think more memory was freed even though nothing was actually spilled. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1311) Pig interfaces should be clearly classified in terms of scope and stability
[ https://issues.apache.org/jira/browse/PIG-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847632#action_12847632 ] Alan Gates commented on PIG-1311: - Hadoop has a proposal on how to approach this in HADOOP-5073 I propose we use the same nomenclature. Java interfaces would be marked via annotations (provided by Hadoop commons). For other interfaces we would need to provide version specific documents (that is, in forrest not in wiki) that detail scope and stability for each interface. Pig interfaces should be clearly classified in terms of scope and stability --- Key: PIG-1311 URL: https://issues.apache.org/jira/browse/PIG-1311 Project: Pig Issue Type: Improvement Reporter: Alan Gates Assignee: Alan Gates Clearly marking Pig interfaces (Java interfaces but also things like config files, CLIs, Pig Latin syntax and semantics, etc.) to show scope (public/private) and stability (stable/evolving/unstable) will help users understand how to interact with Pig and developers to understand what things they can and cannot change. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1307) when we spill the DefaultDataBag we are not setting the sized changed flag to be true.
[ https://issues.apache.org/jira/browse/PIG-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847636#action_12847636 ] Daniel Dai commented on PIG-1307: - +1. Will commit it shortly. when we spill the DefaultDataBag we are not setting the sized changed flag to be true. -- Key: PIG-1307 URL: https://issues.apache.org/jira/browse/PIG-1307 Project: Pig Issue Type: Bug Reporter: Benjamin Reed Assignee: Benjamin Reed Fix For: 0.7.0 Attachments: PIG-1307.patch pig uses a size changed flag to indicate when we should recalculate the memory footprint of the bag. the setting of this flag is sprinkled throughout the code. unfortunately, it is missing in DefaultDataBag.spill(). there may be other cases as well. the problem with this case is that when the low memory threshold kicks in, bags are spilled until the desired amount of memory is freed. since the flag is not being reset subsequent calls to the threshold events will retrigger the spill() and think more memory was freed even though nothing was actually spilled. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1307) when we spill the DefaultDataBag we are not setting the sized changed flag to be true.
[ https://issues.apache.org/jira/browse/PIG-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1307: Resolution: Fixed Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) Performed a manual test, and it works. Patch committed. Thanks Ben! when we spill the DefaultDataBag we are not setting the sized changed flag to be true. -- Key: PIG-1307 URL: https://issues.apache.org/jira/browse/PIG-1307 Project: Pig Issue Type: Bug Reporter: Benjamin Reed Assignee: Benjamin Reed Fix For: 0.7.0 Attachments: PIG-1307.patch pig uses a size changed flag to indicate when we should recalculate the memory footprint of the bag. the setting of this flag is sprinkled throughout the code. unfortunately, it is missing in DefaultDataBag.spill(). there may be other cases as well. the problem with this case is that when the low memory threshold kicks in, bags are spilled until the desired amount of memory is freed. since the flag is not being reset subsequent calls to the threshold events will retrigger the spill() and think more memory was freed even though nothing was actually spilled. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1312) Make Pig work with hadoop security
Make Pig work with hadoop security -- Key: PIG-1312 URL: https://issues.apache.org/jira/browse/PIG-1312 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.7.0 In order to make Pig work with hadoop security, we need to set mapreduce.job.credentials.binary in the JobConf before we call getSplit() in the backend. We need to change code in merge join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1312) Make Pig work with hadoop security
[ https://issues.apache.org/jira/browse/PIG-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1312: Status: Patch Available (was: Open) Make Pig work with hadoop security -- Key: PIG-1312 URL: https://issues.apache.org/jira/browse/PIG-1312 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.7.0 Attachments: PIG-1312-1.patch In order to make Pig work with hadoop security, we need to set mapreduce.job.credentials.binary in the JobConf before we call getSplit() in the backend. We need to change code in merge join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1312) Make Pig work with hadoop security
[ https://issues.apache.org/jira/browse/PIG-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1312: Attachment: PIG-1312-1.patch Make Pig work with hadoop security -- Key: PIG-1312 URL: https://issues.apache.org/jira/browse/PIG-1312 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.7.0 Attachments: PIG-1312-1.patch In order to make Pig work with hadoop security, we need to set mapreduce.job.credentials.binary in the JobConf before we call getSplit() in the backend. We need to change code in merge join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.