[jira] Commented: (PIG-1285) Allow SingleTupleBag to be serialized
[ https://issues.apache.org/jira/browse/PIG-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847721#action_12847721 ] Pradeep Kamath commented on PIG-1285: - yes Allow SingleTupleBag to be serialized - Key: PIG-1285 URL: https://issues.apache.org/jira/browse/PIG-1285 Project: Pig Issue Type: Improvement Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.7.0 Attachments: PIG-1285.patch Currently, Pig uses a SingleTupleBag for efficiency when a full-blown spillable bag implementation is not needed in the Combiner optimization. Unfortunately this can create problems. The below Initial.exec() code fails at run-time with the message that a SingleTupleBag cannot be serialized: {code} @Override public Tuple exec(Tuple in) throws IOException { // single record. just copy. if (in == null) return null; try { Tuple resTuple = tupleFactory_.newTuple(in.size()); for (int i=0; i in.size(); i++) { resTuple.set(i, in.get(i)); } return resTuple; } catch (IOException e) { log.warn(e); return null; } } {code} The code below can fix the problem in the UDF, but it seems like something that should be handled transparently, not requiring UDF authors to know about SingleTupleBags. {code} @Override public Tuple exec(Tuple in) throws IOException { // single record. just copy. if (in == null) return null; /* * Unfortunately SingleTupleBags are not serializable. We cache whether a given index contains a bag * in the map below, and copy all bags into DefaultBags before returning to avoid serialization exceptions. */ MapInteger, Boolean isBagAtIndex = Maps.newHashMap(); try { Tuple resTuple = tupleFactory_.newTuple(in.size()); for (int i=0; i in.size(); i++) { Object obj = in.get(i); if (!isBagAtIndex.containsKey(i)) { isBagAtIndex.put(i, obj instanceof SingleTupleBag); } if (isBagAtIndex.get(i)) { DataBag newBag = bagFactory_.newDefaultBag(); newBag.addAll((DataBag)obj); obj = newBag; } resTuple.set(i, obj); } return resTuple; } catch (IOException e) { log.warn(e); return null; } } {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1308) Inifinite loop in JobClient when reading from BinStorage Message: [org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2]
[ https://issues.apache.org/jira/browse/PIG-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-1308: Attachment: PIG-1308.patch The root cause of the issue is that the OpLimitOptimizer has a relaxed check() implementation which only checks if the node matched by RuleMatcher is a LOLimit which would be true any time there is a LOLimit in the plan. This results in the optimizer running 500 (the current max) iterations of all rules since the OpLimitOptimizer always matches. The attached patch fixes the issue by tightening the implementation of OpLimitOptimizer.check() to return false in cases where LOLimit cannot be pushed up. Inifinite loop in JobClient when reading from BinStorage Message: [org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2] Key: PIG-1308 URL: https://issues.apache.org/jira/browse/PIG-1308 Project: Pig Issue Type: Bug Reporter: Viraj Bhat Assignee: Pradeep Kamath Fix For: 0.7.0 Attachments: PIG-1308.patch Simple script fails to read files from BinStorage() and fails to submit jobs to JobTracker. This occurs with trunk and not with Pig 0.6 branch. {code} data = load 'binstoragesample' using BinStorage() as (s, m, l); A = foreach ULT generate s#'key' as value; X = limit A 20; dump X; {code} When this script is submitted to the Jobtracker, we found the following error: 2010-03-18 22:31:22,296 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:32:01,574 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:32:43,276 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:33:21,743 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:34:02,004 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:34:43,442 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:35:25,907 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:36:07,402 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:36:48,596 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:37:28,014 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:38:04,823 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:38:38,981 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:39:12,220 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 Stack Trace revelead at org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:144) at org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:115) at org.apache.pig.builtin.BinStorage.getSchema(BinStorage.java:404) at org.apache.pig.impl.logicalLayer.LOLoad.determineSchema(LOLoad.java:167) at org.apache.pig.impl.logicalLayer.LOLoad.getProjectionMap(LOLoad.java:263) at org.apache.pig.impl.logicalLayer.ProjectionMapCalculator.visit(ProjectionMapCalculator.java:112) at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:210) at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:52) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.impl.logicalLayer.optimizer.LogicalTransformer.rebuildProjectionMaps(LogicalTransformer.java:76) at org.apache.pig.impl.logicalLayer.optimizer.LogicalOptimizer.optimize(LogicalOptimizer.java:216) at org.apache.pig.PigServer.compileLp(PigServer.java:883) at org.apache.pig.PigServer.store(PigServer.java:564) The binstorage data was generated from 2 datasets using limit and union: {code} Large1 = load 'input1' using PigStorage(); Large2 = load 'input2' using PigStorage(); V = limit Large1 1; C = limit Large2 1; U = union V, C; store U into 'binstoragesample' using BinStorage(); {code} -- This message is
[jira] Updated: (PIG-1308) Inifinite loop in JobClient when reading from BinStorage Message: [org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2]
[ https://issues.apache.org/jira/browse/PIG-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-1308: Status: Patch Available (was: Open) Inifinite loop in JobClient when reading from BinStorage Message: [org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2] Key: PIG-1308 URL: https://issues.apache.org/jira/browse/PIG-1308 Project: Pig Issue Type: Bug Reporter: Viraj Bhat Assignee: Pradeep Kamath Fix For: 0.7.0 Attachments: PIG-1308.patch Simple script fails to read files from BinStorage() and fails to submit jobs to JobTracker. This occurs with trunk and not with Pig 0.6 branch. {code} data = load 'binstoragesample' using BinStorage() as (s, m, l); A = foreach ULT generate s#'key' as value; X = limit A 20; dump X; {code} When this script is submitted to the Jobtracker, we found the following error: 2010-03-18 22:31:22,296 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:32:01,574 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:32:43,276 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:33:21,743 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:34:02,004 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:34:43,442 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:35:25,907 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:36:07,402 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:36:48,596 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:37:28,014 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:38:04,823 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:38:38,981 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:39:12,220 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 Stack Trace revelead at org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:144) at org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:115) at org.apache.pig.builtin.BinStorage.getSchema(BinStorage.java:404) at org.apache.pig.impl.logicalLayer.LOLoad.determineSchema(LOLoad.java:167) at org.apache.pig.impl.logicalLayer.LOLoad.getProjectionMap(LOLoad.java:263) at org.apache.pig.impl.logicalLayer.ProjectionMapCalculator.visit(ProjectionMapCalculator.java:112) at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:210) at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:52) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.impl.logicalLayer.optimizer.LogicalTransformer.rebuildProjectionMaps(LogicalTransformer.java:76) at org.apache.pig.impl.logicalLayer.optimizer.LogicalOptimizer.optimize(LogicalOptimizer.java:216) at org.apache.pig.PigServer.compileLp(PigServer.java:883) at org.apache.pig.PigServer.store(PigServer.java:564) The binstorage data was generated from 2 datasets using limit and union: {code} Large1 = load 'input1' using PigStorage(); Large2 = load 'input2' using PigStorage(); V = limit Large1 1; C = limit Large2 1; U = union V, C; store U into 'binstoragesample' using BinStorage(); {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Pig-trunk build 713
Pig-trunk build stuck on h7 machine for more than a day, I ve killed the stuck job and restarted the build on hudson. http://hudson.zones.apache.org/hudson/view/Pig/job/Pig-trunk/713/ -Giri
[jira] Commented: (PIG-1282) [zebra] make Zebra's pig test cases run on real cluster
[ https://issues.apache.org/jira/browse/PIG-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847859#action_12847859 ] Hadoop QA commented on PIG-1282: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12439224/PIG-1282.patch against trunk revision 925513. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 254 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 release audit. The applied patch generated 523 release audit warnings (more than the trunk's current 522 warnings). +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/257/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/257/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/257/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/257/console This message is automatically generated. [zebra] make Zebra's pig test cases run on real cluster --- Key: PIG-1282 URL: https://issues.apache.org/jira/browse/PIG-1282 Project: Pig Issue Type: Task Affects Versions: 0.6.0 Reporter: Chao Wang Assignee: Chao Wang Fix For: 0.7.0 Attachments: PIG-1282.patch, PIG-1282.patch, PIG-1282.patch The goal of this task is to make Zebra's pig test cases run on real cluster. Currently Zebra's pig test cases are mostly tested using MiniCluster. We want to use a real hadoop cluster to test them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1285) Allow SingleTupleBag to be serialized
[ https://issues.apache.org/jira/browse/PIG-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1285: --- Status: Open (was: Patch Available) Allow SingleTupleBag to be serialized - Key: PIG-1285 URL: https://issues.apache.org/jira/browse/PIG-1285 Project: Pig Issue Type: Improvement Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.7.0 Attachments: PIG-1285.2.patch, PIG-1285.patch Currently, Pig uses a SingleTupleBag for efficiency when a full-blown spillable bag implementation is not needed in the Combiner optimization. Unfortunately this can create problems. The below Initial.exec() code fails at run-time with the message that a SingleTupleBag cannot be serialized: {code} @Override public Tuple exec(Tuple in) throws IOException { // single record. just copy. if (in == null) return null; try { Tuple resTuple = tupleFactory_.newTuple(in.size()); for (int i=0; i in.size(); i++) { resTuple.set(i, in.get(i)); } return resTuple; } catch (IOException e) { log.warn(e); return null; } } {code} The code below can fix the problem in the UDF, but it seems like something that should be handled transparently, not requiring UDF authors to know about SingleTupleBags. {code} @Override public Tuple exec(Tuple in) throws IOException { // single record. just copy. if (in == null) return null; /* * Unfortunately SingleTupleBags are not serializable. We cache whether a given index contains a bag * in the map below, and copy all bags into DefaultBags before returning to avoid serialization exceptions. */ MapInteger, Boolean isBagAtIndex = Maps.newHashMap(); try { Tuple resTuple = tupleFactory_.newTuple(in.size()); for (int i=0; i in.size(); i++) { Object obj = in.get(i); if (!isBagAtIndex.containsKey(i)) { isBagAtIndex.put(i, obj instanceof SingleTupleBag); } if (isBagAtIndex.get(i)) { DataBag newBag = bagFactory_.newDefaultBag(); newBag.addAll((DataBag)obj); obj = newBag; } resTuple.set(i, obj); } return resTuple; } catch (IOException e) { log.warn(e); return null; } } {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1285) Allow SingleTupleBag to be serialized
[ https://issues.apache.org/jira/browse/PIG-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1285: --- Attachment: PIG-1285.2.patch Allow SingleTupleBag to be serialized - Key: PIG-1285 URL: https://issues.apache.org/jira/browse/PIG-1285 Project: Pig Issue Type: Improvement Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.7.0 Attachments: PIG-1285.2.patch, PIG-1285.patch Currently, Pig uses a SingleTupleBag for efficiency when a full-blown spillable bag implementation is not needed in the Combiner optimization. Unfortunately this can create problems. The below Initial.exec() code fails at run-time with the message that a SingleTupleBag cannot be serialized: {code} @Override public Tuple exec(Tuple in) throws IOException { // single record. just copy. if (in == null) return null; try { Tuple resTuple = tupleFactory_.newTuple(in.size()); for (int i=0; i in.size(); i++) { resTuple.set(i, in.get(i)); } return resTuple; } catch (IOException e) { log.warn(e); return null; } } {code} The code below can fix the problem in the UDF, but it seems like something that should be handled transparently, not requiring UDF authors to know about SingleTupleBags. {code} @Override public Tuple exec(Tuple in) throws IOException { // single record. just copy. if (in == null) return null; /* * Unfortunately SingleTupleBags are not serializable. We cache whether a given index contains a bag * in the map below, and copy all bags into DefaultBags before returning to avoid serialization exceptions. */ MapInteger, Boolean isBagAtIndex = Maps.newHashMap(); try { Tuple resTuple = tupleFactory_.newTuple(in.size()); for (int i=0; i in.size(); i++) { Object obj = in.get(i); if (!isBagAtIndex.containsKey(i)) { isBagAtIndex.put(i, obj instanceof SingleTupleBag); } if (isBagAtIndex.get(i)) { DataBag newBag = bagFactory_.newDefaultBag(); newBag.addAll((DataBag)obj); obj = newBag; } resTuple.set(i, obj); } return resTuple; } catch (IOException e) { log.warn(e); return null; } } {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1285) Allow SingleTupleBag to be serialized
[ https://issues.apache.org/jira/browse/PIG-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1285: --- Status: Patch Available (was: Open) Allow SingleTupleBag to be serialized - Key: PIG-1285 URL: https://issues.apache.org/jira/browse/PIG-1285 Project: Pig Issue Type: Improvement Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.7.0 Attachments: PIG-1285.2.patch, PIG-1285.patch Currently, Pig uses a SingleTupleBag for efficiency when a full-blown spillable bag implementation is not needed in the Combiner optimization. Unfortunately this can create problems. The below Initial.exec() code fails at run-time with the message that a SingleTupleBag cannot be serialized: {code} @Override public Tuple exec(Tuple in) throws IOException { // single record. just copy. if (in == null) return null; try { Tuple resTuple = tupleFactory_.newTuple(in.size()); for (int i=0; i in.size(); i++) { resTuple.set(i, in.get(i)); } return resTuple; } catch (IOException e) { log.warn(e); return null; } } {code} The code below can fix the problem in the UDF, but it seems like something that should be handled transparently, not requiring UDF authors to know about SingleTupleBags. {code} @Override public Tuple exec(Tuple in) throws IOException { // single record. just copy. if (in == null) return null; /* * Unfortunately SingleTupleBags are not serializable. We cache whether a given index contains a bag * in the map below, and copy all bags into DefaultBags before returning to avoid serialization exceptions. */ MapInteger, Boolean isBagAtIndex = Maps.newHashMap(); try { Tuple resTuple = tupleFactory_.newTuple(in.size()); for (int i=0; i in.size(); i++) { Object obj = in.get(i); if (!isBagAtIndex.containsKey(i)) { isBagAtIndex.put(i, obj instanceof SingleTupleBag); } if (isBagAtIndex.get(i)) { DataBag newBag = bagFactory_.newDefaultBag(); newBag.addAll((DataBag)obj); obj = newBag; } resTuple.set(i, obj); } return resTuple; } catch (IOException e) { log.warn(e); return null; } } {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1310) ISO Date UDFs: Conversion, Rounding and Date Math
[ https://issues.apache.org/jira/browse/PIG-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847874#action_12847874 ] Russell Jurney commented on PIG-1310: - I'm thinking it would be good if DateTime was a Pig primitive. Can someone give me an idea how much work it is to add a Pig primitive, and if this patch would be accepted for 0.8? ISO Date UDFs: Conversion, Rounding and Date Math - Key: PIG-1310 URL: https://issues.apache.org/jira/browse/PIG-1310 Project: Pig Issue Type: New Feature Components: impl Reporter: Russell Jurney Fix For: 0.8.0 Original Estimate: 168h Remaining Estimate: 168h I've written UDFs to handle loading unix times, datemonth values and ISO 8601 formatted date strings, and working with them as ISO datetimes using jodatime. The working code is here: http://github.com/rjurney/oink/tree/master/src/java/oink/udf/isodate/ It needs to be documented and tests added, and a couple UDFs are missing, but these work if you REGISTER the jodatime jar in your script. Hopefully I can get this stuff in piggybank before someone else writes it this time :) The rounding also may not be performant, but the code works. Ultimately I'd also like to enable support for ISO 8601 durations. Someone slap me if this isn't done soon, it is not much work and this should help everyone working with time series. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1258) [zebra] Number of sorted input splits is unusually high
[ https://issues.apache.org/jira/browse/PIG-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847875#action_12847875 ] Yan Zhou commented on PIG-1258: --- Hudson's rerun appears to be hanging. Here is the result from my private run: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 9 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [zebra] Number of sorted input splits is unusually high --- Key: PIG-1258 URL: https://issues.apache.org/jira/browse/PIG-1258 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Yan Zhou Attachments: PIG-1258.patch Number of sorted input splits is unusually high if the projections are on multiple column groups, or a union of tables, or column group(s) that hold many small tfiles. In one test, the number is about 100 times bigger that from unsorted input splits on the same input tables. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.