[jira] [Created] (PIG-2894) [Piggybank] HadoopJobHistoryLoader for hadoop 0.20.205+
Aniket Mokashi created PIG-2894: --- Summary: [Piggybank] HadoopJobHistoryLoader for hadoop 0.20.205+ Key: PIG-2894 URL: https://issues.apache.org/jira/browse/PIG-2894 Project: Pig Issue Type: Bug Components: piggybank Reporter: Aniket Mokashi Assignee: Aniket Mokashi With https://issues.apache.org/jira/browse/MAPREDUCE-323 hadoop moves job history files to done directory. With that it is not possible to use current HadoopJobHistoryLoader. We need to fix this to make it more useful. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-2895) jodatime jar missing in pig-withouthadoop.jar
Thejas M Nair created PIG-2895: -- Summary: jodatime jar missing in pig-withouthadoop.jar Key: PIG-2895 URL: https://issues.apache.org/jira/browse/PIG-2895 Project: Pig Issue Type: Sub-task Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.11 jodatime jar is missing in pig-withouthadoop.jar. When an external hadoop.jar is used, pig will fail with class not found error. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2895) jodatime jar missing in pig-withouthadoop.jar
[ https://issues.apache.org/jira/browse/PIG-2895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-2895: --- Attachment: PIG-2895.1.patch jodatime jar missing in pig-withouthadoop.jar - Key: PIG-2895 URL: https://issues.apache.org/jira/browse/PIG-2895 Project: Pig Issue Type: Sub-task Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.11 Attachments: PIG-2895.1.patch jodatime jar is missing in pig-withouthadoop.jar. When an external hadoop.jar is used, pig will fail with class not found error. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2886) Add Scan TimeRange to HBaseStorage
[ https://issues.apache.org/jira/browse/PIG-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443559#comment-13443559 ] Ted Malaska commented on PIG-2886: -- Thanks Bill, I tried running TestHBaseStorage and it freezes on SetUp. public void setUp() throws Exception { // This is needed by Pig cluster = MiniCluster.buildCluster(); conf = cluster.getConfiguration(); util = new HBaseTestingUtility(conf); util.startMiniZKCluster(); util.startMiniHBaseCluster(1, 1); } Just wondering if you know what I'm missing to make this work. Hopefully I will get time in the next couple of days to research this. Add Scan TimeRange to HBaseStorage --- Key: PIG-2886 URL: https://issues.apache.org/jira/browse/PIG-2886 Project: Pig Issue Type: Bug Reporter: Ted Malaska Priority: Minor Attachments: PIG-2886-0.patch, PIG-2886-1.patch I have a client that wants to use pig. They are using MR now. They can't use PIG right now because they only want to fetch the last day's worth of data in HBase. A filter with time range would require reading all the HStore files. If we hold major compaction until after the fetch and use Scan Time Range we only need to read very little in compression. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1314) Add DateTime Support to Pig
[ https://issues.apache.org/jira/browse/PIG-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443614#comment-13443614 ] Julien Le Dem commented on PIG-1314: Hi Thejas, this commit added JobControlCompiler.java.orig which I suspect is not what you intended. http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java.orig?view=logpathrev=1376800 Could you double check? Thanks, Julien Add DateTime Support to Pig --- Key: PIG-1314 URL: https://issues.apache.org/jira/browse/PIG-1314 Project: Pig Issue Type: Bug Components: data Affects Versions: 0.7.0 Reporter: Russell Jurney Assignee: Zhijie Shen Labels: gsoc2012 Attachments: joda_vs_builtin.zip, PIG-1314-1.patch, PIG-1314-2.patch, PIG-1314-3.patch, PIG-1314-4.patch, PIG-1314-5.patch, PIG-1314-6.patch, PIG-1314-7.patch Original Estimate: 672h Remaining Estimate: 672h Hadoop/Pig are primarily used to parse log data, and most logs have a timestamp component. Therefore Pig should support dates as a primitive. Can someone familiar with adding types to pig comment on how hard this is? We're looking at doing this, rather than use UDFs. Is this a patch that would be accepted? This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1314) Add DateTime Support to Pig
[ https://issues.apache.org/jira/browse/PIG-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443630#comment-13443630 ] Thejas M Nair commented on PIG-1314: Yes, that was not intentional. Deleted JobControlCompiler.java.orig in svn. Add DateTime Support to Pig --- Key: PIG-1314 URL: https://issues.apache.org/jira/browse/PIG-1314 Project: Pig Issue Type: Bug Components: data Affects Versions: 0.7.0 Reporter: Russell Jurney Assignee: Zhijie Shen Labels: gsoc2012 Attachments: joda_vs_builtin.zip, PIG-1314-1.patch, PIG-1314-2.patch, PIG-1314-3.patch, PIG-1314-4.patch, PIG-1314-5.patch, PIG-1314-6.patch, PIG-1314-7.patch Original Estimate: 672h Remaining Estimate: 672h Hadoop/Pig are primarily used to parse log data, and most logs have a timestamp component. Therefore Pig should support dates as a primitive. Can someone familiar with adding types to pig comment on how hard this is? We're looking at doing this, rather than use UDFs. Is this a patch that would be accepted? This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2819) ObjectSerializer should support classloader
[ https://issues.apache.org/jira/browse/PIG-2819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443632#comment-13443632 ] Aniket Mokashi commented on PIG-2819: - I discussed this briefly with Julien during the hackathon. This is useful for HCatLoader(ish) use case-(deserializing InputJobInfo). Do you guys have a patch for this? ObjectSerializer should support classloader --- Key: PIG-2819 URL: https://issues.apache.org/jira/browse/PIG-2819 Project: Pig Issue Type: Improvement Components: impl Reporter: Raghu Angadi {ObjectSerializer} is pretty useful and could be used by UDF and other user code. Currently its limitation is that the class that is being deserialized should be visible to root class loader (ie. should be part of CLASSPATH on the front end). The registered jars are not visibile. This is because {{java.io.ObjectInputStream}} used to deserialize is from the root classloader. ObjectSerializer should support another method {{deserialize(str, ClassLoader)}}. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2886) Add Scan TimeRange to HBaseStorage
[ https://issues.apache.org/jira/browse/PIG-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443636#comment-13443636 ] Cheolsoo Park commented on PIG-2886: Hi Ted, Regarding TestHBaseStorage, does it hang in hadoop 20 or 23? I assume that you're not setting -Dhadoopversion so using hadoop 20 by default. In hadoop 20, TestHBaseStorage passes for me with your patch. I.e. ant clean test -Dtestcase=TestHBaseStorage -Dhadoopversion=20 passes. {code} [junit] Running org.apache.pig.test.TestHBaseStorage [junit] Tests run: 23, Failures: 0, Errors: 0, Time elapsed: 131.728 sec {code} If it doesn't pass for you, it should be some environment issue. (e.g. did you set umask 0022?) However, it does time out in hadoop 23, and I believe that it's expected since hbase jar from the maven repository is not binary compatible with hadoop 23. I.e. ant clean test -Dtestcase=TestHBaseStorage -Dhadoopversion=23 fails with time out error, and the following error can be found in the test log (build/test/logs/TEST-org.apache.pig.test.TestHBaseStorage.txt): {code} Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hdfs.protocol.FSConstants$SafeModeAction at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) ... 7 more {code} I ran into the same issue while bumping hbase to 0.94, but it seem applied to 0.90 (current version in trunk) as well. Please see HBASE-5680 for more details. Please anyone corrects me if I am wrong about TestHBaseStorage in hadoop 23. Thanks! Add Scan TimeRange to HBaseStorage --- Key: PIG-2886 URL: https://issues.apache.org/jira/browse/PIG-2886 Project: Pig Issue Type: Bug Reporter: Ted Malaska Priority: Minor Attachments: PIG-2886-0.patch, PIG-2886-1.patch I have a client that wants to use pig. They are using MR now. They can't use PIG right now because they only want to fetch the last day's worth of data in HBase. A filter with time range would require reading all the HStore files. If we hold major compaction until after the fetch and use Scan Time Range we only need to read very little in compression. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank
[ https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443639#comment-13443639 ] Aniket Mokashi commented on PIG-1483: - Opened https://issues.apache.org/jira/browse/PIG-2894. [piggybank] Add HadoopJobHistoryLoader to the piggybank --- Key: PIG-1483 URL: https://issues.apache.org/jira/browse/PIG-1483 Project: Pig Issue Type: New Feature Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1483_1.patch, PIG-1483.patch PIG-1333 added many script-related entries to the MR job xml file and thus it's now possible to use Pig for querying Hadoop job history/xml files to get script-level usage statistics. What we need is a Pig loader that can parse these files and generate corresponding data objects. The goal of this jira is to create a HadoopJobHistoryLoader in piggybank. Here is an example that shows the intended usage: *Find all the jobs grouped by script and user:* {code} a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) j#'USER' as user, (Chararray) j#'JOBID' as job; c = filter b by not (id is null); d = group c by (id, user); e = foreach d generate flatten(group), c.job; dump e; {code} A couple more examples: *Find scripts that use only the default parallelism:* {code} a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces; c = group b by (id, user, script_name) parallel 10; d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces; e = filter d by max_reduces == 1; dump e; {code} *Find the running time of each script (in seconds):* {code} a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end; c = group b by (id, user, script_name) d = foreach c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000; dump d; {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2886) Add Scan TimeRange to HBaseStorage
[ https://issues.apache.org/jira/browse/PIG-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443640#comment-13443640 ] Ted Malaska commented on PIG-2886: -- Great thanks. Got it. I was first doing in on my local (no Hadoop) and it would freezy. Then I tried it on CDH4 and it didn't work either. I will try it on CDH3 tonight. By the way do you see anything else in the code I should add or clean up. I should have time to work on it tonight. Ted Malaska Add Scan TimeRange to HBaseStorage --- Key: PIG-2886 URL: https://issues.apache.org/jira/browse/PIG-2886 Project: Pig Issue Type: Bug Reporter: Ted Malaska Priority: Minor Attachments: PIG-2886-0.patch, PIG-2886-1.patch I have a client that wants to use pig. They are using MR now. They can't use PIG right now because they only want to fetch the last day's worth of data in HBase. A filter with time range would require reading all the HStore files. If we hold major compaction until after the fetch and use Scan Time Range we only need to read very little in compression. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2893) fix DBStorage compile issue
[ https://issues.apache.org/jira/browse/PIG-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443785#comment-13443785 ] Alan Gates commented on PIG-2893: - +1, patch looks good. fix DBStorage compile issue --- Key: PIG-2893 URL: https://issues.apache.org/jira/browse/PIG-2893 Project: Pig Issue Type: Sub-task Reporter: Thejas M Nair Attachments: PIG-2893.1.patch DBStorage does not compile after the datetime patch was committed. The joda datetime was passed as argument to java.sql.PreparedStatement.setDate() instead of java.sql.Date . -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2892) piggybank build failing on trunk
[ https://issues.apache.org/jira/browse/PIG-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443786#comment-13443786 ] Alan Gates commented on PIG-2892: - Thejas filed a separate issue for this, PIG-2893. He's also posted a patch over on that JIRA. It looked like it handled the date a little differently. I'm not sure which is the right solution, you should work with Thejas to figure out which is the right one. If it's ok with you I'll mark this one as a duplicate. piggybank build failing on trunk Key: PIG-2892 URL: https://issues.apache.org/jira/browse/PIG-2892 Project: Pig Issue Type: Bug Components: piggybank Reporter: Alan Gates Assignee: Cheolsoo Park Priority: Critical Attachments: PIG-2892.patch When I try to build Piggybank I get: {code} [javac] /grid/0/hortonal/src/pig/top/trunk/contrib/piggybank/java/build.xml:92: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds [javac] Compiling 159 source files to /grid/0/hortonal/src/pig/top/trunk/contrib/piggybank/java/build/classes [javac] /grid/0/hortonal/src/pig/top/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/DBStorage.java:121: cannot find symbol [javac] symbol : method setDate(int,java.util.Date) [javac] location: interface java.sql.PreparedStatement [javac] ps.setDate(sqlPos, ((DateTime) field).toDate()); {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2892) piggybank build failing on trunk
[ https://issues.apache.org/jira/browse/PIG-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443787#comment-13443787 ] Cheolsoo Park commented on PIG-2892: Hi Alan, I looked at PIG-2893, and his patch seems good to me. In addition, he updated the test case. Please go ahead close this as a duplicate. Thanks! piggybank build failing on trunk Key: PIG-2892 URL: https://issues.apache.org/jira/browse/PIG-2892 Project: Pig Issue Type: Bug Components: piggybank Reporter: Alan Gates Assignee: Cheolsoo Park Priority: Critical Attachments: PIG-2892.patch When I try to build Piggybank I get: {code} [javac] /grid/0/hortonal/src/pig/top/trunk/contrib/piggybank/java/build.xml:92: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds [javac] Compiling 159 source files to /grid/0/hortonal/src/pig/top/trunk/contrib/piggybank/java/build/classes [javac] /grid/0/hortonal/src/pig/top/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/DBStorage.java:121: cannot find symbol [javac] symbol : method setDate(int,java.util.Date) [javac] location: interface java.sql.PreparedStatement [javac] ps.setDate(sqlPos, ((DateTime) field).toDate()); {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2888) Improve performance of POPartialAgg
[ https://issues.apache.org/jira/browse/PIG-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-2888: --- Attachment: partialagg_patch_5.patch Improve performance of POPartialAgg --- Key: PIG-2888 URL: https://issues.apache.org/jira/browse/PIG-2888 Project: Pig Issue Type: Improvement Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Attachments: partialagg_patch_1.patch, partialagg_patch_2.patch, partialagg_patch_3.patch, partialagg_patch_4.patch, partialagg_patch_5.patch During performance testing, we found that POPartialAgg can cause performance degradation for Pig jobs when the Algebraic UDFs it's being applied to aren't well suited to the operator's assumptions. Changing the implementation to a more flexible hash-based model can provide significant performance improvements. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2888) Improve performance of POPartialAgg
[ https://issues.apache.org/jira/browse/PIG-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443793#comment-13443793 ] Dmitriy V. Ryaboy commented on PIG-2888: bq. There's a pig.exec.nocombiner that was not replaced by a constant. Fixed. bq. It would be nice to have a consistent way of getting booleans (and floats) from the conf Feels like scope creep.. maybe in another ticket? I don't want to get into how to design that around Properties, Configurations, and PigConfigurations. bq. some of the class description was still applicable Added better docs. bq. what is the reason for this particular value? Bad math :). Fixed the math and added an explanation of how I got there. bq. Don't you want a visitor to just list them all once and set the count? That way you would not have to worry about keeping a reference on them. I could do that, but this feels much cleaner -- no visitors, no serialization, no changes to the MRCompiler/JCCompiler, very self-contained, and works at runtime instead of having to be preset by the planner. bq. +0.5 so that it is never 0 ? Math.min(1, ...) is more readable. No, +0.5 so that it's a round() instead of floor() bq. LOG.info() should be wrapped in if (LOG.isInfoEnabled()) { ... } for perf Done for places where it matters (functions invoked more than once and messages where args are not constants) bq.in aggregateSecondLevel() can't the processedInputMap be reused? No -- aggregate() adds to the list of tuples in the target map, we want to overwrite in this case. bq. in getMinOutputReductionFromProp(), if minReduction = 0 it should throw an exception. Added a log message instead. Improve performance of POPartialAgg --- Key: PIG-2888 URL: https://issues.apache.org/jira/browse/PIG-2888 Project: Pig Issue Type: Improvement Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Attachments: partialagg_patch_1.patch, partialagg_patch_2.patch, partialagg_patch_3.patch, partialagg_patch_4.patch, partialagg_patch_5.patch During performance testing, we found that POPartialAgg can cause performance degradation for Pig jobs when the Algebraic UDFs it's being applied to aren't well suited to the operator's assumptions. Changing the implementation to a more flexible hash-based model can provide significant performance improvements. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2895) jodatime jar missing in pig-withouthadoop.jar
[ https://issues.apache.org/jira/browse/PIG-2895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443811#comment-13443811 ] Alan Gates commented on PIG-2895: - When I run the e2e tests I am still seeing an error, even once this patch is applied. jodatime jar missing in pig-withouthadoop.jar - Key: PIG-2895 URL: https://issues.apache.org/jira/browse/PIG-2895 Project: Pig Issue Type: Sub-task Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.11 Attachments: PIG-2895.1.patch jodatime jar is missing in pig-withouthadoop.jar. When an external hadoop.jar is used, pig will fail with class not found error. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira