[jira] [Commented] (PIG-2673) Allow Merge join to follow an ORDER statement
[ https://issues.apache.org/jira/browse/PIG-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13401514#comment-13401514 ] Julien Le Dem commented on PIG-2673: LGTM. +1 Allow Merge join to follow an ORDER statement - Key: PIG-2673 URL: https://issues.apache.org/jira/browse/PIG-2673 Project: Pig Issue Type: Improvement Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Attachments: PIG-2673.2.patch, PIG-2673_0.patch, PIG-2673_1.patch, PIG-2673_1_noprefix.patch, PIG-2673_1_noprefix_now_with_merge.patch Currently, we insist that data for a merge join must come from an OrderedLoadFunc. We can relax this condition and allow explicit ordering operations to precede a MergeJoin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2763) Groovy UDFs
[ https://issues.apache.org/jira/browse/PIG-2763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13401519#comment-13401519 ] Julien Le Dem commented on PIG-2763: Could you create a review at https://reviews.apache.org ? Thanks Groovy UDFs --- Key: PIG-2763 URL: https://issues.apache.org/jira/browse/PIG-2763 Project: Pig Issue Type: Improvement Reporter: Julien Le Dem Assignee: Mathias Herberts Attachments: PIG-2763.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2742) Rank Operator Syntax
[ https://issues.apache.org/jira/browse/PIG-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13401537#comment-13401537 ] Alan Gates commented on PIG-2742: - Your suggested text in CHANGES would be fine. Or you can omit the part about the subtask, we don't always note those in CHANGES. Rank Operator Syntax Key: PIG-2742 URL: https://issues.apache.org/jira/browse/PIG-2742 Project: Pig Issue Type: Sub-task Components: build Affects Versions: 0.10.0 Reporter: Allan AvendaƱo Assignee: Allan AvendaƱo Attachments: PIG-2742 The syntax proposed is the following: RANK alias (BY (col_ref|col_range)+)? Which now is running on the patch attached with the code implemented so far, with the corresponding tests. And small update to the syntax: RANK alias (BY (col_ref|col_range)+)? DENSE I append DENSE for dense rank implementation. Looking forward to reading your comments. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2673) Allow Merge join to follow an ORDER statement
[ https://issues.apache.org/jira/browse/PIG-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-2673: --- Resolution: Fixed Fix Version/s: 0.11 Status: Resolved (was: Patch Available) Committed to 0.11 (trunk) Allow Merge join to follow an ORDER statement - Key: PIG-2673 URL: https://issues.apache.org/jira/browse/PIG-2673 Project: Pig Issue Type: Improvement Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.11 Attachments: PIG-2673.2.patch, PIG-2673_0.patch, PIG-2673_1.patch, PIG-2673_1_noprefix.patch, PIG-2673_1_noprefix_now_with_merge.patch Currently, we insist that data for a merge join must come from an OrderedLoadFunc. We can relax this condition and allow explicit ordering operations to precede a MergeJoin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2763) Groovy UDFs
[ https://issues.apache.org/jira/browse/PIG-2763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13401548#comment-13401548 ] Jonathan Coveney commented on PIG-2763: --- Mathias, You made the review private. Can you please add me? Thanks! Jon Groovy UDFs --- Key: PIG-2763 URL: https://issues.apache.org/jira/browse/PIG-2763 Project: Pig Issue Type: Improvement Reporter: Julien Le Dem Assignee: Mathias Herberts Attachments: PIG-2763.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2763) Groovy UDFs
[ https://issues.apache.org/jira/browse/PIG-2763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13401549#comment-13401549 ] Jonathan Coveney commented on PIG-2763: --- PS Awesome contribution :) Groovy UDFs --- Key: PIG-2763 URL: https://issues.apache.org/jira/browse/PIG-2763 Project: Pig Issue Type: Improvement Reporter: Julien Le Dem Assignee: Mathias Herberts Attachments: PIG-2763.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Review Request: PIG-2763 - Groovy UDFs
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/5591/ --- Review request for pig, Julien Le Dem and Jonathan Coveney. Description --- Adds support for Groovy UDFs in Pig. Diffs - /trunk/ivy.xml 1353307 /trunk/ivy/libraries.properties 1353307 /trunk/src/org/apache/pig/scripting/ScriptEngine.java 1353307 /trunk/src/org/apache/pig/scripting/groovy/AccumulatorAccumulate.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/AccumulatorCleanup.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/AccumulatorGetValue.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/AlgebraicFinal.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/AlgebraicInitial.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/AlgebraicIntermed.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/GroovyAccumulatorEvalFunc.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/GroovyAlgebraicEvalFunc.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/GroovyEvalFunc.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/GroovyEvalFuncObject.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/GroovyScriptEngine.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/GroovyUtils.java PRE-CREATION /trunk/test/org/apache/pig/test/TestUDFGroovy.java PRE-CREATION /trunk/test/unit-tests 1353307 Diff: https://reviews.apache.org/r/5591/diff/ Testing --- Thanks, Mathias Herberts
[jira] [Commented] (PIG-2763) Groovy UDFs
[ https://issues.apache.org/jira/browse/PIG-2763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13401562#comment-13401562 ] Mathias Herberts commented on PIG-2763: --- Ooops my mistake, I forgot to publish the review. Corrected. Groovy UDFs --- Key: PIG-2763 URL: https://issues.apache.org/jira/browse/PIG-2763 Project: Pig Issue Type: Improvement Reporter: Julien Le Dem Assignee: Mathias Herberts Attachments: PIG-2763.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1314) Add DateTime Support to Pig
[ https://issues.apache.org/jira/browse/PIG-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13401613#comment-13401613 ] Thejas M Nair commented on PIG-1314: bq. As far as I know, either Java builtin Date or Joda DateTime uses millisecond-shift (stored in a long integer variable) from the midnight UTC, which is not exactly the Unix time. Yes, as you noted, the difference is unix timestamp can store upto +/- 292 Billion years, while Joda DateTime supports only +/- 292 Milllion years. Which should be sufficient for most practical purposes! :) bq. The time zone determines only determines the ISO time string, It also affects the field values, (getDayOfWeek(), getHourOfDay() etc. In your data, you can have dates belonging to different timezones, and users might want to retain that information. An example of use case where timezone also needs to be stored - if you want to do analysis of how many people come to a global website during their morning hours, you want to .getHourOfDay() to return the hour as per local timezone. We need an efficient way to serialize timezone along with the long. Can you propose something ? (Maybe, just make it efficient for 256 most 'popular' timezones and store it a byte. And not have the byte for UTC. For other timezones, add a timezone string ?) bq. When we need to convert the DateTime object to Unix time string, we may use the default time zone of the Pig environment If the date field has the timezone value in it, we don't have to rely on default time zone to convert to unix time stamp. (assuming that is what you meant by 'unix time *string*' ) But udfs like DateTime ToDate(String s) where timezone might not be specified, we need a default timezone. I think we should use the default timezone on the pig client machine. Using the default time zone on each task tracker node can lead to a nightmare in debugging if one of the nodes happens to have a different timezone. We should allow the user to set a default timezone using a pig property. bq. We probably need one more UDF String ToString(DateTime d, String format, String timezone) Having timezone argument in this call is necessary only if user wants to print the time for a different timezone. This is useful, but not mandatory. bq.Since the ISO duration is non-negative (Please correct me if I'm wrong), we need to SubstractDuration as well. Yes, you are right. I could not find any references to negative values in ISO duration. Lets add SubstractDuration Trivia from wikipedia: 64 bit unix timestamp, in the negative direction, goes back more than twenty times the age of the universe Add DateTime Support to Pig --- Key: PIG-1314 URL: https://issues.apache.org/jira/browse/PIG-1314 Project: Pig Issue Type: Bug Components: data Affects Versions: 0.7.0 Reporter: Russell Jurney Assignee: Zhijie Shen Labels: gsoc2012 Attachments: PIG-1314-1.patch, PIG-1314-2.patch, joda_vs_builtin.zip Original Estimate: 672h Remaining Estimate: 672h Hadoop/Pig are primarily used to parse log data, and most logs have a timestamp component. Therefore Pig should support dates as a primitive. Can someone familiar with adding types to pig comment on how hard this is? We're looking at doing this, rather than use UDFs. Is this a patch that would be accepted? This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2697) pretty print schema
[ https://issues.apache.org/jira/browse/PIG-2697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghu Angadi updated PIG-2697: -- Attachment: PIG-2697.patch saw Dmitriy's comment late. Added this property to pig.properties in the updated patch. Also made the string 'static final'. pretty print schema --- Key: PIG-2697 URL: https://issues.apache.org/jira/browse/PIG-2697 Project: Pig Issue Type: Improvement Components: grunt Reporter: Raghu Angadi Assignee: Raghu Angadi Attachments: PIG-2697.patch, PIG-2697.patch currently 'describe' dumps the schema in one line. If you have a long or complicated schema, it is pretty much impossible to figure out how the schema looks or what the fileds are. will provide an example below. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2697) pretty print schema
[ https://issues.apache.org/jira/browse/PIG-2697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghu Angadi updated PIG-2697: -- Fix Version/s: 0.11 pretty print schema --- Key: PIG-2697 URL: https://issues.apache.org/jira/browse/PIG-2697 Project: Pig Issue Type: Improvement Components: grunt Reporter: Raghu Angadi Assignee: Raghu Angadi Fix For: 0.11 Attachments: PIG-2697.patch, PIG-2697.patch currently 'describe' dumps the schema in one line. If you have a long or complicated schema, it is pretty much impossible to figure out how the schema looks or what the fileds are. will provide an example below. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2761) With hadoop23 importing modules inside python script does not work
[ https://issues.apache.org/jira/browse/PIG-2761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-2761: Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) +1 Patch committed to 0.10 branch/trunk. Thanks Rohini! With hadoop23 importing modules inside python script does not work -- Key: PIG-2761 URL: https://issues.apache.org/jira/browse/PIG-2761 Project: Pig Issue Type: Bug Affects Versions: 0.10.1 Reporter: Rohini Palaniswamy Assignee: Rohini Palaniswamy Fix For: 0.11, 0.10.1 Attachments: PIG-2761-branch10_1.patch, PIG-2761-initial.patch, PIG-2761-trunk.patch, PIG-2761.patch Because unjar has been removed from 23, registering scripts has issue. PIG-2745 addresses the issue of registering scripts with pig. But if the registered py script imports other modules then it does not work. Steps to reproduce the issue in https://issues.apache.org/jira/browse/PIG-2745?focusedCommentId=13396965page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13396965 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2746) Pig doesn't detect all forms of compression extensions properly
[ https://issues.apache.org/jira/browse/PIG-2746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13401664#comment-13401664 ] Harsh J commented on PIG-2746: -- Daniel/Others, Does the provided test suffice? Is there anything else you'd like me to address to get this in? Do let me know! Pig doesn't detect all forms of compression extensions properly --- Key: PIG-2746 URL: https://issues.apache.org/jira/browse/PIG-2746 Project: Pig Issue Type: Bug Affects Versions: 0.8.1 Reporter: Harsh J Assignee: Harsh J Attachments: PIG-2746.patch, PIG-2746.patch, PIG-2746.patch The PigStorage has the following snippet. {code} private void setCompression(Path path, Job job) { String location=path.getName(); if (location.endsWith(.bz2) || location.endsWith(.bz)) { FileOutputFormat.setCompressOutput(job, true); FileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class); } else if (location.endsWith(.gz)) { FileOutputFormat.setCompressOutput(job, true); FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class); } else { FileOutputFormat.setCompressOutput( job, false); } } {code} This limits it to only work with STORE filenames provided as 'output.gz' or 'output.bz2' and for the rest (like LZO) one has to specify codecs and manually enable compression. Ideally Pig can rely on Hadoop's extension-to-codec detector instead of having this ladder. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (PIG-2760) resources added with a relative path are added to the JobXXXX jar file under their absolute path
[ https://issues.apache.org/jira/browse/PIG-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai resolved PIG-2760. - Resolution: Fixed Fix Version/s: 0.10.1 0.11 Assignee: Rohini Palaniswamy Hadoop Flags: Reviewed This is fixed along with PIG-2761. Thanks folks! resources added with a relative path are added to the Job jar file under their absolute path Key: PIG-2760 URL: https://issues.apache.org/jira/browse/PIG-2760 Project: Pig Issue Type: Bug Affects Versions: 0.10.0 Reporter: Mathias Herberts Assignee: Rohini Palaniswamy Fix For: 0.11, 0.10.1 Attachments: PIG-2760.patch When registering a local resource using a relative path, the resource is added to the Job jar under its absolute path. If a pig script contains the following: REGISTER etc/foo; and is executed from a directory /PATH/TO/DIR, the Job jar file will contain the following: /PATH/TO/DIR/etc/foo instead of etc/foo which was the previous behavior -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1314) Add DateTime Support to Pig
[ https://issues.apache.org/jira/browse/PIG-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13401676#comment-13401676 ] Russell Jurney commented on PIG-1314: - Jodatime seems to solve these problems. Serializing from a string without a timezone, it does things in a reasonable manner. Serializing things from a string with a timezone, it does things in a reasonable manner. Are we discussing a user-facing API, or an internal storage mechanism? I'm not clear on which. Regarding the interface, presenting integers to a user as an interface seems wrong to me. Excluding certain timezones in the name of efficiency also seems wrong to me. The point of a datetime type is to add timezones, otherwise we can simply use longs. As an internal storage mechanism, I'm un-opinionated, so long as all timezones are retained at all times. Add DateTime Support to Pig --- Key: PIG-1314 URL: https://issues.apache.org/jira/browse/PIG-1314 Project: Pig Issue Type: Bug Components: data Affects Versions: 0.7.0 Reporter: Russell Jurney Assignee: Zhijie Shen Labels: gsoc2012 Attachments: PIG-1314-1.patch, PIG-1314-2.patch, joda_vs_builtin.zip Original Estimate: 672h Remaining Estimate: 672h Hadoop/Pig are primarily used to parse log data, and most logs have a timestamp component. Therefore Pig should support dates as a primitive. Can someone familiar with adding types to pig comment on how hard this is? We're looking at doing this, rather than use UDFs. Is this a patch that would be accepted? This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2661) Pig uses an extra job for loading data in Pigmix L9
[ https://issues.apache.org/jira/browse/PIG-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13401687#comment-13401687 ] Jie Li commented on PIG-2661: - An interesting problem: Previously for order-by, Pig will force any previous pipeline to finish and write to disk first, and then sample the data and sort it, so the sampler will see the same data that will be sorted. Now we want to merge the previous map-only pipeline into both the sampler and order-by. The sampler will sample the data before that pipeline, and pass the sample results through the pipeline to generate the partition file. See the query: {code} a = load 'data' as (x,y) b = filter a by udf(x,y) c = foreach b generate udf(x,y) d = order c by x {code} Here a-b-c is the pipeline before order-by. Previously Pig will write c to the disk first, and then the sampler will get samples from c; but now we want to avoid writing c to the disk, so the sampler will load a to get samples and pass them through b and c to generate the partition file. Here b and c can be projection, filter and any other non-blocking operators. One concern is, would the new way of sampling still capture the distribution of the data to be sorted? ||What we want||What we have now||What we'll have|| |Distribution(a-b-c)|Distribution(Sample(a-b-c))|Distribution(Sample(a)-b-c)| It's clear that Sample will keep the original distribution, so the three distributions in the table would be equivalent. Another concern is the performance. With the patch, the sampler will do a full scan of the table before the filter, which might be slower than before if the filter is very selective. This might be acceptable considering that the sampler only parse a small percent of the data. Will do some benchmark. Pig uses an extra job for loading data in Pigmix L9 --- Key: PIG-2661 URL: https://issues.apache.org/jira/browse/PIG-2661 Project: Pig Issue Type: Improvement Affects Versions: 0.9.0 Reporter: Jie Li Assignee: Jie Li Attachments: PIG-2661.0.patch, PIG-2661.1.patch See https://issues.apache.org/jira/browse/PIG-200?focusedCommentId=13260155page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13260155 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-2768) Fix org.apache.hadoop.conf.Configuration deprecation warnings for Hadoop 23
Fabian Alenius created PIG-2768: --- Summary: Fix org.apache.hadoop.conf.Configuration deprecation warnings for Hadoop 23 Key: PIG-2768 URL: https://issues.apache.org/jira/browse/PIG-2768 Project: Pig Issue Type: Improvement Reporter: Fabian Alenius When compiling with hadoopversion=23 and running with hadoop 23 an annoying warning is printed: WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS because fs.default.name is set in the configuration properties in HExecutionEngine.java even if Pig is compiled for hadoop 23. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Review Request: PIG-2763 - Groovy UDFs
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/5591/#review8628 --- /trunk/src/org/apache/pig/scripting/groovy/GroovyAccumulatorEvalFunc.java https://reviews.apache.org/r/5591/#comment18274 can you get rid of trailing whitespace? In vim: %s/\s\+$// will do it /trunk/src/org/apache/pig/scripting/groovy/GroovyAccumulatorEvalFunc.java https://reviews.apache.org/r/5591/#comment18265 you can have this class extend AccumulatorEvalFunc -- it was made just for this case :) /trunk/src/org/apache/pig/scripting/groovy/GroovyAccumulatorEvalFunc.java https://reviews.apache.org/r/5591/#comment18266 I don't like this. What is the source of errors? /trunk/src/org/apache/pig/scripting/groovy/GroovyAccumulatorEvalFunc.java https://reviews.apache.org/r/5591/#comment18273 2 points here. 1) It seems odd to me that you lump outputSchema with the getValue method given your annotation driven approach. Why not annotate the Groovy class instead, or, better yet, allow users to set their own method? Leading to... 2) you could also support dynamic outputSchemas based on input schemas (jython and jruby support both do this) /trunk/src/org/apache/pig/scripting/groovy/GroovyAlgebraicEvalFunc.java https://reviews.apache.org/r/5591/#comment18275 I'm so happy that someone who isn't me found this useful :) /trunk/src/org/apache/pig/scripting/groovy/GroovyEvalFunc.java https://reviews.apache.org/r/5591/#comment18276 IMHO, if they have a UDF that returns null, you should detect this earlier on and throw an error. Same with any methods which don't accept Pig types, if you want to get fancy (JRuby did not get this fancy, but I think at least the former is important rather than returning null) /trunk/src/org/apache/pig/scripting/groovy/GroovyScriptEngine.java https://reviews.apache.org/r/5591/#comment18277 throw an UnsupportedOp exception, it shouldn't be called /trunk/src/org/apache/pig/scripting/groovy/GroovyScriptEngine.java https://reviews.apache.org/r/5591/#comment18278 In general, I'd prefer /***/ javadoc style comments when commenting in line, but this is a style nitpick /trunk/src/org/apache/pig/scripting/groovy/GroovyScriptEngine.java https://reviews.apache.org/r/5591/#comment18279 It seems weird to allow Groovy static methods as UDFs. I suppose there is no harm in it, but given that in Pig all UDF's imply that they are instantiated, it proposes a potential strong departure from how people typically should think about UDF's. /trunk/src/org/apache/pig/scripting/groovy/GroovyScriptEngine.java https://reviews.apache.org/r/5591/#comment18280 See above, this is a weird special case to me... /trunk/src/org/apache/pig/scripting/groovy/GroovyScriptEngine.java https://reviews.apache.org/r/5591/#comment18281 You can also make sure sure that Initial and Intermed return Tuple /trunk/src/org/apache/pig/scripting/groovy/GroovyUtils.java https://reviews.apache.org/r/5591/#comment18284 I'm a big fan of having a private static final TupleFactory and BagFactory in the class. YMMV /trunk/src/org/apache/pig/scripting/groovy/GroovyUtils.java https://reviews.apache.org/r/5591/#comment18282 Is it not possible for users to create a pig Tuple that they then put Groovy objects into? /trunk/src/org/apache/pig/scripting/groovy/GroovyUtils.java https://reviews.apache.org/r/5591/#comment18283 Pig maps have to have Strings as keys. I suppose we don't HAVE to check that here, but it could have potentially weird results /trunk/src/org/apache/pig/scripting/groovy/GroovyUtils.java https://reviews.apache.org/r/5591/#comment18287 In the case of an int, we shouldn't have to go to/from int. Same with Long, Double, and Float. /trunk/src/org/apache/pig/scripting/groovy/GroovyUtils.java https://reviews.apache.org/r/5591/#comment18285 you should go express support of the BigInt/BigDec patch :) /trunk/src/org/apache/pig/scripting/groovy/GroovyUtils.java https://reviews.apache.org/r/5591/#comment18286 Why do you copy the byte array here? It's not like you're copying in all other cases. Is the goal buffer reuse or something? /trunk/src/org/apache/pig/scripting/groovy/GroovyUtils.java https://reviews.apache.org/r/5591/#comment18288 why not just return the boolean? /trunk/src/org/apache/pig/scripting/groovy/GroovyUtils.java https://reviews.apache.org/r/5591/#comment18289 you can just iterate directly on it without calling getall. also, you could use groovy.lang.Tuple#addAll? /trunk/src/org/apache/pig/scripting/groovy/GroovyUtils.java https://reviews.apache.org/r/5591/#comment18292 Same comment as above: Pig maps always have String keys /trunk/src/org/apache/pig/scripting/groovy/GroovyUtils.java https://reviews.apache.org/r/5591/#comment18293
Re: Review Request: PIG-2763 - Groovy UDFs
On June 26, 2012, 10:14 p.m., Jonathan Coveney wrote: Like this a lot! I also like that we're getting a clearer blueprint on what it takes it implement a scripting language... I think we could definitely make a better abstraction soon. Oh and can you put a link to the JIRA on the reviewboard? - Jonathan --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/5591/#review8628 --- On June 26, 2012, 5:52 p.m., Mathias Herberts wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/5591/ --- (Updated June 26, 2012, 5:52 p.m.) Review request for pig, Julien Le Dem and Jonathan Coveney. Description --- Adds support for Groovy UDFs in Pig. Diffs - /trunk/ivy.xml 1353307 /trunk/ivy/libraries.properties 1353307 /trunk/src/org/apache/pig/scripting/ScriptEngine.java 1353307 /trunk/src/org/apache/pig/scripting/groovy/AccumulatorAccumulate.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/AccumulatorCleanup.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/AccumulatorGetValue.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/AlgebraicFinal.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/AlgebraicInitial.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/AlgebraicIntermed.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/GroovyAccumulatorEvalFunc.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/GroovyAlgebraicEvalFunc.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/GroovyEvalFunc.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/GroovyEvalFuncObject.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/GroovyScriptEngine.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/GroovyUtils.java PRE-CREATION /trunk/test/org/apache/pig/test/TestUDFGroovy.java PRE-CREATION /trunk/test/unit-tests 1353307 Diff: https://reviews.apache.org/r/5591/diff/ Testing --- Thanks, Mathias Herberts
Build failed in Jenkins: Pig-trunk #1265
See https://builds.apache.org/job/Pig-trunk/1265/changes Changes: [daijy] Adding missing test TestJobStats.java from PIG-2696 [daijy] PIG-2761: With hadoop23 importing modules inside python script does not work [dvryaboy] PIG-2673: Allow Merge join to follow an ORDER statement -- [...truncated 3832 lines...] [exec] Fetching plugins descriptor: http://forrest.apache.org/plugins/whiteboard-plugins.xml [exec] Getting: http://forrest.apache.org/plugins/whiteboard-plugins.xml [exec] To: https://builds.apache.org/job/Pig-trunk/ws/trunk/src/docs/build/tmp/plugins-2.xml [exec] local file date : Tue Feb 01 02:18:42 UTC 2011 [exec] .. [exec] last modified = Fri Jun 10 08:37:02 UTC 2011 [exec] Plugin list loaded from http://forrest.apache.org/plugins/plugins.xml. [exec] Plugin list loaded from http://forrest.apache.org/plugins/whiteboard-plugins.xml. [exec] [exec] init-plugins: [exec] Created dir: https://builds.apache.org/job/Pig-trunk/ws/trunk/src/docs/build/webapp/conf [exec] Copying 1 file to https://builds.apache.org/job/Pig-trunk/ws/trunk/src/docs/build/tmp [exec] Copying 1 file to https://builds.apache.org/job/Pig-trunk/ws/trunk/src/docs/build/tmp [exec] Copying 1 file to https://builds.apache.org/job/Pig-trunk/ws/trunk/src/docs/build/tmp [exec] Copying 1 file to https://builds.apache.org/job/Pig-trunk/ws/trunk/src/docs/build/tmp [exec] Copying 1 file to https://builds.apache.org/job/Pig-trunk/ws/trunk/src/docs/build/tmp [exec] [exec] -- [exec] Installing plugin: org.apache.forrest.plugin.output.pdf [exec] -- [exec] [exec] [exec] check-plugin: [exec] org.apache.forrest.plugin.output.pdf is available in the build dir. Trying to update it... [exec] [exec] init-props: [exec] [exec] echo-settings-condition: [exec] [exec] echo-settings: [exec] [exec] init-proxy: [exec] [exec] fetch-plugins-descriptors: [exec] [exec] fetch-plugin: [exec] Trying to find the description of org.apache.forrest.plugin.output.pdf in the different descriptor files [exec] Using the descriptor file https://builds.apache.org/job/Pig-trunk/ws/trunk/src/docs/build/tmp/plugins-1.xml... [exec] Processing https://builds.apache.org/job/Pig-trunk/ws/trunk/src/docs/build/tmp/plugins-1.xml to https://builds.apache.org/job/Pig-trunk/ws/trunk/src/docs/build/tmp/pluginlist2fetchbuild.xml [exec] Loading stylesheet /home/jenkins/tools/forrest/latest/main/var/pluginlist2fetch.xsl [exec] [exec] fetch-local-unversioned-plugin: [exec] [exec] get-local: [exec] Trying to locally get org.apache.forrest.plugin.output.pdf [exec] Looking in local /home/jenkins/tools/forrest/latest/plugins [exec] Found ! [exec] [exec] init-build-compiler: [exec] [exec] echo-init: [exec] [exec] init: [exec] [exec] compile: [exec] [exec] jar: [exec] [exec] local-deploy: [exec] Locally deploying org.apache.forrest.plugin.output.pdf [exec] [exec] build: [exec] Plugin org.apache.forrest.plugin.output.pdf deployed ! Ready to configure [exec] [exec] fetch-remote-unversioned-plugin-version-forrest: [exec] [exec] fetch-remote-unversioned-plugin-unversion-forrest: [exec] [exec] has-been-downloaded: [exec] [exec] downloaded-message: [exec] [exec] uptodate-message: [exec] [exec] not-found-message: [exec] Fetch-plugin Ok, installing ! [exec] [exec] unpack-plugin: [exec] [exec] install-plugin: [exec] [exec] configure-plugin: [exec] [exec] configure-output-plugin: [exec] Mounting output plugin: org.apache.forrest.plugin.output.pdf [exec] Processing https://builds.apache.org/job/Pig-trunk/ws/trunk/src/docs/build/tmp/output.xmap to https://builds.apache.org/job/Pig-trunk/ws/trunk/src/docs/build/tmp/output.xmap.new [exec] Loading stylesheet /home/jenkins/tools/forrest/latest/main/var/pluginMountSnippet.xsl [exec] Moving 1 file to https://builds.apache.org/job/Pig-trunk/ws/trunk/src/docs/build/tmp [exec] [exec] configure-plugin-locationmap: [exec] Mounting plugin locationmap for org.apache.forrest.plugin.output.pdf [exec] Processing https://builds.apache.org/job/Pig-trunk/ws/trunk/src/docs/build/tmp/locationmap.xml to https://builds.apache.org/job/Pig-trunk/ws/trunk/src/docs/build/tmp/locationmap.xml.new [exec] Loading stylesheet /home/jenkins/tools/forrest/latest/main/var/pluginLmMountSnippet.xsl [exec] Moving 1 file to https://builds.apache.org/job/Pig-trunk/ws/trunk/src/docs/build/tmp
Re: Review Request: PIG-2763 - Groovy UDFs
On June 26, 2012, 10:14 p.m., Jonathan Coveney wrote: /trunk/src/org/apache/pig/scripting/groovy/GroovyAccumulatorEvalFunc.java, line 80 https://reviews.apache.org/r/5591/diff/1/?file=116555#file116555line80 2 points here. 1) It seems odd to me that you lump outputSchema with the getValue method given your annotation driven approach. Why not annotate the Groovy class instead, or, better yet, allow users to set their own method? Leading to... 2) you could also support dynamic outputSchemas based on input schemas (jython and jruby support both do this) Annotating the Groovy Class would mean that we have a single UDF per class as is the case in Java. It seems to me it is more practical to see several UDFs in a single Groovy class, thus making the class more of a UDF library container than a single UDF container. Dynamic outputschemas have been added via an OutputSchemaFunction annotation, this will be reflected in the next iteration of the patch. On June 26, 2012, 10:14 p.m., Jonathan Coveney wrote: /trunk/src/org/apache/pig/scripting/groovy/GroovyEvalFunc.java, line 129 https://reviews.apache.org/r/5591/diff/1/?file=116557#file116557line129 IMHO, if they have a UDF that returns null, you should detect this earlier on and throw an error. Same with any methods which don't accept Pig types, if you want to get fancy (JRuby did not get this fancy, but I think at least the former is important rather than returning null) This is done because the GroovyEvalFunc wrapper is used for Accumulator UDFs when calling accumulate/cleanup which are 'void' methods. Not supporting 'void' methods in GroovyEvalFunc would force to add a GroovyVoidEvalFunc class just for the Accumulator case. On June 26, 2012, 10:14 p.m., Jonathan Coveney wrote: /trunk/src/org/apache/pig/scripting/groovy/GroovyScriptEngine.java, line 88 https://reviews.apache.org/r/5591/diff/1/?file=116559#file116559line88 In general, I'd prefer /***/ javadoc style comments when commenting in line, but this is a style nitpick I always use // for in line comments, this way I can comment out a block of code spanning multiple lines by using /* ... */ On June 26, 2012, 10:14 p.m., Jonathan Coveney wrote: /trunk/src/org/apache/pig/scripting/groovy/GroovyScriptEngine.java, line 195 https://reviews.apache.org/r/5591/diff/1/?file=116559#file116559line195 It seems weird to allow Groovy static methods as UDFs. I suppose there is no harm in it, but given that in Pig all UDF's imply that they are instantiated, it proposes a potential strong departure from how people typically should think about UDF's. As stated earlier, a Groovy class should really be seen as a container for multiple UDFs, not as containing a single one. Non static methods are needed for Accumulator UDFs, all other UDFs maintain no state, thus the use of static methods. I guess non static methods could be supported too. On June 26, 2012, 10:14 p.m., Jonathan Coveney wrote: /trunk/src/org/apache/pig/scripting/groovy/GroovyScriptEngine.java, line 200 https://reviews.apache.org/r/5591/diff/1/?file=116559#file116559line200 See above, this is a weird special case to me... methods annotated with @AccumulatorGetValue need to have an OuputSchema defined, but since they are part of a trio of methods used to implement the Accumulator, they should not be exposed directly. On June 26, 2012, 10:14 p.m., Jonathan Coveney wrote: /trunk/src/org/apache/pig/scripting/groovy/GroovyUtils.java, line 96 https://reviews.apache.org/r/5591/diff/1/?file=116560#file116560line96 Is it not possible for users to create a pig Tuple that they then put Groovy objects into? They could, but this is strongly discouraged, the use case is to create Pig's Tuple or DataBag and populate them with Groovy objects converted by GroovyUtils.groovyToPig. The ability to create Pig's DataBag from Groovy is to benefit from the spill to disk nature of those. The support of Pig's Tuple is simply to be coherent. On June 26, 2012, 10:14 p.m., Jonathan Coveney wrote: /trunk/src/org/apache/pig/scripting/groovy/GroovyUtils.java, line 95 https://reviews.apache.org/r/5591/diff/1/?file=116560#file116560line95 I'm a big fan of having a private static final TupleFactory and BagFactory in the class. YMMV Ok, added. On June 26, 2012, 10:14 p.m., Jonathan Coveney wrote: /trunk/src/org/apache/pig/scripting/groovy/GroovyUtils.java, line 149 https://reviews.apache.org/r/5591/diff/1/?file=116560#file116560line149 you should go express support of the BigInt/BigDec patch :) I already did! On June 26, 2012, 10:14 p.m., Jonathan Coveney wrote: /trunk/src/org/apache/pig/scripting/groovy/GroovyUtils.java, line 160 https://reviews.apache.org/r/5591/diff/1/?file=116560#file116560line160 Why do you copy the byte array here? It's not like you're copying in all other cases. Is
[jira] [Updated] (PIG-2767) Pig creates wrong schema after dereferencing nested tuple fields
[ https://issues.apache.org/jira/browse/PIG-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-2767: Fix Version/s: 0.11 Pig creates wrong schema after dereferencing nested tuple fields Key: PIG-2767 URL: https://issues.apache.org/jira/browse/PIG-2767 Project: Pig Issue Type: Bug Components: parser Affects Versions: 0.10.0 Environment: Amazon EMR, patched to use Pig 0.10.0 Reporter: Jonathan Packer Assignee: Daniel Dai Fix For: 0.11 Attachments: test_data.txt The following script fails: data = LOAD 'test_data.txt' USING PigStorage() AS (f1: int, f2: int, f3: int, f4: int); nested = FOREACH data GENERATE f1, (f2, f3, f4) AS nested_tuple; dereferenced = FOREACH nested GENERATE f1, nested_tuple.(f2, f3); DESCRIBE dereferenced; uses_dereferenced = FOREACH dereferenced GENERATE nested_tuple.f3; DESCRIBE uses_dereferenced; The schema of dereferenced should be {f1: int, nested_tuple: (f2: int, f3: int)}. DESCRIBE thinks it is {f1: int, f2: int} instead. When dump is used, the data is actually in form of the correct schema however, ex. (1,(2,3)) (5,(6,7)) ... This is not just a problem with DESCRIBE. Because the schema is incorrect, the reference to nested_tuple in the uses_dereferenced statement is considered to be invalid, and the script fails to run. The error is: Invalid field projection. Projected field [nested_tuple] does not exist in schema: f1:int,f2:int. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (PIG-2767) Pig creates wrong schema after dereferencing nested tuple fields
[ https://issues.apache.org/jira/browse/PIG-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai reassigned PIG-2767: --- Assignee: Daniel Dai Pig creates wrong schema after dereferencing nested tuple fields Key: PIG-2767 URL: https://issues.apache.org/jira/browse/PIG-2767 Project: Pig Issue Type: Bug Components: parser Affects Versions: 0.10.0 Environment: Amazon EMR, patched to use Pig 0.10.0 Reporter: Jonathan Packer Assignee: Daniel Dai Fix For: 0.11 Attachments: test_data.txt The following script fails: data = LOAD 'test_data.txt' USING PigStorage() AS (f1: int, f2: int, f3: int, f4: int); nested = FOREACH data GENERATE f1, (f2, f3, f4) AS nested_tuple; dereferenced = FOREACH nested GENERATE f1, nested_tuple.(f2, f3); DESCRIBE dereferenced; uses_dereferenced = FOREACH dereferenced GENERATE nested_tuple.f3; DESCRIBE uses_dereferenced; The schema of dereferenced should be {f1: int, nested_tuple: (f2: int, f3: int)}. DESCRIBE thinks it is {f1: int, f2: int} instead. When dump is used, the data is actually in form of the correct schema however, ex. (1,(2,3)) (5,(6,7)) ... This is not just a problem with DESCRIBE. Because the schema is incorrect, the reference to nested_tuple in the uses_dereferenced statement is considered to be invalid, and the script fails to run. The error is: Invalid field projection. Projected field [nested_tuple] does not exist in schema: f1:int,f2:int. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2697) pretty print schema
[ https://issues.apache.org/jira/browse/PIG-2697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13401773#comment-13401773 ] Jonathan Coveney commented on PIG-2697: --- +1. Assuming it passes ant test-commit (#berigorousgetitright :P), I'll commit. pretty print schema --- Key: PIG-2697 URL: https://issues.apache.org/jira/browse/PIG-2697 Project: Pig Issue Type: Improvement Components: grunt Reporter: Raghu Angadi Assignee: Raghu Angadi Fix For: 0.11 Attachments: PIG-2697.patch, PIG-2697.patch currently 'describe' dumps the schema in one line. If you have a long or complicated schema, it is pretty much impossible to figure out how the schema looks or what the fileds are. will provide an example below. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2769) a simple logic causes very long compiling time on pig 0.10.0
[ https://issues.apache.org/jira/browse/PIG-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dan Li updated PIG-2769: Attachment: case1.tar example code and data file a simple logic causes very long compiling time on pig 0.10.0 Key: PIG-2769 URL: https://issues.apache.org/jira/browse/PIG-2769 Project: Pig Issue Type: Bug Components: build Affects Versions: 0.10.0 Environment: Apache Pig version 0.10.0-SNAPSHOT (rexported) Reporter: Dan Li Fix For: 0.10.0 Attachments: case1.tar We found the following simple logic will cause very long compiling time for pig 0.10.0, while using pig 0.8.1, everything is fine. A = load 'A.txt' using PigStorage() AS (m: int); B = FOREACH A { days_str = (chararray) (m == 1 ? 31: (m == 2 ? 28: (m == 3 ? 31: (m == 4 ? 30: (m == 5 ? 31: (m == 6 ? 30: (m == 7 ? 31: (m == 8 ? 31: (m == 9 ? 30: (m == 10 ? 31: (m == 11 ? 30:31))); GENERATE days_str as days_str; } store B into 'B'; and here's a simple input file example: A.txt 1 2 3 The pig version we used in the test Apache Pig version 0.10.0-SNAPSHOT (rexported) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-2769) a simple logic causes very long compiling time on pig 0.10.0
Dan Li created PIG-2769: --- Summary: a simple logic causes very long compiling time on pig 0.10.0 Key: PIG-2769 URL: https://issues.apache.org/jira/browse/PIG-2769 Project: Pig Issue Type: Bug Components: build Affects Versions: 0.10.0 Environment: Apache Pig version 0.10.0-SNAPSHOT (rexported) Reporter: Dan Li Fix For: 0.10.0 Attachments: case1.tar We found the following simple logic will cause very long compiling time for pig 0.10.0, while using pig 0.8.1, everything is fine. A = load 'A.txt' using PigStorage() AS (m: int); B = FOREACH A { days_str = (chararray) (m == 1 ? 31: (m == 2 ? 28: (m == 3 ? 31: (m == 4 ? 30: (m == 5 ? 31: (m == 6 ? 30: (m == 7 ? 31: (m == 8 ? 31: (m == 9 ? 30: (m == 10 ? 31: (m == 11 ? 30:31))); GENERATE days_str as days_str; } store B into 'B'; and here's a simple input file example: A.txt 1 2 3 The pig version we used in the test Apache Pig version 0.10.0-SNAPSHOT (rexported) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2769) a simple logic causes very long compiling time on pig 0.10.0
[ https://issues.apache.org/jira/browse/PIG-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13401785#comment-13401785 ] Dan Li commented on PIG-2769: - It's worth pointing out that Pig 0.9.2 also runs quickly; we only see the degradation with Pig 0.10.0. The degradation in performance seems to have a knee as 4 or 5 conditionals works as expected but as presented, the script takes about 6 minutes at the GRUNT prompt after hitting enter; before any Hadoop execution. -Clay a simple logic causes very long compiling time on pig 0.10.0 Key: PIG-2769 URL: https://issues.apache.org/jira/browse/PIG-2769 Project: Pig Issue Type: Bug Components: build Affects Versions: 0.10.0 Environment: Apache Pig version 0.10.0-SNAPSHOT (rexported) Reporter: Dan Li Fix For: 0.10.0 Attachments: case1.tar We found the following simple logic will cause very long compiling time for pig 0.10.0, while using pig 0.8.1, everything is fine. A = load 'A.txt' using PigStorage() AS (m: int); B = FOREACH A { days_str = (chararray) (m == 1 ? 31: (m == 2 ? 28: (m == 3 ? 31: (m == 4 ? 30: (m == 5 ? 31: (m == 6 ? 30: (m == 7 ? 31: (m == 8 ? 31: (m == 9 ? 30: (m == 10 ? 31: (m == 11 ? 30:31))); GENERATE days_str as days_str; } store B into 'B'; and here's a simple input file example: A.txt 1 2 3 The pig version we used in the test Apache Pig version 0.10.0-SNAPSHOT (rexported) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Review Request: PIG-2763 - Groovy UDFs
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/5591/ --- (Updated June 26, 2012, 11:26 p.m.) Review request for pig, Julien Le Dem and Jonathan Coveney. Changes --- Added ref to PIG-2763 Description --- Adds support for Groovy UDFs in Pig. This addresses bug PIG-2763. https://issues.apache.org/jira/browse/PIG-2763 Diffs - /trunk/ivy.xml 1353307 /trunk/ivy/libraries.properties 1353307 /trunk/src/org/apache/pig/scripting/ScriptEngine.java 1354285 /trunk/src/org/apache/pig/scripting/groovy/AccumulatorAccumulate.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/AccumulatorCleanup.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/AccumulatorGetValue.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/AlgebraicFinal.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/AlgebraicInitial.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/AlgebraicIntermed.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/GroovyAccumulatorEvalFunc.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/GroovyAlgebraicEvalFunc.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/GroovyEvalFunc.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/GroovyEvalFuncObject.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/GroovyScriptEngine.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/GroovyUtils.java PRE-CREATION /trunk/src/org/apache/pig/scripting/groovy/OutputSchemaFunction.java PRE-CREATION /trunk/test/org/apache/pig/test/TestUDFGroovy.java PRE-CREATION /trunk/test/unit-tests 1353307 Diff: https://reviews.apache.org/r/5591/diff/ Testing --- Thanks, Mathias Herberts
[jira] [Resolved] (PIG-2697) pretty print schema
[ https://issues.apache.org/jira/browse/PIG-2697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Coveney resolved PIG-2697. --- Resolution: Fixed pretty print schema --- Key: PIG-2697 URL: https://issues.apache.org/jira/browse/PIG-2697 Project: Pig Issue Type: Improvement Components: grunt Reporter: Raghu Angadi Assignee: Raghu Angadi Fix For: 0.11 Attachments: PIG-2697.patch, PIG-2697.patch currently 'describe' dumps the schema in one line. If you have a long or complicated schema, it is pretty much impossible to figure out how the schema looks or what the fileds are. will provide an example below. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2697) pretty print schema
[ https://issues.apache.org/jira/browse/PIG-2697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13401799#comment-13401799 ] Jonathan Coveney commented on PIG-2697: --- It's in! pretty print schema --- Key: PIG-2697 URL: https://issues.apache.org/jira/browse/PIG-2697 Project: Pig Issue Type: Improvement Components: grunt Reporter: Raghu Angadi Assignee: Raghu Angadi Fix For: 0.11 Attachments: PIG-2697.patch, PIG-2697.patch currently 'describe' dumps the schema in one line. If you have a long or complicated schema, it is pretty much impossible to figure out how the schema looks or what the fileds are. will provide an example below. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2770) Allow easy inclusion of custom build targets
[ https://issues.apache.org/jira/browse/PIG-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Le Dem updated PIG-2770: --- Attachment: PIG-2770.patch Allow easy inclusion of custom build targets Key: PIG-2770 URL: https://issues.apache.org/jira/browse/PIG-2770 Project: Pig Issue Type: Improvement Reporter: Julien Le Dem Attachments: PIG-2770.patch by adding a line in the build.xml we allow users to easily customize the build import file=./build-site.xml optional=true/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2770) Allow easy inclusion of custom build targets
[ https://issues.apache.org/jira/browse/PIG-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Le Dem updated PIG-2770: --- Patch Info: Patch Available Allow easy inclusion of custom build targets Key: PIG-2770 URL: https://issues.apache.org/jira/browse/PIG-2770 Project: Pig Issue Type: Improvement Reporter: Julien Le Dem Attachments: PIG-2770.patch by adding a line in the build.xml we allow users to easily customize the build import file=./build-site.xml optional=true/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-2770) Allow easy inclusion of custom build targets
Julien Le Dem created PIG-2770: -- Summary: Allow easy inclusion of custom build targets Key: PIG-2770 URL: https://issues.apache.org/jira/browse/PIG-2770 Project: Pig Issue Type: Improvement Reporter: Julien Le Dem Attachments: PIG-2770.patch by adding a line in the build.xml we allow users to easily customize the build import file=./build-site.xml optional=true/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (PIG-2770) Allow easy inclusion of custom build targets
[ https://issues.apache.org/jira/browse/PIG-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Le Dem resolved PIG-2770. Resolution: Fixed Fix Version/s: 0.11 Assignee: Julien Le Dem Allow easy inclusion of custom build targets Key: PIG-2770 URL: https://issues.apache.org/jira/browse/PIG-2770 Project: Pig Issue Type: Improvement Reporter: Julien Le Dem Assignee: Julien Le Dem Fix For: 0.11 Attachments: PIG-2770.patch by adding a line in the build.xml we allow users to easily customize the build import file=./build-site.xml optional=true/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2661) Pig uses an extra job for loading data in Pigmix L9
[ https://issues.apache.org/jira/browse/PIG-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13401828#comment-13401828 ] Jie Li commented on PIG-2661: - Some benchmark result using 1GB TPCH data lineitem: ||query||trunk||this patch|| ||load-orderby-store| 1m41s (load) + 53s (sample) + 3m11s (orderby) | 38s (sample) + 3m27s (orderby)| ||load-orderby-filter-store| 41s (load) + 32s (sample) + 35s (orderby) | 38s (sample) + 50s (orderby) | Note the filter is very selective but we didn't see the slowdown of the sample job. The slight slowdown of the orderby job might result from different serialization. In both query, we save one entire load job. But just another issue came into my mind: though the distribution won't change, the number of samples might change after the pipeline. If the pipeline decreases #records such as filter/limit/sample, then we'll have less samples at the end, but we also have a smaller order-by which doesn't need many samples. If the pipeline increases #records such as flatten/stream, then we may end up with having many samples at the end, which is likely to have poor performance. Therefore let's just disable the sample optimization if we find these exploding pipeline operators. (what else besides flatten/stream?) Pig uses an extra job for loading data in Pigmix L9 --- Key: PIG-2661 URL: https://issues.apache.org/jira/browse/PIG-2661 Project: Pig Issue Type: Improvement Affects Versions: 0.9.0 Reporter: Jie Li Assignee: Jie Li Attachments: PIG-2661.0.patch, PIG-2661.1.patch See https://issues.apache.org/jira/browse/PIG-200?focusedCommentId=13260155page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13260155 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2766) Pig-HCat Usability
[ https://issues.apache.org/jira/browse/PIG-2766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13401840#comment-13401840 ] Daniel Dai commented on PIG-2766: - Couple of comments: 1. We shall not hard code version, instead, find those jars using prefix 2. Why put $additionalJars in the middle? Put it in the end seems more intuitive 3. In the change [[ $f = -secretDebugCmd ]], anything wrong with the original syntax? Pig-HCat Usability -- Key: PIG-2766 URL: https://issues.apache.org/jira/browse/PIG-2766 Project: Pig Issue Type: Bug Components: grunt, tools Affects Versions: 0.10.0 Reporter: Vikram Dixit K Assignee: Vikram Dixit K Fix For: 0.10.0 Attachments: PIG-2766.patch, PIG-2766_2.patch Currently to use hcat from pig (via HCatLoader/HCatStorer) user need to register bunch of jars and set couple of configuration. For a novice user, it is non-trivial to find all the relevant jars and config params. We should have better integration between Pig HCat by pre-configuring Pig to load all these jars and configs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1314) Add DateTime Support to Pig
[ https://issues.apache.org/jira/browse/PIG-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13401884#comment-13401884 ] Zhijie Shen commented on PIG-1314: -- Hi Thejas and Russell, I'll do serialization for timezone as well. {quote} I think converting the string timezone (location name) to UTC offset in minutes, is one possibility. {quote} In my opinion, this kind of compression is lossy. Several time zones may share the same UTC offset, such that when the reverse operation is to do, it will be unknown which timezone the UTC offset should be converted to. {quote} We need an efficient way to serialize timezone along with the long. Can you propose something ? (Maybe, just make it efficient for 256 most 'popular' timezones and store it a byte. And not have the byte for UTC. For other timezones, add a timezone string ?) {quote} The time zone class in either builtin and joda has the function getAvailableIDs, which returns all the available time zone strings. On my machine, I got 616 from the builtin time zone while 558 from the joda one. Probably we can have a one-to-one mapping between the time zone strings and the integer ids in short variables. However the available in the function getAvailableIDs sounds tricky. I'm not sure whether getAvailableIDs returns the same time zone list on all machines or is machine-dependent. Add DateTime Support to Pig --- Key: PIG-1314 URL: https://issues.apache.org/jira/browse/PIG-1314 Project: Pig Issue Type: Bug Components: data Affects Versions: 0.7.0 Reporter: Russell Jurney Assignee: Zhijie Shen Labels: gsoc2012 Attachments: PIG-1314-1.patch, PIG-1314-2.patch, joda_vs_builtin.zip Original Estimate: 672h Remaining Estimate: 672h Hadoop/Pig are primarily used to parse log data, and most logs have a timestamp component. Therefore Pig should support dates as a primitive. Can someone familiar with adding types to pig comment on how hard this is? We're looking at doing this, rather than use UDFs. Is this a patch that would be accepted? This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2661) Pig uses an extra job for loading data in Pigmix L9
[ https://issues.apache.org/jira/browse/PIG-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Li updated PIG-2661: Attachment: PIG-2661.2.patch Attached the patch that disables sample optimization if there is flatten/stream. Pig uses an extra job for loading data in Pigmix L9 --- Key: PIG-2661 URL: https://issues.apache.org/jira/browse/PIG-2661 Project: Pig Issue Type: Improvement Affects Versions: 0.9.0 Reporter: Jie Li Assignee: Jie Li Attachments: PIG-2661.0.patch, PIG-2661.1.patch, PIG-2661.2.patch See https://issues.apache.org/jira/browse/PIG-200?focusedCommentId=13260155page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13260155 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1314) Add DateTime Support to Pig
[ https://issues.apache.org/jira/browse/PIG-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13401937#comment-13401937 ] Thejas M Nair commented on PIG-1314: bq. Several time zones may share the same UTC offset, such that when the reverse operation is to do, it will be unknown which timezone the UTC offset should be converted to. Yes, it will be lossy, but the part that is important for date calculations is preserved. The ISO spec only has offset for timezone. I don't think we have to allow datetime field to be used for storing location information. Does JodaTime preserve the location string ? bq. I'm not sure whether getAvailableIDs returns the same time zone list on all machines or is machine-dependent. It depends on the release/jar (http://joda-time.sourceforge.net/tz_update.html). As pig will be shipping this jar to the nodes, it is ok to assume that it will be the same across all nodes for a query. So it is safe to rely on the id for intermediate serialization. But won't jodatime support a timezone outside this list, If the user specifies a date using the UTC offset format ? Add DateTime Support to Pig --- Key: PIG-1314 URL: https://issues.apache.org/jira/browse/PIG-1314 Project: Pig Issue Type: Bug Components: data Affects Versions: 0.7.0 Reporter: Russell Jurney Assignee: Zhijie Shen Labels: gsoc2012 Attachments: PIG-1314-1.patch, PIG-1314-2.patch, joda_vs_builtin.zip Original Estimate: 672h Remaining Estimate: 672h Hadoop/Pig are primarily used to parse log data, and most logs have a timestamp component. Therefore Pig should support dates as a primitive. Can someone familiar with adding types to pig comment on how hard this is? We're looking at doing this, rather than use UDFs. Is this a patch that would be accepted? This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira