[jira] Subscription: PIG patch available
Issue Subscription Filter: PIG patch available (37 issues) Subscriber: pigdaily Key Summary PIG-3247Piggybank functions to mimic OVER clause in SQL https://issues.apache.org/jira/browse/PIG-3247 PIG-3244Make PIG_HOME configurable https://issues.apache.org/jira/browse/PIG-3244 PIG-3238Pig current releases lack a UDF Stuff(). This UDF deletes a specified length of characters and inserts another set of characters at a specified starting point. https://issues.apache.org/jira/browse/PIG-3238 PIG-3237Pig current releases lack a UDF MakeSet(). This UDF returns a set value (a string containing substrings separated by "," characters) consisting of the strings that have the corresponding bit in the first argument https://issues.apache.org/jira/browse/PIG-3237 PIG-3235Enable DEBUG log messages in unit tests by default https://issues.apache.org/jira/browse/PIG-3235 PIG-3233Deploy a Piggybank Jar https://issues.apache.org/jira/browse/PIG-3233 PIG-3215[piggybank] Add LTSVLoader to load LTSV (Labeled Tab-separated Values) files https://issues.apache.org/jira/browse/PIG-3215 PIG-3210Pig fails to start when it cannot write log to log files https://issues.apache.org/jira/browse/PIG-3210 PIG-3208[zebra] TFile should not set io.compression.codec.lzo.buffersize https://issues.apache.org/jira/browse/PIG-3208 PIG-3205Passing arguments to python script does not work with -f option https://issues.apache.org/jira/browse/PIG-3205 PIG-3198Let users use any function from PigType -> PigType as if it were builtlin https://issues.apache.org/jira/browse/PIG-3198 PIG-3194Changes to ObjectSerializer.java break compatibility with Hadoop 0.20.2 https://issues.apache.org/jira/browse/PIG-3194 PIG-3190Add LuceneTokenizer and SnowballTokenizer to Pig - useful text tokenization https://issues.apache.org/jira/browse/PIG-3190 PIG-3183rm or rmf commands should respect globbing/regex of path https://issues.apache.org/jira/browse/PIG-3183 PIG-3172Partition filter push down does not happen when there is a non partition key map column filter https://issues.apache.org/jira/browse/PIG-3172 PIG-3166Update eclipse .classpath according to ivy library.properties https://issues.apache.org/jira/browse/PIG-3166 PIG-3164Pig current releases lack a UDF endsWith.This UDF tests if a given string ends with the specified suffix. https://issues.apache.org/jira/browse/PIG-3164 PIG-3141Giving CSVExcelStorage an option to handle header rows https://issues.apache.org/jira/browse/PIG-3141 PIG-3123Simplify Logical Plans By Removing Unneccessary Identity Projections https://issues.apache.org/jira/browse/PIG-3123 PIG-3122Operators should not implicitly become reserved keywords https://issues.apache.org/jira/browse/PIG-3122 PIG-3114Duplicated macro name error when using pigunit https://issues.apache.org/jira/browse/PIG-3114 PIG-3105Fix TestJobSubmission unit test failure. https://issues.apache.org/jira/browse/PIG-3105 PIG-3088Add a builtin udf which removes prefixes https://issues.apache.org/jira/browse/PIG-3088 PIG-3077TestMultiQueryLocal should not write in /tmp https://issues.apache.org/jira/browse/PIG-3077 PIG-3069Native Windows Compatibility for Pig E2E Tests and Harness https://issues.apache.org/jira/browse/PIG-3069 PIG-3028testGrunt dev test needs some command filters to run correctly without cygwin https://issues.apache.org/jira/browse/PIG-3028 PIG-3027pigTest unit test needs a newline filter for comparisons of golden multi-line https://issues.apache.org/jira/browse/PIG-3027 PIG-3026Pig checked-in baseline comparisons need a pre-filter to address OS-specific newline differences https://issues.apache.org/jira/browse/PIG-3026 PIG-3024TestEmptyInputDir unit test - hadoop version detection logic is brittle https://issues.apache.org/jira/browse/PIG-3024 PIG-3015Rewrite of AvroStorage https://issues.apache.org/jira/browse/PIG-3015 PIG-3010Allow UDF's to flatten themselves https://issues.apache.org/jira/browse/PIG-3010 PIG-2959Add a pig.cmd for Pig to run under Windows https://issues.apache.org/jira/browse/PIG-2959 PIG-2955 Fix bunch of Pig e2e tests on Windows https://issues.apache.org/jira/browse/PIG-2955 PIG-2643Use bytecode generation to make a performance replacement for InvokeForLong, InvokeForString, etc https://issues.apache.org/jira/browse/PIG-2643 PIG-2641Create toJSON function for all complex types: tuples, bags and maps https://issues.apache.org/jira/browse/PIG-2641 PIG-2591Unit
[jira] [Updated] (PIG-3247) Piggybank functions to mimic OVER clause in SQL
[ https://issues.apache.org/jira/browse/PIG-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-3247: Fix Version/s: 0.12 Status: Patch Available (was: Open) > Piggybank functions to mimic OVER clause in SQL > --- > > Key: PIG-3247 > URL: https://issues.apache.org/jira/browse/PIG-3247 > Project: Pig > Issue Type: New Feature > Components: piggybank >Reporter: Alan Gates >Assignee: Alan Gates > Fix For: 0.12 > > Attachments: Over.patch > > > In order to test Hive I have written some UDFs to mimic the behavior of SQL's > OVER clause. I thought they would be useful to share. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3247) Piggybank functions to mimic OVER clause in SQL
[ https://issues.apache.org/jira/browse/PIG-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-3247: Attachment: Over.patch > Piggybank functions to mimic OVER clause in SQL > --- > > Key: PIG-3247 > URL: https://issues.apache.org/jira/browse/PIG-3247 > Project: Pig > Issue Type: New Feature > Components: piggybank >Reporter: Alan Gates >Assignee: Alan Gates > Attachments: Over.patch > > > In order to test Hive I have written some UDFs to mimic the behavior of SQL's > OVER clause. I thought they would be useful to share. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3247) Piggybank functions to mimic OVER clause in SQL
[ https://issues.apache.org/jira/browse/PIG-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13601801#comment-13601801 ] Alan Gates commented on PIG-3247: - Basic OVER functionality can be accomplished in Pig using GROUP BY and FOREACH FLATTEN. For example: {code} select s, min(i) over (partition by s) from T {code} is done in Pig as: {code} A = load 'T'; B = group A by s; C = foreach B generate flatten(A), MIN(A.i) as min; D = foreach C generate A::s, min; {code} But as soon as a windowing clause is added this no longer works because the function needs to be called once for each row in the bag and only a subset of the bag should be passed to the function. To address this I've added two new functions: Stitch - Given multiple bags this stitches them together row by row. So if you have two bags: {code} bag A: { (1, 2), (3, 4) } bag B { (a, b), (c, d) } {code} Then Stitch(A, B) will return {code} { (1, 2, a, b), (3, 4, c, d) } {code} Over - Implements the standard SQL windowing and analytic functions, including : rank, dense_rank, cume_dist, percent_rank, ntile, first_value, last_value, lead, and lag. Together these can be used to do windowing and analytics functions in Pig. Pig already has rank and dense_rank, and this is in no way meant to replace that. This is meant to mimic exactly the SQL functionality. Also, these functions make no allowance for large sets that don't fit in memory on a single reducer. > Piggybank functions to mimic OVER clause in SQL > --- > > Key: PIG-3247 > URL: https://issues.apache.org/jira/browse/PIG-3247 > Project: Pig > Issue Type: New Feature > Components: piggybank >Reporter: Alan Gates >Assignee: Alan Gates > > In order to test Hive I have written some UDFs to mimic the behavior of SQL's > OVER clause. I thought they would be useful to share. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3247) Piggybank functions to mimic OVER clause in SQL
Alan Gates created PIG-3247: --- Summary: Piggybank functions to mimic OVER clause in SQL Key: PIG-3247 URL: https://issues.apache.org/jira/browse/PIG-3247 Project: Pig Issue Type: New Feature Components: piggybank Reporter: Alan Gates Assignee: Alan Gates In order to test Hive I have written some UDFs to mimic the behavior of SQL's OVER clause. I thought they would be useful to share. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3246) not possible to use remote filesystems (S3) in a pig script
Moritz Moeller created PIG-3246: --- Summary: not possible to use remote filesystems (S3) in a pig script Key: PIG-3246 URL: https://issues.apache.org/jira/browse/PIG-3246 Project: Pig Issue Type: Bug Environment: Apache Pig version 0.10.0-cdh4.2.0 (rexported) Hadoop 2.0.0-cdh4.2.0 Reporter: Moritz Moeller My Hadoop cluster is configured using hdfs://namenode/, hdfs dfs + Pig scripts work fine. Now I want to read data from S3, hdfs dfs -ls s3n://mybucket/file.csv works fine. A Pig script doing LOAD 's3n://mybucket/test.csv' however fails - looks as if Pig is performing the LOAD request using a hdfs FileSystem. I wasn't sure whether to mark this as bug or improvement as I do not know if this should be possible - but as it is a basic feature for Hadoop I guess it should work in Pig, too. org.apache.pig.backend.executionengine.ExecException: ERROR 2118: java.net.UnknownHostException: mybucket at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:288) at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:452) at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:469) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:366) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1218) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1215) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1215) at org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:336) at org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl.run(JobControl.java:233) at java.lang.Thread.run(Thread.java:722) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:257) Caused by: java.lang.IllegalArgumentException: java.net.UnknownHostException: sdfa at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:414) at org.apache.hadoop.security.SecurityUtil.buildDTServiceName(SecurityUtil.java:295) at org.apache.hadoop.fs.FileSystem.getCanonicalServiceName(FileSystem.java:247) at org.apache.hadoop.fs.FileSystem.collectDelegationTokens(FileSystem.java:468) at org.apache.hadoop.fs.FileSystem.addDelegationTokens(FileSystem.java:452) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:121) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:205) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:269) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274) ... 13 more Caused by: java.net.UnknownHostException: mybucket ... 25 more -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3077) TestMultiQueryLocal should not write in /tmp
[ https://issues.apache.org/jira/browse/PIG-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13601720#comment-13601720 ] Prashant Kommireddi commented on PIG-3077: -- Hi [~dreambird], thanks for working on this. I have a comment: {code} String tdir = System.getProperty("user.dir") + "/build/test/tmp/"; {code} user.dir would not be required here. Setting tdir to "build/test/tmp" should work. > TestMultiQueryLocal should not write in /tmp > > > Key: PIG-3077 > URL: https://issues.apache.org/jira/browse/PIG-3077 > Project: Pig > Issue Type: Test >Reporter: Julien Le Dem >Assignee: Johnny Zhang > Attachments: PIG-3077.patch.txt > > > temporary files from tests should be under build/test so that they are > cleaned by "ant clean" > Currently two test suites running on the same machine step on each other and > create flaky tests results -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3077) TestMultiQueryLocal should not write in /tmp
[ https://issues.apache.org/jira/browse/PIG-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johnny Zhang updated PIG-3077: -- Attachment: PIG-3077.patch.txt this patch respect 'pig.temp.dir' first in case we want to put temp file in other location. Otherwise set location as System.getProperty("user.dir") + "/build/test/tmp/". It uses the FileLocalizer to get temp dir from PigContext, and then replace all /tmp in the test code. I verify it and TestMultiQueryLocal passes for me. > TestMultiQueryLocal should not write in /tmp > > > Key: PIG-3077 > URL: https://issues.apache.org/jira/browse/PIG-3077 > Project: Pig > Issue Type: Test >Reporter: Julien Le Dem > Attachments: PIG-3077.patch.txt > > > temporary files from tests should be under build/test so that they are > cleaned by "ant clean" > Currently two test suites running on the same machine step on each other and > create flaky tests results -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3077) TestMultiQueryLocal should not write in /tmp
[ https://issues.apache.org/jira/browse/PIG-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johnny Zhang updated PIG-3077: -- Status: Patch Available (was: Open) > TestMultiQueryLocal should not write in /tmp > > > Key: PIG-3077 > URL: https://issues.apache.org/jira/browse/PIG-3077 > Project: Pig > Issue Type: Test >Reporter: Julien Le Dem >Assignee: Johnny Zhang > Attachments: PIG-3077.patch.txt > > > temporary files from tests should be under build/test so that they are > cleaned by "ant clean" > Currently two test suites running on the same machine step on each other and > create flaky tests results -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (PIG-3077) TestMultiQueryLocal should not write in /tmp
[ https://issues.apache.org/jira/browse/PIG-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johnny Zhang reassigned PIG-3077: - Assignee: Johnny Zhang > TestMultiQueryLocal should not write in /tmp > > > Key: PIG-3077 > URL: https://issues.apache.org/jira/browse/PIG-3077 > Project: Pig > Issue Type: Test >Reporter: Julien Le Dem >Assignee: Johnny Zhang > Attachments: PIG-3077.patch.txt > > > temporary files from tests should be under build/test so that they are > cleaned by "ant clean" > Currently two test suites running on the same machine step on each other and > create flaky tests results -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3239) Unable to return multiple values from a macro using SPLIT
[ https://issues.apache.org/jira/browse/PIG-3239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-3239: --- Status: Open (was: Patch Available) [~dreambird], thank you for the fix. I think your fix is correct. Can you please add a unit test case for this? TestMacroExpansion.java has splitTest, but that doesn't cover OTHERWISE. You might want to expand that test case, or add a new test case. Thanks! > Unable to return multiple values from a macro using SPLIT > - > > Key: PIG-3239 > URL: https://issues.apache.org/jira/browse/PIG-3239 > Project: Pig > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: Apache Pig version 0.10.0-cdh4.2.0 (rexported) > compiled Feb 15 2013, 12:19:17 > Linux 3.2.0-38-generic #61-Ubuntu SMP Tue Feb 19 12:18:21 UTC 2013 x86_64 > x86_64 x86_64 GNU/Linux >Reporter: Luis Belloch >Assignee: Johnny Zhang >Priority: Minor > Attachments: PIG-3239.patch.txt > > > Hi, I'm unable to return multiple values from a macro when values come from a > SPLIT. Here is an small example: > {code} > DEFINE my_macro(seq) RETURNS valid, invalid { > added = FOREACH $seq GENERATE $0 * 2, $1; > SPLIT added INTO $valid IF $1 == true, $invalid OTHERWISE; > } > data = LOAD 'case.csv' USING PigStorage(',') AS (value: int, valid: boolean); > P, Q = my_macro(data); > DUMP P; > DUMP Q; > {code} > Pig is unable to recognize the {{OTHERWISE}} side. Error is: {{ERROR > org.apache.pig.tools.grunt.Grunt - ERROR 1200: Invalid > macro definition: . Reason: Macro 'my_macro' missing return alias: invalid}} > Simple workaround is to force {{$invalid}} to be returned as {{FOREACH}} > result: > {code} > SPLIT added INTO $valid IF $1 == true, tmp_invalid OTHERWISE; > $invalid = FOREACH tmp_invalid GENERATE *; > {code} > Samples and logs attached to the issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3239) Unable to return multiple values from a macro using SPLIT
[ https://issues.apache.org/jira/browse/PIG-3239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johnny Zhang updated PIG-3239: -- Status: Patch Available (was: Open) > Unable to return multiple values from a macro using SPLIT > - > > Key: PIG-3239 > URL: https://issues.apache.org/jira/browse/PIG-3239 > Project: Pig > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: Apache Pig version 0.10.0-cdh4.2.0 (rexported) > compiled Feb 15 2013, 12:19:17 > Linux 3.2.0-38-generic #61-Ubuntu SMP Tue Feb 19 12:18:21 UTC 2013 x86_64 > x86_64 x86_64 GNU/Linux >Reporter: Luis Belloch >Assignee: Johnny Zhang >Priority: Minor > Attachments: PIG-3239.patch.txt > > > Hi, I'm unable to return multiple values from a macro when values come from a > SPLIT. Here is an small example: > {code} > DEFINE my_macro(seq) RETURNS valid, invalid { > added = FOREACH $seq GENERATE $0 * 2, $1; > SPLIT added INTO $valid IF $1 == true, $invalid OTHERWISE; > } > data = LOAD 'case.csv' USING PigStorage(',') AS (value: int, valid: boolean); > P, Q = my_macro(data); > DUMP P; > DUMP Q; > {code} > Pig is unable to recognize the {{OTHERWISE}} side. Error is: {{ERROR > org.apache.pig.tools.grunt.Grunt - ERROR 1200: Invalid > macro definition: . Reason: Macro 'my_macro' missing return alias: invalid}} > Simple workaround is to force {{$invalid}} to be returned as {{FOREACH}} > result: > {code} > SPLIT added INTO $valid IF $1 == true, tmp_invalid OTHERWISE; > $invalid = FOREACH tmp_invalid GENERATE *; > {code} > Samples and logs attached to the issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3239) Unable to return multiple values from a macro using SPLIT
[ https://issues.apache.org/jira/browse/PIG-3239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13600949#comment-13600949 ] Luis Belloch commented on PIG-3239: --- Thanks! We'll test it internally. > Unable to return multiple values from a macro using SPLIT > - > > Key: PIG-3239 > URL: https://issues.apache.org/jira/browse/PIG-3239 > Project: Pig > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: Apache Pig version 0.10.0-cdh4.2.0 (rexported) > compiled Feb 15 2013, 12:19:17 > Linux 3.2.0-38-generic #61-Ubuntu SMP Tue Feb 19 12:18:21 UTC 2013 x86_64 > x86_64 x86_64 GNU/Linux >Reporter: Luis Belloch >Assignee: Johnny Zhang >Priority: Minor > Attachments: PIG-3239.patch.txt > > > Hi, I'm unable to return multiple values from a macro when values come from a > SPLIT. Here is an small example: > {code} > DEFINE my_macro(seq) RETURNS valid, invalid { > added = FOREACH $seq GENERATE $0 * 2, $1; > SPLIT added INTO $valid IF $1 == true, $invalid OTHERWISE; > } > data = LOAD 'case.csv' USING PigStorage(',') AS (value: int, valid: boolean); > P, Q = my_macro(data); > DUMP P; > DUMP Q; > {code} > Pig is unable to recognize the {{OTHERWISE}} side. Error is: {{ERROR > org.apache.pig.tools.grunt.Grunt - ERROR 1200: Invalid > macro definition: . Reason: Macro 'my_macro' missing return alias: invalid}} > Simple workaround is to force {{$invalid}} to be returned as {{FOREACH}} > result: > {code} > SPLIT added INTO $valid IF $1 == true, tmp_invalid OTHERWISE; > $invalid = FOREACH tmp_invalid GENERATE *; > {code} > Samples and logs attached to the issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira