[jira] [Commented] (PIG-3223) AvroStorage does not handle comma separated input paths
[ https://issues.apache.org/jira/browse/PIG-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13611505#comment-13611505 ] Prashant Kommireddi commented on PIG-3223: -- Thanks [~dreambird]. I have a question regarding the current approach - why isn't the globbing implemented in PigAvroInputFormat? Overriding listStatus(JobContext job) should be cleaner, unless I am missing something very specific to Avro? > AvroStorage does not handle comma separated input paths > --- > > Key: PIG-3223 > URL: https://issues.apache.org/jira/browse/PIG-3223 > Project: Pig > Issue Type: Bug > Components: piggybank >Affects Versions: 0.10.0, 0.11 >Reporter: Michael Kramer >Assignee: Johnny Zhang > Attachments: AvroStorage.patch, AvroStorage.patch-2, > AvroStorageUtils.patch, AvroStorageUtils.patch-2, PIG-3223.patch.txt, > PIG-3223.patch.txt > > > In pig 0.11, a patch was issued to AvroStorage to support globs and comma > separated input paths (PIG-2492). While this function works fine for > glob-formatted input paths, it fails when issued a standard comma separated > list of paths. fs.globStatus does not seem to be able to parse out such a > list, and a java.net.URISyntaxException is thrown when toURI is called on the > path. > I have a working fix for this, but it's extremely ugly (basically checking if > the string of input paths is globbed, otherwise splitting on ","). I'm sure > there's a more elegant solution. I'd be happy to post the relevant methods > and "fixes" if necessary. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Subscription: PIG patch available
Issue Subscription Filter: PIG patch available (34 issues) Subscriber: pigdaily Key Summary PIG-3257Add unique identifier UDF https://issues.apache.org/jira/browse/PIG-3257 PIG-3247Piggybank functions to mimic OVER clause in SQL https://issues.apache.org/jira/browse/PIG-3247 PIG-3238Pig current releases lack a UDF Stuff(). This UDF deletes a specified length of characters and inserts another set of characters at a specified starting point. https://issues.apache.org/jira/browse/PIG-3238 PIG-3237Pig current releases lack a UDF MakeSet(). This UDF returns a set value (a string containing substrings separated by "," characters) consisting of the strings that have the corresponding bit in the first argument https://issues.apache.org/jira/browse/PIG-3237 PIG-3223AvroStorage does not handle comma separated input paths https://issues.apache.org/jira/browse/PIG-3223 PIG-3215[piggybank] Add LTSVLoader to load LTSV (Labeled Tab-separated Values) files https://issues.apache.org/jira/browse/PIG-3215 PIG-3210Pig fails to start when it cannot write log to log files https://issues.apache.org/jira/browse/PIG-3210 PIG-3198Let users use any function from PigType -> PigType as if it were builtlin https://issues.apache.org/jira/browse/PIG-3198 PIG-3193Fix "ant docs" warnings https://issues.apache.org/jira/browse/PIG-3193 PIG-3190Add LuceneTokenizer and SnowballTokenizer to Pig - useful text tokenization https://issues.apache.org/jira/browse/PIG-3190 PIG-3183rm or rmf commands should respect globbing/regex of path https://issues.apache.org/jira/browse/PIG-3183 PIG-3173Partition filter push down does not happen partition keys condition include a AND and OR construct https://issues.apache.org/jira/browse/PIG-3173 PIG-3166Update eclipse .classpath according to ivy library.properties https://issues.apache.org/jira/browse/PIG-3166 PIG-3164Pig current releases lack a UDF endsWith.This UDF tests if a given string ends with the specified suffix. https://issues.apache.org/jira/browse/PIG-3164 PIG-3141Giving CSVExcelStorage an option to handle header rows https://issues.apache.org/jira/browse/PIG-3141 PIG-3123Simplify Logical Plans By Removing Unneccessary Identity Projections https://issues.apache.org/jira/browse/PIG-3123 PIG-3122Operators should not implicitly become reserved keywords https://issues.apache.org/jira/browse/PIG-3122 PIG-3114Duplicated macro name error when using pigunit https://issues.apache.org/jira/browse/PIG-3114 PIG-3105Fix TestJobSubmission unit test failure. https://issues.apache.org/jira/browse/PIG-3105 PIG-3088Add a builtin udf which removes prefixes https://issues.apache.org/jira/browse/PIG-3088 PIG-3069Native Windows Compatibility for Pig E2E Tests and Harness https://issues.apache.org/jira/browse/PIG-3069 PIG-3028testGrunt dev test needs some command filters to run correctly without cygwin https://issues.apache.org/jira/browse/PIG-3028 PIG-3027pigTest unit test needs a newline filter for comparisons of golden multi-line https://issues.apache.org/jira/browse/PIG-3027 PIG-3026Pig checked-in baseline comparisons need a pre-filter to address OS-specific newline differences https://issues.apache.org/jira/browse/PIG-3026 PIG-3024TestEmptyInputDir unit test - hadoop version detection logic is brittle https://issues.apache.org/jira/browse/PIG-3024 PIG-3015Rewrite of AvroStorage https://issues.apache.org/jira/browse/PIG-3015 PIG-3010Allow UDF's to flatten themselves https://issues.apache.org/jira/browse/PIG-3010 PIG-2959Add a pig.cmd for Pig to run under Windows https://issues.apache.org/jira/browse/PIG-2959 PIG-2955 Fix bunch of Pig e2e tests on Windows https://issues.apache.org/jira/browse/PIG-2955 PIG-2873Converting bin/pig shell script to python https://issues.apache.org/jira/browse/PIG-2873 PIG-2643Use bytecode generation to make a performance replacement for InvokeForLong, InvokeForString, etc https://issues.apache.org/jira/browse/PIG-2643 PIG-2641Create toJSON function for all complex types: tuples, bags and maps https://issues.apache.org/jira/browse/PIG-2641 PIG-2591Unit tests should not write to /tmp but respect java.io.tmpdir https://issues.apache.org/jira/browse/PIG-2591 PIG-1914Support load/store JSON data in Pig https://issues.apache.org/jira/browse/PIG-1914 You may edit this subscription at: https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=13225&filterId=12322384
[jira] [Commented] (PIG-2586) A better plan/data flow visualizer
[ https://issues.apache.org/jira/browse/PIG-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13611493#comment-13611493 ] Daniel Dai commented on PIG-2586: - Probably not for end user, but for us to figure out what's wrong with a script, this is useful. > A better plan/data flow visualizer > -- > > Key: PIG-2586 > URL: https://issues.apache.org/jira/browse/PIG-2586 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Daniel Dai > Labels: gsoc2013 > > Pig supports a dot graph style plan to visualize the > logical/physical/mapreduce plan (explain with -dot option, see > http://ofps.oreilly.com/titles/9781449302641/developing_and_testing.html). > However, dot graph takes extra step to generate the plan graph and the > quality of the output is not good. It's better we can implement a better > visualizer for Pig. It should: > 1. show operator type and alias > 2. turn on/off output schema > 3. dive into foreach inner plan on demand > 4. provide a way to show operator source code, eg, tooltip of an operator > (plan don't currently have this information, but you can assume this is in > place) > 5. besides visualize logical/physical/mapreduce plan, visualize the script > itself is also useful > 6. may rely on some java graphic library such as Swing > This is a candidate project for Google summer of code 2013. More information > about the program can be found at > https://cwiki.apache.org/confluence/display/PIG/GSoc2013 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2586) A better plan/data flow visualizer
[ https://issues.apache.org/jira/browse/PIG-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13611490#comment-13611490 ] Dmitriy V. Ryaboy commented on PIG-2586: Hm I guess we can add logical plan if we want -- just need to feed it to the PPNL somehow. Ambrose is pretty separate from Pig specifics, if you give it a dag, it'll draw it. Do people use the logical plan to diagnose issues? I don't think I have had to do that yet. > A better plan/data flow visualizer > -- > > Key: PIG-2586 > URL: https://issues.apache.org/jira/browse/PIG-2586 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Daniel Dai > Labels: gsoc2013 > > Pig supports a dot graph style plan to visualize the > logical/physical/mapreduce plan (explain with -dot option, see > http://ofps.oreilly.com/titles/9781449302641/developing_and_testing.html). > However, dot graph takes extra step to generate the plan graph and the > quality of the output is not good. It's better we can implement a better > visualizer for Pig. It should: > 1. show operator type and alias > 2. turn on/off output schema > 3. dive into foreach inner plan on demand > 4. provide a way to show operator source code, eg, tooltip of an operator > (plan don't currently have this information, but you can assume this is in > place) > 5. besides visualize logical/physical/mapreduce plan, visualize the script > itself is also useful > 6. may rely on some java graphic library such as Swing > This is a candidate project for Google summer of code 2013. More information > about the program can be found at > https://cwiki.apache.org/confluence/display/PIG/GSoc2013 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3223) AvroStorage does not handle comma separated input paths
[ https://issues.apache.org/jira/browse/PIG-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johnny Zhang updated PIG-3223: -- Status: Patch Available (was: Open) > AvroStorage does not handle comma separated input paths > --- > > Key: PIG-3223 > URL: https://issues.apache.org/jira/browse/PIG-3223 > Project: Pig > Issue Type: Bug > Components: piggybank >Affects Versions: 0.11, 0.10.0 >Reporter: Michael Kramer >Assignee: Johnny Zhang > Attachments: AvroStorage.patch, AvroStorage.patch-2, > AvroStorageUtils.patch, AvroStorageUtils.patch-2, PIG-3223.patch.txt, > PIG-3223.patch.txt > > > In pig 0.11, a patch was issued to AvroStorage to support globs and comma > separated input paths (PIG-2492). While this function works fine for > glob-formatted input paths, it fails when issued a standard comma separated > list of paths. fs.globStatus does not seem to be able to parse out such a > list, and a java.net.URISyntaxException is thrown when toURI is called on the > path. > I have a working fix for this, but it's extremely ugly (basically checking if > the string of input paths is globbed, otherwise splitting on ","). I'm sure > there's a more elegant solution. I'd be happy to post the relevant methods > and "fixes" if necessary. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2586) A better plan/data flow visualizer
[ https://issues.apache.org/jira/browse/PIG-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13611481#comment-13611481 ] Daniel Dai commented on PIG-2586: - But no logical plan, right? > A better plan/data flow visualizer > -- > > Key: PIG-2586 > URL: https://issues.apache.org/jira/browse/PIG-2586 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Daniel Dai > Labels: gsoc2013 > > Pig supports a dot graph style plan to visualize the > logical/physical/mapreduce plan (explain with -dot option, see > http://ofps.oreilly.com/titles/9781449302641/developing_and_testing.html). > However, dot graph takes extra step to generate the plan graph and the > quality of the output is not good. It's better we can implement a better > visualizer for Pig. It should: > 1. show operator type and alias > 2. turn on/off output schema > 3. dive into foreach inner plan on demand > 4. provide a way to show operator source code, eg, tooltip of an operator > (plan don't currently have this information, but you can assume this is in > place) > 5. besides visualize logical/physical/mapreduce plan, visualize the script > itself is also useful > 6. may rely on some java graphic library such as Swing > This is a candidate project for Google summer of code 2013. More information > about the program can be found at > https://cwiki.apache.org/confluence/display/PIG/GSoc2013 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2586) A better plan/data flow visualizer
[ https://issues.apache.org/jira/browse/PIG-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13611478#comment-13611478 ] Dmitriy V. Ryaboy commented on PIG-2586: It does with the linked patch (it also visualizes the MR plan, without details of what's happening inside the map or reduce stage, without the patch). > A better plan/data flow visualizer > -- > > Key: PIG-2586 > URL: https://issues.apache.org/jira/browse/PIG-2586 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Daniel Dai > Labels: gsoc2013 > > Pig supports a dot graph style plan to visualize the > logical/physical/mapreduce plan (explain with -dot option, see > http://ofps.oreilly.com/titles/9781449302641/developing_and_testing.html). > However, dot graph takes extra step to generate the plan graph and the > quality of the output is not good. It's better we can implement a better > visualizer for Pig. It should: > 1. show operator type and alias > 2. turn on/off output schema > 3. dive into foreach inner plan on demand > 4. provide a way to show operator source code, eg, tooltip of an operator > (plan don't currently have this information, but you can assume this is in > place) > 5. besides visualize logical/physical/mapreduce plan, visualize the script > itself is also useful > 6. may rely on some java graphic library such as Swing > This is a candidate project for Google summer of code 2013. More information > about the program can be found at > https://cwiki.apache.org/confluence/display/PIG/GSoc2013 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2586) A better plan/data flow visualizer
[ https://issues.apache.org/jira/browse/PIG-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13611477#comment-13611477 ] Daniel Dai commented on PIG-2586: - The goal for it is to visualize plan (logical/mapreduce plan) rather than jobs. Does Ambrose has that? > A better plan/data flow visualizer > -- > > Key: PIG-2586 > URL: https://issues.apache.org/jira/browse/PIG-2586 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Daniel Dai > Labels: gsoc2013 > > Pig supports a dot graph style plan to visualize the > logical/physical/mapreduce plan (explain with -dot option, see > http://ofps.oreilly.com/titles/9781449302641/developing_and_testing.html). > However, dot graph takes extra step to generate the plan graph and the > quality of the output is not good. It's better we can implement a better > visualizer for Pig. It should: > 1. show operator type and alias > 2. turn on/off output schema > 3. dive into foreach inner plan on demand > 4. provide a way to show operator source code, eg, tooltip of an operator > (plan don't currently have this information, but you can assume this is in > place) > 5. besides visualize logical/physical/mapreduce plan, visualize the script > itself is also useful > 6. may rely on some java graphic library such as Swing > This is a candidate project for Google summer of code 2013. More information > about the program can be found at > https://cwiki.apache.org/confluence/display/PIG/GSoc2013 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2586) A better plan/data flow visualizer
[ https://issues.apache.org/jira/browse/PIG-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13611472#comment-13611472 ] Dmitriy V. Ryaboy commented on PIG-2586: Do we need this given Ambrose (and from what I hear, Ambari)? What is the difference between what this proposes and what Ambrose does? https://github.com/twitter/ambrose There is an Ambrose patch to add inner plans, too: https://github.com/twitter/ambrose/issues/62 > A better plan/data flow visualizer > -- > > Key: PIG-2586 > URL: https://issues.apache.org/jira/browse/PIG-2586 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Daniel Dai > Labels: gsoc2013 > > Pig supports a dot graph style plan to visualize the > logical/physical/mapreduce plan (explain with -dot option, see > http://ofps.oreilly.com/titles/9781449302641/developing_and_testing.html). > However, dot graph takes extra step to generate the plan graph and the > quality of the output is not good. It's better we can implement a better > visualizer for Pig. It should: > 1. show operator type and alias > 2. turn on/off output schema > 3. dive into foreach inner plan on demand > 4. provide a way to show operator source code, eg, tooltip of an operator > (plan don't currently have this information, but you can assume this is in > place) > 5. besides visualize logical/physical/mapreduce plan, visualize the script > itself is also useful > 6. may rely on some java graphic library such as Swing > This is a candidate project for Google summer of code 2013. More information > about the program can be found at > https://cwiki.apache.org/confluence/display/PIG/GSoc2013 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3258) Patch to allow MultiStorage to use more than one index to generate output tree
[ https://issues.apache.org/jira/browse/PIG-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13611467#comment-13611467 ] Dmitriy V. Ryaboy commented on PIG-3258: please generate patch against the project root. > Patch to allow MultiStorage to use more than one index to generate output tree > -- > > Key: PIG-3258 > URL: https://issues.apache.org/jira/browse/PIG-3258 > Project: Pig > Issue Type: Improvement > Components: piggybank >Reporter: Joel Fouse >Priority: Minor > Labels: piggybank > > I have made a patch to enable MultiStorage to handle multiple tuple indexes, > rather than only one, for generating the output directory structure. Before > I submit it, though, I need to know if I should generate the patch from > /contrib/piggybank/java where I've been compiling and unit testing, or back > at the project root. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2873) Converting bin/pig shell script to python
[ https://issues.apache.org/jira/browse/PIG-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vikram Dixit K updated PIG-2873: Status: Patch Available (was: Open) > Converting bin/pig shell script to python > - > > Key: PIG-2873 > URL: https://issues.apache.org/jira/browse/PIG-2873 > Project: Pig > Issue Type: Bug > Components: tools >Affects Versions: 0.10.0 >Reporter: Vikram Dixit K >Assignee: Vikram Dixit K >Priority: Minor > Attachments: PIG-2873_2.patch, PIG-2873_3.patch, PIG-2873_4.patch, > PIG-2873.patch > > > Converted the shell script in a platform independent way in python. Should > work with version 2.7.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2873) Converting bin/pig shell script to python
[ https://issues.apache.org/jira/browse/PIG-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vikram Dixit K updated PIG-2873: Attachment: PIG-2873_4.patch Integrated the python script with the e2e tests. While running the test-e2e target we can use the python script to run the tests by using the flag {noformat} -Dharness.use.python=true e.g. ant -Dharness.old.pig=/grid/0/pig/old_pig/ -Dharness.cluster.conf=/usr/lib/hadoop/conf/ -Dharness.cluster.bin=/usr/lib/hadoop/bin/hadoop -Dharness.use.python=true test-e2e {noformat} > Converting bin/pig shell script to python > - > > Key: PIG-2873 > URL: https://issues.apache.org/jira/browse/PIG-2873 > Project: Pig > Issue Type: Bug > Components: tools >Affects Versions: 0.10.0 >Reporter: Vikram Dixit K >Assignee: Vikram Dixit K >Priority: Minor > Attachments: PIG-2873_2.patch, PIG-2873_3.patch, PIG-2873_4.patch, > PIG-2873.patch > > > Converted the shell script in a platform independent way in python. Should > work with version 2.7.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3258) Patch to allow MultiStorage to use more than one index to generate output tree
Joel Fouse created PIG-3258: --- Summary: Patch to allow MultiStorage to use more than one index to generate output tree Key: PIG-3258 URL: https://issues.apache.org/jira/browse/PIG-3258 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joel Fouse Priority: Minor I have made a patch to enable MultiStorage to handle multiple tuple indexes, rather than only one, for generating the output directory structure. Before I submit it, though, I need to know if I should generate the patch from /contrib/piggybank/java where I've been compiling and unit testing, or back at the project root. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3170) Pig keeps static references to Hadoop's Context after end of task
[ https://issues.apache.org/jira/browse/PIG-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13611043#comment-13611043 ] Johnny Zhang commented on PIG-3170: --- the patch 'PIG-3170.patch.txt' brings lots of regression in unit tests, we may have to looking further the issue itself. Thanks. > Pig keeps static references to Hadoop's Context after end of task > - > > Key: PIG-3170 > URL: https://issues.apache.org/jira/browse/PIG-3170 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.10.0 >Reporter: Clément Stenac >Priority: Minor > Attachments: PIG-3170.patch.txt, pig-staticreferences-to-context.diff > > > Through the PigStatusReporter, and the ProgressableReporter, when a Pig MR > task is done, static references are kept to Hadoop's Context object. > Additionally, the PigCombiner also keeps a static reference, apparently > without using it. > When the JVM is reused between MR tasks, it can cause large memory > overconsumption, with a peak during the creation of the next task, because > while MR is creating the next task (in MapTask. for example), we have > both contexts (with their associated buffers) allocated at once. > This problem is especially important when using a Combiner, because the > ReduceContext of a Combiner contains references to large sort buffers. > The specifics of our case were: > * 20 GB input data, divided in 85 map tasks > * Very simple Pig script: LOAD A, FILTER A, GROUP A, FOREACH group generate > MAX(field), STORE > * MapR distribution, which automatically computes Xmx for mappers at 800MB > * At the end of the first task, the ReduceContext contains more than 400MB of > byte[] > * Systematic OOM in MapTask. on subsequent VM reuse > * At least -Xmx1200m was required to get the job to complete > * With attached patch, -Xmx600m is enough > While a workaround by increasing Xmx is possible, I think the large > overconsumption and the complexity of debugging the issue (because the OOM > actually happens at the very beginning of the task, before the first byte of > data has been processed) warrants fixing it. > The attached patch makes sure that PigStatusReporter and ProgressableReporter > drop their reference to the Context in the cleanup phases of the task. > No new test is included because I don't really think it's possible to write a > unit test, the issue being not "binary" -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (PIG-3250) Pig dryrun generates wrong output in .expanded file for 'SPLIT....OTHERWISE...' command
[ https://issues.apache.org/jira/browse/PIG-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johnny Zhang reassigned PIG-3250: - Assignee: Johnny Zhang > Pig dryrun generates wrong output in .expanded file for > 'SPLITOTHERWISE...' command > --- > > Key: PIG-3250 > URL: https://issues.apache.org/jira/browse/PIG-3250 > Project: Pig > Issue Type: Bug >Affects Versions: 0.12 >Reporter: Johnny Zhang >Assignee: Johnny Zhang > > step to reproduce it: > 1. input files 'users' > {noformat} > 1 > 2 > 3 > 4 > 5 > {noformat} > 2. pig script split.pig > {noformat} > define group_and_count (A,key) returns B { > SPLIT $A INTO $B IF $key<7, Y IF $key==5, Z OTHERWISE; > } > alpha = load '/var/lib/jenkins/users' as (f1:int); > gamma = group_and_count (alpha, f1); > store gamma into '/var/lib/jenkins/byuser'; > {noformat} > 3. run command > {noformat} > pig -x local -r split.pig > {noformat} > 4. the content of split.pig.expanded > {noformat} > alpha = load '/var/lib/jenkins/users' as f1:int; > SPLIT alpha INTO gamma IF f1 < 7, macro_group_and_count_Y_0 IF f1 == > 5OTHERWISE macro_group_and_count_Z_0; > store gamma INTO '/var/lib/jenkins/byuser'; > {noformat} > the line "f1 == 5OTHERWISE macro_group_and_count_Z_0;" is wrong, it > should be "f1 == 5, macro_group_and_count_Z_0 OTHERWISE" -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3223) AvroStorage does not handle comma separated input paths
[ https://issues.apache.org/jira/browse/PIG-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13611027#comment-13611027 ] Johnny Zhang commented on PIG-3223: --- the latest patch makes AvroStorage working for comma separated input. The patch also adds two test cases for below inputs {code} final private String testCommaSeparated1 = getInputFile("test_dir1/test_glob1.avro,test_dir1/test_glob2.avro,test_dir1/test_glob3.avro"); final private String testCommaSeparated2 = getInputFile("test_dir1/*, test_dir2/test_glob4.avro, test_dir2/test_glob5.avro"); {code} > AvroStorage does not handle comma separated input paths > --- > > Key: PIG-3223 > URL: https://issues.apache.org/jira/browse/PIG-3223 > Project: Pig > Issue Type: Bug > Components: piggybank >Affects Versions: 0.10.0, 0.11 >Reporter: Michael Kramer >Assignee: Johnny Zhang > Attachments: AvroStorage.patch, AvroStorage.patch-2, > AvroStorageUtils.patch, AvroStorageUtils.patch-2, PIG-3223.patch.txt, > PIG-3223.patch.txt > > > In pig 0.11, a patch was issued to AvroStorage to support globs and comma > separated input paths (PIG-2492). While this function works fine for > glob-formatted input paths, it fails when issued a standard comma separated > list of paths. fs.globStatus does not seem to be able to parse out such a > list, and a java.net.URISyntaxException is thrown when toURI is called on the > path. > I have a working fix for this, but it's extremely ugly (basically checking if > the string of input paths is globbed, otherwise splitting on ","). I'm sure > there's a more elegant solution. I'd be happy to post the relevant methods > and "fixes" if necessary. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3223) AvroStorage does not handle comma separated input paths
[ https://issues.apache.org/jira/browse/PIG-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johnny Zhang updated PIG-3223: -- Attachment: PIG-3223.patch.txt > AvroStorage does not handle comma separated input paths > --- > > Key: PIG-3223 > URL: https://issues.apache.org/jira/browse/PIG-3223 > Project: Pig > Issue Type: Bug > Components: piggybank >Affects Versions: 0.10.0, 0.11 >Reporter: Michael Kramer >Assignee: Johnny Zhang > Attachments: AvroStorage.patch, AvroStorage.patch-2, > AvroStorageUtils.patch, AvroStorageUtils.patch-2, PIG-3223.patch.txt, > PIG-3223.patch.txt > > > In pig 0.11, a patch was issued to AvroStorage to support globs and comma > separated input paths (PIG-2492). While this function works fine for > glob-formatted input paths, it fails when issued a standard comma separated > list of paths. fs.globStatus does not seem to be able to parse out such a > list, and a java.net.URISyntaxException is thrown when toURI is called on the > path. > I have a working fix for this, but it's extremely ugly (basically checking if > the string of input paths is globbed, otherwise splitting on ","). I'm sure > there's a more elegant solution. I'd be happy to post the relevant methods > and "fixes" if necessary. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3257) Add unique identifier UDF
[ https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-3257: Status: Patch Available (was: Open) A simple UDF that calls Java's UUID.getRandomUUID() function. I believe this could be done with a combination of the piggybank ToString function and using StringInvoker for UUID.getRandomUUID, but this seems like a useful and simple enough thing to just build in. > Add unique identifier UDF > - > > Key: PIG-3257 > URL: https://issues.apache.org/jira/browse/PIG-3257 > Project: Pig > Issue Type: Improvement > Components: internal-udfs >Reporter: Alan Gates >Assignee: Alan Gates > Fix For: 0.12 > > Attachments: PIG-3257.patch > > > It would be good to have a Pig function to generate unique identifiers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3257) Add unique identifier UDF
[ https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-3257: Attachment: PIG-3257.patch > Add unique identifier UDF > - > > Key: PIG-3257 > URL: https://issues.apache.org/jira/browse/PIG-3257 > Project: Pig > Issue Type: Improvement > Components: internal-udfs >Reporter: Alan Gates >Assignee: Alan Gates > Fix For: 0.12 > > Attachments: PIG-3257.patch > > > It would be good to have a Pig function to generate unique identifiers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3257) Add unique identifier UDF
Alan Gates created PIG-3257: --- Summary: Add unique identifier UDF Key: PIG-3257 URL: https://issues.apache.org/jira/browse/PIG-3257 Project: Pig Issue Type: Improvement Components: internal-udfs Reporter: Alan Gates Assignee: Alan Gates Fix For: 0.12 It would be good to have a Pig function to generate unique identifiers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Put a "Google summer of code 2013" cwiki page
This is a little different than how we've done such things before, but how about a project to get Pig to run on Spark (aka, Spork)? The Twitter pig folks have some code we'd love to share that got us half-way there, it was looking pretty promising (if anyone is curious, it's the "spork" branch on my github fork of pig: https://github.com/dvryaboy/pig ) D On Thu, Mar 21, 2013 at 2:05 PM, Prasanth J wrote: > One more idea for GSoC project. > > YSmart uses correlation between multiple MR jobs to reduce the number of > MR jobs generated. I remember Dmitriy bringing this up early. The > techniques specified in this paper (Input, Job Flow, Transit correlations) > has been patched into Hive. If Pig doesn't use these optimizations then I > think it will be good to have them in Pig as well. > > Here is the link to the paper > http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf > > I think this can be a good candidate project for GSoC. > > Thanks > -- Prasanth > > On Mar 21, 2013, at 3:51 PM, Olga Natkovich wrote: > > > +1 on that > > > > > > > > From: Russell Jurney > > To: "dev@pig.apache.org" > > Sent: Thursday, March 21, 2013 11:54 AM > > Subject: Re: Put a "Google summer of code 2013" cwiki page > > > > Make Grunt use Antlr - high priority one for me. Once Grunt uses Antlr, > > macros will flourish. > > > > > > On Wed, Mar 20, 2013 at 6:25 PM, Daniel Dai > wrote: > > > >> https://cwiki.apache.org/confluence/display/PIG/GSoc2013 > >> > >> Feel free to add more project which could fit in the timeline of a > >> student summer project. > >> > >> I remember there are several projects we discussed in our last meetup: > >> * Allow Pig use Hive UDFs, Alan, do we have a ticket for that? > >> * A general framework for Pig performance test, Rohini, do we have a > >> ticket? > >> > >> Thanks, > >> Daniel > >> > > > > > > > > -- > > Russell Jurney twitter.com/rjurney russell.jur...@gmail.com > datasyndrome.com > >
[VOTE] Release Pig 0.11.1 (candidate 0)
Hi, I have created a candidate build for Pig 0.11.1. This is a maintenance release of Pig 0.11. Keys used to sign the release are available at: http://svn.apache.org/viewvc/pig/trunk/KEYS?view=markup Please download, test, and try it out: http://people.apache.org/~billgraham/pig-0.11.1-candidate-0/ Should we release this? Vote closes on next Thursday EOD, Mar 28th. Thanks, Bill