[jira] [Commented] (PIG-3223) AvroStorage does not handle comma separated input paths

2013-03-22 Thread Prashant Kommireddi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13611505#comment-13611505
 ] 

Prashant Kommireddi commented on PIG-3223:
--

Thanks [~dreambird]. 

I have a question regarding the current approach - why isn't the globbing 
implemented in PigAvroInputFormat? Overriding listStatus(JobContext job) should 
be cleaner, unless I am missing something very specific to Avro?

> AvroStorage does not handle comma separated input paths
> ---
>
> Key: PIG-3223
> URL: https://issues.apache.org/jira/browse/PIG-3223
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Affects Versions: 0.10.0, 0.11
>Reporter: Michael Kramer
>Assignee: Johnny Zhang
> Attachments: AvroStorage.patch, AvroStorage.patch-2, 
> AvroStorageUtils.patch, AvroStorageUtils.patch-2, PIG-3223.patch.txt, 
> PIG-3223.patch.txt
>
>
> In pig 0.11, a patch was issued to AvroStorage to support globs and comma 
> separated input paths (PIG-2492).  While this function works fine for 
> glob-formatted input paths, it fails when issued a standard comma separated 
> list of paths.  fs.globStatus does not seem to be able to parse out such a 
> list, and a java.net.URISyntaxException is thrown when toURI is called on the 
> path.  
> I have a working fix for this, but it's extremely ugly (basically checking if 
> the string of input paths is globbed, otherwise splitting on ",").  I'm sure 
> there's a more elegant solution.  I'd be happy to post the relevant methods 
> and "fixes" if necessary.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Subscription: PIG patch available

2013-03-22 Thread jira
Issue Subscription
Filter: PIG patch available (34 issues)

Subscriber: pigdaily

Key Summary
PIG-3257Add unique identifier UDF
https://issues.apache.org/jira/browse/PIG-3257
PIG-3247Piggybank functions to mimic OVER clause in SQL
https://issues.apache.org/jira/browse/PIG-3247
PIG-3238Pig current releases lack a UDF Stuff(). This UDF deletes a 
specified length of characters and inserts another set of characters at a 
specified starting point.
https://issues.apache.org/jira/browse/PIG-3238
PIG-3237Pig current releases lack a UDF MakeSet(). This UDF returns a set 
value (a string containing substrings separated by "," characters) consisting 
of the strings that have the corresponding bit in the first argument
https://issues.apache.org/jira/browse/PIG-3237
PIG-3223AvroStorage does not handle comma separated input paths
https://issues.apache.org/jira/browse/PIG-3223
PIG-3215[piggybank] Add LTSVLoader to load LTSV (Labeled Tab-separated 
Values) files
https://issues.apache.org/jira/browse/PIG-3215
PIG-3210Pig fails to start when it cannot write log to log files
https://issues.apache.org/jira/browse/PIG-3210
PIG-3198Let users use any function from PigType -> PigType as if it were 
builtlin
https://issues.apache.org/jira/browse/PIG-3198
PIG-3193Fix "ant docs" warnings
https://issues.apache.org/jira/browse/PIG-3193
PIG-3190Add LuceneTokenizer and SnowballTokenizer to Pig - useful text 
tokenization
https://issues.apache.org/jira/browse/PIG-3190
PIG-3183rm or rmf commands should respect globbing/regex of path
https://issues.apache.org/jira/browse/PIG-3183
PIG-3173Partition filter push down does not happen partition keys condition 
include a AND and OR construct
https://issues.apache.org/jira/browse/PIG-3173
PIG-3166Update eclipse .classpath according to ivy library.properties
https://issues.apache.org/jira/browse/PIG-3166
PIG-3164Pig current releases lack a UDF endsWith.This UDF tests if a given 
string ends with the specified suffix.
https://issues.apache.org/jira/browse/PIG-3164
PIG-3141Giving CSVExcelStorage an option to handle header rows
https://issues.apache.org/jira/browse/PIG-3141
PIG-3123Simplify Logical Plans By Removing Unneccessary Identity Projections
https://issues.apache.org/jira/browse/PIG-3123
PIG-3122Operators should not implicitly become reserved keywords
https://issues.apache.org/jira/browse/PIG-3122
PIG-3114Duplicated macro name error when using pigunit
https://issues.apache.org/jira/browse/PIG-3114
PIG-3105Fix TestJobSubmission unit test failure.
https://issues.apache.org/jira/browse/PIG-3105
PIG-3088Add a builtin udf which removes prefixes
https://issues.apache.org/jira/browse/PIG-3088
PIG-3069Native Windows Compatibility for Pig E2E Tests and Harness
https://issues.apache.org/jira/browse/PIG-3069
PIG-3028testGrunt dev test needs some command filters to run correctly 
without cygwin
https://issues.apache.org/jira/browse/PIG-3028
PIG-3027pigTest unit test needs a newline filter for comparisons of golden 
multi-line
https://issues.apache.org/jira/browse/PIG-3027
PIG-3026Pig checked-in baseline comparisons need a pre-filter to address 
OS-specific newline differences
https://issues.apache.org/jira/browse/PIG-3026
PIG-3024TestEmptyInputDir unit test - hadoop version detection logic is 
brittle
https://issues.apache.org/jira/browse/PIG-3024
PIG-3015Rewrite of AvroStorage
https://issues.apache.org/jira/browse/PIG-3015
PIG-3010Allow UDF's to flatten themselves
https://issues.apache.org/jira/browse/PIG-3010
PIG-2959Add a pig.cmd for Pig to run under Windows
https://issues.apache.org/jira/browse/PIG-2959
PIG-2955 Fix bunch of Pig e2e tests on Windows 
https://issues.apache.org/jira/browse/PIG-2955
PIG-2873Converting bin/pig shell script to python
https://issues.apache.org/jira/browse/PIG-2873
PIG-2643Use bytecode generation to make a performance replacement for 
InvokeForLong, InvokeForString, etc
https://issues.apache.org/jira/browse/PIG-2643
PIG-2641Create toJSON function for all complex types: tuples, bags and maps
https://issues.apache.org/jira/browse/PIG-2641
PIG-2591Unit tests should not write to /tmp but respect java.io.tmpdir
https://issues.apache.org/jira/browse/PIG-2591
PIG-1914Support load/store JSON data in Pig
https://issues.apache.org/jira/browse/PIG-1914

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=13225&filterId=12322384


[jira] [Commented] (PIG-2586) A better plan/data flow visualizer

2013-03-22 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13611493#comment-13611493
 ] 

Daniel Dai commented on PIG-2586:
-

Probably not for end user, but for us to figure out what's wrong with a script, 
this is useful.

> A better plan/data flow visualizer
> --
>
> Key: PIG-2586
> URL: https://issues.apache.org/jira/browse/PIG-2586
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Daniel Dai
>  Labels: gsoc2013
>
> Pig supports a dot graph style plan to visualize the 
> logical/physical/mapreduce plan (explain with -dot option, see 
> http://ofps.oreilly.com/titles/9781449302641/developing_and_testing.html). 
> However, dot graph takes extra step to generate the plan graph and the 
> quality of the output is not good. It's better we can implement a better 
> visualizer for Pig. It should:
> 1. show operator type and alias
> 2. turn on/off output schema
> 3. dive into foreach inner plan on demand
> 4. provide a way to show operator source code, eg, tooltip of an operator 
> (plan don't currently have this information, but you can assume this is in 
> place)
> 5. besides visualize logical/physical/mapreduce plan, visualize the script 
> itself is also useful
> 6. may rely on some java graphic library such as Swing
> This is a candidate project for Google summer of code 2013. More information 
> about the program can be found at 
> https://cwiki.apache.org/confluence/display/PIG/GSoc2013

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2586) A better plan/data flow visualizer

2013-03-22 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13611490#comment-13611490
 ] 

Dmitriy V. Ryaboy commented on PIG-2586:


Hm I guess we can add logical plan if we want -- just need to feed it to the 
PPNL somehow. Ambrose is pretty separate from Pig specifics, if you give it a 
dag, it'll draw it.

Do people use the logical plan to diagnose issues? I don't think I have had to 
do that yet.

> A better plan/data flow visualizer
> --
>
> Key: PIG-2586
> URL: https://issues.apache.org/jira/browse/PIG-2586
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Daniel Dai
>  Labels: gsoc2013
>
> Pig supports a dot graph style plan to visualize the 
> logical/physical/mapreduce plan (explain with -dot option, see 
> http://ofps.oreilly.com/titles/9781449302641/developing_and_testing.html). 
> However, dot graph takes extra step to generate the plan graph and the 
> quality of the output is not good. It's better we can implement a better 
> visualizer for Pig. It should:
> 1. show operator type and alias
> 2. turn on/off output schema
> 3. dive into foreach inner plan on demand
> 4. provide a way to show operator source code, eg, tooltip of an operator 
> (plan don't currently have this information, but you can assume this is in 
> place)
> 5. besides visualize logical/physical/mapreduce plan, visualize the script 
> itself is also useful
> 6. may rely on some java graphic library such as Swing
> This is a candidate project for Google summer of code 2013. More information 
> about the program can be found at 
> https://cwiki.apache.org/confluence/display/PIG/GSoc2013

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3223) AvroStorage does not handle comma separated input paths

2013-03-22 Thread Johnny Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johnny Zhang updated PIG-3223:
--

Status: Patch Available  (was: Open)

> AvroStorage does not handle comma separated input paths
> ---
>
> Key: PIG-3223
> URL: https://issues.apache.org/jira/browse/PIG-3223
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Affects Versions: 0.11, 0.10.0
>Reporter: Michael Kramer
>Assignee: Johnny Zhang
> Attachments: AvroStorage.patch, AvroStorage.patch-2, 
> AvroStorageUtils.patch, AvroStorageUtils.patch-2, PIG-3223.patch.txt, 
> PIG-3223.patch.txt
>
>
> In pig 0.11, a patch was issued to AvroStorage to support globs and comma 
> separated input paths (PIG-2492).  While this function works fine for 
> glob-formatted input paths, it fails when issued a standard comma separated 
> list of paths.  fs.globStatus does not seem to be able to parse out such a 
> list, and a java.net.URISyntaxException is thrown when toURI is called on the 
> path.  
> I have a working fix for this, but it's extremely ugly (basically checking if 
> the string of input paths is globbed, otherwise splitting on ",").  I'm sure 
> there's a more elegant solution.  I'd be happy to post the relevant methods 
> and "fixes" if necessary.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2586) A better plan/data flow visualizer

2013-03-22 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13611481#comment-13611481
 ] 

Daniel Dai commented on PIG-2586:
-

But no logical plan, right?

> A better plan/data flow visualizer
> --
>
> Key: PIG-2586
> URL: https://issues.apache.org/jira/browse/PIG-2586
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Daniel Dai
>  Labels: gsoc2013
>
> Pig supports a dot graph style plan to visualize the 
> logical/physical/mapreduce plan (explain with -dot option, see 
> http://ofps.oreilly.com/titles/9781449302641/developing_and_testing.html). 
> However, dot graph takes extra step to generate the plan graph and the 
> quality of the output is not good. It's better we can implement a better 
> visualizer for Pig. It should:
> 1. show operator type and alias
> 2. turn on/off output schema
> 3. dive into foreach inner plan on demand
> 4. provide a way to show operator source code, eg, tooltip of an operator 
> (plan don't currently have this information, but you can assume this is in 
> place)
> 5. besides visualize logical/physical/mapreduce plan, visualize the script 
> itself is also useful
> 6. may rely on some java graphic library such as Swing
> This is a candidate project for Google summer of code 2013. More information 
> about the program can be found at 
> https://cwiki.apache.org/confluence/display/PIG/GSoc2013

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2586) A better plan/data flow visualizer

2013-03-22 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13611478#comment-13611478
 ] 

Dmitriy V. Ryaboy commented on PIG-2586:


It does with the linked patch (it also visualizes the MR plan, without details 
of what's happening inside the map or reduce stage, without the patch).

> A better plan/data flow visualizer
> --
>
> Key: PIG-2586
> URL: https://issues.apache.org/jira/browse/PIG-2586
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Daniel Dai
>  Labels: gsoc2013
>
> Pig supports a dot graph style plan to visualize the 
> logical/physical/mapreduce plan (explain with -dot option, see 
> http://ofps.oreilly.com/titles/9781449302641/developing_and_testing.html). 
> However, dot graph takes extra step to generate the plan graph and the 
> quality of the output is not good. It's better we can implement a better 
> visualizer for Pig. It should:
> 1. show operator type and alias
> 2. turn on/off output schema
> 3. dive into foreach inner plan on demand
> 4. provide a way to show operator source code, eg, tooltip of an operator 
> (plan don't currently have this information, but you can assume this is in 
> place)
> 5. besides visualize logical/physical/mapreduce plan, visualize the script 
> itself is also useful
> 6. may rely on some java graphic library such as Swing
> This is a candidate project for Google summer of code 2013. More information 
> about the program can be found at 
> https://cwiki.apache.org/confluence/display/PIG/GSoc2013

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2586) A better plan/data flow visualizer

2013-03-22 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13611477#comment-13611477
 ] 

Daniel Dai commented on PIG-2586:
-

The goal for it is to visualize plan (logical/mapreduce plan) rather than jobs. 
Does Ambrose has that?

> A better plan/data flow visualizer
> --
>
> Key: PIG-2586
> URL: https://issues.apache.org/jira/browse/PIG-2586
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Daniel Dai
>  Labels: gsoc2013
>
> Pig supports a dot graph style plan to visualize the 
> logical/physical/mapreduce plan (explain with -dot option, see 
> http://ofps.oreilly.com/titles/9781449302641/developing_and_testing.html). 
> However, dot graph takes extra step to generate the plan graph and the 
> quality of the output is not good. It's better we can implement a better 
> visualizer for Pig. It should:
> 1. show operator type and alias
> 2. turn on/off output schema
> 3. dive into foreach inner plan on demand
> 4. provide a way to show operator source code, eg, tooltip of an operator 
> (plan don't currently have this information, but you can assume this is in 
> place)
> 5. besides visualize logical/physical/mapreduce plan, visualize the script 
> itself is also useful
> 6. may rely on some java graphic library such as Swing
> This is a candidate project for Google summer of code 2013. More information 
> about the program can be found at 
> https://cwiki.apache.org/confluence/display/PIG/GSoc2013

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2586) A better plan/data flow visualizer

2013-03-22 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13611472#comment-13611472
 ] 

Dmitriy V. Ryaboy commented on PIG-2586:


Do we need this given Ambrose (and from what I hear, Ambari)?

What is the difference between what this proposes and what Ambrose does?

https://github.com/twitter/ambrose

There is an Ambrose patch to add inner plans, too:
https://github.com/twitter/ambrose/issues/62

> A better plan/data flow visualizer
> --
>
> Key: PIG-2586
> URL: https://issues.apache.org/jira/browse/PIG-2586
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Daniel Dai
>  Labels: gsoc2013
>
> Pig supports a dot graph style plan to visualize the 
> logical/physical/mapreduce plan (explain with -dot option, see 
> http://ofps.oreilly.com/titles/9781449302641/developing_and_testing.html). 
> However, dot graph takes extra step to generate the plan graph and the 
> quality of the output is not good. It's better we can implement a better 
> visualizer for Pig. It should:
> 1. show operator type and alias
> 2. turn on/off output schema
> 3. dive into foreach inner plan on demand
> 4. provide a way to show operator source code, eg, tooltip of an operator 
> (plan don't currently have this information, but you can assume this is in 
> place)
> 5. besides visualize logical/physical/mapreduce plan, visualize the script 
> itself is also useful
> 6. may rely on some java graphic library such as Swing
> This is a candidate project for Google summer of code 2013. More information 
> about the program can be found at 
> https://cwiki.apache.org/confluence/display/PIG/GSoc2013

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3258) Patch to allow MultiStorage to use more than one index to generate output tree

2013-03-22 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13611467#comment-13611467
 ] 

Dmitriy V. Ryaboy commented on PIG-3258:


please generate patch against the project root.

> Patch to allow MultiStorage to use more than one index to generate output tree
> --
>
> Key: PIG-3258
> URL: https://issues.apache.org/jira/browse/PIG-3258
> Project: Pig
>  Issue Type: Improvement
>  Components: piggybank
>Reporter: Joel Fouse
>Priority: Minor
>  Labels: piggybank
>
> I have made a patch to enable MultiStorage to handle multiple tuple indexes, 
> rather than only one, for generating the output directory structure.  Before 
> I submit it, though, I need to know if I should generate the patch from 
> /contrib/piggybank/java where I've been compiling and unit testing, or back 
> at the project root.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2873) Converting bin/pig shell script to python

2013-03-22 Thread Vikram Dixit K (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vikram Dixit K updated PIG-2873:


Status: Patch Available  (was: Open)

> Converting bin/pig shell script to python
> -
>
> Key: PIG-2873
> URL: https://issues.apache.org/jira/browse/PIG-2873
> Project: Pig
>  Issue Type: Bug
>  Components: tools
>Affects Versions: 0.10.0
>Reporter: Vikram Dixit K
>Assignee: Vikram Dixit K
>Priority: Minor
> Attachments: PIG-2873_2.patch, PIG-2873_3.patch, PIG-2873_4.patch, 
> PIG-2873.patch
>
>
> Converted the shell script in a platform independent way in python. Should 
> work with version 2.7.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2873) Converting bin/pig shell script to python

2013-03-22 Thread Vikram Dixit K (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vikram Dixit K updated PIG-2873:


Attachment: PIG-2873_4.patch

Integrated the python script with the e2e tests. While running the test-e2e 
target we can use the python script to run the tests by using the flag

{noformat}
-Dharness.use.python=true

e.g. ant -Dharness.old.pig=/grid/0/pig/old_pig/ 
-Dharness.cluster.conf=/usr/lib/hadoop/conf/ 
-Dharness.cluster.bin=/usr/lib/hadoop/bin/hadoop -Dharness.use.python=true 
test-e2e

{noformat}



> Converting bin/pig shell script to python
> -
>
> Key: PIG-2873
> URL: https://issues.apache.org/jira/browse/PIG-2873
> Project: Pig
>  Issue Type: Bug
>  Components: tools
>Affects Versions: 0.10.0
>Reporter: Vikram Dixit K
>Assignee: Vikram Dixit K
>Priority: Minor
> Attachments: PIG-2873_2.patch, PIG-2873_3.patch, PIG-2873_4.patch, 
> PIG-2873.patch
>
>
> Converted the shell script in a platform independent way in python. Should 
> work with version 2.7.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-3258) Patch to allow MultiStorage to use more than one index to generate output tree

2013-03-22 Thread Joel Fouse (JIRA)
Joel Fouse created PIG-3258:
---

 Summary: Patch to allow MultiStorage to use more than one index to 
generate output tree
 Key: PIG-3258
 URL: https://issues.apache.org/jira/browse/PIG-3258
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joel Fouse
Priority: Minor


I have made a patch to enable MultiStorage to handle multiple tuple indexes, 
rather than only one, for generating the output directory structure.  Before I 
submit it, though, I need to know if I should generate the patch from 
/contrib/piggybank/java where I've been compiling and unit testing, or back at 
the project root.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3170) Pig keeps static references to Hadoop's Context after end of task

2013-03-22 Thread Johnny Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13611043#comment-13611043
 ] 

Johnny Zhang commented on PIG-3170:
---

the patch 'PIG-3170.patch.txt' brings lots of regression in unit tests, we may 
have to looking further the issue itself. Thanks.

> Pig keeps static references to Hadoop's Context after end of task
> -
>
> Key: PIG-3170
> URL: https://issues.apache.org/jira/browse/PIG-3170
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.10.0
>Reporter: Clément Stenac
>Priority: Minor
> Attachments: PIG-3170.patch.txt, pig-staticreferences-to-context.diff
>
>
> Through the PigStatusReporter, and the ProgressableReporter, when a Pig MR 
> task is done, static references are kept to Hadoop's Context object.
> Additionally, the PigCombiner also keeps a static reference, apparently 
> without using it.
> When the JVM is reused between MR tasks, it can cause large memory 
> overconsumption, with a peak during the creation of the next task, because 
> while MR is creating the next task (in MapTask. for example), we have 
> both contexts (with  their associated buffers) allocated at once.
> This problem is especially important when using a Combiner, because the 
> ReduceContext of a Combiner contains references to large sort buffers.
> The specifics of our case were:
> * 20 GB input data, divided in 85 map tasks
> * Very simple Pig script: LOAD A, FILTER A, GROUP A, FOREACH group generate 
> MAX(field), STORE  
> * MapR distribution, which automatically computes Xmx for mappers at 800MB
> * At the end of the first task, the ReduceContext contains more than 400MB of 
> byte[]
> * Systematic OOM in MapTask. on subsequent VM reuse
> * At least -Xmx1200m was required to get the job to complete
> * With attached patch, -Xmx600m is enough
> While a workaround by increasing Xmx is possible, I think the large 
> overconsumption and the complexity of debugging the issue (because the OOM 
> actually happens at the very beginning of the task, before the first byte of 
> data has been processed) warrants fixing it.
> The attached patch makes sure that PigStatusReporter and ProgressableReporter 
> drop their reference to the Context in the cleanup phases of the task.
> No new test is included because I don't really think it's possible to write a 
> unit test, the issue being not "binary"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (PIG-3250) Pig dryrun generates wrong output in .expanded file for 'SPLIT....OTHERWISE...' command

2013-03-22 Thread Johnny Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johnny Zhang reassigned PIG-3250:
-

Assignee: Johnny Zhang

> Pig dryrun generates wrong output in .expanded file for 
> 'SPLITOTHERWISE...' command
> ---
>
> Key: PIG-3250
> URL: https://issues.apache.org/jira/browse/PIG-3250
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.12
>Reporter: Johnny Zhang
>Assignee: Johnny Zhang
>
> step to reproduce it:
> 1. input files 'users'
> {noformat}
> 1
> 2
> 3
> 4
> 5
> {noformat}
> 2. pig script split.pig
> {noformat}
> define group_and_count (A,key) returns B {
> SPLIT $A INTO $B IF $key<7, Y IF $key==5, Z OTHERWISE;
> }
> alpha = load '/var/lib/jenkins/users' as (f1:int);
> gamma = group_and_count (alpha, f1);
> store gamma into '/var/lib/jenkins/byuser';
> {noformat}
> 3. run command
> {noformat}
> pig -x local -r split.pig
> {noformat}
> 4. the content of split.pig.expanded
> {noformat}
> alpha = load '/var/lib/jenkins/users' as f1:int;
> SPLIT alpha INTO gamma IF f1 < 7, macro_group_and_count_Y_0 IF f1 == 
> 5OTHERWISE macro_group_and_count_Z_0;
> store gamma INTO '/var/lib/jenkins/byuser';
> {noformat}
> the line "f1 == 5OTHERWISE macro_group_and_count_Z_0;" is wrong, it 
> should be "f1 == 5, macro_group_and_count_Z_0 OTHERWISE"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3223) AvroStorage does not handle comma separated input paths

2013-03-22 Thread Johnny Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13611027#comment-13611027
 ] 

Johnny Zhang commented on PIG-3223:
---

the latest patch makes AvroStorage working for comma separated input. The patch 
also adds two test cases for below inputs
{code}
final private String testCommaSeparated1 = 
getInputFile("test_dir1/test_glob1.avro,test_dir1/test_glob2.avro,test_dir1/test_glob3.avro");
final private String testCommaSeparated2 = getInputFile("test_dir1/*, 
test_dir2/test_glob4.avro, test_dir2/test_glob5.avro");
{code}

> AvroStorage does not handle comma separated input paths
> ---
>
> Key: PIG-3223
> URL: https://issues.apache.org/jira/browse/PIG-3223
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Affects Versions: 0.10.0, 0.11
>Reporter: Michael Kramer
>Assignee: Johnny Zhang
> Attachments: AvroStorage.patch, AvroStorage.patch-2, 
> AvroStorageUtils.patch, AvroStorageUtils.patch-2, PIG-3223.patch.txt, 
> PIG-3223.patch.txt
>
>
> In pig 0.11, a patch was issued to AvroStorage to support globs and comma 
> separated input paths (PIG-2492).  While this function works fine for 
> glob-formatted input paths, it fails when issued a standard comma separated 
> list of paths.  fs.globStatus does not seem to be able to parse out such a 
> list, and a java.net.URISyntaxException is thrown when toURI is called on the 
> path.  
> I have a working fix for this, but it's extremely ugly (basically checking if 
> the string of input paths is globbed, otherwise splitting on ",").  I'm sure 
> there's a more elegant solution.  I'd be happy to post the relevant methods 
> and "fixes" if necessary.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3223) AvroStorage does not handle comma separated input paths

2013-03-22 Thread Johnny Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johnny Zhang updated PIG-3223:
--

Attachment: PIG-3223.patch.txt

> AvroStorage does not handle comma separated input paths
> ---
>
> Key: PIG-3223
> URL: https://issues.apache.org/jira/browse/PIG-3223
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Affects Versions: 0.10.0, 0.11
>Reporter: Michael Kramer
>Assignee: Johnny Zhang
> Attachments: AvroStorage.patch, AvroStorage.patch-2, 
> AvroStorageUtils.patch, AvroStorageUtils.patch-2, PIG-3223.patch.txt, 
> PIG-3223.patch.txt
>
>
> In pig 0.11, a patch was issued to AvroStorage to support globs and comma 
> separated input paths (PIG-2492).  While this function works fine for 
> glob-formatted input paths, it fails when issued a standard comma separated 
> list of paths.  fs.globStatus does not seem to be able to parse out such a 
> list, and a java.net.URISyntaxException is thrown when toURI is called on the 
> path.  
> I have a working fix for this, but it's extremely ugly (basically checking if 
> the string of input paths is globbed, otherwise splitting on ",").  I'm sure 
> there's a more elegant solution.  I'd be happy to post the relevant methods 
> and "fixes" if necessary.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3257) Add unique identifier UDF

2013-03-22 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-3257:


Status: Patch Available  (was: Open)

A simple UDF that calls Java's UUID.getRandomUUID() function.  I believe this 
could be done with a combination of the piggybank ToString function and using 
StringInvoker for UUID.getRandomUUID, but this seems like a useful and simple 
enough thing to just build in.

> Add unique identifier UDF
> -
>
> Key: PIG-3257
> URL: https://issues.apache.org/jira/browse/PIG-3257
> Project: Pig
>  Issue Type: Improvement
>  Components: internal-udfs
>Reporter: Alan Gates
>Assignee: Alan Gates
> Fix For: 0.12
>
> Attachments: PIG-3257.patch
>
>
> It would be good to have a Pig function to generate unique identifiers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3257) Add unique identifier UDF

2013-03-22 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-3257:


Attachment: PIG-3257.patch

> Add unique identifier UDF
> -
>
> Key: PIG-3257
> URL: https://issues.apache.org/jira/browse/PIG-3257
> Project: Pig
>  Issue Type: Improvement
>  Components: internal-udfs
>Reporter: Alan Gates
>Assignee: Alan Gates
> Fix For: 0.12
>
> Attachments: PIG-3257.patch
>
>
> It would be good to have a Pig function to generate unique identifiers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-3257) Add unique identifier UDF

2013-03-22 Thread Alan Gates (JIRA)
Alan Gates created PIG-3257:
---

 Summary: Add unique identifier UDF
 Key: PIG-3257
 URL: https://issues.apache.org/jira/browse/PIG-3257
 Project: Pig
  Issue Type: Improvement
  Components: internal-udfs
Reporter: Alan Gates
Assignee: Alan Gates
 Fix For: 0.12


It would be good to have a Pig function to generate unique identifiers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Put a "Google summer of code 2013" cwiki page

2013-03-22 Thread Dmitriy Ryaboy
This is a little different than how we've done such things before, but how
about a project to get Pig to run on Spark (aka, Spork)? The Twitter pig
folks have some code we'd love to share that got us half-way there, it was
looking pretty promising (if anyone is curious, it's the "spork" branch on
my github fork of pig: https://github.com/dvryaboy/pig )

D

On Thu, Mar 21, 2013 at 2:05 PM, Prasanth J wrote:

> One more idea for GSoC project.
>
> YSmart uses correlation between multiple MR jobs to reduce the number of
> MR jobs generated. I remember Dmitriy bringing this up early. The
> techniques specified in this paper (Input, Job Flow, Transit correlations)
> has been patched into Hive. If Pig doesn't use these optimizations then I
> think it will be good to have them in Pig as well.
>
> Here is the link to the paper
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
>
> I think this can be a good candidate project for GSoC.
>
> Thanks
> -- Prasanth
>
> On Mar 21, 2013, at 3:51 PM, Olga Natkovich  wrote:
>
> > +1 on that
> >
> >
> > 
> > From: Russell Jurney 
> > To: "dev@pig.apache.org" 
> > Sent: Thursday, March 21, 2013 11:54 AM
> > Subject: Re: Put a "Google summer of code 2013" cwiki page
> >
> > Make Grunt use Antlr - high priority one for me. Once Grunt uses Antlr,
> > macros will flourish.
> >
> >
> > On Wed, Mar 20, 2013 at 6:25 PM, Daniel Dai 
> wrote:
> >
> >> https://cwiki.apache.org/confluence/display/PIG/GSoc2013
> >>
> >> Feel free to add more project which could fit in the timeline of a
> >> student summer project.
> >>
> >> I remember there are several projects we discussed in our last meetup:
> >> * Allow Pig use Hive UDFs, Alan, do we have a ticket for that?
> >> * A general framework for Pig performance test, Rohini, do we have a
> >> ticket?
> >>
> >> Thanks,
> >> Daniel
> >>
> >
> >
> >
> > --
> > Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
> datasyndrome.com
>
>


[VOTE] Release Pig 0.11.1 (candidate 0)

2013-03-22 Thread Bill Graham
Hi,

I have created a candidate build for Pig 0.11.1. This is a maintenance release
of Pig 0.11.

Keys used to sign the release are available at:
http://svn.apache.org/viewvc/pig/trunk/KEYS?view=markup

Please download, test, and try it out:
http://people.apache.org/~billgraham/pig-0.11.1-candidate-0/

Should we release this? Vote closes on next Thursday EOD, Mar 28th.

Thanks,
Bill