[jira] [Commented] (PIG-3261) User set PIG_CLASSPATH entries must be prepended to the CLASSPATH, not appended

2013-03-26 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13614964#comment-13614964
 ] 

Harsh J commented on PIG-3261:
--

I agree on both points from my own experience, but others have probably seen 
even more users than I.

I'm not so very active on the user lists either, but have been a long time 
subscriber and searching shows PIG_CLASSPATH's only ever used for UDF and 
library additives, and hence the other intention's users (i.e. those who want 
to override what Pig auto discovers) wouldn't mind this behavior change either. 
Of course, this data set does not represent all of the users :)

> User set PIG_CLASSPATH entries must be prepended to the CLASSPATH, not 
> appended
> ---
>
> Key: PIG-3261
> URL: https://issues.apache.org/jira/browse/PIG-3261
> Project: Pig
>  Issue Type: Bug
>  Components: grunt
>Affects Versions: 0.10.0
>Reporter: Harsh J
>Assignee: Harsh J
> Attachments: PIG-3261.patch, PIG-3261.patch
>
>
> Currently we are doing this wrong:
> {code}
> if [ "$PIG_CLASSPATH" != "" ]; then
> CLASSPATH=${CLASSPATH}:${PIG_CLASSPATH}
> {code}
> This means that anything added to CLASSPATH until that point will never be 
> able to get overridden by a user set environment, which is wrong behavior. 
> Hadoop libs for example are added to CLASSPATH, before this extension is 
> called in bin/pig.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3259) Optimize byte to Long/Integer conversions

2013-03-26 Thread Prashant Kommireddi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13614940#comment-13614940
 ] 

Prashant Kommireddi commented on PIG-3259:
--

{quote} By counting the number of times exception has so far been thrown by 
.valueOf() {quote}
I see what you mean. That could be an approach, though the heuristic for 
determining the threshold could be tricky. 

{quote}I wonder if there are good libraries that we can use for the sanity 
checks, as the decimal check seems bit more complicated{quote}
I will try and look if any such libraries are available. There's a method to 
check for Double in the javadoc you pointed before, but it could be more 
expensive than we want 
http://docs.oracle.com/javase/6/docs/api/java/lang/Double.html#valueOf%28java.lang.String%29.
 

> Optimize byte to Long/Integer conversions
> -
>
> Key: PIG-3259
> URL: https://issues.apache.org/jira/browse/PIG-3259
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11, 0.11.1
>Reporter: Prashant Kommireddi
>Assignee: Prashant Kommireddi
> Fix For: 0.12
>
> Attachments: byteToLong.xlsx
>
>
> These conversions can be performing better. If the input is not numeric 
> (1234abcd) the code calls Double.valueOf(String) regardless before finally 
> returning null. Any script that inadvertently (user's mistake or not) tries 
> to cast non-numeric column to int or long would result in many wasteful 
> calls. 
> We can avoid this and only handle the cases we find the input to be a decimal 
> number (1234.56) and return null otherwise even before trying 
> Double.valueOf(String).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3261) User set PIG_CLASSPATH entries must be prepended to the CLASSPATH, not appended

2013-03-26 Thread Prashant Kommireddi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13614934#comment-13614934
 ] 

Prashant Kommireddi commented on PIG-3261:
--

{quote}
IIRC the reason was to not have them step over the shipped library jars 
unintentionally with a simple HADOOP_CLASSPATH being set
{quote}

Based on that, I feel like keeping it simple and not having a toggle is better 
for following reasons:

# Pig does not have a env file like hadoop does for specifying CLASSPATH. Most 
likely this would be set by the user, would be intentional and not be picked up 
from any of pig's env files.
# Having a toggle for this seems like an additional step towards the same 
purpose. 

What do you think [~qwertymaniac]? It would be nice to have some others weight 
in on this. I am leaning more towards your initial patch, though I am not 
opposed to the latest patch either.


> User set PIG_CLASSPATH entries must be prepended to the CLASSPATH, not 
> appended
> ---
>
> Key: PIG-3261
> URL: https://issues.apache.org/jira/browse/PIG-3261
> Project: Pig
>  Issue Type: Bug
>  Components: grunt
>Affects Versions: 0.10.0
>Reporter: Harsh J
>Assignee: Harsh J
> Attachments: PIG-3261.patch, PIG-3261.patch
>
>
> Currently we are doing this wrong:
> {code}
> if [ "$PIG_CLASSPATH" != "" ]; then
> CLASSPATH=${CLASSPATH}:${PIG_CLASSPATH}
> {code}
> This means that anything added to CLASSPATH until that point will never be 
> able to get overridden by a user set environment, which is wrong behavior. 
> Hadoop libs for example are added to CLASSPATH, before this extension is 
> called in bin/pig.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3261) User set PIG_CLASSPATH entries must be prepended to the CLASSPATH, not appended

2013-03-26 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated PIG-3261:
-

Status: Patch Available  (was: Open)

> User set PIG_CLASSPATH entries must be prepended to the CLASSPATH, not 
> appended
> ---
>
> Key: PIG-3261
> URL: https://issues.apache.org/jira/browse/PIG-3261
> Project: Pig
>  Issue Type: Bug
>  Components: grunt
>Affects Versions: 0.10.0
>Reporter: Harsh J
>Assignee: Harsh J
> Attachments: PIG-3261.patch, PIG-3261.patch
>
>
> Currently we are doing this wrong:
> {code}
> if [ "$PIG_CLASSPATH" != "" ]; then
> CLASSPATH=${CLASSPATH}:${PIG_CLASSPATH}
> {code}
> This means that anything added to CLASSPATH until that point will never be 
> able to get overridden by a user set environment, which is wrong behavior. 
> Hadoop libs for example are added to CLASSPATH, before this extension is 
> called in bin/pig.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3261) User set PIG_CLASSPATH entries must be prepended to the CLASSPATH, not appended

2013-03-26 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated PIG-3261:
-

Attachment: PIG-3261.patch

Patch revised. Added an env-opt toggler PIG_USER_CLASSPATH_FIRST that preserves 
today's behavior if unset (default).

Testing:

Export:
{{export PIG_CLASSPATH=Foo}}

Default behavior:
{code}
bash -x bin/pig
…
CLASSPATH=/Users/harshchouraria/Work/installs/pig/conf:/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/lib/tools.jar:Foo
…
{code}

Set toggle:

{{export PIG_USER_CLASSPATH_FIRST=true}}

{code}
bash -x bin/pig
…
CLASSPATH=Foo:/Users/harshchouraria/Work/installs/pig/conf:/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/lib/tools.jar
…
{code}

Disable toggle:

{{export PIG_USER_CLASSPATH_FIRST=}}

{code}
bash -x bin/pig
…
CLASSPATH=/Users/harshchouraria/Work/installs/pig/conf:/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/lib/tools.jar:Foo
…
{code}

Unset toggle:

{{unset PIG_USER_CLASSPATH_FIRST}}

{code}
bash -x bin/pig
…
CLASSPATH=/Users/harshchouraria/Work/installs/pig/conf:/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/lib/tools.jar:Foo
…
{code}

> User set PIG_CLASSPATH entries must be prepended to the CLASSPATH, not 
> appended
> ---
>
> Key: PIG-3261
> URL: https://issues.apache.org/jira/browse/PIG-3261
> Project: Pig
>  Issue Type: Bug
>  Components: grunt
>Affects Versions: 0.10.0
>Reporter: Harsh J
>Assignee: Harsh J
> Attachments: PIG-3261.patch, PIG-3261.patch
>
>
> Currently we are doing this wrong:
> {code}
> if [ "$PIG_CLASSPATH" != "" ]; then
> CLASSPATH=${CLASSPATH}:${PIG_CLASSPATH}
> {code}
> This means that anything added to CLASSPATH until that point will never be 
> able to get overridden by a user set environment, which is wrong behavior. 
> Hadoop libs for example are added to CLASSPATH, before this extension is 
> called in bin/pig.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Subscription: PIG patch available

2013-03-26 Thread jira
Issue Subscription
Filter: PIG patch available (34 issues)

Subscriber: pigdaily

Key Summary
PIG-3257Add unique identifier UDF
https://issues.apache.org/jira/browse/PIG-3257
PIG-3247Piggybank functions to mimic OVER clause in SQL
https://issues.apache.org/jira/browse/PIG-3247
PIG-3238Pig current releases lack a UDF Stuff(). This UDF deletes a 
specified length of characters and inserts another set of characters at a 
specified starting point.
https://issues.apache.org/jira/browse/PIG-3238
PIG-3237Pig current releases lack a UDF MakeSet(). This UDF returns a set 
value (a string containing substrings separated by "," characters) consisting 
of the strings that have the corresponding bit in the first argument
https://issues.apache.org/jira/browse/PIG-3237
PIG-3223AvroStorage does not handle comma separated input paths
https://issues.apache.org/jira/browse/PIG-3223
PIG-3215[piggybank] Add LTSVLoader to load LTSV (Labeled Tab-separated 
Values) files
https://issues.apache.org/jira/browse/PIG-3215
PIG-3210Pig fails to start when it cannot write log to log files
https://issues.apache.org/jira/browse/PIG-3210
PIG-3198Let users use any function from PigType -> PigType as if it were 
builtlin
https://issues.apache.org/jira/browse/PIG-3198
PIG-3193Fix "ant docs" warnings
https://issues.apache.org/jira/browse/PIG-3193
PIG-3190Add LuceneTokenizer and SnowballTokenizer to Pig - useful text 
tokenization
https://issues.apache.org/jira/browse/PIG-3190
PIG-3183rm or rmf commands should respect globbing/regex of path
https://issues.apache.org/jira/browse/PIG-3183
PIG-3173Partition filter push down does not happen partition keys condition 
include a AND and OR construct
https://issues.apache.org/jira/browse/PIG-3173
PIG-3166Update eclipse .classpath according to ivy library.properties
https://issues.apache.org/jira/browse/PIG-3166
PIG-3164Pig current releases lack a UDF endsWith.This UDF tests if a given 
string ends with the specified suffix.
https://issues.apache.org/jira/browse/PIG-3164
PIG-3123Simplify Logical Plans By Removing Unneccessary Identity Projections
https://issues.apache.org/jira/browse/PIG-3123
PIG-3122Operators should not implicitly become reserved keywords
https://issues.apache.org/jira/browse/PIG-3122
PIG-3114Duplicated macro name error when using pigunit
https://issues.apache.org/jira/browse/PIG-3114
PIG-3105Fix TestJobSubmission unit test failure.
https://issues.apache.org/jira/browse/PIG-3105
PIG-3088Add a builtin udf which removes prefixes
https://issues.apache.org/jira/browse/PIG-3088
PIG-3069Native Windows Compatibility for Pig E2E Tests and Harness
https://issues.apache.org/jira/browse/PIG-3069
PIG-3028testGrunt dev test needs some command filters to run correctly 
without cygwin
https://issues.apache.org/jira/browse/PIG-3028
PIG-3027pigTest unit test needs a newline filter for comparisons of golden 
multi-line
https://issues.apache.org/jira/browse/PIG-3027
PIG-3026Pig checked-in baseline comparisons need a pre-filter to address 
OS-specific newline differences
https://issues.apache.org/jira/browse/PIG-3026
PIG-3024TestEmptyInputDir unit test - hadoop version detection logic is 
brittle
https://issues.apache.org/jira/browse/PIG-3024
PIG-3015Rewrite of AvroStorage
https://issues.apache.org/jira/browse/PIG-3015
PIG-3010Allow UDF's to flatten themselves
https://issues.apache.org/jira/browse/PIG-3010
PIG-2959Add a pig.cmd for Pig to run under Windows
https://issues.apache.org/jira/browse/PIG-2959
PIG-2955 Fix bunch of Pig e2e tests on Windows 
https://issues.apache.org/jira/browse/PIG-2955
PIG-2873Converting bin/pig shell script to python
https://issues.apache.org/jira/browse/PIG-2873
PIG-2643Use bytecode generation to make a performance replacement for 
InvokeForLong, InvokeForString, etc
https://issues.apache.org/jira/browse/PIG-2643
PIG-2641Create toJSON function for all complex types: tuples, bags and maps
https://issues.apache.org/jira/browse/PIG-2641
PIG-2591Unit tests should not write to /tmp but respect java.io.tmpdir
https://issues.apache.org/jira/browse/PIG-2591
PIG-2244Macros cannot be passed relation names
https://issues.apache.org/jira/browse/PIG-2244
PIG-1914Support load/store JSON data in Pig
https://issues.apache.org/jira/browse/PIG-1914

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=13225&filterId=12322384


[jira] [Commented] (PIG-3049) Cannot sort on a bag in nested foreach

2013-03-26 Thread Johnny Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13614749#comment-13614749
 ] 

Johnny Zhang commented on PIG-3049:
---

[~daijy], thanks for comments, sorry about late reply. I think it is the same 
root cause as PIG-2265, I left comments there to explain our find out so far. I 
don't have a patch ready yet, but yes, I am still looking for a fix.

> Cannot sort on a bag in nested foreach
> --
>
> Key: PIG-3049
> URL: https://issues.apache.org/jira/browse/PIG-3049
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11, 0.12
>Reporter: Jonathan Coveney
>Assignee: Johnny Zhang
> Fix For: 0.12
>
>
> The following script fails.
> {code}
> a = load 'words_and_numbers' as (word:chararray, number:int);
> b = foreach (group a by number) {
>   a_bag = a.word;
>   ord = order a_bag by word;
>   generate group, ord;
> }
> dump b;
> {code}
> On this data:
> {code}
> $ cat words_and_numbers   
>
> hey   1
> hey   2
> you   3
> you   4
> I 5
> could 6
> {code}
> it throws the following error:
> {code}
> ava.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.pig.data.Tuple
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:469)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:308)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:160)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:384)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:340)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:333)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)
> {code}
> Is this a supported feature of Pig? Seems reasonable, just seems like 
> something weird is going on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2265) Test case TestSecondarySort failure

2013-03-26 Thread Johnny Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13614734#comment-13614734
 ] 

Johnny Zhang commented on PIG-2265:
---

current the test is disabled in trunk.

I enable it and can reproduce the issue. I think it is the same root cause as 
PIG-3049. [~cheolsoo] help me debug this issue a while back, and explains to me 
idea.

The reason seems when secondary sort is enabled, the code needs inform 
POProject.java to process secondary sort key properly to avoid cast from the 
content of the tuple to tuple by
POProject.java line 481
{code}
res.result = (Tuple)ret;
{code}

the fix should be something like
POProject.java line 422
change
{code}
ret = inpValue.get(columns.get(0));
{code}

to
{code}
if (secondarySort) {
ret = inpValue;
} else {
ret = inpValue.get(columns.get(0));
}
{code}

it is not clear to me whether this is the right guess, and don't have idea how 
to get the boolean value secondarySort in POProject.java though.

> Test case TestSecondarySort failure
> ---
>
> Key: PIG-2265
> URL: https://issues.apache.org/jira/browse/PIG-2265
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Shengjun Xin
>
> Error message:
> Testcase: testNestedSortEndToEnd3 took 53.076 sec
>   Caused an ERROR
> Unable to open iterator for alias E. Backend error : 
> org.apache.pig.data.DataByteArray cannot be cast to org.apache.pig.data.Tuple
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to 
> open iterator for alias E. Backend error : org.apache.pig.data.DataByteArray 
> cannot be cast to org.apache.pig.data.Tuple
>   at org.apache.pig.PigServer.openIterator(PigServer.java:742)
>   at 
> org.apache.pig.test.TestSecondarySort.testNestedSortEndToEnd3(TestSecondarySort.java:550)
> Caused by: java.lang.ClassCastException: org.apache.pig.data.DataByteArray 
> cannot be cast to org.apache.pig.data.Tuple
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:392)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:357)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3221) Bootstrap sampling

2013-03-26 Thread Gianmarco De Francisci Morales (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13614618#comment-13614618
 ] 

Gianmarco De Francisci Morales commented on PIG-3221:
-

Here an example http://hortonworks.com/blog/bootstrap-sampling-with-apache-pig

> Bootstrap sampling
> --
>
> Key: PIG-3221
> URL: https://issues.apache.org/jira/browse/PIG-3221
> Project: Pig
>  Issue Type: New Feature
>Reporter: Gianmarco De Francisci Morales
>  Labels: gsoc2013
>
> Implement a bootstrap sampling option ( 
> http://en.wikipedia.org/wiki/Bootstrap_(statistics) ) in Pig's SAMPLE 
> operator.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3225) Stratified sampling

2013-03-26 Thread Gianmarco De Francisci Morales (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13614617#comment-13614617
 ] 

Gianmarco De Francisci Morales commented on PIG-3225:
-

Hi Dishara,
Happy to see your interest.
While we haven't discussed in detail with the rest of the Committers, my 
personal view on this project is that it should be combined with the one on 
Bootstrap sampling PIG-3221 to be worth of GSoC.

Regarding the sampling, this part of the project requires designing and 
changing the parser to recognize new part of the syntax for the SAMPLE operator 
(to specify the strata), and implementing the logical and physical operators 
connected to it.

> Stratified sampling
> ---
>
> Key: PIG-3225
> URL: https://issues.apache.org/jira/browse/PIG-3225
> Project: Pig
>  Issue Type: New Feature
>Reporter: Gianmarco De Francisci Morales
>  Labels: gsoc2013
>
> Implement a stratified sampling option ( 
> http://en.wikipedia.org/wiki/Stratified_sampling ) in Pig's SAMPLE operator.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Put a "Google summer of code 2013" cwiki page

2013-03-26 Thread Johnny Zhang
I have another idea for GSoC project: parallel running the unit tests. I
think several people mentioned this in last Pig meetup. The objective is
enabling us to run whole unit tests before commit any patch. The fix should
include two parts:

(1) unit test doesn't interferes each other (e.g. moving test dir from /tmp
to build/test/tmp so test doesn't delete other test's dir)
(2) need to make sure Pig is thread safe

Johnny


On Fri, Mar 22, 2013 at 10:04 AM, Dmitriy Ryaboy  wrote:

> This is a little different than how we've done such things before, but how
> about a project to get Pig to run on Spark (aka, Spork)? The Twitter pig
> folks have some code we'd love to share that got us half-way there, it was
> looking pretty promising (if anyone is curious, it's the "spork" branch on
> my github fork of pig: https://github.com/dvryaboy/pig )
>
> D
>
> On Thu, Mar 21, 2013 at 2:05 PM, Prasanth J  >wrote:
>
> > One more idea for GSoC project.
> >
> > YSmart uses correlation between multiple MR jobs to reduce the number of
> > MR jobs generated. I remember Dmitriy bringing this up early. The
> > techniques specified in this paper (Input, Job Flow, Transit
> correlations)
> > has been patched into Hive. If Pig doesn't use these optimizations then I
> > think it will be good to have them in Pig as well.
> >
> > Here is the link to the paper
> >
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
> >
> > I think this can be a good candidate project for GSoC.
> >
> > Thanks
> > -- Prasanth
> >
> > On Mar 21, 2013, at 3:51 PM, Olga Natkovich 
> wrote:
> >
> > > +1 on that
> > >
> > >
> > > 
> > > From: Russell Jurney 
> > > To: "dev@pig.apache.org" 
> > > Sent: Thursday, March 21, 2013 11:54 AM
> > > Subject: Re: Put a "Google summer of code 2013" cwiki page
> > >
> > > Make Grunt use Antlr - high priority one for me. Once Grunt uses Antlr,
> > > macros will flourish.
> > >
> > >
> > > On Wed, Mar 20, 2013 at 6:25 PM, Daniel Dai 
> > wrote:
> > >
> > >> https://cwiki.apache.org/confluence/display/PIG/GSoc2013
> > >>
> > >> Feel free to add more project which could fit in the timeline of a
> > >> student summer project.
> > >>
> > >> I remember there are several projects we discussed in our last meetup:
> > >> * Allow Pig use Hive UDFs, Alan, do we have a ticket for that?
> > >> * A general framework for Pig performance test, Rohini, do we have a
> > >> ticket?
> > >>
> > >> Thanks,
> > >> Daniel
> > >>
> > >
> > >
> > >
> > > --
> > > Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
> > datasyndrome.com
> >
> >
>


[jira] [Updated] (PIG-2244) Macros cannot be passed relation names

2013-03-26 Thread Johnny Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johnny Zhang updated PIG-2244:
--

Status: Patch Available  (was: Open)

[~alangates], I fix the antlr grammar file so that it be able to expand quoted 
relation in macro definition to relation in .expanded file.

I added another test case testQuotedRelation() to verify this case.

I run the whole test cases in TestMacroExpansion and it pass for me.

> Macros cannot be passed relation names
> --
>
> Key: PIG-2244
> URL: https://issues.apache.org/jira/browse/PIG-2244
> Project: Pig
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.9.0
>Reporter: Alan Gates
>Priority: Minor
> Attachments: PIG-2244.patch.txt
>
>
> If an alias is passed quoted, it gets expanded as if it were an alias in the 
> macro, which leads to a very strange error message.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (PIG-2244) Macros cannot be passed relation names

2013-03-26 Thread Johnny Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johnny Zhang reassigned PIG-2244:
-

Assignee: Johnny Zhang

> Macros cannot be passed relation names
> --
>
> Key: PIG-2244
> URL: https://issues.apache.org/jira/browse/PIG-2244
> Project: Pig
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.9.0
>Reporter: Alan Gates
>Assignee: Johnny Zhang
>Priority: Minor
> Attachments: PIG-2244.patch.txt
>
>
> If an alias is passed quoted, it gets expanded as if it were an alias in the 
> macro, which leads to a very strange error message.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2244) Macros cannot be passed relation names

2013-03-26 Thread Johnny Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johnny Zhang updated PIG-2244:
--

Attachment: PIG-2244.patch.txt

> Macros cannot be passed relation names
> --
>
> Key: PIG-2244
> URL: https://issues.apache.org/jira/browse/PIG-2244
> Project: Pig
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.9.0
>Reporter: Alan Gates
>Priority: Minor
> Attachments: PIG-2244.patch.txt
>
>
> If an alias is passed quoted, it gets expanded as if it were an alias in the 
> macro, which leads to a very strange error message.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3247) Piggybank functions to mimic OVER clause in SQL

2013-03-26 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-3247:


Attachment: Over.2.patch

A new version of the patch that fixes an error in the percent_rank calculation 
and adds the ability to specify the return type of the Over function.

> Piggybank functions to mimic OVER clause in SQL
> ---
>
> Key: PIG-3247
> URL: https://issues.apache.org/jira/browse/PIG-3247
> Project: Pig
>  Issue Type: New Feature
>  Components: piggybank
>Reporter: Alan Gates
>Assignee: Alan Gates
> Fix For: 0.12
>
> Attachments: Over.2.patch, Over.patch
>
>
> In order to test Hive I have written some UDFs to mimic the behavior of SQL's 
> OVER clause.  I thought they would be useful to share.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3259) Optimize byte to Long/Integer conversions

2013-03-26 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13614137#comment-13614137
 ] 

Thejas M Nair commented on PIG-3259:


bq.  How do we determine the number of non-numbers without making calls to 
sanityCheck..()?
By counting the number of times exception has so far been thrown by .valueOf(). 
Once a threshold has been crossed, we can introduce the sanity check for each 
new value. This will put a limit on worst ('incorrect') case performance 
without degrading the 'correct' case performance by much. 

I wonder if there are good libraries that we can use for the sanity checks, as 
the decimal check seems bit more complicated . 

> Optimize byte to Long/Integer conversions
> -
>
> Key: PIG-3259
> URL: https://issues.apache.org/jira/browse/PIG-3259
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11, 0.11.1
>Reporter: Prashant Kommireddi
>Assignee: Prashant Kommireddi
> Fix For: 0.12
>
> Attachments: byteToLong.xlsx
>
>
> These conversions can be performing better. If the input is not numeric 
> (1234abcd) the code calls Double.valueOf(String) regardless before finally 
> returning null. Any script that inadvertently (user's mistake or not) tries 
> to cast non-numeric column to int or long would result in many wasteful 
> calls. 
> We can avoid this and only handle the cases we find the input to be a decimal 
> number (1234.56) and return null otherwise even before trying 
> Double.valueOf(String).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3261) User set PIG_CLASSPATH entries must be prepended to the CLASSPATH, not appended

2013-03-26 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated PIG-3261:
-

Status: Open  (was: Patch Available)

> User set PIG_CLASSPATH entries must be prepended to the CLASSPATH, not 
> appended
> ---
>
> Key: PIG-3261
> URL: https://issues.apache.org/jira/browse/PIG-3261
> Project: Pig
>  Issue Type: Bug
>  Components: grunt
>Affects Versions: 0.10.0
>Reporter: Harsh J
>Assignee: Harsh J
> Attachments: PIG-3261.patch
>
>
> Currently we are doing this wrong:
> {code}
> if [ "$PIG_CLASSPATH" != "" ]; then
> CLASSPATH=${CLASSPATH}:${PIG_CLASSPATH}
> {code}
> This means that anything added to CLASSPATH until that point will never be 
> able to get overridden by a user set environment, which is wrong behavior. 
> Hadoop libs for example are added to CLASSPATH, before this extension is 
> called in bin/pig.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3261) User set PIG_CLASSPATH entries must be prepended to the CLASSPATH, not appended

2013-03-26 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13613705#comment-13613705
 ] 

Harsh J commented on PIG-3261:
--

IIRC the reason was to not have them step over the shipped library jars 
unintentionally with a simple HADOOP_CLASSPATH being set. I guess we can add a 
toggle instead of changing the behavior, would be safer. I'll update the patch.

> User set PIG_CLASSPATH entries must be prepended to the CLASSPATH, not 
> appended
> ---
>
> Key: PIG-3261
> URL: https://issues.apache.org/jira/browse/PIG-3261
> Project: Pig
>  Issue Type: Bug
>  Components: grunt
>Affects Versions: 0.10.0
>Reporter: Harsh J
>Assignee: Harsh J
> Attachments: PIG-3261.patch
>
>
> Currently we are doing this wrong:
> {code}
> if [ "$PIG_CLASSPATH" != "" ]; then
> CLASSPATH=${CLASSPATH}:${PIG_CLASSPATH}
> {code}
> This means that anything added to CLASSPATH until that point will never be 
> able to get overridden by a user set environment, which is wrong behavior. 
> Hadoop libs for example are added to CLASSPATH, before this extension is 
> called in bin/pig.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2988) start deploying pigunit maven artifact part of Pig release process

2013-03-26 Thread Ioan Eugen Stan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13613645#comment-13613645
 ] 

Ioan Eugen Stan commented on PIG-2988:
--

Great! I was just about to report this issue. 

> start deploying pigunit maven artifact part of Pig release process
> --
>
> Key: PIG-2988
> URL: https://issues.apache.org/jira/browse/PIG-2988
> Project: Pig
>  Issue Type: New Feature
>  Components: build
>Affects Versions: 0.11, 0.10.1
>Reporter: Johnny Zhang
>Assignee: Nick White
> Fix For: 0.12, 0.11.1
>
> Attachments: PIG-2988.0-branch11.patch, PIG-2988.0.patch
>
>
> right now the Pig project doesn't publish pigunit Maven artifact, thins like
> {noformat}
> 
>   org.apache.pig
>   pigunit
>   0.10.0
> 
> {noformat}
> doesn't work. Can we start deploy pigunit Maven artifacts as part of the 
> release process? Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3199) Expose LogicalPlan via PigServer API

2013-03-26 Thread Prashant Kommireddi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Kommireddi updated PIG-3199:
-

Patch Info:   (was: Patch Available)

> Expose LogicalPlan via PigServer API
> 
>
> Key: PIG-3199
> URL: https://issues.apache.org/jira/browse/PIG-3199
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.10.0
>Reporter: Prashant Kommireddi
>Assignee: Prashant Kommireddi
> Fix For: 0.12
>
> Attachments: PIG-3199.patch
>
>
> LogicalPlan could be exposed to user in order for one to make validations 
> based on it. For eg, one could get Load/Store paths or other operators and be 
> able to perform checks such as whether I/O paths are valid etc.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3261) User set PIG_CLASSPATH entries must be prepended to the CLASSPATH, not appended

2013-03-26 Thread Prashant Kommireddi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13613616#comment-13613616
 ] 

Prashant Kommireddi commented on PIG-3261:
--

I am actually happy with this patch. Looking through hadoop JIRAs, 
documentation and comments in bin/hadoop script I could not clearly comprehend 
the reason for existence of the prop HADOOP_USER_CLASSPATH_FIRST. Making sure 
we don't miss it here if there's a legit reason, otherwise PIG_CLASSPATH is set 
generally when a user has certain custom jar/classpath requirements. Like you 
said, I don't think a user would set PIG_CLASSPATH but want default CLASSPATH 
to have a precedence.

> User set PIG_CLASSPATH entries must be prepended to the CLASSPATH, not 
> appended
> ---
>
> Key: PIG-3261
> URL: https://issues.apache.org/jira/browse/PIG-3261
> Project: Pig
>  Issue Type: Bug
>  Components: grunt
>Affects Versions: 0.10.0
>Reporter: Harsh J
>Assignee: Harsh J
> Attachments: PIG-3261.patch
>
>
> Currently we are doing this wrong:
> {code}
> if [ "$PIG_CLASSPATH" != "" ]; then
> CLASSPATH=${CLASSPATH}:${PIG_CLASSPATH}
> {code}
> This means that anything added to CLASSPATH until that point will never be 
> able to get overridden by a user set environment, which is wrong behavior. 
> Hadoop libs for example are added to CLASSPATH, before this extension is 
> called in bin/pig.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3261) User set PIG_CLASSPATH entries must be prepended to the CLASSPATH, not appended

2013-03-26 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13613608#comment-13613608
 ] 

Harsh J commented on PIG-3261:
--

We can make it configurable and document that I guess; but its kinda odd to 
have to do two toggles to get an override done. In most cases of an override 
requirement, users are aware of the overriding so the secondary toggle seems a 
tad unnecessary.

If you prefer that strongly, I'll send in another patch - let me know :)

An alternate fix would be to simply do the PIG_CLASSPATH addition before 
anything else is added to CLASSPATH, but this kinda position-in-code fix is 
harder to maintain over time.

> User set PIG_CLASSPATH entries must be prepended to the CLASSPATH, not 
> appended
> ---
>
> Key: PIG-3261
> URL: https://issues.apache.org/jira/browse/PIG-3261
> Project: Pig
>  Issue Type: Bug
>  Components: grunt
>Affects Versions: 0.10.0
>Reporter: Harsh J
>Assignee: Harsh J
> Attachments: PIG-3261.patch
>
>
> Currently we are doing this wrong:
> {code}
> if [ "$PIG_CLASSPATH" != "" ]; then
> CLASSPATH=${CLASSPATH}:${PIG_CLASSPATH}
> {code}
> This means that anything added to CLASSPATH until that point will never be 
> able to get overridden by a user set environment, which is wrong behavior. 
> Hadoop libs for example are added to CLASSPATH, before this extension is 
> called in bin/pig.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira