[jira] Subscription: PIG patch available

2013-05-03 Thread jira
Issue Subscription
Filter: PIG patch available (23 issues)

Subscriber: pigdaily

Key Summary
PIG-3297Avro files with stringType set to String cannot be read by the 
AvroStorage LoadFunc
https://issues.apache.org/jira/browse/PIG-3297
PIG-3295Casting from bytearray failing after Union (even when each field is 
from a single Loader)
https://issues.apache.org/jira/browse/PIG-3295
PIG-3291TestExampleGenerator fails on Windows because of lack of file name 
escaping
https://issues.apache.org/jira/browse/PIG-3291
PIG-3285Jobs using HBaseStorage fail to ship dependency jars
https://issues.apache.org/jira/browse/PIG-3285
PIG-3258Patch to allow MultiStorage to use more than one index to generate 
output tree
https://issues.apache.org/jira/browse/PIG-3258
PIG-3257Add unique identifier UDF
https://issues.apache.org/jira/browse/PIG-3257
PIG-3247Piggybank functions to mimic OVER clause in SQL
https://issues.apache.org/jira/browse/PIG-3247
PIG-3223AvroStorage does not handle comma separated input paths
https://issues.apache.org/jira/browse/PIG-3223
PIG-3210Pig fails to start when it cannot write log to log files
https://issues.apache.org/jira/browse/PIG-3210
PIG-3199Expose LogicalPlan via PigServer API
https://issues.apache.org/jira/browse/PIG-3199
PIG-3166Update eclipse .classpath according to ivy library.properties
https://issues.apache.org/jira/browse/PIG-3166
PIG-3123Simplify Logical Plans By Removing Unneccessary Identity Projections
https://issues.apache.org/jira/browse/PIG-3123
PIG-3088Add a builtin udf which removes prefixes
https://issues.apache.org/jira/browse/PIG-3088
PIG-3069Native Windows Compatibility for Pig E2E Tests and Harness
https://issues.apache.org/jira/browse/PIG-3069
PIG-3026Pig checked-in baseline comparisons need a pre-filter to address 
OS-specific newline differences
https://issues.apache.org/jira/browse/PIG-3026
PIG-3025TestPruneColumn unit test - SimpleEchoStreamingCommand perl inline 
script needs simplification
https://issues.apache.org/jira/browse/PIG-3025
PIG-3024TestEmptyInputDir unit test - hadoop version detection logic is 
brittle
https://issues.apache.org/jira/browse/PIG-3024
PIG-3015Rewrite of AvroStorage
https://issues.apache.org/jira/browse/PIG-3015
PIG-2959Add a pig.cmd for Pig to run under Windows
https://issues.apache.org/jira/browse/PIG-2959
PIG-2955 Fix bunch of Pig e2e tests on Windows 
https://issues.apache.org/jira/browse/PIG-2955
PIG-2248Pig parser does not detect when a macro name masks a UDF name
https://issues.apache.org/jira/browse/PIG-2248
PIG-2244Macros cannot be passed relation names
https://issues.apache.org/jira/browse/PIG-2244
PIG-1914Support load/store JSON data in Pig
https://issues.apache.org/jira/browse/PIG-1914

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=13225&filterId=12322384


[jira] [Updated] (PIG-3311) add pig-withouthadoop-h2 to mvn-jar

2013-05-03 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PIG-3311:
---

Patch Info: Patch Available

> add pig-withouthadoop-h2 to mvn-jar
> ---
>
> Key: PIG-3311
> URL: https://issues.apache.org/jira/browse/PIG-3311
> Project: Pig
>  Issue Type: Improvement
>  Components: build
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
> Attachments: PIG-3311.patch
>
>
> mvn-jar currently creates pig-version.jar and pig-version-h2.jar
> I'm adding pig-version-withouthadoop.jar and pig-version-withouthadoop-h2.jar 
> that are needed to run pig from the command line.
> This will allow a dual-version package.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3311) add pig-withouthadoop-h2 to mvn-jar

2013-05-03 Thread Bill Graham (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648931#comment-13648931
 ] 

Bill Graham commented on PIG-3311:
--

+1

> add pig-withouthadoop-h2 to mvn-jar
> ---
>
> Key: PIG-3311
> URL: https://issues.apache.org/jira/browse/PIG-3311
> Project: Pig
>  Issue Type: Improvement
>  Components: build
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
> Attachments: PIG-3311.patch
>
>
> mvn-jar currently creates pig-version.jar and pig-version-h2.jar
> I'm adding pig-version-withouthadoop.jar and pig-version-withouthadoop-h2.jar 
> that are needed to run pig from the command line.
> This will allow a dual-version package.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: A major addition to Pig. Working with spatial data

2013-05-03 Thread Daniel Dai
I am not sure how other Apache projects dealing with it? Seems Solr also
has some connector to JTS?

Thanks,
Daniel


On Thu, May 2, 2013 at 11:59 AM, Ahmed Eldawy  wrote:

> Thanks Alan for your interest. It's too bad that an open source licensing
> issue is holding me back from doing some open source work. I understand the
> issue and your workarounds make sense. However, as I mentioned in the
> beginning, I don't want to have my own branch of Pig because it makes my
> extension less portable. I'll think of another way to do it. I'll ask vivid
> solutions if they can double license their code although I think the answer
> will be no. I'll also think of a way to ship my extension as a set of jar
> files without the need to change the core of Pig. This way, it can be
> easily ported to newer versions of Pig.
>
> Thanks
> Ahmed
>
> Best regards,
> Ahmed Eldawy
>
>
> On Thu, May 2, 2013 at 12:33 PM, Alan Gates  wrote:
>
> > I know this is frustrating, but the different licenses do have different
> > requirements that make it so that Apache can't ship GPL code.  A legal
> > explanation is at http://www.apache.org/licenses/GPL-compatibility.htmlFor 
> > additional info on the LGPL specific questions see
> > http://www.apache.org/legal/3party.html
> >
> > As far as pulling it in via ivy, the issue isn't so much where the code
> > lives as much as what code we are requiring to make Pig work.  If
> something
> > that is [L]GPL is required for Pig it violates Apache rules as outlined
> > above.  It also would be a show stopper for a lot of companies that
> > redistribute Pig and that are allergic to GPL software.
> >
> > So, as I said before, if you wanted to continue with that library and
> they
> > are not willing to relicense it then it would have to be bolted on after
> > Apache Pig is built.  Nothing stops you from doing this by downloading
> > Apache Pig, adding this library and your code, and redistributing, though
> > it wouldn't then be open to all Pig users.
> >
> > Alan.
> >
> > On May 1, 2013, at 6:08 PM, Ahmed Eldawy wrote:
> >
> > > Thanks for your response. I was never good at differentiating all those
> > > open source licenses. I mean what is the point making open source
> > licenses
> > > if it blocks me from using a library in an open source project. Any
> way,
> > > I'm not going into debate here. Just one question, if we use JTS as a
> > > library (jar file) without adding the code in Pig, is it still a
> > violation?
> > > We'll use ivy, for example, to download the jar file when compiling.
> > > On May 1, 2013 7:50 PM, "Alan Gates"  wrote:
> > >
> > >> Passing on the technical details for a moment, I see a licensing
> issue.
> > >> JTS is licensed under LGPL.  Apache projects cannot contain or ship
> > >> [L]GPL.  Apache does not meet the requirements of GPL and thus we
> cannot
> > >> repackage their code. If you wanted to go forward using that class
> this
> > >> would have to be packaged as an add on that was downloaded separately
> > and
> > >> not from Apache.  Another option is to work with the JTS community and
> > see
> > >> if they are willing to dual license their code under BSD or Apache
> > license
> > >> so that Pig could include it.  If neither of those are an option you
> > would
> > >> need to come up with a new class to contain your spatial data.
> > >>
> > >> Alan.
> > >>
> > >> On May 1, 2013, at 5:40 PM, Ahmed Eldawy wrote:
> > >>
> > >>> Hi all,
> > >>> First, sorry for the long email. I wanted to put all my thoughts here
> > >> and
> > >>> get your feedback.
> > >>> I'm proposing a major addition to Pig that will greatly increase its
> > >>> functionality and user base. It is simply to add spatial support to
> the
> > >>> language and the framework. I've already started working on that but
> I
> > >>> don't want it to be just another branch. I want it, eventually, to be
> > >>> merged with the trunk of Apache Pig. So, I'm sending this email
> mainly
> > to
> > >>> reach out the main contributors of Pig to see the feasibility of
> this.
> > >>> This addition is a part of a big project we have been working on in
> > >>> University of Minnesota; the project is called Spatial Hadoop.
> > >>> http://spatialhadoop.cs.umn.edu. It's about building a MapReduce
> > >> framework
> > >>> (Hadoop) that is capable of maintaining and analyzing spatial data
> > >>> efficiently. I'm the main guy behind that project and since we
> released
> > >> its
> > >>> first version, we received very encouraging responses from different
> > >> groups
> > >>> in the research and industrial community. I'm sure the addition we
> want
> > >> to
> > >>> make to Pig Latin will be widely accepted by the people in the
> spatial
> > >>> community.
> > >>> I'm proposing a plan here while we're still in the early phases of
> this
> > >>> task to be able to discuss it with the main contributors and see its
> > >>> feasibility. First of all, I think that we need to change the core of
> > Pig
> > >>> to be able 

[jira] [Updated] (PIG-3312) Pig duplicates avro records

2013-05-03 Thread Hans Uhlig (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Uhlig updated PIG-3312:


Attachment: twitter.json
twitter.avsc
twitter.avro

> Pig duplicates avro records
> ---
>
> Key: PIG-3312
> URL: https://issues.apache.org/jira/browse/PIG-3312
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.11.1
>Reporter: Hans Uhlig
> Attachments: twitter.avro, twitter.avsc, twitter.json
>
>
> Pig will report avro records twice.
> To Reproduce:
> * Place attached files on hdfs
> * run pig
> > register lib/piggybank.jar
> > register lib/avro-1.7.4.jar
> > register lib/json-simple-1.1.jar
> > register lib/jackson-mapper-asl-1.6.0.jar
> > register lib/jackson-core-asl-1.6.0.jar
> > user_data= LOAD 'twitter.avro' using 
> > org.apache.pig.piggybank.storage.avro.AvroStorage();
> > dump user_data;
> Result: 
> (miguno,Rock: Nerf paper, scissors is fine.,1366150681)
> (BlizzardCS,Works as intended. Terran is IMBA.,1366154481)
> (Test1,One Tweet,1366154490)
> (miguno,Rock: Nerf paper, scissors is fine.,1366150681)
> (BlizzardCS,Works as intended. Terran is IMBA.,1366154481)
> (Test1,One Tweet,1366154490)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-3312) Pig duplicates avro records

2013-05-03 Thread Hans Uhlig (JIRA)
Hans Uhlig created PIG-3312:
---

 Summary: Pig duplicates avro records
 Key: PIG-3312
 URL: https://issues.apache.org/jira/browse/PIG-3312
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.11.1
Reporter: Hans Uhlig


Pig will report avro records twice.

To Reproduce:

* Place attached files on hdfs
* run pig
> register lib/piggybank.jar
> register lib/avro-1.7.4.jar
> register lib/json-simple-1.1.jar
> register lib/jackson-mapper-asl-1.6.0.jar
> register lib/jackson-core-asl-1.6.0.jar
> user_data= LOAD 'twitter.avro' using 
> org.apache.pig.piggybank.storage.avro.AvroStorage();
> dump user_data;

Result: 
(miguno,Rock: Nerf paper, scissors is fine.,1366150681)
(BlizzardCS,Works as intended. Terran is IMBA.,1366154481)
(Test1,One Tweet,1366154490)
(miguno,Rock: Nerf paper, scissors is fine.,1366150681)
(BlizzardCS,Works as intended. Terran is IMBA.,1366154481)
(Test1,One Tweet,1366154490)





--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2248) Pig parser does not detect when a macro name masks a UDF name

2013-05-03 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648866#comment-13648866
 ] 

Daniel Dai commented on PIG-2248:
-

That will be great! Thanks

> Pig parser does not detect when a macro name masks a UDF name
> -
>
> Key: PIG-2248
> URL: https://issues.apache.org/jira/browse/PIG-2248
> Project: Pig
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.9.0
>Reporter: Alan Gates
>Assignee: Johnny Zhang
>Priority: Minor
> Attachments: PIG-2248.patch.txt, PIG-2248.patch.txt, 
> PIG-2248.patch.txt, PIG-2248.patch.txt
>
>
> Pig accepts a macro like:
> {code}
> define COUNT(in_relation, min_gpa) returns c {
>b = filter $in_relation by gpa >= $min_gpa;
>$c = foreach b generate age, name;
>}
> {code}
> This should produce a warning that it is masking a UDF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2248) Pig parser does not detect when a macro name masks a UDF name

2013-05-03 Thread Johnny Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648863#comment-13648863
 ] 

Johnny Zhang commented on PIG-2248:
---

Daniel, this is a good idea, and in bigger picture. 3 steps I am think of:
1. decide the symbol resolving order
2. check the masking is working for designed order
3. document the usage.
what do you think? Maybe we should put logic of this patch into some common 
place, extend it for all symbol resolving check.

> Pig parser does not detect when a macro name masks a UDF name
> -
>
> Key: PIG-2248
> URL: https://issues.apache.org/jira/browse/PIG-2248
> Project: Pig
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.9.0
>Reporter: Alan Gates
>Assignee: Johnny Zhang
>Priority: Minor
> Attachments: PIG-2248.patch.txt, PIG-2248.patch.txt, 
> PIG-2248.patch.txt, PIG-2248.patch.txt
>
>
> Pig accepts a macro like:
> {code}
> define COUNT(in_relation, min_gpa) returns c {
>b = filter $in_relation by gpa >= $min_gpa;
>$c = foreach b generate age, name;
>}
> {code}
> This should produce a warning that it is masking a UDF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Function To Compute Product of Values in Bag

2013-05-03 Thread Julien Le Dem
As for the PRODUCT, I don't see why it could not be added to builtin.
It is a very generic and dependency less function.


On Fri, May 3, 2013 at 1:36 PM, Sergey Goder  wrote:

> Thanks for the tip about numerical accuracy issues and the elegant solution
> exploiting log/exp. It is very much appreciated.
>
> Sergey
>
>
> On Fri, May 3, 2013 at 11:42 AM, Kai Londenberg <
> kai.londenb...@googlemail.com> wrote:
>
> > Hi,
> >
> > Just a hint: It's usually better to work with log probabilites and sum
> > over them, than to work with raw probabilities and to use
> > multiplication. You might easily run into numerical accuracy issues
> > otherwise.
> >
> > i.e. exploit this fact:
> >
> > product(x1, ..., xn) = exp(sum(log(x1), ..., log(xn)))
> >
> > best,
> >
> > Kai Londenberg
> >
> > 2013/5/3 Sergey Goder :
> > > I'm creating a multinomial naive bayes classifier using pig and need to
> > > compute the product of probabilities. There are an arbitrary number of
> > > values in the bag so I would like to be able to use a function similar
> to
> > > the builtin SUM to do this. I looked through the source code and found
> > that
> > > with some really simple changes to SUM.java I can create a PROD.java
> > > function. I included it in my piggybank and have been using it
> > successfully.
> > >
> > > I was curious what the community thought about including this function
> > as a
> > > builtin function in a future release? Or would it make more sense to
> keep
> > > this function as a udf in a piggybank.
> > >
> > > Thanks,
> > > Sergey
> >
>


Pig package supporting both hadoop 1 and 2

2013-05-03 Thread Julien Le Dem
Hi Pig developers,
I'm looking into having a Pig package that works both for Hadoop 1.0 and Hadoop 
2.0
That means have both pig*.jar and pig*-h2.jar in the package and choosing the 
right one dynamically.
In particular I created this JIRA as a first step: 
https://issues.apache.org/jira/browse/PIG-3311
I'm curious to know how others do it.
Yahoo! for example? (Rohini?)
Thanks,
Julien



[jira] [Updated] (PIG-3223) AvroStorage does not handle comma separated input paths

2013-05-03 Thread Johnny Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johnny Zhang updated PIG-3223:
--

Attachment: PIG-3223.branch-0.11.patch.txt

Rohini, thanks for your +1 on RB. Here is the patch for branch 0.11. Could 
please help commit the patch (to trunk and 0.11)? Appreciate.

> AvroStorage does not handle comma separated input paths
> ---
>
> Key: PIG-3223
> URL: https://issues.apache.org/jira/browse/PIG-3223
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Affects Versions: 0.10.0, 0.11
>Reporter: Michael Kramer
>Assignee: Johnny Zhang
> Attachments: AvroStorage.patch, AvroStorage.patch-2, 
> AvroStorageUtils.patch, AvroStorageUtils.patch-2, 
> PIG-3223.branch-0.11.patch.txt, PIG-3223.patch.txt, PIG-3223.patch.txt, 
> PIG-3223.patch.txt, PIG-3223.patch.txt, PIG-3223.patch.txt
>
>
> In pig 0.11, a patch was issued to AvroStorage to support globs and comma 
> separated input paths (PIG-2492).  While this function works fine for 
> glob-formatted input paths, it fails when issued a standard comma separated 
> list of paths.  fs.globStatus does not seem to be able to parse out such a 
> list, and a java.net.URISyntaxException is thrown when toURI is called on the 
> path.  
> I have a working fix for this, but it's extremely ugly (basically checking if 
> the string of input paths is globbed, otherwise splitting on ",").  I'm sure 
> there's a more elegant solution.  I'd be happy to post the relevant methods 
> and "fixes" if necessary.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-3311) add pig-withouthadoop-h2 to mvn-jar

2013-05-03 Thread Julien Le Dem (JIRA)
Julien Le Dem created PIG-3311:
--

 Summary: add pig-withouthadoop-h2 to mvn-jar
 Key: PIG-3311
 URL: https://issues.apache.org/jira/browse/PIG-3311
 Project: Pig
  Issue Type: Improvement
  Components: build
Reporter: Julien Le Dem
Assignee: Julien Le Dem
 Attachments: PIG-3311.patch

mvn-jar currently creates pig-version.jar and pig-version-h2.jar
I'm adding pig-version-withouthadoop.jar and pig-version-withouthadoop-h2.jar 
that are needed to run pig from the command line.
This will allow a dual-version package.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3311) add pig-withouthadoop-h2 to mvn-jar

2013-05-03 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PIG-3311:
---

Attachment: PIG-3311.patch

PIG-3311.patch adds -withouthadoop to the mvn-jar target

> add pig-withouthadoop-h2 to mvn-jar
> ---
>
> Key: PIG-3311
> URL: https://issues.apache.org/jira/browse/PIG-3311
> Project: Pig
>  Issue Type: Improvement
>  Components: build
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
> Attachments: PIG-3311.patch
>
>
> mvn-jar currently creates pig-version.jar and pig-version-h2.jar
> I'm adding pig-version-withouthadoop.jar and pig-version-withouthadoop-h2.jar 
> that are needed to run pig from the command line.
> This will allow a dual-version package.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2248) Pig parser does not detect when a macro name masks a UDF name

2013-05-03 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648841#comment-13648841
 ] 

Daniel Dai commented on PIG-2248:
-

What I mean is something like (for illustration only, may not true)
keywords > default > define > MACRO > UDF

If we have a name conflicts, the higher side wins. eg. Macro "COUNT" will mask 
UDF "COUNT"

> Pig parser does not detect when a macro name masks a UDF name
> -
>
> Key: PIG-2248
> URL: https://issues.apache.org/jira/browse/PIG-2248
> Project: Pig
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.9.0
>Reporter: Alan Gates
>Assignee: Johnny Zhang
>Priority: Minor
> Attachments: PIG-2248.patch.txt, PIG-2248.patch.txt, 
> PIG-2248.patch.txt, PIG-2248.patch.txt
>
>
> Pig accepts a macro like:
> {code}
> define COUNT(in_relation, min_gpa) returns c {
>b = filter $in_relation by gpa >= $min_gpa;
>$c = foreach b generate age, name;
>}
> {code}
> This should produce a warning that it is masking a UDF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2873) Converting bin/pig shell script to python

2013-05-03 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2873:


   Resolution: Fixed
Fix Version/s: 0.12
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

Patch committed to trunk. Though pig.py need more tests, that could be a new 
Jira. Thanks Vikram!

> Converting bin/pig shell script to python
> -
>
> Key: PIG-2873
> URL: https://issues.apache.org/jira/browse/PIG-2873
> Project: Pig
>  Issue Type: Bug
>  Components: tools
>Affects Versions: 0.10.0
>Reporter: Vikram Dixit K
>Assignee: Vikram Dixit K
>Priority: Minor
> Fix For: 0.12
>
> Attachments: PIG-2873_2.patch, PIG-2873_3.patch, PIG-2873_4.patch, 
> PIG-2873.patch
>
>
> Converted the shell script in a platform independent way in python. Should 
> work with version 2.7.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2248) Pig parser does not detect when a macro name masks a UDF name

2013-05-03 Thread Johnny Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648838#comment-13648838
 ] 

Johnny Zhang commented on PIG-2248:
---

Daniel, can you explain the term "symbol resolving order". Does that mean we 
set different level of resolving order for different kind of symbol, and low 
level symbol respect the higher level symbol (respect means cannot mask it) ?

> Pig parser does not detect when a macro name masks a UDF name
> -
>
> Key: PIG-2248
> URL: https://issues.apache.org/jira/browse/PIG-2248
> Project: Pig
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.9.0
>Reporter: Alan Gates
>Assignee: Johnny Zhang
>Priority: Minor
> Attachments: PIG-2248.patch.txt, PIG-2248.patch.txt, 
> PIG-2248.patch.txt, PIG-2248.patch.txt
>
>
> Pig accepts a macro like:
> {code}
> define COUNT(in_relation, min_gpa) returns c {
>b = filter $in_relation by gpa >= $min_gpa;
>$c = foreach b generate age, name;
>}
> {code}
> This should produce a warning that it is masking a UDF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: CHANGES.txt in trunk

2013-05-03 Thread Alan Gates
What do mean by remove?  They should still be in the file.  They may need to be 
relocated under the 0.11 section.  But the trunk CHANGES file should include 
all changes that are on trunk.

Alan.

On May 3, 2013, at 1:34 PM, Rohini Palaniswamy wrote:

> Hi,
>   I see lot of patches that went into 0.11 are under trunk in the
> CHANGES.txt. Should we sync the file with the CHANGES.txt in branch-0.11
> and remove those jiras from trunk that went into 0.11? What is the usual
> process of updating CHANGES.txt when a jira is checked both into a branch
> and also trunk?
> 
> Regards,
> Rohini



[jira] [Commented] (PIG-2248) Pig parser does not detect when a macro name masks a UDF name

2013-05-03 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648831#comment-13648831
 ] 

Daniel Dai commented on PIG-2248:
-

Actually, I feel the check is very involved in the patch. Now we check UDF 
(after going through all the classloader trick), how about default/define? How 
about keyword? We need to exhaustively check everything and make sure there is 
no conflict. On the other side, how about defining an order Pig resolve a 
symbol, and clearly documented. 

At this moment I would prefer documenting the symbol resolving order, unless a 
cleaner solution is proposed.

> Pig parser does not detect when a macro name masks a UDF name
> -
>
> Key: PIG-2248
> URL: https://issues.apache.org/jira/browse/PIG-2248
> Project: Pig
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.9.0
>Reporter: Alan Gates
>Assignee: Johnny Zhang
>Priority: Minor
> Attachments: PIG-2248.patch.txt, PIG-2248.patch.txt, 
> PIG-2248.patch.txt, PIG-2248.patch.txt
>
>
> Pig accepts a macro like:
> {code}
> define COUNT(in_relation, min_gpa) returns c {
>b = filter $in_relation by gpa >= $min_gpa;
>$c = foreach b generate age, name;
>}
> {code}
> This should produce a warning that it is masking a UDF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3287) MultiQueryOptimizer can prevent CombinerOptimizer from working

2013-05-03 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648815#comment-13648815
 ] 

Daniel Dai commented on PIG-3287:
-

Yes, we weight multiquery over combiner, though it is not optimal for some 
cases. Please use "-M" to disable multiquery. Multiquery might be improved to 
allow combiner, this is a new feature and involves non trivial work.

> MultiQueryOptimizer can prevent CombinerOptimizer from working
> --
>
> Key: PIG-3287
> URL: https://issues.apache.org/jira/browse/PIG-3287
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.10.1
>Reporter: Christon DeWan
>
> The CombinerOptimizer does not operate on the script below. As a result, all 
> work is done in the reducer(s), killing performance. Removing one STORE or 
> refactoring the query to use a single FOREACH after the group allows the 
> CombinerOptimizer to work.
> {noformat}
> %declare DUMMY `bash -c '(for (( i=0; \$i < 10; i++ )); do echo \$i 5; done) 
> | hadoop fs -put - /tmp/test_data.tsv; true'`
> s = LOAD '/tmp/test_data.tsv' USING PigStorage(' ') AS (n:long, g:long);
> grouped = GROUP s BY g;
> counted = FOREACH grouped GENERATE flatten($0), COUNT_STAR($1);
> STORE counted INTO '/tmp/test_count';
> summed = FOREACH grouped GENERATE flatten($0), SUM($1.n);
> STORE summed INTO '/tmp/test_sum';
> FS -rmr /tmp/test_{data.tsv,count,sum}
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


dev@pig.apache.org

2013-05-03 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648805#comment-13648805
 ] 

Daniel Dai commented on PIG-3293:
-

Must be the "caster" in D's POCast is null. Can you attach MyLoader?

> Casting fails after Union from two data sources&loaders
> ---
>
> Key: PIG-3293
> URL: https://issues.apache.org/jira/browse/PIG-3293
> Project: Pig
>  Issue Type: Bug
>Reporter: Koji Noguchi
>Priority: Minor
>
> Script similar to 
> {noformat}
> A = load 'data1' using MyLoader() as (a:bytearray);
> B = load 'data2' as (a:bytearray);
> C = union onschema A,B;
> D = foreach C generate (chararray)a;
> Store D into './out';
> {noformat}
> fails with 
>java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: 
> ERROR 1075: Received a bytearray from the UDF. Cannot determine how to 
> convert the bytearray to string.
> Both MyLoader and PigStorage use the default Utf8StorageConverter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-1824) Support import modules in Jython UDF

2013-05-03 Thread Martin Gerlach (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648793#comment-13648793
 ] 

Martin Gerlach commented on PIG-1824:
-

Doesn't work for me, either (with codecs module). Pig version is 0.10.0-cdh4.1.2


> Support import modules in Jython UDF
> 
>
> Key: PIG-1824
> URL: https://issues.apache.org/jira/browse/PIG-1824
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0, 0.9.0
>Reporter: Richard Ding
>Assignee: Woody Anderson
> Fix For: 0.10.0
>
> Attachments: 1824a.patch, 1824b.patch, 1824c.patch, 1824d.patch, 
> 1824_final.patch, 1824.patch, 1824x.patch, 
> TEST-org.apache.pig.test.TestGrunt.txt, 
> TEST-org.apache.pig.test.TestScriptLanguage.txt, 
> TEST-org.apache.pig.test.TestScriptUDF.txt
>
>
> Currently, Jython UDF script doesn't support Jython import statement as in 
> the following example:
> {code}
> #!/usr/bin/python
> import re
> @outputSchema("word:chararray")
> def resplit(content, regex, index):
> return re.compile(regex).split(content)[index]
> {code}
> Can Pig automatically locate the Jython module file and ship it to the 
> backend? Or should we add a ship clause to let user explicitly specify the 
> module to ship? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: CHANGES.txt in trunk

2013-05-03 Thread Rohini Palaniswamy
I will put up the patch Daniel.

Thanks,
Rohini


On Fri, May 3, 2013 at 1:38 PM, Daniel Dai  wrote:

> Sure, I used to clean this up before release, but not strictly follow this
> rule. Patch welcome.
>
> Thanks,
> Daniel
>
>
> On Fri, May 3, 2013 at 1:34 PM, Rohini Palaniswamy
> wrote:
>
> > Hi,
> >I see lot of patches that went into 0.11 are under trunk in the
> > CHANGES.txt. Should we sync the file with the CHANGES.txt in branch-0.11
> > and remove those jiras from trunk that went into 0.11? What is the usual
> > process of updating CHANGES.txt when a jira is checked both into a branch
> > and also trunk?
> >
> > Regards,
> > Rohini
> >
>


[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size

2013-05-03 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648788#comment-13648788
 ] 

Koji Noguchi commented on PIG-3251:
---

bq. FYI, couple of tests from TestBZip are failing after applying my patch. 
Looking.

3 tests failed.  
{noformat}
Testcase: testBZ2Concatenation took 38.266 sec
  FAILED
Expected exception: java.io.IOException
junit.framework.AssertionFailedError: Expected exception: java.io.IOException

Testcase: testBlockHeaderEndingWithCR took 49.539 sec
  FAILED
expected:<82094> but was:<82093>
junit.framework.AssertionFailedError: expected:<82094> but was:<82093>
  at org.apache.pig.test.TestBZip.testCount(TestBZip.java:256)
  at
org.apache.pig.test.TestBZip.testBlockHeaderEndingWithCR(TestBZip.java:112)

Testcase: testBlockHeaderEndingAtSplitNotByteAligned took 48.996 sec
  FAILED
expected:<74999> but was:<101591>
junit.framework.AssertionFailedError: expected:<74999> but was:<101591>
  at org.apache.pig.test.TestBZip.testCount(TestBZip.java:256)
  at
org.apache.pig.test.TestBZip.testBlockHeaderEndingAtSplitNotByteAligned(TestBZip.java:88)
{noformat}

"testBZ2Concatenation" is expected since hadoop bzip2 codec handles 
concatenated bzip files (whereas pig's TestBZip is testing whether it reliably 
fails).
Other two are worrisome to me.  Asking my colleague to check.  It'll take some 
time.  Depending on what we find, we may need to change the condition for using 
hadoop's bzip codec.


> Bzip2TextInputFormat requires double the memory of maximum record size
> --
>
> Key: PIG-3251
> URL: https://issues.apache.org/jira/browse/PIG-3251
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, 
> pig-3251-trunk-v03.patch, pig-3251-trunk-v04.patch, pig-3251-trunk-v05.patch
>
>
> While looking at user's OOM heap dump, noticed that pig's 
> Bzip2TextInputFormat consumes memory at both
> Bzip2TextInputFormat.buffer (ByteArrayOutputStream) 
> and actual Text that is returned as line.
> For example, when having one record with 160MBytes, buffer was 268MBytes and 
> Text was 160MBytes.  
> We can probably eliminate one of them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: CHANGES.txt in trunk

2013-05-03 Thread Daniel Dai
Sure, I used to clean this up before release, but not strictly follow this
rule. Patch welcome.

Thanks,
Daniel


On Fri, May 3, 2013 at 1:34 PM, Rohini Palaniswamy
wrote:

> Hi,
>I see lot of patches that went into 0.11 are under trunk in the
> CHANGES.txt. Should we sync the file with the CHANGES.txt in branch-0.11
> and remove those jiras from trunk that went into 0.11? What is the usual
> process of updating CHANGES.txt when a jira is checked both into a branch
> and also trunk?
>
> Regards,
> Rohini
>


Re: Function To Compute Product of Values in Bag

2013-05-03 Thread Sergey Goder
Thanks for the tip about numerical accuracy issues and the elegant solution
exploiting log/exp. It is very much appreciated.

Sergey


On Fri, May 3, 2013 at 11:42 AM, Kai Londenberg <
kai.londenb...@googlemail.com> wrote:

> Hi,
>
> Just a hint: It's usually better to work with log probabilites and sum
> over them, than to work with raw probabilities and to use
> multiplication. You might easily run into numerical accuracy issues
> otherwise.
>
> i.e. exploit this fact:
>
> product(x1, ..., xn) = exp(sum(log(x1), ..., log(xn)))
>
> best,
>
> Kai Londenberg
>
> 2013/5/3 Sergey Goder :
> > I'm creating a multinomial naive bayes classifier using pig and need to
> > compute the product of probabilities. There are an arbitrary number of
> > values in the bag so I would like to be able to use a function similar to
> > the builtin SUM to do this. I looked through the source code and found
> that
> > with some really simple changes to SUM.java I can create a PROD.java
> > function. I included it in my piggybank and have been using it
> successfully.
> >
> > I was curious what the community thought about including this function
> as a
> > builtin function in a future release? Or would it make more sense to keep
> > this function as a udf in a piggybank.
> >
> > Thanks,
> > Sergey
>


CHANGES.txt in trunk

2013-05-03 Thread Rohini Palaniswamy
Hi,
   I see lot of patches that went into 0.11 are under trunk in the
CHANGES.txt. Should we sync the file with the CHANGES.txt in branch-0.11
and remove those jiras from trunk that went into 0.11? What is the usual
process of updating CHANGES.txt when a jira is checked both into a branch
and also trunk?

Regards,
Rohini


[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size

2013-05-03 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648776#comment-13648776
 ] 

Daniel Dai commented on PIG-3251:
-

bq. Or, are you suggesting I create two silly wrappers instead of one? 
No need, if there is no easy way then forget about it.

> Bzip2TextInputFormat requires double the memory of maximum record size
> --
>
> Key: PIG-3251
> URL: https://issues.apache.org/jira/browse/PIG-3251
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, 
> pig-3251-trunk-v03.patch, pig-3251-trunk-v04.patch, pig-3251-trunk-v05.patch
>
>
> While looking at user's OOM heap dump, noticed that pig's 
> Bzip2TextInputFormat consumes memory at both
> Bzip2TextInputFormat.buffer (ByteArrayOutputStream) 
> and actual Text that is returned as line.
> For example, when having one record with 160MBytes, buffer was 268MBytes and 
> Text was 160MBytes.  
> We can probably eliminate one of them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size

2013-05-03 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648745#comment-13648745
 ] 

Koji Noguchi commented on PIG-3251:
---

FYI, couple of tests from TestBZip are failing after applying my patch.  
Looking.

> Bzip2TextInputFormat requires double the memory of maximum record size
> --
>
> Key: PIG-3251
> URL: https://issues.apache.org/jira/browse/PIG-3251
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, 
> pig-3251-trunk-v03.patch, pig-3251-trunk-v04.patch, pig-3251-trunk-v05.patch
>
>
> While looking at user's OOM heap dump, noticed that pig's 
> Bzip2TextInputFormat consumes memory at both
> Bzip2TextInputFormat.buffer (ByteArrayOutputStream) 
> and actual Text that is returned as line.
> For example, when having one record with 160MBytes, buffer was 268MBytes and 
> Text was 160MBytes.  
> We can probably eliminate one of them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Review Request: PIG-3223 AvroStorage does not handle comma separated input paths

2013-05-03 Thread Rohini Palaniswamy

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10351/#review20138
---

Ship it!


Thanks Johnny. Looks good.

- Rohini Palaniswamy


On May 3, 2013, 7:30 p.m., Johnny Zhang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10351/
> ---
> 
> (Updated May 3, 2013, 7:30 p.m.)
> 
> 
> Review request for pig.
> 
> 
> Description
> ---
> 
> we want to support comma separated input paths in AvroStorage, for example
> "test_dir1/test_glob1.avro,test_dir1/test_glob2.avro,test_dir1/test_glob3.avro"
> "test_dir1/*, test_dir2/test_glob4.avro, test_dir2/test_glob5.avro"
> 
> 
> This addresses bug PIG-3223.
> https://issues.apache.org/jira/browse/PIG-3223
> 
> 
> Diffs
> -
> 
>   
> contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorageUtils.java
>  0ac0225 
>   
> contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java
>  bd7a6d2 
> 
> Diff: https://reviews.apache.org/r/10351/diff/
> 
> 
> Testing
> ---
> 
> added two more test cases in TestAvroStorage.java and they all pass
> 
> 
> Thanks,
> 
> Johnny Zhang
> 
>



[jira] [Updated] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size

2013-05-03 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3251:
--

Attachment: pig-3251-trunk-v05.patch

Thanks Daniel.

bq.is the patch ready?
Ah, forgot to flag it as patch available.

bq. can we just cache splittable?
Makes complete sense.  Changing.

bq. Is it possible to wrap a codec deal with both bz2/bz? 
As far as I understand, hadoop has 1-to-1 mapping for the codec and extension. 
I don't know of a way to map multiple extensions to one codec.
Or, are you suggesting I create two silly wrappers instead of one? 



> Bzip2TextInputFormat requires double the memory of maximum record size
> --
>
> Key: PIG-3251
> URL: https://issues.apache.org/jira/browse/PIG-3251
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, 
> pig-3251-trunk-v03.patch, pig-3251-trunk-v04.patch, pig-3251-trunk-v05.patch
>
>
> While looking at user's OOM heap dump, noticed that pig's 
> Bzip2TextInputFormat consumes memory at both
> Bzip2TextInputFormat.buffer (ByteArrayOutputStream) 
> and actual Text that is returned as line.
> For example, when having one record with 160MBytes, buffer was 268MBytes and 
> Text was 160MBytes.  
> We can probably eliminate one of them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3223) AvroStorage does not handle comma separated input paths

2013-05-03 Thread Johnny Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johnny Zhang updated PIG-3223:
--

Attachment: PIG-3223.patch.txt

latest patch addressed Rohini's comments in RB

> AvroStorage does not handle comma separated input paths
> ---
>
> Key: PIG-3223
> URL: https://issues.apache.org/jira/browse/PIG-3223
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Affects Versions: 0.10.0, 0.11
>Reporter: Michael Kramer
>Assignee: Johnny Zhang
> Attachments: AvroStorage.patch, AvroStorage.patch-2, 
> AvroStorageUtils.patch, AvroStorageUtils.patch-2, PIG-3223.patch.txt, 
> PIG-3223.patch.txt, PIG-3223.patch.txt, PIG-3223.patch.txt, PIG-3223.patch.txt
>
>
> In pig 0.11, a patch was issued to AvroStorage to support globs and comma 
> separated input paths (PIG-2492).  While this function works fine for 
> glob-formatted input paths, it fails when issued a standard comma separated 
> list of paths.  fs.globStatus does not seem to be able to parse out such a 
> list, and a java.net.URISyntaxException is thrown when toURI is called on the 
> path.  
> I have a working fix for this, but it's extremely ugly (basically checking if 
> the string of input paths is globbed, otherwise splitting on ",").  I'm sure 
> there's a more elegant solution.  I'd be happy to post the relevant methods 
> and "fixes" if necessary.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Review Request: PIG-3223 AvroStorage does not handle comma separated input paths

2013-05-03 Thread Johnny Zhang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10351/
---

(Updated May 3, 2013, 7:30 p.m.)


Review request for pig.


Changes
---

latest patch which addressed Rohini's comments. Thanks.


Description
---

we want to support comma separated input paths in AvroStorage, for example
"test_dir1/test_glob1.avro,test_dir1/test_glob2.avro,test_dir1/test_glob3.avro"
"test_dir1/*, test_dir2/test_glob4.avro, test_dir2/test_glob5.avro"


This addresses bug PIG-3223.
https://issues.apache.org/jira/browse/PIG-3223


Diffs (updated)
-

  
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorageUtils.java
 0ac0225 
  
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java
 bd7a6d2 

Diff: https://reviews.apache.org/r/10351/diff/


Testing
---

added two more test cases in TestAvroStorage.java and they all pass


Thanks,

Johnny Zhang



Re: Review Request: PIG-3223 AvroStorage does not handle comma separated input paths

2013-05-03 Thread Johnny Zhang


> On May 3, 2013, 6:53 p.m., Rohini Palaniswamy wrote:
> >

Thanks for the comments, Rohini! appreciate. I will post the revised patch very 
soon.


- Johnny


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10351/#review20127
---


On May 3, 2013, 12:33 a.m., Johnny Zhang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10351/
> ---
> 
> (Updated May 3, 2013, 12:33 a.m.)
> 
> 
> Review request for pig.
> 
> 
> Description
> ---
> 
> we want to support comma separated input paths in AvroStorage, for example
> "test_dir1/test_glob1.avro,test_dir1/test_glob2.avro,test_dir1/test_glob3.avro"
> "test_dir1/*, test_dir2/test_glob4.avro, test_dir2/test_glob5.avro"
> 
> 
> This addresses bug PIG-3223.
> https://issues.apache.org/jira/browse/PIG-3223
> 
> 
> Diffs
> -
> 
>   
> contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorageUtils.java
>  0ac0225 
>   
> contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java
>  bd7a6d2 
> 
> Diff: https://reviews.apache.org/r/10351/diff/
> 
> 
> Testing
> ---
> 
> added two more test cases in TestAvroStorage.java and they all pass
> 
> 
> Thanks,
> 
> Johnny Zhang
> 
>



Re: Review Request: PIG-3223 AvroStorage does not handle comma separated input paths

2013-05-03 Thread Rohini Palaniswamy

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10351/#review20127
---



contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorageUtils.java


Can you move these comments to inside the method or remove them as they can 
be seen from the code. Should not be part of the java doc



contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorageUtils.java


for (FileStatus file : matchedFiles) {
  getAllSubDirsInternal(file.getPath, conf, paths, fs);
}



contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorageUtils.java


private



contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorageUtils.java


 getAllSubDirsInternal(sub.getPath(), conf, paths, fs)

Doing fs.listStatus(file.getPath()) twice is redundant


- Rohini Palaniswamy


On May 3, 2013, 12:33 a.m., Johnny Zhang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10351/
> ---
> 
> (Updated May 3, 2013, 12:33 a.m.)
> 
> 
> Review request for pig.
> 
> 
> Description
> ---
> 
> we want to support comma separated input paths in AvroStorage, for example
> "test_dir1/test_glob1.avro,test_dir1/test_glob2.avro,test_dir1/test_glob3.avro"
> "test_dir1/*, test_dir2/test_glob4.avro, test_dir2/test_glob5.avro"
> 
> 
> This addresses bug PIG-3223.
> https://issues.apache.org/jira/browse/PIG-3223
> 
> 
> Diffs
> -
> 
>   
> contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorageUtils.java
>  0ac0225 
>   
> contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java
>  bd7a6d2 
> 
> Diff: https://reviews.apache.org/r/10351/diff/
> 
> 
> Testing
> ---
> 
> added two more test cases in TestAvroStorage.java and they all pass
> 
> 
> Thanks,
> 
> Johnny Zhang
> 
>



[jira] [Updated] (PIG-3309) TestJsonLoaderStorage fails with IBM JDK 6/7

2013-05-03 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3309:


   Resolution: Fixed
Fix Version/s: (was: 0.11.2)
   0.12
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

Patch committed to trunk. Thanks Leonardo!

> TestJsonLoaderStorage fails with IBM JDK 6/7
> 
>
> Key: PIG-3309
> URL: https://issues.apache.org/jira/browse/PIG-3309
> Project: Pig
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 0.11.1
> Environment: IBM J9 VM 1.6.0 SR13 FP1; IBM J9 VM 1.7.0 SR4 FP1
>Reporter: Leonardo Rangel Augusto
>Assignee: Leonardo Rangel Augusto
>Priority: Minor
>  Labels: patch, test
> Fix For: 0.12
>
> Attachments: PIG-TestJsonStorageLoader.patch
>
>
> TestJsonStorageLoader fails due to small differences in the way HashMaps are 
> printed out. The HashMap specification 
> (http://docs.oracle.com/javase/1.5.0/docs/api/java/util/HashMap.html) 
> mentions that "This class makes no guarantees as to the order of the map; in 
> particular, it does not guarantee that the order will remain constant over 
> time.", so PIG testcases should not rely on the order in which the HashMap 
> items are printed out.
> testJsonLoaderStorage1 explicitly does this comparison:
> Testcase: testJsonLoaderStorage1 took 2.25 sec
> FAILED
> expected:<...3":"c","key2":"b","key1":"a"}}
> {"a0":2,"a1":[{"a10":6,"a11":"cat"},{"a10":7,"a11":"dog"},{"a10":8,"a11":"pig"}],"a2":{"a20":2.3,"a21":"moon"},"a3":{"key4":"value4","key1":"value1...>
>  but was:<...1":"a","key2":"b","key3":"c"}}
> {"a0":2,"a1":[{"a10":6,"a11":"cat"},{"a10":7,"a11":"dog"},{"a10":8,"a11":"pig"}],"a2":{"a20":2.3,"a21":"moon"},"a3":{"key1":"value1","key4":"value4...>
> junit.framework.ComparisonFailure: expected:<...3":"c","key2":"b","key1":"a"}}
> {"a0":2,"a1":[{"a10":6,"a11":"cat"},{"a10":7,"a11":"dog"},{"a10":8,"a11":"pig"}],"a2":{"a20":2.3,"a21":"moon"},"a3":{"key4":"value4","key1":"value1...>
>  but was:<...1":"a","key2":"b","key3":"c"}}
> {"a0":2,"a1":[{"a10":6,"a11":"cat"},{"a10":7,"a11":"dog"},{"a10":8,"a11":"pig"}],"a2":{"a20":2.3,"a21":"moon"},"a3":{"key1":"value1","key4":"value4...>
> at 
> org.apache.pig.test.TestJsonLoaderStorage.testJsonLoaderStorage1(TestJsonLoaderStorage.java:63)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3309) TestJsonLoaderStorage fails with IBM JDK 6/7

2013-05-03 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3309:


Assignee: Leonardo Rangel Augusto

> TestJsonLoaderStorage fails with IBM JDK 6/7
> 
>
> Key: PIG-3309
> URL: https://issues.apache.org/jira/browse/PIG-3309
> Project: Pig
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 0.11.1
> Environment: IBM J9 VM 1.6.0 SR13 FP1; IBM J9 VM 1.7.0 SR4 FP1
>Reporter: Leonardo Rangel Augusto
>Assignee: Leonardo Rangel Augusto
>Priority: Minor
>  Labels: patch, test
> Fix For: 0.11.2
>
> Attachments: PIG-TestJsonStorageLoader.patch
>
>
> TestJsonStorageLoader fails due to small differences in the way HashMaps are 
> printed out. The HashMap specification 
> (http://docs.oracle.com/javase/1.5.0/docs/api/java/util/HashMap.html) 
> mentions that "This class makes no guarantees as to the order of the map; in 
> particular, it does not guarantee that the order will remain constant over 
> time.", so PIG testcases should not rely on the order in which the HashMap 
> items are printed out.
> testJsonLoaderStorage1 explicitly does this comparison:
> Testcase: testJsonLoaderStorage1 took 2.25 sec
> FAILED
> expected:<...3":"c","key2":"b","key1":"a"}}
> {"a0":2,"a1":[{"a10":6,"a11":"cat"},{"a10":7,"a11":"dog"},{"a10":8,"a11":"pig"}],"a2":{"a20":2.3,"a21":"moon"},"a3":{"key4":"value4","key1":"value1...>
>  but was:<...1":"a","key2":"b","key3":"c"}}
> {"a0":2,"a1":[{"a10":6,"a11":"cat"},{"a10":7,"a11":"dog"},{"a10":8,"a11":"pig"}],"a2":{"a20":2.3,"a21":"moon"},"a3":{"key1":"value1","key4":"value4...>
> junit.framework.ComparisonFailure: expected:<...3":"c","key2":"b","key1":"a"}}
> {"a0":2,"a1":[{"a10":6,"a11":"cat"},{"a10":7,"a11":"dog"},{"a10":8,"a11":"pig"}],"a2":{"a20":2.3,"a21":"moon"},"a3":{"key4":"value4","key1":"value1...>
>  but was:<...1":"a","key2":"b","key3":"c"}}
> {"a0":2,"a1":[{"a10":6,"a11":"cat"},{"a10":7,"a11":"dog"},{"a10":8,"a11":"pig"}],"a2":{"a20":2.3,"a21":"moon"},"a3":{"key1":"value1","key4":"value4...>
> at 
> org.apache.pig.test.TestJsonLoaderStorage.testJsonLoaderStorage1(TestJsonLoaderStorage.java:63)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Function To Compute Product of Values in Bag

2013-05-03 Thread Kai Londenberg
Hi,

Just a hint: It's usually better to work with log probabilites and sum
over them, than to work with raw probabilities and to use
multiplication. You might easily run into numerical accuracy issues
otherwise.

i.e. exploit this fact:

product(x1, ..., xn) = exp(sum(log(x1), ..., log(xn)))

best,

Kai Londenberg

2013/5/3 Sergey Goder :
> I'm creating a multinomial naive bayes classifier using pig and need to
> compute the product of probabilities. There are an arbitrary number of
> values in the bag so I would like to be able to use a function similar to
> the builtin SUM to do this. I looked through the source code and found that
> with some really simple changes to SUM.java I can create a PROD.java
> function. I included it in my piggybank and have been using it successfully.
>
> I was curious what the community thought about including this function as a
> builtin function in a future release? Or would it make more sense to keep
> this function as a udf in a piggybank.
>
> Thanks,
> Sergey


[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size

2013-05-03 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648664#comment-13648664
 ] 

Daniel Dai commented on PIG-3251:
-

Hi, [~knoguchi], is the patch ready? Some comments for pig-3251-trunk-v04.patch:
1. Cache job in PigStorage seems to be confusing, can we just cache splittable?
2. Is it possible to wrap a codec deal with both bz2/bz? Just feel it might be 
less confusing than wrap bz2 alone, and use PigTextInputFormat for bz

> Bzip2TextInputFormat requires double the memory of maximum record size
> --
>
> Key: PIG-3251
> URL: https://issues.apache.org/jira/browse/PIG-3251
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, 
> pig-3251-trunk-v03.patch, pig-3251-trunk-v04.patch
>
>
> While looking at user's OOM heap dump, noticed that pig's 
> Bzip2TextInputFormat consumes memory at both
> Bzip2TextInputFormat.buffer (ByteArrayOutputStream) 
> and actual Text that is returned as line.
> For example, when having one record with 160MBytes, buffer was 268MBytes and 
> Text was 160MBytes.  
> We can probably eliminate one of them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Function To Compute Product of Values in Bag

2013-05-03 Thread Sergey Goder
I'm creating a multinomial naive bayes classifier using pig and need to
compute the product of probabilities. There are an arbitrary number of
values in the bag so I would like to be able to use a function similar to
the builtin SUM to do this. I looked through the source code and found that
with some really simple changes to SUM.java I can create a PROD.java
function. I included it in my piggybank and have been using it successfully.

I was curious what the community thought about including this function as a
builtin function in a future release? Or would it make more sense to keep
this function as a udf in a piggybank.

Thanks,
Sergey


[jira] [Updated] (PIG-3307) Refactor physical operators to remove methods parameters that are always null

2013-05-03 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PIG-3307:
---

Attachment: PIG-3307_2.patch

PIG-3307_2.patch removes the unused parameter in getNext(\*)


> Refactor physical operators to remove methods parameters that are always null
> -
>
> Key: PIG-3307
> URL: https://issues.apache.org/jira/browse/PIG-3307
> Project: Pig
>  Issue Type: Improvement
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
> Attachments: PIG-3307_0.patch, PIG-3307_1.patch, PIG-3307_2.patch
>
>
> The physical operators are sometimes overly complex. I'm trying to cleanup 
> some unnecessary code.
> in particular there is an array of getNext(*T* v) where the value v does not 
> seem to have any importance and is just used to pick the correct method.
> I have started a refactoring for a more readable getNext*T*().

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3307) Refactor physical operators to remove methods parameters that are always null

2013-05-03 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648594#comment-13648594
 ] 

Daniel Dai commented on PIG-3307:
-

Any performance implication for this change?

> Refactor physical operators to remove methods parameters that are always null
> -
>
> Key: PIG-3307
> URL: https://issues.apache.org/jira/browse/PIG-3307
> Project: Pig
>  Issue Type: Improvement
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
> Attachments: PIG-3307_0.patch, PIG-3307_1.patch
>
>
> The physical operators are sometimes overly complex. I'm trying to cleanup 
> some unnecessary code.
> in particular there is an array of getNext(*T* v) where the value v does not 
> seem to have any importance and is just used to pick the correct method.
> I have started a refactoring for a more readable getNext*T*().

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2586) A better plan/data flow visualizer

2013-05-03 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648591#comment-13648591
 ] 

Daniel Dai commented on PIG-2586:
-

Looks good. In the schedule, can we get some sample pages first before actual 
implementation?

> A better plan/data flow visualizer
> --
>
> Key: PIG-2586
> URL: https://issues.apache.org/jira/browse/PIG-2586
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Daniel Dai
>  Labels: gsoc2013
>
> Pig supports a dot graph style plan to visualize the 
> logical/physical/mapreduce plan (explain with -dot option, see 
> http://ofps.oreilly.com/titles/9781449302641/developing_and_testing.html). 
> However, dot graph takes extra step to generate the plan graph and the 
> quality of the output is not good. It's better we can implement a better 
> visualizer for Pig. It should:
> 1. show operator type and alias
> 2. turn on/off output schema
> 3. dive into foreach inner plan on demand
> 4. provide a way to show operator source code, eg, tooltip of an operator 
> (plan don't currently have this information, but you can assume this is in 
> place)
> 5. besides visualize logical/physical/mapreduce plan, visualize the script 
> itself is also useful
> 6. may rely on some java graphic library such as Swing
> This is a candidate project for Google summer of code 2013. More information 
> about the program can be found at 
> https://cwiki.apache.org/confluence/display/PIG/GSoc2013

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3308) Storing data in hive columnar rc format

2013-05-03 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3308:


   Resolution: Fixed
Fix Version/s: (was: 0.10.1)
   0.12
 Assignee: Marcin Czech
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

Piggybank test pass. Committed to trunk. Thanks Marcin!

> Storing data in hive columnar rc format
> ---
>
> Key: PIG-3308
> URL: https://issues.apache.org/jira/browse/PIG-3308
> Project: Pig
>  Issue Type: Improvement
>  Components: piggybank
>Affects Versions: 0.10.1
>Reporter: Marcin Czech
>Assignee: Marcin Czech
>  Labels: patch
> Fix For: 0.12
>
> Attachments: PIG-3308.patch
>
>
> I've coded HiveColumnarStorage that can store Pig structures as a Hive 
> Columnar RC tables. Code is based on Elephant-bird RCFilePigStorage. The 
> difference is that data are stored in Hive friendly format, so file can be 
> read from Hive. 
> Example Pig schema:
> {code}
> f1:tuple (f11: chararray,f12: chararray),f2:map[]
> {code}
> Hive schema:
> {code}
> CREATE TABLE sample_table (f1 struct, f2 
> array>)
> PARTITIONED BY (p string) 
> STORED AS RCFILE 
> {code}
> or as a:
> {code}
> CREATE TABLE sample_table (f1 struct, f2 MAP 
> )
> PARTITIONED BY (p string) 
> STORED AS RCFILE 
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3308) Storing data in hive columnar rc format

2013-05-03 Thread Marcin Czech (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcin Czech updated PIG-3308:
--

Attachment: PIG-3308.patch

Yes my mistake. Problem was only in the test file. Now should be fine. Code is 
compatible with Hadoop 0.20 and 0.23.

> Storing data in hive columnar rc format
> ---
>
> Key: PIG-3308
> URL: https://issues.apache.org/jira/browse/PIG-3308
> Project: Pig
>  Issue Type: Improvement
>  Components: piggybank
>Affects Versions: 0.10.1
>Reporter: Marcin Czech
>  Labels: patch
> Fix For: 0.10.1
>
> Attachments: PIG-3308.patch
>
>
> I've coded HiveColumnarStorage that can store Pig structures as a Hive 
> Columnar RC tables. Code is based on Elephant-bird RCFilePigStorage. The 
> difference is that data are stored in Hive friendly format, so file can be 
> read from Hive. 
> Example Pig schema:
> {code}
> f1:tuple (f11: chararray,f12: chararray),f2:map[]
> {code}
> Hive schema:
> {code}
> CREATE TABLE sample_table (f1 struct, f2 
> array>)
> PARTITIONED BY (p string) 
> STORED AS RCFILE 
> {code}
> or as a:
> {code}
> CREATE TABLE sample_table (f1 struct, f2 MAP 
> )
> PARTITIONED BY (p string) 
> STORED AS RCFILE 
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3308) Storing data in hive columnar rc format

2013-05-03 Thread Marcin Czech (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcin Czech updated PIG-3308:
--

Attachment: (was: PIG-3308.patch)

> Storing data in hive columnar rc format
> ---
>
> Key: PIG-3308
> URL: https://issues.apache.org/jira/browse/PIG-3308
> Project: Pig
>  Issue Type: Improvement
>  Components: piggybank
>Affects Versions: 0.10.1
>Reporter: Marcin Czech
>  Labels: patch
> Fix For: 0.10.1
>
>
> I've coded HiveColumnarStorage that can store Pig structures as a Hive 
> Columnar RC tables. Code is based on Elephant-bird RCFilePigStorage. The 
> difference is that data are stored in Hive friendly format, so file can be 
> read from Hive. 
> Example Pig schema:
> {code}
> f1:tuple (f11: chararray,f12: chararray),f2:map[]
> {code}
> Hive schema:
> {code}
> CREATE TABLE sample_table (f1 struct, f2 
> array>)
> PARTITIONED BY (p string) 
> STORED AS RCFILE 
> {code}
> or as a:
> {code}
> CREATE TABLE sample_table (f1 struct, f2 MAP 
> )
> PARTITIONED BY (p string) 
> STORED AS RCFILE 
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size

2013-05-03 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648464#comment-13648464
 ] 

Koji Noguchi commented on PIG-3251:
---

bq. Using hadoop's bzip codec on 0.23/2.0 would have an additional benefit of 
having native codec. (HADOOP-8462)

Learned that bzip native codec so far does not support splitting (and falls 
back to java version for splits).

> Bzip2TextInputFormat requires double the memory of maximum record size
> --
>
> Key: PIG-3251
> URL: https://issues.apache.org/jira/browse/PIG-3251
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, 
> pig-3251-trunk-v03.patch, pig-3251-trunk-v04.patch
>
>
> While looking at user's OOM heap dump, noticed that pig's 
> Bzip2TextInputFormat consumes memory at both
> Bzip2TextInputFormat.buffer (ByteArrayOutputStream) 
> and actual Text that is returned as line.
> For example, when having one record with 160MBytes, buffer was 268MBytes and 
> Text was 160MBytes.  
> We can probably eliminate one of them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3310) ImplicitSplitInserter does not generate new uids for nested schema fields, leading to miscomputations

2013-05-03 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PIG-3310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Clément Stenac updated PIG-3310:


Attachment: generate-uid-for-nested-fields.patch

> ImplicitSplitInserter does not generate new uids for nested schema fields, 
> leading to miscomputations
> -
>
> Key: PIG-3310
> URL: https://issues.apache.org/jira/browse/PIG-3310
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.11.1
> Environment: Reproduced on 0.10.1, 0.11.1 and trunk
>Reporter: Clément Stenac
> Attachments: generate-uid-for-nested-fields.patch
>
>
> Hi,
> Consider the following example
> {code}
> inp = LOAD '$INPUT' AS (memberId:long, shopId:long, score:int);
> tuplified = FOREACH inp GENERATE (memberId, shopId) AS tuplify, score;
> D1 = FOREACH tuplified GENERATE tuplify.memberId as memberId, tuplify.shopId 
> as shopId, score AS score;
> D2 = FOREACH tuplified GENERATE tuplify.memberId as memberId, tuplify.shopId 
> as shopId, score AS score;
> J = JOIN D1 By shopId, D2 by shopId;
> K = FOREACH J GENERATE D1::memberId AS member_id1, D2::memberId AS 
> member_id2, D1::shopId as shop;
> EXPLAIN K;
> DUMP K;
> {code}
> It is a bit weird written like that, but it provides a minimal reproduction 
> case (in the real case, the "tuplified" phase came from a multi-key grouping).
> On input data:
> {code}
> 1   1001101
> 1   1002103
> 1   1003102
> 1   1004102
> 2   1005101
> 2   1003101
> 2   1002123
> 3   1042101
> 3   1005101
> 3   1002133
> {code}
> This will give a wrongful output like ..
> {code}
> (1,1001,1001)
> (1,1002,1002)
> (1,1002,1002)
> (1,1002,1002)
> {code}
> The second column should be a member id so (1,2,3,4,5).
> In the initial case, there was a FILTER (member_id1 < member_id2) after K, 
> and computation failed because of PushUpFilter optimization mistakenly moving 
> the LOFilter operation before the join, at a place where it tried to work on 
> a tuple and failed.
> My understanding of the issue is that when the ImplicitSplitInserter creates 
> the LOSplitOutputs, it will correctly reset the schema, and the LOSplitOutput 
> will regenerate uids for the fields of D1 and D2 ... but will not do that on 
> the tuple members.
> The logical plan after the ImplicitSplitINserter will look like (simplified)
> {code}
>|---D1: (Name: LOForEach Schema: 
> memberId#124:long,shopId#125:long)ColumnPrune:InputUids=[127]ColumnPrune:OutputUids=[125,
>  124]
> |---tuplified: (Name: LOSplitOutput Schema: 
> tuplify#127:tuple(memberId#124:long,shopId#125:long))ColumnPrune:InputUids=[123]ColumnPrune:OutputUids=[127]
>|---tuplified: (Name: LOSplit Schema: 
> tuplify#123:tuple(memberId#124:long,shopId#125:long))ColumnPrune:InputUids=[123]ColumnPrune:OutputUids=[123]
> |---D2: (Name: LOForEach Schema: 
> memberId#124:long,shopId#125:long)ColumnPrune:InputUids=[130]ColumnPrune:OutputUids=[125,
>  124]
> |---tuplified: (Name: LOSplitOutput Schema: 
> tuplify#130:tuple(memberId#124:long,shopId#125:long))ColumnPrune:InputUids=[123]ColumnPrune:OutputUids=[130]
>|---tuplified: (Name: LOSplit Schema: 
> tuplify#123:tuple(memberId#124:long,shopId#125:long))ColumnPrune:InputUids=[123]ColumnPrune:OutputUids=[123]
> {code}
> tuplified correctly gets a new uid (127 and 130) but the members of the tuple 
> don't. When they get reprojected, both branches have the same uid and the 
> join looks like:
> {code}
> |---J: (Name: LOJoin(HASH) Schema: 
> D1::memberId#124:long,D1::shopId#125:long,D2::memberId#139:long,D2::shopId#132:long)ColumnPrune:InputUids=[125,
>  124, 132]ColumnPrune:OutputUids=[125, 124, 132]
> |   |
> |   shopId:(Name: Project Type: long Uid: 125 Input: 0 Column: 1)
> |   |
> |   shopId:(Name: Project Type: long Uid: 125 Input: 1 Column: 1)
> {code}
> If for example instead of reprojecting "memberId", we project "memberId+0", a 
> new node is created, and ultimately the two branches of the join will 
> correctly get separate uids.
> My understanding is that LOSplitOutput.getSchema() should recurse on nested 
> schema fields. However, I only have a light understanding of all of the 
> logical plan handling, so I may be completely wrong.
> Attached is a draft of patch and a test reproducing the issue. Unfortunately, 
> I haven't been able to run all unit tests with the "fix" (I have some weird 
> hangs)
> I'd be happy if you could indicate if that looks like completely the wrong 
> way to fix the issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://ww

[jira] [Created] (PIG-3310) ImplicitSplitInserter does not generate new uids for nested schema fields, leading to miscomputations

2013-05-03 Thread JIRA
Clément Stenac created PIG-3310:
---

 Summary: ImplicitSplitInserter does not generate new uids for 
nested schema fields, leading to miscomputations
 Key: PIG-3310
 URL: https://issues.apache.org/jira/browse/PIG-3310
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.11.1
 Environment: Reproduced on 0.10.1, 0.11.1 and trunk
Reporter: Clément Stenac


Hi,

Consider the following example

{code}
inp = LOAD '$INPUT' AS (memberId:long, shopId:long, score:int);

tuplified = FOREACH inp GENERATE (memberId, shopId) AS tuplify, score;

D1 = FOREACH tuplified GENERATE tuplify.memberId as memberId, tuplify.shopId as 
shopId, score AS score;
D2 = FOREACH tuplified GENERATE tuplify.memberId as memberId, tuplify.shopId as 
shopId, score AS score;

J = JOIN D1 By shopId, D2 by shopId;
K = FOREACH J GENERATE D1::memberId AS member_id1, D2::memberId AS member_id2, 
D1::shopId as shop;

EXPLAIN K;
DUMP K;
{code}

It is a bit weird written like that, but it provides a minimal reproduction 
case (in the real case, the "tuplified" phase came from a multi-key grouping).

On input data:
{code}
1   1001101
1   1002103
1   1003102
1   1004102
2   1005101
2   1003101
2   1002123
3   1042101
3   1005101
3   1002133
{code}

This will give a wrongful output like ..
{code}
(1,1001,1001)
(1,1002,1002)
(1,1002,1002)
(1,1002,1002)
{code}
The second column should be a member id so (1,2,3,4,5).

In the initial case, there was a FILTER (member_id1 < member_id2) after K, and 
computation failed because of PushUpFilter optimization mistakenly moving the 
LOFilter operation before the join, at a place where it tried to work on a 
tuple and failed.

My understanding of the issue is that when the ImplicitSplitInserter creates 
the LOSplitOutputs, it will correctly reset the schema, and the LOSplitOutput 
will regenerate uids for the fields of D1 and D2 ... but will not do that on 
the tuple members.

The logical plan after the ImplicitSplitINserter will look like (simplified)

{code}
   |---D1: (Name: LOForEach Schema: 
memberId#124:long,shopId#125:long)ColumnPrune:InputUids=[127]ColumnPrune:OutputUids=[125,
 124]
|---tuplified: (Name: LOSplitOutput Schema: 
tuplify#127:tuple(memberId#124:long,shopId#125:long))ColumnPrune:InputUids=[123]ColumnPrune:OutputUids=[127]
   |---tuplified: (Name: LOSplit Schema: 
tuplify#123:tuple(memberId#124:long,shopId#125:long))ColumnPrune:InputUids=[123]ColumnPrune:OutputUids=[123]
|---D2: (Name: LOForEach Schema: 
memberId#124:long,shopId#125:long)ColumnPrune:InputUids=[130]ColumnPrune:OutputUids=[125,
 124]
|---tuplified: (Name: LOSplitOutput Schema: 
tuplify#130:tuple(memberId#124:long,shopId#125:long))ColumnPrune:InputUids=[123]ColumnPrune:OutputUids=[130]
   |---tuplified: (Name: LOSplit Schema: 
tuplify#123:tuple(memberId#124:long,shopId#125:long))ColumnPrune:InputUids=[123]ColumnPrune:OutputUids=[123]
{code}

tuplified correctly gets a new uid (127 and 130) but the members of the tuple 
don't. When they get reprojected, both branches have the same uid and the join 
looks like:
{code}
|---J: (Name: LOJoin(HASH) Schema: 
D1::memberId#124:long,D1::shopId#125:long,D2::memberId#139:long,D2::shopId#132:long)ColumnPrune:InputUids=[125,
 124, 132]ColumnPrune:OutputUids=[125, 124, 132]
|   |
|   shopId:(Name: Project Type: long Uid: 125 Input: 0 Column: 1)
|   |
|   shopId:(Name: Project Type: long Uid: 125 Input: 1 Column: 1)
{code}

If for example instead of reprojecting "memberId", we project "memberId+0", a 
new node is created, and ultimately the two branches of the join will correctly 
get separate uids.

My understanding is that LOSplitOutput.getSchema() should recurse on nested 
schema fields. However, I only have a light understanding of all of the logical 
plan handling, so I may be completely wrong.

Attached is a draft of patch and a test reproducing the issue. Unfortunately, I 
haven't been able to run all unit tests with the "fix" (I have some weird hangs)

I'd be happy if you could indicate if that looks like completely the wrong way 
to fix the issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira