[jira] [Updated] (PIG-2397) Running TPC-H Benchmark on Pig

2012-12-07 Thread Jie Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Li updated PIG-2397:


Summary: Running TPC-H Benchmark on Pig  (was: Running TPC-H on Pig)

> Running TPC-H Benchmark on Pig
> --
>
> Key: PIG-2397
> URL: https://issues.apache.org/jira/browse/PIG-2397
> Project: Pig
>  Issue Type: Task
>Reporter: Jie Li
> Attachments: pig_tpch.ppt, TPC-H_on_Pig.tgz
>
>
> For a class project we developed a whole set of Pig scripts for TPC-H. Our 
> goals are:
> 1) identifying the bottlenecks of Pig's performance especially of its 
> relational operators,
> 2) studying how to write efficient scripts by making full use of Pig Latin's 
> features,
> 3) comparing with Hive's TPC-H results for verifying both 1) and 2).
> We will update the JIRA with our scripts, results and analysis soon.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Subscription: PIG patch available

2012-12-07 Thread jira
Issue Subscription
Filter: PIG patch available (35 issues)

Subscriber: pigdaily

Key Summary
PIG-3078Make a UDF that, given a string, returns just the columns prefixed 
by that string
https://issues.apache.org/jira/browse/PIG-3078
PIG-3075Allow AvroStorage STORE Operations To Use Schema Specified By URI
https://issues.apache.org/jira/browse/PIG-3075
PIG-3073POUserFunc creating log spam for large scripts
https://issues.apache.org/jira/browse/PIG-3073
PIG-3069Native Windows Compatibility for Pig E2E Tests and Harness
https://issues.apache.org/jira/browse/PIG-3069
PIG-3067HBaseStorage should be split up to become more managable
https://issues.apache.org/jira/browse/PIG-3067
PIG-3066Fix TestPigRunner in trunk
https://issues.apache.org/jira/browse/PIG-3066
PIG-3057make readField protected to be able to override it if we extend 
PigStorage
https://issues.apache.org/jira/browse/PIG-3057
PIG-3051java.lang.IndexOutOfBoundsException  failure with LimitOptimizer + 
ColumnPruning
https://issues.apache.org/jira/browse/PIG-3051
PIG-3033test-patch failed with javadoc warnings
https://issues.apache.org/jira/browse/PIG-3033
PIG-3029TestTypeCheckingValidatorNewLP has some path reference issues for 
cross-platform execution
https://issues.apache.org/jira/browse/PIG-3029
PIG-3028testGrunt dev test needs some command filters to run correctly 
without cygwin
https://issues.apache.org/jira/browse/PIG-3028
PIG-3027pigTest unit test needs a newline filter for comparisons of golden 
multi-line
https://issues.apache.org/jira/browse/PIG-3027
PIG-3026Pig checked-in baseline comparisons need a pre-filter to address 
OS-specific newline differences
https://issues.apache.org/jira/browse/PIG-3026
PIG-3025TestPruneColumn unit test - SimpleEchoStreamingCommand perl inline 
script needs simplification
https://issues.apache.org/jira/browse/PIG-3025
PIG-3024TestEmptyInputDir unit test - hadoop version detection logic is 
brittle
https://issues.apache.org/jira/browse/PIG-3024
PIG-3015Rewrite of AvroStorage
https://issues.apache.org/jira/browse/PIG-3015
PIG-3010Allow UDF's to flatten themselves
https://issues.apache.org/jira/browse/PIG-3010
PIG-2959Add a pig.cmd for Pig to run under Windows
https://issues.apache.org/jira/browse/PIG-2959
PIG-2957TetsScriptUDF fail due to volume prefix in jar
https://issues.apache.org/jira/browse/PIG-2957
PIG-2956Invalid cache specification for some streaming statement
https://issues.apache.org/jira/browse/PIG-2956
PIG-2955 Fix bunch of Pig e2e tests on Windows 
https://issues.apache.org/jira/browse/PIG-2955
PIG-2878Pig current releases lack a UDF equalIgnoreCase.This function 
returns a Boolean value indicating whether string left is equal to string 
right. This check is case insensitive.
https://issues.apache.org/jira/browse/PIG-2878
PIG-2873Converting bin/pig shell script to python
https://issues.apache.org/jira/browse/PIG-2873
PIG-2834MultiStorage requires unused constructor argument
https://issues.apache.org/jira/browse/PIG-2834
PIG-2824Pushing checking number of fields into LoadFunc
https://issues.apache.org/jira/browse/PIG-2824
PIG-2661Pig uses an extra job for loading data in Pigmix L9
https://issues.apache.org/jira/browse/PIG-2661
PIG-2645PigSplit does not handle the case where SerializationFactory 
returns null
https://issues.apache.org/jira/browse/PIG-2645
PIG-2614AvroStorage crashes on LOADING a single bad error
https://issues.apache.org/jira/browse/PIG-2614
PIG-2507Semicolon in paramenters for UDF results in parsing error
https://issues.apache.org/jira/browse/PIG-2507
PIG-2433Jython import module not working if module path is in classpath
https://issues.apache.org/jira/browse/PIG-2433
PIG-2417Streaming UDFs -  allow users to easily write UDFs in scripting 
languages with no JVM implementation.
https://issues.apache.org/jira/browse/PIG-2417
PIG-2362Rework Ant build.xml to use macrodef instead of antcall
https://issues.apache.org/jira/browse/PIG-2362
PIG-2312NPE when relation and column share the same name and used in Nested 
Foreach 
https://issues.apache.org/jira/browse/PIG-2312
PIG-1942script UDF (jython) should utilize the intended output schema to 
more directly convert Py objects to Pig objects
https://issues.apache.org/jira/browse/PIG-1942
PIG-1237Piggybank MutliStorage - specify field to write in output
https://issues.apache.org/jira/browse/PIG-1237

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterS

[jira] [Commented] (PIG-3020) "Duplicate uid in schema" error when joining two relations derived from the same load statement

2012-12-07 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526925#comment-13526925
 ] 

Dmitriy V. Ryaboy commented on PIG-3020:


are the manifest changes related?

> "Duplicate uid in schema" error when joining two relations derived from the 
> same load statement
> ---
>
> Key: PIG-3020
> URL: https://issues.apache.org/jira/browse/PIG-3020
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11
>Reporter: Julien Le Dem
> Attachments: PIG-3020.patch
>
>
> The following vali=dates OK with pig 0.9 and fails with the following error 
> in 0.11 (and I suspect 0.10)
> pig -c debug2.pig
> Script: debug2.pig
> {noformat}
> A = LOAD 'foo' AS (group:tuple(uid, dst_id), uids_with_recs:bag{} , 
> uids_with_flock:bag{});
> edges_both = FILTER A BY NOT IsEmpty(uids_with_recs) AND NOT 
> IsEmpty(uids_with_flock);
> edges_both = FOREACH edges_both GENERATE
> group.uid AS src_id,
> group.dst_id AS dst_id;
> both_counts = GROUP edges_both BY src_id;
> both_counts = FOREACH both_counts GENERATE
> group AS src_id, SIZE(edges_both) AS size_both;
> edges_bq = FILTER A BY NOT IsEmpty(uids_with_recs);
> edges_bq = FOREACH edges_bq GENERATE
> group.uid AS src_id,
> group.dst_id AS dst_id;
> bq_counts = GROUP edges_bq BY src_id;
> bq_counts = FOREACH bq_counts GENERATE
> group AS src_id, SIZE(edges_bq) AS size_bq;
> per_user_set_sizes = JOIN bq_counts BY src_id LEFT OUTER, both_counts BY 
> src_id;
> store per_user_set_sizes into  'foo';
> {noformat}
> Error:
> {noformat}
> ERROR 2270: Logical plan invalid state: duplicate uid in schema : 
> bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1067: Unable to 
> explain alias null
>   at org.apache.pig.PigServer.explain(PigServer.java:999)
>   at 
> org.apache.pig.tools.grunt.GruntParser.explainCurrentBatch(GruntParser.java:398)
>   at 
> org.apache.pig.tools.grunt.GruntParser.processExplain(GruntParser.java:330)
>   at org.apache.pig.tools.grunt.Grunt.checkScript(Grunt.java:98)
>   at org.apache.pig.Main.run(Main.java:600)
>   at org.apache.pig.Main.main(Main.java:154)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2000: 
> Error processing rule LoadTypeCastInserter
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:122)
>   at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:277)
>   at org.apache.pig.PigServer.compilePp(PigServer.java:1322)
>   at org.apache.pig.PigServer.explain(PigServer.java:984)
>   ... 10 more
> Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 2270: 
> Logical plan invalid state: duplicate uid in schema : 
> bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.validate(SchemaResetter.java:232)
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:105)
>   at 
> org.apache.pig.newplan.logical.relational.LOJoin.accept(LOJoin.java:171)
>   at 
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
>   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaPatcher.transformed(SchemaPatcher.java:43)
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:113)
>   ... 13 more
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Our release process

2012-12-07 Thread Julien Le Dem
Here's my criteria for inclusion in a release branch:
 - no new feature. Only bug fixes.
 - The criteria is more about stability than priority. The person/group
asking for it has a good reason for wanting it in the branch. If commiters
think the patch is reasonable and won't make the branch unstable then we
should check it in. If it breaks something anyway, we revert it.

For what it's worth we (at Twitter) maintain an internal branch where we
add patches we need and I would suggest anybody that wants to be able to
make emergency fixes to their own deployment to do the same. We do keep
that branch as close to apache as we can but it has a few patches that are
in trunk only and do not satisfy the no new feature criteria.

What does the PMC think ?

Julien




On Tue, Dec 4, 2012 at 12:46 PM, Olga Natkovich wrote:

> I am ok with tests running nightly and reverting patches that cause
> failures. We used to have that. Does anybody know what happened? Is anybody
> volunteering to make it work again?
>
> I would like to see specific criteria for what goes into the branch been
> published (rather than case-by-case). This way each team can decided if the
> criteria stringent enough of if they need to run a private branch.
>
> Olga
>
>   --
> *From:* Santhosh M S 
> *To:* Julien Le Dem ; "dev@pig.apache.org" <
> dev@pig.apache.org>
> *Cc:* "billgra...@gmail.com" 
> *Sent:* Friday, November 30, 2012 11:46 PM
>
> *Subject:* Re: Our release process
>
> HI Julien,
>
> You are making most of the points that I did on this thread (CI for e2e,
> not burdening clean e2e prior to every commit for a release branch). The
> only point on which there is no clear agreement is the definition of a bug
> that can be included in a previously released branch. I am fine with a case
> by case inclusion.
>
> Hi Olga,
>
> Are you fine with Julien's proposal as it stands - bugs that are included
> will be determined at the time of inclusion instead of doing it now.
>
> Santhosh
>
>
> 
> From: Julien Le Dem 
> To: dev@pig.apache.org; Santhosh M S 
> Cc: "billgra...@gmail.com" 
> Sent: Friday, November 30, 2012 5:37 PM
> Subject: Re: Our release process
>
> Proposed criteria:
> - it makes the tests fail. targets test-commit + test + e2e tests
> - a critical bug is reported in a short time frame (definition of
> critical not needed as it is rare and can be decided on a case by case
> basis)
>
> That raises another question: what are the existing CI servers running
> the tests?
> - the Apache CI runs test-commit and test (is it more stable now?)
> and not e2e. It would be great if it did.
> - we have a Jenkins build at Twitter where we run test-commit and
> test, we could not run e2e easily in our environment.
> - I understand there's a Yahoo/Hortonworks build (test-commit + test + e2e
> ???)
>
> Whenever those builds fail we should open or reopen JIRAS and fix it.
>
> The time it takes to run the full
> test suite makes it impractical to
> run on a desktop/laptop.
>
> For the release Pig-0.11.0 we need to get this list of JIRAs down to 0
> and publish the jar.
>
> https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+PIG+AND+fixVersion+%3D+%220.11%22+AND+resolution+%3D+Unresolved+ORDER+BY+updated+DESC%2C+due+ASC%2C+priority+DESC
>
> Julien
>
> On Thu, Nov 29, 2012 at 11:16 PM, Santhosh M S
>  wrote:
> > Looks like everyone is interested in having frequent releases - I don't
> see anyone disagreeing with that.
> >
> > Regarding "If a patch
> makes the release branch unstable, we revert it" - what are the criteria?
> If we can't decide on the criteria on this thread (already pretty long)
> then lets get the release trains going. We can revisit the criteria for
> inclusion of bug fixes when that happens.
> >
> > Santhosh
> >
> >
> > 
> >  From: Julien Le Dem 
> > To: dev@pig.apache.org; Santhosh M S 
> > Cc: "billgra...@gmail.com" 
> > Sent:
> Thursday, November 29, 2012 9:45 AM
> > Subject: Re: Our release process
> >
> > The release branch receives only bug fixes. Patch level releases (3rd
> > version number) are issued out of the release branch and introduce
> > only bug fixes and no new features.
> > Deciding whether a patch is applied to the release branch is based on
> > preserving stability (as Bill said). If a patch makes the release
> > branch unstable, we revert it.
> > New features are added to trunk where new major and minor releases will
> happen.
> > If we need a new feature out then we make a new minor release.
> > Doing frequent releases is the industry standard and will resolve
> > conflicts around what should go in a release branch.
> >
> > Making a new release is currently painful *because* we wait so long in
> > between two releases. Let's fix that.
> >
> > Julien
> >
> > On Wed, Nov 28, 2012 at
> 10:09 PM, Santhosh M S
> >  wrote:
> >> Since releasing a major version once a month is agressive and w

[jira] [Commented] (PIG-3020) "Duplicate uid in schema" error when joining two relations derived from the same load statement

2012-12-07 Thread Jonathan Coveney (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526864#comment-13526864
 ] 

Jonathan Coveney commented on PIG-3020:
---

This looks good to me, though I wonder if there is anyone who knows this code 
better than can take a look.

> "Duplicate uid in schema" error when joining two relations derived from the 
> same load statement
> ---
>
> Key: PIG-3020
> URL: https://issues.apache.org/jira/browse/PIG-3020
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11
>Reporter: Julien Le Dem
> Attachments: PIG-3020.patch
>
>
> The following vali=dates OK with pig 0.9 and fails with the following error 
> in 0.11 (and I suspect 0.10)
> pig -c debug2.pig
> Script: debug2.pig
> {noformat}
> A = LOAD 'foo' AS (group:tuple(uid, dst_id), uids_with_recs:bag{} , 
> uids_with_flock:bag{});
> edges_both = FILTER A BY NOT IsEmpty(uids_with_recs) AND NOT 
> IsEmpty(uids_with_flock);
> edges_both = FOREACH edges_both GENERATE
> group.uid AS src_id,
> group.dst_id AS dst_id;
> both_counts = GROUP edges_both BY src_id;
> both_counts = FOREACH both_counts GENERATE
> group AS src_id, SIZE(edges_both) AS size_both;
> edges_bq = FILTER A BY NOT IsEmpty(uids_with_recs);
> edges_bq = FOREACH edges_bq GENERATE
> group.uid AS src_id,
> group.dst_id AS dst_id;
> bq_counts = GROUP edges_bq BY src_id;
> bq_counts = FOREACH bq_counts GENERATE
> group AS src_id, SIZE(edges_bq) AS size_bq;
> per_user_set_sizes = JOIN bq_counts BY src_id LEFT OUTER, both_counts BY 
> src_id;
> store per_user_set_sizes into  'foo';
> {noformat}
> Error:
> {noformat}
> ERROR 2270: Logical plan invalid state: duplicate uid in schema : 
> bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1067: Unable to 
> explain alias null
>   at org.apache.pig.PigServer.explain(PigServer.java:999)
>   at 
> org.apache.pig.tools.grunt.GruntParser.explainCurrentBatch(GruntParser.java:398)
>   at 
> org.apache.pig.tools.grunt.GruntParser.processExplain(GruntParser.java:330)
>   at org.apache.pig.tools.grunt.Grunt.checkScript(Grunt.java:98)
>   at org.apache.pig.Main.run(Main.java:600)
>   at org.apache.pig.Main.main(Main.java:154)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2000: 
> Error processing rule LoadTypeCastInserter
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:122)
>   at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:277)
>   at org.apache.pig.PigServer.compilePp(PigServer.java:1322)
>   at org.apache.pig.PigServer.explain(PigServer.java:984)
>   ... 10 more
> Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 2270: 
> Logical plan invalid state: duplicate uid in schema : 
> bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.validate(SchemaResetter.java:232)
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:105)
>   at 
> org.apache.pig.newplan.logical.relational.LOJoin.accept(LOJoin.java:171)
>   at 
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
>   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaPatcher.transformed(SchemaPatcher.java:43)
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:113)
>   ... 13 more
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (PIG-3084) Improve exceptions messages in POPackage

2012-12-07 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PIG-3084.


   Resolution: Fixed
Fix Version/s: 0.12

> Improve exceptions messages in POPackage
> 
>
> Key: PIG-3084
> URL: https://issues.apache.org/jira/browse/PIG-3084
> Project: Pig
>  Issue Type: Bug
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
> Fix For: 0.12
>
> Attachments: PIG-3084_1.patch, PIG-3084.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3084) Improve exceptions messages in POPackage

2012-12-07 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PIG-3084:
---

Attachment: PIG-3084_1.patch

PIG-3084_1.patch same patch with white space adjusted

> Improve exceptions messages in POPackage
> 
>
> Key: PIG-3084
> URL: https://issues.apache.org/jira/browse/PIG-3084
> Project: Pig
>  Issue Type: Bug
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
> Attachments: PIG-3084_1.patch, PIG-3084.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3084) Improve exceptions messages in POPackage

2012-12-07 Thread Jonathan Coveney (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526848#comment-13526848
 ] 

Jonathan Coveney commented on PIG-3084:
---

+1

> Improve exceptions messages in POPackage
> 
>
> Key: PIG-3084
> URL: https://issues.apache.org/jira/browse/PIG-3084
> Project: Pig
>  Issue Type: Bug
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
> Attachments: PIG-3084.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3084) Improve exceptions messages in POPackage

2012-12-07 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PIG-3084:
---

Attachment: PIG-3084.patch

better exception in PIG-3084.patch

> Improve exceptions messages in POPackage
> 
>
> Key: PIG-3084
> URL: https://issues.apache.org/jira/browse/PIG-3084
> Project: Pig
>  Issue Type: Bug
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
> Attachments: PIG-3084.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-3084) Improve exceptions messages in POPackage

2012-12-07 Thread Julien Le Dem (JIRA)
Julien Le Dem created PIG-3084:
--

 Summary: Improve exceptions messages in POPackage
 Key: PIG-3084
 URL: https://issues.apache.org/jira/browse/PIG-3084
 Project: Pig
  Issue Type: Bug
Reporter: Julien Le Dem
Assignee: Julien Le Dem




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3078) Make a UDF that, given a string, returns just the columns prefixed by that string

2012-12-07 Thread Jonathan Coveney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Coveney updated PIG-3078:
--

Assignee: Jonathan Coveney
  Status: Patch Available  (was: Open)

> Make a UDF that, given a string, returns just the columns prefixed by that 
> string
> -
>
> Key: PIG-3078
> URL: https://issues.apache.org/jira/browse/PIG-3078
> Project: Pig
>  Issue Type: Bug
>Reporter: Jonathan Coveney
>Assignee: Jonathan Coveney
> Fix For: 0.12
>
> Attachments: PIG-3078-0.patch
>
>
> This comes up fairly often, usually as the result of a join. Given that the 
> resulting schema has the column name prepended, a udf in the following form 
> could give just the columns from the desired relation:
> Pluck('relation_name', *)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3078) Make a UDF that, given a string, returns just the columns prefixed by that string

2012-12-07 Thread Jonathan Coveney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Coveney updated PIG-3078:
--

Attachment: PIG-3078-0.patch

> Make a UDF that, given a string, returns just the columns prefixed by that 
> string
> -
>
> Key: PIG-3078
> URL: https://issues.apache.org/jira/browse/PIG-3078
> Project: Pig
>  Issue Type: Bug
>Reporter: Jonathan Coveney
> Fix For: 0.12
>
> Attachments: PIG-3078-0.patch
>
>
> This comes up fairly often, usually as the result of a join. Given that the 
> resulting schema has the column name prepended, a udf in the following form 
> could give just the columns from the desired relation:
> Pluck('relation_name', *)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3020) "Duplicate uid in schema" error when joining two relations derived from the same load statement

2012-12-07 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PIG-3020:
---

Attachment: PIG-3020.patch

PIG-3020.patch fixes the issue


> "Duplicate uid in schema" error when joining two relations derived from the 
> same load statement
> ---
>
> Key: PIG-3020
> URL: https://issues.apache.org/jira/browse/PIG-3020
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11
>Reporter: Julien Le Dem
> Attachments: PIG-3020.patch
>
>
> The following vali=dates OK with pig 0.9 and fails with the following error 
> in 0.11 (and I suspect 0.10)
> pig -c debug2.pig
> Script: debug2.pig
> {noformat}
> A = LOAD 'foo' AS (group:tuple(uid, dst_id), uids_with_recs:bag{} , 
> uids_with_flock:bag{});
> edges_both = FILTER A BY NOT IsEmpty(uids_with_recs) AND NOT 
> IsEmpty(uids_with_flock);
> edges_both = FOREACH edges_both GENERATE
> group.uid AS src_id,
> group.dst_id AS dst_id;
> both_counts = GROUP edges_both BY src_id;
> both_counts = FOREACH both_counts GENERATE
> group AS src_id, SIZE(edges_both) AS size_both;
> edges_bq = FILTER A BY NOT IsEmpty(uids_with_recs);
> edges_bq = FOREACH edges_bq GENERATE
> group.uid AS src_id,
> group.dst_id AS dst_id;
> bq_counts = GROUP edges_bq BY src_id;
> bq_counts = FOREACH bq_counts GENERATE
> group AS src_id, SIZE(edges_bq) AS size_bq;
> per_user_set_sizes = JOIN bq_counts BY src_id LEFT OUTER, both_counts BY 
> src_id;
> store per_user_set_sizes into  'foo';
> {noformat}
> Error:
> {noformat}
> ERROR 2270: Logical plan invalid state: duplicate uid in schema : 
> bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1067: Unable to 
> explain alias null
>   at org.apache.pig.PigServer.explain(PigServer.java:999)
>   at 
> org.apache.pig.tools.grunt.GruntParser.explainCurrentBatch(GruntParser.java:398)
>   at 
> org.apache.pig.tools.grunt.GruntParser.processExplain(GruntParser.java:330)
>   at org.apache.pig.tools.grunt.Grunt.checkScript(Grunt.java:98)
>   at org.apache.pig.Main.run(Main.java:600)
>   at org.apache.pig.Main.main(Main.java:154)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2000: 
> Error processing rule LoadTypeCastInserter
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:122)
>   at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:277)
>   at org.apache.pig.PigServer.compilePp(PigServer.java:1322)
>   at org.apache.pig.PigServer.explain(PigServer.java:984)
>   ... 10 more
> Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 2270: 
> Logical plan invalid state: duplicate uid in schema : 
> bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.validate(SchemaResetter.java:232)
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:105)
>   at 
> org.apache.pig.newplan.logical.relational.LOJoin.accept(LOJoin.java:171)
>   at 
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
>   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaPatcher.transformed(SchemaPatcher.java:43)
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:113)
>   ... 13 more
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-3083) Introduce new syntax that let's you project just the columns that come from a given :: prefix

2012-12-07 Thread Jonathan Coveney (JIRA)
Jonathan Coveney created PIG-3083:
-

 Summary: Introduce new syntax that let's you project just the 
columns that come from a given :: prefix
 Key: PIG-3083
 URL: https://issues.apache.org/jira/browse/PIG-3083
 Project: Pig
  Issue Type: Bug
Reporter: Jonathan Coveney
 Fix For: 0.12


This is basically a more refined approach than PIG-3078, but it is also more 
work. That JIRA is more of a stopgap until we do something like this.

The idea would be to support something like the following:

a = load 'a' as (x,y,z);
b = load 'b'  as (x,y,z);
c = join a by x, b by x;
d = foreach c generate a::*;

Obviously this is useful for any case where you have relations with columns 
with various prefixes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3044) Trigger POPartialAgg compaction under GC pressure

2012-12-07 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526600#comment-13526600
 ] 

Thejas M Nair commented on PIG-3044:


bq. I would even say we remove the % memory budget as the Spillable mechanism 
is more reliable and much simpler.
The reason why % memory budget was introduced for SelfSpillBag, was because the 
spillable mechanism didn't always work well. The cleanup often was getting 
triggered too late. So I think it is better use the Spillable mechanism here to 
spill earlier if necessary, as the patch is doing.


> Trigger POPartialAgg compaction under GC pressure
> -
>
> Key: PIG-3044
> URL: https://issues.apache.org/jira/browse/PIG-3044
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.10.0, 0.11, 0.10.1
>Reporter: Dmitriy V. Ryaboy
>Assignee: Dmitriy V. Ryaboy
> Fix For: 0.11, 0.12
>
> Attachments: PIG-3044.2.diff, PIG-3404.diff
>
>
> If partial aggregation is turned on in pig 10 and 11, 20% (by default) of the 
> available heap can be consumed by the POPartialAgg operator. This can cause 
> memory issues for jobs that use all, or nearly all, of the heap already.
> If we make POPartialAgg "spillable" (trigger compaction when memory reduction 
> is required), we would be much nicer to high-memory jobs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2878) Pig current releases lack a UDF equalIgnoreCase.This function returns a Boolean value indicating whether string left is equal to string right. This check is case insensitiv

2012-12-07 Thread Shami B (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shami B updated PIG-2878:
-

Attachment: PIG-2878.patch

Please find the attached UDF for Equals Ignore.

> Pig current releases lack a UDF equalIgnoreCase.This function returns a 
> Boolean value indicating whether string left is equal to string right. This 
> check is case insensitive.
> --
>
> Key: PIG-2878
> URL: https://issues.apache.org/jira/browse/PIG-2878
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Affects Versions: 0.10.0
>Reporter: Arjun K R
>  Labels: features
> Attachments: PIG-2878.patch
>
>
> Pig current releases lack a UDF equalIgnoreCase.This function returns a 
> Boolean value indicating whether string left is equal to string right. This 
> check is case insensitive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2878) Pig current releases lack a UDF equalIgnoreCase.This function returns a Boolean value indicating whether string left is equal to string right. This check is case insensitiv

2012-12-07 Thread Shami B (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shami B updated PIG-2878:
-

Status: Patch Available  (was: Open)

> Pig current releases lack a UDF equalIgnoreCase.This function returns a 
> Boolean value indicating whether string left is equal to string right. This 
> check is case insensitive.
> --
>
> Key: PIG-2878
> URL: https://issues.apache.org/jira/browse/PIG-2878
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Affects Versions: 0.10.0
>Reporter: Arjun K R
>  Labels: features
> Attachments: PIG-2878.patch
>
>
> Pig current releases lack a UDF equalIgnoreCase.This function returns a 
> Boolean value indicating whether string left is equal to string right. This 
> check is case insensitive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2645) PigSplit does not handle the case where SerializationFactory returns null

2012-12-07 Thread Shami B (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526384#comment-13526384
 ] 

Shami B commented on PIG-2645:
--

Please find the attached patch after incorporating the review comments.

> PigSplit does not handle the case where SerializationFactory returns null
> -
>
> Key: PIG-2645
> URL: https://issues.apache.org/jira/browse/PIG-2645
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.10.0
>Reporter: Alex Levenson
>  Labels: patch
> Attachments: patch_2645.patch, PIG-2645.patch
>
>
> In PigSplit.java, line 254:
> {code}
> SerializationFactory sf = new SerializationFactory(conf);
> Serializer s = sf.getSerializer(wrappedSplits[0].getClass());
> s.open((OutputStream) os);
> {code}
> sf.getSerializer returns null when it cannot find a serializer for a given 
> object. Instead of handling this properly, a NPE is thrown when s.open() is 
> called.
> This is easy to encounter when creating a custom InputSplit from the 
> mapreduce package which is an abstract class that DOES NOT implement Writable.
> However it's easy to miss because InputSplit from the mapred package is an 
> interface that extends Writable, and InputSplits often both extend and 
> implement both the new and old InputSplit abstract class and interface 
> (thereby becoming Writable).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2645) PigSplit does not handle the case where SerializationFactory returns null

2012-12-07 Thread Shami B (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shami B updated PIG-2645:
-

Attachment: patch_2645.patch

Please find the patch with the review comments incorporated

> PigSplit does not handle the case where SerializationFactory returns null
> -
>
> Key: PIG-2645
> URL: https://issues.apache.org/jira/browse/PIG-2645
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.10.0
>Reporter: Alex Levenson
>  Labels: patch
> Attachments: patch_2645.patch, PIG-2645.patch
>
>
> In PigSplit.java, line 254:
> {code}
> SerializationFactory sf = new SerializationFactory(conf);
> Serializer s = sf.getSerializer(wrappedSplits[0].getClass());
> s.open((OutputStream) os);
> {code}
> sf.getSerializer returns null when it cannot find a serializer for a given 
> object. Instead of handling this properly, a NPE is thrown when s.open() is 
> called.
> This is easy to encounter when creating a custom InputSplit from the 
> mapreduce package which is an abstract class that DOES NOT implement Writable.
> However it's easy to miss because InputSplit from the mapred package is an 
> interface that extends Writable, and InputSplits often both extend and 
> implement both the new and old InputSplit abstract class and interface 
> (thereby becoming Writable).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Build failed in Jenkins: Pig-trunk #1372

2012-12-07 Thread Apache Jenkins Server
See 

Changes:

[dvryaboy] PIG-3044: Trigger POPartialAgg compaction under GC pressure

--
[...truncated 37006 lines...]
[junit] at 
org.apache.pig.test.MiniGenericCluster.shutdownMiniDfsAndMrClusters(MiniGenericCluster.java:77)
[junit] at 
org.apache.pig.test.MiniGenericCluster.shutDown(MiniGenericCluster.java:68)
[junit] at 
org.apache.pig.test.TestStore.oneTimeTearDown(TestStore.java:138)
[junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[junit] at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
[junit] at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
[junit] at java.lang.reflect.Method.invoke(Method.java:597)
[junit] at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
[junit] at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
[junit] at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
[junit] at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33)
[junit] at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
[junit] at 
junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:38)
[junit] at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:420)
[junit] at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:911)
[junit] at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:768)
[junit] 12/12/07 10:35:38 WARN datanode.FSDatasetAsyncDiskService: 
AsyncDiskService has already shut down.
[junit] Shutting down DataNode 2
[junit] 12/12/07 10:35:38 INFO mortbay.log: Stopped 
SelectChannelConnector@localhost:0
[junit] 12/12/07 10:35:38 INFO ipc.Server: Stopping server on 51280
[junit] 12/12/07 10:35:38 INFO ipc.Server: IPC Server handler 0 on 51280: 
exiting
[junit] 12/12/07 10:35:38 INFO ipc.Server: IPC Server handler 2 on 51280: 
exiting
[junit] 12/12/07 10:35:38 INFO ipc.Server: Stopping IPC Server listener on 
51280
[junit] 12/12/07 10:35:38 INFO ipc.Server: Stopping IPC Server Responder
[junit] 12/12/07 10:35:38 INFO ipc.Server: IPC Server handler 1 on 51280: 
exiting
[junit] 12/12/07 10:35:38 INFO metrics.RpcInstrumentation: shut down
[junit] 12/12/07 10:35:38 INFO datanode.DataNode: Waiting for threadgroup 
to exit, active threads is 1
[junit] 12/12/07 10:35:38 WARN datanode.DataNode: 
DatanodeRegistration(127.0.0.1:44102, 
storageID=DS-1221353071-67.195.138.20-44102-1354876091518, infoPort=36499, 
ipcPort=51280):DataXceiveServer:java.nio.channels.AsynchronousCloseException
[junit] at 
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:185)
[junit] at 
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:159)
[junit] at 
sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:84)
[junit] at 
org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:131)
[junit] at java.lang.Thread.run(Thread.java:662)
[junit] 
[junit] 12/12/07 10:35:38 INFO datanode.DataNode: Exiting DataXceiveServer
[junit] 12/12/07 10:35:38 INFO datanode.DataNode: Scheduling block 
blk_-8881638348520421667_1184 file 
build/test/data/dfs/data/data4/current/blk_-8881638348520421667 for deletion
[junit] 12/12/07 10:35:38 INFO datanode.DataNode: Scheduling block 
blk_6056207649161818773_1186 file 
build/test/data/dfs/data/data4/current/blk_6056207649161818773 for deletion
[junit] 12/12/07 10:35:38 INFO datanode.DataNode: Deleted block 
blk_-8881638348520421667_1184 at file 
build/test/data/dfs/data/data4/current/blk_-8881638348520421667
[junit] 12/12/07 10:35:38 INFO datanode.DataNode: Scheduling block 
blk_7998852423316239706_1185 file 
build/test/data/dfs/data/data3/current/blk_7998852423316239706 for deletion
[junit] 12/12/07 10:35:38 INFO datanode.DataNode: Deleted block 
blk_6056207649161818773_1186 at file 
build/test/data/dfs/data/data4/current/blk_6056207649161818773
[junit] 12/12/07 10:35:38 INFO datanode.DataNode: Deleted block 
blk_7998852423316239706_1185 at file 
build/test/data/dfs/data/data3/current/blk_7998852423316239706
[junit] 12/12/07 10:35:38 INFO datanode.DataBlockScanner: Exiting 
DataBlockScanner thread.
[junit] 12/12/07 10:35:38 INFO datanode.DataNode: 
DatanodeRegistration(127.0.0.1:44102, 
storageID=DS-1221353071-67.195.138.20-44102-1354876091518, infoPort=36499, 
ipcPort=51280):Finishing DataNode in: 
FSDataset{dirpath='