[jira] [Commented] (PIG-2614) AvroStorage crashes on LOADING a single bad error

2012-12-06 Thread Russell Jurney (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526129#comment-13526129
 ] 

Russell Jurney commented on PIG-2614:
-

See PIG-3059

> AvroStorage crashes on LOADING a single bad error
> -
>
> Key: PIG-2614
> URL: https://issues.apache.org/jira/browse/PIG-2614
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Affects Versions: 0.10.0, 0.11
>Reporter: Russell Jurney
>Assignee: Jonathan Coveney
>  Labels: avro, avrostorage, bad, book, cutting, doug, for, my, 
> pig, sadism
> Fix For: 0.11, 0.10.1
>
> Attachments: PIG-2614_0.patch, PIG-2614_1.patch, PIG-2614_2.patch, 
> test_avro_files.tar.gz
>
>
> AvroStorage dies when a single bad record exists, such as one with missing 
> fields.  This is very bad on 'big data,' where bad records are inevitable.  
> See discussion at 
> http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss
>  for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2012-12-06 Thread Russell Jurney (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526132#comment-13526132
 ] 

Russell Jurney commented on PIG-3015:
-

See PIG-3059

> Rewrite of AvroStorage
> --
>
> Key: PIG-3015
> URL: https://issues.apache.org/jira/browse/PIG-3015
> Project: Pig
>  Issue Type: Improvement
>  Components: piggybank
>Reporter: Joseph Adler
>Assignee: Joseph Adler
> Attachments: PIG-3015.patch
>
>
> The current AvroStorage implementation has a lot of issues: it requires old 
> versions of Avro, it copies data much more than needed, and it's verbose and 
> complicated. (One pet peeve of mine is that old versions of Avro don't 
> support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
> new implementation is significantly faster, and the code is a lot simpler. 
> Rewriting AvroStorage also enabled me to implement support for Trevni (as 
> TrevniStorage).
> I'm opening this ticket to facilitate discussion while I figure out the best 
> way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2012-12-06 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Attachment: PIG-3015.patch

Added test cases for TrevniStorage (and made sure the test cases all pass)

> Rewrite of AvroStorage
> --
>
> Key: PIG-3015
> URL: https://issues.apache.org/jira/browse/PIG-3015
> Project: Pig
>  Issue Type: Improvement
>  Components: piggybank
>Reporter: Joseph Adler
>Assignee: Joseph Adler
> Attachments: PIG-3015.patch
>
>
> The current AvroStorage implementation has a lot of issues: it requires old 
> versions of Avro, it copies data much more than needed, and it's verbose and 
> complicated. (One pet peeve of mine is that old versions of Avro don't 
> support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
> new implementation is significantly faster, and the code is a lot simpler. 
> Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best 
> way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2012-12-06 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Description: 
The current AvroStorage implementation has a lot of issues: it requires old 
versions of Avro, it copies data much more than needed, and it's verbose and 
complicated. (One pet peeve of mine is that old versions of Avro don't support 
Snappy compression.)

I rewrote AvroStorage from scratch to fix these issues. In early tests, the new 
implementation is significantly faster, and the code is a lot simpler. 
Rewriting AvroStorage also enabled me to implement support for Trevni (as 
TrevniStorage).

I'm opening this ticket to facilitate discussion while I figure out the best 
way to contribute the changes back to Apache.

  was:
The current AvroStorage implementation has a lot of issues: it requires old 
versions of Avro, it copies data much more than needed, and it's verbose and 
complicated. (One pet peeve of mine is that old versions of Avro don't support 
Snappy compression.)

I rewrote AvroStorage from scratch to fix these issues. In early tests, the new 
implementation is significantly faster, and the code is a lot simpler. 
Rewriting AvroStorage also enabled me to implement support for Trevni.

I'm opening this ticket to facilitate discussion while I figure out the best 
way to contribute the changes back to Apache.


> Rewrite of AvroStorage
> --
>
> Key: PIG-3015
> URL: https://issues.apache.org/jira/browse/PIG-3015
> Project: Pig
>  Issue Type: Improvement
>  Components: piggybank
>Reporter: Joseph Adler
>Assignee: Joseph Adler
> Attachments: PIG-3015.patch
>
>
> The current AvroStorage implementation has a lot of issues: it requires old 
> versions of Avro, it copies data much more than needed, and it's verbose and 
> complicated. (One pet peeve of mine is that old versions of Avro don't 
> support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
> new implementation is significantly faster, and the code is a lot simpler. 
> Rewriting AvroStorage also enabled me to implement support for Trevni (as 
> TrevniStorage).
> I'm opening this ticket to facilitate discussion while I figure out the best 
> way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2012-12-06 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Attachment: (was: PIG-3015.patch)

> Rewrite of AvroStorage
> --
>
> Key: PIG-3015
> URL: https://issues.apache.org/jira/browse/PIG-3015
> Project: Pig
>  Issue Type: Improvement
>  Components: piggybank
>Reporter: Joseph Adler
>Assignee: Joseph Adler
> Attachments: PIG-3015.patch
>
>
> The current AvroStorage implementation has a lot of issues: it requires old 
> versions of Avro, it copies data much more than needed, and it's verbose and 
> complicated. (One pet peeve of mine is that old versions of Avro don't 
> support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
> new implementation is significantly faster, and the code is a lot simpler. 
> Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best 
> way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Subscription: PIG patch available

2012-12-06 Thread jira
Issue Subscription
Filter: PIG patch available (33 issues)

Subscriber: pigdaily

Key Summary
PIG-3075Allow AvroStorage STORE Operations To Use Schema Specified By URI
https://issues.apache.org/jira/browse/PIG-3075
PIG-3073POUserFunc creating log spam for large scripts
https://issues.apache.org/jira/browse/PIG-3073
PIG-3069Native Windows Compatibility for Pig E2E Tests and Harness
https://issues.apache.org/jira/browse/PIG-3069
PIG-3067HBaseStorage should be split up to become more managable
https://issues.apache.org/jira/browse/PIG-3067
PIG-3066Fix TestPigRunner in trunk
https://issues.apache.org/jira/browse/PIG-3066
PIG-3057make readField protected to be able to override it if we extend 
PigStorage
https://issues.apache.org/jira/browse/PIG-3057
PIG-3051java.lang.IndexOutOfBoundsException  failure with LimitOptimizer + 
ColumnPruning
https://issues.apache.org/jira/browse/PIG-3051
PIG-3033test-patch failed with javadoc warnings
https://issues.apache.org/jira/browse/PIG-3033
PIG-3029TestTypeCheckingValidatorNewLP has some path reference issues for 
cross-platform execution
https://issues.apache.org/jira/browse/PIG-3029
PIG-3028testGrunt dev test needs some command filters to run correctly 
without cygwin
https://issues.apache.org/jira/browse/PIG-3028
PIG-3027pigTest unit test needs a newline filter for comparisons of golden 
multi-line
https://issues.apache.org/jira/browse/PIG-3027
PIG-3026Pig checked-in baseline comparisons need a pre-filter to address 
OS-specific newline differences
https://issues.apache.org/jira/browse/PIG-3026
PIG-3025TestPruneColumn unit test - SimpleEchoStreamingCommand perl inline 
script needs simplification
https://issues.apache.org/jira/browse/PIG-3025
PIG-3024TestEmptyInputDir unit test - hadoop version detection logic is 
brittle
https://issues.apache.org/jira/browse/PIG-3024
PIG-3015Rewrite of AvroStorage
https://issues.apache.org/jira/browse/PIG-3015
PIG-3010Allow UDF's to flatten themselves
https://issues.apache.org/jira/browse/PIG-3010
PIG-2959Add a pig.cmd for Pig to run under Windows
https://issues.apache.org/jira/browse/PIG-2959
PIG-2957TetsScriptUDF fail due to volume prefix in jar
https://issues.apache.org/jira/browse/PIG-2957
PIG-2956Invalid cache specification for some streaming statement
https://issues.apache.org/jira/browse/PIG-2956
PIG-2955 Fix bunch of Pig e2e tests on Windows 
https://issues.apache.org/jira/browse/PIG-2955
PIG-2873Converting bin/pig shell script to python
https://issues.apache.org/jira/browse/PIG-2873
PIG-2834MultiStorage requires unused constructor argument
https://issues.apache.org/jira/browse/PIG-2834
PIG-2824Pushing checking number of fields into LoadFunc
https://issues.apache.org/jira/browse/PIG-2824
PIG-2661Pig uses an extra job for loading data in Pigmix L9
https://issues.apache.org/jira/browse/PIG-2661
PIG-2645PigSplit does not handle the case where SerializationFactory 
returns null
https://issues.apache.org/jira/browse/PIG-2645
PIG-2614AvroStorage crashes on LOADING a single bad error
https://issues.apache.org/jira/browse/PIG-2614
PIG-2507Semicolon in paramenters for UDF results in parsing error
https://issues.apache.org/jira/browse/PIG-2507
PIG-2433Jython import module not working if module path is in classpath
https://issues.apache.org/jira/browse/PIG-2433
PIG-2417Streaming UDFs -  allow users to easily write UDFs in scripting 
languages with no JVM implementation.
https://issues.apache.org/jira/browse/PIG-2417
PIG-2362Rework Ant build.xml to use macrodef instead of antcall
https://issues.apache.org/jira/browse/PIG-2362
PIG-2312NPE when relation and column share the same name and used in Nested 
Foreach 
https://issues.apache.org/jira/browse/PIG-2312
PIG-1942script UDF (jython) should utilize the intended output schema to 
more directly convert Py objects to Pig objects
https://issues.apache.org/jira/browse/PIG-1942
PIG-1237Piggybank MutliStorage - specify field to write in output
https://issues.apache.org/jira/browse/PIG-1237

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=13225&filterId=12322384


[jira] [Resolved] (PIG-3044) Trigger POPartialAgg compaction under GC pressure

2012-12-06 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy resolved PIG-3044.


   Resolution: Fixed
Fix Version/s: 0.11
 Release Note: When pig.exec.mapPartAgg, previously the memory reserved for 
in-memory aggregation was roughly 1/3 the value of pig.cachedbag.memusage per 
POPartialAgg operator. Reserving that much RAM could cause problems for jobs 
that required a lot of memory. POPartialAgg is now spillable, and regardless of 
the setting of pig.cachedbag.memusage, will happily shrink when memory is 
needed.

> Trigger POPartialAgg compaction under GC pressure
> -
>
> Key: PIG-3044
> URL: https://issues.apache.org/jira/browse/PIG-3044
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.10.0, 0.11, 0.10.1
>Reporter: Dmitriy V. Ryaboy
>Assignee: Dmitriy V. Ryaboy
> Fix For: 0.11, 0.12
>
> Attachments: PIG-3044.2.diff, PIG-3404.diff
>
>
> If partial aggregation is turned on in pig 10 and 11, 20% (by default) of the 
> available heap can be consumed by the POPartialAgg operator. This can cause 
> memory issues for jobs that use all, or nearly all, of the heap already.
> If we make POPartialAgg "spillable" (trigger compaction when memory reduction 
> is required), we would be much nicer to high-memory jobs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3044) Trigger POPartialAgg compaction under GC pressure

2012-12-06 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-3044:
---

Attachment: PIG-3044.2.diff

attached a version with Julien's comments incorporated. Will commit to 0.11 and 
trunk.

> Trigger POPartialAgg compaction under GC pressure
> -
>
> Key: PIG-3044
> URL: https://issues.apache.org/jira/browse/PIG-3044
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.10.0, 0.11, 0.10.1
>Reporter: Dmitriy V. Ryaboy
>Assignee: Dmitriy V. Ryaboy
> Fix For: 0.12
>
> Attachments: PIG-3044.2.diff, PIG-3404.diff
>
>
> If partial aggregation is turned on in pig 10 and 11, 20% (by default) of the 
> available heap can be consumed by the POPartialAgg operator. This can cause 
> memory issues for jobs that use all, or nearly all, of the heap already.
> If we make POPartialAgg "spillable" (trigger compaction when memory reduction 
> is required), we would be much nicer to high-memory jobs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (PIG-2815) class loader management in PigContext

2012-12-06 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park resolved PIG-2815.


Resolution: Fixed

Closed.

> class loader management in PigContext
> -
>
> Key: PIG-2815
> URL: https://issues.apache.org/jira/browse/PIG-2815
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Raghu Angadi
>Assignee: Raghu Angadi
> Fix For: 0.11
>
> Attachments: PIG-2815-branch-0.9.patch, PIG-2815-branch-0.9.patch, 
> PIG-2815.patch, PIG-2815.patch
>
>
> The way {{PigContext.classloader}} and resolveClassName() are managed can 
> lead to strange class loading issues, especially when not all {{register}} 
> statements are at the top (example in the first comment).
> Two factors contribute to this: sometimes only one of them and sometimes 
> together:
>  # a new classloader (CL) is created after registering each jar.
> ** but the new jar's parent is the root CL rather than previous CL, 
> effectively throwing previous CL away.
>  # resolveClassName() caches classes based on just the name
> ** A class is not defined by name alone. Classes loaded by two different 
> unrelated CLs are different objects even if both extract the class from same 
> physical jar file.
> ** because of (1), the cached class is not necessarily same as the class 
> that would be loaded based on 'current' CL
> having different class objects for same class have many subtle side effects. 
> e.g. there would be two instances of static variables. 
> I think both should be fixed.. thought fixing one of them might be good 
> enough in many cases. I will add a patch.
>

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3076) make TestScalarAliases more reliable

2012-12-06 Thread Jonathan Coveney (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13525893#comment-13525893
 ] 

Jonathan Coveney commented on PIG-3076:
---

Good stuff. You know I am all about making the tests better. that said, i feel 
that this path

pigServer.registerQuery("A = LOAD 'build/test/tmp/table_testScalarAliasesBatch' 
as (a0: long, a1: double);");

should at least be in a local variable since it is used a couple of times and 
is potentially fragile

otherwise +1

> make TestScalarAliases more reliable
> 
>
> Key: PIG-3076
> URL: https://issues.apache.org/jira/browse/PIG-3076
> Project: Pig
>  Issue Type: Test
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
> Fix For: 0.11, 0.12
>
> Attachments: PIG-3076.patch
>
>
> currently, this test writes in the root directory so its output is not 
> deleted by ant clean.
> Also it deletes its output in the end instead of the begining.
> The consequence is that if the test fail once then it will keep failing until 
> the directory is manually cleaned up (not good for CI)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-3082) outputSchema of a UDF allows two usages when describing a Tuple schema

2012-12-06 Thread Julien Le Dem (JIRA)
Julien Le Dem created PIG-3082:
--

 Summary: outputSchema of a UDF allows two usages when describing a 
Tuple schema
 Key: PIG-3082
 URL: https://issues.apache.org/jira/browse/PIG-3082
 Project: Pig
  Issue Type: Bug
Reporter: Julien Le Dem


When defining an evalfunc that returns a Tuple there are two ways you can 
implement outputSchema().
- The right way: return a schema that contains one Field that contains the type 
and schema of the return type of the UDF
- The unreliable way: return a schema that contains more than one field and it 
will be understood as a tuple schema even though there is no type (which is in 
Field class) to specify that. This is particularly deceitful when the output 
schema is derived from the input schema and the outputted Tuple sometimes 
contain only one field. In such cases Pig understands the output schema as a 
tuple only if there is more than one field. And sometimes it works, sometimes 
it does not.

We should at least issue a warning (backward compatibility) if not plain throw 
an exception when the output schema contains more than one Field.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2204) Allow passing arguments to custom Partitioners

2012-12-06 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13525819#comment-13525819
 ] 

Julien Le Dem commented on PIG-2204:


maybe just updating the doc to mention this?

> Allow passing arguments to custom Partitioners
> --
>
> Key: PIG-2204
> URL: https://issues.apache.org/jira/browse/PIG-2204
> Project: Pig
>  Issue Type: Improvement
>Reporter: Dmitriy V. Ryaboy
>
> Currently, this works:
> {code}
> y = group x by $0 partition by MyPartitioner PARALLEL 2;
> {code}
> However, passing an argument to the partitioner constructor does not work, 
> and dies with a misleading error:
> {code}
> y = group x by $0 partition by MyPartitioner(0) PARALLEL 2;
> 2011-08-03 22:53:23,074 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 1000: Error during parsing. Encountered " "(" "( "" at line 1, column 91.
> Was expecting one of:
> "parallel" ...
> ";" ...
> "." ...
> "$" ...
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3043) Modify the UrlClassloader in PigContext so that classes from the same classloader are used first instead of the parent

2012-12-06 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13525813#comment-13525813
 ] 

Julien Le Dem commented on PIG-3043:


It looks like we could borrow: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ApplicationClassLoader.java
But as this as to work with hadoop 20 we would have to duplicate this class or 
have a similar one.

> Modify the UrlClassloader in PigContext so that classes from the same 
> classloader are used first instead of the parent
> --
>
> Key: PIG-3043
> URL: https://issues.apache.org/jira/browse/PIG-3043
> Project: Pig
>  Issue Type: Improvement
>Reporter: Julien Le Dem
>
> This behavior would be similar to what application servers do (Tomcat, Jetty, 
> ...) and would allow classes from registered jars to use their own version of 
> a class. It also avoid problems when adding a jar to pig break libraries that 
> make use of dynamic class lookup.
> example of a common pattern that regularly is broken by the current mechanism:
> register lib.jar
> register my.jar
> define blah as my.UDF('my.Implementation')
> my.UDF is in my.jar and uses classes in lib.jar that use Class.forName() to 
> resolve my.Implementation. It works fine until lib.jar is added as a 
> dependency of pig or in the PIG_CLASSPATH. Then classes in lib.jar do not see 
> the classes in registered jars.
> I thinks that overriding loadClass(String name, boolean resolve) would allow 
> doing that.
> We should make an exception for anything in org.apache.pig just like 
> servlet.jar is excluded in app servers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2815) class loader management in PigContext

2012-12-06 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13525809#comment-13525809
 ] 

Julien Le Dem commented on PIG-2815:


0.11/trunk sounds good to me


> class loader management in PigContext
> -
>
> Key: PIG-2815
> URL: https://issues.apache.org/jira/browse/PIG-2815
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Raghu Angadi
>Assignee: Raghu Angadi
> Fix For: 0.11
>
> Attachments: PIG-2815-branch-0.9.patch, PIG-2815-branch-0.9.patch, 
> PIG-2815.patch, PIG-2815.patch
>
>
> The way {{PigContext.classloader}} and resolveClassName() are managed can 
> lead to strange class loading issues, especially when not all {{register}} 
> statements are at the top (example in the first comment).
> Two factors contribute to this: sometimes only one of them and sometimes 
> together:
>  # a new classloader (CL) is created after registering each jar.
> ** but the new jar's parent is the root CL rather than previous CL, 
> effectively throwing previous CL away.
>  # resolveClassName() caches classes based on just the name
> ** A class is not defined by name alone. Classes loaded by two different 
> unrelated CLs are different objects even if both extract the class from same 
> physical jar file.
> ** because of (1), the cached class is not necessarily same as the class 
> that would be loaded based on 'current' CL
> having different class objects for same class have many subtle side effects. 
> e.g. there would be two instances of static variables. 
> I think both should be fixed.. thought fixing one of them might be good 
> enough in many cases. I will add a patch.
>

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3047) Check the size of a relation before adding it to distributed cache in Replicated join

2012-12-06 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13525807#comment-13525807
 ] 

Julien Le Dem commented on PIG-3047:


sounds good to me

> Check the size of a relation before adding it to distributed cache in 
> Replicated join
> -
>
> Key: PIG-3047
> URL: https://issues.apache.org/jira/browse/PIG-3047
> Project: Pig
>  Issue Type: Improvement
>Reporter: Julien Le Dem
>
> Right now if someone makes a mistake and put the large relation last, Pig 
> will copy a huge file into distributed cache and it will take a long time 
> before the job eventually fails. It would be better to check before copying 
> the relation that it is of reasonable size.
> <1 GB ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2599) Mavenize Pig

2012-12-06 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PIG-2599:
---

  Description: 
Switch Pig build system from ant to maven.



  was:
Switch Pig build system from ant to maven.

This is a candidate project for Google summer of code 2012. More information 
about the program can be found at 
https://cwiki.apache.org/confluence/display/PIG/GSoc2012

Fix Version/s: 0.12
   Labels:   (was: gsoc2012)

> Mavenize Pig
> 
>
> Key: PIG-2599
> URL: https://issues.apache.org/jira/browse/PIG-2599
> Project: Pig
>  Issue Type: New Feature
>  Components: build
>Reporter: Daniel Dai
> Fix For: 0.12
>
> Attachments: maven-pig.1.zip
>
>
> Switch Pig build system from ant to maven.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2599) Mavenize Pig

2012-12-06 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13525797#comment-13525797
 ] 

Julien Le Dem commented on PIG-2599:


Hey, that sounds like a great start.
Usually attaching a patch to the JIRA and optionnaly posting a review to 
https://reviews.apache.org/dashboard/ is how a review is done.
If you need help you can ping the pig-dev mailing list or comment on the JIRA.
a shell script sounds good to me.
Is it intended to be a one time move that is then checked in to replace the 
current layout?
Why do you need jdo to be installed in your local maven repo, isn't maven going 
to do it?
could you provide a short description of each folder? Not all of them are clear 
to me. 
Do you deal with hadoop 20 vs 23 ?
I think zebra has had issues for a while. I'm not sure what the status of this 
is right now. Maybe Olga knows.
fixing checkstyle and findbugs later sound ok to me. It should be relatively 
easy to do.
what about the shim layer ?

Anyways, thanks for looking into this.

> Mavenize Pig
> 
>
> Key: PIG-2599
> URL: https://issues.apache.org/jira/browse/PIG-2599
> Project: Pig
>  Issue Type: New Feature
>  Components: build
>Reporter: Daniel Dai
>  Labels: gsoc2012
> Attachments: maven-pig.1.zip
>
>
> Switch Pig build system from ant to maven.
> This is a candidate project for Google summer of code 2012. More information 
> about the program can be found at 
> https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Reducer estimation

2012-12-06 Thread Prashant Kommireddi
You could store stats for the last X runs and take an average for the next run.

Sent from my iPhone

On Dec 6, 2012, at 8:09 PM, Dmitriy Ryaboy  wrote:

> How would flat files work? The data needs to be updated by every pig run.
>
> On Dec 3, 2012, at 11:10 PM, Prashant Kommireddi  wrote:
>
>> Awesome! It would be good to have a flat-file based impl as there will
>> probably a lot of pig users not having an hbase instance setup for
>> stats persistence. Let me know if I can help in anyway.
>>
>> Is there a timeframe you are looking at for open-sourcing this?
>>
>>
>> On Dec 4, 2012, at 12:32 PM, Bill Graham  wrote:
>>
>>> We do basically what you're describing. Each of our scripts has a logical
>>> name which defines the workflow. For each job in the workflow we persist
>>> the job stats, counters and conf in HBase via an implementation of
>>> PigProgressNotificationListener. We can then correlate jobs in a run of the
>>> workflow together based on the pig.script.start.time and pig.job.start time
>>> properties. We use the logical plan script signature to determine the
>>> script version has changed.
>>>
>>> During job execution we query the service in a impl of PigReducerEstimator
>>> for matching workflows.
>>>
>>> One simple estimation algo is to multiply Pig's default estimated reducers
>>> (which are based on mapper input bytes) by the ratio of mapper output bytes
>>> over mapper input bytes of previous runs. The same could also be done with
>>> slot time, but we haven't tried that yet.
>>>
>>> We plan to open source parts of this at some point.
>>>
>>>
>>> On Mon, Dec 3, 2012 at 10:32 PM, Prashant Kommireddi 
>>> wrote:
>>>
 I have been thinking about using Pig statistics for # reducers estimation.
 Though the current heuristic approach
 works fine, it requires an admin or the programmer to understand what the
 best number would be for the job.
 We are seeing a large number of jobs over-utilizing resources, and there is
 obviously no default number that works well
 for all kinds of pig scripts. A few non-technical users find it difficult
 to estimate the best # for their jobs.
 It would be great if we can use stats from previous runs of a job to set
 the number
 of reducers for future runs.

 This would be a nice feature for jobs running in production, where the job
 or the dataset size does not fluctuate
 a huge deal.


 1. Set a config param in the script
- set script.unique.id prashant.222111.demo_script
 2. If the above is not set, we fallback on the current implementation
 3. If the above is set
- At the end of the job, persist PigStats (namely Reduce Shuffle
Bytes) to FS (hdfs, local, s3). This would be
 "${script.unique.id}_MMDDHHmmss"
- lets call this stats_on_hdfs
- Read "stats_on_hdfs" for previous runs, and based on the number of
such stats to read (based on
 script.reducer.estimation.num.stats) calculate
an average number of reducers for the current run.
- If no stats_on_hdfs exists, we fallback on current implementation

 It will be advised to not keep the retention of stats too long, and Pig can
 make sure to clear up old files that are not required.

 What do you guys think?

 -Prashant

>>>
>>>
>>>
>>> --
>>> *Note that I'm no longer using my Yahoo! email address. Please email me at
>>> billgra...@gmail.com going forward.*


Re: Reducer estimation

2012-12-06 Thread Dmitriy Ryaboy
How would flat files work? The data needs to be updated by every pig run. 

On Dec 3, 2012, at 11:10 PM, Prashant Kommireddi  wrote:

> Awesome! It would be good to have a flat-file based impl as there will
> probably a lot of pig users not having an hbase instance setup for
> stats persistence. Let me know if I can help in anyway.
> 
> Is there a timeframe you are looking at for open-sourcing this?
> 
> 
> On Dec 4, 2012, at 12:32 PM, Bill Graham  wrote:
> 
>> We do basically what you're describing. Each of our scripts has a logical
>> name which defines the workflow. For each job in the workflow we persist
>> the job stats, counters and conf in HBase via an implementation of
>> PigProgressNotificationListener. We can then correlate jobs in a run of the
>> workflow together based on the pig.script.start.time and pig.job.start time
>> properties. We use the logical plan script signature to determine the
>> script version has changed.
>> 
>> During job execution we query the service in a impl of PigReducerEstimator
>> for matching workflows.
>> 
>> One simple estimation algo is to multiply Pig's default estimated reducers
>> (which are based on mapper input bytes) by the ratio of mapper output bytes
>> over mapper input bytes of previous runs. The same could also be done with
>> slot time, but we haven't tried that yet.
>> 
>> We plan to open source parts of this at some point.
>> 
>> 
>> On Mon, Dec 3, 2012 at 10:32 PM, Prashant Kommireddi 
>> wrote:
>> 
>>> I have been thinking about using Pig statistics for # reducers estimation.
>>> Though the current heuristic approach
>>> works fine, it requires an admin or the programmer to understand what the
>>> best number would be for the job.
>>> We are seeing a large number of jobs over-utilizing resources, and there is
>>> obviously no default number that works well
>>> for all kinds of pig scripts. A few non-technical users find it difficult
>>> to estimate the best # for their jobs.
>>> It would be great if we can use stats from previous runs of a job to set
>>> the number
>>> of reducers for future runs.
>>> 
>>> This would be a nice feature for jobs running in production, where the job
>>> or the dataset size does not fluctuate
>>> a huge deal.
>>> 
>>> 
>>>  1. Set a config param in the script
>>> - set script.unique.id prashant.222111.demo_script
>>>  2. If the above is not set, we fallback on the current implementation
>>>  3. If the above is set
>>> - At the end of the job, persist PigStats (namely Reduce Shuffle
>>> Bytes) to FS (hdfs, local, s3). This would be
>>> "${script.unique.id}_MMDDHHmmss"
>>> - lets call this stats_on_hdfs
>>> - Read "stats_on_hdfs" for previous runs, and based on the number of
>>> such stats to read (based on
>>> script.reducer.estimation.num.stats) calculate
>>> an average number of reducers for the current run.
>>> - If no stats_on_hdfs exists, we fallback on current implementation
>>> 
>>> It will be advised to not keep the retention of stats too long, and Pig can
>>> make sure to clear up old files that are not required.
>>> 
>>> What do you guys think?
>>> 
>>> -Prashant
>>> 
>> 
>> 
>> 
>> --
>> *Note that I'm no longer using my Yahoo! email address. Please email me at
>> billgra...@gmail.com going forward.*


Jenkins build is back to normal : Pig-trunk #1371

2012-12-06 Thread Apache Jenkins Server
See