[jira] Subscription: PIG patch available

2014-10-20 Thread jira
Issue Subscription
Filter: PIG patch available (20 issues)

Subscriber: pigdaily

Key Summary
PIG-4241Auto local mode mistakenly converts large jobs to local mode when 
using with Hive tables
https://issues.apache.org/jira/browse/PIG-4241
PIG-4239pig.output.lazy not works in spark mode
https://issues.apache.org/jira/browse/PIG-4239
PIG-4224Upload Tez payload history string to timeline server
https://issues.apache.org/jira/browse/PIG-4224
PIG-4160-forcelocaljars / -j flag when using a remote url for a script
https://issues.apache.org/jira/browse/PIG-4160
PIG-4111Make Pig compiles with avro-1.7.7
https://issues.apache.org/jira/browse/PIG-4111
PIG-4103Fix TestRegisteredJarVisibility(after PIG-4083)
https://issues.apache.org/jira/browse/PIG-4103
PIG-4084Port TestPigRunner to Tez
https://issues.apache.org/jira/browse/PIG-4084
PIG-4066An optimization for ROLLUP operation in Pig
https://issues.apache.org/jira/browse/PIG-4066
PIG-4004Upgrade the Pigmix queries from the (old) mapred API to mapreduce
https://issues.apache.org/jira/browse/PIG-4004
PIG-4002Disable combiner when map-side aggregation is used
https://issues.apache.org/jira/browse/PIG-4002
PIG-3952PigStorage accepts '-tagSplit' to return full split information
https://issues.apache.org/jira/browse/PIG-3952
PIG-3911Define unique fields with @OutputSchema
https://issues.apache.org/jira/browse/PIG-3911
PIG-3877Getting Geo Latitude/Longitude from Address Lines
https://issues.apache.org/jira/browse/PIG-3877
PIG-3873Geo distance calculation using Haversine
https://issues.apache.org/jira/browse/PIG-3873
PIG-3866Create ThreadLocal classloader per PigContext
https://issues.apache.org/jira/browse/PIG-3866
PIG-3861duplicate jars get added to distributed cache
https://issues.apache.org/jira/browse/PIG-3861
PIG-3668COR built-in function when atleast one of the coefficient values is 
NaN
https://issues.apache.org/jira/browse/PIG-3668
PIG-3635Fix e2e tests for Hadoop 2.X on Windows
https://issues.apache.org/jira/browse/PIG-3635
PIG-3587add functionality for rolling over dates
https://issues.apache.org/jira/browse/PIG-3587
PIG-3441Allow Pig to use default resources from Configuration objects
https://issues.apache.org/jira/browse/PIG-3441

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=16328filterId=12322384


[jira] [Commented] (PIG-4232) UDFContext is not initialized in executors when running on Spark cluster

2014-10-20 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176692#comment-14176692
 ] 

liyunzhang_intel commented on PIG-4232:
---

Scripting.pig
register '/home/zly/prj/oss/pig/bin/libexec/python/scriptingudf.py' using 
jython as myfuncs;
a = load '/user/pig/tests/data/singlefile/studenttab10k' using PigStorage() as 
(name, age:int, gpa:double);
b = foreach a generate age;
explain b;
store b into '/user/pig/out/root-1412926432-nightly.conf/Scripting_1.out';
*Scripting.pig successes in spark mode*


Scripting.udf.pig
register '/home/zly/prj/oss/pig/bin/libexec/python/scriptingudf.py' using 
jython as myfuncs;
a = load '/user/pig/tests/data/singlefile/studenttab10k' using PigStorage() as 
(name, age:int, gpa:double);
b = foreach a generate myfuncs.square(age);
explain b;
store b into '/user/pig/out/root-1412926432-nightly.conf/Scripting_1.out';
*this script fails in spark mode*

After debug, I found that UDFContext is not initialized in spark executors.
In 
https://github.com/apache/pig/blob/spark/src/org/apache/pig/builtin/PigStorage.java#L246-L247
{code}
Properties p = UDFContext.getUDFContext().getUDFProperties(this.getClass());
mRequiredColumns = 
(boolean[])ObjectSerializer.deserialize(p.getProperty(signature));
{code}
When executing Scripting.pig, 
UDFContext.getUDFContext().getUDFProperties(this.getClass()) returns a property 
which contains info about PigStorage. After deserialization, variable 
mRequiredColumns is correctly initialized. 
Wehn executing Scripting.udf.pig, 
UDFContext.getUDFContext().getUDFProperties(this.getClass()) returns a property 
which does not contain info about PigStorage. After deserialization, variable 
mRequiredColumns is null.

*Where to setUDFContext?*
{code}
LoadConverter#convert   
- SparkUtil#newJobConf
 public static JobConf newJobConf(PigContext pigContext) throws IOException {
JobConf jobConf = new JobConf(
ConfigurationUtil.toConfiguration(pigContext.getProperties()));
jobConf.set(pig.pigContext, ObjectSerializer.serialize(pigContext)); 
UDFContext.getUDFContext().serialize(jobConf);//serialize all udf 
info(include PigStorage info) to jobConf
jobConf.set(udf.import.list,
ObjectSerializer.serialize(PigContext.getPackageImportList()));
return jobConf;
}

PigInputFormat#passLoadSignature
-MapRedUtil.setupUDFContext(conf);
-UDFContext.setUDFContext
public static void setupUDFContext(Configuration job) throws IOException {
UDFContext udfc = UDFContext.getUDFContext();
udfc.addJobConf(job);
// don't deserialize in front-end
if (udfc.isUDFConfEmpty()) {
udfc.deserialize(); //UDFContext deserializes from jobConf.
}
}
{code}
In LoadConverter#convert, first serializes all udf info (include PigStorage) to 
jobConf. Then in PigInputFormat#passLoadSignature, UDFContext deserializes from 
jobConf.

*Why need serialization and deserialization?*
UDFContext#tss is a ThreadLocal variable, the value is different in different 
threads. LoadConverter#convert is executed in Thread-A and 
PigInputFormat#passLoadSignature is executed in Thread-B because spark send its 
task to executors to be executed. Actually PigInputFormat#passLoadSignature is 
executed in spark-executor(Thread-B).Serialization and deserialization is used 
to initialize the UDFContext’s value in different threads.
{code}
UDFContext.java
private static ThreadLocalUDFContext tss = new ThreadLocalUDFContext() {
@Override
public UDFContext initialValue() {
return new UDFContext();
}
};

public static UDFContext getUDFContext() {
UDFContext res= tss.get();
return res;
}
{code}

*Why has difference between situation with udf and without udf?*
With udf:
 Before PigInputFormat#passLoadSignature is executed, POUserFunc# 
setFuncInputSchema is executed.  In POUserFunc#setFuncInputSchema, 
UDFContext#udfConfs is put an entry(the key is POUserFunc, the value is an 
empty property  ). When PigInputFormat#passLoadSignature is executed, the 
condition to deserialize UDFContext( udfc.isUDFConfEmpty) can not be matched. 
Deserialization is not executed.
{code}
 PoUserFunc#readObject
-POUserFunc#instantiateFunc(FuncSpec)
-POUserFunc#setFuncInputSchema(String)
 public void setFuncInputSchema(String signature) {
Properties props = 
UDFContext.getUDFContext().getUDFProperties(func.getClass());
Schema tmpS=(Schema)props.get(pig.evalfunc.inputschema.+signature);
if(tmpS!=null) {
this.func.setInputSchema(tmpS);
}
}
-UDFContext.java
  public Properties getUDFProperties(Class c) {
UDFContextKey k = generateKey(c, null);
Properties p = udfConfs.get(k);
if (p == null) {
p = new Properties(); // an empty property
udfConfs.put(k, p); 

[jira] [Updated] (PIG-4232) UDFContext is not initialized in executors when running on Spark cluster

2014-10-20 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-4232:
--
Attachment: PIG-4232.patch

 UDFContext is not initialized in executors when running on Spark cluster
 

 Key: PIG-4232
 URL: https://issues.apache.org/jira/browse/PIG-4232
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: Praveen Rachabattuni
Assignee: liyunzhang_intel
 Attachments: PIG-4232.patch


 UDFContext is used in lot of features across pig code base. For example its 
 used in PigStorage to pass columns information between the frontend and the 
 backend code. 
 https://github.com/apache/pig/blob/spark/src/org/apache/pig/builtin/PigStorage.java#L246-L247



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PIG-4237) Error when there is a bag inside an RDD

2014-10-20 Thread Praveen Rachabattuni (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praveen Rachabattuni resolved PIG-4237.
---
Resolution: Fixed

 Error when there is a bag inside an RDD
 ---

 Key: PIG-4237
 URL: https://issues.apache.org/jira/browse/PIG-4237
 Project: Pig
  Issue Type: Bug
  Components: spark
Reporter: Carlos Balduz
Assignee: Carlos Balduz
Priority: Critical
  Labels: spork
 Attachments: PIG-4237-1.diff


 Bags cannot be sent to an RDD, as it produces a SelfSpillBag$MemoryLimits not 
 Serializable exception. This results in an error for almost every operation 
 performed after grouping tuples.
 This error is fixed after making transient the protected MemoryLimit memLimit 
 attribute inside org.apache.pig.data.SelfSpillBag, but I do not know the 
 impact of this change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4237) Error when there is a bag inside an RDD

2014-10-20 Thread Praveen Rachabattuni (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praveen Rachabattuni updated PIG-4237:
--
Patch Info: Patch Available

 Error when there is a bag inside an RDD
 ---

 Key: PIG-4237
 URL: https://issues.apache.org/jira/browse/PIG-4237
 Project: Pig
  Issue Type: Bug
  Components: spark
Reporter: Carlos Balduz
Assignee: Carlos Balduz
Priority: Critical
  Labels: spork
 Attachments: PIG-4237-1.diff


 Bags cannot be sent to an RDD, as it produces a SelfSpillBag$MemoryLimits not 
 Serializable exception. This results in an error for almost every operation 
 performed after grouping tuples.
 This error is fixed after making transient the protected MemoryLimit memLimit 
 attribute inside org.apache.pig.data.SelfSpillBag, but I do not know the 
 impact of this change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4237) Error when there is a bag inside an RDD

2014-10-20 Thread Praveen Rachabattuni (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176816#comment-14176816
 ] 

Praveen Rachabattuni commented on PIG-4237:
---

Committed this to Spark branch.

 Error when there is a bag inside an RDD
 ---

 Key: PIG-4237
 URL: https://issues.apache.org/jira/browse/PIG-4237
 Project: Pig
  Issue Type: Bug
  Components: spark
Reporter: Carlos Balduz
Assignee: Carlos Balduz
Priority: Critical
  Labels: spork
 Attachments: PIG-4237-1.diff


 Bags cannot be sent to an RDD, as it produces a SelfSpillBag$MemoryLimits not 
 Serializable exception. This results in an error for almost every operation 
 performed after grouping tuples.
 This error is fixed after making transient the protected MemoryLimit memLimit 
 attribute inside org.apache.pig.data.SelfSpillBag, but I do not know the 
 impact of this change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-3979) group all performance, garbage collection, and incremental aggregation

2014-10-20 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177063#comment-14177063
 ] 

Rohini Palaniswamy commented on PIG-3979:
-

[~ddreyfus],

bq. For counting spilled bytes, I rely on the counters.
   Problem is for killed or failed tasks the counters are discarded and the log 
messages are very useful then.
 
bq. I understand how the extra GC solved PIG-3148. 
   There were two System.gc() in the SpillableMemoryManager. There was always a 
System.gc() invoked after all the spill() calls which was there from the 
beginning. Then there is an extraGC that was introduced to do a System.gc() 
before the spill() call if object size exceeds a certain threshold. You have 
removed both the System.gc() calls which I believe will lead to lot of 
regressions. It would be nice to add another API to confirm before invoking 
extra GC but that will cause backward incompatibility unless we move to Java 8. 
For now we can hack by doing instanceof POPartialAgg and not invoke extraGC. I 
am posting a patch that makes the spill in POPartialAgg synchronous. Can you 
test if it fixes your problem? 

 group all performance, garbage collection, and incremental aggregation
 --

 Key: PIG-3979
 URL: https://issues.apache.org/jira/browse/PIG-3979
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.12.0, 0.11.1
Reporter: David Dreyfus
Assignee: David Dreyfus
 Fix For: 0.14.0

 Attachments: PIG-3979-3.patch, PIG-3979-4.patch, PIG-3979-v1.patch, 
 POPartialAgg.java.patch, SpillableMemoryManager.java.patch


 I have a PIG statement similar to:
 summary = foreach (group data ALL) generate 
 COUNT(data.col1), SUM(data.col2), SUM(data.col2)
 , Moments(col3)
 , Moments(data.col4)
 There are a couple of hundred columns.
 I set the following:
 SET pig.exec.mapPartAgg true;
 SET pig.exec.mapPartAgg.minReduction 3;
 SET pig.cachedbag.memusage 0.05;
 I found that when I ran this on a JVM with insufficient memory, the process 
 eventually timed out because of an infinite garbage collection loop.
 The problem was invariant to the memusage setting.
 I solved the problem by making changes to:
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperator.POPartialAgg.java
 Rather than reading in 1 records to establish an estimate of the 
 reduction, I make an estimate after reading in enough tuples to fill 
 pig.cachedbag.memusage percent of Runtime.getRuntime().maxMemory().
 I also made a change to guarantee at least one record allowed in second tier 
 storage. In the current implementation, if the reduction is very high 1000:1, 
 space in second tier storage is zero.
 With these changes, I can summarize large data sets with small JVMs. I also 
 find that setting pig.cachedbag.memusage to a small number such as 0.05 
 results in much better garbage collection performance without reducing 
 throughput. I suppose tuning GC would also solve a problem with excessive 
 garbage collection.
 The performance is sweet. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-3979) group all performance, garbage collection, and incremental aggregation

2014-10-20 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-3979:

Attachment: PIG-3979-synchronous-spill.patch

 Attached a initial version of the patch which does not do extraGC for 
POPartialAgg and makes the spill synchronous to have the second invokeGC 
actually free memory. Can you check if that works for you? It still needs 
fixing of TestPOPartialAgg, modification of existing tests or addition of new 
tests for the behaviour change.

 group all performance, garbage collection, and incremental aggregation
 --

 Key: PIG-3979
 URL: https://issues.apache.org/jira/browse/PIG-3979
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.12.0, 0.11.1
Reporter: David Dreyfus
Assignee: David Dreyfus
 Fix For: 0.14.0

 Attachments: PIG-3979-3.patch, PIG-3979-4.patch, 
 PIG-3979-synchronous-spill.patch, PIG-3979-v1.patch, POPartialAgg.java.patch, 
 SpillableMemoryManager.java.patch


 I have a PIG statement similar to:
 summary = foreach (group data ALL) generate 
 COUNT(data.col1), SUM(data.col2), SUM(data.col2)
 , Moments(col3)
 , Moments(data.col4)
 There are a couple of hundred columns.
 I set the following:
 SET pig.exec.mapPartAgg true;
 SET pig.exec.mapPartAgg.minReduction 3;
 SET pig.cachedbag.memusage 0.05;
 I found that when I ran this on a JVM with insufficient memory, the process 
 eventually timed out because of an infinite garbage collection loop.
 The problem was invariant to the memusage setting.
 I solved the problem by making changes to:
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperator.POPartialAgg.java
 Rather than reading in 1 records to establish an estimate of the 
 reduction, I make an estimate after reading in enough tuples to fill 
 pig.cachedbag.memusage percent of Runtime.getRuntime().maxMemory().
 I also made a change to guarantee at least one record allowed in second tier 
 storage. In the current implementation, if the reduction is very high 1000:1, 
 space in second tier storage is zero.
 With these changes, I can summarize large data sets with small JVMs. I also 
 find that setting pig.cachedbag.memusage to a small number such as 0.05 
 results in much better garbage collection performance without reducing 
 throughput. I suppose tuning GC would also solve a problem with excessive 
 garbage collection.
 The performance is sweet. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PIG-4243) TestAlgebraicEval fails in spark mode

2014-10-20 Thread liyunzhang_intel (JIRA)
liyunzhang_intel created PIG-4243:
-

 Summary: TestAlgebraicEval fails in spark mode
 Key: PIG-4243
 URL: https://issues.apache.org/jira/browse/PIG-4243
 Project: Pig
  Issue Type: Bug
  Components: spark
Reporter: liyunzhang_intel


1. Build spark and pig env according to PIG-4168
2. add TestAlgebraicEval to $PIG_HOME/test/spark-tests
cat  $PIG_HOME/test/spark-tests
**/TestAlgebraicEval
3. run unit test TestAlgebraicEval
ant test-spark
4. the unit test fails
error log is attached



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4243) TestStore fails in spark mode

2014-10-20 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-4243:
--
Summary: TestStore fails in spark mode  (was: TestAlgebraicEval fails in 
spark mode)

 TestStore fails in spark mode
 -

 Key: PIG-4243
 URL: https://issues.apache.org/jira/browse/PIG-4243
 Project: Pig
  Issue Type: Bug
  Components: spark
Reporter: liyunzhang_intel

 1. Build spark and pig env according to PIG-4168
 2. add TestAlgebraicEval to $PIG_HOME/test/spark-tests
 cat  $PIG_HOME/test/spark-tests
 **/TestAlgebraicEval
 3. run unit test TestAlgebraicEval
 ant test-spark
 4. the unit test fails
 error log is attached



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4243) TestStore fails in spark mode

2014-10-20 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-4243:
--
Attachment: TEST-org.apache.pig.test.TestStore.txt

 TestStore fails in spark mode
 -

 Key: PIG-4243
 URL: https://issues.apache.org/jira/browse/PIG-4243
 Project: Pig
  Issue Type: Bug
  Components: spark
Reporter: liyunzhang_intel
 Attachments: TEST-org.apache.pig.test.TestStore.txt


 1. Build spark and pig env according to PIG-4168
 2. add TestStore to $PIG_HOME/test/spark-tests
 cat  $PIG_HOME/test/spark-tests
 **/TestStore
 3. run unit test TestStore
 ant test-spark
 4. the unit test fails
 error log is attached



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4243) TestStore fails in spark mode

2014-10-20 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-4243:
--
Description: 
1. Build spark and pig env according to PIG-4168
2. add TestStore to $PIG_HOME/test/spark-tests
cat  $PIG_HOME/test/spark-tests
**/TestStore
3. run unit test TestStore
ant test-spark
4. the unit test fails
error log is attached

  was:
1. Build spark and pig env according to PIG-4168
2. add TestAlgebraicEval to $PIG_HOME/test/spark-tests
cat  $PIG_HOME/test/spark-tests
**/TestAlgebraicEval
3. run unit test TestAlgebraicEval
ant test-spark
4. the unit test fails
error log is attached


 TestStore fails in spark mode
 -

 Key: PIG-4243
 URL: https://issues.apache.org/jira/browse/PIG-4243
 Project: Pig
  Issue Type: Bug
  Components: spark
Reporter: liyunzhang_intel
 Attachments: TEST-org.apache.pig.test.TestStore.txt


 1. Build spark and pig env according to PIG-4168
 2. add TestStore to $PIG_HOME/test/spark-tests
 cat  $PIG_HOME/test/spark-tests
 **/TestStore
 3. run unit test TestStore
 ant test-spark
 4. the unit test fails
 error log is attached



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4168) Initial implementation of unit tests for Pig on Spark

2014-10-20 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-4168:
--
Attachment: PIG-4168_5.patch

update new patch PIG-4168_5.patch , please use this to review

 Initial implementation of unit tests for Pig on Spark
 -

 Key: PIG-4168
 URL: https://issues.apache.org/jira/browse/PIG-4168
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: Praveen Rachabattuni
Assignee: liyunzhang_intel
 Attachments: PIG-4168.patch, PIG-4168_1.patch, PIG-4168_2.patch, 
 PIG-4168_3.patch, PIG-4168_4.patch, PIG-4168_5.patch


 1.ant clean jar;  pig-0.14.0-SNAPSHOT-core-h1.jar will be generated by the 
 command
 2.export SPARK_PIG_JAR=$PIG_HOME/pig-0.14.0-SNAPSHOT-core-h1.jar 
 3.build hadoop1 and spark env.spark run in local mode
   jps:
   11647 Master #spark master runs
   6457 DataNode #hadoop datanode runs
   22374 Jps
   11705 Worker# spark worker runs
   27009 JobTracker #hadoop jobtracker runs
   26602 NameNode  #hadoop namenode runs
   29486 org.eclipse.equinox.launcher_1.3.0.v20120522-1813.jar
   19692 Main
  
 4.ant test-spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)