[jira] Subscription: PIG patch available
Issue Subscription Filter: PIG patch available (20 issues) Subscriber: pigdaily Key Summary PIG-4241Auto local mode mistakenly converts large jobs to local mode when using with Hive tables https://issues.apache.org/jira/browse/PIG-4241 PIG-4239pig.output.lazy not works in spark mode https://issues.apache.org/jira/browse/PIG-4239 PIG-4224Upload Tez payload history string to timeline server https://issues.apache.org/jira/browse/PIG-4224 PIG-4160-forcelocaljars / -j flag when using a remote url for a script https://issues.apache.org/jira/browse/PIG-4160 PIG-4111Make Pig compiles with avro-1.7.7 https://issues.apache.org/jira/browse/PIG-4111 PIG-4103Fix TestRegisteredJarVisibility(after PIG-4083) https://issues.apache.org/jira/browse/PIG-4103 PIG-4084Port TestPigRunner to Tez https://issues.apache.org/jira/browse/PIG-4084 PIG-4066An optimization for ROLLUP operation in Pig https://issues.apache.org/jira/browse/PIG-4066 PIG-4004Upgrade the Pigmix queries from the (old) mapred API to mapreduce https://issues.apache.org/jira/browse/PIG-4004 PIG-4002Disable combiner when map-side aggregation is used https://issues.apache.org/jira/browse/PIG-4002 PIG-3952PigStorage accepts '-tagSplit' to return full split information https://issues.apache.org/jira/browse/PIG-3952 PIG-3911Define unique fields with @OutputSchema https://issues.apache.org/jira/browse/PIG-3911 PIG-3877Getting Geo Latitude/Longitude from Address Lines https://issues.apache.org/jira/browse/PIG-3877 PIG-3873Geo distance calculation using Haversine https://issues.apache.org/jira/browse/PIG-3873 PIG-3866Create ThreadLocal classloader per PigContext https://issues.apache.org/jira/browse/PIG-3866 PIG-3861duplicate jars get added to distributed cache https://issues.apache.org/jira/browse/PIG-3861 PIG-3668COR built-in function when atleast one of the coefficient values is NaN https://issues.apache.org/jira/browse/PIG-3668 PIG-3635Fix e2e tests for Hadoop 2.X on Windows https://issues.apache.org/jira/browse/PIG-3635 PIG-3587add functionality for rolling over dates https://issues.apache.org/jira/browse/PIG-3587 PIG-3441Allow Pig to use default resources from Configuration objects https://issues.apache.org/jira/browse/PIG-3441 You may edit this subscription at: https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=16328filterId=12322384
[jira] [Commented] (PIG-4232) UDFContext is not initialized in executors when running on Spark cluster
[ https://issues.apache.org/jira/browse/PIG-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176692#comment-14176692 ] liyunzhang_intel commented on PIG-4232: --- Scripting.pig register '/home/zly/prj/oss/pig/bin/libexec/python/scriptingudf.py' using jython as myfuncs; a = load '/user/pig/tests/data/singlefile/studenttab10k' using PigStorage() as (name, age:int, gpa:double); b = foreach a generate age; explain b; store b into '/user/pig/out/root-1412926432-nightly.conf/Scripting_1.out'; *Scripting.pig successes in spark mode* Scripting.udf.pig register '/home/zly/prj/oss/pig/bin/libexec/python/scriptingudf.py' using jython as myfuncs; a = load '/user/pig/tests/data/singlefile/studenttab10k' using PigStorage() as (name, age:int, gpa:double); b = foreach a generate myfuncs.square(age); explain b; store b into '/user/pig/out/root-1412926432-nightly.conf/Scripting_1.out'; *this script fails in spark mode* After debug, I found that UDFContext is not initialized in spark executors. In https://github.com/apache/pig/blob/spark/src/org/apache/pig/builtin/PigStorage.java#L246-L247 {code} Properties p = UDFContext.getUDFContext().getUDFProperties(this.getClass()); mRequiredColumns = (boolean[])ObjectSerializer.deserialize(p.getProperty(signature)); {code} When executing Scripting.pig, UDFContext.getUDFContext().getUDFProperties(this.getClass()) returns a property which contains info about PigStorage. After deserialization, variable mRequiredColumns is correctly initialized. Wehn executing Scripting.udf.pig, UDFContext.getUDFContext().getUDFProperties(this.getClass()) returns a property which does not contain info about PigStorage. After deserialization, variable mRequiredColumns is null. *Where to setUDFContext?* {code} LoadConverter#convert - SparkUtil#newJobConf public static JobConf newJobConf(PigContext pigContext) throws IOException { JobConf jobConf = new JobConf( ConfigurationUtil.toConfiguration(pigContext.getProperties())); jobConf.set(pig.pigContext, ObjectSerializer.serialize(pigContext)); UDFContext.getUDFContext().serialize(jobConf);//serialize all udf info(include PigStorage info) to jobConf jobConf.set(udf.import.list, ObjectSerializer.serialize(PigContext.getPackageImportList())); return jobConf; } PigInputFormat#passLoadSignature -MapRedUtil.setupUDFContext(conf); -UDFContext.setUDFContext public static void setupUDFContext(Configuration job) throws IOException { UDFContext udfc = UDFContext.getUDFContext(); udfc.addJobConf(job); // don't deserialize in front-end if (udfc.isUDFConfEmpty()) { udfc.deserialize(); //UDFContext deserializes from jobConf. } } {code} In LoadConverter#convert, first serializes all udf info (include PigStorage) to jobConf. Then in PigInputFormat#passLoadSignature, UDFContext deserializes from jobConf. *Why need serialization and deserialization?* UDFContext#tss is a ThreadLocal variable, the value is different in different threads. LoadConverter#convert is executed in Thread-A and PigInputFormat#passLoadSignature is executed in Thread-B because spark send its task to executors to be executed. Actually PigInputFormat#passLoadSignature is executed in spark-executor(Thread-B).Serialization and deserialization is used to initialize the UDFContext’s value in different threads. {code} UDFContext.java private static ThreadLocalUDFContext tss = new ThreadLocalUDFContext() { @Override public UDFContext initialValue() { return new UDFContext(); } }; public static UDFContext getUDFContext() { UDFContext res= tss.get(); return res; } {code} *Why has difference between situation with udf and without udf?* With udf: Before PigInputFormat#passLoadSignature is executed, POUserFunc# setFuncInputSchema is executed. In POUserFunc#setFuncInputSchema, UDFContext#udfConfs is put an entry(the key is POUserFunc, the value is an empty property ). When PigInputFormat#passLoadSignature is executed, the condition to deserialize UDFContext( udfc.isUDFConfEmpty) can not be matched. Deserialization is not executed. {code} PoUserFunc#readObject -POUserFunc#instantiateFunc(FuncSpec) -POUserFunc#setFuncInputSchema(String) public void setFuncInputSchema(String signature) { Properties props = UDFContext.getUDFContext().getUDFProperties(func.getClass()); Schema tmpS=(Schema)props.get(pig.evalfunc.inputschema.+signature); if(tmpS!=null) { this.func.setInputSchema(tmpS); } } -UDFContext.java public Properties getUDFProperties(Class c) { UDFContextKey k = generateKey(c, null); Properties p = udfConfs.get(k); if (p == null) { p = new Properties(); // an empty property udfConfs.put(k, p);
[jira] [Updated] (PIG-4232) UDFContext is not initialized in executors when running on Spark cluster
[ https://issues.apache.org/jira/browse/PIG-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated PIG-4232: -- Attachment: PIG-4232.patch UDFContext is not initialized in executors when running on Spark cluster Key: PIG-4232 URL: https://issues.apache.org/jira/browse/PIG-4232 Project: Pig Issue Type: Sub-task Components: spark Reporter: Praveen Rachabattuni Assignee: liyunzhang_intel Attachments: PIG-4232.patch UDFContext is used in lot of features across pig code base. For example its used in PigStorage to pass columns information between the frontend and the backend code. https://github.com/apache/pig/blob/spark/src/org/apache/pig/builtin/PigStorage.java#L246-L247 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PIG-4237) Error when there is a bag inside an RDD
[ https://issues.apache.org/jira/browse/PIG-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Praveen Rachabattuni resolved PIG-4237. --- Resolution: Fixed Error when there is a bag inside an RDD --- Key: PIG-4237 URL: https://issues.apache.org/jira/browse/PIG-4237 Project: Pig Issue Type: Bug Components: spark Reporter: Carlos Balduz Assignee: Carlos Balduz Priority: Critical Labels: spork Attachments: PIG-4237-1.diff Bags cannot be sent to an RDD, as it produces a SelfSpillBag$MemoryLimits not Serializable exception. This results in an error for almost every operation performed after grouping tuples. This error is fixed after making transient the protected MemoryLimit memLimit attribute inside org.apache.pig.data.SelfSpillBag, but I do not know the impact of this change. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-4237) Error when there is a bag inside an RDD
[ https://issues.apache.org/jira/browse/PIG-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Praveen Rachabattuni updated PIG-4237: -- Patch Info: Patch Available Error when there is a bag inside an RDD --- Key: PIG-4237 URL: https://issues.apache.org/jira/browse/PIG-4237 Project: Pig Issue Type: Bug Components: spark Reporter: Carlos Balduz Assignee: Carlos Balduz Priority: Critical Labels: spork Attachments: PIG-4237-1.diff Bags cannot be sent to an RDD, as it produces a SelfSpillBag$MemoryLimits not Serializable exception. This results in an error for almost every operation performed after grouping tuples. This error is fixed after making transient the protected MemoryLimit memLimit attribute inside org.apache.pig.data.SelfSpillBag, but I do not know the impact of this change. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4237) Error when there is a bag inside an RDD
[ https://issues.apache.org/jira/browse/PIG-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176816#comment-14176816 ] Praveen Rachabattuni commented on PIG-4237: --- Committed this to Spark branch. Error when there is a bag inside an RDD --- Key: PIG-4237 URL: https://issues.apache.org/jira/browse/PIG-4237 Project: Pig Issue Type: Bug Components: spark Reporter: Carlos Balduz Assignee: Carlos Balduz Priority: Critical Labels: spork Attachments: PIG-4237-1.diff Bags cannot be sent to an RDD, as it produces a SelfSpillBag$MemoryLimits not Serializable exception. This results in an error for almost every operation performed after grouping tuples. This error is fixed after making transient the protected MemoryLimit memLimit attribute inside org.apache.pig.data.SelfSpillBag, but I do not know the impact of this change. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-3979) group all performance, garbage collection, and incremental aggregation
[ https://issues.apache.org/jira/browse/PIG-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177063#comment-14177063 ] Rohini Palaniswamy commented on PIG-3979: - [~ddreyfus], bq. For counting spilled bytes, I rely on the counters. Problem is for killed or failed tasks the counters are discarded and the log messages are very useful then. bq. I understand how the extra GC solved PIG-3148. There were two System.gc() in the SpillableMemoryManager. There was always a System.gc() invoked after all the spill() calls which was there from the beginning. Then there is an extraGC that was introduced to do a System.gc() before the spill() call if object size exceeds a certain threshold. You have removed both the System.gc() calls which I believe will lead to lot of regressions. It would be nice to add another API to confirm before invoking extra GC but that will cause backward incompatibility unless we move to Java 8. For now we can hack by doing instanceof POPartialAgg and not invoke extraGC. I am posting a patch that makes the spill in POPartialAgg synchronous. Can you test if it fixes your problem? group all performance, garbage collection, and incremental aggregation -- Key: PIG-3979 URL: https://issues.apache.org/jira/browse/PIG-3979 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.12.0, 0.11.1 Reporter: David Dreyfus Assignee: David Dreyfus Fix For: 0.14.0 Attachments: PIG-3979-3.patch, PIG-3979-4.patch, PIG-3979-v1.patch, POPartialAgg.java.patch, SpillableMemoryManager.java.patch I have a PIG statement similar to: summary = foreach (group data ALL) generate COUNT(data.col1), SUM(data.col2), SUM(data.col2) , Moments(col3) , Moments(data.col4) There are a couple of hundred columns. I set the following: SET pig.exec.mapPartAgg true; SET pig.exec.mapPartAgg.minReduction 3; SET pig.cachedbag.memusage 0.05; I found that when I ran this on a JVM with insufficient memory, the process eventually timed out because of an infinite garbage collection loop. The problem was invariant to the memusage setting. I solved the problem by making changes to: org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperator.POPartialAgg.java Rather than reading in 1 records to establish an estimate of the reduction, I make an estimate after reading in enough tuples to fill pig.cachedbag.memusage percent of Runtime.getRuntime().maxMemory(). I also made a change to guarantee at least one record allowed in second tier storage. In the current implementation, if the reduction is very high 1000:1, space in second tier storage is zero. With these changes, I can summarize large data sets with small JVMs. I also find that setting pig.cachedbag.memusage to a small number such as 0.05 results in much better garbage collection performance without reducing throughput. I suppose tuning GC would also solve a problem with excessive garbage collection. The performance is sweet. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-3979) group all performance, garbage collection, and incremental aggregation
[ https://issues.apache.org/jira/browse/PIG-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohini Palaniswamy updated PIG-3979: Attachment: PIG-3979-synchronous-spill.patch Attached a initial version of the patch which does not do extraGC for POPartialAgg and makes the spill synchronous to have the second invokeGC actually free memory. Can you check if that works for you? It still needs fixing of TestPOPartialAgg, modification of existing tests or addition of new tests for the behaviour change. group all performance, garbage collection, and incremental aggregation -- Key: PIG-3979 URL: https://issues.apache.org/jira/browse/PIG-3979 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.12.0, 0.11.1 Reporter: David Dreyfus Assignee: David Dreyfus Fix For: 0.14.0 Attachments: PIG-3979-3.patch, PIG-3979-4.patch, PIG-3979-synchronous-spill.patch, PIG-3979-v1.patch, POPartialAgg.java.patch, SpillableMemoryManager.java.patch I have a PIG statement similar to: summary = foreach (group data ALL) generate COUNT(data.col1), SUM(data.col2), SUM(data.col2) , Moments(col3) , Moments(data.col4) There are a couple of hundred columns. I set the following: SET pig.exec.mapPartAgg true; SET pig.exec.mapPartAgg.minReduction 3; SET pig.cachedbag.memusage 0.05; I found that when I ran this on a JVM with insufficient memory, the process eventually timed out because of an infinite garbage collection loop. The problem was invariant to the memusage setting. I solved the problem by making changes to: org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperator.POPartialAgg.java Rather than reading in 1 records to establish an estimate of the reduction, I make an estimate after reading in enough tuples to fill pig.cachedbag.memusage percent of Runtime.getRuntime().maxMemory(). I also made a change to guarantee at least one record allowed in second tier storage. In the current implementation, if the reduction is very high 1000:1, space in second tier storage is zero. With these changes, I can summarize large data sets with small JVMs. I also find that setting pig.cachedbag.memusage to a small number such as 0.05 results in much better garbage collection performance without reducing throughput. I suppose tuning GC would also solve a problem with excessive garbage collection. The performance is sweet. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PIG-4243) TestAlgebraicEval fails in spark mode
liyunzhang_intel created PIG-4243: - Summary: TestAlgebraicEval fails in spark mode Key: PIG-4243 URL: https://issues.apache.org/jira/browse/PIG-4243 Project: Pig Issue Type: Bug Components: spark Reporter: liyunzhang_intel 1. Build spark and pig env according to PIG-4168 2. add TestAlgebraicEval to $PIG_HOME/test/spark-tests cat $PIG_HOME/test/spark-tests **/TestAlgebraicEval 3. run unit test TestAlgebraicEval ant test-spark 4. the unit test fails error log is attached -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-4243) TestStore fails in spark mode
[ https://issues.apache.org/jira/browse/PIG-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated PIG-4243: -- Summary: TestStore fails in spark mode (was: TestAlgebraicEval fails in spark mode) TestStore fails in spark mode - Key: PIG-4243 URL: https://issues.apache.org/jira/browse/PIG-4243 Project: Pig Issue Type: Bug Components: spark Reporter: liyunzhang_intel 1. Build spark and pig env according to PIG-4168 2. add TestAlgebraicEval to $PIG_HOME/test/spark-tests cat $PIG_HOME/test/spark-tests **/TestAlgebraicEval 3. run unit test TestAlgebraicEval ant test-spark 4. the unit test fails error log is attached -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-4243) TestStore fails in spark mode
[ https://issues.apache.org/jira/browse/PIG-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated PIG-4243: -- Attachment: TEST-org.apache.pig.test.TestStore.txt TestStore fails in spark mode - Key: PIG-4243 URL: https://issues.apache.org/jira/browse/PIG-4243 Project: Pig Issue Type: Bug Components: spark Reporter: liyunzhang_intel Attachments: TEST-org.apache.pig.test.TestStore.txt 1. Build spark and pig env according to PIG-4168 2. add TestStore to $PIG_HOME/test/spark-tests cat $PIG_HOME/test/spark-tests **/TestStore 3. run unit test TestStore ant test-spark 4. the unit test fails error log is attached -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-4243) TestStore fails in spark mode
[ https://issues.apache.org/jira/browse/PIG-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated PIG-4243: -- Description: 1. Build spark and pig env according to PIG-4168 2. add TestStore to $PIG_HOME/test/spark-tests cat $PIG_HOME/test/spark-tests **/TestStore 3. run unit test TestStore ant test-spark 4. the unit test fails error log is attached was: 1. Build spark and pig env according to PIG-4168 2. add TestAlgebraicEval to $PIG_HOME/test/spark-tests cat $PIG_HOME/test/spark-tests **/TestAlgebraicEval 3. run unit test TestAlgebraicEval ant test-spark 4. the unit test fails error log is attached TestStore fails in spark mode - Key: PIG-4243 URL: https://issues.apache.org/jira/browse/PIG-4243 Project: Pig Issue Type: Bug Components: spark Reporter: liyunzhang_intel Attachments: TEST-org.apache.pig.test.TestStore.txt 1. Build spark and pig env according to PIG-4168 2. add TestStore to $PIG_HOME/test/spark-tests cat $PIG_HOME/test/spark-tests **/TestStore 3. run unit test TestStore ant test-spark 4. the unit test fails error log is attached -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-4168) Initial implementation of unit tests for Pig on Spark
[ https://issues.apache.org/jira/browse/PIG-4168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated PIG-4168: -- Attachment: PIG-4168_5.patch update new patch PIG-4168_5.patch , please use this to review Initial implementation of unit tests for Pig on Spark - Key: PIG-4168 URL: https://issues.apache.org/jira/browse/PIG-4168 Project: Pig Issue Type: Sub-task Components: spark Reporter: Praveen Rachabattuni Assignee: liyunzhang_intel Attachments: PIG-4168.patch, PIG-4168_1.patch, PIG-4168_2.patch, PIG-4168_3.patch, PIG-4168_4.patch, PIG-4168_5.patch 1.ant clean jar; pig-0.14.0-SNAPSHOT-core-h1.jar will be generated by the command 2.export SPARK_PIG_JAR=$PIG_HOME/pig-0.14.0-SNAPSHOT-core-h1.jar 3.build hadoop1 and spark env.spark run in local mode jps: 11647 Master #spark master runs 6457 DataNode #hadoop datanode runs 22374 Jps 11705 Worker# spark worker runs 27009 JobTracker #hadoop jobtracker runs 26602 NameNode #hadoop namenode runs 29486 org.eclipse.equinox.launcher_1.3.0.v20120522-1813.jar 19692 Main 4.ant test-spark -- This message was sent by Atlassian JIRA (v6.3.4#6332)