Re: Review Request 45667: Support Pig On Spark

2016-10-26 Thread Pallavi Rao

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/45667/
---

(Updated Oct. 27, 2016, 3:17 a.m.)


Review request for pig, Daniel Dai and Rohini Palaniswamy.


Bugs: PIG-4059 and PIG-4854
https://issues.apache.org/jira/browse/PIG-4059
https://issues.apache.org/jira/browse/PIG-4854


Repository: pig-git


Description
---

The patch contains all the work done in the spark branch, so far.


Diffs (updated)
-

  bin/pig 81f1426 
  build.xml 99ba1f4 
  ivy.xml dd9878e 
  ivy/libraries.properties 3a819a5 
  shims/test/hadoop20/org/apache/pig/test/SparkMiniCluster.java PRE-CREATION 
  shims/test/hadoop23/org/apache/pig/test/SparkMiniCluster.java PRE-CREATION 
  shims/test/hadoop23/org/apache/pig/test/TezMiniCluster.java 792a1bd 
  shims/test/hadoop23/org/apache/pig/test/YarnMiniCluster.java PRE-CREATION 
  src/META-INF/services/org.apache.pig.ExecType 5c034c8 
  src/docs/src/documentation/content/xdocs/start.xml 36f9952 
  
src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/PhysicalOperator.java
 1ff1abd 
  
src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
 ecf780c 
  
src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhysicalPlan.java
 2376d03 
  
src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POCollectedGroup.java
 bcbfe2b 
  
src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POFRJoin.java
 d80951a 
  
src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POForEach.java
 21b75f1 
  
src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POGlobalRearrange.java
 52cfb73 
  
src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POMergeJoin.java
 13f70c0 
  
src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POSort.java
 c3a82c3 
  src/org/apache/pig/backend/hadoop/executionengine/spark/JobGraphBuilder.java 
PRE-CREATION 
  
src/org/apache/pig/backend/hadoop/executionengine/spark/JobMetricsListener.java 
PRE-CREATION 
  src/org/apache/pig/backend/hadoop/executionengine/spark/KryoSerializer.java 
PRE-CREATION 
  
src/org/apache/pig/backend/hadoop/executionengine/spark/MapReducePartitionerWrapper.java
 PRE-CREATION 
  src/org/apache/pig/backend/hadoop/executionengine/spark/SparkEngineConf.java 
PRE-CREATION 
  src/org/apache/pig/backend/hadoop/executionengine/spark/SparkExecType.java 
PRE-CREATION 
  
src/org/apache/pig/backend/hadoop/executionengine/spark/SparkExecutionEngine.java
 PRE-CREATION 
  src/org/apache/pig/backend/hadoop/executionengine/spark/SparkLauncher.java 
PRE-CREATION 
  
src/org/apache/pig/backend/hadoop/executionengine/spark/SparkLocalExecType.java 
PRE-CREATION 
  src/org/apache/pig/backend/hadoop/executionengine/spark/SparkUtil.java 
PRE-CREATION 
  src/org/apache/pig/backend/hadoop/executionengine/spark/UDFJarsFinder.java 
PRE-CREATION 
  
src/org/apache/pig/backend/hadoop/executionengine/spark/converter/CollectedGroupConverter.java
 PRE-CREATION 
  
src/org/apache/pig/backend/hadoop/executionengine/spark/converter/CounterConverter.java
 PRE-CREATION 
  
src/org/apache/pig/backend/hadoop/executionengine/spark/converter/DistinctConverter.java
 PRE-CREATION 
  
src/org/apache/pig/backend/hadoop/executionengine/spark/converter/FRJoinConverter.java
 PRE-CREATION 
  
src/org/apache/pig/backend/hadoop/executionengine/spark/converter/FilterConverter.java
 PRE-CREATION 
  
src/org/apache/pig/backend/hadoop/executionengine/spark/converter/ForEachConverter.java
 PRE-CREATION 
  
src/org/apache/pig/backend/hadoop/executionengine/spark/converter/GlobalRearrangeConverter.java
 PRE-CREATION 
  
src/org/apache/pig/backend/hadoop/executionengine/spark/converter/IndexedKey.java
 PRE-CREATION 
  
src/org/apache/pig/backend/hadoop/executionengine/spark/converter/IteratorTransform.java
 PRE-CREATION 
  
src/org/apache/pig/backend/hadoop/executionengine/spark/converter/LimitConverter.java
 PRE-CREATION 
  
src/org/apache/pig/backend/hadoop/executionengine/spark/converter/LoadConverter.java
 PRE-CREATION 
  
src/org/apache/pig/backend/hadoop/executionengine/spark/converter/LocalRearrangeConverter.java
 PRE-CREATION 
  
src/org/apache/pig/backend/hadoop/executionengine/spark/converter/MergeCogroupConverter.java
 PRE-CREATION 
  
src/org/apache/pig/backend/hadoop/executionengine/spark/converter/MergeJoinConverter.java
 PRE-CREATION 
  
src/org/apache/pig/backend/hadoop/executionengine/spark/converter/OutputConsumerIterator.java
 PRE-CREATION 
  
src/org/apache/pig/backend/hadoop/executionengine/spark/converter/PackageConverter.java
 PRE-CREATION 
  
src/org/apache/pig/backend/hadoop/executionengine/spark/converter/PigSecondaryKeyComparatorSpark.java
 PRE-CREATION 
  

[jira] [Commented] (PIG-4920) Fail to use Javascript UDF in spark yarn client mode

2016-10-26 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15610287#comment-15610287
 ] 

liyunzhang_intel commented on PIG-4920:
---

[~mohitsabharwal] and [~xuefuz]: thanks for your review and checkin!

> Fail to use Javascript UDF in spark yarn client mode
> 
>
> Key: PIG-4920
> URL: https://issues.apache.org/jira/browse/PIG-4920
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4920.patch, PIG-4920_2.patch, PIG-4920_3.patch, 
> PIG-4920_4.patch, PIG-4920_5.patch, PIG-4920_6.patch
>
>
> udf.pig 
> {code}
> register '/home/zly/prj/oss/merge.pig/pig/bin/udf.js' using javascript as 
> myfuncs;
> A = load './passwd' as (a0:chararray, a1:chararray);
> B = foreach A generate myfuncs.helloworld();
> store B into './udf.out';
> {code}
> udf.js
> {code}
> helloworld.outputSchema = "word:chararray";
> function helloworld() {
> return 'Hello, World';
> }
> 
> complex.outputSchema = "word:chararray";
> function complex(word){
> return {word:word};
> }
> {code}
> run udf.pig in spark local mode(export SPARK_MASTER="local"), it successfully.
> run udf.pig in spark yarn client mode(export SPARK_MASTER="yarn-client"), it 
> fails and error message like following:
> {noformat}
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
> at 
> org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:744)
> ... 84 more
> Caused by: java.lang.ExceptionInInitializerError
> at 
> org.apache.pig.scripting.js.JsScriptEngine.getInstance(JsScriptEngine.java:87)
> at org.apache.pig.scripting.js.JsFunction.(JsFunction.java:173)
> ... 89 more
> Caused by: java.lang.IllegalStateException: could not get script path from 
> UDFContext
> at 
> org.apache.pig.scripting.js.JsScriptEngine$Holder.(JsScriptEngine.java:69)
> ... 91 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-5048) HiveUDTF fail if it is the first expression in projection

2016-10-26 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15609943#comment-15609943
 ] 

Daniel Dai commented on PIG-5048:
-

Thanks for the patch. Several comments:
1. I'd like to return Integer.MAX_VALUE in size(), as the name 
UnlimitedNullTuple suggests
2. I'd rather not change other method if not causing problem
3. Can you add test case?

> HiveUDTF fail if it is the first expression in projection
> -
>
> Key: PIG-5048
> URL: https://issues.apache.org/jira/browse/PIG-5048
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Nandor Kollar
> Fix For: 0.17.0, 0.16.1
>
> Attachments: PIG-5048.patch
>
>
> The following script fail:
> {code}
> define explode HiveUDTF('explode');
> A = load 'bag.txt' as (a0:{(b0:chararray)});
> B = foreach A generate explode(a0);
> dump B;
> {code}
> Message: Unimplemented at 
> org.apache.pig.data.UnlimitedNullTuple.size(UnlimitedNullTuple.java:31)
> If it is not the first projection, the script pass:
> {code}
> define explode HiveUDTF('explode');
> A = load 'bag.txt' as (a0:{(b0:chararray)});
> B = foreach A generate a0, explode(a0);
> dump B;
> {code}
> Thanks [~nkollar] reporting it!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-5048) HiveUDTF fail if it is the first expression in projection

2016-10-26 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-5048:

Assignee: Nandor Kollar  (was: Daniel Dai)

> HiveUDTF fail if it is the first expression in projection
> -
>
> Key: PIG-5048
> URL: https://issues.apache.org/jira/browse/PIG-5048
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Nandor Kollar
> Fix For: 0.17.0, 0.16.1
>
> Attachments: PIG-5048.patch
>
>
> The following script fail:
> {code}
> define explode HiveUDTF('explode');
> A = load 'bag.txt' as (a0:{(b0:chararray)});
> B = foreach A generate explode(a0);
> dump B;
> {code}
> Message: Unimplemented at 
> org.apache.pig.data.UnlimitedNullTuple.size(UnlimitedNullTuple.java:31)
> If it is not the first projection, the script pass:
> {code}
> define explode HiveUDTF('explode');
> A = load 'bag.txt' as (a0:{(b0:chararray)});
> B = foreach A generate a0, explode(a0);
> dump B;
> {code}
> Thanks [~nkollar] reporting it!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4934) SET command does not work well with deprecated settings

2016-10-26 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15609905#comment-15609905
 ] 

Daniel Dai commented on PIG-4934:
-

If you have both old and new key set to different value, engine will take the 
new key, why it matters?

> SET command does not work well with deprecated settings
> ---
>
> Key: PIG-4934
> URL: https://issues.apache.org/jira/browse/PIG-4934
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Adam Szita
> Attachments: PIG-4934.2.patch, PIG-4934.patch
>
>
> For eg: If mapred.job.map.memory.mb was specified in the script using set 
> command and mapreduce.map.memory.mb was present in mapred-site.xml, that 
> takes effect.  This is because of the use of Properties and not Configuration.
> GruntParser.processSet() calls HExecutionEngine.setProperty which just 
> updates pigContext.getProperties()
> {code}
> public void setProperty(String property, String value) {
> Properties properties = pigContext.getProperties();
> properties.put(property, value);
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4798) big integer literals fail to parse

2016-10-26 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15609858#comment-15609858
 ] 

Daniel Dai commented on PIG-4798:
-

This is not just a bug fix, it enables a feature never work before. Can you 
also update document? (src/docs/src/documentation/content/xdocs/basic.xml).

> big integer literals fail to parse
> --
>
> Key: PIG-4798
> URL: https://issues.apache.org/jira/browse/PIG-4798
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.15.0, 0.16.0
>Reporter: Savvas Savvides
>Assignee: Adam Szita
> Attachments: PIG-4798.patch
>
>
> For example:
> x < 12345678901234567890
> with x being a biginteger tries to parse the literal as Integer. 
> x < 12345678901234567890BI
> "BI" is not recognized



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4920) Fail to use Javascript UDF in spark yarn client mode

2016-10-26 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4920:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to spark branch. Thanks, Liyun!

> Fail to use Javascript UDF in spark yarn client mode
> 
>
> Key: PIG-4920
> URL: https://issues.apache.org/jira/browse/PIG-4920
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4920.patch, PIG-4920_2.patch, PIG-4920_3.patch, 
> PIG-4920_4.patch, PIG-4920_5.patch, PIG-4920_6.patch
>
>
> udf.pig 
> {code}
> register '/home/zly/prj/oss/merge.pig/pig/bin/udf.js' using javascript as 
> myfuncs;
> A = load './passwd' as (a0:chararray, a1:chararray);
> B = foreach A generate myfuncs.helloworld();
> store B into './udf.out';
> {code}
> udf.js
> {code}
> helloworld.outputSchema = "word:chararray";
> function helloworld() {
> return 'Hello, World';
> }
> 
> complex.outputSchema = "word:chararray";
> function complex(word){
> return {word:word};
> }
> {code}
> run udf.pig in spark local mode(export SPARK_MASTER="local"), it successfully.
> run udf.pig in spark yarn client mode(export SPARK_MASTER="yarn-client"), it 
> fails and error message like following:
> {noformat}
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
> at 
> org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:744)
> ... 84 more
> Caused by: java.lang.ExceptionInInitializerError
> at 
> org.apache.pig.scripting.js.JsScriptEngine.getInstance(JsScriptEngine.java:87)
> at org.apache.pig.scripting.js.JsFunction.(JsFunction.java:173)
> ... 89 more
> Caused by: java.lang.IllegalStateException: could not get script path from 
> UDFContext
> at 
> org.apache.pig.scripting.js.JsScriptEngine$Holder.(JsScriptEngine.java:69)
> ... 91 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-5025) Improve TestLoad.java: use own separated folder under /tmp

2016-10-26 Thread Adam Szita (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15608615#comment-15608615
 ] 

Adam Szita commented on PIG-5025:
-

Actually I think we should go with [^PIG-5025.1.patch].
I checked this test today with Windows and it was actually passing without any 
of these touches (so using /tmp). It seems like this test doesn't even put 
anything into the given working directory on the local FS and the dir doesn't 
even have to exist.
Most tests merely just check if the supplied path matches the one found in the 
Load operator after creating the query plan (see checkLoadPath)
Those that do write and read files (see those calling testLoadingMultipleFiles) 
only do that in HDFS mode and they clean up after their work too. 

So the only problem is the case when it exists and contains a file with : in 
its name

> Improve TestLoad.java: use own separated folder under /tmp
> --
>
> Key: PIG-5025
> URL: https://issues.apache.org/jira/browse/PIG-5025
> Project: Pig
>  Issue Type: Improvement
>Reporter: Adam Szita
>Assignee: Adam Szita
>Priority: Minor
> Attachments: PIG-5025.1.patch, PIG-5025.2.patch, PIG-5025.patch
>
>
> Test cases testCommaSeparatedString2 and testGlobChars may fail if for some 
> reason files (from any other sources) in /tmp have : (colon) in the 
> filenames. This is because HDFS doesn't support colon since it has its own 
> URI handling. Exception below.
> I propose we separate the working dir of these tests to use their own folder 
> in /tmp.
> Failed to parse: java.net.URISyntaxException: Relative path in absolute URI: 
> t:2sTest.txt
>   at 
> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:198)
>   at org.apache.pig.test.TestLoad.checkLoadPath(TestLoad.java:317)
>   at org.apache.pig.test.TestLoad.checkLoadPath(TestLoad.java:299)
>   at 
> org.apache.pig.test.TestLoad.testCommaSeparatedString2(TestLoad.java:189)
> Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: 
> Relative path in absolute URI: t:2sTest.txt
>   at org.apache.hadoop.fs.Path.initialize(Path.java:206)
>   at org.apache.hadoop.fs.Path.(Path.java:172)
>   at org.apache.hadoop.fs.Path.(Path.java:94)
>   at org.apache.hadoop.fs.Globber.doGlob(Globber.java:260)
>   at org.apache.hadoop.fs.Globber.glob(Globber.java:151)
>   at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1637)
>   at 
> org.apache.pig.backend.hadoop.datastorage.HDataStorage.asCollection(HDataStorage.java:215)
>   at 
> org.apache.pig.backend.hadoop.datastorage.HDataStorage.asCollection(HDataStorage.java:41)
>   at 
> org.apache.pig.builtin.JsonMetadata.findMetaFile(JsonMetadata.java:119)
>   at org.apache.pig.builtin.JsonMetadata.getSchema(JsonMetadata.java:191)
>   at org.apache.pig.builtin.PigStorage.getSchema(PigStorage.java:518)
>   at 
> org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:175)
>   at 
> org.apache.pig.newplan.logical.relational.LOLoad.(LOLoad.java:89)
>   at 
> org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:866)
>   at 
> org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3568)
>   at 
> org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1625)
>   at 
> org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:1102)
>   at 
> org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:560)
>   at 
> org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:421)
>   at 
> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:188)
> Caused by: java.net.URISyntaxException: Relative path in absolute URI: 
> t:2sTest.txt
>   at java.net.URI.checkPath(URI.java:1823)
>   at java.net.URI.(URI.java:745)
>   at org.apache.hadoop.fs.Path.initialize(Path.java:203)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4854) Merge spark branch to trunk

2016-10-26 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-4854:
--
Attachment: PigOnSpark_3.patch

[~rohini], [~daijy],[~xuefuz]: update PigOnSpark_3.patch.
modification between PigOnSpark_3.patch and previous patch:
1. revert modification to core code(PigContext.java TestHBaseStorage.java) 
after PIG-4920 because we don;t store UDFContext properties in PigContext any 
more in spark mode.

> Merge spark branch to trunk
> ---
>
> Key: PIG-4854
> URL: https://issues.apache.org/jira/browse/PIG-4854
> Project: Pig
>  Issue Type: Bug
>Reporter: Pallavi Rao
> Attachments: PIG-On-Spark.patch, PigOnSpark_3.patch
>
>
> Believe the spark branch will be shortly ready to be merged with the main 
> branch (couple of minor patches pending commit), given that we have addressed 
> most functionality gaps and have ensured the UTs are clean. There are a few 
> optimizations which we will take up once the branch is merged to trunk.
> [~xuefuz], [~rohini], [~daijy],
> Hopefully, you agree that the spark branch is ready for merge. If yes, how 
> would like us to go about it? Do you want me to upload a huge patch that will 
> be merged like any other patch or do you prefer a branch merge?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 45667: Support Pig On Spark

2016-10-26 Thread kelly zhang


> On May 22, 2016, 9:57 p.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/hbase/HBaseStorage.java, line 313
> > 
> >
> > Just change the code to use UDFContext.getUDFContext().getJobConf() 
> > which should not be null instead of getClientSystemProps(). Not sure why it 
> > is using getClientSystemProps() in the first place.
> 
> kelly zhang wrote:
> Here if we change to UDFContext.getUDFContext().getJobConf(), problem 
> still exists.
> 
> 
> The reason why verify  UDFContext.getUDFContext().getJobConf() or not is 
> because spark executor first initializes all the object then 
> UDFContext.deserialize is called, HBaseStorage constructor is called before 
> UDFContext.deserialized(), so here we need to verify  
> UDFContext.getUDFContext().getJobConf() is null or not otherwise NPE will be 
> thrown out here.

Update PigOnSpark_3.patch. After PIG-4920, we store 
UdfContext#getClientSystemProps  UDFContext#getUdfConfs into SparkEngineConf. 
so not modify HBaseStorage any more.


> On May 22, 2016, 9:57 p.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/impl/PigContext.java, line 924
> > 
> >
> > This can be reverted. PigContext need not be serialized to the backend. 
> > See PIG-4866
> 
> kelly zhang wrote:
> PIG-4866 is not serialize pigcontext in configuration while here we 
> override PigContext#writeObject and PigContext#readObject to only serialize 
> and deserialize 1 attribute(packageImportList)  in spark mode.

Update PigOnSpark_3.patch, After PIG-4920, we store 
UdfContext#getClientSystemProps  UDFContext#getUdfConfs into SparkEngineConf. 
so not modify PigContext anymore.


> On May 22, 2016, 9:57 p.m., Rohini Palaniswamy wrote:
> > test/org/apache/pig/test/TestBuiltin.java, line 3255
> > 
> >
> > This testcase is broken if you have 0-0 repeating twice. It is not 
> > UniqueID anymore.
> 
> kelly zhang wrote:
> 0-0 repeating twice is because we use TaskID in UniqueID#exec:
> public String exec(Tuple input) throws IOException {
> String taskIndex = 
> PigMapReduce.sJobConfInternal.get().get(PigConstants.TASK_INDEX);
> String sequenceId = taskIndex + "-" + Long.toString(sequence);
> sequence++;
> return sequenceId;
> }
> in MR, we initialize PigContants.TASK_INDEX in  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce.Reduce#setup
>  
> protected void setup(Context context) throws IOException, 
> InterruptedException {
>...
> context.getConfiguration().set(PigConstants.TASK_INDEX, 
> Integer.toString(context.getTaskAttemptID().getTaskID().getId()));
> ...
> }
> 
> But spark does not provide funtion like PigGenericMapReduce.Reduce#setup 
> to initialize PigContants.TASK_INDEX when job starts.
> Suggest to file a new jira(Initialize PigContants.TASK_INDEX when spark 
> job starts) and skip this unit test until this jira is resolved.

Update PigOnSpark_3.patch. Have created PIG-5051 and  added comment on 
TestBuilt#testUniqueID(the behavior in spark mode will be same with what in mr 
until PIG-5051 is fixed)


- kelly


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/45667/#review134255
---


On July 11, 2016, 4:32 a.m., Pallavi Rao wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/45667/
> ---
> 
> (Updated July 11, 2016, 4:32 a.m.)
> 
> 
> Review request for pig, Daniel Dai and Rohini Palaniswamy.
> 
> 
> Bugs: PIG-4059 and PIG-4854
> https://issues.apache.org/jira/browse/PIG-4059
> https://issues.apache.org/jira/browse/PIG-4854
> 
> 
> Repository: pig-git
> 
> 
> Description
> ---
> 
> The patch contains all the work done in the spark branch, so far.
> 
> 
> Diffs
> -
> 
>   bin/pig 81f1426 
>   build.xml 99ba1f4 
>   ivy.xml dd9878e 
>   ivy/libraries.properties 3a819a5 
>   shims/test/hadoop20/org/apache/pig/test/SparkMiniCluster.java PRE-CREATION 
>   shims/test/hadoop23/org/apache/pig/test/SparkMiniCluster.java PRE-CREATION 
>   shims/test/hadoop23/org/apache/pig/test/TezMiniCluster.java 792a1bd 
>   shims/test/hadoop23/org/apache/pig/test/YarnMiniCluster.java PRE-CREATION 
>   src/META-INF/services/org.apache.pig.ExecType 5c034c8 
>   src/docs/src/documentation/content/xdocs/start.xml 36f9952 
>   
> src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/PhysicalOperator.java
>  1ff1abd 
>   
> 

[jira] [Updated] (PIG-5051) Initialize PigContants.TASK_INDEX in spark mode correctly

2016-10-26 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-5051:
--
Description: 
in MR, we initialize PigContants.TASK_INDEX in  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce.Reduce#setup
 
{code}
protected void setup(Context context) throws IOException, InterruptedException {
   ...
context.getConfiguration().set(PigConstants.TASK_INDEX, 
Integer.toString(context.getTaskAttemptID().getTaskID().getId()));
...
}
{code}
But spark does not provide funtion like PigGenericMapReduce.Reduce#setup to 
initialize PigContants.TASK_INDEX when job starts. We need find a solution to 
initialize PigContants.TASK_INDEX correctly.

After this jira is fixed.  The behavior of TestBuiltin#testUniqueID in spark 
mode will be same with what in mr.
Now we divide two cases in  TestBuiltin#testUniqueID
{code}

 @Test
public void testUniqueID() throws Exception {
 ...
if (!Util.isSparkExecType(cluster.getExecType())) {
assertEquals("0-0", iter.next().get(1));
assertEquals("0-1", iter.next().get(1));
assertEquals("0-2", iter.next().get(1));
assertEquals("0-3", iter.next().get(1));
assertEquals("0-4", iter.next().get(1));
assertEquals("1-0", iter.next().get(1));
assertEquals("1-1", iter.next().get(1));
assertEquals("1-2", iter.next().get(1));
assertEquals("1-3", iter.next().get(1));
assertEquals("1-4", iter.next().get(1));
} else {
// because we set PigConstants.TASK_INDEX as 0 in
// ForEachConverter#ForEachFunction#initializeJobConf
// UniqueID.exec() will output like 0-*
// the behavior of spark will be same with mr until PIG-5051 is 
fixed.
assertEquals(iter.next().get(1), "0-0");
assertEquals(iter.next().get(1), "0-1");
assertEquals(iter.next().get(1), "0-2");
assertEquals(iter.next().get(1), "0-3");
assertEquals(iter.next().get(1), "0-4");
assertEquals(iter.next().get(1), "0-0");
assertEquals(iter.next().get(1), "0-1");
assertEquals(iter.next().get(1), "0-2");
assertEquals(iter.next().get(1), "0-3");
assertEquals(iter.next().get(1), "0-4");
}
   ...
}
{code}

  was:
in MR, we initialize PigContants.TASK_INDEX in  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce.Reduce#setup
 
{code}
protected void setup(Context context) throws IOException, InterruptedException {
   ...
context.getConfiguration().set(PigConstants.TASK_INDEX, 
Integer.toString(context.getTaskAttemptID().getTaskID().getId()));
...
}
{code}
But spark does not provide funtion like PigGenericMapReduce.Reduce#setup to 
initialize PigContants.TASK_INDEX when job starts. We need find a solution to 
initialize PigContants.TASK_INDEX correctly.


> Initialize PigContants.TASK_INDEX in spark mode correctly
> -
>
> Key: PIG-5051
> URL: https://issues.apache.org/jira/browse/PIG-5051
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
> Fix For: spark-branch
>
>
> in MR, we initialize PigContants.TASK_INDEX in  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce.Reduce#setup
>  
> {code}
> protected void setup(Context context) throws IOException, 
> InterruptedException {
>...
> context.getConfiguration().set(PigConstants.TASK_INDEX, 
> Integer.toString(context.getTaskAttemptID().getTaskID().getId()));
> ...
> }
> {code}
> But spark does not provide funtion like PigGenericMapReduce.Reduce#setup to 
> initialize PigContants.TASK_INDEX when job starts. We need find a solution to 
> initialize PigContants.TASK_INDEX correctly.
> After this jira is fixed.  The behavior of TestBuiltin#testUniqueID in spark 
> mode will be same with what in mr.
> Now we divide two cases in  TestBuiltin#testUniqueID
> {code}
>  @Test
> public void testUniqueID() throws Exception {
>  ...
> if (!Util.isSparkExecType(cluster.getExecType())) {
> assertEquals("0-0", iter.next().get(1));
> assertEquals("0-1", iter.next().get(1));
> assertEquals("0-2", iter.next().get(1));
> assertEquals("0-3", iter.next().get(1));
> assertEquals("0-4", iter.next().get(1));
> assertEquals("1-0", iter.next().get(1));
> assertEquals("1-1", iter.next().get(1));
> assertEquals("1-2", iter.next().get(1));
> assertEquals("1-3", iter.next().get(1));
> assertEquals("1-4", iter.next().get(1));
> } else {
> // because we set 

[jira] [Created] (PIG-5051) Initialize PigContants.TASK_INDEX in spark mode correctly

2016-10-26 Thread liyunzhang_intel (JIRA)
liyunzhang_intel created PIG-5051:
-

 Summary: Initialize PigContants.TASK_INDEX in spark mode correctly
 Key: PIG-5051
 URL: https://issues.apache.org/jira/browse/PIG-5051
 Project: Pig
  Issue Type: Sub-task
Reporter: liyunzhang_intel


in MR, we initialize PigContants.TASK_INDEX in  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce.Reduce#setup
 
{code}
protected void setup(Context context) throws IOException, InterruptedException {
   ...
context.getConfiguration().set(PigConstants.TASK_INDEX, 
Integer.toString(context.getTaskAttemptID().getTaskID().getId()));
...
}
{code}
But spark does not provide funtion like PigGenericMapReduce.Reduce#setup to 
initialize PigContants.TASK_INDEX when job starts. We need find a solution to 
initialize PigContants.TASK_INDEX correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-5029) Optimize sort case when data is skewed

2016-10-26 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-5029:
--
Attachment: PIG-5029_3.patch

> Optimize sort case when data is skewed
> --
>
> Key: PIG-5029
> URL: https://issues.apache.org/jira/browse/PIG-5029
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-5029.patch, PIG-5029_2.patch, PIG-5029_3.patch, 
> SkewedData_L9.docx
>
>
> In PigMix L9.pig
> {code}
> register $PIGMIX_JAR
> A = load '$HDFS_ROOT/page_views' using 
> org.apache.pig.test.pigmix.udf.PigPerformanceLoader()
> as (user, action, timespent, query_term, ip_addr, timestamp,
> estimated_revenue, page_info, page_links);
> B = order A by query_term parallel $PARALLEL;
> store B into '$PIGMIX_OUTPUT/L9out';
> {code}
> The pig physical plan will be changed to spark plan and to spark lineage:
> {code}
> [main] 2016-09-08 01:49:09,844 DEBUG converter.StoreConverter 
> (StoreConverter.java:convert(110)) - RDD lineage: (23) MapPartitionsRDD[8] at 
> map at StoreConverter.java:80 []
>  |   MapPartitionsRDD[7] at mapPartitions at SortConverter.java:58 []
>  |   ShuffledRDD[6] at sortByKey at SortConverter.java:56 []
>  +-(23) MapPartitionsRDD[3] at map at SortConverter.java:49 []
> |   MapPartitionsRDD[2] at mapPartitions at ForEachConverter.java:64 []
> |   MapPartitionsRDD[1] at map at LoadConverter.java:127 []
> |   NewHadoopRDD[0] at newAPIHadoopRDD at LoadConverter.java:102 []
> {code}
> We use 
> [sortByKey|https://github.com/apache/pig/blob/spark/src/org/apache/pig/backend/hadoop/executionengine/spark/converter/SortConverter.java#L56]
>  to implement the sort feature. Although 
> [RangePartitioner|https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/Partitioner.scala#L106]
>  is used by RDD.sortByKey and RangePartitiner will sample data and ranges the 
> key roughly into equal range, the test result(attached  document) shows that 
> one partition will load most keys and take long time to finish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] Subscription: PIG patch available

2016-10-26 Thread jira
Issue Subscription
Filter: PIG patch available (27 issues)

Subscriber: pigdaily

Key Summary
PIG-4926Modify the content of start.xml for spark mode
https://issues-test.apache.org/jira/browse/PIG-4926
PIG-4922Deadlock between SpillableMemoryManager and 
InternalSortedBag$SortedDataBagIterator
https://issues-test.apache.org/jira/browse/PIG-4922
PIG-4918Pig on Tez cannot switch pig.temp.dir to another fs
https://issues-test.apache.org/jira/browse/PIG-4918
PIG-4897Scope of param substitution for run/exec commands
https://issues-test.apache.org/jira/browse/PIG-4897
PIG-4886Add PigSplit#getLocationInfo to fix the NPE found in log in spark 
mode
https://issues-test.apache.org/jira/browse/PIG-4886
PIG-4854Merge spark branch to trunk
https://issues-test.apache.org/jira/browse/PIG-4854
PIG-4849pig on tez will cause tez-ui to crash,because the content from 
timeline server is too long. 
https://issues-test.apache.org/jira/browse/PIG-4849
PIG-4788the value BytesRead metric info always returns 0 even the length of 
input file is not 0 in spark engine
https://issues-test.apache.org/jira/browse/PIG-4788
PIG-4745DataBag should protect content of passed list of tuples
https://issues-test.apache.org/jira/browse/PIG-4745
PIG-4684Exception should be changed to warning when job diagnostics cannot 
be fetched
https://issues-test.apache.org/jira/browse/PIG-4684
PIG-4656Improve String serialization and comparator performance in 
BinInterSedes
https://issues-test.apache.org/jira/browse/PIG-4656
PIG-4598Allow user defined plan optimizer rules
https://issues-test.apache.org/jira/browse/PIG-4598
PIG-4551Partition filter is not pushed down in case of SPLIT
https://issues-test.apache.org/jira/browse/PIG-4551
PIG-4539New PigUnit
https://issues-test.apache.org/jira/browse/PIG-4539
PIG-4515org.apache.pig.builtin.Distinct throws ClassCastException
https://issues-test.apache.org/jira/browse/PIG-4515
PIG-4323PackageConverter hanging in Spark
https://issues-test.apache.org/jira/browse/PIG-4323
PIG-4313StackOverflowError in LIMIT operation on Spark
https://issues-test.apache.org/jira/browse/PIG-4313
PIG-4251Pig on Storm
https://issues-test.apache.org/jira/browse/PIG-4251
PIG-4002Disable combiner when map-side aggregation is used
https://issues-test.apache.org/jira/browse/PIG-4002
PIG-3952PigStorage accepts '-tagSplit' to return full split information
https://issues-test.apache.org/jira/browse/PIG-3952
PIG-3911Define unique fields with @OutputSchema
https://issues-test.apache.org/jira/browse/PIG-3911
PIG-3877Getting Geo Latitude/Longitude from Address Lines
https://issues-test.apache.org/jira/browse/PIG-3877
PIG-3873Geo distance calculation using Haversine
https://issues-test.apache.org/jira/browse/PIG-3873
PIG-3864ToDate(userstring, format, timezone) computes DateTime with strange 
handling of Daylight Saving Time with location based timezones
https://issues-test.apache.org/jira/browse/PIG-3864
PIG-3851Upgrade jline to 2.11
https://issues-test.apache.org/jira/browse/PIG-3851
PIG-3668COR built-in function when atleast one of the coefficient values is 
NaN
https://issues-test.apache.org/jira/browse/PIG-3668
PIG-3587add functionality for rolling over dates
https://issues-test.apache.org/jira/browse/PIG-3587

You may edit this subscription at:
https://issues-test.apache.org/jira/secure/FilterSubscription!default.jspa?subId=16328=12322384


[jira] Subscription: PIG patch available

2016-10-26 Thread jira
Issue Subscription
Filter: PIG patch available (35 issues)

Subscriber: pigdaily

Key Summary
PIG-5049Cleanup e2e tests turing_jython.conf
https://issues.apache.org/jira/browse/PIG-5049
PIG-5043Slowstart not applied in Tez with PARALLEL clause and auto 
parallelism not applied for UnorderedPartitioned
https://issues.apache.org/jira/browse/PIG-5043
PIG-5036Remove biggish from e2e input dataset
https://issues.apache.org/jira/browse/PIG-5036
PIG-5033MultiQueryOptimizerTez creates bad plan with union, split and FRJoin
https://issues.apache.org/jira/browse/PIG-5033
PIG-4926Modify the content of start.xml for spark mode
https://issues.apache.org/jira/browse/PIG-4926
PIG-4922Deadlock between SpillableMemoryManager and 
InternalSortedBag$SortedDataBagIterator
https://issues.apache.org/jira/browse/PIG-4922
PIG-4920Fail to use Javascript UDF in spark yarn client mode
https://issues.apache.org/jira/browse/PIG-4920
PIG-4918Pig on Tez cannot switch pig.temp.dir to another fs
https://issues.apache.org/jira/browse/PIG-4918
PIG-4897Scope of param substitution for run/exec commands
https://issues.apache.org/jira/browse/PIG-4897
PIG-4854Merge spark branch to trunk
https://issues.apache.org/jira/browse/PIG-4854
PIG-4849pig on tez will cause tez-ui to crash,because the content from 
timeline server is too long. 
https://issues.apache.org/jira/browse/PIG-4849
PIG-4815Add xml format support for 'explain' in spark engine 
https://issues.apache.org/jira/browse/PIG-4815
PIG-4798big integer literals fail to parse
https://issues.apache.org/jira/browse/PIG-4798
PIG-4788the value BytesRead metric info always returns 0 even the length of 
input file is not 0 in spark engine
https://issues.apache.org/jira/browse/PIG-4788
PIG-4750REPLACE_MULTI should compile Pattern once and reuse it
https://issues.apache.org/jira/browse/PIG-4750
PIG-4745DataBag should protect content of passed list of tuples
https://issues.apache.org/jira/browse/PIG-4745
PIG-4684Exception should be changed to warning when job diagnostics cannot 
be fetched
https://issues.apache.org/jira/browse/PIG-4684
PIG-4656Improve String serialization and comparator performance in 
BinInterSedes
https://issues.apache.org/jira/browse/PIG-4656
PIG-4598Allow user defined plan optimizer rules
https://issues.apache.org/jira/browse/PIG-4598
PIG-4551Partition filter is not pushed down in case of SPLIT
https://issues.apache.org/jira/browse/PIG-4551
PIG-4539New PigUnit
https://issues.apache.org/jira/browse/PIG-4539
PIG-4515org.apache.pig.builtin.Distinct throws ClassCastException
https://issues.apache.org/jira/browse/PIG-4515
PIG-4323PackageConverter hanging in Spark
https://issues.apache.org/jira/browse/PIG-4323
PIG-4313StackOverflowError in LIMIT operation on Spark
https://issues.apache.org/jira/browse/PIG-4313
PIG-4251Pig on Storm
https://issues.apache.org/jira/browse/PIG-4251
PIG-4002Disable combiner when map-side aggregation is used
https://issues.apache.org/jira/browse/PIG-4002
PIG-3952PigStorage accepts '-tagSplit' to return full split information
https://issues.apache.org/jira/browse/PIG-3952
PIG-3911Define unique fields with @OutputSchema
https://issues.apache.org/jira/browse/PIG-3911
PIG-3891FileBasedOutputSizeReader does not calculate size of files in 
sub-directories
https://issues.apache.org/jira/browse/PIG-3891
PIG-3877Getting Geo Latitude/Longitude from Address Lines
https://issues.apache.org/jira/browse/PIG-3877
PIG-3873Geo distance calculation using Haversine
https://issues.apache.org/jira/browse/PIG-3873
PIG-3864ToDate(userstring, format, timezone) computes DateTime with strange 
handling of Daylight Saving Time with location based timezones
https://issues.apache.org/jira/browse/PIG-3864
PIG-3851Upgrade jline to 2.11
https://issues.apache.org/jira/browse/PIG-3851
PIG-3668COR built-in function when atleast one of the coefficient values is 
NaN
https://issues.apache.org/jira/browse/PIG-3668
PIG-3587add functionality for rolling over dates
https://issues.apache.org/jira/browse/PIG-3587

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=16328=12322384