[jira] [Commented] (PIG-5316) Initialize mapred.task.id property for PoS jobs

2017-11-28 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16269241#comment-16269241
 ] 

Xuefu Zhang commented on PIG-5316:
--

[~nkollar], Sorry for the late reply.
{quote}
Though MRConfiguration is not intended for public use in Pig, should Hive use 
MRConfiguration#TASK_ID instead of referring to the taskId as a string?
{quote}
Your concern is valid. However, Hive has a lot of legacy MR1 code (more than 
just mapred.task.id). It probably takes a lot of effort to clean up this. 
Before that happens, Yes, the risk is going to be there.

> Initialize mapred.task.id property for PoS jobs
> ---
>
> Key: PIG-5316
> URL: https://issues.apache.org/jira/browse/PIG-5316
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Adam Szita
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5316_1.patch, PIG-5316_2.patch
>
>
> Some downstream systems may require the presence of {{mapred.task.id}} 
> property (e.g. HCatalog). This is currently not set when Pig On Spark jobs 
> are started. Let's initialise it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5316) Initialize mapred.task.id property for PoS jobs

2017-11-20 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16259301#comment-16259301
 ] 

Xuefu Zhang commented on PIG-5316:
--

[~nkollar], while it might be just an placeholder, it's used to create staging 
scratch or staging directories. I think we should follow the custom format of a 
task id. You might want to check Pig code where this is set. As another 
reference, you can find how Hive on Spark sets it in 
{{HivePairFlatMapFunction.jara}}. 

> Initialize mapred.task.id property for PoS jobs
> ---
>
> Key: PIG-5316
> URL: https://issues.apache.org/jira/browse/PIG-5316
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Adam Szita
>Assignee: Nandor Kollar
>
> Some downstream systems may require the presence of {{mapred.task.id}} 
> property (e.g. HCatalog). This is currently not set when Pig On Spark jobs 
> are started. Let's initialise it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-4059) Pig on Spark

2017-05-30 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16029736#comment-16029736
 ] 

Xuefu Zhang commented on PIG-4059:
--

Great jobs! Thanks to everyone for making this happen!

> Pig on Spark
> 
>
> Key: PIG-4059
> URL: https://issues.apache.org/jira/browse/PIG-4059
> Project: Pig
>  Issue Type: New Feature
>  Components: spark
>Reporter: Rohini Palaniswamy
>Assignee: Praveen Rachabattuni
>  Labels: spork
> Fix For: spark-branch, 0.17.0
>
> Attachments: Pig-on-Spark-Design-Doc.pdf, Pig-on-Spark-Scope.pdf
>
>
> Setting up your development environment:
> 0. download spark release package(currently pig on spark only support spark 
> 1.6).
> 1. Check out Pig Spark branch.
> 2. Build Pig by running "ant jar" and "ant -Dhadoopversion=23 jar" for 
> hadoop-2.x versions
> 3. Configure these environmental variables:
> export HADOOP_USER_CLASSPATH_FIRST="true"
> Now we support “local” and "yarn-client" mode, you can export system variable 
> “SPARK_MASTER” like:
> export SPARK_MASTER=local or export SPARK_MASTER="yarn-client"
> 4. In local mode: ./pig -x spark_local xxx.pig
> In yarn-client mode: 
> export SPARK_HOME=xx; 
> export SPARK_JAR=hdfs://example.com:8020/ (the hdfs location where 
> you upload the spark-assembly*.jar)
> ./pig -x spark xxx.pig



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: [ANNOUNCE] Welcome new Pig Committer - Adam Szita

2017-05-22 Thread Xuefu Zhang
Congratulations, Adam!

On Mon, May 22, 2017 at 10:51 AM, Rohini Palaniswamy <
rohini.adi...@gmail.com> wrote:

> Hi all,
> It is my pleasure to announce that Adam Szita has been voted in as a
> committer to Apache Pig. Please join me in congratulating Adam. Adam has
> been actively contributing to core Pig and Pig on Spark. We appreciate all
> the work he has done and are looking forward to more contributions from
> him.
>
> Welcome aboard, Adam.
>
> Regards,
> Rohini
>


[jira] [Updated] (PIG-5104) Union_15 e2e test failing on Spark

2017-03-14 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-5104:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Patch committed to Spark branch. Thanks, Liyun!

> Union_15 e2e test failing on Spark
> --
>
> Key: PIG-5104
> URL: https://issues.apache.org/jira/browse/PIG-5104
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
> Attachments: PIG-5104.patch, PIG-5104.zly.patch, TestUnion_15.java
>
>
> While working on PIG-4891 I noticed that Union_15 e2e test is failing on 
> Spark mode with this exception:
> Caused by: java.lang.RuntimeException: 
> org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught 
> error from UDF: org.apache.pig.impl.builtin.GFCross [Unable to get 
> parallelism hint from job conf]
>   at 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.OutputConsumerIterator.readNext(OutputConsumerIterator.java:89)
>   at 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.OutputConsumerIterator.hasNext(OutputConsumerIterator.java:96)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2078: 
> Caught error from UDF: org.apache.pig.impl.builtin.GFCross [Unable to get 
> parallelism hint from job conf]
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:358)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextDataBag(POUserFunc.java:374)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:335)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:404)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:321)
>   at 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.ForEachConverter$ForEachFunction$1$1.getNextResult(ForEachConverter.java:87)
>   at 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.OutputConsumerIterator.readNext(OutputConsumerIterator.java:69)
>   ... 11 more
> Caused by: java.io.IOException: Unable to get parallelism hint from job conf
>   at org.apache.pig.impl.builtin.GFCross.exec(GFCross.java:66)
>   at org.apache.pig.impl.builtin.GFCross.exec(GFCross.java:37)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:330)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5167) Limit_4 is failing with spark exec type

2017-03-09 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904298#comment-15904298
 ] 

Xuefu Zhang commented on PIG-5167:
--

Add sorting is a quick fix. It's fine if it doesn't impact too much testing 
performance. In Hive, we have a choice of sorting result before comparison, 
which makes sorting happen at the client side. However, I'm not sure if it's 
feasible in Pig.

> Limit_4 is failing with spark exec type
> ---
>
> Key: PIG-5167
> URL: https://issues.apache.org/jira/browse/PIG-5167
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
> Attachments: PIG-5167.patch
>
>
> results are different:
> {code}
> diff <(head -n 5 Limit_4.out/out_sorted) <(head -n 5 
> Limit_4_benchmark.out/out_sorted)
> 1,5c1,5
> < 50  3.00
> < 74  2.22
> < alice carson66  2.42
> < alice quirinius 71  0.03
> < alice van buren 28  2.50
> ---
> > bob allen   0.28
> > bob allen   22  0.92
> > bob allen   25  2.54
> > bob allen   26  2.35
> > bob allen   27  2.17
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5133) Commit changes from last round of review on rb

2017-03-09 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904290#comment-15904290
 ] 

Xuefu Zhang commented on PIG-5133:
--

Committed to Spark branch. Thanks, Liyun!

> Commit changes from last round of review on rb
> --
>
> Key: PIG-5133
> URL: https://issues.apache.org/jira/browse/PIG-5133
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: changes_from_rb.patch, PIG-5133_2.patch, 
> PIG-5133_4.patch, PIG-5133_5.patch, PIG-5133.patch
>
>
> In last round of [review|https://reviews.apache.org/r/45667/], rohini gave 
> some comments. Made some changes according to her review.
> After committed PIG-5132. will commit this patch. then will open a new review 
> board to start second round of review of spark branch.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (PIG-5044) Create SparkCompiler#getSamplingJob in spark mode

2017-02-14 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved PIG-5044.
--
Resolution: Fixed

Committed to Spark branch. Thanks.

> Create SparkCompiler#getSamplingJob in spark mode
> -
>
> Key: PIG-5044
> URL: https://issues.apache.org/jira/browse/PIG-5044
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-5044_2.patch, PIG-5044_3.patch, PIG-5044_4.patch
>
>
> Like MRCompiler#getSamplingJob, we also need a function like that to sample 
> data from a file, sort sampling data  and generate output by 
> UDF(org.apache.pig.impl.builtin.FindQuantiles).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (PIG-4952) Calculate the value of parallism for spark mode

2016-12-12 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved PIG-4952.
--
Resolution: Fixed

Patch committed to Spark branch. Thanks, Liyun!

> Calculate the value of parallism for spark mode
> ---
>
> Key: PIG-4952
> URL: https://issues.apache.org/jira/browse/PIG-4952
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4952.patch, PIG-4952_1.patch, PIG-4952_2.patch
>
>
> Calculate the value of parallism for spark mode like what 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.optimizer.ParallelismSetter
>  does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4815) Add xml format support for 'explain' in spark engine

2016-12-08 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4815:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Patch committed to Spark branch. Thanks, Adam!

> Add xml format support for 'explain' in spark engine 
> -
>
> Key: PIG-4815
> URL: https://issues.apache.org/jira/browse/PIG-4815
> Project: Pig
>  Issue Type: Task
>  Components: spark
>Reporter: Prateek Vaishnav
>Assignee: Adam Szita
> Fix For: spark-branch
>
> Attachments: PIG-4815.2.patch, PIG-4815.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-5068) Set SPARK_REDUCERS by pig.properties not by system configuration

2016-11-28 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-5068:
-
   Resolution: Fixed
Fix Version/s: spark-branch
   Status: Resolved  (was: Patch Available)

Committed to Spark branch. Thanks, Liyun!

> Set SPARK_REDUCERS by pig.properties not by system configuration
> 
>
> Key: PIG-5068
> URL: https://issues.apache.org/jira/browse/PIG-5068
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-5068.patch, PIG-5068_1.patch, PIG-5068_2.patch
>
>
> In SparkUtil.java, we set the SPARK_REDUCERS by system configuration
> {code}
> public static int getParallelism(List<RDD> predecessors,
> PhysicalOperator physicalOperator) {
> String numReducers = System.getenv("SPARK_REDUCERS");
> if (numReducers != null) {
> return Integer.parseInt(numReducers);
> }
> int parallelism = physicalOperator.getRequestedParallelism();
> if (parallelism <= 0) {
> // Parallelism wasn't set in Pig, so set it to whatever Spark 
> thinks
> // is reasonable.
> parallelism = predecessors.get(0).context().defaultParallelism();
> }
> return parallelism;
> }
> {code}
> It is better to set it by pig.properties



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4899) The number of records of input file is calculated wrongly in spark mode in multiquery case

2016-11-23 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4899:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to Spark branch. Thanks, Adam Szita.

>  The number of records of input file is calculated wrongly in spark mode in 
> multiquery case
> ---
>
> Key: PIG-4899
> URL: https://issues.apache.org/jira/browse/PIG-4899
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: Adam Szita
> Fix For: spark-branch
>
> Attachments: PIG-4899.2.patch, PIG-4899.patch
>
>
> sparkCounter to calucate the records of input 
> file(LoadConverter#ToTupleFunction#apply) will be executed multiple times in 
> multiquery case. This will cause the input records number is calculated 
> wrongly. for example:
> {code}
> #--
> # Spark Plan  
> #--
> Spark node scope-534
> Split - scope-548
> |   |
> |   
> Store(hdfs://localhost:48350/tmp/temp649016960/tmp48836938:org.apache.pig.impl.io.InterStorage)
>  - scope-538
> |   |
> |   |---C: Filter[bag] - scope-495
> |   |   |
> |   |   Less Than or Equal[boolean] - scope-498
> |   |   |
> |   |   |---Project[int][1] - scope-496
> |   |   |
> |   |   |---Constant(5) - scope-497
> |   |
> |   
> Store(hdfs://localhost:48350/tmp/temp649016960/tmp804709981:org.apache.pig.impl.io.InterStorage)
>  - scope-546
> |   |
> |   |---B: Filter[bag] - scope-507
> |   |   |
> |   |   Equal To[boolean] - scope-510
> |   |   |
> |   |   |---Project[int][0] - scope-508
> |   |   |
> |   |   |---Constant(3) - scope-509
> |
> |---A: New For Each(false,false,false)[bag] - scope-491
> |   |
> |   Cast[int] - scope-483
> |   |
> |   |---Project[bytearray][0] - scope-482
> |   |
> |   Cast[int] - scope-486
> |   |
> |   |---Project[bytearray][1] - scope-485
> |   |
> |   Cast[int] - scope-489
> |   |
> |   |---Project[bytearray][2] - scope-488
> |
> |---A: 
> Load(hdfs://localhost:48350/user/root/input:org.apache.pig.builtin.PigStorage)
>  - scope-481
> Spark node scope-540
> C: 
> Store(hdfs://localhost:48350/user/root/output:org.apache.pig.builtin.PigStorage)
>  - scope-502
> |
> |---Load(hdfs://localhost:48350/tmp/temp649016960/tmp48836938:org.apache.pig.impl.io.InterStorage)
>  - scope-539
> Spark node scope-542
> D: 
> Store(hdfs://localhost:48350/user/root/output2:org.apache.pig.builtin.PigStorage)
>  - scope-533
> |
> |---D: FRJoin[tuple] - scope-525
> |   |
> |   Project[int][0] - scope-522
> |   |
> |   Project[int][0] - scope-523
> |   |
> |   Project[int][0] - scope-524
> |
> 
> |---Load(hdfs://localhost:48350/tmp/temp649016960/tmp48836938:org.apache.pig.impl.io.InterStorage)
>  - scope-541
> Spark node scope-545
> Store(hdfs://localhost:48350/tmp/temp649016960/tmp-2036144538:org.apache.pig.impl.io.InterStorage)
>  - scope-547
> |
> |---A1: New For Each(false,false,false)[bag] - scope-521
> |   |
> |   Cast[int] - scope-513
> |   |
> |   |---Project[bytearray][0] - scope-512
> |   |
> |   Cast[int] - scope-516
> |   |
> |   |---Project[bytearray][1] - scope-515
> |   |
> |   Cast[int] - scope-519
> |   |
> |   |---Project[bytearray][2] - scope-518
> |
> |---A1: 
> Load(hdfs://localhost:48350/user/root/input2:org.apache.pig.builtin.PigStorage)
>  - scope-511---
> {code}
> PhysicalOperator (LoadA) will be executed in 
> LoadConverter#ToTupleFunction#apply for more than the correct times because 
> this is a multi-query case. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-5052) Initialize MRConfiguration.JOB_ID in spark mode correctly

2016-11-21 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15685729#comment-15685729
 ] 

Xuefu Zhang commented on PIG-5052:
--

 PIG-5052.3-incrementalToPatch1.patch is committed. Shall we close this ticket?

> Initialize MRConfiguration.JOB_ID in spark mode correctly
> -
>
> Key: PIG-5052
> URL: https://issues.apache.org/jira/browse/PIG-5052
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: Adam Szita
> Fix For: spark-branch
>
> Attachments: PIG-5052.2.patch, PIG-5052.3-incrementalToPatch1.patch, 
> PIG-5052.3.patch, PIG-5052.patch
>
>
> currently, we initialize MRConfiguration.JOB_ID in SparkUtil#newJobConf.  
> we just set the value as a random string.
> {code}
> jobConf.set(MRConfiguration.JOB_ID, UUID.randomUUID().toString());
> {code}
> We need to find a spark api to initiliaze it correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-5051) Initialize PigContants.TASK_INDEX in spark mode correctly

2016-11-03 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-5051:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to spark branch. Thanks, Liyun!

> Initialize PigContants.TASK_INDEX in spark mode correctly
> -
>
> Key: PIG-5051
> URL: https://issues.apache.org/jira/browse/PIG-5051
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4920_6_5051.patch, PIG-5051.patch
>
>
> in MR, we initialize PigContants.TASK_INDEX in  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce.Reduce#setup
>  
> {code}
> protected void setup(Context context) throws IOException, 
> InterruptedException {
>...
> context.getConfiguration().set(PigConstants.TASK_INDEX, 
> Integer.toString(context.getTaskAttemptID().getTaskID().getId()));
> ...
> }
> {code}
> But spark does not provide funtion like PigGenericMapReduce.Reduce#setup to 
> initialize PigContants.TASK_INDEX when job starts. We need find a solution to 
> initialize PigContants.TASK_INDEX correctly.
> After this jira is fixed.  The behavior of TestBuiltin#testUniqueID in spark 
> mode will be same with what in mr.
> Now we divide two cases in  TestBuiltin#testUniqueID
> {code}
>  @Test
> public void testUniqueID() throws Exception {
>  ...
> if (!Util.isSparkExecType(cluster.getExecType())) {
> assertEquals("0-0", iter.next().get(1));
> assertEquals("0-1", iter.next().get(1));
> assertEquals("0-2", iter.next().get(1));
> assertEquals("0-3", iter.next().get(1));
> assertEquals("0-4", iter.next().get(1));
> assertEquals("1-0", iter.next().get(1));
> assertEquals("1-1", iter.next().get(1));
> assertEquals("1-2", iter.next().get(1));
> assertEquals("1-3", iter.next().get(1));
> assertEquals("1-4", iter.next().get(1));
> } else {
> // because we set PigConstants.TASK_INDEX as 0 in
> // ForEachConverter#ForEachFunction#initializeJobConf
> // UniqueID.exec() will output like 0-*
> // the behavior of spark will be same with mr until PIG-5051 is 
> fixed.
> assertEquals(iter.next().get(1), "0-0");
> assertEquals(iter.next().get(1), "0-1");
> assertEquals(iter.next().get(1), "0-2");
> assertEquals(iter.next().get(1), "0-3");
> assertEquals(iter.next().get(1), "0-4");
> assertEquals(iter.next().get(1), "0-0");
> assertEquals(iter.next().get(1), "0-1");
> assertEquals(iter.next().get(1), "0-2");
> assertEquals(iter.next().get(1), "0-3");
> assertEquals(iter.next().get(1), "0-4");
> }
>...
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-5052) Initialize MRConfiguration.JOB_ID in spark mode correctly

2016-11-03 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-5052:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Patch committed to spark branch. Thanks, Liyun!

> Initialize MRConfiguration.JOB_ID in spark mode correctly
> -
>
> Key: PIG-5052
> URL: https://issues.apache.org/jira/browse/PIG-5052
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-5052.patch
>
>
> currently, we initialize MRConfiguration.JOB_ID in SparkUtil#newJobConf.  
> we just set the value as a random string.
> {code}
> jobConf.set(MRConfiguration.JOB_ID, UUID.randomUUID().toString());
> {code}
> We need to find a spark api to initiliaze it correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-5051) Initialize PigContants.TASK_INDEX in spark mode correctly

2016-10-27 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15614166#comment-15614166
 ] 

Xuefu Zhang commented on PIG-5051:
--

Not sure why it didn't go thur. I did a "svn update" and it seems it's there 
now.

> Initialize PigContants.TASK_INDEX in spark mode correctly
> -
>
> Key: PIG-5051
> URL: https://issues.apache.org/jira/browse/PIG-5051
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4920_6_5051.patch, PIG-5051.patch
>
>
> in MR, we initialize PigContants.TASK_INDEX in  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce.Reduce#setup
>  
> {code}
> protected void setup(Context context) throws IOException, 
> InterruptedException {
>...
> context.getConfiguration().set(PigConstants.TASK_INDEX, 
> Integer.toString(context.getTaskAttemptID().getTaskID().getId()));
> ...
> }
> {code}
> But spark does not provide funtion like PigGenericMapReduce.Reduce#setup to 
> initialize PigContants.TASK_INDEX when job starts. We need find a solution to 
> initialize PigContants.TASK_INDEX correctly.
> After this jira is fixed.  The behavior of TestBuiltin#testUniqueID in spark 
> mode will be same with what in mr.
> Now we divide two cases in  TestBuiltin#testUniqueID
> {code}
>  @Test
> public void testUniqueID() throws Exception {
>  ...
> if (!Util.isSparkExecType(cluster.getExecType())) {
> assertEquals("0-0", iter.next().get(1));
> assertEquals("0-1", iter.next().get(1));
> assertEquals("0-2", iter.next().get(1));
> assertEquals("0-3", iter.next().get(1));
> assertEquals("0-4", iter.next().get(1));
> assertEquals("1-0", iter.next().get(1));
> assertEquals("1-1", iter.next().get(1));
> assertEquals("1-2", iter.next().get(1));
> assertEquals("1-3", iter.next().get(1));
> assertEquals("1-4", iter.next().get(1));
> } else {
> // because we set PigConstants.TASK_INDEX as 0 in
> // ForEachConverter#ForEachFunction#initializeJobConf
> // UniqueID.exec() will output like 0-*
> // the behavior of spark will be same with mr until PIG-5051 is 
> fixed.
> assertEquals(iter.next().get(1), "0-0");
> assertEquals(iter.next().get(1), "0-1");
> assertEquals(iter.next().get(1), "0-2");
> assertEquals(iter.next().get(1), "0-3");
> assertEquals(iter.next().get(1), "0-4");
> assertEquals(iter.next().get(1), "0-0");
> assertEquals(iter.next().get(1), "0-1");
> assertEquals(iter.next().get(1), "0-2");
> assertEquals(iter.next().get(1), "0-3");
> assertEquals(iter.next().get(1), "0-4");
> }
>...
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4920) Fail to use Javascript UDF in spark yarn client mode

2016-10-26 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4920:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to spark branch. Thanks, Liyun!

> Fail to use Javascript UDF in spark yarn client mode
> 
>
> Key: PIG-4920
> URL: https://issues.apache.org/jira/browse/PIG-4920
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4920.patch, PIG-4920_2.patch, PIG-4920_3.patch, 
> PIG-4920_4.patch, PIG-4920_5.patch, PIG-4920_6.patch
>
>
> udf.pig 
> {code}
> register '/home/zly/prj/oss/merge.pig/pig/bin/udf.js' using javascript as 
> myfuncs;
> A = load './passwd' as (a0:chararray, a1:chararray);
> B = foreach A generate myfuncs.helloworld();
> store B into './udf.out';
> {code}
> udf.js
> {code}
> helloworld.outputSchema = "word:chararray";
> function helloworld() {
> return 'Hello, World';
> }
> 
> complex.outputSchema = "word:chararray";
> function complex(word){
> return {word:word};
> }
> {code}
> run udf.pig in spark local mode(export SPARK_MASTER="local"), it successfully.
> run udf.pig in spark yarn client mode(export SPARK_MASTER="yarn-client"), it 
> fails and error message like following:
> {noformat}
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
> at 
> org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:744)
> ... 84 more
> Caused by: java.lang.ExceptionInInitializerError
> at 
> org.apache.pig.scripting.js.JsScriptEngine.getInstance(JsScriptEngine.java:87)
> at org.apache.pig.scripting.js.JsFunction.(JsFunction.java:173)
> ... 89 more
> Caused by: java.lang.IllegalStateException: could not get script path from 
> UDFContext
> at 
> org.apache.pig.scripting.js.JsScriptEngine$Holder.(JsScriptEngine.java:69)
> ... 91 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PIG-4969) Optimize combine case for spark mode

2016-09-13 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved PIG-4969.
--
   Resolution: Fixed
Fix Version/s: spark-branch

Committed to Spark branch. Thanks, Liyun!

> Optimize combine case for spark mode
> 
>
> Key: PIG-4969
> URL: https://issues.apache.org/jira/browse/PIG-4969
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4969_2.patch, PIG-4969_3.patch
>
>
> In our test result of 1 TB pigmix benchmark , it shows that it runs slower in 
> combine case in spark mode .
> ||Script||MR||Spark
> |L_1|8089 |10064
> L1.pig
> {code}
> register pigperf.jar;
> A = load '/user/pig/tests/data/pigmix/page_views' using 
> org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
> as (user, action, timespent, query_term, ip_addr, timestamp,
> estimated_revenue, page_info, page_links);
> B = foreach A generate user, (int)action as action, (map[])page_info as 
> page_info,
> flatten((bag{tuple(map[])})page_links) as page_links;
> C = foreach B generate user,
> (action == 1 ? page_info#'a' : page_links#'b') as header;
> D = group C by user parallel 40;
> E = foreach D generate group, COUNT(C) as cnt;
> store E into 'L1out';
> {code}
> Then spark plan
> {code}
> exec] #--
>  [exec] # Spark Plan  
>  [exec] #--
>  [exec] 
>  [exec] Spark node scope-38
>  [exec] E: 
> Store(hdfs://bdpe81:8020/user/root/output/pig/L1out:org.apache.pig.builtin.PigStorage)
>  - scope-37
>  [exec] |
>  [exec] |---E: New For Each(false,false)[tuple] - scope-42
>  [exec] |   |
>  [exec] |   Project[bytearray][0] - scope-39
>  [exec] |   |
>  [exec] |   Project[bag][1] - scope-40
>  [exec] |   
>  [exec] |   POUserFunc(org.apache.pig.builtin.COUNT$Final)[long] - 
> scope-41
>  [exec] |   |
>  [exec] |   |---Project[bag][1] - scope-57
>  [exec] |
>  [exec] |---Reduce By(false,false)[tuple] - scope-47
>  [exec] |   |
>  [exec] |   Project[bytearray][0] - scope-48
>  [exec] |   |
>  [exec] |   
> POUserFunc(org.apache.pig.builtin.COUNT$Intermediate)[tuple] - scope-49
>  [exec] |   |
>  [exec] |   |---Project[bag][1] - scope-50
>  [exec] |
>  [exec] |---D: Local Rearrange[tuple]{bytearray}(false) - scope-53
>  [exec] |   |
>  [exec] |   Project[bytearray][0] - scope-55
>  [exec] |
>  [exec] |---E: New For Each(false,false)[bag] - scope-43
>  [exec] |   |
>  [exec] |   Project[bytearray][0] - scope-44
>  [exec] |   |
>  [exec] |   
> POUserFunc(org.apache.pig.builtin.COUNT$Initial)[tuple] - scope-45
>  [exec] |   |
>  [exec] |   |---Project[bag][1] - scope-46
>  [exec] |
>  [exec] |---Pre Combiner Local Rearrange[tuple]{Unknown} 
> - scope-56
>  [exec] |
>  [exec] |---C: New For Each(false,false)[bag] - 
> scope-26
>  [exec] |   |
>  [exec] |   Project[bytearray][0] - scope-13
>  [exec] |   |
>  [exec] |   POBinCond[bytearray] - scope-22
>  [exec] |   |
>  [exec] |   |---Equal To[boolean] - scope-17
>  [exec] |   |   |
>  [exec] |   |   |---Project[int][1] - scope-15
>  [exec] |   |   |
>  [exec] |   |   |---Constant(1) - scope-16
>  [exec] |   |
>  [exec] |   |---POMapLookUp[bytearray] - scope-19
>  [exec] |   |   |
>  [exec] |   |   |---Project[map][2] - scope-18
>  [exec] |   |
>  [exec] |   |---POMapLookUp[bytearray] - scope-21
>  [exec] |   |
>  [exec]

[jira] [Comment Edited] (PIG-5024) add a physical operator to broadcast small RDDs

2016-09-08 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475913#comment-15475913
 ] 

Xuefu Zhang edited comment on PIG-5024 at 9/9/16 4:51 AM:
--

Committed to Spark branch. Thanks, Xianda.


was (Author: xuefuz):
Committed to Spark branch. Thanks.

> add a physical operator to broadcast small RDDs
> ---
>
> Key: PIG-5024
> URL: https://issues.apache.org/jira/browse/PIG-5024
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-5024.patch, PIG-5024_2.patch, PIG-5024_3.patch, 
> PIG-5024_4.patch, PIG-5024_5.patch, PIG-5024_6.patch
>
>
> Currently, when optimize some kinds of JOIN, the indexed or sampling files 
> are saved into HDFS. By setting the replication to a larger number, it serves 
> as distributed cache.
> Spark's broadcast mechanism is suitable for this. It seems that we can add a 
> physical operator to broadcast small RDDs.
> This will benefit the optimization of some specialized Joins, such as Skewed 
> Join, Replicated Join and so on. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PIG-5024) add a physical operator to broadcast small RDDs

2016-09-08 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved PIG-5024.
--
Resolution: Fixed

Committed to Spark branch. Thanks.

> add a physical operator to broadcast small RDDs
> ---
>
> Key: PIG-5024
> URL: https://issues.apache.org/jira/browse/PIG-5024
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-5024.patch, PIG-5024_2.patch, PIG-5024_3.patch, 
> PIG-5024_4.patch, PIG-5024_5.patch, PIG-5024_6.patch
>
>
> Currently, when optimize some kinds of JOIN, the indexed or sampling files 
> are saved into HDFS. By setting the replication to a larger number, it serves 
> as distributed cache.
> Spark's broadcast mechanism is suitable for this. It seems that we can add a 
> physical operator to broadcast small RDDs.
> This will benefit the optimization of some specialized Joins, such as Skewed 
> Join, Replicated Join and so on. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PIG-4870) Enable MergeJoin testcase in TestCollectedGroup for spark engine

2016-09-06 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved PIG-4870.
--
Resolution: Fixed

Committed to Spark branch. Thanks, Xianda!

> Enable MergeJoin testcase in TestCollectedGroup for spark engine
> 
>
> Key: PIG-4870
> URL: https://issues.apache.org/jira/browse/PIG-4870
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4870.patch
>
>
> TestCollectedGroup.testMapsideGroupWithMergeJoin was disabled( PIG-4781).
> When MergeJoin (PIG-4810) is ready,  we can enable the UT case for spark 
> engine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PIG-4970) Remove the deserialize and serialization of JobConf in code for spark mode

2016-08-24 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved PIG-4970.
--
Resolution: Fixed

Committed to Spark branch. Thanks, Liyun!

> Remove the deserialize and serialization of JobConf in code for spark mode
> --
>
> Key: PIG-4970
> URL: https://issues.apache.org/jira/browse/PIG-4970
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4970.patch, PIG-4970_2.patch, PIG-4970_3.patch, 
> PIG-4970_4.patch
>
>
> Now we use KryoSerializer to serialize the jobConf in 
> [SparkLauncher|https://github.com/apache/pig/blob/spark/src/org/apache/pig/backend/hadoop/executionengine/spark/SparkLauncher.java#L191].
>  then 
> deserialize it in 
> [ForEachConverter|https://github.com/apache/pig/blob/spark/src/org/apache/pig/backend/hadoop/executionengine/spark/converter/ForEachConverter.java#L83],
>   
> [StreamConverter|https://github.com/apache/pig/blob/spark/src/org/apache/pig/backend/hadoop/executionengine/spark/converter/StreamConverter.java#L70].
>We deserialize and serialize the jobConf in order to make jobConf 
> available in spark executor thread.
> We can refactor it in following ways:
> 1. Let spark to broadcast the jobConf in 
> [sparkContext.newAPIHadoopRDD|https://github.com/apache/pig/blob/spark/src/org/apache/pig/backend/hadoop/executionengine/spark/converter/LoadConverter.java#L102].
>  Here not create a new jobConf and load properties from PigContext but 
> directly use jobConf from SparkLauncher.
> 2. get jobConf in 
> [org.apache.pig.backend.hadoop.executionengine.spark.running.PigInputFormatSpark#createRecordReader|https://github.com/apache/pig/blob/spark/src/org/apache/pig/backend/hadoop/executionengine/spark/running/PigInputFormatSpark.java#L42]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [ANNOUNCE] Congratulations to our new PMC member Koji Noguchi

2016-08-05 Thread Xuefu Zhang
Congratulations, Koji!

On Fri, Aug 5, 2016 at 4:28 PM, Daniel Dai  wrote:

> Please welcome Koji Noguchi as our latest Pig PMC member.
>
> Congrats Koji!
>


[jira] [Resolved] (PIG-4553) Implement secondary sort using one shuffle

2016-07-17 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved PIG-4553.
--
Resolution: Fixed

Committed to Spark branch. Thanks, Liyun!


> Implement secondary sort using one shuffle
> --
>
> Key: PIG-4553
> URL: https://issues.apache.org/jira/browse/PIG-4553
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4553_1.patch, PIG-4553_2.patch
>
>
> Now we implement secondary key sort in 
> GlobalRearrangeConverter#convert
> first shuffle in repartitionAndSortWithinPartitions second shuffle in groupBy
> {code}
> public RDD convert(List<RDD> predecessors,
>   POGlobalRearrangeSpark physicalOperator) throws 
> IOException {
> 
>   if (predecessors.size() == 1) {
> // GROUP
> JavaPairRDD<Object, Iterable> prdd = null;
> if (physicalOperator.isUseSecondaryKey()) {
> RDD rdd = predecessors.get(0);
> RDD<Tuple2<Tuple, Object>> rddPair = rdd.map(new 
> ToKeyNullValueFunction(),
> SparkUtil.<Tuple, Object>getTuple2Manifest());
> JavaPairRDD<Tuple, Object> pairRDD = new JavaPairRDD<Tuple, 
> Object>(rddPair,
> SparkUtil.getManifest(Tuple.class),
> SparkUtil.getManifest(Object.class));
> //first sort the tuple by secondary key if enable 
> useSecondaryKey sort
> JavaPairRDD<Tuple, Object> sorted = 
> pairRDD.repartitionAndSortWithinPartitions(new HashPartitioner(parallelism), 
> new 
> PigSecondaryKeyComparatorSpark(physicalOperator.getSecondarySortOrder()));  
> // first shuffle 
> JavaRDD mapped = sorted.mapPartitions(new 
> ToValueFunction());
> prdd = mapped.groupBy(new GetKeyFunction(physicalOperator), 
> parallelism);// second shuffle
> } else {
> JavaRDD jrdd = predecessors.get(0).toJavaRDD();
> prdd = jrdd.groupBy(new GetKeyFunction(physicalOperator), 
> parallelism);
> }
> JavaRDD jrdd2 = prdd.map(new 
> GroupTupleFunction(physicalOperator));
> return jrdd2.rdd();
> } 
> 
> }
> {code}
> we can optimize it according to the code from 
> https://github.com/tresata/spark-sorted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PIG-4941) TestRank3#testRankWithSplitInMap hangs after upgrade to spark 1.6.1

2016-07-17 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved PIG-4941.
--
   Resolution: Fixed
Fix Version/s: spark-branch

Committed to Spark Branch. Thanks, Liyun!


> TestRank3#testRankWithSplitInMap hangs after upgrade to spark 1.6.1
> ---
>
> Key: PIG-4941
> URL: https://issues.apache.org/jira/browse/PIG-4941
> Project: Pig
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4941.patch, rank.jstack
>
>
> After upgrading spark version to 1.6.1, TestRank3#testRankWithSplitInMap 
> hangs and fails due to timeout exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PIG-4946) Remove redudant code of bin/pig in spark mode after PIG-4903 check in

2016-07-11 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved PIG-4946.
--
   Resolution: Fixed
Fix Version/s: spark-branch

Committed to Spark branch. Thanks, Liyun!

> Remove redudant code of bin/pig in spark mode after PIG-4903 check in
> -
>
> Key: PIG-4946
> URL: https://issues.apache.org/jira/browse/PIG-4946
> Project: Pig
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4946.patch
>
>
> After PIG-4903 checkin, some redudant code of bin/pig in spark branch is 
> generated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PIG-4944) Reset UDFContext#jobConf in spark mode

2016-07-06 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved PIG-4944.
--
   Resolution: Fixed
Fix Version/s: spark-branch

Committed to Spark branch. Thanks, Liyun!

> Reset UDFContext#jobConf in spark mode
> --
>
> Key: PIG-4944
> URL: https://issues.apache.org/jira/browse/PIG-4944
> Project: Pig
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4944.patch, PIG-4944_2.patch, 
> TestEvalPipelineLocal.mr, TestEvalPipelineLocal.spark
>
>
> Community gave some comments about TestEvalPipelineLocal unit test:
> https://reviews.apache.org/r/45667/#comment199056
> We can reset "UDFContext.getUDFContext().addJobConf(null)" in other place not 
>  in TestEvalPipelineLocal#testSetLocationCalledInFE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4797) Optimization for join/group case for spark mode

2016-07-06 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4797:
-
   Resolution: Fixed
Fix Version/s: spark-branch
   Status: Resolved  (was: Patch Available)

Committed to Spark branch. Thanks, Liyun!

> Optimization for join/group case for spark mode
> ---
>
> Key: PIG-4797
> URL: https://issues.apache.org/jira/browse/PIG-4797
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: liyunzhang_intel
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: Join performance analysis.pdf, PIG-4797.patch, 
> PIG-4797_2.patch, PIG-4797_3.patch, PIG-4797_5.patch
>
>
> There are a big  performance difference in join between spark and mr mode.
> {code}
> daily = load './NYSE_daily' as (exchange:chararray, symbol:chararray,
> date:chararray, open:float, high:float, low:float,
> close:float, volume:int, adj_close:float);
> divs  = load './NYSE_dividends' as (exchange:chararray, symbol:chararray,
> date:chararray, dividends:float);
> jnd   = join daily by (exchange, symbol), divs by (exchange, symbol);
> store jnd into './join.out';
> {code}
> join.sh
> {code}
> mode=$1
> start=$(date +%s)
> ./pig -x $mode  $PIG_HOME/bin/join.pig
> end=$(date +%s)
> execution_time=$(( $end - $start ))
> echo "execution_time:"$excution_time
> {code}
> The execution time:
> || |||mr||spark||
> |join|20 sec|79 sec|
> You can download the test data NYSE_daily and NYSE_dividends in 
> https://github.com/alanfgates/programmingpig/blob/master/data/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PIG-4281) Fix TestFinish for Spark engine

2016-07-04 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved PIG-4281.
--
Resolution: Fixed

Latest patch is committed to Spark branch. Thanks, Liyun!

> Fix TestFinish for Spark engine
> ---
>
> Key: PIG-4281
> URL: https://issues.apache.org/jira/browse/PIG-4281
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4281.patch, PIG-4281_2.patch, PIG-4281_3.patch, 
> TEST-org.apache.pig.test.TestFinish.txt
>
>
> error log is attached



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PIG-4898) Fix unit test failure after PIG-4771's patch was checked in

2016-06-01 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved PIG-4898.
--
Resolution: Fixed

Committed to Spark branch. Thanks, Liyun!

> Fix unit test failure after PIG-4771's patch was checked in
> ---
>
> Key: PIG-4898
> URL: https://issues.apache.org/jira/browse/PIG-4898
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4898.patch
>
>
> Now in the [lastest jenkins|https://builds.apache.org/job/Pig-spark/#328], it 
> shows that  following unit test cases fail:
>  org.apache.pig.test.TestFRJoin.testDistinctFRJoin
>  org.apache.pig.test.TestPigRunner.simpleMultiQueryTest3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4771) Implement FR Join for spark engine

2016-05-16 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4771:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to Spark branch. Thanks, Liyun!

> Implement FR Join for spark engine
> --
>
> Key: PIG-4771
> URL: https://issues.apache.org/jira/browse/PIG-4771
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4771.patch, PIG-4771_2.patch, PIG-4771_3.patch
>
>
> We use regular join to replace FR join in current code base(fd31fda). We need 
> to implement FR join.
> Some info collected from 
> https://pig.apache.org/docs/r0.11.0/perf.html#replicated-joins:
> *Replicated Joins*
> Fragment replicate join is a special type of join that works well if one or 
> more relations are small enough to fit into main memory. In such cases, Pig 
> can perform a very efficient join because all of the hadoop work is done on 
> the map side. In this type of join the large relation is followed by one or 
> more small relations. The small relations must be small enough to fit into 
> main memory; if they don't, the process fails and an error is generated.
> *Usage*
> Perform a replicated join with the USING clause (see JOIN (inner) and JOIN 
> (outer)). In this example, a large relation is joined with two smaller 
> relations. Note that the large relation comes first followed by the smaller 
> relations; and, all small relations together must fit into main memory, 
> otherwise an error is generated.
> big = LOAD 'big_data' AS (b1,b2,b3);
> tiny = LOAD 'tiny_data' AS (t1,t2,t3);
> mini = LOAD 'mini_data' AS (m1,m2,m3);
> C = JOIN big BY b1, tiny BY t1, mini BY m1 USING 'replicated';
> *Conditions*
> Fragment replicate joins are experimental; we don't have a strong sense of 
> how small the small relation must be to fit into memory. In our tests with a 
> simple query that involves just a JOIN, a relation of up to 100 M can be used 
> if the process overall gets 1 GB of memory. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PIG-4876) OutputConsumeIterator can't handle the last buffered tuples for some Operators

2016-05-16 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved PIG-4876.
--
Resolution: Fixed

Committed to Spark branch. Thanks, Xianda.

> OutputConsumeIterator can't handle the last buffered tuples for some Operators
> --
>
> Key: PIG-4876
> URL: https://issues.apache.org/jira/browse/PIG-4876
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4876.patch
>
>
> Some Operators, such as MergeCogroup, Stream, CollectedGroup etc buffer some 
> input records to constitute the result tuples. The last result tuples are 
> buffered in the operator.  These Operators need a flag to indicate the end of 
> input, so that they can flush and constitute their last tuples.
> Currently, the flag 'parentPlan.endOfAllInput' is targeted for flushing the 
> buffered tuples in MR mode.  But it does not work with OutputConsumeIterator 
> in Spark mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4893) Task deserialization time is too long for spark on yarn mode

2016-05-13 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15282831#comment-15282831
 ] 

Xuefu Zhang commented on PIG-4893:
--

[~kellyzly], what's the value for spark.serializer? It should be set to 
org.apache.spark.serializer.KryoSerializer. Also, even with kryo, there can 
still be some optimizations that can be done, such as register the classes that 
are serialized. Thanks.

> Task deserialization time is too long for spark on yarn mode
> 
>
> Key: PIG-4893
> URL: https://issues.apache.org/jira/browse/PIG-4893
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: time.PNG
>
>
> I found the task deserialization time is a bit long when i run any scripts of 
> pigmix in spark on yarn mode.  see the attachment picture.  The duration time 
> is 3s but the task deserialization is 13s.  
> My env is hadoop2.6+spark1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4846) Use pigmix to test the performance of pig on spark

2016-04-18 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246099#comment-15246099
 ] 

Xuefu Zhang commented on PIG-4846:
--

The basic idea is to make max use of the given resources (memory and cpu). 
Depending on which is scarece, we want to use the scarce one first, in your 
case, memory. In general, you want to have at least 2G for per core for spark, 
and 4, 5, or 6 cores per executor. In our case, we set 4 cores and 8G memory 
per executor. For executor memory, in general, 15-20% goes to memory overhead. 
Driver memory is less critical unless there is an OOM, which requires more 
memory. 2G is a good minimum.

For more details, I wrote a doc which was included in CDH5.7 for Hive on Spark. 
http://www.cloudera.com/documentation/enterprise/latest/topics/admin_hos_tuning.html.
 While that's for Hive on Spark, some of the configurations may apply to Pig as 
well.

Let me know if you have more questions.

> Use pigmix to test the performance of pig on spark
> --
>
> Key: PIG-4846
> URL: https://issues.apache.org/jira/browse/PIG-4846
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4846.patch, PIG-4846_1.patch
>
>
> We can compare the performance between mr and spark mode by pigmix.
> The introduction of pigmix is 
> https://cwiki.apache.org/confluence/display/PIG/PigMix.
> PIG-4846.patch is to make pigmix run by specied exectype.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4846) Use pigmix to test the performance of pig on spark

2016-04-13 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239246#comment-15239246
 ] 

Xuefu Zhang commented on PIG-4846:
--

Your machine seems having abundant cores but scarce memory. I suggest the 
following:

YARN configuration:
{code}
yarn.nodemanager.resource.memory-mb=56G
 yarn.nodemanger.resource.cpu-vcores=28
{code}

Spark configurations:
{code}
spark.executor.cores=4
spark.executor.memory=6.4G
spark.yarn.executor.memoryOverhead=1.6G
spark.driver.memory=2G
spark.yarn.driver.memoryOverhead=400M
spark.executor.instances=7
{code}
Please note that the numbers might need to be converted to the unit of 
individual property. For instance, .memory takes bytes while memoryOverhead 
takes MB.



> Use pigmix to test the performance of pig on spark
> --
>
> Key: PIG-4846
> URL: https://issues.apache.org/jira/browse/PIG-4846
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4846.patch, PIG-4846_1.patch
>
>
> We can compare the performance between mr and spark mode by pigmix.
> The introduction of pigmix is 
> https://cwiki.apache.org/confluence/display/PIG/PigMix.
> PIG-4846.patch is to make pigmix run by specied exectype.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4854) Merge spark branch to trunk

2016-04-12 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237125#comment-15237125
 ] 

Xuefu Zhang commented on PIG-4854:
--

Yes, we can still commit to Spark branch and later merge back to trunk.

> Merge spark branch to trunk
> ---
>
> Key: PIG-4854
> URL: https://issues.apache.org/jira/browse/PIG-4854
> Project: Pig
>  Issue Type: Task
>Reporter: Pallavi Rao
> Attachments: PIG-On-Spark.patch
>
>
> Believe the spark branch will be shortly ready to be merged with the main 
> branch (couple of minor patches pending commit), given that we have addressed 
> most functionality gaps and have ensured the UTs are clean. There are a few 
> optimizations which we will take up once the branch is merged to trunk.
> [~xuefuz], [~rohini], [~daijy],
> Hopefully, you agree that the spark branch is ready for merge. If yes, how 
> would like us to go about it? Do you want me to upload a huge patch that will 
> be merged like any other patch or do you prefer a branch merge?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4859) Need upgrade snappy-java.version to 1.1.1.3

2016-04-11 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4859:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to Spark branch. Thanks, Liyun!

> Need upgrade snappy-java.version to 1.1.1.3
> ---
>
> Key: PIG-4859
> URL: https://issues.apache.org/jira/browse/PIG-4859
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4859.patch
>
>
> run pig on spark on yarn-client env as following:
> export SPARK_MASTER="yarn-client"
> ./pig -x spark xxx.pig
> Throw error like following:
> {code}
> main] 2016-03-30 16:52:26,115 INFO  scheduler.DAGScheduler 
> (Logging.scala:logInfo(59)) - Job 0 failed: saveAsNewAPIHadoopDataset at 
> StoreConverter.java:101, took 73.980147 s
> 19895 [main] 2016-03-30 16:52:26,119 ERROR spark.JobGraphBuilder 
> (JobGraphBuilder.java:sparkOperToRDD(166)) - throw exception in 
> sparkOperToRDD:
> 19896 org.apache.spark.SparkException: Job aborted due to stage failure: Task 
> 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
> 0.0 (TID 3, zly1.sh.intel.com): java  .lang.UnsatisfiedLinkError: 
> org.xerial.snappy.SnappyNative.uncompressedLength(Ljava/lang/Object;II)I
> 19897 at org.xerial.snappy.SnappyNative.uncompressedLength(Native 
> Method)
> 19898 at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:541)
> 19899 at 
> org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:350)
> 19900 at 
> org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:158)
> 19901 at 
> org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142)
> 19902 at 
> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2313)
> 19903 at 
> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2326)
> 19904 at 
> java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2797)
> 19905 at 
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:802)
> 19906 at java.io.ObjectInputStream.(ObjectInputStream.java:299)
> 19907 at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.(JavaSerializer.scala:64)
> 19908 at 
> org.apache.spark.serializer.JavaDeserializationStream.(JavaSerializer.scala:64)
> 19909 at 
> org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:103)
> 19910 at 
> org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:216)
> 19911 at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:178)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PIG-4855) Merge trunk[4] into spark branch

2016-04-02 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved PIG-4855.
--
Resolution: Fixed

Merged from trunk to Spark branch.

> Merge trunk[4] into spark branch 
> -
>
> Key: PIG-4855
> URL: https://issues.apache.org/jira/browse/PIG-4855
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4855-conflicts.patch, PIG-4855_2.patch, 
> PIG-4855_3.patch
>
>
> Hopefully, a final merge from trunk to spark branch. There was a merge 
> conflict. Will upload the patch that resolves the conflict once I ensure all 
> UTs pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4855) Merge trunk[4] into spark branch

2016-04-01 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1590#comment-1590
 ] 

Xuefu Zhang commented on PIG-4855:
--

I did a svn merge from trunk to spark branch, and the the merge is clean. Thus, 
I didn't use the patch here. Please let me know if I missed anything.

> Merge trunk[4] into spark branch 
> -
>
> Key: PIG-4855
> URL: https://issues.apache.org/jira/browse/PIG-4855
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4855-conflicts.patch, PIG-4855_2.patch, 
> PIG-4855_3.patch
>
>
> Hopefully, a final merge from trunk to spark branch. There was a merge 
> conflict. Will upload the patch that resolves the conflict once I ensure all 
> UTs pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4855) Merge trunk[4] into spark branch

2016-03-30 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4855:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to Spark branch. Thanks, Pallavi/Liyun!

> Merge trunk[4] into spark branch 
> -
>
> Key: PIG-4855
> URL: https://issues.apache.org/jira/browse/PIG-4855
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4855-conflicts.patch, PIG-4855_2.patch
>
>
> Hopefully, a final merge from trunk to spark branch. There was a merge 
> conflict. Will upload the patch that resolves the conflict once I ensure all 
> UTs pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PIG-4848) pig.noSplitCombination=true should always be set internally for a merge join

2016-03-30 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved PIG-4848.
--
Resolution: Fixed

Patch committed to Spark branch. Thanks, Xianda!

> pig.noSplitCombination=true should always be set internally for a merge join
> 
>
> Key: PIG-4848
> URL: https://issues.apache.org/jira/browse/PIG-4848
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4848-2.patch, PIG-4848.patch
>
>
> In spark mode, for a merge join, the flag is NOT set as true internally. The 
> input splits will be in the order of file size. The output is out of order.
> Scenaro:
> cat input1
> {code}
> 1 1
> {code}
> cat input2
> {code}
> 2 2
> {code}
> cat input3
> {code}
> 3333
> {code}
> A = LOAD 'input*' as (a:int, b:int);
> B = LOAD 'input*' as (a:int, b:int);
> C = JOIN A BY $0, B BY $0 USING 'merge';
> DUMP C;
> expected result:
> {code}
> (1,1,1,1)
> (2,2,2,2)
> (33,33,33,33)
> {code}
> actual result:
> {code}
> (33,33,33,33)
> (1,1,1,1)
> (2,2,2,2)
> {code}
> In MR mode, the flag was set as true internally for a merge join(see: 
> PIG-2773). However, it doesn't work now. The output is still out of order, 
> because the splits will be ordered again by hadoop-client. In spark mode, we 
> can solve this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PIG-4842) Collected group doesn't work in some cases

2016-03-30 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved PIG-4842.
--
Resolution: Fixed

Committed to Spark branch. Thanks, Xianda!

> Collected group doesn't work in some cases
> --
>
> Key: PIG-4842
> URL: https://issues.apache.org/jira/browse/PIG-4842
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4842-2.patch, PIG-4842.patch
>
>
> Scenario:
> 1. input data:
> cat collectedgroup1
> {code}
> 1
> 1
> 2
> {code}
> 2. pig script:
> {code}
> A = LOAD 'collectedgroup1' USING myudfs.DummyCollectableLoader() AS (id);
> B = GROUP A by $0 USING 'collected';
> C = GROUP B by $0 USING 'collected';
> DUMP C;
> {code}
> The expected output:
> {code}
> (1,{(1,{(1),(1)})})
> (2,{(2,{(2)})})
> {code}
> The actual output:
> {code}
> (1,{(1,{(1),(1)})})
> (1,)
> (2,{(2,{(2)})})
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4855) Merge trunk[4] into spark branch

2016-03-30 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219213#comment-15219213
 ] 

Xuefu Zhang commented on PIG-4855:
--

[~pallavi.rao], the patch doesn't seem applicable. Could you please check?
xuefu@peki:~/apache/svn-pig-spark$patch -p0 < 
~/Downloads/PIG-4855-conflicts.patch 
patch:  Only garbage was found in the patch input.


> Merge trunk[4] into spark branch 
> -
>
> Key: PIG-4855
> URL: https://issues.apache.org/jira/browse/PIG-4855
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4855-conflicts.patch
>
>
> Hopefully, a final merge from trunk to spark branch. There was a merge 
> conflict. Will upload the patch that resolves the conflict once I ensure all 
> UTs pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4837) TestNativeMapReduce test fix

2016-03-26 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15213292#comment-15213292
 ] 

Xuefu Zhang commented on PIG-4837:
--

Hi [~kellyzly], I don't know how to switch the build machine. Do you have any 
idea who to do that?

> TestNativeMapReduce test fix
> 
>
> Key: PIG-4837
> URL: https://issues.apache.org/jira/browse/PIG-4837
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4837.patch, build23.PNG
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Welcome our new Pig PMC chair Daniel Dai

2016-03-23 Thread Xuefu Zhang
Congratulations, Daniel!

On Wed, Mar 23, 2016 at 3:23 PM, Rohini Palaniswamy  wrote:

> Hi folks,
> I am very happy to announce that we elected Daniel Dai as our new Pig
> PMC Chair and it is official now.  Please join me in congratulating Daniel.
>
> Regards,
> Rohini
>


[jira] [Updated] (PIG-4838) Fix test TestBuiltin

2016-03-19 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4838:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to Spark branch. Thanks, Liyun!

> Fix test TestBuiltin
> 
>
> Key: PIG-4838
> URL: https://issues.apache.org/jira/browse/PIG-4838
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4838.patch
>
>
> In https://builds.apache.org/job/Pig-spark/316/, following unit tests fail:
> org.apache.pig.test.TestBuiltin.testRANDOMWithJob
> org.apache.pig.test.TestBuiltin.testUniqueID



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4837) TestNativeMapReduce test fix

2016-03-19 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200990#comment-15200990
 ] 

Xuefu Zhang commented on PIG-4837:
--

Committed. Thanks, Liyun! I will keep this JIRA open for now.

> TestNativeMapReduce test fix
> 
>
> Key: PIG-4837
> URL: https://issues.apache.org/jira/browse/PIG-4837
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4837.patch, build23.PNG
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4836) Fix TestEvalPipeline test failure

2016-03-10 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4836:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to Spark branch. Thanks, Pallavi!

> Fix TestEvalPipeline test failure
> -
>
> Key: PIG-4836
> URL: https://issues.apache.org/jira/browse/PIG-4836
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4836.patch
>
>
> There are two test failures:
> testMapUDF
> testLimit 
> testLimit will get fixed by PIG-4832. This JIRA will only fix testMapUDF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4835) Fix TestPigRunner test failure

2016-03-09 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4835:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to Spark branch. Thanks, Pallavi.

> Fix TestPigRunner test failure
> --
>
> Key: PIG-4835
> URL: https://issues.apache.org/jira/browse/PIG-4835
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4835.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4827) Fix TestSample UT failure

2016-03-08 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4827:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to Spark branch. Thanks, Pallavi.

> Fix TestSample UT failure
> -
>
> Key: PIG-4827
> URL: https://issues.apache.org/jira/browse/PIG-4827
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4827-v1.patch, PIG-4827.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4829) TestLimitVariable test fix

2016-03-08 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4829:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to Spark branch. Thanks, Liyun.

> TestLimitVariable test fix
> --
>
> Key: PIG-4829
> URL: https://issues.apache.org/jira/browse/PIG-4829
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4289.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4828) TestMultiQuery test fix

2016-03-07 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4828:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to Spark branch. Thanks, Pallavi!

> TestMultiQuery test fix
> ---
>
> Key: PIG-4828
> URL: https://issues.apache.org/jira/browse/PIG-4828
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spark
> Fix For: spark-branch
>
> Attachments: PIG-4828.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4828) TestMultiQuery test fix

2016-03-07 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4828:
-
Labels: spark  (was: spork)

> TestMultiQuery test fix
> ---
>
> Key: PIG-4828
> URL: https://issues.apache.org/jira/browse/PIG-4828
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spark
> Fix For: spark-branch
>
> Attachments: PIG-4828.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4828) TestMultiQuery test fix

2016-03-07 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184461#comment-15184461
 ] 

Xuefu Zhang commented on PIG-4828:
--

+1

> TestMultiQuery test fix
> ---
>
> Key: PIG-4828
> URL: https://issues.apache.org/jira/browse/PIG-4828
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spark
> Fix For: spark-branch
>
> Attachments: PIG-4828.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4825) Fix TestMultiQuery failure

2016-03-07 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184401#comment-15184401
 ] 

Xuefu Zhang commented on PIG-4825:
--

[~pallavi.rao], to keep the history clean, I guess it's better if we create a 
new JIRA and provide a patch for it, which reverts the original patch here plus 
the new changes. Does this sound doable? Thanks.

> Fix TestMultiQuery failure
> --
>
> Key: PIG-4825
> URL: https://issues.apache.org/jira/browse/PIG-4825
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4825-testfix.patch, PIG-4825.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4826) Add excluded-tests-spark

2016-03-06 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4826:
-
Attachment: PIG-4826-amend.patch

It looks like additional change is needed. Trivial patch is attached and 
committed to fix the build.

> Add excluded-tests-spark
> 
>
> Key: PIG-4826
> URL: https://issues.apache.org/jira/browse/PIG-4826
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4826-amend.patch, PIG-4826.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4826) Add excluded-tests-spark

2016-03-06 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182199#comment-15182199
 ] 

Xuefu Zhang commented on PIG-4826:
--

+1. Patch committed to Spark branch. Thanks, Pallavi.

> Add excluded-tests-spark
> 
>
> Key: PIG-4826
> URL: https://issues.apache.org/jira/browse/PIG-4826
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4826.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4826) Add excluded-tests-spark

2016-03-06 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4826:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> Add excluded-tests-spark
> 
>
> Key: PIG-4826
> URL: https://issues.apache.org/jira/browse/PIG-4826
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4826.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4825) Fix TestMultiQuery failure

2016-03-04 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4825:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to Spark branch. Thanks, Pallavi.

> Fix TestMultiQuery failure
> --
>
> Key: PIG-4825
> URL: https://issues.apache.org/jira/browse/PIG-4825
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4825.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4825) Fix TestMultiQuery failure

2016-03-04 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180319#comment-15180319
 ] 

Xuefu Zhang commented on PIG-4825:
--

+1

> Fix TestMultiQuery failure
> --
>
> Key: PIG-4825
> URL: https://issues.apache.org/jira/browse/PIG-4825
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4825.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4823) SparkMiniCluster does not cleanup old conf files during setup

2016-03-04 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4823:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to Spark branch. Thanks, Pallavi.

> SparkMiniCluster does not cleanup old conf files during setup
> -
>
> Key: PIG-4823
> URL: https://issues.apache.org/jira/browse/PIG-4823
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4823.patch
>
>
> If some of the tests fail or get killed, new tests will use the old hadoop 
> conf files left behind by the previous tests, causing the following failure:
> {noformat}
> Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Could not 
> create staging directory. 
>   at 
> org.apache.hadoop.mapreduce.v2.MiniMRYarnCluster.serviceInit(MiniMRYarnCluster.java:165)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.pig.test.SparkMiniCluster.setupMiniDfsAndMrClusters(SparkMiniCluster.java:93)
>   at 
> org.apache.pig.test.MiniGenericCluster.buildCluster(MiniGenericCluster.java:86)
>   at 
> org.apache.pig.test.MiniGenericCluster.buildCluster(MiniGenericCluster.java:68)
>   at 
> org.apache.pig.test.TestToolsPigServer.(TestToolsPigServer.java:42)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1472)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>   at com.sun.proxy.$Proxy15.mkdirs(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:539)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy16.mkdirs(Unknown Source)
>   at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2753)
>   at org.apache.hadoop.fs.Hdfs.mkdir(Hdfs.java:311)
>   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:724)
>   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:720)
>   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:720)
>   at 
> org.apache.hadoop.mapreduce.v2.MiniMRYarnCluster.serviceInit(MiniMRYarnCluster.java:163)
> Caused by: java.net.ConnectException: Connection refused
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4820) Merge trunk[3] into spark branch

2016-03-04 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4820:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Patch committed to spark branch. [~pallavi.rao], I'm not sure if I have messed 
up anything during the merge, but please feel free to create a separate JIRA to 
address it if I did.

> Merge trunk[3] into spark branch
> 
>
> Key: PIG-4820
> URL: https://issues.apache.org/jira/browse/PIG-4820
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4820-conflicts.patch, PIG-4820.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4820) Merge trunk[3] into spark branch

2016-03-03 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178131#comment-15178131
 ] 

Xuefu Zhang commented on PIG-4820:
--

Hi [~pallavi.rao], to preserve the history, I need to use "svn merge". For 
that, I need from you a patch that contains only the changes to resolve the 
conflicts. Is that possible? Thanks.

> Merge trunk[3] into spark branch
> 
>
> Key: PIG-4820
> URL: https://issues.apache.org/jira/browse/PIG-4820
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4820.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4823) SparkMiniCluster does not cleanup old conf files during setup

2016-03-03 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178064#comment-15178064
 ] 

Xuefu Zhang commented on PIG-4823:
--

+1

> SparkMiniCluster does not cleanup old conf files during setup
> -
>
> Key: PIG-4823
> URL: https://issues.apache.org/jira/browse/PIG-4823
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4823.patch
>
>
> If some of the tests fail or get killed, new tests will use the old hadoop 
> conf files left behind by the previous tests, causing the following failure:
> {noformat}
> Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Could not 
> create staging directory. 
>   at 
> org.apache.hadoop.mapreduce.v2.MiniMRYarnCluster.serviceInit(MiniMRYarnCluster.java:165)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.pig.test.SparkMiniCluster.setupMiniDfsAndMrClusters(SparkMiniCluster.java:93)
>   at 
> org.apache.pig.test.MiniGenericCluster.buildCluster(MiniGenericCluster.java:86)
>   at 
> org.apache.pig.test.MiniGenericCluster.buildCluster(MiniGenericCluster.java:68)
>   at 
> org.apache.pig.test.TestToolsPigServer.(TestToolsPigServer.java:42)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1472)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>   at com.sun.proxy.$Proxy15.mkdirs(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:539)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy16.mkdirs(Unknown Source)
>   at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2753)
>   at org.apache.hadoop.fs.Hdfs.mkdir(Hdfs.java:311)
>   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:724)
>   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:720)
>   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:720)
>   at 
> org.apache.hadoop.mapreduce.v2.MiniMRYarnCluster.serviceInit(MiniMRYarnCluster.java:163)
> Caused by: java.net.ConnectException: Connection refused
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PIG-4776) Enable unit test "TestOrcStoragePushdown" for spark

2016-02-29 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved PIG-4776.
--
Resolution: Fixed

The patch is committed to Spark branch. Thanks, Liyun!

> Enable unit test "TestOrcStoragePushdown" for spark
> ---
>
> Key: PIG-4776
> URL: https://issues.apache.org/jira/browse/PIG-4776
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4776.patch
>
>
> In latest jenkins 
> report(https://builds.apache.org/job/Pig-spark/292/#showFailuresLink), it 
> shows that following unit tests fail:
> org.apache.pig.builtin.TestOrcStoragePushdown.testColumnPruning
> org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownBigDecimal
> org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownTimestamp
> org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownChar
> org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownByteShort
> org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownFloatDouble
> org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownIntLongString
> org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownBoolean
> org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownVarchar
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4776) Enable unit test "TestOrcStoragePushdown" for spark

2016-02-29 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171895#comment-15171895
 ] 

Xuefu Zhang commented on PIG-4776:
--

+1

> Enable unit test "TestOrcStoragePushdown" for spark
> ---
>
> Key: PIG-4776
> URL: https://issues.apache.org/jira/browse/PIG-4776
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4776.patch
>
>
> In latest jenkins 
> report(https://builds.apache.org/job/Pig-spark/292/#showFailuresLink), it 
> shows that following unit tests fail:
> org.apache.pig.builtin.TestOrcStoragePushdown.testColumnPruning
> org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownBigDecimal
> org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownTimestamp
> org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownChar
> org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownByteShort
> org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownFloatDouble
> org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownIntLongString
> org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownBoolean
> org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownVarchar
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PIG-4243) Fix "TestStore" for Spark engine

2016-02-29 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved PIG-4243.
--
Resolution: Fixed

Committed to Spark branch. Thanks, Liyun!

> Fix "TestStore" for Spark engine
> 
>
> Key: PIG-4243
> URL: https://issues.apache.org/jira/browse/PIG-4243
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4243.patch, PIG-4243_1.patch, 
> TEST-org.apache.pig.test.TestStore.txt
>
>
> 1. Build spark and pig env according to PIG-4168
> 2. add TestStore to $PIG_HOME/test/spark-tests
> cat  $PIG_HOME/test/spark-tests
> **/TestStore
> 3. run unit test TestStore
> ant test-spark
> 4. the unit test fails
> error log is attached



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4788) the value BytesRead metric info always returns 0 even the length of input file is not 0 in spark engine

2016-02-28 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171344#comment-15171344
 ] 

Xuefu Zhang commented on PIG-4788:
--

I think letting PigSplit extend FileSplit is fine. However, I agree with 
[~pallavi.rao] that this might cause difficulty when merging. Thus, I think we 
can leave this open for now until after we merge the branch to trunk. Thoughts?

> the value BytesRead metric info always returns 0 even the length of input 
> file is not 0 in spark engine
> ---
>
> Key: PIG-4788
> URL: https://issues.apache.org/jira/browse/PIG-4788
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4788.patch
>
>
> In 
> [JobMetricsLinstener#onTaskEnd|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L140],
>  taskMetrics.inputMetrics().get().bytesRead() always returns 0 even the 
> length of input file is not zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4781) Fix remaining unit failure about "TestCollectedGroup" for spark engine

2016-02-24 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15166760#comment-15166760
 ] 

Xuefu Zhang commented on PIG-4781:
--

+1. 

> Fix remaining unit failure about "TestCollectedGroup" for spark engine
> --
>
> Key: PIG-4781
> URL: https://issues.apache.org/jira/browse/PIG-4781
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4781.patch
>
>
> in 
> https://builds.apache.org/job/Pig-spark/lastUnsuccessfulBuild/#showFailuresLink,
>  it shows that following unit test fails:
> org.apache.pig.test.TestCollectedGroup.testMapsideGroupWithMergeJoin
> This fails because currently we use regular join to implement merge join.
> the exeception is 
> {code}
> Caused by: 
> org.apache.pig.backend.hadoop.executionengine.spark.plan.SparkCompilerException:
>  ERROR 2171: Expected one but found more then one root physical operator in 
> physical physicalPlan.
>   at 
> org.apache.pig.backend.hadoop.executionengine.spark.plan.SparkCompiler.visitCollectedGroup(SparkCompiler.java:512)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POCollectedGroup.visit(POCollectedGroup.java:93)
>   at 
> org.apache.pig.backend.hadoop.executionengine.spark.plan.SparkCompiler.compile(SparkCompiler.java:259)
>   at 
> org.apache.pig.backend.hadoop.executionengine.spark.plan.SparkCompiler.compile(SparkCompiler.java:240)
>   at 
> org.apache.pig.backend.hadoop.executionengine.spark.plan.SparkCompiler.compile(SparkCompiler.java:240)
>   at 
> org.apache.pig.backend.hadoop.executionengine.spark.plan.SparkCompiler.compile(SparkCompiler.java:165)
>   at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.compile(SparkLauncher.java:425)
>   at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.launchPig(SparkLauncher.java:150)
>   at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:301)
>   at org.apache.pig.PigServer.launchPlan(PigServer.java:1390)
>   at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1375)
>   at org.apache.pig.PigServer.storeEx(PigServer.java:1034)
>   ... 27 more
> {code}
> After we implement Merge join, this unit test can be fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Welcome to our new Pig PMC member Xuefu Zhang

2016-02-24 Thread Xuefu Zhang
Thank you, Liyun! You did the hard work. I think you well deserve a
committership once we merge the branch to trunk.

--Xuefu

On Wed, Feb 24, 2016 at 5:18 PM, Zhang, Liyun <liyun.zh...@intel.com> wrote:

> Congratulations Xuefu!
>
>
> Kelly Zhang/Zhang,Liyun
> Best Regards
>
>
>
> -Original Message-
> From: Jarek Jarcec Cecho [mailto:jar...@gmail.com] On Behalf Of Jarek
> Jarcec Cecho
> Sent: Thursday, February 25, 2016 6:36 AM
> To: dev@pig.apache.org
> Cc: u...@pig.apache.org
> Subject: Re: Welcome to our new Pig PMC member Xuefu Zhang
>
> Congratulations Xuefu!
>
> Jarcec
>
> > On Feb 24, 2016, at 1:29 PM, Rohini Palaniswamy <rohini.adi...@gmail.com>
> wrote:
> >
> > It is my pleasure to announce that Xuefu Zhang is our newest addition
> > to the Pig PMC. Xuefu is a long time committer of Pig and has been
> > actively involved in driving the Pig on Spark effort for the past year.
> >
> > Please join me in congratulating Xuefu !!!
> >
> > Regards,
> > Rohini
>
>


[jira] [Updated] (PIG-4807) Fix test cases of "TestEvalPipelineLocal" test suite.

2016-02-24 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4807:
-
   Resolution: Fixed
Fix Version/s: spark-branch
   Status: Resolved  (was: Patch Available)

Committed to Spark branch. Thanks, Prateek!

> Fix test cases of "TestEvalPipelineLocal" test suite.
> -
>
> Key: PIG-4807
> URL: https://issues.apache.org/jira/browse/PIG-4807
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Affects Versions: spark-branch
>Reporter: prateek vaishnav
>Assignee: prateek vaishnav
> Fix For: spark-branch
>
> Attachments: diff_1, diff_2
>
>
> This jira is created to address the failure of test cases 
> org.apache.pig.test.TestEvalPipelineLocal.testSetLocationCalledInFE
> org.apache.pig.test.TestEvalPipelineLocal.testExplainInDotGraph
> org.apache.pig.test.TestEvalPipelineLocal.testSortWithUDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4281) Fix TestFinish for Spark engine

2016-02-19 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4281:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to Spark branch. Thanks, Liyun!

> Fix TestFinish for Spark engine
> ---
>
> Key: PIG-4281
> URL: https://issues.apache.org/jira/browse/PIG-4281
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4281.patch, PIG-4281_2.patch, 
> TEST-org.apache.pig.test.TestFinish.txt
>
>
> error log is attached



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4601) Implement Merge CoGroup for Spark engine

2016-02-18 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152352#comment-15152352
 ] 

Xuefu Zhang commented on PIG-4601:
--

Reverted the old commit and committed the new patch. Thanks, Liyun!

> Implement Merge CoGroup for Spark engine
> 
>
> Key: PIG-4601
> URL: https://issues.apache.org/jira/browse/PIG-4601
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Affects Versions: spark-branch
>Reporter: Mohit Sabharwal
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4601_1.patch, PIG-4601_2.patch, PIG-4601_3.patch, 
> PIG-4601_4.patch
>
>
> When doing a cogroup operation, we need do a map-reduce. The target of merge 
> cogroup is implementing cogroup only by a single stage(map). But we need to 
> guarantee the input data are sorted.
> There is performance improvement for cases when A(big dataset) merge cogroup 
> B( small dataset) because we first generate an index file of A then loading A 
> according to the index file and B into memory to do cogroup. The performance 
> improves because there is no cost of reduce period comparing cogroup.
> How to use
> {code}
> C = cogroup A by c1, B by c1 using 'merge';
> {code}
> Here A and B is sorted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4601) Implement Merge CoGroup for Spark engine

2016-02-17 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4601:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to Spark branch. Thanks, Liyun!

> Implement Merge CoGroup for Spark engine
> 
>
> Key: PIG-4601
> URL: https://issues.apache.org/jira/browse/PIG-4601
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Affects Versions: spark-branch
>Reporter: Mohit Sabharwal
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4601_1.patch, PIG-4601_2.patch, PIG-4601_3.patch
>
>
> When doing a cogroup operation, we need do a map-reduce. The target of merge 
> cogroup is implementing cogroup only by a single stage(map). But we need to 
> guarantee the input data are sorted.
> There is performance improvement for cases when A(big dataset) merge cogroup 
> B( small dataset) because we first generate an index file of A then loading A 
> according to the index file and B into memory to do cogroup. The performance 
> improves because there is no cost of reduce period comparing cogroup.
> How to use
> {code}
> C = cogroup A by c1, B by c1 using 'merge';
> {code}
> Here A and B is sorted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4616) Fix UT errors of TestPigRunner in Spark mode

2016-02-14 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4616:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Patch committed to Spark branch.

> Fix UT errors of TestPigRunner in Spark mode
> 
>
> Key: PIG-4616
> URL: https://issues.apache.org/jira/browse/PIG-4616
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4616.patch, PIG-4616_1.patch, PIG-4616_2.patch
>
>
> Following unit test failed:
> org.apache.pig.test.TestPigRunner.returnCodeTest
> org.apache.pig.test.TestPigRunner.testEmptyFileCounter
> org.apache.pig.test.TestPigRunner.testDisablePigCounters2
> org.apache.pig.test.TestPigRunner.simpleTest
> org.apache.pig.test.TestPigRunner.simpleTest2
> org.apache.pig.test.TestPigRunner.MQDepJobFailedTest
> org.apache.pig.test.TestPigRunner.scriptsInDfsTest
> org.apache.pig.test.TestPigRunner.testGetHadoopCounters
> org.apache.pig.test.TestPigRunner.simpleMultiQueryTest
> org.apache.pig.test.TestPigRunner.testDuplicateCounterName
> org.apache.pig.test.TestPigRunner.testRegisterExternalJar
> org.apache.pig.test.TestPigRunner.simpleMultiQueryTest2
> org.apache.pig.test.TestPigRunner.testDuplicateCounterName2
> org.apache.pig.test.TestPigRunner.returnCodeTest2
> org.apache.pig.test.TestPigRunner.orderByTest
> org.apache.pig.test.TestPigRunner.testDisablePigCounters
> org.apache.pig.test.TestPigRunner.testLongCounterName
> org.apache.pig.test.TestPigRunner.testEmptyFileCounter2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4777) Enable "TestEvalPipelineLocal" for spark

2016-02-14 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4777:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to Spark branch. Thanks, Prateek!

> Enable "TestEvalPipelineLocal" for spark
> 
>
> Key: PIG-4777
> URL: https://issues.apache.org/jira/browse/PIG-4777
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: prateek vaishnav
> Fix For: spark-branch
>
> Attachments: test_patch, test_path_v2
>
>
> in latest jenkins 
> report(https://builds.apache.org/job/Pig-spark/lastUnsuccessfulBuild/#showFailuresLink),
>  it shows that following unit tests fail:
> org.apache.pig.test.TestEvalPipelineLocal.testSetLocationCalledInFE
> org.apache.pig.test.TestEvalPipelineLocal.testExplainInDotGraph
> org.apache.pig.test.TestEvalPipelineLocal.testArithmeticCloning
> org.apache.pig.test.TestEvalPipelineLocal.testGroupByTuple
> org.apache.pig.test.TestEvalPipelineLocal.testNestedPlanForCloning
> org.apache.pig.test.TestEvalPipelineLocal.testExpressionReUse
> org.apache.pig.test.TestEvalPipelineLocal.testSortWithUDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PIG-4784) Enable "pig.disable.counter“ for spark engine

2016-02-05 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved PIG-4784.
--
Resolution: Fixed

Committed to Spark branch. Thanks, Liyun!

> Enable "pig.disable.counter“ for spark engine
> -
>
> Key: PIG-4784
> URL: https://issues.apache.org/jira/browse/PIG-4784
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4784.patch, PIG-4784_2.patch
>
>
> When you enable pig.disable.counter as "true" in the conf/pig.properties, the 
> counter to calculate the number of input records  and output records will be 
> disabled. 
> Following unit tests are designed to test it but now they fail:
> org.apache.pig.test.TestPigRunner#testDisablePigCounters
> org.apache.pig.test.TestPigRunner#testDisablePigCounters2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4766) Ensure GroupBy is optimized for all algebraic Operations

2016-02-04 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4766:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to Spark branch. Thanks, Pallavi!

> Ensure GroupBy is optimized for all algebraic Operations
> 
>
> Key: PIG-4766
> URL: https://issues.apache.org/jira/browse/PIG-4766
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4766-v1.patch, PIG-4766-v2.patch, PIG-4766.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4783) Refactor SparkLauncher for spark engine

2016-02-03 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4783:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Patch committed to spark branch. Thanks, Liyun!

> Refactor SparkLauncher for spark engine
> ---
>
> Key: PIG-4783
> URL: https://issues.apache.org/jira/browse/PIG-4783
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4783.patch, PIG-4783_1.patch
>
>
> Currently, the code of SparkLauncher is too big. We can put some function 
> which  executes the spark plan and collects job statistics to other class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4709) Improve performance of GROUPBY operator on Spark

2016-01-28 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4709:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to Spark branch. Thanks, Pallavi!

> Improve performance of GROUPBY operator on Spark
> 
>
> Key: PIG-4709
> URL: https://issues.apache.org/jira/browse/PIG-4709
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4709-v1.patch, PIG-4709-v2.patch, PIG-4709-v3.patch, 
> PIG-4709.patch, TEST-org.apache.pig.test.TestCombiner.xml
>
>
> Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the 
> grouped data is consumed by subsequent operations to perform algebraic 
> operations, this is sub-optimal as there is lot of shuffle traffic. 
> The Spark Plan must be optimized to use reduceBy, where possible, so that a 
> combiner is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PIG-4611) Fix remaining unit test failures about "TestHBaseStorage" in spark mode

2016-01-14 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved PIG-4611.
--
Resolution: Fixed

Committed to Spark branch. Thanks, Liyun!

> Fix remaining unit test failures about "TestHBaseStorage" in spark mode
> ---
>
> Key: PIG-4611
> URL: https://issues.apache.org/jira/browse/PIG-4611
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4611.patch, PIG-4611_2.patch, PIG-4611_3.patch
>
>
> In https://builds.apache.org/job/Pig-spark/lastCompletedBuild/testReport/, it 
> shows following unit test failures about TestHBaseStorage:
>  org.apache.pig.test.TestHBaseStorage.testStoreToHBase_1_with_delete  
>  org.apache.pig.test.TestHBaseStorage.testLoadWithProjection_1
>  org.apache.pig.test.TestHBaseStorage.testLoadWithProjection_2
>  org.apache.pig.test.TestHBaseStorage.testStoreToHBase_2_with_projection
>  org.apache.pig.test.TestHBaseStorage.testCollectedGroup  
>  org.apache.pig.test.TestHBaseStorage.testHeterogeneousScans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PIG-4765) Enable TestPoissonSampleLoader in spark mode

2015-12-22 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved PIG-4765.
--
Resolution: Fixed

Committed to Spark branch. Thanks, Liyun!

> Enable TestPoissonSampleLoader in spark mode
> 
>
> Key: PIG-4765
> URL: https://issues.apache.org/jira/browse/PIG-4765
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4765.patch
>
>
> in 
> https://builds.apache.org/job/Pig-spark/292/testReport/junit/org.apache.pig.test/,
>  it shows that TestPoissonSampleLoader fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4675) Operators with multiple predecessors fail under multiquery optimization

2015-12-18 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4675:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to Spark branch. Thanks, Liyun!

> Operators with multiple predecessors fail under multiquery optimization
> ---
>
> Key: PIG-4675
> URL: https://issues.apache.org/jira/browse/PIG-4675
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Peter Lin
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4675_1.patch, PIG-4675_2.patch, PIG-4675_3.patch, 
> name.txt, ssn.txt, test.pig
>
>
> We are testing the spark branch pig recently with mapr3 and spark 1.5. It 
> turns out if we use more than 1 store command in the pig script will have 
> exception from the second store command. 
>  SSN = load '/test/ssn.txt' using PigStorage() as (ssn:long);
>  SSN_NAME = load '/test/name.txt' using PigStorage() as (ssn:long, 
> name:chararray);
>  X = JOIN SSN by ssn LEFT OUTER, SSN_NAME by ssn USING 'replicated';
>  R1 = limit SSN_NAME 10;
>  store R1 into '/tmp/test1_r1'; 
>  store X into '/tmp/test1_x';
> Exception Details:
> 15/09/11 13:37:00 INFO storage.MemoryStore: ensureFreeSpace(114448) called 
> with curMem=359237, maxMem=503379394
> 15/09/11 13:37:00 INFO storage.MemoryStore: Block broadcast_2 stored as 
> values in memory (estimated size 111.8 KB, free 479.6 MB)
> 15/09/11 13:37:00 INFO storage.MemoryStore: ensureFreeSpace(32569) called 
> with curMem=473685, maxMem=503379394
> 15/09/11 13:37:00 INFO storage.MemoryStore: Block broadcast_2_piece0 stored 
> as bytes in memory (estimated size 31.8 KB, free 479.6 MB)
> 15/09/11 13:37:00 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in 
> memory on 10.51.2.82:55960 (size: 31.8 KB, free: 479.9 MB)
> 15/09/11 13:37:00 INFO spark.SparkContext: Created broadcast 2 from 
> newAPIHadoopRDD at LoadConverter.java:88
> 15/09/11 13:37:00 WARN util.ClosureCleaner: Expected a closure; got 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.LoadConverter$ToTupleFunction
> 15/09/11 13:37:00 INFO spark.SparkLauncher: Converting operator POForEach 
> (Name: SSN: New For Each(false)[bag] - scope-17 Operator Key: scope-17)
> 15/09/11 13:37:00 INFO spark.SparkLauncher: Converting operator POFRJoin 
> (Name: X: FRJoin[tuple] - scope-22 Operator Key: scope-22)
> 15/09/11 13:37:00 ERROR spark.SparkLauncher: throw exception in 
> sparkOperToRDD:
> java.lang.RuntimeException: Should have greater than1 predecessors for class 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.
>  Got : 1
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkUtil.assertPredecessorSizeGreaterThan(SparkUtil.java:93)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.FRJoinConverter.convert(FRJoinConverter.java:55)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.FRJoinConverter.convert(FRJoinConverter.java:46)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:633)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:600)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:621)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.sparkOperToRDD(SparkLauncher.java:552)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.sparkPlanToRDD(SparkLauncher.java:501)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.launchPig(SparkLauncher.java:204)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:301)
> at org.apache.pig.PigServer.launchPlan(PigServer.java:1390)
> at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1375)
> at org.apache.pig.PigServer.execute(PigServer.java:1364)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:415)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:398)
> at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
> at org.apache.pig.Main.run(Main.java:624)
> at org.apache.pig.Main.main(Main.java:170)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4746) Ensure spark can be run as PIG action in Oozie

2015-12-18 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4746:
-
Summary: Ensure spark can be run as PIG action in Oozie  (was: Ensure spork 
can be run as PIG action in Oozie)

> Ensure spark can be run as PIG action in Oozie
> --
>
> Key: PIG-4746
> URL: https://issues.apache.org/jira/browse/PIG-4746
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Srikanth Sundarrajan
> Fix For: spark-branch
>
>
> I was able get PIG on SPARK going with Oozie. But, only in "local" mode. Here 
> is what I did:
> 1. Used workflow schema version uri:oozie:workflow:0.2 and passed exectype as 
> an argument.
> 2. Copied Spark jars + kyro jar into workflow app lib.
> To get spork going in yarn-client mode, couple of enhancements will need to 
> be made:
> 1. Right now, spark launcher reads SPARK_MASTER as env. variable. Need to 
> make this a PIG property.
> 2. The spark libraries need to be in classpath of the driver in case of 
> yarn-clientmode. This will need a fix similar to PIG-4667



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PIG-4293) Enable unit test "TestNativeMapReduce" for spark

2015-12-17 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved PIG-4293.
--
Resolution: Fixed

Committed to Spark branch. Thanks, Liyun!

> Enable unit test "TestNativeMapReduce" for spark
> 
>
> Key: PIG-4293
> URL: https://issues.apache.org/jira/browse/PIG-4293
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4293.patch, PIG-4293_1.patch, 
> TEST-org.apache.pig.test.TestNativeMapReduce.txt
>
>
> error log is attached



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PIG-4754) Fix UT failures in TestScriptLanguage

2015-12-17 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved PIG-4754.
--
Resolution: Fixed

Committed to Spark branch. Thanks, Xianda.

> Fix UT failures in TestScriptLanguage
> -
>
> Key: PIG-4754
> URL: https://issues.apache.org/jira/browse/PIG-4754
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4754.patch, PIG-4754_1.patch
>
>
> org.apache.pig.test.TestScriptLanguage.runParallelTest2
> Error Message
> job should succeed
> Stacktrace
> junit.framework.AssertionFailedError: job should succeed
>   at 
> org.apache.pig.test.TestScriptLanguage.runPigRunner(TestScriptLanguage.java:96)
>   at 
> org.apache.pig.test.TestScriptLanguage.runPigRunner(TestScriptLanguage.java:105)
>   at 
> org.apache.pig.test.TestScriptLanguage.runParallelTest2(TestScriptLanguage.java:311)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 40743: PIG-4709 Improve performance of GROUPBY operator on Spark

2015-11-30 Thread Xuefu Zhang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/40743/#review108368
---



src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POForEach.java
 (line 46)
<https://reviews.apache.org/r/40743/#comment167809>

Nit: let's not rearrange the imports as it creates unnecessary diff which 
might make future merge harder.



src/org/apache/pig/backend/hadoop/executionengine/spark/converter/PigSecondaryKeyComparatorSpark.java
 (line 1)
<https://reviews.apache.org/r/40743/#comment167808>

We need license header here.



src/org/apache/pig/backend/hadoop/executionengine/spark/converter/ReduceByConverter.java
 (line 123)
<https://reviews.apache.org/r/40743/#comment167810>

Nit: a better var name would be nice.


- Xuefu Zhang


On Nov. 27, 2015, 11:19 a.m., Pallavi Rao wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/40743/
> ---
> 
> (Updated Nov. 27, 2015, 11:19 a.m.)
> 
> 
> Review request for pig, Mohit Sabharwal and Xuefu Zhang.
> 
> 
> Bugs: PIG-4709
> https://issues.apache.org/jira/browse/PIG-4709
> 
> 
> Repository: pig-git
> 
> 
> Description
> ---
> 
> Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the 
> grouped data is consumed by subsequent operations to perform algebraic 
> operations, this is sub-optimal as there is lot of shuffle traffic.
> The Spark Plan must be optimized to use reduceBy, where possible, so that a 
> combiner is used.
> 
> Introduced a combiner optimizer that does the following:
> // Checks for algebraic operations and if they exist.
> // Replaces global rearrange (cogroup) with reduceBy as follows:
> // Input:
> // foreach (using algebraicOp)
> //   -> packager
> //  -> globalRearrange
> //  -> localRearrange
> // Output:
> // foreach (using algebraicOp.Final)
> //   -> reduceBy (uses algebraicOp.Intermediate)
> //  -> foreach (using algebraicOp.Initial)
> //  -> localRearrange
> 
> 
> Diffs
> -
> 
>   
> src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POForEach.java
>  f8c1658 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/SparkLauncher.java 
> a4dbadd 
>   
> src/org/apache/pig/backend/hadoop/executionengine/spark/converter/GlobalRearrangeConverter.java
>  5f74992 
>   
> src/org/apache/pig/backend/hadoop/executionengine/spark/converter/LocalRearrangeConverter.java
>  9ce0492 
>   
> src/org/apache/pig/backend/hadoop/executionengine/spark/converter/PackageConverter.java
>  cb96068 
>   
> src/org/apache/pig/backend/hadoop/executionengine/spark/converter/PigSecondaryKeyComparatorSpark.java
>  PRE-CREATION 
>   
> src/org/apache/pig/backend/hadoop/executionengine/spark/converter/PreCombinerLocalRearrangeConverter.java
>  PRE-CREATION 
>   
> src/org/apache/pig/backend/hadoop/executionengine/spark/converter/ReduceByConverter.java
>  PRE-CREATION 
>   
> src/org/apache/pig/backend/hadoop/executionengine/spark/operator/POReduceBySpark.java
>  PRE-CREATION 
>   
> src/org/apache/pig/backend/hadoop/executionengine/spark/optimizer/SparkCombinerOptimizer.java
>  PRE-CREATION 
>   
> src/org/apache/pig/backend/hadoop/executionengine/util/CombinerOptimizerUtil.java
>  6b66ca1 
>   
> src/org/apache/pig/backend/hadoop/executionengine/util/SecondaryKeyOptimizerUtil.java
>  546d91e 
>   src/org/apache/pig/parser/LogicalPlanGenerator.g 99545b0 
>   test/org/apache/pig/test/TestCombiner.java df44293 
> 
> Diff: https://reviews.apache.org/r/40743/diff/
> 
> 
> Testing
> ---
> 
> The patch unblocked one UT in TestCombiner. Added another UT in the same 
> class. Also did some manual testing.
> 
> 
> Thanks,
> 
> Pallavi Rao
> 
>



[jira] [Commented] (PIG-4709) Improve performance of GROUPBY operator on Spark

2015-11-30 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032504#comment-15032504
 ] 

Xuefu Zhang commented on PIG-4709:
--

Thanks, [~pallavi.rao]. Great work! I posted a few comments, mostly cosmetic, 
on RB. This is a complex optimization, and I hope others can also take a look.

> Improve performance of GROUPBY operator on Spark
> 
>
> Key: PIG-4709
> URL: https://issues.apache.org/jira/browse/PIG-4709
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4709.patch
>
>
> Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the 
> grouped data is consumed by subsequent operations to perform algebraic 
> operations, this is sub-optimal as there is lot of shuffle traffic. 
> The Spark Plan must be optimized to use reduceBy, where possible, so that a 
> combiner is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4741) the value of $SPARK_DIST_CLASSPATH in pig file is invalid

2015-11-19 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15015199#comment-15015199
 ] 

Xuefu Zhang commented on PIG-4741:
--

[~kellyzly], your observation seems right. "\" seems something extra. 
[~sriksun], could you also take a quick look? Thanks.

> the value of $SPARK_DIST_CLASSPATH in pig file is invalid
> -
>
> Key: PIG-4741
> URL: https://issues.apache.org/jira/browse/PIG-4741
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4741.patch
>
>
> the value of 
> [$SPARK_DIST_CLASSPATH|https://github.com/apache/pig/blob/spark/bin/pig#L380] 
> in bin/pig is invalid
> {code}
> SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f`
> {code}
> there is no need to escape the {{PWD}}. If we add "\", the value of 
> SPARK_DIST_CLASSPATH will like:
> {code}
>  
> ${PWD}/akka-actor_2.10-2.3.4-spark.jar:${PWD}/akka-remote_2.10-2.3.4-spark.jar
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PIG-4720) Spark related JARs are not included when importing project via IDE

2015-11-02 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved PIG-4720.
--
Resolution: Fixed

Committed to Spark branch. Thanks, Xianda.

>  Spark related JARs are not included when importing project via IDE
> ---
>
> Key: PIG-4720
> URL: https://issues.apache.org/jira/browse/PIG-4720
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4720.patch
>
>
> It is a minior issue. Spark related JARs are not included when importing 
> project via IDE.
> {code}
> $ ant -Dhadoopversion=23 eclipse-files 
> {code}
> Open the generated .classpath, the spark related JARs are not in the 
> classpathentry list.  Because the spark JARs were moved to a new  
> directory(PIG-4667), but eclipse-files target in build.xml are not changed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PIG-4661) Fix UT failures in TestPigServerLocal

2015-11-02 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved PIG-4661.
--
Resolution: Fixed

Committed to Spark branch. Thanks, Xianda!

> Fix UT failures in TestPigServerLocal
> -
>
> Key: PIG-4661
> URL: https://issues.apache.org/jira/browse/PIG-4661
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4661.patch
>
>
> testcase 
> org.apache.pig.test.TestPigServerLocal.testSkipParseInRegisterForBatch failed 
> in spark mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PIG-4659) Fix unit test failures in org.apache.pig.test.TestScriptLanguageJavaScript

2015-11-02 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved PIG-4659.
--
Resolution: Fixed

Committed to Spark branch. Thanks, Xianda!

> Fix unit test failures in org.apache.pig.test.TestScriptLanguageJavaScript
> --
>
> Key: PIG-4659
> URL: https://issues.apache.org/jira/browse/PIG-4659
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4659-2.patch, PIG-4659.patch
>
>
> Failed testcase: org.apache.pig.test.TestScriptLanguageJavaScript.testTC
> Error Message:
> can't evaluate main: main();
> Stacktrace
> java.lang.RuntimeException: can't evaluate main: main();
>   at 
> org.apache.pig.scripting.js.JsScriptEngine.jsEval(JsScriptEngine.java:135)
>   at 
> org.apache.pig.scripting.js.JsScriptEngine.main(JsScriptEngine.java:223)
>   at org.apache.pig.scripting.ScriptEngine.run(ScriptEngine.java:300)
>   at 
> org.apache.pig.test.TestScriptLanguageJavaScript.testTC(TestScriptLanguageJavaScript.java:149)
> Caused by: org.mozilla.javascript.EcmaError: TypeError: Cannot call method 
> "getNumberRecords" of null



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PIG-4655) Support InputStats in spark mode

2015-10-31 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved PIG-4655.
--
Resolution: Fixed

Committed to Spark branch. Thanks, Xianda!

> Support InputStats in spark mode
> 
>
> Key: PIG-4655
> URL: https://issues.apache.org/jira/browse/PIG-4655
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4655-2.patch, PIG-4655-3.patch, PIG-4655-4.patch, 
> PIG-4655.patch
>
>
> Currently, InputStats is not implemented in spark mode. 
> The JUnit case TestPigRunner.testEmptyFileCounter() will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PIG-4634) Fix records count issues in output statistics

2015-10-30 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved PIG-4634.
--
Resolution: Fixed

Committed to Spark branch. Thanks, Xianda.

> Fix records count issues in output statistics
> -
>
> Key: PIG-4634
> URL: https://issues.apache.org/jira/browse/PIG-4634
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4634-3.patch, PIG-4634-4.patch, PIG-4634-5.patch, 
> PIG-4634-6.patch, PIG-4634.patch, PIG-4634_2.patch
>
>
> Test cases simpleTest() and simpleTest2()  in TestPigRunner failed, caused by 
> following issues:
> 1. pig context in SparkPigStats isn't initialized.
> 2. the records count logic hasn't been implemented.
> 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and 
> getRecordWritten() have not been implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4711) Tests in TestCombiner fail due to missing leveldb dependency

2015-10-27 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4711:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Reverted the old commit and committed the new patch (v1). Thanks, Pallavi.

> Tests in TestCombiner fail due to missing leveldb dependency
> 
>
> Key: PIG-4711
> URL: https://issues.apache.org/jira/browse/PIG-4711
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>Priority: Blocker
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4711-v1.patch, PIG-4711.patch
>
>
> Tests in TestCombiner use MiniYARNCluster which in turn has leveldb 
> dependencies.
> Currently, tests fail with Caused by: java.lang.ClassNotFoundException: 
> org.iq80.leveldb.DBException
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   ... 43 more
> The leveldb dependency is included in trunk but is missing in this branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4698) Enable dynamic resource allocation/de-allocation on Yarn backends

2015-10-26 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14975596#comment-14975596
 ] 

Xuefu Zhang commented on PIG-4698:
--

+1

> Enable dynamic resource allocation/de-allocation on Yarn backends
> -
>
> Key: PIG-4698
> URL: https://issues.apache.org/jira/browse/PIG-4698
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Affects Versions: spark-branch
>Reporter: Srikanth Sundarrajan
>Assignee: Srikanth Sundarrajan
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4698.patch
>
>
> Resource elasticity needs to be enabled on Yarn backend to allow jobs to 
> scale out better and provide better wall clock execution times, while unused 
> resources should be released back to RM for use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 39641: PIG-4698 Enable dynamic resource allocation/de-allocation on Yarn backends

2015-10-26 Thread Xuefu Zhang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39641/#review104134
---

Ship it!


Ship It!

- Xuefu Zhang


On Oct. 26, 2015, 7:28 a.m., Srikanth Sundarrajan wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/39641/
> ---
> 
> (Updated Oct. 26, 2015, 7:28 a.m.)
> 
> 
> Review request for pig.
> 
> 
> Bugs: PIG-4698
> https://issues.apache.org/jira/browse/PIG-4698
> 
> 
> Repository: pig-git
> 
> 
> Description
> ---
> 
> Resource elasticity needs to be enabled on Yarn backend to allow jobs to 
> scale out better and provide better wall clock execution times, while unused 
> resources should be released back to RM for use.
> 
> 
> Diffs
> -
> 
>   src/docs/src/documentation/content/xdocs/start.xml eedd5b7 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/SparkLauncher.java 
> b542013 
> 
> Diff: https://reviews.apache.org/r/39641/diff/
> 
> 
> Testing
> ---
> 
> Verified that the dynamic configuration is hornoured by the yarn system. 
> Requires the auxillary shuffle service need to be enabled at the node manager 
> and application level for this to work correctly.
> 
> 
> Thanks,
> 
> Srikanth Sundarrajan
> 
>



[jira] [Updated] (PIG-4711) Tests in TestCombiner fail due to missing leveldb dependency

2015-10-26 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4711:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to Spark branch. Thanks, Pallavi!

> Tests in TestCombiner fail due to missing leveldb dependency
> 
>
> Key: PIG-4711
> URL: https://issues.apache.org/jira/browse/PIG-4711
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4711.patch
>
>
> Tests in TestCombiner use MiniYARNCluster which in turn has leveldb 
> dependencies.
> Currently, tests fail with Caused by: java.lang.ClassNotFoundException: 
> org.iq80.leveldb.DBException
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   ... 43 more
> The leveldb dependency is included in trunk but is missing in this branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4711) Tests in TestCombiner fail due to missing leveldb dependency

2015-10-26 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14975628#comment-14975628
 ] 

Xuefu Zhang commented on PIG-4711:
--

+1

> Tests in TestCombiner fail due to missing leveldb dependency
> 
>
> Key: PIG-4711
> URL: https://issues.apache.org/jira/browse/PIG-4711
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4711.patch
>
>
> Tests in TestCombiner use MiniYARNCluster which in turn has leveldb 
> dependencies.
> Currently, tests fail with Caused by: java.lang.ClassNotFoundException: 
> org.iq80.leveldb.DBException
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   ... 43 more
> The leveldb dependency is included in trunk but is missing in this branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4698) Enable dynamic resource allocation/de-allocation on Yarn backends

2015-10-26 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4698:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to Spark branch. Thanks, Srikanth!

> Enable dynamic resource allocation/de-allocation on Yarn backends
> -
>
> Key: PIG-4698
> URL: https://issues.apache.org/jira/browse/PIG-4698
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Affects Versions: spark-branch
>Reporter: Srikanth Sundarrajan
>Assignee: Srikanth Sundarrajan
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4698.patch
>
>
> Resource elasticity needs to be enabled on Yarn backend to allow jobs to 
> scale out better and provide better wall clock execution times, while unused 
> resources should be released back to RM for use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   5   >