[jira] [Commented] (PIG-5305) Enable yarn-client mode execution of tests in Spark (1) mode

2017-10-05 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192995#comment-16192995
 ] 

liyunzhang_intel commented on PIG-5305:
---

[~szita]: sorry for reply late. Out Of Office this week.
for the patch: +1.

> Enable yarn-client mode execution of tests in Spark (1) mode
> 
>
> Key: PIG-5305
> URL: https://issues.apache.org/jira/browse/PIG-5305
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Adam Szita
>Assignee: Adam Szita
> Attachments: PIG-5305.0.patch, PIG-5305.1.patch, PIG-5305.2.patch
>
>
> See parent jira (PIG-5305) for problem description



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5305) Enable yarn-client mode execution of tests in Spark (1) mode

2017-09-24 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16178450#comment-16178450
 ] 

liyunzhang_intel commented on PIG-5305:
---

[~szita]:
sorry for reply late.
1 thing i was confused is unit test TestEvalPipeLine passes in tez mode without 
this patch by command
{code}
 ant -v -Dtest.junit.output.format=xml -Dtestcase=TestEvalPipeline 
-Dexectype=tez -Dhadoopversion=2 test-tez
{code}
code base:7399a1c
Before you mentioned that some unit tests failed with command {{test-tez}}.So 
is there some wrong with my env?
Patch looks good but please confirm this, thanks!

> Enable yarn-client mode execution of tests in Spark (1) mode
> 
>
> Key: PIG-5305
> URL: https://issues.apache.org/jira/browse/PIG-5305
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Adam Szita
>Assignee: Adam Szita
> Attachments: PIG-5305.0.patch, PIG-5305.1.patch, PIG-5305.2.patch
>
>
> See parent jira (PIG-5305) for problem description



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (PIG-5305) Enable yarn-client mode execution of tests in Spark (1) mode

2017-09-19 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172741#comment-16172741
 ] 

liyunzhang_intel edited comment on PIG-5305 at 9/20/17 4:48 AM:


{quote}
I also checked, test-tez was not running properly since the Spark 2 support 
commit, because setTezEnv was clearing the excluded sources property. I fixed 
this in my latest patch as well.
{quote}
what you mean is before we add {{jar-simple}} in  the dependency of 
{{test-tez}} in PIG-5157, But in {{setTezEnv}}, it will reset 
{{src.exclude.dir}} , this will influence {{jar}} which use the property 
{{src.exclude.dir}}?
{code}
 
Compiling against Spark 2






Compiling against Spark 1















{code}

If my understanding is right, why TestEvalPipeline unit test passes in tez mode 
before this jira? Seem there are no unit test failures in tez mode before this 
jira.



was (Author: kellyzly):
{quote}
I also checked, test-tez was not running properly since the Spark 2 support 
commit, because setTezEnv was clearing the excluded sources property. I fixed 
this in my latest patch as well.
{quote}
what you mean is before we add {{jar-simple}} in  the dependency of 
{{test-tez}} in PIG-5157, But in {{setTezEnv}}, it will reset 
{{src.exclude.dir}} , this will influence {{jar}} which use the property 
{{src.exclude.dir}}?
{code}
 
Compiling against Spark 2






Compiling against Spark 1















{code}

If my understanding is right, why TestEvalPipeline unit test passes in tez mode 
before this jira?


> Enable yarn-client mode execution of tests in Spark (1) mode
> 
>
> Key: PIG-5305
> URL: https://issues.apache.org/jira/browse/PIG-5305
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Adam Szita
>Assignee: Adam Szita
> Attachments: PIG-5305.0.patch, PIG-5305.1.patch, PIG-5305.2.patch
>
>
> See parent jira (PIG-5305) for problem description



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (PIG-5305) Enable yarn-client mode execution of tests in Spark (1) mode

2017-09-19 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172741#comment-16172741
 ] 

liyunzhang_intel edited comment on PIG-5305 at 9/20/17 4:47 AM:


{quote}
I also checked, test-tez was not running properly since the Spark 2 support 
commit, because setTezEnv was clearing the excluded sources property. I fixed 
this in my latest patch as well.
{quote}
what you mean is before we add {{jar-simple}} in  the dependency of 
{{test-tez}} in PIG-5157, But in {{setTezEnv}}, it will reset 
{{src.exclude.dir}} , this will influence {{jar}} which use the property 
{{src.exclude.dir}}?
{code}
 
Compiling against Spark 2






Compiling against Spark 1















{code}

If my understanding is right, why TestEvalPipeline unit test passes in tez mode 
before this jira?



was (Author: kellyzly):
{quote}
I also checked, test-tez was not running properly since the Spark 2 support 
commit, because setTezEnv was clearing the excluded sources property. I fixed 
this in my latest patch as well.
{quote}
what you mean is before we add {{jar-simple}} in  the dependency of 
{{test-tez}} in PIG-5157, But in {{setTezEnv}}, it will reset 
{{src.exclude.dir}} , this will influence {{jar}} which use the property 
{{src.exclude.dir}}?
{code}
 
Compiling against Spark 2






Compiling against Spark 1

















{code}


> Enable yarn-client mode execution of tests in Spark (1) mode
> 
>
> Key: PIG-5305
> URL: https://issues.apache.org/jira/browse/PIG-5305
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Adam Szita
>Assignee: Adam Szita
> Attachments: PIG-5305.0.patch, PIG-5305.1.patch, PIG-5305.2.patch
>
>
> See parent jira (PIG-5305) for problem description



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5305) Enable yarn-client mode execution of tests in Spark (1) mode

2017-09-19 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172741#comment-16172741
 ] 

liyunzhang_intel commented on PIG-5305:
---

{quote}
I also checked, test-tez was not running properly since the Spark 2 support 
commit, because setTezEnv was clearing the excluded sources property. I fixed 
this in my latest patch as well.
{quote}
what you mean is before we add {{jar-simple}} in  the dependency of 
{{test-tez}} in PIG-5157, But in {{setTezEnv}}, it will reset 
{{src.exclude.dir}} , this will influence {{jar}} which use the property 
{{src.exclude.dir}}?
{code}
 
Compiling against Spark 2






Compiling against Spark 1

















{code}


> Enable yarn-client mode execution of tests in Spark (1) mode
> 
>
> Key: PIG-5305
> URL: https://issues.apache.org/jira/browse/PIG-5305
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Adam Szita
>Assignee: Adam Szita
> Attachments: PIG-5305.0.patch, PIG-5305.1.patch, PIG-5305.2.patch
>
>
> See parent jira (PIG-5305) for problem description



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5305) Enable yarn-client mode execution of tests in Spark (1) mode

2017-09-18 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16170981#comment-16170981
 ] 

liyunzhang_intel commented on PIG-5305:
---

[~szita]:
1. {code}

{code}

why need pigtest-jar in test-tez?
2. is there any unit test failures if convert SPARK_MASTER from "local" to 
"yarn-client"?

> Enable yarn-client mode execution of tests in Spark (1) mode
> 
>
> Key: PIG-5305
> URL: https://issues.apache.org/jira/browse/PIG-5305
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Adam Szita
>Assignee: Adam Szita
> Attachments: PIG-5305.0.patch, PIG-5305.1.patch
>
>
> See parent jira (PIG-5305) for problem description



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (PIG-5305) Enable yarn-client mode execution of tests in Spark (1) mode

2017-09-14 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167284#comment-16167284
 ] 

liyunzhang_intel edited comment on PIG-5305 at 9/15/17 3:49 AM:


[~szita]: several suggestions
1. can we only add {{pigtest-jar}} to {{test-spark}} target in build.xml? I 
guess for {{test-tez}}, there is no need for {{pigtest-jar}}
meanwhile there is no need to add {{jar-simple}} in the dependency of 
{{test-tez}} as the dependency of {{compile-test}} includes {{jar-simple}}. If 
my understanding is not right, tell me.
2.please add comment {{added feature to re-initialize SparkContext when 
switching between cluster and local mode PigServers}} on related code. 

Besides, is there any unit test failures if convert  {{SPARK_MASTER}} from 
"local" to "yarn-client"?


was (Author: kellyzly):
[~szita]: several suggestions
1. can we only add {{pigtest-jar}} to {{test-spark}} target in build.xml? I 
guess for {{test-tez}}, there is no need for {{pigtest-jar}}
meanwhile there is no need to add {{jar-simple}} in the dependency of 
{{test-tez}} as the dependency of {{compile-test}} includes {{jar-simple}}. If 
my understanding is not right, tell me.
2.please add comment {{added feature to re-initialize SparkContext when 
switching between cluster and local mode PigServers}} on related code. 

Besides, is there any unit test failures if convert  {{SPARK_MASTER}} from 
"local" to "yarn-client"?

> Enable yarn-client mode execution of tests in Spark (1) mode
> 
>
> Key: PIG-5305
> URL: https://issues.apache.org/jira/browse/PIG-5305
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Adam Szita
>Assignee: Adam Szita
> Attachments: PIG-5305.0.patch
>
>
> See parent jira (PIG-5305) for problem description



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5277) Spark mode is writing nulls among tuples to the output

2017-08-11 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123129#comment-16123129
 ] 

liyunzhang_intel commented on PIG-5277:
---

[~szita]: use #2.  leave it to further investigation. first let all unit tests 
pass in spark mode.

> Spark mode is writing nulls among tuples to the output 
> ---
>
> Key: PIG-5277
> URL: https://issues.apache.org/jira/browse/PIG-5277
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: Adam Szita
>Assignee: Adam Szita
>
> After committing PIG-3655 a couple of Spark mode tests (e.g. 
> org.apache.pig.test.TestEvalPipeline.testCogroupAfterDistinct) started 
> failing on:
> {code}
> java.lang.Error: java.io.IOException: Corrupt data file, expected tuple type 
> byte, but seen 27
>   at 
> org.apache.pig.backend.hadoop.executionengine.HJob$1.hasNext(HJob.java:122)
>   at 
> org.apache.pig.test.TestEvalPipeline.testCogroupAfterDistinct(TestEvalPipeline.java:1052)
> Caused by: java.io.IOException: Corrupt data file, expected tuple type byte, 
> but seen 27
>   at 
> org.apache.pig.impl.io.InterRecordReader.readDataOrEOF(InterRecordReader.java:158)
>   at 
> org.apache.pig.impl.io.InterRecordReader.nextKeyValue(InterRecordReader.java:194)
>   at org.apache.pig.impl.io.InterStorage.getNext(InterStorage.java:79)
>   at 
> org.apache.pig.impl.io.ReadToEndLoader.getNextHelper(ReadToEndLoader.java:238)
>   at 
> org.apache.pig.impl.io.ReadToEndLoader.getNext(ReadToEndLoader.java:218)
>   at 
> org.apache.pig.backend.hadoop.executionengine.HJob$1.hasNext(HJob.java:115)
> {code}
> This is because InterRecordReader became much stricter after PIG-3655. Before 
> it just simply skipped these bytes thinking that they are just garbage on the 
> split beginning. Now when we expect a [proper tuple with a tuple type 
> byte|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/impl/io/InterRecordReader.java#L153]
>  we see these nulls and throw an Exception.
> As I can see it this is happening because JoinGroupSparkConverter has to 
> return something even when it shouldn't.
> When the POPackage operator returns a 
> [POStatus.STATUS_NULL|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/spark/converter/JoinGroupSparkConverter.java#L211],
>  the converter shouldn't return a thing, but it can't do better than 
> returning a null. This then gets written out by Spark..



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5283) Configuration is not passed to SparkPigSplits on the backend

2017-08-09 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119488#comment-16119488
 ] 

liyunzhang_intel commented on PIG-5283:
---

[~szita]: thanks for your explaination. +1

> Configuration is not passed to SparkPigSplits on the backend
> 
>
> Key: PIG-5283
> URL: https://issues.apache.org/jira/browse/PIG-5283
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: Adam Szita
>Assignee: Adam Szita
> Attachments: PIG-5283.0.patch, PIG-5283.1.patch
>
>
> When a Hadoop ObjectWritable is created during a Spark job, the instantiated 
> PigSplit (wrapped into a SparkPigSplit) is given an empty Configuration 
> instance.
> This happens 
> [here|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SerializableWritable.scala#L44]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5283) Configuration is not passed to SparkPigSplits on the backend

2017-08-08 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119287#comment-16119287
 ] 

liyunzhang_intel commented on PIG-5283:
---

[~szita]:  I can understand the reason why need set 
{{CommonConfigurationKeys.IO_SERIALIZATIONS_KEY}}. why need set 
{{PigConfiguration.PIG_COMPRESS_INPUT_SPLITS}} in the configuration?

> Configuration is not passed to SparkPigSplits on the backend
> 
>
> Key: PIG-5283
> URL: https://issues.apache.org/jira/browse/PIG-5283
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: Adam Szita
>Assignee: Adam Szita
> Attachments: PIG-5283.0.patch, PIG-5283.1.patch
>
>
> When a Hadoop ObjectWritable is created during a Spark job, the instantiated 
> PigSplit (wrapped into a SparkPigSplit) is given an empty Configuration 
> instance.
> This happens 
> [here|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SerializableWritable.scala#L44]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5283) Configuration is not passed to SparkPigSplits on the backend

2017-08-08 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16118075#comment-16118075
 ] 

liyunzhang_intel commented on PIG-5283:
---

[~szita]:  
{quote}
My only question is that if we should only write those properties that are 
required for a PigSplit instead of writing the full jobConf (6-700 entries) for 
optimization.

{quote}

not initialize all the items. it is ok to just initialize few items to make it 
work. Will PigInputFormatSpark#createRecordReader initialize all items after 
bypassing current issue?

> Configuration is not passed to SparkPigSplits on the backend
> 
>
> Key: PIG-5283
> URL: https://issues.apache.org/jira/browse/PIG-5283
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: Adam Szita
>Assignee: Adam Szita
> Attachments: PIG-5283.0.patch
>
>
> When a Hadoop ObjectWritable is created during a Spark job, the instantiated 
> PigSplit (wrapped into a SparkPigSplit) is given an empty Configuration 
> instance.
> This happens 
> [here|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SerializableWritable.scala#L44]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5277) Spark mode is writing nulls among tuples to the output

2017-08-03 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113820#comment-16113820
 ] 

liyunzhang_intel commented on PIG-5277:
---

[~szita]: give me some time to investigate #4. If also can not solve the 
problem you mention in #4 in the end of next Friday(2017-08-11), directly use 
#2 to solve the unit test failures.

> Spark mode is writing nulls among tuples to the output 
> ---
>
> Key: PIG-5277
> URL: https://issues.apache.org/jira/browse/PIG-5277
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: Adam Szita
>Assignee: Adam Szita
>
> After committing PIG-3655 a couple of Spark mode tests (e.g. 
> org.apache.pig.test.TestEvalPipeline.testCogroupAfterDistinct) started 
> failing on:
> {code}
> java.lang.Error: java.io.IOException: Corrupt data file, expected tuple type 
> byte, but seen 27
>   at 
> org.apache.pig.backend.hadoop.executionengine.HJob$1.hasNext(HJob.java:122)
>   at 
> org.apache.pig.test.TestEvalPipeline.testCogroupAfterDistinct(TestEvalPipeline.java:1052)
> Caused by: java.io.IOException: Corrupt data file, expected tuple type byte, 
> but seen 27
>   at 
> org.apache.pig.impl.io.InterRecordReader.readDataOrEOF(InterRecordReader.java:158)
>   at 
> org.apache.pig.impl.io.InterRecordReader.nextKeyValue(InterRecordReader.java:194)
>   at org.apache.pig.impl.io.InterStorage.getNext(InterStorage.java:79)
>   at 
> org.apache.pig.impl.io.ReadToEndLoader.getNextHelper(ReadToEndLoader.java:238)
>   at 
> org.apache.pig.impl.io.ReadToEndLoader.getNext(ReadToEndLoader.java:218)
>   at 
> org.apache.pig.backend.hadoop.executionengine.HJob$1.hasNext(HJob.java:115)
> {code}
> This is because InterRecordReader became much stricter after PIG-3655. Before 
> it just simply skipped these bytes thinking that they are just garbage on the 
> split beginning. Now when we expect a [proper tuple with a tuple type 
> byte|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/impl/io/InterRecordReader.java#L153]
>  we see these nulls and throw an Exception.
> As I can see it this is happening because JoinGroupSparkConverter has to 
> return something even when it shouldn't.
> When the POPackage operator returns a 
> [POStatus.STATUS_NULL|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/spark/converter/JoinGroupSparkConverter.java#L211],
>  the converter shouldn't return a thing, but it can't do better than 
> returning a null. This then gets written out by Spark..



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5283) Configuration is not passed to SparkPigSplits on the backend

2017-08-03 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113817#comment-16113817
 ] 

liyunzhang_intel commented on PIG-5283:
---

[~nkollar] and [~szita]: Can we set the correct value of 
CommonConfigurationKeys.IO_SERIALIZATIONS_KEY in  pig on spark to avoid the 
problem?
If we can not  ,+1 for the patch.

> Configuration is not passed to SparkPigSplits on the backend
> 
>
> Key: PIG-5283
> URL: https://issues.apache.org/jira/browse/PIG-5283
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: Adam Szita
>Assignee: Adam Szita
> Attachments: PIG-5283.0.patch
>
>
> When a Hadoop ObjectWritable is created during a Spark job, the instantiated 
> PigSplit (wrapped into a SparkPigSplit) is given an empty Configuration 
> instance.
> This happens 
> [here|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SerializableWritable.scala#L44]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5283) Configuration is not passed to SparkPigSplits on the backend

2017-08-03 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112449#comment-16112449
 ] 

liyunzhang_intel commented on PIG-5283:
---

[~szita]: In  
[PigInputFormatSpark#createRecordReader|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/spark/running/PigInputFormatSpark.java#L59],
  it  initialize pigSplit configuration.  In my understanding,the configuration 
of pigSplit is initialized correctly, so currently you met a problem of invalid 
initialization of pigSplit's configuration?  If yes, can you provide simple 
script to show your problem.


> Configuration is not passed to SparkPigSplits on the backend
> 
>
> Key: PIG-5283
> URL: https://issues.apache.org/jira/browse/PIG-5283
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: Adam Szita
>Assignee: Adam Szita
> Attachments: PIG-5283.0.patch
>
>
> When a Hadoop ObjectWritable is created during a Spark job, the instantiated 
> PigSplit (wrapped into a SparkPigSplit) is given an empty Configuration 
> instance.
> This happens 
> [here|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SerializableWritable.scala#L44]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5276) building "jar" should not call "clean"

2017-08-01 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16109250#comment-16109250
 ] 

liyunzhang_intel commented on PIG-5276:
---

[~nkollar]:+1

> building "jar" should not call "clean"
> --
>
> Key: PIG-5276
> URL: https://issues.apache.org/jira/browse/PIG-5276
> Project: Pig
>  Issue Type: Bug
>  Components: build
>Reporter: Koji Noguchi
>Assignee: Nandor Kollar
>Priority: Minor
> Fix For: 0.18.0
>
> Attachments: PIG-5276_1.patch
>
>
> When adding spark 2 in PIG-5157, we started calling "clean"  from inside 
> "jar" target.  
> To me, "jar" action should be limited to archiving classes.
> For example, when I run 
> % ant javadoc
> % ant jar
> I should not see javadoc gone after the second line.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (PIG-5276) building "jar" should not call "clean"

2017-07-31 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106927#comment-16106927
 ] 

liyunzhang_intel edited comment on PIG-5276 at 7/31/17 7:50 AM:


[~nkollar]:
when upgrading to spark2, we need to delete the two files you mentioned?


was (Author: kellyzly):
[~nkollar]:
when upgrading to spark2, we need to delete {{docs.dir}/build} and 
${jdiff.xml.dir}\${name}_${version}.xml these two file?

> building "jar" should not call "clean"
> --
>
> Key: PIG-5276
> URL: https://issues.apache.org/jira/browse/PIG-5276
> Project: Pig
>  Issue Type: Bug
>  Components: build
>Reporter: Koji Noguchi
>Priority: Minor
>
> When adding spark 2 in PIG-5157, we started calling "clean"  from inside 
> "jar" target.  
> To me, "jar" action should be limited to archiving classes.
> For example, when I run 
> % ant javadoc
> % ant jar
> I should not see javadoc gone after the second line.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5276) building "jar" should not call "clean"

2017-07-31 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106927#comment-16106927
 ] 

liyunzhang_intel commented on PIG-5276:
---

[~nkollar]:
when upgrading to spark2, we need to delete {{docs.dir}/build} and 
${jdiff.xml.dir}\${name}_${version}.xml these two file?

> building "jar" should not call "clean"
> --
>
> Key: PIG-5276
> URL: https://issues.apache.org/jira/browse/PIG-5276
> Project: Pig
>  Issue Type: Bug
>  Components: build
>Reporter: Koji Noguchi
>Priority: Minor
>
> When adding spark 2 in PIG-5157, we started calling "clean"  from inside 
> "jar" target.  
> To me, "jar" action should be limited to archiving classes.
> For example, when I run 
> % ant javadoc
> % ant jar
> I should not see javadoc gone after the second line.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5276) building "jar" should not call "clean"

2017-07-30 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106725#comment-16106725
 ] 

liyunzhang_intel commented on PIG-5276:
---

[~nkollar]:  can we not call clean before jar when upgrading to spark2 in 
PIG-5157?

> building "jar" should not call "clean"
> --
>
> Key: PIG-5276
> URL: https://issues.apache.org/jira/browse/PIG-5276
> Project: Pig
>  Issue Type: Bug
>  Components: build
>Reporter: Koji Noguchi
>Priority: Minor
>
> When adding spark 2 in PIG-5157, we started calling "clean"  from inside 
> "jar" target.  
> To me, "jar" action should be limited to archiving classes.
> For example, when I run 
> % ant javadoc
> % ant jar
> I should not see javadoc gone after the second line.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (PIG-3655) BinStorage and InterStorage approach to record markers is broken

2017-07-26 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102348#comment-16102348
 ] 

liyunzhang_intel edited comment on PIG-3655 at 7/26/17 10:01 PM:
-

[~szita]: can you provide simple script which i can reproduce the error?  sorry 
i have not read all the comments, so may be my understanding is not right.

{quote}

it seems like Spark is writing some NULLs after the last record
{quote}

  this only happens in this case or all cases in spark mode?   If only in this 
case, can you provide the script? thanks!


was (Author: kellyzly):
[~szita]: can you provide simple script which i can reproduce the error?  sorry 
i have not read all the comments, so may be my understanding is not right.

{quote}

it seems like Spark is writing some NULLs after the last record
{quote}

  this only happens in this case or all cases in spark mode?  

> BinStorage and InterStorage approach to record markers is broken
> 
>
> Key: PIG-3655
> URL: https://issues.apache.org/jira/browse/PIG-3655
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0, 0.8.0, 0.8.1, 
> 0.9.0, 0.9.1, 0.9.2, 0.10.0, 0.11, 0.10.1, 0.12.0, 0.11.1
>Reporter: Jeff Plaisance
>Assignee: Adam Szita
> Fix For: 0.18.0
>
> Attachments: PIG-3655.0.patch, PIG-3655.1.patch, PIG-3655.2.patch, 
> PIG-3655.3.patch, PIG-3655.4.patch, PIG-3655.5.patch, 
> PIG-3655.sparkNulls.2.patch, PIG-3655.sparkNulls.patch
>
>
> The way that the record readers for these storage formats seek to the first 
> record in an input split is to find the byte sequence 1 2 3 110 for 
> BinStorage or 1 2 3 19-21|28-30|36-45 for InterStorage. If this sequence 
> occurs in the data for any reason (for example the integer 16909166 stored 
> big endian encodes to the byte sequence for BinStorage) other than to mark 
> the start of a tuple it can cause mysterious failures in pig jobs because the 
> record reader will try to decode garbage and fail.
> For this approach of using an unlikely sequence to mark record boundaries, it 
> is important to reduce the probability of the sequence occuring naturally in 
> the data by ensuring that your record marker is sufficiently long. Hadoop 
> SequenceFile uses 128 bits for this and randomly generates the sequence for 
> each file (selecting a fixed, predetermined value opens up the possibility of 
> a mean person intentionally sending you that value). This makes it extremely 
> unlikely that collisions will occur. In the long run I think that pig should 
> also be doing this.
> As a quick fix it might be good to save the current position in the file 
> before entering readDatum, and if an exception is thrown seek back to the 
> saved position and resume trying to find the next record marker.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-3655) BinStorage and InterStorage approach to record markers is broken

2017-07-26 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102348#comment-16102348
 ] 

liyunzhang_intel commented on PIG-3655:
---

[~szita]: can you provide simple script which i can reproduce the error?  sorry 
i have not read all the comments, so may be my understanding is not right.

{quote}

it seems like Spark is writing some NULLs after the last record
{quote}

  this only happens in this case or all cases in spark mode?  

> BinStorage and InterStorage approach to record markers is broken
> 
>
> Key: PIG-3655
> URL: https://issues.apache.org/jira/browse/PIG-3655
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0, 0.8.0, 0.8.1, 
> 0.9.0, 0.9.1, 0.9.2, 0.10.0, 0.11, 0.10.1, 0.12.0, 0.11.1
>Reporter: Jeff Plaisance
>Assignee: Adam Szita
> Fix For: 0.18.0
>
> Attachments: PIG-3655.0.patch, PIG-3655.1.patch, PIG-3655.2.patch, 
> PIG-3655.3.patch, PIG-3655.4.patch, PIG-3655.5.patch, 
> PIG-3655.sparkNulls.2.patch, PIG-3655.sparkNulls.patch
>
>
> The way that the record readers for these storage formats seek to the first 
> record in an input split is to find the byte sequence 1 2 3 110 for 
> BinStorage or 1 2 3 19-21|28-30|36-45 for InterStorage. If this sequence 
> occurs in the data for any reason (for example the integer 16909166 stored 
> big endian encodes to the byte sequence for BinStorage) other than to mark 
> the start of a tuple it can cause mysterious failures in pig jobs because the 
> record reader will try to decode garbage and fail.
> For this approach of using an unlikely sequence to mark record boundaries, it 
> is important to reduce the probability of the sequence occuring naturally in 
> the data by ensuring that your record marker is sufficiently long. Hadoop 
> SequenceFile uses 128 bits for this and randomly generates the sequence for 
> each file (selecting a fixed, predetermined value opens up the possibility of 
> a mean person intentionally sending you that value). This makes it extremely 
> unlikely that collisions will occur. In the long run I think that pig should 
> also be doing this.
> As a quick fix it might be good to save the current position in the file 
> before entering readDatum, and if an exception is thrown seek back to the 
> saved position and resume trying to find the next record marker.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-3655) BinStorage and InterStorage approach to record markers is broken

2017-07-26 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101355#comment-16101355
 ] 

liyunzhang_intel commented on PIG-3655:
---

[~szita]:  Can you add a filter to filter the empty tuple in rdd in 
JoinGroupSparkConverter?
from 
{code}
   return rdd.toJavaRDD().map(new GroupPkgFunction(pkgOp)).rdd();
{code}
to 
{code}
  return rdd.toJavaRDD().map(new GroupPkgFunction(pkgOp)).filter(new 
Function() {
@Override
public Boolean call(Tuple objects) throws Exception {
if( objects == null){
return false;
}  else{
return true;
}
}
}).rdd();

{code}

 have not fully compiled or tested above code. just for reference.

> BinStorage and InterStorage approach to record markers is broken
> 
>
> Key: PIG-3655
> URL: https://issues.apache.org/jira/browse/PIG-3655
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0, 0.8.0, 0.8.1, 
> 0.9.0, 0.9.1, 0.9.2, 0.10.0, 0.11, 0.10.1, 0.12.0, 0.11.1
>Reporter: Jeff Plaisance
>Assignee: Adam Szita
> Fix For: 0.18.0
>
> Attachments: PIG-3655.0.patch, PIG-3655.1.patch, PIG-3655.2.patch, 
> PIG-3655.3.patch, PIG-3655.4.patch, PIG-3655.5.patch, 
> PIG-3655.sparkNulls.2.patch, PIG-3655.sparkNulls.patch
>
>
> The way that the record readers for these storage formats seek to the first 
> record in an input split is to find the byte sequence 1 2 3 110 for 
> BinStorage or 1 2 3 19-21|28-30|36-45 for InterStorage. If this sequence 
> occurs in the data for any reason (for example the integer 16909166 stored 
> big endian encodes to the byte sequence for BinStorage) other than to mark 
> the start of a tuple it can cause mysterious failures in pig jobs because the 
> record reader will try to decode garbage and fail.
> For this approach of using an unlikely sequence to mark record boundaries, it 
> is important to reduce the probability of the sequence occuring naturally in 
> the data by ensuring that your record marker is sufficiently long. Hadoop 
> SequenceFile uses 128 bits for this and randomly generates the sequence for 
> each file (selecting a fixed, predetermined value opens up the possibility of 
> a mean person intentionally sending you that value). This makes it extremely 
> unlikely that collisions will occur. In the long run I think that pig should 
> also be doing this.
> As a quick fix it might be good to save the current position in the file 
> before entering readDatum, and if an exception is thrown seek back to the 
> saved position and resume trying to find the next record marker.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-5246) Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2

2017-07-24 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-5246:
--
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2
> --
>
> Key: PIG-5246
> URL: https://issues.apache.org/jira/browse/PIG-5246
> Project: Pig
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HBase9498.patch, PIG-5246.1.patch, PIG-5246_2.patch, 
> PIG-5246.3.patch, PIG-5246_4.patch, PIG-5246.patch
>
>
> in bin/pig.
> we copy assembly jar to pig's classpath in spark1.6.
> {code}
> # For spark mode:
> # Please specify SPARK_HOME first so that we can locate 
> $SPARK_HOME/lib/spark-assembly*.jar,
> # we will add spark-assembly*.jar to the classpath.
> if [ "$isSparkMode"  == "true" ]; then
> if [ -z "$SPARK_HOME" ]; then
>echo "Error: SPARK_HOME is not set!"
>exit 1
> fi
> # Please specify SPARK_JAR which is the hdfs path of spark-assembly*.jar 
> to allow YARN to cache spark-assembly*.jar on nodes so that it doesn't need 
> to be distributed each time an application runs.
> if [ -z "$SPARK_JAR" ]; then
>echo "Error: SPARK_JAR is not set, SPARK_JAR stands for the hdfs 
> location of spark-assembly*.jar. This allows YARN to cache 
> spark-assembly*.jar on nodes so that it doesn't need to be distributed each 
> time an application runs."
>exit 1
> fi
> if [ -n "$SPARK_HOME" ]; then
> echo "Using Spark Home: " ${SPARK_HOME}
> SPARK_ASSEMBLY_JAR=`ls ${SPARK_HOME}/lib/spark-assembly*`
> CLASSPATH=${CLASSPATH}:$SPARK_ASSEMBLY_JAR
> fi
> fi
> {code}
> after upgrade to spark2.0, we may modify it



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5246) Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2

2017-07-24 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16099380#comment-16099380
 ] 

liyunzhang_intel commented on PIG-5246:
---

[~nkollar]: commit to trunk, thanks for [~nkollar], [~rohini], [~jeffzhang] 
reviewing.

> Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2
> --
>
> Key: PIG-5246
> URL: https://issues.apache.org/jira/browse/PIG-5246
> Project: Pig
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HBase9498.patch, PIG-5246.1.patch, PIG-5246_2.patch, 
> PIG-5246.3.patch, PIG-5246_4.patch, PIG-5246.patch
>
>
> in bin/pig.
> we copy assembly jar to pig's classpath in spark1.6.
> {code}
> # For spark mode:
> # Please specify SPARK_HOME first so that we can locate 
> $SPARK_HOME/lib/spark-assembly*.jar,
> # we will add spark-assembly*.jar to the classpath.
> if [ "$isSparkMode"  == "true" ]; then
> if [ -z "$SPARK_HOME" ]; then
>echo "Error: SPARK_HOME is not set!"
>exit 1
> fi
> # Please specify SPARK_JAR which is the hdfs path of spark-assembly*.jar 
> to allow YARN to cache spark-assembly*.jar on nodes so that it doesn't need 
> to be distributed each time an application runs.
> if [ -z "$SPARK_JAR" ]; then
>echo "Error: SPARK_JAR is not set, SPARK_JAR stands for the hdfs 
> location of spark-assembly*.jar. This allows YARN to cache 
> spark-assembly*.jar on nodes so that it doesn't need to be distributed each 
> time an application runs."
>exit 1
> fi
> if [ -n "$SPARK_HOME" ]; then
> echo "Using Spark Home: " ${SPARK_HOME}
> SPARK_ASSEMBLY_JAR=`ls ${SPARK_HOME}/lib/spark-assembly*`
> CLASSPATH=${CLASSPATH}:$SPARK_ASSEMBLY_JAR
> fi
> fi
> {code}
> after upgrade to spark2.0, we may modify it



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-07-23 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16097845#comment-16097845
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~szita]: 
{quote}
it looks like you've missed adding an entry to CHANGES.txt upon commit. I've 
added it now: 
{quote}
 thanks for catch that.

[~szita] or [~nkollar]: spend some time to review PIG-5246 if have time, thanks!

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157_15.patch, PIG-5157.patch, SkewedJoinInput1.txt, 
> SkewedJoinInput2.txt
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5246) Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2

2017-07-19 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16094066#comment-16094066
 ] 

liyunzhang_intel commented on PIG-5246:
---

[~nkollar]: as PIG-5157 is resolved, please help review PIG-5246_3.patch to let 
users use pig on spark in spark 2. thanks!

> Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2
> --
>
> Key: PIG-5246
> URL: https://issues.apache.org/jira/browse/PIG-5246
> Project: Pig
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HBase9498.patch, PIG-5246.1.patch, PIG-5246_2.patch, 
> PIG-5246.3.patch, PIG-5246_4.patch, PIG-5246.patch
>
>
> in bin/pig.
> we copy assembly jar to pig's classpath in spark1.6.
> {code}
> # For spark mode:
> # Please specify SPARK_HOME first so that we can locate 
> $SPARK_HOME/lib/spark-assembly*.jar,
> # we will add spark-assembly*.jar to the classpath.
> if [ "$isSparkMode"  == "true" ]; then
> if [ -z "$SPARK_HOME" ]; then
>echo "Error: SPARK_HOME is not set!"
>exit 1
> fi
> # Please specify SPARK_JAR which is the hdfs path of spark-assembly*.jar 
> to allow YARN to cache spark-assembly*.jar on nodes so that it doesn't need 
> to be distributed each time an application runs.
> if [ -z "$SPARK_JAR" ]; then
>echo "Error: SPARK_JAR is not set, SPARK_JAR stands for the hdfs 
> location of spark-assembly*.jar. This allows YARN to cache 
> spark-assembly*.jar on nodes so that it doesn't need to be distributed each 
> time an application runs."
>exit 1
> fi
> if [ -n "$SPARK_HOME" ]; then
> echo "Using Spark Home: " ${SPARK_HOME}
> SPARK_ASSEMBLY_JAR=`ls ${SPARK_HOME}/lib/spark-assembly*`
> CLASSPATH=${CLASSPATH}:$SPARK_ASSEMBLY_JAR
> fi
> fi
> {code}
> after upgrade to spark2.0, we may modify it



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-5157) Upgrade to Spark 2.0

2017-07-18 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-5157:
--
Attachment: PIG-5157_15.patch

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157_15.patch, PIG-5157.patch, SkewedJoinInput1.txt, 
> SkewedJoinInput2.txt
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-5157) Upgrade to Spark 2.0

2017-07-18 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-5157:
--
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157_15.patch, PIG-5157.patch, SkewedJoinInput1.txt, 
> SkewedJoinInput2.txt
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-07-18 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16092460#comment-16092460
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]: commit PIG-5157_15.patch to the trunk, thanks for your development 
as upgrading to spark2 is a big feature. and also thanks the review of [~szita] 
and [~rohini].

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch, SkewedJoinInput1.txt, 
> SkewedJoinInput2.txt
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5080) Support store alias as spark table

2017-07-18 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16091220#comment-16091220
 ] 

liyunzhang_intel commented on PIG-5080:
---

[~jeffzhang]: thanks for your patch. What's the benefit of store pig as spark 
temporary table? To use the result of pig script in another spark engine like 
spark sql?
if we use dataframe to replace rdd, what's the benefit? and can you show the 
detail performance improvement in some benchmark-test?

> Support store alias as spark table
> --
>
> Key: PIG-5080
> URL: https://issues.apache.org/jira/browse/PIG-5080
> Project: Pig
>  Issue Type: New Feature
>  Components: spark
>Affects Versions: 0.16.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Fix For: 0.17.1
>
> Attachments: PIG-5080-1.patch, PIG-5080-2.patch
>
>
> The purpose is that I'd like to take advantage of both pig and hive. 
> Pig-latin has powerful data flow expression ability which is useful for ETL 
> while hive is good at query. 
> The scenario is that I'd like to store pig alias as spark temporary table 
> (cache can be optional). And I have an another spark engine which share the 
> same SparkContext (in the same JVM) to query the table.
> Please close this ticket if it is already supported. I didn't go through all 
> the features of pig-spark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-07-17 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16089381#comment-16089381
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]: left comment on review board, just a small fix! meanwhile , please 
help review PIG-5246, thanks!

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch, SkewedJoinInput1.txt, 
> SkewedJoinInput2.txt
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5246) Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2

2017-07-16 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16089217#comment-16089217
 ] 

liyunzhang_intel commented on PIG-5246:
---

[~nkollar]: the problem about basic script passed on yarn-client mode. now 
running all unit tests in local mode. 

> Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2
> --
>
> Key: PIG-5246
> URL: https://issues.apache.org/jira/browse/PIG-5246
> Project: Pig
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HBase9498.patch, PIG-5246.1.patch, PIG-5246_2.patch, 
> PIG-5246.3.patch, PIG-5246_4.patch, PIG-5246.patch
>
>
> in bin/pig.
> we copy assembly jar to pig's classpath in spark1.6.
> {code}
> # For spark mode:
> # Please specify SPARK_HOME first so that we can locate 
> $SPARK_HOME/lib/spark-assembly*.jar,
> # we will add spark-assembly*.jar to the classpath.
> if [ "$isSparkMode"  == "true" ]; then
> if [ -z "$SPARK_HOME" ]; then
>echo "Error: SPARK_HOME is not set!"
>exit 1
> fi
> # Please specify SPARK_JAR which is the hdfs path of spark-assembly*.jar 
> to allow YARN to cache spark-assembly*.jar on nodes so that it doesn't need 
> to be distributed each time an application runs.
> if [ -z "$SPARK_JAR" ]; then
>echo "Error: SPARK_JAR is not set, SPARK_JAR stands for the hdfs 
> location of spark-assembly*.jar. This allows YARN to cache 
> spark-assembly*.jar on nodes so that it doesn't need to be distributed each 
> time an application runs."
>exit 1
> fi
> if [ -n "$SPARK_HOME" ]; then
> echo "Using Spark Home: " ${SPARK_HOME}
> SPARK_ASSEMBLY_JAR=`ls ${SPARK_HOME}/lib/spark-assembly*`
> CLASSPATH=${CLASSPATH}:$SPARK_ASSEMBLY_JAR
> fi
> fi
> {code}
> after upgrade to spark2.0, we may modify it



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-5246) Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2

2017-07-14 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-5246:
--
Attachment: PIG-5246_4.patch

[~nkollar]:
changes in PIG-5246_4.patch:
{code}
CLASSPATH=${CLASSPATH}:${SPARK_HOME}/lib/spark-assembly*
{code}

to 
{code}
   SPARK_ASSEMBLY_JAR=`ls ${SPARK_HOME}/lib/spark-assembly*`
   CLASSPATH=${CLASSPATH}:$SPARK_ASSEMBLY_JAR

{code}

can not use wildcard to locate the spark_assembly_jar.
After all unit tests pass on my local jenkins.  will close PIG-5157

> Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2
> --
>
> Key: PIG-5246
> URL: https://issues.apache.org/jira/browse/PIG-5246
> Project: Pig
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HBase9498.patch, PIG-5246.1.patch, PIG-5246_2.patch, 
> PIG-5246.3.patch, PIG-5246_4.patch, PIG-5246.patch
>
>
> in bin/pig.
> we copy assembly jar to pig's classpath in spark1.6.
> {code}
> # For spark mode:
> # Please specify SPARK_HOME first so that we can locate 
> $SPARK_HOME/lib/spark-assembly*.jar,
> # we will add spark-assembly*.jar to the classpath.
> if [ "$isSparkMode"  == "true" ]; then
> if [ -z "$SPARK_HOME" ]; then
>echo "Error: SPARK_HOME is not set!"
>exit 1
> fi
> # Please specify SPARK_JAR which is the hdfs path of spark-assembly*.jar 
> to allow YARN to cache spark-assembly*.jar on nodes so that it doesn't need 
> to be distributed each time an application runs.
> if [ -z "$SPARK_JAR" ]; then
>echo "Error: SPARK_JAR is not set, SPARK_JAR stands for the hdfs 
> location of spark-assembly*.jar. This allows YARN to cache 
> spark-assembly*.jar on nodes so that it doesn't need to be distributed each 
> time an application runs."
>exit 1
> fi
> if [ -n "$SPARK_HOME" ]; then
> echo "Using Spark Home: " ${SPARK_HOME}
> SPARK_ASSEMBLY_JAR=`ls ${SPARK_HOME}/lib/spark-assembly*`
> CLASSPATH=${CLASSPATH}:$SPARK_ASSEMBLY_JAR
> fi
> fi
> {code}
> after upgrade to spark2.0, we may modify it



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-5157) Upgrade to Spark 2.0

2017-07-13 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-5157:
--
Attachment: SkewedJoinInput2.txt
SkewedJoinInput1.txt

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch, SkewedJoinInput1.txt, 
> SkewedJoinInput2.txt
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-07-13 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085443#comment-16085443
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]: attached  SkewedJoinInput1.txt,  SkewedJoinInput2.txt which i used 
in the testJoin.pig. Please help view whether there is similar error in your 
env.

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch, SkewedJoinInput1.txt, 
> SkewedJoinInput2.txt
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-07-09 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079868#comment-16079868
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]: sorry to reply so late.

here is the result after solving the exception i mentioned last time
in spark1 with yarn-client mode:
{noformat}
export 
SPARK_JAR=hdfs://zly1.sh.intel.com:8020/user/root/spark-assembly-1.6.1-hadoop2.6.0.jar
export SPARK_HOME=$SPARK161 #donwload the spark1.6.1
export HADOOP_USER_CLASSPATH_FIRST="true"
$PIG_HOME/bin/pig -x spark  $PIG_HOME/bin/testJoin.pig
{noformat}

pig.properties
{noformat}
pig.sort.readonce.loadfuncs=org.apache.pig.backend.hadoop.hbase.HBaseStorage,org.apache.pig.backend.hadoop.accumulo.AccumuloStorage
spark.master=yarn-client
{noformat}

testJoin.pig
{code}
A = load './SkewedJoinInput1.txt' as (id,name,n);
B = load './SkewedJoinInput2.txt' as (id,name);
D = join A by (id,name), B by (id,name) parallel 10; 
store D into './testJoin.out';
{code} 

the script fails to generate result  and exception found in the log is
{noformat}

[task-result-getter-0] 2017-07-10 12:16:45,667 WARN  scheduler.TaskSetManager 
(Logging.scala:logWarning(70)) - Lost task 0.0 in stage 0.0 (TID 0, 
zly1.sh.intel.com): java.lang.IllegalStateException: unread block data
at 
java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

{noformat}

can you verify whether there is the same problem your cluster in yarn-client 
mode(in my cluster, it passed on local mode but failed in yarn-client mode)? 
the error seems like a problem about datanode but i verified the environment 
with spark branch code and it passed. So I guess the problem is caused by the 
patch.

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-06-30 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069728#comment-16069728
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]: with PIG-5157_13.patch on review board.
the test result :
pass on spark1 or spark2 on local mode
failed on yarn-client mode( add spark.master=yarn-client in conf/pig.properties)
exception message found in log
{noformat}
 [shuffle-server-0] 2017-06-30 14:24:25,501 WARN  
server.TransportChannelHandler 
(TransportChannelHandler.java:exceptionCaught(79)) - Excepti on in connection 
from /10.239.47.58:58214
1264 java.lang.NoSuchMethodError: 
org.apache.spark.network.client.TransportClient.getChannel()Lio/netty/channel/Channel;
{noformat}

If you can not reproduce in your env, also tell me, will check whether it is 
caused by configuraion or not.

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-06-26 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062805#comment-16062805
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]: with PIG-5157_11.patch , pass in TestGrunt unit test. But has 
problem in yarn-cluster env, will investigate more.

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (PIG-5157) Upgrade to Spark 2.0

2017-06-23 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060574#comment-16060574
 ] 

liyunzhang_intel edited comment on PIG-5157 at 6/23/17 8:29 AM:


[~nkollar]:looks good, but met some problem when testing in local and 
yarn-client, give me more time to verify the problem is caused by the 
configuration or others. thanks!
after using this patch,  the result of unit test
{code}
 ant  -Dtest.junit.output.format=xml clean  -Dtestcase=TestGrunt  
-Dexectype=spark  -Dhadoopversion=2  test

{code}
the result:
{noformat}
Tests run: 67, Failures: 1, Errors: 5, Skipped: 4, Time elapsed: 138.459 sec

{noformat}

I will investigate the reason in my env but can you verify it in your env?


was (Author: kellyzly):
[~nkollar]:looks good, but met some problem when testing in local and 
yarn-client, give me more time to verify the problem is caused by the 
configuration or others. thanks!
after using this patch,  the result of unit test
{code}
 ant  -Dtest.junit.output.format=xml clean  -Dtestcase=TestGrunt  
-Dexectype=spark  -Dhadoopversion=2  test

{code}
the result:
{noformat}
Tests run: 67, Failures: 1, Errors: 5, Skipped: 4, Time elapsed: 138.459 sec

{noformat}

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-06-23 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060574#comment-16060574
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]:looks good, but met some problem when testing in local and 
yarn-client, give me more time to verify the problem is caused by the 
configuration or others. thanks!
after using this patch,  the result of unit test
{code}
 ant  -Dtest.junit.output.format=xml clean  -Dtestcase=TestGrunt  
-Dexectype=spark  -Dhadoopversion=2  test

{code}
the result:
{noformat}
Tests run: 67, Failures: 1, Errors: 5, Skipped: 4, Time elapsed: 138.459 sec

{noformat}

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-5246) Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2

2017-06-21 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-5246:
--
Attachment: PIG-5246.3.patch

[~nkollar],[~szita],[~rohini],[~jeffzhang]: update PIG-5246.3.patch
changes
1. use spark-tags*.jar to verify whether current spark is spark1 or spark2
2. small fix like
{code}
SPARK_ASSEMBLY_JAR=`ls ${SPARK_HOME}/lib/spark-assembly*`
 CLASSPATH=${CLASSPATH}:$SPARK_ASSEMBLY_JAR

{code}

to
{code}
CLASSPATH=${CLASSPATH}:${SPARK_HOME}/lib/spark-assembly*
{code}

> Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2
> --
>
> Key: PIG-5246
> URL: https://issues.apache.org/jira/browse/PIG-5246
> Project: Pig
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HBase9498.patch, PIG-5246.1.patch, PIG-5246_2.patch, 
> PIG-5246.3.patch, PIG-5246.patch
>
>
> in bin/pig.
> we copy assembly jar to pig's classpath in spark1.6.
> {code}
> # For spark mode:
> # Please specify SPARK_HOME first so that we can locate 
> $SPARK_HOME/lib/spark-assembly*.jar,
> # we will add spark-assembly*.jar to the classpath.
> if [ "$isSparkMode"  == "true" ]; then
> if [ -z "$SPARK_HOME" ]; then
>echo "Error: SPARK_HOME is not set!"
>exit 1
> fi
> # Please specify SPARK_JAR which is the hdfs path of spark-assembly*.jar 
> to allow YARN to cache spark-assembly*.jar on nodes so that it doesn't need 
> to be distributed each time an application runs.
> if [ -z "$SPARK_JAR" ]; then
>echo "Error: SPARK_JAR is not set, SPARK_JAR stands for the hdfs 
> location of spark-assembly*.jar. This allows YARN to cache 
> spark-assembly*.jar on nodes so that it doesn't need to be distributed each 
> time an application runs."
>exit 1
> fi
> if [ -n "$SPARK_HOME" ]; then
> echo "Using Spark Home: " ${SPARK_HOME}
> SPARK_ASSEMBLY_JAR=`ls ${SPARK_HOME}/lib/spark-assembly*`
> CLASSPATH=${CLASSPATH}:$SPARK_ASSEMBLY_JAR
> fi
> fi
> {code}
> after upgrade to spark2.0, we may modify it



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-06-21 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16057075#comment-16057075
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]:  after using the patch and test a simple query in yarn-client env.
build jar:
{noformat}ant clean -v -Dhadoopversion=2 jar-spark12{noformat}
testJoin.pig
{code}
A = load './SkewedJoinInput1.txt' as (id,name,n);
B = load './SkewedJoinInput2.txt' as (id,name);
D = join A by (id,name), B by (id,name) parallel 10; 
store D into './testJoin.out';
{code}

spark1:
export SPARK_HOME=
export export 
SPARK_JAR=hdfs://:8020/user/root/spark-assembly-1.6.1-hadoop2.6.0.jar
$PIG_HOME/bin/pig -x spark -logfile $PIG_HOME/logs/pig.log testJoin.pig
error in logs/pig
{noformat}
java.lang.NoClassDefFoundError: 
org/apache/spark/scheduler/SparkListenerInterface
at 
org.apache.pig.backend.hadoop.executionengine.spark.SparkExecutionEngine.(SparkExecutionEngine.java:35)
at 
org.apache.pig.backend.hadoop.executionengine.spark.SparkExecType.getExecutionEngine(SparkExecType.java:42)
at org.apache.pig.impl.PigContext.(PigContext.java:269)
at org.apache.pig.impl.PigContext.(PigContext.java:256)
at org.apache.pig.Main.run(Main.java:389)
at org.apache.pig.Main.main(Main.java:175)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.lang.ClassNotFoundException: 
org.apache.spark.scheduler.SparkListenerInterface
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 12 more
{noformat}

spark2( patch PIG-5246_2.patch)
export SPARK_HOME=
$PIG_HOME/bin/pig -x spark -logfile $PIG_HOME/logs/pig.log testJoin.pig
error in logs/pig
{noformat}
[main] 2017-06-21 14:14:05,791 ERROR spark.JobGraphBuilder 
(JobGraphBuilder.java:sparkOperToRDD(187)) - throw exception in sparkOperToRDD: 
org.apache.spark.SparkException: Task not serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2037)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:763)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:762)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:762)
at 
org.apache.spark.api.java.JavaRDDLike$class.mapPartitions(JavaRDDLike.scala:166)
at 
org.apache.spark.api.java.AbstractJavaRDDLike.mapPartitions(JavaRDDLike.scala:45)
at 
org.apache.pig.backend.hadoop.executionengine.spark.converter.ForEachConverter.convert(ForEachConverter.java:64)
at 
org.apache.pig.backend.hadoop.executionengine.spark.converter.ForEachConverter.convert(ForEachConverter.java:45)
at 
org.apache.pig.backend.hadoop.executionengine.spark.JobGraphBuilder.physicalToRDD(JobGraphBuilder.java:292)
at 
org.apache.pig.backend.hadoop.executionengine.spark.JobGraphBuilder.physicalToRDD(JobGraphBuilder.java:248)
at 
org.apache.pig.backend.hadoop.executionengine.spark.JobGraphBuilder.physicalToRDD(JobGraphBuilder.java:248)
at 
org.apache.pig.backend.hadoop.executionengine.spark.JobGraphBuilder.physicalToRDD(JobGraphBuilder.java:248)
at 
org.apache.pig.backend.hadoop.executionengine.spark.JobGraphBuilder.sparkOperToRDD(JobGraphBuilder.java:182)
at 
org.apache.pig.backend.hadoop.executionengine.spark.JobGraphBuilder.visitSparkOp(JobGraphBuilder.java:112)
at 
org.apache.pig.backend.hadoop.executionengine.spark.plan.SparkOperator.visit(SparkOperator.java:140)
at 
org.apache.pig.backend.hadoop.executionengine.spark.plan.SparkOperator.visit(SparkOperator.java:37)
  

[jira] [Commented] (PIG-5246) Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2

2017-06-20 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16056789#comment-16056789
 ] 

liyunzhang_intel commented on PIG-5246:
---

[~rohini]:
bq.You can cherry pick jars to include in the classpath, but it is not going to 
make much difference. If a new jar is added in a new version, then it will 
actually be a problem and pig will have to be updated to include that jar.
thanks for explanation, that's the reason why I mentioned in previous comment 
"But If users use a different spark which is different from compile. Will the 
dependencies be different?"

> Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2
> --
>
> Key: PIG-5246
> URL: https://issues.apache.org/jira/browse/PIG-5246
> Project: Pig
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HBase9498.patch, PIG-5246.1.patch, PIG-5246_2.patch, 
> PIG-5246.patch
>
>
> in bin/pig.
> we copy assembly jar to pig's classpath in spark1.6.
> {code}
> # For spark mode:
> # Please specify SPARK_HOME first so that we can locate 
> $SPARK_HOME/lib/spark-assembly*.jar,
> # we will add spark-assembly*.jar to the classpath.
> if [ "$isSparkMode"  == "true" ]; then
> if [ -z "$SPARK_HOME" ]; then
>echo "Error: SPARK_HOME is not set!"
>exit 1
> fi
> # Please specify SPARK_JAR which is the hdfs path of spark-assembly*.jar 
> to allow YARN to cache spark-assembly*.jar on nodes so that it doesn't need 
> to be distributed each time an application runs.
> if [ -z "$SPARK_JAR" ]; then
>echo "Error: SPARK_JAR is not set, SPARK_JAR stands for the hdfs 
> location of spark-assembly*.jar. This allows YARN to cache 
> spark-assembly*.jar on nodes so that it doesn't need to be distributed each 
> time an application runs."
>exit 1
> fi
> if [ -n "$SPARK_HOME" ]; then
> echo "Using Spark Home: " ${SPARK_HOME}
> SPARK_ASSEMBLY_JAR=`ls ${SPARK_HOME}/lib/spark-assembly*`
> CLASSPATH=${CLASSPATH}:$SPARK_ASSEMBLY_JAR
> fi
> fi
> {code}
> after upgrade to spark2.0, we may modify it



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-06-19 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16053644#comment-16053644
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]:  It builds successful in my env when using ant clean jar-spark12, 
but give me more time to test it on spark1 and spark2.
  

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (PIG-5246) Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2

2017-06-18 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16053472#comment-16053472
 ] 

liyunzhang_intel edited comment on PIG-5246 at 6/19/17 4:50 AM:


[~rohini]:
bq. I would suggest checking for presence of spark-tags*.jar which is only 
present in Spark 2. If it is not present, then assume spark 1.
thanks for suggestion.
[~jeffzhang]:
bq. Pig don't need to load all the jars under SPARK_HOME/jars. Pig has already 
specify spark dependencies in ivy.
yes, the spark dependencies in ivy is for compile, i can select jars which pig 
on spark really needs from $SPARK_HOME/jars. But If users use a different spark 
which is different from compile. Will the dependencies be different?
My question is: is there big performance influence if we append all jar under 
$SPARK_HOME/jars to the pig classpath?


was (Author: kellyzly):
[~rohini]:bq. I would suggest checking for presence of spark-tags*.jar which is 
only present in Spark 2. If it is not present, then assume spark 1.
thanks for suggestion.
[~jeffzhang]:
bq. Pig don't need to load all the jars under SPARK_HOME/jars. Pig has already 
specify spark dependencies in ivy.
yes, the spark dependencies in ivy is for compile, i can select jars which pig 
on spark really needs from $SPARK_HOME/jars. But If users use a different spark 
which is different from compile. Will the dependencies be different?
My question is: is there big performance influence if we append all jar under 
$SPARK_HOME/jars to the pig classpath?

> Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2
> --
>
> Key: PIG-5246
> URL: https://issues.apache.org/jira/browse/PIG-5246
> Project: Pig
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HBase9498.patch, PIG-5246.1.patch, PIG-5246_2.patch, 
> PIG-5246.patch
>
>
> in bin/pig.
> we copy assembly jar to pig's classpath in spark1.6.
> {code}
> # For spark mode:
> # Please specify SPARK_HOME first so that we can locate 
> $SPARK_HOME/lib/spark-assembly*.jar,
> # we will add spark-assembly*.jar to the classpath.
> if [ "$isSparkMode"  == "true" ]; then
> if [ -z "$SPARK_HOME" ]; then
>echo "Error: SPARK_HOME is not set!"
>exit 1
> fi
> # Please specify SPARK_JAR which is the hdfs path of spark-assembly*.jar 
> to allow YARN to cache spark-assembly*.jar on nodes so that it doesn't need 
> to be distributed each time an application runs.
> if [ -z "$SPARK_JAR" ]; then
>echo "Error: SPARK_JAR is not set, SPARK_JAR stands for the hdfs 
> location of spark-assembly*.jar. This allows YARN to cache 
> spark-assembly*.jar on nodes so that it doesn't need to be distributed each 
> time an application runs."
>exit 1
> fi
> if [ -n "$SPARK_HOME" ]; then
> echo "Using Spark Home: " ${SPARK_HOME}
> SPARK_ASSEMBLY_JAR=`ls ${SPARK_HOME}/lib/spark-assembly*`
> CLASSPATH=${CLASSPATH}:$SPARK_ASSEMBLY_JAR
> fi
> fi
> {code}
> after upgrade to spark2.0, we may modify it



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5246) Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2

2017-06-18 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16053472#comment-16053472
 ] 

liyunzhang_intel commented on PIG-5246:
---

[~rohini]:bq. I would suggest checking for presence of spark-tags*.jar which is 
only present in Spark 2. If it is not present, then assume spark 1.
thanks for suggestion.
[~jeffzhang]:
bq. Pig don't need to load all the jars under SPARK_HOME/jars. Pig has already 
specify spark dependencies in ivy.
yes, the spark dependencies in ivy is for compile, i can select jars which pig 
on spark really needs from $SPARK_HOME/jars. But If users use a different spark 
which is different from compile. Will the dependencies be different?
My question is: is there big performance influence if we append all jar under 
$SPARK_HOME/jars to the pig classpath?

> Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2
> --
>
> Key: PIG-5246
> URL: https://issues.apache.org/jira/browse/PIG-5246
> Project: Pig
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HBase9498.patch, PIG-5246.1.patch, PIG-5246_2.patch, 
> PIG-5246.patch
>
>
> in bin/pig.
> we copy assembly jar to pig's classpath in spark1.6.
> {code}
> # For spark mode:
> # Please specify SPARK_HOME first so that we can locate 
> $SPARK_HOME/lib/spark-assembly*.jar,
> # we will add spark-assembly*.jar to the classpath.
> if [ "$isSparkMode"  == "true" ]; then
> if [ -z "$SPARK_HOME" ]; then
>echo "Error: SPARK_HOME is not set!"
>exit 1
> fi
> # Please specify SPARK_JAR which is the hdfs path of spark-assembly*.jar 
> to allow YARN to cache spark-assembly*.jar on nodes so that it doesn't need 
> to be distributed each time an application runs.
> if [ -z "$SPARK_JAR" ]; then
>echo "Error: SPARK_JAR is not set, SPARK_JAR stands for the hdfs 
> location of spark-assembly*.jar. This allows YARN to cache 
> spark-assembly*.jar on nodes so that it doesn't need to be distributed each 
> time an application runs."
>exit 1
> fi
> if [ -n "$SPARK_HOME" ]; then
> echo "Using Spark Home: " ${SPARK_HOME}
> SPARK_ASSEMBLY_JAR=`ls ${SPARK_HOME}/lib/spark-assembly*`
> CLASSPATH=${CLASSPATH}:$SPARK_ASSEMBLY_JAR
> fi
> fi
> {code}
> after upgrade to spark2.0, we may modify it



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5246) Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2

2017-06-16 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16051474#comment-16051474
 ] 

liyunzhang_intel commented on PIG-5246:
---

[~nkollar]:
  bq. For Spark 2.x do we have to add all jar under $SPARK_HOME/jars?
some guy suggested to to add all jar under $SPARK_HOME/jars in Hive on 
Spark([HIVE-15302|https://issues.apache.org/jira/browse/HIVE-15302]), It seems 
this is not accepted by [~vanzin]. But in [Hive 
wiki|https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started],
 it is said that we need not append all jars under $SPARK_HOME/jars.
{noformat}
Configuring Hive
To add the Spark dependency to Hive:
Prior to Hive 2.2.0, link the spark-assembly jar to HIVE_HOME/lib.
Since Hive 2.2.0, Hive on Spark runs with Spark 2.0.0 and above, which doesn't 
have an assembly jar.
To run with YARN mode (either yarn-client or yarn-cluster), link the following 
jars to HIVE_HOME/lib.
scala-library
spark-core
spark-network-common
To run with LOCAL mode (for debugging only), link the following jars in 
addition to those above to HIVE_HOME/lib.
chill-java  chill  jackson-module-paranamer  jackson-module-scala  
jersey-container-servlet-core
jersey-server  json4s-ast  kryo-shaded  minlog  scala-xml  spark-launcher
spark-network-shuffle  spark-unsafe  xbean-asm5-shaded
{noformat}

 I don't know whether there is performance influence if we append all jar under 
$SPARK_HOME/jars to the pig classpath.
bq.Could we avoid creating temp files? Instead of creating spark.version, would 
something like this work?
yes, this works, thanks for suggestion.

> Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2
> --
>
> Key: PIG-5246
> URL: https://issues.apache.org/jira/browse/PIG-5246
> Project: Pig
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HBase9498.patch, PIG-5246.1.patch, PIG-5246_2.patch, 
> PIG-5246.patch
>
>
> in bin/pig.
> we copy assembly jar to pig's classpath in spark1.6.
> {code}
> # For spark mode:
> # Please specify SPARK_HOME first so that we can locate 
> $SPARK_HOME/lib/spark-assembly*.jar,
> # we will add spark-assembly*.jar to the classpath.
> if [ "$isSparkMode"  == "true" ]; then
> if [ -z "$SPARK_HOME" ]; then
>echo "Error: SPARK_HOME is not set!"
>exit 1
> fi
> # Please specify SPARK_JAR which is the hdfs path of spark-assembly*.jar 
> to allow YARN to cache spark-assembly*.jar on nodes so that it doesn't need 
> to be distributed each time an application runs.
> if [ -z "$SPARK_JAR" ]; then
>echo "Error: SPARK_JAR is not set, SPARK_JAR stands for the hdfs 
> location of spark-assembly*.jar. This allows YARN to cache 
> spark-assembly*.jar on nodes so that it doesn't need to be distributed each 
> time an application runs."
>exit 1
> fi
> if [ -n "$SPARK_HOME" ]; then
> echo "Using Spark Home: " ${SPARK_HOME}
> SPARK_ASSEMBLY_JAR=`ls ${SPARK_HOME}/lib/spark-assembly*`
> CLASSPATH=${CLASSPATH}:$SPARK_ASSEMBLY_JAR
> fi
> fi
> {code}
> after upgrade to spark2.0, we may modify it



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5246) Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2

2017-06-15 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050164#comment-16050164
 ] 

liyunzhang_intel commented on PIG-5246:
---

[~nkollar]: can you help review PIG-5246_2.patch?

in PIG-5246_2.patch
use following way to judge spark version
{code}
+$SPARK_HOME/bin/spark-submit --version >/tmp/spark.version 2>&1
+isSpark1=`grep "version 1" /tmp/spark.version|wc -l`
+if [ "$isSpark1" -eq 0 ];then 
+  sparkversion="2"
 fi
{code}
redirect the output of "spark-submit --version" to /tmp/spark.version(later 
will remove this file), Is there any better way to judge the version?

> Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2
> --
>
> Key: PIG-5246
> URL: https://issues.apache.org/jira/browse/PIG-5246
> Project: Pig
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HBase9498.patch, PIG-5246.1.patch, PIG-5246_2.patch, 
> PIG-5246.patch
>
>
> in bin/pig.
> we copy assembly jar to pig's classpath in spark1.6.
> {code}
> # For spark mode:
> # Please specify SPARK_HOME first so that we can locate 
> $SPARK_HOME/lib/spark-assembly*.jar,
> # we will add spark-assembly*.jar to the classpath.
> if [ "$isSparkMode"  == "true" ]; then
> if [ -z "$SPARK_HOME" ]; then
>echo "Error: SPARK_HOME is not set!"
>exit 1
> fi
> # Please specify SPARK_JAR which is the hdfs path of spark-assembly*.jar 
> to allow YARN to cache spark-assembly*.jar on nodes so that it doesn't need 
> to be distributed each time an application runs.
> if [ -z "$SPARK_JAR" ]; then
>echo "Error: SPARK_JAR is not set, SPARK_JAR stands for the hdfs 
> location of spark-assembly*.jar. This allows YARN to cache 
> spark-assembly*.jar on nodes so that it doesn't need to be distributed each 
> time an application runs."
>exit 1
> fi
> if [ -n "$SPARK_HOME" ]; then
> echo "Using Spark Home: " ${SPARK_HOME}
> SPARK_ASSEMBLY_JAR=`ls ${SPARK_HOME}/lib/spark-assembly*`
> CLASSPATH=${CLASSPATH}:$SPARK_ASSEMBLY_JAR
> fi
> fi
> {code}
> after upgrade to spark2.0, we may modify it



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-06-15 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050123#comment-16050123
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]: after i download the latest patch from RB, how to compile now?
when i use following command 
{code}
ant clean -v -Dhadoopversion=2 jar-spark12
{code}

I got following err
{noformat}
[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.7
94937 [javac] 
/home/zly/prj/oss/pig/src/org/apache/pig/backend/hadoop/executionengine/spark/SparkShim2.java:29:
 error: cannot find symbol
94938 [javac] import org.apache.spark.api.java.Optional;
94939 [javac] ^
94940 [javac]   symbol:   class Optional
94941 [javac]   location: package org.apache.spark.api.java
94942 [javac] 
/home/zly/prj/oss/pig/src/org/apache/pig/backend/hadoop/executionengine/spark/SparkShim2.java:99:
 error: no interface expected   here
94943 [javac] private static class JobMetricsListener extends 
SparkListener {
94944 [javac] ^
94945 [javac] 
/home/zly/prj/oss/pig/build/ivy/lib/Pig/hbase-client-1.2.4.jar(org/apache/hadoop/hbase/filter/FilterList.class):
 warning: Canno  t find annotation method 'value()' in type 
'SuppressWarnings': class file for 
edu.umd.cs.findbugs.annotations.SuppressWarnings not found
94946 [javac] 
/home/zly/prj/oss/pig/build/ivy/lib/Pig/hbase-client-1.2.4.jar(org/apache/hadoop/hbase/filter/FilterList.class):
 warning: Canno  t find annotation method 'justification()' in type 
'SuppressWarnings'
94947 [javac] 
/home/zly/prj/oss/pig/build/ivy/lib/Pig/hbase-common-1.2.4.jar(org/apache/hadoop/hbase/io/ImmutableBytesWritable.class):
 warnin  g: Cannot find annotation method 'value()' in type 
'SuppressWarnings'
94948 [javac] 
/home/zly/prj/oss/pig/build/ivy/lib/Pig/hbase-common-1.2.4.jar(org/apache/hadoop/hbase/io/ImmutableBytesWritable.class):
 warnin  g: Cannot find annotation method 'justification()' in type 
'SuppressWarnings'
94949 [javac] 
/home/zly/prj/oss/pig/build/ivy/lib/Pig/hbase-server-1.2.4.jar(org/apache/hadoop/hbase/mapreduce/TableInputFormat.class):
 warni  ng: Cannot find annotation method 'value()' in type 
'SuppressWarnings'
94950 [javac] 
/home/zly/prj/oss/pig/build/ivy/lib/Pig/hbase-server-1.2.4.jar(org/apache/hadoop/hbase/mapreduce/TableInputFormat.class):
 warni  ng: Cannot find annotation method 'justification()' in type 
'SuppressWarnings'
94951 [javac] 
/home/zly/prj/oss/pig/src/org/apache/pig/backend/hadoop/executionengine/spark/SparkShim2.java:49:
 error:  is not abstract and does 
not override abstract method call(T) in FlatMapFunction
94952 [javac] return new FlatMapFunction() {
94953 [javac]^

{noformat}

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-06-13 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047531#comment-16047531
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]:  leave some comments on review board.
 can you update patch with latest code?
latest code
{noformat}
* 5c55102 - (origin/trunk, origin/HEAD) PIG-4700: Enable progress reporting for 
Tasks in Tez (satishsaley via rohini) (7 days ago) 
{noformat}  
when i download the patch from review board and patch like following
{code}
 patch -p0 Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (PIG-5157) Upgrade to Spark 2.0

2017-06-13 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047531#comment-16047531
 ] 

liyunzhang_intel edited comment on PIG-5157 at 6/13/17 7:36 AM:


[~nkollar]:  made some comments on review board.
 can you update patch with latest code?
latest code
{noformat}
* 5c55102 - (origin/trunk, origin/HEAD) PIG-4700: Enable progress reporting for 
Tasks in Tez (satishsaley via rohini) (7 days ago) 
{noformat}  
when i download the patch from review board and patch like following
{code}
 patch -p0
{noformat}  
when i download the patch from review board and patch like following
{code}
 patch -p0 Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5246) Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2

2017-06-04 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16036470#comment-16036470
 ] 

liyunzhang_intel commented on PIG-5246:
---

[~jeffzhang]: thanks for suggestion
bq. Why copying the assembly jar instead of including it in the classpath of 
pig ?
sorry for mistake in my last comment. We just include the spark-assembly*.jar 
in the [classpath of pig|https://github.com/apache/pig/blob/trunk/bin/pig#L415]

bq. And it is also weird to me not loading spark-defaults.conf as this would 
cause extra administration overhead. 
yes, i agree this can be improved by directly parsing 
$SPARK_HOME/spark-defaults.conf. see PIG-5252





> Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2
> --
>
> Key: PIG-5246
> URL: https://issues.apache.org/jira/browse/PIG-5246
> Project: Pig
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HBase9498.patch, PIG-5246.1.patch, PIG-5246.patch
>
>
> in bin/pig.
> we copy assembly jar to pig's classpath in spark1.6.
> {code}
> # For spark mode:
> # Please specify SPARK_HOME first so that we can locate 
> $SPARK_HOME/lib/spark-assembly*.jar,
> # we will add spark-assembly*.jar to the classpath.
> if [ "$isSparkMode"  == "true" ]; then
> if [ -z "$SPARK_HOME" ]; then
>echo "Error: SPARK_HOME is not set!"
>exit 1
> fi
> # Please specify SPARK_JAR which is the hdfs path of spark-assembly*.jar 
> to allow YARN to cache spark-assembly*.jar on nodes so that it doesn't need 
> to be distributed each time an application runs.
> if [ -z "$SPARK_JAR" ]; then
>echo "Error: SPARK_JAR is not set, SPARK_JAR stands for the hdfs 
> location of spark-assembly*.jar. This allows YARN to cache 
> spark-assembly*.jar on nodes so that it doesn't need to be distributed each 
> time an application runs."
>exit 1
> fi
> if [ -n "$SPARK_HOME" ]; then
> echo "Using Spark Home: " ${SPARK_HOME}
> SPARK_ASSEMBLY_JAR=`ls ${SPARK_HOME}/lib/spark-assembly*`
> CLASSPATH=${CLASSPATH}:$SPARK_ASSEMBLY_JAR
> fi
> fi
> {code}
> after upgrade to spark2.0, we may modify it



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (PIG-5252) Get properties from $SPARK_HOME/conf/spark-defaults.conf not from pig.properties

2017-06-04 Thread liyunzhang_intel (JIRA)
liyunzhang_intel created PIG-5252:
-

 Summary: Get properties from $SPARK_HOME/conf/spark-defaults.conf 
not from pig.properties
 Key: PIG-5252
 URL: https://issues.apache.org/jira/browse/PIG-5252
 Project: Pig
  Issue Type: Bug
Reporter: liyunzhang_intel
Assignee: liyunzhang_intel
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (PIG-5246) Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2

2017-06-04 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16036459#comment-16036459
 ] 

liyunzhang_intel edited comment on PIG-5246 at 6/5/17 1:59 AM:
---

[~rohini]: thanks for suggestion, for spark1 and spark2, it will be done by 
checking for spark-assembly.jar or other things in the script and user need not 
specify the version of spark.
bq. For eg: In Spark JobMetricsListener will redirect to 
JobMetricsListenerSpark1 or JobMetricsListenerSpark2. But for users it makes it 
very simple as they can use same pig installation to run against any version.
It will be convenient for users in that way but not sure whether there is 
conflicts if both jars of spark1 and spark2 in the pig classpath.
 [~zjffdu]:  
bq. Actually SPARK_ASSEMBLY_JAR is not a must-have thing for spark. 
  If SPARK_ASSEMBLY_JAR is not a must-have thing for spark1, how to judge 
spark1 or spark2?
bq.IMO, pig don't need to specify that, it is supposed to be set in 
spark-defaults.conf which would apply to all spark apps.
  Pig on Spark use spark installation and will copy 
$SPARK_HOME/lib/spark-assembly*jar(spark1) and $SPARK_HOME/jars/*jar to the 
classpath of pig. But we don't read spark-defaults.conf.  We parse 
pig.properties and save the configuration about spark to 
[SparkContext|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/spark/SparkLauncher.java#L584].

 


was (Author: kellyzly):
[~rohini]: thanks for suggestion, for spark1 and spark2, it will be done by 
checking for spark-assembly.jar or other things in the script and user need not 
specify the version of spark.
bq. For eg: In Spark JobMetricsListener will redirect to 
JobMetricsListenerSpark1 or JobMetricsListenerSpark2. But for users it makes it 
very simple as they can use same pig installation to run against any version.
It will be convenient for users in that way but not sure whether there is 
conflicts if both jars of spark1 and spark2 in the pig classpath.
 [~zjffdu]:  bq. Actually SPARK_ASSEMBLY_JAR is not a must-have thing for 
spark. 
  If SPARK_ASSEMBLY_JAR is not a must-have thing for spark1, how to judge 
spark1 or spark2?
bq.IMO, pig don't need to specify that, it is supposed to be set in 
spark-defaults.conf which would apply to all spark apps.
  Pig on Spark use spark installation and will copy 
$SPARK_HOME/lib/spark-assembly*jar(spark1) and $SPARK_HOME/jars/*jar to the 
classpath of pig. But we don't read spark-defaults.conf.  We parse 
pig.properties and save the configuration about spark to 
[SparkContext|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/spark/SparkLauncher.java#L584].

 

> Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2
> --
>
> Key: PIG-5246
> URL: https://issues.apache.org/jira/browse/PIG-5246
> Project: Pig
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HBase9498.patch, PIG-5246.1.patch, PIG-5246.patch
>
>
> in bin/pig.
> we copy assembly jar to pig's classpath in spark1.6.
> {code}
> # For spark mode:
> # Please specify SPARK_HOME first so that we can locate 
> $SPARK_HOME/lib/spark-assembly*.jar,
> # we will add spark-assembly*.jar to the classpath.
> if [ "$isSparkMode"  == "true" ]; then
> if [ -z "$SPARK_HOME" ]; then
>echo "Error: SPARK_HOME is not set!"
>exit 1
> fi
> # Please specify SPARK_JAR which is the hdfs path of spark-assembly*.jar 
> to allow YARN to cache spark-assembly*.jar on nodes so that it doesn't need 
> to be distributed each time an application runs.
> if [ -z "$SPARK_JAR" ]; then
>echo "Error: SPARK_JAR is not set, SPARK_JAR stands for the hdfs 
> location of spark-assembly*.jar. This allows YARN to cache 
> spark-assembly*.jar on nodes so that it doesn't need to be distributed each 
> time an application runs."
>exit 1
> fi
> if [ -n "$SPARK_HOME" ]; then
> echo "Using Spark Home: " ${SPARK_HOME}
> SPARK_ASSEMBLY_JAR=`ls ${SPARK_HOME}/lib/spark-assembly*`
> CLASSPATH=${CLASSPATH}:$SPARK_ASSEMBLY_JAR
> fi
> fi
> {code}
> after upgrade to spark2.0, we may modify it



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5246) Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2

2017-06-04 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16036459#comment-16036459
 ] 

liyunzhang_intel commented on PIG-5246:
---

[~rohini]: thanks for suggestion, for spark1 and spark2, it will be done by 
checking for spark-assembly.jar or other things in the script and user need not 
specify the version of spark.
bq. For eg: In Spark JobMetricsListener will redirect to 
JobMetricsListenerSpark1 or JobMetricsListenerSpark2. But for users it makes it 
very simple as they can use same pig installation to run against any version.
It will be convenient for users in that way but not sure whether there is 
conflicts if both jars of spark1 and spark2 in the pig classpath.
 [~zjffdu]:  bq. Actually SPARK_ASSEMBLY_JAR is not a must-have thing for 
spark. 
  If SPARK_ASSEMBLY_JAR is not a must-have thing for spark1, how to judge 
spark1 or spark2?
bq.IMO, pig don't need to specify that, it is supposed to be set in 
spark-defaults.conf which would apply to all spark apps.
  Pig on Spark use spark installation and will copy 
$SPARK_HOME/lib/spark-assembly*jar(spark1) and $SPARK_HOME/jars/*jar to the 
classpath of pig. But we don't read spark-defaults.conf.  We parse 
pig.properties and save the configuration about spark to 
[SparkContext|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/spark/SparkLauncher.java#L584].

 

> Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2
> --
>
> Key: PIG-5246
> URL: https://issues.apache.org/jira/browse/PIG-5246
> Project: Pig
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HBase9498.patch, PIG-5246.1.patch, PIG-5246.patch
>
>
> in bin/pig.
> we copy assembly jar to pig's classpath in spark1.6.
> {code}
> # For spark mode:
> # Please specify SPARK_HOME first so that we can locate 
> $SPARK_HOME/lib/spark-assembly*.jar,
> # we will add spark-assembly*.jar to the classpath.
> if [ "$isSparkMode"  == "true" ]; then
> if [ -z "$SPARK_HOME" ]; then
>echo "Error: SPARK_HOME is not set!"
>exit 1
> fi
> # Please specify SPARK_JAR which is the hdfs path of spark-assembly*.jar 
> to allow YARN to cache spark-assembly*.jar on nodes so that it doesn't need 
> to be distributed each time an application runs.
> if [ -z "$SPARK_JAR" ]; then
>echo "Error: SPARK_JAR is not set, SPARK_JAR stands for the hdfs 
> location of spark-assembly*.jar. This allows YARN to cache 
> spark-assembly*.jar on nodes so that it doesn't need to be distributed each 
> time an application runs."
>exit 1
> fi
> if [ -n "$SPARK_HOME" ]; then
> echo "Using Spark Home: " ${SPARK_HOME}
> SPARK_ASSEMBLY_JAR=`ls ${SPARK_HOME}/lib/spark-assembly*`
> CLASSPATH=${CLASSPATH}:$SPARK_ASSEMBLY_JAR
> fi
> fi
> {code}
> after upgrade to spark2.0, we may modify it



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (PIG-5215) Merge changes from review board to spark branch

2017-06-04 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel resolved PIG-5215.
---
Resolution: Fixed

> Merge changes from review board to spark branch
> ---
>
> Key: PIG-5215
> URL: https://issues.apache.org/jira/browse/PIG-5215
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-5215.1.patch, PIG-5215.3.patch, 
> PIG-5215.4.fixes.patch, PIG-5215.4.patch, PIG-5215.4.TestCombinerFix.patch, 
> PIG-5215.5.patch, PIG-5215.patch
>
>
> in [review board|https://reviews.apache.org/r/57317/], there are comments 
> from community. After the review board is close, merge these changes to spark 
> branch



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5215) Merge changes from review board to spark branch

2017-06-04 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16036445#comment-16036445
 ] 

liyunzhang_intel commented on PIG-5215:
---

[~nkollar]: thanks for remind, close it

> Merge changes from review board to spark branch
> ---
>
> Key: PIG-5215
> URL: https://issues.apache.org/jira/browse/PIG-5215
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-5215.1.patch, PIG-5215.3.patch, 
> PIG-5215.4.fixes.patch, PIG-5215.4.patch, PIG-5215.4.TestCombinerFix.patch, 
> PIG-5215.5.patch, PIG-5215.patch
>
>
> in [review board|https://reviews.apache.org/r/57317/], there are comments 
> from community. After the review board is close, merge these changes to spark 
> branch



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5246) Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2

2017-06-02 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16034284#comment-16034284
 ] 

liyunzhang_intel commented on PIG-5246:
---

[~szita], [~nkollar], [~rohini] and [~jeffzhang]:
It is not very convenient to let users to type {{-sparkversion 2}} when they 
use pig like( the default sparkversion is 1, need not type)
{code}
./pig -x $mode -sparkversion 2 -log4jconf $PIG_HOME/conf/log4j.properties 
-logfile $PIG_HOME/logs/pig.log  $PIG_HOME/bin/testJoin.pig
{code} 
some options to improve this
1. save {{sparkversion}} in file and parse {{sparkversion}} from the file in 
bin/pig
2. judge the spark version from spark-assembly*jar. in spark1, there is 
spark-assembly*jar in $SPARK_HOME/lib while in spark2, there is no  
$SPARK_HOME/lib/spark-assembly*jar

Please give me your opinion or you think it is acceptable to let users to 
specified ${{sparkversion}} in command.

> Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2
> --
>
> Key: PIG-5246
> URL: https://issues.apache.org/jira/browse/PIG-5246
> Project: Pig
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: PIG-5246.1.patch, PIG-5246.patch
>
>
> in bin/pig.
> we copy assembly jar to pig's classpath in spark1.6.
> {code}
> # For spark mode:
> # Please specify SPARK_HOME first so that we can locate 
> $SPARK_HOME/lib/spark-assembly*.jar,
> # we will add spark-assembly*.jar to the classpath.
> if [ "$isSparkMode"  == "true" ]; then
> if [ -z "$SPARK_HOME" ]; then
>echo "Error: SPARK_HOME is not set!"
>exit 1
> fi
> # Please specify SPARK_JAR which is the hdfs path of spark-assembly*.jar 
> to allow YARN to cache spark-assembly*.jar on nodes so that it doesn't need 
> to be distributed each time an application runs.
> if [ -z "$SPARK_JAR" ]; then
>echo "Error: SPARK_JAR is not set, SPARK_JAR stands for the hdfs 
> location of spark-assembly*.jar. This allows YARN to cache 
> spark-assembly*.jar on nodes so that it doesn't need to be distributed each 
> time an application runs."
>exit 1
> fi
> if [ -n "$SPARK_HOME" ]; then
> echo "Using Spark Home: " ${SPARK_HOME}
> SPARK_ASSEMBLY_JAR=`ls ${SPARK_HOME}/lib/spark-assembly*`
> CLASSPATH=${CLASSPATH}:$SPARK_ASSEMBLY_JAR
> fi
> fi
> {code}
> after upgrade to spark2.0, we may modify it



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (PIG-5246) Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2

2017-06-01 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16032609#comment-16032609
 ] 

liyunzhang_intel edited comment on PIG-5246 at 6/1/17 7:57 AM:
---

[~nkollar]: in PIG-5246.1.patch, modify {{sparkversion=16}} to 
{{sparkversion=1}} and {{sparkversion=21}} to {{sparkversion=2}}


was (Author: kellyzly):
[~nkollar]: in PIG-5246.1.patch, modify {{spark16}} to {{spark1}} and 
{{spark21}} to {{spark2}}

> Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2
> --
>
> Key: PIG-5246
> URL: https://issues.apache.org/jira/browse/PIG-5246
> Project: Pig
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: PIG-5246.1.patch, PIG-5246.patch
>
>
> in bin/pig.
> we copy assembly jar to pig's classpath in spark1.6.
> {code}
> # For spark mode:
> # Please specify SPARK_HOME first so that we can locate 
> $SPARK_HOME/lib/spark-assembly*.jar,
> # we will add spark-assembly*.jar to the classpath.
> if [ "$isSparkMode"  == "true" ]; then
> if [ -z "$SPARK_HOME" ]; then
>echo "Error: SPARK_HOME is not set!"
>exit 1
> fi
> # Please specify SPARK_JAR which is the hdfs path of spark-assembly*.jar 
> to allow YARN to cache spark-assembly*.jar on nodes so that it doesn't need 
> to be distributed each time an application runs.
> if [ -z "$SPARK_JAR" ]; then
>echo "Error: SPARK_JAR is not set, SPARK_JAR stands for the hdfs 
> location of spark-assembly*.jar. This allows YARN to cache 
> spark-assembly*.jar on nodes so that it doesn't need to be distributed each 
> time an application runs."
>exit 1
> fi
> if [ -n "$SPARK_HOME" ]; then
> echo "Using Spark Home: " ${SPARK_HOME}
> SPARK_ASSEMBLY_JAR=`ls ${SPARK_HOME}/lib/spark-assembly*`
> CLASSPATH=${CLASSPATH}:$SPARK_ASSEMBLY_JAR
> fi
> fi
> {code}
> after upgrade to spark2.0, we may modify it



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PIG-5246) Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2

2017-06-01 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-5246:
--
Attachment: PIG-5246.1.patch

[~nkollar]: in PIG-5246.1.patch, modify {{spark16}} to {{spark1}} and 
{{spark21}} to {{spark2}}

> Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2
> --
>
> Key: PIG-5246
> URL: https://issues.apache.org/jira/browse/PIG-5246
> Project: Pig
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: PIG-5246.1.patch, PIG-5246.patch
>
>
> in bin/pig.
> we copy assembly jar to pig's classpath in spark1.6.
> {code}
> # For spark mode:
> # Please specify SPARK_HOME first so that we can locate 
> $SPARK_HOME/lib/spark-assembly*.jar,
> # we will add spark-assembly*.jar to the classpath.
> if [ "$isSparkMode"  == "true" ]; then
> if [ -z "$SPARK_HOME" ]; then
>echo "Error: SPARK_HOME is not set!"
>exit 1
> fi
> # Please specify SPARK_JAR which is the hdfs path of spark-assembly*.jar 
> to allow YARN to cache spark-assembly*.jar on nodes so that it doesn't need 
> to be distributed each time an application runs.
> if [ -z "$SPARK_JAR" ]; then
>echo "Error: SPARK_JAR is not set, SPARK_JAR stands for the hdfs 
> location of spark-assembly*.jar. This allows YARN to cache 
> spark-assembly*.jar on nodes so that it doesn't need to be distributed each 
> time an application runs."
>exit 1
> fi
> if [ -n "$SPARK_HOME" ]; then
> echo "Using Spark Home: " ${SPARK_HOME}
> SPARK_ASSEMBLY_JAR=`ls ${SPARK_HOME}/lib/spark-assembly*`
> CLASSPATH=${CLASSPATH}:$SPARK_ASSEMBLY_JAR
> fi
> fi
> {code}
> after upgrade to spark2.0, we may modify it



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PIG-5246) Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2

2017-06-01 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-5246:
--
Status: Patch Available  (was: Open)

> Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2
> --
>
> Key: PIG-5246
> URL: https://issues.apache.org/jira/browse/PIG-5246
> Project: Pig
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: PIG-5246.patch
>
>
> in bin/pig.
> we copy assembly jar to pig's classpath in spark1.6.
> {code}
> # For spark mode:
> # Please specify SPARK_HOME first so that we can locate 
> $SPARK_HOME/lib/spark-assembly*.jar,
> # we will add spark-assembly*.jar to the classpath.
> if [ "$isSparkMode"  == "true" ]; then
> if [ -z "$SPARK_HOME" ]; then
>echo "Error: SPARK_HOME is not set!"
>exit 1
> fi
> # Please specify SPARK_JAR which is the hdfs path of spark-assembly*.jar 
> to allow YARN to cache spark-assembly*.jar on nodes so that it doesn't need 
> to be distributed each time an application runs.
> if [ -z "$SPARK_JAR" ]; then
>echo "Error: SPARK_JAR is not set, SPARK_JAR stands for the hdfs 
> location of spark-assembly*.jar. This allows YARN to cache 
> spark-assembly*.jar on nodes so that it doesn't need to be distributed each 
> time an application runs."
>exit 1
> fi
> if [ -n "$SPARK_HOME" ]; then
> echo "Using Spark Home: " ${SPARK_HOME}
> SPARK_ASSEMBLY_JAR=`ls ${SPARK_HOME}/lib/spark-assembly*`
> CLASSPATH=${CLASSPATH}:$SPARK_ASSEMBLY_JAR
> fi
> fi
> {code}
> after upgrade to spark2.0, we may modify it



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PIG-5246) Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2

2017-06-01 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-5246:
--
Attachment: PIG-5246.patch

[~nkollar], [~szita]: help review 
in spark2, spark-assembly*.jar does not exist, so we need append all jars under 
$SPARK_HOME/jars/ to the pig classpath.
{code}
+if [ "$sparkversion" == "21" ]; then
+  if [ -n "$SPARK_HOME" ]; then
+ echo "Using Spark Home: " ${SPARK_HOME}
+  for f in $SPARK_HOME/jars/*.jar; do
+   CLASSPATH=${CLASSPATH}:$f
+  done
+  fi
+ fi
{code}

the way to use 

1. build pig with spark21
{noformat}
   ant clean -v  -Dsparkversion=21   -Dhadoopversion=2 jar
{noformat}
2. run pig with spark21
{noformat}
  /pig -x $mode -sparkversion 21 -log4jconf $PIG_HOME/conf/log4j.properties 
-logfile $PIG_HOME/logs/pig.log  $PIG_HOME/bin/testJoin.pig
{noformat}
  

> Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2
> --
>
> Key: PIG-5246
> URL: https://issues.apache.org/jira/browse/PIG-5246
> Project: Pig
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: PIG-5246.patch
>
>
> in bin/pig.
> we copy assembly jar to pig's classpath in spark1.6.
> {code}
> # For spark mode:
> # Please specify SPARK_HOME first so that we can locate 
> $SPARK_HOME/lib/spark-assembly*.jar,
> # we will add spark-assembly*.jar to the classpath.
> if [ "$isSparkMode"  == "true" ]; then
> if [ -z "$SPARK_HOME" ]; then
>echo "Error: SPARK_HOME is not set!"
>exit 1
> fi
> # Please specify SPARK_JAR which is the hdfs path of spark-assembly*.jar 
> to allow YARN to cache spark-assembly*.jar on nodes so that it doesn't need 
> to be distributed each time an application runs.
> if [ -z "$SPARK_JAR" ]; then
>echo "Error: SPARK_JAR is not set, SPARK_JAR stands for the hdfs 
> location of spark-assembly*.jar. This allows YARN to cache 
> spark-assembly*.jar on nodes so that it doesn't need to be distributed each 
> time an application runs."
>exit 1
> fi
> if [ -n "$SPARK_HOME" ]; then
> echo "Using Spark Home: " ${SPARK_HOME}
> SPARK_ASSEMBLY_JAR=`ls ${SPARK_HOME}/lib/spark-assembly*`
> CLASSPATH=${CLASSPATH}:$SPARK_ASSEMBLY_JAR
> fi
> fi
> {code}
> after upgrade to spark2.0, we may modify it



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-06-01 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16032514#comment-16032514
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]: have tested that we can remove JobLogger in spark16.  

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5247) Investigate stopOnFailure feature with Spark execution engine

2017-05-31 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16032292#comment-16032292
 ] 

liyunzhang_intel commented on PIG-5247:
---

[~szita]: what i am confused is currently this feature is implemented in spark 
engine while you create a jira for this.
if stopOnFailure is enabled, the remaining job will not be executed if 
exception is thrown.

> Investigate stopOnFailure feature with Spark execution engine
> -
>
> Key: PIG-5247
> URL: https://issues.apache.org/jira/browse/PIG-5247
> Project: Pig
>  Issue Type: Improvement
>Reporter: Adam Szita
> Fix For: 0.18.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5245) TestGrunt.testStopOnFailure is flaky

2017-05-31 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030847#comment-16030847
 ] 

liyunzhang_intel commented on PIG-5245:
---

[~rohini]: stop_on_failure is implemented in spark mode in 
[JobGraphBuilder.java|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/spark/JobGraphBuilder.java#L193]


> TestGrunt.testStopOnFailure is flaky
> 
>
> Key: PIG-5245
> URL: https://issues.apache.org/jira/browse/PIG-5245
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.17.0
>
> Attachments: PIG-5245-1.patch, PIG-5245-2.patch
>
>
>   The test is supposed to run two tests in parallel, and one when fails other 
> should be killed when stop on failure is configured. But the test is actually 
> running only job at a time and based on the order in which jobs are run it is 
> passing. This is because of the capacity scheduler configuration of the 
> MiniCluster. It runs only one AM at a time due to resource restrictions. In a 
> 16G node, when the first job runs it takes up 1536 (AM) + 1024 (task) 
> memory.mb. Since only 10% of cluster resource is the default for running AMs 
> and a single AM already takes up memory close to 1.6G, second job AM is not 
> launched.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (PIG-5246) Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2

2017-05-31 Thread liyunzhang_intel (JIRA)
liyunzhang_intel created PIG-5246:
-

 Summary: Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after 
upgrading spark to 2
 Key: PIG-5246
 URL: https://issues.apache.org/jira/browse/PIG-5246
 Project: Pig
  Issue Type: Bug
Reporter: liyunzhang_intel
Assignee: liyunzhang_intel


in bin/pig.
we copy assembly jar to pig's classpath in spark1.6.
{code}
# For spark mode:
# Please specify SPARK_HOME first so that we can locate 
$SPARK_HOME/lib/spark-assembly*.jar,
# we will add spark-assembly*.jar to the classpath.
if [ "$isSparkMode"  == "true" ]; then
if [ -z "$SPARK_HOME" ]; then
   echo "Error: SPARK_HOME is not set!"
   exit 1
fi

# Please specify SPARK_JAR which is the hdfs path of spark-assembly*.jar to 
allow YARN to cache spark-assembly*.jar on nodes so that it doesn't need to be 
distributed each time an application runs.
if [ -z "$SPARK_JAR" ]; then
   echo "Error: SPARK_JAR is not set, SPARK_JAR stands for the hdfs 
location of spark-assembly*.jar. This allows YARN to cache spark-assembly*.jar 
on nodes so that it doesn't need to be distributed each time an application 
runs."
   exit 1
fi

if [ -n "$SPARK_HOME" ]; then
echo "Using Spark Home: " ${SPARK_HOME}
SPARK_ASSEMBLY_JAR=`ls ${SPARK_HOME}/lib/spark-assembly*`
CLASSPATH=${CLASSPATH}:$SPARK_ASSEMBLY_JAR
fi
fi

{code}
after upgrade to spark2.0, we may modify it



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (PIG-5157) Upgrade to Spark 2.0

2017-05-31 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030799#comment-16030799
 ] 

liyunzhang_intel edited comment on PIG-5157 at 5/31/17 7:54 AM:


[~nkollar]:
bq. in JobMetricsListener.java there's a huge code section commented out 
(uncomment and remove the code onTaskEnd until we fix PIG-5157). Should we 
enable that?
the reason to modify it is because [~rohini] suggested that [memory| is used a 
lot if we update metric info in onTaskEnd()(suppose there are thousand tasks)
in org.apache.pig.backend.hadoop.executionengine.spark.JobMetricsListener of 
spark21, we should use code like following 
notice: not fully test, can not guarantee it is right.
{code}
  public void onStageCompleted(SparkListenerStageCompleted stageCompleted) {
int stageId = stageCompleted.stageInfo().stageId();
int stageAttemptId = stageCompleted.stageInfo().attemptId();
String stageIdentifier = stageId + "_" + stageAttemptId;
Integer jobId = stageIdToJobId.get(stageId);
if (jobId == null) {
LOG.warn("Cannot find job id for stage[" + stageId + "].");
} else {
Map jobMetrics = 
allJobMetrics.get(jobId);
if (jobMetrics == null) {
jobMetrics = Maps.newHashMap();
allJobMetrics.put(jobId, jobMetrics);
}
List stageMetrics = jobMetrics.get(stageIdentifier);
if (stageMetrics == null) {
stageMetrics = Lists.newLinkedList();
jobMetrics.put(stageIdentifier, stageMetrics);
}
 
 stageMetrics.add(stageCompleted.stageInfo().taskMetrics());
}
}
public synchronized void onTaskEnd(SparkListenerTaskEnd taskEnd) {
}
{code}
bq. I removed JobLogger, do we need it? It seems that a property called 
'spark.eventLog.enabled' is the proper replacement for this class, should we 
use it instead? It looks like JobLogger became deprecated and was removed from 
Spark 2.
It seems we can remove JobLogger and enable {{spark.eventLog.enabled}} in spark2



was (Author: kellyzly):
[~nkollar]:
bq. in JobMetricsListener.java there's a huge code section commented out 
(uncomment and remove the code onTaskEnd until we fix PIG-5157). Should we 
enable that?
the reason to modify it is because [~rohini] suggested that [memory| is used a 
lot if we update metric info in onTaskEnd()(suppose there are thousand tasks)
in org.apache.pig.backend.hadoop.executionengine.spark.JobMetricsListener of 
spark21, we should use code like following 
notice: not fully test, can not guarantee it is right.
{code}
  public void onStageCompleted(SparkListenerStageCompleted stageCompleted) {
if we update taskMetrics in onTaskEnd(), it consumes lot of memory.
int stageId = stageCompleted.stageInfo().stageId();
int stageAttemptId = stageCompleted.stageInfo().attemptId();
String stageIdentifier = stageId + "_" + stageAttemptId;
Integer jobId = stageIdToJobId.get(stageId);
if (jobId == null) {
LOG.warn("Cannot find job id for stage[" + stageId + "].");
} else {
Map jobMetrics = 
allJobMetrics.get(jobId);
if (jobMetrics == null) {
jobMetrics = Maps.newHashMap();
allJobMetrics.put(jobId, jobMetrics);
}
List stageMetrics = jobMetrics.get(stageIdentifier);
if (stageMetrics == null) {
stageMetrics = Lists.newLinkedList();
jobMetrics.put(stageIdentifier, stageMetrics);
}
 
 stageMetrics.add(stageCompleted.stageInfo().taskMetrics());
}
}
public synchronized void onTaskEnd(SparkListenerTaskEnd taskEnd) {
}
{code}
bq. I removed JobLogger, do we need it? It seems that a property called 
'spark.eventLog.enabled' is the proper replacement for this class, should we 
use it instead? It looks like JobLogger became deprecated and was removed from 
Spark 2.
It seems we can remove JobLogger and enable {{spark.eventLog.enabled}} in spark2


> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-05-31 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030799#comment-16030799
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]:
bq. in JobMetricsListener.java there's a huge code section commented out 
(uncomment and remove the code onTaskEnd until we fix PIG-5157). Should we 
enable that?
the reason to modify it is because [~rohini] suggested that [memory| is used a 
lot if we update metric info in onTaskEnd()(suppose there are thousand tasks)
in org.apache.pig.backend.hadoop.executionengine.spark.JobMetricsListener of 
spark21, we should use code like following 
notice: not fully test, can not guarantee it is right.
{code}
  public void onStageCompleted(SparkListenerStageCompleted stageCompleted) {
if we update taskMetrics in onTaskEnd(), it consumes lot of memory.
int stageId = stageCompleted.stageInfo().stageId();
int stageAttemptId = stageCompleted.stageInfo().attemptId();
String stageIdentifier = stageId + "_" + stageAttemptId;
Integer jobId = stageIdToJobId.get(stageId);
if (jobId == null) {
LOG.warn("Cannot find job id for stage[" + stageId + "].");
} else {
Map jobMetrics = 
allJobMetrics.get(jobId);
if (jobMetrics == null) {
jobMetrics = Maps.newHashMap();
allJobMetrics.put(jobId, jobMetrics);
}
List stageMetrics = jobMetrics.get(stageIdentifier);
if (stageMetrics == null) {
stageMetrics = Lists.newLinkedList();
jobMetrics.put(stageIdentifier, stageMetrics);
}
 
 stageMetrics.add(stageCompleted.stageInfo().taskMetrics());
}
}
public synchronized void onTaskEnd(SparkListenerTaskEnd taskEnd) {
}
{code}
bq. I removed JobLogger, do we need it? It seems that a property called 
'spark.eventLog.enabled' is the proper replacement for this class, should we 
use it instead? It looks like JobLogger became deprecated and was removed from 
Spark 2.
It seems we can remove JobLogger and enable {{spark.eventLog.enabled}} in spark2


> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-05-29 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16028651#comment-16028651
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]: will review tomorrow as Monday and Tuesday, i am out of office.

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5215) Merge changes from review board to spark branch

2017-05-26 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025908#comment-16025908
 ] 

liyunzhang_intel commented on PIG-5215:
---

[~szita]: the only possible change is PIG-5167. If [~nkollar] can fix it soon, 
we may include it in the first release otherwise I suggest we don't do any 
other change about current code and commit the change of PIG-5167 later.

> Merge changes from review board to spark branch
> ---
>
> Key: PIG-5215
> URL: https://issues.apache.org/jira/browse/PIG-5215
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-5215.1.patch, PIG-5215.3.patch, 
> PIG-5215.4.fixes.patch, PIG-5215.4.patch, PIG-5215.4.TestCombinerFix.patch, 
> PIG-5215.5.patch, PIG-5215.patch
>
>
> in [review board|https://reviews.apache.org/r/57317/], there are comments 
> from community. After the review board is close, merge these changes to spark 
> branch



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (PIG-5215) Merge changes from review board to spark branch

2017-05-26 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025902#comment-16025902
 ] 

liyunzhang_intel edited comment on PIG-5215 at 5/26/17 7:01 AM:


[~szita]: i check the svn history and found that the last commit was success. 
If i am wrong, please tell me.
{noformat}
[root@bdpe42 pig.on.spark]# svn log|head -n 10

r1796232 | zly | 2017-05-25 22:01:13 -0400 (Thu, 25 May 2017) | 1 line

PIG-5215:Merge changes from review board to spark branch(Liyun)

{noformat}


was (Author: kellyzly):
[~szita]: sorry, the commit fail, i will commit soon.

> Merge changes from review board to spark branch
> ---
>
> Key: PIG-5215
> URL: https://issues.apache.org/jira/browse/PIG-5215
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-5215.1.patch, PIG-5215.3.patch, 
> PIG-5215.4.fixes.patch, PIG-5215.4.patch, PIG-5215.4.TestCombinerFix.patch, 
> PIG-5215.5.patch, PIG-5215.patch
>
>
> in [review board|https://reviews.apache.org/r/57317/], there are comments 
> from community. After the review board is close, merge these changes to spark 
> branch



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5215) Merge changes from review board to spark branch

2017-05-26 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025902#comment-16025902
 ] 

liyunzhang_intel commented on PIG-5215:
---

[~szita]: sorry, the commit fail, i will commit soon.

> Merge changes from review board to spark branch
> ---
>
> Key: PIG-5215
> URL: https://issues.apache.org/jira/browse/PIG-5215
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-5215.1.patch, PIG-5215.3.patch, 
> PIG-5215.4.fixes.patch, PIG-5215.4.patch, PIG-5215.4.TestCombinerFix.patch, 
> PIG-5215.5.patch, PIG-5215.patch
>
>
> in [review board|https://reviews.apache.org/r/57317/], there are comments 
> from community. After the review board is close, merge these changes to spark 
> branch



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5167) Limit_4 is failing with spark exec type

2017-05-25 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025723#comment-16025723
 ] 

liyunzhang_intel commented on PIG-5167:
---

[~nkollar]: my suggestion is 
1. add a new verify_pig_script to Limit_13
{code}
{
'num' => 13,
'execonly' => 'spark', # Limit_4 failed on 
Spark: distinct doesn't do implicit sort like it does in MR
'pig' =>q\a = load 
':INPATH:/singlefile/studentnulltab10k';
b = distinct a;
c = limit b 100;
store c into ':OUTPATH:';\,
 'verify_pig_script' =>q\a = load 
':INPATH:/singlefile/studentnulltab10k';
   b = distinct a;
   c = limit b 100;
   store c into ':OUTPATH:';\,
}
{code}
It is not very good because the script and verify_script are same.
2.  If #option1 is not accepted, remove Limit_13 and leave PIG-5167 open

I have tried #option1 but failed because the verify_pig_script is executed by 
benchmark mode(mr). so the result is different in spark and mr.

> Limit_4 is failing with spark exec type
> ---
>
> Key: PIG-5167
> URL: https://issues.apache.org/jira/browse/PIG-5167
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
> Attachments: PIG-5167_2.patch, PIG-5167_3.patch, PIG-5167.patch
>
>
> results are different:
> {code}
> diff <(head -n 5 Limit_4.out/out_sorted) <(head -n 5 
> Limit_4_benchmark.out/out_sorted)
> 1,5c1,5
> < 50  3.00
> < 74  2.22
> < alice carson66  2.42
> < alice quirinius 71  0.03
> < alice van buren 28  2.50
> ---
> > bob allen   0.28
> > bob allen   22  0.92
> > bob allen   25  2.54
> > bob allen   26  2.35
> > bob allen   27  2.17
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PIG-5215) Merge changes from review board to spark branch

2017-05-25 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-5215:
--
Attachment: PIG-5215.5.patch

[~szita]: thanks for fix.  Include PIG-5215.4.fixes.patch and 
PIG-5215.4.TestCombinerFix.patch to PIG-5215.5.patch.  ready to commit to spark 
branch.

> Merge changes from review board to spark branch
> ---
>
> Key: PIG-5215
> URL: https://issues.apache.org/jira/browse/PIG-5215
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-5215.1.patch, PIG-5215.3.patch, 
> PIG-5215.4.fixes.patch, PIG-5215.4.patch, PIG-5215.4.TestCombinerFix.patch, 
> PIG-5215.5.patch, PIG-5215.patch
>
>
> in [review board|https://reviews.apache.org/r/57317/], there are comments 
> from community. After the review board is close, merge these changes to spark 
> branch



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Reopened] (PIG-5167) Limit_4 is failing with spark exec type

2017-05-25 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel reopened PIG-5167:
---

reopen it as Rohini suggested to fix it in another method:
bq.Testing distinct + orderby + limit serves the same purpose as orderby + 
limit test. Can you remove orderby from this test? If distinct + limit differs 
everytime even with spark and a different verify_pig_script runs just ignore 
the test for now adding a TODO to test num 4.

> Limit_4 is failing with spark exec type
> ---
>
> Key: PIG-5167
> URL: https://issues.apache.org/jira/browse/PIG-5167
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
> Attachments: PIG-5167_2.patch, PIG-5167_3.patch, PIG-5167.patch
>
>
> results are different:
> {code}
> diff <(head -n 5 Limit_4.out/out_sorted) <(head -n 5 
> Limit_4_benchmark.out/out_sorted)
> 1,5c1,5
> < 50  3.00
> < 74  2.22
> < alice carson66  2.42
> < alice quirinius 71  0.03
> < alice van buren 28  2.50
> ---
> > bob allen   0.28
> > bob allen   22  0.92
> > bob allen   25  2.54
> > bob allen   26  2.35
> > bob allen   27  2.17
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PIG-5215) Merge changes from review board to spark branch

2017-05-25 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-5215:
--
Attachment: PIG-5215.4.patch

[~sztia]: update latest PIG-5215.4.patch.  

> Merge changes from review board to spark branch
> ---
>
> Key: PIG-5215
> URL: https://issues.apache.org/jira/browse/PIG-5215
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-5215.1.patch, PIG-5215.3.patch, PIG-5215.4.patch, 
> PIG-5215.patch
>
>
> in [review board|https://reviews.apache.org/r/57317/], there are comments 
> from community. After the review board is close, merge these changes to spark 
> branch



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PIG-5215) Merge changes from review board to spark branch

2017-05-25 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-5215:
--
Attachment: (was: PIG-5215.4.patch)

> Merge changes from review board to spark branch
> ---
>
> Key: PIG-5215
> URL: https://issues.apache.org/jira/browse/PIG-5215
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-5215.1.patch, PIG-5215.3.patch, PIG-5215.4.patch, 
> PIG-5215.patch
>
>
> in [review board|https://reviews.apache.org/r/57317/], there are comments 
> from community. After the review board is close, merge these changes to spark 
> branch



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PIG-5215) Merge changes from review board to spark branch

2017-05-25 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-5215:
--
Attachment: PIG-5215.4.patch

> Merge changes from review board to spark branch
> ---
>
> Key: PIG-5215
> URL: https://issues.apache.org/jira/browse/PIG-5215
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-5215.1.patch, PIG-5215.3.patch, PIG-5215.4.patch, 
> PIG-5215.patch
>
>
> in [review board|https://reviews.apache.org/r/57317/], there are comments 
> from community. After the review board is close, merge these changes to spark 
> branch



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (PIG-5241) Specify the hdfs path directly to spark and avoid the unnecessary download and upload in SparkLauncher.java

2017-05-25 Thread liyunzhang_intel (JIRA)
liyunzhang_intel created PIG-5241:
-

 Summary: Specify the hdfs path directly to spark and avoid the 
unnecessary download and upload in SparkLauncher.java
 Key: PIG-5241
 URL: https://issues.apache.org/jira/browse/PIG-5241
 Project: Pig
  Issue Type: Sub-task
Reporter: liyunzhang_intel


//TODO: Specify the hdfs path directly to spark and avoid the unnecessary 
download and upload in SparkLauncher.java
{code}
  private void cacheFiles(String cacheFiles) throws IOException {
if (cacheFiles != null && !cacheFiles.isEmpty()) {
File tmpFolder = Files.createTempDirectory("cache").toFile();
tmpFolder.deleteOnExit();
for (String file : cacheFiles.split(",")) {
String fileName = extractFileName(file.trim());
Path src = new Path(extractFileUrl(file.trim()));
File tmpFile = new File(tmpFolder, fileName);
Path tmpFilePath = new Path(tmpFile.getAbsolutePath());
FileSystem fs = tmpFilePath.getFileSystem(jobConf);
//TODO: Specify the hdfs path directly to spark and avoid the 
unnecessary download and upload in SparkLauncher.java
fs.copyToLocalFile(src, tmpFilePath);
tmpFile.deleteOnExit();
LOG.info(String.format("CacheFile:%s", fileName));
addResourceToSparkJobWorkingDirectory(tmpFile, fileName,
ResourceType.FILE);
}
}
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5135) HDFS bytes read stats are always 0 in Spark mode

2017-05-24 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16024218#comment-16024218
 ] 

liyunzhang_intel commented on PIG-5135:
---

[~szita]:
bq.I've checked this, it seems that assertEquals(30, 
inputStats.get(0).getBytes()); is fine, but assertEquals(18, 
inputStats.get(1).getBytes()); is not true, Spark returns -1 here. The plan 
generated for spark consists of 4 jobs, last one being the responsible for 
replicated join. This latter does 3 loads, and thus SparkPigStats handle this 
as -1. (Even after adding together all the bytes from all load ops in this job 
I got different result than 18.) I guess compression is also at work here on 
the tmp file part generation that further alters the number of bytes being read.
org.apache.pig.test.TestPigRunner#simpleMultiQueryTest3
{code}
#--
# Spark Plan  
#--

Spark node scope-53
Store(hdfs://localhost:58892/tmp/temp-1660154197/tmp1818797386:org.apache.pig.impl.io.InterStorage)
 - scope-54
|
|---A: New For Each(false,false,false)[bag] - scope-10
|   |
|   Cast[int] - scope-2
|   |
|   |---Project[bytearray][0] - scope-1
|   |
|   Cast[int] - scope-5
|   |
|   |---Project[bytearray][1] - scope-4
|   |
|   Cast[int] - scope-8
|   |
|   |---Project[bytearray][2] - scope-7
|
|---A: 
Load(hdfs://localhost:58892/user/root/input:org.apache.pig.builtin.PigStorage) 
- scope-0

Spark node scope-55
Store(hdfs://localhost:58892/tmp/temp-1660154197/tmp-546700946:org.apache.pig.impl.io.InterStorage)
 - scope-56
|
|---C: Filter[bag] - scope-14
|   |
|   Less Than or Equal[boolean] - scope-17
|   |
|   |---Project[int][1] - scope-15
|   |
|   |---Constant(5) - scope-16
|

|---Load(hdfs://localhost:58892/tmp/temp-1660154197/tmp1818797386:org.apache.pig.impl.io.InterStorage)
 - scope-10

Spark node scope-57
C: 
Store(hdfs://localhost:58892/user/root/output:org.apache.pig.builtin.PigStorage)
 - scope-21
|
|---Load(hdfs://localhost:58892/tmp/temp-1660154197/tmp-546700946:org.apache.pig.impl.io.InterStorage)
 - scope-14

Spark node scope-65
D: 
Store(hdfs://localhost:58892/user/root/output2:org.apache.pig.builtin.PigStorage)
 - scope-52
|
|---D: FRJoinSpark[tuple] - scope-44
|   |
|   Project[int][0] - scope-41
|   |
|   Project[int][0] - scope-42
|   |
|   Project[int][0] - scope-43
|

|---Load(hdfs://localhost:58892/tmp/temp-1660154197/tmp-546700946:org.apache.pig.impl.io.InterStorage)
 - scope-58
|
|---BroadcastSpark - scope-63
|   |
|   |---B: Filter[bag] - scope-26
|   |   |
|   |   Equal To[boolean] - scope-29
|   |   |
|   |   |---Project[int][0] - scope-27
|   |   |
|   |   |---Constant(3) - scope-28
|   |
|   
|---Load(hdfs://localhost:58892/tmp/temp-1660154197/tmp1818797386:org.apache.pig.impl.io.InterStorage)
 - scope-60
|
|---BroadcastSpark - scope-64
|
|---A1: New For Each(false,false,false)[bag] - scope-40
|   |
|   Cast[int] - scope-32
|   |
|   |---Project[bytearray][0] - scope-31
|   |
|   Cast[int] - scope-35
|   |
|   |---Project[bytearray][1] - scope-34
|   |
|   Cast[int] - scope-38
|   |
|   |---Project[bytearray][2] - scope-37
|
|---A1: 
Load(hdfs://localhost:58892/user/root/input2:org.apache.pig.builtin.PigStorage) 
- scope-30
{code}
 assertEquals(30, inputStats.get(0).getBytes()) is correct in spark mode,
 assertEquals(18, inputStats.get(1).getBytes()) is wrong in spark mode as the 
there are 3 loads in {{Spark node scope-65}}.  
[{{stats.get("BytesRead")}}|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L93]
 returns 49( guess this is the sum of 
three loads({{input2}},{{tmp1818797386}},{{tmp-546700946}}). But current 
[{{bytesRead}}|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L91]
 is -1 because 
[{{singleInput}}|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L92]
 is false.


Let's modify the code like
{code}

  // Since Tez does has only one load per job its values are correct
// the result of inputStats in spark mode is also correct
  if (!Util.isMapredExecType(cluster.getExecType())) {
assertEquals(30, inputStats.get(0).getBytes());
  }

  //TODO PIG-5240:Fix TestPigRunner#simpleMultiQueryTest3 in spark mode 
for wrong inputStats
  if (!Util.isMapredExecType(cluster.getExecType()) && 

[jira] [Created] (PIG-5240) Fix TestPigRunner#simpleMultiQueryTest3 in spark mode for wrong inputStats

2017-05-24 Thread liyunzhang_intel (JIRA)
liyunzhang_intel created PIG-5240:
-

 Summary: Fix TestPigRunner#simpleMultiQueryTest3 in spark mode for 
wrong inputStats
 Key: PIG-5240
 URL: https://issues.apache.org/jira/browse/PIG-5240
 Project: Pig
  Issue Type: Sub-task
Reporter: liyunzhang_intel


in  TestPigRunner#simpleMultiQueryTest3 ,
the explain plan
{code}
#--
# Spark Plan  
#--

Spark node scope-53
Store(hdfs://localhost:58892/tmp/temp-1660154197/tmp1818797386:org.apache.pig.impl.io.InterStorage)
 - scope-54
|
|---A: New For Each(false,false,false)[bag] - scope-10
|   |
|   Cast[int] - scope-2
|   |
|   |---Project[bytearray][0] - scope-1
|   |
|   Cast[int] - scope-5
|   |
|   |---Project[bytearray][1] - scope-4
|   |
|   Cast[int] - scope-8
|   |
|   |---Project[bytearray][2] - scope-7
|
|---A: 
Load(hdfs://localhost:58892/user/root/input:org.apache.pig.builtin.PigStorage) 
- scope-0

Spark node scope-55
Store(hdfs://localhost:58892/tmp/temp-1660154197/tmp-546700946:org.apache.pig.impl.io.InterStorage)
 - scope-56
|
|---C: Filter[bag] - scope-14
|   |
|   Less Than or Equal[boolean] - scope-17
|   |
|   |---Project[int][1] - scope-15
|   |
|   |---Constant(5) - scope-16
|

|---Load(hdfs://localhost:58892/tmp/temp-1660154197/tmp1818797386:org.apache.pig.impl.io.InterStorage)
 - scope-10

Spark node scope-57
C: 
Store(hdfs://localhost:58892/user/root/output:org.apache.pig.builtin.PigStorage)
 - scope-21
|
|---Load(hdfs://localhost:58892/tmp/temp-1660154197/tmp-546700946:org.apache.pig.impl.io.InterStorage)
 - scope-14

Spark node scope-65
D: 
Store(hdfs://localhost:58892/user/root/output2:org.apache.pig.builtin.PigStorage)
 - scope-52
|
|---D: FRJoinSpark[tuple] - scope-44
|   |
|   Project[int][0] - scope-41
|   |
|   Project[int][0] - scope-42
|   |
|   Project[int][0] - scope-43
|

|---Load(hdfs://localhost:58892/tmp/temp-1660154197/tmp-546700946:org.apache.pig.impl.io.InterStorage)
 - scope-58
|
|---BroadcastSpark - scope-63
|   |
|   |---B: Filter[bag] - scope-26
|   |   |
|   |   Equal To[boolean] - scope-29
|   |   |
|   |   |---Project[int][0] - scope-27
|   |   |
|   |   |---Constant(3) - scope-28
|   |
|   
|---Load(hdfs://localhost:58892/tmp/temp-1660154197/tmp1818797386:org.apache.pig.impl.io.InterStorage)
 - scope-60
|
|---BroadcastSpark - scope-64
|
|---A1: New For Each(false,false,false)[bag] - scope-40
|   |
|   Cast[int] - scope-32
|   |
|   |---Project[bytearray][0] - scope-31
|   |
|   Cast[int] - scope-35
|   |
|   |---Project[bytearray][1] - scope-34
|   |
|   Cast[int] - scope-38
|   |
|   |---Project[bytearray][2] - scope-37
|
|---A1: 
Load(hdfs://localhost:58892/user/root/input2:org.apache.pig.builtin.PigStorage) 
- scope-30
{code}
 assertEquals(30, inputStats.get(0).getBytes()) is correct in spark mode,
 assertEquals(18, inputStats.get(1).getBytes()) is wrong in spark mode as the 
there are 3 loads in {{Spark node scope-65}}.  
[{{stats.get("BytesRead")}}|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L93]
 returns 49( guess this is the sum of 
three loads({{input2}},{{tmp1818797386}},{{tmp-546700946}}). But current 
[{{bytesRead}}|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L91]
 is -1 because 
[{{singleInput}}|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L92]
 is false.





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PIG-5239) Investigate why there are duplicated A[3,4] inTestLocationInPhysicalPlan#test in spark mode

2017-05-24 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-5239:
--
Issue Type: Sub-task  (was: Bug)
Parent: PIG-4059

> Investigate why there are duplicated A[3,4] inTestLocationInPhysicalPlan#test 
> in spark mode
> ---
>
> Key: PIG-5239
> URL: https://issues.apache.org/jira/browse/PIG-5239
> Project: Pig
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (PIG-5239) Investigate why there are duplicated A[3,4] inTestLocationInPhysicalPlan#test in spark mode

2017-05-24 Thread liyunzhang_intel (JIRA)
liyunzhang_intel created PIG-5239:
-

 Summary: Investigate why there are duplicated A[3,4] 
inTestLocationInPhysicalPlan#test in spark mode
 Key: PIG-5239
 URL: https://issues.apache.org/jira/browse/PIG-5239
 Project: Pig
  Issue Type: Bug
Reporter: liyunzhang_intel






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-05-24 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022446#comment-16022446
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]:
bq. the optimizations offered (project Tungsten and Catalyst optimizer) looks 
promising
If use catalyst optimizer, do we need 
{{org.apache.pig.newplan.logical.relational.LogicalPlan}},{{org.apache.pig.backend.hadoop.executionengine.physicalLayer.plans.PhysicalPlan}}?
 {{Catalyst optimizer}} optimizes the spark plan generated by spark sql.
bq. however it seems that it is build around Java beans
  I guess DataSet/DataFrame api provide row-based operation. see the 
[patch|https://issues.apache.org/jira/secure/attachment/12847623/PIG-5080-1.patch]
 of PIG-5080
 {code}
  SparkContext context = SparkContext.getOrCreate();
SQLContext sqlContext = SQLContext.getOrCreate(context);
DataFrame df = sqlContext.table("complex_data");
Row[] rows = df.collect();
assertEquals(10, rows.length);
for (int i = 0; i < rows.length; i++) {
  assertEquals(i, rows[i].getJavaMap(0).get("key_" + i));
}
{code}

[~zjffdu]: appreciate if you can give us your suggetion as you are more 
familiar with spark.

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-05-21 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16019070#comment-16019070
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~zjffdu] and [~rohini]: thanks for your suggestion.
[~zjffdu]: 
bq.Supporting to spark2 could be done in the next release, maybe also changing 
from the rdd api to dataframe api in the next release.
yes, we will definitely not support spark2 in the first release.

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-05-19 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16017007#comment-16017007
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~rohini],[~xuefuz],[~zjffdu]: Should we support spark2 or support both 
spark1.6 and spark2?  It may use reflection to support both version(still 
investigation).  Please give us your opinion, in my view, we don't suppport 
spark1.6 if we upgrade to spark2.0.

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5207) BugFix e2e tests fail on spark

2017-05-19 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16016999#comment-16016999
 ] 

liyunzhang_intel commented on PIG-5207:
---

[~rohini]: can you spend some time to view the modification of 
PhysicalPlan.java.

> BugFix e2e tests fail on spark
> --
>
> Key: PIG-5207
> URL: https://issues.apache.org/jira/browse/PIG-5207
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Adam Szita
>Assignee: Adam Szita
> Fix For: spark-branch
>
> Attachments: PIG-5207.0.patch, PIG-5207.1.patch
>
>
> Observed ClassCastException in BugFix 1 and 2 test cases. The exception is 
> thrown from and UDF: COR.Final



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (PIG-5199) exclude jline in spark dependency

2017-05-18 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel reassigned PIG-5199:
-

Assignee: Adam Szita  (was: liyunzhang_intel)

> exclude jline in spark dependency
> -
>
> Key: PIG-5199
> URL: https://issues.apache.org/jira/browse/PIG-5199
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: Adam Szita
> Fix For: spark-branch
>
> Attachments: PIG-5199.1.patch, PIG-5199.patch
>
>
> when i fix PIG-5197 and run TestGrunt, the exception is thrown
> {code}
> [ERROR] Terminal initialization failed; falling back to unsupported$
> 4220 java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but 
> interface was expected$
> 4221 ^Iat jline.TerminalFactory.create(TerminalFactory.java:101)$
> 4222 ^Iat jline.TerminalFactory.get(TerminalFactory.java:159)$
> 4223 ^Iat jline.console.ConsoleReader.(ConsoleReader.java:227)$
> 4224 ^Iat jline.console.ConsoleReader.(ConsoleReader.java:219)$
> 4225 ^Iat jline.console.ConsoleReader.(ConsoleReader.java:211)$
> 4226 ^Iat org.apache.pig.Main.run(Main.java:554)$
> 4227 ^Iat org.apache.pig.PigRunner.run(PigRunner.java:49)$
> 4228 ^Iat org.apache.pig.test.TestGrunt.testGruntUtf8(TestGrunt.java:1579)$
> 4229 ^Iat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)$
> 4230 ^Iat 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)$
> 4231 ^Iat 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)$
> 4232 ^Iat java.lang.reflect.Method.invoke(Method.java:498)$
> 4233 ^Iat 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)$
> 4234 ^Iat 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)$
> 4235 ^Iat 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)$
> 4236 ^Iat 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)$
> 4237 ^Iat 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)$
> 4238 ^Iat org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)$
> 4239 ^Iat 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)$
> 4240 ^Iat 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)$
> 4241 ^Iat org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)$
> 4242 ^Iat org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)$
> 4243 ^Iat org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)$
> 4244 ^Iat org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)$
> {code}
> i found this is because there are 2 jars about jline in different version
> {code}
> find -name jline*jar
> ./build/ivy/lib/spark/jline-0.9.94.jar
> ./build/ivy/lib/Pig/jline-2.11.jar
> ./lib/spark/jline-0.9.94.jar
> ./lib/jline-2.11.jar
> {code}
> we need to exclude jline-0.9.94 from spark dependency.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5199) exclude jline in spark dependency

2017-05-18 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16015301#comment-16015301
 ] 

liyunzhang_intel commented on PIG-5199:
---

[~szita]: 1 question, it seems that pig-0.17-0.SNAPSHOT.jar does not contain 
any libs in {{ivy.lib.dir}}, why do we need variable {{core.dependencies.jar}} 
any more? just to generate  
[pig-0.17.0-SNAPSHOT.jar|https://github.com/apache/pig/blob/spark/build.xml#L692]?
{code}




{code}


> exclude jline in spark dependency
> -
>
> Key: PIG-5199
> URL: https://issues.apache.org/jira/browse/PIG-5199
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-5199.1.patch, PIG-5199.patch
>
>
> when i fix PIG-5197 and run TestGrunt, the exception is thrown
> {code}
> [ERROR] Terminal initialization failed; falling back to unsupported$
> 4220 java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but 
> interface was expected$
> 4221 ^Iat jline.TerminalFactory.create(TerminalFactory.java:101)$
> 4222 ^Iat jline.TerminalFactory.get(TerminalFactory.java:159)$
> 4223 ^Iat jline.console.ConsoleReader.(ConsoleReader.java:227)$
> 4224 ^Iat jline.console.ConsoleReader.(ConsoleReader.java:219)$
> 4225 ^Iat jline.console.ConsoleReader.(ConsoleReader.java:211)$
> 4226 ^Iat org.apache.pig.Main.run(Main.java:554)$
> 4227 ^Iat org.apache.pig.PigRunner.run(PigRunner.java:49)$
> 4228 ^Iat org.apache.pig.test.TestGrunt.testGruntUtf8(TestGrunt.java:1579)$
> 4229 ^Iat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)$
> 4230 ^Iat 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)$
> 4231 ^Iat 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)$
> 4232 ^Iat java.lang.reflect.Method.invoke(Method.java:498)$
> 4233 ^Iat 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)$
> 4234 ^Iat 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)$
> 4235 ^Iat 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)$
> 4236 ^Iat 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)$
> 4237 ^Iat 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)$
> 4238 ^Iat org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)$
> 4239 ^Iat 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)$
> 4240 ^Iat 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)$
> 4241 ^Iat org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)$
> 4242 ^Iat org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)$
> 4243 ^Iat org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)$
> 4244 ^Iat org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)$
> {code}
> i found this is because there are 2 jars about jline in different version
> {code}
> find -name jline*jar
> ./build/ivy/lib/spark/jline-0.9.94.jar
> ./build/ivy/lib/Pig/jline-2.11.jar
> ./lib/spark/jline-0.9.94.jar
> ./lib/jline-2.11.jar
> {code}
> we need to exclude jline-0.9.94 from spark dependency.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PIG-5199) exclude jline in spark dependency

2017-05-18 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-5199:
--
Resolution: Fixed
Status: Resolved  (was: Patch Available)

[~szita]: LGTM, commit to the spark branch. thanks for contribution.

> exclude jline in spark dependency
> -
>
> Key: PIG-5199
> URL: https://issues.apache.org/jira/browse/PIG-5199
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-5199.1.patch, PIG-5199.patch
>
>
> when i fix PIG-5197 and run TestGrunt, the exception is thrown
> {code}
> [ERROR] Terminal initialization failed; falling back to unsupported$
> 4220 java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but 
> interface was expected$
> 4221 ^Iat jline.TerminalFactory.create(TerminalFactory.java:101)$
> 4222 ^Iat jline.TerminalFactory.get(TerminalFactory.java:159)$
> 4223 ^Iat jline.console.ConsoleReader.(ConsoleReader.java:227)$
> 4224 ^Iat jline.console.ConsoleReader.(ConsoleReader.java:219)$
> 4225 ^Iat jline.console.ConsoleReader.(ConsoleReader.java:211)$
> 4226 ^Iat org.apache.pig.Main.run(Main.java:554)$
> 4227 ^Iat org.apache.pig.PigRunner.run(PigRunner.java:49)$
> 4228 ^Iat org.apache.pig.test.TestGrunt.testGruntUtf8(TestGrunt.java:1579)$
> 4229 ^Iat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)$
> 4230 ^Iat 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)$
> 4231 ^Iat 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)$
> 4232 ^Iat java.lang.reflect.Method.invoke(Method.java:498)$
> 4233 ^Iat 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)$
> 4234 ^Iat 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)$
> 4235 ^Iat 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)$
> 4236 ^Iat 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)$
> 4237 ^Iat 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)$
> 4238 ^Iat org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)$
> 4239 ^Iat 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)$
> 4240 ^Iat 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)$
> 4241 ^Iat org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)$
> 4242 ^Iat org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)$
> 4243 ^Iat org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)$
> 4244 ^Iat org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)$
> {code}
> i found this is because there are 2 jars about jline in different version
> {code}
> find -name jline*jar
> ./build/ivy/lib/spark/jline-0.9.94.jar
> ./build/ivy/lib/Pig/jline-2.11.jar
> ./lib/spark/jline-0.9.94.jar
> ./lib/jline-2.11.jar
> {code}
> we need to exclude jline-0.9.94 from spark dependency.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5135) HDFS bytes read stats are always 0 in Spark mode

2017-05-18 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16015285#comment-16015285
 ] 

liyunzhang_intel commented on PIG-5135:
---

[~rohini]: there is some problem in last checkin causing the [jenkins 
failure|https://builds.apache.org/job/Pig-spark/402/consoleFull]
* 19463c9 - (HEAD, origin/spark, spark) PIG-5135: HDFS bytes read stats are 
always 0 in Spark mode (szita via rohini) (8 hours ago) 

recommit and see whether jenkins pass or not.

> HDFS bytes read stats are always 0 in Spark mode
> 
>
> Key: PIG-5135
> URL: https://issues.apache.org/jira/browse/PIG-5135
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: Adam Szita
> Fix For: spark-branch
>
> Attachments: PIG-5135.0.patch, PIG-5135.1.patch, PIG-5135.2.patch
>
>
> I discovered this while running TestOrcStoragePushdown unit test in Spark 
> mode where the test depends on the value of this stat.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Issue Comment Deleted] (PIG-5228) Orc_2 is failing with spark exec type

2017-05-15 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-5228:
--
Comment: was deleted

(was: [~szita] and [~rohini]: commit to the spark branch, thanks for 
contribution and review.)

> Orc_2 is failing with spark exec type
> -
>
> Key: PIG-5228
> URL: https://issues.apache.org/jira/browse/PIG-5228
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Adam Szita
>Assignee: Adam Szita
> Fix For: spark-branch
>
> Attachments: PIG-5228.0.patch
>
>
> This test is failing due to mismatch in the actual and expected result. The 
> difference is only related to the order of entries in Pig maps such as:
> Actual:
> {code}
> [name#alice, age#18]...
> {code}
> Expected:
> {code}
> [age#18, name#alice]...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5228) Orc_2 is failing with spark exec type

2017-05-15 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16010032#comment-16010032
 ] 

liyunzhang_intel commented on PIG-5228:
---

[~szita] and [~rohini]: commit to the spark branch, thanks for contribution and 
review.

> Orc_2 is failing with spark exec type
> -
>
> Key: PIG-5228
> URL: https://issues.apache.org/jira/browse/PIG-5228
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Adam Szita
>Assignee: Adam Szita
> Fix For: spark-branch
>
> Attachments: PIG-5228.0.patch
>
>
> This test is failing due to mismatch in the actual and expected result. The 
> difference is only related to the order of entries in Pig maps such as:
> Actual:
> {code}
> [name#alice, age#18]...
> {code}
> Expected:
> {code}
> [age#18, name#alice]...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PIG-5215) Merge changes from review board to spark branch

2017-05-07 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-5215:
--
Issue Type: Bug  (was: Sub-task)
Parent: (was: PIG-4059)

> Merge changes from review board to spark branch
> ---
>
> Key: PIG-5215
> URL: https://issues.apache.org/jira/browse/PIG-5215
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-5215.1.patch, PIG-5215.3.patch, PIG-5215.patch
>
>
> in [review board|https://reviews.apache.org/r/57317/], there are comments 
> from community. After the review board is close, merge these changes to spark 
> branch



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5218) Jyhton_Checkin_3 fails with spark exec type

2017-05-07 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16000247#comment-16000247
 ] 

liyunzhang_intel commented on PIG-5218:
---

[~rohini]: commit to the branch. [~szita]: thanks for contribution.

> Jyhton_Checkin_3 fails with spark exec type
> ---
>
> Key: PIG-5218
> URL: https://issues.apache.org/jira/browse/PIG-5218
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Adam Szita
>Assignee: Adam Szita
> Fix For: spark-branch
>
> Attachments: PIG-5218.0.patch, PIG-5218.1.patch
>
>
> Exception observed:
> {code}
> Caused by: java.lang.ClassCastException: 
> org.apache.commons.logging.impl.SLF4JLocationAwareLog cannot be cast to 
> org.apache.commons.logging.impl.Log4JLogger
> at 
> org.apache.hadoop.test.GenericTestUtils.setLogLevel(GenericTestUtils.java:107)
> at 
> org.apache.hadoop.fs.FileContextCreateMkdirBaseTest.(FileContextCreateMkdirBaseTest.java:60)
> ... 29 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5199) exclude jline in spark dependency

2017-05-05 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997962#comment-15997962
 ] 

liyunzhang_intel commented on PIG-5199:
---

[~szita]:
based on current branch code(d6bf437), if we change build.xml like 
following,org.apache.pig.test.TestRegisteredJarVisibility.testRegisterJarOverridePigJarPackages
 will fail.  {{ivy.lib.dir.spark}} decides the path of  spark necessary jars. 
currently it is in {{build/ivy/lib/spark/}}, It is suggested to be 
{{build/ivy/lib/Pig/spark}} by Rohini. If have time , please help investigate 
why it fails.

{code}

{code}



> exclude jline in spark dependency
> -
>
> Key: PIG-5199
> URL: https://issues.apache.org/jira/browse/PIG-5199
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-5199.patch
>
>
> when i fix PIG-5197 and run TestGrunt, the exception is thrown
> {code}
> [ERROR] Terminal initialization failed; falling back to unsupported$
> 4220 java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but 
> interface was expected$
> 4221 ^Iat jline.TerminalFactory.create(TerminalFactory.java:101)$
> 4222 ^Iat jline.TerminalFactory.get(TerminalFactory.java:159)$
> 4223 ^Iat jline.console.ConsoleReader.(ConsoleReader.java:227)$
> 4224 ^Iat jline.console.ConsoleReader.(ConsoleReader.java:219)$
> 4225 ^Iat jline.console.ConsoleReader.(ConsoleReader.java:211)$
> 4226 ^Iat org.apache.pig.Main.run(Main.java:554)$
> 4227 ^Iat org.apache.pig.PigRunner.run(PigRunner.java:49)$
> 4228 ^Iat org.apache.pig.test.TestGrunt.testGruntUtf8(TestGrunt.java:1579)$
> 4229 ^Iat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)$
> 4230 ^Iat 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)$
> 4231 ^Iat 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)$
> 4232 ^Iat java.lang.reflect.Method.invoke(Method.java:498)$
> 4233 ^Iat 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)$
> 4234 ^Iat 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)$
> 4235 ^Iat 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)$
> 4236 ^Iat 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)$
> 4237 ^Iat 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)$
> 4238 ^Iat org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)$
> 4239 ^Iat 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)$
> 4240 ^Iat 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)$
> 4241 ^Iat org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)$
> 4242 ^Iat org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)$
> 4243 ^Iat org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)$
> 4244 ^Iat org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)$
> {code}
> i found this is because there are 2 jars about jline in different version
> {code}
> find -name jline*jar
> ./build/ivy/lib/spark/jline-0.9.94.jar
> ./build/ivy/lib/Pig/jline-2.11.jar
> ./lib/spark/jline-0.9.94.jar
> ./lib/jline-2.11.jar
> {code}
> we need to exclude jline-0.9.94 from spark dependency.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5215) Merge changes from review board to spark branch

2017-05-05 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997920#comment-15997920
 ] 

liyunzhang_intel commented on PIG-5215:
---

[~szita]: thanks for review. Commit to branch.

> Merge changes from review board to spark branch
> ---
>
> Key: PIG-5215
> URL: https://issues.apache.org/jira/browse/PIG-5215
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-5215.1.patch, PIG-5215.3.patch, PIG-5215.patch
>
>
> in [review board|https://reviews.apache.org/r/57317/], there are comments 
> from community. After the review board is close, merge these changes to spark 
> branch



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PIG-5215) Merge changes from review board to spark branch

2017-05-04 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-5215:
--
Attachment: PIG-5215.3.patch

[~szita]: help review PIG-5215.3.patch
changes:
1.remove unecessary runtime exception
2. code refactor

all unit tests pass with the patch.


> Merge changes from review board to spark branch
> ---
>
> Key: PIG-5215
> URL: https://issues.apache.org/jira/browse/PIG-5215
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-5215.1.patch, PIG-5215.3.patch, PIG-5215.patch
>
>
> in [review board|https://reviews.apache.org/r/57317/], there are comments 
> from community. After the review board is close, merge these changes to spark 
> branch



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PIG-4854) Merge spark branch to trunk

2017-05-03 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-4854:
--
Resolution: Fixed
Status: Resolved  (was: Patch Available)

duplicate with PIG-5215

> Merge spark branch to trunk
> ---
>
> Key: PIG-4854
> URL: https://issues.apache.org/jira/browse/PIG-4854
> Project: Pig
>  Issue Type: Bug
>Reporter: Pallavi Rao
> Attachments: PigOnSpark_3.patch, PIG-On-Spark.patch
>
>
> Believe the spark branch will be shortly ready to be merged with the main 
> branch (couple of minor patches pending commit), given that we have addressed 
> most functionality gaps and have ensured the UTs are clean. There are a few 
> optimizations which we will take up once the branch is merged to trunk.
> [~xuefuz], [~rohini], [~daijy],
> Hopefully, you agree that the spark branch is ready for merge. If yes, how 
> would like us to go about it? Do you want me to upload a huge patch that will 
> be merged like any other patch or do you prefer a branch merge?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5228) Orc_2 is failing with spark exec type

2017-05-01 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15991940#comment-15991940
 ] 

liyunzhang_intel commented on PIG-5228:
---

[~szita]:
{quote}
liyunzhang_intel this still fails on my cluster, I think it may be dependent on 
HashMap implementation and thus JDK version as well.
{quote}
does this mean that in mr mode, there is possibility that the result is 
{code}
[name#calvin polk,age#75,gpa#2.67155704010308]  (bob johnson,42,1.2)

{code}

if yes, this is problem of pig not pig on spark, otherwise, we should 
investigate the reason why there is difference in mr and spark.

> Orc_2 is failing with spark exec type
> -
>
> Key: PIG-5228
> URL: https://issues.apache.org/jira/browse/PIG-5228
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Adam Szita
>Assignee: Adam Szita
> Fix For: spark-branch
>
> Attachments: PIG-5228.0.patch
>
>
> This test is failing due to mismatch in the actual and expected result. The 
> difference is only related to the order of entries in Pig maps such as:
> Actual:
> {code}
> [name#alice, age#18]...
> {code}
> Expected:
> {code}
> [age#18, name#alice]...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5228) Orc_2 is failing with spark exec type

2017-04-28 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15988443#comment-15988443
 ] 

liyunzhang_intel commented on PIG-5228:
---

[~szita]: Orc_2 does not fail in my env with branch code(base 63968e3), why?

> Orc_2 is failing with spark exec type
> -
>
> Key: PIG-5228
> URL: https://issues.apache.org/jira/browse/PIG-5228
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Adam Szita
>Assignee: Adam Szita
> Fix For: spark-branch
>
> Attachments: PIG-5228.0.patch
>
>
> This test is failing due to mismatch in the actual and expected result. The 
> difference is only related to the order of entries in Pig maps such as:
> Actual:
> {code}
> [name#alice, age#18]...
> {code}
> Expected:
> {code}
> [age#18, name#alice]...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


  1   2   3   4   5   6   7   8   9   10   >