[ 
https://issues.apache.org/jira/browse/PIG-5176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15963551#comment-15963551
 ] 

Nandor Kollar commented on PIG-5176:
------------------------------------

It looks like the problem occurs when the Spark file server is Netty-based. It 
seems that the [Netty file 
server|https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rpc/netty/NettyStreamManager.scala#L52]
 has a strange restriction, you can't add the same file with the same name 
twice, while [HTTP file 
server|https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/HttpFileServer.scala#L64]
 doesn't restrict on this. [~kellyzly] probably on your cluster Spark used 
HTTP-based file server, that's why you didn't experience this issue, and my 
cluster uses Netty-based implementation.
You can reproduce the problem in a unit test: execute 
TestStreaming#testInputShipSpecs in yarn client mode (SPARK_MASTER=yarn-client) 
and with the additional VM option: -Dspark.rpc.useNettyFileServer=true. It 
should fail, but when you remove the spark.rpc.useNettyFileServer VM option 
(this means Spark will use HTTP-based file server, see 
[NettyRpcEnv.scala|https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala#L59])
 it should pass. I'm not sure how can we fix this, should we check for the 
currently used file server implementation property in Pig, set the 
useNettyFileServer to false in our SparkLauncher class, or document that this 
is not supported, one should use HTTP-based file server. Liyun, What do you 
recommend?

> Several ComputeSpec test cases fail
> -----------------------------------
>
>                 Key: PIG-5176
>                 URL: https://issues.apache.org/jira/browse/PIG-5176
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: Nandor Kollar
>            Assignee: Nandor Kollar
>             Fix For: spark-branch
>
>         Attachments: PIG-5176.patch
>
>
> Several ComputeSpec test cases failed on my cluster:
> ComputeSpec_5 - ComputeSpec_13
> These scripts have a ship() part in the define, where the ship includes the 
> script file too, so we add the same file to spark context twice. This is not 
> a problem with Hadoop, but looks like Spark doesn't like adding the same 
> filename twice:
> {code}
> Caused by: java.lang.IllegalArgumentException: requirement failed: File 
> PigStreamingDepend.pl already registered.
>         at scala.Predef$.require(Predef.scala:233)
>         at 
> org.apache.spark.rpc.netty.NettyStreamManager.addFile(NettyStreamManager.scala:69)
>         at org.apache.spark.SparkContext.addFile(SparkContext.scala:1386)
>         at org.apache.spark.SparkContext.addFile(SparkContext.scala:1348)
>         at 
> org.apache.spark.api.java.JavaSparkContext.addFile(JavaSparkContext.scala:662)
>         at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.addResourceToSparkJobWorkingDirectory(SparkLauncher.java:462)
>         at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.shipFiles(SparkLauncher.java:371)
>         at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.addFilesToSparkJob(SparkLauncher.java:357)
>         at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.uploadResources(SparkLauncher.java:235)
>         at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.launchPig(SparkLauncher.java:222)
>         at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:290)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to