[
https://issues.apache.org/jira/browse/PIG-5176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15963551#comment-15963551
]
Nandor Kollar commented on PIG-5176:
------------------------------------
It looks like the problem occurs when the Spark file server is Netty-based. It
seems that the [Netty file
server|https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rpc/netty/NettyStreamManager.scala#L52]
has a strange restriction, you can't add the same file with the same name
twice, while [HTTP file
server|https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/HttpFileServer.scala#L64]
doesn't restrict on this. [~kellyzly] probably on your cluster Spark used
HTTP-based file server, that's why you didn't experience this issue, and my
cluster uses Netty-based implementation.
You can reproduce the problem in a unit test: execute
TestStreaming#testInputShipSpecs in yarn client mode (SPARK_MASTER=yarn-client)
and with the additional VM option: -Dspark.rpc.useNettyFileServer=true. It
should fail, but when you remove the spark.rpc.useNettyFileServer VM option
(this means Spark will use HTTP-based file server, see
[NettyRpcEnv.scala|https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala#L59])
it should pass. I'm not sure how can we fix this, should we check for the
currently used file server implementation property in Pig, set the
useNettyFileServer to false in our SparkLauncher class, or document that this
is not supported, one should use HTTP-based file server. Liyun, What do you
recommend?
> Several ComputeSpec test cases fail
> -----------------------------------
>
> Key: PIG-5176
> URL: https://issues.apache.org/jira/browse/PIG-5176
> Project: Pig
> Issue Type: Sub-task
> Components: spark
> Reporter: Nandor Kollar
> Assignee: Nandor Kollar
> Fix For: spark-branch
>
> Attachments: PIG-5176.patch
>
>
> Several ComputeSpec test cases failed on my cluster:
> ComputeSpec_5 - ComputeSpec_13
> These scripts have a ship() part in the define, where the ship includes the
> script file too, so we add the same file to spark context twice. This is not
> a problem with Hadoop, but looks like Spark doesn't like adding the same
> filename twice:
> {code}
> Caused by: java.lang.IllegalArgumentException: requirement failed: File
> PigStreamingDepend.pl already registered.
> at scala.Predef$.require(Predef.scala:233)
> at
> org.apache.spark.rpc.netty.NettyStreamManager.addFile(NettyStreamManager.scala:69)
> at org.apache.spark.SparkContext.addFile(SparkContext.scala:1386)
> at org.apache.spark.SparkContext.addFile(SparkContext.scala:1348)
> at
> org.apache.spark.api.java.JavaSparkContext.addFile(JavaSparkContext.scala:662)
> at
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.addResourceToSparkJobWorkingDirectory(SparkLauncher.java:462)
> at
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.shipFiles(SparkLauncher.java:371)
> at
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.addFilesToSparkJob(SparkLauncher.java:357)
> at
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.uploadResources(SparkLauncher.java:235)
> at
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.launchPig(SparkLauncher.java:222)
> at
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:290)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)