[ https://issues.apache.org/jira/browse/PIG-4893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320250#comment-15320250 ]
liyunzhang_intel commented on PIG-4893: --------------------------------------- [~sriksun] and [~pallavi.rao]: In current spark branch, the task deserialization is a bit long because we append all jars under $PIG_HOME/lib/ and $PIG_HOME/lib/Spark to SPARK_YARN_DIST_FILES and spark client will upload all these jars to hdfs and yarn container spends some time to download these jar when deserializationing the task. In PIG-4903_2.patch, we don't append all jars to SPARK_YARN_DIST_FILES bin/pig. In PIG-4893.patch, we dynamically distribute cache necessary jars to hdfs by using SparkContext.addJar() which will upload jar to the hdfs so that yarn container can access them later. Because both of you are familiar with this part, please help review and i will put the patch to the final patch of spark branch when merging with trunk. SparkContext.addJar {code} /** * Adds a JAR dependency for all tasks to be executed on this SparkContext in the future. * The `path` passed can be either a local file, a file in HDFS (or other Hadoop-supported * filesystems), an HTTP, HTTPS or FTP URI, or local:/path for a file on every worker node. */ def addJar(path: String) { ... } {code} PIG-4893.patch is based on eab9180 in spark branch. > Task deserialization time is too long for spark on yarn mode > ------------------------------------------------------------ > > Key: PIG-4893 > URL: https://issues.apache.org/jira/browse/PIG-4893 > Project: Pig > Issue Type: Sub-task > Components: spark > Reporter: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4893.patch, time.PNG > > > I found the task deserialization time is a bit long when i run any scripts of > pigmix in spark on yarn mode. see the attachment picture. The duration time > is 3s but the task deserialization is 13s. > My env is hadoop2.6+spark1.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332)