[ 
https://issues.apache.org/jira/browse/PIG-4893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320250#comment-15320250
 ] 

liyunzhang_intel commented on PIG-4893:
---------------------------------------

[~sriksun] and [~pallavi.rao]:
    In current spark branch, the task deserialization is a bit long because we 
append all jars under $PIG_HOME/lib/ and $PIG_HOME/lib/Spark to 
SPARK_YARN_DIST_FILES and spark client will upload all these jars to hdfs and 
yarn container spends some time to download these jar when deserializationing 
the task. In PIG-4903_2.patch, we don't append  all jars to 
SPARK_YARN_DIST_FILES bin/pig. In PIG-4893.patch, we dynamically distribute 
cache necessary jars to hdfs by using SparkContext.addJar() which will upload 
jar to the hdfs so that yarn container can access them later.
Because both of you are familiar with this part, please help review and i will 
put the patch to the final patch of spark branch when merging with trunk.

SparkContext.addJar
{code}
 /**
   * Adds a JAR dependency for all tasks to be executed on this SparkContext in 
the future.
   * The `path` passed can be either a local file, a file in HDFS (or other 
Hadoop-supported
   * filesystems), an HTTP, HTTPS or FTP URI, or local:/path for a file on 
every worker node.
   */
  def addJar(path: String) {
...
}
{code}

PIG-4893.patch is based on eab9180 in spark branch.

> Task deserialization time is too long for spark on yarn mode
> ------------------------------------------------------------
>
>                 Key: PIG-4893
>                 URL: https://issues.apache.org/jira/browse/PIG-4893
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: liyunzhang_intel
>             Fix For: spark-branch
>
>         Attachments: PIG-4893.patch, time.PNG
>
>
> I found the task deserialization time is a bit long when i run any scripts of 
> pigmix in spark on yarn mode.  see the attachment picture.  The duration time 
> is 3s but the task deserialization is 13s.  
> My env is hadoop2.6+spark1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to