[ https://issues.apache.org/jira/browse/PIG-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15303337#comment-15303337 ]
liyunzhang_intel commented on PIG-4903: --------------------------------------- [~sriksun]: thanks for your reply, here is my understanding of the code your provide: 1. SPARK_JARS includes all the dependency jars which is in $PIG_HOME/lib/ and $PIG_HOME/lib/spark/ and we need add those jars to the classpath of pig. 2. SPARK_YARN_DIST_FILES includes all the dependency jars we need to be shipped. 3.SPARK_DIST_CLASSPATH includes all the dependency jars later the executors needs in spark on yarn mode. In above code you provide, i don't understand following 2 points: 1. why we need exclude spark-yarn.jar from shipped jar. can you explain detaily about these? now i'm investigating spark code to understand it. 2. i found that we only need ship the jar under $PIG_HOME/lib/ and add spark-assembly.jar to the classpath of pig to make it run successfully: {code} if [ -n "$SPARK_HOME" ]; then echo "Using Spark Home: " ${SPARK_HOME} SPARK_JARS=`ls ${SPARK_HOME}/lib/spark-assembly*` fi for f in $PIG_HOME/lib/*.jar; do SPARK_JARS=${SPARK_JARS}:$f; SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f; SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f` done {code} It is very strange spark-assembly.jar will be automatically uploaded in this code while only spark-yarn.jar will be uploaded in PIG-4667. If spark-assembly.jar will be automatically uploaded, we need not ship jars under $PIG_HOME/lib/spark/. > Avoid add all spark dependency jars to SPARK_YARN_DIST_FILES and > SPARK_DIST_CLASSPATH > -------------------------------------------------------------------------------------- > > Key: PIG-4903 > URL: https://issues.apache.org/jira/browse/PIG-4903 > Project: Pig > Issue Type: Sub-task > Components: spark > Reporter: liyunzhang_intel > > There are some comments about bin/pig on > https://reviews.apache.org/r/45667/#comment198955. > {code} > ################# ADDING SPARK DEPENDENCIES ################## > # Spark typically works with a single assembly file. However this > # assembly isn't available as a artifact to pull in via ivy. > # To work around this short coming, we add all the jars barring > # spark-yarn to DIST through dist-files and then add them to classpath > # of the executors through an independent env variable. The reason > # for excluding spark-yarn is because spark-yarn is already being added > # by the spark-yarn-client via jarOf(Client.Class) > for f in $PIG_HOME/lib/*.jar; do > if [[ $f == $PIG_HOME/lib/spark-assembly* ]]; then > # Exclude spark-assembly.jar from shipped jars, but retain in > classpath > SPARK_JARS=${SPARK_JARS}:$f; > else > SPARK_JARS=${SPARK_JARS}:$f; > SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f; > SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f` > fi > done > CLASSPATH=${CLASSPATH}:${SPARK_JARS} > export SPARK_YARN_DIST_FILES=`echo ${SPARK_YARN_DIST_FILES} | sed 's/^,//g'` > export SPARK_JARS=${SPARK_YARN_DIST_FILES} > export SPARK_DIST_CLASSPATH > {code} > Here we first copy all spark dependency jar like > spark-network-shuffle_2.10-1.6.1 jar to distcache(SPARK_YARN_DIST_FILES) then > add them to the classpath of executor(SPARK_DIST_CLASSPATH). Actually we need > not copy all these depency jar to SPARK_DIST_CLASSPATH because all these > dependency jars are included in spark-assembly.jar and spark-assembly.jar is > uploaded with the spark job. -- This message was sent by Atlassian JIRA (v6.3.4#6332)