[ https://issues.apache.org/jira/browse/PIG-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15318146#comment-15318146 ]
liyunzhang_intel commented on PIG-4903: --------------------------------------- [~rohini] and [~sriksun]: sorry for reply late and thanks for your comments. Summary some points: Before *we append all jars under $PIG_HOME/build/lib/Spark/*.jar and $PIG_HOME/lib/*.jar to the classpath to make it run successfully Now * we force users to export SPARK_HOME,then we can locate the spark-assembly*.jar. we append spark-assembly*.jar to the classpath. * Here we can export SPARK_JAR=hdfs:xx/spark-assembly.jar to distribute cache the spark-assembly*.jar to avoid upload spark-assembly*.jar to hdfs everytime we start a spark job. Question: Should we require end-users to do this step? if enduser does not do this, spark-client will automatically upload spark-assembly*.jar because spark-assembly.jar is in the classpath. [~rohini] and [~sriksun], please give your suggestion. * For jars under $PIG_HOME/lib/*.jar, in PIG-4893, @rohini suggested we dynamically load the dependency jar not all jars in $PIG_HOME/lib/ and the basic necessary jar is JarManager.getDefaultJars(). So the code in bin/pig now is {code} ################# ADDING SPARK DEPENDENCIES ################## # Please specify SPARK_HOME first so that we can locate $SPARK_HOME/lib/spark-assembly*.jar, # we will add spark-assembly*.jar to the classpath. if [ -z "$SPARK_HOME" ]; then echo "Error: SPARK_HOME is not set!" exit 1 fi # Please specify SPARK_JAR which is the hdfs path of spark-assembly*.jar to allow YARN to cache spark-assembly*.jar on nodes so that it doesn't need to be distributed each time an application runs. if [ -z "$SPARK_JAR" ]; then echo "Error: SPARK_JAR is not set, SPARK_JAR stands for the hdfs location of spark-assembly*.jar. This allows YARN to cache spark-assembly*.jar on nodes so that it doesn't need to be distributed each time an application runs." exit 1 fi if [ -n "$SPARK_HOME" ]; then echo "Using Spark Home: " ${SPARK_HOME} SPARK_ASSEMBLY_JAR=`ls ${SPARK_HOME}/lib/spark-assembly*` CLASSPATH=${CLASSPATH}:$SPARK_ASSEMBLY_JAR fi ################# ADDING SPARK DEPENDENCIES ################## {code} [~rohini]: bq.Can we call it SPARK_LIB_URIS instead of SPARK_JAR and allow comma separated list of multiple hdfs paths? Also support -Dspark.lib.uris so it can be passed via commandline if not as an environmental variable. SPARK_JAR stands for the hdfs location of spark-assembly*.jar, so i think it is not better to use SPARK_LIB_URIS because this environment variable is used by spark code to locate the hdfs location of spark-assembly*.jar > Avoid add all spark dependency jars to SPARK_YARN_DIST_FILES and > SPARK_DIST_CLASSPATH > -------------------------------------------------------------------------------------- > > Key: PIG-4903 > URL: https://issues.apache.org/jira/browse/PIG-4903 > Project: Pig > Issue Type: Sub-task > Components: spark > Reporter: liyunzhang_intel > Attachments: PIG-4903.patch, PIG-4903_1.patch > > > There are some comments about bin/pig on > https://reviews.apache.org/r/45667/#comment198955. > {code} > ################# ADDING SPARK DEPENDENCIES ################## > # Spark typically works with a single assembly file. However this > # assembly isn't available as a artifact to pull in via ivy. > # To work around this short coming, we add all the jars barring > # spark-yarn to DIST through dist-files and then add them to classpath > # of the executors through an independent env variable. The reason > # for excluding spark-yarn is because spark-yarn is already being added > # by the spark-yarn-client via jarOf(Client.Class) > for f in $PIG_HOME/lib/*.jar; do > if [[ $f == $PIG_HOME/lib/spark-assembly* ]]; then > # Exclude spark-assembly.jar from shipped jars, but retain in > classpath > SPARK_JARS=${SPARK_JARS}:$f; > else > SPARK_JARS=${SPARK_JARS}:$f; > SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f; > SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f` > fi > done > CLASSPATH=${CLASSPATH}:${SPARK_JARS} > export SPARK_YARN_DIST_FILES=`echo ${SPARK_YARN_DIST_FILES} | sed 's/^,//g'` > export SPARK_JARS=${SPARK_YARN_DIST_FILES} > export SPARK_DIST_CLASSPATH > {code} > Here we first copy all spark dependency jar like > spark-network-shuffle_2.10-1.6.1 jar to distcache(SPARK_YARN_DIST_FILES) then > add them to the classpath of executor(SPARK_DIST_CLASSPATH). Actually we need > not copy all these depency jar to SPARK_DIST_CLASSPATH because all these > dependency jars are included in spark-assembly.jar and spark-assembly.jar is > uploaded with the spark job. -- This message was sent by Atlassian JIRA (v6.3.4#6332)