[ https://issues.apache.org/jira/browse/PIG-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320244#comment-15320244 ]
liyunzhang_intel commented on PIG-4903: --------------------------------------- [~sriksun]: Only 1 point is left to be discussed for this jira. * whether we should require end-users to specify SPARK_HOME, then we can locate the path of spark-assembly*.jar and add it to the classpath? or use original code, append all jars under $PIG_HOME/lib/Spark/ to the classpath? my suggestion is require end-users to specify SPARK_HOME because we already mandate user to export SPARK_JAR=hdfs:xxx/xxx/spark-assembly*.jar, it is more acceptable for them to do following steps: 1. upload spark-assembly*jar to hdfs 2. export SPARK_JAR=hdfs:xxx/xxx/spark-assembly*.jar 3. export SPARK_HOME. And i reference [hive on spark|https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started#HiveonSpark:GettingStarted-SparkInstallation], they also require users to specify SPARK_HOME while they use maven to compile(Maybe the spark version is different in compile and runtime). PIG-4903_2.patch is require end-users to specify SPARK_HOME. > Avoid add all spark dependency jars to SPARK_YARN_DIST_FILES and > SPARK_DIST_CLASSPATH > -------------------------------------------------------------------------------------- > > Key: PIG-4903 > URL: https://issues.apache.org/jira/browse/PIG-4903 > Project: Pig > Issue Type: Sub-task > Components: spark > Reporter: liyunzhang_intel > Attachments: PIG-4903.patch, PIG-4903_1.patch, PIG-4903_2.patch > > > There are some comments about bin/pig on > https://reviews.apache.org/r/45667/#comment198955. > {code} > ################# ADDING SPARK DEPENDENCIES ################## > # Spark typically works with a single assembly file. However this > # assembly isn't available as a artifact to pull in via ivy. > # To work around this short coming, we add all the jars barring > # spark-yarn to DIST through dist-files and then add them to classpath > # of the executors through an independent env variable. The reason > # for excluding spark-yarn is because spark-yarn is already being added > # by the spark-yarn-client via jarOf(Client.Class) > for f in $PIG_HOME/lib/*.jar; do > if [[ $f == $PIG_HOME/lib/spark-assembly* ]]; then > # Exclude spark-assembly.jar from shipped jars, but retain in > classpath > SPARK_JARS=${SPARK_JARS}:$f; > else > SPARK_JARS=${SPARK_JARS}:$f; > SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f; > SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f` > fi > done > CLASSPATH=${CLASSPATH}:${SPARK_JARS} > export SPARK_YARN_DIST_FILES=`echo ${SPARK_YARN_DIST_FILES} | sed 's/^,//g'` > export SPARK_JARS=${SPARK_YARN_DIST_FILES} > export SPARK_DIST_CLASSPATH > {code} > Here we first copy all spark dependency jar like > spark-network-shuffle_2.10-1.6.1 jar to distcache(SPARK_YARN_DIST_FILES) then > add them to the classpath of executor(SPARK_DIST_CLASSPATH). Actually we need > not copy all these depency jar to SPARK_DIST_CLASSPATH because all these > dependency jars are included in spark-assembly.jar and spark-assembly.jar is > uploaded with the spark job. -- This message was sent by Atlassian JIRA (v6.3.4#6332)