[ 
https://issues.apache.org/jira/browse/PIG-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15318146#comment-15318146
 ] 

liyunzhang_intel commented on PIG-4903:
---------------------------------------

[~rohini] and [~sriksun]: sorry for reply late and thanks for your comments.

Summary some points:
Before
 *we append all jars under $PIG_HOME/build/lib/Spark/*.jar and  
$PIG_HOME/lib/*.jar to the classpath to make it run successfully
Now 
* we force users to export SPARK_HOME,then we can locate the 
spark-assembly*.jar. we append spark-assembly*.jar to the classpath.
* Here we can export SPARK_JAR=hdfs:xx/spark-assembly.jar to distribute
cache the spark-assembly*.jar to avoid upload spark-assembly*.jar to hdfs 
everytime we start a spark job. Question: Should we require end-users to do 
this step? if enduser does not do this, spark-client will automatically upload 
spark-assembly*.jar because spark-assembly.jar is in the classpath. [~rohini] 
and [~sriksun], please give your suggestion.
* For jars under $PIG_HOME/lib/*.jar, in PIG-4893, @rohini suggested we 
dynamically load the dependency jar not all jars in $PIG_HOME/lib/ and  the 
basic necessary jar is JarManager.getDefaultJars().

So the code in bin/pig now is
{code}
################# ADDING SPARK DEPENDENCIES ##################
# Please specify SPARK_HOME first so that we can locate 
$SPARK_HOME/lib/spark-assembly*.jar,
# we will add spark-assembly*.jar to the classpath.
if [ -z "$SPARK_HOME" ]; then
   echo "Error: SPARK_HOME is not set!"  
   exit 1
fi

# Please specify SPARK_JAR which is the hdfs path of spark-assembly*.jar to 
allow YARN to cache spark-assembly*.jar on nodes so that it doesn't need to be 
distributed each time an application runs.
if [ -z "$SPARK_JAR" ]; then
   echo "Error: SPARK_JAR is not set, SPARK_JAR stands for the hdfs location of 
spark-assembly*.jar. This allows YARN to cache spark-assembly*.jar on nodes so 
that it doesn't need to be distributed each time an application runs."  
   exit 1
fi

if [ -n "$SPARK_HOME" ]; then
    echo "Using Spark Home: " ${SPARK_HOME}
    SPARK_ASSEMBLY_JAR=`ls ${SPARK_HOME}/lib/spark-assembly*`
    CLASSPATH=${CLASSPATH}:$SPARK_ASSEMBLY_JAR
fi
################# ADDING SPARK DEPENDENCIES ##################

{code}

[~rohini]:
bq.Can we call it SPARK_LIB_URIS instead of SPARK_JAR and allow comma separated 
list of multiple hdfs paths? Also support -Dspark.lib.uris so it can be passed 
via commandline if not as an environmental variable.

SPARK_JAR stands for the hdfs location of spark-assembly*.jar, so i think it is 
not better to use SPARK_LIB_URIS because this environment variable 
is used by spark code to locate the hdfs location of spark-assembly*.jar



> Avoid add all spark dependency jars to  SPARK_YARN_DIST_FILES and 
> SPARK_DIST_CLASSPATH
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-4903
>                 URL: https://issues.apache.org/jira/browse/PIG-4903
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: liyunzhang_intel
>         Attachments: PIG-4903.patch, PIG-4903_1.patch
>
>
> There are some comments about bin/pig on 
> https://reviews.apache.org/r/45667/#comment198955.
> {code}
> ################# ADDING SPARK DEPENDENCIES ##################
> # Spark typically works with a single assembly file. However this
> # assembly isn't available as a artifact to pull in via ivy.
> # To work around this short coming, we add all the jars barring
> # spark-yarn to DIST through dist-files and then add them to classpath
> # of the executors through an independent env variable. The reason
> # for excluding spark-yarn is because spark-yarn is already being added
> # by the spark-yarn-client via jarOf(Client.Class)
> for f in $PIG_HOME/lib/*.jar; do
>     if [[ $f == $PIG_HOME/lib/spark-assembly* ]]; then
>         # Exclude spark-assembly.jar from shipped jars, but retain in 
> classpath
>         SPARK_JARS=${SPARK_JARS}:$f;
>     else
>         SPARK_JARS=${SPARK_JARS}:$f;
>         SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f;
>         SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f`
>     fi
> done
> CLASSPATH=${CLASSPATH}:${SPARK_JARS}
> export SPARK_YARN_DIST_FILES=`echo ${SPARK_YARN_DIST_FILES} | sed 's/^,//g'`
> export SPARK_JARS=${SPARK_YARN_DIST_FILES}
> export SPARK_DIST_CLASSPATH
> {code}
> Here we first copy all spark dependency jar like 
> spark-network-shuffle_2.10-1.6.1 jar to distcache(SPARK_YARN_DIST_FILES) then 
> add them to the classpath of executor(SPARK_DIST_CLASSPATH). Actually we need 
> not copy all these depency jar to SPARK_DIST_CLASSPATH because all these 
> dependency jars are included in spark-assembly.jar and spark-assembly.jar is 
> uploaded with the spark job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to