[
https://issues.apache.org/jira/browse/PIG-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15318643#comment-15318643
]
Rohini Palaniswamy commented on PIG-4903:
-----------------------------------------
bq. SPARK_JAR stands for the hdfs location of spark-assembly*.jar
Sounds good. Did not know that before. It will be better to keep it same as
what spark code recognizes instead of inventing a new one.
bq. we force users to export SPARK_HOME,then we can locate the
spark-assembly*.jar. we append spark-assembly*.jar to the classpath.
Not sure if this is required or if you can include all jars under
$PIG_HOME/lib/Spark/*.jar for the front end classpath. I am not aware of the
version compatibility issues with Spark. So will leave that to you and Srikanth
to decide.
bq. if enduser does not do this, spark-client will automatically upload
spark-assembly*.jar because spark-assembly.jar is in the classpath.
Please mandate that SPARK_JAR be set so that spark client will not
automatically upload.
bq. For jars under $PIG_HOME/lib/*.jar, in PIG-4893, @rohini suggested we
dynamically load the dependency jar not all jars in $PIG_HOME/lib/ and the
basic necessary jar is JarManager.getDefaultJars().
In the frontend, you can include everything under lib/ to the classpath.
But you should only distribute required jars to the backend.
> Avoid add all spark dependency jars to SPARK_YARN_DIST_FILES and
> SPARK_DIST_CLASSPATH
> --------------------------------------------------------------------------------------
>
> Key: PIG-4903
> URL: https://issues.apache.org/jira/browse/PIG-4903
> Project: Pig
> Issue Type: Sub-task
> Components: spark
> Reporter: liyunzhang_intel
> Attachments: PIG-4903.patch, PIG-4903_1.patch
>
>
> There are some comments about bin/pig on
> https://reviews.apache.org/r/45667/#comment198955.
> {code}
> ################# ADDING SPARK DEPENDENCIES ##################
> # Spark typically works with a single assembly file. However this
> # assembly isn't available as a artifact to pull in via ivy.
> # To work around this short coming, we add all the jars barring
> # spark-yarn to DIST through dist-files and then add them to classpath
> # of the executors through an independent env variable. The reason
> # for excluding spark-yarn is because spark-yarn is already being added
> # by the spark-yarn-client via jarOf(Client.Class)
> for f in $PIG_HOME/lib/*.jar; do
> if [[ $f == $PIG_HOME/lib/spark-assembly* ]]; then
> # Exclude spark-assembly.jar from shipped jars, but retain in
> classpath
> SPARK_JARS=${SPARK_JARS}:$f;
> else
> SPARK_JARS=${SPARK_JARS}:$f;
> SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f;
> SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f`
> fi
> done
> CLASSPATH=${CLASSPATH}:${SPARK_JARS}
> export SPARK_YARN_DIST_FILES=`echo ${SPARK_YARN_DIST_FILES} | sed 's/^,//g'`
> export SPARK_JARS=${SPARK_YARN_DIST_FILES}
> export SPARK_DIST_CLASSPATH
> {code}
> Here we first copy all spark dependency jar like
> spark-network-shuffle_2.10-1.6.1 jar to distcache(SPARK_YARN_DIST_FILES) then
> add them to the classpath of executor(SPARK_DIST_CLASSPATH). Actually we need
> not copy all these depency jar to SPARK_DIST_CLASSPATH because all these
> dependency jars are included in spark-assembly.jar and spark-assembly.jar is
> uploaded with the spark job.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)