[ 
https://issues.apache.org/jira/browse/PIG-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15315949#comment-15315949
 ] 

Rohini Palaniswamy commented on PIG-4903:
-----------------------------------------

Can we call it SPARK_LIB_URIS instead of SPARK_JAR and allow comma separated 
list of multiple hdfs paths? Also support -Dspark.lib.uris so it can be passed 
via commandline if not as an environmental variable.

bq.  Will this mandate SPARK_HOME to be set even when the execution is not 
spark (ex. mapreduce, tez etc) ?
  We should not mandate if execution mode is not spark

bq. Do we want to warn that SPARK_HOME is not set and continue execution using 
individual jars
  Though might make it easy for users, it would better to make it a strict 
requirement and not ship individual jars at all. Shipping 128MB+ of data to 
every task node when user might not be even processing that much input data is 
really bad. And worse, nodemanager will be deleting that after the application 
completes as the scope of that LocalResource would be APPLICATION and not 
PRIVATE or PUBLIC. It will not only be bad in terms of efficiency but also 
performance. If they launch couple of spark jobs together it can really slow 
down all of them because nodemanagers have ~5 threads to do localization. 

{code}
  /** Number of threads to handle localization requests.*/
  public static final String NM_LOCALIZER_CLIENT_THREAD_COUNT =
    NM_PREFIX + "localizer.client.thread-count";
  public static final int DEFAULT_NM_LOCALIZER_CLIENT_THREAD_COUNT = 5;
  
  /** Number of threads to use for localization fetching.*/
  public static final String NM_LOCALIZER_FETCH_THREAD_COUNT = 
    NM_PREFIX + "localizer.fetch.thread-count";
  public static final int DEFAULT_NM_LOCALIZER_FETCH_THREAD_COUNT = 4;
{code}




> Avoid add all spark dependency jars to  SPARK_YARN_DIST_FILES and 
> SPARK_DIST_CLASSPATH
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-4903
>                 URL: https://issues.apache.org/jira/browse/PIG-4903
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: liyunzhang_intel
>         Attachments: PIG-4903.patch, PIG-4903_1.patch
>
>
> There are some comments about bin/pig on 
> https://reviews.apache.org/r/45667/#comment198955.
> {code}
> ################# ADDING SPARK DEPENDENCIES ##################
> # Spark typically works with a single assembly file. However this
> # assembly isn't available as a artifact to pull in via ivy.
> # To work around this short coming, we add all the jars barring
> # spark-yarn to DIST through dist-files and then add them to classpath
> # of the executors through an independent env variable. The reason
> # for excluding spark-yarn is because spark-yarn is already being added
> # by the spark-yarn-client via jarOf(Client.Class)
> for f in $PIG_HOME/lib/*.jar; do
>     if [[ $f == $PIG_HOME/lib/spark-assembly* ]]; then
>         # Exclude spark-assembly.jar from shipped jars, but retain in 
> classpath
>         SPARK_JARS=${SPARK_JARS}:$f;
>     else
>         SPARK_JARS=${SPARK_JARS}:$f;
>         SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f;
>         SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f`
>     fi
> done
> CLASSPATH=${CLASSPATH}:${SPARK_JARS}
> export SPARK_YARN_DIST_FILES=`echo ${SPARK_YARN_DIST_FILES} | sed 's/^,//g'`
> export SPARK_JARS=${SPARK_YARN_DIST_FILES}
> export SPARK_DIST_CLASSPATH
> {code}
> Here we first copy all spark dependency jar like 
> spark-network-shuffle_2.10-1.6.1 jar to distcache(SPARK_YARN_DIST_FILES) then 
> add them to the classpath of executor(SPARK_DIST_CLASSPATH). Actually we need 
> not copy all these depency jar to SPARK_DIST_CLASSPATH because all these 
> dependency jars are included in spark-assembly.jar and spark-assembly.jar is 
> uploaded with the spark job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to