[
https://issues.apache.org/jira/browse/SAMZA-307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049293#comment-14049293
]
Yan Fang commented on SAMZA-307:
--------------------------------
{quote}
That sounds like MapReduce's DistributedCache. I'm not terribly keen on this,
as there's a lot of potential for dependency versions to go out of sync,
causing surprising errors. There's a lot less that can go wrong with one
self-contained tgz file.
Also, it's worth considering how to handle dependencies added by the user's job
(any libraries that it depends on). Unfortunately the location of those
dependencies depends on the build tool they're using.
{quote}
To simplify the whole process, it seems that we can assume 1) users have maven
installed; 2) users package their code in one jar file or provide the folder
that contains all the jars. The maven assembly plugin we are using for
hell-samza should work in this way. In terms of users' dependencies, since we
can not predict what they will use, we may want to provide them a list of
already-included dependencies, at least some of them, such as samza-\*.jar,
hadoop-\*.jar, etc, so that users can exclude those when they compile and we
can lower the risk of handling different versions of the same jar.
{quote}
What to Tez and other YARN frameworks do?
{quote}
Checked Tez [install/deploy|http://tez.incubator.apache.org/install.html] site,
currently it uses the similar approach as we are using in [HDFS
deploy|http://samza.incubator.apache.org/learn/tutorials/0.7.0/deploy-samza-job-from-hdfs.html]
. The location could be arbitrary. Maybe [~zjshen] has a better answer for
this. Which location to use for deploying Samza to HDFS?
> Simplify YARN deploy procedure
> -------------------------------
>
> Key: SAMZA-307
> URL: https://issues.apache.org/jira/browse/SAMZA-307
> Project: Samza
> Issue Type: Improvement
> Reporter: Yan Fang
>
> Currently, we have two ways of deploying the samza job to YARN cluster, from
> [HDFS|https://samza.incubator.apache.org/learn/tutorials/0.7.0/deploy-samza-job-from-hdfs.html]
> and [Http |
> https://samza.incubator.apache.org/learn/tutorials/0.7.0/run-in-multi-node-yarn.html],
> but neither of them is out-of-box. Users have to go through the tutorial,
> add dependencies, recompile, put the job package to HDFS or Http and then
> finally run. I feel it is a little cumbersome sometimes. We maybe able to
> provide a simpler way to deploy the job.
> When users have YARN and HDFS in the same cluster (such as CDH5), we can
> provide a job-submit script which does:
> 1. take cluster configuration
> 2. call some jave code to upload the assembly (all the samza needed jars and
> is already-compiled) and user's job jar (which changes frequently) to the HDFS
> 3. run the job as usual.
> Therefore, the users only need to run one command line *instead of*:
> 1. going step by step from the tutorial during their first job
> 2. assembling all code and uploading to HDFS manually every time they make
> changes to their job.
> (Yes, I learnt it from [Spark's Yarn
> deploy|http://spark.apache.org/docs/latest/running-on-yarn.html] and [their
> code|https://github.com/apache/spark/blob/master/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala]
> )
> When users only have YARN, I think they have no way but start the http server
> as tutorial.
> What do you think? Does the simplification make sense? Or the Samza will have
> some difficulties (issues) if we do the deploy in this way? Thank you.
>
--
This message was sent by Atlassian JIRA
(v6.2#6252)