[jira] [Commented] (SAMZA-307) Simplify YARN deploy procedure

Yan Fang (JIRA) Mon, 30 Jun 2014 11:54:12 -0700

    [ 
https://issues.apache.org/jira/browse/SAMZA-307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14047973#comment-14047973
 ]


Yan Fang commented on SAMZA-307:
--------------------------------

{quote}
Where would that deployment script live? We'd probably need to make a binary 
release, so that users can download a Samza .tar.gz file, unpack it, and run 
the deployment script directly. I've opened SAMZA-311 for that.
{quote}

This script as you described in SAMZA-311 can live under /bin directory in 
binary release.

{quote}
The process involves several distinct steps (building a job package, uploading 
it to HDFS, telling YARN to run the job). While it's useful to do all three 
steps at once for projects that need it, it would be worth separating out the 
stages and allowing users to run only some of them (e.g. skip the job package 
building for projects that already have a suitable build process). It might 
even make sense to make those steps separate JIRAs, because building a job 
package is quite distinct from uploading a package to HDFS.
{quote}

The thing I am not sure about is that, is it possible for Samza to run if we 
upload two separate packages, one is Samza system jars and one is user's jar, 
to one folder of HDFS? If this is possible, the users only need to 
compile/package their own jars with whatever tools they like (e.g. Eclipse 
export...) and _our deploy script only takes care of uploading pre-packaged 
Samza package and users' jars into the same folder_ and, of course, run the 
job. If not possible, _our script needs to assemble Samza jars and users' jars 
into one package and then upload it_. (assembling depends on external tools , 
such as maven, gradle.)

Agree, allowing users only use some of the steps, like bin/grid in hello-samza, 
is helpful. We can put those steps into separate JIRAs as subtask if necessary 
after we have the agreement on what exactly steps to include in the deploy 
script. 

{quote}
How will the script choose the HDFS location to use?
{quote}

Can we just upload to, say, /user/username/.staging/applicationId?

{quote}
Will the job package in HDFS get cleaned up after a job is shut down?
{quote}

I think this is possible if the job is shut down gracefully. But it seems 
impossible if the job is killed by users forcefully. Since HDFS' storage is 
cheap, it may not be an emergent feature. 

{quote}
Can we support Hadoop clusters that have security enabled for HDFS?
{quote}

If we use java to upload the HDFS, we may need to take care of security. If we 
purely dependent on, say "hadoop fs -put ...", it seems fine in the security 
enabled cluster. Not sure which approach is better. Latter one is definitely 
simpler but may have some side-effects?



> Simplify YARN deploy procedure 
> -------------------------------
>
>                 Key: SAMZA-307
>                 URL: https://issues.apache.org/jira/browse/SAMZA-307
>             Project: Samza
>          Issue Type: Improvement
>            Reporter: Yan Fang
>
> Currently, we have two ways of deploying the samza job to YARN cluster, from 
> [HDFS|https://samza.incubator.apache.org/learn/tutorials/0.7.0/deploy-samza-job-from-hdfs.html]
>  and [Http | 
> https://samza.incubator.apache.org/learn/tutorials/0.7.0/run-in-multi-node-yarn.html],
>  but neither of them is out-of-box. Users have to go through the tutorial, 
> add dependencies, recompile, put the job package to HDFS or Http and then 
> finally run. I feel it is a little cumbersome sometimes. We maybe able to 
> provide a simpler way to deploy the job.
> When users have YARN and HDFS in the same cluster (such as CDH5), we can 
> provide a job-submit script which does:
> 1. take cluster configuration
> 2. call some jave code to upload the assembly (all the samza needed jars and 
> is already-compiled) and user's job jar (which changes frequently) to the HDFS
> 3. run the job as usual. 
> Therefore, the users only need to run one command line *instead of*:
> 1. going step by step from the tutorial during their first job
> 2. assembling all code and uploading to HDFS manually every time they make 
> changes to their job. 
> (Yes, I learnt it from [Spark's Yarn 
> deploy|http://spark.apache.org/docs/latest/running-on-yarn.html] and [their 
> code|https://github.com/apache/spark/blob/master/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala]
>  ) 
> When users only have YARN, I think they have no way but start the http server 
> as tutorial. 
> What do you think? Does the simplification make sense? Or the Samza will have 
> some difficulties (issues) if we do the deploy in this way? Thank you.
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SAMZA-307) Simplify YARN deploy procedure

Reply via email to