[GitHub] samza pull request #296: SAMZA-1416 : Better logging around the exception wh...

2017-09-15 Thread dnishimura
GitHub user dnishimura opened a pull request:

https://github.com/apache/samza/pull/296

SAMZA-1416 : Better logging around the exception where class loading failed 
in initializing the SystemFactory for a input/output system

Also added test coverage for the Util.getObj method.
@nickpan47 @jmakes 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dnishimura/samza samza-1416

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/samza/pull/296.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #296


commit a0e70da78925c6a3fdf9f51ae3e2e34d132d0dd1
Author: Daniel Nishimura 
Date:   2017-09-15T23:07:15Z

Added better logging to Util.getObj method when exception is thrown.




---


Re: [VOTE] SEP-8: Add in-memory system consumer & producer

2017-09-15 Thread Bharath Kumarasubramanian
Thanks for your feedback. Answers inline


On 9/14/17, 1:23 AM, "Yi Pan"  wrote:

Hi, Bharath,

Overall looks good! I have the following comments:

i) Question on the Type of IME + data partition:

How do we enforce that user adds IME w/ the expected partition id to the
corresponding sub-collection?
For IME as the data source, we will take a collection instead of collection 
of collection since we know the partition information already.
I will update the wiki to make it more clear and explicit. Let me know if 
this is acceptable? 


ii) In the architecture graph, what's the difference between SSP queues and
Data source/sink? What is the layer exposed to the user (I.e. programmer)?
SSP queues are intermediate buffers for the in-memory system to pass 
messages and are not exposed to programmer.
Data source/sink refers to the handle of input data provided by the end 
user and output to which the system will flush the data for end user to access.


ii) Agree w/ the approach to use a customized queues managed by the admin.
However, the reason not to use BEM is not very clear. For the matter of
fact, BEM is just one optional base class for SystemConsumer implementation.
Not sure why we necessarily need to be limited by BEM.
I agree BEM is just an optional helper class that has bunch of utility 
methods to implement a SystemConsumer. Having to go down the approach will 
require the SystemProducer implementation to have a reference to SystemConsumer 
for writing data into same buffer or one single implementation to act as both 
consumer & producer. This isn’t a limitation but things we sign up for if we go 
down with approaches using BEM. The benefits that come up  with BEM isn’t 
justified for our use case and hence approach C.

iii) In the code examples,

A) what's the difference between durable state vs non-durable state in
highlevel API? I don't see any difference. Also, the SEP has clearly
described that the design is only for InMemory input/output/intermediate
streams. I noticed that you added changelog as inputs in low-level API. But
it is not clear how this changelog is defined and why it is an input to the
application??? 
The changelog is supposed to be wired through the StoreDescriptor. Since 
this is not supported in V1, I will go ahead and remove the use case.
I will add a section on use cases not supported and add these to them for 
book keeping purpose so that we can revisit these for V2.

B) the code example for checkpoint is empty and we have stated that we
won't support checkpoint in this SEP. Can we remove it? Removed it.


Thanks!


-Yi

On Wed, Sep 6, 2017 at 2:06 PM, xinyu liu  wrote:

> +1 on the overall design. This will make testing a lot easier!
>
> Thanks,
> Xinyu
>
> On Wed, Sep 6, 2017 at 10:45 AM, Bharath Kumara Subramanian <
> codin.mart...@gmail.com> wrote:
>
> > Hi all,
> >
> > Can you please vote for SEP-8?
> > You can find the design document here
> >  action?pageId=71013043
> > >.
> >
> > Thanks,
> > Bharath
> >
>




Re: Deploying Samza Jobs Using S3 and YARN on AWS

2017-09-15 Thread Jagadish Venkatraman
Thank you Xiaochuan for your question!

You should ensure that *every machine in your cluster* has the S3 jar file
in its YARN class-path. From your error, it looks like the machine you are
running on does not have the JAR file corresponding to *S3AFileSystem*.

>> Whats the right way to set this up? Should I just copy over the required
AWS jars to the Hadoop conf directory

I'd lean on the side of simplicity and the *scp* route seems to address
most of your needs.

>> Should I be editing run-job.sh or run-class.sh?

You should not have to edit any of these files. Once you fix your
class-paths by copying those relevant JARs, it should just work.

Please let us know if you need more assistance.

--
Jagdish


On Fri, Sep 15, 2017 at 11:07 AM, XiaoChuan Yu  wrote:

> Hi,
>
> I'm trying to deploy a Samza job using YARN and S3 where I upload the zip
> package to S3 and point yarn.package.path to it.
> Does anyone know what kind of set up steps is required for this?
>
> What I've tried so far is to get Hello Samza to be run this way in AWS.
>
> However I ran into the following exception:
> Exception in thread "main" java.lang.RuntimeException:
> java.lang.ClassNotFoundException: Class
> org.apache.hadoop.fs.s3a.S3AFileSystem not found
> at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2112)
> at org.apache.hadoop.fs.FileSystem.getFileSystemClass(
> FileSystem.java:2578)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
> ...
>
> Running "$YARN_HOME/bin/yarn classpath" gives the following:
> /home/ec2-user/deploy/yarn/etc/hadoop
> /home/ec2-user/deploy/yarn/etc/hadoop
> /home/ec2-user/deploy/yarn/etc/hadoop
> /home/ec2-user/deploy/yarn/share/hadoop/common/lib/*
> /home/ec2-user/deploy/yarn/share/hadoop/common/*
> /home/ec2-user/deploy/yarn/share/hadoop/hdfs
> /home/ec2-user/deploy/yarn/share/hadoop/hdfs/lib/*
> /home/ec2-user/deploy/yarn/share/hadoop/hdfs/*
> /home/ec2-user/deploy/yarn/share/hadoop/yarn/lib/*
> /home/ec2-user/deploy/yarn/share/hadoop/yarn/*
> /home/ec2-user/deploy/yarn/share/hadoop/mapreduce/lib/*
> /home/ec2-user/deploy/yarn/share/hadoop/mapreduce/*
> /contrib/capacity-scheduler/*.jar
> /home/ec2-user/deploy/yarn/share/hadoop/yarn/*
> /home/ec2-user/deploy/yarn/share/hadoop/yarn/lib/*
>
> I manually copied the required AWS related jars to
> /home/ec2-user/deploy/yarn/share/hadoop/common.
> I checked that it is loadable by running "yarn
> org.apache.hadoop.fs.s3a.S3AFileSystem" which gives the "Main method not
> found" error instead of class not found.
>
> From the console output of run-job.sh I see the following in class path:
> 1. All jars under the lib directory of the zip package
> 2. /home/ec2-user/deploy/yarn/etc/hadoop (Hadoop conf directory)
>
> The class path from run-job.sh seem to be missing the AWS related jars
> required for S3AFileSystem.
> Whats the right way to set this up?
> Should I just copy over the required AWS jars to the Hadoop conf directory
> (2.)?
> Should I be editing run-job.sh or run-class.sh?
>
> Thanks,
> Xiaochuan Yu
>



-- 
Jagadish V,
Graduate Student,
Department of Computer Science,
Stanford University


Deploying Samza Jobs Using S3 and YARN on AWS

2017-09-15 Thread XiaoChuan Yu
Hi,

I'm trying to deploy a Samza job using YARN and S3 where I upload the zip
package to S3 and point yarn.package.path to it.
Does anyone know what kind of set up steps is required for this?

What I've tried so far is to get Hello Samza to be run this way in AWS.

However I ran into the following exception:
Exception in thread "main" java.lang.RuntimeException:
java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2112)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2578)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
...

Running "$YARN_HOME/bin/yarn classpath" gives the following:
/home/ec2-user/deploy/yarn/etc/hadoop
/home/ec2-user/deploy/yarn/etc/hadoop
/home/ec2-user/deploy/yarn/etc/hadoop
/home/ec2-user/deploy/yarn/share/hadoop/common/lib/*
/home/ec2-user/deploy/yarn/share/hadoop/common/*
/home/ec2-user/deploy/yarn/share/hadoop/hdfs
/home/ec2-user/deploy/yarn/share/hadoop/hdfs/lib/*
/home/ec2-user/deploy/yarn/share/hadoop/hdfs/*
/home/ec2-user/deploy/yarn/share/hadoop/yarn/lib/*
/home/ec2-user/deploy/yarn/share/hadoop/yarn/*
/home/ec2-user/deploy/yarn/share/hadoop/mapreduce/lib/*
/home/ec2-user/deploy/yarn/share/hadoop/mapreduce/*
/contrib/capacity-scheduler/*.jar
/home/ec2-user/deploy/yarn/share/hadoop/yarn/*
/home/ec2-user/deploy/yarn/share/hadoop/yarn/lib/*

I manually copied the required AWS related jars to
/home/ec2-user/deploy/yarn/share/hadoop/common.
I checked that it is loadable by running "yarn
org.apache.hadoop.fs.s3a.S3AFileSystem" which gives the "Main method not
found" error instead of class not found.

>From the console output of run-job.sh I see the following in class path:
1. All jars under the lib directory of the zip package
2. /home/ec2-user/deploy/yarn/etc/hadoop (Hadoop conf directory)

The class path from run-job.sh seem to be missing the AWS related jars
required for S3AFileSystem.
Whats the right way to set this up?
Should I just copy over the required AWS jars to the Hadoop conf directory
(2.)?
Should I be editing run-job.sh or run-class.sh?

Thanks,
Xiaochuan Yu