Hi Dilli -

Kevin's response to you already added it to the public list.

Thanks,

--larry


On Sat, Apr 13, 2013 at 10:13 PM, Dilli Arumugam
<[email protected]>wrote:

> Hi Kevin,
> I am already subscribed to [email protected] a few days back.
> I was not sure whether my question should be raised with Hortonworks list
> or in public.
> Hence, posted in internal list.
> If posting in public list adds value to Hortonworks, I could repost it in
> [email protected].
> Shall I repost it there?
> Thanks
> Dilli
>
>
> On Sat, Apr 13, 2013 at 8:13 AM, Kevin Minder
> <[email protected]>wrote:
>
> >  Hey Dilli,
> > Yes we stuggle with thinking about this all the time as well.  Here is
> our
> > current thinking.
> >
> >    1. We aren't positioning the gateway as the import/export mechanism
> >    for Haddop.  The target consumer is really the data scientist who
> will be
> >    submitting jobs and retrieving results.  That being said we haven't
> thought
> >    nearly enough about the relationship between the gateway and thinks
> like
> >    sqoop.
> >     2. This probably goes without saying but the gateway is a clustered
> >    service and can be scaled (within reason) to accommodate the load.
>  This
> >    being said it doesn't make any sense if you end up needing as many
> gateway
> >    instances as data nodes.
> >    3. There is an common, unexpected (at least for me) usage pattern in
> >    the field that results from a strong desire to have the Hadoop cluster
> >    fire-walled away from the enterprise.  This results in what is
> typically
> >    called a "gateway machine" in the DMZ.  Data workers will typically
> scp
> >    their files to this machine and then ssh to this machine.  Then they
> will
> >    run Hadoop CLI command from that machine.  Note that this "gateway
> machine"
> >    has Hadoop installed and configured including any Kerberos config if
> >    required.  In some sense the gateway is intended to allow the data
> worker
> >    to stay on their local desktop and perform the same operations.  So
> from my
> >    perspective the bottleneck issue isn't really different and if we do
> are
> >    jobs well it will be easier to solve with the gateway because it
> should be
> >    easier to deploy a gateway instance than setup one of the "gateway
> >    machines".
> >
> > Note: Please subscribe to the Apache Knox mailing lists if you haven't
> > already and we should have these discussions there.
> > Kevin.
> >
> >
> > On 4/12/13 8:59 PM, Dilli Arumugam wrote:
> >
> > First my understanding of the current behavior:
> >
> >  Gateway behavior:
> >
> >  When a client wants to upload a file using gateway all the file content
> > has to go through gateway.
> > If 1000 clients upload 1GB, file, 1TB has to go through gateway.
> >
> >  WebHDFS Behavior:
> >
> >  When a client wants to upload a file using WebHDFS, it gets a redirected
> > to datanode.
> > The file content goes to DataNode directly without passing through
> > NameNode.
> > If 1000 clients upload 1GB, file, 1TB could be sent to potentially to
> >  1000 different datanode with each datanode getting 1GB.
> >
> >  If my understanding is right, gateway could become a bottleneck during
> > file uploads/reads.
> > Am I misunderstanding things?
> >
> >  Thanks
> >  Dilli
> >
> >
> >
> >
> >
>

Reply via email to