Hi Kevin,
I am already subscribed to [email protected] a few days back.
I was not sure whether my question should be raised with Hortonworks list
or in public.
Hence, posted in internal list.
If posting in public list adds value to Hortonworks, I could repost it in
[email protected].
Shall I repost it there?
Thanks
Dilli


On Sat, Apr 13, 2013 at 8:13 AM, Kevin Minder
<[email protected]>wrote:

>  Hey Dilli,
> Yes we stuggle with thinking about this all the time as well.  Here is our
> current thinking.
>
>    1. We aren't positioning the gateway as the import/export mechanism
>    for Haddop.  The target consumer is really the data scientist who will be
>    submitting jobs and retrieving results.  That being said we haven't thought
>    nearly enough about the relationship between the gateway and thinks like
>    sqoop.
>     2. This probably goes without saying but the gateway is a clustered
>    service and can be scaled (within reason) to accommodate the load.  This
>    being said it doesn't make any sense if you end up needing as many gateway
>    instances as data nodes.
>    3. There is an common, unexpected (at least for me) usage pattern in
>    the field that results from a strong desire to have the Hadoop cluster
>    fire-walled away from the enterprise.  This results in what is typically
>    called a "gateway machine" in the DMZ.  Data workers will typically scp
>    their files to this machine and then ssh to this machine.  Then they will
>    run Hadoop CLI command from that machine.  Note that this "gateway machine"
>    has Hadoop installed and configured including any Kerberos config if
>    required.  In some sense the gateway is intended to allow the data worker
>    to stay on their local desktop and perform the same operations.  So from my
>    perspective the bottleneck issue isn't really different and if we do are
>    jobs well it will be easier to solve with the gateway because it should be
>    easier to deploy a gateway instance than setup one of the "gateway
>    machines".
>
> Note: Please subscribe to the Apache Knox mailing lists if you haven't
> already and we should have these discussions there.
> Kevin.
>
>
> On 4/12/13 8:59 PM, Dilli Arumugam wrote:
>
> First my understanding of the current behavior:
>
>  Gateway behavior:
>
>  When a client wants to upload a file using gateway all the file content
> has to go through gateway.
> If 1000 clients upload 1GB, file, 1TB has to go through gateway.
>
>  WebHDFS Behavior:
>
>  When a client wants to upload a file using WebHDFS, it gets a redirected
> to datanode.
> The file content goes to DataNode directly without passing through
> NameNode.
> If 1000 clients upload 1GB, file, 1TB could be sent to potentially to
>  1000 different datanode with each datanode getting 1GB.
>
>  If my understanding is right, gateway could become a bottleneck during
> file uploads/reads.
> Am I misunderstanding things?
>
>  Thanks
>  Dilli
>
>
>
>
>

Reply via email to