Hey Dilli,
Yes we stuggle with thinking about this all the time as well. Here is our current thinking.

1. We aren't positioning the gateway as the import/export mechanism for
   Haddop.  The target consumer is really the data scientist who will
   be submitting jobs and retrieving results. That being said we
   haven't thought nearly enough about the relationship between the
   gateway and thinks like sqoop.
2. This probably goes without saying but the gateway is a clustered
service and can be scaled (within reason) to accommodate the load. This being said it doesn't make any sense if you end up needing as
   many gateway instances as data nodes.
3. There is an common, unexpected (at least for me) usage pattern in
   the field that results from a strong desire to have the Hadoop
   cluster fire-walled away from the enterprise.  This results in what
   is typically called a "gateway machine" in the DMZ.  Data workers
   will typically scp their files to this machine and then ssh to this
machine. Then they will run Hadoop CLI command from that machine. Note that this "gateway machine" has Hadoop installed and configured
   including any Kerberos config if required.  In some sense the
   gateway is intended to allow the data worker to stay on their local
   desktop and perform the same operations.  So from my perspective the
   bottleneck issue isn't really different and if we do are jobs well
   it will be easier to solve with the gateway because it should be
   easier to deploy a gateway instance than setup one of the "gateway
   machines".

Note: Please subscribe to the Apache Knox mailing lists if you haven't already and we should have these discussions there.
Kevin.

On 4/12/13 8:59 PM, Dilli Arumugam wrote:
First my understanding of the current behavior:

Gateway behavior:

When a client wants to upload a file using gateway all the file content has to go through gateway.
If 1000 clients upload 1GB, file, 1TB has to go through gateway.

WebHDFS Behavior:

When a client wants to upload a file using WebHDFS, it gets a redirected to datanode. The file content goes to DataNode directly without passing through NameNode. If 1000 clients upload 1GB, file, 1TB could be sent to potentially to 1000 different datanode with each datanode getting 1GB.

If my understanding is right, gateway could become a bottleneck during file uploads/reads.
Am I misunderstanding things?

Thanks
Dilli




Reply via email to