Hi Dilli - Kevin's response to you already added it to the public list.
Thanks, --larry On Sat, Apr 13, 2013 at 10:13 PM, Dilli Arumugam <[email protected]>wrote: > Hi Kevin, > I am already subscribed to [email protected] a few days back. > I was not sure whether my question should be raised with Hortonworks list > or in public. > Hence, posted in internal list. > If posting in public list adds value to Hortonworks, I could repost it in > [email protected]. > Shall I repost it there? > Thanks > Dilli > > > On Sat, Apr 13, 2013 at 8:13 AM, Kevin Minder > <[email protected]>wrote: > > > Hey Dilli, > > Yes we stuggle with thinking about this all the time as well. Here is > our > > current thinking. > > > > 1. We aren't positioning the gateway as the import/export mechanism > > for Haddop. The target consumer is really the data scientist who > will be > > submitting jobs and retrieving results. That being said we haven't > thought > > nearly enough about the relationship between the gateway and thinks > like > > sqoop. > > 2. This probably goes without saying but the gateway is a clustered > > service and can be scaled (within reason) to accommodate the load. > This > > being said it doesn't make any sense if you end up needing as many > gateway > > instances as data nodes. > > 3. There is an common, unexpected (at least for me) usage pattern in > > the field that results from a strong desire to have the Hadoop cluster > > fire-walled away from the enterprise. This results in what is > typically > > called a "gateway machine" in the DMZ. Data workers will typically > scp > > their files to this machine and then ssh to this machine. Then they > will > > run Hadoop CLI command from that machine. Note that this "gateway > machine" > > has Hadoop installed and configured including any Kerberos config if > > required. In some sense the gateway is intended to allow the data > worker > > to stay on their local desktop and perform the same operations. So > from my > > perspective the bottleneck issue isn't really different and if we do > are > > jobs well it will be easier to solve with the gateway because it > should be > > easier to deploy a gateway instance than setup one of the "gateway > > machines". > > > > Note: Please subscribe to the Apache Knox mailing lists if you haven't > > already and we should have these discussions there. > > Kevin. > > > > > > On 4/12/13 8:59 PM, Dilli Arumugam wrote: > > > > First my understanding of the current behavior: > > > > Gateway behavior: > > > > When a client wants to upload a file using gateway all the file content > > has to go through gateway. > > If 1000 clients upload 1GB, file, 1TB has to go through gateway. > > > > WebHDFS Behavior: > > > > When a client wants to upload a file using WebHDFS, it gets a redirected > > to datanode. > > The file content goes to DataNode directly without passing through > > NameNode. > > If 1000 clients upload 1GB, file, 1TB could be sent to potentially to > > 1000 different datanode with each datanode getting 1GB. > > > > If my understanding is right, gateway could become a bottleneck during > > file uploads/reads. > > Am I misunderstanding things? > > > > Thanks > > Dilli > > > > > > > > > > >
