Thanks, that’s interesting information. Use of an Edge Node sounds like a useful convention. We are software vendors, and we want to connect to any Hadoop cluster regardless of configuration. How does the Edge Node support connections to HDFS from the client? Doesn’t the HDFS FileSystem require direct connections to each DataNode? Does such an Edge Node proxy all of those connections automatically, or does our software need to be made aware of this convention somehow?
Thanks, John From: Rishi Yadav [mailto:ri...@infoobjects.com] Sent: Saturday, June 07, 2014 8:20 AM To: user@hadoop.apache.org Subject: Re: Gathering connection information Typically users ssh edge node which is co-located with the cluster. It also minimizes latency between client and cluster. — Sent from Mailbox<https://www.dropbox.com/mailbox> On Sat, Jun 7, 2014 at 7:12 AM, Peyman Mohajerian <mohaj...@gmail.com<mailto:mohaj...@gmail.com>> wrote: In my experience you build a node called Edge Node which has all the libraries and configuration setting in XML to connect to the cluster, it just doesn't have any of the Hadoop daemons running. On Wed, Jun 4, 2014 at 2:46 PM, John Lilley <john.lil...@redpoint.net<mailto:john.lil...@redpoint.net>> wrote: We’ve found that much of the Hadoop samples assume that running is being done form a cluster node, and that the connection information can be gleaned directly from a configuration object. However, we always run our client from a remote computer, and our users must manually specify the NN/RM addresses and ports. We’ve found this varies maddeningly between distros and especially on hosted virtual implementations. Getting the wrong port results in various inscrutable errors with red-herring messages about security. Is there a prescribed way to get the correct connection information more easily, like from a web API (where at least we’d only need one address and port)? john