editing etc hosts files of a cluster
Hi, I have a cluster setup with 3 nodes, and I'm adding hostname details (in /etc/hosts) manually in each node. Seems it is not an effective approach. How this scenario is handled in big clusters? Is there any simple of way to add the hostname details in all the nodes by editing a single entry/file/script? Thanks and Regards, Ramesh -- View this message in context: http://www.nabble.com/editing-etc-hosts-files-of-a-cluster-tp25958579p25958579.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: editing etc hosts files of a cluster
DNS ;) Ramesh.Ramasamy wrote: Hi, I have a cluster setup with 3 nodes, and I'm adding hostname details (in /etc/hosts) manually in each node. Seems it is not an effective approach. How this scenario is handled in big clusters? Is there any simple of way to add the hostname details in all the nodes by editing a single entry/file/script? Thanks and Regards, Ramesh -- *** The 'Last-Chance' Architect www.galatea.com (US) +1 303 731 3116 (UK) +44 20 8144 4367 ***
Re: editing etc hosts files of a cluster
A bit more specific: At Yahoo!, we had either every server as a DNS slave or a DNS caching server. In the case of LinkedIn, we're running Solaris so nscd is significantly better than its Linux counterpart. However, we still seem to be blowing out the cache too much. So we'll likely switch to DNS caching servers here as well. On 10/19/09 6:45 AM, "Last-chance Architect" wrote: > DNS ;) > > Ramesh.Ramasamy wrote: >> Hi, >> >> I have a cluster setup with 3 nodes, and I'm adding hostname details (in >> /etc/hosts) manually in each node. Seems it is not an effective approach. >> How this scenario is handled in big clusters? >> >> Is there any simple of way to add the hostname details in all the nodes by >> editing a single entry/file/script? >> >> Thanks and Regards, >> Ramesh >> >>
Re: editing etc hosts files of a cluster
On Mon, Oct 19, 2009 at 2:36 PM, Allen Wittenauer wrote: > > A bit more specific: > > At Yahoo!, we had either every server as a DNS slave or a DNS caching > server. > > In the case of LinkedIn, we're running Solaris so nscd is significantly > better than its Linux counterpart. However, we still seem to be blowing out > the cache too much. So we'll likely switch to DNS caching servers here as > well. > > On 10/19/09 6:45 AM, "Last-chance Architect" wrote: > >> DNS ;) >> >> Ramesh.Ramasamy wrote: >>> Hi, >>> >>> I have a cluster setup with 3 nodes, and I'm adding hostname details (in >>> /etc/hosts) manually in each node. Seems it is not an effective approach. >>> How this scenario is handled in big clusters? >>> >>> Is there any simple of way to add the hostname details in all the nodes by >>> editing a single entry/file/script? >>> >>> Thanks and Regards, >>> Ramesh >>> >>> > > Allan, I am interested in your post. What has caused you to run caching DNS servers on each of your nodes? Is this a hadoop specific problem or a problem specific to your implementation? My assumption here is that a hadoop cluster of say 1000 nodes would repeatedly talk to the same 1000 nodes. Are you saying that nscd is inadequacy to handle the size of the cache, or nscd is not very efficient? What exactly is the reason you are running a caching DNS server on each node? Thank you, Edward
Re: editing etc hosts files of a cluster
On 10/19/09 11:46 AM, "Edward Capriolo" wrote: > I am interested in your post. What has caused you to run caching DNS > servers on each of your nodes? Is this a hadoop specific problem or a > problem specific to your implementation? Hadoop does a -tremendous- amount of hostname lookups. If you don't have either nscd or a local DNS caching server, you are likely throwing what could be some significant performance gains away. > My assumption here is that a hadoop cluster of say 1000 nodes would > repeatedly talk to the same 1000 nodes. ... and that's the catch! Every node running the DFSClient code or being called out from a map/reduce task is a potential hostname that would need be resolved. Just think about something like distcp. Also note that this is before we talk about monitoring, any other naming services, CNAMEs, multi-As, etc, that get built as a normal part of running an infrastructure. > Are you saying that nscd is > inadequacy to handle the size of the cache, or nscd is not very > efficient? What exactly is the reason you are running a caching DNS > server on each node? In the case of Yahoo!, we had (or, at least, a perception) that we had or were going to have jobs that did a lot of direct DNS lookups and/or accessed/referenced things outside of the local grid. Also note that a DNS caching server is going to store more information about hostnames than a simple host to IP service like nscd. Hypothetical: Let's say I'm building rules for a spam filter and part of my process is to look up the MX record for a given host. nscd isn't going to help you there. In the case of LinkedIn, the jury is still out. I suspect we don't have nscd.conf tuned correctly. Our grid isn't that big, our connections in/out are fairly small, etc. It has been one of the things on my todo list since I got hired here 2 months ago. :) [For the record, I'm not one of those crazy people who turns off nscd because I had a bad experience with a broken version five years ago. In the case of Yahoo!, I was the crazy person who started insisting we turn it on, albeit not for hosts.]
Re: editing etc hosts files of a cluster
On Mon, Oct 19, 2009 at 7:17 PM, Allen Wittenauer wrote: > > > > On 10/19/09 11:46 AM, "Edward Capriolo" wrote: > >> I am interested in your post. What has caused you to run caching DNS >> servers on each of your nodes? Is this a hadoop specific problem or a >> problem specific to your implementation? > > Hadoop does a -tremendous- amount of hostname lookups. If you don't have > either nscd or a local DNS caching server, you are likely throwing what > could be some significant performance gains away. > >> My assumption here is that a hadoop cluster of say 1000 nodes would >> repeatedly talk to the same 1000 nodes. > > ... and that's the catch! Every node running the DFSClient code or being > called out from a map/reduce task is a potential hostname that would need be > resolved. Just think about something like distcp. > > Also note that this is before we talk about monitoring, any other naming > services, CNAMEs, multi-As, etc, that get built as a normal part of running > an infrastructure. > >> Are you saying that nscd is >> inadequacy to handle the size of the cache, or nscd is not very >> efficient? What exactly is the reason you are running a caching DNS >> server on each node? > > In the case of Yahoo!, we had (or, at least, a perception) that we had or > were going to have jobs that did a lot of direct DNS lookups and/or > accessed/referenced things outside of the local grid. Also note that a DNS > caching server is going to store more information about hostnames than a > simple host to IP service like nscd. > > Hypothetical: Let's say I'm building rules for a spam filter and part of my > process is to look up the MX record for a given host. nscd isn't going to > help you there. > > In the case of LinkedIn, the jury is still out. I suspect we don't have > nscd.conf tuned correctly. Our grid isn't that big, our connections in/out > are fairly small, etc. It has been one of the things on my todo list since I > got hired here 2 months ago. :) > > [For the record, I'm not one of those crazy people who turns off nscd > because I had a bad experience with a broken version five years ago. In > the case of Yahoo!, I was the crazy person who started insisting we turn it > on, albeit not for hosts.] > > Cool thanks for the info. I have found NSCD to be absolutely essential in most/all situations. Whenever I would truss processes on OS'es without NSCD (say freebsd 6.2) I would see numerous repeated 'stat' against /etc/passwd and /etc/group. If you are doing users and groups through LDAP nscd is super important as well. Your not going to want to make a series of lookups each stat. I would think the most efficient implementation would be nscd and a local caching server in that case. NSCD should be very efficient since it is done through libraries, dns lookups have to open sockets (overhead). However I can see your point nscd can not do other types of records.
Re: editing etc hosts files of a cluster
Most of the communication and name lookups within a cluster refer to other nodes within that same cluster. It is usually not a big deal to put all the systems from a cluster in a single hosts file, and rsync it around the cluster. (Consider using prsync, which comes with pssh, http://www.theether.org/pssh/, or your favorite cluster management software.) Editing each individually clearly doesn't scale; but editing it once and replicating it does. Is a large hosts file less efficient than nscd or a caching DNS server for nodes within the cluster? Thanks, David On 10/19/2009 8:02 PM, Edward Capriolo wrote: > On Mon, Oct 19, 2009 at 7:17 PM, Allen Wittenauer > wrote: > >> >> >> On 10/19/09 11:46 AM, "Edward Capriolo" wrote: >> >> >>> I am interested in your post. What has caused you to run caching DNS >>> servers on each of your nodes? Is this a hadoop specific problem or a >>> problem specific to your implementation? >>> >> Hadoop does a -tremendous- amount of hostname lookups. If you don't have >> either nscd or a local DNS caching server, you are likely throwing what >> could be some significant performance gains away. >> >> >>> My assumption here is that a hadoop cluster of say 1000 nodes would >>> repeatedly talk to the same 1000 nodes. >>> >> ... and that's the catch! Every node running the DFSClient code or being >> called out from a map/reduce task is a potential hostname that would need be >> resolved. Just think about something like distcp. >> >> Also note that this is before we talk about monitoring, any other naming >> services, CNAMEs, multi-As, etc, that get built as a normal part of running >> an infrastructure. >> >> >>> Are you saying that nscd is >>> inadequacy to handle the size of the cache, or nscd is not very >>> efficient? What exactly is the reason you are running a caching DNS >>> server on each node? >>> >> In the case of Yahoo!, we had (or, at least, a perception) that we had or >> were going to have jobs that did a lot of direct DNS lookups and/or >> accessed/referenced things outside of the local grid. Also note that a DNS >> caching server is going to store more information about hostnames than a >> simple host to IP service like nscd. >> >> Hypothetical: Let's say I'm building rules for a spam filter and part of my >> process is to look up the MX record for a given host. nscd isn't going to >> help you there. >> >> In the case of LinkedIn, the jury is still out. I suspect we don't have >> nscd.conf tuned correctly. Our grid isn't that big, our connections in/out >> are fairly small, etc. It has been one of the things on my todo list since I >> got hired here 2 months ago. :) >> >> [For the record, I'm not one of those crazy people who turns off nscd >> because I had a bad experience with a broken version five years ago. In >> the case of Yahoo!, I was the crazy person who started insisting we turn it >> on, albeit not for hosts.] >> >> >> > Cool thanks for the info. > > I have found NSCD to be absolutely essential in most/all situations. > Whenever I would truss processes on OS'es without NSCD (say freebsd > 6.2) I would see numerous repeated 'stat' against /etc/passwd and > /etc/group. > > If you are doing users and groups through LDAP nscd is super important > as well. Your not going to want to make a series of lookups each stat. > > I would think the most efficient implementation would be nscd and a > local caching server in that case. NSCD should be very efficient since > it is done through libraries, dns lookups have to open sockets > (overhead). However I can see your point nscd can not do other types > of records. > >
Re: editing etc hosts files of a cluster
Any time you deal with pushing files around, you also have to deal with the repercussions of when the file fails to get to its destination or it fails to get there in a timely manner. [Hai hadoop config files.] If you use an interface alias/vip/multi-a/whatever to deal with namenode availability, then the host information becomes even more critical. Rather than build something custom, I chose to use well known, off the shelf software to deal with keeping host information relatively in-sync: bind. On 10/19/09 8:09 PM, "David B. Ritch" wrote: > Most of the communication and name lookups within a cluster refer to > other nodes within that same cluster. It is usually not a big deal to > put all the systems from a cluster in a single hosts file, and rsync it > around the cluster. (Consider using prsync, which comes with pssh, > http://www.theether.org/pssh/, or your favorite cluster management > software.) > Editing each individually clearly doesn't scale; but editing it once and > replicating it does. > > Is a large hosts file less efficient than nscd or a caching DNS server > for nodes within the cluster? > > Thanks, > > David > > On 10/19/2009 8:02 PM, Edward Capriolo wrote: >> On Mon, Oct 19, 2009 at 7:17 PM, Allen Wittenauer >> wrote: >> >>> >>> >>> On 10/19/09 11:46 AM, "Edward Capriolo" wrote: >>> >>> I am interested in your post. What has caused you to run caching DNS servers on each of your nodes? Is this a hadoop specific problem or a problem specific to your implementation? >>> Hadoop does a -tremendous- amount of hostname lookups. If you don't have >>> either nscd or a local DNS caching server, you are likely throwing what >>> could be some significant performance gains away. >>> >>> My assumption here is that a hadoop cluster of say 1000 nodes would repeatedly talk to the same 1000 nodes. >>> ... and that's the catch! Every node running the DFSClient code or being >>> called out from a map/reduce task is a potential hostname that would need be >>> resolved. Just think about something like distcp. >>> >>> Also note that this is before we talk about monitoring, any other naming >>> services, CNAMEs, multi-As, etc, that get built as a normal part of running >>> an infrastructure. >>> >>> Are you saying that nscd is inadequacy to handle the size of the cache, or nscd is not very efficient? What exactly is the reason you are running a caching DNS server on each node? >>> In the case of Yahoo!, we had (or, at least, a perception) that we had or >>> were going to have jobs that did a lot of direct DNS lookups and/or >>> accessed/referenced things outside of the local grid. Also note that a DNS >>> caching server is going to store more information about hostnames than a >>> simple host to IP service like nscd. >>> >>> Hypothetical: Let's say I'm building rules for a spam filter and part of my >>> process is to look up the MX record for a given host. nscd isn't going to >>> help you there. >>> >>> In the case of LinkedIn, the jury is still out. I suspect we don't have >>> nscd.conf tuned correctly. Our grid isn't that big, our connections in/out >>> are fairly small, etc. It has been one of the things on my todo list since I >>> got hired here 2 months ago. :) >>> >>> [For the record, I'm not one of those crazy people who turns off nscd >>> because I had a bad experience with a broken version five years ago. In >>> the case of Yahoo!, I was the crazy person who started insisting we turn it >>> on, albeit not for hosts.] >>> >>> >>> >> Cool thanks for the info. >> >> I have found NSCD to be absolutely essential in most/all situations. >> Whenever I would truss processes on OS'es without NSCD (say freebsd >> 6.2) I would see numerous repeated 'stat' against /etc/passwd and >> /etc/group. >> >> If you are doing users and groups through LDAP nscd is super important >> as well. Your not going to want to make a series of lookups each stat. >> >> I would think the most efficient implementation would be nscd and a >> local caching server in that case. NSCD should be very efficient since >> it is done through libraries, dns lookups have to open sockets >> (overhead). However I can see your point nscd can not do other types >> of records. >> >> >
Re: editing etc hosts files of a cluster
I also prefer to avoid custom software, and follow standards. We use Puppet to manage our node configuration (including hadoop config files), and adding one more file to the configuration is trivial. I prefer not to run additional daemons on all my nodes when I can avoid it. Replicating our hosts file allows us to avoid running named on all the nodes. David On Tue, Oct 20, 2009 at 1:15 PM, Allen Wittenauer wrote: > Any time you deal with pushing files around, you also have to deal with the > repercussions of when the file fails to get to its destination or it fails > to get there in a timely manner. [Hai hadoop config files.] If you use an > interface alias/vip/multi-a/whatever to deal with namenode availability, > then the host information becomes even more critical. > > Rather than build something custom, I chose to use well known, off the > shelf > software to deal with keeping host information relatively in-sync: bind. > > > On 10/19/09 8:09 PM, "David B. Ritch" wrote: > > > Most of the communication and name lookups within a cluster refer to > > other nodes within that same cluster. It is usually not a big deal to > > put all the systems from a cluster in a single hosts file, and rsync it > > around the cluster. (Consider using prsync, which comes with pssh, > > http://www.theether.org/pssh/, or your favorite cluster management > > software.) > > Editing each individually clearly doesn't scale; but editing it once and > > replicating it does. > > > > Is a large hosts file less efficient than nscd or a caching DNS server > > for nodes within the cluster? > > > > Thanks, > > > > David > > > > On 10/19/2009 8:02 PM, Edward Capriolo wrote: > >> On Mon, Oct 19, 2009 at 7:17 PM, Allen Wittenauer > >> wrote: > >> > >>> > >>> > >>> On 10/19/09 11:46 AM, "Edward Capriolo" wrote: > >>> > >>> > I am interested in your post. What has caused you to run caching DNS > servers on each of your nodes? Is this a hadoop specific problem or a > problem specific to your implementation? > > >>> Hadoop does a -tremendous- amount of hostname lookups. If you don't > have > >>> either nscd or a local DNS caching server, you are likely throwing what > >>> could be some significant performance gains away. > >>> > >>> > My assumption here is that a hadoop cluster of say 1000 nodes would > repeatedly talk to the same 1000 nodes. > > >>> ... and that's the catch! Every node running the DFSClient code or > being > >>> called out from a map/reduce task is a potential hostname that would > need be > >>> resolved. Just think about something like distcp. > >>> > >>> Also note that this is before we talk about monitoring, any other > naming > >>> services, CNAMEs, multi-As, etc, that get built as a normal part of > running > >>> an infrastructure. > >>> > >>> > Are you saying that nscd is > inadequacy to handle the size of the cache, or nscd is not very > efficient? What exactly is the reason you are running a caching DNS > server on each node? > > >>> In the case of Yahoo!, we had (or, at least, a perception) that we had > or > >>> were going to have jobs that did a lot of direct DNS lookups and/or > >>> accessed/referenced things outside of the local grid. Also note that a > DNS > >>> caching server is going to store more information about hostnames than > a > >>> simple host to IP service like nscd. > >>> > >>> Hypothetical: Let's say I'm building rules for a spam filter and part > of my > >>> process is to look up the MX record for a given host. nscd isn't going > to > >>> help you there. > >>> > >>> In the case of LinkedIn, the jury is still out. I suspect we don't > have > >>> nscd.conf tuned correctly. Our grid isn't that big, our connections > in/out > >>> are fairly small, etc. It has been one of the things on my todo list > since I > >>> got hired here 2 months ago. :) > >>> > >>> [For the record, I'm not one of those crazy people who turns off nscd > >>> because I had a bad experience with a broken version five years ago. > In > >>> the case of Yahoo!, I was the crazy person who started insisting we > turn it > >>> on, albeit not for hosts.] > >>> > >>> > >>> > >> Cool thanks for the info. > >> > >> I have found NSCD to be absolutely essential in most/all situations. > >> Whenever I would truss processes on OS'es without NSCD (say freebsd > >> 6.2) I would see numerous repeated 'stat' against /etc/passwd and > >> /etc/group. > >> > >> If you are doing users and groups through LDAP nscd is super important > >> as well. Your not going to want to make a series of lookups each stat. > >> > >> I would think the most efficient implementation would be nscd and a > >> local caching server in that case. NSCD should be very efficient since > >> it is done through libraries, dns lookups have to open sockets > >> (overhead). However I can see your point nscd can not do other types > >> of records. > >> > >> > > > >
Re: editing etc hosts files of a cluster
Everything can get made to work in a small scale. As the grid grows, well... On 10/20/09 10:32 AM, "David Ritch" wrote: > I also prefer to avoid custom software, and follow standards. We use Puppet > to manage our node configuration (including hadoop config files), and adding > one more file to the configuration is trivial. > > I prefer not to run additional daemons on all my nodes when I can avoid it. > Replicating our hosts file allows us to avoid running named on all the > nodes. > > David > > On Tue, Oct 20, 2009 at 1:15 PM, Allen Wittenauer > wrote: > >> Any time you deal with pushing files around, you also have to deal with the >> repercussions of when the file fails to get to its destination or it fails >> to get there in a timely manner. [Hai hadoop config files.] If you use an >> interface alias/vip/multi-a/whatever to deal with namenode availability, >> then the host information becomes even more critical. >> >> Rather than build something custom, I chose to use well known, off the >> shelf >> software to deal with keeping host information relatively in-sync: bind. >> >> >> On 10/19/09 8:09 PM, "David B. Ritch" wrote: >> >>> Most of the communication and name lookups within a cluster refer to >>> other nodes within that same cluster. It is usually not a big deal to >>> put all the systems from a cluster in a single hosts file, and rsync it >>> around the cluster. (Consider using prsync, which comes with pssh, >>> http://www.theether.org/pssh/, or your favorite cluster management >>> software.) >>> Editing each individually clearly doesn't scale; but editing it once and >>> replicating it does. >>> >>> Is a large hosts file less efficient than nscd or a caching DNS server >>> for nodes within the cluster? >>> >>> Thanks, >>> >>> David >>> >>> On 10/19/2009 8:02 PM, Edward Capriolo wrote: On Mon, Oct 19, 2009 at 7:17 PM, Allen Wittenauer wrote: > > > On 10/19/09 11:46 AM, "Edward Capriolo" wrote: > > >> I am interested in your post. What has caused you to run caching DNS >> servers on each of your nodes? Is this a hadoop specific problem or a >> problem specific to your implementation? >> > Hadoop does a -tremendous- amount of hostname lookups. If you don't >> have > either nscd or a local DNS caching server, you are likely throwing what > could be some significant performance gains away. > > >> My assumption here is that a hadoop cluster of say 1000 nodes would >> repeatedly talk to the same 1000 nodes. >> > ... and that's the catch! Every node running the DFSClient code or >> being > called out from a map/reduce task is a potential hostname that would >> need be > resolved. Just think about something like distcp. > > Also note that this is before we talk about monitoring, any other >> naming > services, CNAMEs, multi-As, etc, that get built as a normal part of >> running > an infrastructure. > > >> Are you saying that nscd is >> inadequacy to handle the size of the cache, or nscd is not very >> efficient? What exactly is the reason you are running a caching DNS >> server on each node? >> > In the case of Yahoo!, we had (or, at least, a perception) that we had >> or > were going to have jobs that did a lot of direct DNS lookups and/or > accessed/referenced things outside of the local grid. Also note that a >> DNS > caching server is going to store more information about hostnames than >> a > simple host to IP service like nscd. > > Hypothetical: Let's say I'm building rules for a spam filter and part >> of my > process is to look up the MX record for a given host. nscd isn't going >> to > help you there. > > In the case of LinkedIn, the jury is still out. I suspect we don't >> have > nscd.conf tuned correctly. Our grid isn't that big, our connections >> in/out > are fairly small, etc. It has been one of the things on my todo list >> since I > got hired here 2 months ago. :) > > [For the record, I'm not one of those crazy people who turns off nscd > because I had a bad experience with a broken version five years ago. >> In > the case of Yahoo!, I was the crazy person who started insisting we >> turn it > on, albeit not for hosts.] > > > Cool thanks for the info. I have found NSCD to be absolutely essential in most/all situations. Whenever I would truss processes on OS'es without NSCD (say freebsd 6.2) I would see numerous repeated 'stat' against /etc/passwd and /etc/group. If you are doing users and groups through LDAP nscd is super important as well. Your not going to want to make a series of lookups each stat. I would think the most efficient implementation would be nscd and a local caching server in that case. NSCD should be very efficient since it is
Re: editing etc hosts files of a cluster
Allen Wittenauer wrote: A bit more specific: At Yahoo!, we had either every server as a DNS slave or a DNS caching server. In the case of LinkedIn, we're running Solaris so nscd is significantly better than its Linux counterpart. However, we still seem to be blowing out the cache too much. So we'll likely switch to DNS caching servers here as well. the standard hadoop scripts don't tune DNS caching in the JVM, so Hadoop doesn't notice DNS entries changing; that adds extra complexity to the DNS-lookup-failure class of bugs -the situation where the TT and forked jobs see different IP addresses for the same hosts
Re: editing etc hosts files of a cluster
David B. Ritch wrote: Most of the communication and name lookups within a cluster refer to other nodes within that same cluster. It is usually not a big deal to put all the systems from a cluster in a single hosts file, and rsync it around the cluster. (Consider using prsync, which comes with pssh, http://www.theether.org/pssh/, or your favorite cluster management software.) Editing each individually clearly doesn't scale; but editing it once and replicating it does. Is a large hosts file less efficient than nscd or a caching DNS server for nodes within the cluster? Pro * removes the DNS server as a SPOF * works on clusters without DNS servers (virtual ones, for example) * lets you set up private hostnames ("namenode", "jobtracker") that don't change * lets you keep the cluster config under SCM Con * harder to push out changes * wierd errors when your cluster is inconsistent We could do a lot in Hadoop in detecting and reporting DNS problems; contributions here would be very welcome. They are a dog to test though.