[ 
https://issues.apache.org/jira/browse/CASSANDRA-7431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-7431:
-----------------------------------

    Description: 
The split assignment on AbstractColumnFamilyInputFormat:247 peforms a reverse 
DNS lookup of Cassandra IPs in order to preserve locality in Hadoop (task 
trackers are identified by hostnames, not IPs).

However, the reverse DNS lookup of an EC2 endpoint does not yield the EC2 
hostname of that endpoint when running from an EC2 instance due to the use of 
InetAddress.getHostname().

In order to show this, consider the following piece of code:

{code:title=DnsResolver.java|borderStyle=solid}
public class DnsResolver {
    public static void main(String[] args) throws Exception {
        InetAddress namenodePublicAddress = InetAddress.getByName(args[0]);
        System.out.println("getHostAddress: " + 
namenodePublicAddress.getHostAddress());
        System.out.println("getHostName: " + 
namenodePublicAddress.getHostName());
    }
}
{code}

When this code is run from my machine to perform reverse lookup of an EC2 IP, 
the output is:
{code:none}
➜  java DnsResolver 54.201.254.99
getHostAddress: 54.201.254.99
getHostName: ec2-54-201-254-99.compute-1.amazonaws.com
{code}

When this code is executed from inside an EC2 machine, the output is:
{code:none}
➜  java DnsResolver 54.201.254.99
getHostAddress: 54.201.254.99
getHostName: 54.201.254.99
{code}

However, when using linux tools such as "host" or "dig", the EC2 hostname is 
properly resolved from the EC2 instance, so there's some problem with Java's 
InetAddress.getHostname() and EC2.

Two consequences of this bug during AbstractColumnFamilyInputFormat split 
definition are:
1) If the Hadoop cluster is configured to use EC2 public DNS, the locality will 
be lost, because Hadoop will try to match the CFIF split location (public IP) 
with the task tracker location (public DNS), so no matches will be found.
2) If the Cassandra nodes' broadcast_address is set to public IPs, all hadoop 
communication will be done via the public IP, what will incurr additional 
transference charges. If the public IP is mapped to the EC2 DNS during split 
definition, when the task is executed, ColumnFamilyRecordReader will resolve 
the public DNS to the private IP of the instance, so there will be not 
additional charges.

A similar bug was filed in the WHIRR project: 
https://issues.apache.org/jira/browse/WHIRR-128

  was:
The split assignment on AbstractColumnFamilyInputFormat:247 peforms a reverse 
DNS lookup of a Cassandra endpoint in order to preserve locality in Hadoop 
(task trackers are identified by hostnames, not IPs).

However, the reverse DNS lookup of an EC2 endpoint does not yield the EC2 
hostname of that endpoint when running from an EC2 instance due to the use of 
InetAddress.getHostname().

In order to show this, consider the following piece of code:

{code:title=DnsResolver.java|borderStyle=solid}
public class DnsResolver {
    public static void main(String[] args) throws Exception {
        InetAddress namenodePublicAddress = InetAddress.getByName(args[0]);
        System.out.println("getHostAddress: " + 
namenodePublicAddress.getHostAddress());
        System.out.println("getHostName: " + 
namenodePublicAddress.getHostName());
    }
}
{code}

When this code is run from my machine to perform reverse lookup of an EC2 IP, 
the output is:
{code:none}
➜  java DnsResolver 54.201.254.99
getHostAddress: 54.201.254.99
getHostName: ec2-54-201-254-99.compute-1.amazonaws.com
{code}

When this code is executed from inside an EC2 machine, the output is:
{code:none}
➜  java DnsResolver 54.201.254.99
getHostAddress: 54.201.254.99
getHostName: 54.201.254.99
{code}

However, when using linux tools such as "host" or "dig", the EC2 hostname is 
properly resolved from the EC2 instance, so there's some problem with Java's 
InetAddress.getHostname() and EC2.

Two consequences of this bug during AbstractColumnFamilyInputFormat split 
definition are:
1) If the Hadoop cluster is configured to use EC2 public DNS, the locality will 
be lost, because Hadoop will try to match the CFIF split location (public IP) 
with the task tracker location (public DNS), so no matches will be found.
2) If the Cassandra nodes' broadcast_address is set to public IPs, all hadoop 
communication will be done via the public IP, what will incurr additional 
transference charges. If the public IP is mapped to the EC2 DNS during split 
definition, when the task is executed, ColumnFamilyRecordReader will resolve 
the public DNS to the private IP of the instance, so there will be not 
additional charges.

A similar bug was filed in the WHIRR project: 
https://issues.apache.org/jira/browse/WHIRR-128


> Hadoop integration does not perform reverse DNS lookup correctly on EC2
> -----------------------------------------------------------------------
>
>                 Key: CASSANDRA-7431
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7431
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Hadoop
>            Reporter: Paulo Motta
>            Assignee: Paulo Motta
>
> The split assignment on AbstractColumnFamilyInputFormat:247 peforms a reverse 
> DNS lookup of Cassandra IPs in order to preserve locality in Hadoop (task 
> trackers are identified by hostnames, not IPs).
> However, the reverse DNS lookup of an EC2 endpoint does not yield the EC2 
> hostname of that endpoint when running from an EC2 instance due to the use of 
> InetAddress.getHostname().
> In order to show this, consider the following piece of code:
> {code:title=DnsResolver.java|borderStyle=solid}
> public class DnsResolver {
>     public static void main(String[] args) throws Exception {
>         InetAddress namenodePublicAddress = InetAddress.getByName(args[0]);
>         System.out.println("getHostAddress: " + 
> namenodePublicAddress.getHostAddress());
>         System.out.println("getHostName: " + 
> namenodePublicAddress.getHostName());
>     }
> }
> {code}
> When this code is run from my machine to perform reverse lookup of an EC2 IP, 
> the output is:
> {code:none}
> ➜  java DnsResolver 54.201.254.99
> getHostAddress: 54.201.254.99
> getHostName: ec2-54-201-254-99.compute-1.amazonaws.com
> {code}
> When this code is executed from inside an EC2 machine, the output is:
> {code:none}
> ➜  java DnsResolver 54.201.254.99
> getHostAddress: 54.201.254.99
> getHostName: 54.201.254.99
> {code}
> However, when using linux tools such as "host" or "dig", the EC2 hostname is 
> properly resolved from the EC2 instance, so there's some problem with Java's 
> InetAddress.getHostname() and EC2.
> Two consequences of this bug during AbstractColumnFamilyInputFormat split 
> definition are:
> 1) If the Hadoop cluster is configured to use EC2 public DNS, the locality 
> will be lost, because Hadoop will try to match the CFIF split location 
> (public IP) with the task tracker location (public DNS), so no matches will 
> be found.
> 2) If the Cassandra nodes' broadcast_address is set to public IPs, all hadoop 
> communication will be done via the public IP, what will incurr additional 
> transference charges. If the public IP is mapped to the EC2 DNS during split 
> definition, when the task is executed, ColumnFamilyRecordReader will resolve 
> the public DNS to the private IP of the instance, so there will be not 
> additional charges.
> A similar bug was filed in the WHIRR project: 
> https://issues.apache.org/jira/browse/WHIRR-128



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to