Re: Hadoop DNS/topology details

2013-02-20 Thread Sage Weil
On Wed, 20 Feb 2013, Noah Watkins wrote:
 
 On Feb 19, 2013, at 4:39 PM, Sage Weil s...@inktank.com wrote:
 
  However, we do have host and rack information in the crush map, at least 
  for non-customized installations.  How about something like
  
   string ceph_get_osd_crush_location(int osd, string type);
  
  or similar.  We could call that with host and rack and get exactly 
  what we need, without making any changes to the data structures.
 
 This would then be used in conjunction with an interface:
 
  ceph_offset_to_osds(offset, vectorint osds)
 ...
 osdmpa-pg_to_acting_osds(osds)
 ...
 
 or something like this that replaces the current extent-to-sockaddr 
 interface? The proposed interface about would do the host/ip mapping, as 
 well as the topology mapping?

Yeah.  The ceph_offset_to_osds should probably also have an (optional?) 
out argument that tells you how long the extent is starting from offset 
that is on those devices.  Then you can do another call at offset+len to 
get the next segment.

sage

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hadoop DNS/topology details

2013-02-20 Thread Noah Watkins

On Feb 20, 2013, at 9:31 AM, Sage Weil s...@inktank.com wrote:

 or something like this that replaces the current extent-to-sockaddr 
 interface? The proposed interface about would do the host/ip mapping, as 
 well as the topology mapping?
 
 Yeah.  The ceph_offset_to_osds should probably also have an (optional?) 
 out argument that tells you how long the extent is starting from offset 
 that is on those devices.  Then you can do another call at offset+len to 
 get the next segment.


It'd be nice to hide the striping strategy so we don't have to reproduce it in 
the Hadoop shim as we currently do, and which is needed with an interface using 
only an offset (we have to know the stripe unit to jump to the next extent). 
So, something like this might work:

  struct extent {
loff_t offset, length;
vectorint osds;
  }

  ceph_get_file_extents(file, offset, length, vectorextent extents);

Then we could re-use the Striper or something?

-Noah--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hadoop DNS/topology details

2013-02-19 Thread Gregory Farnum
On Tue, Feb 19, 2013 at 2:10 PM, Noah Watkins jayh...@cs.ucsc.edu wrote:
 Here is the information that I've found so far regarding the operation of 
 Hadoop w.r.t. DNS/topology. There are two parts, the file system client 
 requirements, and other consumers of topology information.

 -- File System Client --

 The relevant interface between the Hadoop VFS and its underlying file system 
 is:

   FileSystem:getFileBlockLocations(File, Extent)

 which is expected to return a list of hosts (a 3-tuple: hostname, IP, 
 topology path) for each block that contains any part of the specified file 
 extent. So, with triplication and 2 blocks, there are 2 * 3 = 6 3-tuples 
 present.

   *** Note: HDFS sorts each list of hosts based on a distance metric applied 
 between the initiating file system client and each of the blocks in the list 
 using the HDFS cluster map. This should not affect correctness, although it's 
 possible that consumers of this list (e.g. MapReduce) may assume an ordering. 
 ***

That is just truly annoying. Is this described anywhere in their docs?
I don't think it would be hard to sort, if we had some mechanism for
doing so (crush map nearness, presumably?), but if doing it wrong is
expensive in terms of performance we'll want some sort of contract to
code to.


 The current Ceph client can produce the same list, but does not include 
 hostname nor topology information. Currently reverse DNS is used to fill in 
 the hostname, and defaults to a flat topology in which all hosts are in a 
 single topology path: /default-rack/host.

 - Reverse DNS could be quite slow:
- 3x replication * 1 TB / 64 MB blocks = 49152 lookups
- Caching lookups could help

 -- Topology Information --

 Services that run on a Hadoop cluster (such as MapReduce) use hostname and 
 topology information attached to each file system block to schedule and 
 aggregate work based on various policies. These services don't have direct 
 access to the HDFS cluster map, and instead rely on a service to provide a 
 mapping:

DNS-names/IP - topology path mapping

 This can be performed using a script/utility program that will perform bulk 
 translations, or implemented in Java.

 -- A Possible Approach --

 1. Expand CephFS interface to return IP and hostname

Ceph doesn't store hostnames anywhere — it really can't do this. All
it has is IPs associated with OSD ID numbers. :) Adding hostnames
would be a monitor and map change, which we could do, but given the
issues we've had with hostnames in other contexts I'd really rather
not.
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hadoop DNS/topology details

2013-02-19 Thread Noah Watkins

On Feb 19, 2013, at 2:22 PM, Gregory Farnum g...@inktank.com wrote:

 On Tue, Feb 19, 2013 at 2:10 PM, Noah Watkins jayh...@cs.ucsc.edu wrote:
 
 That is just truly annoying. Is this described anywhere in their docs?

Not really. It's just there in the code--I can figure out the metric if you're 
interested. I suspect it is local node, local rack, off rack ordering, with no 
special tie breakers.

 I don't think it would be hard to sort, if we had some mechanism for
 doing so (crush map nearness, presumably?),

Topology information from the bucket hierarchy? I think it's always some sort 
of heuristic.

 1. Expand CephFS interface to return IP and hostname
 
 Ceph doesn't store hostnames anywhere — it really can't do this. All
 it has is IPs associated with OSD ID numbers. :) Adding hostnames
 would be a monitor and map change, which we could do, but given the
 issues we've had with hostnames in other contexts I'd really rather
 not.

What is the fate of hostnames used in ceph.conf? Could that information be 
leveraged, when specified by the cluster admin?

-Noah--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hadoop DNS/topology details

2013-02-19 Thread Gregory Farnum
On Tue, Feb 19, 2013 at 4:39 PM, Sage Weil s...@inktank.com wrote:
 On Tue, 19 Feb 2013, Noah Watkins wrote:
 On Feb 19, 2013, at 2:22 PM, Gregory Farnum g...@inktank.com wrote:
  On Tue, Feb 19, 2013 at 2:10 PM, Noah Watkins jayh...@cs.ucsc.edu
  wrote:
 
  That is just truly annoying. Is this described anywhere in their docs?

 Not really. It's just there in the code--I can figure out the metric if
 you're interested. I suspect it is local node, local rack, off rack
 ordering, with no special tie breakers.

  I don't think it would be hard to sort, if we had some mechanism for
  doing so (crush map nearness, presumably?),

 Topology information from the bucket hierarchy? I think it's always some
 sort of heuristic.

  1. Expand CephFS interface to return IP and hostname
 
  Ceph doesn't store hostnames anywhere ? it really can't do this. All
  it has is IPs associated with OSD ID numbers. :) Adding hostnames
  would be a monitor and map change, which we could do, but given the
  issues we've had with hostnames in other contexts I'd really rather
  not.

 What is the fate of hostnames used in ceph.conf? Could that information
 be leveraged, when specified by the cluster admin?

 Those went hte way of the Dodo.

More specifically, those hostnames are used by mkcephfs (and
ceph-deploy?) for ssh'ing into the remote nodes, and they might sit in
a lot of ceph.conf's somewhere. But it's not data aggregated by the
monitors, or even used in-memory.

 However, we do have host and rack information in the crush map, at least
 for non-customized installations.  How about something like

   string ceph_get_osd_crush_location(int osd, string type);

 or similar.  We could call that with host and rack and get exactly
 what we need, without making any changes to the data structures.

That's a good workaround, but it does rely on those fields being set
up in the CRUSH map (and makes handling cases like SSD-primary setups
a lot more challenging).
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html