Re: Hadoop DNS/topology details
On Wed, 20 Feb 2013, Noah Watkins wrote: On Feb 19, 2013, at 4:39 PM, Sage Weil s...@inktank.com wrote: However, we do have host and rack information in the crush map, at least for non-customized installations. How about something like string ceph_get_osd_crush_location(int osd, string type); or similar. We could call that with host and rack and get exactly what we need, without making any changes to the data structures. This would then be used in conjunction with an interface: ceph_offset_to_osds(offset, vectorint osds) ... osdmpa-pg_to_acting_osds(osds) ... or something like this that replaces the current extent-to-sockaddr interface? The proposed interface about would do the host/ip mapping, as well as the topology mapping? Yeah. The ceph_offset_to_osds should probably also have an (optional?) out argument that tells you how long the extent is starting from offset that is on those devices. Then you can do another call at offset+len to get the next segment. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hadoop DNS/topology details
On Feb 20, 2013, at 9:31 AM, Sage Weil s...@inktank.com wrote: or something like this that replaces the current extent-to-sockaddr interface? The proposed interface about would do the host/ip mapping, as well as the topology mapping? Yeah. The ceph_offset_to_osds should probably also have an (optional?) out argument that tells you how long the extent is starting from offset that is on those devices. Then you can do another call at offset+len to get the next segment. It'd be nice to hide the striping strategy so we don't have to reproduce it in the Hadoop shim as we currently do, and which is needed with an interface using only an offset (we have to know the stripe unit to jump to the next extent). So, something like this might work: struct extent { loff_t offset, length; vectorint osds; } ceph_get_file_extents(file, offset, length, vectorextent extents); Then we could re-use the Striper or something? -Noah-- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hadoop DNS/topology details
On Tue, Feb 19, 2013 at 2:10 PM, Noah Watkins jayh...@cs.ucsc.edu wrote: Here is the information that I've found so far regarding the operation of Hadoop w.r.t. DNS/topology. There are two parts, the file system client requirements, and other consumers of topology information. -- File System Client -- The relevant interface between the Hadoop VFS and its underlying file system is: FileSystem:getFileBlockLocations(File, Extent) which is expected to return a list of hosts (a 3-tuple: hostname, IP, topology path) for each block that contains any part of the specified file extent. So, with triplication and 2 blocks, there are 2 * 3 = 6 3-tuples present. *** Note: HDFS sorts each list of hosts based on a distance metric applied between the initiating file system client and each of the blocks in the list using the HDFS cluster map. This should not affect correctness, although it's possible that consumers of this list (e.g. MapReduce) may assume an ordering. *** That is just truly annoying. Is this described anywhere in their docs? I don't think it would be hard to sort, if we had some mechanism for doing so (crush map nearness, presumably?), but if doing it wrong is expensive in terms of performance we'll want some sort of contract to code to. The current Ceph client can produce the same list, but does not include hostname nor topology information. Currently reverse DNS is used to fill in the hostname, and defaults to a flat topology in which all hosts are in a single topology path: /default-rack/host. - Reverse DNS could be quite slow: - 3x replication * 1 TB / 64 MB blocks = 49152 lookups - Caching lookups could help -- Topology Information -- Services that run on a Hadoop cluster (such as MapReduce) use hostname and topology information attached to each file system block to schedule and aggregate work based on various policies. These services don't have direct access to the HDFS cluster map, and instead rely on a service to provide a mapping: DNS-names/IP - topology path mapping This can be performed using a script/utility program that will perform bulk translations, or implemented in Java. -- A Possible Approach -- 1. Expand CephFS interface to return IP and hostname Ceph doesn't store hostnames anywhere — it really can't do this. All it has is IPs associated with OSD ID numbers. :) Adding hostnames would be a monitor and map change, which we could do, but given the issues we've had with hostnames in other contexts I'd really rather not. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hadoop DNS/topology details
On Feb 19, 2013, at 2:22 PM, Gregory Farnum g...@inktank.com wrote: On Tue, Feb 19, 2013 at 2:10 PM, Noah Watkins jayh...@cs.ucsc.edu wrote: That is just truly annoying. Is this described anywhere in their docs? Not really. It's just there in the code--I can figure out the metric if you're interested. I suspect it is local node, local rack, off rack ordering, with no special tie breakers. I don't think it would be hard to sort, if we had some mechanism for doing so (crush map nearness, presumably?), Topology information from the bucket hierarchy? I think it's always some sort of heuristic. 1. Expand CephFS interface to return IP and hostname Ceph doesn't store hostnames anywhere — it really can't do this. All it has is IPs associated with OSD ID numbers. :) Adding hostnames would be a monitor and map change, which we could do, but given the issues we've had with hostnames in other contexts I'd really rather not. What is the fate of hostnames used in ceph.conf? Could that information be leveraged, when specified by the cluster admin? -Noah-- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hadoop DNS/topology details
On Tue, Feb 19, 2013 at 4:39 PM, Sage Weil s...@inktank.com wrote: On Tue, 19 Feb 2013, Noah Watkins wrote: On Feb 19, 2013, at 2:22 PM, Gregory Farnum g...@inktank.com wrote: On Tue, Feb 19, 2013 at 2:10 PM, Noah Watkins jayh...@cs.ucsc.edu wrote: That is just truly annoying. Is this described anywhere in their docs? Not really. It's just there in the code--I can figure out the metric if you're interested. I suspect it is local node, local rack, off rack ordering, with no special tie breakers. I don't think it would be hard to sort, if we had some mechanism for doing so (crush map nearness, presumably?), Topology information from the bucket hierarchy? I think it's always some sort of heuristic. 1. Expand CephFS interface to return IP and hostname Ceph doesn't store hostnames anywhere ? it really can't do this. All it has is IPs associated with OSD ID numbers. :) Adding hostnames would be a monitor and map change, which we could do, but given the issues we've had with hostnames in other contexts I'd really rather not. What is the fate of hostnames used in ceph.conf? Could that information be leveraged, when specified by the cluster admin? Those went hte way of the Dodo. More specifically, those hostnames are used by mkcephfs (and ceph-deploy?) for ssh'ing into the remote nodes, and they might sit in a lot of ceph.conf's somewhere. But it's not data aggregated by the monitors, or even used in-memory. However, we do have host and rack information in the crush map, at least for non-customized installations. How about something like string ceph_get_osd_crush_location(int osd, string type); or similar. We could call that with host and rack and get exactly what we need, without making any changes to the data structures. That's a good workaround, but it does rely on those fields being set up in the CRUSH map (and makes handling cases like SSD-primary setups a lot more challenging). -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html