[ 
https://issues.apache.org/jira/browse/HADOOP-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558940#action_12558940
 ] 

Devaraj Das commented on HADOOP-1985:
-------------------------------------

bq. I'm worried about the time and memory performance of this. Have you run a 
sort with dfs cluster == map/reduce cluster and compared running times and job 
tracker memory size? We've already seen cases where the current pollForNewTask 
causes performance problems... 

I assume you meant findNewTask giving performance problems. To clarify (for the 
benefit of others), the JobTracker would consume more memory due to two reasons:
1) The NetworkTopology is created here. This cannot be avoided, right?
2) Multiple cache levels are maintained. Currently, we maintain only one cache 
level (host to maps). This patch adds a cache at level, currently set to two 
(host, rack) and compile time config, to do efficient look ups. But the caches 
are just mappings from Node to references to objects in the NetworkTopology. 
Are you referring to these additional caches when you say memory performance 
may be a problem?

The running time performance is helped by the caches. It takes O(1) at level to 
find a TIP for a TaskTracker, no? The linear search for TIP (if it was a cache 
miss), is there even currently. The only additional thing here is the lookup 
when the level is more than 1.

I did run the sort with dfs-cluster == map/reduce-cluster and the numbers were 
very comparable. Nothing concerning there..

bq. It bothers me that the max levels is hard coded rather than configurable.

I was thinking that the most typical cases would require just two levels - 
host, rack, and that's why i made this a compile time constant. But if it makes 
sense to make that runtime configurable, I can enable that behavior..

bq. From a style point of view, I probably would have defined a new class 
rather than use nested java.utils containers like List<Map<Node, 
List<TaskInProgress>>>. That way if we change the representation later it won't 
be scattered through the code. In particular, I can imagine wanting to have the 
data structure be something like:
Map<String (rack name), RackInfo> and RackInfo has a Map<String (hostname), 
List<TaskInProgress> >. Or even more tree-like...

How about providing get/set APIs to the existing datastructure. The 
datastructure works for all cases with arbitrary number of levels (host, rack, 
switch, datacenter,..) (since it is a list of mappings from Node to TIPs). I 
didn't want to introduce Strings in the mapping since the NetworkTopology 
provides a  _Node_ abstraction for everything. If we went to Strings then we 
have an additional step of getting the Node from the String name (and vice 
versa), parsing strings to get to the Node, etc., which can be easily avoided 
by having the mappings based on Node.

bq. Did you need to change the definition of findNewTask? I don't see it in the 
patch.

Yes, I changed the definition of findNewTask. In the patch look for _Find a new 
task to run._ The diff doesn't have the line _findNewTask_. It just has the 
comment above it.

bq. This needs user documentation in forrest.

I have that in the 1985.v6.patch. Look for cluster_setup.xml and 
hdfs_design.xml, where I talk about how rack config can be setup. Did you mean 
something else?

bq. The java doc on DNSToSwitchMapping.resolve should probably mention that 
they must cache if their operation is expensive. Although there isn't a way to 
clear or update that cache, which might be a problem at some point...

Agreed regarding the documentation on the cache part. The update of the cache 
could be handled by the implementation of DNSToSwitchMapping, no? I can imagine 
a case, where the implementation starts a thread that periodically contacts 
some service and updates its cache. This is transparent to clients calling 
DNSToSwitchMapping.resolve.

bq. You don't really need the Scan example, you could use the 
GenericMRLoadGenerator with a -keepmap of 0.

Okay.

bq. In the longer term I think a configured mapping class would be useful. A 
class named
org.apache.hadoop.net.ConfiguredNodeMapping that let you set the mapping in 
your config.

In the patch this is handled by a specific implementation of the 
DNSToSwitchMapping called StaticMapping, and that provides an API to set up the 
mapping from host to rackid (used in testcases). But I think I should be able 
to set things in the configuration and StaticMapping could initialize itself 
with the mapping provided there. I'll look at that.

> Abstract node to switch mapping into a topology service class used by 
> namenode and jobtracker
> ---------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1985
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1985
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs, mapred
>            Reporter: eric baldeschwieler
>            Assignee: Devaraj Das
>             Fix For: 0.16.0
>
>         Attachments: 1985.new.patch, 1985.v1.patch, 1985.v2.patch, 
> 1985.v3.patch, 1985.v4.patch, 1985.v5.patch, 1985.v6.patch
>
>
> In order to implement switch locality in MapReduce, we need to have switch 
> location in both the namenode and job tracker.  Currently the namenode asks 
> the data nodes for this info and they run a local script to answer this 
> question.  In our environment and others that I know of there is no reason to 
> push this to each node.  It is easier to maintain a centralized script that 
> maps node DNS names to switch strings.
> I propose that we build a new class that caches known DNS name to switch 
> mappings and invokes a loadable class or a configurable system call to 
> resolve unknown DNS to switch mappings.  We can then add this to the namenode 
> to support the current block to switch mapping needs and simplify the data 
> nodes.  We can also add this same callout to the job tracker and then 
> implement rack locality logic there without needing to chane the filesystem 
> API or the split planning API.
> Not only is this the least intrusive path to building racklocal MR I can ID, 
> it is also future compatible to future infrastructures that may derive 
> topology on the fly, etc, etc...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to