"Full" replication is a good idea, but I suggest we file it as a new
bug/enhancement.
Actually placing a copy of a file on every node is probably rarely
the right thing to do for "full" replication. One copy per switch
would be my preferred default on our clusters (gigabit switches) and
for .JAR files squareroot(numNodes) is probably the right answer.
e14
On Apr 8, 2006, at 12:16 PM, Bryan Pendleton (JIRA) wrote:
[ http://issues.apache.org/jira/browse/HADOOP-51?
page=comments#action_12373745 ]
Bryan Pendleton commented on HADOOP-51:
---------------------------------------
Great!
A few comments from reading the patch (haven't test with it yet):
1) The <description> for dfs.replication.min is wrong
2) This is a wider concern, but on coding style - the idiom of
conf.getType("config.value",defaultValue) is good for user-defined
values, but shouldn't the default be skipped for things that are
defined in hadoop-default.xml, in general? It takes away the value
of hadoop-default, and it also means changing that value might or
might not always have the desired system-wide results.
3) Wouldn't it be better to log at a severe level replications that
are set below minReplication, or greater than maxReplication, and
just set the replication to the nearest bound? Since replication is
set per-file by the application, but min and max are probably set
by the administrator of the hadoop cluster. Throwing an IOException
causes failure where degraded performance would be preferable.
4) I may be dense, but I didn't see any way to specify that
replication be "full", ie, a copy per datanode. I got the feeling
this was something that was desired of this functionality (ie, for
job.jar files, job configs, and lookup data used widely in a job)
Using a short means, if we ever scale to > 32k nodes, there'd be no
way to manually specify this. Just using Short.MAX_VALUE means
getting a lot of errors about not being able to replicate as fully
as desired.
Otherwise, this looks like a wonderful patch!
per-file replication counts
---------------------------
Key: HADOOP-51
URL: http://issues.apache.org/jira/browse/HADOOP-51
Project: Hadoop
Type: New Feature
Components: dfs
Versions: 0.2
Reporter: Doug Cutting
Assignee: Konstantin Shvachko
Fix For: 0.2
Attachments: Replication.patch
It should be possible to specify different replication counts for
different files. Perhaps an option when creating a new file
should be the desired replication count. MapReduce should take
advantage of this feature so that job.xml and job.jar files, which
are frequently accessed by lots of machines, are more highly
replicated than large data files.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira