This only handles the problem of putting lots of files.  It doesn't deal
with putting files in parallel (at once).

This is a ticklish problem since even on a relatively small cluster, dfs has
a higher read speed than most storage can read.  That means that you can
swamp things pretty easily.

When I have files on a single source machine, I just spawn multiple -put's
on sub-directories until I have sufficiently saturated the read speed of the
source.  If all of the cluster members have access to a universal file
system, then you can use the (undocumented) pdist command, but I don't like
that as much.

You also have to watch out if you start writing from a host in your cluster
else you will wind up with odd imbalances in file storage.  In my case, the
source of the data is actually outside of the cluster and I get pretty good
balancing.

If you do wind up with bad balancing, the best option I have seen is to
increase the replication on individual files for 30-60 seconds and then
decrease it again.  In order to get sufficient throughput for the
rebalancing, I pipeline lots of these changes so that I have 10-100 files at
a time with higher replication.  This does tend to substantially increase
the number of files with excess replication, but that corrects itself pretty
quickly.


On 10/31/07 1:53 PM, "Aaron Kimball" <[EMAIL PROTECTED]> wrote:

> hadoop dfs -put will take a directory. If it won't work recursively,
> then you can probably bang out a bash script that will handle it using
> find(1) and xargs(1).
> 
> -- Aaron
> 
> Chris Fellows wrote:
>> Hello!
>> 
>> Quick simple question, hopefully someone out there could answer.
>> 
>> Does the hadoop dfs support putting multiple files at once?
>> 
>> The documentation says -put only works on one file. What's the best way to
>> import multiple files in multiple directories (i.e. dir1/file1 dir1/file2
>> dir2/file1 dir2/file2 etc)?
>> 
>> End goal would be to do something like:
>> 
>>     bin/hadoop dfs -put /dir*/file* /myfiles
>> 
>> And a follow-up: bin/hadoop dfs -lsr /myfiles
>> would list:
>> 
>> /myfiles/dir1/file1
>> /myfiles/dir1/file2
>> /myfiles/dir2/file1
>> /myfiles/dir2/file2
>> 
>> Thanks again for any input!!!
>> 
>> - chris
>> 
>> 
>>   

Reply via email to