A fairly standard approach in a problem like this is for each of the
crawlers to write multiple items to separate files (one per crawler).  This
can be done using a map-reduce structure by having the map input be items to
crawl and then using the reduce mechanism to gather like items together.
This structure can give you the desired output structure, especially if the
reduces are careful about naming their outputs well.


On 11/6/07 3:12 AM, "André Martin" <[EMAIL PROTECTED]> wrote:

> Hi Ted & Hadoopers,
> I was thinking of a similar solution/optimization but I have the
> following problem:
> We have a large distributed system that consists of several
> spider/crawler nodes - pretty much like a web crawler system - every
> node writes its gathered data directly to the DFS. So there is no real
> possibility of bundling the data while it is written to the DFS since
> two spiders may write some data for the same logical unit concurrently -
> if the DFS would support synchronized append writes, it would make our
> lifes a little bit easier.
> However, our files are still organized in thousands of directories / a
> pretty large directory tree since I need only certain branches for a
> mapred operation in order to do some data mining...
> 
> Cu on the 'net,
>                        Bye - bye,
> 
>                                   <<<<< André <<<< >>>> èrbnA >>>>>
> 
> Ted Dunning wrote:
>> If your larger run is typical of your smaller run then you have lots and
>> lots of small files.  This is going to make things slow even without the
>> overhead of a distributed computation.
>> 
>> In the sequential case, enumerating the files an inefficient read patterns
>> will be what slows you down.  The inefficient reads come about because the
>> disk has to seek every 100KB of input.  That is bad.
>> 
>> In the hadoop case, things are worse because opening a file takes much
>> longer than with local files.
>> 
>> The solution is for you to package your data more efficiently.  This fixes a
>> multitude of ills.  If you don't mind limiting your available parallelism a
>> little bit, you could even use tar files (tar isn't usually recommended
>> because you can't split a tar file across maps).
>> 
>> If you were to package 1000 files per bundle, you would get average file
>> sizes of 100MB instead of 100KB and your file opening overhead in the
>> parallel case would be decreased by 1000x.  Your disk read speed would be
>> much higher as well because your disks would mostly be reading contiguous
>> sectors.
>> 
>> I have a system similar to yours with lots and lots of little files (littler
>> than yours even).  With aggressive file bundling I can routinely process
>> data at a sustained rate of 100MB/s on ten really crummy storage/compute
>> nodes.  Moreover, that rate is probably not even bounded by I/O since my
>> data takes a fair bit of CPU to decrypt and parse.
>>   
> 
> 

Reply via email to