Re: Efficiently partition broadly distributed keys

Niels Basjes Thu, 10 Mar 2011 14:08:29 -0800

Hi Luca,

2011/3/10 Luca Aiello <alu...@yahoo-inc.com>:


> thanks for the quick response. So in your opinion there is nothing like a 
> "hadoop embedded" tool to do this. This is what I suspected indeed.

The mapreduce model simply uses the <key> as the pivot of the
processing. In your application specific situation your looking for a
way to "break" that model in a smart way. So no, Hadoop doesn't do
that.

> Since the <key, value> pairs in the initial dataset are randomly partitioned 
> in the input files,
> I suppose that I can avoid the initial statistic step and do statistics "on 
> the fly".
> For example, if I have 3 keys A,B,C with probability 0.7, 0.2, and 0.1, every 
> mapper will receive,
> on average, a input composed by 70% of A keys, 20% of Bs and 10% of Cs
> and so it will know that the <key,value> to be chunked are de As.

If you know this in advance you can surely put that into the the
mapper. But that simply means you've done the statistics in advance in
some way.
I fail to see how you could do this "on the fly" without prior knowledge.

As a slightly smarter way of doing something like this; Have a look at
the terasort example.
There a sample of the data is used to estimate the distribution.

> This can introduce some errors but it should produce a output which is quite 
> uniformly distributed.
>
> Thanks again!

You're welcome.

Niels


> On Mar 10, 2011, at 12:23 PM, Niels Basjes wrote:
>
>> If I understand your problem correctly you actually need some way of
>> knowing if you need to "chop" a large set with a specific key in to
>> subsets.
>> In mapreduce the map only has information about a single key at a
>> time. So you need something extra.
>>
>> One way of handling this is to start by doing a statistical run first
>> and use the output to determine which keys can be chopped.
>>
>> So first do a MR to determine per key how many records and/or bytes
>> need to be handled.
>> In the reducer you have a "lower limit" that ensures the reducer only
>> outputs the keys that need to be chopped.
>> In the second pass you do your actual algorithm with one twist: In the
>> mapper you use the output of the first run to determine if the key
>> needs to be rewritten and in how many variants. You then randomly (!)
>> choose a variant.
>>
>> Example of what i mean:
>> Assume you have 100 records with key A and 10000 records with key B.
>> You determine that you want key groups of an approximate maximum of
>> 1000 records.
>> The first pass outputs that key B must be split into 10 variants 
>> (=10000/1000).
>> Then the second MAP will transform a key B into one of these randomly:
>> B-1, B-2, B-3 ...
>> A record with key A will remain unchanged.
>>
>> Disadvantage:
>>   Each run will show different results.
>>   Only works if the set of keys that needs to be chopped is small
>> enough so you can have it in memory in the call to the second map.
>>
>> HTH
>>
>> Niels Basjes
>>
>>
>> 2011/3/10 Luca Aiello <alu...@yahoo-inc.com>:
>>> Dear users,
>>>
>>> hope this is the right list to submit this one, otherwise I apologize.
>>>
>>> I'd like to have your opinion about a problem that I'm facing on MapReduce 
>>> framework. I am writing my code in Java and running on a grid.
>>>
>>> I have a textual input structured in <key, value> pairs. My task is to make 
>>> the cartesian product of all the values that have the same key.
>>> I can do so it simply using <key> as my map key, so that every value with 
>>> the same key is put in the same reducer, where I can easily process them 
>>> and obtain the cartesian.
>>>
>>> However, my keys are not uniformly distributed, the distribution is very 
>>> broad. As a result of this, the output of my reducers will be very 
>>> unbalanced and I will have many small files (some KB) and a bunch of huge 
>>> files (tens of GB). A sub-optimal yet acceptable approximation of my task 
>>> would be to make the cartesian product of smaller chunks of values for very 
>>> frequent keys, so that the load is distributed evenly among reducers. I am 
>>> wondering how can I do this in the most efficient/elegant way.
>>>
>>> It appears to me that using a customized Partitioner is not the right way 
>>> to act, since records with the same key have still to be mapped together 
>>> (am I right?).
>>> The only solution that comes into my mind is to split the key space 
>>> artificially insider the mapper (e.g., for a frequent key "ABC", map the 
>>> values on the reducers using keys like "ABC1", "ABC2", an so on). This 
>>> would require an additional post-processing cleanup phase to retrieve the 
>>> original keys.
>>>
>>> Do you know a better, possibly automatic way to perform this task?
>>>
>>> Thank you!
>>> Best
>>>
>>> Luca
>>>
>>>
>>
>>
>>
>> --
>> Met vriendelijke groeten,
>>
>> Niels Basjes
>
>



-- 
Met vriendelijke groeten,

Niels Basjes

Re: Efficiently partition broadly distributed keys

Reply via email to