Hi Ajay Srivastava,

Thank your for your reply.

Could you please explain a little bit more on "Write a grouping comparator
which group records on first part of key i.e. Ki."  ?
I guess it is a crucial part, which could filter some pairs before passing
them to the reducer.


Regards,
Zheyi Rong


On Thu, Apr 18, 2013 at 12:50 PM, Ajay Srivastava <
ajay.srivast...@guavus.com> wrote:

>  Hi Rong,
> You can use following simple method.
>
>  Lets say dataset1 has m records and when you emit these records from
> mapper, keys are K1,K2 ….., Km for each respective record. Also add an
> identifier to identify dataset from where records is being emitted.
> So if R1 is a record in dataset1, the mapper will emit key as (K1,
> DATASET1) and value as R1.
>
>  For dataset2 having n records, emit m records for each record with keys
> K1, K2, …., Km and identifier as DATASET2.
> So if R1' is a record from dataset2, emit m records with key as  (Ki,
> DATASET2) and value R1' where i is from 1 to m.
>
>
>  Write a grouping comparator which group records on first part of key
> i.e. Ki.
>
>  In reducer, for each iteration of reduce there will be one record from
> dataset1 and n records from dataset2. Get the cartesian product, apply
> filter and then output.
>
>
>  Note -- You may not know keys (K1, K2, … , Km) before hand. If yes, then
> you need one more pass of dataset1 to identify the keys and store it to use
> for dataset2.
>
>
>  Regards,
> Ajay Srivastava
>
>
>  On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote:
>
>  This is not suitable for his large dataset.
>
> --Send from my Sony mobile.
> On Apr 18, 2013 5:58 PM, "Jagat Singh" <jagatsi...@gmail.com> wrote:
>
>>  Hi,
>>
>> Can you have a look at
>>
>> http://pig.apache.org/docs/r0.11.1/basic.html#cross
>>
>>  Thanks
>>
>>
>> On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zheyi.r...@gmail.com> wrote:
>>
>>> Dear all,
>>>
>>>  I am writing to kindly ask for ideas of doing cartesian product in
>>> hadoop.
>>> Specifically, now I have two datasets, each of which contains 20million
>>> lines.
>>> I want to do cartesian product on these two datasets, comparing lines
>>> pairwisely.
>>>
>>>  The output of each comparison can be mostly filtered by a function (
>>> we do not output the
>>> whole result of this cartesian product, but only a small part).
>>>
>>>  I guess one good way is to pass one block from dataset1 and another
>>> block from dataset2
>>> to a mapper, then let the mappers do the product in memory to avoid IO.
>>>
>>>  Any suggestions?
>>> Thank you very much.
>>>
>>>  Regards,
>>> Zheyi Rong
>>>
>>
>>
>

Reply via email to