Hi Ajay Srivastava, Thank your for your reply.
Could you please explain a little bit more on "Write a grouping comparator which group records on first part of key i.e. Ki." ? I guess it is a crucial part, which could filter some pairs before passing them to the reducer. Regards, Zheyi Rong On Thu, Apr 18, 2013 at 12:50 PM, Ajay Srivastava < ajay.srivast...@guavus.com> wrote: > Hi Rong, > You can use following simple method. > > Lets say dataset1 has m records and when you emit these records from > mapper, keys are K1,K2 ….., Km for each respective record. Also add an > identifier to identify dataset from where records is being emitted. > So if R1 is a record in dataset1, the mapper will emit key as (K1, > DATASET1) and value as R1. > > For dataset2 having n records, emit m records for each record with keys > K1, K2, …., Km and identifier as DATASET2. > So if R1' is a record from dataset2, emit m records with key as (Ki, > DATASET2) and value R1' where i is from 1 to m. > > > Write a grouping comparator which group records on first part of key > i.e. Ki. > > In reducer, for each iteration of reduce there will be one record from > dataset1 and n records from dataset2. Get the cartesian product, apply > filter and then output. > > > Note -- You may not know keys (K1, K2, … , Km) before hand. If yes, then > you need one more pass of dataset1 to identify the keys and store it to use > for dataset2. > > > Regards, > Ajay Srivastava > > > On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote: > > This is not suitable for his large dataset. > > --Send from my Sony mobile. > On Apr 18, 2013 5:58 PM, "Jagat Singh" <jagatsi...@gmail.com> wrote: > >> Hi, >> >> Can you have a look at >> >> http://pig.apache.org/docs/r0.11.1/basic.html#cross >> >> Thanks >> >> >> On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zheyi.r...@gmail.com> wrote: >> >>> Dear all, >>> >>> I am writing to kindly ask for ideas of doing cartesian product in >>> hadoop. >>> Specifically, now I have two datasets, each of which contains 20million >>> lines. >>> I want to do cartesian product on these two datasets, comparing lines >>> pairwisely. >>> >>> The output of each comparison can be mostly filtered by a function ( >>> we do not output the >>> whole result of this cartesian product, but only a small part). >>> >>> I guess one good way is to pass one block from dataset1 and another >>> block from dataset2 >>> to a mapper, then let the mappers do the product in memory to avoid IO. >>> >>> Any suggestions? >>> Thank you very much. >>> >>> Regards, >>> Zheyi Rong >>> >> >> >