Re: Algorithm for cross product

2011-06-23 Thread Jason
> One elegant solution would involve an ability to restart one of the input > splitters and replay the input data from set A several times until the mapper > had generated all sets of the form By replaying A several times, did you actually mean reading A set B times? Cause in this case you may

Re: Algorithm for cross product

2011-06-23 Thread Steve Lewis
The approach you suggest is similar to what I am currently doing but it requires you to size the partitions to the memory available on the reducer. This is a non-trivial task and is not necessarily guaranteed to scale. It is true that the simplest approach is to break one of the sets into sufficien

Re: Algorithm for cross product

2011-06-23 Thread John Armstrong
On Wed, 22 Jun 2011 15:16:02 -0700, Steve Lewis wrote: > Assume I have two data sources A and B > Assume I have an input format and can generate key values for both A and B > I want an algorithm which will generate the cross product of all values in > A > having the key K and all values in B havin

Re: Algorithm for cross product

2011-06-22 Thread Lance Norskog
If you have scaling problems, check out the Mahout project. They are all about distributed scalable linear algebra & more. http://mahout.apache.org Lance On Wed, Jun 22, 2011 at 5:13 PM, Jason wrote: > I remember I had a similar problem. > The way I approached it was by partitioning one of the d

Re: Algorithm for cross product

2011-06-22 Thread Jason
I remember I had a similar problem. The way I approached it was by partitioning one of the data sets. At high level these are the steps: Suppose you decide to partition set A. Each partition represents a subset/range of the A keys and must be small enough to fit records in memory. Each partit

Algorithm for cross product

2011-06-22 Thread Steve Lewis
Assume I have two data sources A and B Assume I have an input format and can generate key values for both A and B I want an algorithm which will generate the cross product of all values in A having the key K and all values in B having the key K. Currently I use a mapper to generate key values for A