> One elegant solution would involve an ability to restart one of the input
> splitters and replay the input data from set A several times until the mapper
> had generated all sets of the form
By replaying A several times, did you actually mean reading A set B times?
Cause in this case you may
The approach you suggest is similar to what I am currently doing but it
requires you to size the partitions to the memory available on the reducer.
This is a non-trivial task and is not necessarily guaranteed to scale. It is
true that the simplest approach is to break one of the sets into
sufficien
On Wed, 22 Jun 2011 15:16:02 -0700, Steve Lewis
wrote:
> Assume I have two data sources A and B
> Assume I have an input format and can generate key values for both A and
B
> I want an algorithm which will generate the cross product of all values
in
> A
> having the key K and all values in B havin
If you have scaling problems, check out the Mahout project. They are
all about distributed scalable linear algebra & more.
http://mahout.apache.org
Lance
On Wed, Jun 22, 2011 at 5:13 PM, Jason wrote:
> I remember I had a similar problem.
> The way I approached it was by partitioning one of the d
I remember I had a similar problem.
The way I approached it was by partitioning one of the data sets. At high level
these are the steps:
Suppose you decide to partition set A.
Each partition represents a subset/range of the A keys and must be small enough
to fit records in memory.
Each partit
Assume I have two data sources A and B
Assume I have an input format and can generate key values for both A and B
I want an algorithm which will generate the cross product of all values in A
having the key K and all values in B having the
key K.
Currently I use a mapper to generate key values for A