There are no subtle ways to deal with quadratic problems like this. They just don't scale.
Your suggestions are roughly on course. When matching 10GB against 50GB, the choice of which input to use as input to the mapper depends a lot on how much you can buffer in memory and how long such a buffer takes to build. If you can't store the entire 10GB of data in memory at once, then consider a program like this: a) split the 50GB of data across as many mappers as you have using standard methods b) in the mapper, emit each record several times with keys of the form (i, j) where i cycles through [0,n) and is incremented once for each record read and j cycles through [0, m) and is incremented each time you emit a record. Choose m so that 1/m of your 10GB data will fit in your reducers memory. Choose n so that n x m is as large as your desired number of reducers. c) in the reducer, you will get some key (i,j) and an iterator for a number of records. Read the i-th segment of your 10GB data and compare each of the records that the iterator gives you to that data. If you made n = 1 in step (b), then you will have at most m-way parallelism in this step. If n is large, however, your reducer may need to read the same segment of your 10GB data more than once. In such conditions you may want to sort the records and remember which segment you have already read. In general, though, as I mentioned this is not a scalable process and as your data grows it is likely to become untenable. If you can split your data into pieces and estimate which piece each record should be matched to then you might be able to make the process more scalable. Consider indexing techniques to do this rough targeting. For instance, if you are trying to find the closes few strings based on edit distance, you might be able to use n-grams to get approximate matches via a text retrieval index. This can substantially reduce the cost of your algorithm. On Sun, Apr 10, 2011 at 2:10 PM, oleksiy <gayduk.a.s...@mail.ru> wrote: > ... Persistent data and input data don't have commons keys. > > In my cluster I have 5 data nodes. > The app does simple match every line of input data with every line of > persistent data. > > ... > And may be there is more subtle way in hadoop to do this work? >