On Thu, Jun 28, 2012 at 9:29 AM, Rahul <[email protected]> wrote: > Yes indeed this is a small PoC to get familiar with Crunch in relation to my > problem. Basically I have the following algo at play: > 1. Read data rows > 2. Create custom keys for each of them, built using various attributes of > data (this time it is just a simple hash code, but I would like to emit > multiple key-value pairs) > 3. Group similar data based on created Keys > 4. Iterate over individual items in the group and do extensive comparison > between all of them > > I just built an outline in the test case to see what/how can be done, can > you advise something better ?
Thanks for the outline. In this case, your approach (with putting the contents of the incoming Iterable into a collection) should work fine, as long as number of elements per group is relatively small (i.e. easily able to fit in the memory available to each reducer in your Hadoop cluster). - Gabriel
