FYI, The "reusing the same iterator" semantics seems to be a common pitfall. So I have extended the DoFn#process() documentation to explain the impact imposed by the Hadoop Reducer implementation.
On Thu, Jun 28, 2012 at 1:25 PM, Josh Wills <[email protected]> wrote: > Hey Rahul, > > Re: groupByKey vs. collectValues: collectValues calls groupByKey in > the course of its operations; it's basically just a convenience method > for the kind of problem you're trying to solve (i.e., I need to > iterate over the values returned by groupByKey multiple times, so > please put them into a collection.) There are lots of other cases > where you do not need to iterate over the values multiple times, and > so Crunch (much like MapReduce) doesn't bother to keep everything > around unless you explicitly ask it to do so. > > On Thu, Jun 28, 2012 at 4:13 AM, Rahul <[email protected]> wrote: > > Hi Gabriel, > > > > I have found a way by which Crunch supports the uses case of having > repeated > > iterators but I am not completely sure of the in-outs of the same. > > Basically rather than doing a groupBy on Ptable to get back a > > PGroupedTable, I used the collectValues API to get back a > > Ptable<key,Collection<values>>. > > > > PTable<Integer, Collection<TupleN>> collectValues = > > classifiedData.collectValues(); > > PTable<String, Integer> scores = collectValues.parallelDo("compute > > pairs", > > new PTableScoreCalculator(), > Writables.tableOf(Writables.strings(), > > Writables.ints())); > > > > > > Now when I do ParalledDo on the new collection I get back a Pair, having > > keyType and ArrayList<valueType>, over which I can do things as I wish. > > > > class PTableScoreCalculator extends DoFn<Pair<Integer, > Collection<TupleN>>, > > Pair<String, Integer>> { > > public void process(Pair<Integer, Collection<TupleN>> input, > > > > Emitter<Pair<String, Integer>> emitter) { > > Iterator<TupleN> primary = input.second().iterator(); > > ..................... > > } > > > > This way I could iterate over again and again, any comments on the same. > I > > am attaching my test case for reference. > > > > BTW why are there two methods that can do the same things the groupBykey > > method and the collectValues method ? I see an Aggregation gets invoked > for > > the collection API and in the other case a lazy collection gets created. > Any > > idea on the different applications of the two. > > > > regards, > > Rahul > > > > > > > > On 28-06-2012 14:17, Gabriel Reid wrote: > > > > On Thu, Jun 28, 2012 at 9:29 AM, Rahul <[email protected]> wrote: > > > > Yes indeed this is a small PoC to get familiar with Crunch in relation > to my > > problem. Basically I have the following algo at play: > > 1. Read data rows > > 2. Create custom keys for each of them, built using various attributes of > > data (this time it is just a simple hash code, but I would like to emit > > multiple key-value pairs) > > 3. Group similar data based on created Keys > > 4. Iterate over individual items in the group and do extensive comparison > > between all of them > > > > I just built an outline in the test case to see what/how can be done, can > > you advise something better ? > > > > Thanks for the outline. In this case, your approach (with putting the > > contents of the > > incoming Iterable into a collection) should work fine, as long as > > number of elements > > per group is relatively small (i.e. easily able to fit in the memory > > available to each reducer in your Hadoop cluster). > > > > - Gabriel > > > > > > > > -- > Director of Data Science > Cloudera > Twitter: @josh_wills >
