FYI, The "reusing the same iterator" semantics seems to be a common
pitfall. So I have extended the DoFn#process() documentation to explain the
impact imposed by the Hadoop Reducer implementation.

On Thu, Jun 28, 2012 at 1:25 PM, Josh Wills <[email protected]> wrote:

> Hey Rahul,
>
> Re: groupByKey vs. collectValues: collectValues calls groupByKey in
> the course of its operations; it's basically just a convenience method
> for the kind of problem you're trying to solve (i.e., I need to
> iterate over the values returned by groupByKey multiple times, so
> please put them into a collection.) There are lots of other cases
> where you do not need to iterate over the values multiple times, and
> so Crunch (much like MapReduce) doesn't bother to keep everything
> around unless you explicitly ask it to do so.
>
> On Thu, Jun 28, 2012 at 4:13 AM, Rahul <[email protected]> wrote:
> > Hi Gabriel,
> >
> > I have found a way by which Crunch supports the uses case of having
> repeated
> > iterators but I am not completely sure of the in-outs of the same.
> > Basically rather than doing a groupBy on Ptable to get back a
> > PGroupedTable,  I used the collectValues API to get back a
> > Ptable<key,Collection<values>>.
> >
> >     PTable<Integer, Collection<TupleN>> collectValues =
> > classifiedData.collectValues();
> >     PTable<String, Integer> scores = collectValues.parallelDo("compute
> > pairs",
> >         new PTableScoreCalculator(),
> Writables.tableOf(Writables.strings(),
> > Writables.ints()));
> >
> >
> > Now when I do ParalledDo on the new collection I get back a Pair, having
> > keyType and ArrayList<valueType>, over which I can do things as I wish.
> >
> > class PTableScoreCalculator extends DoFn<Pair<Integer,
> Collection<TupleN>>,
> > Pair<String, Integer>> {
> > public void process(Pair<Integer, Collection<TupleN>> input,
> >
> >       Emitter<Pair<String, Integer>> emitter) {
> >     Iterator<TupleN> primary = input.second().iterator();
> > .....................
> > }
> >
> > This way I could iterate over again and again, any comments on the same.
> I
> > am attaching my test case for reference.
> >
> > BTW why are there two methods that can do the same things  the groupBykey
> > method and the collectValues method ? I see  an Aggregation gets invoked
> for
> > the collection API and in the other case a lazy collection gets created.
> Any
> > idea on the different applications of the two.
> >
> > regards,
> > Rahul
> >
> >
> >
> > On 28-06-2012 14:17, Gabriel Reid wrote:
> >
> > On Thu, Jun 28, 2012 at 9:29 AM, Rahul <[email protected]> wrote:
> >
> > Yes indeed this is a small PoC to get familiar with Crunch in relation
> to my
> > problem. Basically I have the following algo at play:
> > 1. Read data rows
> > 2. Create custom keys for each of them, built using various attributes of
> > data (this time it is just a simple hash code, but I would like to emit
> > multiple key-value pairs)
> > 3. Group similar data based on created Keys
> > 4. Iterate over individual items in the group and do extensive comparison
> > between all of them
> >
> > I just built an outline in the test case to see what/how can be done, can
> > you advise something better ?
> >
> > Thanks for the outline. In this case, your approach (with putting the
> > contents of the
> > incoming Iterable into a collection) should work fine, as long as
> > number of elements
> > per group is relatively small (i.e. easily able to fit in the memory
> > available to each reducer in your Hadoop cluster).
> >
> > - Gabriel
> >
> >
>
>
>
> --
> Director of Data Science
> Cloudera
> Twitter: @josh_wills
>

Reply via email to