Re: Problem in understanding UDF COUNT

Shahab Yunus Mon, 21 Jul 2014 08:35:07 -0700

Ashish,

*d = foreach c generate COUNT(b), group;*

I interpret or visualize is as:

c is a structure holding or consisting of groups of words or items. Imagine
a list where each entry is the groupid and each groupid points to a
collection of objects/items belonging to that same groupid. We can call
this collection b. You can also imagine c as a nested map, where the key is
distinct groupids and the value is a collection of items (again, let us
call it b) belonging to one key.

So, now you want to count how many items exist for for each groupid in list
(or map) c. Recall that we are calling group of items for each value of c
as b.

c[0]=new york points to  [1,2,3]
c[1]=philadelphia points to  [1,2,3,4]
c[2]=boston points to  [5,6,7,8,9]

So in the above example in the c list we have 3 unique gropuids (new york,
boston and philadelphia) and each point to its own collection of items that
we are calling b. We want to know the count for each group, which is 3,4 &
5 for new york, philadelphia & boston respectively.

Now coming back to the pig statement once again:
*d = foreach c generate COUNT(b), group;*

This is exactly what we are doing....
*Counting for each c (new york, philadelphia, boston in out example), how
many b's are in there (3,4 & 5).*

The second argument to the pig statement of 'group' will give us the group
id (the c's) for each count of b as well.

Regards,
Shahab

On Mon, Jul 21, 2014 at 11:02 AM, <[email protected]>
wrote:

> This was hard for me to get when I started using pig, and it still annoys
> me after 1.5 year's experience with pig. In mathematics and logic,
> quantifiers (like "for each", "there exist") bind variables that occur in
> their scope:
> (for each x)(there exists y) [y > x]
>
> The (for each x) binds x in (there exists y) [y > x]
>
> But in pig the variable x in (for each x) *does not bind occurrences of x*
> in the following subexpression. IMO this is an unnecessary stumbling block
> to people learning pig, who have a background in math or logic.
>
> Here is how you can read
>         foreach c generate COUNT(b), group;
> so it makes sense:
>         c's components are "group" and (bag) b, so:
>         foreach (group, b) in c generate COUNT(b), group;
>
> I would love it if the Pig syntax were extended to allow quantifiers like
>  "foreach (group, b) in c" but I don't know how feasible that would be.
>
> William F Dowling
> Senior Technologist
> Thomson Reuters
>
>
> -----Original Message-----
> From: Ashish Dobhal [mailto:[email protected]]
> Sent: Monday, July 21, 2014 10:34 AM
> To: [email protected]
> Subject: Re: Problem in understanding UDF COUNT
>
> Shahab Thanks
> My doubt is why are we taking the bag b and not  bag c as the arguement in
> the COUNT(b) function.
> The bag c contains the groups and not hte bag b.
> TThanks.
>
>
> On Mon, Jul 21, 2014 at 6:21 PM, Shahab Yunus <[email protected]>
> wrote:
>
> > Have you seen this documentation and blog?
> > http://squarecog.wordpress.com/2010/05/11/group-operator-in-apache-pig
> > / http://pig.apache.org/docs/r0.9.2/func.html#count
> >
> > They explain this in detail.
> >
> > Regards,
> > Shahab
> >
> >
> > On Mon, Jul 21, 2014 at 8:44 AM, Ashish Dobhal
> > <[email protected]>
> > wrote:
> >
> > > a = load '/user/hue/word_count_text.txt'; b = foreach a generate
> > > flatten(TOKENIZE((chararray)$0)) as word; c = group b by word; d =
> > > foreach c generate COUNT(b), group;
> > >
> > > I want to know what would be the input to the udf COUNT in this
> > > case.Also what is the meaning of b being passed as an arguement.
> > >
> > > Also I am still not clear acout how count operates.
> > >
> > > Thanks
> > >
> > > Ashish
> > >
> >
>

Re: Problem in understanding UDF COUNT

Reply via email to