Re: Simple word count in pig..

Jacob Perkins Wed, 20 Nov 2013 04:55:37 -0800

Jamal,

You're going to want to use a FLATTEN and another group by. Consider:


flattened   = foreach processed generate id, flatten(tokens) as token;
frequency = foreach (group flattened by (id, token)) generate
                        flatten(group)         as (id, token),
                        COUNT(flattened) as freq; 

Of course, this will spawn another map-reduce job. However, since COUNT is 
algebraic, pig will make use of combiners drastically reducing the amount of 
data sent to the reducers.

--jacob
@thedatachef

On Nov 19, 2013, at 5:45 PM, jamal sasha <jamalsha...@gmail.com> wrote:

> Hi,
> 
> I have data already processed in following form:
> 
> 
> ( id ,{ bag of words})
> So for example:
> 
> (foobar, {(foo), (foo),(foobar),(bar)})
> (foo,{(bar),(bar)})
> 
> and so on..
> describe processed gives me:
> processed: {id: chararray,tokens: {tuple_of_tokens: (token: chararray)}}
> 
> 
> Now what I want is.. also count the number of times a word appears in this
> data and output it as
> foobar, foo, 2
> foobar,foobar,1
> foobar,bar,1
> foo,bar,2
> 
> and so on...
> 
> How do I do this in pig?

Re: Simple word count in pig..

Reply via email to