Sample pseudocode.
The idea is to group tuples by movie_id and count size of group bags.
movieAlias = LOAD 'path/to/movie/files' as (
user_id:long,movie_id:long,timestamp:long);
groupedByMovie = group movieAlias by movie_id;
counted = FOREACH groupedByMovie GENERATE group as movie_id,
Really easy, fundamental actually.
a = Group your_data by (user_id,movie);
foreach a generate
flatten(group)
count($1)
;
-Original Message-
From: Chengi Liu [mailto:chengi.liu...@gmail.com]
Sent: Wednesday, May 14, 2014 1:25 PM
To: user@pig.apache.org
Subject: Frequency count in pig
Hi,
My data is in format:
user_id,movie_id,timestamp
123, abc,unix_timestamp
123, def, ...
123, abc, ...
234, sda, ...
Now, I want to compute the number of times each movie is played in pig..
So the output I am expecting is:
123,abc,2
123,def,1
234,sda,1
and
such as the following:
movie = LOAD '$input' AS (user_id:int, movie_id:chararray, timestamp:int);
movie_group = GROUP movie by user_id;
movie_count = FOREACH movie_group GENERATE group as user_id, movie_id,
COUNT($1) AS MovieCount;
On Thu, May 15, 2014 at 4:25 AM, Chengi Liu
Hi,
I am using HCatLoader to load data from a table (existing in hive).
A = load 'rwf_data' USING org.apache.hcatalog.pig.HCatLoader();
describe A;
I got Error 1115: Table not found : ...
It is weird. Any suggestions on this? Thanks
Patcharee
You can either do Hadoop mv if its a wrapper script or
do getMerge to merge and rename all part files to single part file.
On May 14, 2014, at 2:11 AM, Patcharee Thongtra patcharee.thong...@uni.no
wrote:
Hi,
Is it possible to store results in to a file with determined filename,
instead
Hi there,
You could do that with the help of
MultipleOutputFormathttp://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.htmlclass.
It extends FileOutputFormat,and allows us to write the output data
to different output files.
*Warm regards,*
*Mohammad Tariq*