Re: Spreading data in Pig

Jacob Perkins Sun, 31 Mar 2013 11:13:37 -0700

Hi John,

The only way I can think of to do this is using the RANK operator
(available only in pig version 0.11) along with a custom udf as follows:


* RANK the users relation to result in something like:

(User1, 1)
(User2, 2)
(User3, 3)
(User4, 4)
(User5, 5)
(User6, 6)
(User7, 7)
(User8, 8)
(User9, 9)

* Use a udf that functions much like the rstats "seq" function
(http://stat.ethz.ch/R-manual/R-devel/library/base/html/seq.html) that
generates a bag containing integers from 0 up to the capacity of a given
host:

(Hostb, {(0),(1)})
(Hostc, {(0),(1),(2),(3)})
(Hostd, {(0),(1),(2)})

which can then be flattened in a projection to result in:

(Hostb, 0)
(Hostb, 1)
(Hostc, 0)
(Hostc, 1)
(Hostc, 2)
(Hostc, 3)
(Hostd, 0)
(Hostd, 1)
(Hostd, 2)

(Basically reversing any aggregation that was done to produce the
capacity count in the first place...)

* Rank the exploded set of hosts to result in:

(Hostb, 1)
(Hostb, 2)
(Hostc, 3)
(Hostc, 4)
(Hostc, 5)
(Hostc, 6)
(Hostd, 7)
(Hostd, 8)
(Hostd, 9)

* You can then join the ranked hosts and the ranked users by rank and
project out fields you don't need to result in:

(Hostb, User1)
(Hostb, User2)
(Hostc, User3)
(Hostc, User4)
(Hostc, User5)
(Hostc, User6)
(Hostd, User7)
(Hostd, User8)
(Hostd, User9)

Here's some example pig code that I used that works with pig 0.11 (I
already have a Seq udf):

************

users = load 'users' as (user_id:chararray);
hosts = load 'hosts' as (host_id:chararray, capacity:int);

hosts_exploded = foreach hosts {
                   sequence = Seq(0, capacity, capacity);
                   generate
                     host_id           as host_id,
                     flatten(sequence) as num;
                 };

ranked_users = rank users;
ranked_hosts = rank hosts_exploded;

spread = foreach (join ranked_users by $0, ranked_hosts by $0) generate
host_id, user_id;

dump spread;

************


Hope that helps!

--jacob
@thedatachef

On Sun, 2013-03-31 at 12:06 -0400, John Meek wrote:
> hey all,
> 
> Can anyone let me know how I can accomplish below problem in Pig?
> 
> I have 2 data sources:
> 
> TABLE A with a list of User IDs:
> 
> User1
> User2
> User3
> User4
> User5
> User6
> User7
> User8
> User9
> 
> TABLE B with (Host name, Capacity):
> 
> Hostb 2
> Hostc 4
> Hostd 3
> 
> 
> I basically need to spread the data in table A based on Table B based on how 
> much capacity Table B has.
> 
> So end result should be a file:
> 
> User1 Hostb
> User2 Hostb
> User3 Hostc
> User4 Hostc
> User5 Hostc
> User6 Hostc
> User7 Hostd
> User8 Hostd
> User9 Hostd
> 
> The order does not matter as long as each Host gets the capacity it can take. 
> Also the SUM(TableB.Capacity) will always be COUNT(TableA.UserID) so there 
> wont be any extra or less values to plug in.
> 
> 
> thanks,
> JM
> 
>

Re: Spreading data in Pig

Reply via email to