Hi John, The only way I can think of to do this is using the RANK operator (available only in pig version 0.11) along with a custom udf as follows:
* RANK the users relation to result in something like: (User1, 1) (User2, 2) (User3, 3) (User4, 4) (User5, 5) (User6, 6) (User7, 7) (User8, 8) (User9, 9) * Use a udf that functions much like the rstats "seq" function (http://stat.ethz.ch/R-manual/R-devel/library/base/html/seq.html) that generates a bag containing integers from 0 up to the capacity of a given host: (Hostb, {(0),(1)}) (Hostc, {(0),(1),(2),(3)}) (Hostd, {(0),(1),(2)}) which can then be flattened in a projection to result in: (Hostb, 0) (Hostb, 1) (Hostc, 0) (Hostc, 1) (Hostc, 2) (Hostc, 3) (Hostd, 0) (Hostd, 1) (Hostd, 2) (Basically reversing any aggregation that was done to produce the capacity count in the first place...) * Rank the exploded set of hosts to result in: (Hostb, 1) (Hostb, 2) (Hostc, 3) (Hostc, 4) (Hostc, 5) (Hostc, 6) (Hostd, 7) (Hostd, 8) (Hostd, 9) * You can then join the ranked hosts and the ranked users by rank and project out fields you don't need to result in: (Hostb, User1) (Hostb, User2) (Hostc, User3) (Hostc, User4) (Hostc, User5) (Hostc, User6) (Hostd, User7) (Hostd, User8) (Hostd, User9) Here's some example pig code that I used that works with pig 0.11 (I already have a Seq udf): ************ users = load 'users' as (user_id:chararray); hosts = load 'hosts' as (host_id:chararray, capacity:int); hosts_exploded = foreach hosts { sequence = Seq(0, capacity, capacity); generate host_id as host_id, flatten(sequence) as num; }; ranked_users = rank users; ranked_hosts = rank hosts_exploded; spread = foreach (join ranked_users by $0, ranked_hosts by $0) generate host_id, user_id; dump spread; ************ Hope that helps! --jacob @thedatachef On Sun, 2013-03-31 at 12:06 -0400, John Meek wrote: > hey all, > > Can anyone let me know how I can accomplish below problem in Pig? > > I have 2 data sources: > > TABLE A with a list of User IDs: > > User1 > User2 > User3 > User4 > User5 > User6 > User7 > User8 > User9 > > TABLE B with (Host name, Capacity): > > Hostb 2 > Hostc 4 > Hostd 3 > > > I basically need to spread the data in table A based on Table B based on how > much capacity Table B has. > > So end result should be a file: > > User1 Hostb > User2 Hostb > User3 Hostc > User4 Hostc > User5 Hostc > User6 Hostc > User7 Hostd > User8 Hostd > User9 Hostd > > The order does not matter as long as each Host gets the capacity it can take. > Also the SUM(TableB.Capacity) will always be COUNT(TableA.UserID) so there > wont be any extra or less values to plug in. > > > thanks, > JM > >
