Re: how to optimize multiple stores

2015-01-08 Thread Marco Cadetg
having to group everything. Cheers, Rodrigo 2015-01-08 7:27 GMT-02:00 Marco Cadetg ma...@zattoo.com: Hi there, I've a big pig script which first generates some expensive intermediate result on which I run multiple group by statements and multiple stores. Something like

how to optimize multiple stores

2015-01-08 Thread Marco Cadetg
Hi there, I've a big pig script which first generates some expensive intermediate result on which I run multiple group by statements and multiple stores. Something like this. Register UDFs etc A = LOAD B = LOAD C = LOAD -- do lots of transformations with A and B and C get

Re: nested order limit by percentage of overall records

2013-03-19 Thread Marco Cadetg
to use a combination of Piggybank and custom UDFs. On Mon, Mar 18, 2013 at 5:13 PM, Marco Cadetg ma...@zattoo.com wrote: Thanks a lot Mike. This seems to be what I'm looking for ;) I'm a bit disappointed that what I wanted to achieve isn't possible without using any UDF. Cheers, -Marco

nested order limit by percentage of overall records

2013-03-18 Thread Marco Cadetg
Hi there, I would like to do something very similar to a nested foreach with using order by and then limit. But I would like to limit on a relation to the total number of records. users = load 'users' as (userid:chararray, money:long, region:chararray); grouped_region = group users by region;

Re: nested order limit by percentage of overall records

2013-03-18 Thread Marco Cadetg
inputs. You can use this to receive the top x% for any given field and then you can use that within a filter On Mon, Mar 18, 2013 at 6:23 AM, Marco Cadetg ma...@zattoo.com wrote: Hi there, I would like to do something very similar to a nested foreach with using order by and then limit

reduce continuous sessions

2012-08-30 Thread Marco Cadetg
Hi there, I do have some user session which look something on the following lines: id:chararray, start:long(unix timestamp), end:long(unix timestamp) xxx,1,3 xxx,4,7 yyy,1,2 yyy,5,7 zzz,6,7 zzz,7,10 I would like to to combine the rows which belong to a continues session e.g. in my example the

filter duplicates from a bag

2012-08-24 Thread Marco Cadetg
Hi there, What is the best way to retrieve duplicates from a bag. I basically would like to do something like the opposite of DISTINCT. A: {userid: long,foo: long,bar: long} dump A (1,2,3) (1,2,3) (1,3,2) (2,3,1) Now I would like to have a bag which contains (1,2,3) (1,2,3) Thanks, -Marco

Re: filter duplicates from a bag

2012-08-24 Thread Marco Cadetg
= foreach D generate group; Disclaimer: untested code. Cheers, -- Gianmarco On Fri, Aug 24, 2012 at 11:35 AM, Marco Cadetg ma...@zattoo.com wrote: Hi there, What is the best way to retrieve duplicates from a bag. I basically would like to do something like the opposite of DISTINCT

join result dataset bigger than before

2012-06-26 Thread Marco Cadetg
Hi there, I'm doing a join like this: A = LOAD '/data/sessions' USING PigStorage(',') AS (userid:chararray, client_type:chararray, flag:long); A1 = GROUP bettyy_sessions ALL; A1 = FOREACH A1 GENERATE COUNT(A); DUMP A1 (543872) B = LOAD '/data/userdb' USING PigStorage(',') AS (uid:chararray,

exclude rows from group

2012-02-28 Thread Marco Cadetg
Hi there, I try to retrieve the group of 'rich' userids which are not 'happy' . Something like retrieve all ids which are not in the other bags.ids. Is there a better way to exclude some rows from a group? Example code: A: {userid: chararray,user_type: chararray} A: (1,rich) (1,happy)

Re: exclude rows from group

2012-02-28 Thread Marco Cadetg
; --jacob @thedatachef On Tue, 2012-02-28 at 16:49 +0100, Marco Cadetg wrote: Hi there, I try to retrieve the group of 'rich' userids which are not 'happy' . Something like retrieve all ids which are not in the other bags.ids. Is there a better way to exclude some rows from a group

overwrite output

2012-01-17 Thread Marco Cadetg
Hi there, AFAICT the STORE function doesn't provide a way to overwrite the output. I guess you could use your own storage UDF to accomplish that but is there also another way of doing that? Thanks -Marco

Re: creating a graph over time

2011-11-01 Thread Marco Cadetg
/visualization tool of choice... Guy On Mon, Oct 31, 2011 at 8:55 AM, Marco Cadetg ma...@zattoo.com wrote: The data is not about students but about television ;) Regarding the size. The raw input data size is about 150m although when I 'explode' the timeseries it will be around

Re: creating a graph over time

2011-10-31 Thread Marco Cadetg
On Thu, Oct 27, 2011 at 4:05 PM, Guy Bayes fatal.er...@gmail.com wrote: how big is your dataset? On Thu, Oct 27, 2011 at 9:23 AM, Marco Cadetg ma...@zattoo.com wrote: Thanks Bill and Norbert that seems like what I was looking for. I'm a bit worried about how much data/io

creating a graph over time

2011-10-27 Thread Marco Cadetg
I have a problem where I don't know how or if pig is even suitable to solve it. I have a schema like this: student-id,student-name,start-time,duration,course 1,marco,1319708213,500,math 2,ralf,1319708111,112,english 3,greg,1319708321,333,french 4,diva,1319708444,80,english

Re: creating a graph over time

2011-10-27 Thread Marco Cadetg
27, 2011, Marco Cadetg ma...@zattoo.com wrote: I have a problem where I don't know how or if pig is even suitable to solve it. I have a schema like this: student-id,student-name,start-time,duration,course 1,marco,1319708213,500,math 2,ralf,1319708111,112,english 3,greg

Re: calculate percentage

2011-10-12 Thread Marco Cadetg
; } total_iq_per_gender = GROUP A BY (gender); total_iq_per_gender = FOREACH A { GENERATE FLATTEN(group), SUM(A.iq) AS iq_per_gender; } Now I guess I could use JOIN to combine both bags(?) by gender but somehow I don't get it. Thanks -Marco On Tue, Oct 11, 2011 at 6:02 PM, Marco Cadetg ma

Re: calculate percentage

2011-10-12 Thread Marco Cadetg
FOREACH GENERATE needs to look like. On Wed, Oct 12, 2011 at 10:34 AM, Dmitriy Ryaboy dvrya...@gmail.com wrote: Sure, just join your total counts with your partials on gender. D On Tue, Oct 11, 2011 at 11:58 PM, Marco Cadetg ma...@zattoo.com wrote: D'oh I just see that unfortunately my

Re: calculate percentage

2011-10-12 Thread Marco Cadetg
: (Male,Here,0.89285713) (Male,There,0.10714286) (Female,Here,0.13793103) (Female,There,0.86206895) Norbert On Wed, Oct 12, 2011 at 5:38 AM, Marco Cadetg ma...@zattoo.com wrote: Yes but I'm still not able to compute the percentage. I've joined the bags as below. A = LOAD '/data/marco

replace value of a given field

2011-10-11 Thread Marco Cadetg
Hi there, I would like to replace the value of a field based on its value. E.g.: A = LOAD 'student' USING PigStorage() AS (name:chararray); DUMP A; (John) (Mary (Bill) (Joe) (John) Now I would like to replace all John's with Marco. Is there a way to do this in PIG? I thought about using sth

Re: replace value of a given field

2011-10-11 Thread Marco Cadetg
' ? 'Marco' : (name == 'Sally' ? 'Anne' : name)) as name; Maybe there's a better way if you have to do lots of these at once..? See also REPLACE for replacing a substring. On 11 October 2011 08:51, Marco Cadetg ma...@zattoo.com wrote: Hi there, I would like to replace the value of a field

calculate percentage

2011-10-11 Thread Marco Cadetg
Hi there, I would need to do something like this: A = LOAD 'student' USING PigStorage() AS (name:chararray, region:chararry, iq:int); DUMP A; (John, There, 10) (Alf, There, 10) (ET, There, 10) (Mary, Here, 80) (Bill, Here, 100) (Joe, Here, 150) total_iq_per_region = GROUP A BY (region);

pig hadoop 0.20.204.0

2011-09-13 Thread Marco Cadetg
Hi there, I do have a problem with pig 9.0 and hadoop 0.20.204: http://hadoop.apache.org/common/releases.html#5+Sep%2C+2011%3A+release+0.20.204.0+available I tried several things but I am unable to use pig with that version of hadoop. When using the pig without hadoop version build via ant

Re: pig hadoop 0.20.204.0

2011-09-13 Thread Marco Cadetg
) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) ... 26 more On Tue, Sep 13, 2011 at 11:09 AM, Marco Cadetg ma

Re: pig hadoop 0.20.204.0

2011-09-13 Thread Marco Cadetg
Sorry for spamming the list, it looks like there were some classpath jars missing. Cheers -Marco On Tue, Sep 13, 2011 at 3:13 PM, Marco Cadetg ma...@zattoo.com wrote: I'm wondering if this is rather a hadoop configuration problem or a problem between pig and hadoop? Here is the complete