Re: wrong sort order (lexical vs numeric) in a nested foreach

2012-08-30 Thread Lauren Blau
sorry, premature email :-). relation = key1 ,key2,orderkey1,val; //schema is (chararray,int,int,chararray); groupbykey = group relation by (key1,key2); foreach groupbykey { sorted = order relation by orderkey1; generate flatten($0), MyUDF(sorted); } I notice that when the 'sorted' value

Re: Validating tuple length

2012-08-30 Thread Norbert Burger
FILTER SIZE(tuple) == 14 won't work for your use case? On Thu, Aug 30, 2012 at 3:39 PM, Sam William wrote: > HI, > I was wondering if it is possible validate records by checking the tuple > length. I expect every record to have 14 fields, but some records might be > corrupt. I want to filte

Validating tuple length

2012-08-30 Thread Sam William
HI, I was wondering if it is possible validate records by checking the tuple length. I expect every record to have 14 fields, but some records might be corrupt. I want to filter those out . I tried checking ($13 is null), but this includes records which has a null value in the 14th field.

Re: Loading Map's in Pig

2012-08-30 Thread pablomar
good finding ! On Thu, Aug 30, 2012 at 1:43 PM, Cheolsoo Park wrote: > Looking at PigStorage source code, this looks like what's happening. > > When the comma ',' is the delimiter, PigStorage splits the input string as > follows: > > 151364,[id#812,pref#secondary] > > => > > 151364 > [id#812 > p

Re: Loading Map's in Pig

2012-08-30 Thread Cheolsoo Park
Looking at PigStorage source code, this looks like what's happening. When the comma ',' is the delimiter, PigStorage splits the input string as follows: 151364,[id#812,pref#secondary] => 151364 [id#812 pref#secondary] Now "[id#812" is not a map literal, so it ends up being null. Thanks, Cheol

Re: Loading Map's in Pig

2012-08-30 Thread pablomar
I guess it is because you are using ',' as separator (there is a comma between your val and the map) and then again inside the map (is that a bug ?) I tried with a space to separate the int from the map (input.txt): 151364 [id#812,pref#secondary] 121211 [id#212,pref#primary] and with this script:

Re: Count of all the rows

2012-08-30 Thread Mohit Anchlia
I looked at definition of Relation which says: A relation is a bag (more specifically, an outer bag). If relation is a bag then what's the difference between a Bag and Relation. I am getting bit confused on the definitions. In below example what would be Telation, Tuple or a Bag? (1,2,3,4) Is 1

RE: group by clickstream

2012-08-30 Thread Steve Bernstein
Nvm, here's what I'll do, but if anyone has a better idea, please do tell. I'll STORE the bag using PigStorage(';') to delimit the chararrays, then reload it with an appropriate schema, treating the page sequences as concatenated strings, then group and count by those. I can SPLIT() out the out

RE: group by clickstream

2012-08-30 Thread Steve Bernstein
Some clarification on the below. Ignore the outer bag, I'd removed some data elements for clarity and simplicity. Basically, I'm trying to find a way to go from: {(pg),(pg),...,(pg)} to {(pg,pg,...,pg)} For an abritrary number of "pg" tuples. SB -Original Message- From: Steve Bernst

RE: reduce continuous sessions

2012-08-30 Thread Steve Bernstein
You might want to check out LinkedIn's DataFu contribution, particularly the "sessionize" UDF: http://sna-projects.com/datafu/javadoc/0.0.4/datafu/pig/sessions/Sessionize.html _ Steve Bernstein VP, Analytics Rearden Commerce, Inc. +1.408.499.0961 Mobile deem.com | reardencommerce.c

Re: reduce continuous sessions

2012-08-30 Thread Marco Cadetg
Unfortunately it's not that simple. A = LOAD 'comb.txt' USING PigStorage(',') AS (id:chararray,start:long,end:long); B = FOREACH (GROUP A BY id) { GENERATE FLATTEN(group),MIN(A.start),MAX(A.end); } dump B (xxx,1,7) (yyy,1,7) (zzz,6,10) This is not what I want. I want only to reduce the rows / ses

Re: reduce continuous sessions

2012-08-30 Thread Prashant Kommireddi
Seems like you are looking to group by "id" and get the MIN and MAX timestamp for each group? On Thu, Aug 30, 2012 at 1:00 AM, Marco Cadetg wrote: > Hi there, > > I do have some user session which look something on the following lines: > > id:chararray, start:long(unix timestamp), end:long(unix

reduce continuous sessions

2012-08-30 Thread Marco Cadetg
Hi there, I do have some user session which look something on the following lines: id:chararray, start:long(unix timestamp), end:long(unix timestamp) xxx,1,3 xxx,4,7 yyy,1,2 yyy,5,7 zzz,6,7 zzz,7,10 I would like to to combine the rows which belong to a continues session e.g. in my example the re