sorry, premature email :-).
relation = key1 ,key2,orderkey1,val; //schema is
(chararray,int,int,chararray);
groupbykey = group relation by (key1,key2);
foreach groupbykey {
sorted = order relation by orderkey1;
generate flatten($0), MyUDF(sorted);
}
I notice that when the 'sorted' value
FILTER SIZE(tuple) == 14 won't work for your use case?
On Thu, Aug 30, 2012 at 3:39 PM, Sam William wrote:
> HI,
> I was wondering if it is possible validate records by checking the tuple
> length. I expect every record to have 14 fields, but some records might be
> corrupt. I want to filte
HI,
I was wondering if it is possible validate records by checking the tuple
length. I expect every record to have 14 fields, but some records might be
corrupt. I want to filter those out . I tried checking ($13 is null), but
this includes records which has a null value in the 14th field.
good finding !
On Thu, Aug 30, 2012 at 1:43 PM, Cheolsoo Park wrote:
> Looking at PigStorage source code, this looks like what's happening.
>
> When the comma ',' is the delimiter, PigStorage splits the input string as
> follows:
>
> 151364,[id#812,pref#secondary]
>
> =>
>
> 151364
> [id#812
> p
Looking at PigStorage source code, this looks like what's happening.
When the comma ',' is the delimiter, PigStorage splits the input string as
follows:
151364,[id#812,pref#secondary]
=>
151364
[id#812
pref#secondary]
Now "[id#812" is not a map literal, so it ends up being null.
Thanks,
Cheol
I guess it is because you are using ',' as separator (there is a comma
between your val and the map) and then again inside the map (is that a bug
?)
I tried with a space to separate the int from the map (input.txt):
151364 [id#812,pref#secondary]
121211 [id#212,pref#primary]
and with this script:
I looked at definition of Relation which says:
A relation is a bag (more specifically, an outer bag).
If relation is a bag then what's the difference between a Bag and Relation.
I am getting bit confused on the definitions. In below example what would
be Telation, Tuple or a Bag?
(1,2,3,4)
Is 1
Nvm, here's what I'll do, but if anyone has a better idea, please do tell.
I'll STORE the bag using PigStorage(';') to delimit the chararrays, then reload
it with an appropriate schema, treating the page sequences as concatenated
strings, then group and count by those. I can SPLIT() out the out
Some clarification on the below. Ignore the outer bag, I'd removed some data
elements for clarity and simplicity. Basically, I'm trying to find a way to go
from:
{(pg),(pg),...,(pg)}
to
{(pg,pg,...,pg)}
For an abritrary number of "pg" tuples.
SB
-Original Message-
From: Steve Bernst
You might want to check out LinkedIn's DataFu contribution, particularly the
"sessionize" UDF:
http://sna-projects.com/datafu/javadoc/0.0.4/datafu/pig/sessions/Sessionize.html
_
Steve Bernstein
VP, Analytics
Rearden Commerce, Inc.
+1.408.499.0961 Mobile
deem.com | reardencommerce.c
Unfortunately it's not that simple.
A = LOAD 'comb.txt' USING PigStorage(',') AS
(id:chararray,start:long,end:long);
B = FOREACH (GROUP A BY id) { GENERATE
FLATTEN(group),MIN(A.start),MAX(A.end); }
dump B
(xxx,1,7)
(yyy,1,7)
(zzz,6,10)
This is not what I want. I want only to reduce the rows / ses
Seems like you are looking to group by "id" and get the MIN and MAX
timestamp for each group?
On Thu, Aug 30, 2012 at 1:00 AM, Marco Cadetg wrote:
> Hi there,
>
> I do have some user session which look something on the following lines:
>
> id:chararray, start:long(unix timestamp), end:long(unix
Hi there,
I do have some user session which look something on the following lines:
id:chararray, start:long(unix timestamp), end:long(unix timestamp)
xxx,1,3
xxx,4,7
yyy,1,2
yyy,5,7
zzz,6,7
zzz,7,10
I would like to to combine the rows which belong to a continues session
e.g. in my example the re
13 matches
Mail list logo