TOKENIZE takes a string and returns a bag. It's issue is right now it
only allows you to split on whitespace. It would make sense to
generalize this to take a delimiter.
Alan.
On May 7, 2011, at 7:55 PM, Jacob Perkins wrote:
Dmitriy,
I see your point. It would definitely be nice to ha
Dmitriy,
I see your point. It would definitely be nice to have a builtin for
returning a bag though. I'd actually be happy if
TOBAG(FLATTEN(STRSPLIT(X,','))) worked.
--jacob
@thedatachef
On Sat, 2011-05-07 at 18:41 -0700, Dmitriy Ryaboy wrote:
> FWIW -- the reason STRSPLIT returns a Tuple is
FWIW -- the reason STRSPLIT returns a Tuple is that the more common
case is thought to be splitting a string of a known format and trying
to get some part of it.
so, "foreach address_book generate STRSPLIT(phone_number, '-') as
(area_code, top_3, bottom_4);"
RegexExtractAll (whatever it's called
On Fri, 2011-05-06 at 16:06 -0600, Christian wrote:
> Thank you for taking the time to explain this to me Jacob!
>
> Am I stuck with hard-coding for my other question?
>
> Instead of:
> 2011-05-01DIRECTIVE132423DIRECTIVE23433DIRECTIVE3
> 1983
> --
> 2011-05-0132423343
Thank you for taking the time to explain this to me Jacob!
Am I stuck with hard-coding for my other question?
Instead of:
2011-05-01DIRECTIVE132423DIRECTIVE23433DIRECTIVE3
1983
--
2011-05-013242334331983
would also do as long as I could count on the column order.
On Fri, 2011-05-06 at 15:38 -0600, Christian wrote:
> >
> > > #1) Let's say you are tracking messages and extracting the hash tags from
> > > the message and storing them as one field (#hash1#hash2#hash3). This
> > means
> > > you might have a line that looks something like the following:
> > >
>
> > #1) Let's say you are tracking messages and extracting the hash tags from
> > the message and storing them as one field (#hash1#hash2#hash3). This
> means
> > you might have a line that looks something like the following:
> > 23432011-05-06T03:04:00.000Zusername
> > some+message
Christian,
I've answered inline:
On Fri, 2011-05-06 at 15:14 -0600, Christian wrote:
> I am sorry if this has been asked in the past. I can't seem to find
> information on it.
>
> I have two questions, but they are somewhat related.
>
> #1) Let's say you are tracking messages and extracting the
you can group on group, like this:
A = LOAD '/some/dir' Using PigStorage (date, directive);
B = GROUP A by (date, directive);
C = FOREACH B GENERATE FLATTEN(group) as (date, directive), COUNT(A) as cnt;
D = group c by date;
E = foreach D generate group as date, c.(directive,cnt) as cnts;
Shaw