Re: Working with an unknown number of values

Dmitriy Ryaboy Sat, 07 May 2011 18:42:15 -0700

FWIW -- the reason STRSPLIT returns a Tuple is that the more common
case is thought to be splitting a string of a known format and trying
to get some part of it.


so, "foreach address_book generate STRSPLIT(phone_number, '-') as
(area_code, top_3, bottom_4);"

RegexExtractAll (whatever it's called these days) should return a bag, iirc.

D

On Fri, May 6, 2011 at 2:59 PM, jacob <jacob.a.perk...@gmail.com> wrote:
> On Fri, 2011-05-06 at 15:38 -0600, Christian wrote:
>> >
>> > > #1) Let's say you are tracking messages and extracting the hash tags from
>> > > the message and storing them as one field (#hash1#hash2#hash3). This
>> > means
>> > > you might have a line that looks something like the following:
>> > >       2343    2011-05-06T03:04:00.000Z    username
>> > > some+message+goes+here#with+#hash+#tags    #with#hash#tags   some
>> >  other
>> > >  info
>> > >
>> > > How can I get the # of tweets per hash tag? Also, how can I get the # of
>> > > tweets per user per hash tag?
>> > > I know I can use the STRSPLIT function to split on '#'. That will give me
>> > a
>> > > bag of hash tags. How can I then group by these such that each hash tag
>> > has
>> > > a set of tweets?
>> > You will need to 'FLATTEN' the bag of hashtags then do a 'GROUP BY' on
>> > the hashtag itself.
>> >
>>
>> If each message has an unknown number of hashtags, will a 'FLATTEN' given me
>> an unknown # of fields? If so, how do I know which field to group by? I
>> don't want to group by messages that have the exact hash tags. I want all
>> messages that have one of the hash tags.
>
> Oh, that's right, STRSPLIT (rather uselessly) yields a nested tuple and
> NOT a bag. If you could get a bag then you could do the following (I'm
> throwing out some fields for now):
>
> A = LOAD 'tweets_and_meta' AS (text:chararray, hashtags:chararray);
> B = FOREACH A GENERATE text, FLATTEN(MySplittingUDF(hashtags)) AS
> hashtag;
> C = GROUP B BY hastag;
>
> Then C will contain a key (the hashtag) and a bag containing all the
> tweets with that hashtag. You'll have to write 'MySplittingUDF' yourself
> to do the same as STRSPLIT but that returns a bag instead.
>
> ie.
>
> #foobar tweet text,#foobar
> this tweet has #two #hashtags,#two#hashtags
> another #foobar tweet,#foobar
>
> will yield:
>
> #foobar,   {(#foobar tweet text, #foobar),(another #foobar tweet,
> #foobar)}
> #two,      {(this tweet has #two #hashtags, #two)}
> #hashtags, {(this tweet has #two #hashtags, #hashtags)}
>
>
>>
>>
>> > >     But now I want to end up something like the following:
>> >
>> >
>> > > 2011-05-01    DIRECTIVE1    32423    DIRECTIVE2    3433    DIRECTIVE3
>> > >  1983
>> > >
>> > > If I knew the directives ahead of time, I know I can do something like
>> > the
>> > > following:
>> > >
>> > > D = GROUP C BY date;
>> > >
>> > > E = FOREACH D {
>> > >      DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1';
>> > >      DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2';
>> > >      DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3';
>> > >         GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date),
>> > 'DIRECTIVE2',
>> > > COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date);
>> > > }
>> > >
>> > > But how do I do this w/o having to hardcode the filters? Am I thinking
>> > about
>> > > this all wrong?
>> > >
>> > It's really a matter of how you structure your data ahead of time.
>> > Imagine the data looking like this instead (call it X):
>> >
>> > 201101,directive1
>> > 201101,directive1
>> > 201101,directive2
>> > 201101,directive2
>> > 201101,directive2
>> > 201101,directive3
>> > 201102,directive2
>> > 201102,directive4
>> > 201103,directive1
>> >
>> > This is how my data looks (row and column wise)
>>
>> >
>> > then, a simple:
>> >
>> > Y = GROUP X BY (date,directive);
>> > Z = FOREACH Y GENERATE FLATTEN(group) AS (date,directive), COUNT(X) AS
>> > num_occurrences;
>> >
>> > would result in:
>> >
>> > 201101,directive1,2
>> > 201101,directive2,3
>> > 201101,directive3,1
>> > 201102,directive2,1
>> > 201102,directive4,1
>> > 201103,directive1,1
>> >
>> > At least, that's what it _seems_ like you're asking for.
>> >
>> > I've gotten that far. I'm actually asking for the being able to put those
>> into columns and not rows.
>>
>> >
>> > --jacob
>> > @thedatachef
>> >
>> > Thanks Jacob!
>>
>> -Christian
>>
>> >
>> > > Thanks very much for you help,
>> > > Christian
>> >
>> >
>> >
>
>
>

Re: Working with an unknown number of values

Reply via email to