FWIW -- the reason STRSPLIT returns a Tuple is that the more common case is thought to be splitting a string of a known format and trying to get some part of it.
so, "foreach address_book generate STRSPLIT(phone_number, '-') as (area_code, top_3, bottom_4);" RegexExtractAll (whatever it's called these days) should return a bag, iirc. D On Fri, May 6, 2011 at 2:59 PM, jacob <jacob.a.perk...@gmail.com> wrote: > On Fri, 2011-05-06 at 15:38 -0600, Christian wrote: >> > >> > > #1) Let's say you are tracking messages and extracting the hash tags from >> > > the message and storing them as one field (#hash1#hash2#hash3). This >> > means >> > > you might have a line that looks something like the following: >> > > 2343 2011-05-06T03:04:00.000Z username >> > > some+message+goes+here#with+#hash+#tags #with#hash#tags some >> > other >> > > info >> > > >> > > How can I get the # of tweets per hash tag? Also, how can I get the # of >> > > tweets per user per hash tag? >> > > I know I can use the STRSPLIT function to split on '#'. That will give me >> > a >> > > bag of hash tags. How can I then group by these such that each hash tag >> > has >> > > a set of tweets? >> > You will need to 'FLATTEN' the bag of hashtags then do a 'GROUP BY' on >> > the hashtag itself. >> > >> >> If each message has an unknown number of hashtags, will a 'FLATTEN' given me >> an unknown # of fields? If so, how do I know which field to group by? I >> don't want to group by messages that have the exact hash tags. I want all >> messages that have one of the hash tags. > > Oh, that's right, STRSPLIT (rather uselessly) yields a nested tuple and > NOT a bag. If you could get a bag then you could do the following (I'm > throwing out some fields for now): > > A = LOAD 'tweets_and_meta' AS (text:chararray, hashtags:chararray); > B = FOREACH A GENERATE text, FLATTEN(MySplittingUDF(hashtags)) AS > hashtag; > C = GROUP B BY hastag; > > Then C will contain a key (the hashtag) and a bag containing all the > tweets with that hashtag. You'll have to write 'MySplittingUDF' yourself > to do the same as STRSPLIT but that returns a bag instead. > > ie. > > #foobar tweet text,#foobar > this tweet has #two #hashtags,#two#hashtags > another #foobar tweet,#foobar > > will yield: > > #foobar, {(#foobar tweet text, #foobar),(another #foobar tweet, > #foobar)} > #two, {(this tweet has #two #hashtags, #two)} > #hashtags, {(this tweet has #two #hashtags, #hashtags)} > > >> >> >> > > But now I want to end up something like the following: >> > >> > >> > > 2011-05-01 DIRECTIVE1 32423 DIRECTIVE2 3433 DIRECTIVE3 >> > > 1983 >> > > >> > > If I knew the directives ahead of time, I know I can do something like >> > the >> > > following: >> > > >> > > D = GROUP C BY date; >> > > >> > > E = FOREACH D { >> > > DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1'; >> > > DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2'; >> > > DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3'; >> > > GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date), >> > 'DIRECTIVE2', >> > > COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date); >> > > } >> > > >> > > But how do I do this w/o having to hardcode the filters? Am I thinking >> > about >> > > this all wrong? >> > > >> > It's really a matter of how you structure your data ahead of time. >> > Imagine the data looking like this instead (call it X): >> > >> > 201101,directive1 >> > 201101,directive1 >> > 201101,directive2 >> > 201101,directive2 >> > 201101,directive2 >> > 201101,directive3 >> > 201102,directive2 >> > 201102,directive4 >> > 201103,directive1 >> > >> > This is how my data looks (row and column wise) >> >> > >> > then, a simple: >> > >> > Y = GROUP X BY (date,directive); >> > Z = FOREACH Y GENERATE FLATTEN(group) AS (date,directive), COUNT(X) AS >> > num_occurrences; >> > >> > would result in: >> > >> > 201101,directive1,2 >> > 201101,directive2,3 >> > 201101,directive3,1 >> > 201102,directive2,1 >> > 201102,directive4,1 >> > 201103,directive1,1 >> > >> > At least, that's what it _seems_ like you're asking for. >> > >> > I've gotten that far. I'm actually asking for the being able to put those >> into columns and not rows. >> >> > >> > --jacob >> > @thedatachef >> > >> > Thanks Jacob! >> >> -Christian >> >> > >> > > Thanks very much for you help, >> > > Christian >> > >> > >> > > > >