Thanks for following through William! D
On Wed, Jun 8, 2011 at 1:56 PM, William Oberman <ober...@civicscience.com> wrote: > Just in case this ends up as someone else's answer someday, here is the > working query on real data: > rows = LOAD 'cassandra://civicscience/observations' USING > CassandraStorage(); > filter_rows = FILTER rows BY $1 is not null; > counts = FOREACH filter_rows GENERATE COUNT($1); > counts_in_bag = GROUP counts ALL; > sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1); > dump sum_of_bag; > > For some reason typing the bag was causing me problems. > > On Tue, Jun 7, 2011 at 4:58 PM, William Oberman > <ober...@civicscience.com>wrote: > >> I think FILTER will do the trick? E.g. >> >> >> rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING >> CassandraStorage() AS (key, columns: bag {T: tuple(name, value)}); >> filter_rows = FILTER rows BY columns is not null; >> counts = FOREACH filter_rows GENERATE COUNT(columns); >> >> counts_in_bag = GROUP counts ALL; >> sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1); >> dump sum_of_bag; >> >> >> On Tue, Jun 7, 2011 at 4:33 PM, William Oberman >> <ober...@civicscience.com>wrote: >> >>> I tried this same script on closer to production data, and I'm getting >>> errors. I'm 50% sure it's this: >>> https://issues.apache.org/jira/browse/PIG-1283 >>> >>> One of my rows in cassandra has no columns (maybe?), which maybe causes a >>> null bag, which causes COUNT to blow up (at least, that's my theory). As a >>> workaround, can I have COUNT ignore/skip rows with null columns? I'll start >>> digging through the docs as well. >>> >>> will >>> >>> >>> On Fri, Jun 3, 2011 at 4:09 PM, William Oberman <ober...@civicscience.com >>> > wrote: >>> >>>> That is exactly what I wanted, thanks for the confirm! >>>> >>>> >>>> On Fri, Jun 3, 2011 at 4:06 PM, Dmitriy Ryaboy <dvrya...@gmail.com>wrote: >>>> >>>>> I am not sure what you mean by "count all columns". The code you have >>>>> counts all *cells*. >>>>> So: >>>>> id1: col1, col2 >>>>> id2: col1, col2, col3 >>>>> >>>>> has 3 columns in a conventional sense, but your code will return 5. Is >>>>> that what you want? If so, your code seems correct. >>>>> >>>>> D >>>>> >>>>> On Fri, Jun 3, 2011 at 12:53 PM, William Oberman >>>>> <ober...@civicscience.com> wrote: >>>>> > Howdy, >>>>> > >>>>> > I'm coming from cassandra, and I'm actually trying to count all >>>>> columns in a >>>>> > column family. I believe that is similar to counting the number >>>>> tuples in a >>>>> > bag in the lingo in the pig manual. It was harder than I expected, >>>>> but I >>>>> > think this works: >>>>> > rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING >>>>> CassandraStorage() >>>>> > AS (key, columns: bag {T: tuple(name, value)}); >>>>> > counts = FOREACH rows GENERATE COUNT(columns); >>>>> > counts_in_bag = GROUP counts ALL; >>>>> > sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1); >>>>> > dump sum_of_bag; >>>>> > >>>>> > My question is: am I right that it works? I started with 3 keys >>>>> having a >>>>> > total of 5 columns and got (5). Then I added a new key/column, and >>>>> another >>>>> > column on an existing key and got (7). So, it seems like it's >>>>> working. >>>>> > But, was there a better way to write it? >>>>> > >>>>> > Thanks! >>>>> > >>>>> > will >>>>> > >>>>> >>>> >>>> >>>> >>>> -- >>>> Will Oberman >>>> Civic Science, Inc. >>>> 3030 Penn Avenue., First Floor >>>> Pittsburgh, PA 15201 >>>> (M) 412-480-7835 >>>> (E) ober...@civicscience.com >>>> >>> >>> >>> >>> -- >>> Will Oberman >>> Civic Science, Inc. >>> 3030 Penn Avenue., First Floor >>> Pittsburgh, PA 15201 >>> (M) 412-480-7835 >>> (E) ober...@civicscience.com >>> >> >> >> >> -- >> Will Oberman >> Civic Science, Inc. >> 3030 Penn Avenue., First Floor >> Pittsburgh, PA 15201 >> (M) 412-480-7835 >> (E) ober...@civicscience.com >> > > > > -- > Will Oberman > Civic Science, Inc. > 3030 Penn Avenue., First Floor > Pittsburgh, PA 15201 > (M) 412-480-7835 > (E) ober...@civicscience.com >