Oh np Badri. Also fwiw the open-source brisk project - https://github.com/riptano/brisk/ - does a good job integrating cassandra with hadoop and today they released beta 2 of it, which includes pig support in there. Might be worth looking at too. It simplifies operations a lot from what I understand. The explanation of what it does is http://www.datastax.com/brisk
Anyway, so just something else to consider. Also I started a project called pygmalion to help out with pig + cassandra specifically. You might find it useful and/or want to contribute code/examples to it :). Anyway, that's here: https://github.com/jeromatron/pygmalion/ Jeremy On Jun 17, 2011, at 9:05 PM, Badrinarayanan S wrote: > Hi Jeremy, > > Thanks. Till we get 1.0 we will also adopt separate CF for analysis > purposes. > > Regards, > badri > > -----Original Message----- > From: Jeremy Hanna [mailto:jeremy.hanna1...@gmail.com] > Sent: Saturday, June 18, 2011 12:39 AM > To: user@pig.apache.org > Subject: Re: PIG Cassandra - Performance > > The way cassandra currently does mapreduce is that it iterates over all the > rows of the column family. So yes, performance would be related to the > growing number of rows. You can use the pig FILTER function to filter them > down, but you are still iterating over all of the rows in that columns > family. > > There is a ticket - CASSANDRA-1600 > (https://issues.apache.org/jira/browse/CASSANDRA-1600) that addresses this > and allows for subsets of rows to be specified. It will also enable > mapreducing over secondary indexes in a column family. We had hoped 1600 > would be resolved by now but there was a complication with a dependent > issue. I have been told that it will definitely be in the next major > release of Cassandra - 1.0, due out in the beginning of October. From what > I understand, these updates will then enable both pig and hive to more > easily push down selects of subsets of data. > > Until then, what we've done is set up a separate column family with data > that we want to analyze that only has a subset of the data. Then when 1.0 > comes out, we'll shift over to use that. > > Jeremy > > On Jun 17, 2011, at 1:29 PM, Badrinarayanan S wrote: > >> Hi, >> >> >> >> In our production Cassandra systems we are observing the time taken by > same >> PIG script keeps increasing each and every day. The PIG scripts reads data >> for a day at a time from a Cassandra Column Family. The number of rows the >> PIG script is expected to return is almost same every day, however every > day >> the amount of rows we are storing in Cassandra is increasing. We haven't >> changed the default setting for multiquery, it is by default enabled. >> >> >> >> Could this increase in PIG script execution time be related to the >> increasing number of rows in Cassandra every day? >> >> >> >> Related to this I was trying to understand the behavior of LOAD statement. >> Does LOAD statement reads all the data from Cassandra and then applies the >> required filter conditions? If so the increase in execution time could be >> attributed to the extra time required to read the ever increasing data in >> Cassandra. >> >> >> >> We are also working on a suitable archival mechanisms for our data so that >> the total number of rows that are stored are always maintained at an > optimum >> count. This should also help us to maintain almost constant PIG script >> execution time every day. >> >> >> >> Please advice. >> >> >> >> Thanks, >> >> Badri >> >> >> >> >> >> >> >