Re: PIG Cassandra - Performance

Jeremy Hanna Fri, 17 Jun 2011 20:21:50 -0700

Oh np Badri.

Also fwiw the open-source brisk project - https://github.com/riptano/brisk/ - 
does a good job integrating cassandra with hadoop and today they released beta 
2 of it, which includes pig support in there.  Might be worth looking at too.  
It simplifies operations a lot from what I understand.  The explanation of what 
it does is http://www.datastax.com/brisk


Anyway, so just something else to consider.

Also I started a project called pygmalion to help out with pig + cassandra 
specifically.  You might find it useful and/or want to contribute code/examples 
to it :).  Anyway, that's here: https://github.com/jeromatron/pygmalion/

Jeremy

On Jun 17, 2011, at 9:05 PM, Badrinarayanan S wrote:

> Hi Jeremy,
> 
> Thanks. Till we get 1.0 we will also adopt separate CF for analysis
> purposes.
> 
> Regards,
> badri
> 
> -----Original Message-----
> From: Jeremy Hanna [mailto:jeremy.hanna1...@gmail.com] 
> Sent: Saturday, June 18, 2011 12:39 AM
> To: user@pig.apache.org
> Subject: Re: PIG Cassandra - Performance
> 
> The way cassandra currently does mapreduce is that it iterates over all the
> rows of the column family.  So yes, performance would be related to the
> growing number of rows.  You can use the pig FILTER function to filter them
> down, but you are still iterating over all of the rows in that columns
> family.
> 
> There is a ticket - CASSANDRA-1600
> (https://issues.apache.org/jira/browse/CASSANDRA-1600) that addresses this
> and allows for subsets of rows to be specified.  It will also enable
> mapreducing over secondary indexes in a column family.  We had hoped 1600
> would be resolved by now but there was a complication with a dependent
> issue.  I have been told that it will definitely be in the next major
> release of Cassandra - 1.0, due out in the beginning of October.  From what
> I understand, these updates will then enable both pig and hive to more
> easily push down selects of subsets of data.
> 
> Until then, what we've done is set up a separate column family with data
> that we want to analyze that only has a subset of the data.  Then when 1.0
> comes out, we'll shift over to use that.
> 
> Jeremy
> 
> On Jun 17, 2011, at 1:29 PM, Badrinarayanan S wrote:
> 
>> Hi,
>> 
>> 
>> 
>> In our production Cassandra systems we are observing the time taken by
> same
>> PIG script keeps increasing each and every day. The PIG scripts reads data
>> for a day at a time from a Cassandra Column Family. The number of rows the
>> PIG script is expected to return is almost same every day, however every
> day
>> the amount of rows we are storing in Cassandra is increasing. We haven't
>> changed the default setting for multiquery, it is by default enabled.
>> 
>> 
>> 
>> Could this increase in PIG script execution time be related to the
>> increasing number of rows in Cassandra every day? 
>> 
>> 
>> 
>> Related to this I was trying to understand the behavior of LOAD statement.
>> Does LOAD statement reads all the data from Cassandra and then applies the
>> required filter conditions? If so the increase in execution time could be
>> attributed to the extra time required to read the ever increasing data in
>> Cassandra.
>> 
>> 
>> 
>> We are also working on a suitable archival mechanisms for our data so that
>> the total number of rows that are stored are always maintained at an
> optimum
>> count. This should also help us to maintain almost constant PIG script
>> execution time every day.
>> 
>> 
>> 
>> Please advice.
>> 
>> 
>> 
>> Thanks,
>> 
>> Badri
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>

Re: PIG Cassandra - Performance

Reply via email to