Re: Recommendations on moving to Hadoop/Hive with Cassandra + RDBMS

Jeremy Hanna Tue, 30 Aug 2011 11:42:42 -0700

I've tried to help out with some UDFs and references that help with our use 
case: https://github.com/jeromatron/pygmalion/


There are some brisk docs on pig as well that might be helpful: 
http://www.datastax.com/docs/0.8/brisk/about_pig

On Aug 30, 2011, at 1:30 PM, Tharindu Mathew wrote:

> Thanks Jeremy for your response. That gives me some encouragement, that I 
> might be on that right track.
> 
> I think I need to try out more stuff before coming to a conclusion on Brisk.
> 
> For Pig operations over Cassandra, I only could find 
> http://svn.apache.org/repos/asf/cassandra/trunk/contrib/pig. Are there any 
> other resource that you can point me to? There seems to be a lack of samples 
> on this subject.
> 
> On Tue, Aug 30, 2011 at 10:56 PM, Jeremy Hanna <jeremy.hanna1...@gmail.com> 
> wrote:
> FWIW, we are using Pig (and Hadoop) with Cassandra and are looking to 
> potentially move to Brisk because of the simplicity of operations there.
> 
> Not sure what you mean about the true power of Hadoop.  In my mind the true 
> power of Hadoop is the ability to parallelize jobs and send each task to 
> where the data resides.  HDFS exists to enable that.  Brisk is just another 
> HDFS compatible implementation.  If you're already storing your data in 
> Cassandra and are looking to use Hadoop with it, then I would seriously 
> consider using Brisk.
> 
> That said, Cassandra with Hadoop works fine.
> 
> On Aug 30, 2011, at 11:58 AM, Tharindu Mathew wrote:
> 
> > Hi Eric,
> >
> > Thanks for your response.
> >
> > On Tue, Aug 30, 2011 at 5:35 PM, Eric Djatsa <djatsa...@gmail.com> wrote:
> >
> >> Hi Tharindu, try having a look at Brisk(
> >> http://www.datastax.com/products/brisk) it integrates Hadoop with
> >> Cassandra and is shipped with Hive for SQL analysis. You can then install
> >> Sqoop(http://www.cloudera.com/downloads/sqoop/) on top of Hadoop in order
> >> to enable data import/export between Hadoop and MySQL.
> >> Does this sound ok to you ?
> >>
> >> These do sound ok. But I was looking at using something from Apache itself.
> >
> > Brisk sounds nice, but I feel that disregarding HDFS and totally switching
> > to Cassandra is not the right thing to do. Just my opinion there. I feel we
> > are not using the true power of Hadoop then.
> >
> > I feel Pig has more integration with Cassandra, so I might take a look
> > there.
> >
> > Whichever I choose, I will contribute the code back to the Apache projects I
> > use. Here's a sample data analysis I do with my language. Maybe, there is no
> > generic way to do what I want to do.
> >
> >
> >
> > <get name="NodeId">
> > <index name="ServerName" start="" end=""/>
> > <!--<index name="nodeId" start="AS" end="FB"/>-->
> > <!--<groupBy index="nodeId"/>-->
> > <granularity index="timeStamp" type="hour"/>
> > </get>
> >
> > <lookup name="Event"/>
> >
> > <aggregate>
> > <measure name="RequestCount" aggregationType="CUMULATIVE"/>
> > <measure name="ResponseCount" aggregationType="CUMULATIVE"/>
> > <measure name="MaximumResponseTime" aggregationType="AVG"/>
> > </aggregate>
> >
> > <put name="NodeResult" indexRow="allKeys"/>
> >
> > <log/>
> >
> > <get name="NodeResult">
> > <index name="ServerName" start="" end=""/>
> > <groupBy index="ServerName"/>
> > </get>
> >
> > <aggregate>
> > <measure name="RequestCount" aggregationType="CUMULATIVE"/>
> > <measure name="ResponseCount" aggregationType="CUMULATIVE"/>
> > <measure name="MaximumResponseTime" aggregationType="AVG"/>
> > </aggregate>
> >
> > <put name="NodeAccumilator" indexRow="allKeys"/>
> >
> > <log/>
> >
> >
> >> 2011/8/29 Tharindu Mathew <mcclou...@gmail.com>
> >>
> >>> Hi,
> >>>
> >>> I have an already running system where I define a simple data flow (using
> >>> a simple custom data flow language) and configure jobs to run against 
> >>> stored
> >>> data. I use quartz to schedule and run these jobs and the data exists on
> >>> various data stores (mainly Cassandra but some data exists in RDBMS like
> >>> mysql as well).
> >>>
> >>> Thinking about scalability and already existing support for standard data
> >>> flow languages in the form of Pig and HiveQL, I plan to move my system to
> >>> Hadoop.
> >>>
> >>> I've seen some efforts on the integration of Cassandra and Hadoop. I've
> >>> been reading up and still am contemplating on how to make this change.
> >>>
> >>> It would be great to hear the recommended approach of doing this on Hadoop
> >>> with the integration of Cassandra and other RDBMS. For example, a sample
> >>> task that already runs on the system is "once in every hour, get rows from
> >>> column family X, aggregate data in columns A, B and C and write back to
> >>> column family Y, and enter details of last aggregated row into a table in
> >>> mysql"
> >>>
> >>> Thanks in advance.
> >>>
> >>> --
> >>> Regards,
> >>>
> >>> Tharindu
> >>>
> >>
> >>
> >>
> >> --
> >> *Eric Djatsa Yota*
> >> *Double degree MsC Student in Computer Science Engineering and
> >> Communication Networks
> >> Télécom ParisTech (FRANCE) - Politecnico di Torino (ITALY)*
> >> *Intern at AMADEUS S.A.S Sophia Antipolis*
> >> djatsa...@gmail.com
> >> *Tel : 0601791859*
> >>
> >>
> >
> >
> > --
> > Regards,
> >
> > Tharindu
> 
> 
> 
> 
> -- 
> Regards,
> 
> Tharindu

Re: Recommendations on moving to Hadoop/Hive with Cassandra + RDBMS

Reply via email to