Hi Eric, Thanks for your response.
On Tue, Aug 30, 2011 at 5:35 PM, Eric Djatsa <djatsa...@gmail.com> wrote: > Hi Tharindu, try having a look at Brisk( > http://www.datastax.com/products/brisk) it integrates Hadoop with > Cassandra and is shipped with Hive for SQL analysis. You can then install > Sqoop(http://www.cloudera.com/downloads/sqoop/) on top of Hadoop in order > to enable data import/export between Hadoop and MySQL. > Does this sound ok to you ? > > These do sound ok. But I was looking at using something from Apache itself. Brisk sounds nice, but I feel that disregarding HDFS and totally switching to Cassandra is not the right thing to do. Just my opinion there. I feel we are not using the true power of Hadoop then. I feel Pig has more integration with Cassandra, so I might take a look there. Whichever I choose, I will contribute the code back to the Apache projects I use. Here's a sample data analysis I do with my language. Maybe, there is no generic way to do what I want to do. <get name="NodeId"> <index name="ServerName" start="" end=""/> <!--<index name="nodeId" start="AS" end="FB"/>--> <!--<groupBy index="nodeId"/>--> <granularity index="timeStamp" type="hour"/> </get> <lookup name="Event"/> <aggregate> <measure name="RequestCount" aggregationType="CUMULATIVE"/> <measure name="ResponseCount" aggregationType="CUMULATIVE"/> <measure name="MaximumResponseTime" aggregationType="AVG"/> </aggregate> <put name="NodeResult" indexRow="allKeys"/> <log/> <get name="NodeResult"> <index name="ServerName" start="" end=""/> <groupBy index="ServerName"/> </get> <aggregate> <measure name="RequestCount" aggregationType="CUMULATIVE"/> <measure name="ResponseCount" aggregationType="CUMULATIVE"/> <measure name="MaximumResponseTime" aggregationType="AVG"/> </aggregate> <put name="NodeAccumilator" indexRow="allKeys"/> <log/> > 2011/8/29 Tharindu Mathew <mcclou...@gmail.com> > >> Hi, >> >> I have an already running system where I define a simple data flow (using >> a simple custom data flow language) and configure jobs to run against stored >> data. I use quartz to schedule and run these jobs and the data exists on >> various data stores (mainly Cassandra but some data exists in RDBMS like >> mysql as well). >> >> Thinking about scalability and already existing support for standard data >> flow languages in the form of Pig and HiveQL, I plan to move my system to >> Hadoop. >> >> I've seen some efforts on the integration of Cassandra and Hadoop. I've >> been reading up and still am contemplating on how to make this change. >> >> It would be great to hear the recommended approach of doing this on Hadoop >> with the integration of Cassandra and other RDBMS. For example, a sample >> task that already runs on the system is "once in every hour, get rows from >> column family X, aggregate data in columns A, B and C and write back to >> column family Y, and enter details of last aggregated row into a table in >> mysql" >> >> Thanks in advance. >> >> -- >> Regards, >> >> Tharindu >> > > > > -- > *Eric Djatsa Yota* > *Double degree MsC Student in Computer Science Engineering and > Communication Networks > Télécom ParisTech (FRANCE) - Politecnico di Torino (ITALY)* > *Intern at AMADEUS S.A.S Sophia Antipolis* > djatsa...@gmail.com > *Tel : 0601791859* > > -- Regards, Tharindu