Re: Recommendations on moving to Hadoop/Hive with Cassandra + RDBMS

Tharindu Mathew Tue, 30 Aug 2011 10:07:02 -0700

Hi Eric,

Thanks for your response.


On Tue, Aug 30, 2011 at 5:35 PM, Eric Djatsa <djatsa...@gmail.com> wrote:

> Hi Tharindu, try having a look at Brisk(
> http://www.datastax.com/products/brisk) it integrates Hadoop with
> Cassandra and is shipped with Hive for SQL analysis. You can then install
> Sqoop(http://www.cloudera.com/downloads/sqoop/) on top of Hadoop in order
> to enable data import/export between Hadoop and MySQL.
> Does this sound ok to you ?
>
> These do sound ok. But I was looking at using something from Apache itself.

Brisk sounds nice, but I feel that disregarding HDFS and totally switching
to Cassandra is not the right thing to do. Just my opinion there. I feel we
are not using the true power of Hadoop then.

I feel Pig has more integration with Cassandra, so I might take a look
there.

Whichever I choose, I will contribute the code back to the Apache projects I
use. Here's a sample data analysis I do with my language. Maybe, there is no
generic way to do what I want to do.



<get name="NodeId">
<index name="ServerName" start="" end=""/>
<!--<index name="nodeId" start="AS" end="FB"/>-->
<!--<groupBy index="nodeId"/>-->
<granularity index="timeStamp" type="hour"/>
</get>

<lookup name="Event"/>

<aggregate>
<measure name="RequestCount" aggregationType="CUMULATIVE"/>
<measure name="ResponseCount" aggregationType="CUMULATIVE"/>
<measure name="MaximumResponseTime" aggregationType="AVG"/>
</aggregate>

<put name="NodeResult" indexRow="allKeys"/>

<log/>

<get name="NodeResult">
<index name="ServerName" start="" end=""/>
<groupBy index="ServerName"/>
</get>

<aggregate>
<measure name="RequestCount" aggregationType="CUMULATIVE"/>
<measure name="ResponseCount" aggregationType="CUMULATIVE"/>
<measure name="MaximumResponseTime" aggregationType="AVG"/>
</aggregate>

<put name="NodeAccumilator" indexRow="allKeys"/>

<log/>


> 2011/8/29 Tharindu Mathew <mcclou...@gmail.com>
>
>> Hi,
>>
>> I have an already running system where I define a simple data flow (using
>> a simple custom data flow language) and configure jobs to run against stored
>> data. I use quartz to schedule and run these jobs and the data exists on
>> various data stores (mainly Cassandra but some data exists in RDBMS like
>> mysql as well).
>>
>> Thinking about scalability and already existing support for standard data
>> flow languages in the form of Pig and HiveQL, I plan to move my system to
>> Hadoop.
>>
>> I've seen some efforts on the integration of Cassandra and Hadoop. I've
>> been reading up and still am contemplating on how to make this change.
>>
>> It would be great to hear the recommended approach of doing this on Hadoop
>> with the integration of Cassandra and other RDBMS. For example, a sample
>> task that already runs on the system is "once in every hour, get rows from
>> column family X, aggregate data in columns A, B and C and write back to
>> column family Y, and enter details of last aggregated row into a table in
>> mysql"
>>
>> Thanks in advance.
>>
>> --
>> Regards,
>>
>> Tharindu
>>
>
>
>
> --
> *Eric Djatsa Yota*
> *Double degree MsC Student in Computer Science Engineering and
> Communication Networks
> Télécom ParisTech (FRANCE) - Politecnico di Torino (ITALY)*
> *Intern at AMADEUS S.A.S Sophia Antipolis*
> djatsa...@gmail.com
> *Tel : 0601791859*
>
>


-- 
Regards,

Tharindu

Re: Recommendations on moving to Hadoop/Hive with Cassandra + RDBMS

Reply via email to