Re: A Big Data Trifecta: Storm, Kafka and Cassandra

2012-08-04 Thread Philip Nelson
--- On Sat, 8/4/12, Brian O'Neill  wrote:

> From: Brian O'Neill 
> Subject: A Big Data Trifecta: Storm, Kafka and Cassandra
> To: user@cassandra.apache.org
> Date: Saturday, August 4, 2012, 4:41 AM
> Philip,
> 
> I figured I would reply via blog post. =)
> http://brianoneill.blogspot.com/2012/08/a-big-data-trifecta-storm-kafka-and.html

Brian -- thanks again for this. It's always great to get a reference to another 
approach.



Re: Secondary index impact on write performance

2012-08-04 Thread David McNelis
Thanks. That was what I expected, but wanted to confirm.
On Aug 4, 2012 11:24 AM, "Dave Brosius"  wrote:

> There is a second (system managed) column family for each secondary index,
> so any write to a field that is indexed causes two writes, one to the main
> column family, and another to the index column family, where in this index
> column family the key is the value of the secondary column, and the value
> is the key of the original row.
>
>
>
> On 08/04/2012 11:40 AM, David McNelis wrote:
>
>> Morning,
>>
>> Was reading up on secondary indexes and on the Datastax post about them,
>> it mentions the additional management overhead, and also that if you alter
>> an existing column family, that data will be updated in the background.
>>  But how do secondary indexes affect write performance?
>>
>> If the answer is "it doesn't", then how do brand new records get located
>> by a subsequent indexed query?
>>
>> If someone has a link to a post with some of this info, that would be
>> awesome.
>>
>> David
>>
>
>


Re: Secondary index impact on write performance

2012-08-04 Thread Dave Brosius
There is a second (system managed) column family for each secondary 
index, so any write to a field that is indexed causes two writes, one to 
the main column family, and another to the index column family, where in 
this index column family the key is the value of the secondary column, 
and the value is the key of the original row.




On 08/04/2012 11:40 AM, David McNelis wrote:

Morning,

Was reading up on secondary indexes and on the Datastax post about 
them, it mentions the additional management overhead, and also that if 
you alter an existing column family, that data will be updated in the 
background.  But how do secondary indexes affect write performance?


If the answer is "it doesn't", then how do brand new records get 
located by a subsequent indexed query?


If someone has a link to a post with some of this info, that would be 
awesome.


David




Secondary index impact on write performance

2012-08-04 Thread David McNelis
Morning,

Was reading up on secondary indexes and on the Datastax post about them, it
mentions the additional management overhead, and also that if you alter an
existing column family, that data will be updated in the background.  But
how do secondary indexes affect write performance?

If the answer is "it doesn't", then how do brand new records get located by
a subsequent indexed query?

If someone has a link to a post with some of this info, that would be
awesome.

David


A Big Data Trifecta: Storm, Kafka and Cassandra

2012-08-04 Thread Brian O'Neill
Philip,

I figured I would reply via blog post. =)
http://brianoneill.blogspot.com/2012/08/a-big-data-trifecta-storm-kafka-and.html

That blog post shows how we pieced together Kafka and Cassandra (via Storm).
With LinkedIn behind Kafka, it is well supported.  They use it in
production. (and most likely we will too =)

Let me know if you end up using it.  Thus far, I think it pairs nicely
with Cassandra, but we don't have it in production yet.

-brian

On Fri, Aug 3, 2012 at 3:41 PM, Milind Parikh  wrote:
> Kafka is relatively stable and has a active well-supported news-group as
> well.
>
> As discussed by Brian, you would be inverting the paradigm of store-process.
> Essentially in your original approach, you are storing the messages first
> and then processing them after the fact. In the Kafka model, you would
> process the messages as they come in.
>
> Since you are thinking about parallelism anyways, I trust that your
> processing paradigm is inherently paralleizable.
>
> Regards
> Milind
>
>
>
>
>
> On Fri, Aug 3, 2012 at 12:22 PM, Philip Nelson
>  wrote:
>>
>> Brian -- thanks.
>>
>> > We were looking to do the same thing, but in the end decided
>> > to go with Kafka.
>> > Given your throughput requirements, Kafka might be a good
>> > option for you as well.
>>
>> This might be off-topic, so I'll keep it short. Kafka is reasonably
>> stable? Mature (I see it's in the Incubator)? Relative to Cassandra?
>>
>> Philip
>>
>>
>



-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/