Re: Comprehensive documentation on Cassandra Data modelling

2014-12-16 Thread Ryan Svihla
There is a lot of stuff out there and the best thing you can do today is
watch Patrick McFadden's series. This is  was what I used before I started
at DataStax. Planet Cassandra has a data modeling playlist of videos you
can watch
https://www.youtube.com/playlist?list=PLqcm6qE9lgKJoSWKYWHWhrVupRbS8mmDA
including the McFadden videos I mentioned.

Finally, you hit a key point, a series of tables is the normal approach to
most data modeling, you model your tables around the queries you need, with
the exception of the nuance I referred to in the last email, this one
concept will get you through 80% of use cases fine.

On Tue, Dec 16, 2014 at 12:01 PM, Jason Kania  wrote:
>
> Ryan,
>
> Thanks for the response. It offers a bit more clarity.
>
> I think a series of blog posts with good real world examples would go a
> long way to increasing usability of Cassandra. Right now I find the process
> like going through a mine field because I only discover what is not
> possible after trying something that I would find logical and failing.
>
> For my specific questions, the problem is that since searching is only
> possible on columns in the primary key and the primary key cannot be
> updated, I am not sure what the appropriate solution is when data exists
> that needs to be searched and then updated. What is the preferrable
> approach to this? Is the expectation to maintain a series of tables, one
> for each stage of data manipulation with its own primary key?
>
> Thanks,
>
> Jason
>
>   --
>  *From:* Ryan Svihla 
> *To:* user@cassandra.apache.org
> *Sent:* Tuesday, December 16, 2014 12:36 PM
> *Subject:* Re: Comprehensive documentation on Cassandra Data modelling
>
> Data Modeling a distributed application could be a book unto itself.
> However, I will add, modeling by restriction is basically the entire
> thought process in Cassandra data modeling since it's a distributed hash
> table and a core aspect of that sort of application is you need to be able
> to quickly locate which server owns the data you want in the cluster (which
> is provided by the partition key).
>
> in specific response to your questions
> 1) as long as you know the primary key and the column name this just
> works. I'm not sure what the problem is
> 2) Yes, the partition key tells you which server owns the data, otherwise
> you'd have to scan all servers to find what you're asking for.
> 3) I'm not sure I understand this.
>
> To summarize, all modeling can be understood when you embrace the idea
> that :
>
>
>1. Querying a single server will be faster than querying many servers
>2. Multiple tables with the same data but with different partition
>keys is much easier to scale that a single table that you have to scan the
>whole cluster for your answer.
>
>
> If you accept this, you've basically got the key principle down...most
> other ideas are extensions of this, some nuance includes dealing with
> tombstones, partition size and order. and I can answer any more specifics.
>
> I've been meaning to write a series of blog posts on this, but as I
> stated, it's almost a book unto itself. Data modeling a distributed
> application requires a fundamental rethink of all the assumptions we've
> been taught for master/slave style databases.
>
>
>
>
> On Tue, Dec 16, 2014 at 10:46 AM, Jason Kania 
> wrote:
>
> Hi,
>
> I have been having a few exchanges with contributors to the project around
> what is possible with Cassandra and a common response that comes up when I
> describe functionality as broken or missing is that I am not modelling my
> data correctly. Unfortunately, I cannot seem to find comprehensive
> documentation on modelling with Cassandra. In particular, I am finding
> myself modelling by restriction rather than what I would like to do.
>
> Does such documentations exist? If not, is there any effort to create such
> documentation?The DataStax documentation on data modelling is far too weak
> to be meaningful.
>
> In particular, I am caught because:
>
> 1) I want to search on a specific column to make updates to it after
> further processing; ie I don't know its value on first insert
> 2) If I want to search on a column, it has to be part of the primary key
> 3) If a column is part of the primary key, it cannot be edited so I have a
> circular dependency
>
> Thanks,
>
> Jason
>
>
>
> --
> [image: datastax_logo.png] <http://www.datastax.com/>
> Ryan Svihla
> Solution Architect
>
> [image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>
> DataStax is th

Re: Comprehensive documentation on Cassandra Data modelling

2014-12-16 Thread Jason Kania
Ryan,
Thanks for the response. It offers a bit more clarity.
I think a series of blog posts with good real world examples would go a long 
way to increasing usability of Cassandra. Right now I find the process like 
going through a mine field because I only discover what is not possible after 
trying something that I would find logical and failing.

For my specific questions, the problem is that since searching is only possible 
on columns in the primary key and the primary key cannot be updated, I am not 
sure what the appropriate solution is when data exists that needs to be 
searched and then updated. What is the preferrable approach to this? Is the 
expectation to maintain a series of tables, one for each stage of data 
manipulation with its own primary key?
Thanks,
Jason
  From: Ryan Svihla 
 To: user@cassandra.apache.org 
 Sent: Tuesday, December 16, 2014 12:36 PM
 Subject: Re: Comprehensive documentation on Cassandra Data modelling
   
Data Modeling a distributed application could be a book unto itself. However, I 
will add, modeling by restriction is basically the entire thought process in 
Cassandra data modeling since it's a distributed hash table and a core aspect 
of that sort of application is you need to be able to quickly locate which 
server owns the data you want in the cluster (which is provided by the 
partition key).

in specific response to your questions
1) as long as you know the primary key and the column name this just works. I'm 
not sure what the problem is
2) Yes, the partition key tells you which server owns the data, otherwise you'd 
have to scan all servers to find what you're asking for.
3) I'm not sure I understand this.

To summarize, all modeling can be understood when you embrace the idea that :

   
   - Querying a single server will be faster than querying many servers
   - Multiple tables with the same data but with different partition keys is 
much easier to scale that a single table that you have to scan the whole 
cluster for your answer. 

If you accept this, you've basically got the key principle down...most other 
ideas are extensions of this, some nuance includes dealing with tombstones, 
partition size and order. and I can answer any more specifics. 

I've been meaning to write a series of blog posts on this, but as I stated, 
it's almost a book unto itself. Data modeling a distributed application 
requires a fundamental rethink of all the assumptions we've been taught for 
master/slave style databases.




On Tue, Dec 16, 2014 at 10:46 AM, Jason Kania  wrote:
Hi,
I have been having a few exchanges with contributors to the project around what 
is possible with Cassandra and a common response that comes up when I describe 
functionality as broken or missing is that I am not modelling my data 
correctly. Unfortunately, I cannot seem to find comprehensive documentation on 
modelling with Cassandra. In particular, I am finding myself modelling by 
restriction rather than what I would like to do.

Does such documentations exist? If not, is there any effort to create such 
documentation?The DataStax documentation on data modelling is far too weak to 
be meaningful.

In particular, I am caught because:
1) I want to search on a specific column to make updates to it after further 
processing; ie I don't know its value on first insert
2) If I want to search on a column, it has to be part of the primary key3) If a 
column is part of the primary key, it cannot be edited so I have a circular 
dependency
Thanks,
Jason



-- 
Ryan SvihlaSolution Architect
 

DataStax is the fastest, most scalable distributed database technology, 
delivering Apache Cassandra to the world’s most innovative enterprises. 
Datastax is built to be agile, always-on, and predictably scalable to any size. 
With more than 500 customers in 45 countries, DataStax is the database 
technology and transactional backbone of choice for the worlds most innovative 
companies such as Netflix, Adobe, Intuit, and eBay. 


  

Re: Comprehensive documentation on Cassandra Data modelling

2014-12-16 Thread Ryan Svihla
Data Modeling a distributed application could be a book unto itself.
However, I will add, modeling by restriction is basically the entire
thought process in Cassandra data modeling since it's a distributed hash
table and a core aspect of that sort of application is you need to be able
to quickly locate which server owns the data you want in the cluster (which
is provided by the partition key).

in specific response to your questions
1) as long as you know the primary key and the column name this just works.
I'm not sure what the problem is
2) Yes, the partition key tells you which server owns the data, otherwise
you'd have to scan all servers to find what you're asking for.
3) I'm not sure I understand this.

To summarize, all modeling can be understood when you embrace the idea that
:


   1. Querying a single server will be faster than querying many servers
   2. Multiple tables with the same data but with different partition keys
   is much easier to scale that a single table that you have to scan the whole
   cluster for your answer.


If you accept this, you've basically got the key principle down...most
other ideas are extensions of this, some nuance includes dealing with
tombstones, partition size and order. and I can answer any more specifics.

I've been meaning to write a series of blog posts on this, but as I stated,
it's almost a book unto itself. Data modeling a distributed application
requires a fundamental rethink of all the assumptions we've been taught for
master/slave style databases.


On Tue, Dec 16, 2014 at 10:46 AM, Jason Kania  wrote:
>
> Hi,
>
> I have been having a few exchanges with contributors to the project around
> what is possible with Cassandra and a common response that comes up when I
> describe functionality as broken or missing is that I am not modelling my
> data correctly. Unfortunately, I cannot seem to find comprehensive
> documentation on modelling with Cassandra. In particular, I am finding
> myself modelling by restriction rather than what I would like to do.
>
> Does such documentations exist? If not, is there any effort to create such
> documentation?The DataStax documentation on data modelling is far too weak
> to be meaningful.
>
> In particular, I am caught because:
>
> 1) I want to search on a specific column to make updates to it after
> further processing; ie I don't know its value on first insert
> 2) If I want to search on a column, it has to be part of the primary key
> 3) If a column is part of the primary key, it cannot be edited so I have a
> circular dependency
>
> Thanks,
>
> Jason
>


-- 

[image: datastax_logo.png] 

Ryan Svihla

Solution Architect

[image: twitter.png]  [image: linkedin.png]


DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.


Comprehensive documentation on Cassandra Data modelling

2014-12-16 Thread Jason Kania
Hi,
I have been having a few exchanges with contributors to the project around what 
is possible with Cassandra and a common response that comes up when I describe 
functionality as broken or missing is that I am not modelling my data 
correctly. Unfortunately, I cannot seem to find comprehensive documentation on 
modelling with Cassandra. In particular, I am finding myself modelling by 
restriction rather than what I would like to do.

Does such documentations exist? If not, is there any effort to create such 
documentation?The DataStax documentation on data modelling is far too weak to 
be meaningful.

In particular, I am caught because:
1) I want to search on a specific column to make updates to it after further 
processing; ie I don't know its value on first insert
2) If I want to search on a column, it has to be part of the primary key3) If a 
column is part of the primary key, it cannot be edited so I have a circular 
dependency
Thanks,
Jason