Re: Re: Dynamic Columns

2015-01-21 Thread Jack Krupansky
Peter,

At least from your description, the proposed use of the clustering column
name seems at first blush to fully fit the bill. The point is not that the
resulting clustered primary key is used to reference an object, but that a
SELECT on the partition key references the entire object, which will be a
sequence of CQL3 rows in a partition, and then the clustering column key is
added when you wish to access that specific aspect of the object. What's
missing? Again, just store the partition key to reference the full object -
no pollution required!

And please note that any number of clustering columns can be specified, so
more structured "dynamic columns" can be supported. For example, you could
have a timestamp as a separate clustering column to maintain temporal state
of the database. The partition key can also be structured from multiple
columns as a composite partition key as well.

As far as all these static columns, consider them optional and merely an
optimization. If you wish to have a 100% opaque object model, you wouldn't
have any static columns and the only non-primary key column would be the
blob value field. Every object attribute would be specified using another
clustering column name and blob value. Presto, everything you need for a
pure, opaque, fully-generalized object management system - all with just
CQL3. Maybe we should include such an example in the doc and with the
project to more strongly emphasize this capability to fully model
arbitrarily complex object structures - including temporal structures.

Anything else missing?

As a general proposition, you can use the term "clustering column" in CQL3
wherever you might have used "dynamic column" in Thrift. The point in CQL3
is not to eliminate a useful feature, dynamic column, but to repackage the
feature to make a lot more sense for the vast majority of use cases. Maybe
there are some cases that doesn't exactly fit as well as desired, but feel
free to specifically identify such cases so that we can elaborate how we
think they are covered or at least covered well enough for most users.


-- Jack Krupansky

On Wed, Jan 21, 2015 at 12:19 PM, Peter Lin  wrote:

>
> the example you provided does not work for for my use case.
>
>   CREATE TABLE t (
> key blob,
> static my_static_column_1 int,
> static my_static_column_2 float,
> static my_static_column_3 blob,
> ,
> dynamic_column_name blob,
> dynamic_column_value blob,
> PRIMARY KEY (key, dynamic_column_name);
>   )
>
> the dynamic column can't be part of the primary key. The temporal entity
> key can be the default UUID or the user can choose the field in their
> object. Within our framework, we have concept of temporal links between one
> or more temporal entities. Poluting the primary key with the dynamic column
> wouldn't work.
>
> Please excuse the confusing RDB comparison. My point is that Cassandra's
> dynamic column feature is the "unique" feature that makes it better than
> traditional RDB or newSql like VoltDB for building temporal databases. With
> databases that require static schema + alter table for managing schema
> evolution, it makes it harder and results in down time.
>
> One of the challenges of data management over time is evolving the data
> model and making queries simple. If the record is 5 years old, it probably
> has a difference schema than a record inserted this week. With temporal
> databases, every update is an insert, so it's a little bit more complex
> than just "use a blob". There's a whole level of complication with temporal
> data and CQL3 custom types isn't clear to me. I've read the CQL3
> documentation on the custom types several times and it is rather poor. It
> gives me the impression there's still work needed to get custom types in
> good shape.
>
> With regard to examples others have told me, your advice is fair. A few
> minutes with google and some blogs should pop up. The reason I bring these
> things up isn't to put down CQL. It's because I care and want to help
> improve Cassandra by sharing my experience. I consistently recommend new
> users learn and understand both Thrift and CQL.
>
>
>
> On Wed, Jan 21, 2015 at 11:45 AM, Sylvain Lebresne 
> wrote:
>
>> On Wed, Jan 21, 2015 at 4:44 PM, Peter Lin  wrote:
>>
>>> I don't remember other people's examples in detail due to my shitty
>>> memory, so I'd rather not misquote.
>>>
>>
>> Fair enough, but maybe you shouldn't use "people's examples you don't
>> remenber" as argument then. Those examples might be wrong or outdated and
>> that kind of stuff creates confusion for everyone.
>>
>>
>>>
>>> In my case, I mix static and dynamic columns in a single column family
>>> with primitives and objects. The objects are temporal object graphs with a
>>> known type. Doing this type of stuff is basically transparent for me, since
>>> I'm using thrift and our data modeler generates helper classes. Our tooling
>>> seamlessly convert the bytes back to the target object. We have 

Re: Is there a way to add a new node to a cluster but not sync old data?

2015-01-21 Thread Yatong Zhang
Yes, my cluster is almost full and there are lots of pending tasks. You
helped me a lot and thank you Eric~

On Thu, Jan 22, 2015 at 11:59 AM, Eric Stevens  wrote:

> Yes, bootstrapping a new node will cause read loads on your existing nodes
> - it is becoming the owner and replica of a whole new set of existing
> data.  To do that it needs to know what data it's now responsible for, and
> that's what bootstrapping is for.
>
> If you're at the point where bootstrapping a new node is placing a
> too-heavy burden on your existing nodes, you may be dangerously close to or
> even past the tipping point where you ought to have already grown your
> cluster.  You need to grow your cluster as soon as possible, and chances
> are you're close to no longer being able to keep up with compaction (see
> nodetool compactionstats, make sure pending tasks is <5, preferably 0 or
> 1).  Once you're falling behind on compaction, it becomes difficult to
> successfully bootstrap new nodes, and you're in a very tough spot.
>
>
> On Wed, Jan 21, 2015 at 7:43 PM, Yatong Zhang  wrote:
>
>> Thanks for the reply. The bootstrap of new node put a heavy burden on the
>> whole cluster and I don't know why. So that' the issue I want to fix
>> actually.
>>
>> On Mon, Jan 12, 2015 at 6:08 AM, Eric Stevens  wrote:
>>
>>> Yes, but it won't do what I suspect you're hoping for.  If you disable
>>> auto_bootstrap in cassandra.yaml the node will join the cluster and will
>>> not stream any old data from existing nodes.
>>>
>>> The cluster will now be in an inconsistent state.  If you bring enough
>>> nodes online this way to violate your read consistency level (eg RF=3,
>>> CL=Quorum, if you bring on 2 nodes this way), some of your queries will be
>>> missing data that they ought to have returned.
>>>
>>> There is no way to bring a new node online and have it be responsible
>>> just for new data, and have no responsibility for old data.  It *will* be
>>> responsible for old data, it just won't *know* about the old data it
>>> should be responsible for.  Executing a repair will fix this, but only
>>> because the existing nodes will stream all the missing data to the new
>>> node.  This will create more pressure on your cluster than just normal
>>> bootstrapping would have.
>>>
>>> I can't think of any reason you'd want to do that unless you needed to
>>> grow your cluster really quickly, and were ok with corrupting your old data.
>>>
>>> On Sat, Jan 10, 2015 at 12:39 AM, Yatong Zhang 
>>> wrote:
>>>
 Hi there,

 I am using C* 2.0.10 and I was trying to add a new node to a
 cluster(actually replace a dead node). But after added the new node some
 other nodes in the cluster had a very high work-load and affected the whole
 performance of the cluster.
 So I am wondering is there a way to add a new node and this node only
 afford new data?

>>>
>>>
>>
>


Re: Is there a way to add a new node to a cluster but not sync old data?

2015-01-21 Thread Eric Stevens
Yes, bootstrapping a new node will cause read loads on your existing nodes
- it is becoming the owner and replica of a whole new set of existing
data.  To do that it needs to know what data it's now responsible for, and
that's what bootstrapping is for.

If you're at the point where bootstrapping a new node is placing a
too-heavy burden on your existing nodes, you may be dangerously close to or
even past the tipping point where you ought to have already grown your
cluster.  You need to grow your cluster as soon as possible, and chances
are you're close to no longer being able to keep up with compaction (see
nodetool compactionstats, make sure pending tasks is <5, preferably 0 or
1).  Once you're falling behind on compaction, it becomes difficult to
successfully bootstrap new nodes, and you're in a very tough spot.


On Wed, Jan 21, 2015 at 7:43 PM, Yatong Zhang  wrote:

> Thanks for the reply. The bootstrap of new node put a heavy burden on the
> whole cluster and I don't know why. So that' the issue I want to fix
> actually.
>
> On Mon, Jan 12, 2015 at 6:08 AM, Eric Stevens  wrote:
>
>> Yes, but it won't do what I suspect you're hoping for.  If you disable
>> auto_bootstrap in cassandra.yaml the node will join the cluster and will
>> not stream any old data from existing nodes.
>>
>> The cluster will now be in an inconsistent state.  If you bring enough
>> nodes online this way to violate your read consistency level (eg RF=3,
>> CL=Quorum, if you bring on 2 nodes this way), some of your queries will be
>> missing data that they ought to have returned.
>>
>> There is no way to bring a new node online and have it be responsible
>> just for new data, and have no responsibility for old data.  It *will* be
>> responsible for old data, it just won't *know* about the old data it
>> should be responsible for.  Executing a repair will fix this, but only
>> because the existing nodes will stream all the missing data to the new
>> node.  This will create more pressure on your cluster than just normal
>> bootstrapping would have.
>>
>> I can't think of any reason you'd want to do that unless you needed to
>> grow your cluster really quickly, and were ok with corrupting your old data.
>>
>> On Sat, Jan 10, 2015 at 12:39 AM, Yatong Zhang 
>> wrote:
>>
>>> Hi there,
>>>
>>> I am using C* 2.0.10 and I was trying to add a new node to a
>>> cluster(actually replace a dead node). But after added the new node some
>>> other nodes in the cluster had a very high work-load and affected the whole
>>> performance of the cluster.
>>> So I am wondering is there a way to add a new node and this node only
>>> afford new data?
>>>
>>
>>
>


Fwd: ReadTimeoutException in Cassandra 2.0.11

2015-01-21 Thread Neha Trivedi
Hello All,
I am trying to process 200MB file. I am getting following Error. We are
using (apache-cassandra-2.0.3.jar)
com.datastax.driver.core.
exceptions.ReadTimeoutException: Cassandra timeout during read query at
consistency ONE (1 responses were required but only 0 replica responded)

1. Is it due to memory?
2. Is it related to driver?

Initially when I was trying 15MB and it was throwing the same Exception but
after that it started working.


thanks
regards
neha


Re: Is there a way to add a new node to a cluster but not sync old data?

2015-01-21 Thread Yatong Zhang
Thanks for the reply. The bootstrap of new node put a heavy burden on the
whole cluster and I don't know why. So that' the issue I want to fix
actually.

On Mon, Jan 12, 2015 at 6:08 AM, Eric Stevens  wrote:

> Yes, but it won't do what I suspect you're hoping for.  If you disable
> auto_bootstrap in cassandra.yaml the node will join the cluster and will
> not stream any old data from existing nodes.
>
> The cluster will now be in an inconsistent state.  If you bring enough
> nodes online this way to violate your read consistency level (eg RF=3,
> CL=Quorum, if you bring on 2 nodes this way), some of your queries will be
> missing data that they ought to have returned.
>
> There is no way to bring a new node online and have it be responsible just
> for new data, and have no responsibility for old data.  It *will* be
> responsible for old data, it just won't *know* about the old data it
> should be responsible for.  Executing a repair will fix this, but only
> because the existing nodes will stream all the missing data to the new
> node.  This will create more pressure on your cluster than just normal
> bootstrapping would have.
>
> I can't think of any reason you'd want to do that unless you needed to
> grow your cluster really quickly, and were ok with corrupting your old data.
>
> On Sat, Jan 10, 2015 at 12:39 AM, Yatong Zhang 
> wrote:
>
>> Hi there,
>>
>> I am using C* 2.0.10 and I was trying to add a new node to a
>> cluster(actually replace a dead node). But after added the new node some
>> other nodes in the cluster had a very high work-load and affected the whole
>> performance of the cluster.
>> So I am wondering is there a way to add a new node and this node only
>> afford new data?
>>
>
>


Re: get partition key from tombstone warnings?

2015-01-21 Thread Ian Rose
Ah, thanks for the pointer Philip.  Is there any kind of formal way to
"vote up" issues?  I'm assuming that adding a comment of "+1" or the like
is more likely to be *counter*productive.

- Ian


On Wed, Jan 21, 2015 at 5:02 PM, Philip Thompson <
philip.thomp...@datastax.com> wrote:

> There is an open ticket for this improvement at
> https://issues.apache.org/jira/browse/CASSANDRA-8561
>
> On Wed, Jan 21, 2015 at 4:55 PM, Ian Rose  wrote:
>
>> When I see a warning like "Read 9 live and 5769 tombstoned cells in ...
>> " is there a way for me to see the partition key that this query was
>> operating on?
>>
>> The description in the original JIRA ticket (
>> https://issues.apache.org/jira/browse/CASSANDRA-6042) reads as though
>> exposing this information was one of the original goals, but it isn't
>> obvious to me in the logs...
>>
>> Cheers!
>> - Ian
>>
>>
>


Re: Re: Dynamic Columns

2015-01-21 Thread Peter Lin
I've written my fair share of crappy code, which became legacy. then I or
someone else was left with supporting it and something newer. Isn't that
the nature of software development.

I forget who said this quote first, but I'm gonna borrow it "only pretty
code is code that is in your head. once it's written, it becomes crap." I
tell my son this all the time. When we start a project we have no clue what
we should have known, so we make a butt load of mistakes. If we're lucky,
by the third or forth version it's not so smelly, but in the mean time we
have to keep supporting the stuff. Not because we want to, but because
we're the ones that put the users through it. Atleast that's how I see it.

having said that, at some point, the really old stuff should be deprecated
and cleaned out. It totally makes sense to remove thrift at some point. I
don't know when that is, but every piece of software eventually dies or is
abandoned. Except for Cobol. That thing will be around 200 yrs from now



On Wed, Jan 21, 2015 at 6:57 PM, Robert Coli  wrote:

> On Wed, Jan 21, 2015 at 2:09 PM, Peter Lin  wrote:
>
>> on the topic of multiple incompatible API's I recommend you look at
>> SqlServer and Sybase. Most of the legacy RDBMS have multiple incompatible
>> API. Though in some cases, it is/was unavoidable.
>>
>
> My bet is that the small development team responsible for Cassandra does
> not have anything like the number of contractual obligations that
> commercial databases from the 1980s had. In other words, I believe having
> two persistent, non-pluggable (this attribute probably excludes various
> "legacy" APIs?) APIs is far more "avoidable" in the Cassandra case than in
> the historic cases you cite. I could certainly be wrong... people who
> disagree with my assessment now have a way to make me pay for my wrongness
> by making me donate $20 to the Apache Foundation on Jan 1, 2019. [1] :D
>
> =Rob
> [1] Project committers/others with material ability (Datastax...) to
> affect outcome ineligible.
>
>


Re: Re: Dynamic Columns

2015-01-21 Thread Robert Coli
On Wed, Jan 21, 2015 at 2:09 PM, Peter Lin  wrote:

> on the topic of multiple incompatible API's I recommend you look at
> SqlServer and Sybase. Most of the legacy RDBMS have multiple incompatible
> API. Though in some cases, it is/was unavoidable.
>

My bet is that the small development team responsible for Cassandra does
not have anything like the number of contractual obligations that
commercial databases from the 1980s had. In other words, I believe having
two persistent, non-pluggable (this attribute probably excludes various
"legacy" APIs?) APIs is far more "avoidable" in the Cassandra case than in
the historic cases you cite. I could certainly be wrong... people who
disagree with my assessment now have a way to make me pay for my wrongness
by making me donate $20 to the Apache Foundation on Jan 1, 2019. [1] :D

=Rob
[1] Project committers/others with material ability (Datastax...) to affect
outcome ineligible.


Re: Re: Dynamic Columns

2015-01-21 Thread Peter Lin
everyone is different. I also recommend users take time to understanding
every tool they use as much as time allows. We don't always have the luxury
of time, but I see no point recommending laziness.

I'm probably insane, since I also spend time reading papers on CRDT, paxos,
query compilers, machine learning and other topics I find fun.

on the topic of multiple incompatible API's I recommend you look at
SqlServer and Sybase. Most of the legacy RDBMS have multiple incompatible
API. Though in some cases, it is/was unavoidable.

On Wed, Jan 21, 2015 at 4:47 PM, Robert Coli  wrote:

> On Wed, Jan 21, 2015 at 9:19 AM, Peter Lin  wrote:
>
>>
>> I consistently recommend new users learn and understand both Thrift and
>> CQL.
>>
>
> FWIW, I consider this a disservice to new users. New users should use CQL,
> and not deploy against a deprecated-in-all-but-name API. Understanding
> non-CQL *storage* might be valuable, understanding the Thrift interface to
> storage is anti-valuable.
>
> Despite the dissembling public statements regarding Thrift "not going
> anywhere" it is obvious to me that no other databases exist with two
> non-pluggable and incompatible APIs for a reason. The pain of maintaining
> these two APIs will eventually become not worth the backwards
> compatibility. At this time it will be deprecated and then shortly
> thereafter removed; I expect this to happen at latest by EOY 2018. [1]
>
> =Rob
> [1] If anyone strongly disagrees, I am taking $20 cash bets, with any
> proceeds donated to the Apache Foundation.
>
>


Re: get partition key from tombstone warnings?

2015-01-21 Thread Philip Thompson
There is an open ticket for this improvement at
https://issues.apache.org/jira/browse/CASSANDRA-8561

On Wed, Jan 21, 2015 at 4:55 PM, Ian Rose  wrote:

> When I see a warning like "Read 9 live and 5769 tombstoned cells in ...
> " is there a way for me to see the partition key that this query was
> operating on?
>
> The description in the original JIRA ticket (
> https://issues.apache.org/jira/browse/CASSANDRA-6042) reads as though
> exposing this information was one of the original goals, but it isn't
> obvious to me in the logs...
>
> Cheers!
> - Ian
>
>


Re: Re: Dynamic Columns

2015-01-21 Thread Peter Lin
I apologize if I've offended you, but I clearly stated CQL3 supports
dynamic columns. How it supports dynamic columns is different. If I'm
reading you correctly, I believe we agree both thrift and CQL3 support
dynamic columns. Where we differ that I feel the coverage for existing
thrift use cases isn't 100%. That may be right or wrong, but it is my
impression. I agree with you that CQL3 supports the majority of dynamic
column use cases, but in a slightly different way. There are cases like
mine which fit better in thrift.

Could I rip out all the stuff I did and replace it with CQL3 with a major
redesign? Yes, I could but honestly I see some downsides with that
proposition.

1. for modeling tools like mine an object API is a far better fit in my
bias opinion
2. text based languages like SQL and CQL could "in theory" provide similar
object safety, but it's so much work that most people don't bother. This is
from first hand experience building 3 orms and using most of the open
source orms in the java space. I've also used several orms in .Net and they
all suffer from this pain point. There's a reason why microsoft created
Linq.
3. the structure and syntax of SQL  and all variations of SQL are not
ideally suited to complex data structures that are graphs. A temporal
entity is an object graph that may be shallow (3-8 levels) or deep (15+).
SQL is ideally suited to tables. CQL in this regard is more flexible and
supports collections, but it's still not ideal for things like insurance
policies. Look at the Acord standard for property insurance, if you want to
get a better understanding. For example, a temporal record using ORM could
result in 500 rows of data in a dozen tables for a small entity to 50K+
rows for a large entity. The mailing list isn't the right place to go into
the theory and practice of temporal databases, but a lot of the design
choices I made is based on formal logic.



On Wed, Jan 21, 2015 at 4:06 PM, Sylvain Lebresne 
wrote:

> On Wed, Jan 21, 2015 at 6:19 PM, Peter Lin  wrote:
>
>> the dynamic column can't be part of the primary key. The temporal entity
>> key can be the default UUID or the user can choose the field in their
>> object. Within our framework, we have concept of temporal links between one
>> or more temporal entities. Poluting the primary key with the dynamic column
>> wouldn't work.
>>
>
> Not totally sure I understand. Are you talking about the underlying
> storage space used? If you are, we can discuss it (it's not too hard to
> remedy it in CQL, I was mainly trying to illustrating my point, not
> pretending this was a drop-in solution for your use case) but it's more of
> a performance discussion, and I think we've somewhat quit the realm of
> "there's things CQL3 doesn't support".
>
>
>> Please excuse the confusing RDB comparison. My point is that Cassandra's
>> dynamic column feature is the "unique" feature that makes it better than
>> traditional RDB or newSql like VoltDB for building temporal databases. With
>> databases that require static schema + alter table for managing schema
>> evolution, it makes it harder and results in down time.
>>
>
> Here again you seem you imply that CQL doesn't support dynamic columns, or
> has a somewhat inferior support, but that's just not true.
>
>
>> One of the challenges of data management over time is evolving the data
>> model and making queries simple. If the record is 5 years old, it probably
>> has a difference schema than a record inserted this week. With temporal
>> databases, every update is an insert, so it's a little bit more complex
>> than just "use a blob". There's a whole level of complication with temporal
>> data and CQL3 custom types isn't clear to me. I've read the CQL3
>> documentation on the custom types several times and it is rather poor. It
>> gives me the impression there's still work needed to get custom types in
>> good shape.
>>
>
> I'm sorry but that's a bit of hand waving. Custom types (and by that I
> mean user-provided AbstractType implementations) works in CQL *exactly*
> like in thrift: they are not in a better or worse shape than in thrift. And
> while the documentation on CQL3 is indeed poor on this part, so is the
> thrift documentation on the same subject (besides, I don't think you're
> whole point is about saying that documentation could be improved). Again,
> what you can do in thrift, you can do in CQL.
>

Honestly I haven't I tried to use CQL3 user provided type. I read the
specification several times and had a ton of questions along with several
other people that were trying to under what it meant. If you want people to
use it, the documentation needs to improve. I did give a good faith effort
and spent a week trying to understand what the spec is trying to say, but
it only resulted in more questions. So yes, I am hand waving because it
left me frustrated. Having been part of apache community for many years,
writing great docs is hard and most of us hate doing it. Just to be clear,
I'm not

get partition key from tombstone warnings?

2015-01-21 Thread Ian Rose
When I see a warning like "Read 9 live and 5769 tombstoned cells in ...
" is there a way for me to see the partition key that this query was
operating on?

The description in the original JIRA ticket (
https://issues.apache.org/jira/browse/CASSANDRA-6042) reads as though
exposing this information was one of the original goals, but it isn't
obvious to me in the logs...

Cheers!
- Ian


Re: Re: Dynamic Columns

2015-01-21 Thread Robert Coli
On Wed, Jan 21, 2015 at 9:19 AM, Peter Lin  wrote:

>
> I consistently recommend new users learn and understand both Thrift and
> CQL.
>

FWIW, I consider this a disservice to new users. New users should use CQL,
and not deploy against a deprecated-in-all-but-name API. Understanding
non-CQL *storage* might be valuable, understanding the Thrift interface to
storage is anti-valuable.

Despite the dissembling public statements regarding Thrift "not going
anywhere" it is obvious to me that no other databases exist with two
non-pluggable and incompatible APIs for a reason. The pain of maintaining
these two APIs will eventually become not worth the backwards
compatibility. At this time it will be deprecated and then shortly
thereafter removed; I expect this to happen at latest by EOY 2018. [1]

=Rob
[1] If anyone strongly disagrees, I am taking $20 cash bets, with any
proceeds donated to the Apache Foundation.


Re: Re: Dynamic Columns

2015-01-21 Thread Sylvain Lebresne
On Wed, Jan 21, 2015 at 6:19 PM, Peter Lin  wrote:

> the dynamic column can't be part of the primary key. The temporal entity
> key can be the default UUID or the user can choose the field in their
> object. Within our framework, we have concept of temporal links between one
> or more temporal entities. Poluting the primary key with the dynamic column
> wouldn't work.
>

Not totally sure I understand. Are you talking about the underlying storage
space used? If you are, we can discuss it (it's not too hard to remedy it
in CQL, I was mainly trying to illustrating my point, not pretending this
was a drop-in solution for your use case) but it's more of a performance
discussion, and I think we've somewhat quit the realm of "there's things
CQL3 doesn't support".


> Please excuse the confusing RDB comparison. My point is that Cassandra's
> dynamic column feature is the "unique" feature that makes it better than
> traditional RDB or newSql like VoltDB for building temporal databases. With
> databases that require static schema + alter table for managing schema
> evolution, it makes it harder and results in down time.
>

Here again you seem you imply that CQL doesn't support dynamic columns, or
has a somewhat inferior support, but that's just not true.


> One of the challenges of data management over time is evolving the data
> model and making queries simple. If the record is 5 years old, it probably
> has a difference schema than a record inserted this week. With temporal
> databases, every update is an insert, so it's a little bit more complex
> than just "use a blob". There's a whole level of complication with temporal
> data and CQL3 custom types isn't clear to me. I've read the CQL3
> documentation on the custom types several times and it is rather poor. It
> gives me the impression there's still work needed to get custom types in
> good shape.
>

I'm sorry but that's a bit of hand waving. Custom types (and by that I mean
user-provided AbstractType implementations) works in CQL *exactly* like in
thrift: they are not in a better or worse shape than in thrift. And while
the documentation on CQL3 is indeed poor on this part, so is the thrift
documentation on the same subject (besides, I don't think you're whole
point is about saying that documentation could be improved). Again, what
you can do in thrift, you can do in CQL.


> I consistently recommend new users learn and understand both Thrift and
> CQL.
>

I understand that you do this with the best of intentions and don't take it
the wrong way but it is my opinion that you are counterproductive by doing
so, and this for 2 reasons:
1) you don't only recommend users to learn both API, you justify that
advice by affirming that there is a whole family of important use cases
that thrift supports and CQL do not. Except that I pretend tat this
affirmation is technically incorrect, and so far I haven't seen much
example proving me wrong.
2) there is a wealth of evidence that trying to learn both thrift and CQL
confuses the hell out of new users. Which is btw not surprising, both API
presents the same concepts in seemingly different way (even though they do
are the same concepts) and even have conflicting vocabulary, so it's
obviously confusing when you try to learn those concepts in the first
place. Trying to learn CQL when you know thrift well is fine, and why not
learn thrift once you know and understand CQL well, but learning both is
imo a bad advice. It could maybe (maybe) be justified if what you say about
having whole family of use cases not being doable with CQL was true, but
it's not.

--
Sylvain


>
>
>
> On Wed, Jan 21, 2015 at 11:45 AM, Sylvain Lebresne 
> wrote:
>
>> On Wed, Jan 21, 2015 at 4:44 PM, Peter Lin  wrote:
>>
>>> I don't remember other people's examples in detail due to my shitty
>>> memory, so I'd rather not misquote.
>>>
>>
>> Fair enough, but maybe you shouldn't use "people's examples you don't
>> remenber" as argument then. Those examples might be wrong or outdated and
>> that kind of stuff creates confusion for everyone.
>>
>>
>>>
>>> In my case, I mix static and dynamic columns in a single column family
>>> with primitives and objects. The objects are temporal object graphs with a
>>> known type. Doing this type of stuff is basically transparent for me, since
>>> I'm using thrift and our data modeler generates helper classes. Our tooling
>>> seamlessly convert the bytes back to the target object. We have a few
>>> standard static columns related to temporal metadata. At any time, dynamic
>>> columns can be added and they can be primitives or objects.
>>>
>>
>> I don't see anything in that that cannot be done with CQL. You can mix
>> static and dynamic columns in CQL thanks to static columns. More precisely,
>> you can do what you're describing with a table looking a bit like this:
>>   CREATE TABLE t (
>> key blob,
>> static my_static_column_1 int,
>> static my_static_column_2 float,
>> static my_static_column_3 blob,
>>

Re: Compaction failing to trigger

2015-01-21 Thread Robert Coli
On Wed, Jan 21, 2015 at 10:10 AM, Flavien Charlon  wrote:

> https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/
>>
>>
>
> This doesn't really answer my question, I asked whether this particular
> bug (which I can't find in JIRA) is planned to be fixed in 2.1.3, not
> whether 2.1.3 would be production ready.
>

No idea, but as I didn't recognize your name/email and you were
encountering problems with an IMO not-ready-for-production version. Many
people who are new to Cassandra and pre-or-close-to-production might be
better served by running a slightly older version and focusing on the
challenge of writing their app against a mostly-working distributed
database instead of troubleshooting Cassandra bugs.

tl;dr - Cassandra bugs in cutting edge versions are likely best encountered
by experienced operators who can recognize them and respond, not new
operators.

While we're on this topic, the version numbering is very misleading.
> Version which are not recommended for production should be very explicitly
> labelled as such (beta for example), and 2.1.0 should really be what you
> call now 2.1.6.
>

That's why I wrote the blog post. It is however important to note that I
speak in no official capacity for Apache Cassandra or Datastax.

The intent of the project is for x.y.0 to be production ready, and in
fairness they have recently added new QA processes which are likely to
drive the production ready version down from x.y.6. They are only human,
however, and as human developers are likely have slightly different (lower)
standards for production readiness than the typical operator. I wrote that
blog post to help set operator-appropriate expectations, so people are not
disappointed with the overall stability of Cassandra.

I personally operate Cassandra slightly on the trailing edge, and as a
result only encounter a limited subset of the problems I assist people with
on the list and IRC.

=Rob


Re: How to know disk utilization by each row on a node

2015-01-21 Thread nitin padalia
Did you use cfstats and cfhistograms?
On Jan 22, 2015 12:37 AM, "Edson Marquezani Filho" <
edsonmarquez...@gmail.com> wrote:

> Ok, nice tool, but I still can't see how much data each row occupies
> on the SSTable (or am I missing something?).
>
> Obs: considering SSTables format, where rows are strictly sequential
> and sorted, a feature like that doesn't seem something very hard to
> implement, anyway. Wouldn't it be possible to calculate it only from
> index files, without even needing to read the actual table?
>
> On Tue, Jan 20, 2015 at 5:05 PM, Jens Rantil  wrote:
> > Hi,
> >
> > Datastax comes with sstablekeys that does that. You could also use
> > sstable2json script to find keys.
> >
> > Cheers,
> > Jens
> >
> >
> >
> > On Tue, Jan 20, 2015 at 2:53 PM, Edson Marquezani Filho
> >  wrote:
> >>
> >> Hello, everybody.
> >>
> >> Does anyone know a way to list, for an arbitrary column family, all
> >> the rows owned (including replicas) by a given node and the data size
> >> (real size or disk occupation) of each one of them on that node?
> >>
> >> I would like to do that because I have data on one of my nodes growing
> >> faster than the others, although rows (and replicas) seem evenly
> >> distributed across the cluster. So, I would like to verify if I have
> >> some specific rows growing too much.
> >>
> >> Thank you.
> >
> >
>


Re: How to know disk utilization by each row on a node

2015-01-21 Thread Edson Marquezani Filho
Ok, nice tool, but I still can't see how much data each row occupies
on the SSTable (or am I missing something?).

Obs: considering SSTables format, where rows are strictly sequential
and sorted, a feature like that doesn't seem something very hard to
implement, anyway. Wouldn't it be possible to calculate it only from
index files, without even needing to read the actual table?

On Tue, Jan 20, 2015 at 5:05 PM, Jens Rantil  wrote:
> Hi,
>
> Datastax comes with sstablekeys that does that. You could also use
> sstable2json script to find keys.
>
> Cheers,
> Jens
>
>
>
> On Tue, Jan 20, 2015 at 2:53 PM, Edson Marquezani Filho
>  wrote:
>>
>> Hello, everybody.
>>
>> Does anyone know a way to list, for an arbitrary column family, all
>> the rows owned (including replicas) by a given node and the data size
>> (real size or disk occupation) of each one of them on that node?
>>
>> I would like to do that because I have data on one of my nodes growing
>> faster than the others, although rows (and replicas) seem evenly
>> distributed across the cluster. So, I would like to verify if I have
>> some specific rows growing too much.
>>
>> Thank you.
>
>


Re: How do replica become out of sync

2015-01-21 Thread Flavien Charlon
Quite a few, see here: http://pastebin.com/SMnprHdp. In total about 3,000
ranges across the 3 nodes.

This is with vnodes disabled. It was at least an order of magnitude worse
when we had it enabled.

Flavien

On 20 January 2015 at 22:22, Robert Coli  wrote:

> On Mon, Jan 19, 2015 at 5:44 PM, Flavien Charlon <
> flavien.char...@gmail.com> wrote:
>
>> Thanks Andi. The reason I was asking is that even though my nodes have
>> been 100% available and no write has been rejected, when running an
>> incremental repair, the logs still indicate that some ranges are out of
>> sync (which then results in large amounts of compaction), how can this be
>> possible?
>>
>
> This is most likely, as you conjecture, due to slight differences between
> nodes at the time of Merkle Tree calculation.
>
> How many rows differ?
>
> =Rob
>
>


Re: Re: Dynamic Columns

2015-01-21 Thread Peter Lin
the example you provided does not work for for my use case.

  CREATE TABLE t (
key blob,
static my_static_column_1 int,
static my_static_column_2 float,
static my_static_column_3 blob,
,
dynamic_column_name blob,
dynamic_column_value blob,
PRIMARY KEY (key, dynamic_column_name);
  )

the dynamic column can't be part of the primary key. The temporal entity
key can be the default UUID or the user can choose the field in their
object. Within our framework, we have concept of temporal links between one
or more temporal entities. Poluting the primary key with the dynamic column
wouldn't work.

Please excuse the confusing RDB comparison. My point is that Cassandra's
dynamic column feature is the "unique" feature that makes it better than
traditional RDB or newSql like VoltDB for building temporal databases. With
databases that require static schema + alter table for managing schema
evolution, it makes it harder and results in down time.

One of the challenges of data management over time is evolving the data
model and making queries simple. If the record is 5 years old, it probably
has a difference schema than a record inserted this week. With temporal
databases, every update is an insert, so it's a little bit more complex
than just "use a blob". There's a whole level of complication with temporal
data and CQL3 custom types isn't clear to me. I've read the CQL3
documentation on the custom types several times and it is rather poor. It
gives me the impression there's still work needed to get custom types in
good shape.

With regard to examples others have told me, your advice is fair. A few
minutes with google and some blogs should pop up. The reason I bring these
things up isn't to put down CQL. It's because I care and want to help
improve Cassandra by sharing my experience. I consistently recommend new
users learn and understand both Thrift and CQL.



On Wed, Jan 21, 2015 at 11:45 AM, Sylvain Lebresne 
wrote:

> On Wed, Jan 21, 2015 at 4:44 PM, Peter Lin  wrote:
>
>> I don't remember other people's examples in detail due to my shitty
>> memory, so I'd rather not misquote.
>>
>
> Fair enough, but maybe you shouldn't use "people's examples you don't
> remenber" as argument then. Those examples might be wrong or outdated and
> that kind of stuff creates confusion for everyone.
>
>
>>
>> In my case, I mix static and dynamic columns in a single column family
>> with primitives and objects. The objects are temporal object graphs with a
>> known type. Doing this type of stuff is basically transparent for me, since
>> I'm using thrift and our data modeler generates helper classes. Our tooling
>> seamlessly convert the bytes back to the target object. We have a few
>> standard static columns related to temporal metadata. At any time, dynamic
>> columns can be added and they can be primitives or objects.
>>
>
> I don't see anything in that that cannot be done with CQL. You can mix
> static and dynamic columns in CQL thanks to static columns. More precisely,
> you can do what you're describing with a table looking a bit like this:
>   CREATE TABLE t (
> key blob,
> static my_static_column_1 int,
> static my_static_column_2 float,
> static my_static_column_3 blob,
> ,
> dynamic_column_name blob,
> dynamic_column_value blob,
> PRIMARY KEY (key, dynamic_column_name);
>   )
>
> And your helper classes will serialize your objects as they probably do
> today (if you use a custom comparator, you can do that too). And let it be
> clear that I'm not pretending that doing it this way is tremendously
> simpler than thrift. But I'm saying that 1) it's possible and 2) while it's
> not meaningfully simpler than thriftMy , it's not really harder either (and
> in fact, it's actually less verbose with CQL than with raw thrift).
>
>
>>
>> For the record, doing this kind of stuff in a relational database sucks
>> horribly.
>>
>
> I don't know what that has to do with CQL to be honest. If you're doing
> relational with CQL you're doing it wrong. And please note that I'm not
> saying CQL is the perfect API for modeling temporal data. But I don't get
> how thrift, which is very crude API, is a much better API at that than CQL
> (or, again, how it allows you to do things you can't with CQL).
>
> --
> Sylvain
>


Re: Compaction failing to trigger

2015-01-21 Thread Flavien Charlon
>
> What version of Cassandra are you running?


2.1.2

Are they all "live"? Are there pending compactions, or exceptions regarding
> compactions in your logs?


Yes they are all live according to cfstats. There is no pending compaction
or exception in the logs.

https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/


This doesn't really answer my question, I asked whether this particular bug
(which I can't find in JIRA) is planned to be fixed in 2.1.3, not whether
2.1.3 would be production ready.

While we're on this topic, the version numbering is very misleading.
Version which are not recommended for production should be very explicitly
labelled as such (beta for example), and 2.1.0 should really be what you
call now 2.1.6.

Setting 'cold_reads_to_omit' to 0 did the job for me


Thanks, I've tried it, and it works. This should probably be made the
default IMO.

Flavien


On 20 January 2015 at 22:51, Eric Stevens  wrote:

> @Rob - he's probably referring to the thread titled "Reasons for nodes not
> compacting?" where Tyler speculates that the tables are falling below the
> cold read threshold for compaction.  He speculated it may be a bug.  At the
> same time in a different thread, Roland had a similar problem, and Tyler's
> proposed workaround seemed to work for him.
>
> On Tue, Jan 20, 2015 at 3:35 PM, Robert Coli  wrote:
>
>> On Sun, Jan 18, 2015 at 6:06 PM, Flavien Charlon <
>> flavien.char...@gmail.com> wrote:
>>
>>> It's set on all the tables, as I'm using the default for all the tables.
>>> But for that particular table there are 41 SSTables between 60MB and 85MB,
>>> it should only take 4 for the compaction to kick in.
>>>
>>
>> What version of Cassandra are you running?
>>
>> Are they all "live"? Are there pending compactions, or exceptions
>> regarding compactions in your logs?
>>
>>
>>> As this is probably a bug and going back in the mailing list archive, it
>>> seems it's already been reported:
>>>
>>
>> This is a weird statement. Are you saying that you've found it in the
>> mailing list archives? If so, why not paste the threads so those of us who
>> might remember can refer to them?
>>
>>>
>>>- Will it be fixed in 2.1.3?
>>>
>>>
>> https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/
>>
>>
>> =Rob
>>
>>
>


Re: Re: Dynamic Columns

2015-01-21 Thread Sylvain Lebresne
On Wed, Jan 21, 2015 at 4:44 PM, Peter Lin  wrote:

> I don't remember other people's examples in detail due to my shitty
> memory, so I'd rather not misquote.
>

Fair enough, but maybe you shouldn't use "people's examples you don't
remenber" as argument then. Those examples might be wrong or outdated and
that kind of stuff creates confusion for everyone.


>
> In my case, I mix static and dynamic columns in a single column family
> with primitives and objects. The objects are temporal object graphs with a
> known type. Doing this type of stuff is basically transparent for me, since
> I'm using thrift and our data modeler generates helper classes. Our tooling
> seamlessly convert the bytes back to the target object. We have a few
> standard static columns related to temporal metadata. At any time, dynamic
> columns can be added and they can be primitives or objects.
>

I don't see anything in that that cannot be done with CQL. You can mix
static and dynamic columns in CQL thanks to static columns. More precisely,
you can do what you're describing with a table looking a bit like this:
  CREATE TABLE t (
key blob,
static my_static_column_1 int,
static my_static_column_2 float,
static my_static_column_3 blob,
,
dynamic_column_name blob,
dynamic_column_value blob,
PRIMARY KEY (key, dynamic_column_name);
  )

And your helper classes will serialize your objects as they probably do
today (if you use a custom comparator, you can do that too). And let it be
clear that I'm not pretending that doing it this way is tremendously
simpler than thrift. But I'm saying that 1) it's possible and 2) while it's
not meaningfully simpler than thriftMy , it's not really harder either (and
in fact, it's actually less verbose with CQL than with raw thrift).


>
> For the record, doing this kind of stuff in a relational database sucks
> horribly.
>

I don't know what that has to do with CQL to be honest. If you're doing
relational with CQL you're doing it wrong. And please note that I'm not
saying CQL is the perfect API for modeling temporal data. But I don't get
how thrift, which is very crude API, is a much better API at that than CQL
(or, again, how it allows you to do things you can't with CQL).

--
Sylvain


Re: Re: Dynamic Columns

2015-01-21 Thread Peter Lin
I don't remember other people's examples in detail due to my shitty memory,
so I'd rather not misquote.

In my case, I mix static and dynamic columns in a single column family with
primitives and objects. The objects are temporal object graphs with a known
type. Doing this type of stuff is basically transparent for me, since I'm
using thrift and our data modeler generates helper classes. Our tooling
seamlessly convert the bytes back to the target object. We have a few
standard static columns related to temporal metadata. At any time, dynamic
columns can be added and they can be primitives or objects. The framework
we built uses CQL for basic queries and views the user defines.

We model the schema in a GUI modeler and the framework provides a query API
to access a specific version or versions of any record. The design borrows
heavily from temporal logic and active databases.

For the record, doing this kind of stuff in a relational database sucks
horribly. The reason I chose to build a temporal database on Cassandra is
because I've done it on oracle/sqlserver in the past. Last year I submitted
a talk about our temporal database for the datastax conference, but it was
rejected since there were too many submissions. I know spotify also built a
temporal database on Cassandra and they gave a talk on what they did.

peter


On Wed, Jan 21, 2015 at 10:13 AM, Sylvain Lebresne 
wrote:

>
> I've chatted with several long time users of Cassandra and there's things
>> CQL3 doesn't support.
>>
>
> Would you care to elaborate then? Maybe a simple example of something (or
> multiple things since you used plural) in thrift that cannot be supported
> in CQL?
> And please note that I'm *not* saying that all existing thrift table can
> be seemlessly used from CQL: there is indeed a few cases for which that's
> not the case. But that does not mean those cases cannot easily be in CQL
> from scratch.
>


Re: Re: Dynamic Columns

2015-01-21 Thread Sylvain Lebresne
> I've chatted with several long time users of Cassandra and there's things
> CQL3 doesn't support.
>

Would you care to elaborate then? Maybe a simple example of something (or
multiple things since you used plural) in thrift that cannot be supported
in CQL?
And please note that I'm *not* saying that all existing thrift table can be
seemlessly used from CQL: there is indeed a few cases for which that's not
the case. But that does not mean those cases cannot easily be in CQL from
scratch.


Re: keyspace not exists?

2015-01-21 Thread Jason Wee
Thanks Rob, we keep this in mind for our learning journey.

Jason

On Wed, Jan 21, 2015 at 6:45 AM, Robert Coli  wrote:

> On Sun, Jan 18, 2015 at 8:55 PM, Jason Wee  wrote:
>
>> two nodes running cassandra 2.1.2 and one running cassandra 2.1.1
>>
>
> For the record, this is an unsupported persistent configuration. You are
> only supposed to have split minor versions during an upgrade.
>
> I have no idea if it is causing the problem you are having.
>
> =Rob
>
>


cassandra-stress - confusing documentation

2015-01-21 Thread Tzach Livyatan
Hi all
I'm using cassandra-stress directly from apache-cassandra-2.1.2/tools/bin
The documentation I found
http://www.datastax.com/documentation/cassandra/2.1/cassandra/tools/toolsCStress_t.html
is either too old or too advance, but does not match what I use.

In particular, I fail to use the -key populate=1..100 option as used in
the two nodes example from the link above.

#On Node1$ cassandra-stress write tries=20 n=100 cl=one -mode
native cql3 -schema keyspace="Keyspace1" -key populate=1..100 -log
file=~/node1_load.log -node $NODES
 #On Node2$ cassandra-stress write tries=20 n=100 cl=one -mode
native cql3 -schema keyspace="Keyspace1" -key
populate=101..200 -log file=~/node2_load.log -node $NODES

Can some one please direct me to the right doc, or to a valid example of
using populate range?

Thanks
Tzach


Re: Count of column values

2015-01-21 Thread Poonam Ligade
Hi ,

Sorry for the previous incomplete message.
I am using where clause as follows:
select count(*) from trends where data1='abc' ALLOW FILTERING;
How can i store this count output to any other column.

Can you help with any wayround.

Thanks,
Poonam.

On Wed, Jan 21, 2015 at 7:46 PM, Poonam Ligade 
wrote:

> Hi,
>
> I am newbie to Cassandra.
> I have to find out top 10 recent trends in data
>
> I have schema as follows
>
> create table trends(
> day int,
> data1 text,
> data2 map,
> PRIMARY KEY (day, data1)) ;
>
> I have to take count of duplicate values in data1 so that i can find top
> 10 data1 trends.
>
> 1. I tried adding an counter column, but again you can't use order by
> clause on counter column.
> 2. I tried using where clause
>


Count of column values

2015-01-21 Thread Poonam Ligade
Hi,

I am newbie to Cassandra.
I have to find out top 10 recent trends in data

I have schema as follows

create table trends(
day int,
data1 text,
data2 map,
PRIMARY KEY (day, data1)) ;

I have to take count of duplicate values in data1 so that i can find top 10
data1 trends.

1. I tried adding an counter column, but again you can't use order by
clause on counter column.
2. I tried using where clause


Re: Re: Dynamic Columns

2015-01-21 Thread Peter Lin
I've studied the source code and I don't believe that statement is true.
I've chatted with several long time users of Cassandra and there's things
CQL3 doesn't support.

Like I've said before. Thrift and CQL3 compliment each other. I totally
understand some committers don't want the overhead due to time and resource
limitations. On more than one occassion, people have offered to help and
work on thrift, but were rejected. There's logs in jira.

For the record, it's great that CQL was created to make life easier for new
users. But here's the thing that annoys me. There's users that just want to
save and query data, but there's people out there like me that are building
tools for Cassandra. For tool builders, having object API like thrift is
invaluable. If we look at relational databases, we see many of them have 2
separate API for that reason. Microsoft SqlServer has SQL and object API.
Having both makes it easier to build tools. It's a shame to ignore all the
lessons RDBMS can teach us and suffer NIH syndrome. I've built several data
modeling tools over the years including ORM's.

We built our own data modeling tool for the temporal database I built on
Cassandra, so this isn't just some hypothetical complaint. This is from
many years of first hand experience. I understand my needs often don't and
won't line up with what's in Cassandra's roadmap. But that's the great
thing about open source. Should thrift go away permanently I'll just fork
Cassandra and do my own thing.


On Wed, Jan 21, 2015 at 8:53 AM, Sylvain Lebresne 
wrote:

> On Wed, Jan 21, 2015 at 3:46 AM, Peter Lin  wrote:
>
>>
>>  I don't understand why people [...] pretend it supports 100% of the use
>> cases.
>>
>
> Have you consider the possibly that it's actually true and you're just
> wrong by lack of knowledge?
>
> --
> Sylvain
>


Re: row cache hit is costlier for partiton with large rows

2015-01-21 Thread Sylvain Lebresne
The row cache saves partition data off-heap, which means that every cache
hit require copying/deserializing the cached partition into the heap, and
the more rows per partition you cache, the long it will take. Which is why
it's currently not a good cache too much rows per partition (unless you
know what you're doing).

On Wed, Jan 21, 2015 at 1:15 PM, nitin padalia 
wrote:

> Hi,
>
> With two different families when I do a read, row cache hit is almost
> 15x costlier with larger rows (1 rows per partition), in
> comparison to partition with only 100 rows.
>
> Difference in two column families is one is having 100 rows per
> partition another 1 rows per partition. Schema for two tables is:
> CREATE TABLE table1_row_cache (
>   user_id uuid,
>   dept_id uuid,
>   location_id text,
>   locationmap_id uuid,
>   PRIMARY KEY ((user_id, location_id), dept_id)
> )
>
> CREATE TABLE table2_row_cache (
>   user_id uuid,
>   dept_id uuid,
>   location_id text,
>   locationmap_id uuid,
>   PRIMARY KEY ((user_id, dept_id), location_id)
> )
>
> Here is the tracing:
>
> Row cache Hit with Column Family table1_row_cache, 100 rows per partition:
>  Preparing statement [SharedPool-Worker-2] | 2015-01-20
> 14:35:47.54 | x.x.x.x |   1023
>   Row cache hit [SharedPool-Worker-5] | 2015-01-20
> 14:35:47.542000 | x.x.x.x |   2426
>
> Row cache Hit with CF table2_row_cache, 1 rows per partition:
> Preparing statement [SharedPool-Worker-1] | 2015-01-20 16:02:51.696000
> | x.x.x.x |490
>  Row cache hit [SharedPool-Worker-2] | 2015-01-20
> 16:02:51.711000 | x.x.x.x |  15146
>
>
> If for both cases data is in memory why its not same? Can someone
> point me what wrong here?
>
> Nitin Padalia
>


Re: Re: Dynamic Columns

2015-01-21 Thread Sylvain Lebresne
On Wed, Jan 21, 2015 at 3:46 AM, Peter Lin  wrote:

>
>  I don't understand why people [...] pretend it supports 100% of the use
> cases.
>

Have you consider the possibly that it's actually true and you're just
wrong by lack of knowledge?

--
Sylvain


Re: Re: Dynamic Columns

2015-01-21 Thread Jonathan Lacefield
Hello,

  Peter highlighted the tradeoff between Thrift and CQL3 nicely in this
case, i.e. requiring a different design approach for this solution.
Collections do not sound like a good fit for your current challenge, but is
there a different way to design/solve your challenge using CQL techniques?

  It is recommended to leverage CQL for new projects as this is the
direction that Cassandra is heading and where the majority of effort is
being applied from a development perspective.

  Sounds like you have a decision to make.  Leverage Thrift and the Dynamic
Column approach to solving this problem.  Or, rethink the design approach
and leverage CQL.

  Please let the mailing list know the direction you choose.

Jonathan

[image: datastax_logo.png]

Jonathan Lacefield

Solution Architect | (404) 822 3487 | jlacefi...@datastax.com

[image: linkedin.png]  [image:
facebook.png]  [image: twitter.png]
 [image: g+.png]

 

On Tue, Jan 20, 2015 at 9:46 PM, Peter Lin  wrote:

>
> the thing is, CQL only handles some types of dynamic column use cases.
> There's plenty of examples on datastax.com that shows how to do CQL style
> dynamic columns.
>
> based on what was described by Chetan, I don't feel CQL3 is a perfect fit
> for what he wants to do. To use CQL3, he'd have to change his approach.
>
> In my temporal database, I use both Thrift and CQL. They compliment each
> other very nice. I don't understand why people have to put down Thrift or
> pretend it supports 100% of the use cases. Lots of people who started using
> Cassandra pre CQL and had no problems using thrift. Yes you have to
> understand more and the learning curve is steeper, but taking time to learn
> the internals of cassandra is a good thing.
>
> Using CQL3 lists or maps, it would force the query to load the enter
> collection, but that is by design. To get the full power of the old style
> of dynamic columns, thrift is a better fit. I hope CQL continues to improve
> so that it supports 100% of the existing use cases.
>
>
>
> On Tue, Jan 20, 2015 at 8:50 PM, Xu Zhongxing 
> wrote:
>
>> I approximate dynamic columns by data_key and data_value columns.
>> Is there a better way to get dynamic columns in CQL 3?
>>
>> At 2015-01-21 09:41:02, "Peter Lin"  wrote:
>>
>>
>> I think that table example misses the point of chetan's functional
>> requirement. he actually needs dynamic columns.
>>
>> On Tue, Jan 20, 2015 at 8:12 PM, Xu Zhongxing 
>> wrote:
>>
>>> Maybe this is the closest thing to "dynamic columns" in CQL 3.
>>>
>>> create table reivew (
>>> product_id bigint,
>>> created_at timestamp,
>>> data_key text,
>>> data_tvalue text,
>>> data_ivalue int,
>>> primary key ((priduct_id, created_at), data_key)
>>> );
>>>
>>> data_tvalue and data_ivalue is optional.
>>>
>>> At 2015-01-21 04:44:07, "chetan verma"  wrote:
>>>
>>> Hi,
>>>
>>> Adding to previous mail. For example: We have a column family named
>>> review (with some arbitrary data in map).
>>>
>>> CREATE TABLE review(
>>> product_id bigint,
>>> created_at timestamp,
>>> data_int map,
>>> data_text map,
>>> PRIMARY KEY (product_id, created_at)
>>> );
>>>
>>> Assume that these 2 maps I use to store arbitrary data (i.e. data_int
>>> and data_text for int and text values)
>>> when we see output on cassandra-cli, it looks like in a partition as :
>>> :data_int:map_key as column name and value as map value.
>>> suppose I need to get this value, I couldn't do that with CQL3 but in
>>> thrift its possible. Any Solution?
>>>
>>> On Wed, Jan 21, 2015 at 1:06 AM, chetan verma 
>>> wrote:
>>>
 Hi,

 Most of the time I will  be querying on product_id and created_at, but
 for analytic I need to query almost on all column.
 Multiple collections ideas is good but the only is cassandra reads a
 collection entirely, what if I need a slice of it, I mean
 columns for certain keys which is possible with thrift. Please suggest.

 On Wed, Jan 21, 2015 at 12:36 AM, Jonathan Lacefield <
 jlacefi...@datastax.com> wrote:

> Hello,
>
> There are probably lots of options to this challenge.  The more
> details around your use case that you can provide, the easier it will be
> for this group to offer advice.
>
> A few follow-up questions:
>   - How will you query this data?
>   - Do your queries require filtering on specific columns other than
> product_id and created_at, i.e. the dynamic columns?
>
> Depending on the answers to these questions, you have several options,
> of which here are a few:
>
>- Cassandra efficiently stores sparse data, so you could create
>columns and not populate them, without much of a penalty
>- Could use a clustering column to store a columns type

row cache hit is costlier for partiton with large rows

2015-01-21 Thread nitin padalia
Hi,

With two different families when I do a read, row cache hit is almost
15x costlier with larger rows (1 rows per partition), in
comparison to partition with only 100 rows.

Difference in two column families is one is having 100 rows per
partition another 1 rows per partition. Schema for two tables is:
CREATE TABLE table1_row_cache (
  user_id uuid,
  dept_id uuid,
  location_id text,
  locationmap_id uuid,
  PRIMARY KEY ((user_id, location_id), dept_id)
)

CREATE TABLE table2_row_cache (
  user_id uuid,
  dept_id uuid,
  location_id text,
  locationmap_id uuid,
  PRIMARY KEY ((user_id, dept_id), location_id)
)

Here is the tracing:

Row cache Hit with Column Family table1_row_cache, 100 rows per partition:
 Preparing statement [SharedPool-Worker-2] | 2015-01-20
14:35:47.54 | x.x.x.x |   1023
  Row cache hit [SharedPool-Worker-5] | 2015-01-20
14:35:47.542000 | x.x.x.x |   2426

Row cache Hit with CF table2_row_cache, 1 rows per partition:
Preparing statement [SharedPool-Worker-1] | 2015-01-20 16:02:51.696000
| x.x.x.x |490
 Row cache hit [SharedPool-Worker-2] | 2015-01-20
16:02:51.711000 | x.x.x.x |  15146


If for both cases data is in memory why its not same? Can someone
point me what wrong here?

Nitin Padalia


Re: Versioning in cassandra while indexing ?

2015-01-21 Thread Kai Wang
depending on your data model, static column night be useful.
https://issues.apache.org/jira/plugins/servlet/mobile#issue/CASSANDRA-6561
On Jan 21, 2015 2:56 AM, "Pandian R"  wrote:

> Hi,
>
> I just wanted to know if there is any kind of versioning system in
> cassandra while indexing new data(like the one we have for ElasticSearch,
> for example).
>
> For example, I have a series of payloads each coming with an id and
> 'updatedAt' timestamp. I just want to maintain the latest state of any
> payload for all the ids ie, index the data only if the current payload has
> greater 'updatedAt' than the previously stored timestamp. I can do this
> with one additional self-lookup, but is there a way to achieve this without
> overhead of additional lookup ?
>
> Thanks !
>
> --
> Regards,
> Pandian
>


Re: Versioning in cassandra while indexing ?

2015-01-21 Thread Pandian R
Awesome. Thanks a lot Graham. Will use the clock timestamp for versioning :)

On Wed, Jan 21, 2015 at 2:02 PM, graham sanderson  wrote:

> I believe you can use “USING TIMESTAMP XXX” with your inserts which will
> set the actual cell write times to the timestamp you provide. Then at least
> on read you’ll get the “latest” value… you may or may not incur an actual
> write of the old data to disk, but either way it’ll get cleaned up for you.
>
> > On Jan 21, 2015, at 1:54 AM, Pandian R  wrote:
> >
> > Hi,
> >
> > I just wanted to know if there is any kind of versioning system in
> cassandra while indexing new data(like the one we have for ElasticSearch,
> for example).
> >
> > For example, I have a series of payloads each coming with an id and
> 'updatedAt' timestamp. I just want to maintain the latest state of any
> payload for all the ids ie, index the data only if the current payload has
> greater 'updatedAt' than the previously stored timestamp. I can do this
> with one additional self-lookup, but is there a way to achieve this without
> overhead of additional lookup ?
> >
> > Thanks !
> >
> > --
> > Regards,
> > Pandian
>
>


-- 
Regards,
Pandian


Re: Versioning in cassandra while indexing ?

2015-01-21 Thread graham sanderson
I believe you can use “USING TIMESTAMP XXX” with your inserts which will set 
the actual cell write times to the timestamp you provide. Then at least on read 
you’ll get the “latest” value… you may or may not incur an actual write of the 
old data to disk, but either way it’ll get cleaned up for you.

> On Jan 21, 2015, at 1:54 AM, Pandian R  wrote:
> 
> Hi,
> 
> I just wanted to know if there is any kind of versioning system in cassandra 
> while indexing new data(like the one we have for ElasticSearch, for example). 
> 
> For example, I have a series of payloads each coming with an id and 
> 'updatedAt' timestamp. I just want to maintain the latest state of any 
> payload for all the ids ie, index the data only if the current payload has 
> greater 'updatedAt' than the previously stored timestamp. I can do this with 
> one additional self-lookup, but is there a way to achieve this without 
> overhead of additional lookup ?
> 
> Thanks !
> 
> -- 
> Regards,
> Pandian



smime.p7s
Description: S/MIME cryptographic signature