Re: How much disk is needed to compact Leveled compaction?

2015-04-06 Thread Jean Tremblay
I am only using LeveledCompactionStrategy, and as I describe in my original 
mail, I don’t understand why C* is complaining that it cannot compact when I 
have more than 40% free disk space.



On 07 Apr 2015, at 01:10 , Bryan Holladay 
mailto:holla...@longsight.com>> wrote:


What other storage impacting commands or nuances do you gave to consider when 
you switch to leveled compaction? For instance, nodetool cleanup says

"Running the nodetool cleanup command causes a temporary increase in disk space 
usage proportional to the size of your largest SSTable."

Are sstables smaller with leveled compaction making this a non issue?

How can you determine what the new threshold for storage space is?

Thanks,
Bryan

On Apr 6, 2015 6:19 PM, "DuyHai Doan" 
mailto:doanduy...@gmail.com>> wrote:

If you have SSD, you may afford switching to leveled compaction strategy, which 
requires much less than 50% of the current dataset for free space

Le 5 avr. 2015 19:04, "daemeon reiydelle" 
mailto:daeme...@gmail.com>> a écrit :

You appear to have multiple java binaries in your path. That needs to be 
resolved.

sent from my mobile
Daemeon C.M. Reiydelle
USA 415.501.0198
London +44.0.20.8144.9872

On Apr 5, 2015 1:40 AM, "Jean Tremblay" 
mailto:jean.tremb...@zen-innovations.com>> 
wrote:
Hi,
I have a cluster of 5 nodes. We use cassandra 2.1.3.

The 5 nodes use about 50-57% of the 1T SSD.
One node managed to compact all its data. During one compaction this node used 
almost 100% of the drive. The other nodes refuse to continue compaction 
claiming that there is not enough disk space.

From the documentation LeveledCompactionStrategy should be able to compact my 
data, well at least this is what I understand.

<>

This is the disk usage. Node 4 is the only one that could compact everything.
node0: /dev/disk1 931Gi 534Gi 396Gi 57% /
node1: /dev/disk1 931Gi 513Gi 417Gi 55% /
node2: /dev/disk1 931Gi 526Gi 404Gi 57% /
node3: /dev/disk1 931Gi 507Gi 424Gi 54% /
node4: /dev/disk1 931Gi 475Gi 456Gi 51% /

When I try to compact the other ones I get this:

objc[18698]: Class JavaLaunchHelper is implemented in both 
/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/bin/java and 
/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/jre/lib/libinstrument.dylib.
 One of the two will be used. Which one is undefined.
error: Not enough space for compaction, estimated sstables = 2894, expected 
write size = 485616651726
-- StackTrace --
java.lang.RuntimeException: Not enough space for compaction, estimated sstables 
= 2894, expected write size = 485616651726
at 
org.apache.cassandra.db.compaction.CompactionTask.checkAvailableDiskSpace(CompactionTask.java:293)
at 
org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:127)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at 
org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:76)
at 
org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59)
at 
org.apache.cassandra.db.compaction.CompactionManager$7.runMayThrow(CompactionManager.java:512)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)

I did not set the sstable_size_in_mb I use the 160MB default.

Is it normal that during compaction it needs so much diskspace? What would be 
the best solution to overcome this problem?

Thanks for your help




Re: log all the query statement

2015-04-06 Thread Anishek Agarwal
Hey Peter,

This is from the perspective of 2.0.13 but there should be something
similar in your version. Can you enable debug log for cassandra and see if
the log files have additional info. Depending on how soon/later in you test
you get the error, you might also want to modify the "maxBackupIndex" or
"maxFileSize" to make sure u keep enough log files around.

anishek

On Thu, Apr 2, 2015 at 11:53 AM, 鄢来琼  wrote:

>  Hi all,
>
>
>
> Cassandra 2.1.2 is used in my project, but some node is down after
> executing query some statements.
>
> Could I configure the Cassandra to log all the executed statement?
>
> Hope the log file can be used to identify the problem.
>
> Thanks.
>
>
>
> Peter
>
>
>


Data stax object mapping and lightweight transactions

2015-04-06 Thread Sha Liu
Hi,

Does the latest Data Stax Java driver (2.1.5) support lightweight transactions 
using object mapping? For example, if I set the write consistency level of the 
mapped class to SERIAL through annotation, then does the “save” operation use 
lightweight transaction instead of a normal write?

Thanks,
Sha Liu

Re: How much disk is needed to compact Leveled compaction?

2015-04-06 Thread Bryan Holladay
What other storage impacting commands or nuances do you gave to consider
when you switch to leveled compaction? For instance, nodetool cleanup says

"Running the nodetool cleanup command causes a temporary increase in disk
space usage proportional to the size of your largest SSTable."

Are sstables smaller with leveled compaction making this a non issue?

How can you determine what the new threshold for storage space is?

Thanks,
Bryan
 On Apr 6, 2015 6:19 PM, "DuyHai Doan"  wrote:

> If you have SSD, you may afford switching to leveled compaction strategy,
> which requires much less than 50% of the current dataset for free space
> Le 5 avr. 2015 19:04, "daemeon reiydelle"  a écrit :
>
>> You appear to have multiple java binaries in your path. That needs to be
>> resolved.
>>
>> sent from my mobile
>> Daemeon C.M. Reiydelle
>> USA 415.501.0198
>> London +44.0.20.8144.9872
>> On Apr 5, 2015 1:40 AM, "Jean Tremblay" <
>> jean.tremb...@zen-innovations.com> wrote:
>>
>>>  Hi,
>>> I have a cluster of 5 nodes. We use cassandra 2.1.3.
>>>
>>>  The 5 nodes use about 50-57% of the 1T SSD.
>>>  One node managed to compact all its data. During one compaction this
>>> node used almost 100% of the drive. The other nodes refuse to continue
>>> compaction claiming that there is not enough disk space.
>>>
>>>  From the documentation LeveledCompactionStrategy should be able to
>>> compact my data, well at least this is what I understand.
>>>
>>>  <>> compaction as the size of the largest column family. Leveled compaction
>>> needs much less space for compaction, only 10 * sstable_size_in_mb.
>>> However, even if you’re using leveled compaction, you should leave much
>>> more free disk space available than this to accommodate streaming, repair,
>>> and snapshots, which can easily use 10GB or more of disk space.
>>> Furthermore, disk performance tends to decline after 80 to 90% of the disk
>>> space is used, so don’t push the boundaries.>>
>>>
>>>  This is the disk usage. Node 4 is the only one that could compact
>>> everything.
>>>  node0: /dev/disk1 931Gi 534Gi 396Gi 57% /
>>> node1: /dev/disk1 931Gi 513Gi 417Gi 55% /
>>> node2: /dev/disk1 931Gi 526Gi 404Gi 57% /
>>> node3: /dev/disk1 931Gi 507Gi 424Gi 54% /
>>> node4: /dev/disk1 931Gi 475Gi 456Gi 51% /
>>>
>>>  When I try to compact the other ones I get this:
>>>
>>>  objc[18698]: Class JavaLaunchHelper is implemented in both
>>> /Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/bin/java
>>> and /Library/Java/JavaVirtualMachines/jdk1.8.0_
>>> 40.jdk/Contents/Home/jre/lib/libinstrument.dylib. One of the two will
>>> be used. Which one is undefined.
>>> error: Not enough space for compaction, estimated sstables = 2894,
>>> expected write size = 485616651726
>>> -- StackTrace --
>>> java.lang.RuntimeException: Not enough space for compaction, estimated
>>> sstables = 2894, expected write size = 485616651726
>>> at org.apache.cassandra.db.compaction.CompactionTask.
>>> checkAvailableDiskSpace(CompactionTask.java:293)
>>> at org.apache.cassandra.db.compaction.CompactionTask.
>>> runMayThrow(CompactionTask.java:127)
>>> at org.apache.cassandra.utils.WrappedRunnable.run(
>>> WrappedRunnable.java:28)
>>> at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(
>>> CompactionTask.java:76)
>>> at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(
>>> AbstractCompactionTask.java:59)
>>> at org.apache.cassandra.db.compaction.CompactionManager$7.runMayThrow(
>>> CompactionManager.java:512)
>>> at org.apache.cassandra.utils.WrappedRunnable.run(
>>> WrappedRunnable.java:28)
>>>
>>>   I did not set the sstable_size_in_mb I use the 160MB default.
>>>
>>>  Is it normal that during compaction it needs so much diskspace? What
>>> would be the best solution to overcome this problem?
>>>
>>>  Thanks for your help
>>>
>>>


Re: How much disk is needed to compact Leveled compaction?

2015-04-06 Thread Ali Akhtar
I may have misunderstood, but it seems that he was already using
LeveledCompaction

On Tue, Apr 7, 2015 at 3:17 AM, DuyHai Doan  wrote:

> If you have SSD, you may afford switching to leveled compaction strategy,
> which requires much less than 50% of the current dataset for free space
> Le 5 avr. 2015 19:04, "daemeon reiydelle"  a écrit :
>
>> You appear to have multiple java binaries in your path. That needs to be
>> resolved.
>>
>> sent from my mobile
>> Daemeon C.M. Reiydelle
>> USA 415.501.0198
>> London +44.0.20.8144.9872
>> On Apr 5, 2015 1:40 AM, "Jean Tremblay" <
>> jean.tremb...@zen-innovations.com> wrote:
>>
>>>  Hi,
>>> I have a cluster of 5 nodes. We use cassandra 2.1.3.
>>>
>>>  The 5 nodes use about 50-57% of the 1T SSD.
>>>  One node managed to compact all its data. During one compaction this
>>> node used almost 100% of the drive. The other nodes refuse to continue
>>> compaction claiming that there is not enough disk space.
>>>
>>>  From the documentation LeveledCompactionStrategy should be able to
>>> compact my data, well at least this is what I understand.
>>>
>>>  <>> compaction as the size of the largest column family. Leveled compaction
>>> needs much less space for compaction, only 10 * sstable_size_in_mb.
>>> However, even if you’re using leveled compaction, you should leave much
>>> more free disk space available than this to accommodate streaming, repair,
>>> and snapshots, which can easily use 10GB or more of disk space.
>>> Furthermore, disk performance tends to decline after 80 to 90% of the disk
>>> space is used, so don’t push the boundaries.>>
>>>
>>>  This is the disk usage. Node 4 is the only one that could compact
>>> everything.
>>>  node0: /dev/disk1 931Gi 534Gi 396Gi 57% /
>>> node1: /dev/disk1 931Gi 513Gi 417Gi 55% /
>>> node2: /dev/disk1 931Gi 526Gi 404Gi 57% /
>>> node3: /dev/disk1 931Gi 507Gi 424Gi 54% /
>>> node4: /dev/disk1 931Gi 475Gi 456Gi 51% /
>>>
>>>  When I try to compact the other ones I get this:
>>>
>>>  objc[18698]: Class JavaLaunchHelper is implemented in both
>>> /Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/bin/java
>>> and /Library/Java/JavaVirtualMachines/jdk1.8.0_
>>> 40.jdk/Contents/Home/jre/lib/libinstrument.dylib. One of the two will
>>> be used. Which one is undefined.
>>> error: Not enough space for compaction, estimated sstables = 2894,
>>> expected write size = 485616651726
>>> -- StackTrace --
>>> java.lang.RuntimeException: Not enough space for compaction, estimated
>>> sstables = 2894, expected write size = 485616651726
>>> at org.apache.cassandra.db.compaction.CompactionTask.
>>> checkAvailableDiskSpace(CompactionTask.java:293)
>>> at org.apache.cassandra.db.compaction.CompactionTask.
>>> runMayThrow(CompactionTask.java:127)
>>> at org.apache.cassandra.utils.WrappedRunnable.run(
>>> WrappedRunnable.java:28)
>>> at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(
>>> CompactionTask.java:76)
>>> at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(
>>> AbstractCompactionTask.java:59)
>>> at org.apache.cassandra.db.compaction.CompactionManager$7.runMayThrow(
>>> CompactionManager.java:512)
>>> at org.apache.cassandra.utils.WrappedRunnable.run(
>>> WrappedRunnable.java:28)
>>>
>>>   I did not set the sstable_size_in_mb I use the 160MB default.
>>>
>>>  Is it normal that during compaction it needs so much diskspace? What
>>> would be the best solution to overcome this problem?
>>>
>>>  Thanks for your help
>>>
>>>


Re: How much disk is needed to compact Leveled compaction?

2015-04-06 Thread DuyHai Doan
If you have SSD, you may afford switching to leveled compaction strategy,
which requires much less than 50% of the current dataset for free space
Le 5 avr. 2015 19:04, "daemeon reiydelle"  a écrit :

> You appear to have multiple java binaries in your path. That needs to be
> resolved.
>
> sent from my mobile
> Daemeon C.M. Reiydelle
> USA 415.501.0198
> London +44.0.20.8144.9872
> On Apr 5, 2015 1:40 AM, "Jean Tremblay" 
> wrote:
>
>>  Hi,
>> I have a cluster of 5 nodes. We use cassandra 2.1.3.
>>
>>  The 5 nodes use about 50-57% of the 1T SSD.
>>  One node managed to compact all its data. During one compaction this
>> node used almost 100% of the drive. The other nodes refuse to continue
>> compaction claiming that there is not enough disk space.
>>
>>  From the documentation LeveledCompactionStrategy should be able to
>> compact my data, well at least this is what I understand.
>>
>>  <> compaction as the size of the largest column family. Leveled compaction
>> needs much less space for compaction, only 10 * sstable_size_in_mb.
>> However, even if you’re using leveled compaction, you should leave much
>> more free disk space available than this to accommodate streaming, repair,
>> and snapshots, which can easily use 10GB or more of disk space.
>> Furthermore, disk performance tends to decline after 80 to 90% of the disk
>> space is used, so don’t push the boundaries.>>
>>
>>  This is the disk usage. Node 4 is the only one that could compact
>> everything.
>>  node0: /dev/disk1 931Gi 534Gi 396Gi 57% /
>> node1: /dev/disk1 931Gi 513Gi 417Gi 55% /
>> node2: /dev/disk1 931Gi 526Gi 404Gi 57% /
>> node3: /dev/disk1 931Gi 507Gi 424Gi 54% /
>> node4: /dev/disk1 931Gi 475Gi 456Gi 51% /
>>
>>  When I try to compact the other ones I get this:
>>
>>  objc[18698]: Class JavaLaunchHelper is implemented in both
>> /Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/bin/java
>> and /Library/Java/JavaVirtualMachines/jdk1.8.0_
>> 40.jdk/Contents/Home/jre/lib/libinstrument.dylib. One of the two will be
>> used. Which one is undefined.
>> error: Not enough space for compaction, estimated sstables = 2894,
>> expected write size = 485616651726
>> -- StackTrace --
>> java.lang.RuntimeException: Not enough space for compaction, estimated
>> sstables = 2894, expected write size = 485616651726
>> at org.apache.cassandra.db.compaction.CompactionTask.
>> checkAvailableDiskSpace(CompactionTask.java:293)
>> at org.apache.cassandra.db.compaction.CompactionTask.
>> runMayThrow(CompactionTask.java:127)
>> at org.apache.cassandra.utils.WrappedRunnable.run(
>> WrappedRunnable.java:28)
>> at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(
>> CompactionTask.java:76)
>> at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(
>> AbstractCompactionTask.java:59)
>> at org.apache.cassandra.db.compaction.CompactionManager$7.runMayThrow(
>> CompactionManager.java:512)
>> at org.apache.cassandra.utils.WrappedRunnable.run(
>> WrappedRunnable.java:28)
>>
>>   I did not set the sstable_size_in_mb I use the 160MB default.
>>
>>  Is it normal that during compaction it needs so much diskspace? What
>> would be the best solution to overcome this problem?
>>
>>  Thanks for your help
>>
>>


Re: Timeseries analysis using Cassandra and partition by date period

2015-04-06 Thread Serega Sheypak
Thank you, we'll see that instrument,

2015-04-06 12:30 GMT+02:00 Srinivasa T N :

> Comparison to OpenTSDB HBase
>
> For one we do not use id’s for strings. The string data (metric names and
> tags) are written to row keys and the appropriate indexes. Because
> Cassandra has much wider rows there are far fewer keys written to the
> database. The space saved by using id’s is minor and by not using id’s we
> avoid having to use any kind of locks across the cluster.
>
> As mentioned the Cassandra has wider rows. The default row size in
> OpenTSDB HBase is 1 hour. Cassandra is set to 3 weeks.
> http://kairosdb.github.io/kairosdocs/CassandraSchema.html
>
> On Mon, Apr 6, 2015 at 3:27 PM, Serega Sheypak 
> wrote:
>
>> Thanks, is it a kind of opentsdb?
>>
>> 2015-04-05 18:28 GMT+02:00 Kevin Burton :
>>
>>> > Hi, I switched from HBase to Cassandra and try to find problem
>>> solution for timeseries analysis on top Cassandra.
>>>
>>> Depending on what you’re looking for, you might want to check out
>>> KairosDB.
>>>
>>> 0.95 beta2 just shipped yesterday as well so you have good timing.
>>>
>>> https://github.com/kairosdb/kairosdb
>>>
>>> On Sat, Apr 4, 2015 at 11:29 AM, Serega Sheypak <
>>> serega.shey...@gmail.com> wrote:
>>>
 Okay, so bucketing by day/week/month is a capacity planning stuff and
 actual questions I want to ask.
 As as a conclusion:
 I have a table events

 CREATE TABLE user_plans (
   id timeuuid,
   user_id timeuuid,
   event_ts timestamp,
   event_type int,
   some_other_attr text

 PRIMARY KEY (user_id, ends)
 );
 which fits tactic queries:
 select smth from user_plans where user_id='xxx' and end_ts > now()

 Then I create second table user_plans_daily (or weekly, monthy)

 with DDL:
 CREATE TABLE user_plans_daily/weekly/monthly (
   ymd int,
   user_id timeuuid,
   event_ts timestamp,
   event_type int,
   some_other_attr text
 )
 PRIMARY KEY ((ymd, user_id), event_ts )
 WITH CLUSTERING ORDER BY (event_ts DESC);

 And this table is good for answering strategic questions:
 select * from
 user_plans_daily/weekly/monthly
 where ymd in ()
 And I should avoid long condition inside IN clause, that is why you
 suggest me to create bigger bucket, correct?


 2015-04-04 20:00 GMT+02:00 Jack Krupansky :

> It sounds like your time bucket should be a month, but it depends on
> the amount of data per user per day and your main query range. Within the
> partition you can then query for a range of days.
>
> Yes, all of the rows within a partition are stored on one physical
> node as well as the replica nodes.
>
> -- Jack Krupansky
>
> On Sat, Apr 4, 2015 at 1:38 PM, Serega Sheypak <
> serega.shey...@gmail.com> wrote:
>
>> >non-equal relation on a partition key is not supported
>> Ok, can I generate select query:
>> select some_attributes
>> from events where ymd = 20150101 or ymd = 20150102 or 20150103 ...
>> or 20150331
>>
>> > The partition key determines which node can satisfy the query
>> So you mean that all rows with the same *(ymd, user_id)* would be on
>> one physical node?
>>
>>
>> 2015-04-04 16:38 GMT+02:00 Jack Krupansky :
>>
>>> Unfortunately, a non-equal relation on a partition key is not
>>> supported. You would need to bucket by some larger unit, like a month, 
>>> and
>>> then use the date/time as a clustering column for the row key. Then you
>>> could query within the partition. The partition key determines which 
>>> node
>>> can satisfy the query. Designing your partition key judiciously is the 
>>> key
>>> (haha!) to performant Cassandra applications.
>>>
>>> -- Jack Krupansky
>>>
>>> On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak <
>>> serega.shey...@gmail.com> wrote:
>>>
 Hi, we plan to have 10^8 users and each user could generate 10
 events per day.
 So we have:
 10^8 records per day
 10^8*30 records per month.
 Our timewindow analysis could be from 1 to 6 months.

 Right now PK is PRIMARY KEY (user_id, ends) where endts is exact
 ts of event.

 So you suggest this approach:
 *PRIMARY KEY ((ymd, user_id), event_ts ) *
 *WITH CLUSTERING ORDER BY (**event_ts*
 * DESC);*

 where ymd=20150102 (the Second of January)?

 *What happens to writes:*
 SSTable with past days (ymd < current_day) stay untouched and don't
 take part in Compaction process since there are o changes to them?

 What happens to read:
 I issue query:
 select some_attributes
 from events where ymd >= 20150101 and ymd < 20150301
 Does Cassandra skip SSTables which don't have ymd in specified

Re: Timeseries analysis using Cassandra and partition by date period

2015-04-06 Thread Srinivasa T N
 Comparison to OpenTSDB HBase

For one we do not use id’s for strings. The string data (metric names and
tags) are written to row keys and the appropriate indexes. Because
Cassandra has much wider rows there are far fewer keys written to the
database. The space saved by using id’s is minor and by not using id’s we
avoid having to use any kind of locks across the cluster.

As mentioned the Cassandra has wider rows. The default row size in OpenTSDB
HBase is 1 hour. Cassandra is set to 3 weeks.
http://kairosdb.github.io/kairosdocs/CassandraSchema.html

On Mon, Apr 6, 2015 at 3:27 PM, Serega Sheypak 
wrote:

> Thanks, is it a kind of opentsdb?
>
> 2015-04-05 18:28 GMT+02:00 Kevin Burton :
>
>> > Hi, I switched from HBase to Cassandra and try to find problem solution
>> for timeseries analysis on top Cassandra.
>>
>> Depending on what you’re looking for, you might want to check out
>> KairosDB.
>>
>> 0.95 beta2 just shipped yesterday as well so you have good timing.
>>
>> https://github.com/kairosdb/kairosdb
>>
>> On Sat, Apr 4, 2015 at 11:29 AM, Serega Sheypak > > wrote:
>>
>>> Okay, so bucketing by day/week/month is a capacity planning stuff and
>>> actual questions I want to ask.
>>> As as a conclusion:
>>> I have a table events
>>>
>>> CREATE TABLE user_plans (
>>>   id timeuuid,
>>>   user_id timeuuid,
>>>   event_ts timestamp,
>>>   event_type int,
>>>   some_other_attr text
>>>
>>> PRIMARY KEY (user_id, ends)
>>> );
>>> which fits tactic queries:
>>> select smth from user_plans where user_id='xxx' and end_ts > now()
>>>
>>> Then I create second table user_plans_daily (or weekly, monthy)
>>>
>>> with DDL:
>>> CREATE TABLE user_plans_daily/weekly/monthly (
>>>   ymd int,
>>>   user_id timeuuid,
>>>   event_ts timestamp,
>>>   event_type int,
>>>   some_other_attr text
>>> )
>>> PRIMARY KEY ((ymd, user_id), event_ts )
>>> WITH CLUSTERING ORDER BY (event_ts DESC);
>>>
>>> And this table is good for answering strategic questions:
>>> select * from
>>> user_plans_daily/weekly/monthly
>>> where ymd in ()
>>> And I should avoid long condition inside IN clause, that is why you
>>> suggest me to create bigger bucket, correct?
>>>
>>>
>>> 2015-04-04 20:00 GMT+02:00 Jack Krupansky :
>>>
 It sounds like your time bucket should be a month, but it depends on
 the amount of data per user per day and your main query range. Within the
 partition you can then query for a range of days.

 Yes, all of the rows within a partition are stored on one physical node
 as well as the replica nodes.

 -- Jack Krupansky

 On Sat, Apr 4, 2015 at 1:38 PM, Serega Sheypak <
 serega.shey...@gmail.com> wrote:

> >non-equal relation on a partition key is not supported
> Ok, can I generate select query:
> select some_attributes
> from events where ymd = 20150101 or ymd = 20150102 or 20150103 ... or
> 20150331
>
> > The partition key determines which node can satisfy the query
> So you mean that all rows with the same *(ymd, user_id)* would be on
> one physical node?
>
>
> 2015-04-04 16:38 GMT+02:00 Jack Krupansky :
>
>> Unfortunately, a non-equal relation on a partition key is not
>> supported. You would need to bucket by some larger unit, like a month, 
>> and
>> then use the date/time as a clustering column for the row key. Then you
>> could query within the partition. The partition key determines which node
>> can satisfy the query. Designing your partition key judiciously is the 
>> key
>> (haha!) to performant Cassandra applications.
>>
>> -- Jack Krupansky
>>
>> On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak <
>> serega.shey...@gmail.com> wrote:
>>
>>> Hi, we plan to have 10^8 users and each user could generate 10
>>> events per day.
>>> So we have:
>>> 10^8 records per day
>>> 10^8*30 records per month.
>>> Our timewindow analysis could be from 1 to 6 months.
>>>
>>> Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts
>>> of event.
>>>
>>> So you suggest this approach:
>>> *PRIMARY KEY ((ymd, user_id), event_ts ) *
>>> *WITH CLUSTERING ORDER BY (**event_ts*
>>> * DESC);*
>>>
>>> where ymd=20150102 (the Second of January)?
>>>
>>> *What happens to writes:*
>>> SSTable with past days (ymd < current_day) stay untouched and don't
>>> take part in Compaction process since there are o changes to them?
>>>
>>> What happens to read:
>>> I issue query:
>>> select some_attributes
>>> from events where ymd >= 20150101 and ymd < 20150301
>>> Does Cassandra skip SSTables which don't have ymd in specified range
>>> and give me a kind of partition elimination, like in traditional DBs?
>>>
>>>
>>> 2015-04-04 14:41 GMT+02:00 Jack Krupansky 
>>> :
>>>
 It depends on the actual number of events per user, but simply
 bucketing the partition 

Re: Timeseries analysis using Cassandra and partition by date period

2015-04-06 Thread Serega Sheypak
Thanks, is it a kind of opentsdb?

2015-04-05 18:28 GMT+02:00 Kevin Burton :

> > Hi, I switched from HBase to Cassandra and try to find problem solution
> for timeseries analysis on top Cassandra.
>
> Depending on what you’re looking for, you might want to check out KairosDB.
>
> 0.95 beta2 just shipped yesterday as well so you have good timing.
>
> https://github.com/kairosdb/kairosdb
>
> On Sat, Apr 4, 2015 at 11:29 AM, Serega Sheypak 
> wrote:
>
>> Okay, so bucketing by day/week/month is a capacity planning stuff and
>> actual questions I want to ask.
>> As as a conclusion:
>> I have a table events
>>
>> CREATE TABLE user_plans (
>>   id timeuuid,
>>   user_id timeuuid,
>>   event_ts timestamp,
>>   event_type int,
>>   some_other_attr text
>>
>> PRIMARY KEY (user_id, ends)
>> );
>> which fits tactic queries:
>> select smth from user_plans where user_id='xxx' and end_ts > now()
>>
>> Then I create second table user_plans_daily (or weekly, monthy)
>>
>> with DDL:
>> CREATE TABLE user_plans_daily/weekly/monthly (
>>   ymd int,
>>   user_id timeuuid,
>>   event_ts timestamp,
>>   event_type int,
>>   some_other_attr text
>> )
>> PRIMARY KEY ((ymd, user_id), event_ts )
>> WITH CLUSTERING ORDER BY (event_ts DESC);
>>
>> And this table is good for answering strategic questions:
>> select * from
>> user_plans_daily/weekly/monthly
>> where ymd in ()
>> And I should avoid long condition inside IN clause, that is why you
>> suggest me to create bigger bucket, correct?
>>
>>
>> 2015-04-04 20:00 GMT+02:00 Jack Krupansky :
>>
>>> It sounds like your time bucket should be a month, but it depends on the
>>> amount of data per user per day and your main query range. Within the
>>> partition you can then query for a range of days.
>>>
>>> Yes, all of the rows within a partition are stored on one physical node
>>> as well as the replica nodes.
>>>
>>> -- Jack Krupansky
>>>
>>> On Sat, Apr 4, 2015 at 1:38 PM, Serega Sheypak >> > wrote:
>>>
 >non-equal relation on a partition key is not supported
 Ok, can I generate select query:
 select some_attributes
 from events where ymd = 20150101 or ymd = 20150102 or 20150103 ... or
 20150331

 > The partition key determines which node can satisfy the query
 So you mean that all rows with the same *(ymd, user_id)* would be on
 one physical node?


 2015-04-04 16:38 GMT+02:00 Jack Krupansky :

> Unfortunately, a non-equal relation on a partition key is not
> supported. You would need to bucket by some larger unit, like a month, and
> then use the date/time as a clustering column for the row key. Then you
> could query within the partition. The partition key determines which node
> can satisfy the query. Designing your partition key judiciously is the key
> (haha!) to performant Cassandra applications.
>
> -- Jack Krupansky
>
> On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak <
> serega.shey...@gmail.com> wrote:
>
>> Hi, we plan to have 10^8 users and each user could generate 10 events
>> per day.
>> So we have:
>> 10^8 records per day
>> 10^8*30 records per month.
>> Our timewindow analysis could be from 1 to 6 months.
>>
>> Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts
>> of event.
>>
>> So you suggest this approach:
>> *PRIMARY KEY ((ymd, user_id), event_ts ) *
>> *WITH CLUSTERING ORDER BY (**event_ts*
>> * DESC);*
>>
>> where ymd=20150102 (the Second of January)?
>>
>> *What happens to writes:*
>> SSTable with past days (ymd < current_day) stay untouched and don't
>> take part in Compaction process since there are o changes to them?
>>
>> What happens to read:
>> I issue query:
>> select some_attributes
>> from events where ymd >= 20150101 and ymd < 20150301
>> Does Cassandra skip SSTables which don't have ymd in specified range
>> and give me a kind of partition elimination, like in traditional DBs?
>>
>>
>> 2015-04-04 14:41 GMT+02:00 Jack Krupansky :
>>
>>> It depends on the actual number of events per user, but simply
>>> bucketing the partition key can give you the same effect - clustering 
>>> rows
>>> by time range. A composite partition key could be comprised of the user
>>> name and the date.
>>>
>>> It also depends on the data rate - is it many events per day or just
>>> a few events per week, or over what time period. You need to be careful 
>>> -
>>> you don't want your Cassandra partitions to be too big (millions of 
>>> rows)
>>> or too small (just a few or even one row per partition.)
>>>
>>> -- Jack Krupansky
>>>
>>> On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak <
>>> serega.shey...@gmail.com> wrote:
>>>
 Hi, I switched from HBase to Cassandra and try to find problem
 solution for timeseries analysis on top Cassandra.
 I have