Re: How much disk is needed to compact Leveled compaction?
I am only using LeveledCompactionStrategy, and as I describe in my original mail, I don’t understand why C* is complaining that it cannot compact when I have more than 40% free disk space. On 07 Apr 2015, at 01:10 , Bryan Holladay mailto:holla...@longsight.com>> wrote: What other storage impacting commands or nuances do you gave to consider when you switch to leveled compaction? For instance, nodetool cleanup says "Running the nodetool cleanup command causes a temporary increase in disk space usage proportional to the size of your largest SSTable." Are sstables smaller with leveled compaction making this a non issue? How can you determine what the new threshold for storage space is? Thanks, Bryan On Apr 6, 2015 6:19 PM, "DuyHai Doan" mailto:doanduy...@gmail.com>> wrote: If you have SSD, you may afford switching to leveled compaction strategy, which requires much less than 50% of the current dataset for free space Le 5 avr. 2015 19:04, "daemeon reiydelle" mailto:daeme...@gmail.com>> a écrit : You appear to have multiple java binaries in your path. That needs to be resolved. sent from my mobile Daemeon C.M. Reiydelle USA 415.501.0198 London +44.0.20.8144.9872 On Apr 5, 2015 1:40 AM, "Jean Tremblay" mailto:jean.tremb...@zen-innovations.com>> wrote: Hi, I have a cluster of 5 nodes. We use cassandra 2.1.3. The 5 nodes use about 50-57% of the 1T SSD. One node managed to compact all its data. During one compaction this node used almost 100% of the drive. The other nodes refuse to continue compaction claiming that there is not enough disk space. From the documentation LeveledCompactionStrategy should be able to compact my data, well at least this is what I understand. <> This is the disk usage. Node 4 is the only one that could compact everything. node0: /dev/disk1 931Gi 534Gi 396Gi 57% / node1: /dev/disk1 931Gi 513Gi 417Gi 55% / node2: /dev/disk1 931Gi 526Gi 404Gi 57% / node3: /dev/disk1 931Gi 507Gi 424Gi 54% / node4: /dev/disk1 931Gi 475Gi 456Gi 51% / When I try to compact the other ones I get this: objc[18698]: Class JavaLaunchHelper is implemented in both /Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/bin/java and /Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/jre/lib/libinstrument.dylib. One of the two will be used. Which one is undefined. error: Not enough space for compaction, estimated sstables = 2894, expected write size = 485616651726 -- StackTrace -- java.lang.RuntimeException: Not enough space for compaction, estimated sstables = 2894, expected write size = 485616651726 at org.apache.cassandra.db.compaction.CompactionTask.checkAvailableDiskSpace(CompactionTask.java:293) at org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:127) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:76) at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) at org.apache.cassandra.db.compaction.CompactionManager$7.runMayThrow(CompactionManager.java:512) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) I did not set the sstable_size_in_mb I use the 160MB default. Is it normal that during compaction it needs so much diskspace? What would be the best solution to overcome this problem? Thanks for your help
Re: log all the query statement
Hey Peter, This is from the perspective of 2.0.13 but there should be something similar in your version. Can you enable debug log for cassandra and see if the log files have additional info. Depending on how soon/later in you test you get the error, you might also want to modify the "maxBackupIndex" or "maxFileSize" to make sure u keep enough log files around. anishek On Thu, Apr 2, 2015 at 11:53 AM, 鄢来琼 wrote: > Hi all, > > > > Cassandra 2.1.2 is used in my project, but some node is down after > executing query some statements. > > Could I configure the Cassandra to log all the executed statement? > > Hope the log file can be used to identify the problem. > > Thanks. > > > > Peter > > >
Data stax object mapping and lightweight transactions
Hi, Does the latest Data Stax Java driver (2.1.5) support lightweight transactions using object mapping? For example, if I set the write consistency level of the mapped class to SERIAL through annotation, then does the “save” operation use lightweight transaction instead of a normal write? Thanks, Sha Liu
Re: How much disk is needed to compact Leveled compaction?
What other storage impacting commands or nuances do you gave to consider when you switch to leveled compaction? For instance, nodetool cleanup says "Running the nodetool cleanup command causes a temporary increase in disk space usage proportional to the size of your largest SSTable." Are sstables smaller with leveled compaction making this a non issue? How can you determine what the new threshold for storage space is? Thanks, Bryan On Apr 6, 2015 6:19 PM, "DuyHai Doan" wrote: > If you have SSD, you may afford switching to leveled compaction strategy, > which requires much less than 50% of the current dataset for free space > Le 5 avr. 2015 19:04, "daemeon reiydelle" a écrit : > >> You appear to have multiple java binaries in your path. That needs to be >> resolved. >> >> sent from my mobile >> Daemeon C.M. Reiydelle >> USA 415.501.0198 >> London +44.0.20.8144.9872 >> On Apr 5, 2015 1:40 AM, "Jean Tremblay" < >> jean.tremb...@zen-innovations.com> wrote: >> >>> Hi, >>> I have a cluster of 5 nodes. We use cassandra 2.1.3. >>> >>> The 5 nodes use about 50-57% of the 1T SSD. >>> One node managed to compact all its data. During one compaction this >>> node used almost 100% of the drive. The other nodes refuse to continue >>> compaction claiming that there is not enough disk space. >>> >>> From the documentation LeveledCompactionStrategy should be able to >>> compact my data, well at least this is what I understand. >>> >>> <>> compaction as the size of the largest column family. Leveled compaction >>> needs much less space for compaction, only 10 * sstable_size_in_mb. >>> However, even if you’re using leveled compaction, you should leave much >>> more free disk space available than this to accommodate streaming, repair, >>> and snapshots, which can easily use 10GB or more of disk space. >>> Furthermore, disk performance tends to decline after 80 to 90% of the disk >>> space is used, so don’t push the boundaries.>> >>> >>> This is the disk usage. Node 4 is the only one that could compact >>> everything. >>> node0: /dev/disk1 931Gi 534Gi 396Gi 57% / >>> node1: /dev/disk1 931Gi 513Gi 417Gi 55% / >>> node2: /dev/disk1 931Gi 526Gi 404Gi 57% / >>> node3: /dev/disk1 931Gi 507Gi 424Gi 54% / >>> node4: /dev/disk1 931Gi 475Gi 456Gi 51% / >>> >>> When I try to compact the other ones I get this: >>> >>> objc[18698]: Class JavaLaunchHelper is implemented in both >>> /Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/bin/java >>> and /Library/Java/JavaVirtualMachines/jdk1.8.0_ >>> 40.jdk/Contents/Home/jre/lib/libinstrument.dylib. One of the two will >>> be used. Which one is undefined. >>> error: Not enough space for compaction, estimated sstables = 2894, >>> expected write size = 485616651726 >>> -- StackTrace -- >>> java.lang.RuntimeException: Not enough space for compaction, estimated >>> sstables = 2894, expected write size = 485616651726 >>> at org.apache.cassandra.db.compaction.CompactionTask. >>> checkAvailableDiskSpace(CompactionTask.java:293) >>> at org.apache.cassandra.db.compaction.CompactionTask. >>> runMayThrow(CompactionTask.java:127) >>> at org.apache.cassandra.utils.WrappedRunnable.run( >>> WrappedRunnable.java:28) >>> at org.apache.cassandra.db.compaction.CompactionTask.executeInternal( >>> CompactionTask.java:76) >>> at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute( >>> AbstractCompactionTask.java:59) >>> at org.apache.cassandra.db.compaction.CompactionManager$7.runMayThrow( >>> CompactionManager.java:512) >>> at org.apache.cassandra.utils.WrappedRunnable.run( >>> WrappedRunnable.java:28) >>> >>> I did not set the sstable_size_in_mb I use the 160MB default. >>> >>> Is it normal that during compaction it needs so much diskspace? What >>> would be the best solution to overcome this problem? >>> >>> Thanks for your help >>> >>>
Re: How much disk is needed to compact Leveled compaction?
I may have misunderstood, but it seems that he was already using LeveledCompaction On Tue, Apr 7, 2015 at 3:17 AM, DuyHai Doan wrote: > If you have SSD, you may afford switching to leveled compaction strategy, > which requires much less than 50% of the current dataset for free space > Le 5 avr. 2015 19:04, "daemeon reiydelle" a écrit : > >> You appear to have multiple java binaries in your path. That needs to be >> resolved. >> >> sent from my mobile >> Daemeon C.M. Reiydelle >> USA 415.501.0198 >> London +44.0.20.8144.9872 >> On Apr 5, 2015 1:40 AM, "Jean Tremblay" < >> jean.tremb...@zen-innovations.com> wrote: >> >>> Hi, >>> I have a cluster of 5 nodes. We use cassandra 2.1.3. >>> >>> The 5 nodes use about 50-57% of the 1T SSD. >>> One node managed to compact all its data. During one compaction this >>> node used almost 100% of the drive. The other nodes refuse to continue >>> compaction claiming that there is not enough disk space. >>> >>> From the documentation LeveledCompactionStrategy should be able to >>> compact my data, well at least this is what I understand. >>> >>> <>> compaction as the size of the largest column family. Leveled compaction >>> needs much less space for compaction, only 10 * sstable_size_in_mb. >>> However, even if you’re using leveled compaction, you should leave much >>> more free disk space available than this to accommodate streaming, repair, >>> and snapshots, which can easily use 10GB or more of disk space. >>> Furthermore, disk performance tends to decline after 80 to 90% of the disk >>> space is used, so don’t push the boundaries.>> >>> >>> This is the disk usage. Node 4 is the only one that could compact >>> everything. >>> node0: /dev/disk1 931Gi 534Gi 396Gi 57% / >>> node1: /dev/disk1 931Gi 513Gi 417Gi 55% / >>> node2: /dev/disk1 931Gi 526Gi 404Gi 57% / >>> node3: /dev/disk1 931Gi 507Gi 424Gi 54% / >>> node4: /dev/disk1 931Gi 475Gi 456Gi 51% / >>> >>> When I try to compact the other ones I get this: >>> >>> objc[18698]: Class JavaLaunchHelper is implemented in both >>> /Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/bin/java >>> and /Library/Java/JavaVirtualMachines/jdk1.8.0_ >>> 40.jdk/Contents/Home/jre/lib/libinstrument.dylib. One of the two will >>> be used. Which one is undefined. >>> error: Not enough space for compaction, estimated sstables = 2894, >>> expected write size = 485616651726 >>> -- StackTrace -- >>> java.lang.RuntimeException: Not enough space for compaction, estimated >>> sstables = 2894, expected write size = 485616651726 >>> at org.apache.cassandra.db.compaction.CompactionTask. >>> checkAvailableDiskSpace(CompactionTask.java:293) >>> at org.apache.cassandra.db.compaction.CompactionTask. >>> runMayThrow(CompactionTask.java:127) >>> at org.apache.cassandra.utils.WrappedRunnable.run( >>> WrappedRunnable.java:28) >>> at org.apache.cassandra.db.compaction.CompactionTask.executeInternal( >>> CompactionTask.java:76) >>> at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute( >>> AbstractCompactionTask.java:59) >>> at org.apache.cassandra.db.compaction.CompactionManager$7.runMayThrow( >>> CompactionManager.java:512) >>> at org.apache.cassandra.utils.WrappedRunnable.run( >>> WrappedRunnable.java:28) >>> >>> I did not set the sstable_size_in_mb I use the 160MB default. >>> >>> Is it normal that during compaction it needs so much diskspace? What >>> would be the best solution to overcome this problem? >>> >>> Thanks for your help >>> >>>
Re: How much disk is needed to compact Leveled compaction?
If you have SSD, you may afford switching to leveled compaction strategy, which requires much less than 50% of the current dataset for free space Le 5 avr. 2015 19:04, "daemeon reiydelle" a écrit : > You appear to have multiple java binaries in your path. That needs to be > resolved. > > sent from my mobile > Daemeon C.M. Reiydelle > USA 415.501.0198 > London +44.0.20.8144.9872 > On Apr 5, 2015 1:40 AM, "Jean Tremblay" > wrote: > >> Hi, >> I have a cluster of 5 nodes. We use cassandra 2.1.3. >> >> The 5 nodes use about 50-57% of the 1T SSD. >> One node managed to compact all its data. During one compaction this >> node used almost 100% of the drive. The other nodes refuse to continue >> compaction claiming that there is not enough disk space. >> >> From the documentation LeveledCompactionStrategy should be able to >> compact my data, well at least this is what I understand. >> >> <> compaction as the size of the largest column family. Leveled compaction >> needs much less space for compaction, only 10 * sstable_size_in_mb. >> However, even if you’re using leveled compaction, you should leave much >> more free disk space available than this to accommodate streaming, repair, >> and snapshots, which can easily use 10GB or more of disk space. >> Furthermore, disk performance tends to decline after 80 to 90% of the disk >> space is used, so don’t push the boundaries.>> >> >> This is the disk usage. Node 4 is the only one that could compact >> everything. >> node0: /dev/disk1 931Gi 534Gi 396Gi 57% / >> node1: /dev/disk1 931Gi 513Gi 417Gi 55% / >> node2: /dev/disk1 931Gi 526Gi 404Gi 57% / >> node3: /dev/disk1 931Gi 507Gi 424Gi 54% / >> node4: /dev/disk1 931Gi 475Gi 456Gi 51% / >> >> When I try to compact the other ones I get this: >> >> objc[18698]: Class JavaLaunchHelper is implemented in both >> /Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/bin/java >> and /Library/Java/JavaVirtualMachines/jdk1.8.0_ >> 40.jdk/Contents/Home/jre/lib/libinstrument.dylib. One of the two will be >> used. Which one is undefined. >> error: Not enough space for compaction, estimated sstables = 2894, >> expected write size = 485616651726 >> -- StackTrace -- >> java.lang.RuntimeException: Not enough space for compaction, estimated >> sstables = 2894, expected write size = 485616651726 >> at org.apache.cassandra.db.compaction.CompactionTask. >> checkAvailableDiskSpace(CompactionTask.java:293) >> at org.apache.cassandra.db.compaction.CompactionTask. >> runMayThrow(CompactionTask.java:127) >> at org.apache.cassandra.utils.WrappedRunnable.run( >> WrappedRunnable.java:28) >> at org.apache.cassandra.db.compaction.CompactionTask.executeInternal( >> CompactionTask.java:76) >> at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute( >> AbstractCompactionTask.java:59) >> at org.apache.cassandra.db.compaction.CompactionManager$7.runMayThrow( >> CompactionManager.java:512) >> at org.apache.cassandra.utils.WrappedRunnable.run( >> WrappedRunnable.java:28) >> >> I did not set the sstable_size_in_mb I use the 160MB default. >> >> Is it normal that during compaction it needs so much diskspace? What >> would be the best solution to overcome this problem? >> >> Thanks for your help >> >>
Re: Timeseries analysis using Cassandra and partition by date period
Thank you, we'll see that instrument, 2015-04-06 12:30 GMT+02:00 Srinivasa T N : > Comparison to OpenTSDB HBase > > For one we do not use id’s for strings. The string data (metric names and > tags) are written to row keys and the appropriate indexes. Because > Cassandra has much wider rows there are far fewer keys written to the > database. The space saved by using id’s is minor and by not using id’s we > avoid having to use any kind of locks across the cluster. > > As mentioned the Cassandra has wider rows. The default row size in > OpenTSDB HBase is 1 hour. Cassandra is set to 3 weeks. > http://kairosdb.github.io/kairosdocs/CassandraSchema.html > > On Mon, Apr 6, 2015 at 3:27 PM, Serega Sheypak > wrote: > >> Thanks, is it a kind of opentsdb? >> >> 2015-04-05 18:28 GMT+02:00 Kevin Burton : >> >>> > Hi, I switched from HBase to Cassandra and try to find problem >>> solution for timeseries analysis on top Cassandra. >>> >>> Depending on what you’re looking for, you might want to check out >>> KairosDB. >>> >>> 0.95 beta2 just shipped yesterday as well so you have good timing. >>> >>> https://github.com/kairosdb/kairosdb >>> >>> On Sat, Apr 4, 2015 at 11:29 AM, Serega Sheypak < >>> serega.shey...@gmail.com> wrote: >>> Okay, so bucketing by day/week/month is a capacity planning stuff and actual questions I want to ask. As as a conclusion: I have a table events CREATE TABLE user_plans ( id timeuuid, user_id timeuuid, event_ts timestamp, event_type int, some_other_attr text PRIMARY KEY (user_id, ends) ); which fits tactic queries: select smth from user_plans where user_id='xxx' and end_ts > now() Then I create second table user_plans_daily (or weekly, monthy) with DDL: CREATE TABLE user_plans_daily/weekly/monthly ( ymd int, user_id timeuuid, event_ts timestamp, event_type int, some_other_attr text ) PRIMARY KEY ((ymd, user_id), event_ts ) WITH CLUSTERING ORDER BY (event_ts DESC); And this table is good for answering strategic questions: select * from user_plans_daily/weekly/monthly where ymd in () And I should avoid long condition inside IN clause, that is why you suggest me to create bigger bucket, correct? 2015-04-04 20:00 GMT+02:00 Jack Krupansky : > It sounds like your time bucket should be a month, but it depends on > the amount of data per user per day and your main query range. Within the > partition you can then query for a range of days. > > Yes, all of the rows within a partition are stored on one physical > node as well as the replica nodes. > > -- Jack Krupansky > > On Sat, Apr 4, 2015 at 1:38 PM, Serega Sheypak < > serega.shey...@gmail.com> wrote: > >> >non-equal relation on a partition key is not supported >> Ok, can I generate select query: >> select some_attributes >> from events where ymd = 20150101 or ymd = 20150102 or 20150103 ... >> or 20150331 >> >> > The partition key determines which node can satisfy the query >> So you mean that all rows with the same *(ymd, user_id)* would be on >> one physical node? >> >> >> 2015-04-04 16:38 GMT+02:00 Jack Krupansky : >> >>> Unfortunately, a non-equal relation on a partition key is not >>> supported. You would need to bucket by some larger unit, like a month, >>> and >>> then use the date/time as a clustering column for the row key. Then you >>> could query within the partition. The partition key determines which >>> node >>> can satisfy the query. Designing your partition key judiciously is the >>> key >>> (haha!) to performant Cassandra applications. >>> >>> -- Jack Krupansky >>> >>> On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak < >>> serega.shey...@gmail.com> wrote: >>> Hi, we plan to have 10^8 users and each user could generate 10 events per day. So we have: 10^8 records per day 10^8*30 records per month. Our timewindow analysis could be from 1 to 6 months. Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts of event. So you suggest this approach: *PRIMARY KEY ((ymd, user_id), event_ts ) * *WITH CLUSTERING ORDER BY (**event_ts* * DESC);* where ymd=20150102 (the Second of January)? *What happens to writes:* SSTable with past days (ymd < current_day) stay untouched and don't take part in Compaction process since there are o changes to them? What happens to read: I issue query: select some_attributes from events where ymd >= 20150101 and ymd < 20150301 Does Cassandra skip SSTables which don't have ymd in specified
Re: Timeseries analysis using Cassandra and partition by date period
Comparison to OpenTSDB HBase For one we do not use id’s for strings. The string data (metric names and tags) are written to row keys and the appropriate indexes. Because Cassandra has much wider rows there are far fewer keys written to the database. The space saved by using id’s is minor and by not using id’s we avoid having to use any kind of locks across the cluster. As mentioned the Cassandra has wider rows. The default row size in OpenTSDB HBase is 1 hour. Cassandra is set to 3 weeks. http://kairosdb.github.io/kairosdocs/CassandraSchema.html On Mon, Apr 6, 2015 at 3:27 PM, Serega Sheypak wrote: > Thanks, is it a kind of opentsdb? > > 2015-04-05 18:28 GMT+02:00 Kevin Burton : > >> > Hi, I switched from HBase to Cassandra and try to find problem solution >> for timeseries analysis on top Cassandra. >> >> Depending on what you’re looking for, you might want to check out >> KairosDB. >> >> 0.95 beta2 just shipped yesterday as well so you have good timing. >> >> https://github.com/kairosdb/kairosdb >> >> On Sat, Apr 4, 2015 at 11:29 AM, Serega Sheypak > > wrote: >> >>> Okay, so bucketing by day/week/month is a capacity planning stuff and >>> actual questions I want to ask. >>> As as a conclusion: >>> I have a table events >>> >>> CREATE TABLE user_plans ( >>> id timeuuid, >>> user_id timeuuid, >>> event_ts timestamp, >>> event_type int, >>> some_other_attr text >>> >>> PRIMARY KEY (user_id, ends) >>> ); >>> which fits tactic queries: >>> select smth from user_plans where user_id='xxx' and end_ts > now() >>> >>> Then I create second table user_plans_daily (or weekly, monthy) >>> >>> with DDL: >>> CREATE TABLE user_plans_daily/weekly/monthly ( >>> ymd int, >>> user_id timeuuid, >>> event_ts timestamp, >>> event_type int, >>> some_other_attr text >>> ) >>> PRIMARY KEY ((ymd, user_id), event_ts ) >>> WITH CLUSTERING ORDER BY (event_ts DESC); >>> >>> And this table is good for answering strategic questions: >>> select * from >>> user_plans_daily/weekly/monthly >>> where ymd in () >>> And I should avoid long condition inside IN clause, that is why you >>> suggest me to create bigger bucket, correct? >>> >>> >>> 2015-04-04 20:00 GMT+02:00 Jack Krupansky : >>> It sounds like your time bucket should be a month, but it depends on the amount of data per user per day and your main query range. Within the partition you can then query for a range of days. Yes, all of the rows within a partition are stored on one physical node as well as the replica nodes. -- Jack Krupansky On Sat, Apr 4, 2015 at 1:38 PM, Serega Sheypak < serega.shey...@gmail.com> wrote: > >non-equal relation on a partition key is not supported > Ok, can I generate select query: > select some_attributes > from events where ymd = 20150101 or ymd = 20150102 or 20150103 ... or > 20150331 > > > The partition key determines which node can satisfy the query > So you mean that all rows with the same *(ymd, user_id)* would be on > one physical node? > > > 2015-04-04 16:38 GMT+02:00 Jack Krupansky : > >> Unfortunately, a non-equal relation on a partition key is not >> supported. You would need to bucket by some larger unit, like a month, >> and >> then use the date/time as a clustering column for the row key. Then you >> could query within the partition. The partition key determines which node >> can satisfy the query. Designing your partition key judiciously is the >> key >> (haha!) to performant Cassandra applications. >> >> -- Jack Krupansky >> >> On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak < >> serega.shey...@gmail.com> wrote: >> >>> Hi, we plan to have 10^8 users and each user could generate 10 >>> events per day. >>> So we have: >>> 10^8 records per day >>> 10^8*30 records per month. >>> Our timewindow analysis could be from 1 to 6 months. >>> >>> Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts >>> of event. >>> >>> So you suggest this approach: >>> *PRIMARY KEY ((ymd, user_id), event_ts ) * >>> *WITH CLUSTERING ORDER BY (**event_ts* >>> * DESC);* >>> >>> where ymd=20150102 (the Second of January)? >>> >>> *What happens to writes:* >>> SSTable with past days (ymd < current_day) stay untouched and don't >>> take part in Compaction process since there are o changes to them? >>> >>> What happens to read: >>> I issue query: >>> select some_attributes >>> from events where ymd >= 20150101 and ymd < 20150301 >>> Does Cassandra skip SSTables which don't have ymd in specified range >>> and give me a kind of partition elimination, like in traditional DBs? >>> >>> >>> 2015-04-04 14:41 GMT+02:00 Jack Krupansky >>> : >>> It depends on the actual number of events per user, but simply bucketing the partition
Re: Timeseries analysis using Cassandra and partition by date period
Thanks, is it a kind of opentsdb? 2015-04-05 18:28 GMT+02:00 Kevin Burton : > > Hi, I switched from HBase to Cassandra and try to find problem solution > for timeseries analysis on top Cassandra. > > Depending on what you’re looking for, you might want to check out KairosDB. > > 0.95 beta2 just shipped yesterday as well so you have good timing. > > https://github.com/kairosdb/kairosdb > > On Sat, Apr 4, 2015 at 11:29 AM, Serega Sheypak > wrote: > >> Okay, so bucketing by day/week/month is a capacity planning stuff and >> actual questions I want to ask. >> As as a conclusion: >> I have a table events >> >> CREATE TABLE user_plans ( >> id timeuuid, >> user_id timeuuid, >> event_ts timestamp, >> event_type int, >> some_other_attr text >> >> PRIMARY KEY (user_id, ends) >> ); >> which fits tactic queries: >> select smth from user_plans where user_id='xxx' and end_ts > now() >> >> Then I create second table user_plans_daily (or weekly, monthy) >> >> with DDL: >> CREATE TABLE user_plans_daily/weekly/monthly ( >> ymd int, >> user_id timeuuid, >> event_ts timestamp, >> event_type int, >> some_other_attr text >> ) >> PRIMARY KEY ((ymd, user_id), event_ts ) >> WITH CLUSTERING ORDER BY (event_ts DESC); >> >> And this table is good for answering strategic questions: >> select * from >> user_plans_daily/weekly/monthly >> where ymd in () >> And I should avoid long condition inside IN clause, that is why you >> suggest me to create bigger bucket, correct? >> >> >> 2015-04-04 20:00 GMT+02:00 Jack Krupansky : >> >>> It sounds like your time bucket should be a month, but it depends on the >>> amount of data per user per day and your main query range. Within the >>> partition you can then query for a range of days. >>> >>> Yes, all of the rows within a partition are stored on one physical node >>> as well as the replica nodes. >>> >>> -- Jack Krupansky >>> >>> On Sat, Apr 4, 2015 at 1:38 PM, Serega Sheypak >> > wrote: >>> >non-equal relation on a partition key is not supported Ok, can I generate select query: select some_attributes from events where ymd = 20150101 or ymd = 20150102 or 20150103 ... or 20150331 > The partition key determines which node can satisfy the query So you mean that all rows with the same *(ymd, user_id)* would be on one physical node? 2015-04-04 16:38 GMT+02:00 Jack Krupansky : > Unfortunately, a non-equal relation on a partition key is not > supported. You would need to bucket by some larger unit, like a month, and > then use the date/time as a clustering column for the row key. Then you > could query within the partition. The partition key determines which node > can satisfy the query. Designing your partition key judiciously is the key > (haha!) to performant Cassandra applications. > > -- Jack Krupansky > > On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak < > serega.shey...@gmail.com> wrote: > >> Hi, we plan to have 10^8 users and each user could generate 10 events >> per day. >> So we have: >> 10^8 records per day >> 10^8*30 records per month. >> Our timewindow analysis could be from 1 to 6 months. >> >> Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts >> of event. >> >> So you suggest this approach: >> *PRIMARY KEY ((ymd, user_id), event_ts ) * >> *WITH CLUSTERING ORDER BY (**event_ts* >> * DESC);* >> >> where ymd=20150102 (the Second of January)? >> >> *What happens to writes:* >> SSTable with past days (ymd < current_day) stay untouched and don't >> take part in Compaction process since there are o changes to them? >> >> What happens to read: >> I issue query: >> select some_attributes >> from events where ymd >= 20150101 and ymd < 20150301 >> Does Cassandra skip SSTables which don't have ymd in specified range >> and give me a kind of partition elimination, like in traditional DBs? >> >> >> 2015-04-04 14:41 GMT+02:00 Jack Krupansky : >> >>> It depends on the actual number of events per user, but simply >>> bucketing the partition key can give you the same effect - clustering >>> rows >>> by time range. A composite partition key could be comprised of the user >>> name and the date. >>> >>> It also depends on the data rate - is it many events per day or just >>> a few events per week, or over what time period. You need to be careful >>> - >>> you don't want your Cassandra partitions to be too big (millions of >>> rows) >>> or too small (just a few or even one row per partition.) >>> >>> -- Jack Krupansky >>> >>> On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak < >>> serega.shey...@gmail.com> wrote: >>> Hi, I switched from HBase to Cassandra and try to find problem solution for timeseries analysis on top Cassandra. I have