Re: Spark and intermediate results
I know the connector, but having the connector only means it will take *input* data from Cassandra, right? What about intermediate results? If it stores intermediate results on Cassandra, could you please clarify how data locality is handled? Will it store in other keyspace? I could not find any doc about it... From: user@cassandra.apache.org Subject: Re: Spark and intermediate results You can run spark against your Cassandra data directly without using a shared filesystem. https://github.com/datastax/spark-cassandra-connector On Fri, Oct 9, 2015 at 6:09 AM Marcelo Valle (BLOOMBERG/ LONDON) wrote: Hello, I saw this nice link from an event: http://www.datastax.com/dev/blog/zen-art-spark-maintenance?mkt_tok=3RkMMJWWfF9wsRogvqzIZKXonjHpfsX56%2B8uX6GylMI%2F0ER3fOvrPUfGjI4GTcdmI%2BSLDwEYGJlv6SgFSrXMMblswLgIXBY%3D I would like to test using Spark to perform some operations on a column family, my objective is reading from CF A and writing the output of my M/R job to CF B. That said, I've read this from Spark's FAQ (http://spark.apache.org/faq.html): "Do I need Hadoop to run Spark? No, but if you run on a cluster, you will need some form of shared file system (for example, NFS mounted at the same path on each node). If you have this type of filesystem, you can just deploy Spark in standalone mode." The question I ask is - if I don't want to have a HDFS instalation just to run Spark on Cassandra, is my only option to have this NFS mounted over network? It doesn't seem smart to me to have something as NFS to store Spark files, as it would probably affect performance, and at the same time I wouldn't like to have an additional HDFS cluster just to run jobs on Cassandra. Is there a way of using Cassandra itself as this "some form of shared file system"? -Marcelo << ideas don't deserve respect >> << ideas don't deserve respect >>
Spark and intermediate results
Hello, I saw this nice link from an event: http://www.datastax.com/dev/blog/zen-art-spark-maintenance?mkt_tok=3RkMMJWWfF9wsRogvqzIZKXonjHpfsX56%2B8uX6GylMI%2F0ER3fOvrPUfGjI4GTcdmI%2BSLDwEYGJlv6SgFSrXMMblswLgIXBY%3D I would like to test using Spark to perform some operations on a column family, my objective is reading from CF A and writing the output of my M/R job to CF B. That said, I've read this from Spark's FAQ (http://spark.apache.org/faq.html): "Do I need Hadoop to run Spark? No, but if you run on a cluster, you will need some form of shared file system (for example, NFS mounted at the same path on each node). If you have this type of filesystem, you can just deploy Spark in standalone mode." The question I ask is - if I don't want to have a HDFS instalation just to run Spark on Cassandra, is my only option to have this NFS mounted over network? It doesn't seem smart to me to have something as NFS to store Spark files, as it would probably affect performance, and at the same time I wouldn't like to have an additional HDFS cluster just to run jobs on Cassandra. Is there a way of using Cassandra itself as this "some form of shared file system"? -Marcelo << ideas don't deserve respect >>
Re: ScyllaDB, a new open source, Cassandra-compatible NoSQL
I think there is a very important point in Scylladb - latency. Performance can be an important requirement, but the fact scylladb is written in C and uses lock free algorithms inside means it should have lower latency than Cassandra, which enables it's use for a wider range of applications. It seems like a huge milestone achieved by Cassandra community, congratulations! From: user@cassandra.apache.org Subject: Re: ScyllaDB, a new open source, Cassandra-compatible NoSQL Looking at the architecture and what scylladb does, I'm not surprised they got 10x improvement. SeaStar skips a lot of the overhead of copying stuff and it gives them CPU core affinity. Anyone that's listened to Clif Click talk about cache misses, locks and other low level stuff would recognize the huge boost in performance when many of those bottlenecks are removed. Using an actor model to avoid locks doesn't hurt either. On Tue, Sep 22, 2015 at 5:20 PM, Minh Do wrote: First glance at their github, it looks like they re-implemented Cassandra in C++. 90% components in Cassandra are in scylladb, i.e. compaction, repair, CQL, gossip, SStable. With C++, I believe this helps performance to some extent up to a point when compaction has not run yet. Then, it will be disk IO to be the dominant factor in the performance measurement as the more traffics to a node the more degrading the performance is across the cluster. Also, they only support Thrift protocol so it won't work with Java Driver with the new asynchronous protocol. I doubt their tests are truly a fair one. On Tue, Sep 22, 2015 at 2:13 PM, Venkatesh Arivazhagan wrote: I came across this article: zdnet.com/article/kvm-creators-open-source-fast-cassandra-drop-in-replacement-scylla/ Tzach, I would love to know/understand moree about ScyllaDB too. Also the benchmark seems to have only 1 DB Server. Do you have benchmark numbers where more than 1 DB servers were involved? :) On Tue, Sep 22, 2015 at 1:40 PM, Sachin Nikam wrote: Tzach, Can you point to any documentation on scylladb site which talks about how/why scylla db performs better than Cassandra while using the same architecture? Regards Sachin On Tue, Sep 22, 2015 at 9:18 AM, Tzach Livyatan wrote: Hello Cassandra users, We are pleased to announce a new member of the Cassandra Ecosystem - ScyllaDB ScyllaDB is a new, open source, Cassandra-compatible NoSQL data store, written with the goal of delivering superior performance and consistent low latency. Today, ScyllaDB runs 1M tps per server with sub 1ms latency. ScyllaDB supports CQL, is compatible with Cassandra drivers, and works out of the box with Cassandra tools like cqlsh, Spark connector, nodetool and cassandra-stress. ScyllaDB is a drop-in replacement solution for the Cassandra server side packages. Scylla is implemented using the new shared-nothing Seastar framework for extreme performance on modern multicore hardware, and the Data Plane Development Kit (DPDK) for high-speed low-latency networking. Try Scylla Now - http://www.scylladb.com We will be at Cassandra summit 2015, you are welcome to visit our booth to hear more and see a demo. Avi Kivity, our CTO, will host a session on Scylla on Thursday, 1:50 PM - 2:30 PM in rooms M1 - M3. Regards Tzach scylladb << ideas don't deserve respect >>
Re: how many rows can one partion key hold?
> When one partition's data is extreme large, the write/read will slow? This is actually a good question, If a partition has near 2 billion rows, will writes or reads get too slow? My understanding is it shouldn't, as data is indexed inside a partition and when you read or write you are doing a binary search, so it should take log (n) time for the operation. However, my practical experience tells me it can be a problem depending on the number of reads you do and how you do them. It your binary search takes 2 more steps, but for 1 billion reads, it could be considerably slow. Also, this search could be done on disk, as it depends a lot on how your cache is configured. Having a small amount per partition could be a Cassandra anti-pattern though, mainly if your reads can go across many partitions. I think there is no correct answer here, it depends on your data and on your application, IMHO. -Marcelo From: user@cassandra.apache.org Subject: Re: how many rows can one partion key hold? you might want to read here http://wiki.apache.org/cassandra/CassandraLimitations jason On Fri, Feb 27, 2015 at 2:44 PM, wateray wrote: Hi all, My team is using Cassandra as our database. We have one question as below. As we know, the row with the some partition key will be stored in the some node. But how many rows can one partition key hold? What is it depend on? The node's volume or partition data size or partition rows size(the number of rows)? When one partition's data is extreme large, the write/read will slow? Can anyone show me some exist usecases. thanks!
Re: Unexplained query slowness
I didn't know about this cfhistograms thing, very nice! From: user@cassandra.apache.org Subject: Re: Unexplained query slowness Have a look at your column family histograms (nodetool cfhistograms iirc), if you notice things like a very long tail, a double hump or outliers it would indicate something wrong with your data model or you have a hot partition key/s. Also looking at your 99 and 95 percentile latencies will just hide these occasional high latency reads as they fall outside these percentiles. If you are running a stock config, first rule out that its not your data model, then investigate things like disk latency, noisy neighbours (if you are on vms/ in the cloud). On 26 February 2015 at 03:01, Marcelo Valle (BLOOMBERG/ LONDON) wrote: I am sorry if it's too basic and you already looked at that, but the first thing I would ask would be the data model. What data model are you using (how is your data partitioned)? What queries are you running? If you are using ALLOW FILTERING, for instance, it will be very easy to say why it's slow. Most times people get slow queries in Cassandra they are using the wrong data model. []s From: user@cassandra.apache.org Subject: Re:Unexplained query slowness Our Cassandra database just rolled to live last night. I’m looking at our query performance, and overall it is very good, but perhaps 1 in 10,000 queries takes several hundred milliseconds (up to a full second). I’ve grepped for GC in the system.log on all nodes, and there aren’t any recent GC events. I’m executing ~500 queries per second, which produces negligible load and CPU utilization. I have very minimal writes (one every few minutes). The slow queries are across the board. There isn’t one particular query that is slow. I’m running 2.0.12 with SSD’s. I’ve got a 10 node cluster with RF=3. I have no idea where to even begin to look. Any thoughts on where to start would be greatly appreciated. Robert -- Ben Bromhead Instaclustr | www.instaclustr.com | @instaclustr | (650) 284 9692
Re:Unexplained query slowness
I am sorry if it's too basic and you already looked at that, but the first thing I would ask would be the data model. What data model are you using (how is your data partitioned)? What queries are you running? If you are using ALLOW FILTERING, for instance, it will be very easy to say why it's slow. Most times people get slow queries in Cassandra they are using the wrong data model. []s From: user@cassandra.apache.org Subject: Re:Unexplained query slowness Our Cassandra database just rolled to live last night. I’m looking at our query performance, and overall it is very good, but perhaps 1 in 10,000 queries takes several hundred milliseconds (up to a full second). I’ve grepped for GC in the system.log on all nodes, and there aren’t any recent GC events. I’m executing ~500 queries per second, which produces negligible load and CPU utilization. I have very minimal writes (one every few minutes). The slow queries are across the board. There isn’t one particular query that is slow. I’m running 2.0.12 with SSD’s. I’ve got a 10 node cluster with RF=3. I have no idea where to even begin to look. Any thoughts on where to start would be greatly appreciated. Robert
Re: Cassandra Read Timeout
I am sorry, not sure if I will be able to help you. I am not familiar with super columns, I would tell you to try to get rid of them as soon as possible. Maybe someone else in the list can help you. Anyway, it really seems you have a cell with a very large amount of data and the request might be taking longer to complete because of the amount, but I also don't understand why the requests on same row different SC and same SC different row would work. []s From: oifa.yul...@gmail.com Subject: Re: Cassandra Read Timeout Hello I am running 1.2.19 Best regards Yulian Oifa On Tue, Feb 24, 2015 at 6:57 PM, Marcelo Valle (BLOOMBERG/ LONDON) wrote: Super column? Out of curiosity, which Cassandra version are you running? From: user@cassandra.apache.org Subject: Re: Cassandra Read Timeout Hello The structure is the same , the CFs are super column CFs , where key is long ( timestamp to partition the index , so each 11 days new row is created ) , super Column is int32 and columns / values are timeuuids.I am running same queries , getting reversed slice by raw key and super column. The number of reads is relatively high on second CF since i am testing it several hours already , most of time there are no requests for read on both of them , only writes.There is at most 1 read request in 20-30 seconds so it should not create a load.There is also no reads ( 0 before and 1 after ) pending on tpstats. Please also note that queries on different row same super column , and same row , different super column are working , and if i am not mistaken cassandra is loading complete raw including all super columns to memory ( either any request to this row should fail if this would be a memory problem , or none...). Best regards Yulian Oifa
Re: Cassandra Read Timeout
Super column? Out of curiosity, which Cassandra version are you running? From: user@cassandra.apache.org Subject: Re: Cassandra Read Timeout Hello The structure is the same , the CFs are super column CFs , where key is long ( timestamp to partition the index , so each 11 days new row is created ) , super Column is int32 and columns / values are timeuuids.I am running same queries , getting reversed slice by raw key and super column. The number of reads is relatively high on second CF since i am testing it several hours already , most of time there are no requests for read on both of them , only writes.There is at most 1 read request in 20-30 seconds so it should not create a load.There is also no reads ( 0 before and 1 after ) pending on tpstats. Please also note that queries on different row same super column , and same row , different super column are working , and if i am not mistaken cassandra is loading complete raw including all super columns to memory ( either any request to this row should fail if this would be a memory problem , or none...). Best regards Yulian Oifa
Re: Cassandra Read Timeout
Indeed, I thought something odd could be happening to your cluster, but it seems it's working fine but the request is taking too long to complete. I noticed from your cfstats the read count was about 10 in the first CF and in the second one it was about 1000... Would you be doing much more reads in the second one? If the load is higher, it could justify the timeout. Do both CFs have the same data model? Are you running exactly the same queries? Best regards, Marcelo. From: user@cassandra.apache.org Subject: Re: Cassandra Read Timeout Hello TP STATS Before Request: Pool NameActive Pending Completed Blocked All time blocked ReadStage 0 07592835 0 0 RequestResponseStage 0 0 0 0 0 MutationStage 0 0 215980736 0 0 ReadRepairStage 0 0 0 0 0 ReplicateOnWriteStage 0 0 0 0 0 GossipStage 0 0 0 0 0 AntiEntropyStage 0 0 0 0 0 MigrationStage0 0 28 0 0 MemoryMeter 0 0474 0 0 MemtablePostFlusher 0 0 32845 0 0 FlushWriter 0 0 4013 0 2239 MiscStage 0 0 0 0 0 PendingRangeCalculator0 0 1 0 0 commitlog_archiver0 0 0 0 0 InternalResponseStage 0 0 0 0 0 HintedHandoff 0 0 0 0 0 Message type Dropped RANGE_SLICE 0 READ_REPAIR 0 BINARY 0 READ 0 MUTATION 0 _TRACE 0 REQUEST_RESPONSE 0 COUNTER_MUTATION 0 TP STATS After Request: Pool NameActive Pending Completed Blocked All time blocked ReadStage 1 17592942 0 0 RequestResponseStage 0 0 0 0 0 MutationStage 0 0 215983339 0 0 ReadRepairStage 0 0 0 0 0 ReplicateOnWriteStage 0 0 0 0 0 GossipStage 0 0 0 0 0 AntiEntropyStage 0 0 0 0 0 MigrationStage0 0 28 0 0 MemoryMeter 0 0474 0 0 MemtablePostFlusher 0 0 32845 0 0 FlushWriter 0 0 4013 0 2239 MiscStage 0 0 0 0 0 PendingRangeCalculator0 0 1 0 0 commitlog_archiver0 0 0 0 0 InternalResponseStage 0 0 0 0 0 HintedHandoff 0 0 0 0 0 Message type Dropped RANGE_SLICE 0 READ_REPAIR 0 BINARY 0 READ 0 MUTATION 0 _TRACE 0 REQUEST_RESPONSE 0 COUNTER_MUTATION 0 The only items changed are : ReadStage increased by 107 + 1 Active/Pending and MutationStage which changed by 2603. Please note that system is writing all the time in batches ( each second 2 servers write one batch each ) so i dont see anything special in this numbers. Best regards Yulian Oifa
Re:Cassandra Read Timeout
Yulian, Maybe other people have other clues, but I think if you could monitor the behavior in tpstats after activity "Seeking to partition beginning in data file" it could help to find the problem. Which type of thread is getting stuck? Do you see any number increasing continuously during the request? Best regards, Marcelo. From: user@cassandra.apache.org Subject: Re:Cassandra Read Timeout Hello to all I have single node cassandra on amazon ec2. Currently i am having a read timeout problem on single CF , single raw. Raw size is aroung 190MB.There are bigger raws with similar structure ( its index raws , which actually stores keys ) and everything is working fine on them, everything is working also fine on this cf but on other raw. Tables data from CFStats ( First table has bigger raws but works fine , where second has timeout ) : Column Family: pendindexes SSTable count: 5 Space used (live): 462298352 Space used (total): 462306752 SSTable Compression Ratio: 0.3511107495795905 Number of Keys (estimate): 640 Memtable Columns Count: 63339 Memtable Data Size: 12328802 Memtable Switch Count: 78 Read Count: 10 Read Latency: NaN ms. Write Count: 1530113 Write Latency: 0.022 ms. Pending Tasks: 0 Bloom Filter False Positives: 0 Bloom Filter False Ratio: 0.0 Bloom Filter Space Used: 3920 Compacted row minimum size: 73 Compacted row maximum size: 223875792 Compacted row mean size: 42694982 Average live cells per slice (last five minutes): 21.0 Average tombstones per slice (last five minutes): 0.0 Column Family: statuspindexes SSTable count: 1 Space used (live): 99602136 Space used (total): 99609360 SSTable Compression Ratio: 0.34278775390997873 Number of Keys (estimate): 128 Memtable Columns Count: 6250 Memtable Data Size: 6061097 Memtable Switch Count: 65 Read Count: 1000 Read Latency: NaN ms. Write Count: 1193142 Write Latency: 3.616 ms. Pending Tasks: 0 Bloom Filter False Positives: 0 Bloom Filter False Ratio: 0.0 Bloom Filter Space Used: 656 Compacted row minimum size: 180 Compacted row maximum size: 186563160 Compacted row mean size: 63225562 Average live cells per slice (last five minutes): 0.0 Average tombstones per slice (last five minutes): 0.0 I have tried to debug it with cql , thats what i get: activity | timestamp| source | source_elapsed -+--+--+ execute_cql3_query | 15:39:53,120 | 172.31.6.173 | 0 Parsing Select * from statuspindexes LIMIT 1; | 15:39:53,120 | 172.31.6.173 |875 Preparing statement | 15:39:53,121 | 172.31.6.173 | 1643 Determining replicas to query | 15:39:53,121 | 172.31.6.173 | 1740 Executing seq scan across 1 sstables for [min(-9223372036854775808), min(-9223372036854775808)] | 15:39:53,122 | 172.31.6.173 | 2581 Seeking to partition beginning in data file | 15:39:53,123 | 172.31.6.173 | 3118 Timed out; received 0 of 1 responses for range 2 of 2 | 15:40:03,121 | 172.31.6.173 | 10001370 Request complete | 15:40:03,121 | 172.31.6.173 | 10001513 I have executed compaction on that cf. What could lead to that problem? Best regards Yulian Oifa
Re:PySpark and Cassandra integration
I will try it for sure Frens, very nice! Thanks for sharing! From: user@cassandra.apache.org Subject: Re:PySpark and Cassandra integration Hi all, Wanted to let you know I've forked PySpark Cassandra on https://github.com/TargetHolding/pyspark-cassandra. Unfortunately the original code didn't work for me and I couldn't figure out how it could work. But it inspired! so I rewrote the majority of the project. The rewrite implements full usage ofhttps://github.com/datastax/spark-cassandra-connector and brings much of it's goodness to PySpark! Hope that some of you are able to put this to good use. And feedback, pull requests, etc. are more than welcome! Best regards, Frens Jan
Re:designing table
My cents: You could partition your data per date and second query would be easy. If you need to query ALL data for a client id, it would be hard though, but querying last 10 days for a client id could be easy, for instance. If you need to query ALL, it would probably be better to create another CF and write on both, google for Cassandra materialized view in this case. []s From: user@cassandra.apache.org Subject: Re:designing table I am trying to design a table in Cassandra in which I will have multiple JSON String for a particular client id. abc123 - jsonA abc123 - jsonB abcd12345 - jsonC My query pattern is going to be - Give me all JSON String for a particular client id. Give me all the client id's and json strings for a particular date. What is the best way to design table for this?
Re: Many pending compactions
Cassandra 2.1 comes with incremental repair, and I haven't read the details myself: http://www.datastax.com/documentation/cassandra/2.1/cassandra/operations/ops_repair_nodes_c.html However, AFAIK, a full repair will rebuild all sstables, that's why you should have more than 50% of disk space available on each node. Of course, it will also make sure data is replicated to the right nodes in the process. []s From: user@cassandra.apache.org Subject: Re: Many pending compactions Can you explain me what is the correlation between growing SSTables and repair? I was sure, until your mail, that repair is only to make data consistent between nodes. Regards On Wed, Feb 18, 2015 at 4:20 PM, Roni Balthazar wrote: Which error are you getting when running repairs? You need to run repair on your nodes within gc_grace_seconds (eg: weekly). They have data that are not read frequently. You can run "repair -pr" on all nodes. Since you do not have deletes, you will not have trouble with that. If you have deletes, it's better to increase gc_grace_seconds before the repair. http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_repair_nodes_c.html After repair, try to run a "nodetool cleanup". Check if the number of SSTables goes down after that... Pending compactions must decrease as well... Cheers, Roni Balthazar On Wed, Feb 18, 2015 at 12:39 PM, Ja Sam wrote: > 1) we tried to run repairs but they usually does not succeed. But we had > Leveled compaction before. Last week we ALTER tables to STCS, because guys > from DataStax suggest us that we should not use Leveled and alter tables in > STCS, because we don't have SSD. After this change we did not run any > repair. Anyway I don't think it will change anything in SSTable count - if I > am wrong please give me an information > > 2) I did this. My tables are 99% write only. It is audit system > > 3) Yes I am using default values > > 4) In both operations I am using LOCAL_QUORUM. > > I am almost sure that READ timeout happens because of too much SSTables. > Anyway firstly I would like to fix to many pending compactions. I still > don't know how to speed up them. > > > On Wed, Feb 18, 2015 at 2:49 PM, Roni Balthazar > wrote: >> >> Are you running repairs within gc_grace_seconds? (default is 10 days) >> >> http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_repair_nodes_c.html >> >> Double check if you set cold_reads_to_omit to 0.0 on tables with STCS >> that you do not read often. >> >> Are you using default values for the properties >> min_compaction_threshold(4) and max_compaction_threshold(32)? >> >> Which Consistency Level are you using for reading operations? Check if >> you are not reading from DC_B due to your Replication Factor and CL. >> >> http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html >> >> >> Cheers, >> >> Roni Balthazar >> >> On Wed, Feb 18, 2015 at 11:07 AM, Ja Sam wrote: >> > I don't have problems with DC_B (replica) only in DC_A(my system write >> > only >> > to it) I have read timeouts. >> > >> > I checked in OpsCenter SSTable count and I have: >> > 1) in DC_A same +-10% for last week, a small increase for last 24h (it >> > is >> > more than 15000-2 SSTables depends on node) >> > 2) in DC_B last 24h shows up to 50% decrease, which give nice >> > prognostics. >> > Now I have less then 1000 SSTables >> > >> > What did you measure during system optimizations? Or do you have an idea >> > what more should I check? >> > 1) I look at CPU Idle (one node is 50% idle, rest 70% idle) >> > 2) Disk queue -> mostly is it near zero: avg 0.09. Sometimes there are >> > spikes >> > 3) system RAM usage is almost full >> > 4) In Total Bytes Compacted most most lines are below 3MB/s. For total >> > DC_A >> > it is less than 10MB/s, in DC_B it looks much better (avg is like >> > 17MB/s) >> > >> > something else? >> > >> > >> > >> > On Wed, Feb 18, 2015 at 1:32 PM, Roni Balthazar >> > >> > wrote: >> >> >> >> Hi, >> >> >> >> You can check if the number of SSTables is decreasing. Look for the >> >> "SSTable count" information of your tables using "nodetool cfstats". >> >> The compaction history can be viewed using "nodetool >> >> compactionhistory". >> >> >> >> About the timeouts, check this out: >> >> >> >> http://www.datastax.com/dev/blog/how-cassandra-deals-with-replica-failure >> >> Also try to run "nodetool tpstats" to see the threads statistics. It >> >> can lead you to know if you are having performance problems. If you >> >> are having too many pending tasks or dropped messages, maybe will you >> >> need to tune your system (eg: driver's timeout, concurrent reads and >> >> so on) >> >> >> >> Regards, >> >> >> >> Roni Balthazar >> >> >> >> On Wed, Feb 18, 2015 at 9:51 AM, Ja Sam wrote: >> >> > Hi, >> >> > Thanks for your "tip" it looks that something changed - I still don't >> >> > know >> >> > if it is ok. >> >> > >> >> > My nodes started to do more compa
Re: best supported spark connector for Cassandra
For SQL queries on Cassandra I used to use Presto: https://prestodb.io/ It's a nice tool from FB and seems to work well with Cassandra. You can use their JDBC driver with your favourite java SQL tool. Inside my apps, I never needed to use SQL queries. []s From: pavel.velik...@gmail.com Subject: Re: best supported spark connector for Cassandra Hi Marcelo, Were you able to use the Spark SQL features of the Cassandra connector? I couldn’t make a .jar that wouldn’t confict with Spark SQL native .jar… So I ended up using only the basic features, cannot use SQL queries. On Feb 13, 2015, at 7:49 PM, Paulo Ricardo Motta Gomes wrote: I used to use calliope, which was really awesome before DataStax native integration with Spark. Now I'm quite happy with the official DataStax spark connector, it's very straightforward to use. I never tried to use these drivers with Java though, I'd suggest you to use them with Scala, which is the best option to write spark jobs. On Fri, Feb 13, 2015 at 12:12 PM, Carlos Rolo wrote: Not for sure ;) If you need Cassandra support I can forward you to someone to talk to at Pythian. Regards, Regards, Carlos Juzarte Rolo Cassandra Consultant Pythian - Love your data rolo@pythian | Twitter: cjrolo | Linkedin: linkedin.com/in/carlosjuzarterolo Tel: 1649 www.pythian.com On Fri, Feb 13, 2015 at 3:05 PM, Marcelo Valle (BLOOMBERG/ LONDON) wrote: Actually, I am not the one looking for support, but I thank you a lot anyway. But from your message I guess the answer is yes, Datastax is not the only Cassandra vendor offering support and changing official Cassandra source at this moment, is this right? From: user@cassandra.apache.org Subject: Re: best supported spark connector for Cassandra Of course, Stratio Deep and Stratio Cassandra are licensed Apache 2.0. Regarding the Cassandra support, I can introduce you to someone in Stratio that can help you. 2015-02-12 15:05 GMT+01:00 Marcelo Valle (BLOOMBERG/ LONDON) : Thanks for the hint Gaspar. Do you know if Stratio Deep / Stratio Cassandra are also licensed Apache 2.0? I had interest in knowing more about Stratio when I was working on a start up. Now, on a blueship, it seems one of the hardest obstacles to use Cassandra in a project is the need of an area supporting it, and it seems people are specially concerned about how many vendors an open source solution has to provide support. This seems to be kind of an advantage of HBase, as there are many vendors supporting it, but I wonder if Stratio can be considered an alternative to Datastax reggarding Cassandra support? It's not my call here to decide anything, but as part of the community it helps to have this business scenario clear. I could say Cassandra could be the best fit technical solution for some projects but sometimes non-technical factors are in the game, like this need for having more than one vendor available... From: gmu...@stratio.com Subject: Re: best supported spark connector for Cassandra My suggestion is to use Java or Scala instead of Python. For Java/Scala both the Datastax and Stratio drivers are valid and similar options. As far as I know they both take care about data locality and are not based on the Hadoop interface. The advantage of Stratio Deep is that allows you to integrate Spark not only with Cassandra but with MongoDB, Elasticsearch, Aerospike and others as well. Stratio has a forked Cassandra for including some additional features such as Lucene based secondary indexes. So Stratio driver works fine with the Apache Cassandra and also with their fork. You can find some examples of using Deep here: https://github.com/Stratio/deep-examples Please if you need some help with Stratio Deep do not hesitate to contact us. 2015-02-11 17:18 GMT+01:00 shahab : I am using Calliope cassandra-spark connector(http://tuplejump.github.io/calliope/), which is quite handy and easy to use! The only problem is that it is a bit outdates , works with Spark 1.1.0, hopefully new version comes soon. best, /Shahab On Wed, Feb 11, 2015 at 2:51 PM, Marcelo Valle (BLOOMBERG/ LONDON) wrote: I just finished a scala course, nice exercise to check what I learned :D Thanks for the answer! From: user@cassandra.apache.org Subject: Re: best supported spark connector for Cassandra Start looking at the Spark/Cassandra connector here (in Scala): https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector Data locality is provided by this method: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L329-L336 Start digging from this all the way down the code. As for Stratio Deep, I can't tell how the did the integration with Spark. Take some time to dig down their code to understand the logic. On Wed, Fe
Re: best supported spark connector for Cassandra
Actually, I am not the one looking for support, but I thank you a lot anyway. But from your message I guess the answer is yes, Datastax is not the only Cassandra vendor offering support and changing official Cassandra source at this moment, is this right? From: user@cassandra.apache.org Subject: Re: best supported spark connector for Cassandra Of course, Stratio Deep and Stratio Cassandra are licensed Apache 2.0. Regarding the Cassandra support, I can introduce you to someone in Stratio that can help you. 2015-02-12 15:05 GMT+01:00 Marcelo Valle (BLOOMBERG/ LONDON) : Thanks for the hint Gaspar. Do you know if Stratio Deep / Stratio Cassandra are also licensed Apache 2.0? I had interest in knowing more about Stratio when I was working on a start up. Now, on a blueship, it seems one of the hardest obstacles to use Cassandra in a project is the need of an area supporting it, and it seems people are specially concerned about how many vendors an open source solution has to provide support. This seems to be kind of an advantage of HBase, as there are many vendors supporting it, but I wonder if Stratio can be considered an alternative to Datastax reggarding Cassandra support? It's not my call here to decide anything, but as part of the community it helps to have this business scenario clear. I could say Cassandra could be the best fit technical solution for some projects but sometimes non-technical factors are in the game, like this need for having more than one vendor available... From: gmu...@stratio.com Subject: Re: best supported spark connector for Cassandra My suggestion is to use Java or Scala instead of Python. For Java/Scala both the Datastax and Stratio drivers are valid and similar options. As far as I know they both take care about data locality and are not based on the Hadoop interface. The advantage of Stratio Deep is that allows you to integrate Spark not only with Cassandra but with MongoDB, Elasticsearch, Aerospike and others as well. Stratio has a forked Cassandra for including some additional features such as Lucene based secondary indexes. So Stratio driver works fine with the Apache Cassandra and also with their fork. You can find some examples of using Deep here: https://github.com/Stratio/deep-examples Please if you need some help with Stratio Deep do not hesitate to contact us. 2015-02-11 17:18 GMT+01:00 shahab : I am using Calliope cassandra-spark connector(http://tuplejump.github.io/calliope/), which is quite handy and easy to use! The only problem is that it is a bit outdates , works with Spark 1.1.0, hopefully new version comes soon. best, /Shahab On Wed, Feb 11, 2015 at 2:51 PM, Marcelo Valle (BLOOMBERG/ LONDON) wrote: I just finished a scala course, nice exercise to check what I learned :D Thanks for the answer! From: user@cassandra.apache.org Subject: Re: best supported spark connector for Cassandra Start looking at the Spark/Cassandra connector here (in Scala): https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector Data locality is provided by this method: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L329-L336 Start digging from this all the way down the code. As for Stratio Deep, I can't tell how the did the integration with Spark. Take some time to dig down their code to understand the logic. On Wed, Feb 11, 2015 at 2:25 PM, Marcelo Valle (BLOOMBERG/ LONDON) wrote: Taking the opportunity Spark was being discussed in another thread, I decided to start a new one as I have interest in using Spark + Cassandra in the feature. About 3 years ago, Spark was not an existing option and we tried to use hadoop to process Cassandra data. My experience was horrible and we reached the conclusion it was faster to develop an internal tool than insist on Hadoop _for our specific case_. How I can see Spark is starting to be known as a "better hadoop" and it seems market is going this way now. I can also see I have many more options to decide how to integrate Cassandra using the Spark RDD concept than using the ColumnFamilyInputFormat. I have found this java driver made by Datastax: https://github.com/datastax/spark-cassandra-connector I also have found python Cassandra support on spark's repo, but it seems experimental yet: https://github.com/apache/spark/tree/master/examples/src/main/python Finally I have found stratio deep: https://github.com/Stratio/deep-spark It seems Stratio guys have forked Cassandra also, I am still a little confused about it. Question: which driver should I use, if I want to use Java? And which if I want to use python? I think the way Spark can integrate to Cassandra makes all the difference in the world, from my past experience, so I would like to know more about i
Re:query by column size
There is no automatic indexing in Cassandra. There are secondary indexes, but not for these cases. You could use a solution like DSE, to get data automatically indexed on solr, in each node, as soon as data comes. Then you could do such a query on solr. If the query can be slow, you could run a MR job over all rows, filtering the ones you want. []s From: user@cassandra.apache.org Subject: Re:query by column size Greetings, I have one column family with 10 columns, one of the column we store xml/json. Is there a way I can query that column where size > 50kb ? assuming I have index on that column. thanks CV.
Re: best supported spark connector for Cassandra
Thanks for the hint Gaspar. Do you know if Stratio Deep / Stratio Cassandra are also licensed Apache 2.0? I had interest in knowing more about Stratio when I was working on a start up. Now, on a blueship, it seems one of the hardest obstacles to use Cassandra in a project is the need of an area supporting it, and it seems people are specially concerned about how many vendors an open source solution has to provide support. This seems to be kind of an advantage of HBase, as there are many vendors supporting it, but I wonder if Stratio can be considered an alternative to Datastax reggarding Cassandra support? It's not my call here to decide anything, but as part of the community it helps to have this business scenario clear. I could say Cassandra could be the best fit technical solution for some projects but sometimes non-technical factors are in the game, like this need for having more than one vendor available... From: gmu...@stratio.com Subject: Re: best supported spark connector for Cassandra My suggestion is to use Java or Scala instead of Python. For Java/Scala both the Datastax and Stratio drivers are valid and similar options. As far as I know they both take care about data locality and are not based on the Hadoop interface. The advantage of Stratio Deep is that allows you to integrate Spark not only with Cassandra but with MongoDB, Elasticsearch, Aerospike and others as well. Stratio has a forked Cassandra for including some additional features such as Lucene based secondary indexes. So Stratio driver works fine with the Apache Cassandra and also with their fork. You can find some examples of using Deep here: https://github.com/Stratio/deep-examples Please if you need some help with Stratio Deep do not hesitate to contact us. 2015-02-11 17:18 GMT+01:00 shahab : I am using Calliope cassandra-spark connector(http://tuplejump.github.io/calliope/), which is quite handy and easy to use! The only problem is that it is a bit outdates , works with Spark 1.1.0, hopefully new version comes soon. best, /Shahab On Wed, Feb 11, 2015 at 2:51 PM, Marcelo Valle (BLOOMBERG/ LONDON) wrote: I just finished a scala course, nice exercise to check what I learned :D Thanks for the answer! From: user@cassandra.apache.org Subject: Re: best supported spark connector for Cassandra Start looking at the Spark/Cassandra connector here (in Scala): https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector Data locality is provided by this method: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L329-L336 Start digging from this all the way down the code. As for Stratio Deep, I can't tell how the did the integration with Spark. Take some time to dig down their code to understand the logic. On Wed, Feb 11, 2015 at 2:25 PM, Marcelo Valle (BLOOMBERG/ LONDON) wrote: Taking the opportunity Spark was being discussed in another thread, I decided to start a new one as I have interest in using Spark + Cassandra in the feature. About 3 years ago, Spark was not an existing option and we tried to use hadoop to process Cassandra data. My experience was horrible and we reached the conclusion it was faster to develop an internal tool than insist on Hadoop _for our specific case_. How I can see Spark is starting to be known as a "better hadoop" and it seems market is going this way now. I can also see I have many more options to decide how to integrate Cassandra using the Spark RDD concept than using the ColumnFamilyInputFormat. I have found this java driver made by Datastax: https://github.com/datastax/spark-cassandra-connector I also have found python Cassandra support on spark's repo, but it seems experimental yet: https://github.com/apache/spark/tree/master/examples/src/main/python Finally I have found stratio deep: https://github.com/Stratio/deep-spark It seems Stratio guys have forked Cassandra also, I am still a little confused about it. Question: which driver should I use, if I want to use Java? And which if I want to use python? I think the way Spark can integrate to Cassandra makes all the difference in the world, from my past experience, so I would like to know more about it, but I don't even know which source code I should start looking... I would like to integrate using python and or C++, but I wonder if it doesn't pay the way to use the java driver instead. Thanks in advance -- Gaspar Muñoz @gmunozsoria Vía de las dos Castillas, 33, Ática 4, 3ª Planta 28224 Pozuelo de Alarcón, Madrid Tel: +34 91 352 59 42 // @stratiobd
Re: How to speed up SELECT * query in Cassandra
Thanks Jirka! From: user@cassandra.apache.org Subject: Re: How to speed up SELECT * query in Cassandra Hi, here are some snippets of code in scala which should get you started. Jirka H. loop { lastRow => val query = lastRow match { case Some(row) => nextPageQuery(row, upperLimit) case None => initialQuery(lowerLimit) } session.execute(query).all } private def nextPageQuery(row: Row, upperLimit: String): String = { val tokenPart = "token(%s) > token(0x%s) and token(%s) < %s".format(rowKeyName, hex(row.getBytes(rowKeyName)), rowKeyName, upperLimit) basicQuery.format(tokenPart) } private def initialQuery(lowerLimit: String): String = { val tokenPart = "token(%s) >= %s".format(rowKeyName, lowerLimit) basicQuery.format(tokenPart) } private def calculateRanges: (BigDecimal, BigDecimal, IndexedSeq[(BigDecimal, BigDecimal)]) = { tokenRange match { case Some((start, end)) => Logger.info("Token range given: {}", "<" + start.underlying.toPlainString + ", " + end.underlying.toPlainString + ">") val tokenSpaceSize = end - start val rangeSize = tokenSpaceSize / concurrency val ranges = for (i <- 0 until concurrency) yield (start + (i * rangeSize), start + ((i + 1) * rangeSize)) (tokenSpaceSize, rangeSize, ranges) case None => val tokenSpaceSize = partitioner.max - partitioner.min val rangeSize = tokenSpaceSize / concurrency val ranges = for (i <- 0 until concurrency) yield (partitioner.min + (i * rangeSize), partitioner.min + ((i + 1) * rangeSize)) (tokenSpaceSize, rangeSize, ranges) } } private val basicQuery = { "select %s, %s, %s, writetime(%s) from %s where %s%s limit %d%s".format( rowKeyName, columnKeyName, columnValueName, columnValueName, columnFamily, "%s", // template whereCondition, pageSize, if (cqlAllowFiltering) " allow filtering" else "" ) } case object Murmur3 extends Partitioner { override val min = BigDecimal(-2).pow(63) override val max = BigDecimal(2).pow(63) - 1 } case object Random extends Partitioner { override val min = BigDecimal(0) override val max = BigDecimal(2).pow(127) - 1 } On 02/11/2015 02:21 PM, Ja Sam wrote: Your answer looks very promising How do you calculate start and stop? On Wed, Feb 11, 2015 at 12:09 PM, Jiri Horky wrote: The fastest way I am aware of is to do the queries in parallel to multiple cassandra nodes and make sure that you only ask them for keys they are responsible for. Otherwise, the node needs to resend your query which is much slower and creates unnecessary objects (and thus GC pressure). You can manually take advantage of the token range information, if the driver does not get this into account for you. Then, you can play with concurrency and batch size of a single query against one node. Basically, what you/driver should do is to transform the query to series of "SELECT * FROM TABLE WHERE TOKEN IN (start, stop)". I will need to look up the actual code, but the idea should be clear :) Jirka H. On 02/11/2015 11:26 AM, Ja Sam wrote: > Is there a simple way (or even a complicated one) how can I speed up > SELECT * FROM [table] query? > I need to get all rows form one table every day. I split tables, and > create one for each day, but still query is quite slow (200 millions > of records) > > I was thinking about run this query in parallel, but I don't know
Re: best supported spark connector for Cassandra
I just finished a scala course, nice exercise to check what I learned :D Thanks for the answer! From: user@cassandra.apache.org Subject: Re: best supported spark connector for Cassandra Start looking at the Spark/Cassandra connector here (in Scala): https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector Data locality is provided by this method: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L329-L336 Start digging from this all the way down the code. As for Stratio Deep, I can't tell how the did the integration with Spark. Take some time to dig down their code to understand the logic. On Wed, Feb 11, 2015 at 2:25 PM, Marcelo Valle (BLOOMBERG/ LONDON) wrote: Taking the opportunity Spark was being discussed in another thread, I decided to start a new one as I have interest in using Spark + Cassandra in the feature. About 3 years ago, Spark was not an existing option and we tried to use hadoop to process Cassandra data. My experience was horrible and we reached the conclusion it was faster to develop an internal tool than insist on Hadoop _for our specific case_. How I can see Spark is starting to be known as a "better hadoop" and it seems market is going this way now. I can also see I have many more options to decide how to integrate Cassandra using the Spark RDD concept than using the ColumnFamilyInputFormat. I have found this java driver made by Datastax: https://github.com/datastax/spark-cassandra-connector I also have found python Cassandra support on spark's repo, but it seems experimental yet: https://github.com/apache/spark/tree/master/examples/src/main/python Finally I have found stratio deep: https://github.com/Stratio/deep-spark It seems Stratio guys have forked Cassandra also, I am still a little confused about it. Question: which driver should I use, if I want to use Java? And which if I want to use python? I think the way Spark can integrate to Cassandra makes all the difference in the world, from my past experience, so I would like to know more about it, but I don't even know which source code I should start looking... I would like to integrate using python and or C++, but I wonder if it doesn't pay the way to use the java driver instead. Thanks in advance
best supported spark connector for Cassandra
Taking the opportunity Spark was being discussed in another thread, I decided to start a new one as I have interest in using Spark + Cassandra in the feature. About 3 years ago, Spark was not an existing option and we tried to use hadoop to process Cassandra data. My experience was horrible and we reached the conclusion it was faster to develop an internal tool than insist on Hadoop _for our specific case_. How I can see Spark is starting to be known as a "better hadoop" and it seems market is going this way now. I can also see I have many more options to decide how to integrate Cassandra using the Spark RDD concept than using the ColumnFamilyInputFormat. I have found this java driver made by Datastax: https://github.com/datastax/spark-cassandra-connector I also have found python Cassandra support on spark's repo, but it seems experimental yet: https://github.com/apache/spark/tree/master/examples/src/main/python Finally I have found stratio deep: https://github.com/Stratio/deep-spark It seems Stratio guys have forked Cassandra also, I am still a little confused about it. Question: which driver should I use, if I want to use Java? And which if I want to use python? I think the way Spark can integrate to Cassandra makes all the difference in the world, from my past experience, so I would like to know more about it, but I don't even know which source code I should start looking... I would like to integrate using python and or C++, but I wonder if it doesn't pay the way to use the java driver instead. Thanks in advance
Re: How to speed up SELECT * query in Cassandra
> cassandra makes a very poor datawarehouse ot long term time series store Really? This is not the impression I have... I think Cassandra is good to store larges amounts of data and historical information, it's only not good to store temporary data. Netflix has a large amount of data and it's all stored in Cassandra, AFAIK. > The very nature of cassandra's distributed nature vs partitioning data on > hadoop makes spark on hdfs actually fasted than on cassandra. I am not sure about the current state of Spark support for Cassandra, but I guess if you create a map reduce job, the intermediate map results will be still stored in HDFS, as it happens to hadoop, is this right? I think the problem with Spark + Cassandra or with Hadoop + Cassandra is that the hard part spark or hadoop does, the shuffling, could be done out of the box with Cassandra, but no one takes advantage on that. What if a map / reduce job used a temporary CF in Cassandra to store intermediate results? From: user@cassandra.apache.org Subject: Re: How to speed up SELECT * query in Cassandra I use spark with cassandra, and you dont need DSE. I see a lot of people ask this same question below (how do I get a lot of data out of cassandra?), and my question is always, why arent you updating both places at once? For example, we use hadoop and cassandra in conjunction with each other, we use a message bus to store every event in both, aggregrate in both, but only keep current data in cassandra (cassandra makes a very poor datawarehouse ot long term time series store) and then use services to process queries that merge data from hadoop and cassandra. Also, spark on hdfs gives more flexibility in terms of large datasets and performance. The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra -- Colin Clark +1 612 859 6129 Skype colin.p.clark On Feb 11, 2015, at 4:49 AM, Jens Rantil wrote: On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) wrote: If you use Cassandra enterprise, you can use hive, AFAIK. Even better, you can use Spark/Shark with DSE. Cheers, Jens -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook Linkedin Twitter
Re:How to speed up SELECT * query in Cassandra
Look for the message "Re: Fastest way to map/parallel read all values in a table?" in the mailing list, it was recently discussed. You can have several parallel processes each one reading a slice of the data, by splitting min/max murmur3 hash ranges. In the company I used to work we developed a system to run custom python processes on demand to process Cassandra data among other things to be able to do that. I hope it will be released as open source soon, it seems there is a lot of people having always this same problem. If you use Cassandra enterprise, you can use hive, AFAIK. A good idea would be running a hadoop or spark process over your cluster and do the processing you want, but sometimes I think it might be a bit hard to achieve good results for that, mainly because these tools work fine but are "auto magic". It's hard to control where intermediate data will be stored, for example. From: user@cassandra.apache.org Subject: Re:How to speed up SELECT * query in Cassandra Is there a simple way (or even a complicated one) how can I speed up SELECT * FROM [table] query? I need to get all rows form one table every day. I split tables, and create one for each day, but still query is quite slow (200 millions of records) I was thinking about run this query in parallel, but I don't know if it is possible
Re:Adding more nodes causes performance problem
AFAIK, if you were using RF 3 in a 3 node cluster, so all your nodes had all your data. When the number of nodes started to grow, this assumption stopped being true. I think Cassandra will scale linearly from 9 nodes on, but comparing a situation where all your nodes hold all your data is not really fair, as in this situation Cassandra will behave as a database with two more replicas, for reads. I can be wrong, but this is my call. From: user@cassandra.apache.org Subject: Re:Adding more nodes causes performance problem I have a cluster with 3 nodes, the only keyspace is with replication factor of 3, the application read/write UUID-keyed data. I use CQL (casssandra-python), most writes are done by execute_async, most read are done with consistency level of ONE, overall performance in this setup is better than I expected. Then I test 6-nodes cluster and 9-nodes. The performance (both read and write) was getting worse and worse. Roughly speaking, 6-nodes is about 2~3 times slower than 3-nodes, and 9-nodes is about 5~6 times slower than 3-nodes. All tests were done with same data set, same test program, same client machines, for multiple times. I'm running Cassandra 2.1.2 with default configuration. What I observed, is that with 6-nodes and 9-nodes, the Cassandra servers were doing OK with IO, but CPU utilization was about 60%~70% higher than 3-nodes. I'd like to get suggestion how to troubleshoot this, as this is totally against what I read, that Cassandra is scaled linearly.
Re:Fastest way to map/parallel read all values in a table?
Just for the record, I was doing the exact same thing in an internal application in the start up I used to work. We have had the need of writing custom code process in parallel all rows of a column family. Normally we would use Spark for the job, but in our case the logic was a little more complicated, so we wrote custom code. What we did was to run N process in M machines (N cores in each), each one processing tasks. The tasks were created by splitting the range -2^ 63 to 2^ 63 -1 in N*M*10 tasks. Even if data was not completely distributed along the tasks, no machines were idle, as when some task was completed another one was taken from the task pool. It was fast enough for us, but I am interested in knowing if there is a better way of doing it. For your specific case, here is a tool we had opened as open source and can be useful for simpler tests: https://github.com/s1mbi0se/cql_record_processor Also, I guess you probably know that, but I would consider using Spark for doing this. Best regards, Marcelo. From: user@cassandra.apache.org Subject: Re:Fastest way to map/parallel read all values in a table? What’s the fastest way to map/parallel read all values in a table? Kind of like a mini map only job. I’m doing this to compute stats across our entire corpus. What I did to begin with was use token() and then spit it into the number of splits I needed. So I just took the total key range space which is -2^63 to 2^63 - 1 and broke it into N parts. Then the queries come back as: select * from mytable where token(primaryKey) >= x and token(primaryKey) < y From reading on this list I thought this was the correct way to handle this problem. However, I’m seeing horrible performance doing this. After about 1% it just flat out locks up. Could it be that I need to randomize the token order so that it’s not contiguous? Maybe it’s all mapping on the first box to begin with. -- Founder/CEO Spinn3r.com Location: San Francisco, CA blog: http://burtonator.wordpress.com … or check out my Google+ profile
Re: to normalize or not to normalize - read penalty vs write penalty
Perfect Tyler. My feeling was leading me to this, but I wasn't being able to put it in words as you did. Thanks a lot for the message. From: user@cassandra.apache.org Subject: Re: to normalize or not to normalize - read penalty vs write penalty Okay. Let's assume with denormalization you have to do 1000 writes (and one read per user) and with normalization you have to do 1 write (and maybe 1000 reads for each user). If you execute the writes in the most optimal way (batched by partition, if applicable, and separate, concurrent requests per partition), I think it's reasonable to say you can do 1000 writes in 10 to 20ms. Doing 1000 reads is going to take longer. Exactly how long depends on your systems (SSDs or not, whether the data is cached, etc). But this is probably going to take at least 2x as long as the writes. So, with denormalization, it's 10 to 20ms for all users to see the change (with a median somewhere around 5 to 10ms). With normalization, all users *could* see the update almost immediately, because it's only one write. However, each of your users needs to read 1000 partitions, which takes, say 20 to 50ms. So effectively, they won't see the changes for 20 to 50ms, unless they know to read the details for that exact alert. On Wed, Feb 4, 2015 at 11:57 AM, Marcelo Valle (BLOOMBERG/ LONDON) wrote: I don't want to optimize for reads or writes, I want to optimize for having the smallest gap possible between the time I write and the time I read. []s From: user@cassandra.apache.org Subject: Re: to normalize or not to normalize - read penalty vs write penalty Roughly how often do you expect to update alerts? How often do you expect to read the alerts? I suspect you'll be doing 100x more reads (or more), in which case optimizing for reads is the definitely right choice. On Wed, Feb 4, 2015 at 9:50 AM, Marcelo Valle (BLOOMBERG/ LONDON) wrote: Hello everyone, I am thinking about the architecture of my application using Cassandra and I am asking myself if I should or shouldn't normalize an entity. I have users and alerts in my application and for each user, several alerts. The first model which came into my mind was creating an "alerts" CF with user-id as part of the partition key. This way, I can have fast writes and my reads will be fast too, as I will always read per partition. However, I received a requirement later that made my life more complicated. Alerts can be shared by 1000s of users and alerts can change. I am building a real time app and if I change an alert, all users related to it should see the change. Suppose I want to keep thing not normalized - always an alert changes I would need to do a write on 1000s of records. This way my write performance everytime I change an alert would be affected. On the other hand, I could have a CF for users-alerts and another for alert details. Then, at read time, I would need to query 1000s of alerts for a given user. In both situations, there is a gap between the time data is written and the time it's available to be read. I understand not normalizing will make me use more disk space, but once data is written once, I will be able to perform as many reads as I want to with no penalty in performance. Also, I understand writes are faster than reads in Cassandra, so the gap would be smaller in the first solution. I would be glad in hearing thoughts from the community. Best regards, Marcelo Valle. -- Tyler Hobbs DataStax -- Tyler Hobbs DataStax
Re: to normalize or not to normalize - read penalty vs write penalty
I don't want to optimize for reads or writes, I want to optimize for having the smallest gap possible between the time I write and the time I read. []s From: user@cassandra.apache.org Subject: Re: to normalize or not to normalize - read penalty vs write penalty Roughly how often do you expect to update alerts? How often do you expect to read the alerts? I suspect you'll be doing 100x more reads (or more), in which case optimizing for reads is the definitely right choice. On Wed, Feb 4, 2015 at 9:50 AM, Marcelo Valle (BLOOMBERG/ LONDON) wrote: Hello everyone, I am thinking about the architecture of my application using Cassandra and I am asking myself if I should or shouldn't normalize an entity. I have users and alerts in my application and for each user, several alerts. The first model which came into my mind was creating an "alerts" CF with user-id as part of the partition key. This way, I can have fast writes and my reads will be fast too, as I will always read per partition. However, I received a requirement later that made my life more complicated. Alerts can be shared by 1000s of users and alerts can change. I am building a real time app and if I change an alert, all users related to it should see the change. Suppose I want to keep thing not normalized - always an alert changes I would need to do a write on 1000s of records. This way my write performance everytime I change an alert would be affected. On the other hand, I could have a CF for users-alerts and another for alert details. Then, at read time, I would need to query 1000s of alerts for a given user. In both situations, there is a gap between the time data is written and the time it's available to be read. I understand not normalizing will make me use more disk space, but once data is written once, I will be able to perform as many reads as I want to with no penalty in performance. Also, I understand writes are faster than reads in Cassandra, so the gap would be smaller in the first solution. I would be glad in hearing thoughts from the community. Best regards, Marcelo Valle. -- Tyler Hobbs DataStax
Re: data distribution along column family partitions
From: clohfin...@gmail.com Subject: Re: data distribution along column family partitions > not ok :) don't let a single partition get to 1gb, 100's of mb should be when > flares are going up. The main reasoning is compactions would be horrifically > slow and there will be a lot of gc pain. Bringing the time bucket to be by > day will probably be sufficient. It would take billions of alarm events in > single time bucket if thats entire data payload to get that bad. > Wide rows work well, the keeping them smaller is an optimization that will > save you a lot of pain down the road from troublesome jvm gcs, slower > compactions, unbalanced nodes, and higher read latencies. That's the point, I won't have many partitions with more than 15gb. But suppose will have for 1000 users, among 10 million. Almost all partitions will have a good size, but I won't have a problem with the few ones which are big then? I am asking this because in prior experience I felt I was having a huge performance penalty reading updates from these 1000 users, like, I might have few cases, but assuming every time data changes I will have to process the user again, I will read the worst case very often. Chris On Wed, Feb 4, 2015 at 9:33 AM, Marcelo Valle (BLOOMBERG/ LONDON) wrote: > The data model lgtm. You may need to balance the size of the time buckets > with the amount of alarms to prevent partitions from getting too large. 1 month may be a little large, I would aim to keep the partitions below 25mb (can check with nodetool cfstats) or so in size to keep everything happy. Its ok if occasional ones go larger, something like 1gb can be bad.. but it would still work if not very efficiently. What about 15 gb? > Deletes on an entire time-bucket at a time seems like a good approach, but > just setting TTL would be far far better imho (why not just set it to two > years?). May want to look into new DateTieredCompactionStrategy, or > LeveledCompactionStrategy or the obsoleted data will very rarely go away. Excellent hint, I will take a good look on this. I didn't know DateTieredCompactionStrategy > When reading just be sure to use paging (the good cql drivers will have it > built in) and don't actually read it all in one massive query. If you > decrease size of your time bucket you may end up having to page the query > across multiple partitions if Y-X > bucket size. If I use paging, Cassandra won't try to allocate the whole partition on the server node, it will just allocate memory in the heap for that page. Check? Marcelo Valle From: user@cassandra.apache.org Subject: Re: data distribution along column family partitions The data model lgtm. You may need to balance the size of the time buckets with the amount of alarms to prevent partitions from getting too large. 1 month may be a little large, I would aim to keep the partitions below 25mb (can check with nodetool cfstats) or so in size to keep everything happy. Its ok if occasional ones go larger, something like 1gb can be bad.. but it would still work if not very efficiently. Deletes on an entire time-bucket at a time seems like a good approach, but just setting TTL would be far far better imho (why not just set it to two years?). May want to look into new DateTieredCompactionStrategy, or LeveledCompactionStrategy or the obsoleted data will very rarely go away. When reading just be sure to use paging (the good cql drivers will have it built in) and don't actually read it all in one massive query. If you decrease size of your time bucket you may end up having to page the query across multiple partitions if Y-X > bucket size. Chris On Wed, Feb 4, 2015 at 4:34 AM, Marcelo Elias Del Valle wrote: Hello, I am designing a model to store alerts users receive over time. I will want to store probably the last two years of alerts for each user. The first thought I had was having a column family partitioned by user + timebucket, where time bucket could be something like year + month. For instance: partition key: user-id = f47ac10b-58cc-4372-a567-0e02b2c3d479 time-bucket = 201502 rest of primary key: timestamp = column of tipy timestamp alert id = f47ac10b-58cc-4372-a567-0e02b2c3d480 Question, would this make it easier to delete old data? Supposing I am not using TTL and I want to remove alerts older than 2 years, what would be better, just deleting the entire time-bucket for each user-id (through a map/reduce process) or having just user-id as partition key and deleting, for each user, where X > timestamp > Y? Is it the same for Cassandra, internally? Another question is: would data be distributed enough if I just choose to partition by user-id? I will have some users with a large number of alerts, but in average I could consider alerts would have a good distribution along user ids. The problem is I don'
to normalize or not to normalize - read penalty vs write penalty
Hello everyone, I am thinking about the architecture of my application using Cassandra and I am asking myself if I should or shouldn't normalize an entity. I have users and alerts in my application and for each user, several alerts. The first model which came into my mind was creating an "alerts" CF with user-id as part of the partition key. This way, I can have fast writes and my reads will be fast too, as I will always read per partition. However, I received a requirement later that made my life more complicated. Alerts can be shared by 1000s of users and alerts can change. I am building a real time app and if I change an alert, all users related to it should see the change. Suppose I want to keep thing not normalized - always an alert changes I would need to do a write on 1000s of records. This way my write performance everytime I change an alert would be affected. On the other hand, I could have a CF for users-alerts and another for alert details. Then, at read time, I would need to query 1000s of alerts for a given user. In both situations, there is a gap between the time data is written and the time it's available to be read. I understand not normalizing will make me use more disk space, but once data is written once, I will be able to perform as many reads as I want to with no penalty in performance. Also, I understand writes are faster than reads in Cassandra, so the gap would be smaller in the first solution. I would be glad in hearing thoughts from the community. Best regards, Marcelo Valle.
Re: data distribution along column family partitions
> The data model lgtm. You may need to balance the size of the time buckets > with the amount of alarms to prevent partitions from getting too large. 1 month may be a little large, I would aim to keep the partitions below 25mb (can check with nodetool cfstats) or so in size to keep everything happy. Its ok if occasional ones go larger, something like 1gb can be bad.. but it would still work if not very efficiently. What about 15 gb? > Deletes on an entire time-bucket at a time seems like a good approach, but > just setting TTL would be far far better imho (why not just set it to two > years?). May want to look into new DateTieredCompactionStrategy, or > LeveledCompactionStrategy or the obsoleted data will very rarely go away. Excellent hint, I will take a good look on this. I didn't know DateTieredCompactionStrategy > When reading just be sure to use paging (the good cql drivers will have it > built in) and don't actually read it all in one massive query. If you > decrease size of your time bucket you may end up having to page the query > across multiple partitions if Y-X > bucket size. If I use paging, Cassandra won't try to allocate the whole partition on the server node, it will just allocate memory in the heap for that page. Check? Marcelo Valle From: user@cassandra.apache.org Subject: Re: data distribution along column family partitions The data model lgtm. You may need to balance the size of the time buckets with the amount of alarms to prevent partitions from getting too large. 1 month may be a little large, I would aim to keep the partitions below 25mb (can check with nodetool cfstats) or so in size to keep everything happy. Its ok if occasional ones go larger, something like 1gb can be bad.. but it would still work if not very efficiently. Deletes on an entire time-bucket at a time seems like a good approach, but just setting TTL would be far far better imho (why not just set it to two years?). May want to look into new DateTieredCompactionStrategy, or LeveledCompactionStrategy or the obsoleted data will very rarely go away. When reading just be sure to use paging (the good cql drivers will have it built in) and don't actually read it all in one massive query. If you decrease size of your time bucket you may end up having to page the query across multiple partitions if Y-X > bucket size. Chris On Wed, Feb 4, 2015 at 4:34 AM, Marcelo Elias Del Valle wrote: Hello, I am designing a model to store alerts users receive over time. I will want to store probably the last two years of alerts for each user. The first thought I had was having a column family partitioned by user + timebucket, where time bucket could be something like year + month. For instance: partition key: user-id = f47ac10b-58cc-4372-a567-0e02b2c3d479 time-bucket = 201502 rest of primary key: timestamp = column of tipy timestamp alert id = f47ac10b-58cc-4372-a567-0e02b2c3d480 Question, would this make it easier to delete old data? Supposing I am not using TTL and I want to remove alerts older than 2 years, what would be better, just deleting the entire time-bucket for each user-id (through a map/reduce process) or having just user-id as partition key and deleting, for each user, where X > timestamp > Y? Is it the same for Cassandra, internally? Another question is: would data be distributed enough if I just choose to partition by user-id? I will have some users with a large number of alerts, but in average I could consider alerts would have a good distribution along user ids. The problem is I don't fell confident having few partitions with A LOT of alerts would not be a problem at read time. What happens at read time when I try to read data from a big partition? Like, I want to read alerts for a user where X > timestamp > Y, but it would return 1 million alerts. As it's all in a single partition, this read will occur in the same node, thus allocating a lot of memory for this single operation, right? What if the memory needed for this operation is bigger than it fits in java heap? Would this be a problem to Cassandra? Best regards, -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr