Re: How to speed up SELECT * query in Cassandra
Could you please share how much data you store on the cluster and what is HW configuration of the nodes? These nodes are dedicated HW, 24 cpu and 50Gb ram. Each node has a few TBs of data (you don't want to go over this) in raid50 (we're migrating over to JBOD). Each c* node is running 2.0.11 and configured to use 8gm heap, 2g new, and jdk1.7.0_55. Hadoop (2.2.0) tasktrackers and dfs run on these nodes as well, all up they use up to 12Gb ram, leaving ~30Gb ram for kernel and page cache. Data-locality is an important goal, in the worse case scenarios we've seen it mean a four times throughput benefit. Hdfs being a volatile hadoop-internals space for us is on SSDs, providing strong m/r performance. (commitlog of course is also on SSD – we made the mistake of putting it on the same SSD to begin with. don't do that, commitlog gets its own SSD) I am really impressed that you are able to read 100M records in ~4minutes on 4 nodes. It makes something like 100k reads per node, which is something we are quite far away from. These are not individual reads and not the number of partition keys, but m/r records (or cql rows). But yes, the performance of spark against cassandra is impressive. It leads me to question, whether reading from Spark goes through Cassandra's JVM and thus go through normal read path, or if it reads the sstables directly from disks sequentially and possibly filters out old/tombstone values by itself? Both Hadoop-Cassandra integration and the Spark-Cassandra connector goes through the normal read path like all cql read queries. With our m/r jobs each task works with just one partition key, doing repeated column slice reads through that partition key according to the ConfigHelper.rangeBatchSize setting, which we have set to 100. These hadoop jobs use a custom written CqlInputFormat due to the poor performance CqlInputFormat has today against a vnodes setup, the customisation we have is pretty much the same as the patch on offer in CASSANDRA-6091. This problem with vnodes we haven't experienced with the spark connector. I presume that, like the hadoop integration, spark also bulk reads (column slices) from each partition key. Otherwise this is useful reading http://wiki.apache.org/cassandra/HadoopSupport#Troubleshooting This is also a cluster that serves requests to web applications that need low latency. Let it be said this isn't something i'd recommend, just the path we had to take because of our small initial dedicated-HW cluster. (You really want to separate online and offline datacenters, so that you can maximise the offline clusters for the heavy batch reads). ~mck
Re: How to speed up SELECT * query in Cassandra
Jirka, But I am really interested how it can work well with Spark/Hadoop where you basically needs to read all the data as well (as far as I understand that). I can't give you any benchmarking between technologies (nor am i particularly interested in getting involved in such a discussion) but i can share our experiences with Cassandra, Hadoop, and Spark, over the past 4+ years, and hopefully assure you that Cassandra+Spark is a smart choice. On a four node cluster we were running 5000+ small hadoop jobs each day each finishing within two minutes, often within one minute, resulting in (give or take) a billion records read and 150 millions records written from and to c*. These small jobs are incrementally processing on limited partition key sets each time. These jobs are primarily reading data from a raw events store that has a ttl of 3 months and 22+Gb of tombstones a day (reads over old partition keys are rare). We also run full-table-scan jobs and have never come across any issues particular to that. There are hadoop map/reduce settings to increase durability if you have tables with troublesome partition keys. This is also a cluster that serves requests to web applications that need low latency. We recently wrote a spark job that does full table scans over 100 million+ rows, involves a handful of stages (two tables, 9 maps, 4 reduce, and 2 joins), and writes back to a new table 5 millions rows. This job runs in ~260 seconds. Spark is becoming a natural complement to schema evolution for cassandra, something you'll want to do to keep your schema optimised against your read request patterns, even little things like switching cluster keys around. With any new technology hitting some hurdles (especially if you go wondering outside recommended practices) will of course be part of the game, but that said I've only had positive experiences with this community's ability to help out (and do so quickly). Starting from scratch i'd use Spark (on scala) over Hadoop no questions asked. Otherwise Cassandra has always been our 'big data' platform, hadoop/spark is just an extra tool on top. We've never kept data in hdfs and are very grateful for having made that choice. ~mck ref https://prezi.com/vt98oob9fvo4/cassandra-summit-cassandra-and-hadoop-at-finnno/
Re: How to speed up SELECT * query in Cassandra
If you are using Spark you need to be _really_ careful about your tombstones. In our experience a single partition with too many tombstones can take down the whole batch job (until something like https://issues.apache.org/jira/browse/CASSANDRA-8574 is fixed). This was a major obstacle for us to overcome when using Spark. Cheers, Jens On Wed, Feb 11, 2015 at 5:12 PM, Jiri Horky ho...@avast.com wrote: Well, I always wondered how Cassandra can by used in Hadoop-like environment where you basically need to do full table scan. I need to say that our experience is that cassandra is perfect for writing, reading specific values by key, but definitely not for reading all of the data out of it. Some of our projects found out that doing that with a not trivial in a timely manner is close to impossible in many situations. We are slowly moving to storing the data in HDFS and possibly reprocess them on a daily bases for such usecases (statistics). This is nothing against Cassandra, it can not be perfect for everything. But I am really interested how it can work well with Spark/Hadoop where you basically needs to read all the data as well (as far as I understand that). Jirka H. On 02/11/2015 01:51 PM, DuyHai Doan wrote: The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra Prove it. Did you ever have a look into the source code of the Spark/Cassandra connector to see how data locality is achieved before throwing out such statement ? On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: cassandra makes a very poor datawarehouse ot long term time series store Really? This is not the impression I have... I think Cassandra is good to store larges amounts of data and historical information, it's only not good to store temporary data. Netflix has a large amount of data and it's all stored in Cassandra, AFAIK. The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra. I am not sure about the current state of Spark support for Cassandra, but I guess if you create a map reduce job, the intermediate map results will be still stored in HDFS, as it happens to hadoop, is this right? I think the problem with Spark + Cassandra or with Hadoop + Cassandra is that the hard part spark or hadoop does, the shuffling, could be done out of the box with Cassandra, but no one takes advantage on that. What if a map / reduce job used a temporary CF in Cassandra to store intermediate results? From: user@cassandra.apache.org Subject: Re: How to speed up SELECT * query in Cassandra I use spark with cassandra, and you dont need DSE. I see a lot of people ask this same question below (how do I get a lot of data out of cassandra?), and my question is always, why arent you updating both places at once? For example, we use hadoop and cassandra in conjunction with each other, we use a message bus to store every event in both, aggregrate in both, but only keep current data in cassandra (cassandra makes a very poor datawarehouse ot long term time series store) and then use services to process queries that merge data from hadoop and cassandra. Also, spark on hdfs gives more flexibility in terms of large datasets and performance. The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra -- *Colin Clark* +1 612 859 6129 Skype colin.p.clark On Feb 11, 2015, at 4:49 AM, Jens Rantil jens.ran...@tink.se wrote: On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: If you use Cassandra enterprise, you can use hive, AFAIK. Even better, you can use Spark/Shark with DSE. Cheers, Jens -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook https://www.facebook.com/#%21/tink.se Linkedin http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary Twitter https://twitter.com/tink -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook https://www.facebook.com/#!/tink.se Linkedin http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary Twitter https://twitter.com/tink
Re: How to speed up SELECT * query in Cassandra
Thanks Jirka! From: user@cassandra.apache.org Subject: Re: How to speed up SELECT * query in Cassandra Hi, here are some snippets of code in scala which should get you started. Jirka H. loop { lastRow = val query = lastRow match { case Some(row) = nextPageQuery(row, upperLimit) case None = initialQuery(lowerLimit) } session.execute(query).all } private def nextPageQuery(row: Row, upperLimit: String): String = { val tokenPart = token(%s) token(0x%s) and token(%s)%s.format(rowKeyName, hex(row.getBytes(rowKeyName)), rowKeyName, upperLimit) basicQuery.format(tokenPart) } private def initialQuery(lowerLimit: String): String = { val tokenPart = token(%s) = %s.format(rowKeyName, lowerLimit) basicQuery.format(tokenPart) } private def calculateRanges: (BigDecimal, BigDecimal, IndexedSeq[(BigDecimal, BigDecimal)]) = { tokenRange match { case Some((start, end)) = Logger.info(Token range given: {}, + start.underlying.toPlainString + ,+ end.underlying.toPlainString + ) val tokenSpaceSize = end - start val rangeSize = tokenSpaceSize / concurrency val ranges = for (i - 0 until concurrency) yield (start + (i * rangeSize), start + ((i + 1) * rangeSize)) (tokenSpaceSize, rangeSize, ranges) case None = val tokenSpaceSize = partitioner.max - partitioner.min val rangeSize = tokenSpaceSize / concurrency val ranges = for (i - 0 until concurrency) yield (partitioner.min + (i * rangeSize), partitioner.min + ((i + 1) * rangeSize)) (tokenSpaceSize, rangeSize, ranges) } } private val basicQuery = { select %s, %s, %s, writetime(%s) from %s where %s%s limit %d%s.format( rowKeyName, columnKeyName, columnValueName, columnValueName, columnFamily, %s, // template whereCondition, pageSize, if (cqlAllowFiltering) allow filtering else ) } case object Murmur3 extends Partitioner { override val min = BigDecimal(-2).pow(63) override val max = BigDecimal(2).pow(63) - 1 } case object Random extends Partitioner { override val min = BigDecimal(0) override val max = BigDecimal(2).pow(127) - 1 } On 02/11/2015 02:21 PM, Ja Sam wrote: Your answer looks very promising How do you calculate start and stop? On Wed, Feb 11, 2015 at 12:09 PM, Jiri Horky ho...@avast.com wrote: The fastest way I am aware of is to do the queries in parallel to multiple cassandra nodes and make sure that you only ask them for keys they are responsible for. Otherwise, the node needs to resend your query which is much slower and creates unnecessary objects (and thus GC pressure). You can manually take advantage of the token range information, if the driver does not get this into account for you. Then, you can play with concurrency and batch size of a single query against one node. Basically, what you/driver should do is to transform the query to series of SELECT * FROM TABLE WHERE TOKEN IN (start, stop). I will need to look up the actual code, but the idea should be clear :) Jirka H. On 02/11/2015 11:26 AM, Ja Sam wrote: Is there a simple way (or even a complicated one) how can I speed up SELECT * FROM [table] query? I need to get all rows form one table every day. I split tables, and create one for each day, but still query is quite slow (200 millions of records) I was thinking about run this query in parallel, but I don't know
Re: How to speed up SELECT * query in Cassandra
Well, I always wondered how Cassandra can by used in Hadoop-like environment where you basically need to do full table scan. I need to say that our experience is that cassandra is perfect for writing, reading specific values by key, but definitely not for reading all of the data out of it. Some of our projects found out that doing that with a not trivial in a timely manner is close to impossible in many situations. We are slowly moving to storing the data in HDFS and possibly reprocess them on a daily bases for such usecases (statistics). This is nothing against Cassandra, it can not be perfect for everything. But I am really interested how it can work well with Spark/Hadoop where you basically needs to read all the data as well (as far as I understand that). Jirka H. On 02/11/2015 01:51 PM, DuyHai Doan wrote: The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra Prove it. Did you ever have a look into the source code of the Spark/Cassandra connector to see how data locality is achieved before throwing out such statement ? On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net mailto:mvallemil...@bloomberg.net wrote: cassandra makes a very poor datawarehouse ot long term time series store Really? This is not the impression I have... I think Cassandra is good to store larges amounts of data and historical information, it's only not good to store temporary data. Netflix has a large amount of data and it's all stored in Cassandra, AFAIK. The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra. I am not sure about the current state of Spark support for Cassandra, but I guess if you create a map reduce job, the intermediate map results will be still stored in HDFS, as it happens to hadoop, is this right? I think the problem with Spark + Cassandra or with Hadoop + Cassandra is that the hard part spark or hadoop does, the shuffling, could be done out of the box with Cassandra, but no one takes advantage on that. What if a map / reduce job used a temporary CF in Cassandra to store intermediate results? From: user@cassandra.apache.org mailto:user@cassandra.apache.org Subject: Re: How to speed up SELECT * query in Cassandra I use spark with cassandra, and you dont need DSE. I see a lot of people ask this same question below (how do I get a lot of data out of cassandra?), and my question is always, why arent you updating both places at once? For example, we use hadoop and cassandra in conjunction with each other, we use a message bus to store every event in both, aggregrate in both, but only keep current data in cassandra (cassandra makes a very poor datawarehouse ot long term time series store) and then use services to process queries that merge data from hadoop and cassandra. Also, spark on hdfs gives more flexibility in terms of large datasets and performance. The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra -- *Colin Clark* +1 612 859 6129 tel:%2B1%20612%20859%206129 Skype colin.p.clark On Feb 11, 2015, at 4:49 AM, Jens Rantil jens.ran...@tink.se mailto:jens.ran...@tink.se wrote: On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net mailto:mvallemil...@bloomberg.net wrote: If you use Cassandra enterprise, you can use hive, AFAIK. Even better, you can use Spark/Shark with DSE. Cheers, Jens -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se mailto:jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se http://www.tink.se/ Facebook https://www.facebook.com/#%21/tink.se Linkedin http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary Twitter https://twitter.com/tink
Re: How to speed up SELECT * query in Cassandra
For your information Colin: http://en.wikipedia.org/wiki/List_of_fallacies. Look at Burden of proof http://en.wikipedia.org/wiki/Philosophic_burden_of_proof You stated The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra It's up to YOU to prove it right, not up to me to prove it wrong. All other bla bla is troll. Come back to me once you get some decent benchmarks supporting your statement, until then, the question is closed. On Wed, Feb 11, 2015 at 3:17 PM, Colin co...@clark.ws wrote: Did you want me to included specific examples from my employment at datastax or start from the ground up? All spark is on cassandra is a better than the previous use of hive. The fact that datastax hasnt provided any benchmarks themselves other than glossy marketing statements pretty much says it all-where are your benchmarks? Maybe you could combine it with the in memory option to really boogie... :) (If I find time, I might just write a blog post about exactly how to do this-it involves the use of parquet and partitioning with clustering-and it doesnt cost anything to do it-and it's in production at my company) -- *Colin Clark* +1 612 859 6129 Skype colin.p.clark On Feb 11, 2015, at 6:51 AM, DuyHai Doan doanduy...@gmail.com wrote: The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra Prove it. Did you ever have a look into the source code of the Spark/Cassandra connector to see how data locality is achieved before throwing out such statement ? On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: cassandra makes a very poor datawarehouse ot long term time series store Really? This is not the impression I have... I think Cassandra is good to store larges amounts of data and historical information, it's only not good to store temporary data. Netflix has a large amount of data and it's all stored in Cassandra, AFAIK. The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra. I am not sure about the current state of Spark support for Cassandra, but I guess if you create a map reduce job, the intermediate map results will be still stored in HDFS, as it happens to hadoop, is this right? I think the problem with Spark + Cassandra or with Hadoop + Cassandra is that the hard part spark or hadoop does, the shuffling, could be done out of the box with Cassandra, but no one takes advantage on that. What if a map / reduce job used a temporary CF in Cassandra to store intermediate results? From: user@cassandra.apache.org Subject: Re: How to speed up SELECT * query in Cassandra I use spark with cassandra, and you dont need DSE. I see a lot of people ask this same question below (how do I get a lot of data out of cassandra?), and my question is always, why arent you updating both places at once? For example, we use hadoop and cassandra in conjunction with each other, we use a message bus to store every event in both, aggregrate in both, but only keep current data in cassandra (cassandra makes a very poor datawarehouse ot long term time series store) and then use services to process queries that merge data from hadoop and cassandra. Also, spark on hdfs gives more flexibility in terms of large datasets and performance. The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra -- *Colin Clark* +1 612 859 6129 Skype colin.p.clark On Feb 11, 2015, at 4:49 AM, Jens Rantil jens.ran...@tink.se wrote: On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: If you use Cassandra enterprise, you can use hive, AFAIK. Even better, you can use Spark/Shark with DSE. Cheers, Jens -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook https://www.facebook.com/#!/tink.se Linkedin http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary Twitter https://twitter.com/tink
Re: How to speed up SELECT * query in Cassandra
Did you want me to included specific examples from my employment at datastax or start from the ground up? All spark is on cassandra is a better than the previous use of hive. The fact that datastax hasnt provided any benchmarks themselves other than glossy marketing statements pretty much says it all-where are your benchmarks? Maybe you could combine it with the in memory option to really boogie... :) (If I find time, I might just write a blog post about exactly how to do this-it involves the use of parquet and partitioning with clustering-and it doesnt cost anything to do it-and it's in production at my company) -- Colin Clark +1 612 859 6129 Skype colin.p.clark On Feb 11, 2015, at 6:51 AM, DuyHai Doan doanduy...@gmail.com wrote: The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra Prove it. Did you ever have a look into the source code of the Spark/Cassandra connector to see how data locality is achieved before throwing out such statement ? On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: cassandra makes a very poor datawarehouse ot long term time series store Really? This is not the impression I have... I think Cassandra is good to store larges amounts of data and historical information, it's only not good to store temporary data. Netflix has a large amount of data and it's all stored in Cassandra, AFAIK. The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra. I am not sure about the current state of Spark support for Cassandra, but I guess if you create a map reduce job, the intermediate map results will be still stored in HDFS, as it happens to hadoop, is this right? I think the problem with Spark + Cassandra or with Hadoop + Cassandra is that the hard part spark or hadoop does, the shuffling, could be done out of the box with Cassandra, but no one takes advantage on that. What if a map / reduce job used a temporary CF in Cassandra to store intermediate results? From: user@cassandra.apache.org Subject: Re: How to speed up SELECT * query in Cassandra I use spark with cassandra, and you dont need DSE. I see a lot of people ask this same question below (how do I get a lot of data out of cassandra?), and my question is always, why arent you updating both places at once? For example, we use hadoop and cassandra in conjunction with each other, we use a message bus to store every event in both, aggregrate in both, but only keep current data in cassandra (cassandra makes a very poor datawarehouse ot long term time series store) and then use services to process queries that merge data from hadoop and cassandra. Also, spark on hdfs gives more flexibility in terms of large datasets and performance. The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra -- Colin Clark +1 612 859 6129 Skype colin.p.clark On Feb 11, 2015, at 4:49 AM, Jens Rantil jens.ran...@tink.se wrote: On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: If you use Cassandra enterprise, you can use hive, AFAIK. Even better, you can use Spark/Shark with DSE. Cheers, Jens -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook Linkedin Twitter
Re: How to speed up SELECT * query in Cassandra
No, the question isnt closed. You dont get to decide that. I dont run a website making claims regarding cassandra and spark - your employer does. Again, where are your benchmarks? I will publish mine, then we'll see what you've got. -- Colin Clark +1 612 859 6129 Skype colin.p.clark On Feb 11, 2015, at 8:39 AM, DuyHai Doan doanduy...@gmail.com wrote: For your information Colin: http://en.wikipedia.org/wiki/List_of_fallacies. Look at Burden of proof You stated The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra It's up to YOU to prove it right, not up to me to prove it wrong. All other bla bla is troll. Come back to me once you get some decent benchmarks supporting your statement, until then, the question is closed. On Wed, Feb 11, 2015 at 3:17 PM, Colin co...@clark.ws wrote: Did you want me to included specific examples from my employment at datastax or start from the ground up? All spark is on cassandra is a better than the previous use of hive. The fact that datastax hasnt provided any benchmarks themselves other than glossy marketing statements pretty much says it all-where are your benchmarks? Maybe you could combine it with the in memory option to really boogie... :) (If I find time, I might just write a blog post about exactly how to do this-it involves the use of parquet and partitioning with clustering-and it doesnt cost anything to do it-and it's in production at my company) -- Colin Clark +1 612 859 6129 Skype colin.p.clark On Feb 11, 2015, at 6:51 AM, DuyHai Doan doanduy...@gmail.com wrote: The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra Prove it. Did you ever have a look into the source code of the Spark/Cassandra connector to see how data locality is achieved before throwing out such statement ? On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: cassandra makes a very poor datawarehouse ot long term time series store Really? This is not the impression I have... I think Cassandra is good to store larges amounts of data and historical information, it's only not good to store temporary data. Netflix has a large amount of data and it's all stored in Cassandra, AFAIK. The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra. I am not sure about the current state of Spark support for Cassandra, but I guess if you create a map reduce job, the intermediate map results will be still stored in HDFS, as it happens to hadoop, is this right? I think the problem with Spark + Cassandra or with Hadoop + Cassandra is that the hard part spark or hadoop does, the shuffling, could be done out of the box with Cassandra, but no one takes advantage on that. What if a map / reduce job used a temporary CF in Cassandra to store intermediate results? From: user@cassandra.apache.org Subject: Re: How to speed up SELECT * query in Cassandra I use spark with cassandra, and you dont need DSE. I see a lot of people ask this same question below (how do I get a lot of data out of cassandra?), and my question is always, why arent you updating both places at once? For example, we use hadoop and cassandra in conjunction with each other, we use a message bus to store every event in both, aggregrate in both, but only keep current data in cassandra (cassandra makes a very poor datawarehouse ot long term time series store) and then use services to process queries that merge data from hadoop and cassandra. Also, spark on hdfs gives more flexibility in terms of large datasets and performance. The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra -- Colin Clark +1 612 859 6129 Skype colin.p.clark On Feb 11, 2015, at 4:49 AM, Jens Rantil jens.ran...@tink.se wrote: On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: If you use Cassandra enterprise, you can use hive, AFAIK. Even better, you can use Spark/Shark with DSE. Cheers, Jens -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook Linkedin Twitter
Re: How to speed up SELECT * query in Cassandra
Hi, here are some snippets of code in scala which should get you started. Jirka H. loop {lastRow =val query = lastRow match {case Some(row) = nextPageQuery(row, upperLimit)case None = initialQuery(lowerLimit)}session.execute(query).all} private def nextPageQuery(row: Row, upperLimit: String): String = {val tokenPart = token(%s) token(0x%s) and token(%s) %s.format(rowKeyName, hex(row.getBytes(rowKeyName)), rowKeyName, upperLimit)basicQuery.format(tokenPart)} private def initialQuery(lowerLimit: String): String = {val tokenPart = token(%s) = %s.format(rowKeyName, lowerLimit)basicQuery.format(tokenPart)}private def calculateRanges: (BigDecimal, BigDecimal, IndexedSeq[(BigDecimal, BigDecimal)]) = {tokenRange match {case Some((start, end)) =Logger.info(Token range given: {}, + start.underlying.toPlainString + , + end.underlying.toPlainString + )val tokenSpaceSize = end - startval rangeSize = tokenSpaceSize / concurrencyval ranges = for (i - 0 until concurrency) yield (start + (i * rangeSize), start + ((i + 1) * rangeSize))(tokenSpaceSize, rangeSize, ranges)case None =val tokenSpaceSize = partitioner.max - partitioner.minval rangeSize = tokenSpaceSize / concurrencyval ranges = for (i - 0 until concurrency) yield (partitioner.min + (i * rangeSize), partitioner.min + ((i + 1) * rangeSize))(tokenSpaceSize, rangeSize, ranges)}} private val basicQuery = {select %s, %s, %s, writetime(%s) from %s where %s%s limit %d%s.format(rowKeyName,columnKeyName,columnValueName,columnValueName,columnFamily,%s, // templatewhereCondition,pageSize,if (cqlAllowFiltering) allow filtering else )} case object Murmur3 extends Partitioner {override val min = BigDecimal(-2).pow(63)override val max = BigDecimal(2).pow(63) - 1}case object Random extends Partitioner {override val min = BigDecimal(0)override val max = BigDecimal(2).pow(127) - 1} On 02/11/2015 02:21 PM, Ja Sam wrote: Your answer looks very promising How do you calculate start and stop? On Wed, Feb 11, 2015 at 12:09 PM, Jiri Horky ho...@avast.com mailto:ho...@avast.com wrote: The fastest way I am aware of is to do the queries in parallel to multiple cassandra nodes and make sure that you only ask them for keys they are responsible for. Otherwise, the node needs to resend your query which is much slower and creates unnecessary objects (and thus GC pressure). You can manually take advantage of the token range information, if the driver does not get this into account for you. Then, you can play with concurrency and batch size of a single query against one node. Basically, what you/driver should do is to transform the query to series of SELECT * FROM TABLE WHERE TOKEN IN (start, stop). I will need to look up the actual code, but the idea should be clear :) Jirka H. On 02/11/2015 11:26 AM, Ja Sam wrote: Is there a simple way (or even a complicated one) how can I speed up SELECT * FROM [table] query? I need to get all rows form one table every day. I split tables, and create one for each day, but still query is quite slow (200 millions of records) I was thinking about run this query in parallel, but I don't know if it is possible
How to speed up SELECT * query in Cassandra
Is there a simple way (or even a complicated one) how can I speed up SELECT * FROM [table] query? I need to get all rows form one table every day. I split tables, and create one for each day, but still query is quite slow (200 millions of records) I was thinking about run this query in parallel, but I don't know if it is possible
Re: How to speed up SELECT * query in Cassandra
I use spark with cassandra, and you dont need DSE. I see a lot of people ask this same question below (how do I get a lot of data out of cassandra?), and my question is always, why arent you updating both places at once? For example, we use hadoop and cassandra in conjunction with each other, we use a message bus to store every event in both, aggregrate in both, but only keep current data in cassandra (cassandra makes a very poor datawarehouse ot long term time series store) and then use services to process queries that merge data from hadoop and cassandra. Also, spark on hdfs gives more flexibility in terms of large datasets and performance. The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra -- Colin Clark +1 612 859 6129 Skype colin.p.clark On Feb 11, 2015, at 4:49 AM, Jens Rantil jens.ran...@tink.se wrote: On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: If you use Cassandra enterprise, you can use hive, AFAIK. Even better, you can use Spark/Shark with DSE. Cheers, Jens -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook Linkedin Twitter
Re: How to speed up SELECT * query in Cassandra
On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: If you use Cassandra enterprise, you can use hive, AFAIK. Even better, you can use Spark/Shark with DSE. Cheers, Jens -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook https://www.facebook.com/#!/tink.se Linkedin http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary Twitter https://twitter.com/tink
Re: How to speed up SELECT * query in Cassandra
The fastest way I am aware of is to do the queries in parallel to multiple cassandra nodes and make sure that you only ask them for keys they are responsible for. Otherwise, the node needs to resend your query which is much slower and creates unnecessary objects (and thus GC pressure). You can manually take advantage of the token range information, if the driver does not get this into account for you. Then, you can play with concurrency and batch size of a single query against one node. Basically, what you/driver should do is to transform the query to series of SELECT * FROM TABLE WHERE TOKEN IN (start, stop). I will need to look up the actual code, but the idea should be clear :) Jirka H. On 02/11/2015 11:26 AM, Ja Sam wrote: Is there a simple way (or even a complicated one) how can I speed up SELECT * FROM [table] query? I need to get all rows form one table every day. I split tables, and create one for each day, but still query is quite slow (200 millions of records) I was thinking about run this query in parallel, but I don't know if it is possible
Re: How to speed up SELECT * query in Cassandra
The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra Prove it. Did you ever have a look into the source code of the Spark/Cassandra connector to see how data locality is achieved before throwing out such statement ? On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: cassandra makes a very poor datawarehouse ot long term time series store Really? This is not the impression I have... I think Cassandra is good to store larges amounts of data and historical information, it's only not good to store temporary data. Netflix has a large amount of data and it's all stored in Cassandra, AFAIK. The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra. I am not sure about the current state of Spark support for Cassandra, but I guess if you create a map reduce job, the intermediate map results will be still stored in HDFS, as it happens to hadoop, is this right? I think the problem with Spark + Cassandra or with Hadoop + Cassandra is that the hard part spark or hadoop does, the shuffling, could be done out of the box with Cassandra, but no one takes advantage on that. What if a map / reduce job used a temporary CF in Cassandra to store intermediate results? From: user@cassandra.apache.org Subject: Re: How to speed up SELECT * query in Cassandra I use spark with cassandra, and you dont need DSE. I see a lot of people ask this same question below (how do I get a lot of data out of cassandra?), and my question is always, why arent you updating both places at once? For example, we use hadoop and cassandra in conjunction with each other, we use a message bus to store every event in both, aggregrate in both, but only keep current data in cassandra (cassandra makes a very poor datawarehouse ot long term time series store) and then use services to process queries that merge data from hadoop and cassandra. Also, spark on hdfs gives more flexibility in terms of large datasets and performance. The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra -- *Colin Clark* +1 612 859 6129 Skype colin.p.clark On Feb 11, 2015, at 4:49 AM, Jens Rantil jens.ran...@tink.se wrote: On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: If you use Cassandra enterprise, you can use hive, AFAIK. Even better, you can use Spark/Shark with DSE. Cheers, Jens -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook https://www.facebook.com/#!/tink.se Linkedin http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary Twitter https://twitter.com/tink
Re: How to speed up SELECT * query in Cassandra
Your answer looks very promising How do you calculate start and stop? On Wed, Feb 11, 2015 at 12:09 PM, Jiri Horky ho...@avast.com wrote: The fastest way I am aware of is to do the queries in parallel to multiple cassandra nodes and make sure that you only ask them for keys they are responsible for. Otherwise, the node needs to resend your query which is much slower and creates unnecessary objects (and thus GC pressure). You can manually take advantage of the token range information, if the driver does not get this into account for you. Then, you can play with concurrency and batch size of a single query against one node. Basically, what you/driver should do is to transform the query to series of SELECT * FROM TABLE WHERE TOKEN IN (start, stop). I will need to look up the actual code, but the idea should be clear :) Jirka H. On 02/11/2015 11:26 AM, Ja Sam wrote: Is there a simple way (or even a complicated one) how can I speed up SELECT * FROM [table] query? I need to get all rows form one table every day. I split tables, and create one for each day, but still query is quite slow (200 millions of records) I was thinking about run this query in parallel, but I don't know if it is possible