If you are using Spark you need to be _really_ careful about your tombstones. In our experience a single partition with too many tombstones can take down the whole batch job (until something like https://issues.apache.org/jira/browse/CASSANDRA-8574 is fixed). This was a major obstacle for us to overcome when using Spark.
Cheers, Jens On Wed, Feb 11, 2015 at 5:12 PM, Jiri Horky <ho...@avast.com> wrote: > Well, I always wondered how Cassandra can by used in Hadoop-like > environment where you basically need to do full table scan. > > I need to say that our experience is that cassandra is perfect for > writing, reading specific values by key, but definitely not for reading all > of the data out of it. Some of our projects found out that doing that with > a not trivial in a timely manner is close to impossible in many situations. > We are slowly moving to storing the data in HDFS and possibly reprocess > them on a daily bases for such usecases (statistics). > > This is nothing against Cassandra, it can not be perfect for everything. > But I am really interested how it can work well with Spark/Hadoop where you > basically needs to read all the data as well (as far as I understand that). > > Jirka H. > > > On 02/11/2015 01:51 PM, DuyHai Doan wrote: > > "The very nature of cassandra's distributed nature vs partitioning data > on hadoop makes spark on hdfs actually fasted than on cassandra...." > > Prove it. Did you ever have a look into the source code of the > Spark/Cassandra connector to see how data locality is achieved before > throwing out such statement ? > > On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) < > mvallemil...@bloomberg.net> wrote: > >> > cassandra makes a very poor datawarehouse ot long term time series >> store >> >> Really? This is not the impression I have... I think Cassandra is good >> to store larges amounts of data and historical information, it's only not >> good to store temporary data. >> Netflix has a large amount of data and it's all stored in Cassandra, >> AFAIK. >> >> > The very nature of cassandra's distributed nature vs partitioning >> data on hadoop makes spark on hdfs actually fasted than on cassandra. >> >> I am not sure about the current state of Spark support for Cassandra, >> but I guess if you create a map reduce job, the intermediate map results >> will be still stored in HDFS, as it happens to hadoop, is this right? I >> think the problem with Spark + Cassandra or with Hadoop + Cassandra is that >> the hard part spark or hadoop does, the shuffling, could be done out of the >> box with Cassandra, but no one takes advantage on that. What if a map / >> reduce job used a temporary CF in Cassandra to store intermediate results? >> >> From: user@cassandra.apache.org >> Subject: Re: How to speed up SELECT * query in Cassandra >> >> I use spark with cassandra, and you dont need DSE. >> >> I see a lot of people ask this same question below (how do I get a lot >> of data out of cassandra?), and my question is always, why arent you >> updating both places at once? >> >> For example, we use hadoop and cassandra in conjunction with each >> other, we use a message bus to store every event in both, aggregrate in >> both, but only keep current data in cassandra (cassandra makes a very poor >> datawarehouse ot long term time series store) and then use services to >> process queries that merge data from hadoop and cassandra. >> >> Also, spark on hdfs gives more flexibility in terms of large datasets >> and performance. The very nature of cassandra's distributed nature vs >> partitioning data on hadoop makes spark on hdfs actually fasted than on >> cassandra.... >> >> >> >> -- >> *Colin Clark* >> +1 612 859 6129 >> Skype colin.p.clark >> >> On Feb 11, 2015, at 4:49 AM, Jens Rantil <jens.ran...@tink.se> wrote: >> >> >> On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) < >> mvallemil...@bloomberg.net> wrote: >> >>> If you use Cassandra enterprise, you can use hive, AFAIK. >> >> >> Even better, you can use Spark/Shark with DSE. >> >> Cheers, >> Jens >> >> >> -- >> Jens Rantil >> Backend engineer >> Tink AB >> >> Email: jens.ran...@tink.se >> Phone: +46 708 84 18 32 >> Web: www.tink.se >> >> Facebook <https://www.facebook.com/#%21/tink.se> Linkedin >> <http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_photo&trkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary> >> Twitter <https://twitter.com/tink> >> >> >> > > -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook <https://www.facebook.com/#!/tink.se> Linkedin <http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_photo&trkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary> Twitter <https://twitter.com/tink>