"The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra...."
Prove it. Did you ever have a look into the source code of the Spark/Cassandra connector to see how data locality is achieved before throwing out such statement ? On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) < mvallemil...@bloomberg.net> wrote: > > cassandra makes a very poor datawarehouse ot long term time series store > > Really? This is not the impression I have... I think Cassandra is good to > store larges amounts of data and historical information, it's only not good > to store temporary data. > Netflix has a large amount of data and it's all stored in Cassandra, > AFAIK. > > > The very nature of cassandra's distributed nature vs partitioning data > on hadoop makes spark on hdfs actually fasted than on cassandra. > > I am not sure about the current state of Spark support for Cassandra, but > I guess if you create a map reduce job, the intermediate map results will > be still stored in HDFS, as it happens to hadoop, is this right? I think > the problem with Spark + Cassandra or with Hadoop + Cassandra is that the > hard part spark or hadoop does, the shuffling, could be done out of the box > with Cassandra, but no one takes advantage on that. What if a map / reduce > job used a temporary CF in Cassandra to store intermediate results? > > From: user@cassandra.apache.org > Subject: Re: How to speed up SELECT * query in Cassandra > > I use spark with cassandra, and you dont need DSE. > > I see a lot of people ask this same question below (how do I get a lot of > data out of cassandra?), and my question is always, why arent you updating > both places at once? > > For example, we use hadoop and cassandra in conjunction with each other, > we use a message bus to store every event in both, aggregrate in both, but > only keep current data in cassandra (cassandra makes a very poor > datawarehouse ot long term time series store) and then use services to > process queries that merge data from hadoop and cassandra. > > Also, spark on hdfs gives more flexibility in terms of large datasets and > performance. The very nature of cassandra's distributed nature vs > partitioning data on hadoop makes spark on hdfs actually fasted than on > cassandra.... > > > > -- > *Colin Clark* > +1 612 859 6129 > Skype colin.p.clark > > On Feb 11, 2015, at 4:49 AM, Jens Rantil <jens.ran...@tink.se> wrote: > > > On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) < > mvallemil...@bloomberg.net> wrote: > >> If you use Cassandra enterprise, you can use hive, AFAIK. > > > Even better, you can use Spark/Shark with DSE. > > Cheers, > Jens > > > -- > Jens Rantil > Backend engineer > Tink AB > > Email: jens.ran...@tink.se > Phone: +46 708 84 18 32 > Web: www.tink.se > > Facebook <https://www.facebook.com/#!/tink.se> Linkedin > <http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_photo&trkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary> > Twitter <https://twitter.com/tink> > > >