"The very nature of cassandra's distributed nature vs partitioning data on
hadoop makes spark on hdfs actually fasted than on cassandra...."

Prove it. Did you ever have a look into the source code of the
Spark/Cassandra connector to see how data locality is achieved before
throwing out such statement ?

On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) <
mvallemil...@bloomberg.net> wrote:

> > cassandra makes a very poor datawarehouse ot long term time series store
>
> Really? This is not the impression I have... I think Cassandra is good to
> store larges amounts of data and historical information, it's only not good
> to store temporary data.
> Netflix has a large amount of data and it's all stored in Cassandra,
> AFAIK.
>
> > The very nature of cassandra's distributed nature vs partitioning data
> on hadoop makes spark on hdfs actually fasted than on cassandra.
>
> I am not sure about the current state of Spark support for Cassandra, but
> I guess if you create a map reduce job, the intermediate map results will
> be still stored in HDFS, as it happens to hadoop, is this right? I think
> the problem with Spark + Cassandra or with Hadoop + Cassandra is that the
> hard part spark or hadoop does, the shuffling, could be done out of the box
> with Cassandra, but no one takes advantage on that. What if a map / reduce
> job used a temporary CF in Cassandra to store intermediate results?
>
> From: user@cassandra.apache.org
> Subject: Re: How to speed up SELECT * query in Cassandra
>
> I use spark with cassandra, and you dont need DSE.
>
> I see a lot of people ask this same question below (how do I get a lot of
> data out of cassandra?), and my question is always, why arent you updating
> both places at once?
>
> For example, we use hadoop and cassandra in conjunction with each other,
> we use a message bus to store every event in both, aggregrate in both, but
> only keep current data in cassandra (cassandra makes a very poor
> datawarehouse ot long term time series store) and then use services to
> process queries that merge data from hadoop and cassandra.
>
> Also, spark on hdfs gives more flexibility in terms of large datasets and
> performance.  The very nature of cassandra's distributed nature vs
> partitioning data on hadoop makes spark on hdfs actually fasted than on
> cassandra....
>
>
>
> --
> *Colin Clark*
> +1 612 859 6129
> Skype colin.p.clark
>
> On Feb 11, 2015, at 4:49 AM, Jens Rantil <jens.ran...@tink.se> wrote:
>
>
> On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) <
> mvallemil...@bloomberg.net> wrote:
>
>> If you use Cassandra enterprise, you can use hive, AFAIK.
>
>
> Even better, you can use Spark/Shark with DSE.
>
> Cheers,
> Jens
>
>
> --
> Jens Rantil
> Backend engineer
> Tink AB
>
> Email: jens.ran...@tink.se
> Phone: +46 708 84 18 32
> Web: www.tink.se
>
> Facebook <https://www.facebook.com/#!/tink.se> Linkedin
> <http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_photo&trkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary>
>  Twitter <https://twitter.com/tink>
>
>
>

Reply via email to