Re: How to speed up SELECT * query in Cassandra

Jiri Horky Wed, 11 Feb 2015 08:14:37 -0800

Well, I always wondered how Cassandra can by used in Hadoop-like
environment where you basically need to do full table scan.


I need to say that our experience is that cassandra is perfect for
writing, reading specific values by key, but definitely not for reading
all of the data out of it. Some of our projects found out that doing
that with a not trivial in a timely manner is close to impossible in
many situations. We are slowly moving to storing the data in HDFS and
possibly reprocess them on a daily bases for such usecases (statistics).

This is nothing against Cassandra, it can not be perfect for everything.
But I am really interested how it can work well with Spark/Hadoop where
you basically needs to read all the data as well (as far as I understand
that).

Jirka H.

On 02/11/2015 01:51 PM, DuyHai Doan wrote:
> "The very nature of cassandra's distributed nature vs partitioning
> data on hadoop makes spark on hdfs actually fasted than on cassandra...."
>
> Prove it. Did you ever have a look into the source code of the
> Spark/Cassandra connector to see how data locality is achieved before
> throwing out such statement ?
>
> On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON)
> <[email protected] <mailto:[email protected]>> wrote:
>
>     > cassandra makes a very poor datawarehouse ot long term time series store
>
>     Really? This is not the impression I have... I think Cassandra is
>     good to store larges amounts of data and historical information,
>     it's only not good to store temporary data.
>     Netflix has a large amount of data and it's all stored in
>     Cassandra, AFAIK.
>
>     > The very nature of cassandra's distributed nature vs partitioning data 
> on hadoop makes spark on hdfs
>     actually fasted than on cassandra.
>
>     I am not sure about the current state of Spark support for
>     Cassandra, but I guess if you create a map reduce job, the
>     intermediate map results will be still stored in HDFS, as it
>     happens to hadoop, is this right? I think the problem with Spark +
>     Cassandra or with Hadoop + Cassandra is that the hard part spark
>     or hadoop does, the shuffling, could be done out of the box with
>     Cassandra, but no one takes advantage on that. What if a map /
>     reduce job used a temporary CF in Cassandra to store intermediate
>     results?
>
>     From: [email protected] <mailto:[email protected]>
>     Subject: Re: How to speed up SELECT * query in Cassandra
>
>         I use spark with cassandra, and you dont need DSE.
>
>         I see a lot of people ask this same question below (how do I
>         get a lot of data out of cassandra?), and my question is
>         always, why arent you updating both places at once?
>
>         For example, we use hadoop and cassandra in conjunction with
>         each other, we use a message bus to store every event in both,
>         aggregrate in both, but only keep current data in cassandra
>         (cassandra makes a very poor datawarehouse ot long term time
>         series store) and then use services to process queries that
>         merge data from hadoop and cassandra.  
>
>         Also, spark on hdfs gives more flexibility in terms of large
>         datasets and performance.  The very nature of cassandra's
>         distributed nature vs partitioning data on hadoop makes spark
>         on hdfs actually fasted than on cassandra....
>
>
>
>         -- 
>         *Colin Clark* 
>         +1 612 859 6129 <tel:%2B1%20612%20859%206129>
>         Skype colin.p.clark
>
>         On Feb 11, 2015, at 4:49 AM, Jens Rantil <[email protected]
>         <mailto:[email protected]>> wrote:
>
>>
>>         On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/
>>         LONDON) <[email protected]
>>         <mailto:[email protected]>> wrote:
>>
>>             If you use Cassandra enterprise, you can use hive, AFAIK.
>>
>>
>>         Even better, you can use Spark/Shark with DSE.
>>
>>         Cheers,
>>         Jens
>>
>>
>>         -- 
>>         Jens Rantil
>>         Backend engineer
>>         Tink AB
>>
>>         Email: [email protected] <mailto:[email protected]>
>>         Phone: +46 708 84 18 32
>>         Web: www.tink.se <http://www.tink.se/>
>>
>>         Facebook <https://www.facebook.com/#%21/tink.se> Linkedin
>>         
>> <http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_photo&trkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary>
>>  Twitter
>>         <https://twitter.com/tink>
>
>
>

Re: How to speed up SELECT * query in Cassandra

Reply via email to