Re: How to speed up SELECT * query in Cassandra

Jens Rantil Fri, 13 Feb 2015 06:12:17 -0800

If you are using Spark you need to be _really_ careful about your
tombstones. In our experience a single partition with too many tombstones
can take down the whole batch job (until something like
https://issues.apache.org/jira/browse/CASSANDRA-8574 is fixed). This was a
major obstacle for us to overcome when using Spark.


Cheers,
Jens

On Wed, Feb 11, 2015 at 5:12 PM, Jiri Horky <ho...@avast.com> wrote:

>  Well, I always wondered how Cassandra can by used in Hadoop-like
> environment where you basically need to do full table scan.
>
> I need to say that our experience is that cassandra is perfect for
> writing, reading specific values by key, but definitely not for reading all
> of the data out of it. Some of our projects found out that doing that with
> a not trivial in a timely manner is close to impossible in many situations.
> We are slowly moving to storing the data in HDFS and possibly reprocess
> them on a daily bases for such usecases (statistics).
>
> This is nothing against Cassandra, it can not be perfect for everything.
> But I am really interested how it can work well with Spark/Hadoop where you
> basically needs to read all the data as well (as far as I understand that).
>
> Jirka H.
>
>
> On 02/11/2015 01:51 PM, DuyHai Doan wrote:
>
> "The very nature of cassandra's distributed nature vs partitioning data
> on hadoop makes spark on hdfs actually fasted than on cassandra...."
>
>  Prove it. Did you ever have a look into the source code of the
> Spark/Cassandra connector to see how data locality is achieved before
> throwing out such statement ?
>
> On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) <
> mvallemil...@bloomberg.net> wrote:
>
>>  > cassandra makes a very poor datawarehouse ot long term time series
>> store
>>
>>  Really? This is not the impression I have... I think Cassandra is good
>> to store larges amounts of data and historical information, it's only not
>> good to store temporary data.
>> Netflix has a large amount of data and it's all stored in Cassandra,
>> AFAIK.
>>
>>  > The very nature of cassandra's distributed nature vs partitioning
>> data on hadoop makes spark on hdfs actually fasted than on cassandra.
>>
>>  I am not sure about the current state of Spark support for Cassandra,
>> but I guess if you create a map reduce job, the intermediate map results
>> will be still stored in HDFS, as it happens to hadoop, is this right? I
>> think the problem with Spark + Cassandra or with Hadoop + Cassandra is that
>> the hard part spark or hadoop does, the shuffling, could be done out of the
>> box with Cassandra, but no one takes advantage on that. What if a map /
>> reduce job used a temporary CF in Cassandra to store intermediate results?
>>
>>   From: user@cassandra.apache.org
>> Subject: Re: How to speed up SELECT * query in Cassandra
>>
>> I use spark with cassandra, and you dont need DSE.
>>
>>  I see a lot of people ask this same question below (how do I get a lot
>> of data out of cassandra?), and my question is always, why arent you
>> updating both places at once?
>>
>>  For example, we use hadoop and cassandra in conjunction with each
>> other, we use a message bus to store every event in both, aggregrate in
>> both, but only keep current data in cassandra (cassandra makes a very poor
>> datawarehouse ot long term time series store) and then use services to
>> process queries that merge data from hadoop and cassandra.
>>
>>  Also, spark on hdfs gives more flexibility in terms of large datasets
>> and performance.  The very nature of cassandra's distributed nature vs
>> partitioning data on hadoop makes spark on hdfs actually fasted than on
>> cassandra....
>>
>>
>>
>> --
>> *Colin Clark*
>>  +1 612 859 6129
>> Skype colin.p.clark
>>
>> On Feb 11, 2015, at 4:49 AM, Jens Rantil <jens.ran...@tink.se> wrote:
>>
>>
>> On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) <
>> mvallemil...@bloomberg.net> wrote:
>>
>>> If you use Cassandra enterprise, you can use hive, AFAIK.
>>
>>
>> Even better, you can use Spark/Shark with DSE.
>>
>>  Cheers,
>> Jens
>>
>>
>>  --
>>  Jens Rantil
>> Backend engineer
>> Tink AB
>>
>>  Email: jens.ran...@tink.se
>> Phone: +46 708 84 18 32
>> Web: www.tink.se
>>
>>  Facebook <https://www.facebook.com/#%21/tink.se> Linkedin
>> <http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_photo&trkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary>
>>  Twitter <https://twitter.com/tink>
>>
>>
>>
>
>


-- 
Jens Rantil
Backend engineer
Tink AB

Email: jens.ran...@tink.se
Phone: +46 708 84 18 32
Web: www.tink.se

Facebook <https://www.facebook.com/#!/tink.se> Linkedin
<http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_photo&trkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary>
 Twitter <https://twitter.com/tink>

Re: How to speed up SELECT * query in Cassandra

Reply via email to