Re: implementing a 'sorted set' on top of cassandra

Benjamin Roth Tue, 17 Jan 2017 10:44:09 -0800

That was what I had in mind. Which solution (populate on demand,
pre-populate) really fits your needs depends on
- write frequency
- required cache expiration time
- read frequency
- ratio of written / read sets


If you prefer event based stuff, the kafka solution, jon proposed could be
quite interesting.

2017-01-17 19:10 GMT+01:00 Jonathan Haddad <j...@jonhaddad.com>:

> You could store the key -> score pairs in Cassandra, pull out the full
> partition and repopulate the cache in redis with the top N whatever you
> need.  I'd only read the Cassandra values directly in order to repopulate
> the cache.
>
> I wouldn't try to score the score -> key values, the perf will be a
> nightmare.
>
> On Tue, Jan 17, 2017 at 8:47 AM Mike Torra <mto...@demandware.com> wrote:
>
>> Thanks for the feedback everyone! Redis `zincryby` and `zrangebyscore` is
>> indeed what we use today.
>>
>> Caching the resulting 'sorted sets' in redis is exactly what I plan to
>> do. There will be tens of thousands of these sorted sets, each generally
>> with <10k items (with maybe a few exceptions going a bit over that). The
>> reason to periodically calculate the set and store it in cassandra is to
>> avoid having the client do that work, when the client only really cares
>> about the top 100 or so items at any given time. Being truly "real time" is
>> not critical for us, but it is a selling point to be as up to date as
>> possible.
>>
>> I'd like to understand the performance issue of frequently updating these
>> sets. I understand that every time I 'regenerate' the sorted set, any rows
>> that change will create a tombstone - for example, if "item_1" is in first
>> place and "item_2" is in second place, then they switch on the next update,
>> that would be two tombstones. Do you think this will be a big enough
>> problem that it is worth doing the sorting work client side, on demand, and
>> just try to eat the performance hit there? My thought was to make a
>> tradeoff by using more cassandra disk space (ie pre calculating all sets),
>> in exchange for faster reads when requests actually come in that need this
>> data.
>>
>> From: Benjamin Roth <benjamin.r...@jaumo.com>
>> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>> Date: Saturday, January 14, 2017 at 1:25 PM
>> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>> Subject: Re: implementing a 'sorted set' on top of cassandra
>>
>> Mike mentioned "increment" in his initial post. That let me think of a
>> case with increments and fetching a top list by a counter like
>> https://redis.io/commands/zincrby
>> https://redis.io/commands/zrangebyscore
>>
>> 1. Cassandra is absolutely not made to sort by a counter (or a
>> non-counter numeric incrementing value) but it is made to store counters.
>> In this case a partition could be seen as a set.
>> 2. I thought of CS for persistence and - depending on the app
>> requirements like real-time and set size - still use redis as a read cache
>>
>> 2017-01-14 18:45 GMT+01:00 Jonathan Haddad <j...@jonhaddad.com>:
>>
>> Sorted sets don't have a requirement of incrementing / decrementing.
>> They're commonly used for thing like leaderboards where the values are
>> arbitrary.
>>
>> In Redis they are implemented with 2 data structures for efficient
>> lookups of either key or value. No getting around that as far as I know.
>>
>> In Cassandra they would require using the score as a clustering column in
>> order to select top N scores (and paginate). That means a tombstone
>> whenever the value for a key in the set changes. In sets with high rates of
>> change that means a lot of tombstones and thus terrible performance.
>> On Sat, Jan 14, 2017 at 9:40 AM DuyHai Doan <doanduy...@gmail.com> wrote:
>>
>> Sorting on an "incremented" numeric value has always been a nightmare to
>> be done properly in C*
>>
>> Either use Counter type but then no sorting is possible since counter
>> cannot be used as type for clustering column (which allows sort)
>>
>> Or use simple numeric type on clustering column but then to increment the
>> value *concurrently* and *safely* it's prohibitive (SELECT to fetch current
>> value + UPDATE ... IF value = <old_value>) + retry
>>
>>
>>
>> On Sat, Jan 14, 2017 at 8:54 AM, Benjamin Roth <benjamin.r...@jaumo.com>
>> wrote:
>>
>> If your proposed solution is crazy depends on your needs :)
>> It sounds like you can live with not-realtime data. So it is ok to cache
>> it. Why preproduce the results if you only need 5% of them? Why not use
>> redis as a cache with expiring sorted sets that are filled on demand from
>> cassandra partitions with counters?
>> So redis has much less to do and can scale much better. And you are not
>> limited on keeping all data in ram as cache data is volatile and can be
>> evicted on demand.
>> If this is effective also depends on the size of your sets. CS wont be
>> able to sort them by score for you, so you will have to load the complete
>> set to redis for caching and / or do sorting in your app on demand. This
>> certainly won't work out well with sets with millions of entries.
>>
>> 2017-01-13 23:14 GMT+01:00 Mike Torra <mto...@demandware.com>:
>>
>> We currently use redis to store sorted sets that we increment many, many
>> times more than we read. For example, only about 5% of these sets are ever
>> read. We are getting to the point where redis is becoming difficult to
>> scale (currently at >20 nodes).
>>
>> We've started using cassandra for other things, and now we are
>> experimenting to see if having a similar 'sorted set' data structure is
>> feasible in cassandra. My approach so far is:
>>
>>    1. Use a counter CF to store the values I want to sort by
>>    2. Periodically read in all key/values in the counter CF and sort in
>>    the client application (~every five minutes or so)
>>    3. Write back to a different CF with the ordered keys I care about
>>
>> Does this seem crazy? Is there a simpler way to do this in cassandra?
>>
>>
>>
>>
>> --
>> Benjamin Roth
>> Prokurist
>>
>> Jaumo GmbH · www.jaumo.com
>> Wehrstraße 46 · 73035 Göppingen · Germany
>> Phone +49 7161 304880-6 <+49%207161%203048806> · Fax +49 7161 304880-1
>> <+49%207161%203048801>
>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>>
>>
>>
>>
>>
>> --
>> Benjamin Roth
>> Prokurist
>>
>> Jaumo GmbH · www.jaumo.com
>> Wehrstraße 46 · 73035 Göppingen · Germany
>> Phone +49 7161 304880-6 <+49%207161%203048806> · Fax +49 7161 304880-1
>> <+49%207161%203048801>
>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>>
>


-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: implementing a 'sorted set' on top of cassandra

Reply via email to