Re: [web2py] Re: Redis caching

Arnon Marcus Tue, 18 Feb 2014 13:15:40 -0800

I like how in this medium people can talk behind your back and in your face
at the same time! :P

I actually invested about 2 weeks (both at work AND at home), experimenting
with MANY different options of storing and retrieving data in redis, using
all structure-types, using both generic (procedural-generated) data, and
our own real-world data. It started out as a pet-project, but it mushroomed
into a very detailed and flexible "py-redis-benchmarking tool", which I
have every intention of sharing on github - I think it's over a 1k-loc
already... You basically tell it which benchmak-combination(s) you wish to
run, and it prints the results in a nicely-organized table. If you choose
to use the procedurally-generated data (for synthetic-benchmarking) you can
define each of the 3 dimensions it has (keys, records, fields), to see how
each effect each redis-storing-option (lists, sets, hashes, etc.). So you
can get a feel for how "scale" behaves as a factor of influence on the
benefits/trade-offs of each storage-option. I think I will add
graph-plotting for IPython, just for the fun of it...

In conclusion:
A major performance-factor is the number of round-trips to redis, so I
employed heavy-use of "pipeline", But it turns out that another
major-performance-factor after that, is the manipulations that need to
happen to the data in python, on pre-storing and post-retrival, in order to
fit the data into the redis-structures. Turns out, that - at least for
bulk-store/retrival (pipeline-usage), the overheads of fitting a data
structure into redis, outweighs the benefits, sometimes by orders of
magnitude. Perhaps if an application is written to use redis as a database,
it would be worth it, as interjecting into a specific value "nested" inside
a redis-structure "may" be faster than having to pull an entire "key" with
serialized data - but that's not the use-case we're talking about for
"caching" in web2py.

So, the *tl;dr;*  version of it, is:
"Flat key-value store of serialized data is fastest for
bulk-store/retrieval"

* Especially when using "hiredis" (python-wrapper around a "c-compiled"
redis-client - That's orders-of-magnitudes faster...)

Then I went to testing many serialization formats/libraries:
- JSON (pure-python)
- simplejson (with "c-compiled" optimizations)
- cjson (a "c-compiled" library w/Python-wrapper)
- ujson (a "c-compiled" library w/Python-wrapper)
- pickle (pure-python)
- cPickle (a "c-compiled" library w/Python-wrapper)
- msgpack (with "c-compiled" optimizations)
- u-msgpack (pure-python)
- marshal

Results:
- all pure-python options are slowest (unsurprising)
- simplejson is almost as fast as cjson when used with
c-compiled-optimization, and is more maintained, so no use for cjson.
- cPickle is almost as fast as marshal, and is platform/version agnostic,
so no use for marshal.
- ujson is only faster than simplejson for very long (and flat) lists, and
is less maintained/popular/mature.

So, that leaves us with:
- simplejson
- cPickle
- msgpack

- cPickle is actually "slowest", AND is python-only.
- With either simplejson or msgpack, you can read the data from redis from
non-python clients AND they both (surprisingly) handle unicode really well..
- msgpack is roughly x2 faster than simplejson, but is less-readable in a
redis-GUI.

However:
When using simplejson or msgpack. once you introduce "DateTime" values, you
need to process the results in python by interjecting into the parsers with
hooks... Once you do that, all the performance-gain nullifies...
So cPickle becomes fastest, as it generates the python "DateTime" objects
in the c-level...

So I ended-up where I started, rounding a full-circle back to flat-keys
with cPickle...

The only benefit I ended-up gaining, is by re-factoring our high-level
cache-data-structure, on-top of redis_cache.py, that does bulk-retrival and
smart-refreshes - but I'm not sure I can share that code...

We are now doing a bulk-get of our entire redis-cache on every request.
It has over 100 keys, some very small and some with hundreds of
nested-records. We narrowed it down to 16ms per-request (best-case), which
is good enough for me.

We basically have a class in a module, which instanciates a
non-thread-local singleton, once per-process. It has an ordered-dictionary
of "keys" mapped to "lambdas". We call it the "cache-catalog". The results
are stored in a regular dictionary (which is thread-local), which maps the
keys to their respective resultant-value. On each request, a bulk-get is
issued with a list of all the keys (which we already have - it's the keys
of the catalog + the "w2py:<app>:" prefix, so we don't even need to have
them stored in redis in a separate set... And we still don't have to use
the infamous "GET KEYS" redis-command...), and since the catalog is an
ordered-dictionary, we know which value in the result maps to which key. So
we know the "None" values represent the keys that are currently "missing"
in redis, due to a deletion triggered by a cache-update on another
request/thread/process. So we get a list of "missing keys", that we just
run through in a regular for-loop, generating new values using the regular
cache-mechanism (which triggers the lambdas) - so we only update what's
missing.

This turns out to be extremely efficient, fast and resilient.
I suggest this approach would be factored-into the redis_cache.py file
itself somehow...
Not sure I can share that code though... (legally...)

Anyways, hope this somes-up the topic, and hope some people learned
something from this summary of my experience.
If not, hey, what do I know, I'm just an "idea guy" after all, right? :P

I'll be posting a link to the git-repo of the benchmark-code in a few days,
after I clean it up a bit...

On Tue, Feb 18, 2014 at 9:27 PM, Derek <sp1d...@gmail.com> wrote:

> >endless arguments just to "win"?
>
> I don't think it's that, I think that people who consider themselves "idea
> men" are people who are generally lazy who don't want to do any of the
> work, but want to take credit for it. They discount the amount of time that
> developers put into a project and state that they could do it better (if
> they could just be bothered to implement their idea, which happens to be
> too simple for them to bother with.) I was merely suggesting that the best
> way to handle such people is to say 'it is a wonderful idea! people might
> steal it! better be the first to implement it yourself and then patent it!'
> What I've seen is that they usually shut up about their great 'new idea'
> and maybe they learn that programming isn't as easy or 'simple' as they
> thought it was.
>
>
> On Thursday, February 6, 2014 3:44:37 PM UTC-7, Jufsa Lagom wrote:
>>
>> Hello Arnon.
>>
>> I just made a quick search of your posts on the other groups on
>> groups.google.com..
>>
>> On many (almost all) groups that you have made posts, you run into
>> arguments with longtime members/contributors that have put down huge amount
>> of time in the projects.
>>
>> You say yourself in many posts, that you are inexperienced in the subject
>> that are being discussed?
>> Then, perhaps it's good to take a more humble approach when addressing
>> your questions/statements?
>> I can only speak for myself, that I should at least pick that approach if
>> I had a question to the community..
>>
>> Don't misunderstand me, It's always good with new ideas and fresh
>> insights..
>> But when meeting massive resistance in a community about an idea that
>> doesn't seem to get any traction, then perhaps that idea shouldn't be
>> forced with endless arguments just to "win"?
>>
>> Sorry for the OT, and this is just a friendly hint from an old news user
>> :)
>>
>> --
>> Kind Regards
>> Jufsa Lagom
>>
>> On Thursday, January 16, 2014 11:57:05 PM UTC+1, Arnon Marcus wrote:
>>>
>>> Derek: Are you being sarcastic and mean?
>>>
>>>
>>>
>>>> cache doesn't cache only resultsets, hence pickle is the only possible
>>>> choice.
>>>>
>>>>
>>>
>>> Well, not if you only need flat and basic objects - there the benefit of
>>> pickle is mute and it's overhead is obvious - take a look at this project:
>>> https://redis-collections.readthedocs.org/en/latest/
>>>
>>>
>>>> It's cool. Actually, I started developing something like that using DAL
>>>> callbacks, but as soon as multiple tables are involved with FK and such, it
>>>> starts to loose "speed". Also, your whole app needs to be coded a-la
>>>> "ActiveRecord", i.e. fetch only by PK.
>>>>
>>>
>>> Hmmm... Haven't thought of that... Well, you can't search/query for
>>> specific records by their hashed-values, but that's not the use-case I was
>>> thinking about - I am not suggesting "replacing" the dal... Plus, that
>>> restriction would also exist when using pickles for such a use-case...
>>> What I had in mind is simpler than that - just have a bunch of simple
>>> queries that you would do in your cache.ram anyways, and instead have their
>>> "raw" result-set (before being parsed into "rows" objects) and cached as-is
>>> (almost...) - that would be faster to load-in the cache than into
>>> cache.ram, and also faster for retrieval.
>>>
>>>
>>>> BTW, I'm not properly sure that fetching 100 records with 100 calls to
>>>> redis vs pulling a single time a pickle of 1000 records and discarding what
>>>> you don't need is faster.
>>>>
>>>
>>> Hmmm... I don't know, redis is famous for crunching somewhere in the
>>> order of 500K requests per-second - have you tested it?
>>>
>>>
>>>> BTW2: ORM are already there: redisco and redis-lympid
>>>>
>>>
>>> 10x, I'll take a look - though I think an ORM would defeat the purpose
>>> (in terms of of speed) and would be overkill...
>>>
>>  --
> Resources:
> - http://web2py.com
> - http://web2py.com/book (Documentation)
> - http://github.com/web2py/web2py (Source code)
> - https://code.google.com/p/web2py/issues/list (Report Issues)
> ---
> You received this message because you are subscribed to a topic in the
> Google Groups "web2py-users" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/web2py/im3pZuKWkWI/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> web2py+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
Resources:
- http://web2py.com
- http://web2py.com/book (Documentation)
- http://github.com/web2py/web2py (Source code)
- https://code.google.com/p/web2py/issues/list (Report Issues)
--- 
You received this message because you are subscribed to the Google Groups 
"web2py-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to web2py+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: [web2py] Re: Redis caching

Reply via email to