> No i haven't tested using 2 instances... for now i have it resolved
> using just DUMP/GET... which is not so good in my opinion too (but
> works and isn't affecting performance for my small instance) so... im
> trying to find some better way. Short-expire list seem good, maybe
> this will work... Rewriting the app to use 2 instances is not an
> option for me because staging data has (in many cases) same length as
> normal cache data, so i'd have to check all calls to memcached in the
> whole code.

grep your code for short expiration times, add a key prefix for those,
extend your client with set/get calls that route to the "real" client
based on key prefix? (then optioanlly remove the key prefix before
sending, to shave some bytes). Shouldn't have to rewrite the application
code if you can change the client out from under it.

> For many assumptions, I have to make assumptions and ask someone with
> greater experience because i have no time to implement / test
> everything because unfortunately i can't devote much time to working
> with memcached code (and there are too many possible resolutions to
> the problem... that'll probably won't work) :)

Your above reasoning is a better start than shrugging off the idea. I
always try to push for the most flexible, simplest method first.

> So i think that if you said that everything i thought about won't
> work... i'll just do a quick patch and modify dump to return only
> expired items using smaller buffer... should be a little (not much, i
> know) better. Or maybe implement short-expire list in syslog if i have
> more time.

Syslog one sounds pretty easy.

> Btw: how large instances you're running (and how many req/s)? You said
> you'll keep 3 or more LRUs in new version, any other improvements?

Lots of ideas, on a roughly monthly release cycle. See the release notes
page on our wiki for the slow crawl; but I don't have a real "roadmap" at
the moment.

> >I see too many alternatives that have the potential to work far better.
> Can you talk about some more, that's interesting

Those were the examples I gave already; syslog, extend the client, write a
storage engine with an internal LRU split. I'm sure I could come up with
more but I have other things to think about right now :P

> On 26 Lut, 21:31, dormando <dorma...@rydia.net> wrote:
> > > For the "running multiple copies"... im using persistent connection
> > > but are you sure the amount of TCP communication will be good for
> > > performance.
> >
> > have you tested it? you're making an awful lot of assumptions and seem to
> > be really itching to go modify some core code and deploy it. Why not
> > *test* the simplest ideas first and move on when you have to?
> >
> > > I mean even locking the whole slab that has 1mb and
> > > scanning it? Will it take more than 1 ms on modern machine? Beside
> > > it's complicated to rewrite application like this.
> >
> > If you're blocking the daemon at all, you're causing anything that would
> > be running in parallel to block for that 1ms. For really low request rates
> > that's fine, but we must support much more than that.
> >
> > > @Dormando... why you call it "bizzare"? :) Rebalancing slabs shouldn't
> > > be much different.
> >
> > Because it's a corner case, and your solution is to do a *ton* of work. So
> > much so that it walks into another corner case itself; someone with a 192G
> > cache with 100 million entries that are all *valid*, would end up
> > constantly locking/unlocking the cache while never invalidating anything.
> > Your tradeoff is to just move the corner case to another area, my
> > complaint is that it's not sufficiently generic for us to ship.
> >
> > > What you think about forking the app? (i mean forking the in-memory
> > > process). Should work well on modern kernel without locking because
> > > you have copy-on-write? Maybe locking then copying the whole single
> > > slab? I can allocate some buffer, which will be size of single slab
> > > then use LOCK, copy ONE slab into the buffer and use another thread to
> > > build a list of items we can remove. Copying eg. 1 mb of memory should
> > > happen in no time.
> >
> > I had some test code which memcpy'd about 2k of memory 5 times per second
> > while holding a stats lock, and that cut the top throughput by at least
> > 5%. The impact was worse than that, since the test code had removed dozens
> > of (uncontsted) mutex lock calls and replaced them with the tiny memcpy.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > > Generally you think i should move the cleanup into storage engine? How
> > > advanced is that (production ready?)
> >
> > > > The worst we do is in slab rebalance, which holds a slab logically and 
> > > > glances at it
> > > > with tiny locks.
> > > The good thing about cleanup is that you won't have to use tiny locks
> > > (i think). Just lock the slab, copy memory and then wake up some
> > > thread to take a look, add the keys to some list then just process the
> > > list from time to time  (or am i wrong?)
> >
> > > Can you give me some pointers please?
> >
> > > for now im seeing you're using:
> > > it = heads[slabs_clsid];
> > > then iterate it = it->next;
> >
> > > that's probably why you say it's too slow... but what if we just
> > > lock=>copy one slab's memory=>unlock=>analyze slab=>[make 100 get
> > > requsts=>sleep]repeat? We have fixed size items in slab so we know
> > > exactly where the key and expiration time is, right?
> >
> > I tried to explain the method I'd been thinking of for doing this most
> > efficiently, but you seem to be ignoring that. There's just no way in hell
> > we'll ever ship something that issues requests against itself or forks or
> > copies memory around to scan them.
> >
> > Here are some reasons, and then even more alternatives (since your request
> > rate is really low):
> >
> > 1) the most common use case has a mix of reads and writes, not as much
> > writes and then batch reads (which you're doing). Which means common keys
> > with a 5 second expiration would get fetched and expired more naturally.
> > everything else would fall through the bottom due to disuse.
> >
> > 2) tossing huge chunks of memory around then issuing mass fetches back
> > against itself doesn't test well. Issuing more locks doesn't test well
> > (especially on NUMA; contesting locks or copying memory around causes
> > cacheline flushes, pipeline stalls, cross-cpu memory barriers, etc). I've
> > tested this, copying 1mb of memory is not fast enough for us, if I can't
> > even copy 2k without impacting performance.
> >
> > 3) Issuing extraneous micro locks or scanning does terrible things to
> > large instances for the above reasons. If your traffic pattern *isn't*
> > your particular corner case, everything else gets slower.
> >
> > You could also ship a copy of all your short-expiration SET's to syslog,
> > and have a daemon tailing the syslog and issuing gets as things expire...
> > then you don't need to block the daemon at all but you're still issuing
> > all those extra gets.
> >
> > but, again, if you're really attached to doing it your way, go ahead and
> > use the engine-pu branch. In a few months memcached will do this better
> > anyway, and I don't agree with the method you're insisting on. I see too
> > many alternatives that have the potential to work far better.
>

Reply via email to