Re: Extstore revival after crash

2023-04-25 Thread 'Danny Kopping' via memcached
> It would be really bad for both of us if you created a mission critical
backup solution based off of an undocumented, unsupported dataformat which
potentially changes with version updates.

Oh absolutely haha! This is more of a POC to prove feasibility, and I was
also just curious about what data was actually in the file.

> One of many questions; is this due to cost? (ie; don't want to double the
cache storage) or some other reason?

Mostly about cost, yeah.

I'll hit you up on Discord

On Mon, Apr 24, 2023 at 11:50 PM dormando  wrote:

> Hey,
>
>
> Aside:
> I'm actually busy trying to parse the datafile with a small Go program to
> try and replay all the data. Solving this warming will give us a lot of
> confidence to roll this out in a big way across our infra.
> What're your thoughts on this and the above?
>
>
> It would be really bad for both of us if you created a mission critical
> backup solution based off of an undocumented, unsupported dataformat which
> potentially changes with version updates. I think you may have also
> misunderstood me; the data is actually partially in RAM.
>
> Is there any chance I could get you into the MC discord to chat a bit
> further about your use case? (linked from https://memcached.org/) -
> easier to play 20 questions there. If that's not possible I'll list a bunch
> of questions in the mailing list here instead :)
>
>
> @Javier, thanks for your thoughts here too. Replication is not an option
> for us at this scale; that said, your solution is pretty cool!
>
>
> One of many questions; is this due to cost? (ie; don't want to double the
> cache storage) or some other reason?
>
> On Monday, April 24, 2023 at 1:05:23 PM UTC+2 Javier Arias Losada wrote:
>
>> Hi there,
>>
>> one thing we've done to mitigate this kind of risk is having two copies
>> of every shard in different availability zones in our cloud provider. Also,
>> we run in kubernetes so for us nodes leaving the cluster is a relatively
>> frequent issue... we are playing with a small process that does the warmup
>> of new nodes quicker.
>>
>> Since we have more than one copy of the data, we do a warmup process. Our
>> cache nodes are MUCH MUCH smaller... so this approach might not be
>> reasonable for your use-case.
>>
>> This is how our process works, when a new node is restarted or any other
>> situation that involves an empty memcached process starting, our warmup
>> process:
>> locates the warmer node for the shard
>> gets all the keys and TTLS with from the warmer node: lru_crawler
>> metadump all
>> traverses in reverse the list of keys (lru_crawler goes from the least
>> recently used, for this it's better to go from most recent).
>> For each key: get the value from the warmer node and add (not set) it to
>> the cold node, including TTL.
>>
>> This process might lead to some small data inconcistencies, it will
>> depend on your use case how important that is.
>>
>> Since our access patterns are very skewed (a small % of keys gets the
>> bigger % of traffic, at least during some time) going in reverse in the LRU
>> dump helps being much more effective.
>>
>> Best
>> Javier Arias
>> On Sunday, April 23, 2023 at 7:24:28 PM UTC+2 dormando wrote:
>>
>>> Hey,
>>>
>>> Thanks for reaching out!
>>>
>>> There is no crash safety in memcached or extstore; it does look like the
>>> data is on disk but it is actually spread across memory and disk, with
>>> recent or heavily accessed data staying in RAM. Best case you only
>>> recover
>>> your cold data. Further, keys can appear multiple times in the extstore
>>> datafile and we rely on the RAM index to know which one is current.
>>>
>>> I've never heard of anyone losing an entire cluster, but people do try
>>> to
>>> mitigate this by replicating cache across availability zones/regions.
>>> This can be done with a few methods, like our new proxy code. I'd be
>>> happy
>>> to go over a few scenarios if you'd like.
>>>
>>> -Dormando
>>>
>>> On Sun, 23 Apr 2023, 'Danny Kopping' via memcached wrote:
>>>
>>> > First off, thanks for the amazing work @dormando & others!
>>> > Context:
>>> > I work at Grafana Labs, and we are very interested in trying out
>>> extstore for some very large (>50TB) caches. We plan to split this 50TB
>>> cache into about
>>> > 35 different nodes, each with 1.5TB of NVMe & a small memcached
>>> instance. Losing any given node will result in losing ~3% of the overa

Re: Extstore revival after crash

2023-04-24 Thread 'Danny Kopping' via memcached
Thanks for the reply @dormando!

> Best case you only recover your cold data. Further, keys can appear 
multiple times in the extstore datafile and we rely on the RAM index to 
know which one is current.

This is actually perfect for our use-case. We just need a big ol' cache of 
cold data, and we never overwrite keys; they're immutable in our system.
The volume of data we're dealing with is so big that there will be very 
little hotspotting on any particular keys, so I'm intending to force most 
of the data into cold storage.

The cache will be used to as a read-through, to protect our upstream 
service which we're loading many millions of files from (object storage) - 
sometimes up to several hundred thousand RPS.
It's true that it's unlikely that we'll lose everything all at once, and we 
will design for frequent failure, but as ever "hope is not a strategy" 
(although it springs eternal... :))

Aside:
I'm actually busy trying to parse the datafile with a small Go program to 
try and replay all the data. Solving this warming will give us a lot of 
confidence to roll this out in a big way across our infra.
What're your thoughts on this and the above?

@Javier, thanks for your thoughts here too. Replication is not an option 
for us at this scale; that said, your solution is pretty cool!

On Monday, April 24, 2023 at 1:05:23 PM UTC+2 Javier Arias Losada wrote:

> Hi there,
>
> one thing we've done to mitigate this kind of risk is having two copies of 
> every shard in different availability zones in our cloud provider. Also, we 
> run in kubernetes so for us nodes leaving the cluster is a relatively 
> frequent issue... we are playing with a small process that does the warmup 
> of new nodes quicker.
>
> Since we have more than one copy of the data, we do a warmup process. Our 
> cache nodes are MUCH MUCH smaller... so this approach might not be 
> reasonable for your use-case.
>
> This is how our process works, when a new node is restarted or any other 
> situation that involves an empty memcached process starting, our warmup 
> process: 
> locates the warmer node for the shard
> gets all the keys and TTLS with from the warmer node: lru_crawler 
> metadump all
> traverses in reverse the list of keys (lru_crawler goes from the least 
> recently used, for this it's better to go from most recent).
> For each key: get the value from the warmer node and add (not set) it to 
> the cold node, including TTL.
>
> This process might lead to some small data inconcistencies, it will depend 
> on your use case how important that is.
>
> Since our access patterns are very skewed (a small % of keys gets the 
> bigger % of traffic, at least during some time) going in reverse in the LRU 
> dump helps being much more effective.
>
> Best
> Javier Arias
> On Sunday, April 23, 2023 at 7:24:28 PM UTC+2 dormando wrote:
>
>> Hey, 
>>
>> Thanks for reaching out! 
>>
>> There is no crash safety in memcached or extstore; it does look like the 
>> data is on disk but it is actually spread across memory and disk, with 
>> recent or heavily accessed data staying in RAM. Best case you only 
>> recover 
>> your cold data. Further, keys can appear multiple times in the extstore 
>> datafile and we rely on the RAM index to know which one is current. 
>>
>> I've never heard of anyone losing an entire cluster, but people do try to 
>> mitigate this by replicating cache across availability zones/regions. 
>> This can be done with a few methods, like our new proxy code. I'd be 
>> happy 
>> to go over a few scenarios if you'd like. 
>>
>> -Dormando 
>>
>> On Sun, 23 Apr 2023, 'Danny Kopping' via memcached wrote: 
>>
>> > First off, thanks for the amazing work @dormando & others! 
>> > Context: 
>> > I work at Grafana Labs, and we are very interested in trying out 
>> extstore for some very large (>50TB) caches. We plan to split this 50TB 
>> cache into about 
>> > 35 different nodes, each with 1.5TB of NVMe & a small memcached 
>> instance. Losing any given node will result in losing ~3% of the overall 
>> cache which is 
>> > acceptable, however if we lose all nodes at once somehow, losing all of 
>> our cache will be pretty bad and will put severe pressure on our backend. 
>> > 
>> > Ask: 
>> > Having looked at the file that extstore writes on disk, it looks like 
>> it has both keys & values contained in it. Would it be possible to 
>> "re-warm" the 
>> > cache on startup by scanning this data and resubmitting it to itself? 
>> We could then have add some condition to our readiness check in k8s to wait 
>> until 
>> > the d

Extstore revival after crash

2023-04-23 Thread 'Danny Kopping' via memcached
First off, thanks for the amazing work @dormando & others!

*Context:*
*I work at Grafana Labs, and we are very interested in trying out extstore 
for some very large (>50TB) caches. We plan to split this 50TB cache into 
about 35 different nodes, each with 1.5TB of NVMe & a small memcached 
instance. Losing any given node will result in losing ~3% of the overall 
cache which is acceptable, however if we lose all nodes at once somehow, 
losing all of our cache will be pretty bad and will put severe pressure on 
our backend.*

Ask:
Having looked at the file that extstore writes on disk, it looks like it 
has both keys & values contained in it. Would it be possible to "re-warm" 
the cache on startup by scanning this data and resubmitting it to itself? 
We could then have add some condition to our readiness check in k8s to wait 
until the data is all re-warmed and then allow traffic to flow to those 
instances. Is this feature planned for anytime soon?

Thanks!

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/memcached/cc45382b-eee7-4e37-a841-d210bf18ff4bn%40googlegroups.com.


Re: "Out of memory during read" errors instead of key eviction

2022-11-28 Thread 'Danny Kopping' via memcached
To add another datapoint here, we at Grafana Labs use memcached extensively 
in our cloud and this fix made a massive impact on our cache effectiveness:
https://user-images.githubusercontent.com/373762/204228886-7c5a759a-927c-46fb-ae55-3e0b4056ebae.png

Thank you very much to you both for the investigation and bugfix!

On Saturday, August 27, 2022 at 8:53:47 AM UTC+2 Dormando wrote:

> Thanks for taking the time to evaluate! It helps my confidence level with
> the fix.
>
> You caught me at a good time :) Been really behind with fixes for quite a
> while and only catching up this week. I've looked at this a few times and
> didn't see the easy fix before...
>
> I think earlier versions of the item chunking code were more fragile and I
> didn't revisit it after the cleanup work. In this case each chunk
> remembers its original slab class, so having the final chunk be from an
> unintended class doesn't break anything. Otherwise freeing the chunks
> would be impossible if I had to recalculate their original slab class from
> the chunk size.
>
> So now it'll use too much memory in some cases, and lowering slab chunk
> max would ease that a bit... so maybe soon will finally be a good time to
> lower the default chunk max a little to at least 128k or 256k.
>
> -Dormando
>
> On Fri, 26 Aug 2022, Hayden wrote:
>
> > I didn't see the docker files in the repo that could build the docker 
> image, and when I tried cloning the git repo and doing a docker build I 
> encountered
> > errors that I think were related to the web proxy on my work network. I 
> was able to grab the release tarball and the bitnami docker file, do a 
> little
> > surgery to work around my proxy issue, and build a 1.6.17 docker image 
> though.
> > I ran my application against the new version and it ran for ~2hr without 
> any errors (it previously wouldn't run more than 30s or so before 
> encountering
> > blocks of the OOM during read errors). I also made a little test loop 
> that just hammered the instance with similar sized writes (1-2MB) as fast 
> as it
> > could and let it run a few hours, and it didn't have a single blip. That 
> encompassed a couple million evictions. I'm pretty comfortable saying the 
> issue
> > is fixed, at least for the kind of use I had in mind.
> >
> > I added a comment to the issue on GitHub to the same effect.
> >
> > I'm impressed by the quick turnaround, BTW. ;-)
> >
> > H
> >
> > On Friday, August 26, 2022 at 5:54:26 PM UTC-7 Dormando wrote:
> > So I tested this a bit more and released it in 1.6.17; I think bitnami
> > should pick it up soonish. if not I'll try to figure out docker this
> > weekend if you still need it.
> >
> > I'm not 100% sure it'll fix your use case but it does fix some things I
> > can test and it didn't seem like a regression. would be nice to validate
> > still.
> >
> > On Fri, 26 Aug 2022, dormando wrote:
> >
> > > You can't build docker images or compile binaries? there's a
> > > docker-compose.yml in the repo already if that helps.
> > >
> > > If not I can try but I don't spend a lot of time with docker directly.
> > >
> > > On Fri, 26 Aug 2022, Hayden wrote:
> > >
> > > > I'd be happy to help validate the fix, but I can't do it until the 
> weekend, and I don't have a ready way to build an updated image. Any
> > chance you could
> > > > create a docker image with the fix that I could grab from somewhere?
> > > >
> > > > On Friday, August 26, 2022 at 10:38:54 AM UTC-7 Dormando wrote:
> > > > I have an opportunity to put this fix into a release today if anyone 
> wants
> > > > to help validate :)
> > > >
> > > > On Thu, 25 Aug 2022, dormando wrote:
> > > >
> > > > > Took another quick look...
> > > > >
> > > > > Think there's an easy patch that might work:
> > > > > https://github.com/memcached/memcached/pull/924
> > > > >
> > > > > If you wouldn't mind helping validate? An external validator would 
> help me
> > > > > get it in time for the next release :)
> > > > >
> > > > > Thanks,
> > > > > -Dormando
> > > > >
> > > > > On Wed, 24 Aug 2022, dormando wrote:
> > > > >
> > > > > > Hey,
> > > > > >
> > > > > > Thanks for the info. Yes; this generally confirms the issue. I 
> see some of
> > > > > > your higher slab classes with "free_chunks 0", so if you're 
> setting data
> > > > > > that requires these chunks it could error out. The "stats items" 
> confirms
> > > > > > this since there are no actual items in those lower slab classes.
> > > > > >
> > > > > > You're certainly right a workaround of making your items < 512k 
> would also
> > > > > > work; but in general if I have features it'd be nice if they 
> worked well
> > > > > > :) Please open an issue so we can improve things!
> > > > > >
> > > > > > I intended to lower the slab_chunk_max default from 512k to much 
> lower, as
> > > > > > that actually raises the memory efficiency by a bit (less gap at 
> the
> > > > > > higher classes). That may help here. The system should also try 
> ejecting
> > > > > > items from the