Hi Dan, hi Rafael, we found the issue. It was a cleanup script that didn't work correctly. Basically it removed files via rados and the bucket index didn't update.
Thank you a lot for your help. (will also close the bug on the ceph tracker) Am Fr., 23. Juli 2021 um 01:16 Uhr schrieb Rafael Lopez <rafael.lo...@monash.edu>: > > Thanks for further clarification Dan. > > Boris, if you have a test/QA environment on the same code as production, you > can confirm if the problem is as above. Do NOT do this in production - if the > problem exists it might result in losing production data. > > 1. Upload large S3 object that would take 10+ seconds to download (several GB) > 2. Download object to ensure it is working > 3. Set "rgw_gc_obj_min_wait" to very low value (2-3 seconds) > 4. Download object > > Step (4) may succeed, but run this: > `radosgw-admin gc list` > > And check for shadow objects associated with the S3 object. > > Once the garbage collection completes, you will get the 404 NoSuchKey return > when you try to download the S3 object, although it will still be listed as > an object in the bucket. > Also recommend setting the "rgw_gc_obj_min_wait" back to a high value after > you finish testing. > > On Thu, 22 Jul 2021 at 19:45, Dan van der Ster <d...@vanderster.com> wrote: >> >> Boris, >> >> To check if your issue is related to Rafael's, could you check your >> access logs for requests on the missing objects which lasted longer >> than one hour? >> >> I ask because Nautilus also has rgw_gc_obj_min_wait (2hr by default), >> which is the main config option related to >> https://tracker.ceph.com/issues/47866 >> >> >> -- Dan >> >> On Thu, Jul 22, 2021 at 11:12 AM Dan van der Ster <d...@vanderster.com> >> wrote: >> > >> > Hi Rafael, >> > >> > AFAIU, that gc issue was not relevant for N -- the bug is in the new >> > rgw_gc code which landed in Octopus and was not backported to N. >> > >> > Well, RHCEPH had the new rgw_gc cls backported to it, and RHCEPH has >> > the bugfix you refer to: >> > * Wed Dec 02 2020 Ceph Jenkins <ceph-jenk...@redhat.com> 2:14.2.11-86 >> > - rgw: during GC defer, prevent new GC enqueue (rhbz#1892644) >> > https://bugzilla.redhat.com/show_bug.cgi?id=1892644 >> > >> > But still, I think it shouldn't apply to the upstream community >> > Nautilus that we run. >> > >> > That said, this indeed looks really similar so perhaps Nautilus has >> > similar faulty gc logic. >> > >> > Cheers, Dan >> > >> > On Thu, Jul 22, 2021 at 6:47 AM Rafael Lopez <rafael.lo...@monash.edu> >> > wrote: >> > > >> > > hi boris, >> > > >> > > We hit an issue late last year that sounds similar to what you are >> > > experiencing. I am not sure if the fix was backported to nautilus, I >> > > can't see any reference to a nautilus backport so it's possible it was >> > > only backported to octopus (15.x), exception being red hat ceph nautilus. >> > > >> > > https://tracker.ceph.com/issues/47866?next_issue_id=48255#note-59 >> > > https://www.mail-archive.com/ceph-users@ceph.io/msg05312.html >> > > >> > > Basically, a read request on a s3/swift object that took a very long >> > > time to complete would cause the associated rados data objects to be put >> > > in the GC queue, but the head object would still be present. So the s3 >> > > object would still show as present, `rados bi list` would show it (since >> > > head object was present) but the data objects would be gone, resulting >> > > in 404 NoSuchKey when retrieving the object. >> > > >> > > raf >> > > >> > > On Wed, 21 Jul 2021 at 18:12, Boris Behrens <b...@kervyn.de> wrote: >> > >> >> > >> Good morning everybody, >> > >> >> > >> we've dug further into it but still don't know how this could happen. >> > >> What we ruled out for now: >> > >> * Orphan objects cleanup process. >> > >> ** There is only one bucket with missing data (I checked all other >> > >> buckets yesterday) >> > >> ** The "keep this files" list is generated by radosgw-admin bukcet >> > >> rados list. I would doubt that there were files listed, that are >> > >> accessible via radosgw >> > >> ** The deleted files are somewhat random, but always with their >> > >> corresponding counterparts (per folder there are 2-3 files that belong >> > >> together) >> > >> >> > >> * Customer remove his data, but radosgw didn't clean up the bucket index >> > >> ** there are no delete requests in the buckets usage log. >> > >> ** customer told us, that they do not have a delete job for this bucket >> > >> >> > >> So I am lost with ideas that I could check, and hope that you people >> > >> might be able to help with further ideas. >> > >> >> > >> >> > >> >> > >> >> > >> -- >> > >> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend >> > >> im groüen Saal. >> > >> _______________________________________________ >> > >> ceph-users mailing list -- ceph-users@ceph.io >> > >> To unsubscribe send an email to ceph-users-le...@ceph.io >> > > >> > > >> > > >> > > -- >> > > Rafael Lopez >> > > Devops Systems Engineer >> > > Monash University eResearch Centre >> > > >> > > E: rafael.lo...@monash.edu >> > > > > > > -- > Rafael Lopez > Devops Systems Engineer > Monash University eResearch Centre > > T: +61 3 9905 9118 > E: rafael.lo...@monash.edu > -- Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im groüen Saal. _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io