Hi Dan,
hi Rafael,

we found the issue.
It was a cleanup script that didn't work correctly.
Basically it removed files via rados and the bucket index didn't update.

Thank you a lot for your help. (will also close the bug on the ceph tracker)

Am Fr., 23. Juli 2021 um 01:16 Uhr schrieb Rafael Lopez
<rafael.lo...@monash.edu>:
>
> Thanks for further clarification Dan.
>
> Boris, if you have a test/QA environment on the same code as production, you 
> can confirm if the problem is as above. Do NOT do this in production - if the 
> problem exists it might result in losing production data.
>
> 1. Upload large S3 object that would take 10+ seconds to download (several GB)
> 2. Download object to ensure it is working
> 3. Set "rgw_gc_obj_min_wait" to very low value (2-3 seconds)
> 4. Download object
>
> Step (4) may succeed, but run this:
> `radosgw-admin gc list`
>
> And check for shadow objects associated with the S3 object.
>
> Once the garbage collection completes, you will get the 404 NoSuchKey return 
> when you try to download the S3 object, although it will still be listed as 
> an object in the bucket.
> Also recommend setting the "rgw_gc_obj_min_wait" back to a high value after 
> you finish testing.
>
> On Thu, 22 Jul 2021 at 19:45, Dan van der Ster <d...@vanderster.com> wrote:
>>
>> Boris,
>>
>> To check if your issue is related to Rafael's, could you check your
>> access logs for requests on the missing objects which lasted longer
>> than one hour?
>>
>> I ask because Nautilus also has rgw_gc_obj_min_wait (2hr by default),
>> which is the main config option related to
>> https://tracker.ceph.com/issues/47866
>>
>>
>> -- Dan
>>
>> On Thu, Jul 22, 2021 at 11:12 AM Dan van der Ster <d...@vanderster.com> 
>> wrote:
>> >
>> > Hi Rafael,
>> >
>> > AFAIU, that gc issue was not relevant for N -- the bug is in the new
>> > rgw_gc code which landed in Octopus and was not backported to N.
>> >
>> > Well, RHCEPH had the new rgw_gc cls backported to it, and RHCEPH has
>> > the bugfix you refer to:
>> > * Wed Dec 02 2020 Ceph Jenkins <ceph-jenk...@redhat.com> 2:14.2.11-86
>> > - rgw: during GC defer, prevent new GC enqueue (rhbz#1892644)
>> > https://bugzilla.redhat.com/show_bug.cgi?id=1892644
>> >
>> > But still, I think it shouldn't apply to the upstream community
>> > Nautilus that we run.
>> >
>> > That said, this indeed looks really similar so perhaps Nautilus has
>> > similar faulty gc logic.
>> >
>> > Cheers, Dan
>> >
>> > On Thu, Jul 22, 2021 at 6:47 AM Rafael Lopez <rafael.lo...@monash.edu> 
>> > wrote:
>> > >
>> > > hi boris,
>> > >
>> > > We hit an issue late last year that sounds similar to what you are 
>> > > experiencing. I am not sure if the fix was backported to nautilus, I 
>> > > can't see any reference to a nautilus backport so it's possible it was 
>> > > only backported to octopus (15.x), exception being red hat ceph nautilus.
>> > >
>> > > https://tracker.ceph.com/issues/47866?next_issue_id=48255#note-59
>> > > https://www.mail-archive.com/ceph-users@ceph.io/msg05312.html
>> > >
>> > > Basically, a read request on a s3/swift object that took a very long 
>> > > time to complete would cause the associated rados data objects to be put 
>> > > in the GC queue, but the head object would still be present. So the s3 
>> > > object would still show as present, `rados bi list` would show it (since 
>> > > head object was present) but the data objects would be gone, resulting 
>> > > in 404 NoSuchKey when retrieving the object.
>> > >
>> > > raf
>> > >
>> > > On Wed, 21 Jul 2021 at 18:12, Boris Behrens <b...@kervyn.de> wrote:
>> > >>
>> > >> Good morning everybody,
>> > >>
>> > >> we've dug further into it but still don't know how this could happen.
>> > >> What we ruled out for now:
>> > >> * Orphan objects cleanup process.
>> > >> ** There is only one bucket with missing data (I checked all other
>> > >> buckets yesterday)
>> > >> ** The "keep this files" list is generated by radosgw-admin bukcet
>> > >> rados list. I would doubt that there were files listed, that are
>> > >> accessible via radosgw
>> > >> ** The deleted files are somewhat random, but always with their
>> > >> corresponding counterparts (per folder there are 2-3 files that belong
>> > >> together)
>> > >>
>> > >> * Customer remove his data, but radosgw didn't clean up the bucket index
>> > >> ** there are no delete requests in the buckets usage log.
>> > >> ** customer told us, that they do not have a delete job for this bucket
>> > >>
>> > >> So I am lost with ideas that I could check, and hope that you people
>> > >> might be able to help with further ideas.
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> --
>> > >> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend
>> > >> im groüen Saal.
>> > >> _______________________________________________
>> > >> ceph-users mailing list -- ceph-users@ceph.io
>> > >> To unsubscribe send an email to ceph-users-le...@ceph.io
>> > >
>> > >
>> > >
>> > > --
>> > > Rafael Lopez
>> > > Devops Systems Engineer
>> > > Monash University eResearch Centre
>> > >
>> > > E: rafael.lo...@monash.edu
>> > >
>
>
>
> --
> Rafael Lopez
> Devops Systems Engineer
> Monash University eResearch Centre
>
> T: +61 3 9905 9118
> E: rafael.lo...@monash.edu
>


-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend
im groüen Saal.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to