[ceph-users] Re: Please guide us inidentifying thecauseofthedata miss in EC pool

Best Regards Thu, 08 Aug 2024 02:34:41 -0700

Hi，Frédéric Nass


Sorry, I may not have expressed it clearly before. The epoch and OSD up/down 
timeline was extracted and merged from the 9 OSD logs. I analyzed the PG 
(9.11b6) peering process. OSD 494, 1169, 1057 fully recorded the down/up of 
other OSDs. I also checked the logs of the other 6 OSDs. The role conversion 
during peering was expected and no abnormalities were found. I also checked the 
status of the monitor. One of the 5 monitors lost power and was powered on 
after about 40 minutes. The log showed that its rank value was relatively large 
and it did not become the leader.

Let's talk about the fault domain. The fault domain we set is the host level, 
but in fact all hosts are distributed in 2 buildings, but the original designer 
did not consider the fault level of the building.



In this case, the OSD may have a brain split, but from the log, it does not 
happen.



Best regards.



Best&nbsp;Regards
wu_chu...@qq.com

Best&nbsp;Regards






                       
Original Email
                       
                     

From:"Frédéric Nass"< frederic.n...@univ-lorraine.fr &gt;;

Sent Time:2024/8/8 15:40

To:"Best Regards"< wu_chu...@qq.com &gt;;

Cc recipient:"ceph-users"< ceph-users@ceph.io &gt;;

Subject:Re: Re:[ceph-users] Re: Please guide us inidentifying thecauseofthedata 
miss in EC pool


ceph osd pause is a lot of constraints from an operational perspective. :-)


host uptime and service running time is a thing. But it doesn't mean that these 
3 OSDs were in the acting set when the power outage occured.


Since OSDs 494, 1169 and 1057 did not crash, I assume they're in the same 
failure domain. Is that right? 


Being isolated along with their local MON(s) from other MONs and other 6 OSDs, 
there's a fair chance that any of the 6 other OSDs in other failure domains 
took the lead, sent 5 chunks around and acknowledged the write to RGW client. 
Then all of them crashed.


Your thoughts?


Frédéric.






De : Best Regards <wu_chu...@qq.com&gt;
Envoyé : jeudi 8 août 2024 09:16
À : Frédéric Nass
Cc: ceph-users 
Objet : Re:[ceph-users] Re: Please guide us inidentifying thecause ofthedata 
miss in EC pool






Hi,&nbsp;Frédéric Nass


Yes.&nbsp;I checked the host running time where the OSD is located and the OSD 
service running time.These were stopped when I executed `ceph-object-tool`. 
They were running before.


Because we need to maintain the hardware frequently (the hardware is quite 
old), min_size is set to the lowest value. When a failure occurs, we set the 
read and write pause flag.&nbsp;During the failure, there is no PUT action on 
the S3 keys to which these objects belong.








Best Regards





Best&nbsp;Regards
wu_chu...@qq.com

Best&nbsp;Regards






                       
Original Email
                       
                     

From:"Frédéric Nass"< frederic.n...@univ-lorraine.fr &gt;;

Sent Time:2024/8/8 14:40

To:"Best Regards"< wu_chu...@qq.com &gt;;

Cc recipient:"ceph-users"< ceph-users@ceph.io &gt;;

Subject:[ceph-users] Re: Please guide us inidentifying thecause ofthedata miss 
in EC pool


Hi Chulin,

Are you 100% sure that 494, 1169 and 1057 (that did not restart) were in the 
acting set at the exact moment the power outage occured?&nbsp;

I'm asking because min_size 6 would have allowed the data to be written to 
eventually 6 crashing OSDs.

Bests,
Frédéric.


________________________________
De : Best Regards 
Envoyé : jeudi 8 août 2024 08:10
À : Frédéric Nass
Cc: ceph-users 
Objet : Re:Re: Re:Re: Re:Re: Re:Re: [ceph-users] Please guide us inidentifying 
thecause ofthedata miss in EC pool

Hi，&nbsp;Frédéric Nass


Thank you for your continued attention and guidance. Let's analyze and verify 
this issue from different perspectives.


The reason why we did not stop the investigation is that we tried to find other 
ways to avoid the losses caused by this sudden failure. Turning off the disk 
cache is the last option, of course, this operation will only be carried out 
after finding definite evidence.

I also have a question that among the 9 OSDs, some have not been restarted. In 
theory, these OSDs should retain the object info(metadata,pg_log,etc.), even if 
the object cannot be recovered. I sorted out the OSD booting log where the 
object should be located and the PG peering process:


OSD 494/1169/1057 has been in the running state, and osd.494 was the primary of 
the acting_set during the failure. However, no record of the object was found 
using `ceph-object-tool --op list or --op log` in, so the loss of data due to 
disk cache loss does not seem to be a complete explanation (perhaps there is 
some processing logic that we have not paid attention to).





Best Regards,



Woo
wu_chu...@qq.com

Best&nbsp;Regards






                       
Original Email
                       
                     

From:"Frédéric Nass"< frederic.n...@univ-lorraine.fr &gt;;

Sent Time:2024/8/8 4:01

To:"wu_chulin"< wu_chu...@qq.com &gt;;

Subject:Re: Re:Re: Re:Re: Re:Re: [ceph-users] Please guide us inidentifying 
thecause ofthedata miss in EC pool


Hey Chulin,


Looks clearer now.
 


Non-persistent cache for KV metadata and Bluestore metadata certainly explains 
how data was lost without the cluster even noticing.


What's unexpected is data staying for so long in the disks buffers and not 
being written to persistent sectors at all.


Anyways, thank you for sharing your use case and investigation. It was nice 
chatting with you.


If you can, share this in the ceph-user list. It will for sure benefit everyone 
in the community.


Best regards,
Frédéric.


PS : Note that using min_size &gt;= k + 1 on EC pools is recommended (so as 
min_size &gt;= 2 on rep X3 pools) because you don't want to write data without 
any parity chunks.









De : wu_chu...@qq.com
Envoyé : mercredi 7 août 2024 11:30
À : Frédéric Nass
Objet : Re:Re: Re:Re: Re:Re: [ceph-users] Please guide us in identifying 
thecause ofthedata miss in EC pool




Hi,
Yes, after the file -&gt; object -&gt; PG -&gt; OSD correspondence is found, 
the object record can be found on the specified OSD using the command 
`ceph-objectore-tool --op list `

The pool min_size is 6


The business department reported more than 30, but we proactively screened out 
more than 100. The upload time of the lost files was mainly distributed about 3 
hours before the failure, and these files were successfully downloaded after 
being uploaded (RGW log).


One OSD corresponds to one disk, and no separate space is allocated for WAL/DB.


The HDD cache is the default (SATA is enabled by default), and the hard disk 
cache has not been forcibly turned off due to performance issues.


The loss of OSD data due to the loss of hard disk cache was our initial 
inference, and the initial explanation provided to the business department was 
the same. When the cluster was restored, ceph reported 12 unfound objects, 
which is acceptable. After all, most devices were powered off abnormally, and 
it is difficult to ensure the integrity of all data. Up to now, our team have 
not located how the data was lost. In the past, when the hard disk hardware was 
damaged, either the OSD could not start because of damaged key data, or some 
objects were read incorrectly after the OSD started, which could be repaired. 
Now deep-scrub cannot find the problem, which may be related to the loss (or 
deletion) of object metadata. After all, deep-scrub needs the object list of 
the current PG. If those 9 OSDs do not have the object metadata information, 
deep-scrub does not know the existence of this object.



wu_chu...@qq.com
wu_chu...@qq.com








Original Email



From:"Frédéric Nass"< frederic.n...@univ-lorraine.fr &gt;;

Sent Time:2024/8/6 20:40

To:"wu_chulin"< wu_chu...@qq.com &gt;;

Subject:Re: Re:Re: Re:Re: [ceph-users] Please guide us in identifying thecause 
ofthedata miss in EC pool




That's interesting.


Have you tried to correlate any existing retrievable object to PG id and OSD 
mapping in order to verify the presence of each of these object's shards using 
ceph-objectore-tool on each one of its acting OSDs, for a previously and 
successfully written S3 object.


This would help verify that the command you've run trying to find the missing 
shard was good.


Also what min_size is this pool using?&nbsp;
&nbsp;
How many S3 objects like these were reported missing by the business unit? Have 
you or they made an inventory of unretrievable / missing objects?


Are WAL / RocksDB collocated on HDDs only OSDs or are these OSDs using SSDs for 
WAL / RocksDB?


Did you disable HDD buffers (also known as disk cache) as... HDD buffers are 
non-persistent.


I know nothing about the intensity of your workloads but if your looking for a 
few tens or a few hundreds of unwritten s3 objects, there might be some 
situation with non-persistent cache (like volatile disk buffers) were ceph 
would consider the data to be written when it was actually not at the moment of 
the power outage, due to the use of non-persistent disk buffers. Especially if 
you kept writing data with less shards (min_size) than k+1 (no parity at all). 
That sound like a possibility.




Also what I'm thinking right is... If you can identify which shard over 9 is 
wrong, then you may use ceph-objectore-tool or ceph-kvstore-tool to destroy 
this particular shard, then deep-scrub the PG so to detect the inexistent shard 
and have it rebuilt.


Never tried this myself though.


Best regards,
Frédéric.






De : wu_chu...@qq.com
Envoyé : mardi 6 août 2024 12:15
À : Frédéric Nass
Objet : Re:Re: Re:Re: [ceph-users] Please guide us in identifying the cause 
ofthedata miss in EC pool









Hi,&nbsp;
Thank you for your attention to this matter.


1. Manually executing deep-scrub PG did not report any errors. I checked the 
OSD logs and did not see any errors detected or fixed by the OSD. Ceph health 
was also normal, and the OS where the OSD was located did not report any IO 
type errors.

2. At first, I also suspected whether the objects were distributed on these 
OSDs before the failure. I used ceph osd getmap 

3. Our Ceph version is 13.2.10, which does not have pg autoscaler; lifecycle 
policies are not set.


4. We have 90+ hosts, most of which are Dell R720xd, most of the hard disks are 
3.5 inches/5400 rpm/10TB Western Digital/, and most of the controllers are PERC 
H330 Mini; this is the current cluster status:
cluster:
id: f990db28-9604-4d49-9733-b17155887e3b
health: HEALTH_OK


services:
&nbsp; mon: 5 daemons, quorum 
cz-ceph-01,cz-ceph-02,cz-ceph-03,cz-ceph-07,cz-ceph-13
&nbsp; mgr: cz-ceph-01(active), standbys: cz-ceph-03, cz-ceph-02
&nbsp; osd: 1172 osds: 1172 up, 1172 in&nbsp;
&nbsp; rgw: 9 daemons active&nbsp;
data:&nbsp;
&nbsp; pools: 16 pools, 25752 pgs&nbsp;
&nbsp; objects: 2.23 G objects, 6.1 PiB&nbsp;
&nbsp; usage: 9.4 PiB used, 2.5 PiB / 12 PiB avail&nbsp;
&nbsp; pgs:&nbsp; &nbsp; &nbsp; 25686 active+clean&nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;64 
active+clean+scrubbing+deep+repair&nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;2 
active+clean+scrubbing+deep


Best regards.



Original Email



From:"Frédéric Nass"< frederic.n...@univ-lorraine.fr &gt;;

Sent Time:2024/8/6 15:34

To:"wu_chulin"< wu_chu...@qq.com &gt;;

Subject:Re: Re:Re: [ceph-users] Please guide us in identifying the cause 
ofthedata miss in EC pool


Hi,


Did the deep-scrub report any errors? (any inconsistencies should show errors 
after deep-scrubbing the PG.)
&nbsp;
We're these errors fixed by the PG repair?


Is is possible that you looked for the wrong PG or OSD when trying to list 
these objects with ceph-objectstore-tool?


Were the PG autoscaler running at that time? Are you using S3 lifecycle 
policies that could have move this object to another placement pool and so 
another PG?


Can you give details about this cluster? Hardware, disks, controller, etc.


Cheers,
Frédéric.











De : wu_chu...@qq.com
Envoyé : lundi 5 août 2024 10:09
À : Frédéric Nass
Objet : Re:Re: [ceph-users] Please guide us in identifying the cause of thedata 
miss in EC pool




Hi，
Thank you for your reply. I apologize for the omission in the previous email. 
Please disregard the previous email and refer to this one instead.&nbsp;
After the failure, we executed the repair and deep-scrub command on some of the 
PGs that lost data, and the status was active+clean after completion, 
but&nbsp;the object still could not be retrieved.
Our erasure code parameters are k=6, m=3. Theoretically, the data of three OSDs 
lost due to power failure should be recoverable. However, we stopped nine OSDs 
and exported the object list, but could not find the lost object 
information.&nbsp;What puzzled us was that some OSDs were not powered off and 
were still running, but their object lists did not have the information too.




Best regards.




wu_chu...@qq.com
wu_chu...@qq.com








Original Email



From:"Frédéric Nass"< frederic.n...@univ-lorraine.fr &gt;;

Sent Time:2024/8/3 15:11

To:"wu_chulin"< wu_chu...@qq.com &gt;;"ceph-users"< ceph-users@ceph.io &gt;;

Subject:Re: [ceph-users] Please guide us in identifying the cause of thedata 
miss in EC pool


Hi,


First thing that comes to mind when it comes to data unavailability or 
inconsistencies after a power outage is that some dirty data may have been lost 
along the IO path before reaching persistent storage. This can happen with non 
enterprise grade SSDs using non-persistent cache or with HDDs disk buffer if 
left enabled for example.


With that said, have you tried to deep-scrub the PG from which you can't 
retrieve data? What's the status of this PG now? Did it recover?


Regards,
Frédéric.






De : wu_chu...@qq.com
Envoyé : mercredi 31 juillet 2024 05:49
À : ceph-users
Objet : [ceph-users] Please guide us in identifying the cause of the data miss 
in EC pool




Dear Ceph team:&amp;nbsp; &amp;nbsp; &amp;nbsp;On July 13th at 4:55 AM, our 
Ceph cluster experienced a significant power outage in the data center, causing 
a large number of OSDs to power off and restart (total: 1172, down: 821). 
Approximately two hours later, all OSDs successfully started, and the cluster 
resumed its services. However, around 6 PM, the business department reported 
that some files, which had been successfully written (via the RGW service), 
were failing to download, and the number of such files was quite significant. 
Consequently, we began a series of investigations: 


 1. The incident occurred at 04:55. At 05:01, we executed noout, nobackfill, 
and norecover. At 06:22, we executed `ceph osd pause`. By 07:23, all OSDs were 
UP&amp;amp;IN, and subsequently, we executed `ceph osd unpause`. 


 2. We randomly selected a problematic file and attempted to download it via 
the S3 API. The RGW returned "No such key". 


 3. The RGW logs showed op status=-2, http status=200. We also checked the 
upload logs, which indicated 2024-07-13 04:19:20.052, op status=0, 
http_status=200. 


 4. We set debug_rgw=20 and attempted to download the file again. It was found 
that a 4M chunk(this file is 64M) failed to get. 


 5. Using rados get for this chunk returned: "No such file or directory". 


 6. Setting debug_osd=20, we observed get_object_context: obc NOT found in 
cache. 


 7. Setting debug_bluestore=20, we saw get_onode oid xxx, key xxx != 
'0xfffffffffffffffeffffffffffffffff'o'. 


 8. We stopped the primary OSD and tried to get the file again, but the result 
was the same. The object’s corresponding PG state was 
active+recovery_wait+degraded. 


 9. Using ceph-objectstore-tool --op list &amp;amp;&amp;amp; --op log, we could 
not find the object information. The ceph-kvstore-tool rocksdb command also did 
not reveal anything new. 


 10. If an OSD had lost data, the PG state should have been unfound or 
inconsistency. 


 11. We started reanalyzing the startup logs of the OSDs related to the PG. The 
pool was set to erasure-code 6-3 with 9 OSDs. Six of these OSDs had restarted, 
and after peering, the PG state became ACTIVE. 


 12. We divided the lost files, and the upload time was before the failure 
occurred. The earliest upload time was around 1 am, and the successful upload 
records could be found in the RGW log 


 13. We have submitted an issue on the Ceph issue 
tracker:&amp;nbsp;https://tracker.ceph.com/issues/66942, it includes the 
original logs needed for troubleshooting. However, four days have passed 
without any response. In desperation, we are sending this email, hoping that 
someone from the Ceph team can guide us as soon as possible. 


 We are currently in a difficult situation and hope you can provide guidance. 
Thank you. 



 Best regards. 





 wu_chu...@qq.com 
 wu_chu...@qq.com
 _______________________________________________
 ceph-users mailing list -- ceph-users@ceph.io
 To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Please guide us inidentifying thecauseofthedata miss in EC pool

Reply via email to