On 27 Feb 2018 06:46, "Jan Pekař - Imatic" <jan.pe...@imatic.cz> wrote:

I think I hit the same issue.
I have corrupted data on cephfs and I don't remember the same issue before
Luminous (i did the same tests before).

It is on my test 1 node cluster with lower memory then recommended (so
server is swapping) but it shouldn't lose data (it never did before).
So slow requests may appear in the log like Florent B mentioned.

My test is to take some bigger files (few GB) and copy it to cephfs or from
cephfs to cephfs and stress the cluster so data copying stall for a while.
It will resume in few seconds/minutes and everything looks ok (no error on
copying). But copied file may be corrupted silently.

I checked wiles with MD5SUM and compared some corrupted files in detail.
There were missing some 4MB blocks of data (cephfs object size) - corrupted
file had that block of data filled with zeroes.

My idea is, that there happen something wrong when cluster is under
pressure and client want to save the block. Client gets OK and continues
with another block so data is lost and corrupted block is filled with zeros.

I tried kernel client 4.x and ceph-fuse client with same result.

I'm using erasure for cephfs data pool, cache tier and my storage is
bluestore and filestore mixed.

How can I help to debug or what should I do to help to find the problem?


Always worrying to see the dreaded C word. I operate a Luminous cluster
with a pretty varied workload and have yet to see any signs of corruption,
although of course that doesn't mean its not happening. Initial questions:

- What's the history of your cluster? Was this an upgrade or fresh Luminous
install?
- Was ceph healthy when you ran this test?
-Are you accessing this one node cluster from the node itself or from a
separate client?

I'd recommend starting a new thread with more details, it sounds like it's
pretty reproducable for you so maybe crank up your debugging and send logs.
http://docs.ceph.com/docs/luminous/dev/kernel-client-troubleshooting/


With regards
Jan Pekar


On 14.12.2017 15:41, Yan, Zheng wrote:

> On Thu, Dec 14, 2017 at 8:52 PM, Florent B <flor...@coppint.com> wrote:
>
>> On 14/12/2017 03:38, Yan, Zheng wrote:
>>
>>> On Thu, Dec 14, 2017 at 12:49 AM, Florent B <flor...@coppint.com> wrote:
>>>
>>>>
>>>> Systems are on Debian Jessie : kernel 3.16.0-4-amd64 & libfuse 2.9.3-15.
>>>>
>>>> I don't know pattern of corruption, but according to error message in
>>>> Dovecot, it seems to expect data to read but reach EOF.
>>>>
>>>> All seems fine using fuse_disable_pagecache (no more corruption, and
>>>> performance increased : no more MDS slow requests on filelock requests).
>>>>
>>>
>>> I checked ceph-fuse changes since kraken, didn't find any clue. I
>>> would be helpful if you can try recent version kernel.
>>>
>>> Regards
>>> Yan, Zheng
>>>
>>
>> Problem occurred this morning even with fuse_disable_pagecache=true.
>>
>> It seems to be a lock issue between imap & lmtp processes.
>>
>> Dovecot uses fcntl as locking method. Is there any change about it in
>> Luminous ? I switched to flock to see if problem is still there...
>>
>>
> I don't remenber there is any change.
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
-- 
============
Ing. Jan Pekař
jan.pe...@imatic.cz | +420603811737
----
Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz
============
--

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Kll
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to