[ceph-users] cephfs poor performance

2018-10-07 Thread Tomasz Płaza

Hi,

Can someone please help me, how do I improve performance on our CephFS 
cluster?


System in use is: Centos 7.5 with ceph 12.2.7.
The hardware in use are as follows:
3xMON/MGR:
1xIntel(R) Xeon(R) Bronze 3106
16GB RAM
2xSSD for system
1Gbe NIC

2xMDS:
2xIntel(R) Xeon(R) Bronze 3106
64GB RAM
2xSSD for system
10Gbe NIC

6xOSD:
1xIntel(R) Xeon(R) Silver 4108
2xSSD for system
6xHGST HUS726060ALE610 SATA HDD's
1xINTEL SSDSC2BB150G7 for osd db`s (10G partitions) rest for OSD to 
place cephfs_metadata

10Gbe NIC

pools (default crush rule aware of device class):
rbd with 1024 pg crush rule replicated_hdd
cephfs_data with 256 pg crush rule replicated_hdd
cephfs_metadata with 32 pg crush rule replicated_ssd

test done by fio: fio --randrepeat=1 --ioengine=libaio --direct=1 
--gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k 
--iodepth=64 --size=1G --readwrite=randrw --rwmixread=75


shows iops write/read performance as folows:
rbd 3663/1223
cephfs (fuse) 205/68 (wich is a little lower than raw performance of one 
hdd used in cluster)


Everything is connected to one Cisco 10Gbe switch.
Please help.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Don't upgrade to 13.2.2 if you use cephfs

2018-10-07 Thread Yan, Zheng
On Mon, Oct 8, 2018 at 11:34 AM Daniel Carrasco  wrote:
>
> I've got several problems on 12.2.8 too. All my standby MDS uses a lot of 
> memory (while active uses normal memory), and I'm receiving a lot of slow MDS 
> messages (causing the webpage to freeze and fail until MDS are restarted)... 
> Finally I had to copy the entire site to DRBD and use NFS to solve all 
> problems...
>

was standby-replay enabled?

> El lun., 8 oct. 2018 a las 5:21, Alex Litvak () 
> escribió:
>>
>> How is this not an emergency announcement?  Also I wonder if I can
>> downgrade at all ?  I am using ceph with docker deployed with
>> ceph-ansible.  I wonder if I should push downgrade or basically wait for
>> the fix.  I believe, a fix needs to be provided.
>>
>> Thank you,
>>
>> On 10/7/2018 9:30 PM, Yan, Zheng wrote:
>> > There is a bug in v13.2.2 mds, which causes decoding purge queue to
>> > fail. If mds is already in damaged state, please downgrade mds to
>> > 13.2.1, then run 'ceph mds repaired fs_name:damaged_rank' .
>> >
>> > Sorry for all the trouble I caused.
>> > Yan, Zheng
>> >
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> _
>
>   Daniel Carrasco Marín
>   Ingeniería para la Innovación i2TIC, S.L.
>   Tlf:  +34 911 12 32 84 Ext: 223
>   www.i2tic.com
> _
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Error in MDS (laggy or creshed)

2018-10-07 Thread Yan, Zheng
On Mon, Oct 8, 2018 at 10:32 AM Alfredo Daniel Rezinovsky
 wrote:
>
> I tried to downgrade and the mds still fails.
>
> I removed the mds and created them again.
>
> I already messed up with the purgue_queue. Can I reset the queue (with
> 13.2.1) ?

have you run 'ceph mds repaired ...' ?

>
> Thanks for your help
>
> On 07/10/18 23:18, Yan, Zheng wrote:
> > Sorry there is bug in 13.2.2 that breaks compatibility of purge queue
> > disk format. Please downgrading mds to 13.2.1, then run 'ceph mds
> > repaired cephfs_name:0'.
> >
> > Regards
> > Yan, Zheng
> > On Mon, Oct 8, 2018 at 9:20 AM Alfredo Daniel Rezinovsky
> >  wrote:
> >> Cluster with 4 nodes
> >>
> >> node 1: 2 HDDs
> >> node 2: 3 HDDs
> >> node 3: 3 HDDs
> >> node 4: 2 HDDs
> >>
> >> After a problem with upgrade from 13.2.1 to 13.2.2 (I restarted the
> >> nodes 1 at a time, think that was the problem)
> >>
> >> I upgraded with ubuntu apt-get upgrade. I had 1 active mds at a time
> >> when did the upgrade.
> >>
> >> All MDSs stopped working
> >>
> >> Status shows 1 crashed and no one in standby.
> >>
> >> If I restart an MDS status shows replay then crash with this log output:
> >>
> >>ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic
> >> (stable)
> >> 1: (()+0x3f5480) [0x555de8a51480]
> >> 2: (()+0x12890) [0x7f6e4cb41890]
> >> 3: (gsignal()+0xc7) [0x7f6e4bc39e97]
> >> 4: (abort()+0x141) [0x7f6e4bc3b801]
> >> 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> >> const*)+0x250) [0x7f6e4d22a710]
> >> 6: (()+0x26c787) [0x7f6e4d22a787]
> >> 7: (EMetaBlob::replay(MDSRank*, LogSegment*, MDSlaveUpdate*)+0x5f4b)
> >> [0x555de8a3c83b]
> >> 8: (EUpdate::replay(MDSRank*)+0x39) [0x555de8a3dd79]
> >> 9: (MDLog::_replay_thread()+0x864) [0x555de89e6e04]
> >> 10: (MDLog::ReplayThread::entry()+0xd) [0x555de8784ebd]
> >> 11: (()+0x76db) [0x7f6e4cb366db]
> >> 12: (clone()+0x3f) [0x7f6e4bd1c88f]
> >> NOTE: a copy of the executable, or `objdump -rdS ` is needed
> >> to interpret this
> >>
> >> journal reports OK
> >>
> >> Now im trying:
> >>
> >>cephfs-data-scan scan_extents cephfs_data
> >>
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Don't upgrade to 13.2.2 if you use cephfs

2018-10-07 Thread Daniel Carrasco
I've got several problems on 12.2.8 too. All my standby MDS uses a lot of
memory (while active uses normal memory), and I'm receiving a lot of slow
MDS messages (causing the webpage to freeze and fail until MDS are
restarted)... Finally I had to copy the entire site to DRBD and use NFS to
solve all problems...

El lun., 8 oct. 2018 a las 5:21, Alex Litvak ()
escribió:

> How is this not an emergency announcement?  Also I wonder if I can
> downgrade at all ?  I am using ceph with docker deployed with
> ceph-ansible.  I wonder if I should push downgrade or basically wait for
> the fix.  I believe, a fix needs to be provided.
>
> Thank you,
>
> On 10/7/2018 9:30 PM, Yan, Zheng wrote:
> > There is a bug in v13.2.2 mds, which causes decoding purge queue to
> > fail. If mds is already in damaged state, please downgrade mds to
> > 13.2.1, then run 'ceph mds repaired fs_name:damaged_rank' .
> >
> > Sorry for all the trouble I caused.
> > Yan, Zheng
> >
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
_

  Daniel Carrasco Marín
  Ingeniería para la Innovación i2TIC, S.L.
  Tlf:  +34 911 12 32 84 Ext: 223
  www.i2tic.com
_
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Don't upgrade to 13.2.2 if you use cephfs

2018-10-07 Thread Alex Litvak
How is this not an emergency announcement?  Also I wonder if I can 
downgrade at all ?  I am using ceph with docker deployed with 
ceph-ansible.  I wonder if I should push downgrade or basically wait for 
the fix.  I believe, a fix needs to be provided.


Thank you,

On 10/7/2018 9:30 PM, Yan, Zheng wrote:

There is a bug in v13.2.2 mds, which causes decoding purge queue to
fail. If mds is already in damaged state, please downgrade mds to
13.2.1, then run 'ceph mds repaired fs_name:damaged_rank' .

Sorry for all the trouble I caused.
Yan, Zheng




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Error in MDS (laggy or creshed)

2018-10-07 Thread Alfredo Daniel Rezinovsky

I tried to downgrade and the mds still fails.

I removed the mds and created them again.

I already messed up with the purgue_queue. Can I reset the queue (with 
13.2.1) ?


Thanks for your help

On 07/10/18 23:18, Yan, Zheng wrote:

Sorry there is bug in 13.2.2 that breaks compatibility of purge queue
disk format. Please downgrading mds to 13.2.1, then run 'ceph mds
repaired cephfs_name:0'.

Regards
Yan, Zheng
On Mon, Oct 8, 2018 at 9:20 AM Alfredo Daniel Rezinovsky
 wrote:

Cluster with 4 nodes

node 1: 2 HDDs
node 2: 3 HDDs
node 3: 3 HDDs
node 4: 2 HDDs

After a problem with upgrade from 13.2.1 to 13.2.2 (I restarted the
nodes 1 at a time, think that was the problem)

I upgraded with ubuntu apt-get upgrade. I had 1 active mds at a time
when did the upgrade.

All MDSs stopped working

Status shows 1 crashed and no one in standby.

If I restart an MDS status shows replay then crash with this log output:

   ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic
(stable)
1: (()+0x3f5480) [0x555de8a51480]
2: (()+0x12890) [0x7f6e4cb41890]
3: (gsignal()+0xc7) [0x7f6e4bc39e97]
4: (abort()+0x141) [0x7f6e4bc3b801]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x250) [0x7f6e4d22a710]
6: (()+0x26c787) [0x7f6e4d22a787]
7: (EMetaBlob::replay(MDSRank*, LogSegment*, MDSlaveUpdate*)+0x5f4b)
[0x555de8a3c83b]
8: (EUpdate::replay(MDSRank*)+0x39) [0x555de8a3dd79]
9: (MDLog::_replay_thread()+0x864) [0x555de89e6e04]
10: (MDLog::ReplayThread::entry()+0xd) [0x555de8784ebd]
11: (()+0x76db) [0x7f6e4cb366db]
12: (clone()+0x3f) [0x7f6e4bd1c88f]
NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this

journal reports OK

Now im trying:

   cephfs-data-scan scan_extents cephfs_data


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Don't upgrade to 13.2.2 if you use cephfs

2018-10-07 Thread Yan, Zheng
There is a bug in v13.2.2 mds, which causes decoding purge queue to
fail. If mds is already in damaged state, please downgrade mds to
13.2.1, then run 'ceph mds repaired fs_name:damaged_rank' .

Sorry for all the trouble I caused.
Yan, Zheng
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Error in MDS (laggy or creshed)

2018-10-07 Thread Yan, Zheng
Sorry there is bug in 13.2.2 that breaks compatibility of purge queue
disk format. Please downgrading mds to 13.2.1, then run 'ceph mds
repaired cephfs_name:0'.

Regards
Yan, Zheng
On Mon, Oct 8, 2018 at 9:20 AM Alfredo Daniel Rezinovsky
 wrote:
>
> Cluster with 4 nodes
>
> node 1: 2 HDDs
> node 2: 3 HDDs
> node 3: 3 HDDs
> node 4: 2 HDDs
>
> After a problem with upgrade from 13.2.1 to 13.2.2 (I restarted the
> nodes 1 at a time, think that was the problem)
>
> I upgraded with ubuntu apt-get upgrade. I had 1 active mds at a time
> when did the upgrade.
>
> All MDSs stopped working
>
> Status shows 1 crashed and no one in standby.
>
> If I restart an MDS status shows replay then crash with this log output:
>
>   ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic
> (stable)
> 1: (()+0x3f5480) [0x555de8a51480]
> 2: (()+0x12890) [0x7f6e4cb41890]
> 3: (gsignal()+0xc7) [0x7f6e4bc39e97]
> 4: (abort()+0x141) [0x7f6e4bc3b801]
> 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x250) [0x7f6e4d22a710]
> 6: (()+0x26c787) [0x7f6e4d22a787]
> 7: (EMetaBlob::replay(MDSRank*, LogSegment*, MDSlaveUpdate*)+0x5f4b)
> [0x555de8a3c83b]
> 8: (EUpdate::replay(MDSRank*)+0x39) [0x555de8a3dd79]
> 9: (MDLog::_replay_thread()+0x864) [0x555de89e6e04]
> 10: (MDLog::ReplayThread::entry()+0xd) [0x555de8784ebd]
> 11: (()+0x76db) [0x7f6e4cb366db]
> 12: (clone()+0x3f) [0x7f6e4bd1c88f]
> NOTE: a copy of the executable, or `objdump -rdS ` is needed
> to interpret this
>
> journal reports OK
>
> Now im trying:
>
>   cephfs-data-scan scan_extents cephfs_data
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

2018-10-07 Thread Yan, Zheng
Sorry. this is caused wrong backport. downgrading mds to 13.2.1 and
marking mds repaird can resolve this.

Yan, Zheng
On Sat, Oct 6, 2018 at 8:26 AM Sergey Malinin  wrote:
>
> Update:
> I discovered http://tracker.ceph.com/issues/24236 and 
> https://github.com/ceph/ceph/pull/22146
> Make sure that it is not relevant in your case before proceeding to 
> operations that modify on-disk data.
>
>
> On 6.10.2018, at 03:17, Sergey Malinin  wrote:
>
> I ended up rescanning the entire fs using alternate metadata pool approach as 
> in http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/
> The process has not competed yet because during the recovery our cluster 
> encountered another problem with OSDs that I got fixed yesterday (thanks to 
> Igor Fedotov @ SUSE).
> The first stage (scan_extents) completed in 84 hours (120M objects in data 
> pool on 8 hdd OSDs on 4 hosts). The second (scan_inodes) was interrupted by 
> OSDs failure so I have no timing stats but it seems to be runing 2-3 times 
> faster than extents scan.
> As to root cause -- in my case I recall that during upgrade I had forgotten 
> to restart 3 OSDs, one of which was holding metadata pool contents, before 
> restarting MDS daemons and that seemed to had an impact on MDS journal 
> corruption, because when I restarted those OSDs, MDS was able to start up but 
> soon failed throwing lots of 'loaded dup inode' errors.
>
>
> On 6.10.2018, at 00:41, Alfredo Daniel Rezinovsky  
> wrote:
>
> Same problem...
>
> # cephfs-journal-tool --journal=purge_queue journal inspect
> 2018-10-05 18:37:10.704 7f01f60a9bc0 -1 Missing object 500.016c
> Overall journal integrity: DAMAGED
> Objects missing:
>   0x16c
> Corrupt regions:
>   0x5b00-
>
> Just after upgrade to 13.2.2
>
> Did you fixed it?
>
>
> On 26/09/18 13:05, Sergey Malinin wrote:
>
> Hello,
> Followed standard upgrade procedure to upgrade from 13.2.1 to 13.2.2.
> After upgrade MDS cluster is down, mds rank 0 and purge_queue journal are 
> damaged. Resetting purge_queue does not seem to work well as journal still 
> appears to be damaged.
> Can anybody help?
>
> mds log:
>
>   -789> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.mds2 Updating MDS map to 
> version 586 from mon.2
>   -788> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 handle_mds_map i am 
> now mds.0.583
>   -787> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 handle_mds_map 
> state change up:rejoin --> up:active
>   -786> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 recovery_done -- 
> successful recovery!
> 
>-38> 2018-09-26 18:42:32.707 7f70f28a7700 -1 mds.0.purge_queue _consume: 
> Decode error at read_pos=0x322ec6636
>-37> 2018-09-26 18:42:32.707 7f70f28a7700  5 mds.beacon.mds2 
> set_want_state: up:active -> down:damaged
>-36> 2018-09-26 18:42:32.707 7f70f28a7700  5 mds.beacon.mds2 _send 
> down:damaged seq 137
>-35> 2018-09-26 18:42:32.707 7f70f28a7700 10 monclient: _send_mon_message 
> to mon.ceph3 at mon:6789/0
>-34> 2018-09-26 18:42:32.707 7f70f28a7700  1 -- mds:6800/e4cc09cf --> 
> mon:6789/0 -- mdsbeacon(14c72/mds2 down:damaged seq 137 v24a) v7 -- 
> 0x563b321ad480 con 0
> 
> -3> 2018-09-26 18:42:32.743 7f70f98b5700  5 -- mds:6800/3838577103 >> 
> mon:6789/0 conn(0x563b3213e000 :-1 
> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=8 cs=1 l=1). rx mon.2 seq 
> 29 0x563b321ab880 mdsbeaco
> n(85106/mds2 down:damaged seq 311 v587) v7
> -2> 2018-09-26 18:42:32.743 7f70f98b5700  1 -- mds:6800/3838577103 <== 
> mon.2 mon:6789/0 29  mdsbeacon(85106/mds2 down:damaged seq 311 v587) v7 
>  129+0+0 (3296573291 0 0) 0x563b321ab880 con 0x563b3213e
> 000
> -1> 2018-09-26 18:42:32.743 7f70f98b5700  5 mds.beacon.mds2 
> handle_mds_beacon down:damaged seq 311 rtt 0.038261
>  0> 2018-09-26 18:42:32.743 7f70f28a7700  1 mds.mds2 respawn!
>
> # cephfs-journal-tool --journal=purge_queue journal inspect
> Overall journal integrity: DAMAGED
> Corrupt regions:
>   0x322ec65d9-
>
> # cephfs-journal-tool --journal=purge_queue journal reset
> old journal was 13470819801~8463
> new journal start will be 13472104448 (1276184 bytes past old end)
> writing journal head
> done
>
> # cephfs-journal-tool --journal=purge_queue journal inspect
> 2018-09-26 19:00:52.848 7f3f9fa50bc0 -1 Missing object 500.0c8c
> Overall journal integrity: DAMAGED
> Objects missing:
>   0xc8c
> Corrupt regions:
>   0x32300-
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inconsistent directory content in cephfs

2018-10-07 Thread Yan, Zheng
On Fri, Oct 5, 2018 at 6:57 PM Burkhard Linke
 wrote:
>
> Hi,
>
>
> a user just stumbled across a problem with directory content in cephfs
> (kernel client, ceph 12.2.8, one active, one standby-replay instance):
>
>
> root@host1:~# ls /ceph/sge-tmp/db/work/06/ | wc -l
> 224
> root@host1:~# uname -a
> Linux host1 4.13.0-32-generic #35~16.04.1-Ubuntu SMP Thu Jan 25 10:13:43
> UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>
>
> root@host2:~# ls /ceph/sge-tmp/db/work/06/ | wc -l
> 224
> root@host2:~# uname -a
> Linux host2 4.15.0-32-generic #35~16.04.1-Ubuntu SMP Fri Aug 10 21:54:34
> UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>
>
> root@host3:~# ls /ceph/sge-tmp/db/work/6c | wc -l
> 225
> root@host3:~# uname -a
> Linux host3 4.13.0-19-generic #22~16.04.1-Ubuntu SMP Mon Dec 4 15:35:18
> UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
>
>
> Three hosts, different kernel versions, and one extra directory entry on
> the third host. All host used the same mount configuration:
>

which kernel versions?

> # mount | grep ceph
> :/volumes on /ceph type ceph
> (rw,relatime,name=volumes,secret=,acl,readdir_max_entries=8192,readdir_max_bytes=4104304)
>
> MDS logs only contain '2018-10-05 12:43:55.565598 7f2b7c578700  1
> mds.ceph-storage-04 Updating MDS map to version 325550 from mon.0' about
> every few minutes, with increasing version numbers. ceph -w also shows
> the following warnings:
>
> 2018-10-05 12:25:06.955085 mon.ceph-storage-03 [WRN] Health check
> failed: 2 clients failing to respond to cache pressure (MDS_CLIENT_RECALL)
> 2018-10-05 12:26:18.895358 mon.ceph-storage-03 [INF] MDS health message
> cleared (mds.0): Client host1:volumes failing to respond to cache pressure
> 2018-10-05 12:26:18.895401 mon.ceph-storage-03 [INF] MDS health message
> cleared (mds.0): Client cb-pc10:volumes failing to respond to cache pressure
> 2018-10-05 12:26:19.415890 mon.ceph-storage-03 [INF] Health check
> cleared: MDS_CLIENT_RECALL (was: 2 clients failing to respond to cache
> pressure)
> 2018-10-05 12:26:19.415919 mon.ceph-storage-03 [INF] Cluster is now healthy
>
> Timestamps of the MDS log messages and the messages about cache pressure
> are equal, so I assume that the MDS map has a list of failing clients
> and thus gets updated.
>
>
> But this does not explain the difference in the directory content. All
> entries are subdirectories. I also tried to enforce renewal of cached
> information by drop the kernel caches on the affected host, but to no
> avail yet. Caps on the MDS have dropped from 3.2 million to 800k, so
> dropping was effective.
>
>
> Any hints on the root cause for this problem? I've also tested various
> other clientssome show 224 entries, some 225.
>
>
> Regards,
>
> Burkhard
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Error in MDS (laggy or creshed)

2018-10-07 Thread Yan, Zheng
does the log show which assertion was triggered?

Yan, Zheng
On Mon, Oct 8, 2018 at 9:20 AM Alfredo Daniel Rezinovsky
 wrote:
>
> Cluster with 4 nodes
>
> node 1: 2 HDDs
> node 2: 3 HDDs
> node 3: 3 HDDs
> node 4: 2 HDDs
>
> After a problem with upgrade from 13.2.1 to 13.2.2 (I restarted the
> nodes 1 at a time, think that was the problem)
>
> I upgraded with ubuntu apt-get upgrade. I had 1 active mds at a time
> when did the upgrade.
>
> All MDSs stopped working
>
> Status shows 1 crashed and no one in standby.
>
> If I restart an MDS status shows replay then crash with this log output:
>
>   ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic
> (stable)
> 1: (()+0x3f5480) [0x555de8a51480]
> 2: (()+0x12890) [0x7f6e4cb41890]
> 3: (gsignal()+0xc7) [0x7f6e4bc39e97]
> 4: (abort()+0x141) [0x7f6e4bc3b801]
> 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x250) [0x7f6e4d22a710]
> 6: (()+0x26c787) [0x7f6e4d22a787]
> 7: (EMetaBlob::replay(MDSRank*, LogSegment*, MDSlaveUpdate*)+0x5f4b)
> [0x555de8a3c83b]
> 8: (EUpdate::replay(MDSRank*)+0x39) [0x555de8a3dd79]
> 9: (MDLog::_replay_thread()+0x864) [0x555de89e6e04]
> 10: (MDLog::ReplayThread::entry()+0xd) [0x555de8784ebd]
> 11: (()+0x76db) [0x7f6e4cb366db]
> 12: (clone()+0x3f) [0x7f6e4bd1c88f]
> NOTE: a copy of the executable, or `objdump -rdS ` is needed
> to interpret this
>
> journal reports OK
>
> Now im trying:
>
>   cephfs-data-scan scan_extents cephfs_data
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Error in MDS (laggy or creshed)

2018-10-07 Thread Alfredo Daniel Rezinovsky

Cluster with 4 nodes

node 1: 2 HDDs
node 2: 3 HDDs
node 3: 3 HDDs
node 4: 2 HDDs

After a problem with upgrade from 13.2.1 to 13.2.2 (I restarted the 
nodes 1 at a time, think that was the problem)


I upgraded with ubuntu apt-get upgrade. I had 1 active mds at a time 
when did the upgrade.


All MDSs stopped working

Status shows 1 crashed and no one in standby.

If I restart an MDS status shows replay then crash with this log output:

 ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic 
(stable)

1: (()+0x3f5480) [0x555de8a51480]
2: (()+0x12890) [0x7f6e4cb41890]
3: (gsignal()+0xc7) [0x7f6e4bc39e97]
4: (abort()+0x141) [0x7f6e4bc3b801]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x250) [0x7f6e4d22a710]

6: (()+0x26c787) [0x7f6e4d22a787]
7: (EMetaBlob::replay(MDSRank*, LogSegment*, MDSlaveUpdate*)+0x5f4b) 
[0x555de8a3c83b]

8: (EUpdate::replay(MDSRank*)+0x39) [0x555de8a3dd79]
9: (MDLog::_replay_thread()+0x864) [0x555de89e6e04]
10: (MDLog::ReplayThread::entry()+0xd) [0x555de8784ebd]
11: (()+0x76db) [0x7f6e4cb366db]
12: (clone()+0x3f) [0x7f6e4bd1c88f]
NOTE: a copy of the executable, or `objdump -rdS ` is needed 
to interpret this


journal reports OK

Now im trying:

 cephfs-data-scan scan_extents cephfs_data


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Error in MDS (laggy or creshed)

2018-10-07 Thread Alfredo Daniel Rezinovsky

Cluster with 4 nodes

node 1: 2 HDDs
node 2: 3 HDDs
node 3: 3 HDDs
node 4: 2 HDDs

After a problem with upgrade from 13.2.1 to 13.2.2 (I restarted the 
nodes 1 at a time)


I upgraded with ubuntu apt-get upgrade. I had 1 acvive mds at a time 
when did the upgrade.


All MDSs stopped working

Status shows 1 crashed and no one in standby.

If I restart an MDS status shows replay then crash with this log output:

 ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic 
(stable)

1: (()+0x3f5480) [0x555de8a51480]
2: (()+0x12890) [0x7f6e4cb41890]
3: (gsignal()+0xc7) [0x7f6e4bc39e97]
4: (abort()+0x141) [0x7f6e4bc3b801]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x250) [0x7f6e4d22a710]

6: (()+0x26c787) [0x7f6e4d22a787]
7: (EMetaBlob::replay(MDSRank*, LogSegment*, MDSlaveUpdate*)+0x5f4b) 
[0x555de8a3c83b]

8: (EUpdate::replay(MDSRank*)+0x39) [0x555de8a3dd79]
9: (MDLog::_replay_thread()+0x864) [0x555de89e6e04]
10: (MDLog::ReplayThread::entry()+0xd) [0x555de8784ebd]
11: (()+0x76db) [0x7f6e4cb366db]
12: (clone()+0x3f) [0x7f6e4bd1c88f]
NOTE: a copy of the executable, or `objdump -rdS ` is needed 
to interpret this


journal reports OK

Now im trying:

 cephfs-data-scan scan_extents cephfs_data


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] list admin issues

2018-10-07 Thread Paul Emmerich
I'm also seeing this once every few months or so on Gmail with G Suite.

Paul
Am So., 7. Okt. 2018 um 08:18 Uhr schrieb Joshua Chen
:
>
> I also got removed once, got another warning once (need to re-enable).
>
> Cheers
> Joshua
>
>
> On Sun, Oct 7, 2018 at 5:38 AM Svante Karlsson  wrote:
>>
>> I'm also getting removed but not only from ceph. I subscribe 
>> d...@kafka.apache.org list and the same thing happens there.
>>
>> Den lör 6 okt. 2018 kl 23:24 skrev Jeff Smith :
>>>
>>> I have been removed twice.
>>> On Sat, Oct 6, 2018 at 7:07 AM Elias Abacioglu
>>>  wrote:
>>> >
>>> > Hi,
>>> >
>>> > I'm bumping this old thread cause it's getting annoying. My membership 
>>> > get disabled twice a month.
>>> > Between my two Gmail accounts I'm in more than 25 mailing lists and I see 
>>> > this behavior only here. Why is only ceph-users only affected? Maybe 
>>> > Christian was on to something, is this intentional?
>>> > Reality is that there is a lot of ceph-users with Gmail accounts, perhaps 
>>> > it wouldn't be so bad to actually trying to figure this one out?
>>> >
>>> > So can the maintainers of this list please investigate what actually gets 
>>> > bounced? Look at my address if you want.
>>> > I got disabled 20181006, 20180927, 20180916, 20180725, 20180718 most 
>>> > recently.
>>> > Please help!
>>> >
>>> > Thanks,
>>> > Elias
>>> >
>>> > On Mon, Oct 16, 2017 at 5:41 AM Christian Balzer  wrote:
>>> >>
>>> >>
>>> >> Most mails to this ML score low or negatively with SpamAssassin, however
>>> >> once in a while (this is a recent one) we get relatively high scores.
>>> >> Note that the forged bits are false positives, but the SA is up to date 
>>> >> and
>>> >> google will have similar checks:
>>> >> ---
>>> >> X-Spam-Status: No, score=3.9 required=10.0 tests=BAYES_00,DCC_CHECK,
>>> >>  
>>> >> FORGED_MUA_MOZILLA,FORGED_YAHOO_RCVD,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,
>>> >>  
>>> >> HEADER_FROM_DIFFERENT_DOMAINS,HTML_MESSAGE,MIME_HTML_MOSTLY,RCVD_IN_MSPIKE_H4,
>>> >>  RCVD_IN_MSPIKE_WL,RDNS_NONE,T_DKIM_INVALID shortcircuit=no autolearn=no
>>> >> ---
>>> >>
>>> >> Between attachment mails and some of these and you're well on your way 
>>> >> out.
>>> >>
>>> >> The default mailman settings and logic require 5 bounces to trigger
>>> >> unsubscription and 7 days of NO bounces to reset the counter.
>>> >>
>>> >> Christian
>>> >>
>>> >> On Mon, 16 Oct 2017 12:23:25 +0900 Christian Balzer wrote:
>>> >>
>>> >> > On Mon, 16 Oct 2017 14:15:22 +1100 Blair Bethwaite wrote:
>>> >> >
>>> >> > > Thanks Christian,
>>> >> > >
>>> >> > > You're no doubt on the right track, but I'd really like to figure out
>>> >> > > what it is at my end - I'm unlikely to be the only person subscribed
>>> >> > > to ceph-users via a gmail account.
>>> >> > >
>>> >> > > Re. attachments, I'm surprised mailman would be allowing them in the
>>> >> > > first place, and even so gmail's attachment requirements are less
>>> >> > > strict than most corporate email setups (those that don't already use
>>> >> > > a cloud provider).
>>> >> > >
>>> >> > Mailman doesn't do anything with this by default AFAIK, but see below.
>>> >> > Strict is fine if you're in control, corporate mail can be hell, 
>>> >> > doubly so
>>> >> > if on M$ cloud.
>>> >> >
>>> >> > > This started happening earlier in the year after I turned off digest
>>> >> > > mode. I also have a paid google domain, maybe I'll try setting
>>> >> > > delivery to that address and seeing if anything changes...
>>> >> > >
>>> >> > Don't think google domain is handled differently, but what do I know.
>>> >> >
>>> >> > Though the digest bit confirms my suspicion about attachments:
>>> >> > ---
>>> >> > When a subscriber chooses to receive plain text daily “digests” of list
>>> >> > messages, Mailman sends the digest messages without any original
>>> >> > attachments (in Mailman lingo, it “scrubs” the messages of 
>>> >> > attachments).
>>> >> > However, Mailman also includes links to the original attachments that 
>>> >> > the
>>> >> > recipient can click on.
>>> >> > ---
>>> >> >
>>> >> > Christian
>>> >> >
>>> >> > > Cheers,
>>> >> > >
>>> >> > > On 16 October 2017 at 13:54, Christian Balzer  wrote:
>>> >> > > >
>>> >> > > > Hello,
>>> >> > > >
>>> >> > > > You're on gmail.
>>> >> > > >
>>> >> > > > Aside from various potential false positives with regards to spam 
>>> >> > > > my bet
>>> >> > > > is that gmail's known dislike for attachments is the cause of these
>>> >> > > > bounces and that setting is beyond your control.
>>> >> > > >
>>> >> > > > Because Google knows best[tm].
>>> >> > > >
>>> >> > > > Christian
>>> >> > > >
>>> >> > > > On Mon, 16 Oct 2017 13:50:43 +1100 Blair Bethwaite wrote:
>>> >> > > >
>>> >> > > >> Hi all,
>>> >> > > >>
>>> >> > > >> This is a mailing-list admin issue - I keep being unsubscribed 
>>> >> > > >> from
>>> >> > > >> ceph-users with the message:
>>> >> > > >> "Your membership in the mailing list ceph-users has been disabled 
>>> >> > > >> due
>>> >> > > 

Re: [ceph-users] Cannot write to cephfs if some osd's are not available on the client network

2018-10-07 Thread Paul Emmerich
solarflow99 :
> now this goes against what I thought I learned about ceph fs.  You should be 
> able to RW to/from all OSDs, how can it be limited to only a single OSD??

Clients only connect to the primary OSD of a PG, so technically an OSD
that isn't the primary of any OSD doesn't have to be reachable by
clients.
That doesn't mean that's a good idea to configure a system like this ;)

Paul


>
> On Sat, Oct 6, 2018 at 4:30 AM Christopher Blum  
> wrote:
>>
>> I wouldn't recommend you pursuit this any further, but if this is the only 
>> client that would reside on the same VM as the OSD, one thing you could try 
>> is to decrease the primary affinity to 0 [1] for the local OSD .
>> That way that single OSD would never become a primary OSD ;)
>>
>> Disclaimer: This is more like a hack.
>>
>>
>> [1] https://ceph.com/geen-categorie/ceph-primary-affinity/
>>
>> On Fri, Oct 5, 2018 at 10:23 PM Gregory Farnum  wrote:
>>>
>>> On Fri, Oct 5, 2018 at 3:13 AM Marc Roos  wrote:



 I guess then this waiting "quietly" should be looked at again, I am
 having load of 10 on this vm.

 [@~]# uptime
  11:51:58 up 4 days,  1:35,  1 user,  load average: 10.00, 10.01, 10.05

 [@~]# uname -a
 Linux smb 3.10.0-862.11.6.el7.x86_64 #1 SMP Tue Aug 14 21:49:04 UTC 2018
 x86_64 x86_64 x86_64 GNU/Linux

 [@~]# cat /etc/redhat-release
 CentOS Linux release 7.5.1804 (Core)

 [@~]# dmesg
 [348948.927734] libceph: osd23 192.168.10.114:6810 socket closed (con
 state CONNECTING)
 [348957.120090] libceph: osd27 192.168.10.114:6802 socket closed (con
 state CONNECTING)
 [349010.370171] libceph: osd26 192.168.10.114:6806 socket closed (con
 state CONNECTING)
 [349114.822301] libceph: osd24 192.168.10.114:6804 socket closed (con
 state CONNECTING)
 [349141.447330] libceph: osd29 192.168.10.114:6812 socket closed (con
 state CONNECTING)
 [349278.668658] libceph: osd25 192.168.10.114:6800 socket closed (con
 state CONNECTING)
 [349440.467038] libceph: osd28 192.168.10.114:6808 socket closed (con
 state CONNECTING)
 [349465.043957] libceph: osd23 192.168.10.114:6810 socket closed (con
 state CONNECTING)
 [349473.236400] libceph: osd27 192.168.10.114:6802 socket closed (con
 state CONNECTING)
 [349526.486408] libceph: osd26 192.168.10.114:6806 socket closed (con
 state CONNECTING)
 [349630.938498] libceph: osd24 192.168.10.114:6804 socket closed (con
 state CONNECTING)
 [349657.563561] libceph: osd29 192.168.10.114:6812 socket closed (con
 state CONNECTING)
 [349794.784936] libceph: osd25 192.168.10.114:6800 socket closed (con
 state CONNECTING)
 [349956.583300] libceph: osd28 192.168.10.114:6808 socket closed (con
 state CONNECTING)
 [349981.160225] libceph: osd23 192.168.10.114:6810 socket closed (con
 state CONNECTING)
 [349989.352510] libceph: osd27 192.168.10.114:6802 socket closed (con
 state CONNECTING)
>>>
>>>
>>> Looks like in this case the client is spinning trying to establish the 
>>> network connections it expects to be available. There's not really much 
>>> else it can do — we expect and require full routing. The monitors are 
>>> telling the clients that the OSDs are up and available, and it is doing 
>>> data IO that requires them. So it tries to establish a connection, sees the 
>>> network fail, and tries again.
>>>
>>> Unfortunately the restricted-network use case you're playing with here is 
>>> just not supported by Ceph.
>>> -Greg
>>>

 ..
 ..
 ..




 -Original Message-
 From: John Spray [mailto:jsp...@redhat.com]
 Sent: donderdag 27 september 2018 11:43
 To: Marc Roos
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Cannot write to cephfs if some osd's are not
 available on the client network

 On Thu, Sep 27, 2018 at 10:16 AM Marc Roos 
 wrote:
 >
 >
 > I have a test cluster and on a osd node I put a vm. The vm is using a
 > macvtap on the client network interface of the osd node. Making access

 > to local osd's impossible.
 >
 > the vm of course reports that it cannot access the local osd's. What I

 > am getting is:
 >
 > - I cannot reboot this vm normally, need to reset it.

 When linux tries to shut down cleanly, part of that is flushing buffers
 from any mounted filesystem back to disk.  If you have a network
 filesystem mounted, and the network is unavailable, that can cause the
 process to block.  You can try forcibly unmounting before rebooting.

 > - vm is reporting very high load.

 The CPU load part is surprising -- in general Ceph clients should wait
 quietly when blocked, rather than spinning.

 > I guess this should not be happening not? Because it should choose an
 > other available osd of the 3x replicated pool and just 

Re: [ceph-users] Cluster broken and OSDs crash with failed assertion in PGLog::merge_log

2018-10-07 Thread Jonas Jelten
Thanks, I did that now: https://tracker.ceph.com/issues/36337

On 05/10/2018 19.12, Neha Ojha wrote:
> Hi JJ,
> 
> In the case, the condition olog.head >= log.tail is not true,
> therefore it crashes. Could you please open a tracker
> issue(https://tracker.ceph.com/) and attach the osd logs and the pg
> dump output?
> 
> Thanks,
> Neha
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

2018-10-07 Thread Sergey Malinin
I was able to start MDS and mount the fs with broken ownership/permissions and 
8k out of millions files in lost+found.


> On 7.10.2018, at 02:04, Sergey Malinin  wrote:
> 
> I'm at scan_links now, will post an update once it has finished.
> Have you reset the journal after fs recovery as suggested in the doc?
> 
> quote:
> 
> If the damaged filesystem contains dirty journal data, it may be recovered 
> next with:
> 
> cephfs-journal-tool --rank=:0 event 
> recover_dentries list --alternate-pool recovery
> cephfs-journal-tool --rank recovery-fs:0 journal reset --force
> 
> 
>> On 7.10.2018, at 00:36, Alfredo Daniel Rezinovsky > > wrote:
>> 
>> I did something wrong in the upgrade restart also...
>> 
>> after rescaning with:
>> 
>> cephfs-data-scan scan_extents cephfs_data (with threads)
>> 
>> cephfs-data-scan scan_inodes cephfs_data (with threads)
>> 
>> cephfs-data-scan scan_links
>> 
>> My MDS still crashes and wont replay.
>>  1: (()+0x3ec320) [0x55b0e2bd2320]
>>  2: (()+0x12890) [0x7fc3adce3890]
>>  3: (gsignal()+0xc7) [0x7fc3acddbe97]
>>  4: (abort()+0x141) [0x7fc3acddd801]
>>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
>> const*)+0x250) [0x7fc3ae3cc080]
>>  6: (()+0x26c0f7) [0x7fc3ae3cc0f7]
>>  7: (()+0x21eb27) [0x55b0e2a04b27]
>>  8: (MDCache::journal_dirty_inode(MutationImpl*, EMetaBlob*, CInode*, 
>> snapid_t)+0xc0) [0x55b0e2a04d40]
>>  9: (Locker::check_inode_max_size(CInode*, bool, unsigned long, unsigned 
>> long, utime_t)+0x91d) [0x55b0e2a6a0fd]
>>  10: (RecoveryQueue::_recovered(CInode*, int, unsigned long, utime_t)+0x39f) 
>> [0x55b0e2a3ca2f]
>>  11: (MDSIOContextBase::complete(int)+0x119) [0x55b0e2b54ab9]
>>  12: (Filer::C_Probe::finish(int)+0xe7) [0x55b0e2bd94e7]
>>  13: (Context::complete(int)+0x9) [0x55b0e28e9719]
>>  14: (Finisher::finisher_thread_entry()+0x12e) [0x7fc3ae3ca4ce]
>>  15: (()+0x76db) [0x7fc3adcd86db]
>>  16: (clone()+0x3f) [0x7fc3acebe88f]
>> 
>> Did you do somenthing else before starting the MDSs again?
>> 
>> On 05/10/18 21:17, Sergey Malinin wrote:
>>> I ended up rescanning the entire fs using alternate metadata pool approach 
>>> as in http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/ 
>>> 
>>> The process has not competed yet because during the recovery our cluster 
>>> encountered another problem with OSDs that I got fixed yesterday (thanks to 
>>> Igor Fedotov @ SUSE).
>>> The first stage (scan_extents) completed in 84 hours (120M objects in data 
>>> pool on 8 hdd OSDs on 4 hosts). The second (scan_inodes) was interrupted by 
>>> OSDs failure so I have no timing stats but it seems to be runing 2-3 times 
>>> faster than extents scan.
>>> As to root cause -- in my case I recall that during upgrade I had forgotten 
>>> to restart 3 OSDs, one of which was holding metadata pool contents, before 
>>> restarting MDS daemons and that seemed to had an impact on MDS journal 
>>> corruption, because when I restarted those OSDs, MDS was able to start up 
>>> but soon failed throwing lots of 'loaded dup inode' errors.
>>> 
>>> 
 On 6.10.2018, at 00:41, Alfredo Daniel Rezinovsky >>> > wrote:
 
 Same problem...
 
 # cephfs-journal-tool --journal=purge_queue journal inspect
 2018-10-05 18:37:10.704 7f01f60a9bc0 -1 Missing object 500.016c
 Overall journal integrity: DAMAGED
 Objects missing:
   0x16c
 Corrupt regions:
   0x5b00-
 
 Just after upgrade to 13.2.2
 
 Did you fixed it?
 
 
 On 26/09/18 13:05, Sergey Malinin wrote:
> Hello,
> Followed standard upgrade procedure to upgrade from 13.2.1 to 13.2.2.
> After upgrade MDS cluster is down, mds rank 0 and purge_queue journal are 
> damaged. Resetting purge_queue does not seem to work well as journal 
> still appears to be damaged.
> Can anybody help?
> 
> mds log:
> 
>   -789> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.mds2 Updating MDS map 
> to version 586 from mon.2
>   -788> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 handle_mds_map 
> i am now mds.0.583
>   -787> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 handle_mds_map 
> state change up:rejoin --> up:active
>   -786> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 recovery_done 
> -- successful recovery!
> 
>-38> 2018-09-26 18:42:32.707 7f70f28a7700 -1 mds.0.purge_queue 
> _consume: Decode error at read_pos=0x322ec6636
>-37> 2018-09-26 18:42:32.707 7f70f28a7700  5 mds.beacon.mds2 
> set_want_state: up:active -> down:damaged
>-36> 2018-09-26 18:42:32.707 7f70f28a7700  5 mds.beacon.mds2 _send 
> down:damaged seq 137
>-35> 2018-09-26 18:42:32.707 7f70f28a7700 10 monclient: 
> _send_mon_message to mon.ceph3 at mon:6789/0
>-34> 2018-09-26 18:42:32.707 7f70f28a7700  1 -- mds:6800/e4cc09cf --> 

Re: [ceph-users] list admin issues

2018-10-07 Thread Joshua Chen
I also got removed once, got another warning once (need to re-enable).

Cheers
Joshua


On Sun, Oct 7, 2018 at 5:38 AM Svante Karlsson 
wrote:

> I'm also getting removed but not only from ceph. I subscribe
> d...@kafka.apache.org list and the same thing happens there.
>
> Den lör 6 okt. 2018 kl 23:24 skrev Jeff Smith :
>
>> I have been removed twice.
>> On Sat, Oct 6, 2018 at 7:07 AM Elias Abacioglu
>>  wrote:
>> >
>> > Hi,
>> >
>> > I'm bumping this old thread cause it's getting annoying. My membership
>> get disabled twice a month.
>> > Between my two Gmail accounts I'm in more than 25 mailing lists and I
>> see this behavior only here. Why is only ceph-users only affected? Maybe
>> Christian was on to something, is this intentional?
>> > Reality is that there is a lot of ceph-users with Gmail accounts,
>> perhaps it wouldn't be so bad to actually trying to figure this one out?
>> >
>> > So can the maintainers of this list please investigate what actually
>> gets bounced? Look at my address if you want.
>> > I got disabled 20181006, 20180927, 20180916, 20180725, 20180718 most
>> recently.
>> > Please help!
>> >
>> > Thanks,
>> > Elias
>> >
>> > On Mon, Oct 16, 2017 at 5:41 AM Christian Balzer  wrote:
>> >>
>> >>
>> >> Most mails to this ML score low or negatively with SpamAssassin,
>> however
>> >> once in a while (this is a recent one) we get relatively high scores.
>> >> Note that the forged bits are false positives, but the SA is up to
>> date and
>> >> google will have similar checks:
>> >> ---
>> >> X-Spam-Status: No, score=3.9 required=10.0 tests=BAYES_00,DCC_CHECK,
>> >>
>> FORGED_MUA_MOZILLA,FORGED_YAHOO_RCVD,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,
>> >>
>> HEADER_FROM_DIFFERENT_DOMAINS,HTML_MESSAGE,MIME_HTML_MOSTLY,RCVD_IN_MSPIKE_H4,
>> >>  RCVD_IN_MSPIKE_WL,RDNS_NONE,T_DKIM_INVALID shortcircuit=no
>> autolearn=no
>> >> ---
>> >>
>> >> Between attachment mails and some of these and you're well on your way
>> out.
>> >>
>> >> The default mailman settings and logic require 5 bounces to trigger
>> >> unsubscription and 7 days of NO bounces to reset the counter.
>> >>
>> >> Christian
>> >>
>> >> On Mon, 16 Oct 2017 12:23:25 +0900 Christian Balzer wrote:
>> >>
>> >> > On Mon, 16 Oct 2017 14:15:22 +1100 Blair Bethwaite wrote:
>> >> >
>> >> > > Thanks Christian,
>> >> > >
>> >> > > You're no doubt on the right track, but I'd really like to figure
>> out
>> >> > > what it is at my end - I'm unlikely to be the only person
>> subscribed
>> >> > > to ceph-users via a gmail account.
>> >> > >
>> >> > > Re. attachments, I'm surprised mailman would be allowing them in
>> the
>> >> > > first place, and even so gmail's attachment requirements are less
>> >> > > strict than most corporate email setups (those that don't already
>> use
>> >> > > a cloud provider).
>> >> > >
>> >> > Mailman doesn't do anything with this by default AFAIK, but see
>> below.
>> >> > Strict is fine if you're in control, corporate mail can be hell,
>> doubly so
>> >> > if on M$ cloud.
>> >> >
>> >> > > This started happening earlier in the year after I turned off
>> digest
>> >> > > mode. I also have a paid google domain, maybe I'll try setting
>> >> > > delivery to that address and seeing if anything changes...
>> >> > >
>> >> > Don't think google domain is handled differently, but what do I know.
>> >> >
>> >> > Though the digest bit confirms my suspicion about attachments:
>> >> > ---
>> >> > When a subscriber chooses to receive plain text daily “digests” of
>> list
>> >> > messages, Mailman sends the digest messages without any original
>> >> > attachments (in Mailman lingo, it “scrubs” the messages of
>> attachments).
>> >> > However, Mailman also includes links to the original attachments
>> that the
>> >> > recipient can click on.
>> >> > ---
>> >> >
>> >> > Christian
>> >> >
>> >> > > Cheers,
>> >> > >
>> >> > > On 16 October 2017 at 13:54, Christian Balzer 
>> wrote:
>> >> > > >
>> >> > > > Hello,
>> >> > > >
>> >> > > > You're on gmail.
>> >> > > >
>> >> > > > Aside from various potential false positives with regards to
>> spam my bet
>> >> > > > is that gmail's known dislike for attachments is the cause of
>> these
>> >> > > > bounces and that setting is beyond your control.
>> >> > > >
>> >> > > > Because Google knows best[tm].
>> >> > > >
>> >> > > > Christian
>> >> > > >
>> >> > > > On Mon, 16 Oct 2017 13:50:43 +1100 Blair Bethwaite wrote:
>> >> > > >
>> >> > > >> Hi all,
>> >> > > >>
>> >> > > >> This is a mailing-list admin issue - I keep being unsubscribed
>> from
>> >> > > >> ceph-users with the message:
>> >> > > >> "Your membership in the mailing list ceph-users has been
>> disabled due
>> >> > > >> to excessive bounces..."
>> >> > > >> This seems to be happening on roughly a monthly basis.
>> >> > > >>
>> >> > > >> Thing is I have no idea what the bounce is or where it is
>> coming from.
>> >> > > >> I've tried emailing ceph-users-ow...@lists.ceph.com and the
>> contact
>> >> > > >> listed in Mailman