Re: [ceph-users] MDS damaged

2017-10-24 Thread Daniel Davidson

This finally finished:

2017-10-24 22:50:11.766519 7f775e539bc0  1 scavenge_dentries: frag 
607. is corrupt, overwriting

Events by type:
  OPEN: 5640344
  SESSION: 10
  SUBTREEMAP: 8070
  UPDATE: 1384964
Errors: 0

I truncated:
#cephfs-journal-tool journal reset
old journal was 6255163020467~8616264519
new journal start will be 6263781982208 (2697222 bytes past old end)
writing journal head
writing EResetJournal entry
done

I reset sessions:
# cephfs-table-tool all reset session
{
    "0": {
    "data": {},
    "result": 0
    }
}

I marked it repaired:

#ceph mds repaired 0

And I still got errors as show from ceph -w:
2017-10-25 00:02:08.929404 mds.0 [ERR] dir 607 object missing on disk; 
some files may be lost (~mds0/stray7)
2017-10-25 00:02:09.099472 mon.0 [INF] mds.0 172.16.31.1:6800/3462673422 
down:damaged

2017-10-25 00:02:09.105643 mon.0 [INF] fsmap e121619: 0/1/1 up, 1 damaged
2017-10-25 00:02:10.182101 mon.0 [INF] mds.? 172.16.31.1:6809/2991612296 
up:boot
2017-10-25 00:02:10.182189 mon.0 [INF] fsmap e121620: 0/1/1 up, 1 
up:standby, 1 damaged


What should I do next? ceph fs reset igbhome scares me.

Dan


On 10/24/2017 09:25 PM, Daniel Davidson wrote:

Out of desperation, I started with the disaster recovery guide:

http://docs.ceph.com/docs/jewel/cephfs/disaster-recovery/

After exporting the journal, I started doing:

cephfs-journal-tool event recover_dentries summary

And that was about 7 hours ago, and it is still running.  I am getting 
a lot of messages like:


2017-10-24 21:24:10.910489 7f775e539bc0  1 scavenge_dentries: frag 
607. is corrupt, overwriting


The frag number is the same for every line and there have been thousands.

I really could use some assistance,

Dan




On 10/24/2017 12:14 PM, Daniel Davidson wrote:

Our ceph system is having a problem.

A few days a go we had a pg that was marked as inconsistent, and 
today I fixed it with a:


#ceph pg repair 1.37c

then a file was stuck as missing so I did a:

#ceph pg 1.37c mark_unfound_lost delete
pg has 1 objects unfound and apparently lost marking

That fixed the unfound file problem and all the pgs went 
active+clean.  A few minutes later though, the FS seemed to pause and 
the MDS started giving errors.


# ceph -w
    cluster 7bffce86-9d7b-4bdf-a9c9-67670e68ca77
 health HEALTH_ERR
    mds rank 0 is damaged
    mds cluster is degraded
    noscrub,nodeep-scrub flag(s) set
 monmap e3: 4 mons at 
{ceph-0=172.16.31.1:6789/0,ceph-1=172.16.31.2:6789/0,ceph-2=172.16.31.3:6789/0,ceph-3=172.16.31.4:6789/0}
    election epoch 652, quorum 0,1,2,3 
ceph-0,ceph-1,ceph-2,ceph-3

  fsmap e121409: 0/1/1 up, 4 up:standby, 1 damaged
 osdmap e35220: 32 osds: 32 up, 32 in
    flags noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
  pgmap v28398840: 1536 pgs, 2 pools, 795 TB data, 329 Mobjects
    1595 TB used, 1024 TB / 2619 TB avail
    1536 active+clean

Looking into the logs when I try a:

#ceph mds repaired 0

2017-10-24 12:01:27.354271 mds.0 172.16.31.3:6801/1949050374 75 : 
cluster [ERR] dir 607 object missing on disk; some files may be lost 
(~mds0/stray7)


Any ideas as for what to do next, I am stumped.

Dan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] pg inconsistent and repair doesn't work

2017-10-24 Thread Wei Jin
Hi, list,

We ran into pg deep scrub error. And we tried to repair it by `ceph pg
repair pgid`. But it didn't work. We also verified object files,  and
found both 3 replicas were zero size. What's the problem, whether it
is a bug? And how to fix the inconsistent? I haven't restarted the
osds so far as I am not sure whether it works.

ceph version: 10.2.9
user case: cephfs
kernel client: 4.4/4.9

Error info from primary osd:

root@n10-075-019:~# grep -Hn 'ERR' /var/log/ceph/ceph-osd.27.log.1
/var/log/ceph/ceph-osd.27.log.1:3038:2017-10-25 04:47:34.460536
7f39c4829700 -1 log_channel(cluster) log [ERR] : 1.fcd shard 27: soid
1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head
size 0 != size 3461120 from auth oi
1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head(2401'147490
client.901549.1:33749 dirty|omap_digest s 3461120 uv 147490 od
 alloc_hint [0 0])
/var/log/ceph/ceph-osd.27.log.1:3039:2017-10-25 04:47:34.460722
7f39c4829700 -1 log_channel(cluster) log [ERR] : 1.fcd shard 62: soid
1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head
size 0 != size 3461120 from auth oi
1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head(2401'147490
client.901549.1:33749 dirty|omap_digest s 3461120 uv 147490 od
 alloc_hint [0 0])
/var/log/ceph/ceph-osd.27.log.1:3040:2017-10-25 04:47:34.460725
7f39c4829700 -1 log_channel(cluster) log [ERR] : 1.fcd shard 133: soid
1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head
size 0 != size 3461120 from auth oi
1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head(2401'147490
client.901549.1:33749 dirty|omap_digest s 3461120 uv 147490 od
 alloc_hint [0 0])
/var/log/ceph/ceph-osd.27.log.1:3041:2017-10-25 04:47:34.460800
7f39c4829700 -1 log_channel(cluster) log [ERR] : 1.fcd soid
1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head:
failed to pick suitable auth object
/var/log/ceph/ceph-osd.27.log.1:3042:2017-10-25 04:47:34.461458
7f39c4829700 -1 log_channel(cluster) log [ERR] : deep-scrub 1.fcd
1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head
on disk size (0) does not match object info size (3461120) adjusted
for ondisk to (3461120)
/var/log/ceph/ceph-osd.27.log.1:3043:2017-10-25 04:47:44.645934
7f39c4829700 -1 log_channel(cluster) log [ERR] : 1.fcd deep-scrub 4
errors


Object file info:

root@n10-075-019:/var/lib/ceph/osd/ceph-27/current/1.fcd_head# find .
-name "103528d.0058__head_12086FCD*"
./DIR_D/DIR_C/DIR_F/DIR_6/DIR_8/103528d.0058__head_12086FCD_fsvolumens\u87c46348-9869-11e7-8525-3497f65a8415_1
root@n10-075-019:/var/lib/ceph/osd/ceph-27/current/1.fcd_head# ls -al
./DIR_D/DIR_C/DIR_F/DIR_6/DIR_8/103528d.0058__head_12086FCD*
-rw-r--r-- 1 ceph ceph 0 Oct 24 22:04
./DIR_D/DIR_C/DIR_F/DIR_6/DIR_8/103528d.0058__head_12086FCD_fsvolumens\u87c46348-9869-11e7-8525-3497f65a8415_1
root@n10-075-019:/var/lib/ceph/osd/ceph-27/current/1.fcd_head#


root@n10-075-028:/var/lib/ceph/osd/ceph-62/current/1.fcd_head# find .
-name "103528d.0058__head_12086FCD*"
./DIR_D/DIR_C/DIR_F/DIR_6/DIR_8/103528d.0058__head_12086FCD_fsvolumens\u87c46348-9869-11e7-8525-3497f65a8415_1
root@n10-075-028:/var/lib/ceph/osd/ceph-62/current/1.fcd_head# ls -al
./DIR_D/DIR_C/DIR_F/DIR_6/DIR_8/103528d.0058__head_12086FCD*
-rw-r--r-- 1 ceph ceph 0 Oct 24 22:04
./DIR_D/DIR_C/DIR_F/DIR_6/DIR_8/103528d.0058__head_12086FCD_fsvolumens\u87c46348-9869-11e7-8525-3497f65a8415_1
root@n10-075-028:/var/lib/ceph/osd/ceph-62/current/1.fcd_head#


root@n10-075-040:/var/lib/ceph/osd/ceph-133/current/1.fcd_head# find .
-name "103528d.0058__head_12086FCD*"
./DIR_D/DIR_C/DIR_F/DIR_6/DIR_8/103528d.0058__head_12086FCD_fsvolumens\u87c46348-9869-11e7-8525-3497f65a8415_1
root@n10-075-040:/var/lib/ceph/osd/ceph-133/current/1.fcd_head# ls -al
./DIR_D/DIR_C/DIR_F/DIR_6/DIR_8/103528d.0058__head_12086FCD*
-rw-r--r-- 1 ceph ceph 0 Oct 24 22:04
./DIR_D/DIR_C/DIR_F/DIR_6/DIR_8/103528d.0058__head_12086FCD_fsvolumens\u87c46348-9869-11e7-8525-3497f65a8415_1
root@n10-075-040:/var/lib/ceph/osd/ceph-133/current/1.fcd_head#
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bluestore with SSD-backed DBs; what if the SSD fails?

2017-10-24 Thread Christian Sarrasin
I'm planning to migrate an existing Filestore cluster with (SATA)
SSD-based journals fronting multiple HDD-hosted OSDs - should be a
common enough setup.  So I've been trying to parse various contributions
here and Ceph devs' blog posts (for which, thanks!)

Seems the best way to repurpose that hardware would basically be to use
those SSDs as DB partitions for Bluestore.

The one thing I'm still wondering about is failure domains.  With
Filestore and SSD-backed journals, an SSD failure would kill writes but
OSDs were otherwise still whole.  Replacing the failed SSD quickly would
get you back on your feet with relatively little data movement.

Hence the question: what happens if a SSD that contains several
partitions hosting DBs for multiple OSDs fails?  Is OSDs data still
recoverable upon replacing the SSD or is the entire lot basically toast?

If so, might this warrant revisiting the old debate about RAID-1'ing
SSDs in such as setup?  Or I suppose at least not being too ambitious
with the number of DBs hosted on a single SSD?

Thoughts much appreciated!

PS: It's not fully clear whether a separate WAL partition is useful in
that setup?  Sage posted about a month back: "[WAL] will always just
spill over onto the next fastest device (wal -> db -> main)".  I'll take
that as meaning that a separate WAL partition would be
counter-productive if hosted on the same SSD.  Please correct me if I'm
wrong?

Cheers
Christian

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS damaged

2017-10-24 Thread Daniel Davidson

Out of desperation, I started with the disaster recovery guide:

http://docs.ceph.com/docs/jewel/cephfs/disaster-recovery/

After exporting the journal, I started doing:

cephfs-journal-tool event recover_dentries summary

And that was about 7 hours ago, and it is still running.  I am getting a 
lot of messages like:


2017-10-24 21:24:10.910489 7f775e539bc0  1 scavenge_dentries: frag 
607. is corrupt, overwriting


The frag number is the same for every line and there have been thousands.

I really could use some assistance,

Dan




On 10/24/2017 12:14 PM, Daniel Davidson wrote:

Our ceph system is having a problem.

A few days a go we had a pg that was marked as inconsistent, and today 
I fixed it with a:


#ceph pg repair 1.37c

then a file was stuck as missing so I did a:

#ceph pg 1.37c mark_unfound_lost delete
pg has 1 objects unfound and apparently lost marking

That fixed the unfound file problem and all the pgs went 
active+clean.  A few minutes later though, the FS seemed to pause and 
the MDS started giving errors.


# ceph -w
    cluster 7bffce86-9d7b-4bdf-a9c9-67670e68ca77
 health HEALTH_ERR
    mds rank 0 is damaged
    mds cluster is degraded
    noscrub,nodeep-scrub flag(s) set
 monmap e3: 4 mons at 
{ceph-0=172.16.31.1:6789/0,ceph-1=172.16.31.2:6789/0,ceph-2=172.16.31.3:6789/0,ceph-3=172.16.31.4:6789/0}
    election epoch 652, quorum 0,1,2,3 
ceph-0,ceph-1,ceph-2,ceph-3

  fsmap e121409: 0/1/1 up, 4 up:standby, 1 damaged
 osdmap e35220: 32 osds: 32 up, 32 in
    flags noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
  pgmap v28398840: 1536 pgs, 2 pools, 795 TB data, 329 Mobjects
    1595 TB used, 1024 TB / 2619 TB avail
    1536 active+clean

Looking into the logs when I try a:

#ceph mds repaired 0

2017-10-24 12:01:27.354271 mds.0 172.16.31.3:6801/1949050374 75 : 
cluster [ERR] dir 607 object missing on disk; some files may be lost 
(~mds0/stray7)


Any ideas as for what to do next, I am stumped.

Dan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure code profile

2017-10-24 Thread jorpilo
That's a pretty hard question, I don't think it would speed writes so much 
because you end writing the same amount of data but I think on a 4+4 setup 
re-building or serving data while a node is down will go faster and will use 
less resources because it has to rebuild a smallers chunks of data.
Another question, when you create a EC pool you also create a crush EC rule, so 
what would happen if you set a failure domain of node and on the rule you 
divide it by OSDs?  How do failure domaing and crush rule interact?
 Mensaje original De: Oliver Humpage  
Fecha: 24/10/17  10:32 p. m.  (GMT+01:00) Para: Karun Josy 
 Cc: ceph-users  Asunto: Re: 
[ceph-users] Erasure code profile 

Consider a cluster of 8 OSD servers with 3 disks on each server. 
If I use a profile setting of k=5, m=3 and  ruleset-failure-domain=host ;
As far as I understand it can tolerate failure of 3 OSDs and 1 host, am I right 
?
When setting up your pool, you specify a crush map which says what your 
"failure domain” is. You can think of a failure domain as "what’s the largest 
single thing that could fail and the cluster would still survive?”. By default 
this is a node (a server). Large clusters often use a rack instead.  Ceph 
places your data across the OSDs in your cluster so that if that large single 
thing (node or rack) fails, your data is still safe and available.
If you specify a single OSD (a disk) as your failure domain, then ceph might 
end up placing lots of data on different OSDs on the same node. This is a bad 
idea since if that node goes down you'll lose several OSDs, and so your data 
might not survive.
If you have 8 nodes, and erasure of 5+3, then with the default failure domain 
of a node your data will be spread across all 8 nodes (data chunks on 5 of 
them, and parity chunks on the other three). Therefore you could tolerate 3 
whole nodes failing. You are right that 5+3 encoding will result in 1.6xdata 
disk usage.
If you were being pathological about minimising disk usage, I think you could 
in theory set a failure domain of an OSD, then use 8+2 encoding with a crush 
map that never used more than 2 OSDs in each node for a placement group. Then 
technically you could tolerate a node failure. I doubt anyone would recommend 
that though!
That said, here’s a question for others: say a cluster only has 4 nodes (each 
with many OSDs), would you use 2+2 or 4+4? Either way you use 2xdata space and 
could lose 2 nodes (assuming a proper crush map), but presumably the 4+4 would 
be faster and you could lose more OSDs?
Oliver.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Infinite degraded objects

2017-10-24 Thread Christian Wuerdig
>From which version of ceph to which other version of ceph did you
upgrade? Can you provide logs from crashing OSDs? The degraded object
percentage being larger than 100% has been reported before
(https://www.spinics.net/lists/ceph-users/msg39519.html) and looks
like it's been fixed a week or so ago:
http://tracker.ceph.com/issues/21803

On Mon, Oct 23, 2017 at 5:10 AM, Gonzalo Aguilar Delgado
 wrote:
> Hello,
>
> Since we upgraded ceph cluster we are facing a lot of problems. Most of them
> due to osd crashing. What can cause this?
>
>
> This morning I woke up with thi message:
>
>
> root@red-compute:~# ceph -w
> cluster 9028f4da-0d77-462b-be9b-dbdf7fa57771
>  health HEALTH_ERR
> 1 pgs are stuck inactive for more than 300 seconds
> 7 pgs inconsistent
> 1 pgs stale
> 1 pgs stuck stale
> recovery 20266198323167232/287940 objects degraded
> (7038340738753.641%)
> 37154696925806626 scrub errors
> too many PGs per OSD (305 > max 300)
>  monmap e12: 2 mons at
> {blue-compute=172.16.0.119:6789/0,red-compute=172.16.0.100:6789/0}
> election epoch 4986, quorum 0,1 red-compute,blue-compute
>   fsmap e913: 1/1/1 up {0=blue-compute=up:active}
>  osdmap e8096: 5 osds: 5 up, 5 in
> flags require_jewel_osds
>   pgmap v68755349: 764 pgs, 6 pools, 558 GB data, 140 kobjects
> 1119 GB used, 3060 GB / 4179 GB avail
> 20266198323167232/287940 objects degraded (7038340738753.641%)
>  756 active+clean
>7 active+clean+inconsistent
>1 stale+active+clean
>   client io 1630 B/s rd, 552 kB/s wr, 0 op/s rd, 64 op/s wr
>
> 2017-10-22 18:10:13.000812 mon.0 [INF] pgmap v68755348: 764 pgs: 7
> active+clean+inconsistent, 756 active+clean, 1 stale+active+clean; 558 GB
> data, 1119 GB used, 3060 GB / 4179 GB avail; 1641 B/s rd, 229 kB/s wr, 39
> op/s; 20266198323167232/287940 objects degraded (7038340738753.641%)
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure Pool OSD fail

2017-10-24 Thread Jorge Pinilla López
well, you should use M > 1, the more you have, less risk and more
performance.

You don't read twice as much data, you read it from different sources,
further more you can even read less data and have to rebuild it, because
on erasure pools you don't replicate the data.


On the other hand, the configuration it's not as bad as you think, its
just different.

3 nodes cluster

Replicate pool size = 2

    -you can take 1 failure, then re-balance and take another failure.
(max 2 separate)

    -you use 2*data space

    -you have to write 2*data, full data on one node and full data on
the second one.

Erasure code pool

    -you can only lose 1 node

    -you use less space

    -as you dont write 2*data, writes are also faster. You write half
data on one node, half data on the other and parity on separate nodes,
write work is a lot more distributed.

    -reads are slower because you need all the data parts.


On both configurations, if you have corrupted data you lose your data,
so that's not really a point to compare.

Replicate pool can achieve way more insensitive read works while Erasure
pools are thought to perform big writes but really few reads.


I have check myself that both configurations can work with a 3 node
cluster so it's not a better and a worse configuration, it really depend
on your work, and the best thing :) you can have both in the same OSDs!


El 24/10/2017 a las 12:37, Eino Tuominen escribió:
>
> Hello,
>
>
> Correct me if I'm wrong, but isn't your configuration just twice as
> bad as running with replication size=2? With replication size=2 when
> you lose a disk you lose data if there is even one defect block found
> when ceph is reconstructing the pgs that had a replica on the failed
> disk. No, with your setup you have to be able to read twice as much
> data correctly in order to reconstruct the pgs. When using EC I think
> that you have to use m>1 in production.
>
>
> -- 
>
>   Eino Tuominen
>
>
> 
> *From:* ceph-users  on behalf of
> Jorge Pinilla López 
> *Sent:* Tuesday, October 24, 2017 11:24
> *To:* ceph-users@lists.ceph.com
> *Subject:* Re: [ceph-users] Erasure Pool OSD fail
>  
>
> Okay I think I can respond myself, the pool is created with a default
> min_size of 3, so when one of the OSDs goes down, the pool doenst
> perform any IO, manually changing the the pool min_size to 2 worked great.
>
>
> El 24/10/2017 a las 10:13, Jorge Pinilla López escribió:
>> I am testing erasure code pools and doing a rados test write to try
>> fault tolerace.
>> I have 3 Nodes with 1 OSD each, K=2 M=1.
>>
>> While performing the write (rados bench -p replicate 100 write), I
>> stop one of the OSDs daemons (example osd.0), simulating a node fail,
>> and then the hole write stops and I can't write any data anymore.
>>
>>     1  16    28    12   46.8121    48 1.01548   
>> 0.616034
>>     2  16    40    24   47.3907    48 1.04219   
>> 0.923728
>>     3  16    52    36   47.5889    48   
>> 0.593145  1.0038
>>     4  16    68    52   51.6633    64 1.39638
>> 1.08098
>>     5  16    74    58    46.158    24 1.02699
>> 1.10172
>>     6  16    83    67   44.4711    36 3.01542
>> 1.18012
>>     7  16    95    79   44.9722    48    0.776493
>> 1.24003
>>     8  16    95    79   39.3681 0   -
>> 1.24003
>>     9  16    95    79   35.0061 0   -
>> 1.24003
>>    10  16    95    79   31.5144 0   -
>> 1.24003
>>    11  16    95    79   28.6561 0   -
>> 1.24003
>>    12  16    95    79   26.2732 0   -
>> 1.24003
>>
>> Its pretty clear where the OSD failed
>>
>> On the other hand, using a replicated pool, the client (rados test)
>> doesnt even notice the OSD fail, which is awesome.
>>
>> Is this a normal behaviour on EC pools?
>> 
>> *Jorge Pinilla López*
>> jorp...@unizar.es
>> Estudiante de ingenieria informática
>> Becario del area de sistemas (SICUZ)
>> Universidad de Zaragoza
>> PGP-KeyID: A34331932EBC715A
>> 
>> 
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> -- 
> 
> *Jorge Pinilla López*
> jorp...@unizar.es
> Estudiante de ingenieria informática
> Becario del area de sistemas (SICUZ)
> Universidad de Zaragoza
> PGP-KeyID: A34331932EBC715A
> 

Re: [ceph-users] Reported bucket size incorrect (Luminous)

2017-10-24 Thread Christian Wuerdig
What version of Ceph are you using? There were a few bugs leaving
behind orphaned objects (e.g. http://tracker.ceph.com/issues/18331 and
http://tracker.ceph.com/issues/10295). If that's your problem then
there is a tool for finding these objects so you can then manually
delete them - have a google search for rgw orphan find

On Sat, Oct 21, 2017 at 2:40 AM, Mark Schouten  wrote:
> Hi,
>
> I have a bucket that according to radosgw-admin is about 8TB, even though
> it's really only 961GB.
>
> I have ran radosgw-admin gc process, and that completes quite fast.
> root@osdnode04:~# radosgw-admin gc process
> root@osdnode04:~# radosgw-admin gc list
> []
>
> {
> "bucket": "qnapnas",
> "zonegroup": "8e81f1e2-c173-4b8d-b421-6ccabdf69f2e",
> "placement_rule": "default-placement",
> "explicit_placement": {
> "data_pool": "default.rgw.buckets.data",
> "data_extra_pool": "default.rgw.buckets.non-ec",
> "index_pool": "default.rgw.buckets.index"
> },
> "id": "1c19a332-7ffc-4472-b852-ec4a143785cc.19675875.3",
> "marker": "1c19a332-7ffc-4472-b852-ec4a143785cc.19675875.3",
> "index_type": "Normal",
> "owner": "DB0339$REDACTED",
> "ver": "0#963948",
> "master_ver": "0#0",
> "mtime": "2017-08-23 12:15:50.203650",
> "max_marker": "0#",
> "usage": {
> "rgw.main": {
> "size": 8650431493893,
> "size_actual": 8650431578112,
> "size_utilized": 8650431493893,
> "size_kb": 8447687006,
> "size_kb_actual": 8447687088,
> "size_kb_utilized": 8447687006,
> "num_objects": 227080
> },
> "rgw.multimeta": {
> "size": 0,
> "size_actual": 0,
> "size_utilized": 0,
> "size_kb": 0,
> "size_kb_actual": 0,
> "size_kb_utilized": 0,
> "num_objects": 17
> }
> },
> "bucket_quota": {
> "enabled": false,
> "check_on_raw": false,
> "max_size": -1024,
> "max_size_kb": 0,
> "max_objects": -1
> }
> },
>
>
> Can anybody explain what's wrong?
>
>
> Met vriendelijke groeten,
>
> --
> Kerio Operator in de Cloud? https://www.kerioindecloud.nl/
> Mark Schouten | Tuxis Internet Engineering
> KvK: 61527076 | http://www.tuxis.nl/
> T: 0318 200208 | i...@tuxis.nl
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Speeding up garbage collection in RGW

2017-10-24 Thread David Turner
Thank you so much for chiming in, Ben.

Can you explain what each setting value means? I believe I understand min
wait, that's just how long to wait before allowing the object to be cleaned
up.  gc max objs is how many will be cleaned up during each period?  gc
processor period is how often it will kick off gc to clean things up?  And
gc processor max time is the longest the process can run after the period
starts?  Is that about right for that?  I read somewhere saying that prime
numbers are optimal for gc max objs.  Do you know why that is?  I notice
you're using one there.  What is lc max objs?  I couldn't find a reference
for that setting.

Additionally, do you know if the radosgw-admin gc list is ever cleaned up,
or is it an ever growing list?  I got up to 3.6 Billion objects in the list
before I killed the gc list command.

On Tue, Oct 24, 2017 at 4:47 PM Ben Hines  wrote:

> I agree the settings are rather confusing. We also have many millions of
> objects and had this trouble, so i set these rather aggressive gc settings
> on our cluster which result in gc almost always running. We also use
> lifecycles to expire objects.
>
> rgw lifecycle work time = 00:01-23:59
> rgw gc max objs = 2647
> rgw lc max objs = 2647
> rgw gc obj min wait = 300
> rgw gc processor period = 600
> rgw gc processor max time = 600
>
>
> -Ben
>
> On Tue, Oct 24, 2017 at 9:25 AM, David Turner 
> wrote:
>
>> As I'm looking into this more and more, I'm realizing how big of a
>> problem garbage collection has been in our clusters.  The biggest cluster
>> has over 1 billion objects in its gc list (the command is still running, it
>> just recently passed by the 1B mark).  Does anyone have any guidance on
>> what to do to optimize the gc settings to hopefully/eventually catch up on
>> this as well as stay caught up once we are?  I'm not expecting an overnight
>> fix, but something that could feasibly be caught up within 6 months would
>> be wonderful.
>>
>> On Mon, Oct 23, 2017 at 11:18 AM David Turner 
>> wrote:
>>
>>> We recently deleted a bucket that was no longer needed that had 400TB of
>>> data in it to help as our cluster is getting quite full.  That should free
>>> up about 30% of our cluster used space, but in the last week we haven't
>>> seen nearly a fraction of that free up yet.  I left the cluster with this
>>> running over the weekend to try to help `radosgw-admin --rgw-realm=local gc
>>> process`, but it didn't seem to put a dent into it.  Our regular ingestion
>>> is faster than how fast the garbage collection is cleaning stuff up, but
>>> our regular ingestion is less than 2% growth at it's maximum.
>>>
>>> As of yesterday our gc list was over 350GB when dumped into a file (I
>>> had to stop it as the disk I was redirecting the output to was almost
>>> full).  In the future I will use the --bypass-gc option to avoid the
>>> cleanup, but is there a way to speed up the gc once you're in this
>>> position?  There were about 8M objects that were deleted from this bucket.
>>> I've come across a few references to the rgw-gc settings in the config, but
>>> nothing that explained the times well enough for me to feel comfortable
>>> doing anything with them.
>>>
>>> On Tue, Jul 25, 2017 at 4:01 PM Bryan Stillwell 
>>> wrote:
>>>
 Excellent, thank you!  It does exist in 0.94.10!  :)



 Bryan



 *From: *Pavan Rallabhandi 
 *Date: *Tuesday, July 25, 2017 at 11:21 AM


 *To: *Bryan Stillwell , "
 ceph-users@lists.ceph.com" 
 *Subject: *Re: [ceph-users] Speeding up garbage collection in RGW



 I’ve just realized that the option is present in Hammer (0.94.10) as
 well, you should try that.



 *From: *Bryan Stillwell 
 *Date: *Tuesday, 25 July 2017 at 9:45 PM
 *To: *Pavan Rallabhandi , "
 ceph-users@lists.ceph.com" 
 *Subject: *EXT: Re: [ceph-users] Speeding up garbage collection in RGW



 Unfortunately, we're on hammer still (0.94.10).  That option looks like
 it would work better, so maybe it's time to move the upgrade up in the
 schedule.



 I've been playing with the various gc options and I haven't seen any
 speedups like we would need to remove them in a reasonable amount of time.



 Thanks,

 Bryan



 *From: *Pavan Rallabhandi 
 *Date: *Tuesday, July 25, 2017 at 3:00 AM
 *To: *Bryan Stillwell , "
 ceph-users@lists.ceph.com" 
 *Subject: *Re: [ceph-users] Speeding up garbage collection in RGW



 If your Ceph version is >=Jewel, you can try the `--bypass-gc` option
 

Re: [ceph-users] Erasure code profile

2017-10-24 Thread Oliver Humpage

> Consider a cluster of 8 OSD servers with 3 disks on each server. 
> 
> If I use a profile setting of k=5, m=3 and  ruleset-failure-domain=host ;
> 
> As far as I understand it can tolerate failure of 3 OSDs and 1 host, am I 
> right ?

When setting up your pool, you specify a crush map which says what your 
"failure domain” is. You can think of a failure domain as "what’s the largest 
single thing that could fail and the cluster would still survive?”. By default 
this is a node (a server). Large clusters often use a rack instead.  Ceph 
places your data across the OSDs in your cluster so that if that large single 
thing (node or rack) fails, your data is still safe and available.

If you specify a single OSD (a disk) as your failure domain, then ceph might 
end up placing lots of data on different OSDs on the same node. This is a bad 
idea since if that node goes down you'll lose several OSDs, and so your data 
might not survive.

If you have 8 nodes, and erasure of 5+3, then with the default failure domain 
of a node your data will be spread across all 8 nodes (data chunks on 5 of 
them, and parity chunks on the other three). Therefore you could tolerate 3 
whole nodes failing. You are right that 5+3 encoding will result in 1.6xdata 
disk usage.

If you were being pathological about minimising disk usage, I think you could 
in theory set a failure domain of an OSD, then use 8+2 encoding with a crush 
map that never used more than 2 OSDs in each node for a placement group. Then 
technically you could tolerate a node failure. I doubt anyone would recommend 
that though!

That said, here’s a question for others: say a cluster only has 4 nodes (each 
with many OSDs), would you use 2+2 or 4+4? Either way you use 2xdata space and 
could lose 2 nodes (assuming a proper crush map), but presumably the 4+4 would 
be faster and you could lose more OSDs?

Oliver.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] luminous ubuntu 16.04 HWE (4.10 kernel). ceph-disk can't prepare a disk

2017-10-24 Thread Webert de Souza Lima
When you do umount the device, the raised error is still the same?


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*

On Mon, Oct 23, 2017 at 4:46 AM, Wido den Hollander  wrote:

>
> > Op 22 oktober 2017 om 18:45 schreef Sean Sullivan :
> >
> >
> > On freshly installed ubuntu 16.04 servers with the HWE kernel selected
> > (4.10). I can not use ceph-deploy or ceph-disk to provision osd.
> >
> >
> >  whenever I try I get the following::
> >
> > ceph-disk -v prepare --dmcrypt --dmcrypt-key-dir /etc/ceph/dmcrypt-keys
> > --bluestore --cluster ceph --fs-type xfs -- /dev/sdy
> > command: Running command: /usr/bin/ceph-osd --cluster=ceph
> > --show-config-value=fsid
> > get_dm_uuid: get_dm_uuid /dev/sdy uuid path is
> /sys/dev/block/65:128/dm/uuid
> > set_type: Will colocate block with data on /dev/sdy
> > command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd.
> > --lookup bluestore_block_size
> > [command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd.
> > --lookup bluestore_block_db_size
> > command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd.
> > --lookup bluestore_block_size
> > command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd.
> > --lookup bluestore_block_wal_size
> > get_dm_uuid: get_dm_uuid /dev/sdy uuid path is
> /sys/dev/block/65:128/dm/uuid
> > get_dm_uuid: get_dm_uuid /dev/sdy uuid path is
> /sys/dev/block/65:128/dm/uuid
> > get_dm_uuid: get_dm_uuid /dev/sdy uuid path is
> /sys/dev/block/65:128/dm/uuid
> > Traceback (most recent call last):
> >   File "/usr/sbin/ceph-disk", line 9, in 
> > load_entry_point('ceph-disk==1.0.0', 'console_scripts',
> 'ceph-disk')()
> >   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5704,
> in
> > run
> > main(sys.argv[1:])
> >   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5655,
> in
> > main
> > args.func(args)
> >   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2091,
> in
> > main
> > Prepare.factory(args).prepare()
> >   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2080,
> in
> > prepare
> > self._prepare()
> >   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2154,
> in
> > _prepare
> > self.lockbox.prepare()
> >   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2842,
> in
> > prepare
> > verify_not_in_use(self.args.lockbox, check_partitions=True)
> >   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 950,
> in
> > verify_not_in_use
> > raise Error('Device is mounted', partition)
> > ceph_disk.main.Error: Error: Device is mounted: /dev/sdy5
> >
> > unmounting the disk does not seem to help either. I'm assuming something
> is
> > triggering too early but i'm not sure how to delay or figure that out.
> >
> > has anyone deployed on xenial with the 4.10 kernel? Am I missing
> something
> > important?
>
> Yes I have without any issues, I've did:
>
> $ ceph-disk prepare /dev/sdb
>
> Luminous default to BlueStore and that worked just fine.
>
> Yes, this is with a 4.10 HWE kernel from Ubuntu 16.04.
>
> Wido
>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 回复: Re: [luminous]OSD memory usage increase when writing a lot of data to cluster

2017-10-24 Thread shadow_lin
Hi Sage,
When will 12.2.2 be released?

2017-10-24 

lin.yunfan



发件人:Sage Weil 
发送时间:2017-10-24 20:03
主题:Re: [ceph-users] [luminous]OSD memory usage increase when writing a lot of 
data to cluster
收件人:"shadow_lin"
抄送:"ceph-users"

On Tue, 24 Oct 2017, shadow_lin wrote: 
> BLOCKQUOTE{margin-Top: 0px; margin-Bottom: 0px; margin-Left: 2em} body 
> {border-width:0;margin:0} img {border:0;margin:0;padding:0} Hi All, 
> The cluster has 24 osd with 24 8TB hdd. 
> Each osd server has 2GB ram and runs 2OSD with 2 8TBHDD. I know the memory 
> is below the remmanded value, but this osd server is an ARM  server so I 
> can't do anything to add more ram. 
> I created a replicated(2 rep) pool and an 20TB image and mounted to the test 
> server with xfs fs.  
>   
> I have set the ceph.conf to this(according to other related post suggested): 
> [osd] 
> bluestore_cache_size = 104857600 
> bluestore_cache_size_hdd = 104857600 
> bluestore_cache_size_ssd = 104857600 
> bluestore_cache_kv_max = 103809024 
>   
>  osd map cache size = 20 
> osd map max advance = 10 
> osd map share max epochs = 10 
> osd pg epoch persisted max stale = 10 
> The bluestore cache setting did improve the situation,but if i try to write 
> 1TB data by dd command(dd if=/dev/zero of=test bs=1G count=1000)  to rbd the 
> osd will eventually be killed by oom killer. 
> If I only wirte like 100G  data once then everything is fine. 
>   
> Why does the osd memory usage keep increasing whle writing ? 
> Is there anything I can do to reduce the memory usage? 

There is a bluestore memory bug that was fixed just after 12.2.1 was  
released; it will be fixed in 12.2.2.  In the meantime, you can run  
consider running the latest luminous branch (not fully tested) from 
https://shaman.ceph.com/builds/ceph/luminous. 

sage ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [luminous]OSD memory usage increase when writing a lot of data to cluster

2017-10-24 Thread Sage Weil
On Tue, 24 Oct 2017, shadow_lin wrote:
> BLOCKQUOTE{margin-Top: 0px; margin-Bottom: 0px; margin-Left: 2em} body
> {border-width:0;margin:0} img {border:0;margin:0;padding:0} Hi All,
> The cluster has 24 osd with 24 8TB hdd.
> Each osd server has 2GB ram and runs 2OSD with 2 8TBHDD. I know the memory
> is below the remmanded value, but this osd server is an ARM  server so I
> can't do anything to add more ram.
> I created a replicated(2 rep) pool and an 20TB image and mounted to the test
> server with xfs fs. 
>  
> I have set the ceph.conf to this(according to other related post suggested):
> [osd]
> bluestore_cache_size = 104857600
> bluestore_cache_size_hdd = 104857600
> bluestore_cache_size_ssd = 104857600
> bluestore_cache_kv_max = 103809024
>  
>  osd map cache size = 20
> osd map max advance = 10
> osd map share max epochs = 10
> osd pg epoch persisted max stale = 10
> The bluestore cache setting did improve the situation,but if i try to write
> 1TB data by dd command(dd if=/dev/zero of=test bs=1G count=1000)  to rbd the
> osd will eventually be killed by oom killer.
> If I only wirte like 100G  data once then everything is fine.
>  
> Why does the osd memory usage keep increasing whle writing ?
> Is there anything I can do to reduce the memory usage?

There is a bluestore memory bug that was fixed just after 12.2.1 was 
released; it will be fixed in 12.2.2.  In the meantime, you can run 
consider running the latest luminous branch (not fully tested) from
https://shaman.ceph.com/builds/ceph/luminous.

sage___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure Pool OSD fail

2017-10-24 Thread Eino Tuominen
Hello,


Correct me if I'm wrong, but isn't your configuration just twice as bad as 
running with replication size=2? With replication size=2 when you lose a disk 
you lose data if there is even one defect block found when ceph is 
reconstructing the pgs that had a replica on the failed disk. No, with your 
setup you have to be able to read twice as much data correctly in order to 
reconstruct the pgs. When using EC I think that you have to use m>1 in 
production.


--

  Eino Tuominen



From: ceph-users  on behalf of Jorge Pinilla 
López 
Sent: Tuesday, October 24, 2017 11:24
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Erasure Pool OSD fail


Okay I think I can respond myself, the pool is created with a default min_size 
of 3, so when one of the OSDs goes down, the pool doenst perform any IO, 
manually changing the the pool min_size to 2 worked great.

El 24/10/2017 a las 10:13, Jorge Pinilla López escribió:
I am testing erasure code pools and doing a rados test write to try fault 
tolerace.
I have 3 Nodes with 1 OSD each, K=2 M=1.

While performing the write (rados bench -p replicate 100 write), I stop one of 
the OSDs daemons (example osd.0), simulating a node fail, and then the hole 
write stops and I can't write any data anymore.

1  162812   46.812148 1.015480.616034
2  164024   47.390748 1.042190.923728
3  165236   47.5889480.593145  1.0038
4  166852   51.663364 1.39638 1.08098
5  16745846.15824 1.02699 1.10172
6  168367   44.471136 3.01542 1.18012
7  169579   44.9722480.776493 1.24003
8  169579   39.3681 0   - 1.24003
9  169579   35.0061 0   - 1.24003
   10  169579   31.5144 0   - 1.24003
   11  169579   28.6561 0   - 1.24003
   12  169579   26.2732 0   - 1.24003

Its pretty clear where the OSD failed

On the other hand, using a replicated pool, the client (rados test) doesnt even 
notice the OSD fail, which is awesome.

Is this a normal behaviour on EC pools?

Jorge Pinilla López
jorp...@unizar.es
Estudiante de ingenieria informática
Becario del area de sistemas (SICUZ)
Universidad de Zaragoza
PGP-KeyID: 
A34331932EBC715A




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--

Jorge Pinilla López
jorp...@unizar.es
Estudiante de ingenieria informática
Becario del area de sistemas (SICUZ)
Universidad de Zaragoza
PGP-KeyID: 
A34331932EBC715A

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lots of reads on default.rgw.usage pool

2017-10-24 Thread Mark Schouten
Stracing the radosgw-process, I see a lot of the following:



[pid 12364] sendmsg(23, {msg_name(0)=NULL, 
msg_iov(5)=[{"\7{\340\r\0\0\0\0\0P\200\16\0\0\0\0\0*\0?\0\10\0\331\0\0\0\0\0\0\0M"...,
 54}, 
{"\1\1\22\0\0\0\1\10\0\0\0\0\0\0\0\0\0\0\0\377\377\377\377\377\20\226\206\351\v3\0\0"...,
 217}, {"rgwuser_usage_log_read", 22}, 
{"\1\0011\0\0\0\\320Y\0\0\0\0\200\16\371Y\0\0\0\0\25\0\0\0DB0339"..., 55}, 
{"\305\234\203\332\0\0\0\0K~\356z\4\266\305\272\27hTx\5", 21}], 
msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 369

Does anybody know where this is coming from?


Met vriendelijke groeten,

-- 
Kerio Operator in de Cloud? https://www.kerioindecloud.nl/
Mark Schouten  | Tuxis Internet Engineering
KvK: 61527076 | http://www.tuxis.nl/
T: 0318 200208 | i...@tuxis.nl



 Van:   Mark Schouten  
 Aan:    
 Verzonden:   24-10-2017 12:11 
 Onderwerp:   [ceph-users] Lots of reads on default.rgw.usage pool 

Hi,


Since I upgraded to Luminous last week, I see a lot of read-activity on the 
default.rgw.usage pool. (See attached image). I think it has something to with 
the rgw-daemons, since restarting them slows the reads down for a while. It 
might also have to do with tenants and the fact that dynamic bucket sharding 
isn't working for me [1].


So this morning I disabled the dynamic bucket sharding via 
'rgw_dynamic_resharding = false', but that doesn't seem to help. Maybe 
bucketsharding is still trying to run because of the entry in 'radosgw-admin 
reshard list' that I cannot delete?


[1]: 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021774.html


Met vriendelijke groeten,

-- 
Kerio Operator in de Cloud? https://www.kerioindecloud.nl/
Mark Schouten  | Tuxis Internet Engineering
KvK: 61527076 | http://www.tuxis.nl/
T: 0318 200208 | i...@tuxis.nl

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 


smime.p7s
Description: Electronic Signature S/MIME
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure code profile

2017-10-24 Thread Ronny Aasen
yes you can. but just like a raid5 array with a lost disk, it is not a 
comfortable way to run your cluster for any significant time. you also 
get performance degradations.


having a warning active all the time makes it harder to detect new 
issues, and such. One becomes numb to the warning allways beeing on.


strive to have your cluster in health ok all the time. and design so 
that you have the fault tolerance you want as overhead. having more 
nodes then strictly needed allow ceph to self heal quickly. and also 
gives better performance, by spreading load over more machines.

10+4 on 14 nodes means each and every  nodes are hit on each write.


kind regards
Ronny Aasen


On 23. okt. 2017 21:12, Jorge Pinilla López wrote:
I have one question, what can or can't do a cluster working on degraded 
mode?


With K=10 + M = 4 if one of my OSDs node fails it will start working on 
degraded mode, but can I still do writes and reads from that pool?



El 23/10/2017 a las 21:01, Ronny Aasen escribió:

On 23.10.2017 20:29, Karun Josy wrote:

Hi,

While creating a pool with erasure code profile k=10, m=4, I get PG 
status as

"200 creating+incomplete"

While creating pool with profile k=5, m=3 it works fine.

Cluster has 8 OSDs with total 23 disks.

Is there any requirements for setting the first profile ?



you need K+M+X  osd nodes. K and M comes from the profile, X is how 
many nodes you want to be able to tolerate failure of, without 
becoming degraded. (how many failed nodes ceph should be able to 
automatically heal)


so with K=10 + M = 4 you need minimum 14 nodes and you have 0 fault 
tolerance (a single failure = a degreded cluster)  so you have to 
scramble to replace the node to get HEALTH OK again.  if you have 15 
nodes you can loose 1 node and cehp will automatically rebalance to 
the 14 needed nodes, and you can replace the lost node at your leisure.


kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--

*Jorge Pinilla López*
jorp...@unizar.es
Estudiante de ingenieria informática
Becario del area de sistemas (SICUZ)
Universidad de Zaragoza
PGP-KeyID: A34331932EBC715A 





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Lots of reads on default.rgw.usage pool

2017-10-24 Thread Mark Schouten
Hi,


Since I upgraded to Luminous last week, I see a lot of read-activity on the 
default.rgw.usage pool. (See attached image). I think it has something to with 
the rgw-daemons, since restarting them slows the reads down for a while. It 
might also have to do with tenants and the fact that dynamic bucket sharding 
isn't working for me [1].


So this morning I disabled the dynamic bucket sharding via 
'rgw_dynamic_resharding = false', but that doesn't seem to help. Maybe 
bucketsharding is still trying to run because of the entry in 'radosgw-admin 
reshard list' that I cannot delete?


[1]: 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021774.html


Met vriendelijke groeten,

-- 
Kerio Operator in de Cloud? https://www.kerioindecloud.nl/
Mark Schouten  | Tuxis Internet Engineering
KvK: 61527076 | http://www.tuxis.nl/
T: 0318 200208 | i...@tuxis.nl

smime.p7s
Description: Electronic Signature S/MIME
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [luminous]OSD memory usage increase when writing a lot of data to cluster

2017-10-24 Thread Denes Dolhay

Hi,

There was a thread about this a not long ago, please check:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021676.html

Denes.

On 10/24/2017 11:48 AM, shadow_lin wrote:

Hi All,
The cluster has 24 osd with 24 8TB hdd.
Each osd server has 2GB ram and runs 2OSD with 2 8TBHDD. I know the 
memory is below the remmanded value, but this osd server is an ARM  
server so I can't do anything to add more ram.
I created a replicated(2 rep) pool and an 20TB image and mounted to 
the test server with xfs fs.
I have set the ceph.conf to this(according to other related post 
suggested):

[osd]
bluestore_cache_size = 104857600
bluestore_cache_size_hdd = 104857600
bluestore_cache_size_ssd = 104857600
bluestore_cache_kv_max = 103809024


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [luminous]OSD memory usage increase when writing a lot of data to cluster

2017-10-24 Thread shadow_lin
Hi All,
The cluster has 24 osd with 24 8TB hdd. 
Each osd server has 2GB ram and runs 2OSD with 2 8TBHDD. I know the memory is 
below the remmanded value, but this osd server is an ARM  server so I can't do 
anything to add more ram.
I created a replicated(2 rep) pool and an 20TB image and mounted to the test 
server with xfs fs. 

I have set the ceph.conf to this(according to other related post suggested):
[osd]
bluestore_cache_size = 104857600
bluestore_cache_size_hdd = 104857600
bluestore_cache_size_ssd = 104857600
bluestore_cache_kv_max = 103809024

 osd map cache size = 20
osd map max advance = 10
osd map share max epochs = 10
osd pg epoch persisted max stale = 10

The bluestore cache setting did improve the situation,but if i try to write 1TB 
data by dd command(dd if=/dev/zero of=test bs=1G count=1000)  to rbd the osd 
will eventually be killed by oom killer.
If I only wirte like 100G  data once then everything is fine.

Why does the osd memory usage keep increasing whle writing ?
Is there anything I can do to reduce the memory usage?

2017-10-24



lin.yunfan___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure Pool OSD fail

2017-10-24 Thread Jorge Pinilla López
Okay I think I can respond myself, the pool is created with a default
min_size of 3, so when one of the OSDs goes down, the pool doenst
perform any IO, manually changing the the pool min_size to 2 worked great.


El 24/10/2017 a las 10:13, Jorge Pinilla López escribió:
> I am testing erasure code pools and doing a rados test write to try
> fault tolerace.
> I have 3 Nodes with 1 OSD each, K=2 M=1.
>
> While performing the write (rados bench -p replicate 100 write), I
> stop one of the OSDs daemons (example osd.0), simulating a node fail,
> and then the hole write stops and I can't write any data anymore.
>
>     1  16    28    12   46.8121    48 1.01548   
> 0.616034
>     2  16    40    24   47.3907    48 1.04219   
> 0.923728
>     3  16    52    36   47.5889    48    0.593145 
> 1.0038
>     4  16    68    52   51.6633    64 1.39638
> 1.08098
>     5  16    74    58    46.158    24 1.02699
> 1.10172
>     6  16    83    67   44.4711    36 3.01542
> 1.18012
>     7  16    95    79   44.9722    48    0.776493
> 1.24003
>     8  16    95    79   39.3681 0   -
> 1.24003
>     9  16    95    79   35.0061 0   -
> 1.24003
>    10  16    95    79   31.5144 0   -
> 1.24003
>    11  16    95    79   28.6561 0   -
> 1.24003
>    12  16    95    79   26.2732 0   -
> 1.24003
>
> Its pretty clear where the OSD failed
>
> On the other hand, using a replicated pool, the client (rados test)
> doesnt even notice the OSD fail, which is awesome.
>
> Is this a normal behaviour on EC pools?
> 
> *Jorge Pinilla López*
> jorp...@unizar.es
> Estudiante de ingenieria informática
> Becario del area de sistemas (SICUZ)
> Universidad de Zaragoza
> PGP-KeyID: A34331932EBC715A
> 
> 
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 

*Jorge Pinilla López*
jorp...@unizar.es
Estudiante de ingenieria informática
Becario del area de sistemas (SICUZ)
Universidad de Zaragoza
PGP-KeyID: A34331932EBC715A


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Erasure Pool OSD fail

2017-10-24 Thread Jorge Pinilla López
I am testing erasure code pools and doing a rados test write to try
fault tolerace.
I have 3 Nodes with 1 OSD each, K=2 M=1.

While performing the write (rados bench -p replicate 100 write), I stop
one of the OSDs daemons (example osd.0), simulating a node fail, and
then the hole write stops and I can't write any data anymore.

    1  16    28    12   46.8121    48 1.01548   
0.616034
    2  16    40    24   47.3907    48 1.04219   
0.923728
    3  16    52    36   47.5889    48    0.593145 
1.0038
    4  16    68    52   51.6633    64 1.39638
1.08098
    5  16    74    58    46.158    24 1.02699
1.10172
    6  16    83    67   44.4711    36 3.01542
1.18012
    7  16    95    79   44.9722    48    0.776493
1.24003
    8  16    95    79   39.3681 0   -
1.24003
    9  16    95    79   35.0061 0   -
1.24003
   10  16    95    79   31.5144 0   -
1.24003
   11  16    95    79   28.6561 0   -
1.24003
   12  16    95    79   26.2732 0   -
1.24003

Its pretty clear where the OSD failed

On the other hand, using a replicated pool, the client (rados test)
doesnt even notice the OSD fail, which is awesome.

Is this a normal behaviour on EC pools?

*Jorge Pinilla López*
jorp...@unizar.es
Estudiante de ingenieria informática
Becario del area de sistemas (SICUZ)
Universidad de Zaragoza
PGP-KeyID: A34331932EBC715A


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Jewel] Crash Osd with void Hit_set_trim

2017-10-24 Thread pascal.pu...@pci-conseil.net

Hello,

Le 24/10/2017 à 07:49, Brad Hubbard a écrit :



On Mon, Oct 23, 2017 at 4:51 PM, pascal.pu...@pci-conseil.net 
 > wrote:


Hello,

Le 23/10/2017 à 02:05, Brad Hubbard a écrit :

2017-10-22 17:32:56.031086 7f3acaff5700 1 osd.14 pg_epoch: 72024
pg[37.1c( v 71593'41657 (60849'38594,71593'41657] local-les=72023
n=13 ec=7037 les/c/f 72023/72023/66447 72022/72022/72022)
[14,1,41] r=0 lpr=72022 crt=71593'41657 lcod 0'
0 mlcod 0'0 active+clean] hit_set_trim
37:3800:.ceph-internal::hit_set_37.1c_archive_2017-08-31
01%3a03%3a24.697717Z_2017-08-31 01%3a52%3a34.767197Z:head not found
2017-10-22 17:32:56.033936 7f3acaff5700 -1 osd/ReplicatedPG.cc:
In function 'void
ReplicatedPG::hit_set_trim(ReplicatedPG::OpContextUPtr&, unsigned
int)' thread 7f3acaff5700 time 2017-10-22 17:32:56.031105
osd/ReplicatedPG.cc: 11782: FAILED assert(obc)

It appears to be looking for (and failing to find) a hitset
object with a timestamp from August? Does that sound right to
you? Of course, it appears an object for that timestamp does not
exist.


How is-it possible ? How to fix it. I am sure, if I run a lot of
read, other objects like this will crash other osd.
(Cluster is OK now, I will probably destroy OSD 14 and recreate it).
How to find this object ?


You should be able to do a find on the OSDs filestore and grep the 
output for 'hit_set_37.1c_archive_2017-08-31'. I'd start with the OSDs 
responsible for pg 37.1c and then move on to the others if it's feasible.



So with grep, I found OSD.14 (already destroyed anb recreated) and OSD.1.

ceph-osd-01: /var/log/ceph/ceph-osd.1.log-20171019.gz:2017-10-18 
05:37:52.793802 7f9754ec5700 -1 osd.1 pg_epoch: 71592 pg[37.1c( v 
71591'41652 (60849'38594,71591'41652] local-les=71583 n=17 ec=7037 
les/c/f 71583/71554/66447 71561/71578/71578) [43,26,13]/[1,41] r=0 
lpr=71578 pi=71553-71577/5 luod=71590'41651 bft=13,26,43 crt=71588'41647 
lcod 71589'41650 mlcod 0'0 
active+undersized+degraded+remapped+wait_backfill] agent_load_hit_sets: 
could not load hitset 
37:3800:.ceph-internal::hit_set_37.1c_archive_2017-08-31 
01%3a03%3a24.697717Z_2017-08-31 01%3a52%3a34.767197Z:head


May I destroy OSD 1 and recreate it as well  to force move ? or just 
reweight OSD to force move ?


How to find other objects with same issues ? (just restart rsync and see ?).

Other question  :I use to run a night crontab with fstrim on rbd disk. 
Is-it is it because of the problem ?



Let us know the results.



--
*Performance Conseil Informatique*
Pascal Pucci
Consultant Infrastructure
pascal.pu...@pci-conseil.net 
Mobile : 06 51 47 84 98
Bureau : 02 85 52 41 81
http://www.performance-conseil-informatique.net /*News :*
Parteneriat DataCore -PCI est Silver Partner 

Très heureux de réaliser des projets continuité stockage avec DataCore 
depuis 2008. PCI est partenaire Silver DataCore. Merci à DataCore 
...lire...I 
 


/

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Jewel] Crash Osd with void Hit_set_trim

2017-10-24 Thread Brad Hubbard
On Tue, Oct 24, 2017 at 3:49 PM, Brad Hubbard  wrote:

>
>
> On Mon, Oct 23, 2017 at 4:51 PM, pascal.pu...@pci-conseil.net <
> pascal.pu...@pci-conseil.net> wrote:
>
>> Hello,
>> Le 23/10/2017 à 02:05, Brad Hubbard a écrit :
>>
>> 2017-10-22 17:32:56.031086 7f3acaff5700  1 osd.14 pg_epoch: 72024
>> pg[37.1c( v 71593'41657 (60849'38594,71593'41657] local-les=72023 n=13
>> ec=7037 les/c/f 72023/72023/66447 72022/72022/72022) [14,1,41] r=0
>> lpr=72022 crt=71593'41657 lcod 0'
>> 0 mlcod 0'0 active+clean] hit_set_trim 
>> 37:3800:.ceph-internal::hit_set_37.1c_archive_2017-08-31
>> 01%3a03%3a24.697717Z_2017-08-31 01%3a52%3a34.767197Z:head not found
>> 2017-10-22 17:32:56.033936 7f3acaff5700 -1 osd/ReplicatedPG.cc: In
>> function 'void ReplicatedPG::hit_set_trim(ReplicatedPG::OpContextUPtr&,
>> unsigned int)' thread 7f3acaff5700 time 2017-10-22 17:32:56.031105
>> osd/ReplicatedPG.cc: 11782: FAILED assert(obc)
>>
>> It appears to be looking for (and failing to find) a hitset object with a
>> timestamp from August? Does that sound right to you? Of course, it appears
>> an object for that timestamp does not exist.
>>
>> How is-it possible ? How to fix it. I am sure, if I run a lot of read,
>> other objects like this will crash other osd.
>> (Cluster is OK now, I will probably destroy OSD 14 and recreate it).
>> How to find this object ?
>>
>
> You should be able to do a find on the OSDs filestore and grep the output
> for 'hit_set_37.1c_archive_2017-08-31'. I'd start with the OSDs
> responsible for pg 37.1c and then move on to the others if it's feasible.
>

Many thanks to Kefu for correcting me on this.

You'll need to use something more like the following command to find this
object.

find ${path_to_osd} -name 'hit\\uset\\u37.1c\\uarchive\\u2017-08-31
01:03:24.697717Z\\u2017-08-31 01:52:34.767197Z*'

Apologies for the confusion, it was entirely mine.


> Let us know the results.
>
>
>> For information : All ceph server are NTP time synchrone.
>>
>> What are the settings for this cache tier?
>>
>>
>> Just Tier in "backwrite" on erasure pool 2+1.
>>
>> # ceph osd pool get cache-nvme-data all
>> size: 3
>> min_size: 2
>> crash_replay_interval: 0
>> pg_num: 512
>> pgp_num: 512
>> crush_ruleset: 10
>> hashpspool: true
>> nodelete: false
>> nopgchange: false
>> nosizechange: false
>> write_fadvise_dontneed: false
>> noscrub: false
>> nodeep-scrub: false
>> hit_set_type: bloom
>> hit_set_period: 14400
>> hit_set_count: 12
>> hit_set_fpp: 0.05
>> use_gmt_hitset: 1
>> auid: 0
>> target_max_objects: 100
>> target_max_bytes: 1000
>> cache_target_dirty_ratio: 0.4
>> cache_target_dirty_high_ratio: 0.6
>> cache_target_full_ratio: 0.8
>> cache_min_flush_age: 600
>> cache_min_evict_age: 1800
>> min_read_recency_for_promote: 1
>> min_write_recency_for_promote: 1
>> fast_read: 0
>> hit_set_grade_decay_rate: 0
>> hit_set_search_last_n: 0
>>
>> #  ceph osd pool get raid-2-1-data all
>> size: 3
>> min_size: 2
>> crash_replay_interval: 0
>> pg_num: 1024
>> pgp_num: 1024
>> crush_ruleset: 8
>> hashpspool: true
>> nodelete: false
>> nopgchange: false
>> nosizechange: false
>> write_fadvise_dontneed: false
>> noscrub: false
>> nodeep-scrub: false
>> use_gmt_hitset: 1
>> auid: 0
>> erasure_code_profile: raid-2-1
>> min_write_recency_for_promote: 0
>> fast_read: 0
>>
>> # ceph osd erasure-code-profile get raid-2-1
>> jerasure-per-chunk-alignment=false
>> k=2
>> m=1
>> plugin=jerasure
>> ruleset-failure-domain=host
>> ruleset-root=default
>> technique=reed_sol_van
>> w=8
>>
>> Could you check your logs for any errors from the 'agent_load_hit_sets'
>> function?
>>
>>
>> join log : #  pdsh -R exec -w ceph-osd-01,ceph-osd-02,ceph-osd-03,ceph-osd-04
>> ssh -x  %h 'zgrep -B10 -A10 agent_load_hit_sets
>> /var/log/ceph/ceph-osd.*gz'|less > log_agent_load_hit_sets.log
>>
>> On 19 October, I restarted on morning OSD 14.
>>
>> thanks for your help.
>>
>> regards,
>>
>>
>> On Mon, Oct 23, 2017 at 2:41 AM, pascal.pu...@pci-conseil.net <
>> pascal.pu...@pci-conseil.net> wrote:
>>
>>> Hello,
>>>
>>> I ran today a lot read IO with an simple rsync... and again, an OSD
>>> crashed :
>>>
>>> But as before, I can't restart OSD. It continue crashing again. So OSD
>>> is out, cluster is recovering.
>>>
>>> I had just time to increase OSD log.
>>>
>>> # ceph tell osd.14 injectargs --debug-osd 5/5
>>>
>>> Join log :
>>>
>>> # grep -B100 -100 objdump /var/log/ceph/ceph-osd.14.log
>>>
>>> If I ran another read, an other OSD willl probably crash.
>>>
>>> Any Idee ?
>>>
>>> I will probably plan to move data from erasure pool to replicat 3x pool.
>>> It's becoming unstable without any change.
>>>
>>> Regards,
>>>
>>> PS: Last sunday, I lost RBD header during remove of cache tier... a lot
>>> of thanks to http://fnordahl.com/2017/04/17
>>> /ceph-rbd-volume-header-recovery/, to recreate it and resurrect RBD
>>> disk :)
>>> Le 19/10/2017 à 00:19, Brad Hubbard a écrit :
>>>
>>> On Wed, Oct 18, 2017 at 11:16 PM, 
>>>