Re: [ceph-users] Slow requests during OSD maintenance

2018-07-17 Thread Konstantin Shalygin

2. What is the best way to remove an OSD node from the cluster during
maintenance? ceph osd set noout is not the way to go, since no OSD's are
out during yum update and the node is still part of the cluster and will
handle I/O.
I think the best way is the combination of "ceph osd set noout" + stopping
the OSD services so the OSD node does not have any traffic anymore.



You are right.

ceph osd net noout

systemctl stop ceph-osd.target




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS with erasure coding, do I need a cache-pool?

2018-07-17 Thread Linh Vu
I think the P4600 should be fine, although 2TB is probably way over kill for 15 
OSDs.


Our older nodes use the P3700 400GB for 16 OSDs. I have yet to see the WAL and 
DB getting filled up at 2GB/10GB each. Our newer nodes use the Intel Optane 
900P 480GB, that's actually faster than the P4600 and significantly cheaper in 
our country (we bought ~100 OSD nodes recently and that was a big saving) and 
has a big 10 DWPD. For NLSAS OSDs, even the older P3700 is more than enough, 
but for our flash OSDs, the Optane 900P performs a lot better. It's about 2x 
faster than the P3700 we had, and allow us to get more out of our flash drives.


From: Oliver Schulz 
Sent: Wednesday, 18 July 2018 12:00:14 PM
To: Linh Vu; ceph-users
Subject: Re: [ceph-users] CephFS with erasure coding, do I need a cache-pool?

Thanks, Linh!

A question regarding choice of NVMe - do you think an
Intel P4510 or P4600 would do well for WAL+DB? I'd
thinking about using a single 2 TB NVMe for 15 OSDs.
Would you recommend a different model?

Is there any experience on how many 4k IOPS one should
have for WAL+DB per OSD?

We have a few new BlueStore nodes in an older
cluster, and we use Intel Optanes for WAL. We wanted to
use them for DB too - only to learn that while fast
they're just to small for the DB for several OSDs ...
so I hope a "regular" NVMe is fast enough?

We currently use the Gigabyte D120-C21 server barebone
(https://b2b.gigabyte.com/Storage-Server/D120-C21-rev-100)
for our OSD nodes, and we'd like to use it in our
next cluster too, because of the high storage density
and the good hdd-price to server-price ratio.
But it can only fit a single NVMe-drive (we use one of
the 16 HDD slots for an U.2 drive and connect it to the
single M.2-PCIe slot on the mainboard).


Cheers,

Oliver


On 18.07.2018 09:11, Linh Vu wrote:
> On our NLSAS OSD nodes, there is 1x NVMe PCIe card for all the WALs and
> DBs (we accept that the risk of 1 card failing is low, and our failure
> domain is host anyway). Each OSD (16 per host) gets 2GB of WAL and 10GB
> of DB.
>
>
> On our Flash (SSD but not NVMe) OSD nodes, there are 8 OSDs per node,
> and 2x NVMe PCIe cards for the WALs and DBs. Each OSD gets 4GB of WAL
> and 40GB of DB.
>
>
> On our upcoming NVMe OSD nodes, for obvious reason, we don't do any such
> special allocation. 
>
>
> Cheers,
>
> Linh
>
>
> 
> *From:* Oliver Schulz 
> *Sent:* Tuesday, 17 July 2018 11:39:26 PM
> *To:* Linh Vu; ceph-users
> *Subject:* Re: [ceph-users] CephFS with erasure coding, do I need a
> cache-pool?
> Dear Linh,
>
> another question, if I may:
>
> How do you handle Bluestore WAL and DB, and
> how much SSD space do you allocate for them?
>
>
> Cheers,
>
> Oliver
>
>
> On 17.07.2018 08:55, Linh Vu wrote:
> > Hi Oliver,
> >
> >
> > We have several CephFS on EC pool deployments, one been in production
> > for a while, the others about to pending all the Bluestore+EC fixes in
> > 12.2.7 
> >
> >
> > Firstly as John and Greg have said, you don't need SSD cache pool at all.
> >
> >
> > Secondly, regarding k/m, it depends on how many hosts or racks you have,
> > and how many failures you want to tolerate.
> >
> >
> > For our smallest pool with only 8 hosts in 4 different racks and 2
> > different pairs of switches (note: we consider switch failure more
> > common than rack cooling or power failure), we're using 4/2 with failure
> > domain = host. We currently use this for SSD scratch storage for HPC.
> >
> >
> > For one of our larger pools, with 24 hosts over 6 different racks and 6
> > different pairs of switches, we're using 4:2 with failure domain = rack.
> >
> >
> > For another pool with similar host count but not spread over so many
> > pairs of switches, we're using 6:3 and failure domain = host.
> >
> >
> > Also keep in mind that a higher value of k/m may give you more
> > throughput but increase latency especially for small files, so it also
> > depends on how important performance is and what kind of file size you
> > store on your CephFS.
> >
> >
> > Cheers,
> >
> > Linh
> >
> > 
> > *From:* ceph-users  on behalf of
> > Oliver Schulz 
> > *Sent:* Sunday, 15 July 2018 9:46:16 PM
> > *To:* ceph-users
> > *Subject:* [ceph-users] CephFS with erasure coding, do I need a
> cache-pool?
> > Dear all,
> >
> > we're planning a new Ceph-Clusterm, with CephFS as the
> > main workload, and would like to use erasure coding to
> > use the disks more efficiently. Access pattern will
> > probably be more read- than write-heavy, on average.
> >
> > I don't have any practical experience with erasure-
> > coded pools so far.
> >
> > I'd be glad for any hints / recommendations regarding
> > these questions:
> >
> > * Is an SSD cache pool recommended/necessary for
> > CephFS on an erasure-coded HDD pool 

Re: [ceph-users] CephFS with erasure coding, do I need a cache-pool?

2018-07-17 Thread Oliver Schulz

Thanks, Linh!

A question regarding choice of NVMe - do you think an
Intel P4510 or P4600 would do well for WAL+DB? I'd
thinking about using a single 2 TB NVMe for 15 OSDs.
Would you recommend a different model?

Is there any experience on how many 4k IOPS one should
have for WAL+DB per OSD?

We have a few new BlueStore nodes in an older
cluster, and we use Intel Optanes for WAL. We wanted to
use them for DB too - only to learn that while fast
they're just to small for the DB for several OSDs ...
so I hope a "regular" NVMe is fast enough?

We currently use the Gigabyte D120-C21 server barebone
(https://b2b.gigabyte.com/Storage-Server/D120-C21-rev-100)
for our OSD nodes, and we'd like to use it in our
next cluster too, because of the high storage density
and the good hdd-price to server-price ratio.
But it can only fit a single NVMe-drive (we use one of
the 16 HDD slots for an U.2 drive and connect it to the
single M.2-PCIe slot on the mainboard).


Cheers,

Oliver


On 18.07.2018 09:11, Linh Vu wrote:
On our NLSAS OSD nodes, there is 1x NVMe PCIe card for all the WALs and 
DBs (we accept that the risk of 1 card failing is low, and our failure 
domain is host anyway). Each OSD (16 per host) gets 2GB of WAL and 10GB 
of DB.



On our Flash (SSD but not NVMe) OSD nodes, there are 8 OSDs per node, 
and 2x NVMe PCIe cards for the WALs and DBs. Each OSD gets 4GB of WAL 
and 40GB of DB.



On our upcoming NVMe OSD nodes, for obvious reason, we don't do any such 
special allocation. 



Cheers,

Linh



*From:* Oliver Schulz 
*Sent:* Tuesday, 17 July 2018 11:39:26 PM
*To:* Linh Vu; ceph-users
*Subject:* Re: [ceph-users] CephFS with erasure coding, do I need a 
cache-pool?

Dear Linh,

another question, if I may:

How do you handle Bluestore WAL and DB, and
how much SSD space do you allocate for them?


Cheers,

Oliver


On 17.07.2018 08:55, Linh Vu wrote:
 > Hi Oliver,
 >
 >
 > We have several CephFS on EC pool deployments, one been in production
 > for a while, the others about to pending all the Bluestore+EC fixes in
 > 12.2.7 
 >
 >
 > Firstly as John and Greg have said, you don't need SSD cache pool at all.
 >
 >
 > Secondly, regarding k/m, it depends on how many hosts or racks you have,
 > and how many failures you want to tolerate.
 >
 >
 > For our smallest pool with only 8 hosts in 4 different racks and 2
 > different pairs of switches (note: we consider switch failure more
 > common than rack cooling or power failure), we're using 4/2 with failure
 > domain = host. We currently use this for SSD scratch storage for HPC.
 >
 >
 > For one of our larger pools, with 24 hosts over 6 different racks and 6
 > different pairs of switches, we're using 4:2 with failure domain = rack.
 >
 >
 > For another pool with similar host count but not spread over so many
 > pairs of switches, we're using 6:3 and failure domain = host.
 >
 >
 > Also keep in mind that a higher value of k/m may give you more
 > throughput but increase latency especially for small files, so it also
 > depends on how important performance is and what kind of file size you
 > store on your CephFS.
 >
 >
 > Cheers,
 >
 > Linh
 >
 > 
 > *From:* ceph-users  on behalf of
 > Oliver Schulz 
 > *Sent:* Sunday, 15 July 2018 9:46:16 PM
 > *To:* ceph-users
 > *Subject:* [ceph-users] CephFS with erasure coding, do I need a 
cache-pool?

 > Dear all,
 >
 > we're planning a new Ceph-Clusterm, with CephFS as the
 > main workload, and would like to use erasure coding to
 > use the disks more efficiently. Access pattern will
 > probably be more read- than write-heavy, on average.
 >
 > I don't have any practical experience with erasure-
 > coded pools so far.
 >
 > I'd be glad for any hints / recommendations regarding
 > these questions:
 >
 > * Is an SSD cache pool recommended/necessary for
 > CephFS on an erasure-coded HDD pool (using Ceph
 > Luminous and BlueStore)?
 >
 > * What are good values for k/m for erasure coding in
 > practice (assuming a cluster of about 300 OSDs), to
 > make things robust and ease maintenance (ability to
 > take a few nodes down)? Is k/m = 6/3 a good choice?
 >
 > * Will it be sufficient to have k+m racks, resp. failure
 > domains?
 >
 >
 > Cheers and thanks for any advice,
 >
 > Oliver
 > ___
 > ceph-users mailing list
 > ceph-users@lists.ceph.com
 > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS with erasure coding, do I need a cache-pool?

2018-07-17 Thread Oliver Schulz

On 18.07.2018 00:43, Gregory Farnum wrote:

 > But you could also do workaround like letting it choose (K+M)/2
racks
 > and putting two shards in each rack.
Oh yes, you are more susceptible to top-of-rack switch failures in this 
case or whatever. It's just one option — many people are less concerned 


Would I set failure domain to host, in this case?
Or still to rack?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Exact scope of OSD heartbeating?

2018-07-17 Thread Anthony D'Atri
The documentation here:

http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/ 


says

"Each Ceph OSD Daemon checks the heartbeat of other Ceph OSD Daemons every 6 
seconds"

and

" If a neighboring Ceph OSD Daemon doesn’t show a heartbeat within a 20 second 
grace period, the Ceph OSD Daemon may consider the neighboring Ceph OSD Daemon 
down and report it back to a Ceph Monitor,"

I've always thought that each OSD heartbeats with *every* other OSD, which of 
course means that total heartbeat traffic grows ~ quadratically.  However in 
extending test we've observed that the number of other OSDs that a subject 
heartbeat (heartbeated?) was < N, which has us wondering if perhaps only OSDs 
with which a given OSD shares are contacted -- or some other subset.

I plan to submit a doc fix for mon_osd_min_down_reporters and wanted to resolve 
this FUD first.

-- aad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS with erasure coding, do I need a cache-pool?

2018-07-17 Thread Linh Vu
On our NLSAS OSD nodes, there is 1x NVMe PCIe card for all the WALs and DBs (we 
accept that the risk of 1 card failing is low, and our failure domain is host 
anyway). Each OSD (16 per host) gets 2GB of WAL and 10GB of DB.


On our Flash (SSD but not NVMe) OSD nodes, there are 8 OSDs per node, and 2x 
NVMe PCIe cards for the WALs and DBs. Each OSD gets 4GB of WAL and 40GB of DB.


On our upcoming NVMe OSD nodes, for obvious reason, we don't do any such 
special allocation. 


Cheers,

Linh



From: Oliver Schulz 
Sent: Tuesday, 17 July 2018 11:39:26 PM
To: Linh Vu; ceph-users
Subject: Re: [ceph-users] CephFS with erasure coding, do I need a cache-pool?

Dear Linh,

another question, if I may:

How do you handle Bluestore WAL and DB, and
how much SSD space do you allocate for them?


Cheers,

Oliver


On 17.07.2018 08:55, Linh Vu wrote:
> Hi Oliver,
>
>
> We have several CephFS on EC pool deployments, one been in production
> for a while, the others about to pending all the Bluestore+EC fixes in
> 12.2.7 
>
>
> Firstly as John and Greg have said, you don't need SSD cache pool at all.
>
>
> Secondly, regarding k/m, it depends on how many hosts or racks you have,
> and how many failures you want to tolerate.
>
>
> For our smallest pool with only 8 hosts in 4 different racks and 2
> different pairs of switches (note: we consider switch failure more
> common than rack cooling or power failure), we're using 4/2 with failure
> domain = host. We currently use this for SSD scratch storage for HPC.
>
>
> For one of our larger pools, with 24 hosts over 6 different racks and 6
> different pairs of switches, we're using 4:2 with failure domain = rack.
>
>
> For another pool with similar host count but not spread over so many
> pairs of switches, we're using 6:3 and failure domain = host.
>
>
> Also keep in mind that a higher value of k/m may give you more
> throughput but increase latency especially for small files, so it also
> depends on how important performance is and what kind of file size you
> store on your CephFS.
>
>
> Cheers,
>
> Linh
>
> 
> *From:* ceph-users  on behalf of
> Oliver Schulz 
> *Sent:* Sunday, 15 July 2018 9:46:16 PM
> *To:* ceph-users
> *Subject:* [ceph-users] CephFS with erasure coding, do I need a cache-pool?
> Dear all,
>
> we're planning a new Ceph-Clusterm, with CephFS as the
> main workload, and would like to use erasure coding to
> use the disks more efficiently. Access pattern will
> probably be more read- than write-heavy, on average.
>
> I don't have any practical experience with erasure-
> coded pools so far.
>
> I'd be glad for any hints / recommendations regarding
> these questions:
>
> * Is an SSD cache pool recommended/necessary for
> CephFS on an erasure-coded HDD pool (using Ceph
> Luminous and BlueStore)?
>
> * What are good values for k/m for erasure coding in
> practice (assuming a cluster of about 300 OSDs), to
> make things robust and ease maintenance (ability to
> take a few nodes down)? Is k/m = 6/3 a good choice?
>
> * Will it be sufficient to have k+m racks, resp. failure
> domains?
>
>
> Cheers and thanks for any advice,
>
> Oliver
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v12.2.7 Luminous released

2018-07-17 Thread Linh Vu
Thanks for all your hard work in putting out the fixes so quickly! :)

We have a cluster on 12.2.5 with Bluestore and EC pool but for CephFS, not RGW. 
In the release notes, it says RGW is a risk especially the garbage collection, 
and the recommendation is to either pause IO or disable RGW garbage collection.


In our case with CephFS, not RGW, is it a lot less risky to perform the upgrade 
to 12.2.7 without the need to pause IO?


What does pause IO do? Do current sessions just get queued up and IO resume 
normally with no problem after unpausing?


If we have to pause IO, is it better to do something like: pause IO, restart 
OSDs on one node, unpause IO - repeated for all the nodes involved in the EC 
pool?


Regards,

Linh


From: ceph-users  on behalf of Sage Weil 

Sent: Wednesday, 18 July 2018 4:42:41 AM
To: Stefan Kooman
Cc: ceph-annou...@ceph.com; ceph-de...@vger.kernel.org; 
ceph-maintain...@ceph.com; ceph-us...@ceph.com
Subject: Re: [ceph-users] v12.2.7 Luminous released

On Tue, 17 Jul 2018, Stefan Kooman wrote:
> Quoting Abhishek Lekshmanan (abhis...@suse.com):
>
> > *NOTE* The v12.2.5 release has a potential data corruption issue with
> > erasure coded pools. If you ran v12.2.5 with erasure coding, please see
^^^
> > below.
>
> < snip >
>
> > Upgrading from v12.2.5 or v12.2.6
> > -
> >
> > If you used v12.2.5 or v12.2.6 in combination with erasure coded
^
> > pools, there is a small risk of corruption under certain workloads.
> > Specifically, when:
>
> < snip >
>
> One section mentions Luminous clusters _with_ EC pools specifically, the other
> section mentions Luminous clusters running 12.2.5.

I think they both do?

> I might be misreading this, but to make things clear for current Ceph
> Luminous 12.2.5 users. Is the following statement correct?
>
> If you do _NOT_ use EC in your 12.2.5 cluster (only replicated pools), there 
> is
> no need to quiesce IO (ceph osd pause).

Correct.

> http://docs.ceph.com/docs/master/releases/luminous/#upgrading-from-other-versions
> If your cluster did not run v12.2.5 or v12.2.6 then none of the above
> issues apply to you and you should upgrade normally.
>
> ^^ Above section would indicate all 12.2.5 luminous clusters.

The intent here is to clarify that any cluster running 12.2.4 or
older can upgrade without reading carefully. If the cluster
does/did run 12.2.5 or .6, then read carefully because it may (or may not)
be affected.

Does that help? Any suggested revisions to the wording in the release
notes that make it clearer are welcome!

Thanks-
sage


>
> Please clarify,
>
> Thanks,
>
> Stefan
>
> --
> | BIT BV http://www.bit.nl/ Kamer van Koophandel 09090351
> | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at 
> http://vger.kernel.org/majordomo-info.html
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] resize wal/db

2018-07-17 Thread Shunde Zhang
Hi Igor,

Thanks for the reply.
I am using NVMe SSD for wal/db. Do you have any instruction on how to create a 
partition for wal/db on it?
For example, the type of existing partitions are unknown.
If I create a new one, it defaults to Linux:

Disk /dev/nvme0n1: 500.1 GB, 500107862016 bytes, 976773168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: gpt
Disk identifier: 300AA8A7-8E72-4D4D-8232-02A0EA7FEB00


# Start  EndSize  TypeName
 1 2048  2099199  1G  unknown ceph block.db
 2  2099200  3278847576M  unknown ceph block.wal
 3  3278848  5375999  1G  unknown ceph block.db
 4  5376000  6555647576M  unknown ceph block.wal
 5  6555648  8652799  1G  unknown ceph block.db
 6  8652800  9832447576M  unknown ceph block.wal
 7  9832448 11929599  1G  unknown ceph block.db
 8 11929600 13109247576M  unknown ceph block.wal
 9 13109248 15206399  1G  unknown ceph block.db
10 15206400 16386047576M  unknown ceph block.wal
11 16386048 18483199  1G  unknown ceph block.db
12 18483200 19662847576M  unknown ceph block.wal
13 19662848 2175  1G  unknown ceph block.db
14 2176 22939647576M  unknown ceph block.wal
15 22939648 25036799  1G  unknown ceph block.db
16 25036800 26216447576M  unknown ceph block.wal
17 26216448 28313599  1G  unknown ceph block.db
18 28313600 29493247576M  unknown ceph block.wal
19 29493248 31590399  1G  unknown ceph block.db
20 31590400 32770047576M  unknown ceph block.wal
21 32770048 34867199  1G  unknown ceph block.db
22 34867200 36046847576M  unknown ceph block.wal
23 36046848 38143999  1G  unknown ceph block.db
24 38144000 39323647576M  unknown ceph block.wal
25 39323648 41420799  1G  unknown ceph block.db
26 41420800 42600447576M  unknown ceph block.wal
27 42600448 44697599  1G  unknown ceph block.db
28 44697600 45877247576M  unknown ceph block.wal
29 45877248 47974399  1G  Linux filesyste 


Thanks,
Shunde

> On 18 Jul 2018, at 12:08 am, Igor Fedotov  wrote:
> 
> For now you can expand that space up to actual volume size using 
> ceph-bluestore-tool commands ( bluefs-bdev-expand and set-label-key).
> 
> Which is a bit tricky though.
> 
> And I'm currently working on a solution within ceph-bluestore tool to  
> simplify both expansion and migration.
> 
> 
> Thanks,
> 
> Igor
> 
> 
> On 7/17/2018 5:02 PM, Nicolas Huillard wrote:
>> Le mardi 17 juillet 2018 à 16:20 +0300, Igor Fedotov a écrit :
>>> Right, but procedure described in the blog can be pretty easily
>>> adjusted
>>> to do a resize.
>> Sure, but if I remember correctly, Ceph itself cannot use the increased
>> size: you'll end up with a larger device with unused additional space.
>> Using that space may be on the TODO, though, so this may not be a
>> complete waste of space...
>> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

2018-07-17 Thread Tom W
Hi Bryan,

Try both the commands timing out again but with the -verbose flag, see if we 
can get anything from that.

Tom

From: Bryan Banister 
Sent: 17 July 2018 23:51
To: Tom W ; ceph-users@lists.ceph.com
Subject: RE: Cluster in bad shape, seemingly endless cycle of OSDs failed, then 
marked down, then booted, then failed again

Hey Tom,

Ok, yeah, I've used those steps before myself for other operations like cluster 
software updates.  I'll try them.

As for the query, not sure how I missed that, and think I've used it before!  
Unfortunately it just hangs similar to the other daemon operation:
root@rook-tools:/# ceph pg 19.1fdf query

Thanks,
-Bryan


From: Tom W [mailto:to...@ukfast.co.uk]
Sent: Tuesday, July 17, 2018 5:36 PM
To: Bryan Banister 
mailto:bbanis...@jumptrading.com>>; 
ceph-users@lists.ceph.com
Subject: RE: Cluster in bad shape, seemingly endless cycle of OSDs failed, then 
marked down, then booted, then failed again

Note: External Email

Hi Bryan,

Try "ceph pg xx.xx query", with xx.xx of course being the PG number. With some 
luck this will give you the state of that individual pg and which OSDs or 
issues may be blocking the peering from completing, which can be used as a clue 
perhaps as to the cause. If you can find a point of two OSDs unable to peer, 
you then at least have a pathway to begin testing connectivity again.

For pausing your cluster, my method (never tested in an environment with more 
than a few osds or in production) is:
ceph osd set noout
ceph osd set nobackfill
ceph osd set norecover
ceph osd set nodown
ceph osd set norebalance
ceph osd pause

To return, just reverse the order with "ceph osd pause" becoming "ceph osd 
unpause" I believe. All the above flags will stop most activity and just get 
things peered and not much else, once they are successfully peered, you can 
slowly begin to unset the above (and I do recommend going slowly..). You don't 
have any significant misplaced/degraded objects so you won't likely see much 
recovery activity, but as scrubbing kicks in there might be significant amounts 
of inconsistent PGs and backfill/recovery going on. It might be best to limit 
the impact of these from going 0 to 100 with these parameters (1 backfill at a 
time, wait 0.1 seconds between recovery op per OSD).

ceph tell osd.* injectargs '--osd-max-backfills 1'
ceph tell osd.* injectargs '--osd-recovery-sleep 0.1'

Tom



From: Bryan Banister 
mailto:bbanis...@jumptrading.com>>
Sent: 17 July 2018 23:22
To: Tom W mailto:to...@ukfast.co.uk>>; 
ceph-users@lists.ceph.com
Subject: RE: Cluster in bad shape, seemingly endless cycle of OSDs failed, then 
marked down, then booted, then failed again

Thanks Tom,

Yes, we can try pausing I/O to give the cluster time to recover.  I assume that 
you're talking about using `ceph osd set pause` for this?

We did finally get some health output, which seems to indicate everything is 
basically stuck:
2018-07-17 21:00:00.000107 mon.rook-ceph-mon7 [WRN] overall HEALTH_WARN 
nodown,noout flag(s) set; 1/8884349 objects misplaced (0.000%); Reduced data 
availability: 10907 pgs inactive, 6354 pgs down, 4553 pgs peering; Degraded 
data redundancy: 10/8884349 objects degraded (0.000%), 6 pgs degraded, 80 pgs 
undersized; 135 slow requests are blocked > 32 sec
2018-07-17 22:00:00.000124 mon.rook-ceph-mon7 [WRN] overall HEALTH_WARN 
nodown,noout flag(s) set; 1/8884349 objects misplaced (0.000%); Reduced data 
availability: 10907 pgs inactive, 6354 pgs down, 4553 pgs peering; Degraded 
data redundancy: 10/8884349 objects degraded (0.000%), 6 pgs degraded, 80 pgs 
undersized; 135 slow requests are blocked > 32 sec

I can't seem to find a command to run a query on a specific PG, though I'm 
really new to ceph so sorry if that's an obvious thing.  What would I run to 
query the status and condition of a PG?

I'll talk with our kubernetes team to see if they can also help rule out any 
networking related issues.

Cheers,
-Bryan

From: Tom W [mailto:to...@ukfast.co.uk]
Sent: Tuesday, July 17, 2018 5:06 PM
To: Bryan Banister 
mailto:bbanis...@jumptrading.com>>; 
ceph-users@lists.ceph.com
Subject: RE: Cluster in bad shape, seemingly endless cycle of OSDs failed, then 
marked down, then booted, then failed again

Note: External Email

Hi Bryan,

That's unusual, and not something I can really begin to unravel. As some other 
pointers, perhaps run a PG query on some of the inactive and peering PGs for 
any potentially useful output?

I suspect from what you've put that most PGs are simply in a down and peering 
state, and it can't peer as they are down still. The nodown flag doesn't seem 
to have fixed that, but then again it can't peer if they actually are down 
which nodown will mask.

Is pausing all cluster IO an option for you? My thinking here is to pause all 
IO, completely 

Re: [ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

2018-07-17 Thread Bryan Banister
Hey Tom,

Ok, yeah, I've used those steps before myself for other operations like cluster 
software updates.  I'll try them.

As for the query, not sure how I missed that, and think I've used it before!  
Unfortunately it just hangs similar to the other daemon operation:
root@rook-tools:/# ceph pg 19.1fdf query

Thanks,
-Bryan


From: Tom W [mailto:to...@ukfast.co.uk]
Sent: Tuesday, July 17, 2018 5:36 PM
To: Bryan Banister ; ceph-users@lists.ceph.com
Subject: RE: Cluster in bad shape, seemingly endless cycle of OSDs failed, then 
marked down, then booted, then failed again

Note: External Email

Hi Bryan,

Try "ceph pg xx.xx query", with xx.xx of course being the PG number. With some 
luck this will give you the state of that individual pg and which OSDs or 
issues may be blocking the peering from completing, which can be used as a clue 
perhaps as to the cause. If you can find a point of two OSDs unable to peer, 
you then at least have a pathway to begin testing connectivity again.

For pausing your cluster, my method (never tested in an environment with more 
than a few osds or in production) is:
ceph osd set noout
ceph osd set nobackfill
ceph osd set norecover
ceph osd set nodown
ceph osd set norebalance
ceph osd pause

To return, just reverse the order with "ceph osd pause" becoming "ceph osd 
unpause" I believe. All the above flags will stop most activity and just get 
things peered and not much else, once they are successfully peered, you can 
slowly begin to unset the above (and I do recommend going slowly..). You don't 
have any significant misplaced/degraded objects so you won't likely see much 
recovery activity, but as scrubbing kicks in there might be significant amounts 
of inconsistent PGs and backfill/recovery going on. It might be best to limit 
the impact of these from going 0 to 100 with these parameters (1 backfill at a 
time, wait 0.1 seconds between recovery op per OSD).

ceph tell osd.* injectargs '--osd-max-backfills 1'
ceph tell osd.* injectargs '--osd-recovery-sleep 0.1'

Tom



From: Bryan Banister 
mailto:bbanis...@jumptrading.com>>
Sent: 17 July 2018 23:22
To: Tom W mailto:to...@ukfast.co.uk>>; 
ceph-users@lists.ceph.com
Subject: RE: Cluster in bad shape, seemingly endless cycle of OSDs failed, then 
marked down, then booted, then failed again

Thanks Tom,

Yes, we can try pausing I/O to give the cluster time to recover.  I assume that 
you're talking about using `ceph osd set pause` for this?

We did finally get some health output, which seems to indicate everything is 
basically stuck:
2018-07-17 21:00:00.000107 mon.rook-ceph-mon7 [WRN] overall HEALTH_WARN 
nodown,noout flag(s) set; 1/8884349 objects misplaced (0.000%); Reduced data 
availability: 10907 pgs inactive, 6354 pgs down, 4553 pgs peering; Degraded 
data redundancy: 10/8884349 objects degraded (0.000%), 6 pgs degraded, 80 pgs 
undersized; 135 slow requests are blocked > 32 sec
2018-07-17 22:00:00.000124 mon.rook-ceph-mon7 [WRN] overall HEALTH_WARN 
nodown,noout flag(s) set; 1/8884349 objects misplaced (0.000%); Reduced data 
availability: 10907 pgs inactive, 6354 pgs down, 4553 pgs peering; Degraded 
data redundancy: 10/8884349 objects degraded (0.000%), 6 pgs degraded, 80 pgs 
undersized; 135 slow requests are blocked > 32 sec

I can't seem to find a command to run a query on a specific PG, though I'm 
really new to ceph so sorry if that's an obvious thing.  What would I run to 
query the status and condition of a PG?

I'll talk with our kubernetes team to see if they can also help rule out any 
networking related issues.

Cheers,
-Bryan

From: Tom W [mailto:to...@ukfast.co.uk]
Sent: Tuesday, July 17, 2018 5:06 PM
To: Bryan Banister 
mailto:bbanis...@jumptrading.com>>; 
ceph-users@lists.ceph.com
Subject: RE: Cluster in bad shape, seemingly endless cycle of OSDs failed, then 
marked down, then booted, then failed again

Note: External Email

Hi Bryan,

That's unusual, and not something I can really begin to unravel. As some other 
pointers, perhaps run a PG query on some of the inactive and peering PGs for 
any potentially useful output?

I suspect from what you've put that most PGs are simply in a down and peering 
state, and it can't peer as they are down still. The nodown flag doesn't seem 
to have fixed that, but then again it can't peer if they actually are down 
which nodown will mask.

Is pausing all cluster IO an option for you? My thinking here is to pause all 
IO, completely restart and verify all OSDs are back up and operational? If they 
fail to come up during paused IO, it will rule out any spiking load, but this 
seems to be more of a network issue, as even peering would normally generate 
some volume of traffic as it cycles to reattempt.

I'm not familiar at all with Rook or Kubernetes at this stage so I also have 
concern over how the networking stack there would work. 

Re: [ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

2018-07-17 Thread Tom W
Hi Bryan,

Try "ceph pg xx.xx query", with xx.xx of course being the PG number. With some 
luck this will give you the state of that individual pg and which OSDs or 
issues may be blocking the peering from completing, which can be used as a clue 
perhaps as to the cause. If you can find a point of two OSDs unable to peer, 
you then at least have a pathway to begin testing connectivity again.

For pausing your cluster, my method (never tested in an environment with more 
than a few osds or in production) is:
ceph osd set noout
ceph osd set nobackfill
ceph osd set norecover
ceph osd set nodown
ceph osd set norebalance
ceph osd pause

To return, just reverse the order with "ceph osd pause" becoming "ceph osd 
unpause" I believe. All the above flags will stop most activity and just get 
things peered and not much else, once they are successfully peered, you can 
slowly begin to unset the above (and I do recommend going slowly..). You don't 
have any significant misplaced/degraded objects so you won't likely see much 
recovery activity, but as scrubbing kicks in there might be significant amounts 
of inconsistent PGs and backfill/recovery going on. It might be best to limit 
the impact of these from going 0 to 100 with these parameters (1 backfill at a 
time, wait 0.1 seconds between recovery op per OSD).

ceph tell osd.* injectargs '--osd-max-backfills 1'
ceph tell osd.* injectargs '--osd-recovery-sleep 0.1'

Tom



From: Bryan Banister 
Sent: 17 July 2018 23:22
To: Tom W ; ceph-users@lists.ceph.com
Subject: RE: Cluster in bad shape, seemingly endless cycle of OSDs failed, then 
marked down, then booted, then failed again

Thanks Tom,

Yes, we can try pausing I/O to give the cluster time to recover.  I assume that 
you're talking about using `ceph osd set pause` for this?

We did finally get some health output, which seems to indicate everything is 
basically stuck:
2018-07-17 21:00:00.000107 mon.rook-ceph-mon7 [WRN] overall HEALTH_WARN 
nodown,noout flag(s) set; 1/8884349 objects misplaced (0.000%); Reduced data 
availability: 10907 pgs inactive, 6354 pgs down, 4553 pgs peering; Degraded 
data redundancy: 10/8884349 objects degraded (0.000%), 6 pgs degraded, 80 pgs 
undersized; 135 slow requests are blocked > 32 sec
2018-07-17 22:00:00.000124 mon.rook-ceph-mon7 [WRN] overall HEALTH_WARN 
nodown,noout flag(s) set; 1/8884349 objects misplaced (0.000%); Reduced data 
availability: 10907 pgs inactive, 6354 pgs down, 4553 pgs peering; Degraded 
data redundancy: 10/8884349 objects degraded (0.000%), 6 pgs degraded, 80 pgs 
undersized; 135 slow requests are blocked > 32 sec

I can't seem to find a command to run a query on a specific PG, though I'm 
really new to ceph so sorry if that's an obvious thing.  What would I run to 
query the status and condition of a PG?

I'll talk with our kubernetes team to see if they can also help rule out any 
networking related issues.

Cheers,
-Bryan

From: Tom W [mailto:to...@ukfast.co.uk]
Sent: Tuesday, July 17, 2018 5:06 PM
To: Bryan Banister 
mailto:bbanis...@jumptrading.com>>; 
ceph-users@lists.ceph.com
Subject: RE: Cluster in bad shape, seemingly endless cycle of OSDs failed, then 
marked down, then booted, then failed again

Note: External Email

Hi Bryan,

That's unusual, and not something I can really begin to unravel. As some other 
pointers, perhaps run a PG query on some of the inactive and peering PGs for 
any potentially useful output?

I suspect from what you've put that most PGs are simply in a down and peering 
state, and it can't peer as they are down still. The nodown flag doesn't seem 
to have fixed that, but then again it can't peer if they actually are down 
which nodown will mask.

Is pausing all cluster IO an option for you? My thinking here is to pause all 
IO, completely restart and verify all OSDs are back up and operational? If they 
fail to come up during paused IO, it will rule out any spiking load, but this 
seems to be more of a network issue, as even peering would normally generate 
some volume of traffic as it cycles to reattempt.

I'm not familiar at all with Rook or Kubernetes at this stage so I also have 
concern over how the networking stack there would work. MTU has been a problem 
in the past but this would only affect performance and not operation in my 
mind. Also perhaps being able to reach other nodes on the right interfaces, so 
can you definitely traverse the public and cluster networks successfully?

Tom



From: Bryan Banister 
mailto:bbanis...@jumptrading.com>>
Sent: 17 July 2018 22:36
To: Tom W mailto:to...@ukfast.co.uk>>; 
ceph-users@lists.ceph.com
Subject: RE: Cluster in bad shape, seemingly endless cycle of OSDs failed, then 
marked down, then booted, then failed again

Hi Tom,

I tried to check out the ops in flight as you suggested but this seems to just 
hang:
root@rook-ceph-osd-carg-kubelet-osd02-m9rhx:/# ceph --admin-daemon 

Re: [ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

2018-07-17 Thread Bryan Banister
Thanks Tom,

Yes, we can try pausing I/O to give the cluster time to recover.  I assume that 
you're talking about using `ceph osd set pause` for this?

We did finally get some health output, which seems to indicate everything is 
basically stuck:
2018-07-17 21:00:00.000107 mon.rook-ceph-mon7 [WRN] overall HEALTH_WARN 
nodown,noout flag(s) set; 1/8884349 objects misplaced (0.000%); Reduced data 
availability: 10907 pgs inactive, 6354 pgs down, 4553 pgs peering; Degraded 
data redundancy: 10/8884349 objects degraded (0.000%), 6 pgs degraded, 80 pgs 
undersized; 135 slow requests are blocked > 32 sec
2018-07-17 22:00:00.000124 mon.rook-ceph-mon7 [WRN] overall HEALTH_WARN 
nodown,noout flag(s) set; 1/8884349 objects misplaced (0.000%); Reduced data 
availability: 10907 pgs inactive, 6354 pgs down, 4553 pgs peering; Degraded 
data redundancy: 10/8884349 objects degraded (0.000%), 6 pgs degraded, 80 pgs 
undersized; 135 slow requests are blocked > 32 sec

I can't seem to find a command to run a query on a specific PG, though I'm 
really new to ceph so sorry if that's an obvious thing.  What would I run to 
query the status and condition of a PG?

I'll talk with our kubernetes team to see if they can also help rule out any 
networking related issues.

Cheers,
-Bryan

From: Tom W [mailto:to...@ukfast.co.uk]
Sent: Tuesday, July 17, 2018 5:06 PM
To: Bryan Banister ; ceph-users@lists.ceph.com
Subject: RE: Cluster in bad shape, seemingly endless cycle of OSDs failed, then 
marked down, then booted, then failed again

Note: External Email

Hi Bryan,

That's unusual, and not something I can really begin to unravel. As some other 
pointers, perhaps run a PG query on some of the inactive and peering PGs for 
any potentially useful output?

I suspect from what you've put that most PGs are simply in a down and peering 
state, and it can't peer as they are down still. The nodown flag doesn't seem 
to have fixed that, but then again it can't peer if they actually are down 
which nodown will mask.

Is pausing all cluster IO an option for you? My thinking here is to pause all 
IO, completely restart and verify all OSDs are back up and operational? If they 
fail to come up during paused IO, it will rule out any spiking load, but this 
seems to be more of a network issue, as even peering would normally generate 
some volume of traffic as it cycles to reattempt.

I'm not familiar at all with Rook or Kubernetes at this stage so I also have 
concern over how the networking stack there would work. MTU has been a problem 
in the past but this would only affect performance and not operation in my 
mind. Also perhaps being able to reach other nodes on the right interfaces, so 
can you definitely traverse the public and cluster networks successfully?

Tom



From: Bryan Banister 
mailto:bbanis...@jumptrading.com>>
Sent: 17 July 2018 22:36
To: Tom W mailto:to...@ukfast.co.uk>>; 
ceph-users@lists.ceph.com
Subject: RE: Cluster in bad shape, seemingly endless cycle of OSDs failed, then 
marked down, then booted, then failed again

Hi Tom,

I tried to check out the ops in flight as you suggested but this seems to just 
hang:
root@rook-ceph-osd-carg-kubelet-osd02-m9rhx:/# ceph --admin-daemon 
/var/lib/rook/osd238/rook-osd.238.asok daemon osd.238 dump_ops_in_flight

Nothing returns and don't get a prompt back.

The cluster is somewhat new, but has been running without any major issues for 
more than a week or so.  We're not even sure how this all started.

I'm happy to provide more details of our deployment if you or others need 
anything.

We haven't changed anything today/recently.  I think you're correct that 
unsetting 'nodown' will just return things to the previous state.

Thanks!
-Bryan

From: Tom W [mailto:to...@ukfast.co.uk]
Sent: Tuesday, July 17, 2018 4:19 PM
To: Bryan Banister 
mailto:bbanis...@jumptrading.com>>; 
ceph-users@lists.ceph.com
Subject: RE: Cluster in bad shape, seemingly endless cycle of OSDs failed, then 
marked down, then booted, then failed again

Note: External Email

Hi Bryan,

OSDs may not truly be up, this flag merely prevents them being marked as down 
even if they are unresponsive. It may be worth unsetting nodown as soon as you 
are confident, but unsetting it before anything changes will just return to the 
previous state. Perhaps not harmful, but I have no oversight on your deployment 
nor am I an expert in any regards.

Find an OSD which is up and having issues peering, and perhaps try something 
like this

ceph daemon osd.x dump_ops_in_flight

Replacing x with the OSD number, I am curious to see what may be holding it up. 
I assume you have already done the usual tests to ensure it is traversing the 
right interface, correct VLANs, reachable via ICMP, perhaps even run an iperf 
and tpcdump to be certain the flow is as expected.

Tom

From: Bryan Banister 

Re: [ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

2018-07-17 Thread Tom W
Hi Bryan,

That's unusual, and not something I can really begin to unravel. As some other 
pointers, perhaps run a PG query on some of the inactive and peering PGs for 
any potentially useful output?

I suspect from what you've put that most PGs are simply in a down and peering 
state, and it can't peer as they are down still. The nodown flag doesn't seem 
to have fixed that, but then again it can't peer if they actually are down 
which nodown will mask.

Is pausing all cluster IO an option for you? My thinking here is to pause all 
IO, completely restart and verify all OSDs are back up and operational? If they 
fail to come up during paused IO, it will rule out any spiking load, but this 
seems to be more of a network issue, as even peering would normally generate 
some volume of traffic as it cycles to reattempt.

I'm not familiar at all with Rook or Kubernetes at this stage so I also have 
concern over how the networking stack there would work. MTU has been a problem 
in the past but this would only affect performance and not operation in my 
mind. Also perhaps being able to reach other nodes on the right interfaces, so 
can you definitely traverse the public and cluster networks successfully?

Tom


From: Bryan Banister 
Sent: 17 July 2018 22:36
To: Tom W ; ceph-users@lists.ceph.com
Subject: RE: Cluster in bad shape, seemingly endless cycle of OSDs failed, then 
marked down, then booted, then failed again

Hi Tom,

I tried to check out the ops in flight as you suggested but this seems to just 
hang:
root@rook-ceph-osd-carg-kubelet-osd02-m9rhx:/# ceph --admin-daemon 
/var/lib/rook/osd238/rook-osd.238.asok daemon osd.238 dump_ops_in_flight

Nothing returns and don't get a prompt back.

The cluster is somewhat new, but has been running without any major issues for 
more than a week or so.  We're not even sure how this all started.

I'm happy to provide more details of our deployment if you or others need 
anything.

We haven't changed anything today/recently.  I think you're correct that 
unsetting 'nodown' will just return things to the previous state.

Thanks!
-Bryan

From: Tom W [mailto:to...@ukfast.co.uk]
Sent: Tuesday, July 17, 2018 4:19 PM
To: Bryan Banister 
mailto:bbanis...@jumptrading.com>>; 
ceph-users@lists.ceph.com
Subject: RE: Cluster in bad shape, seemingly endless cycle of OSDs failed, then 
marked down, then booted, then failed again

Note: External Email

Hi Bryan,

OSDs may not truly be up, this flag merely prevents them being marked as down 
even if they are unresponsive. It may be worth unsetting nodown as soon as you 
are confident, but unsetting it before anything changes will just return to the 
previous state. Perhaps not harmful, but I have no oversight on your deployment 
nor am I an expert in any regards.

Find an OSD which is up and having issues peering, and perhaps try something 
like this

ceph daemon osd.x dump_ops_in_flight

Replacing x with the OSD number, I am curious to see what may be holding it up. 
I assume you have already done the usual tests to ensure it is traversing the 
right interface, correct VLANs, reachable via ICMP, perhaps even run an iperf 
and tpcdump to be certain the flow is as expected.

Tom

From: Bryan Banister 
mailto:bbanis...@jumptrading.com>>
Sent: 17 July 2018 22:03
To: Tom W mailto:to...@ukfast.co.uk>>; 
ceph-users@lists.ceph.com
Subject: RE: Cluster in bad shape, seemingly endless cycle of OSDs failed, then 
marked down, then booted, then failed again

Hi Tom,

Decided to try your suggestion of the 'nodown' setting and this indeed has 
gotten all of the OSDs up and they haven't failed out like before.  However the 
PGs are in bad states and Ceph doesn't seem interested in starting recovery 
over the last 30 minues since the latest health message was reported:

2018-07-17 20:29:00.638398 mon.rook-ceph-mon7 [WRN] Health check update: 
1/8884343 objects misplaced (0.000%) (OBJECT_MISPLACED)
2018-07-17 20:29:00.864863 mon.rook-ceph-mon7 [INF] osd.221 
7.129.220.49:6957/30346 boot
2018-07-17 20:29:01.907855 mon.rook-ceph-mon7 [INF] Health check cleared: 
OSD_DOWN (was: 1 osds down)
2018-07-17 20:29:02.598518 mon.rook-ceph-mon7 [INF] osd.238 
7.129.220.49:6923/30330 boot
2018-07-17 20:29:02.988546 mon.rook-ceph-mon7 [WRN] Health check update: 
Reduced data availability: 10895 pgs inactive, 6514 pgs down, 4391 pgs peering, 
2 pgs stale (PG_AVAILABILITY)
2018-07-17 20:29:04.380454 mon.rook-ceph-mon7 [WRN] Health check update: 
Degraded data redundancy: 10/8884349 objects degraded (0.000%), 6 pgs degraded, 
80 pgs undersized (PG_DEGRADED)
2018-07-17 20:29:08.319073 mon.rook-ceph-mon7 [WRN] Health check update: 
1/8884349 objects misplaced (0.000%) (OBJECT_MISPLACED)
2018-07-17 20:29:08.319103 mon.rook-ceph-mon7 [WRN] Health check update: 
Reduced data availability: 10893 pgs inactive, 6391 pgs down, 4515 pgs peering, 
1 pg stale (PG_AVAILABILITY)

Re: [ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

2018-07-17 Thread Bryan Banister
Hi Tom,

I tried to check out the ops in flight as you suggested but this seems to just 
hang:
root@rook-ceph-osd-carg-kubelet-osd02-m9rhx:/# ceph --admin-daemon 
/var/lib/rook/osd238/rook-osd.238.asok daemon osd.238 dump_ops_in_flight

Nothing returns and don't get a prompt back.

The cluster is somewhat new, but has been running without any major issues for 
more than a week or so.  We're not even sure how this all started.

I'm happy to provide more details of our deployment if you or others need 
anything.

We haven't changed anything today/recently.  I think you're correct that 
unsetting 'nodown' will just return things to the previous state.

Thanks!
-Bryan

From: Tom W [mailto:to...@ukfast.co.uk]
Sent: Tuesday, July 17, 2018 4:19 PM
To: Bryan Banister ; ceph-users@lists.ceph.com
Subject: RE: Cluster in bad shape, seemingly endless cycle of OSDs failed, then 
marked down, then booted, then failed again

Note: External Email

Hi Bryan,

OSDs may not truly be up, this flag merely prevents them being marked as down 
even if they are unresponsive. It may be worth unsetting nodown as soon as you 
are confident, but unsetting it before anything changes will just return to the 
previous state. Perhaps not harmful, but I have no oversight on your deployment 
nor am I an expert in any regards.

Find an OSD which is up and having issues peering, and perhaps try something 
like this

ceph daemon osd.x dump_ops_in_flight

Replacing x with the OSD number, I am curious to see what may be holding it up. 
I assume you have already done the usual tests to ensure it is traversing the 
right interface, correct VLANs, reachable via ICMP, perhaps even run an iperf 
and tpcdump to be certain the flow is as expected.

Tom

From: Bryan Banister 
mailto:bbanis...@jumptrading.com>>
Sent: 17 July 2018 22:03
To: Tom W mailto:to...@ukfast.co.uk>>; 
ceph-users@lists.ceph.com
Subject: RE: Cluster in bad shape, seemingly endless cycle of OSDs failed, then 
marked down, then booted, then failed again

Hi Tom,

Decided to try your suggestion of the 'nodown' setting and this indeed has 
gotten all of the OSDs up and they haven't failed out like before.  However the 
PGs are in bad states and Ceph doesn't seem interested in starting recovery 
over the last 30 minues since the latest health message was reported:

2018-07-17 20:29:00.638398 mon.rook-ceph-mon7 [WRN] Health check update: 
1/8884343 objects misplaced (0.000%) (OBJECT_MISPLACED)
2018-07-17 20:29:00.864863 mon.rook-ceph-mon7 [INF] osd.221 
7.129.220.49:6957/30346 boot
2018-07-17 20:29:01.907855 mon.rook-ceph-mon7 [INF] Health check cleared: 
OSD_DOWN (was: 1 osds down)
2018-07-17 20:29:02.598518 mon.rook-ceph-mon7 [INF] osd.238 
7.129.220.49:6923/30330 boot
2018-07-17 20:29:02.988546 mon.rook-ceph-mon7 [WRN] Health check update: 
Reduced data availability: 10895 pgs inactive, 6514 pgs down, 4391 pgs peering, 
2 pgs stale (PG_AVAILABILITY)
2018-07-17 20:29:04.380454 mon.rook-ceph-mon7 [WRN] Health check update: 
Degraded data redundancy: 10/8884349 objects degraded (0.000%), 6 pgs degraded, 
80 pgs undersized (PG_DEGRADED)
2018-07-17 20:29:08.319073 mon.rook-ceph-mon7 [WRN] Health check update: 
1/8884349 objects misplaced (0.000%) (OBJECT_MISPLACED)
2018-07-17 20:29:08.319103 mon.rook-ceph-mon7 [WRN] Health check update: 
Reduced data availability: 10893 pgs inactive, 6391 pgs down, 4515 pgs peering, 
1 pg stale (PG_AVAILABILITY)
2018-07-17 20:29:13.319406 mon.rook-ceph-mon7 [WRN] Health check update: 
Reduced data availability: 10893 pgs inactive, 6354 pgs down, 4552 pgs peering 
(PG_AVAILABILITY)
2018-07-17 20:29:14.044696 mon.rook-ceph-mon7 [WRN] Health check update: 123 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-17 20:29:20.277493 mon.rook-ceph-mon7 [WRN] Health check update: 129 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-17 20:29:27.344834 mon.rook-ceph-mon7 [WRN] Health check update: 135 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-17 20:29:54.516115 mon.rook-ceph-mon7 [WRN] Health check update: 
Reduced data availability: 10899 pgs inactive, 6354 pgs down, 4552 pgs peering 
(PG_AVAILABILITY)
2018-07-17 20:30:03.322101 mon.rook-ceph-mon7 [WRN] Health check update: 
Reduced data availability: 10907 pgs inactive, 6354 pgs down, 4553 pgs peering 
(PG_AVAILABILITY)

Nothing since then, which was 30 min ago.  Hosts are basically idle.

I'm thinking of unsetting the 'nodown" now to see what it does, but is there 
any other recommendations here before I do that?

Thanks again!
-Bryan

From: Tom W [mailto:to...@ukfast.co.uk]
Sent: Tuesday, July 17, 2018 1:58 PM
To: Bryan Banister 
mailto:bbanis...@jumptrading.com>>; 
ceph-users@lists.ceph.com
Subject: Re: Cluster in bad shape, seemingly endless cycle of OSDs failed, then 
marked down, then booted, then failed again

Note: External Email

Prior to the 

Re: [ceph-users] v12.2.7 Luminous released

2018-07-17 Thread Cassiano Pilipavicius

FIY,

I have updated some osds  from 12.2.6 that was suffering from the CRC 
error and the 12.2.7 fixed the issue!


I installed some new osds on 12/07 without being aware from the issue, 
and in my small cluestes, I just noticed the problem when I was trying 
to copy some RBD images to another pool... probably my VMs was not 
reading the full objects at once, so this has not affected my users.


Thanks to the developers for fixing the issue quickly and for all the 
users that posted info about the issue to the list!



Em 7/17/2018 3:42 PM, Sage Weil escreveu:

On Tue, 17 Jul 2018, Stefan Kooman wrote:

Quoting Abhishek Lekshmanan (abhis...@suse.com):


*NOTE* The v12.2.5 release has a potential data corruption issue with
erasure coded pools. If you ran v12.2.5 with erasure coding, please see

 ^^^

below.

< snip >


Upgrading from v12.2.5 or v12.2.6
-

If you used v12.2.5 or v12.2.6 in combination with erasure coded

^

pools, there is a small risk of corruption under certain workloads.
Specifically, when:

< snip >

One section mentions Luminous clusters _with_ EC pools specifically, the other
section mentions Luminous clusters running 12.2.5.

I think they both do?


I might be misreading this, but to make things clear for current Ceph
Luminous 12.2.5 users. Is the following statement correct?

If you do _NOT_ use EC in your 12.2.5 cluster (only replicated pools), there is
no need to quiesce IO (ceph osd pause).

Correct.


http://docs.ceph.com/docs/master/releases/luminous/#upgrading-from-other-versions
If your cluster did not run v12.2.5 or v12.2.6 then none of the above
issues apply to you and you should upgrade normally.

^^ Above section would indicate all 12.2.5 luminous clusters.

The intent here is to clarify that any cluster running 12.2.4 or
older can upgrade without reading carefully.  If the cluster
does/did run 12.2.5 or .6, then read carefully because it may (or may not)
be affected.

Does that help?  Any suggested revisions to the wording in the release
notes that make it clearer are welcome!

Thanks-
sage



Please clarify,

Thanks,

Stefan

--
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

2018-07-17 Thread Tom W
Hi Bryan,

OSDs may not truly be up, this flag merely prevents them being marked as down 
even if they are unresponsive. It may be worth unsetting nodown as soon as you 
are confident, but unsetting it before anything changes will just return to the 
previous state. Perhaps not harmful, but I have no oversight on your deployment 
nor am I an expert in any regards.

Find an OSD which is up and having issues peering, and perhaps try something 
like this

ceph daemon osd.x dump_ops_in_flight

Replacing x with the OSD number, I am curious to see what may be holding it up. 
I assume you have already done the usual tests to ensure it is traversing the 
right interface, correct VLANs, reachable via ICMP, perhaps even run an iperf 
and tpcdump to be certain the flow is as expected.

Tom

From: Bryan Banister 
Sent: 17 July 2018 22:03
To: Tom W ; ceph-users@lists.ceph.com
Subject: RE: Cluster in bad shape, seemingly endless cycle of OSDs failed, then 
marked down, then booted, then failed again

Hi Tom,

Decided to try your suggestion of the 'nodown' setting and this indeed has 
gotten all of the OSDs up and they haven't failed out like before.  However the 
PGs are in bad states and Ceph doesn't seem interested in starting recovery 
over the last 30 minues since the latest health message was reported:

2018-07-17 20:29:00.638398 mon.rook-ceph-mon7 [WRN] Health check update: 
1/8884343 objects misplaced (0.000%) (OBJECT_MISPLACED)
2018-07-17 20:29:00.864863 mon.rook-ceph-mon7 [INF] osd.221 
7.129.220.49:6957/30346 boot
2018-07-17 20:29:01.907855 mon.rook-ceph-mon7 [INF] Health check cleared: 
OSD_DOWN (was: 1 osds down)
2018-07-17 20:29:02.598518 mon.rook-ceph-mon7 [INF] osd.238 
7.129.220.49:6923/30330 boot
2018-07-17 20:29:02.988546 mon.rook-ceph-mon7 [WRN] Health check update: 
Reduced data availability: 10895 pgs inactive, 6514 pgs down, 4391 pgs peering, 
2 pgs stale (PG_AVAILABILITY)
2018-07-17 20:29:04.380454 mon.rook-ceph-mon7 [WRN] Health check update: 
Degraded data redundancy: 10/8884349 objects degraded (0.000%), 6 pgs degraded, 
80 pgs undersized (PG_DEGRADED)
2018-07-17 20:29:08.319073 mon.rook-ceph-mon7 [WRN] Health check update: 
1/8884349 objects misplaced (0.000%) (OBJECT_MISPLACED)
2018-07-17 20:29:08.319103 mon.rook-ceph-mon7 [WRN] Health check update: 
Reduced data availability: 10893 pgs inactive, 6391 pgs down, 4515 pgs peering, 
1 pg stale (PG_AVAILABILITY)
2018-07-17 20:29:13.319406 mon.rook-ceph-mon7 [WRN] Health check update: 
Reduced data availability: 10893 pgs inactive, 6354 pgs down, 4552 pgs peering 
(PG_AVAILABILITY)
2018-07-17 20:29:14.044696 mon.rook-ceph-mon7 [WRN] Health check update: 123 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-17 20:29:20.277493 mon.rook-ceph-mon7 [WRN] Health check update: 129 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-17 20:29:27.344834 mon.rook-ceph-mon7 [WRN] Health check update: 135 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-17 20:29:54.516115 mon.rook-ceph-mon7 [WRN] Health check update: 
Reduced data availability: 10899 pgs inactive, 6354 pgs down, 4552 pgs peering 
(PG_AVAILABILITY)
2018-07-17 20:30:03.322101 mon.rook-ceph-mon7 [WRN] Health check update: 
Reduced data availability: 10907 pgs inactive, 6354 pgs down, 4553 pgs peering 
(PG_AVAILABILITY)

Nothing since then, which was 30 min ago.  Hosts are basically idle.

I'm thinking of unsetting the 'nodown" now to see what it does, but is there 
any other recommendations here before I do that?

Thanks again!
-Bryan

From: Tom W [mailto:to...@ukfast.co.uk]
Sent: Tuesday, July 17, 2018 1:58 PM
To: Bryan Banister 
mailto:bbanis...@jumptrading.com>>; 
ceph-users@lists.ceph.com
Subject: Re: Cluster in bad shape, seemingly endless cycle of OSDs failed, then 
marked down, then booted, then failed again

Note: External Email

Prior to the OSD being marked as down by the cluster, do you note the PGs 
become inactive on it? Using a flag such as nodown may prevent OSDs flapping if 
it helps reduce the IO load to see if things stabilise out, but be wary of this 
flag as I believe PGs using the OSD as the primary will not failover to another 
OSD while nodown is set.

My thoughts here, albeit I am shooting in the dark a little with this theory, 
is perhaps individual OSDs being overloaded and not returning a heartbeat as a 
result of the load. When OSDs are marked as down and new maps are distributed 
this would add further load so while it keeps recalculating it may be a vicious 
cycle which may be alleviated if it could stabilise.

With networks mainly idle, do you see any spikes at all? Perhaps an OSD coming 
online, OSD attempts backfill/recovery and QoS dropping the heartbeat packets 
if it overloads the link?

Just spitballing some ideas here until somebody more qualified may have an idea.


From: Bryan Banister 
mailto:bbanis...@jumptrading.com>>
Sent: 17 July 

Re: [ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

2018-07-17 Thread Bryan Banister
Hi Tom,

Decided to try your suggestion of the 'nodown' setting and this indeed has 
gotten all of the OSDs up and they haven't failed out like before.  However the 
PGs are in bad states and Ceph doesn't seem interested in starting recovery 
over the last 30 minues since the latest health message was reported:

2018-07-17 20:29:00.638398 mon.rook-ceph-mon7 [WRN] Health check update: 
1/8884343 objects misplaced (0.000%) (OBJECT_MISPLACED)
2018-07-17 20:29:00.864863 mon.rook-ceph-mon7 [INF] osd.221 
7.129.220.49:6957/30346 boot
2018-07-17 20:29:01.907855 mon.rook-ceph-mon7 [INF] Health check cleared: 
OSD_DOWN (was: 1 osds down)
2018-07-17 20:29:02.598518 mon.rook-ceph-mon7 [INF] osd.238 
7.129.220.49:6923/30330 boot
2018-07-17 20:29:02.988546 mon.rook-ceph-mon7 [WRN] Health check update: 
Reduced data availability: 10895 pgs inactive, 6514 pgs down, 4391 pgs peering, 
2 pgs stale (PG_AVAILABILITY)
2018-07-17 20:29:04.380454 mon.rook-ceph-mon7 [WRN] Health check update: 
Degraded data redundancy: 10/8884349 objects degraded (0.000%), 6 pgs degraded, 
80 pgs undersized (PG_DEGRADED)
2018-07-17 20:29:08.319073 mon.rook-ceph-mon7 [WRN] Health check update: 
1/8884349 objects misplaced (0.000%) (OBJECT_MISPLACED)
2018-07-17 20:29:08.319103 mon.rook-ceph-mon7 [WRN] Health check update: 
Reduced data availability: 10893 pgs inactive, 6391 pgs down, 4515 pgs peering, 
1 pg stale (PG_AVAILABILITY)
2018-07-17 20:29:13.319406 mon.rook-ceph-mon7 [WRN] Health check update: 
Reduced data availability: 10893 pgs inactive, 6354 pgs down, 4552 pgs peering 
(PG_AVAILABILITY)
2018-07-17 20:29:14.044696 mon.rook-ceph-mon7 [WRN] Health check update: 123 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-17 20:29:20.277493 mon.rook-ceph-mon7 [WRN] Health check update: 129 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-17 20:29:27.344834 mon.rook-ceph-mon7 [WRN] Health check update: 135 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-17 20:29:54.516115 mon.rook-ceph-mon7 [WRN] Health check update: 
Reduced data availability: 10899 pgs inactive, 6354 pgs down, 4552 pgs peering 
(PG_AVAILABILITY)
2018-07-17 20:30:03.322101 mon.rook-ceph-mon7 [WRN] Health check update: 
Reduced data availability: 10907 pgs inactive, 6354 pgs down, 4553 pgs peering 
(PG_AVAILABILITY)

Nothing since then, which was 30 min ago.  Hosts are basically idle.

I'm thinking of unsetting the 'nodown" now to see what it does, but is there 
any other recommendations here before I do that?

Thanks again!
-Bryan

From: Tom W [mailto:to...@ukfast.co.uk]
Sent: Tuesday, July 17, 2018 1:58 PM
To: Bryan Banister ; ceph-users@lists.ceph.com
Subject: Re: Cluster in bad shape, seemingly endless cycle of OSDs failed, then 
marked down, then booted, then failed again

Note: External Email

Prior to the OSD being marked as down by the cluster, do you note the PGs 
become inactive on it? Using a flag such as nodown may prevent OSDs flapping if 
it helps reduce the IO load to see if things stabilise out, but be wary of this 
flag as I believe PGs using the OSD as the primary will not failover to another 
OSD while nodown is set.

My thoughts here, albeit I am shooting in the dark a little with this theory, 
is perhaps individual OSDs being overloaded and not returning a heartbeat as a 
result of the load. When OSDs are marked as down and new maps are distributed 
this would add further load so while it keeps recalculating it may be a vicious 
cycle which may be alleviated if it could stabilise.

With networks mainly idle, do you see any spikes at all? Perhaps an OSD coming 
online, OSD attempts backfill/recovery and QoS dropping the heartbeat packets 
if it overloads the link?

Just spitballing some ideas here until somebody more qualified may have an idea.


From: Bryan Banister 
mailto:bbanis...@jumptrading.com>>
Sent: 17 July 2018 19:18:15
To: Bryan Banister; Tom W; 
ceph-users@lists.ceph.com
Subject: RE: Cluster in bad shape, seemingly endless cycle of OSDs failed, then 
marked down, then booted, then failed again

I didn't find anything obvious in the release notes about this issue we see to 
have, but I don't understand it really.

We have seen logs indicating some kind of heartbeat issue with OSDs, but we 
don't believe there is any issues with the networking between the nodes, which 
are mostly idle as well:

2018-07-17 17:41:32.903871 I | osd12: 2018-07-17 17:41:32.903793 7fffef198700 
-1 osd.12 4296 heartbeat_check: no reply from 7.129.220.44:6866 osd.219 ever on 
either front or back, first ping sent 2018-07-17 17:41:09.893761 (cutoff 
2018-07-17 17:41:12.903604)
2018-07-17 17:41:32.903875 I | osd12: 2018-07-17 17:41:32.903795 7fffef198700 
-1 osd.12 4296 heartbeat_check: no reply from 7.129.220.44:6922 osd.220 ever on 
either front or back, first ping sent 2018-07-17 17:41:09.893761 (cutoff 
2018-07-17 17:41:12.903604)
2018-07-17 

Re: [ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

2018-07-17 Thread Tom W
Prior to the OSD being marked as down by the cluster, do you note the PGs 
become inactive on it? Using a flag such as nodown may prevent OSDs flapping if 
it helps reduce the IO load to see if things stabilise out, but be wary of this 
flag as I believe PGs using the OSD as the primary will not failover to another 
OSD while nodown is set.

My thoughts here, albeit I am shooting in the dark a little with this theory, 
is perhaps individual OSDs being overloaded and not returning a heartbeat as a 
result of the load. When OSDs are marked as down and new maps are distributed 
this would add further load so while it keeps recalculating it may be a vicious 
cycle which may be alleviated if it could stabilise.

With networks mainly idle, do you see any spikes at all? Perhaps an OSD coming 
online, OSD attempts backfill/recovery and QoS dropping the heartbeat packets 
if it overloads the link?

Just spitballing some ideas here until somebody more qualified may have an idea.


From: Bryan Banister 
Sent: 17 July 2018 19:18:15
To: Bryan Banister; Tom W; ceph-users@lists.ceph.com
Subject: RE: Cluster in bad shape, seemingly endless cycle of OSDs failed, then 
marked down, then booted, then failed again

I didn’t find anything obvious in the release notes about this issue we see to 
have, but I don’t understand it really.

We have seen logs indicating some kind of heartbeat issue with OSDs, but we 
don’t believe there is any issues with the networking between the nodes, which 
are mostly idle as well:

2018-07-17 17:41:32.903871 I | osd12: 2018-07-17 17:41:32.903793 7fffef198700 
-1 osd.12 4296 heartbeat_check: no reply from 7.129.220.44:6866 osd.219 ever on 
either front or back, first ping sent 2018-07-17 17:41:09.893761 (cutoff 
2018-07-17 17:41:12.903604)
2018-07-17 17:41:32.903875 I | osd12: 2018-07-17 17:41:32.903795 7fffef198700 
-1 osd.12 4296 heartbeat_check: no reply from 7.129.220.44:6922 osd.220 ever on 
either front or back, first ping sent 2018-07-17 17:41:09.893761 (cutoff 
2018-07-17 17:41:12.903604)
2018-07-17 17:41:32.903878 I | osd12: 2018-07-17 17:41:32.903798 7fffef198700 
-1 osd.12 4296 heartbeat_check: no reply from 7.129.220.44:6901 osd.221 ever on 
either front or back, first ping sent 2018-07-17 17:41:09.893761 (cutoff 
2018-07-17 17:41:12.903604)
2018-07-17 17:41:32.903880 I | osd12: 2018-07-17 17:41:32.903800 7fffef198700 
-1 osd.12 4296 heartbeat_check: no reply from 7.129.220.44:6963 osd.222 ever on 
either front or back, first ping sent 2018-07-17 17:41:09.893761 (cutoff 
2018-07-17 17:41:12.903604)
2018-07-17 17:41:32.903884 I | osd12: 2018-07-17 17:41:32.903803 7fffef198700 
-1 osd.12 4296 heartbeat_check: no reply from 7.129.220.44:6907 osd.224 ever on 
either front or back, first ping sent 2018-07-17 17:41:09.893761 (cutoff 
2018-07-17 17:41:12.903604)

Is there a way to resolve this issue, which seems to be the root cause of the 
OSDs being marked as failed.

Thanks in advance for any help,
-Bryan

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Bryan 
Banister
Sent: Tuesday, July 17, 2018 12:08 PM
To: Tom W ; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs 
failed, then marked down, then booted, then failed again

Note: External Email

Hi Tom,

We’re apparently running ceph version 12.2.5 on a Rook based cluster.  We have 
EC pools on large 8TB HDDs and metadata on bluestore OSDs on NVMe drives.

I’ll look at the release notes.

Thanks!
-Bryan

From: Tom W [mailto:to...@ukfast.co.uk]
Sent: Tuesday, July 17, 2018 12:05 PM
To: Bryan Banister 
mailto:bbanis...@jumptrading.com>>; 
ceph-users@lists.ceph.com
Subject: Re: Cluster in bad shape, seemingly endless cycle of OSDs failed, then 
marked down, then booted, then failed again

Note: External Email


Hi Bryan,



What version of Ceph are you currently running on, and do you run any erasure 
coded pools or bluestore OSDs? Might be worth having a quick glance over the 
recent changelogs:



http://docs.ceph.com/docs/master/releases/luminous/



Tom


From: ceph-users 
mailto:ceph-users-boun...@lists.ceph.com>> 
on behalf of Bryan Banister 
mailto:bbanis...@jumptrading.com>>
Sent: 17 July 2018 18:00:05
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs 
failed, then marked down, then booted, then failed again

Hi all,

We’re still very new to managing Ceph and seem to have cluster that is in an 
endless loop of failing OSDs, then marking them down, then booting them again:

Here are some example logs:
2018-07-17 16:48:28.976673 mon.rook-ceph-mon7 [INF] osd.83 failed 
(root=default,host=carg-kubelet-osd04) (3 reporters from different host after 
61.491973 >= grace 20.010293)
2018-07-17 16:48:28.976730 

Re: [ceph-users] v12.2.7 Luminous released

2018-07-17 Thread Sage Weil
On Tue, 17 Jul 2018, Stefan Kooman wrote:
> Quoting Abhishek Lekshmanan (abhis...@suse.com):
> 
> > *NOTE* The v12.2.5 release has a potential data corruption issue with
> > erasure coded pools. If you ran v12.2.5 with erasure coding, please see
^^^
> > below.
> 
> < snip >
> 
> > Upgrading from v12.2.5 or v12.2.6
> > -
> > 
> > If you used v12.2.5 or v12.2.6 in combination with erasure coded
   ^
> > pools, there is a small risk of corruption under certain workloads.
> > Specifically, when:
> 
> < snip >
> 
> One section mentions Luminous clusters _with_ EC pools specifically, the other
> section mentions Luminous clusters running 12.2.5.

I think they both do?

> I might be misreading this, but to make things clear for current Ceph 
> Luminous 12.2.5 users. Is the following statement correct?
> 
> If you do _NOT_ use EC in your 12.2.5 cluster (only replicated pools), there 
> is
> no need to quiesce IO (ceph osd pause).

Correct.

> http://docs.ceph.com/docs/master/releases/luminous/#upgrading-from-other-versions
> If your cluster did not run v12.2.5 or v12.2.6 then none of the above
> issues apply to you and you should upgrade normally.
> 
> ^^ Above section would indicate all 12.2.5 luminous clusters.

The intent here is to clarify that any cluster running 12.2.4 or 
older can upgrade without reading carefully.  If the cluster 
does/did run 12.2.5 or .6, then read carefully because it may (or may not) 
be affected.

Does that help?  Any suggested revisions to the wording in the release 
notes that make it clearer are welcome!

Thanks-
sage


> 
> Please clarify,
> 
> Thanks,
> 
> Stefan
> 
> -- 
> | BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v12.2.7 Luminous released

2018-07-17 Thread Stefan Kooman
Quoting Abhishek Lekshmanan (abhis...@suse.com):

> *NOTE* The v12.2.5 release has a potential data corruption issue with
> erasure coded pools. If you ran v12.2.5 with erasure coding, please see
> below.

< snip >

> Upgrading from v12.2.5 or v12.2.6
> -
> 
> If you used v12.2.5 or v12.2.6 in combination with erasure coded
> pools, there is a small risk of corruption under certain workloads.
> Specifically, when:

< snip >

One section mentions Luminous clusters _with_ EC pools specifically, the other
section mentions Luminous clusters running 12.2.5. I might be
misreading this, but to make  things clear for current Ceph Luminous
12.2.5 users. Is the following statement correct?

If you do _NOT_ use EC in your 12.2.5 cluster (only replicated pools), there is
no need to quiesce IO (ceph osd pause).

http://docs.ceph.com/docs/master/releases/luminous/#upgrading-from-other-versions
If your cluster did not run v12.2.5 or v12.2.6 then none of the above
issues apply to you and you should upgrade normally.

^^ Above section would indicate all 12.2.5 luminous clusters.

Please clarify,

Thanks,

Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

2018-07-17 Thread Bryan Banister
I didn't find anything obvious in the release notes about this issue we see to 
have, but I don't understand it really.

We have seen logs indicating some kind of heartbeat issue with OSDs, but we 
don't believe there is any issues with the networking between the nodes, which 
are mostly idle as well:

2018-07-17 17:41:32.903871 I | osd12: 2018-07-17 17:41:32.903793 7fffef198700 
-1 osd.12 4296 heartbeat_check: no reply from 7.129.220.44:6866 osd.219 ever on 
either front or back, first ping sent 2018-07-17 17:41:09.893761 (cutoff 
2018-07-17 17:41:12.903604)
2018-07-17 17:41:32.903875 I | osd12: 2018-07-17 17:41:32.903795 7fffef198700 
-1 osd.12 4296 heartbeat_check: no reply from 7.129.220.44:6922 osd.220 ever on 
either front or back, first ping sent 2018-07-17 17:41:09.893761 (cutoff 
2018-07-17 17:41:12.903604)
2018-07-17 17:41:32.903878 I | osd12: 2018-07-17 17:41:32.903798 7fffef198700 
-1 osd.12 4296 heartbeat_check: no reply from 7.129.220.44:6901 osd.221 ever on 
either front or back, first ping sent 2018-07-17 17:41:09.893761 (cutoff 
2018-07-17 17:41:12.903604)
2018-07-17 17:41:32.903880 I | osd12: 2018-07-17 17:41:32.903800 7fffef198700 
-1 osd.12 4296 heartbeat_check: no reply from 7.129.220.44:6963 osd.222 ever on 
either front or back, first ping sent 2018-07-17 17:41:09.893761 (cutoff 
2018-07-17 17:41:12.903604)
2018-07-17 17:41:32.903884 I | osd12: 2018-07-17 17:41:32.903803 7fffef198700 
-1 osd.12 4296 heartbeat_check: no reply from 7.129.220.44:6907 osd.224 ever on 
either front or back, first ping sent 2018-07-17 17:41:09.893761 (cutoff 
2018-07-17 17:41:12.903604)

Is there a way to resolve this issue, which seems to be the root cause of the 
OSDs being marked as failed.

Thanks in advance for any help,
-Bryan

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Bryan 
Banister
Sent: Tuesday, July 17, 2018 12:08 PM
To: Tom W ; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs 
failed, then marked down, then booted, then failed again

Note: External Email

Hi Tom,

We're apparently running ceph version 12.2.5 on a Rook based cluster.  We have 
EC pools on large 8TB HDDs and metadata on bluestore OSDs on NVMe drives.

I'll look at the release notes.

Thanks!
-Bryan

From: Tom W [mailto:to...@ukfast.co.uk]
Sent: Tuesday, July 17, 2018 12:05 PM
To: Bryan Banister 
mailto:bbanis...@jumptrading.com>>; 
ceph-users@lists.ceph.com
Subject: Re: Cluster in bad shape, seemingly endless cycle of OSDs failed, then 
marked down, then booted, then failed again

Note: External Email


Hi Bryan,



What version of Ceph are you currently running on, and do you run any erasure 
coded pools or bluestore OSDs? Might be worth having a quick glance over the 
recent changelogs:



http://docs.ceph.com/docs/master/releases/luminous/



Tom


From: ceph-users 
mailto:ceph-users-boun...@lists.ceph.com>> 
on behalf of Bryan Banister 
mailto:bbanis...@jumptrading.com>>
Sent: 17 July 2018 18:00:05
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs 
failed, then marked down, then booted, then failed again

Hi all,

We're still very new to managing Ceph and seem to have cluster that is in an 
endless loop of failing OSDs, then marking them down, then booting them again:

Here are some example logs:
2018-07-17 16:48:28.976673 mon.rook-ceph-mon7 [INF] osd.83 failed 
(root=default,host=carg-kubelet-osd04) (3 reporters from different host after 
61.491973 >= grace 20.010293)
2018-07-17 16:48:28.976730 mon.rook-ceph-mon7 [INF] osd.84 failed 
(root=default,host=carg-kubelet-osd04) (3 reporters from different host after 
61.491916 >= grace 20.010293)
2018-07-17 16:48:28.976785 mon.rook-ceph-mon7 [INF] osd.85 failed 
(root=default,host=carg-kubelet-osd04) (3 reporters from different host after 
61.491870 >= grace 20.011151)
2018-07-17 16:48:28.976843 mon.rook-ceph-mon7 [INF] osd.86 failed 
(root=default,host=carg-kubelet-osd04) (3 reporters from different host after 
61.491828 >= grace 20.010293)
2018-07-17 16:48:28.976890 mon.rook-ceph-mon7 [INF] Marking osd.1 out (has been 
down for 605 seconds)
2018-07-17 16:48:28.976913 mon.rook-ceph-mon7 [INF] Marking osd.2 out (has been 
down for 605 seconds)
2018-07-17 16:48:28.976933 mon.rook-ceph-mon7 [INF] Marking osd.3 out (has been 
down for 605 seconds)
2018-07-17 16:48:28.976954 mon.rook-ceph-mon7 [INF] Marking osd.4 out (has been 
down for 605 seconds)
2018-07-17 16:48:28.976979 mon.rook-ceph-mon7 [INF] Marking osd.9 out (has been 
down for 605 seconds)
2018-07-17 16:48:28.977000 mon.rook-ceph-mon7 [INF] Marking osd.10 out (has 
been down for 605 seconds)
2018-07-17 16:48:28.977020 mon.rook-ceph-mon7 [INF] Marking osd.11 out (has 
been down for 605 seconds)
2018-07-17 16:48:28.977040 

Re: [ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

2018-07-17 Thread Bryan Banister
Hi Tom,

We're apparently running ceph version 12.2.5 on a Rook based cluster.  We have 
EC pools on large 8TB HDDs and metadata on bluestore OSDs on NVMe drives.

I'll look at the release notes.

Thanks!
-Bryan

From: Tom W [mailto:to...@ukfast.co.uk]
Sent: Tuesday, July 17, 2018 12:05 PM
To: Bryan Banister ; ceph-users@lists.ceph.com
Subject: Re: Cluster in bad shape, seemingly endless cycle of OSDs failed, then 
marked down, then booted, then failed again

Note: External Email


Hi Bryan,



What version of Ceph are you currently running on, and do you run any erasure 
coded pools or bluestore OSDs? Might be worth having a quick glance over the 
recent changelogs:



http://docs.ceph.com/docs/master/releases/luminous/



Tom


From: ceph-users 
mailto:ceph-users-boun...@lists.ceph.com>> 
on behalf of Bryan Banister 
mailto:bbanis...@jumptrading.com>>
Sent: 17 July 2018 18:00:05
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs 
failed, then marked down, then booted, then failed again

Hi all,

We're still very new to managing Ceph and seem to have cluster that is in an 
endless loop of failing OSDs, then marking them down, then booting them again:

Here are some example logs:
2018-07-17 16:48:28.976673 mon.rook-ceph-mon7 [INF] osd.83 failed 
(root=default,host=carg-kubelet-osd04) (3 reporters from different host after 
61.491973 >= grace 20.010293)
2018-07-17 16:48:28.976730 mon.rook-ceph-mon7 [INF] osd.84 failed 
(root=default,host=carg-kubelet-osd04) (3 reporters from different host after 
61.491916 >= grace 20.010293)
2018-07-17 16:48:28.976785 mon.rook-ceph-mon7 [INF] osd.85 failed 
(root=default,host=carg-kubelet-osd04) (3 reporters from different host after 
61.491870 >= grace 20.011151)
2018-07-17 16:48:28.976843 mon.rook-ceph-mon7 [INF] osd.86 failed 
(root=default,host=carg-kubelet-osd04) (3 reporters from different host after 
61.491828 >= grace 20.010293)
2018-07-17 16:48:28.976890 mon.rook-ceph-mon7 [INF] Marking osd.1 out (has been 
down for 605 seconds)
2018-07-17 16:48:28.976913 mon.rook-ceph-mon7 [INF] Marking osd.2 out (has been 
down for 605 seconds)
2018-07-17 16:48:28.976933 mon.rook-ceph-mon7 [INF] Marking osd.3 out (has been 
down for 605 seconds)
2018-07-17 16:48:28.976954 mon.rook-ceph-mon7 [INF] Marking osd.4 out (has been 
down for 605 seconds)
2018-07-17 16:48:28.976979 mon.rook-ceph-mon7 [INF] Marking osd.9 out (has been 
down for 605 seconds)
2018-07-17 16:48:28.977000 mon.rook-ceph-mon7 [INF] Marking osd.10 out (has 
been down for 605 seconds)
2018-07-17 16:48:28.977020 mon.rook-ceph-mon7 [INF] Marking osd.11 out (has 
been down for 605 seconds)
2018-07-17 16:48:28.977040 mon.rook-ceph-mon7 [INF] Marking osd.12 out (has 
been down for 605 seconds)
2018-07-17 16:48:28.977059 mon.rook-ceph-mon7 [INF] Marking osd.13 out (has 
been down for 605 seconds)
2018-07-17 16:48:28.977079 mon.rook-ceph-mon7 [INF] Marking osd.14 out (has 
been down for 605 seconds)
2018-07-17 16:48:30.889316 mon.rook-ceph-mon7 [INF] osd.55 
7.129.218.12:6920/90761 boot
2018-07-17 16:48:31.113052 mon.rook-ceph-mon7 [WRN] Health check update: 
4946/8854434 objects misplaced (0.056%) (OBJECT_MISPLACED)
2018-07-17 16:48:31.113087 mon.rook-ceph-mon7 [WRN] Health check update: 
Degraded data redundancy: 7951/8854434 objects degraded (0.090%), 88 pgs 
degraded, 273 pgs undersized (PG_DEGRADED)
2018-07-17 16:48:32.763546 mon.rook-ceph-mon7 [WRN] Health check update: 
Reduced data availability: 10439 pgs inactive, 8994 pgs down, 1639 pgs peering, 
88 pgs incomplete, 3430 pgs stale (PG_AVAILABILITY)
2018-07-17 16:48:32.763578 mon.rook-ceph-mon7 [WRN] Health check update: 29 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-17 16:48:34.096178 mon.rook-ceph-mon7 [INF] osd.88 failed 
(root=default,host=carg-kubelet-osd04) (3 reporters from different host after 
66.612054 >= grace 20.010283)
2018-07-17 16:48:34.108020 mon.rook-ceph-mon7 [WRN] Health check update: 112 
osds down (OSD_DOWN)
2018-07-17 16:48:38.736108 mon.rook-ceph-mon7 [WRN] Health check update: 
4946/8843715 objects misplaced (0.056%) (OBJECT_MISPLACED)
2018-07-17 16:48:38.736140 mon.rook-ceph-mon7 [WRN] Health check update: 
Reduced data availability: 10415 pgs inactive, 9000 pgs down, 1635 pgs peering, 
88 pgs incomplete, 3418 pgs stale (PG_AVAILABILITY)
2018-07-17 16:48:38.736166 mon.rook-ceph-mon7 [WRN] Health check update: 
Degraded data redundancy: 7949/8843715 objects degraded (0.090%), 86 pgs 
degraded, 267 pgs undersized (PG_DEGRADED)
2018-07-17 16:48:40.430146 mon.rook-ceph-mon7 [WRN] Health check update: 111 
osds down (OSD_DOWN)
2018-07-17 16:48:40.812579 mon.rook-ceph-mon7 [INF] osd.117 
7.129.217.10:6833/98090 boot
2018-07-17 16:48:42.427204 mon.rook-ceph-mon7 [INF] osd.115 
7.129.217.10:6940/98114 boot
2018-07-17 16:48:42.427297 mon.rook-ceph-mon7 [INF] osd.100 

Re: [ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

2018-07-17 Thread Tom W
Hi Bryan,


What version of Ceph are you currently running on, and do you run any erasure 
coded pools or bluestore OSDs? Might be worth having a quick glance over the 
recent changelogs:


http://docs.ceph.com/docs/master/releases/luminous/


Tom


From: ceph-users  on behalf of Bryan 
Banister 
Sent: 17 July 2018 18:00:05
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs 
failed, then marked down, then booted, then failed again

Hi all,

We’re still very new to managing Ceph and seem to have cluster that is in an 
endless loop of failing OSDs, then marking them down, then booting them again:

Here are some example logs:
2018-07-17 16:48:28.976673 mon.rook-ceph-mon7 [INF] osd.83 failed 
(root=default,host=carg-kubelet-osd04) (3 reporters from different host after 
61.491973 >= grace 20.010293)
2018-07-17 16:48:28.976730 mon.rook-ceph-mon7 [INF] osd.84 failed 
(root=default,host=carg-kubelet-osd04) (3 reporters from different host after 
61.491916 >= grace 20.010293)
2018-07-17 16:48:28.976785 mon.rook-ceph-mon7 [INF] osd.85 failed 
(root=default,host=carg-kubelet-osd04) (3 reporters from different host after 
61.491870 >= grace 20.011151)
2018-07-17 16:48:28.976843 mon.rook-ceph-mon7 [INF] osd.86 failed 
(root=default,host=carg-kubelet-osd04) (3 reporters from different host after 
61.491828 >= grace 20.010293)
2018-07-17 16:48:28.976890 mon.rook-ceph-mon7 [INF] Marking osd.1 out (has been 
down for 605 seconds)
2018-07-17 16:48:28.976913 mon.rook-ceph-mon7 [INF] Marking osd.2 out (has been 
down for 605 seconds)
2018-07-17 16:48:28.976933 mon.rook-ceph-mon7 [INF] Marking osd.3 out (has been 
down for 605 seconds)
2018-07-17 16:48:28.976954 mon.rook-ceph-mon7 [INF] Marking osd.4 out (has been 
down for 605 seconds)
2018-07-17 16:48:28.976979 mon.rook-ceph-mon7 [INF] Marking osd.9 out (has been 
down for 605 seconds)
2018-07-17 16:48:28.977000 mon.rook-ceph-mon7 [INF] Marking osd.10 out (has 
been down for 605 seconds)
2018-07-17 16:48:28.977020 mon.rook-ceph-mon7 [INF] Marking osd.11 out (has 
been down for 605 seconds)
2018-07-17 16:48:28.977040 mon.rook-ceph-mon7 [INF] Marking osd.12 out (has 
been down for 605 seconds)
2018-07-17 16:48:28.977059 mon.rook-ceph-mon7 [INF] Marking osd.13 out (has 
been down for 605 seconds)
2018-07-17 16:48:28.977079 mon.rook-ceph-mon7 [INF] Marking osd.14 out (has 
been down for 605 seconds)
2018-07-17 16:48:30.889316 mon.rook-ceph-mon7 [INF] osd.55 
7.129.218.12:6920/90761 boot
2018-07-17 16:48:31.113052 mon.rook-ceph-mon7 [WRN] Health check update: 
4946/8854434 objects misplaced (0.056%) (OBJECT_MISPLACED)
2018-07-17 16:48:31.113087 mon.rook-ceph-mon7 [WRN] Health check update: 
Degraded data redundancy: 7951/8854434 objects degraded (0.090%), 88 pgs 
degraded, 273 pgs undersized (PG_DEGRADED)
2018-07-17 16:48:32.763546 mon.rook-ceph-mon7 [WRN] Health check update: 
Reduced data availability: 10439 pgs inactive, 8994 pgs down, 1639 pgs peering, 
88 pgs incomplete, 3430 pgs stale (PG_AVAILABILITY)
2018-07-17 16:48:32.763578 mon.rook-ceph-mon7 [WRN] Health check update: 29 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-17 16:48:34.096178 mon.rook-ceph-mon7 [INF] osd.88 failed 
(root=default,host=carg-kubelet-osd04) (3 reporters from different host after 
66.612054 >= grace 20.010283)
2018-07-17 16:48:34.108020 mon.rook-ceph-mon7 [WRN] Health check update: 112 
osds down (OSD_DOWN)
2018-07-17 16:48:38.736108 mon.rook-ceph-mon7 [WRN] Health check update: 
4946/8843715 objects misplaced (0.056%) (OBJECT_MISPLACED)
2018-07-17 16:48:38.736140 mon.rook-ceph-mon7 [WRN] Health check update: 
Reduced data availability: 10415 pgs inactive, 9000 pgs down, 1635 pgs peering, 
88 pgs incomplete, 3418 pgs stale (PG_AVAILABILITY)
2018-07-17 16:48:38.736166 mon.rook-ceph-mon7 [WRN] Health check update: 
Degraded data redundancy: 7949/8843715 objects degraded (0.090%), 86 pgs 
degraded, 267 pgs undersized (PG_DEGRADED)
2018-07-17 16:48:40.430146 mon.rook-ceph-mon7 [WRN] Health check update: 111 
osds down (OSD_DOWN)
2018-07-17 16:48:40.812579 mon.rook-ceph-mon7 [INF] osd.117 
7.129.217.10:6833/98090 boot
2018-07-17 16:48:42.427204 mon.rook-ceph-mon7 [INF] osd.115 
7.129.217.10:6940/98114 boot
2018-07-17 16:48:42.427297 mon.rook-ceph-mon7 [INF] osd.100 
7.129.217.10:6899/98091 boot
2018-07-17 16:48:42.427502 mon.rook-ceph-mon7 [INF] osd.95 
7.129.217.10:6901/98092 boot

Not sure this is going to fix itself.  Any ideas on how to handle this 
situation??

Thanks in advance!
-Bryan




Note: This email is for the confidential use of the named addressee(s) only and 
may contain proprietary, confidential, or privileged information and/or 
personal data. If you are not the intended recipient, you are hereby notified 
that any review, dissemination, or copying of this email is strictly 
prohibited, and requested to notify the sender immediately and destroy this 

[ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

2018-07-17 Thread Bryan Banister
Hi all,

We're still very new to managing Ceph and seem to have cluster that is in an 
endless loop of failing OSDs, then marking them down, then booting them again:

Here are some example logs:
2018-07-17 16:48:28.976673 mon.rook-ceph-mon7 [INF] osd.83 failed 
(root=default,host=carg-kubelet-osd04) (3 reporters from different host after 
61.491973 >= grace 20.010293)
2018-07-17 16:48:28.976730 mon.rook-ceph-mon7 [INF] osd.84 failed 
(root=default,host=carg-kubelet-osd04) (3 reporters from different host after 
61.491916 >= grace 20.010293)
2018-07-17 16:48:28.976785 mon.rook-ceph-mon7 [INF] osd.85 failed 
(root=default,host=carg-kubelet-osd04) (3 reporters from different host after 
61.491870 >= grace 20.011151)
2018-07-17 16:48:28.976843 mon.rook-ceph-mon7 [INF] osd.86 failed 
(root=default,host=carg-kubelet-osd04) (3 reporters from different host after 
61.491828 >= grace 20.010293)
2018-07-17 16:48:28.976890 mon.rook-ceph-mon7 [INF] Marking osd.1 out (has been 
down for 605 seconds)
2018-07-17 16:48:28.976913 mon.rook-ceph-mon7 [INF] Marking osd.2 out (has been 
down for 605 seconds)
2018-07-17 16:48:28.976933 mon.rook-ceph-mon7 [INF] Marking osd.3 out (has been 
down for 605 seconds)
2018-07-17 16:48:28.976954 mon.rook-ceph-mon7 [INF] Marking osd.4 out (has been 
down for 605 seconds)
2018-07-17 16:48:28.976979 mon.rook-ceph-mon7 [INF] Marking osd.9 out (has been 
down for 605 seconds)
2018-07-17 16:48:28.977000 mon.rook-ceph-mon7 [INF] Marking osd.10 out (has 
been down for 605 seconds)
2018-07-17 16:48:28.977020 mon.rook-ceph-mon7 [INF] Marking osd.11 out (has 
been down for 605 seconds)
2018-07-17 16:48:28.977040 mon.rook-ceph-mon7 [INF] Marking osd.12 out (has 
been down for 605 seconds)
2018-07-17 16:48:28.977059 mon.rook-ceph-mon7 [INF] Marking osd.13 out (has 
been down for 605 seconds)
2018-07-17 16:48:28.977079 mon.rook-ceph-mon7 [INF] Marking osd.14 out (has 
been down for 605 seconds)
2018-07-17 16:48:30.889316 mon.rook-ceph-mon7 [INF] osd.55 
7.129.218.12:6920/90761 boot
2018-07-17 16:48:31.113052 mon.rook-ceph-mon7 [WRN] Health check update: 
4946/8854434 objects misplaced (0.056%) (OBJECT_MISPLACED)
2018-07-17 16:48:31.113087 mon.rook-ceph-mon7 [WRN] Health check update: 
Degraded data redundancy: 7951/8854434 objects degraded (0.090%), 88 pgs 
degraded, 273 pgs undersized (PG_DEGRADED)
2018-07-17 16:48:32.763546 mon.rook-ceph-mon7 [WRN] Health check update: 
Reduced data availability: 10439 pgs inactive, 8994 pgs down, 1639 pgs peering, 
88 pgs incomplete, 3430 pgs stale (PG_AVAILABILITY)
2018-07-17 16:48:32.763578 mon.rook-ceph-mon7 [WRN] Health check update: 29 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-17 16:48:34.096178 mon.rook-ceph-mon7 [INF] osd.88 failed 
(root=default,host=carg-kubelet-osd04) (3 reporters from different host after 
66.612054 >= grace 20.010283)
2018-07-17 16:48:34.108020 mon.rook-ceph-mon7 [WRN] Health check update: 112 
osds down (OSD_DOWN)
2018-07-17 16:48:38.736108 mon.rook-ceph-mon7 [WRN] Health check update: 
4946/8843715 objects misplaced (0.056%) (OBJECT_MISPLACED)
2018-07-17 16:48:38.736140 mon.rook-ceph-mon7 [WRN] Health check update: 
Reduced data availability: 10415 pgs inactive, 9000 pgs down, 1635 pgs peering, 
88 pgs incomplete, 3418 pgs stale (PG_AVAILABILITY)
2018-07-17 16:48:38.736166 mon.rook-ceph-mon7 [WRN] Health check update: 
Degraded data redundancy: 7949/8843715 objects degraded (0.090%), 86 pgs 
degraded, 267 pgs undersized (PG_DEGRADED)
2018-07-17 16:48:40.430146 mon.rook-ceph-mon7 [WRN] Health check update: 111 
osds down (OSD_DOWN)
2018-07-17 16:48:40.812579 mon.rook-ceph-mon7 [INF] osd.117 
7.129.217.10:6833/98090 boot
2018-07-17 16:48:42.427204 mon.rook-ceph-mon7 [INF] osd.115 
7.129.217.10:6940/98114 boot
2018-07-17 16:48:42.427297 mon.rook-ceph-mon7 [INF] osd.100 
7.129.217.10:6899/98091 boot
2018-07-17 16:48:42.427502 mon.rook-ceph-mon7 [INF] osd.95 
7.129.217.10:6901/98092 boot

Not sure this is going to fix itself.  Any ideas on how to handle this 
situation??

Thanks in advance!
-Bryan




Note: This email is for the confidential use of the named addressee(s) only and 
may contain proprietary, confidential, or privileged information and/or 
personal data. If you are not the intended recipient, you are hereby notified 
that any review, dissemination, or copying of this email is strictly 
prohibited, and requested to notify the sender immediately and destroy this 
email and any attachments. Email transmission cannot be guaranteed to be secure 
or error-free. The Company, therefore, does not make any guarantees as to the 
completeness or accuracy of this email or any attachments. This email is for 
informational purposes only and does not constitute a recommendation, offer, 
request, or solicitation of any kind to buy, sell, subscribe, redeem, or 
perform any type of transaction of a financial product. Personal data, as 
defined by applicable data privacy laws, contained in this 

[ceph-users] Recovery from 12.2.5 (corruption) -> 12.2.6 (hair on fire) -> 13.2.0 (some objects inaccessible and CephFS damaged)

2018-07-17 Thread Troy Ablan
I was on 12.2.5 for a couple weeks and started randomly seeing
corruption, moved to 12.2.6 via yum update on Sunday, and all hell broke
loose.  I panicked and moved to Mimic, and when that didn't solve the
problem, only then did I start to root around in mailing lists archives.

It appears I can't downgrade OSDs back to Luminous now that 12.2.7 is
out, but I'm unsure how to proceed now that the damaged cluster is
running under Mimic.  Is there anything I can do to get the cluster back
online and objects readable?

Everything is BlueStore and most of it is EC.

Thanks.

-Troy
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is Ceph the right tool for storing lots of small files?

2018-07-17 Thread Gregory Farnum
On Mon, Jul 16, 2018 at 11:41 PM Christian Wimmer <
christian.wim...@jaumo.com> wrote:

> Hi all,
>
> I am trying to use Ceph with RGW to store lots (>300M) of small files (80%
> 2-15kB, 20% up to 500kB).
> After some testing, I wonder if Ceph is the right tool for that.
>
> Does anybody of you have experience with this use case?
>
> Things I came across:
> - EC pools: default stripe-width is 4kB. Does it make sense to lower the
> stripe width for small objects or is EC a bad idea for this use case?
> - Bluestore: bluestore min alloc size is per default 64kB. Would it be
> better to lower it to say 2kB or am I better off with Filestore (probably
> not if I want to store a huge amount of small files)?
> - Bluestore / RocksDB: RocksDB seems to consume a lot of disk space when
> storing lots of files.
>   For example: I have OSDs with about 500k onodes (which should translate
> to 500k stored objects, right?) and the DB size is about 30GB. That's about
> 63kB per onode - which is a lot, considering the original object is about
> 5kB.
>

Those numbers seem a little large to me (although with erasure coding they
could make sense due to the "object info" replication across shards), but
in general I would not expect Ceph or RGW to be a good fit for files which
tend to be that small from a data storage efficiency standpoint.

That said, you've got about 4TB of data there. Are you sure some large SSDs
in a RAID1(+0) or something wouldn't fulfill your needs? ;) If you're more
concerned about scaling out the IO than the ratio of data stored to data
used, Ceph may still be a good choice. *shrug*
-Greg


>
> Thanks,
> Christian
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS with erasure coding, do I need a cache-pool?

2018-07-17 Thread Gregory Farnum
On Tue, Jul 17, 2018 at 3:40 AM Jake Grimmett  wrote:

>
> I'd be interested to hear more from Greg about why cache pools are best
> avoided...
>

While performance has improved over many releases, cache pools still don't
do well on most workloads that most people use them for. As a result we've
moved away from their current implementation; we continue to run their
tests and don't merge code which fails them, but bugs which pop up in the
community or are intermittent get a lot less attention than other areas of
RADOS do.

On Tue, Jul 17, 2018 at 6:32 AM Oliver Schulz 
wrote:

> > But you could also do workaround like letting it choose (K+M)/2 racks
> > and putting two shards in each rack.
>
> I probably have this wrong - wouldn't it reduce durability
> to put two shards in one failure domain?
>

Oh yes, you are more susceptible to top-of-rack switch failures in this
case or whatever. It's just one option — many people are less concerned
about their switches than their hard drives, especially since two lost
switches are an accessibility but not a durability issue.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] luminous librbd::image::OpenRequest: failed to retreive immutable metadata

2018-07-17 Thread Dan van der Ster
Hi,

This mail is for the search engines. An old "Won't Fix" ticket is
still quite relevant: http://tracker.ceph.com/issues/16211

When you upgrade an old rbd cluster to luminous, there is a good
chance you will have several rbd images with unreadable header
objects. E.g.

# rbd info -p volumes volume-d865f046-6f0e-4a95-9e2f-9d228af8c3ef
2018-07-17 18:15:57.224508 7fdf5a7fc700 -1 librbd::image::OpenRequest:
failed to retreive immutable metadata: (2) No such file or directory
rbd: error opening image volume-d865f046-6f0e-4a95-9e2f-9d228af8c3ef:
(2) No such file or directory

The workaround is to set a dummy omap key on the header object -- Wido
provided a script to do this in the tracker issue above.

Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] v12.2.7 Luminous released

2018-07-17 Thread Abhishek Lekshmanan

This is the seventh bugfix release of Luminous v12.2.x long term
stable release series. This release contains several fixes for
regressions in the v12.2.6 and v12.2.5 releases.  We recommend that
all users upgrade. 

*NOTE* The v12.2.6 release has serious known regressions, while 12.2.6
wasn't formally announced in the mailing lists or blog, the packages
were built and available on download.ceph.com since last week. If you
installed this release, please see the upgrade procedure below.

*NOTE* The v12.2.5 release has a potential data corruption issue with
erasure coded pools. If you ran v12.2.5 with erasure coding, please see
below.

The full blog post alongwith the complete changelog is published at the
official ceph blog at https://ceph.com/releases/12-2-7-luminous-released/

Upgrading from v12.2.6
--

v12.2.6 included an incomplete backport of an optimization for
BlueStore OSDs that avoids maintaining both the per-object checksum
and the internal BlueStore checksum.  Due to the accidental omission
of a critical follow-on patch, v12.2.6 corrupts (fails to update) the
stored per-object checksum value for some objects.  This can result in
an EIO error when trying to read those objects.

#. If your cluster uses FileStore only, no special action is required.
   This problem only affects clusters with BlueStore.

#. If your cluster has only BlueStore OSDs (no FileStore), then you
   should enable the following OSD option::

 osd skip data digest = true

   This will avoid setting and start ignoring the full-object digests
   whenever the primary for a PG is BlueStore.

#. If you have a mix of BlueStore and FileStore OSDs, then you should
   enable the following OSD option::

 osd distrust data digest = true

   This will avoid setting and start ignoring the full-object digests
   in all cases.  This weakens the data integrity checks for
   FileStore (although those checks were always only opportunistic).

If your cluster includes BlueStore OSDs and was affected, deep scrubs
will generate errors about mismatched CRCs for affected objects.
Currently the repair operation does not know how to correct them
(since all replicas do not match the expected checksum it does not
know how to proceed).  These warnings are harmless in the sense that
IO is not affected and the replicas are all still in sync.  The number
of affected objects is likely to drop (possibly to zero) on their own
over time as those objects are modified.  We expect to include a scrub
improvement in v12.2.8 to clean up any remaining objects.

Additionally, see the notes below, which apply to both v12.2.5 and v12.2.6.

Upgrading from v12.2.5 or v12.2.6
-

If you used v12.2.5 or v12.2.6 in combination with erasure coded
pools, there is a small risk of corruption under certain workloads.
Specifically, when:

* An erasure coded pool is in use
* The pool is busy with successful writes
* The pool is also busy with updates that result in an error result to
  the librados user.  RGW garbage collection is the most common
  example of this (it sends delete operations on objects that don't
  always exist.)
* Some OSDs are reasonably busy.  One known example of such load is
  FileStore splitting, although in principle any load on the cluster
  could also trigger the behavior.
* One or more OSDs restarts.

This combination can trigger an OSD crash and possibly leave PGs in a state
where they fail to peer.

Notably, upgrading a cluster involves OSD restarts and as such may
increase the risk of encountering this bug.  For this reason, for
clusters with erasure coded pools, we recommend the following upgrade
procedure to minimize risk:

1. Install the v12.2.7 packages.
2. Temporarily quiesce IO to cluster::

 ceph osd pause

3. Restart all OSDs and wait for all PGs to become active.
4. Resume IO::

 ceph osd unpause

This will cause an availability outage for the duration of the OSD
restarts.  If this in unacceptable, an *more risky* alternative is to
disable RGW garbage collection (the primary known cause of these rados
operations) for the duration of the upgrade::

1. Set ``rgw_enable_gc_threads = false`` in ceph.conf
2. Restart all radosgw daemons
3. Upgrade and restart all OSDs
4. Remove ``rgw_enable_gc_threads = false`` from ceph.conf
5. Restart all radosgw daemons

Upgrading from other versions
-

If your cluster did not run v12.2.5 or v12.2.6 then none of the above
issues apply to you and you should upgrade normally.

v12.2.7 Changelog
-

* mon/AuthMonitor: improve error message (issue#21765, pr#22963, Douglas Fuller)
* osd/PG: do not blindly roll forward to log.head (issue#24597, pr#22976, Sage 
Weil)
* osd/PrimaryLogPG: rebuild attrs from clients (issue#24768 , pr#22962, Sage 
Weil)
* osd: work around data digest problems in 12.2.6 (version 2) (issue#24922, 
pr#23055, Sage Weil)
* rgw: objects in cache never refresh after 

[ceph-users] multisite and link speed

2018-07-17 Thread Robert Stanford
 I have ceph clusters in a zone configured as active/passive, or
primary/backup.  If the network link between the two clusters is slower
than the speed of data coming in to the active cluster, what will
eventually happen?  Will data pool on the active cluster until memory runs
out?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] resize wal/db

2018-07-17 Thread Igor Fedotov
For now you can expand that space up to actual volume size using 
ceph-bluestore-tool commands ( bluefs-bdev-expand and set-label-key).


Which is a bit tricky though.

And I'm currently working on a solution within ceph-bluestore tool to  
simplify both expansion and migration.



Thanks,

Igor


On 7/17/2018 5:02 PM, Nicolas Huillard wrote:

Le mardi 17 juillet 2018 à 16:20 +0300, Igor Fedotov a écrit :

Right, but procedure described in the blog can be pretty easily
adjusted
to do a resize.

Sure, but if I remember correctly, Ceph itself cannot use the increased
size: you'll end up with a larger device with unused additional space.
Using that space may be on the TODO, though, so this may not be a
complete waste of space...



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] resize wal/db

2018-07-17 Thread Nicolas Huillard
Le mardi 17 juillet 2018 à 16:20 +0300, Igor Fedotov a écrit :
> Right, but procedure described in the blog can be pretty easily
> adjusted 
> to do a resize.

Sure, but if I remember correctly, Ceph itself cannot use the increased
size: you'll end up with a larger device with unused additional space.
Using that space may be on the TODO, though, so this may not be a
complete waste of space...

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS with erasure coding, do I need a cache-pool?

2018-07-17 Thread Oliver Schulz

Dear Linh,

another question, if I may:

How do you handle Bluestore WAL and DB, and
how much SSD space do you allocate for them?


Cheers,

Oliver


On 17.07.2018 08:55, Linh Vu wrote:

Hi Oliver,


We have several CephFS on EC pool deployments, one been in production 
for a while, the others about to pending all the Bluestore+EC fixes in 
12.2.7 



Firstly as John and Greg have said, you don't need SSD cache pool at all.


Secondly, regarding k/m, it depends on how many hosts or racks you have, 
and how many failures you want to tolerate.



For our smallest pool with only 8 hosts in 4 different racks and 2 
different pairs of switches (note: we consider switch failure more 
common than rack cooling or power failure), we're using 4/2 with failure 
domain = host. We currently use this for SSD scratch storage for HPC.



For one of our larger pools, with 24 hosts over 6 different racks and 6 
different pairs of switches, we're using 4:2 with failure domain = rack.



For another pool with similar host count but not spread over so many 
pairs of switches, we're using 6:3 and failure domain = host.



Also keep in mind that a higher value of k/m may give you more 
throughput but increase latency especially for small files, so it also 
depends on how important performance is and what kind of file size you 
store on your CephFS.



Cheers,

Linh


*From:* ceph-users  on behalf of 
Oliver Schulz 

*Sent:* Sunday, 15 July 2018 9:46:16 PM
*To:* ceph-users
*Subject:* [ceph-users] CephFS with erasure coding, do I need a cache-pool?
Dear all,

we're planning a new Ceph-Clusterm, with CephFS as the
main workload, and would like to use erasure coding to
use the disks more efficiently. Access pattern will
probably be more read- than write-heavy, on average.

I don't have any practical experience with erasure-
coded pools so far.

I'd be glad for any hints / recommendations regarding
these questions:

* Is an SSD cache pool recommended/necessary for
CephFS on an erasure-coded HDD pool (using Ceph
Luminous and BlueStore)?

* What are good values for k/m for erasure coding in
practice (assuming a cluster of about 300 OSDs), to
make things robust and ease maintenance (ability to
take a few nodes down)? Is k/m = 6/3 a good choice?

* Will it be sufficient to have k+m racks, resp. failure
domains?


Cheers and thanks for any advice,

Oliver
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS with erasure coding, do I need a cache-pool?

2018-07-17 Thread Oliver Schulz

Thanks a lot, Linh!

On 17.07.2018 08:55, Linh Vu wrote:

Hi Oliver,


We have several CephFS on EC pool deployments, one been in production 
for a while, the others about to pending all the Bluestore+EC fixes in 
12.2.7 



Firstly as John and Greg have said, you don't need SSD cache pool at all.


Secondly, regarding k/m, it depends on how many hosts or racks you have, 
and how many failures you want to tolerate.



For our smallest pool with only 8 hosts in 4 different racks and 2 
different pairs of switches (note: we consider switch failure more 
common than rack cooling or power failure), we're using 4/2 with failure 
domain = host. We currently use this for SSD scratch storage for HPC.



For one of our larger pools, with 24 hosts over 6 different racks and 6 
different pairs of switches, we're using 4:2 with failure domain = rack.



For another pool with similar host count but not spread over so many 
pairs of switches, we're using 6:3 and failure domain = host.



Also keep in mind that a higher value of k/m may give you more 
throughput but increase latency especially for small files, so it also 
depends on how important performance is and what kind of file size you 
store on your CephFS.



Cheers,

Linh


*From:* ceph-users  on behalf of 
Oliver Schulz 

*Sent:* Sunday, 15 July 2018 9:46:16 PM
*To:* ceph-users
*Subject:* [ceph-users] CephFS with erasure coding, do I need a cache-pool?
Dear all,

we're planning a new Ceph-Clusterm, with CephFS as the
main workload, and would like to use erasure coding to
use the disks more efficiently. Access pattern will
probably be more read- than write-heavy, on average.

I don't have any practical experience with erasure-
coded pools so far.

I'd be glad for any hints / recommendations regarding
these questions:

* Is an SSD cache pool recommended/necessary for
CephFS on an erasure-coded HDD pool (using Ceph
Luminous and BlueStore)?

* What are good values for k/m for erasure coding in
practice (assuming a cluster of about 300 OSDs), to
make things robust and ease maintenance (ability to
take a few nodes down)? Is k/m = 6/3 a good choice?

* Will it be sufficient to have k+m racks, resp. failure
domains?


Cheers and thanks for any advice,

Oliver
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS with erasure coding, do I need a cache-pool?

2018-07-17 Thread Oliver Schulz

Hi Greg,

On 17.07.2018 03:01, Gregory Farnum wrote:

Since Luminous, you can use an erasure coded pool (on bluestore)
directly as a CephFS data pool, no cache pool needed.
More than that, we'd really prefer you didn't use cache pools for 
anything. Just Say No. :)


Thanks for the confirmation - I'll happily go
without a cache pool, then. :-)



 > * Will it be sufficient to have k+m racks, resp. failure
 >    domains?


Generally, if you want CRUSH to select X "buckets" at any level, it's 
good to have at least X+1 choices for it to prevent mapping failures. 


So for k/m = 6/3, it would make sense to have 10 racks,
and to deploy OSD nodes in multiples of 10, accordingly?


But you could also do workaround like letting it choose (K+M)/2 racks 
and putting two shards in each rack.


I probably have this wrong - wouldn't it reduce durability
to put two shards in one failure domain?


Thanks for the advice!

Oliver
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] tcmalloc performance still relevant?

2018-07-17 Thread Uwe Sauter
I asked a similar question about 2 weeks ago, subject "jemalloc / Bluestore". 
Have a look at the archives


Regards,

Uwe

Am 17.07.2018 um 15:27 schrieb Robert Stanford:
> Looking here: 
> https://ceph.com/geen-categorie/the-ceph-and-tcmalloc-performance-story/
> 
>  I see that it was a good idea to change to JEMalloc.  Is this still the 
> case, with up to date Linux and current Ceph?
> 
>  
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Read/write statistics per RBD image

2018-07-17 Thread Jason Dillaman
Yes, you just need to enable the "admin socket" in your ceph.conf and then
use "ceph --admin-daemon /path/to/image/admin/socket.asok perf dump".

On Tue, Jul 17, 2018 at 8:53 AM Mateusz Skala (UST, POL) <
mateusz.sk...@ust-global.com> wrote:

> Hi,
>
> It is possible to get statistics of issued reads/writes to specific RBD
> image? Best will be statistics like in /proc/diskstats in linux.
>
> Regards
>
> Mateusz
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] checking rbd volumes modification times

2018-07-17 Thread Jason Dillaman
That's not possible right now, but work is in-progress to add that to a
future release of Ceph [1].

[1] https://github.com/ceph/ceph/pull/21114

On Mon, Jul 16, 2018 at 7:12 PM Andrei Mikhailovsky 
wrote:

> Dear cephers,
>
> Could someone tell me how to check the rbd volumes modification times in
> ceph pool? I am currently in the process of trimming our ceph pool and
> would like to start with volumes which were not modified for a long time.
> How do I get that information?
>
> Cheers
>
> Andrei
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] tcmalloc performance still relevant?

2018-07-17 Thread Robert Stanford
Looking here:
https://ceph.com/geen-categorie/the-ceph-and-tcmalloc-performance-story/

 I see that it was a good idea to change to JEMalloc.  Is this still the
case, with up to date Linux and current Ceph?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] resize wal/db

2018-07-17 Thread Igor Fedotov
Right, but procedure described in the blog can be pretty easily adjusted 
to do a resize.


Thanks,

Igor


On 7/17/2018 11:10 AM, Eugen Block wrote:

Hi,

There is no way to resize DB while OSD is running. There is a bit 
shorter "unofficial" but risky way than redeploying OSD though. But 
you'll need to tag specific OSD out for a while in any case. You will 
also need either additional free partition(s) or initial deployment 
had to be done using LVMs.


See this blog for morr details. 
http://heiterbiswolkig.blogs.nde.ag/2018/04/08/migrating-bluestores-block-db/


just for clarification: we did NOT resize the block.db in the 
described procedure! We used the exact same block size for the new lvm 
based block.db as it was before. This is also mentioned in the article.


Regards,
Eugen


Zitat von Igor Fedotov :


Hi Zhang,

There is no way to resize DB while OSD is running. There is a bit 
shorter "unofficial" but risky way than redeploying OSD though. But 
you'll need to tag specific OSD out for a while in any case. You will 
also need either additional free partition(s) or initial deployment 
had to be done using LVMs.


See this blog for morr details. 
http://heiterbiswolkig.blogs.nde.ag/2018/04/08/migrating-bluestores-block-db/



And I advise to try such things at non-production cluster first.


Thanks,

Igor


On 7/12/2018 7:03 AM, Shunde Zhang wrote:

Hi Ceph Gurus,

I have installed Ceph Luminous with Bluestore using ceph-ansible.
However, when I did the install, I didn’t set the wal/db size. Then 
it ended up using the default values, which is quite small: 1G db 
and 576MB wal.
Note that each OSD node has 12 OSDs and each OSD has a 1.8T spinning 
disk for data. All 12 OSDs share one NVMe M2 SSD for wal/db.
Now the ceph is in use, and I want to increase the size of db/wal 
after doing some research: I want to use 20G db and 1G wal. (Are 
they reasonable numbers?)
I can delete one OSD and then re-create it with ceph-ansible but 
that is troublesome.
I wonder if there is a (simple) way to increase the size of both db 
and wal when an OSD is running?


Thanks in advance,
Shunde.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ls operation is too slow in cephfs

2018-07-17 Thread John Spray
On Tue, Jul 17, 2018 at 8:26 AM Surya Bala  wrote:
>
> Hi folks,
>
> We have production cluster with 8 nodes and each node has 60 disks of size 
> 6TB each. We are using cephfs and FUSE client with global mount point. We are 
> doing rsync from our old server to this cluster rsync is slow compared to 
> normal server
>
> when we do 'ls' inside some folder, which has many more number of files like 
> 1lakhs and 2lakhs, the response is too slow.

The first thing to check is what kind of "ls" you're doing.  Some
systems colorize ls by default, and that involves statting every file
in addition to listing the directory.  Try with "ls --color=never".

It also helps to be more specific about what "too slow" means.  How
many seconds, and how many files?

John

>
> Any suggestions please
>
> Regards
> Surya
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Read/write statistics per RBD image

2018-07-17 Thread Mateusz Skala (UST, POL)
Hi,

It is possible to get statistics of issued reads/writes to specific RBD image? 
Best will be statistics like in /proc/diskstats in linux.

Regards

Mateusz

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ls operation is too slow in cephfs

2018-07-17 Thread Brenno Augusto Falavinha Martinez
Hi,
could you share your MDS hardware specifications? 


Att., 
Brenno Martinez 
SUPCD / CDENA / CDNIA 
(41) 3593-8423

- Mensagem original -
De: "Daniel Baumann" 
Para: "Ceph Users" 
Enviadas: Terça-feira, 17 de julho de 2018 6:45:25
Assunto: Re: [ceph-users] ls operation is too slow in cephfs

On 07/17/2018 11:43 AM, Marc Roos wrote:
> I had similar thing with doing the ls. Increasing the cache limit helped 
> with our test cluster

same here; additionally we also had to use more than one MDS to get good
performance (currently 3 MDS plus 2 stand-by per FS).

Regards,
Daniel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-


"Esta mensagem do SERVIÇO FEDERAL DE PROCESSAMENTO DE DADOS (SERPRO), empresa 
pública federal regida pelo disposto na Lei Federal nº 5.615, é enviada 
exclusivamente a seu destinatário e pode conter informações confidenciais, 
protegidas por sigilo profissional. Sua utilização desautorizada é ilegal e 
sujeita o infrator às penas da lei. Se você a recebeu indevidamente, queira, 
por gentileza, reenviá-la ao emitente, esclarecendo o equívoco."

"This message from SERVIÇO FEDERAL DE PROCESSAMENTO DE DADOS (SERPRO) -- a 
government company established under Brazilian law (5.615/70) -- is directed 
exclusively to its addressee and may contain confidential data, protected under 
professional secrecy rules. Its unauthorized use is illegal and may subject the 
transgressor to the law's penalties. If you're not the addressee, please send 
it back, elucidating the failure."
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why the change from ceph-disk to ceph-volume and lvm? (and just not stick with direct disk access)

2018-07-17 Thread Marc Roos
 
I still wanted to thank you for the nicely detailed arguments regarding 
this, it is much appreciated. It really gives me the broader perspective 
I was lacking. 



-Original Message-
From: Warren Wang [mailto:warren.w...@walmart.com] 
Sent: maandag 11 juni 2018 17:30
To: Konstantin Shalygin; ceph-users@lists.ceph.com; Marc Roos
Subject: Re: [ceph-users] Why the change from ceph-disk to ceph-volume 
and lvm? (and just not stick with direct disk access)

I'll chime in as a large scale operator, and a strong proponent of 
ceph-volume.
Ceph-disk wasn't accomplishing what was needed with anything other than 
vanilla use cases (even then, still kind of broken). I'm not going to 
re-hash Sage's valid points too much, but trying to manipulate the old 
ceph-disk to work with your own LVM (or other block manager). As far as 
the pain of doing something new goes, yes, sometimes moving to newer 
more flexible methods results in a large amount of work. Trust me, I 
feel that pain when we're talking about things like ceph-volume, 
bluestore, etc, but these changes are not made without reason.

As far as LVM performance goes, I think that's well understood in the 
larger Linux community. We accept that minimal overhead to accomplish 
some of the setups that we're interested in, such as encrypted, 
lvm-cached OSDs. The above is not a trivial thing to do using ceph-disk. 
We know, we run that in production, at large scale. It's plagued with 
problems, and since it's done without Ceph itself, it is difficult to 
tie the two together. Having it managed directly by Ceph, via 
ceph-volume makes much more sense. 
We're not alone in this, so I know it will benefit others as well, at 
the cost of technical expertise.

There are maintainers now for ceph-volume, so if there's something you 
don't like, I suggest proposing a change. 

Warren Wang

On 6/8/18, 11:05 AM, "ceph-users on behalf of Konstantin Shalygin" 
 wrote:

> - ceph-disk was replaced for two reasons: (1) It's design was
> centered around udev, and it was terrible.  We have been plagued 
for years
> with bugs due to race conditions in the udev-driven activation of 
OSDs,
> mostly variations of "I rebooted and not all of my OSDs started."  
It's
> horrible to observe and horrible to debug. (2) It was based on GPT
> partitions, lots of people had block layer tools they wanted to 
use
> that were LVM-based, and the two didn't mix (no GPT partitions on 
top of
> LVs).
>
> - We designed ceph-volome to be *modular* because antipicate that 
there
> are going to be lots of ways that people provision the hardware 
devices
> that we need to consider.  There are already two: legacy ceph-disk 
devices
> that are still in use and have GPT partitions (handled by 
'simple'), and
> lvm.  SPDK devices where we manage NVMe devices directly from 
userspace
> are on the immediate horizon--obviously LVM won't work there since 
the
> kernel isn't involved at all.  We can add any other schemes we 
like.
>
> - If you don't like LVM (e.g., because you find that there is a 
measurable
> overhead), let's design a new approach!  I wouldn't bother unless 
you can
> actually measure an impact.  But if you can demonstrate a 
measurable cost,
> let's do it.
>
> - LVM was chosen as the default appraoch for new devices are a few
> reasons:
>- It allows you to attach arbitrary metadata do each device, 
like which
> cluster uuid it belongs to, which osd uuid it belongs to, which 
type of
> device it is (primary, db, wal, journal), any secrets needed to 
fetch it's
> decryption key from a keyserver (the mon by default), and so on.
>- One of the goals was to enable lvm-based block layer modules 
beneath
> OSDs (dm-cache).  All of the other devicemapper-based tools we are
> aware of work with LVM.  It was a hammer that hit all nails.
>
> - The 'simple' mode is the current 'out' that avoids using LVM if 
it's not
> an option for you.  We only implemented scan and activate because 
that was
> all that we saw a current need for.  It should be quite easy to 
add the
> ability to create new OSDs.
>
> I would caution you, though, that simple relies on a file in 
/etc/ceph
> that has the metadata about the devices.  If you lose that file 
you need
> to have some way to rebuild it or we won't know what to do with 
your
> devices.  That means you should make the devices self-describing 
in some
> way... not, say, a raw device with dm-crypt layered directly on 
top, or
> some other option that makes it impossible to tell what it is.  As 
long as
> you can implement 'scan' and get any other info you need (e.g., 
whatever
> is necessary to fetch decryption keys) then great.


Thanks, I got what I wanted. It was in this form that it was 
necessary 
to submit deprecations to the community: "why do we do 

Re: [ceph-users] CephFS with erasure coding, do I need a cache-pool?

2018-07-17 Thread Jake Grimmett
Hi Oliver,

We put Cephfs directly on an 8:2 EC cluster, (10 nodes, 450 OSD), but
put metadata on a replicated pool using NVMe drives (1 per node, 5 nodes).

We get great performance with large files, but as Linh indicated, IOPS
with small files could be better.

I did consider adding a replicated SSD tier to improve IOPS. But having
seen very inconsistent performance on a kraken test cluster that used
tiering, deciding that it might not give a worthwhile speed-up, plus the
added complexity could make the system fragile.

I'd be interested to hear more from Greg about why cache pools are best
avoided...

best regards,

Jake

On 17/07/18 01:55, Linh Vu wrote:
> Hi Oliver,
> 
> 
> We have several CephFS on EC pool deployments, one been in production
> for a while, the others about to pending all the Bluestore+EC fixes in
> 12.2.7 
> 
> 
> Firstly as John and Greg have said, you don't need SSD cache pool at all. 
> 
> 
> Secondly, regarding k/m, it depends on how many hosts or racks you have,
> and how many failures you want to tolerate. 
> 
> 
> For our smallest pool with only 8 hosts in 4 different racks and 2
> different pairs of switches (note: we consider switch failure more
> common than rack cooling or power failure), we're using 4/2 with failure
> domain = host. We currently use this for SSD scratch storage for HPC.
> 
> 
> For one of our larger pools, with 24 hosts over 6 different racks and 6
> different pairs of switches, we're using 4:2 with failure domain = rack. 
> 
> 
> For another pool with similar host count but not spread over so many
> pairs of switches, we're using 6:3 and failure domain = host.
> 
> 
> Also keep in mind that a higher value of k/m may give you more
> throughput but increase latency especially for small files, so it also
> depends on how important performance is and what kind of file size you
> store on your CephFS. 
> 
> 
> Cheers,
> 
> Linh
> 
> 
> *From:* ceph-users  on behalf of
> Oliver Schulz 
> *Sent:* Sunday, 15 July 2018 9:46:16 PM
> *To:* ceph-users
> *Subject:* [ceph-users] CephFS with erasure coding, do I need a cache-pool?
>  
> Dear all,
> 
> we're planning a new Ceph-Clusterm, with CephFS as the
> main workload, and would like to use erasure coding to
> use the disks more efficiently. Access pattern will
> probably be more read- than write-heavy, on average.
> 
> I don't have any practical experience with erasure-
> coded pools so far.
> 
> I'd be glad for any hints / recommendations regarding
> these questions:
> 
> * Is an SSD cache pool recommended/necessary for
> CephFS on an erasure-coded HDD pool (using Ceph
> Luminous and BlueStore)?
> 
> * What are good values for k/m for erasure coding in
> practice (assuming a cluster of about 300 OSDs), to
> make things robust and ease maintenance (ability to
> take a few nodes down)? Is k/m = 6/3 a good choice?
> 
> * Will it be sufficient to have k+m racks, resp. failure
> domains?
> 
> 
> Cheers and thanks for any advice,
> 
> Oliver
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Slow requests during OSD maintenance

2018-07-17 Thread sinan
Hi,

On one of our OSD nodes I performed a "yum update" with Ceph repositories
disabled. So only the OS packages were being updated.

During and namely at the end of the yum update, the cluster started to
have slow/blocked requests and all VM's with Ceph storage backend had high
I/O load. After ~15 minutes the cluster health was OK.

The load on the OSD was not high, plenty of available memory + CPU.

1. How is it possible that an yum update on 1 node causes slow requests?
2. What is the best way to remove an OSD node from the cluster during
maintenance? ceph osd set noout is not the way to go, since no OSD's are
out during yum update and the node is still part of the cluster and will
handle I/O.
I think the best way is the combination of "ceph osd set noout" + stopping
the OSD services so the OSD node does not have any traffic anymore.

Any thoughts on this?

Thanks!
Sinan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ls operation is too slow in cephfs

2018-07-17 Thread Surya Bala
Previosly we had multi-active MDS. But that time we got slow /stuck
requests when multiple clients accessing the cluster. So we decided to have
single active MDS and all others are stand by.

When we got this issue MDS trimming was going on. when we checked the last
ops

{

"ops": [

   {

"description": "client_request(client.8784398:69290 readdir
#0x100cf10 2018-06-22 21:16:35.303754 caller_uid=0, caller_gid=0{0,})",

"initiated_at": "2018-06-22 21:16:35.319622",

"age": 1982.691792,

"duration": 1982.691821,

"type_data": {

"flag_point": "failed to authpin local pins",

"reqid": "client.8784398:69290",

"op_type": "client_request",

"client_info": {

"client": "client.8784398",

"tid": 69290

},

"events": [

{

"time": "2018-06-22 21:16:35.319622",

"event": "initiated"

},

{

"time": "2018-06-22 21:16:35.319998",

"event": "failed to authpin local pins"

}

]

}

},
All the requests come to the server got hang and deadlock situation
occurred . We restarted MDS which hung,  then everything became normal. Not
able to find the reason, but we saw some post that multi active MDS is
still not stable. So we changed it to single

Regards
Surya

On Tue, Jul 17, 2018 at 3:15 PM, Daniel Baumann 
wrote:

> On 07/17/2018 11:43 AM, Marc Roos wrote:
> > I had similar thing with doing the ls. Increasing the cache limit helped
> > with our test cluster
>
> same here; additionally we also had to use more than one MDS to get good
> performance (currently 3 MDS plus 2 stand-by per FS).
>
> Regards,
> Daniel
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ls operation is too slow in cephfs

2018-07-17 Thread Daniel Baumann
On 07/17/2018 11:43 AM, Marc Roos wrote:
> I had similar thing with doing the ls. Increasing the cache limit helped 
> with our test cluster

same here; additionally we also had to use more than one MDS to get good
performance (currently 3 MDS plus 2 stand-by per FS).

Regards,
Daniel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ls operation is too slow in cephfs

2018-07-17 Thread Marc Roos
 
I had similar thing with doing the ls. Increasing the cache limit helped 
with our test cluster

mds_cache_memory_limit = 80





-Original Message-
From: Surya Bala [mailto:sooriya.ba...@gmail.com] 
Sent: dinsdag 17 juli 2018 11:39
To: Anton Aleksandrov
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] ls operation is too slow in cephfs

Thanks for the reply anton. 


CPU core count - 40
RAM - 250GB 

We have single active MDS, Ceph version luminous 12.2.4 default PG 
number is 64 and we are not changing PG count while creating pool we 
have totally 8 server each with 60OSD of 6TB size.
8 server splitted into 2 per region . Crush map is designed to use 2 
servers for each pool

Regards
Surya Balan


On Tue, Jul 17, 2018 at 1:48 PM, Anton Aleksandrov 
 wrote:


You need to give us more details about your OSD setup and hardware 
specification of nodes (CPU core count, RAM amount)



On 2018.07.17. 10:25, Surya Bala wrote:


Hi folks, 

We have production cluster with 8 nodes and each node has 60 
disks of size 6TB each. We are using cephfs and FUSE client with global 
mount point. We are doing rsync from our old server to this cluster 
rsync is slow compared to normal server 

when we do 'ls' inside some folder, which has many more number 
of files like 1lakhs and 2lakhs, the response is too slow. 

Any suggestions please

Regards
Surya

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
 





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ls operation is too slow in cephfs

2018-07-17 Thread Surya Bala
Thanks for the reply anton.


CPU core count - 40
RAM - 250GB

We have single active MDS, Ceph version luminous 12.2.4
default PG number is 64 and we are not changing PG count while creating pool
we have totally 8 server each with 60OSD of 6TB size.
8 server splitted into 2 per region . Crush map is designed to use 2
servers for each pool

Regards
Surya Balan


On Tue, Jul 17, 2018 at 1:48 PM, Anton Aleksandrov 
wrote:

> You need to give us more details about your OSD setup and hardware
> specification of nodes (CPU core count, RAM amount)
>
> On 2018.07.17. 10:25, Surya Bala wrote:
>
> Hi folks,
>
> We have production cluster with 8 nodes and each node has 60 disks of size
> 6TB each. We are using cephfs and FUSE client with global mount point. We
> are doing rsync from our old server to this cluster rsync is slow compared
> to normal server
>
> when we do 'ls' inside some folder, which has many more number of files
> like 1lakhs and 2lakhs, the response is too slow.
>
> Any suggestions please
>
> Regards
> Surya
>
>
> ___
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mon scrub errors

2018-07-17 Thread Elias Abacioglu
Hi,

Sorry for bumping an old thread, but I have the same issue.

Is it the mon.0 that I should reset?

2018-07-02 11:39:51.223731 [ERR]  mon.2 ScrubResult(keys
{monmap=48,osd_metadata=52} crc {monmap=3659766006,osd_metadata=1928984927})
2018-07-02 11:39:51.223701 [ERR]  mon.0 ScrubResult(keys
{monmap=45,osd_metadata=55} crc {monmap=1049715966,osd_metadata=409731371})
2018-07-02 11:39:51.223666 [ERR]  scrub mismatch
2018-07-02 11:39:51.223632 [ERR]  mon.1 ScrubResult(keys
{monmap=48,osd_metadata=52} crc {monmap=672606225,osd_metadata=1928984927})
2018-07-02 11:39:51.223598 [ERR]  mon.0 ScrubResult(keys
{monmap=45,osd_metadata=55} crc {monmap=1049715966,osd_metadata=409731371})
2018-07-02 11:39:51.223551 [ERR]  scrub mismatch
2018-07-02 11:39:51.211198 [ERR]  mon.2 ScrubResult(keys
{logm=33,mds_health=10,mds_metadata=1,mdsmap=56} crc
{logm=3564099269,mds_health=692648600,mds_metadata=2226414273,mdsmap=643765345})
2018-07-02 11:39:51.211148 [ERR]  mon.0 ScrubResult(keys
{logm=33,mds_health=6,mds_metadata=1,mdsmap=60} crc
{logm=3564099269,mds_health=4085298890,mds_metadata=2226414273,mdsmap=758287940})
2018-07-02 11:39:51.211100 [ERR]  scrub mismatch
2018-07-02 11:39:51.211052 [ERR]  mon.1 ScrubResult(keys
{logm=33,mds_health=9,mds_metadata=1,mdsmap=57} crc
{logm=3564099269,mds_health=4264808124,mds_metadata=2226414273,mdsmap=704105513})
2018-07-02 11:39:51.211000 [ERR]  mon.0 ScrubResult(keys
{logm=33,mds_health=6,mds_metadata=1,mdsmap=60} crc
{logm=3564099269,mds_health=4085298890,mds_metadata=2226414273,mdsmap=758287940})
2018-07-02 11:39:51.210944 [ERR]  scrub mismatch

2018-07-03 12:42:21.674471 [ERR]  mon.2 ScrubResult(keys
{monmap=24,osd_metadata=76} crc {monmap=578360729,osd_metadata=2641573038})
2018-07-03 12:42:21.674447 [ERR]  mon.0 ScrubResult(keys
{monmap=24,osd_metadata=76} crc {monmap=962305203,osd_metadata=2641573038})
2018-07-03 12:42:21.674422 [ERR]  scrub mismatch
2018-07-03 12:42:21.674399 [ERR]  mon.1 ScrubResult(keys
{monmap=24,osd_metadata=76} crc {monmap=3891180386,osd_metadata=2641573038})
2018-07-03 12:42:21.674359 [ERR]  mon.0 ScrubResult(keys
{monmap=24,osd_metadata=76} crc {monmap=962305203,osd_metadata=2641573038})
2018-07-03 12:42:21.674319 [ERR]  scrub mismatch
2018-07-03 12:42:21.664602 [ERR]  mon.2 ScrubResult(keys
{mgrstat=73,monmap=27} crc {mgrstat=199345228,monmap=2433199488})
2018-07-03 12:42:21.664523 [ERR]  mon.0 ScrubResult(keys
{mgrstat=73,monmap=27} crc {mgrstat=199345228,monmap=880923984})

2018-07-16 12:42:32.413856 [ERR]  mon.2 ScrubResult(keys
{monmap=17,osd_metadata=83} crc {monmap=3927456148,osd_metadata=4212506330})
2018-07-16 12:42:32.413828 [ERR]  mon.0 ScrubResult(keys
{monmap=17,osd_metadata=83} crc {monmap=4047155390,osd_metadata=4212506330})
2018-07-16 12:42:32.413800 [ERR]  scrub mismatch
2018-07-16 12:42:32.413769 [ERR]  mon.1 ScrubResult(keys
{monmap=17,osd_metadata=83} crc {monmap=797941615,osd_metadata=4212506330})
2018-07-16 12:42:32.413740 [ERR]  mon.0 ScrubResult(keys
{monmap=17,osd_metadata=83} crc {monmap=4047155390,osd_metadata=4212506330})
2018-07-16 12:42:32.413689 [ERR]  scrub mismatch
2018-07-16 12:42:32.412795 [ERR]  mon.2 ScrubResult(keys
{mgrstat=66,monmap=34} crc {mgrstat=1700428385,monmap=4227348033})
2018-07-16 12:42:32.412768 [ERR]  mon.0 ScrubResult(keys
{mgrstat=66,monmap=34} crc {mgrstat=1700428385,monmap=3150674595})
2018-07-16 12:42:32.412741 [ERR]  scrub mismatch


Thanks,
Elias

On Thu, Apr 5, 2018 at 4:40 PM, kefu chai  wrote:

> On Thu, Apr 5, 2018 at 3:10 PM, Rickard Nilsson
>  wrote:
> > Hi,
> >
> > Im' having a cluster with three moitors, two mds and nine osd. Lately
> I've
> > been getting scrub errors from the monitors;
> >
> > 2018-04-05 07:26:52.147185 [ERR]  mon.2 ScrubResult(keys
> > {osd_pg_creating=1,osdmap=99} crc
> > {osd_pg_creating=1404726104,osdmap=3323124730})
> > 2018-04-05 07:26:52.147167 [ERR]  mon.0 ScrubResult(keys
> > {osd_metadata=5,osd_pg_creating=1,osdmap=94} crc
> > {osd_metadata=477302505,osd_pg_creating=1404726104,osdmap=2387598890})
> > 2018-04-05 07:26:52.147139 [ERR]  scrub mismatch
> > 2018-04-05 07:26:52.144378 [ERR]  mon.2 ScrubResult(keys
> > {mgrstat=92,monmap=5,osd_pg_creating=1,osdmap=2} crc
> > {mgrstat=2630742218,monmap=4118007020,osd_pg_creating=
> 1404726104,osdmap=940126788})
> > 2018-04-05 07:26:52.144360 [ERR]  mon.0 ScrubResult(keys
> > {mgrstat=92,monmap=5,osd_metadata=3} crc
> > {mgrstat=2630742218,monmap=4118007020,osd_metadata=3256871745})
> > 2018-04-05 07:26:52.144334 [ERR]  scrub mismatch
> > 2018-04-05 07:26:52.140213 [ERR]  mon.2 ScrubResult(keys
> {mgr=67,mgrstat=33}
> > crc {mgr=1823433831,mgrstat=205032})
> > 2018-04-05 07:26:52.140193 [ERR]  mon.0 ScrubResult(keys
> > {mgr=67,mgr_command_descs=1,mgr_metadata=2,mgrstat=30} crc
> > {mgr=1823433831,mgr_command_descs=2758154725,mgr_metadata=
> 2776211204,mgrstat=779157107})
> > 2018-04-05 07:26:52.140165 [ERR]  scrub mismatch
> > 2018-04-05 07:26:52.120025 [ERR]  mon.2 ScrubResult(keys
> {logm=23,mdsmap=77}
> > 

Re: [ceph-users] Delete pool nicely

2018-07-17 Thread Simon Ironside


On 22/05/18 18:28, David Turner wrote:
 From my experience, that would cause you some troubles as it would 
throw the entire pool into the deletion queue to be processed as it 
cleans up the disks and everything.  I would suggest using a pool 
listing from `rados -p .rgw.buckets ls` and iterate on that using some 
scripts around the `rados -p .rgw.buckest rm ` command that 
you could stop, restart at a faster pace, slow down, etc.  Once the 
objects in the pool are gone, you can delete the empty pool without any 
problems.  I like this option because it makes it simple to stop it if 
you're impacting your VM traffic.


Just to finish the story here; thanks again for the advice - it worked well.

Generating the list of objects took around 6 hours but didn't cause any 
issues doing so. I had a sleep 0.1 between each rm iteration. Probably a 
bit on the conservative side but didn't cause me any problems either and 
was making acceptable progress so I didn't change it.


3 weeks later and the pool was more or less empty (I avoided ~600 
objects/400KiB with $ characters in the name that I couldn't be bothered 
handling automatically) so I deleted the pools. I did get some slow 
request warnings immediately after deleting the pools but they went away 
in a minute or so.


Thanks,
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ls operation is too slow in cephfs

2018-07-17 Thread Anton Aleksandrov
You need to give us more details about your OSD setup and hardware 
specification of nodes (CPU core count, RAM amount)



On 2018.07.17. 10:25, Surya Bala wrote:

Hi folks,

We have production cluster with 8 nodes and each node has 60 disks of 
size 6TB each. We are using cephfs and FUSE client with global mount 
point. We are doing rsync from our old server to this cluster rsync is 
slow compared to normal server


when we do 'ls' inside some folder, which has many more number of 
files like 1lakhs and 2lakhs, the response is too slow.


Any suggestions please

Regards
Surya


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] resize wal/db

2018-07-17 Thread Eugen Block

Hi,

There is no way to resize DB while OSD is running. There is a bit  
shorter "unofficial" but risky way than redeploying OSD though. But  
you'll need to tag specific OSD out for a while in any case. You  
will also need either additional free partition(s) or initial  
deployment had to be done using LVMs.


See this blog for morr details.  
http://heiterbiswolkig.blogs.nde.ag/2018/04/08/migrating-bluestores-block-db/


just for clarification: we did NOT resize the block.db in the  
described procedure! We used the exact same block size for the new lvm  
based block.db as it was before. This is also mentioned in the article.


Regards,
Eugen


Zitat von Igor Fedotov :


Hi Zhang,

There is no way to resize DB while OSD is running. There is a bit  
shorter "unofficial" but risky way than redeploying OSD though. But  
you'll need to tag specific OSD out for a while in any case. You  
will also need either additional free partition(s) or initial  
deployment had to be done using LVMs.


See this blog for morr details.  
http://heiterbiswolkig.blogs.nde.ag/2018/04/08/migrating-bluestores-block-db/



And I advise to try such things at non-production cluster first.


Thanks,

Igor


On 7/12/2018 7:03 AM, Shunde Zhang wrote:

Hi Ceph Gurus,

I have installed Ceph Luminous with Bluestore using ceph-ansible.
However, when I did the install, I didn’t set the wal/db size. Then  
it ended up using the default values, which is quite small: 1G db  
and 576MB wal.
Note that each OSD node has 12 OSDs and each OSD has a 1.8T  
spinning disk for data. All 12 OSDs share one NVMe M2 SSD for wal/db.
Now the ceph is in use, and I want to increase the size of db/wal  
after doing some research: I want to use 20G db and 1G wal. (Are  
they reasonable numbers?)
I can delete one OSD and then re-create it with ceph-ansible but  
that is troublesome.
I wonder if there is a (simple) way to increase the size of both db  
and wal when an OSD is running?


Thanks in advance,
Shunde.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ls operation is too slow in cephfs

2018-07-17 Thread Surya Bala
Hi folks,

We have production cluster with 8 nodes and each node has 60 disks of size
6TB each. We are using cephfs and FUSE client with global mount point. We
are doing rsync from our old server to this cluster rsync is slow compared
to normal server

when we do 'ls' inside some folder, which has many more number of files
like 1lakhs and 2lakhs, the response is too slow.

Any suggestions please

Regards
Surya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Is Ceph the right tool for storing lots of small files?

2018-07-17 Thread Christian Wimmer
Hi all,

I am trying to use Ceph with RGW to store lots (>300M) of small files (80%
2-15kB, 20% up to 500kB).
After some testing, I wonder if Ceph is the right tool for that.

Does anybody of you have experience with this use case?

Things I came across:
- EC pools: default stripe-width is 4kB. Does it make sense to lower the
stripe width for small objects or is EC a bad idea for this use case?
- Bluestore: bluestore min alloc size is per default 64kB. Would it be
better to lower it to say 2kB or am I better off with Filestore (probably
not if I want to store a huge amount of small files)?
- Bluestore / RocksDB: RocksDB seems to consume a lot of disk space when
storing lots of files.
  For example: I have OSDs with about 500k onodes (which should translate
to 500k stored objects, right?) and the DB size is about 30GB. That's about
63kB per onode - which is a lot, considering the original object is about
5kB.

Thanks,
Christian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com