[ceph-users] Is Ceph the right tool for storing lots of small files?

2018-07-16 Thread Christian Wimmer
Hi all,

I am trying to use Ceph with RGW to store lots (>300M) of small files (80%
2-15kB, 20% up to 500kB).
After some testing, I wonder if Ceph is the right tool for that.

Does anybody of you have experience with this use case?

Things I came across:
- EC pools: default stripe-width is 4kB. Does it make sense to lower the
stripe width for small objects or is EC a bad idea for this use case?
- Bluestore: bluestore min alloc size is per default 64kB. Would it be
better to lower it to say 2kB or am I better off with Filestore (probably
not if I want to store a huge amount of small files)?
- Bluestore / RocksDB: RocksDB seems to consume a lot of disk space when
storing lots of files.
  For example: I have OSDs with about 500k onodes (which should translate
to 500k stored objects, right?) and the DB size is about 30GB. That's about
63kB per onode - which is a lot, considering the original object is about
5kB.

Thanks,
Christian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] intermittent slow requests on idle ssd ceph clusters

2018-07-16 Thread Glen Baars
Hello Pavel,

I don't have all that much info ( fairly new to Ceph ) but we are facing a 
similar issue. If the cluster is fairly idle we get slow requests - if I'm 
backfilling a new node there is no slow requests. Same X540 network cards but 
ceph 12.2.5 and Ubuntu 16.04. 4.4.0 kernel. LACP with VLANs for ceph 
front/backend networks.

Not sure that it is the same issue but if you want me to do any tests - let me 
know.

Kind regards,
Glen Baars

-Original Message-
From: ceph-users  On Behalf Of Xavier Trilla
Sent: Tuesday, 17 July 2018 6:16 AM
To: Pavel Shub ; Ceph Users 
Subject: Re: [ceph-users] intermittent slow requests on idle ssd ceph clusters

Hi Pavel,

Any strange messages on dmesg, syslog, etc?

I would recommend profiling the kernel with perf and checking for the calls 
that are consuming more CPU.

We had several problems like the one you are describing, and for example one of 
them got fixed increasing vm.min_free_kbytes to 4GB.

Also, how is the sys usage if you run top on the machines hosting the OSDs?

Saludos Cordiales,
Xavier Trilla P.
Clouding.io

¿Un Servidor Cloud con SSDs, redundado
y disponible en menos de 30 segundos?

¡Pruébalo ahora en Clouding.io!

-Mensaje original-
De: ceph-users  En nombre de Pavel Shub 
Enviado el: lunes, 16 de julio de 2018 23:52
Para: Ceph Users 
Asunto: [ceph-users] intermittent slow requests on idle ssd ceph clusters

Hello folks,

We've been having issues with slow requests cropping up on practically idle 
ceph clusters. From what I can tell the requests are hanging waiting for 
subops, and the OSD on the other end receives requests minutes later! Below it 
started waiting for subops at 12:09:51 and the subop was completed at 12:14:28.

{
"description": "osd_op(client.903117.0:569924 6.391 
6:89ed76f2:::%2fraster%2fv5%2fes%2f16%2f36320%2f24112:head [writefull 0~2072] 
snapc 0=[] ondisk+write+known_if_redirected e5777)",
"initiated_at": "2018-07-05 12:09:51.191419",
"age": 326.651167,
"duration": 276.977834,
"type_data": {
"flag_point": "commit sent; apply or cleanup",
"client_info": {
"client": "client.903117",
"client_addr": "10.20.31.234:0/1433094386",
"tid": 569924
},
"events": [
{
"time": "2018-07-05 12:09:51.191419",
"event": "initiated"
},
{
"time": "2018-07-05 12:09:51.191471",
"event": "queued_for_pg"
},
{
"time": "2018-07-05 12:09:51.191538",
"event": "reached_pg"
},
{
"time": "2018-07-05 12:09:51.191877",
"event": "started"
},
{
"time": "2018-07-05 12:09:51.192135",
"event": "waiting for subops from 11"
},
{
"time": "2018-07-05 12:09:51.192599",
"event": "op_commit"
},
{
"time": "2018-07-05 12:09:51.192616",
"event": "op_applied"
},
{
"time": "2018-07-05 12:14:28.169018",
"event": "sub_op_commit_rec from 11"
},
{
"time": "2018-07-05 12:14:28.169164",
"event": "commit_sent"
},
{
"time": "2018-07-05 12:14:28.169253",
"event": "done"
}
]
}
},

The below is what I assume the corresponding request on osd.11, it seems to be 
receiving the network request ~4 minutes later.

2018-07-05 12:14:28.058552 7fb75ee0e700 20 osd.11 5777 share_map_peer
0x562b61bca000 already has epoch 5777
2018-07-05 12:14:28.167247 7fb75de0c700 10 osd.11 5777  new session
0x562cc23f0200 con=0x562baaa0e000 addr=10.16.15.28:6805/3218
2018-07-05 12:14:28.167282 7fb75de0c700 10 osd.11 5777  session
0x562cc23f0200 osd.20 has caps osdcap[grant(*)] 'allow *'
2018-07-05 12:14:28.167291 7fb75de0c700  0 -- 10.16.16.32:6817/3808 >>
10.16.15.28:6805/3218 conn(0x562baaa0e000 :6817 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg 
accept connect_seq 20 vs existing csq=19 existing_state=STATE_STANDBY
2018-07-05 12:14:28.167322 7fb7546d6700  2 osd.11 5777 ms_handle_reset con 
0x562baaa0e000 session 0x562cc23f0200
2018-07-05 12:14:28.167546 7fb75de0c700 10 osd.11 5777  session
0x562b62195c00 osd.20 has caps osdcap[grant(*)] 'allow *'

This is an all SSD cluster with minimal load. All hardware checks return good 
values. The cluster is currently running latest ceph mimic
(13.2.0) but we have also experienced this on other versions of luminous 12.2.2 
and 12.2.5.

I'm starting to think that this is a potential network driver issue.
We're currently running on kernel 4.14.15 and when we updated to latest 4.17 
the slow requests seem to occur more frequently. The network cards that we run 
are 10g intel X540.

Does anyone know how I can debug this further?

Thanks,
Pavel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
This e-mail is intended solely for the benefit of the addressee(s) and any 
other named recipient. It is confidential and may contain legally privileged or 
confidential information. If you are not the recipient, any use, distribution, 
disclosure or copying of this e-mail is prohibited

Re: [ceph-users] CephFS with erasure coding, do I need a cache-pool?

2018-07-16 Thread Linh Vu
Hi Oliver,


We have several CephFS on EC pool deployments, one been in production for a 
while, the others about to pending all the Bluestore+EC fixes in 12.2.7 😊


Firstly as John and Greg have said, you don't need SSD cache pool at all.


Secondly, regarding k/m, it depends on how many hosts or racks you have, and 
how many failures you want to tolerate.


For our smallest pool with only 8 hosts in 4 different racks and 2 different 
pairs of switches (note: we consider switch failure more common than rack 
cooling or power failure), we're using 4/2 with failure domain = host. We 
currently use this for SSD scratch storage for HPC.


For one of our larger pools, with 24 hosts over 6 different racks and 6 
different pairs of switches, we're using 4:2 with failure domain = rack.


For another pool with similar host count but not spread over so many pairs of 
switches, we're using 6:3 and failure domain = host.


Also keep in mind that a higher value of k/m may give you more throughput but 
increase latency especially for small files, so it also depends on how 
important performance is and what kind of file size you store on your CephFS.


Cheers,

Linh


From: ceph-users  on behalf of Oliver Schulz 

Sent: Sunday, 15 July 2018 9:46:16 PM
To: ceph-users
Subject: [ceph-users] CephFS with erasure coding, do I need a cache-pool?

Dear all,

we're planning a new Ceph-Clusterm, with CephFS as the
main workload, and would like to use erasure coding to
use the disks more efficiently. Access pattern will
probably be more read- than write-heavy, on average.

I don't have any practical experience with erasure-
coded pools so far.

I'd be glad for any hints / recommendations regarding
these questions:

* Is an SSD cache pool recommended/necessary for
CephFS on an erasure-coded HDD pool (using Ceph
Luminous and BlueStore)?

* What are good values for k/m for erasure coding in
practice (assuming a cluster of about 300 OSDs), to
make things robust and ease maintenance (ability to
take a few nodes down)? Is k/m = 6/3 a good choice?

* Will it be sufficient to have k+m racks, resp. failure
domains?


Cheers and thanks for any advice,

Oliver
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Jewel PG stuck inconsistent with 3 0-size objects

2018-07-16 Thread Brad Hubbard
Your issue is different since not only do the omap digests of all
replicas not match the omap digest from the auth object info but they
are all different to each other.

What is min_size of pool 67 and what can you tell us about the events
leading up to this?

On Mon, Jul 16, 2018 at 7:06 PM, Matthew Vernon  wrote:
> Hi,
>
> Our cluster is running 10.2.9 (from Ubuntu; on 16.04 LTS), and we have a
> pg that's stuck inconsistent; if I repair it, it logs "failed to pick
> suitable auth object" (repair log attached, to try and stop my MUA
> mangling it).
>
> We then deep-scrubbed that pg, at which point
> rados list-inconsistent-obj 67.2e --format=json-pretty produces a bit of
> output (also attached), which includes that all 3 osds have a zero-sized
> object e.g.
>
> "osd": 1937,
> "errors": [
> "omap_digest_mismatch_oi"
> ],
> "size": 0,
> "omap_digest": "0x45773901",
> "data_digest": "0x"
>
> All 3 osds have different omap_digest, but all have 0 size. Indeed,
> looking on the OSD disks directly, each object is 0 size (i.e. they are
> identical).
>
> This looks similar to one of the failure modes in
> http://tracker.ceph.com/issues/21388 where the is a suggestion (comment
> 19 from David Zafman) to do:
>
> rados -p default.rgw.buckets.index setomapval
> .dir.861ae926-7ff0-48c5-86d6-a6ba8d0a7a14.7130858.6 temporary-key anything
> [deep-scrub]
> rados -p default.rgw.buckets.index rmomapkey
> .dir.861ae926-7ff0-48c5-86d6-a6ba8d0a7a14.7130858.6 temporary-key
>
> Is this likely to be the correct approach here, to? And is there an
> underlying bug in ceph that still needs fixing? :)
>
> Thanks,
>
> Matthew
>
>
>
> --
>  The Wellcome Sanger Institute is operated by Genome Research
>  Limited, a charity registered in England with number 1021457 and a
>  company registered in England with number 2742969, whose registered
>  office is 215 Euston Road, London, NW1 2BE.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] checking rbd volumes modification times

2018-07-16 Thread Andrei Mikhailovsky
Dear cephers, 

Could someone tell me how to check the rbd volumes modification times in ceph 
pool? I am currently in the process of trimming our ceph pool and would like to 
start with volumes which were not modified for a long time. How do I get that 
information? 

Cheers 

Andrei 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD tuning no longer required?

2018-07-16 Thread Xavier Trilla
Hi there,

I would like just to note, that for some scenarios defaults are not good enough.

Recently we upgraded one of our clusters from Jewel to Luminous, during the 
upgrade we removed all the custom tuning we had done on it over the years from 
ceph.conf -I was extremely excited to get rid of all those parameters!!- but 
performance after the upgrade was really bad.

The cluster runs mainly over write cache backed HDDs (via dedicated SAS 
controller), so looks like the defaults for HDDs are for really low performance 
HDD setups.

So, at the end we had to reapply our previous config, and tune some other -not 
as well documented as I would like- parameters. And now, we have even more 
tuning parameters than we did… :/

I think the main issue is the lack of a updated guide to tune ceph OSD 
performance. Or at least the proper documentation in order to know what each 
parameter really does (And I’m running ceph clusters since cuttlefish, so I did 
spend my time checking documentation, presentations, etc…)

It’s a pity so many new users become disappointed with ceph, because they don’t 
get even near to the performance they are expecting.

I still have to try the new default values in a pure NVMe cluster, I hope the 
result will be better.

I think a good document about how to tune OSD performance, would really help 
Ceph :)

Cheers!

Saludos Cordiales,
Xavier Trilla P.
Clouding.io

¿Un Servidor Cloud con SSDs, redundado
y disponible en menos de 30 segundos?

¡Pruébalo ahora en Clouding.io!

De: ceph-users  En nombre de Robert Stanford
Enviado el: lunes, 16 de julio de 2018 21:40
Para: Gregory Farnum 
CC: ceph-users ; Konstantin Shalygin 
Asunto: Re: [ceph-users] OSD tuning no longer required?


 Golden advice.  Thank you Greg

On Mon, Jul 16, 2018 at 1:45 PM, Gregory Farnum 
mailto:gfar...@redhat.com>> wrote:
On Fri, Jul 13, 2018 at 2:50 AM Robert Stanford 
mailto:rstanford8...@gmail.com>> wrote:

 This is what leads me to believe it's other settings being referred to as well:
https://ceph.com/community/new-luminous-rados-improvements/

"There are dozens of documents floating around with long lists of Ceph 
configurables that have been tuned for optimal performance on specific hardware 
or for specific workloads.  In most cases these ceph.conf fragments tend to 
induce funny looks on developers’ faces because the settings being adjusted 
seem counter-intuitive, unrelated to the performance of the system, and/or 
outright dangerous.  Our goal is to make Ceph work as well as we can out of the 
box without requiring any tuning at all, so we are always striving to choose 
sane defaults.  And generally, we discourage tuning by users. "

To me it's not just bluestore settings / sdd vs. hdd they're talking about 
("dozens of documents floating around"... "our goal... without any tuning at 
all".  Am I off base?

Ceph is *extremely* tunable, because whenever we set up a new behavior 
(snapshot trimming sleeps, scrub IO priorities, whatever) and we're not sure 
how it should behave we add a config option. Most of these config options we 
come up with some value through testing or informed guesswork, set it in the 
config, and expect that users won't ever see it. Some of these settings we 
don't know what they should be, and we really hope the whole mechanism gets 
replaced before users see it, but they don't. Some of the settings should be 
auto-tuning or manually set to a different value for each deployment to get 
optimal performance.
So there are lots of options for people to make things much better or much 
worse for themselves.

However, by far the biggest impact and most common tunables are those that 
basically vary on if the OSD is using a hard drive or an SSD for its local 
storage — those are order-of-magnitude differences in expected latency and 
throughput. So we now have separate default tunables for those cases which are 
automatically applied.

Could somebody who knows what they're doing tweak things even better for a 
particular deployment? Undoubtedly. But do *most* people know what they're 
doing that well? They don't.
In particular, the old "fix it" configuration settings that a lot of people 
were sharing and using starting in the Cuttlefish days are rather dangerously 
out of date, and we no longer have defaults that are quite as stupid as some of 
those were.

So I'd generally recommend you remove any custom tuning you've set up unless 
you have specific reason to think it will do better than the defaults for your 
currently-deployed release.
-Greg


 Regards

On Thu, Jul 12, 2018 at 9:12 PM, Konstantin Shalygin 
mailto:k0...@k0ste.ru>> wrote:
  I saw this in the Luminous release notes:

  "Each OSD now adjusts its default configuration based on whether the
backing device is an HDD or SSD. Manual tuning generally not required"

  Which tuning in particular?  The ones in my configuration are
osd_op_threads, osd_disk_threads, osd_recovery_max_active,

Re: [ceph-users] intermittent slow requests on idle ssd ceph clusters

2018-07-16 Thread Xavier Trilla
Hi Pavel,

Any strange messages on dmesg, syslog, etc? 

I would recommend profiling the kernel with perf and checking for the calls 
that are consuming more CPU.

We had several problems like the one you are describing, and for example one of 
them got fixed increasing vm.min_free_kbytes to 4GB. 

Also, how is the sys usage if you run top on the machines hosting the OSDs?

Saludos Cordiales,
Xavier Trilla P.
Clouding.io

¿Un Servidor Cloud con SSDs, redundado
y disponible en menos de 30 segundos?

¡Pruébalo ahora en Clouding.io!

-Mensaje original-
De: ceph-users  En nombre de Pavel Shub
Enviado el: lunes, 16 de julio de 2018 23:52
Para: Ceph Users 
Asunto: [ceph-users] intermittent slow requests on idle ssd ceph clusters

Hello folks,

We've been having issues with slow requests cropping up on practically idle 
ceph clusters. From what I can tell the requests are hanging waiting for 
subops, and the OSD on the other end receives requests minutes later! Below it 
started waiting for subops at 12:09:51 and the subop was completed at 12:14:28.

{
"description": "osd_op(client.903117.0:569924 6.391 
6:89ed76f2:::%2fraster%2fv5%2fes%2f16%2f36320%2f24112:head [writefull 0~2072] 
snapc 0=[] ondisk+write+known_if_redirected e5777)",
"initiated_at": "2018-07-05 12:09:51.191419",
"age": 326.651167,
"duration": 276.977834,
"type_data": {
"flag_point": "commit sent; apply or cleanup",
"client_info": {
"client": "client.903117",
"client_addr": "10.20.31.234:0/1433094386",
"tid": 569924
},
"events": [
{
"time": "2018-07-05 12:09:51.191419",
"event": "initiated"
},
{
"time": "2018-07-05 12:09:51.191471",
"event": "queued_for_pg"
},
{
"time": "2018-07-05 12:09:51.191538",
"event": "reached_pg"
},
{
"time": "2018-07-05 12:09:51.191877",
"event": "started"
},
{
"time": "2018-07-05 12:09:51.192135",
"event": "waiting for subops from 11"
},
{
"time": "2018-07-05 12:09:51.192599",
"event": "op_commit"
},
{
"time": "2018-07-05 12:09:51.192616",
"event": "op_applied"
},
{
"time": "2018-07-05 12:14:28.169018",
"event": "sub_op_commit_rec from 11"
},
{
"time": "2018-07-05 12:14:28.169164",
"event": "commit_sent"
},
{
"time": "2018-07-05 12:14:28.169253",
"event": "done"
}
]
}
},

The below is what I assume the corresponding request on osd.11, it seems to be 
receiving the network request ~4 minutes later.

2018-07-05 12:14:28.058552 7fb75ee0e700 20 osd.11 5777 share_map_peer
0x562b61bca000 already has epoch 5777
2018-07-05 12:14:28.167247 7fb75de0c700 10 osd.11 5777  new session
0x562cc23f0200 con=0x562baaa0e000 addr=10.16.15.28:6805/3218
2018-07-05 12:14:28.167282 7fb75de0c700 10 osd.11 5777  session
0x562cc23f0200 osd.20 has caps osdcap[grant(*)] 'allow *'
2018-07-05 12:14:28.167291 7fb75de0c700  0 -- 10.16.16.32:6817/3808 >>
10.16.15.28:6805/3218 conn(0x562baaa0e000 :6817 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg 
accept connect_seq 20 vs existing csq=19 existing_state=STATE_STANDBY
2018-07-05 12:14:28.167322 7fb7546d6700  2 osd.11 5777 ms_handle_reset con 
0x562baaa0e000 session 0x562cc23f0200
2018-07-05 12:14:28.167546 7fb75de0c700 10 osd.11 5777  session
0x562b62195c00 osd.20 has caps osdcap[grant(*)] 'allow *'

This is an all SSD cluster with minimal load. All hardware checks return good 
values. The cluster is currently running latest ceph mimic
(13.2.0) but we have also experienced this on other versions of luminous 12.2.2 
and 12.2.5.

I'm starting to think that this is a potential network driver issue.
We're currently running on kernel 4.14.15 and when we updated to latest 4.17 
the slow requests seem to occur more frequently. The network cards that we run 
are 10g intel X540.

Does anyone know how I can debug this further?

Thanks,
Pavel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] intermittent slow requests on idle ssd ceph clusters

2018-07-16 Thread Pavel Shub
Hello folks,

We've been having issues with slow requests cropping up on practically
idle ceph clusters. From what I can tell the requests are hanging
waiting for subops, and the OSD on the other end receives requests
minutes later! Below it started waiting for subops at 12:09:51 and the
subop was completed at 12:14:28.

{
"description": "osd_op(client.903117.0:569924 6.391
6:89ed76f2:::%2fraster%2fv5%2fes%2f16%2f36320%2f24112:head [writefull
0~2072] snapc 0=[] ondisk+write+known_if_redirected e5777)",
"initiated_at": "2018-07-05 12:09:51.191419",
"age": 326.651167,
"duration": 276.977834,
"type_data": {
"flag_point": "commit sent; apply or cleanup",
"client_info": {
"client": "client.903117",
"client_addr": "10.20.31.234:0/1433094386",
"tid": 569924
},
"events": [
{
"time": "2018-07-05 12:09:51.191419",
"event": "initiated"
},
{
"time": "2018-07-05 12:09:51.191471",
"event": "queued_for_pg"
},
{
"time": "2018-07-05 12:09:51.191538",
"event": "reached_pg"
},
{
"time": "2018-07-05 12:09:51.191877",
"event": "started"
},
{
"time": "2018-07-05 12:09:51.192135",
"event": "waiting for subops from 11"
},
{
"time": "2018-07-05 12:09:51.192599",
"event": "op_commit"
},
{
"time": "2018-07-05 12:09:51.192616",
"event": "op_applied"
},
{
"time": "2018-07-05 12:14:28.169018",
"event": "sub_op_commit_rec from 11"
},
{
"time": "2018-07-05 12:14:28.169164",
"event": "commit_sent"
},
{
"time": "2018-07-05 12:14:28.169253",
"event": "done"
}
]
}
},

The below is what I assume the corresponding request on osd.11, it
seems to be receiving the network request ~4 minutes later.

2018-07-05 12:14:28.058552 7fb75ee0e700 20 osd.11 5777 share_map_peer
0x562b61bca000 already has epoch 5777
2018-07-05 12:14:28.167247 7fb75de0c700 10 osd.11 5777  new session
0x562cc23f0200 con=0x562baaa0e000 addr=10.16.15.28:6805/3218
2018-07-05 12:14:28.167282 7fb75de0c700 10 osd.11 5777  session
0x562cc23f0200 osd.20 has caps osdcap[grant(*)] 'allow *'
2018-07-05 12:14:28.167291 7fb75de0c700  0 -- 10.16.16.32:6817/3808 >>
10.16.15.28:6805/3218 conn(0x562baaa0e000 :6817
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg accept connect_seq 20 vs existing csq=19
existing_state=STATE_STANDBY
2018-07-05 12:14:28.167322 7fb7546d6700  2 osd.11 5777 ms_handle_reset
con 0x562baaa0e000 session 0x562cc23f0200
2018-07-05 12:14:28.167546 7fb75de0c700 10 osd.11 5777  session
0x562b62195c00 osd.20 has caps osdcap[grant(*)] 'allow *'

This is an all SSD cluster with minimal load. All hardware checks
return good values. The cluster is currently running latest ceph mimic
(13.2.0) but we have also experienced this on other versions of
luminous 12.2.2 and 12.2.5.

I'm starting to think that this is a potential network driver issue.
We're currently running on kernel 4.14.15 and when we updated to
latest 4.17 the slow requests seem to occur more frequently. The
network cards that we run are 10g intel X540.

Does anyone know how I can debug this further?

Thanks,
Pavel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD tuning no longer required?

2018-07-16 Thread Robert Stanford
 Golden advice.  Thank you Greg

On Mon, Jul 16, 2018 at 1:45 PM, Gregory Farnum  wrote:

> On Fri, Jul 13, 2018 at 2:50 AM Robert Stanford 
> wrote:
>
>>
>>  This is what leads me to believe it's other settings being referred to
>> as well:
>> https://ceph.com/community/new-luminous-rados-improvements/
>>
>> *"There are dozens of documents floating around with long lists of Ceph
>> configurables that have been tuned for optimal performance on specific
>> hardware or for specific workloads.  In most cases these ceph.conf
>> fragments tend to induce funny looks on developers’ faces because the
>> settings being adjusted seem counter-intuitive, unrelated to the
>> performance of the system, and/or outright dangerous.  Our goal is to make
>> Ceph work as well as we can out of the box without requiring any tuning at
>> all, so we are always striving to choose sane defaults.  And generally, we
>> discourage tuning by users. "*
>>
>> To me it's not just bluestore settings / sdd vs. hdd they're talking
>> about ("dozens of documents floating around"... "our goal... without any
>> tuning at all".  Am I off base?
>>
>
> Ceph is *extremely* tunable, because whenever we set up a new behavior
> (snapshot trimming sleeps, scrub IO priorities, whatever) and we're not
> sure how it should behave we add a config option. Most of these config
> options we come up with some value through testing or informed guesswork,
> set it in the config, and expect that users won't ever see it. Some of
> these settings we don't know what they should be, and we really hope the
> whole mechanism gets replaced before users see it, but they don't. Some of
> the settings should be auto-tuning or manually set to a different value for
> each deployment to get optimal performance.
> So there are lots of options for people to make things much better or much
> worse for themselves.
>
> However, by far the biggest impact and most common tunables are those that
> basically vary on if the OSD is using a hard drive or an SSD for its local
> storage — those are order-of-magnitude differences in expected latency and
> throughput. So we now have separate default tunables for those cases which
> are automatically applied.
>
> Could somebody who knows what they're doing tweak things even better for a
> particular deployment? Undoubtedly. But do *most* people know what they're
> doing that well? They don't.
> In particular, the old "fix it" configuration settings that a lot of
> people were sharing and using starting in the Cuttlefish days are rather
> dangerously out of date, and we no longer have defaults that are quite as
> stupid as some of those were.
>
> So I'd generally recommend you remove any custom tuning you've set up
> unless you have specific reason to think it will do better than the
> defaults for your currently-deployed release.
> -Greg
>
>
>>
>>  Regards
>>
>> On Thu, Jul 12, 2018 at 9:12 PM, Konstantin Shalygin 
>> wrote:
>>
>>>   I saw this in the Luminous release notes:

   "Each OSD now adjusts its default configuration based on whether the
 backing device is an HDD or SSD. Manual tuning generally not required"

   Which tuning in particular?  The ones in my configuration are
 osd_op_threads, osd_disk_threads, osd_recovery_max_active,
 osd_op_thread_suicide_timeout, and osd_crush_chooseleaf_type, among
 others.  Can I rip these out when I upgrade to
 Luminous?

>>>
>>> This mean that some "bluestore_*" settings tuned for nvme/hdd separately.
>>>
>>> Also with Luminous we have:
>>>
>>> osd_op_num_shards_(ssd|hdd)
>>>
>>> osd_op_num_threads_per_shard_(ssd|hdd)
>>>
>>> osd_recovery_sleep_(ssd|hdd)
>>>
>>>
>>>
>>>
>>> k
>>>
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS with erasure coding, do I need a cache-pool?

2018-07-16 Thread Gregory Farnum
On Mon, Jul 16, 2018 at 1:25 AM John Spray  wrote:

> On Sun, Jul 15, 2018 at 12:46 PM Oliver Schulz
>  wrote:
> >
> > Dear all,
> >
> > we're planning a new Ceph-Clusterm, with CephFS as the
> > main workload, and would like to use erasure coding to
> > use the disks more efficiently. Access pattern will
> > probably be more read- than write-heavy, on average.
> >
> > I don't have any practical experience with erasure-
> > coded pools so far.
> >
> > I'd be glad for any hints / recommendations regarding
> > these questions:
> >
> > * Is an SSD cache pool recommended/necessary for
> >CephFS on an erasure-coded HDD pool (using Ceph
> >Luminous and BlueStore)?
>
> Since Luminous, you can use an erasure coded pool (on bluestore)
> directly as a CephFS data pool, no cache pool needed.
>

More than that, we'd really prefer you didn't use cache pools for anything.
Just Say No. :)
-Greg


>
> John
>
> > * What are good values for k/m for erasure coding in
> >practice (assuming a cluster of about 300 OSDs), to
> >make things robust and ease maintenance (ability to
> >take a few nodes down)? Is k/m = 6/3 a good choice?
>

That will depend on your file sizes, IO patterns, and expected durability
needs. I think 6+3 is a common one but I don't deal with many deployments.


> >
> > * Will it be sufficient to have k+m racks, resp. failure
> >domains?
>

Generally, if you want CRUSH to select X "buckets" at any level, it's good
to have at least X+1 choices for it to prevent mapping failures. But you
could also do workaround like letting it choose (K+M)/2 racks and putting
two shards in each rack.
-Greg


> >
> >
> > Cheers and thanks for any advice,
> >
> > Oliver
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD fails to start after power failure (with FAILED assert(num_unsent <= log_queue.size()) error)

2018-07-16 Thread Gregory Farnum
It's *repeatedly* crashing and restarting? I think the other times we've
seen this it was entirely ephemeral and went away on restart, and I really
don't know what about this state *could* be made persistent, so that's
quite strange. If you can set "debug monc = 20", reproduce this, and post
the log it might help us track down the issue.

But if you just want it to work, I bet that restarting the host node will
resolve it...
-Greg

On Sat, Jul 14, 2018 at 3:29 PM David Young  wrote:

> Hey folks,
>
> Sorry, posting this from a second account, since for some reason my
> primary account doesn't seem to be able to post to the list...
>
> I have a Luminous 12.2.6 cluster which suffered a power failure recently.
> On recovery, one of my OSDs is continually crashing and restarting, with
> the error below:
>
> 
> 9ae00 con 0
> -3> 2018-07-15 09:50:58.313242 7f131c5a9700 10 monclient: tick
> -2> 2018-07-15 09:50:58.313277 7f131c5a9700 10 monclient:
> _check_auth_rotating have uptodate secrets (they expire after 2018-07-15
> 09:50:28.313274)
> -1> 2018-07-15 09:50:58.313320 7f131c5a9700 10 log_client  log_queue
> is 8 last_log 10 sent 0 num 8 unsent 10 sending 10
>  0> 2018-07-15 09:50:58.320255 7f131c5a9700 -1
> /build/ceph-12.2.6/src/common/LogClient.cc: In function 'Message*
> LogClient::_get_mon_log_message()' thread 7f131c5a9700 time 2018-07-15
> 09:50:58.313336
> /build/ceph-12.2.6/src/common/LogClient.cc: 294: FAILED assert(num_unsent
> <= log_queue.size())
> 
>
>
> I've found a few recent references to this "FAILED assert" message
> (assuming that's the cause of the problem), such as
> https://bugzilla.redhat.com/show_bug.cgi?id=1599718 and
> http://tracker.ceph.com/issues/18209, with the most recent occurance
> being 3 days ago (http://tracker.ceph.com/issues/18209#note-12).
>
> Is there any resolution to this issue, or anything I can attempt to
> recover?
>
> Thanks!
> D
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD tuning no longer required?

2018-07-16 Thread Gregory Farnum
On Fri, Jul 13, 2018 at 2:50 AM Robert Stanford 
wrote:

>
>  This is what leads me to believe it's other settings being referred to as
> well:
> https://ceph.com/community/new-luminous-rados-improvements/
>
> *"There are dozens of documents floating around with long lists of Ceph
> configurables that have been tuned for optimal performance on specific
> hardware or for specific workloads.  In most cases these ceph.conf
> fragments tend to induce funny looks on developers’ faces because the
> settings being adjusted seem counter-intuitive, unrelated to the
> performance of the system, and/or outright dangerous.  Our goal is to make
> Ceph work as well as we can out of the box without requiring any tuning at
> all, so we are always striving to choose sane defaults.  And generally, we
> discourage tuning by users. "*
>
> To me it's not just bluestore settings / sdd vs. hdd they're talking about
> ("dozens of documents floating around"... "our goal... without any tuning
> at all".  Am I off base?
>

Ceph is *extremely* tunable, because whenever we set up a new behavior
(snapshot trimming sleeps, scrub IO priorities, whatever) and we're not
sure how it should behave we add a config option. Most of these config
options we come up with some value through testing or informed guesswork,
set it in the config, and expect that users won't ever see it. Some of
these settings we don't know what they should be, and we really hope the
whole mechanism gets replaced before users see it, but they don't. Some of
the settings should be auto-tuning or manually set to a different value for
each deployment to get optimal performance.
So there are lots of options for people to make things much better or much
worse for themselves.

However, by far the biggest impact and most common tunables are those that
basically vary on if the OSD is using a hard drive or an SSD for its local
storage — those are order-of-magnitude differences in expected latency and
throughput. So we now have separate default tunables for those cases which
are automatically applied.

Could somebody who knows what they're doing tweak things even better for a
particular deployment? Undoubtedly. But do *most* people know what they're
doing that well? They don't.
In particular, the old "fix it" configuration settings that a lot of people
were sharing and using starting in the Cuttlefish days are rather
dangerously out of date, and we no longer have defaults that are quite as
stupid as some of those were.

So I'd generally recommend you remove any custom tuning you've set up
unless you have specific reason to think it will do better than the
defaults for your currently-deployed release.
-Greg


>
>  Regards
>
> On Thu, Jul 12, 2018 at 9:12 PM, Konstantin Shalygin 
> wrote:
>
>>   I saw this in the Luminous release notes:
>>>
>>>   "Each OSD now adjusts its default configuration based on whether the
>>> backing device is an HDD or SSD. Manual tuning generally not required"
>>>
>>>   Which tuning in particular?  The ones in my configuration are
>>> osd_op_threads, osd_disk_threads, osd_recovery_max_active,
>>> osd_op_thread_suicide_timeout, and osd_crush_chooseleaf_type, among
>>> others.  Can I rip these out when I upgrade to
>>> Luminous?
>>>
>>
>> This mean that some "bluestore_*" settings tuned for nvme/hdd separately.
>>
>> Also with Luminous we have:
>>
>> osd_op_num_shards_(ssd|hdd)
>>
>> osd_op_num_threads_per_shard_(ssd|hdd)
>>
>> osd_recovery_sleep_(ssd|hdd)
>>
>>
>>
>>
>> k
>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 12.2.6 CRC errors

2018-07-16 Thread Sage Weil
We are in the process of building the 12.2.7 release now that will fix 
this.  (If you don't want to wait you can also install the autobuilt 
packages from shaman.ceph.com... official packages are only a few hours 
away from being ready though).

I would set data migration for the time being (norebalance).  Once the new 
version is install it will stop creating the crc mismatches and it 
will prevent them from triggering an incorrect EIO on read.  However, 
scrub doesn't repair them yet.  They will tend to go away on their own 
as normal IO touches the affected objects.  In 12.2.8 scrub will repair 
the CRCs.

In the meantime, while waiting for the fix, you can set 
osd_skip_data_digest = false to avoid generating more errors.  But note 
that once you upgrade you need to turn that back on (or 
osd_distruct_data_digest) to apply the fix/workaround.

You'll want to read the 12.2.7 release notes carefully (PR at 
https://github.com/ceph/ceph/pull/23057).

The bug doesn't corrupt data; only the whole-object checksums.  However, 
some reads (when the entire object is read) will see the bad checksum and 
return EIO.  This could break applications at a higher layer (although 
hopefully they will just abort and exit cleanly; it is hard to tell given 
the breadth of workloads).

I hope that helps, and I'm very sorry this regression crept in!
sage


On Mon, 16 Jul 2018, Stefan Schneebeli wrote:

> hello guys,
> 
> unfortunately I missed the warning on friday and upgraded my cluster on
> saturday to 12.2.6.
> The cluster is in a migration state from filestore to bluestore (10/2) and I
> get constantly inconsistent PG's only on the two bluestore OSD's.
> If I run a rados list-inconsistent-obj 2.17 --format=json-pretty for example I
> see at the end this mismatches:
> 
> "shards": [
> {
> "osd": 0,
> "primary": true,
> "errors": [],
> "size": 4194304,
> "omap_digest": "0x"
> },
> {
> "osd": 1,
> "primary": false,
> "errors": [
> "data_digest_mismatch_info"
> ],
> "size": 4194304,
> "omap_digest": "0x",
> "data_digest": "0x21b21973"
> 
> Is this the issue you talking about ?
> I can repair this PG's wth ceph pg repair and it reports the error is fixed.
> But is it really fixed?
> Do I have to be afraid to have now corrupted data?
> Would it be an option to noout this bluestore OSD's and stop them?
> When do you expect the new 12.2.7 Release? Will it fix all the errors?
> 
> Thank you in advance for your answers!
> 
> Stefan
> 
> 
> 
> 
> 
> -- Originalnachricht --
> Von: "Sage Weil" 
> An: "Glen Baars" 
> Cc: "ceph-users@lists.ceph.com" 
> Gesendet: 14.07.2018 19:15:57
> Betreff: Re: [ceph-users] 12.2.6 CRC errors
> 
> > On Sat, 14 Jul 2018, Glen Baars wrote:
> > > Hello Ceph users!
> > > 
> > > Note to users, don't install new servers on Friday the 13th!
> > > 
> > > We added a new ceph node on Friday and it has received the latest 12.2.6
> > > update. I started to see CRC errors and investigated hardware issues. I
> > > have since found that it is caused by the 12.2.6 release. About 80TB
> > > copied onto this server.
> > > 
> > > I have set noout,noscrub,nodeepscrub and repaired the affected PGs (
> > > ceph pg repair ) . This has cleared the errors.
> > > 
> > > * no idea if this is a good way to fix the issue. From the bug
> > > report this issue is in the deepscrub and therefore I suppose stopping
> > > it will limit the issues. ***
> > > 
> > > Can anyone tell me what to do? Downgrade seems that it won't fix the
> > > issue. Maybe remove this node and rebuild with 12.2.5 and resync data?
> > > Wait a few days for 12.2.7?
> > 
> > I would sit tight for now.  I'm working on the right fix and hope to
> > having something to test shortly, and possibly a release by tomorrow.
> > 
> > There is a remaining danger is that for the objects with bad full-object
> > digests, that a read of the entire object will throw an EIO.  It's up
> > to you whether you want to try to quiesce workloads to avoid that (to
> > prevent corruption at higher layers) or avoid a service
> > degradation/outage.  :(  Unfortunately I don't have super precise guidance
> > as far as how likely that is.
> > 
> > Are you using bluestore only, or is it a mix of bluestore and filestore?
> > 
> > sage
> > 
> > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> 
> 
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 12.2.6 CRC errors

2018-07-16 Thread Stefan Schneebeli

hello guys,

unfortunately I missed the warning on friday and upgraded my cluster on 
saturday to 12.2.6.
The cluster is in a migration state from filestore to bluestore (10/2) 
and I get constantly inconsistent PG's only on the two bluestore OSD's.
If I run a rados list-inconsistent-obj 2.17 --format=json-pretty for 
example I see at the end this mismatches:


"shards": [
{
"osd": 0,
"primary": true,
"errors": [],
"size": 4194304,
"omap_digest": "0x"
},
{
"osd": 1,
"primary": false,
"errors": [
"data_digest_mismatch_info"
],
"size": 4194304,
"omap_digest": "0x",
"data_digest": "0x21b21973"

Is this the issue you talking about ?
I can repair this PG's wth ceph pg repair and it reports the error is 
fixed.

But is it really fixed?
Do I have to be afraid to have now corrupted data?
Would it be an option to noout this bluestore OSD's and stop them?
When do you expect the new 12.2.7 Release? Will it fix all the errors?

Thank you in advance for your answers!

Stefan





-- Originalnachricht --
Von: "Sage Weil" 
An: "Glen Baars" 
Cc: "ceph-users@lists.ceph.com" 
Gesendet: 14.07.2018 19:15:57
Betreff: Re: [ceph-users] 12.2.6 CRC errors


On Sat, 14 Jul 2018, Glen Baars wrote:

Hello Ceph users!

Note to users, don't install new servers on Friday the 13th!

We added a new ceph node on Friday and it has received the latest 
12.2.6
update. I started to see CRC errors and investigated hardware issues. 
I

have since found that it is caused by the 12.2.6 release. About 80TB
copied onto this server.

I have set noout,noscrub,nodeepscrub and repaired the affected PGs (
ceph pg repair ) . This has cleared the errors.

* no idea if this is a good way to fix the issue. From the bug
report this issue is in the deepscrub and therefore I suppose stopping
it will limit the issues. ***

Can anyone tell me what to do? Downgrade seems that it won't fix the
issue. Maybe remove this node and rebuild with 12.2.5 and resync data?
Wait a few days for 12.2.7?


I would sit tight for now.  I'm working on the right fix and hope to
having something to test shortly, and possibly a release by tomorrow.

There is a remaining danger is that for the objects with bad 
full-object

digests, that a read of the entire object will throw an EIO.  It's up
to you whether you want to try to quiesce workloads to avoid that (to
prevent corruption at higher layers) or avoid a service
degradation/outage.  :(  Unfortunately I don't have super precise 
guidance

as far as how likely that is.

Are you using bluestore only, or is it a mix of bluestore and 
filestore?


sage


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSDs for data drives

2018-07-16 Thread Satish Patel
I just ran test on Samsung 850 Pro 500GB (how to interpret result of
following output?)



[root@compute-01 tmp]# fio --filename=/dev/sda --direct=1 --sync=1
--rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based
--group_reporting --name=journal-test
journal-test: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B,
(T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=76.0MiB/s][r=0,w=19.7k
IOPS][eta 00m:00s]
journal-test: (groupid=0, jobs=1): err= 0: pid=6969: Mon Jul 16 14:21:27 2018
  write: IOPS=20.1k, BW=78.6MiB/s (82.5MB/s)(4719MiB/60001msec)
clat (usec): min=36, max=4525, avg=47.22, stdev=16.65
 lat (usec): min=36, max=4526, avg=47.57, stdev=16.69
clat percentiles (usec):
 |  1.00th=[   39],  5.00th=[   40], 10.00th=[   40], 20.00th=[   41],
 | 30.00th=[   43], 40.00th=[   48], 50.00th=[   49], 60.00th=[   50],
 | 70.00th=[   50], 80.00th=[   51], 90.00th=[   52], 95.00th=[   53],
 | 99.00th=[   62], 99.50th=[   65], 99.90th=[  108], 99.95th=[  363],
 | 99.99th=[  396]
   bw (  KiB/s): min=72152, max=96464, per=100.00%, avg=80581.45,
stdev=7032.18, samples=119
   iops: min=18038, max=24116, avg=20145.34, stdev=1758.05, samples=119
  lat (usec)   : 50=71.83%, 100=28.06%, 250=0.03%, 500=0.08%, 750=0.01%
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.01%, 10=0.01%
  cpu  : usr=9.44%, sys=31.95%, ctx=1209952, majf=0, minf=78
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued rwt: total=0,1207979,0, short=0,0,0, dropped=0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=78.6MiB/s (82.5MB/s), 78.6MiB/s-78.6MiB/s
(82.5MB/s-82.5MB/s), io=4719MiB (4948MB), run=60001-60001msec

Disk stats (read/write):
  sda: ios=0/1205921, merge=0/29, ticks=0/41418, in_queue=40965, util=68.35%

On Mon, Jul 16, 2018 at 1:18 PM, Michael Kuriger  wrote:
> I dunno, to me benchmark tests are only really useful to compare different
> drives.
>
>
>
>
>
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Paul Emmerich
> Sent: Monday, July 16, 2018 8:41 AM
> To: Satish Patel
> Cc: ceph-users
>
>
> Subject: Re: [ceph-users] SSDs for data drives
>
>
>
> This doesn't look like a good benchmark:
>
> (from the blog post)
>
> dd if=/dev/zero of=/mnt/rawdisk/data.bin bs=1G count=20 oflag=direct
>
> 1. it writes compressible data which some SSDs might compress, you should
> use urandom
>
> 2. that workload does not look like something Ceph will do to your disk,
> like not at all
>
> If you want a quick estimate of an SSD in worst-case scenario: run the usual
> 4k oflag=direct,dsync test (or better: fio).
>
> A bad SSD will get < 1k IOPS, a good one > 10k
>
> But that doesn't test everything. In particular, performance might degrade
> as the disks fill up. Also, it's the absolute
>
> worst-case, i.e., a disk used for multiple journal/wal devices
>
>
>
>
>
> Paul
>
>
>
> 2018-07-16 10:09 GMT-04:00 Satish Patel :
>
> https://blog.cypressxt.net/hello-ceph-and-samsung-850-evo/
>
>
> On Thu, Jul 12, 2018 at 3:37 AM, Adrian Saul
>  wrote:
>>
>>
>> We started our cluster with consumer (Samsung EVO) disks and the write
>> performance was pitiful, they had periodic spikes in latency (average of
>> 8ms, but much higher spikes) and just did not perform anywhere near where
>> we
>> were expecting.
>>
>>
>>
>> When replaced with SM863 based devices the difference was night and day.
>> The DC grade disks held a nearly constant low latency (contantly sub-ms),
>> no
>> spiking and performance was massively better.   For a period I ran both
>> disks in the cluster and was able to graph them side by side with the same
>> workload.  This was not even a moderately loaded cluster so I am glad we
>> discovered this before we went full scale.
>>
>>
>>
>> So while you certainly can do cheap and cheerful and let the data
>> availability be handled by Ceph, don’t expect the performance to keep up.
>>
>>
>>
>>
>>
>>
>>
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Satish Patel
>> Sent: Wednesday, 11 July 2018 10:50 PM
>> To: Paul Emmerich 
>> Cc: ceph-users 
>> Subject: Re: [ceph-users] SSDs for data drives
>>
>>
>>
>> Prices going way up if I am picking Samsung SM863a for all data drives.
>>
>>
>>
>> We have many servers running on consumer grade sad drives and we never
>> noticed any performance or any fault so far (but we never used ceph
>> before)
>>
>>
>>
>> I thought that is the whole point of ceph to provide high availability if
>> drive go down also parellel read from multiple osd node
>>
>>
>>
>> Sent from my iPhone
>>
>>
>> On Jul 11, 2018, at 6:57 AM, Paul Emmerich  wrote:
>>
>> Hi,
>>
>>
>>
>> we‘ve no long-term data for

Re: [ceph-users] SSDs for data drives

2018-07-16 Thread Michael Kuriger
I dunno, to me benchmark tests are only really useful to compare different 
drives.


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Paul 
Emmerich
Sent: Monday, July 16, 2018 8:41 AM
To: Satish Patel
Cc: ceph-users
Subject: Re: [ceph-users] SSDs for data drives

This doesn't look like a good benchmark:

(from the blog post)

dd if=/dev/zero of=/mnt/rawdisk/data.bin bs=1G count=20 oflag=direct
1. it writes compressible data which some SSDs might compress, you should use 
urandom
2. that workload does not look like something Ceph will do to your disk, like 
not at all
If you want a quick estimate of an SSD in worst-case scenario: run the usual 4k 
oflag=direct,dsync test (or better: fio).
A bad SSD will get < 1k IOPS, a good one > 10k
But that doesn't test everything. In particular, performance might degrade as 
the disks fill up. Also, it's the absolute
worst-case, i.e., a disk used for multiple journal/wal devices


Paul

2018-07-16 10:09 GMT-04:00 Satish Patel 
mailto:satish@gmail.com>>:
https://blog.cypressxt.net/hello-ceph-and-samsung-850-evo/

On Thu, Jul 12, 2018 at 3:37 AM, Adrian Saul
mailto:adrian.s...@tpgtelecom.com.au>> wrote:
>
>
> We started our cluster with consumer (Samsung EVO) disks and the write
> performance was pitiful, they had periodic spikes in latency (average of
> 8ms, but much higher spikes) and just did not perform anywhere near where we
> were expecting.
>
>
>
> When replaced with SM863 based devices the difference was night and day.
> The DC grade disks held a nearly constant low latency (contantly sub-ms), no
> spiking and performance was massively better.   For a period I ran both
> disks in the cluster and was able to graph them side by side with the same
> workload.  This was not even a moderately loaded cluster so I am glad we
> discovered this before we went full scale.
>
>
>
> So while you certainly can do cheap and cheerful and let the data
> availability be handled by Ceph, don’t expect the performance to keep up.
>
>
>
>
>
>
>
> From: ceph-users 
> [mailto:ceph-users-boun...@lists.ceph.com]
>  On Behalf Of
> Satish Patel
> Sent: Wednesday, 11 July 2018 10:50 PM
> To: Paul Emmerich mailto:paul.emmer...@croit.io>>
> Cc: ceph-users mailto:ceph-users@lists.ceph.com>>
> Subject: Re: [ceph-users] SSDs for data drives
>
>
>
> Prices going way up if I am picking Samsung SM863a for all data drives.
>
>
>
> We have many servers running on consumer grade sad drives and we never
> noticed any performance or any fault so far (but we never used ceph before)
>
>
>
> I thought that is the whole point of ceph to provide high availability if
> drive go down also parellel read from multiple osd node
>
>
>
> Sent from my iPhone
>
>
> On Jul 11, 2018, at 6:57 AM, Paul Emmerich 
> mailto:paul.emmer...@croit.io>> wrote:
>
> Hi,
>
>
>
> we‘ve no long-term data for the SM variant.
>
> Performance is fine as far as we can tell, but the main difference between
> these two models should be endurance.
>
>
>
>
>
> Also, I forgot to mention that my experiences are only for the 1, 2, and 4
> TB variants. Smaller SSDs are often proportionally slower (especially below
> 500GB).
>
>
>
> Paul
>
>
> Robert Stanford mailto:rstanford8...@gmail.com>>:
>
> Paul -
>
>
>
>  That's extremely helpful, thanks.  I do have another cluster that uses
> Samsung SM863a just for journal (spinning disks for data).  Do you happen to
> have an opinion on those as well?
>
>
>
> On Wed, Jul 11, 2018 at 4:03 AM, Paul Emmerich 
> mailto:paul.emmer...@croit.io>>
> wrote:
>
> PM/SM863a are usually great disks and should be the default go-to option,
> they outperform
>
> even the more expensive PM1633 in our experience.
>
> (But that really doesn't matter if it's for the full OSD and not as
> dedicated WAL/journal)
>
>
>
> We got a cluster with a few hundred SanDisk Ultra II (discontinued, i
> believe) that was built on a budget.
>
> Not the best disk but great value. They have been running since ~3 years now
> with very few failures and
>
> okayish overall performance.
>
>
>
> We also got a few clusters with a few hundred SanDisk Extreme Pro, but we
> are not yet sure about their
>
> long-time durability as they are only ~9 months old (average of ~1000 write
> IOPS on each disk over that time).
>
> Some of them report only 50-60% lifetime left.
>
>
>
> For NVMe, the Intel NVMe 750 is still a great disk
>
>
>
> Be carefuly to get these exact models. Seemingly similar disks might be just
> completely bad, for
>
> example, the Samsung PM961 is just unusable for Ceph in our experience.
>
>
>
> Paul
>
>
>
> 2018-07-11 10:14 GMT+02:00 Wido den Hollander 
> mailto:w...@42on.com>>:
>
>
>
> On 07/11/201

Re: [ceph-users] chkdsk /b fails on Ceph iSCSI volume

2018-07-16 Thread Mike Christie
On 07/15/2018 08:08 AM, Wladimir Mutel wrote:
> Hi,
> 
> I cloned a NTFS with bad blocks from USB HDD onto Ceph RBD volume
> (using ntfsclone, so the copy has sparse regions), and decided to clean
> bad blocks within the copy. I run chkdsk /b from WIndows and it fails on
> free space verification (step 5 of 5).
> In tcmu-runner.log I see that command 8f (SCSI Verify) is not supported.
> Does it mean that I should not try to run chkdsk /b on this volume at
> all ? (it seems that bad blocks were re-verified and cleared)
> Are there any plans to make user:rbd backstore support verify requests ?
> 

I did not know of any apps using the command so it was just not implemented.

I have put it on my TODO list:

https://github.com/open-iscsi/tcmu-runner/issues/445


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSDs for data drives

2018-07-16 Thread Paul Emmerich
This doesn't look like a good benchmark:

(from the blog post)

dd if=/dev/zero of=/mnt/rawdisk/data.bin bs=1G count=20 oflag=direct

1. it writes compressible data which some SSDs might compress, you should
use urandom
2. that workload does not look like something Ceph will do to your disk,
like not at all

If you want a quick estimate of an SSD in worst-case scenario: run the
usual 4k oflag=direct,dsync test (or better: fio).
A bad SSD will get < 1k IOPS, a good one > 10k

But that doesn't test everything. In particular, performance might degrade
as the disks fill up. Also, it's the absolute
worst-case, i.e., a disk used for multiple journal/wal devices



Paul

2018-07-16 10:09 GMT-04:00 Satish Patel :

> https://blog.cypressxt.net/hello-ceph-and-samsung-850-evo/
>
> On Thu, Jul 12, 2018 at 3:37 AM, Adrian Saul
>  wrote:
> >
> >
> > We started our cluster with consumer (Samsung EVO) disks and the write
> > performance was pitiful, they had periodic spikes in latency (average of
> > 8ms, but much higher spikes) and just did not perform anywhere near
> where we
> > were expecting.
> >
> >
> >
> > When replaced with SM863 based devices the difference was night and day.
> > The DC grade disks held a nearly constant low latency (contantly
> sub-ms), no
> > spiking and performance was massively better.   For a period I ran both
> > disks in the cluster and was able to graph them side by side with the
> same
> > workload.  This was not even a moderately loaded cluster so I am glad we
> > discovered this before we went full scale.
> >
> >
> >
> > So while you certainly can do cheap and cheerful and let the data
> > availability be handled by Ceph, don’t expect the performance to keep up.
> >
> >
> >
> >
> >
> >
> >
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> > Satish Patel
> > Sent: Wednesday, 11 July 2018 10:50 PM
> > To: Paul Emmerich 
> > Cc: ceph-users 
> > Subject: Re: [ceph-users] SSDs for data drives
> >
> >
> >
> > Prices going way up if I am picking Samsung SM863a for all data drives.
> >
> >
> >
> > We have many servers running on consumer grade sad drives and we never
> > noticed any performance or any fault so far (but we never used ceph
> before)
> >
> >
> >
> > I thought that is the whole point of ceph to provide high availability if
> > drive go down also parellel read from multiple osd node
> >
> >
> >
> > Sent from my iPhone
> >
> >
> > On Jul 11, 2018, at 6:57 AM, Paul Emmerich 
> wrote:
> >
> > Hi,
> >
> >
> >
> > we‘ve no long-term data for the SM variant.
> >
> > Performance is fine as far as we can tell, but the main difference
> between
> > these two models should be endurance.
> >
> >
> >
> >
> >
> > Also, I forgot to mention that my experiences are only for the 1, 2, and
> 4
> > TB variants. Smaller SSDs are often proportionally slower (especially
> below
> > 500GB).
> >
> >
> >
> > Paul
> >
> >
> > Robert Stanford :
> >
> > Paul -
> >
> >
> >
> >  That's extremely helpful, thanks.  I do have another cluster that uses
> > Samsung SM863a just for journal (spinning disks for data).  Do you
> happen to
> > have an opinion on those as well?
> >
> >
> >
> > On Wed, Jul 11, 2018 at 4:03 AM, Paul Emmerich 
> > wrote:
> >
> > PM/SM863a are usually great disks and should be the default go-to option,
> > they outperform
> >
> > even the more expensive PM1633 in our experience.
> >
> > (But that really doesn't matter if it's for the full OSD and not as
> > dedicated WAL/journal)
> >
> >
> >
> > We got a cluster with a few hundred SanDisk Ultra II (discontinued, i
> > believe) that was built on a budget.
> >
> > Not the best disk but great value. They have been running since ~3 years
> now
> > with very few failures and
> >
> > okayish overall performance.
> >
> >
> >
> > We also got a few clusters with a few hundred SanDisk Extreme Pro, but we
> > are not yet sure about their
> >
> > long-time durability as they are only ~9 months old (average of ~1000
> write
> > IOPS on each disk over that time).
> >
> > Some of them report only 50-60% lifetime left.
> >
> >
> >
> > For NVMe, the Intel NVMe 750 is still a great disk
> >
> >
> >
> > Be carefuly to get these exact models. Seemingly similar disks might be
> just
> > completely bad, for
> >
> > example, the Samsung PM961 is just unusable for Ceph in our experience.
> >
> >
> >
> > Paul
> >
> >
> >
> > 2018-07-11 10:14 GMT+02:00 Wido den Hollander :
> >
> >
> >
> > On 07/11/2018 10:10 AM, Robert Stanford wrote:
> >>
> >>  In a recent thread the Samsung SM863a was recommended as a journal
> >> SSD.  Are there any recommendations for data SSDs, for people who want
> >> to use just SSDs in a new Ceph cluster?
> >>
> >
> > Depends on what you are looking for, SATA, SAS3 or NVMe?
> >
> > I have very good experiences with these drives running with BlueStore in
> > them in SuperMicro machines:
> >
> > - SATA: Samsung PM863a
> > - SATA: Intel S4500
> > - SAS: Samsung PM1633
> > - NVMe: Samsung PM963
> >
> > Running WAL+DB+D

[ceph-users] Luminous 12.2.5 - crushable RGW

2018-07-16 Thread Jakub Jaszewski
Hi,
We run 5 RADOS Gateways on Luminous 12.2.5 as upstream servers in nginx
active-active setup, based on keepalived.
Cluster is 12x Ceph nodes (16x 10TB OSD(bluestore) per node, 2x 10Gb
network link shared by access and cluster networks), RGW pool is EC 9+3.

We recently noticed below entries in RGW logs:

2018-07-11 06:19:13.726392 7f2eeed46700  1 == starting new request
req=0x7f2eeed402c0 =
2018-07-11 06:19:13.871358 7f2eeed46700  0 NOTICE: resharding operation on
bucket index detected, blocking
2018-07-11 06:19:58.953816 7f2eeed46700  0 block_while_resharding ERROR:
bucket is still resharding, please retry
2018-07-11 06:19:58.959424 7f2eeed46700  0 NOTICE: resharding operation on
bucket index detected, blocking
2018-07-11 06:20:44.088045 7f2eeed46700  0 block_while_resharding ERROR:
bucket is still resharding, please retry
2018-07-11 06:20:44.090664 7f2eeed46700  0 NOTICE: resharding operation on
bucket index detected, blocking
2018-07-11 06:21:29.141182 7f2eeed46700  0 block_while_resharding ERROR:
bucket is still resharding, please retry
2018-07-11 06:21:29.146598 7f2eeed46700  0 NOTICE: resharding operation on
bucket index detected, blocking
2018-07-11 06:22:14.178369 7f2eeed46700  0 block_while_resharding ERROR:
bucket is still resharding, please retry
2018-07-11 06:22:14.181697 7f2eeed46700  0 NOTICE: resharding operation on
bucket index detected, blocking
2018-07-11 06:22:34.199763 7f2eeed46700  1 == req done
req=0x7f2eeed402c0 op status=0 http_status=200 ==
2018-07-11 06:22:34.199851 7f2eeed46700  1 civetweb: 0x5599a1158000:
10.195.17.6 - - [11/Jul/2018:06:10:11 +] "PUT
/BUCKET/PATH/OBJECT?partNumber=2&uploadId=2~ol_fQw_u7eKRjuP1qVwnj5V12GxDYXu
HTTP/1.1" 200 0 - -

Causing 'upstream timed out (110: Connection timed out) while reading
response header from upstream' errors and 504 response code on nginx side
due to 30 seconds timeout.

Other recurring log entries look like:

2018-07-11 06:20:47.407632 7f2e97c98700  1 == starting new request
req=0x7f2e97c922c0 =
2018-07-11 06:20:47.412455 7f2e97c98700  0 NOTICE: resharding operation on
bucket index detected, blocking
2018-07-11 06:21:32.424983 7f2e97c98700  0 block_while_resharding ERROR:
bucket is still resharding, please retry
2018-07-11 06:21:32.426597 7f2e97c98700  0 NOTICE: resharding operation on
bucket index detected, blocking
2018-07-11 06:22:17.67 7f2e97c98700  0 block_while_resharding ERROR:
bucket is still resharding, please retry
2018-07-11 06:22:17.492217 7f2e97c98700  0 NOTICE: resharding operation on
bucket index detected, blocking

2018-07-11 06:22:32.495254 7f2e97c98700  0 ERROR: update_bucket_id()
new_bucket_id=d644765c-1705-49b2-9609-a8511d3c4fed.151639.105 returned
r=-125
2018-07-11 06:22:32.495386 7f2e97c98700  0 WARNING: set_req_state_err
err_no=125 resorting to 500

2018-07-11 06:22:32.495509 7f2e97c98700  1 == req done
req=0x7f2e97c922c0 op status=-125 http_status=500 ==
2018-07-11 06:22:32.495569 7f2e97c98700  1 civetweb: 0x5599a14f4000:
10.195.17.6 - - [11/Jul/2018:06:19:25 +] "POST PUT
/BUCKET/PATH/OBJECT?uploads HTTP/1.1" 500 0 - -


To avoid 504 & 500 responses we disabled dynamic resharding via
'rgw_dynamic_resharding = false'. Not sure if setting nginx
'proxy_read_timeout' option to value higher than bucket resharding time is
a good idea.

Once done, 'block_while_resharding ERROR: bucket is still resharding,
please retry' disappeared from RGW logs, however, another ERROR is now
logged and then RGWs catch singal Aborted and get restarted by systemd:

2018-07-13 05:27:31.149618 7f7eb72c7700  1 == starting new request
req=0x7f7eb72c12c0 =
2018-07-13 05:27:52.593413 7f7eb72c7700  0 ERROR: flush_read_list():
d->client_cb->handle_data() returned -5
2018-07-13 05:27:52.594040 7f7eb72c7700  1 == req done
req=0x7f7eb72c12c0 op status=-5 http_status=206 ==
2018-07-13 05:27:52.594633 7f7eb72c7700  1 civetweb: 0x55ab3171b000:
10.195.17.6 - - [13/Jul/2018:05:24:28 +] "GET /BUCKET/PATH/OBJECT_580MB
HTTP/1.1" 206 0 - Hadoop 2.7.3.2.5.3.0-37, aws-sdk-java/1.10.6
Linux/4.4.0-97-generic Java_HotSpot(TM)_64-Bit_Server_VM/25.77-b03/1.8.0_77

We see ~40 such ERRORs (each GET requests ~580 MB object) prior to the RGW
crush:

2018-07-13 05:21:43.993778 7fcce6575700  1 == starting new request
req=0x7fcce656f2c0 =
2018-07-13 05:22:16.137676 7fcce6575700 -1
/build/ceph-12.2.5/src/common/buffer.cc: In function 'void
ceph::buffer::list::append(const ceph::buffer::ptr&, unsigned int, unsigned
int)' thread 7fcce6575700 time 2018-07-13 05:22:16.135271
/build/ceph-12.2.5/src/common/buffer.cc: 1967: FAILED assert(len+off <=
bp.length())

 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x7fcd5b7aab72]
 2: (ceph::buffer::list::append(ceph::buffer::ptr const&, unsigned int,
unsigned int)+0x118) [0x7fcd64993cf8]
 3: (RGWPutObj_ObjStore::get_data(ceph::buffer::list&)+0xd

Re: [ceph-users] Jewel PG stuck inconsistent with 3 0-size objects

2018-07-16 Thread Pavan Rallabhandi
Yes, that suggestion worked for us, although we hit this when we've upgraded to 
10.2.10 from 10.2.7.

I guess this was fixed via http://tracker.ceph.com/issues/21440 and 
http://tracker.ceph.com/issues/19404

Thanks,
-Pavan. 

On 7/16/18, 5:07 AM, "ceph-users on behalf of Matthew Vernon" 
 wrote:

Hi,

Our cluster is running 10.2.9 (from Ubuntu; on 16.04 LTS), and we have a
pg that's stuck inconsistent; if I repair it, it logs "failed to pick
suitable auth object" (repair log attached, to try and stop my MUA
mangling it).

We then deep-scrubbed that pg, at which point
rados list-inconsistent-obj 67.2e --format=json-pretty produces a bit of
output (also attached), which includes that all 3 osds have a zero-sized
object e.g.

"osd": 1937,
"errors": [
"omap_digest_mismatch_oi"
],
"size": 0,
"omap_digest": "0x45773901",
"data_digest": "0x"

All 3 osds have different omap_digest, but all have 0 size. Indeed,
looking on the OSD disks directly, each object is 0 size (i.e. they are
identical).

This looks similar to one of the failure modes in
http://tracker.ceph.com/issues/21388 where the is a suggestion (comment
19 from David Zafman) to do:

rados -p default.rgw.buckets.index setomapval
.dir.861ae926-7ff0-48c5-86d6-a6ba8d0a7a14.7130858.6 temporary-key anything
[deep-scrub]
rados -p default.rgw.buckets.index rmomapkey
.dir.861ae926-7ff0-48c5-86d6-a6ba8d0a7a14.7130858.6 temporary-key

Is this likely to be the correct approach here, to? And is there an
underlying bug in ceph that still needs fixing? :)

Thanks,

Matthew



-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [rgw] Very high cache misses with automatic bucket resharding

2018-07-16 Thread Rudenko Aleksandr
Yes, i have tasks in `radosgw-admin reshard list`.

And objects count in .rgw.buckets.index is increasing, slowly.

But i confused a bit. I have one big bucket with 161 shards.

…
"max_marker": 
"0#,1#,2#,3#,4#,5#,6#,7#,8#,9#,10#,11#,12#,13#,14#,15#,16#,17#,18#,19#,20#,21#,22#,23#,24#,25#,26#,27#,28#,29#,30#,31#,32#,33#,34#,35#,36#,37#,38#,39#,40#,41#,42#,43#,44#,45#,46#,47#,48#,49#,50#,51#,52#,53#,54#,55#,56#,57#,58#,59#,60#,61#,62#,63#,64#,65#,66#,67#,68#,69#,70#,71#,72#,73#,74#,75#,76#,77#,78#,79#,80#,81#,82#,83#,84#,85#,86#,87#,88#,89#,90#,91#,92#,93#,94#,95#,96#,97#,98#,99#,100#,101#,102#,103#,104#,105#,106#,107#,108#,109#,110#,111#,112#,113#,114#,115#,116#,117#,118#,119#,120#,121#,122#,123#,124#,125#,126#,127#,128#,129#,130#,131#,132#,133#,134#,135#,136#,137#,138#,139#,140#,141#,142#,143#,144#,145#,146#,147#,148#,149#,150#,151#,152#,153#,154#,155#,156#,157#,158#,159#,160#»,
…

But in reshard list i see:

{
"time": "2018-07-15 21:11:31.290620Z",
"tenant": "",
"bucket_name": "my-bucket",
"bucket_id": "default.32785769.2",
"new_instance_id": "",
"old_num_shards": 1,
"new_num_shards": 162
},

"old_num_shards": 1 - it’s correct?

I hit a lot of problems trying to use auto resharding in 12.2.5

Which problems?

On 16 Jul 2018, at 16:57, Sean Redmond 
mailto:sean.redmo...@gmail.com>> wrote:

Hi,

Do you have on going resharding? 'radosgw-admin reshard list' should so you the 
status.

Do you see the number of objects in .rgw.bucket.index pool increasing?

I hit a lot of problems trying to use auto resharding in 12.2.5 - I have 
disabled it for the moment.

Thanks

[1] https://tracker.ceph.com/issues/24551

On Mon, Jul 16, 2018 at 12:32 PM, Rudenko Aleksandr 
mailto:arude...@croc.ru>> wrote:

Hi, guys.

I use Luminous 12.2.5.

Automatic bucket index resharding has not been activated in the past.

Few days ago i activated auto. resharding.

After that and now i see:

- very high Ceph read I/O (~300 I/O before activating resharding, ~4k now),
- very high Ceph read bandwidth (50 MB/s before activating resharding, 250 MB/s 
now),
- very high RGW cache miss (400 count/s before activating resharding, ~3.5k 
now).

For Ceph monitoring i use MGR+Zabbix plugin and zabbix-template from ceph 
github repo.
For RGW monitoring i use RGW perf dump and my script.

Why is it happening? When is it ending?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] resize wal/db

2018-07-16 Thread Igor Fedotov

Hi Zhang,

There is no way to resize DB while OSD is running. There is a bit 
shorter "unofficial" but risky way than redeploying OSD though. But 
you'll need to tag specific OSD out for a while in any case. You will 
also need either additional free partition(s) or initial deployment had 
to be done using LVMs.


See this blog for morr details. 
http://heiterbiswolkig.blogs.nde.ag/2018/04/08/migrating-bluestores-block-db/



And I advise to try such things at non-production cluster first.


Thanks,

Igor


On 7/12/2018 7:03 AM, Shunde Zhang wrote:

Hi Ceph Gurus,

I have installed Ceph Luminous with Bluestore using ceph-ansible.
However, when I did the install, I didn’t set the wal/db size. Then it ended up 
using the default values, which is quite small: 1G db and 576MB wal.
Note that each OSD node has 12 OSDs and each OSD has a 1.8T spinning disk for 
data. All 12 OSDs share one NVMe M2 SSD for wal/db.
Now the ceph is in use, and I want to increase the size of db/wal after doing 
some research: I want to use 20G db and 1G wal. (Are they reasonable numbers?)
I can delete one OSD and then re-create it with ceph-ansible but that is 
troublesome.
I wonder if there is a (simple) way to increase the size of both db and wal 
when an OSD is running?

Thanks in advance,
Shunde.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSDs for data drives

2018-07-16 Thread Satish Patel
https://blog.cypressxt.net/hello-ceph-and-samsung-850-evo/

On Thu, Jul 12, 2018 at 3:37 AM, Adrian Saul
 wrote:
>
>
> We started our cluster with consumer (Samsung EVO) disks and the write
> performance was pitiful, they had periodic spikes in latency (average of
> 8ms, but much higher spikes) and just did not perform anywhere near where we
> were expecting.
>
>
>
> When replaced with SM863 based devices the difference was night and day.
> The DC grade disks held a nearly constant low latency (contantly sub-ms), no
> spiking and performance was massively better.   For a period I ran both
> disks in the cluster and was able to graph them side by side with the same
> workload.  This was not even a moderately loaded cluster so I am glad we
> discovered this before we went full scale.
>
>
>
> So while you certainly can do cheap and cheerful and let the data
> availability be handled by Ceph, don’t expect the performance to keep up.
>
>
>
>
>
>
>
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Satish Patel
> Sent: Wednesday, 11 July 2018 10:50 PM
> To: Paul Emmerich 
> Cc: ceph-users 
> Subject: Re: [ceph-users] SSDs for data drives
>
>
>
> Prices going way up if I am picking Samsung SM863a for all data drives.
>
>
>
> We have many servers running on consumer grade sad drives and we never
> noticed any performance or any fault so far (but we never used ceph before)
>
>
>
> I thought that is the whole point of ceph to provide high availability if
> drive go down also parellel read from multiple osd node
>
>
>
> Sent from my iPhone
>
>
> On Jul 11, 2018, at 6:57 AM, Paul Emmerich  wrote:
>
> Hi,
>
>
>
> we‘ve no long-term data for the SM variant.
>
> Performance is fine as far as we can tell, but the main difference between
> these two models should be endurance.
>
>
>
>
>
> Also, I forgot to mention that my experiences are only for the 1, 2, and 4
> TB variants. Smaller SSDs are often proportionally slower (especially below
> 500GB).
>
>
>
> Paul
>
>
> Robert Stanford :
>
> Paul -
>
>
>
>  That's extremely helpful, thanks.  I do have another cluster that uses
> Samsung SM863a just for journal (spinning disks for data).  Do you happen to
> have an opinion on those as well?
>
>
>
> On Wed, Jul 11, 2018 at 4:03 AM, Paul Emmerich 
> wrote:
>
> PM/SM863a are usually great disks and should be the default go-to option,
> they outperform
>
> even the more expensive PM1633 in our experience.
>
> (But that really doesn't matter if it's for the full OSD and not as
> dedicated WAL/journal)
>
>
>
> We got a cluster with a few hundred SanDisk Ultra II (discontinued, i
> believe) that was built on a budget.
>
> Not the best disk but great value. They have been running since ~3 years now
> with very few failures and
>
> okayish overall performance.
>
>
>
> We also got a few clusters with a few hundred SanDisk Extreme Pro, but we
> are not yet sure about their
>
> long-time durability as they are only ~9 months old (average of ~1000 write
> IOPS on each disk over that time).
>
> Some of them report only 50-60% lifetime left.
>
>
>
> For NVMe, the Intel NVMe 750 is still a great disk
>
>
>
> Be carefuly to get these exact models. Seemingly similar disks might be just
> completely bad, for
>
> example, the Samsung PM961 is just unusable for Ceph in our experience.
>
>
>
> Paul
>
>
>
> 2018-07-11 10:14 GMT+02:00 Wido den Hollander :
>
>
>
> On 07/11/2018 10:10 AM, Robert Stanford wrote:
>>
>>  In a recent thread the Samsung SM863a was recommended as a journal
>> SSD.  Are there any recommendations for data SSDs, for people who want
>> to use just SSDs in a new Ceph cluster?
>>
>
> Depends on what you are looking for, SATA, SAS3 or NVMe?
>
> I have very good experiences with these drives running with BlueStore in
> them in SuperMicro machines:
>
> - SATA: Samsung PM863a
> - SATA: Intel S4500
> - SAS: Samsung PM1633
> - NVMe: Samsung PM963
>
> Running WAL+DB+DATA with BlueStore on the same drives.
>
> Wido
>
>>  Thank you
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
>
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> Confidentiality: This email and any attachments are confidential and may be
> subject to copyright, legal or some other professional privilege. They are
> intended solely for the attention and use of the named addressee(s). They
> may only be copied, distributed or disclosed with the consent of the

Re: [ceph-users] [rgw] Very high cache misses with automatic bucket resharding

2018-07-16 Thread Sean Redmond
Hi,

Do you have on going resharding? 'radosgw-admin reshard list' should so you
the status.

Do you see the number of objects in .rgw.bucket.index pool increasing?

I hit a lot of problems trying to use auto resharding in 12.2.5 - I have
disabled it for the moment.

Thanks

[1] https://tracker.ceph.com/issues/24551

On Mon, Jul 16, 2018 at 12:32 PM, Rudenko Aleksandr 
wrote:

> Hi, guys.
>
> I use Luminous 12.2.5.
>
> Automatic bucket index resharding has not been activated in the past.
>
> Few days ago i activated auto. resharding.
>
> After that and now i see:
>
> - very high Ceph read I/O (~300 I/O before activating resharding, ~4k now),
> - very high Ceph read bandwidth (50 MB/s before activating resharding, 250
> MB/s now),
> - very high RGW cache miss (400 count/s before activating resharding,
> ~3.5k now).
>
> For Ceph monitoring i use MGR+Zabbix plugin and zabbix-template from ceph
> github repo.
> For RGW monitoring i use RGW perf dump and my script.
>
> Why is it happening? When is it ending?
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS with erasure coding, do I need a cache-pool?

2018-07-16 Thread Oliver Schulz

Dear John,


On 16.07.2018 16:25, John Spray wrote:

Since Luminous, you can use an erasure coded pool (on bluestore)
directly as a CephFS data pool, no cache pool needed.


Great! I'll be happy to go without
a cache pool then.


Thanks for your help, John,

Oliver


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Safe to use rados -p rbd cleanup?

2018-07-16 Thread Piotr Dałek

On 18-07-16 01:40 PM, Wido den Hollander wrote:



On 07/15/2018 11:12 AM, Mehmet wrote:

hello guys,

in my production cluster i've many objects like this

"#> rados -p rbd ls | grep 'benchmark'"
... .. .
benchmark_data_inkscope.example.net_32654_object1918
benchmark_data_server_26414_object1990
... .. .

Is it safe to run "rados -p rbd cleanup" or is there any risk for my
images?


the cleanup will require more then just that as you will need to specify
the benchmark prefix as well.


Yes and no. "rados -p rbd cleanup" will try to locate benchmark metadata 
object and remove only objects indiced by these metadata. "--prefix" is used 
when these metadata are lost or overwritten.


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovhcloud.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Safe to use rados -p rbd cleanup?

2018-07-16 Thread Piotr Dałek

On 18-07-15 11:12 AM, Mehmet wrote:

hello guys,

in my production cluster i've many objects like this

"#> rados -p rbd ls | grep 'benchmark'"
... .. .
benchmark_data_inkscope.example.net_32654_object1918
benchmark_data_server_26414_object1990
... .. .

Is it safe to run "rados -p rbd cleanup" or is there any risk for my images?


It'll probably fail due to hostname mismatch (rados bench write produces 
objects with caller hostname embedded in object name). Try what Wido 
suggested to cleanup all benchmark-made objects.

Otherwise yes, it's safe as objects for rbd images are named differently.

--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovhcloud.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Safe to use rados -p rbd cleanup?

2018-07-16 Thread Wido den Hollander



On 07/15/2018 11:12 AM, Mehmet wrote:
> hello guys,
> 
> in my production cluster i've many objects like this
> 
> "#> rados -p rbd ls | grep 'benchmark'"
> ... .. .
> benchmark_data_inkscope.example.net_32654_object1918
> benchmark_data_server_26414_object1990
> ... .. .
> 
> Is it safe to run "rados -p rbd cleanup" or is there any risk for my
> images?

the cleanup will require more then just that as you will need to specify
the benchmark prefix as well.

Why not run:

$ rados -p rbd ls > ls.txt
$ cat ls.txt|grep 'benchmark_data'|xargs -n 1 rados -p rbd rm

That should remove those objects as well. That's how I usually do it.

Wido

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Luminous dynamic resharding, when index max shards already set

2018-07-16 Thread Robert Stanford
I am upgrading my clusters to Luminous.  We are already using rados
gateway, and index max shards has been set for the rgw data pools.  Now we
want to use Luminous dynamic index resharding.  How do we make this
transition?

 Regards
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [rgw] Very high cache misses with automatic bucket resharding

2018-07-16 Thread Rudenko Aleksandr
Hi, guys.

I use Luminous 12.2.5.

Automatic bucket index resharding has not been activated in the past.

Few days ago i activated auto. resharding.

After that and now i see:

- very high Ceph read I/O (~300 I/O before activating resharding, ~4k now),
- very high Ceph read bandwidth (50 MB/s before activating resharding, 250 MB/s 
now),
- very high RGW cache miss (400 count/s before activating resharding, ~3.5k 
now).

For Ceph monitoring i use MGR+Zabbix plugin and zabbix-template from ceph 
github repo.
For RGW monitoring i use RGW perf dump and my script.

Why is it happening? When is it ending?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph issue too many open files.

2018-07-16 Thread Daznis
Hi,

Recently about ~2 weeks ago something strange started happening with
one of the ceph cluster I'm managing. It's running ceph jewel 10.2.10
with cache layer. Some OSD's started crashing with "too many open
files error". From looking at the issue I have found that it keeps a
lot of links in /proc/self/fd and once 1 mil limit is reached it
crashes. I have tried increasing the limit to 2 mil, but same thing
happened. The problem with this is that it's not clearing
/proc/self/fd as there is about 900k inodes used inside the OSD drive.
Once the OSD is restarted and scrub starts I'm getting missing shard
errors:

2018-07-15 18:32:26.554348 7f604ebd1700 -1 log_channel(cluster) log
[ERR] : 6.58 shard 51 missing
6:1a3a2565:::rbd_data.314da9e52da0f2.d570:head

OSD crash log:
-4> 2018-07-15 17:40:25.566804 7f97143fe700  0
filestore(/var/lib/ceph/osd/ceph-44)  error (24) Too many open files
not handled on operation 0x7f970e0274c0 (5142329351.0.0, or op 0,
counting from 0)
-3> 2018-07-15 17:40:25.566825 7f97143fe700  0
filestore(/var/lib/ceph/osd/ceph-44) unexpected error code
-2> 2018-07-15 17:40:25.566829 7f97143fe700  0
filestore(/var/lib/ceph/osd/ceph-44)  transaction dump:
{
"ops": [
{
"op_num": 0,
"op_name": "touch",
"collection": "6.f0_head",
"oid": "#-8:0f00:::temp_6.f0_0_55255967_2688:head#"
},
{
"op_num": 1,
"op_name": "write",
"collection": "6.f0_head",
"oid": "#-8:0f00:::temp_6.f0_0_55255967_2688:head#",
"length": 65536,
"offset": 0,
"bufferlist length": 65536
},
{
"op_num": 2,
"op_name": "omap_setkeys",
"collection": "6.f0_head",
"oid": "#6:0f00head#",
"attr_lens": {
"_info": 925
}
}
]
}

-1> 2018-07-15 17:40:25.566886 7f97143fe700 -1 dump_open_fds
unable to open /proc/self/fd
 0> 2018-07-15 17:40:25.569564 7f97143fe700 -1
os/filestore/FileStore.cc: In function 'void
FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int,
ThreadPool::TPHandle*)' thread 7f97143fe700 time 2018-07-15
17:40:25.566888
os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected error")

Any insight on how to fix this issue is appreciated.

Regards,
Darius
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Jewel PG stuck inconsistent with 3 0-size objects

2018-07-16 Thread Matthew Vernon
Hi,

Our cluster is running 10.2.9 (from Ubuntu; on 16.04 LTS), and we have a
pg that's stuck inconsistent; if I repair it, it logs "failed to pick
suitable auth object" (repair log attached, to try and stop my MUA
mangling it).

We then deep-scrubbed that pg, at which point
rados list-inconsistent-obj 67.2e --format=json-pretty produces a bit of
output (also attached), which includes that all 3 osds have a zero-sized
object e.g.

"osd": 1937,
"errors": [
"omap_digest_mismatch_oi"
],
"size": 0,
"omap_digest": "0x45773901",
"data_digest": "0x"

All 3 osds have different omap_digest, but all have 0 size. Indeed,
looking on the OSD disks directly, each object is 0 size (i.e. they are
identical).

This looks similar to one of the failure modes in
http://tracker.ceph.com/issues/21388 where the is a suggestion (comment
19 from David Zafman) to do:

rados -p default.rgw.buckets.index setomapval
.dir.861ae926-7ff0-48c5-86d6-a6ba8d0a7a14.7130858.6 temporary-key anything
[deep-scrub]
rados -p default.rgw.buckets.index rmomapkey
.dir.861ae926-7ff0-48c5-86d6-a6ba8d0a7a14.7130858.6 temporary-key

Is this likely to be the correct approach here, to? And is there an
underlying bug in ceph that still needs fixing? :)

Thanks,

Matthew



-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 2018-07-16 09:17:33.351755 7f058a047700  0 log_channel(cluster) log [INF] : 
67.2e repair starts
2018-07-16 09:17:51.521378 7f0587842700 -1 log_channel(cluster) log [ERR] : 
67.2e shard 1937: soid 
67:7463f933:::.dir.861ae926-7ff0-48c5-86d6-a6ba8d0a7a14.7130858.6:head 
omap_digest 0x45773901 != omap_digest 0x952ce474 from auth oi 
67:7463f933:::.dir.861ae926-7ff0-48c5-86d6-a6ba8d0a7a14.7130858.6:head(444843'17812260
 osd.1987.0:16910852 dirty|omap|data_digest|omap_digest s 0 uv 17812259 dd 
 od 952ce474 alloc_hint [0 0])
2018-07-16 09:17:51.521463 7f0587842700 -1 log_channel(cluster) log [ERR] : 
67.2e shard 1987: soid 
67:7463f933:::.dir.861ae926-7ff0-48c5-86d6-a6ba8d0a7a14.7130858.6:head 
omap_digest 0xec3afbe != omap_digest 0x45773901 from shard 1937, omap_digest 
0xec3afbe != omap_digest 0x952ce474 from auth oi 
67:7463f933:::.dir.861ae926-7ff0-48c5-86d6-a6ba8d0a7a14.7130858.6:head(444843'17812260
 osd.1987.0:16910852 dirty|omap|data_digest|omap_digest s 0 uv 17812259 dd 
 od 952ce474 alloc_hint [0 0])
2018-07-16 09:17:51.521653 7f0587842700 -1 log_channel(cluster) log [ERR] : 
67.2e shard 2796: soid 
67:7463f933:::.dir.861ae926-7ff0-48c5-86d6-a6ba8d0a7a14.7130858.6:head 
omap_digest 0x5eec6452 != omap_digest 0x45773901 from shard 1937, omap_digest 
0x5eec6452 != omap_digest 0x952ce474 from auth oi 
67:7463f933:::.dir.861ae926-7ff0-48c5-86d6-a6ba8d0a7a14.7130858.6:head(444843'17812260
 osd.1987.0:16910852 dirty|omap|data_digest|omap_digest s 0 uv 17812259 dd 
 od 952ce474 alloc_hint [0 0])
2018-07-16 09:17:51.521702 7f0587842700 -1 log_channel(cluster) log [ERR] : 
67.2e soid 
67:7463f933:::.dir.861ae926-7ff0-48c5-86d6-a6ba8d0a7a14.7130858.6:head: failed 
to pick suitable auth object
2018-07-16 09:17:51.521988 7f0587842700 -1 log_channel(cluster) log [ERR] : 
67.2e repair 4 errors, 0 fixed
{
"epoch": 514919,
"inconsistents": [
{
"object": {
"name": ".dir.861ae926-7ff0-48c5-86d6-a6ba8d0a7a14.7130858.6",
"nspace": "",
"locator": "",
"snap": "head",
"version": 17812259
},
"errors": [
"omap_digest_mismatch"
],
"union_shard_errors": [
"omap_digest_mismatch_oi"
],
"selected_object_info": 
"67:7463f933:::.dir.861ae926-7ff0-48c5-86d6-a6ba8d0a7a14.7130858.6:head(444843'17812260
 osd.1987.0:16910852 dirty|omap|data_digest|omap_digest s 0 uv 17812259 dd 
 od 952ce474 alloc_hint [0 0])",
"shards": [
{
"osd": 1937,
"errors": [
"omap_digest_mismatch_oi"
],
"size": 0,
"omap_digest": "0x45773901",
"data_digest": "0x"
},
{
"osd": 1987,
"errors": [
"omap_digest_mismatch_oi"
],
"size": 0,
"omap_digest": "0x0ec3afbe",
"data_digest": "0x"
},
{
"osd": 2796,
"errors": [
"omap_digest_mismatch_oi"
],

Re: [ceph-users] CephFS with erasure coding, do I need a cache-pool?

2018-07-16 Thread John Spray
On Sun, Jul 15, 2018 at 12:46 PM Oliver Schulz
 wrote:
>
> Dear all,
>
> we're planning a new Ceph-Clusterm, with CephFS as the
> main workload, and would like to use erasure coding to
> use the disks more efficiently. Access pattern will
> probably be more read- than write-heavy, on average.
>
> I don't have any practical experience with erasure-
> coded pools so far.
>
> I'd be glad for any hints / recommendations regarding
> these questions:
>
> * Is an SSD cache pool recommended/necessary for
>CephFS on an erasure-coded HDD pool (using Ceph
>Luminous and BlueStore)?

Since Luminous, you can use an erasure coded pool (on bluestore)
directly as a CephFS data pool, no cache pool needed.

John

> * What are good values for k/m for erasure coding in
>practice (assuming a cluster of about 300 OSDs), to
>make things robust and ease maintenance (ability to
>take a few nodes down)? Is k/m = 6/3 a good choice?
>
> * Will it be sufficient to have k+m racks, resp. failure
>domains?
>
>
> Cheers and thanks for any advice,
>
> Oliver
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com