Re: [ceph-users] performance in a small cluster

2019-05-24 Thread Maged Mokhtar

Hi Robert

1) Can you specify how many threads were used in the 4k write rados test 
? i suspect that only 16 threads were used, this is because it is the 
default + also the average latency was 2.9 ms giving average of 344 iops 
per thread, your average iops were 5512 divide this by 344 we get 16.02. 
If this is the case then this is too low, you have 12 OSDs, you need to 
use 64 or 128 threads to get a couple of threads on each OSD to stress 
it. use the -t option to specify the thread count. Also better if you 
can run more than 1 client process + preferably from different hosts and 
get the total iops.


2) The read latency you see of 0.4 ms is good. The write latency of 2.9 
ms is not very good but not terrible: a fast all flash bluestore system 
should give around 1 to 1.5 ms write latency (ie from around 600 to 1000 
iops per thread), some users are able to go below 1 ms but it is not 
easy.  Disk model as well as tuning your cpu c states and p states 
frequency will help reduce latency, there are several topics in this 
mailing list that goes into this in great detail + search for a 
presentation by Nick Fisk.


3) Running a simple tool like atop while doing the tests can also reveal 
a lot on where bottlenecks are, % utilization of disks and cpu are 
important. However i expect that if you were using 16 threads only, they 
will not be highly utilized as the dominant factor would be latency as 
noted earlier.


/Maged


On 24/05/2019 13:22, Robert Sander wrote:

Hi,

we have a small cluster at a customer's site with three nodes and 4 
SSD-OSDs each.

Connected with 10G the system is supposed to perform well.

rados bench shows ~450MB/s write and ~950MB/s read speeds with 4MB 
objects but only 20MB/s write and 95MB/s read with 4KB objects.


This is a little bit disappointing as the 4K performance is also seen 
in KVM VMs using RBD.


Is there anything we can do to improve performance with small objects 
/ block sizes?


Jumbo frames have already been enabled.

4MB objects write:

Total time run: 30.218930
Total writes made:  3391
Write size: 4194304
Object size:    4194304
Bandwidth (MB/sec): 448.858
Stddev Bandwidth:   63.5044
Max bandwidth (MB/sec): 552
Min bandwidth (MB/sec): 320
Average IOPS:   112
Stddev IOPS:    15
Max IOPS:   138
Min IOPS:   80
Average Latency(s): 0.142475
Stddev Latency(s):  0.0990132
Max latency(s): 0.814715
Min latency(s): 0.0308732

4MB objects rand read:

Total time run:   30.169312
Total reads made: 7223
Read size:    4194304
Object size:  4194304
Bandwidth (MB/sec):   957.662
Average IOPS: 239
Stddev IOPS:  23
Max IOPS: 272
Min IOPS: 175
Average Latency(s):   0.0653696
Max latency(s):   0.517275
Min latency(s):   0.00201978

4K objects write:

Total time run: 30.002628
Total writes made:  165404
Write size: 4096
Object size:    4096
Bandwidth (MB/sec): 21.5351
Stddev Bandwidth:   2.0575
Max bandwidth (MB/sec): 22.4727
Min bandwidth (MB/sec): 11.0508
Average IOPS:   5512
Stddev IOPS:    526
Max IOPS:   5753
Min IOPS:   2829
Average Latency(s): 0.00290095
Stddev Latency(s):  0.0015036
Max latency(s): 0.0778454
Min latency(s): 0.00174262

4K objects read:

Total time run:   30.000538
Total reads made: 1064610
Read size:    4096
Object size:  4096
Bandwidth (MB/sec):   138.619
Average IOPS: 35486
Stddev IOPS:  3776
Max IOPS: 42208
Min IOPS: 26264
Average Latency(s):   0.000443905
Max latency(s):   0.0123462
Min latency(s):   0.000123081


Regards


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] performance in a small cluster

2019-05-24 Thread Paul Emmerich
On Sat, May 25, 2019 at 12:30 AM Mark Lehrer  wrote:

> > but only 20MB/s write and 95MB/s read with 4KB objects.
>
> There is copy-on-write overhead for each block, so 4K performance is
> going to be limited no matter what.
>

no snapshots are involved and he's using rados bench which operates on
block sizes as specified, so no partial updates are involved

This workload basically goes straight into the WAL for up to 512 MB, so it's
virtually identical to running the standard fio benchmark for ceph disks.


>
> However, if your system is like mine the main problem you will run
> into is that Ceph was designed for spinning disks.  Therefore, its
> main goal is to make sure that no individual OSD is doing more than
> one or two things at a time no matter what.  Unfortunately, SSDs
> typically don't show best performance until you are doing 20+
> simultaneous I/Os (especially if you use a small block size).
>

No, there are different defaults for number of threads and other tuning
parameters since Luminous.


>
> You can see this most clearly with iostat (run "iostat -mtxy 1" on one
> of your OSD nodes) and a high queue depth 4K workload.  You'll notice
> that even though the client is trying to do many things at a time, the
> OSD node is practically idle.  Especially problematic is the fact that
> iostat will stay below 1 in the "avgqu-sz" column and the utilization
> % will be very low.  This makes it look like a thread semaphore kind
> of problem to me... and increasing the number of clients doesn't seem
> to make the OSDs work any harder.
>

RocksDB WAL uses 4 threads/WALs by default IIRC, you can change that
in bluestore_rocksdb_options. Yes, that is often a bottleneck and is one of
the standard options to tune to get the most IOPS out of NVMe disks.
Well, that and creating more partitions/OSDs on a single disk.


But the main problem is that you want to write your data for real. Many
SSDs are just bad at writing small chunks of data.
These benchmark results simply look like a case of a slow disk.


>
> I still haven't found a good solution unfortunately but definitely
> keep an eye on the queue size and util% in iostat -- SSD bandwidth &
> iops depend on maximizing the number of parallel I/O operations.  If
> anyone has hints on improving Ceph threading I would love to figure
> this one out.
>

Agreed, everyone should monitor util%



-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


>
>
> On Fri, May 24, 2019 at 5:23 AM Robert Sander
>  wrote:
> >
> > Hi,
> >
> > we have a small cluster at a customer's site with three nodes and 4
> > SSD-OSDs each.
> > Connected with 10G the system is supposed to perform well.
> >
> > rados bench shows ~450MB/s write and ~950MB/s read speeds with 4MB
> > objects but only 20MB/s write and 95MB/s read with 4KB objects.
> >
> > This is a little bit disappointing as the 4K performance is also seen in
> > KVM VMs using RBD.
> >
> > Is there anything we can do to improve performance with small objects /
> > block sizes?
> >
> > Jumbo frames have already been enabled.
> >
> > 4MB objects write:
> >
> > Total time run: 30.218930
> > Total writes made:  3391
> > Write size: 4194304
> > Object size:4194304
> > Bandwidth (MB/sec): 448.858
> > Stddev Bandwidth:   63.5044
> > Max bandwidth (MB/sec): 552
> > Min bandwidth (MB/sec): 320
> > Average IOPS:   112
> > Stddev IOPS:15
> > Max IOPS:   138
> > Min IOPS:   80
> > Average Latency(s): 0.142475
> > Stddev Latency(s):  0.0990132
> > Max latency(s): 0.814715
> > Min latency(s): 0.0308732
> >
> > 4MB objects rand read:
> >
> > Total time run:   30.169312
> > Total reads made: 7223
> > Read size:4194304
> > Object size:  4194304
> > Bandwidth (MB/sec):   957.662
> > Average IOPS: 239
> > Stddev IOPS:  23
> > Max IOPS: 272
> > Min IOPS: 175
> > Average Latency(s):   0.0653696
> > Max latency(s):   0.517275
> > Min latency(s):   0.00201978
> >
> > 4K objects write:
> >
> > Total time run: 30.002628
> > Total writes made:  165404
> > Write size: 4096
> > Object size:4096
> > Bandwidth (MB/sec): 21.5351
> > Stddev Bandwidth:   2.0575
> > Max bandwidth (MB/sec): 22.4727
> > Min bandwidth (MB/sec): 11.0508
> > Average IOPS:   5512
> > Stddev IOPS:526
> > Max IOPS:   5753
> > Min IOPS:   2829
> > Average Latency(s): 0.00290095
> > Stddev Latency(s):  0.0015036
> > Max latency(s): 0.0778454
> > Min latency(s): 0.00174262
> >
> > 4K objects read:
> >
> > Total time run:   30.000538
> > Total reads made: 1064610
> > Read size:4096
> > Object size:  4096
> > Bandwidth (MB/sec):   138.619
> > Average 

Re: [ceph-users] "allow profile rbd" or "profile rbd"

2019-05-24 Thread Jason Dillaman
On Fri, May 24, 2019 at 6:09 PM Marc Roos  wrote:
>
>
> I have still some account listing either "allow" or not. What should
> this be? Should this not be kept uniform?

What if the profile in the future adds denials? What does "allow
profile XYX" (or "deny profile rbd") mean when it has other embedded
logic? That was at least the thought about how to address a grouped
ACL (i.e. the "allow" prefix doesn't make much sense).

>
>
> [client.xxx.xx]
>  key = xxx
>  caps mon = "allow profile rbd"
>  caps osd = "profile rbd pool=rbd,profile rbd pool=rbd.ssd"
>
>
>
> [client.xxx]
>  key = 
>  caps mon = "profile rbd"
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] performance in a small cluster

2019-05-24 Thread Mark Lehrer
> but only 20MB/s write and 95MB/s read with 4KB objects.

There is copy-on-write overhead for each block, so 4K performance is
going to be limited no matter what.

However, if your system is like mine the main problem you will run
into is that Ceph was designed for spinning disks.  Therefore, its
main goal is to make sure that no individual OSD is doing more than
one or two things at a time no matter what.  Unfortunately, SSDs
typically don't show best performance until you are doing 20+
simultaneous I/Os (especially if you use a small block size).

You can see this most clearly with iostat (run "iostat -mtxy 1" on one
of your OSD nodes) and a high queue depth 4K workload.  You'll notice
that even though the client is trying to do many things at a time, the
OSD node is practically idle.  Especially problematic is the fact that
iostat will stay below 1 in the "avgqu-sz" column and the utilization
% will be very low.  This makes it look like a thread semaphore kind
of problem to me... and increasing the number of clients doesn't seem
to make the OSDs work any harder.

I still haven't found a good solution unfortunately but definitely
keep an eye on the queue size and util% in iostat -- SSD bandwidth &
iops depend on maximizing the number of parallel I/O operations.  If
anyone has hints on improving Ceph threading I would love to figure
this one out.


On Fri, May 24, 2019 at 5:23 AM Robert Sander
 wrote:
>
> Hi,
>
> we have a small cluster at a customer's site with three nodes and 4
> SSD-OSDs each.
> Connected with 10G the system is supposed to perform well.
>
> rados bench shows ~450MB/s write and ~950MB/s read speeds with 4MB
> objects but only 20MB/s write and 95MB/s read with 4KB objects.
>
> This is a little bit disappointing as the 4K performance is also seen in
> KVM VMs using RBD.
>
> Is there anything we can do to improve performance with small objects /
> block sizes?
>
> Jumbo frames have already been enabled.
>
> 4MB objects write:
>
> Total time run: 30.218930
> Total writes made:  3391
> Write size: 4194304
> Object size:4194304
> Bandwidth (MB/sec): 448.858
> Stddev Bandwidth:   63.5044
> Max bandwidth (MB/sec): 552
> Min bandwidth (MB/sec): 320
> Average IOPS:   112
> Stddev IOPS:15
> Max IOPS:   138
> Min IOPS:   80
> Average Latency(s): 0.142475
> Stddev Latency(s):  0.0990132
> Max latency(s): 0.814715
> Min latency(s): 0.0308732
>
> 4MB objects rand read:
>
> Total time run:   30.169312
> Total reads made: 7223
> Read size:4194304
> Object size:  4194304
> Bandwidth (MB/sec):   957.662
> Average IOPS: 239
> Stddev IOPS:  23
> Max IOPS: 272
> Min IOPS: 175
> Average Latency(s):   0.0653696
> Max latency(s):   0.517275
> Min latency(s):   0.00201978
>
> 4K objects write:
>
> Total time run: 30.002628
> Total writes made:  165404
> Write size: 4096
> Object size:4096
> Bandwidth (MB/sec): 21.5351
> Stddev Bandwidth:   2.0575
> Max bandwidth (MB/sec): 22.4727
> Min bandwidth (MB/sec): 11.0508
> Average IOPS:   5512
> Stddev IOPS:526
> Max IOPS:   5753
> Min IOPS:   2829
> Average Latency(s): 0.00290095
> Stddev Latency(s):  0.0015036
> Max latency(s): 0.0778454
> Min latency(s): 0.00174262
>
> 4K objects read:
>
> Total time run:   30.000538
> Total reads made: 1064610
> Read size:4096
> Object size:  4096
> Bandwidth (MB/sec):   138.619
> Average IOPS: 35486
> Stddev IOPS:  3776
> Max IOPS: 42208
> Min IOPS: 26264
> Average Latency(s):   0.000443905
> Max latency(s):   0.0123462
> Min latency(s):   0.000123081
>
>
> Regards
> --
> Robert Sander
> Heinlein Support GmbH
> Linux: Akademie - Support - Hosting
> http://www.heinlein-support.de
>
> Tel: 030-405051-43
> Fax: 030-405051-19
>
> Zwangsangaben lt. §35a GmbHG:
> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] performance in a small cluster

2019-05-24 Thread Paul Emmerich
On Fri, May 24, 2019 at 3:27 PM Robert Sander 
wrote:

> Am 24.05.19 um 14:43 schrieb Paul Emmerich:
> > 20 MB/s at 4K blocks is ~5000 iops, that's 1250 IOPS per SSD (assuming
> > replica 3).
> >
> > What we usually check in scenarios like these:
> >
> > * SSD model? Lots of cheap SSDs simply can't handle more than that
>
> The system has been newly created and is not busy at all.
>
> We tested a single SSD without OSD on top with fio: it can do 50K IOPS
> read and 16K IOPS write.
>

If you tell us the disk model someone here might be able to share their
experiences with that disk.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


>
> > * Get some proper statistics such as OSD latencies, disk IO utilization,
> > etc. A benchmark without detailed performance data doesn't really help
> > to debug such a problem
>
> Yes, that is correct, we will try to setup a perfdata gathering system.
>
> Regards
> --
> Robert Sander
> Heinlein Support GmbH
> Linux: Akademie - Support - Hosting
> http://www.heinlein-support.de
>
> Tel: 030-405051-43
> Fax: 030-405051-19
>
> Zwangsangaben lt. §35a GmbHG:
> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Failed Disk simulation question

2019-05-24 Thread solarflow99
I think a deep scrub would eventually catch this right?


On Wed, May 22, 2019 at 2:56 AM Eugen Block  wrote:

> Hi Alex,
>
> > The cluster has been idle at the moment being new and all.  I
> > noticed some disk related errors in dmesg but that was about it.
> > It looked to me for the next 20 - 30 minutes the failure has not
> > been detected.  All osds were up and in and health was OK. OSD logs
> > had no smoking gun either.
> > After 30 minutes, I restarted the OSD container and it failed to
> > start as expected.
>
> if the cluster doesn't have to read or write to specific OSDs (or
> sectors on that OSD) the failure won't be detected immediately. We had
> an issue last year where one of the SSDs (used for rocksdb and wal)
> had a failure, but that was never reported. We discovered that when we
> tried to migrate the lvm to a new device and got read errors.
>
> > Later on, I performed the same operation during the fio bench mark
> > and OSD failed immediately.
>
> This confirms our experience, if there's data to read/write on that
> disk the failure will be detected.
> Please note that this was in a Luminous cluster, I don't know if and
> how Nautilus has improved in sensing disk failures.
>
> Regards,
> Eugen
>
>
> Zitat von Alex Litvak :
>
> > Hello cephers,
> >
> > I know that there was similar question posted 5 years ago.  However
> > the answer was inconclusive for me.
> > I installed a new Nautilus 14.2.1 cluster and started pre-production
> > testing.  I followed RedHat document and simulated a soft disk
> > failure by
> >
> > #  echo 1 > /sys/block/sdc/device/delete
> >
> > The cluster has been idle at the moment being new and all.  I
> > noticed some disk related errors in dmesg but that was about it.
> > It looked to me for the next 20 - 30 minutes the failure has not
> > been detected.  All osds were up and in and health was OK. OSD logs
> > had no smoking gun either.
> > After 30 minutes, I restarted the OSD container and it failed to
> > start as expected.
> >
> > Later on, I performed the same operation during the fio bench mark
> > and OSD failed immediately.
> >
> > My question is:  Should the disk problem have been detected quick
> > enough even on the idle cluster? I thought Nautilus has the means to
> > sense failure before intensive IO hit the disk.
> > Am I wrong to expect that?
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inconsistent number of pools

2019-05-24 Thread Michel Raabe

On 20.05.19 13:04, Lars Täuber wrote:

Mon, 20 May 2019 10:52:14 +
Eugen Block  ==> ceph-users@lists.ceph.com :

Hi, have you tried 'ceph health detail'?



No I hadn't. Thanks for the hint.


You can also try

$ rados lspools
$ ceph osd pool ls

and verify that with the pgs

$ ceph pg ls --format=json-pretty | jq -r '.pg_stats[].pgid' | cut -d. 
-f1 | uniq



hth
michel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] performance in a small cluster

2019-05-24 Thread Robert LeBlanc
On Fri, May 24, 2019 at 6:26 AM Robert Sander 
wrote:

> Am 24.05.19 um 14:43 schrieb Paul Emmerich:
> > 20 MB/s at 4K blocks is ~5000 iops, that's 1250 IOPS per SSD (assuming
> > replica 3).
> >
> > What we usually check in scenarios like these:
> >
> > * SSD model? Lots of cheap SSDs simply can't handle more than that
>
> The system has been newly created and is not busy at all.
>
> We tested a single SSD without OSD on top with fio: it can do 50K IOPS
> read and 16K IOPS write.
>

You probably tested with async writes, try passing sync to fio, that is
much closer to what Ceph will do as it syncs every write to make sure it
is written to disk before acknowledging back to the client that the write
is done. When I did these tests, I also filled the entire drive and ran the
test for an hour. Most drives looked fine with short tests are small
amounts of data, but once the drive started getting full, the performance
dropped off a cliff. Considering that Ceph is really hard on drives, it's
good to test the extreme.

Robert LeBlanc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] large omap object in usage_log_pool

2019-05-24 Thread Casey Bodley



On 5/24/19 1:15 PM, shubjero wrote:

Thanks for chiming in Konstantin!

Wouldn't setting this value to 0 disable the sharding?

Reference: http://docs.ceph.com/docs/mimic/radosgw/config-ref/

rgw override bucket index max shards
Description:Represents the number of shards for the bucket index
object, a value of zero indicates there is no sharding. It is not
recommended to set a value too large (e.g. thousand) as it increases
the cost for bucket listing. This variable should be set in the client
or global sections so that it is automatically applied to
radosgw-admin commands.
Type:Integer
Default:0

rgw dynamic resharding is enabled:
ceph daemon mon.controller1 config show | grep rgw_dynamic_resharding
 "rgw_dynamic_resharding": "true",

I'd like to know more about the purpose of our .usage pool and the
'usage_log_pool' in general as I cant find much about this component
of ceph.


You can find docs for the usage log at 
http://docs.ceph.com/docs/master/radosgw/admin/#usage


Unless trimmed, the usage log will continue to grow. If you aren't using 
it, I'd recommend turning it off and trimming it all.




On Thu, May 23, 2019 at 11:24 PM Konstantin Shalygin  wrote:

in the config.
```"rgw_override_bucket_index_max_shards": "8",```. Should this be
increased?

Should be decreased to default `0`, I think.

Modern Ceph releases resolve large omaps automatically via bucket dynamic 
resharding:

```

{
 "option": {
 "name": "rgw_dynamic_resharding",
 "type": "bool",
 "level": "basic",
 "desc": "Enable dynamic resharding",
 "long_desc": "If true, RGW will dynamicall increase the number of shards in 
buckets that have a high number of objects per shard.",
 "default": true,
 "daemon_default": "",
 "tags": [],
 "services": [
 "rgw"
 ],
 "see_also": [
 "rgw_max_objs_per_shard"
 ],
 "min": "",
 "max": ""
 }
}
```

```

{
 "option": {
 "name": "rgw_max_objs_per_shard",
 "type": "int64_t",
 "level": "basic",
 "desc": "Max objects per shard for dynamic resharding",
 "long_desc": "This is the max number of objects per bucket index shard that 
RGW will allow with dynamic resharding. RGW will trigger an automatic reshard operation on the 
bucket if it exceeds this number.",
 "default": 10,
 "daemon_default": "",
 "tags": [],
 "services": [
 "rgw"
 ],
 "see_also": [
 "rgw_dynamic_resharding"
 ],
 "min": "",
 "max": ""
 }
}
```


So when your bucket reached new 100k objects rgw will shard this bucket 
automatically.

Some old buckets may be not sharded, like your ancients from Giant. You can 
check fill status like this: `radosgw-admin bucket limit check | jq '.[]'`. If 
some buckets is not reshared you can shart it by hand via `radosgw-admin 
reshard add ...`. Also, there may be some stale reshard instances (fixed ~ in 
12.2.11), you can check it via `radosgw-admin reshard stale-instances list` and 
then remove via `reshard stale-instances rm`.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS object mapping.

2019-05-24 Thread Robert LeBlanc
On Fri, May 24, 2019 at 2:14 AM Burkhard Linke <
burkhard.li...@computational.bio.uni-giessen.de> wrote:

> Hi,
> On 5/22/19 5:53 PM, Robert LeBlanc wrote:
>
> When you say 'some' is it a fixed offset that the file data starts? Is the
> first stripe just metadata?
>
> No, the first stripe contains the first 4 MB of a file by default,. The
> xattr and omap data are stored separately.
>
Ahh, so it must be in the XFS xattrs, that makes sense.

For future posterity, I combined a couple of your commands to remove the
temporary intermediate file for others who may run across this.

rados -p  getxattr . parent
| ceph-dencoder type inode_backtrace_t import - decode dump_json


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] large omap object in usage_log_pool

2019-05-24 Thread shubjero
Thanks for chiming in Konstantin!

Wouldn't setting this value to 0 disable the sharding?

Reference: http://docs.ceph.com/docs/mimic/radosgw/config-ref/

rgw override bucket index max shards
Description:Represents the number of shards for the bucket index
object, a value of zero indicates there is no sharding. It is not
recommended to set a value too large (e.g. thousand) as it increases
the cost for bucket listing. This variable should be set in the client
or global sections so that it is automatically applied to
radosgw-admin commands.
Type:Integer
Default:0

rgw dynamic resharding is enabled:
ceph daemon mon.controller1 config show | grep rgw_dynamic_resharding
"rgw_dynamic_resharding": "true",

I'd like to know more about the purpose of our .usage pool and the
'usage_log_pool' in general as I cant find much about this component
of ceph.


On Thu, May 23, 2019 at 11:24 PM Konstantin Shalygin  wrote:
>
> in the config.
> ```"rgw_override_bucket_index_max_shards": "8",```. Should this be
> increased?
>
> Should be decreased to default `0`, I think.
>
> Modern Ceph releases resolve large omaps automatically via bucket dynamic 
> resharding:
>
> ```
>
> {
> "option": {
> "name": "rgw_dynamic_resharding",
> "type": "bool",
> "level": "basic",
> "desc": "Enable dynamic resharding",
> "long_desc": "If true, RGW will dynamicall increase the number of 
> shards in buckets that have a high number of objects per shard.",
> "default": true,
> "daemon_default": "",
> "tags": [],
> "services": [
> "rgw"
> ],
> "see_also": [
> "rgw_max_objs_per_shard"
> ],
> "min": "",
> "max": ""
> }
> }
> ```
>
> ```
>
> {
> "option": {
> "name": "rgw_max_objs_per_shard",
> "type": "int64_t",
> "level": "basic",
> "desc": "Max objects per shard for dynamic resharding",
> "long_desc": "This is the max number of objects per bucket index 
> shard that RGW will allow with dynamic resharding. RGW will trigger an 
> automatic reshard operation on the bucket if it exceeds this number.",
> "default": 10,
> "daemon_default": "",
> "tags": [],
> "services": [
> "rgw"
> ],
> "see_also": [
> "rgw_dynamic_resharding"
> ],
> "min": "",
> "max": ""
> }
> }
> ```
>
>
> So when your bucket reached new 100k objects rgw will shard this bucket 
> automatically.
>
> Some old buckets may be not sharded, like your ancients from Giant. You can 
> check fill status like this: `radosgw-admin bucket limit check | jq '.[]'`. 
> If some buckets is not reshared you can shart it by hand via `radosgw-admin 
> reshard add ...`. Also, there may be some stale reshard instances (fixed ~ in 
> 12.2.11), you can check it via `radosgw-admin reshard stale-instances list` 
> and then remove via `reshard stale-instances rm`.
>
>
>
> k
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] "allow profile rbd" or "profile rbd"

2019-05-24 Thread Marc Roos


I have still some account listing either "allow" or not. What should 
this be? Should this not be kept uniform?



[client.xxx.xx]
 key = xxx
 caps mon = "allow profile rbd"
 caps osd = "profile rbd pool=rbd,profile rbd pool=rbd.ssd"



[client.xxx]
 key = 
 caps mon = "profile rbd"





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Major ceph disaster

2019-05-24 Thread Robert LeBlanc
I'd say that if you can't find that object in Rados, then your assumption
may be good. I haven't run into this problem before. Try doing a Rados get
for that object and see if you get anything. I've done a Rados list
grepping for the hex inode, but it took almost two days on our cluster that
had half a billion objects. Your cluster may be faster.

Sent from a mobile device, please excuse any typos.

On Fri, May 24, 2019, 8:21 AM Kevin Flöh  wrote:

> ok this just gives me:
>
> error getting xattr ec31/10004dfce92./parent: (2) No such file or
> directory
>
> Does this mean that the lost object isn't even a file that appears in the
> ceph directory. Maybe a leftover of a file that has not been deleted
> properly? It wouldn't be an issue to mark the object as lost in that case.
> On 24.05.19 5:08 nachm., Robert LeBlanc wrote:
>
> You need to use the first stripe of the object as that is the only one
> with the metadata.
>
> Try "rados -p ec31 getxattr 10004dfce92. parent" instead.
>
> Robert LeBlanc
>
> Sent from a mobile device, please excuse any typos.
>
> On Fri, May 24, 2019, 4:42 AM Kevin Flöh  wrote:
>
>> Hi,
>>
>> we already tried "rados -p ec31 getxattr 10004dfce92.003d parent" but
>> this is just hanging forever if we are looking for unfound objects. It
>> works fine for all other objects.
>>
>> We also tried scanning the ceph directory with find -inum 1099593404050
>> (decimal of 10004dfce92) and found nothing. This is also working for non
>> unfound objects.
>>
>> Is there another way to find the corresponding file?
>> On 24.05.19 11:12 vorm., Burkhard Linke wrote:
>>
>> Hi,
>> On 5/24/19 9:48 AM, Kevin Flöh wrote:
>>
>> We got the object ids of the missing objects with ceph pg 1.24c
>> list_missing:
>>
>> {
>> "offset": {
>> "oid": "",
>> "key": "",
>> "snapid": 0,
>> "hash": 0,
>> "max": 0,
>> "pool": -9223372036854775808,
>> "namespace": ""
>> },
>> "num_missing": 1,
>> "num_unfound": 1,
>> "objects": [
>> {
>> "oid": {
>> "oid": "10004dfce92.003d",
>> "key": "",
>> "snapid": -2,
>> "hash": 90219084,
>> "max": 0,
>> "pool": 1,
>> "namespace": ""
>> },
>> "need": "46950'195355",
>> "have": "0'0",
>> "flags": "none",
>> "locations": [
>> "36(3)",
>> "61(2)"
>> ]
>> }
>> ],
>> "more": false
>> }
>>
>> we want to give up those objects with:
>>
>> ceph pg 1.24c mark_unfound_lost revert
>>
>> But first we would like to know which file(s) is affected. Is there a way to 
>> map the object id to the corresponding file?
>>
>>
>> The object name is composed of the file inode id and the chunk within the
>> file. The first chunk has some metadata you can use to retrieve the
>> filename. See the 'CephFS object mapping' thread on the mailing list for
>> more information.
>>
>>
>> Regards,
>>
>> Burkhard
>>
>>
>> ___
>> ceph-users mailing 
>> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Major ceph disaster

2019-05-24 Thread Kevin Flöh

ok this just gives me:

error getting xattr ec31/10004dfce92./parent: (2) No such file 
or directory


Does this mean that the lost object isn't even a file that appears in 
the ceph directory. Maybe a leftover of a file that has not been deleted 
properly? It wouldn't be an issue to mark the object as lost in that case.


On 24.05.19 5:08 nachm., Robert LeBlanc wrote:
You need to use the first stripe of the object as that is the only one 
with the metadata.


Try "rados -p ec31 getxattr 10004dfce92. parent" instead.

Robert LeBlanc

Sent from a mobile device, please excuse any typos.

On Fri, May 24, 2019, 4:42 AM Kevin Flöh > wrote:


Hi,

we already tried "rados -p ec31 getxattr 10004dfce92.003d
parent" but this is just hanging forever if we are looking for
unfound objects. It works fine for all other objects.

We also tried scanning the ceph directory with find -inum
1099593404050 (decimal of 10004dfce92) and found nothing. This is
also working for non unfound objects.

Is there another way to find the corresponding file?

On 24.05.19 11:12 vorm., Burkhard Linke wrote:


Hi,

On 5/24/19 9:48 AM, Kevin Flöh wrote:


We got the object ids of the missing objects with|ceph pg 1.24c
list_missing:|

|{
    "offset": {
    "oid": "",
    "key": "",
    "snapid": 0,
    "hash": 0,
    "max": 0,
    "pool": -9223372036854775808,
    "namespace": ""
    },
    "num_missing": 1,
    "num_unfound": 1,
    "objects": [
    {
    "oid": {
    "oid": "10004dfce92.003d",
    "key": "",
    "snapid": -2,
    "hash": 90219084,
    "max": 0,
    "pool": 1,
    "namespace": ""
    },
    "need": "46950'195355",
    "have": "0'0",
    "flags": "none",
    "locations": [
    "36(3)",
    "61(2)"
    ]
    }
    ],
    "more": false
}
|

|we want to give up those objects with:|

ceph  pg  1.24c  mark_unfound_lost  revert But first we would like to know 
which file(s) is
affected. Is there a way to map the object id to the
corresponding file?



The object name is composed of the file inode id and the chunk
within the file. The first chunk has some metadata you can use to
retrieve the filename. See the 'CephFS object mapping' thread on
the mailing list for more information.


Regards,

Burkhard



___
ceph-users mailing list
ceph-users@lists.ceph.com  
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Major ceph disaster

2019-05-24 Thread Robert LeBlanc
You need to use the first stripe of the object as that is the only one with
the metadata.

Try "rados -p ec31 getxattr 10004dfce92. parent" instead.

Robert LeBlanc

Sent from a mobile device, please excuse any typos.

On Fri, May 24, 2019, 4:42 AM Kevin Flöh  wrote:

> Hi,
>
> we already tried "rados -p ec31 getxattr 10004dfce92.003d parent" but
> this is just hanging forever if we are looking for unfound objects. It
> works fine for all other objects.
>
> We also tried scanning the ceph directory with find -inum 1099593404050
> (decimal of 10004dfce92) and found nothing. This is also working for non
> unfound objects.
>
> Is there another way to find the corresponding file?
> On 24.05.19 11:12 vorm., Burkhard Linke wrote:
>
> Hi,
> On 5/24/19 9:48 AM, Kevin Flöh wrote:
>
> We got the object ids of the missing objects with ceph pg 1.24c
> list_missing:
>
> {
> "offset": {
> "oid": "",
> "key": "",
> "snapid": 0,
> "hash": 0,
> "max": 0,
> "pool": -9223372036854775808,
> "namespace": ""
> },
> "num_missing": 1,
> "num_unfound": 1,
> "objects": [
> {
> "oid": {
> "oid": "10004dfce92.003d",
> "key": "",
> "snapid": -2,
> "hash": 90219084,
> "max": 0,
> "pool": 1,
> "namespace": ""
> },
> "need": "46950'195355",
> "have": "0'0",
> "flags": "none",
> "locations": [
> "36(3)",
> "61(2)"
> ]
> }
> ],
> "more": false
> }
>
> we want to give up those objects with:
>
> ceph pg 1.24c mark_unfound_lost revert
>
> But first we would like to know which file(s) is affected. Is there a way to 
> map the object id to the corresponding file?
>
>
> The object name is composed of the file inode id and the chunk within the
> file. The first chunk has some metadata you can use to retrieve the
> filename. See the 'CephFS object mapping' thread on the mailing list for
> more information.
>
>
> Regards,
>
> Burkhard
>
>
> ___
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lost OSD - 1000: FAILED assert(r == 0)

2019-05-24 Thread Guillaume Chenuet
Hi,

Thanks for your answers.
I recreated the OSD and I'll monitor the disk health (currently OK).

Thanks a lot,
Guillaume

On Fri, 24 May 2019 at 15:56, Igor Fedotov  wrote:

> Hi Guillaume,
>
> Could you please set debug-bluefs to 20, restart OSD and collect the whole
> log.
>
>
> Thanks,
>
> Igor
> On 5/24/2019 4:50 PM, Guillaume Chenuet wrote:
>
> Hi,
>
> We are running a Ceph cluster with 36 OSD splitted on 3 servers (12 OSD
> per server) and Ceph version
> 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable).
>
> This cluster is used by an OpenStack private cloud and deployed with
> OpenStack Kolla. Every OSD ran into a Docker container on the server and
> MON, MGR, MDS, and RGW are running on 3 other servers.
>
> This week, one OSD crashed and failed to restart, with this stack trace:
>
>  Running command: '/usr/bin/ceph-osd -f --public-addr 10.106.142.30
> --cluster-addr 10.106.142.30 -i 35'
> + exec /usr/bin/ceph-osd -f --public-addr 10.106.142.30 --cluster-addr
> 10.106.142.30 -i 35
> starting osd.35 at - osd_data /var/lib/ceph/osd/ceph-35
> /var/lib/ceph/osd/ceph-35/journal
> /builddir/build/BUILD/ceph-12.2.11/src/os/bluestore/BlueFS.cc: In function
> 'int BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*,
> uint64_t, size_t, ceph::bufferlist*, char*)' thread 7efd088d6d80 time
> 2019-05-24 05:40:47.799918
> /builddir/build/BUILD/ceph-12.2.11/src/os/bluestore/BlueFS.cc: 1000:
> FAILED assert(r == 0)
>  ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x110) [0x556f7833f8f0]
>  2: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*,
> unsigned long, unsigned long, ceph::buffer::list*, char*)+0xca4)
> [0x556f782b5574]
>  3: (BlueFS::_replay(bool)+0x2ef) [0x556f782c82af]
>  4: (BlueFS::mount()+0x1d4) [0x556f782cc014]
>  5: (BlueStore::_open_db(bool)+0x1847) [0x556f781e0ce7]
>  6: (BlueStore::_mount(bool)+0x40e) [0x556f782126ae]
>  7: (OSD::init()+0x3bd) [0x556f77dbbaed]
>  8: (main()+0x2d07) [0x556f77cbe667]
>  9: (__libc_start_main()+0xf5) [0x7efd04fa63d5]
>  10: (()+0x4c1f73) [0x556f77d5ef73]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed
> to interpret this.
> *** Caught signal (Aborted) **
>  in thread 7efd088d6d80 thread_name:ceph-osd
>  ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous
> (stable)
>  1: (()+0xa63931) [0x556f78300931]
>  2: (()+0xf5d0) [0x7efd05f995d0]
>  3: (gsignal()+0x37) [0x7efd04fba207]
>  4: (abort()+0x148) [0x7efd04fbb8f8]
>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x284) [0x556f7833fa64]
>  6: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*,
> unsigned long, unsigned long, ceph::buffer::list*, char*)+0xca4)
> [0x556f782b5574]
>  7: (BlueFS::_replay(bool)+0x2ef) [0x556f782c82af]
>  8: (BlueFS::mount()+0x1d4) [0x556f782cc014]
>  9: (BlueStore::_open_db(bool)+0x1847) [0x556f781e0ce7]
>  10: (BlueStore::_mount(bool)+0x40e) [0x556f782126ae]
>  11: (OSD::init()+0x3bd) [0x556f77dbbaed]
>  12: (main()+0x2d07) [0x556f77cbe667]
>  13: (__libc_start_main()+0xf5) [0x7efd04fa63d5]
>  14: (()+0x4c1f73) [0x556f77d5ef73]
>
> The cluster health is OK and Ceph sees this OSD as shutdown.
>
> I tried to find more information on the internet about this error without
> luck.
> Do you have any idea or input about this error, please?
>
> Thanks,
> Guillaume
>
>
> ___
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

-- 
Guillaume Chenuet
*DevOps Engineer Productivity*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RFC: relicence Ceph LGPL-2.1 code as LGPL-2.1 or LGPL-3.0

2019-05-24 Thread Sage Weil
On Fri, 10 May 2019, Robin H. Johnson wrote:
> On Fri, May 10, 2019 at 02:27:11PM +, Sage Weil wrote:
> > If you are a Ceph developer who has contributed code to Ceph and object to 
> > this change of license, please let us know, either by replying to this 
> > message or by commenting on that pull request.
> Am I correct in reading the diff that only a very small number of files
> did not already have the 'or later' clause of *GPL in effect?

To the contrary, I think one file (the COPYING file) has one line as a 
catch-all for everything (that isn't a special case) which is changing 
from 2.1 to 2.1 or 3.

https://github.com/ceph/ceph/pull/22446/files#diff-7116ef0705885343c9e1b2171a06be0eR6

> As a slight tangent, can we get SPDX tags on files rather than this
> hard-to-parse text?

(/me googles SPDX)

Sure?  The current format is based on the Debian copyright file format, 
which seemed appropriate at the time.  Happy to take patches that add more 
appropriate annotations...

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lost OSD - 1000: FAILED assert(r == 0)

2019-05-24 Thread Igor Fedotov

Hi Guillaume,

Could you please set debug-bluefs to 20, restart OSD and collect the 
whole log.



Thanks,

Igor

On 5/24/2019 4:50 PM, Guillaume Chenuet wrote:

Hi,

We are running a Ceph cluster with 36 OSD splitted on 3 servers (12 
OSD per server) and Ceph version 
12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable).


This cluster is used by an OpenStack private cloud and deployed with 
OpenStack Kolla. Every OSD ran into a Docker container on the server 
and MON, MGR, MDS, and RGW are running on 3 other servers.


This week, one OSD crashed and failed to restart, with this stack trace:

 Running command: '/usr/bin/ceph-osd -f --public-addr 10.106.142.30 
--cluster-addr 10.106.142.30 -i 35'
+ exec /usr/bin/ceph-osd -f --public-addr 10.106.142.30 --cluster-addr 
10.106.142.30 -i 35
starting osd.35 at - osd_data /var/lib/ceph/osd/ceph-35 
/var/lib/ceph/osd/ceph-35/journal
/builddir/build/BUILD/ceph-12.2.11/src/os/bluestore/BlueFS.cc: In 
function 'int BlueFS::_read(BlueFS::FileReader*, 
BlueFS::FileReaderBuffer*, uint64_t, size_t, ceph::bufferlist*, 
char*)' thread 7efd088d6d80 time 2019-05-24 05:40:47.799918
/builddir/build/BUILD/ceph-12.2.11/src/os/bluestore/BlueFS.cc: 1000: 
FAILED assert(r == 0)
 ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) 
luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x110) [0x556f7833f8f0]
 2: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, 
unsigned long, unsigned long, ceph::buffer::list*, char*)+0xca4) 
[0x556f782b5574]

 3: (BlueFS::_replay(bool)+0x2ef) [0x556f782c82af]
 4: (BlueFS::mount()+0x1d4) [0x556f782cc014]
 5: (BlueStore::_open_db(bool)+0x1847) [0x556f781e0ce7]
 6: (BlueStore::_mount(bool)+0x40e) [0x556f782126ae]
 7: (OSD::init()+0x3bd) [0x556f77dbbaed]
 8: (main()+0x2d07) [0x556f77cbe667]
 9: (__libc_start_main()+0xf5) [0x7efd04fa63d5]
 10: (()+0x4c1f73) [0x556f77d5ef73]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.

*** Caught signal (Aborted) **
 in thread 7efd088d6d80 thread_name:ceph-osd
 ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) 
luminous (stable)

 1: (()+0xa63931) [0x556f78300931]
 2: (()+0xf5d0) [0x7efd05f995d0]
 3: (gsignal()+0x37) [0x7efd04fba207]
 4: (abort()+0x148) [0x7efd04fbb8f8]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x284) [0x556f7833fa64]
 6: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, 
unsigned long, unsigned long, ceph::buffer::list*, char*)+0xca4) 
[0x556f782b5574]

 7: (BlueFS::_replay(bool)+0x2ef) [0x556f782c82af]
 8: (BlueFS::mount()+0x1d4) [0x556f782cc014]
 9: (BlueStore::_open_db(bool)+0x1847) [0x556f781e0ce7]
 10: (BlueStore::_mount(bool)+0x40e) [0x556f782126ae]
 11: (OSD::init()+0x3bd) [0x556f77dbbaed]
 12: (main()+0x2d07) [0x556f77cbe667]
 13: (__libc_start_main()+0xf5) [0x7efd04fa63d5]
 14: (()+0x4c1f73) [0x556f77d5ef73]

The cluster health is OK and Ceph sees this OSD as shutdown.

I tried to find more information on the internet about this error 
without luck.

Do you have any idea or input about this error, please?

Thanks,
Guillaume


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lost OSD - 1000: FAILED assert(r == 0)

2019-05-24 Thread Paul Emmerich
Disk got corrupted, it might be dead. Check kernel log for errors and SMART
reallocated sector count or errors.

If the disk is still good: simply re-create the OSD.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Fri, May 24, 2019 at 3:51 PM Guillaume Chenuet <
guillaume.chen...@schibsted.com> wrote:

> Hi,
>
> We are running a Ceph cluster with 36 OSD splitted on 3 servers (12 OSD
> per server) and Ceph version
> 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable).
>
> This cluster is used by an OpenStack private cloud and deployed with
> OpenStack Kolla. Every OSD ran into a Docker container on the server and
> MON, MGR, MDS, and RGW are running on 3 other servers.
>
> This week, one OSD crashed and failed to restart, with this stack trace:
>
>  Running command: '/usr/bin/ceph-osd -f --public-addr 10.106.142.30
> --cluster-addr 10.106.142.30 -i 35'
> + exec /usr/bin/ceph-osd -f --public-addr 10.106.142.30 --cluster-addr
> 10.106.142.30 -i 35
> starting osd.35 at - osd_data /var/lib/ceph/osd/ceph-35
> /var/lib/ceph/osd/ceph-35/journal
> /builddir/build/BUILD/ceph-12.2.11/src/os/bluestore/BlueFS.cc: In function
> 'int BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*,
> uint64_t, size_t, ceph::bufferlist*, char*)' thread 7efd088d6d80 time
> 2019-05-24 05:40:47.799918
> /builddir/build/BUILD/ceph-12.2.11/src/os/bluestore/BlueFS.cc: 1000:
> FAILED assert(r == 0)
>  ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x110) [0x556f7833f8f0]
>  2: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*,
> unsigned long, unsigned long, ceph::buffer::list*, char*)+0xca4)
> [0x556f782b5574]
>  3: (BlueFS::_replay(bool)+0x2ef) [0x556f782c82af]
>  4: (BlueFS::mount()+0x1d4) [0x556f782cc014]
>  5: (BlueStore::_open_db(bool)+0x1847) [0x556f781e0ce7]
>  6: (BlueStore::_mount(bool)+0x40e) [0x556f782126ae]
>  7: (OSD::init()+0x3bd) [0x556f77dbbaed]
>  8: (main()+0x2d07) [0x556f77cbe667]
>  9: (__libc_start_main()+0xf5) [0x7efd04fa63d5]
>  10: (()+0x4c1f73) [0x556f77d5ef73]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed
> to interpret this.
> *** Caught signal (Aborted) **
>  in thread 7efd088d6d80 thread_name:ceph-osd
>  ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous
> (stable)
>  1: (()+0xa63931) [0x556f78300931]
>  2: (()+0xf5d0) [0x7efd05f995d0]
>  3: (gsignal()+0x37) [0x7efd04fba207]
>  4: (abort()+0x148) [0x7efd04fbb8f8]
>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x284) [0x556f7833fa64]
>  6: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*,
> unsigned long, unsigned long, ceph::buffer::list*, char*)+0xca4)
> [0x556f782b5574]
>  7: (BlueFS::_replay(bool)+0x2ef) [0x556f782c82af]
>  8: (BlueFS::mount()+0x1d4) [0x556f782cc014]
>  9: (BlueStore::_open_db(bool)+0x1847) [0x556f781e0ce7]
>  10: (BlueStore::_mount(bool)+0x40e) [0x556f782126ae]
>  11: (OSD::init()+0x3bd) [0x556f77dbbaed]
>  12: (main()+0x2d07) [0x556f77cbe667]
>  13: (__libc_start_main()+0xf5) [0x7efd04fa63d5]
>  14: (()+0x4c1f73) [0x556f77d5ef73]
>
> The cluster health is OK and Ceph sees this OSD as shutdown.
>
> I tried to find more information on the internet about this error without
> luck.
> Do you have any idea or input about this error, please?
>
> Thanks,
> Guillaume
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Lost OSD - 1000: FAILED assert(r == 0)

2019-05-24 Thread Guillaume Chenuet
Hi,

We are running a Ceph cluster with 36 OSD splitted on 3 servers (12 OSD per
server) and Ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee)
luminous (stable).

This cluster is used by an OpenStack private cloud and deployed with
OpenStack Kolla. Every OSD ran into a Docker container on the server and
MON, MGR, MDS, and RGW are running on 3 other servers.

This week, one OSD crashed and failed to restart, with this stack trace:

 Running command: '/usr/bin/ceph-osd -f --public-addr 10.106.142.30
--cluster-addr 10.106.142.30 -i 35'
+ exec /usr/bin/ceph-osd -f --public-addr 10.106.142.30 --cluster-addr
10.106.142.30 -i 35
starting osd.35 at - osd_data /var/lib/ceph/osd/ceph-35
/var/lib/ceph/osd/ceph-35/journal
/builddir/build/BUILD/ceph-12.2.11/src/os/bluestore/BlueFS.cc: In function
'int BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*,
uint64_t, size_t, ceph::bufferlist*, char*)' thread 7efd088d6d80 time
2019-05-24 05:40:47.799918
/builddir/build/BUILD/ceph-12.2.11/src/os/bluestore/BlueFS.cc: 1000: FAILED
assert(r == 0)
 ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x110) [0x556f7833f8f0]
 2: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned
long, unsigned long, ceph::buffer::list*, char*)+0xca4) [0x556f782b5574]
 3: (BlueFS::_replay(bool)+0x2ef) [0x556f782c82af]
 4: (BlueFS::mount()+0x1d4) [0x556f782cc014]
 5: (BlueStore::_open_db(bool)+0x1847) [0x556f781e0ce7]
 6: (BlueStore::_mount(bool)+0x40e) [0x556f782126ae]
 7: (OSD::init()+0x3bd) [0x556f77dbbaed]
 8: (main()+0x2d07) [0x556f77cbe667]
 9: (__libc_start_main()+0xf5) [0x7efd04fa63d5]
 10: (()+0x4c1f73) [0x556f77d5ef73]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.
*** Caught signal (Aborted) **
 in thread 7efd088d6d80 thread_name:ceph-osd
 ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous
(stable)
 1: (()+0xa63931) [0x556f78300931]
 2: (()+0xf5d0) [0x7efd05f995d0]
 3: (gsignal()+0x37) [0x7efd04fba207]
 4: (abort()+0x148) [0x7efd04fbb8f8]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x284) [0x556f7833fa64]
 6: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned
long, unsigned long, ceph::buffer::list*, char*)+0xca4) [0x556f782b5574]
 7: (BlueFS::_replay(bool)+0x2ef) [0x556f782c82af]
 8: (BlueFS::mount()+0x1d4) [0x556f782cc014]
 9: (BlueStore::_open_db(bool)+0x1847) [0x556f781e0ce7]
 10: (BlueStore::_mount(bool)+0x40e) [0x556f782126ae]
 11: (OSD::init()+0x3bd) [0x556f77dbbaed]
 12: (main()+0x2d07) [0x556f77cbe667]
 13: (__libc_start_main()+0xf5) [0x7efd04fa63d5]
 14: (()+0x4c1f73) [0x556f77d5ef73]

The cluster health is OK and Ceph sees this OSD as shutdown.

I tried to find more information on the internet about this error without
luck.
Do you have any idea or input about this error, please?

Thanks,
Guillaume
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] performance in a small cluster

2019-05-24 Thread Robert Sander
Am 24.05.19 um 14:43 schrieb Paul Emmerich:
> 20 MB/s at 4K blocks is ~5000 iops, that's 1250 IOPS per SSD (assuming
> replica 3).
> 
> What we usually check in scenarios like these:
> 
> * SSD model? Lots of cheap SSDs simply can't handle more than that

The system has been newly created and is not busy at all.

We tested a single SSD without OSD on top with fio: it can do 50K IOPS
read and 16K IOPS write.

> * Get some proper statistics such as OSD latencies, disk IO utilization,
> etc. A benchmark without detailed performance data doesn't really help
> to debug such a problem

Yes, that is correct, we will try to setup a perfdata gathering system.

Regards
-- 
Robert Sander
Heinlein Support GmbH
Linux: Akademie - Support - Hosting
http://www.heinlein-support.de

Tel: 030-405051-43
Fax: 030-405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein  -- Sitz: Berlin



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] performance in a small cluster

2019-05-24 Thread Paul Emmerich
20 MB/s at 4K blocks is ~5000 iops, that's 1250 IOPS per SSD (assuming
replica 3).

What we usually check in scenarios like these:

* SSD model? Lots of cheap SSDs simply can't handle more than that
* Get some proper statistics such as OSD latencies, disk IO utilization,
etc. A benchmark without detailed performance data doesn't really help to
debug such a problem


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Fri, May 24, 2019 at 1:23 PM Robert Sander 
wrote:

> Hi,
>
> we have a small cluster at a customer's site with three nodes and 4
> SSD-OSDs each.
> Connected with 10G the system is supposed to perform well.
>
> rados bench shows ~450MB/s write and ~950MB/s read speeds with 4MB
> objects but only 20MB/s write and 95MB/s read with 4KB objects.
>
> This is a little bit disappointing as the 4K performance is also seen in
> KVM VMs using RBD.
>
> Is there anything we can do to improve performance with small objects /
> block sizes?
>
> Jumbo frames have already been enabled.
>
> 4MB objects write:
>
> Total time run: 30.218930
> Total writes made:  3391
> Write size: 4194304
> Object size:4194304
> Bandwidth (MB/sec): 448.858
> Stddev Bandwidth:   63.5044
> Max bandwidth (MB/sec): 552
> Min bandwidth (MB/sec): 320
> Average IOPS:   112
> Stddev IOPS:15
> Max IOPS:   138
> Min IOPS:   80
> Average Latency(s): 0.142475
> Stddev Latency(s):  0.0990132
> Max latency(s): 0.814715
> Min latency(s): 0.0308732
>
> 4MB objects rand read:
>
> Total time run:   30.169312
> Total reads made: 7223
> Read size:4194304
> Object size:  4194304
> Bandwidth (MB/sec):   957.662
> Average IOPS: 239
> Stddev IOPS:  23
> Max IOPS: 272
> Min IOPS: 175
> Average Latency(s):   0.0653696
> Max latency(s):   0.517275
> Min latency(s):   0.00201978
>
> 4K objects write:
>
> Total time run: 30.002628
> Total writes made:  165404
> Write size: 4096
> Object size:4096
> Bandwidth (MB/sec): 21.5351
> Stddev Bandwidth:   2.0575
> Max bandwidth (MB/sec): 22.4727
> Min bandwidth (MB/sec): 11.0508
> Average IOPS:   5512
> Stddev IOPS:526
> Max IOPS:   5753
> Min IOPS:   2829
> Average Latency(s): 0.00290095
> Stddev Latency(s):  0.0015036
> Max latency(s): 0.0778454
> Min latency(s): 0.00174262
>
> 4K objects read:
>
> Total time run:   30.000538
> Total reads made: 1064610
> Read size:4096
> Object size:  4096
> Bandwidth (MB/sec):   138.619
> Average IOPS: 35486
> Stddev IOPS:  3776
> Max IOPS: 42208
> Min IOPS: 26264
> Average Latency(s):   0.000443905
> Max latency(s):   0.0123462
> Min latency(s):   0.000123081
>
>
> Regards
> --
> Robert Sander
> Heinlein Support GmbH
> Linux: Akademie - Support - Hosting
> http://www.heinlein-support.de
>
> Tel: 030-405051-43
> Fax: 030-405051-19
>
> Zwangsangaben lt. §35a GmbHG:
> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Major ceph disaster

2019-05-24 Thread Kevin Flöh

Hi,

we already tried "rados -p ec31 getxattr 10004dfce92.003d parent" 
but this is just hanging forever if we are looking for unfound objects. 
It works fine for all other objects.


We also tried scanning the ceph directory with find -inum 1099593404050 
(decimal of 10004dfce92) and found nothing. This is also working for non 
unfound objects.


Is there another way to find the corresponding file?

On 24.05.19 11:12 vorm., Burkhard Linke wrote:


Hi,

On 5/24/19 9:48 AM, Kevin Flöh wrote:


We got the object ids of the missing objects with|ceph pg 1.24c 
list_missing:|


|{
    "offset": {
    "oid": "",
    "key": "",
    "snapid": 0,
    "hash": 0,
    "max": 0,
    "pool": -9223372036854775808,
    "namespace": ""
    },
    "num_missing": 1,
    "num_unfound": 1,
    "objects": [
    {
    "oid": {
    "oid": "10004dfce92.003d",
    "key": "",
    "snapid": -2,
    "hash": 90219084,
    "max": 0,
    "pool": 1,
    "namespace": ""
    },
    "need": "46950'195355",
    "have": "0'0",
    "flags": "none",
    "locations": [
    "36(3)",
    "61(2)"
    ]
    }
    ],
    "more": false
}
|

|we want to give up those objects with:|

ceph  pg  1.24c  mark_unfound_lost  revert But first we would like to know which file(s) is affected. Is 
there a way to map the object id to the corresponding file?



The object name is composed of the file inode id and the chunk within 
the file. The first chunk has some metadata you can use to retrieve 
the filename. See the 'CephFS object mapping' thread on the mailing 
list for more information.



Regards,

Burkhard



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] performance in a small cluster

2019-05-24 Thread Robert Sander

Hi,

we have a small cluster at a customer's site with three nodes and 4 
SSD-OSDs each.

Connected with 10G the system is supposed to perform well.

rados bench shows ~450MB/s write and ~950MB/s read speeds with 4MB 
objects but only 20MB/s write and 95MB/s read with 4KB objects.


This is a little bit disappointing as the 4K performance is also seen in 
KVM VMs using RBD.


Is there anything we can do to improve performance with small objects / 
block sizes?


Jumbo frames have already been enabled.

4MB objects write:

Total time run: 30.218930
Total writes made:  3391
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): 448.858
Stddev Bandwidth:   63.5044
Max bandwidth (MB/sec): 552
Min bandwidth (MB/sec): 320
Average IOPS:   112
Stddev IOPS:15
Max IOPS:   138
Min IOPS:   80
Average Latency(s): 0.142475
Stddev Latency(s):  0.0990132
Max latency(s): 0.814715
Min latency(s): 0.0308732

4MB objects rand read:

Total time run:   30.169312
Total reads made: 7223
Read size:4194304
Object size:  4194304
Bandwidth (MB/sec):   957.662
Average IOPS: 239
Stddev IOPS:  23
Max IOPS: 272
Min IOPS: 175
Average Latency(s):   0.0653696
Max latency(s):   0.517275
Min latency(s):   0.00201978

4K objects write:

Total time run: 30.002628
Total writes made:  165404
Write size: 4096
Object size:4096
Bandwidth (MB/sec): 21.5351
Stddev Bandwidth:   2.0575
Max bandwidth (MB/sec): 22.4727
Min bandwidth (MB/sec): 11.0508
Average IOPS:   5512
Stddev IOPS:526
Max IOPS:   5753
Min IOPS:   2829
Average Latency(s): 0.00290095
Stddev Latency(s):  0.0015036
Max latency(s): 0.0778454
Min latency(s): 0.00174262

4K objects read:

Total time run:   30.000538
Total reads made: 1064610
Read size:4096
Object size:  4096
Bandwidth (MB/sec):   138.619
Average IOPS: 35486
Stddev IOPS:  3776
Max IOPS: 42208
Min IOPS: 26264
Average Latency(s):   0.000443905
Max latency(s):   0.0123462
Min latency(s):   0.000123081


Regards
--
Robert Sander
Heinlein Support GmbH
Linux: Akademie - Support - Hosting
http://www.heinlein-support.de

Tel: 030-405051-43
Fax: 030-405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to fix this? session lost, hunting for new mon, session established, io error

2019-05-24 Thread Ilya Dryomov
On Tue, May 21, 2019 at 11:41 AM Marc Roos  wrote:
>
>
>
> I have this on a cephfs client, I had ceph common on 12.2.11, and
> upgraded to 12.2.12 while having this error. They are writing here [0]
> you need to upgrade kernel and it is fixed in 12.2.2
>
> [@~]# uname -a
> Linux mail03 3.10.0-957.5.1.el7.x86_6
>
> [Tue May 21 11:23:26 2019] libceph: mon2 192.168.10.113:6789 session
> established
> [Tue May 21 11:23:26 2019] libceph: mon2 192.168.10.113:6789 io error
> [Tue May 21 11:23:26 2019] libceph: mon2 192.168.10.113:6789 session
> lost, hunting for new mon
> [Tue May 21 11:23:26 2019] libceph: mon0 192.168.10.111:6789 session
> established
> [Tue May 21 11:23:26 2019] libceph: mon0 192.168.10.111:6789 io error
> [Tue May 21 11:23:26 2019] libceph: mon0 192.168.10.111:6789 session
> lost, hunting for new mon
> [Tue May 21 11:23:26 2019] libceph: mon1 192.168.10.112:6789 session
> established
> [Tue May 21 11:23:26 2019] libceph: mon1 192.168.10.112:6789
> [Tue May 21 11:23:26 2019] libceph: mon1 192.168.10.112:6789 session
> lost, hunting for new mon
> [Tue May 21 11:23:26 2019] libceph: mon2 192.168.10.113:6789 session
> established
>
>
>
> ceph version
> ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous
> (stable)
>
> [0]
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg52177.html
> https://tracker.ceph.com/issues/23537

Hi Marc,

The issue you linked is definitely not related -- no "io error" there.

This looks like http://tracker.ceph.com/issues/38040.  This is a server
side issue, so no point in upgrading the kernel.  It's still present in
luminous, but there is an easy workaround -- try decreasing "osd map
message max" as described in the thread linked from the description.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Is there some changes in ceph instructions in latest version(14.2.1)?

2019-05-24 Thread Yuan Minghui
Hello :

   When I try to install the latest version ceph-14.2.1.

When I try to create a ‘mon’, there is something wrong ?

 

What should I do now?

 

Thanks 

kyle

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS object mapping.

2019-05-24 Thread Burkhard Linke

Hi,

On 5/22/19 5:53 PM, Robert LeBlanc wrote:
On Wed, May 22, 2019 at 12:22 AM Burkhard Linke 
> wrote:


Hi,

On 5/21/19 9:46 PM, Robert LeBlanc wrote:
> I'm at a new job working with Ceph again and am excited to back
in the
> community!
>
> I can't find any documentation to support this, so please help me
> understand if I got this right.
>
> I've got a Jewel cluster with CephFS and we have an inconsistent
PG.
> All copies of the object are zero size, but the digest says that it
> should be a non-zero size, so it seems that my two options are,
delete
> the file that the object is part of, or rewrite the object with
RADOS
> to update the digest. So, this leads to my question, how to I tell
> which file the object belongs to.
>
> From what I found, the object is prefixed with the hex value of the
> inode and suffixed by the stripe number:
> 1000d2ba15c.0005
> .
>
> I then ran `find . -xdev -inum 1099732590940` and found a file
on the
> CephFS file system. I just want to make sure that I found the right
> file before I start trying recovery options.
>

The first stripe XYZ. has some metadata stored as xattr
(rados
xattr, not cephfs xattr). One of the entries has the key 'parent':


When you say 'some' is it a fixed offset that the file data starts? Is 
the first stripe just metadata?


No, the first stripe contains the first 4 MB of a file by default,. The 
xattr and omap data are stored separately.



Regards,

Burkhard


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Major ceph disaster

2019-05-24 Thread Burkhard Linke

Hi,

On 5/24/19 9:48 AM, Kevin Flöh wrote:


We got the object ids of the missing objects with|ceph pg 1.24c 
list_missing:|


|{
    "offset": {
    "oid": "",
    "key": "",
    "snapid": 0,
    "hash": 0,
    "max": 0,
    "pool": -9223372036854775808,
    "namespace": ""
    },
    "num_missing": 1,
    "num_unfound": 1,
    "objects": [
    {
    "oid": {
    "oid": "10004dfce92.003d",
    "key": "",
    "snapid": -2,
    "hash": 90219084,
    "max": 0,
    "pool": 1,
    "namespace": ""
    },
    "need": "46950'195355",
    "have": "0'0",
    "flags": "none",
    "locations": [
    "36(3)",
    "61(2)"
    ]
    }
    ],
    "more": false
}
|

|we want to give up those objects with:|

ceph  pg  1.24c  mark_unfound_lost  revert But first we would like to know which file(s) is affected. Is 
there a way to map the object id to the corresponding file?



The object name is composed of the file inode id and the chunk within 
the file. The first chunk has some metadata you can use to retrieve the 
filename. See the 'CephFS object mapping' thread on the mailing list for 
more information.



Regards,

Burkhard


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph dovecot

2019-05-24 Thread Danny Al-Gaaf
Hi,

you can find the slides here:

https://dalgaaf.github.io/Cephalocon-Barcelona-librmb/

And Wido is right, it's not production ready and we have some work ahead
to make it work with an acceptable performance atm especially in our scale.

If you have any questions don't hesitate to contact me.

Danny

Am 23.05.19 um 12:13 schrieb Wido den Hollander:
> 
> 
> On 5/23/19 12:02 PM, Marc Roos wrote:
>>
>> Sorry for not waiting until it is published on the ceph website but, 
>> anyone attended this talk? Is it production ready? 
>>
> 
> Danny from Deutsche Telekom can answer this better, but no, it's not
> production ready.
> 
> It seems it's more challenging to get it working especially on the scale
> of Telekom. (Millions of mailboxes).
> 
> Wido
> 
>> https://cephalocon2019.sched.com/event/M7j8
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Major ceph disaster

2019-05-24 Thread Kevin Flöh
We got the object ids of the missing objects with|ceph pg 1.24c 
list_missing:|


|{
    "offset": {
    "oid": "",
    "key": "",
    "snapid": 0,
    "hash": 0,
    "max": 0,
    "pool": -9223372036854775808,
    "namespace": ""
    },
    "num_missing": 1,
    "num_unfound": 1,
    "objects": [
    {
    "oid": {
    "oid": "10004dfce92.003d",
    "key": "",
    "snapid": -2,
    "hash": 90219084,
    "max": 0,
    "pool": 1,
    "namespace": ""
    },
    "need": "46950'195355",
    "have": "0'0",
    "flags": "none",
    "locations": [
    "36(3)",
    "61(2)"
    ]
    }
    ],
    "more": false
}
|

|we want to give up those objects with:|

ceph  pg  1.24c  mark_unfound_lost  revert But first we would like to know which file(s) is affected. Is 
there a way to map the object id to the corresponding file?


||

On 23.05.19 3:52 nachm., Alexandre Marangone wrote:
The PGs will stay active+recovery_wait+degraded until you solve the 
unfound objects issue.
You can follow this doc to look at which objects are unfound[1]  and 
if no other recourse mark them lost


[1] 
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#unfound-objects. 



On Thu, May 23, 2019 at 5:47 AM Kevin Flöh > wrote:


thank you for this idea, it has improved the situation. Nevertheless,
there are still 2 PGs in recovery_wait. ceph -s gives me:

   cluster:
 id: 23e72372-0d44-4cad-b24f-3641b14b86f4
 health: HEALTH_WARN
 3/125481112 objects unfound (0.000%)
 Degraded data redundancy: 3/497011315 objects degraded
(0.000%), 2 pgs degraded

   services:
 mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02
 mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu

 mds: cephfs-1/1/1 up  {0=ceph-node03.etp.kit.edu
=up:active}, 3
up:standby
 osd: 96 osds: 96 up, 96 in

   data:
 pools:   2 pools, 4096 pgs
 objects: 125.48M objects, 259TiB
 usage:   370TiB used, 154TiB / 524TiB avail
 pgs: 3/497011315 objects degraded (0.000%)
  3/125481112 objects unfound (0.000%)
  4083 active+clean
  10   active+clean+scrubbing+deep
  2    active+recovery_wait+degraded
  1    active+clean+scrubbing

   io:
 client:   318KiB/s rd, 77.0KiB/s wr, 190op/s rd, 0op/s wr


and ceph health detail:

HEALTH_WARN 3/125481112 objects unfound (0.000%); Degraded data
redundancy: 3/497011315 objects degraded (0.000%), 2 p
gs degraded
OBJECT_UNFOUND 3/125481112 objects unfound (0.000%)
 pg 1.24c has 1 unfound objects
 pg 1.779 has 2 unfound objects
PG_DEGRADED Degraded data redundancy: 3/497011315 objects degraded
(0.000%), 2 pgs degraded
 pg 1.24c is active+recovery_wait+degraded, acting
[32,4,61,36], 1
unfound
 pg 1.779 is active+recovery_wait+degraded, acting
[50,4,77,62], 2
unfound


also the status changed form HEALTH_ERR to HEALTH_WARN. We also
did ceph
osd down for all OSDs of the degraded PGs. Do you have any further
suggestions on how to proceed?

On 23.05.19 11:08 vorm., Dan van der Ster wrote:
> I think those osds (1, 11, 21, 32, ...) need a little kick to
re-peer
> their degraded PGs.
>
> Open a window with `watch ceph -s`, then in another window slowly do
>
>      ceph osd down 1
>      # then wait a minute or so for that osd.1 to re-peer fully.
>      ceph osd down 11
>      ...
>
> Continue that for each of the osds with stuck requests, or until
there
> are no more recovery_wait/degraded PGs.
>
> After each `ceph osd down...`, you should expect to see several PGs
> re-peer, and then ideally the slow requests will disappear and the
> degraded PGs will become active+clean.
> If anything else happens, you should stop and let us know.
>
>
> -- dan
>
> On Thu, May 23, 2019 at 10:59 AM Kevin Flöh mailto:kevin.fl...@kit.edu>> wrote:
>> This is the current status of ceph:
>>
>>
>>     cluster:
>>       id:     23e72372-0d44-4cad-b24f-3641b14b86f4
>>       health: HEALTH_ERR
>>               9/125481144 objects unfound (0.000%)
>>               Degraded data redundancy: 9/497011417 objects
degraded
>> (0.000%), 7 pgs degraded
>>               9 stuck requests are blocked > 4096 sec.
Implicated osds
>> 1,11,21,32,43,50,65
>>
>>     services:
>>       mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02
>>       mgr: ceph-node01(active),