Re: [ceph-users] PG Balancer Upmap mode not working

2019-12-10 Thread Richard Bade
> How is that possible? I dont know how much more proof I need to present that 
> there's a bug.

I also think there's a bug in the balancer plugin as it seems to have
stopped for me also. I'm on Luminous though, so not sure if that will
be the same bug.
The balancer used to work flawlessly, giving me a very even
distribution with about 1% variance. Some time between 12.2.7 (maybe)
and 12.2.12 it's stopped working.
Here's a small selection of my osd's showing 47%-62% spread.

210   hdd 7.27739  1.0 7.28TiB 3.43TiB 3.84TiB 47.18 0.74 104
211   hdd 7.27739  1.0 7.28TiB 3.96TiB 3.32TiB 54.39 0.85 118
212   hdd 7.27739  1.0 7.28TiB 4.50TiB 2.77TiB 61.88 0.97 136
213   hdd 7.27739  1.0 7.28TiB 4.06TiB 3.21TiB 55.85 0.87 124
214   hdd 7.27739  1.0 7.28TiB 4.30TiB 2.98TiB 59.05 0.92 130
215   hdd 7.27739  1.0 7.28TiB 4.41TiB 2.87TiB 60.54 0.95 134
 TOTAL 1.26PiB  825TiB  463TiB 64.01
MIN/MAX VAR: 0.74/1.10  STDDEV: 3.22

$ sudo ceph balancer status
{
"active": true,
"plans": [],
"mode": "upmap"
}

I'm happy to add debugging data or test things to get this bug fixed.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Upgrade Documentation: Wait for recovery

2019-06-17 Thread Richard Bade
Hi Everyone,
Recently we moved a bunch of our servers from one rack to another. In
the late stages of this we hit a point when some requests were blocked
due to one pg being in "peered" state.

This was unexpected to us, but on discussion with Wido we understand
why this happened. However it's brought up another point in that we
believed we were following the instructions as per upgrade
documentation. We've done our upgrades this way in the past without
hitting this "peered" state. The documentation says this:
"Ensure each upgraded Ceph OSD Daemon has rejoined the cluster"

We read this that you can go through and restart all the osd's one by
one in the whole cluster without waiting for recovery to happen.
Whereas it seems more like it should be:
"Ensure each upgraded Ceph OSD Daemon has rejoined the cluster" and
"ensure recovery has completed before moving on to the next {failure
domain}" where failure domain is host, rack etc depending on what is
in your crush map.

Thoughts? Should the documentation be more clear on this to help
people such as myself making this mistake?

Rich
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs compression?

2018-06-28 Thread Richard Bade
Oh, also because the compression is at the osd level you don't see it
in ceph df. You just see that your RAW is not increasing as much as
you'd expect. E.g.
$ sudo ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
785T  300T 485T 61.73
POOLS:
NAMEID USED %USED MAX AVAIL OBJECTS
cephfs-metadata 11 185M 068692G   178
cephfs-data 12 408T 75.26  134T 132641159

You can see that we've used 408TB in the pool but only 485TB RAW -
Rather than ~600TB RAW that I'd expect for my k4, m2 pool settings.
On Fri, 29 Jun 2018 at 17:08, Richard Bade  wrote:
>
> I'm using compression on a cephfs-data pool in luminous. I didn't do
> anything special
>
> $ sudo ceph osd pool get cephfs-data all | grep ^compression
> compression_mode: aggressive
> compression_algorithm: zlib
>
> You can check how much compression you're getting on the osd's
> $ for osd in `seq 0 11`; do echo osd.$osd; sudo ceph daemon osd.$osd
> perf dump | grep 'bluestore_compressed'; done
> osd.0
> "bluestore_compressed": 686487948225,
> "bluestore_compressed_allocated": 788659830784,
> "bluestore_compressed_original": 1660064620544,
> 
> osd.11
> "bluestore_compressed": 700999601387,
> "bluestore_compressed_allocated": 808854355968,
> "bluestore_compressed_original": 1752045551616,
>
> I can't say for mimic, but definitely for luminous v12.2.5 compression
> is working well with mostly default options.
>
> -Rich
>
> > For RGW, compression works very well. We use rgw to store crash dumps, in
> > most cases, the compression ratio is about 2.0 ~ 4.0.
>
> > I tried to enable compression for cephfs data pool:
>
> > # ceph osd pool get cephfs_data all | grep ^compression
> > compression_mode: force
> > compression_algorithm: lz4
> > compression_required_ratio: 0.95
> > compression_max_blob_size: 4194304
> > compression_min_blob_size: 4096
>
> > (we built ceph packages and enabled lz4.)
>
> > It doesn't seem to work. I copied a 8.7GB folder to cephfs, ceph df says it
> > used 8.7GB:
>
> > root at ceph-admin:~# ceph df
> > GLOBAL:
> > SIZE   AVAIL  RAW USED %RAW USED
> > 16 TiB 16 TiB  111 GiB  0.69
> > POOLS:
> > NAMEID USED%USED MAX AVAIL OBJECTS
> > cephfs_data 1  8.7 GiB  0.17   5.0 TiB  360545
> > cephfs_metadata 2  221 MiB 0   5.0 TiB   77707
>
> > I know this folder can be compressed to ~4.0GB under zfs lz4 compression.
>
> > Am I missing anything? how to make cephfs compression work? is there any
> trick?
>
> > By the way, I am evaluating ceph mimic v13.2.0.
>
> > Thanks in advance,
> > --Youzhong
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs compression?

2018-06-28 Thread Richard Bade
I'm using compression on a cephfs-data pool in luminous. I didn't do
anything special

$ sudo ceph osd pool get cephfs-data all | grep ^compression
compression_mode: aggressive
compression_algorithm: zlib

You can check how much compression you're getting on the osd's
$ for osd in `seq 0 11`; do echo osd.$osd; sudo ceph daemon osd.$osd
perf dump | grep 'bluestore_compressed'; done
osd.0
"bluestore_compressed": 686487948225,
"bluestore_compressed_allocated": 788659830784,
"bluestore_compressed_original": 1660064620544,

osd.11
"bluestore_compressed": 700999601387,
"bluestore_compressed_allocated": 808854355968,
"bluestore_compressed_original": 1752045551616,

I can't say for mimic, but definitely for luminous v12.2.5 compression
is working well with mostly default options.

-Rich

> For RGW, compression works very well. We use rgw to store crash dumps, in
> most cases, the compression ratio is about 2.0 ~ 4.0.

> I tried to enable compression for cephfs data pool:

> # ceph osd pool get cephfs_data all | grep ^compression
> compression_mode: force
> compression_algorithm: lz4
> compression_required_ratio: 0.95
> compression_max_blob_size: 4194304
> compression_min_blob_size: 4096

> (we built ceph packages and enabled lz4.)

> It doesn't seem to work. I copied a 8.7GB folder to cephfs, ceph df says it
> used 8.7GB:

> root at ceph-admin:~# ceph df
> GLOBAL:
> SIZE   AVAIL  RAW USED %RAW USED
> 16 TiB 16 TiB  111 GiB  0.69
> POOLS:
> NAMEID USED%USED MAX AVAIL OBJECTS
> cephfs_data 1  8.7 GiB  0.17   5.0 TiB  360545
> cephfs_metadata 2  221 MiB 0   5.0 TiB   77707

> I know this folder can be compressed to ~4.0GB under zfs lz4 compression.

> Am I missing anything? how to make cephfs compression work? is there any
trick?

> By the way, I am evaluating ceph mimic v13.2.0.

> Thanks in advance,
> --Youzhong
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous Bluestore performance, bcache

2018-06-28 Thread Richard Bade
Hi Andrei,
These are good questions. We have another cluster with filestore and
bcache but for this particular one I was interested in testing out
bluestore. So I have used bluestore both with and without bcache.
For my synthetic load on the vm's I'm using this fio command:
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1
--name=test --filename=test --bs=4k --iodepth=64 --size=4G
--readwrite=randwrite --rate_iops=50

Currently on bluestore with my synthetic load I'm getting 7% hit ratio
(cat /sys/block/bcache*/bcache/stats_total/cache_hit_ratio)
On our filestore cluster with ~700 vm's of varied workload we're
geting about 30-35% hit ratio.
In the hourly hit ratio I have as high as 50% on some osd's in our
filestore cluster. Only 25% on my synthetic load on bluestore so far,
but I hadn't actually been checking this stat until now.

I hope that helps.
Regards,
Richard

> Hi Richard,
> It is an interesting test for me too as I am planning to migrate to
> Bluestore storage and was considering repurposing the ssd disks
> that we currently use for journals.
> I was wondering if you are using the Filestore or the bluestone
> for the osds?
> Also, when you perform your testing, how good is the hit ratio
> that you have on the bcache?
> Are you using a lot of random data for your benchmarks? How
> large is your test file for each vm?
> We have been playing around with a few caching scenarios a
> few years back (enchanceio and a few more which I can't
> remember now) and we have seen a very poor hit ratio on the
> caching system. Was wondering if you see a different picture?
> Cheers
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Luminous Bluestore performance, bcache

2018-06-27 Thread Richard Bade
Hi Everyone,
There's been a few threads go past around this but I haven't seen any
that pointed me in the right direction.
We've recently set up a new luminous (12.2.5) cluster with 5 hosts
each with 12 4TB Seagate Constellation ES spinning disks for osd's. We
also have 2x 400GB Intel DC P3700's per node. We're using this for rbd
storage for VM's running under Proxmox VE.
I firstly set these up with DB partition (approx 60GB per osd) on nvme
and data directly onto the spinning disk using ceph-deploy create.
This worked great and was very simple.
However performance wasn't great. I fired up 20vm's each running fio
trying to attain 50 iops. Ceph was only just able to keep up with the
1000iops this generated and vm's started to have trouble hitting their
50iops target.
So I rebuilt all the osd's halving the DB space (~30GB per osd) and
adding a 200GB BCache partition shared between 6 osd's. Again this
worked great with ceph-deploy create and was very simple.
I have had a vast improvement with my synthetic test. I can now run
100 50iops test vm's generating a constant 5000iops load and each one
can keep up without any trouble.

The question I have is if the poor performance out of the box is
expected? Or is there some kind of tweaking I should be doing to make
this usable for rbd images? Are others able to work ok with this kind
of config at a small scale like my 60osd's? Or is it only workable at
a larger scale?

Regards,
Rich
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph ObjectCacher FAILED assert (qemu/kvm)

2018-05-08 Thread Richard Bade
Hi Everyone,
We run some hosts with Proxmox 4.4 connected to our ceph cluster for
RBD storage. Occasionally we get a vm suddenly stop with no real
explanation. The last time this happened to one particular vm I turned
on some qemu logging via Proxmox Monitor tab for the vm and got this
dump this time when the vm stopped again:

osdc/ObjectCacher.cc: In function 'void
ObjectCacher::Object::discard(loff_t, loff_t)' thread 7f1c6ebfd700
time 2018-05-08 07:00:47.816114
osdc/ObjectCacher.cc: 533: FAILED assert(bh->waitfor_read.empty())
 ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
 1: (()+0x2d0712) [0x7f1c8e093712]
 2: (()+0x52c107) [0x7f1c8e2ef107]
 3: (()+0x52c45f) [0x7f1c8e2ef45f]
 4: (()+0x82107) [0x7f1c8de45107]
 5: (()+0x83388) [0x7f1c8de46388]
 6: (()+0x80e74) [0x7f1c8de43e74]
 7: (()+0x86db0) [0x7f1c8de49db0]
 8: (()+0x2c0ddf) [0x7f1c8e083ddf]
 9: (()+0x2c1d00) [0x7f1c8e084d00]
 10: (()+0x8064) [0x7f1c804e0064]
 11: (clone()+0x6d) [0x7f1c8021562d]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

We're using virtio-scsi for the disk with discard option and writeback
cache enabled. The vm is Win2012r2.

Has anyone seen this before? Is there a resolution?
I couldn't find any mention of this while googling for various key
words in the dump.

Regards,
Richard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Safe to delete data, metadata pools?

2018-01-15 Thread Richard Bade
Thanks John, I removed these pools on Friday and as you suspected
there was no impact.

Regards,
Rich

On 8 January 2018 at 23:15, John Spray <jsp...@redhat.com> wrote:
> On Mon, Jan 8, 2018 at 2:55 AM, Richard Bade <hitr...@gmail.com> wrote:
>> Hi Everyone,
>> I've got a couple of pools that I don't believe are being used but
>> have a reasonably large number of pg's (approx 50% of our total pg's).
>> I'd like to delete them but as they were pre-existing when I inherited
>> the cluster, I wanted to make sure they aren't needed for anything
>> first.
>> Here's the details:
>> POOLS:
>> NAME   ID USED   %USED MAX AVAIL OBJECTS
>> data   0   0 088037G0
>> metadata   1   0 088037G0
>>
>> We don't run cephfs and I believe these are meant for that, but may
>> have been created by default when the cluster was set up (back on
>> dumpling or bobtail I think).
>> As far as I can tell there is no data in them. Do they need to exist
>> for some ceph function?
>> The pool names worry me a little, as they sound important.
>
> The data and metadata pools were indeed created by default in older
> versions of Ceph, for use by CephFS.  Since you're not using CephFS,
> and nobody is using the pools for anything else either (they're
> empty), you can go ahead and delete them.
>
>>
>> They have 3136 pg's each so I'd like to be rid of those so I can
>> increase the number of pg's in my actual data pools without getting
>> over the 300 pg's per osd.
>> Here's the osd dump:
>> pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash
>> rjenkins pg_num 3136 pgp_num 3136 last_change 1 crash_replay_interval
>> 45 min_read_recency_for_promote 1 min_write_recency_for_promote 1
>> stripe_width 0
>> pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 1
>> object_hash rjenkins pg_num 3136 pgp_num 3136 last_change 1
>> min_read_recency_for_promote 1 min_write_recency_for_promote 1
>> stripe_width 0
>>
>> Also, what performance impact am I likely to see when ceph removes the
>> empty pg's considering it's approx 50% of my total pg's on my 180
>> osd's.
>
> Given that they're empty, I'd expect little if any noticeable impact.
>
> John
>
>>
>> Thanks,
>> Rich
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Safe to delete data, metadata pools?

2018-01-07 Thread Richard Bade
Hi Everyone,
I've got a couple of pools that I don't believe are being used but
have a reasonably large number of pg's (approx 50% of our total pg's).
I'd like to delete them but as they were pre-existing when I inherited
the cluster, I wanted to make sure they aren't needed for anything
first.
Here's the details:
POOLS:
NAME   ID USED   %USED MAX AVAIL OBJECTS
data   0   0 088037G0
metadata   1   0 088037G0

We don't run cephfs and I believe these are meant for that, but may
have been created by default when the cluster was set up (back on
dumpling or bobtail I think).
As far as I can tell there is no data in them. Do they need to exist
for some ceph function?
The pool names worry me a little, as they sound important.

They have 3136 pg's each so I'd like to be rid of those so I can
increase the number of pg's in my actual data pools without getting
over the 300 pg's per osd.
Here's the osd dump:
pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 3136 pgp_num 3136 last_change 1 crash_replay_interval
45 min_read_recency_for_promote 1 min_write_recency_for_promote 1
stripe_width 0
pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 1
object_hash rjenkins pg_num 3136 pgp_num 3136 last_change 1
min_read_recency_for_promote 1 min_write_recency_for_promote 1
stripe_width 0

Also, what performance impact am I likely to see when ceph removes the
empty pg's considering it's approx 50% of my total pg's on my 180
osd's.

Thanks,
Rich
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inconsistent PG won't repair

2017-11-08 Thread Richard Bade
For anyone that encounters this in the future, I was able to resolve
the issue by finding the three osd's that the object is on. One by one
I stop the osd, flushed the journal and used the objectstore tool to
remove the data (sudo ceph-objectstore-tool --data-path
/var/lib/ceph/osd/ceph-19 --journal-path
/dev/disk/by-partlabel/journal19 --pool tier3-rbd-3X
rbd_data.19cdf512ae8944a.0001bb56 remove). Then I started the
osd again and let it recover before moving on to the next osd.
After the object was deleted from all three osd's I ran a scrub on the
PG (ceph pg scrub 3.f05). Once the scrub was finished the
inconsistency went away.
Note, the object in question was empty (size of zero bytes) before I
started this process. I emptied the object by moving the rbd image to
another pool.

Rich

On 24 October 2017 at 14:34, Richard Bade <hitr...@gmail.com> wrote:
> What I'm thinking about trying is using the ceph-objectstore-tool to
> remove the offending clone metadata. From the help the syntax is this:
> ceph-objectstore-tool ...  remove-clone-metadata 
> i.e. something like for my object and expected clone from the log message
> ceph-objectstore-tool rbd_data.19cdf512ae8944a.0001bb56
> remove-clone-metadata 148d2
> Anyone had experience with this? I'm not 100% sure if this will
> resolve the issue or cause much the same situation (since it's already
> expecting a clone that's not there currently).
>
> Rich
>
> On 21 October 2017 at 14:13, Brad Hubbard <bhubb...@redhat.com> wrote:
>> On Sat, Oct 21, 2017 at 1:59 AM, Richard Bade <hitr...@gmail.com> wrote:
>>> Hi Lincoln,
>>> Yes the object is 0-bytes on all OSD's. Has the same filesystem
>>> date/time too. Before I removed the rbd image (migrated disk to
>>> different pool) it was 4MB on all the OSD's and md5 checksum was the
>>> same on all so it seems that only metadata is inconsistent.
>>> Thanks for your suggestion, I just looked into this as I thought maybe
>>> I can delete the object (since it's empty anyway). But I just get file
>>> not found:
>>> ~$ rados stat rbd_data.19cdf512ae8944a.0001bb56 --pool=tier3-rbd-3X
>>>  error stat-ing
>>> tier3-rbd-3X/rbd_data.19cdf512ae8944a.0001bb56: (2) No such
>>> file or directory
>>
>> Maybe try downing the osds involved?
>>
>>>
>>> Regards,
>>> Rich
>>>
>>> On 21 October 2017 at 04:32, Lincoln Bryant <linco...@uchicago.edu> wrote:
>>>> Hi Rich,
>>>>
>>>> Is the object inconsistent and 0-bytes on all OSDs?
>>>>
>>>> We ran into a similar issue on Jewel, where an object was empty across the 
>>>> board but had inconsistent metadata. Ultimately it was resolved by doing a 
>>>> "rados get" and then a "rados put" on the object. *However* that was a 
>>>> last ditch effort after I couldn't get any other repair option to work, 
>>>> and I have no idea if that will cause any issues down the road :)
>>>>
>>>> --Lincoln
>>>>
>>>>> On Oct 20, 2017, at 10:16 AM, Richard Bade <hitr...@gmail.com> wrote:
>>>>>
>>>>> Hi Everyone,
>>>>> In our cluster running 0.94.10 we had a pg pop up as inconsistent
>>>>> during scrub. Previously when this has happened running ceph pg repair
>>>>> [pg_num] has resolved the problem. This time the repair runs but it
>>>>> remains inconsistent.
>>>>> ~$ ceph health detail
>>>>> HEALTH_ERR 1 pgs inconsistent; 2 scrub errors; noout flag(s) set
>>>>> pg 3.f05 is active+clean+inconsistent, acting [171,23,131]
>>>>> 1 scrub errors
>>>>>
>>>>> The error in the logs is:
>>>>> cstor01 ceph-mon: osd.171 10.233.202.21:6816/12694 45 : deep-scrub
>>>>> 3.f05 3/68ab5f05/rbd_data.19cdf512ae8944a.0001bb56/snapdir
>>>>> expected clone 3/68ab5f05/rbd_data.19cdf512ae8944a.0001bb56/148d2
>>>>>
>>>>> Now, I've tried several things to resolve this. I've tried stopping
>>>>> each of the osd's in turn and running a repair. I've located the rbd
>>>>> image and removed it to empty out the object. The object is now zero
>>>>> bytes but still inconsistent. I've tried stopping each osd, removing
>>>>> the object and starting the osd again. It correctly identifies the
>>>>> object as missing and repair works to fix this but it still remains
>>>>> inconsistent.
>>>>> I've run out of ideas.
>>>&

Re: [ceph-users] Inconsistent PG won't repair

2017-10-23 Thread Richard Bade
What I'm thinking about trying is using the ceph-objectstore-tool to
remove the offending clone metadata. From the help the syntax is this:
ceph-objectstore-tool ...  remove-clone-metadata 
i.e. something like for my object and expected clone from the log message
ceph-objectstore-tool rbd_data.19cdf512ae8944a.0001bb56
remove-clone-metadata 148d2
Anyone had experience with this? I'm not 100% sure if this will
resolve the issue or cause much the same situation (since it's already
expecting a clone that's not there currently).

Rich

On 21 October 2017 at 14:13, Brad Hubbard <bhubb...@redhat.com> wrote:
> On Sat, Oct 21, 2017 at 1:59 AM, Richard Bade <hitr...@gmail.com> wrote:
>> Hi Lincoln,
>> Yes the object is 0-bytes on all OSD's. Has the same filesystem
>> date/time too. Before I removed the rbd image (migrated disk to
>> different pool) it was 4MB on all the OSD's and md5 checksum was the
>> same on all so it seems that only metadata is inconsistent.
>> Thanks for your suggestion, I just looked into this as I thought maybe
>> I can delete the object (since it's empty anyway). But I just get file
>> not found:
>> ~$ rados stat rbd_data.19cdf512ae8944a.0001bb56 --pool=tier3-rbd-3X
>>  error stat-ing
>> tier3-rbd-3X/rbd_data.19cdf512ae8944a.0001bb56: (2) No such
>> file or directory
>
> Maybe try downing the osds involved?
>
>>
>> Regards,
>> Rich
>>
>> On 21 October 2017 at 04:32, Lincoln Bryant <linco...@uchicago.edu> wrote:
>>> Hi Rich,
>>>
>>> Is the object inconsistent and 0-bytes on all OSDs?
>>>
>>> We ran into a similar issue on Jewel, where an object was empty across the 
>>> board but had inconsistent metadata. Ultimately it was resolved by doing a 
>>> "rados get" and then a "rados put" on the object. *However* that was a last 
>>> ditch effort after I couldn't get any other repair option to work, and I 
>>> have no idea if that will cause any issues down the road :)
>>>
>>> --Lincoln
>>>
>>>> On Oct 20, 2017, at 10:16 AM, Richard Bade <hitr...@gmail.com> wrote:
>>>>
>>>> Hi Everyone,
>>>> In our cluster running 0.94.10 we had a pg pop up as inconsistent
>>>> during scrub. Previously when this has happened running ceph pg repair
>>>> [pg_num] has resolved the problem. This time the repair runs but it
>>>> remains inconsistent.
>>>> ~$ ceph health detail
>>>> HEALTH_ERR 1 pgs inconsistent; 2 scrub errors; noout flag(s) set
>>>> pg 3.f05 is active+clean+inconsistent, acting [171,23,131]
>>>> 1 scrub errors
>>>>
>>>> The error in the logs is:
>>>> cstor01 ceph-mon: osd.171 10.233.202.21:6816/12694 45 : deep-scrub
>>>> 3.f05 3/68ab5f05/rbd_data.19cdf512ae8944a.0001bb56/snapdir
>>>> expected clone 3/68ab5f05/rbd_data.19cdf512ae8944a.0001bb56/148d2
>>>>
>>>> Now, I've tried several things to resolve this. I've tried stopping
>>>> each of the osd's in turn and running a repair. I've located the rbd
>>>> image and removed it to empty out the object. The object is now zero
>>>> bytes but still inconsistent. I've tried stopping each osd, removing
>>>> the object and starting the osd again. It correctly identifies the
>>>> object as missing and repair works to fix this but it still remains
>>>> inconsistent.
>>>> I've run out of ideas.
>>>> The object is now zero bytes:
>>>> ~$ find /var/lib/ceph/osd/ceph-23/current/3.f05_head/ -name
>>>> "*19cdf512ae8944a.0001bb56*" -ls
>>>> 537598582  0 -rw-r--r--   1 root root0 Oct 21
>>>> 03:54 
>>>> /var/lib/ceph/osd/ceph-23/current/3.f05_head/DIR_5/DIR_0/DIR_F/DIR_5/DIR_B/rbd\\udata.19cdf512ae8944a.0001bb56__snapdir_68AB5F05__3
>>>>
>>>> How can I resolve this? Is there some way to remove the empty object
>>>> completely? I saw reference to ceph-objectstore-tool which has some
>>>> options to remove-clone-metadata but I don't know how to use this.
>>>> Will using this to remove the mentioned 148d2 expected clone resolve
>>>> this? Or would this do the opposite as it would seem that it can't
>>>> find that clone?
>>>> Documentation on this tool is sparse.
>>>>
>>>> Any help here would be appreciated.
>>>>
>>>> Regards,
>>>> Rich
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Cheers,
> Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inconsistent PG won't repair

2017-10-20 Thread Richard Bade
Hi Lincoln,
Yes the object is 0-bytes on all OSD's. Has the same filesystem
date/time too. Before I removed the rbd image (migrated disk to
different pool) it was 4MB on all the OSD's and md5 checksum was the
same on all so it seems that only metadata is inconsistent.
Thanks for your suggestion, I just looked into this as I thought maybe
I can delete the object (since it's empty anyway). But I just get file
not found:
~$ rados stat rbd_data.19cdf512ae8944a.0001bb56 --pool=tier3-rbd-3X
 error stat-ing
tier3-rbd-3X/rbd_data.19cdf512ae8944a.0001bb56: (2) No such
file or directory

Regards,
Rich

On 21 October 2017 at 04:32, Lincoln Bryant <linco...@uchicago.edu> wrote:
> Hi Rich,
>
> Is the object inconsistent and 0-bytes on all OSDs?
>
> We ran into a similar issue on Jewel, where an object was empty across the 
> board but had inconsistent metadata. Ultimately it was resolved by doing a 
> "rados get" and then a "rados put" on the object. *However* that was a last 
> ditch effort after I couldn't get any other repair option to work, and I have 
> no idea if that will cause any issues down the road :)
>
> --Lincoln
>
>> On Oct 20, 2017, at 10:16 AM, Richard Bade <hitr...@gmail.com> wrote:
>>
>> Hi Everyone,
>> In our cluster running 0.94.10 we had a pg pop up as inconsistent
>> during scrub. Previously when this has happened running ceph pg repair
>> [pg_num] has resolved the problem. This time the repair runs but it
>> remains inconsistent.
>> ~$ ceph health detail
>> HEALTH_ERR 1 pgs inconsistent; 2 scrub errors; noout flag(s) set
>> pg 3.f05 is active+clean+inconsistent, acting [171,23,131]
>> 1 scrub errors
>>
>> The error in the logs is:
>> cstor01 ceph-mon: osd.171 10.233.202.21:6816/12694 45 : deep-scrub
>> 3.f05 3/68ab5f05/rbd_data.19cdf512ae8944a.0001bb56/snapdir
>> expected clone 3/68ab5f05/rbd_data.19cdf512ae8944a.0001bb56/148d2
>>
>> Now, I've tried several things to resolve this. I've tried stopping
>> each of the osd's in turn and running a repair. I've located the rbd
>> image and removed it to empty out the object. The object is now zero
>> bytes but still inconsistent. I've tried stopping each osd, removing
>> the object and starting the osd again. It correctly identifies the
>> object as missing and repair works to fix this but it still remains
>> inconsistent.
>> I've run out of ideas.
>> The object is now zero bytes:
>> ~$ find /var/lib/ceph/osd/ceph-23/current/3.f05_head/ -name
>> "*19cdf512ae8944a.0001bb56*" -ls
>> 537598582  0 -rw-r--r--   1 root root0 Oct 21
>> 03:54 
>> /var/lib/ceph/osd/ceph-23/current/3.f05_head/DIR_5/DIR_0/DIR_F/DIR_5/DIR_B/rbd\\udata.19cdf512ae8944a.0001bb56__snapdir_68AB5F05__3
>>
>> How can I resolve this? Is there some way to remove the empty object
>> completely? I saw reference to ceph-objectstore-tool which has some
>> options to remove-clone-metadata but I don't know how to use this.
>> Will using this to remove the mentioned 148d2 expected clone resolve
>> this? Or would this do the opposite as it would seem that it can't
>> find that clone?
>> Documentation on this tool is sparse.
>>
>> Any help here would be appreciated.
>>
>> Regards,
>> Rich
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Inconsistent PG won't repair

2017-10-20 Thread Richard Bade
Hi Everyone,
In our cluster running 0.94.10 we had a pg pop up as inconsistent
during scrub. Previously when this has happened running ceph pg repair
[pg_num] has resolved the problem. This time the repair runs but it
remains inconsistent.
~$ ceph health detail
HEALTH_ERR 1 pgs inconsistent; 2 scrub errors; noout flag(s) set
pg 3.f05 is active+clean+inconsistent, acting [171,23,131]
1 scrub errors

The error in the logs is:
cstor01 ceph-mon: osd.171 10.233.202.21:6816/12694 45 : deep-scrub
3.f05 3/68ab5f05/rbd_data.19cdf512ae8944a.0001bb56/snapdir
expected clone 3/68ab5f05/rbd_data.19cdf512ae8944a.0001bb56/148d2

Now, I've tried several things to resolve this. I've tried stopping
each of the osd's in turn and running a repair. I've located the rbd
image and removed it to empty out the object. The object is now zero
bytes but still inconsistent. I've tried stopping each osd, removing
the object and starting the osd again. It correctly identifies the
object as missing and repair works to fix this but it still remains
inconsistent.
I've run out of ideas.
The object is now zero bytes:
~$ find /var/lib/ceph/osd/ceph-23/current/3.f05_head/ -name
"*19cdf512ae8944a.0001bb56*" -ls
537598582  0 -rw-r--r--   1 root root0 Oct 21
03:54 
/var/lib/ceph/osd/ceph-23/current/3.f05_head/DIR_5/DIR_0/DIR_F/DIR_5/DIR_B/rbd\\udata.19cdf512ae8944a.0001bb56__snapdir_68AB5F05__3

How can I resolve this? Is there some way to remove the empty object
completely? I saw reference to ceph-objectstore-tool which has some
options to remove-clone-metadata but I don't know how to use this.
Will using this to remove the mentioned 148d2 expected clone resolve
this? Or would this do the opposite as it would seem that it can't
find that clone?
Documentation on this tool is sparse.

Any help here would be appreciated.

Regards,
Rich
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Objects Stuck Degraded

2017-01-25 Thread Richard Bade
Hi Everyone,
Just an update to this in case anyone has the same issue. This seems
to have been caused by ceph osd reweight-by-utilization. Because we
have two pools that map to two separate sets of disks and one pool was
more full than the other the reweight-by-utilization had reduced the
weight of the osd's in one pool down to around 0.3. This seems to have
caused the crush map not to be able to find a suitable osd for the 2nd
copy.
Changing the reweight weights back up to near 1 has resolved the issue.

Regards,
Richard

On 25 January 2017 at 10:58, Richard Bade <hitr...@gmail.com> wrote:
> Hi Everyone,
> I've got a strange one. After doing a reweight of some osd's the other
> night our cluster is showing 1pg stuck unclean.
>
> 2017-01-25 09:48:41 : 1 pgs stuck unclean | recovery 140/71532872
> objects degraded (0.000%) | recovery 2553/71532872 objects misplaced
> (0.004%)
>
> When I query the pg it shows one of the osd's is not up.
>
> "state": "active+remapped",
> "snap_trimq": "[]",
> "epoch": 231928,
> "up": [
> 155
> ],
> "acting": [
> 155,
> 105
> ],
> "actingbackfill": [
> "105",
> "155"
> ],
>
> I've tried restarting the osd's, ceph pg repair, ceph pg 4.559
> list_missing, ceph pg 4.559 mark_unfound_lost revert.
> Nothing works.
> I've just tried setting osd.105 out, waiting for backfill to evacuate
> the osd and stopping the osd process to see if it'll recreate the 2nd
> set of data but no luck.
> It would seem that the primary copy of the data on osd.155 is fine but
> the 2nd copy on osd.105 isn't there.
>
> Any ideas how I can force rebuilding the 2nd copy? Or any other ideas
> to resolve this?
>
> We're running Hammer
> ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
>
> Regards,
> Richard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Objects Stuck Degraded

2017-01-24 Thread Richard Bade
Hi Everyone,
I've got a strange one. After doing a reweight of some osd's the other
night our cluster is showing 1pg stuck unclean.

2017-01-25 09:48:41 : 1 pgs stuck unclean | recovery 140/71532872
objects degraded (0.000%) | recovery 2553/71532872 objects misplaced
(0.004%)

When I query the pg it shows one of the osd's is not up.

"state": "active+remapped",
"snap_trimq": "[]",
"epoch": 231928,
"up": [
155
],
"acting": [
155,
105
],
"actingbackfill": [
"105",
"155"
],

I've tried restarting the osd's, ceph pg repair, ceph pg 4.559
list_missing, ceph pg 4.559 mark_unfound_lost revert.
Nothing works.
I've just tried setting osd.105 out, waiting for backfill to evacuate
the osd and stopping the osd process to see if it'll recreate the 2nd
set of data but no luck.
It would seem that the primary copy of the data on osd.155 is fine but
the 2nd copy on osd.105 isn't there.

Any ideas how I can force rebuilding the 2nd copy? Or any other ideas
to resolve this?

We're running Hammer
ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)

Regards,
Richard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mark_unfound_lost revert|delete behaviour

2016-06-01 Thread Richard Bade
Hi Everyone,
Can anyone tell me how the ceph pg x.x mark_unfound_lost revert|delete
command is meant to work?
Due to some not fully know strange circumstances I have 1 unfound
object in one of my pools.
I've read through
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#unfound-objects
and it seems pretty clear that the object is lost and needs to be
reverted or deleted.
However when I run the revert it returns quickly and doesn't seem to
do anything. The first time I tried this it also kicked the osd that
was the primary for this pg. This seemed bad so I restarted the osd. I
tried reverting again and nothing happened.
Later I tried deleting the unfound and a similar thing happened - the
osd which was primary went down. This time though the command didn't
return straight away and the loadaverage on that box skyrocketed to
around 1800. I restarted the OSD, but noticed that it didn't seem to
have caused a lot of pg's to require recovery as it would normally
when an osd is down.
So I'm wondering if the osd is meant to go down?
Can anyone confirm the sequence of events that are expected when
issuing the mark_unfound_lost? I have not managed to find any info
when googling.

Regards,
Richard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mon crash after update to Hammer 0.94.3 from Firefly 0.80.10

2016-03-13 Thread Richard Bade
Hi Everyone,
Thanks for your input on this. I know it's been a long time but I just
wanted to report back that this issue has been resolved.
We added two more monitors which happened to be on Ubuntu 14.04
(rather than 12.04) and these had no issues. So we upgraded every host
to 14.04.
Since the OS update we have not had any Monitor crashes. It's now been
over two months and the Mon's have been stable.
Thanks again,
Richard

On 17 October 2015 at 07:26, Richard Bade <hitr...@gmail.com> wrote:
> Ok, debugging increased
> ceph tell mon.[abc] injectargs --debug-mon 20
> ceph tell mon.[abc] injectargs --debug-ms 1
>
> Regards,
> Richard
>
> On 17 October 2015 at 01:38, Sage Weil <s...@newdream.net> wrote:
>>
>> This doesn't look familiar.  Are you able to enable a higher log level so
>> that if it happens again we'll have more info?
>>
>> debug mon = 20
>> debug ms = 1
>>
>> Thanks!
>> sage
>>
>> On Fri, 16 Oct 2015, Dan van der Ster wrote:
>>
>> > Hmm, that's strange. I didn't see anything in the tracker that looks
>> > related. Hopefully an expert can chime in...
>> >
>> > Cheers, Dan
>> >
>> > On Fri, Oct 16, 2015 at 1:38 PM, Richard Bade <hitr...@gmail.com> wrote:
>> > > Thanks for your quick response Dan, but no. All the ceph-mon.*.log
>> > > files are
>> > > empty.
>> > > I did track this down in syslog though, in case it helps:
>> > > ceph-mon: 2015-10-16 21:25:00.117115 7f4c9f458700 -1 *** Caught signal
>> > > (Segmentation fault) **#012 in thread 7f4c9f458700#012#012 ceph
>> > > version
>> > > 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)#012 1:
>> > > /usr/bin/ceph-mon()
>> > > [0x928b05]#012 2: (()+0xfcb0) [0x7f4ca50e0cb0]#012 3:
>> > > (get_str_map_key(std::map<std::string, std::string,
>> > > std::less,
>> > > std::allocator<std::pair > > const&,
>> > > std::string const&, std::string const*)+0x37) [0x87d8e7]#012 4:
>> > > (LogMonitor::update_from_paxos(bool*)+0x801) [0x6846e1]#012 5:
>> > > (PaxosService::refresh(bool*)+0x3c6) [0x5dc326]#012 6:
>> > > (Monitor::refresh_from_paxos(bool*)+0x36b) [0x588aab]#012 7:
>> > > (Paxos::do_refresh()+0x4c) [0x5c465c]#012 8:
>> > > (Paxos::handle_commit(MMonPaxos*)+0x243) [0x5cb2d3]#012 9:
>> > > (Paxos::dispatch(PaxosServiceMessage*)+0x22b) [0x5d3fbb]#012 10:
>> > > (Monitor::dispatch(MonSession*, Message*, bool)+0x864) [0x5ab0d4]#012
>> > > 11:
>> > > (Monitor::_ms_dispatch(Message*)+0x2c9) [0x5a8a19]#012 12:
>> > > (Monitor::ms_dispatch(Message*)+0x32) [0x5c3952]#012 13:
>> > > (Messenger::ms_deliver_dispatch(Message*)+0x77) [0x8ac987]#012 14:
>> > > (DispatchQueue::entry()+0x44a) [0x8a9b2a]#012 15:
>> > > (DispatchQueue::DispatchThread::entry()+0xd) [0x79e4ad]#012 16:
>> > > (()+0x7e9a)
>> > > [0x7f4ca50d8e9a]#012 17: (clone()+0x6d) [0x7f4ca3dca38d]#012 NOTE: a
>> > > copy of
>> > > the executable, or `objdump -rdS ` is needed to interpret
>> > > this.
>> > >
>> > > Regards,
>> > > Richard
>> > >
>> > > On 17 October 2015 at 00:33, Dan van der Ster <d...@vanderster.com>
>> > > wrote:
>> > >>
>> > >> Hi,
>> > >> Is there a backtrace in /var/log/ceph/ceph-mon.*.log ?
>> > >> Cheers, Dan
>> > >>
>> > >> On Fri, Oct 16, 2015 at 12:46 PM, Richard Bade <hitr...@gmail.com>
>> > >> wrote:
>> > >> > Hi Everyone,
>> > >> > I upgraded our cluster to Hammer 0.94.3 a couple of days ago and
>> > >> > today
>> > >> > we've
>> > >> > had one monitor crash twice and another one once. We have 3
>> > >> > monitors
>> > >> > total
>> > >> > and have been running Firefly 0.80.10 for quite some time without
>> > >> > any
>> > >> > monitor issues.
>> > >> > When the monitor crashes it leaves a core file and a crash file in
>> > >> > /var/crash
>> > >> > I can't see anything obviously the same goolging about.
>> > >> > Has anyone seen anything like this?
>> > >> > Any suggestions? What other info would be useful to help track down
>> > >> > the
>> > >> > issue.
>> > >> >
>> > >> > Regards,
>> > >> > Richard
>> > >> >
>> > >> > ___
>> > >> > ceph-users mailing list
>> > >> > ceph-users@lists.ceph.com
>> > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > >> >
>> > >
>> > >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> >
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mon crash after update to Hammer 0.94.3 from Firefly 0.80.10

2015-10-16 Thread Richard Bade
Ok, debugging increased
ceph tell mon.[abc] injectargs --debug-mon 20
ceph tell mon.[abc] injectargs --debug-ms 1

Regards,
Richard

On 17 October 2015 at 01:38, Sage Weil <s...@newdream.net> wrote:

> This doesn't look familiar.  Are you able to enable a higher log level so
> that if it happens again we'll have more info?
>
> debug mon = 20
> debug ms = 1
>
> Thanks!
> sage
>
> On Fri, 16 Oct 2015, Dan van der Ster wrote:
>
> > Hmm, that's strange. I didn't see anything in the tracker that looks
> > related. Hopefully an expert can chime in...
> >
> > Cheers, Dan
> >
> > On Fri, Oct 16, 2015 at 1:38 PM, Richard Bade <hitr...@gmail.com> wrote:
> > > Thanks for your quick response Dan, but no. All the ceph-mon.*.log
> files are
> > > empty.
> > > I did track this down in syslog though, in case it helps:
> > > ceph-mon: 2015-10-16 21:25:00.117115 7f4c9f458700 -1 *** Caught signal
> > > (Segmentation fault) **#012 in thread 7f4c9f458700#012#012 ceph version
> > > 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)#012 1:
> /usr/bin/ceph-mon()
> > > [0x928b05]#012 2: (()+0xfcb0) [0x7f4ca50e0cb0]#012 3:
> > > (get_str_map_key(std::map<std::string, std::string,
> std::less,
> > > std::allocator<std::pair > > const&,
> > > std::string const&, std::string const*)+0x37) [0x87d8e7]#012 4:
> > > (LogMonitor::update_from_paxos(bool*)+0x801) [0x6846e1]#012 5:
> > > (PaxosService::refresh(bool*)+0x3c6) [0x5dc326]#012 6:
> > > (Monitor::refresh_from_paxos(bool*)+0x36b) [0x588aab]#012 7:
> > > (Paxos::do_refresh()+0x4c) [0x5c465c]#012 8:
> > > (Paxos::handle_commit(MMonPaxos*)+0x243) [0x5cb2d3]#012 9:
> > > (Paxos::dispatch(PaxosServiceMessage*)+0x22b) [0x5d3fbb]#012 10:
> > > (Monitor::dispatch(MonSession*, Message*, bool)+0x864) [0x5ab0d4]#012
> 11:
> > > (Monitor::_ms_dispatch(Message*)+0x2c9) [0x5a8a19]#012 12:
> > > (Monitor::ms_dispatch(Message*)+0x32) [0x5c3952]#012 13:
> > > (Messenger::ms_deliver_dispatch(Message*)+0x77) [0x8ac987]#012 14:
> > > (DispatchQueue::entry()+0x44a) [0x8a9b2a]#012 15:
> > > (DispatchQueue::DispatchThread::entry()+0xd) [0x79e4ad]#012 16:
> (()+0x7e9a)
> > > [0x7f4ca50d8e9a]#012 17: (clone()+0x6d) [0x7f4ca3dca38d]#012 NOTE: a
> copy of
> > > the executable, or `objdump -rdS ` is needed to interpret
> this.
> > >
> > > Regards,
> > > Richard
> > >
> > > On 17 October 2015 at 00:33, Dan van der Ster <d...@vanderster.com>
> wrote:
> > >>
> > >> Hi,
> > >> Is there a backtrace in /var/log/ceph/ceph-mon.*.log ?
> > >> Cheers, Dan
> > >>
> > >> On Fri, Oct 16, 2015 at 12:46 PM, Richard Bade <hitr...@gmail.com>
> wrote:
> > >> > Hi Everyone,
> > >> > I upgraded our cluster to Hammer 0.94.3 a couple of days ago and
> today
> > >> > we've
> > >> > had one monitor crash twice and another one once. We have 3 monitors
> > >> > total
> > >> > and have been running Firefly 0.80.10 for quite some time without
> any
> > >> > monitor issues.
> > >> > When the monitor crashes it leaves a core file and a crash file in
> > >> > /var/crash
> > >> > I can't see anything obviously the same goolging about.
> > >> > Has anyone seen anything like this?
> > >> > Any suggestions? What other info would be useful to help track down
> the
> > >> > issue.
> > >> >
> > >> > Regards,
> > >> > Richard
> > >> >
> > >> > ___
> > >> > ceph-users mailing list
> > >> > ceph-users@lists.ceph.com
> > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >> >
> > >
> > >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-mon crash after update to Hammer 0.94.3 from Firefly 0.80.10

2015-10-16 Thread Richard Bade
Hi Everyone,
I upgraded our cluster to Hammer 0.94.3 a couple of days ago and today
we've had one monitor crash twice and another one once. We have 3 monitors
total and have been running Firefly 0.80.10 for quite some time without any
monitor issues.
When the monitor crashes it leaves a core file and a crash file in
/var/crash
I can't see anything obviously the same goolging about.
Has anyone seen anything like this?
Any suggestions? What other info would be useful to help track down the
issue.

Regards,
Richard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mon crash after update to Hammer 0.94.3 from Firefly 0.80.10

2015-10-16 Thread Richard Bade
Thanks for your quick response Dan, but no. All the ceph-mon.*.log files
are empty.
I did track this down in syslog though, in case it helps:
ceph-mon: 2015-10-16 21:25:00.117115 7f4c9f458700 -1 *** Caught signal
(Segmentation fault) **#012 in thread 7f4c9f458700#012#012 ceph version
0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)#012 1:
/usr/bin/ceph-mon() [0x928b05]#012 2: (()+0xfcb0) [0x7f4ca50e0cb0]#012 3:
(get_str_map_key(std::map<std::string, std::string, std::less,
std::allocator<std::pair > > const&,
std::string const&, std::string const*)+0x37) [0x87d8e7]#012 4:
(LogMonitor::update_from_paxos(bool*)+0x801) [0x6846e1]#012 5:
(PaxosService::refresh(bool*)+0x3c6) [0x5dc326]#012 6:
(Monitor::refresh_from_paxos(bool*)+0x36b) [0x588aab]#012 7:
(Paxos::do_refresh()+0x4c) [0x5c465c]#012 8:
(Paxos::handle_commit(MMonPaxos*)+0x243) [0x5cb2d3]#012 9:
(Paxos::dispatch(PaxosServiceMessage*)+0x22b) [0x5d3fbb]#012 10:
(Monitor::dispatch(MonSession*, Message*, bool)+0x864) [0x5ab0d4]#012 11:
(Monitor::_ms_dispatch(Message*)+0x2c9) [0x5a8a19]#012 12:
(Monitor::ms_dispatch(Message*)+0x32) [0x5c3952]#012 13:
(Messenger::ms_deliver_dispatch(Message*)+0x77) [0x8ac987]#012 14:
(DispatchQueue::entry()+0x44a) [0x8a9b2a]#012 15:
(DispatchQueue::DispatchThread::entry()+0xd) [0x79e4ad]#012 16: (()+0x7e9a)
[0x7f4ca50d8e9a]#012 17: (clone()+0x6d) [0x7f4ca3dca38d]#012 NOTE: a copy
of the executable, or `objdump -rdS ` is needed to interpret
this.

Regards,
Richard

On 17 October 2015 at 00:33, Dan van der Ster <d...@vanderster.com> wrote:

> Hi,
> Is there a backtrace in /var/log/ceph/ceph-mon.*.log ?
> Cheers, Dan
>
> On Fri, Oct 16, 2015 at 12:46 PM, Richard Bade <hitr...@gmail.com> wrote:
> > Hi Everyone,
> > I upgraded our cluster to Hammer 0.94.3 a couple of days ago and today
> we've
> > had one monitor crash twice and another one once. We have 3 monitors
> total
> > and have been running Firefly 0.80.10 for quite some time without any
> > monitor issues.
> > When the monitor crashes it leaves a core file and a crash file in
> > /var/crash
> > I can't see anything obviously the same goolging about.
> > Has anyone seen anything like this?
> > Any suggestions? What other info would be useful to help track down the
> > issue.
> >
> > Regards,
> > Richard
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] XFS and nobarriers on Intel SSD

2015-09-13 Thread Richard Bade
Hi Everyone,

I updated the firmware on 3 S3710 drives (one host) last Tuesday and have
not seen any ATA resets or Task Aborts on that host in the 5 days since.

I also set nobarriers on another host on Wednesday and have only seen one
Task Abort, and that was on an S3710.

I have seen 18 ATA resets or Task Aborts on the two hosts that I made no
changes on.

It looks like this firmware has fixed my issues, but it looks like
nobarriers also improves the situation significantly. Which seems to
Correlate with your experience Christian.

Thanks everyone for the info in this thread, I plan to update the firmware
on the remainder of the S3710 drives this week and also set nobarriers.

Regards,

Richard

On 8 September 2015 at 14:27, Richard Bade <hitr...@gmail.com> wrote:

> Hi Christian,
>
> On 8 September 2015 at 14:02, Christian Balzer <ch...@gol.com> wrote:
>>
>> Indeed. But first a word about the setup where I'm seeing this.
>> These are 2 mailbox server clusters (2 nodes each), replicating via DRBD
>> over Infiniband (IPoIB at this time), LSI 3008 controller. One cluster
>> with the Samsung DC SSDs, one with the Intel S3610.
>> 2 of these chassis to be precise:
>> https://www.supermicro.com/products/system/2U/2028/SYS-2028TP-DC0FR.cfm
>
>
> We are using the same box, but DC0R (no infiniband) so I guess not
> surprising we're seeing the same thing happening.
>
>
>>
>>
>> Of course latest firmware and I tried this with any kernel from Debian
>> 3.16 to stock 4.1.6.
>>
>> With nobarrier I managed to trigger the error only once yesterday on the
>> DRBD replication target, not the machine that actual has the FS mounted.
>> Usually I'd be able to trigger quite a bit more often during those tests.
>>
>> So this morning I updated the firmware of all S3610s on one node and
>> removed the nobarrier flag. It took a lot of punishment, but eventually
>> this happened:
>> ---
>> Sep  8 10:43:47 mbx09 kernel: [ 1743.358329] sd 0:0:1:0: attempting task
>> abort! scmd(880fdc85b680)
>> Sep  8 10:43:47 mbx09 kernel: [ 1743.358339] sd 0:0:1:0: [sdb] CDB:
>> Write(10) 2a 00 0e 9a fb b8 00 00 08 00
>> Sep  8 10:43:47 mbx09 kernel: [ 1743.358345] scsi target0:0:1:
>> handle(0x000a), sas_address(0x443322110100), phy(1)
>> Sep  8 10:43:47 mbx09 kernel: [ 1743.358348] scsi target0:0:1:
>> enclosure_logical_id(0x5003048019e98d00), slot(1)
>> Sep  8 10:43:47 mbx09 kernel: [ 1743.387951] sd 0:0:1:0: task abort:
>> SUCCESS scmd(880fdc85b680)
>> ---
>> Note that on the un-patched node (DRBD replication target) I managed to
>> trigger this bug 3 times in the same period.
>>
>> So unless Intel has something to say (and given that this happens with
>> Samsungs as well), I'd still look beady eyed at LSI/Avago...
>>
>
> Yes, I think there may be more than one issue here. The reduction in
> occurrences seems to prove there is an issue fixed by the Intel firmware,
> but something is still happening.
> Once I have updated the firmware on the drives on one of our hosts
> tonight, hopefully I can get some more statistics and pinpoint if there is
> another issue specifically with the LSI3008.
> I'd be interested to know if the combination of nobarriers and the updated
> firmware fixes the issue.
>
> Regards,
> Richard
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] XFS and nobarriers on Intel SSD

2015-09-07 Thread Richard Bade
Thanks guys for the pointers to this Intel thread:

https://communities.intel.com/thread/77801

It looks promising. I intend to update the firmware on disks in one
node tonight and will report back after a few days to a week on my
findings.

I've also posted to that forum and will update there too.

Regards,

Richard


On 5 September 2015 at 07:55, Richard Bade <hitr...@gmail.com> wrote:

> Hi Everyone,
>
> We have a Ceph pool that is entirely made up of Intel S3700/S3710
> enterprise SSD's.
>
> We are seeing some significant I/O delays on the disks causing a “SCSI
> Task Abort” from the OS. This seems to be triggered by the drive receiving
> a “Synchronize cache command”.
>
> My current thinking is that setting nobarriers in XFS will stop the drive
> receiving a sync command and therefore stop the I/O delay associated with
> it.
>
> In the XFS FAQ it looks like the recommendation is that if you have a
> Battery Backed raid controller you should set nobarriers for performance
> reasons.
>
> Our LSI card doesn’t have battery backed cache as it’s configured in HBA
> mode (IT) rather than Raid (IR). Our Intel s37xx SSD’s do have a capacitor
> backed cache though.
>
> So is it recommended that barriers are turned off as the drive has a safe
> cache (I am confident that the cache will write out to disk on power
> failure)?
>
> Has anyone else encountered this issue?
>
> Any info or suggestions about this would be appreciated.
>
> Regards,
>
> Richard
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] XFS and nobarriers on Intel SSD

2015-09-07 Thread Richard Bade
Hi Christian,

On 8 September 2015 at 14:02, Christian Balzer  wrote:
>
> Indeed. But first a word about the setup where I'm seeing this.
> These are 2 mailbox server clusters (2 nodes each), replicating via DRBD
> over Infiniband (IPoIB at this time), LSI 3008 controller. One cluster
> with the Samsung DC SSDs, one with the Intel S3610.
> 2 of these chassis to be precise:
> https://www.supermicro.com/products/system/2U/2028/SYS-2028TP-DC0FR.cfm


We are using the same box, but DC0R (no infiniband) so I guess not
surprising we're seeing the same thing happening.


>
>
> Of course latest firmware and I tried this with any kernel from Debian
> 3.16 to stock 4.1.6.
>
> With nobarrier I managed to trigger the error only once yesterday on the
> DRBD replication target, not the machine that actual has the FS mounted.
> Usually I'd be able to trigger quite a bit more often during those tests.
>
> So this morning I updated the firmware of all S3610s on one node and
> removed the nobarrier flag. It took a lot of punishment, but eventually
> this happened:
> ---
> Sep  8 10:43:47 mbx09 kernel: [ 1743.358329] sd 0:0:1:0: attempting task
> abort! scmd(880fdc85b680)
> Sep  8 10:43:47 mbx09 kernel: [ 1743.358339] sd 0:0:1:0: [sdb] CDB:
> Write(10) 2a 00 0e 9a fb b8 00 00 08 00
> Sep  8 10:43:47 mbx09 kernel: [ 1743.358345] scsi target0:0:1:
> handle(0x000a), sas_address(0x443322110100), phy(1)
> Sep  8 10:43:47 mbx09 kernel: [ 1743.358348] scsi target0:0:1:
> enclosure_logical_id(0x5003048019e98d00), slot(1)
> Sep  8 10:43:47 mbx09 kernel: [ 1743.387951] sd 0:0:1:0: task abort:
> SUCCESS scmd(880fdc85b680)
> ---
> Note that on the un-patched node (DRBD replication target) I managed to
> trigger this bug 3 times in the same period.
>
> So unless Intel has something to say (and given that this happens with
> Samsungs as well), I'd still look beady eyed at LSI/Avago...
>

Yes, I think there may be more than one issue here. The reduction in
occurrences seems to prove there is an issue fixed by the Intel firmware,
but something is still happening.
Once I have updated the firmware on the drives on one of our hosts tonight,
hopefully I can get some more statistics and pinpoint if there is another
issue specifically with the LSI3008.
I'd be interested to know if the combination of nobarriers and the updated
firmware fixes the issue.

Regards,
Richard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] XFS and nobarriers on Intel SSD

2015-09-07 Thread Richard Bade
Hi Christian,
Thanks for the info. I'm just wondering, have you updated your S3610's with
the new firmware that was released on 21/08 as referred to in the thread?
We thought we weren't seeing the issue on the intel controller also to
start with, but after further investigation it turned out we were, but it
was reported as a different log item such as this:
ata5.00: exception Emask 0x0 SAct 0x30 SErr 0x0 action 0x6 frozen
ata5.00: failed command: READ FPDMA QUEUED
ata5.00: cmd 60/10:a0:18:ca:ca/00:00:32:00:00/40 tag 20 ncq 8192 in
  res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5.00: status: { DRDY }
ata5.00: failed command: READ FPDMA QUEUED
ata5.00: cmd 60/40:a8:48:ca:ca/00:00:32:00:00/40 tag 21 ncq 32768 in
 res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5.00: status: { DRDY }
ata5: hard resetting link
ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata5.00: configured for UDMA/133
ata5.00: device reported invalid CHS sector 0
ata5.00: device reported invalid CHS sector 0
ata5: EH complete
ata5.00: Enabling discard_zeroes_data

I believe this to be the same thing as the LSI3008 which gives these log
messages:
sd 0:0:6:0: attempting task abort! scmd(8804cac00600)
sd 0:0:6:0: [sdg] CDB:
Read(10): 28 00 1c e7 76 a0 00 01 30 00
scsi target0:0:6: handle(0x000f), sas_address(0x443322110600), phy(6)
scsi target0:0:6: enclosure_logical_id(0x50030480), slot(6)
sd 0:0:6:0: task abort: SUCCESS scmd(8804cac00600)
sd 0:0:6:0: attempting task abort! scmd(8804cac03780)

I appreciate your info with regards to nobarries. I assume by "alleviate
it, but didn't fix" you mean the number of occurrences is reduced?

Regards,
Richard


On 8 September 2015 at 11:43, Christian Balzer <ch...@gol.com> wrote:

>
> Hello,
>
> Note that I see exactly your errors (in a non-Ceph environment) with both
> Samsung 845DC EVO and Intel DC S3610.
> Though I need to stress things quite a bit to make it happen.
>
> Also setting nobarrier did alleviate it, but didn't fix it 100%, so I
> guess something still issues flushes at some point.
>
> From where I stand LSI/Avago are full of it.
> Not only does this problem NOT happen with any onboard SATA chipset I have
> access to, their task abort and reset is what actually impacts things
> (several seconds to recover), not whatever insignificant delay caused by
> the SSDs.
>
> Christian
> On Tue, 8 Sep 2015 11:35:38 +1200 Richard Bade wrote:
>
> > Thanks guys for the pointers to this Intel thread:
> >
> > https://communities.intel.com/thread/77801
> >
> > It looks promising. I intend to update the firmware on disks in one
> > node tonight and will report back after a few days to a week on my
> > findings.
> >
> > I've also posted to that forum and will update there too.
> >
> > Regards,
> >
> > Richard
> >
> >
> > On 5 September 2015 at 07:55, Richard Bade <hitr...@gmail.com> wrote:
> >
> > > Hi Everyone,
> > >
> > > We have a Ceph pool that is entirely made up of Intel S3700/S3710
> > > enterprise SSD's.
> > >
> > > We are seeing some significant I/O delays on the disks causing a “SCSI
> > > Task Abort” from the OS. This seems to be triggered by the drive
> > > receiving a “Synchronize cache command”.
> > >
> > > My current thinking is that setting nobarriers in XFS will stop the
> > > drive receiving a sync command and therefore stop the I/O delay
> > > associated with it.
> > >
> > > In the XFS FAQ it looks like the recommendation is that if you have a
> > > Battery Backed raid controller you should set nobarriers for
> > > performance reasons.
> > >
> > > Our LSI card doesn’t have battery backed cache as it’s configured in
> > > HBA mode (IT) rather than Raid (IR). Our Intel s37xx SSD’s do have a
> > > capacitor backed cache though.
> > >
> > > So is it recommended that barriers are turned off as the drive has a
> > > safe cache (I am confident that the cache will write out to disk on
> > > power failure)?
> > >
> > > Has anyone else encountered this issue?
> > >
> > > Any info or suggestions about this would be appreciated.
> > >
> > > Regards,
> > >
> > > Richard
> > >
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Fusion Communications
> http://www.gol.com/
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] XFS and nobarriers on Intel SSD

2015-09-04 Thread Richard Bade
Hi Everyone,

We have a Ceph pool that is entirely made up of Intel S3700/S3710
enterprise SSD's.

We are seeing some significant I/O delays on the disks causing a “SCSI Task
Abort” from the OS. This seems to be triggered by the drive receiving a
“Synchronize cache command”.

My current thinking is that setting nobarriers in XFS will stop the drive
receiving a sync command and therefore stop the I/O delay associated with
it.

In the XFS FAQ it looks like the recommendation is that if you have a
Battery Backed raid controller you should set nobarriers for performance
reasons.

Our LSI card doesn’t have battery backed cache as it’s configured in HBA
mode (IT) rather than Raid (IR). Our Intel s37xx SSD’s do have a capacitor
backed cache though.

So is it recommended that barriers are turned off as the drive has a safe
cache (I am confident that the cache will write out to disk on power
failure)?

Has anyone else encountered this issue?

Any info or suggestions about this would be appreciated.

Regards,

Richard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] XFS and nobarriers on Intel SSD

2015-09-04 Thread Richard Bade
Hi Jan,
Thanks for your response.


> *How exactly do you know this is the cause? This is usually just an effect
> of something going wrong and part of error recovery process.**Preceding
> this event should be the real error/root cause...*

We have been working with LSI/Avago to resolve this. We get a bunch of
these type log events:

2015-09-04T14:58:59.169677+12:00  ceph-osd: - ceph-osd:
2015-09-04 14:58:59.168444 7fbc5ec71700  0 log [WRN] : slow request
30.894936 seconds old, received at 2015-09-04 14:58:28.272976:
osd_op(client.42319583.0:1185218039
rbd_data.1d8a5a92eb141f2.56a0 [read 3579392~8192] 4.f9f016cb
ack+read e66603) v4 currently no flag points reached

Followed by the task abort I mentioned:
 sd 11:0:4:0: attempting task abort! scmd(8804c07d0480)
 sd 11:0:4:0: [sdf] CDB:
 Write(10): 2a 00 24 6f 01 a8 00 00 08 00
 scsi target11:0:4: handle(0x000d), sas_address(0x443322110400), phy(4)
 scsi target11:0:4: enclosure_logical_id(0x50030480), slot(4)
 sd 11:0:4:0: task abort: SUCCESS scmd(8804c07d0480)

LSI had us enable debugging on our card and send them many logs and
debugging data. Their response was:

Please do not send in the Synchronize cache command(35h). That’s the one
> causing the drive from not responding to Read/write commands quick enough.

A Synchronize cache command instructs the ATA device to flush the cache
> contents to medium and so while the disk is in the process of doing it,
> it’s probably causing the read/write commands to take longer time to
> complete.

LSI/Avago believe this to be the root cause of the IO delay based on the
debugging info.

*and from what I've seen it is not necessary with fast drives (such as
> S3700).*

While I agree with you that it should not be necessary as the S3700's
should be very fast, our current experience does not show this to be the
case.

Just a little more about our setup. We're using Ceph Firefly (0.80.10) on
Ubuntu 14.04. We see this same thing on every S3700/10 on four hosts. We do
not see this happening on the spinning disks in the same cluster but
different pool on similar hardware.

If you know of any other reason this may be happening, we would appreciate
it. Otherwise we will need to continue investigating the possibility of
setting nobarriers.

Regards,
Richard

On 5 September 2015 at 09:32, Jan Schermer <j...@schermer.cz> wrote:

> We are seeing some significant I/O delays on the disks causing a “SCSI
> Task Abort” from the OS. This seems to be triggered by the drive receiving
> a “Synchronize cache command”.
>
>
> How exactly do you know this is the cause? This is usually just an effect
> of something going wrong and part of error recovery process.
> Preceding this event should be the real error/root cause...
>
> It is _supposedly_ safe to disable barriers in this scenario, but IMO the
> assumptions behind that are deeply flawed, and from what I've seen it is
> not necessary with fast drives (such as S3700).
>
> Take a look in the mailing list archives, I elaborated on this quite a bit
> in the past, including my experience with Kingston drives + XFS + LSI (and
> the effect is present even on Intels, but because they are much faster it
> shouldn't cause any real problems).
>
> Jan
>
>
> On 04 Sep 2015, at 21:55, Richard Bade <hitr...@gmail.com> wrote:
>
> Hi Everyone,
>
> We have a Ceph pool that is entirely made up of Intel S3700/S3710
> enterprise SSD's.
>
> We are seeing some significant I/O delays on the disks causing a “SCSI
> Task Abort” from the OS. This seems to be triggered by the drive receiving
> a “Synchronize cache command”.
>
> My current thinking is that setting nobarriers in XFS will stop the drive
> receiving a sync command and therefore stop the I/O delay associated with
> it.
>
> In the XFS FAQ it looks like the recommendation is that if you have a
> Battery Backed raid controller you should set nobarriers for performance
> reasons.
>
> Our LSI card doesn’t have battery backed cache as it’s configured in HBA
> mode (IT) rather than Raid (IR). Our Intel s37xx SSD’s do have a capacitor
> backed cache though.
>
> So is it recommended that barriers are turned off as the drive has a safe
> cache (I am confident that the cache will write out to disk on power
> failure)?
>
> Has anyone else encountered this issue?
>
> Any info or suggestions about this would be appreciated.
>
> Regards,
>
> Richard
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com