Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

2018-05-23 Thread Sean Sullivan
Thanks Yan! I did this for the bug ticket and missed these replies. I hope
I did it correctly. Here are the pastes of the dumps:

https://pastebin.com/kw4bZVZT -- primary
https://pastebin.com/sYZQx0ER -- secondary


they are not that long here is the output of one:


   1. Thread 17 "mds_rank_progr" received signal SIGSEGV, Segmentation fault
   .
   2. [Switching to Thread 0x7fe3b100a700 (LWP 120481)]
   3. 0x5617aacc48c2 in Server::handle_client_getattr (this=this@entry=
   0x5617b5acbcd0, mdr=..., is_lookup=is_lookup@entry=true) at
   /build/ceph-12.2.5/src/mds/Server.cc:3065
   4. 3065/build/ceph-12.2.5/src/mds/Server.cc: No such file or
   directory.
   5. (gdb) t
   6. [Current thread is 17 (Thread 0x7fe3b100a700 (LWP 120481))]
   7. (gdb) bt
   8. #0  0x5617aacc48c2 in Server::handle_client_getattr (
   this=this@entry=0x5617b5acbcd0, mdr=..., is_lookup=is_lookup@entry=true)
   at /build/ceph-12.2.5/src/mds/Server.cc:3065
   9. #1  0x5617aacfc98b in Server::dispatch_client_request (
   this=this@entry=0x5617b5acbcd0, mdr=...) at
   /build/ceph-12.2.5/src/mds/Server.cc:1802
   10. #2  0x5617aacfce9b in Server::handle_client_request (
   this=this@entry=0x5617b5acbcd0, req=req@entry=0x5617bdfa8700)at
   /build/ceph-12.2.5/src/mds/Server.cc:1716
   11. #3  0x5617aad017b6 in Server::dispatch (this=0x5617b5acbcd0,
   m=m@entry=0x5617bdfa8700) at /build/ceph-12.2.5/src/mds/Server.cc:258
   12. #4  0x5617aac6afac in MDSRank::handle_deferrable_message (
   this=this@entry=0x5617b5d22000, m=m@entry=0x5617bdfa8700)at
   /build/ceph-12.2.5/src/mds/MDSRank.cc:716
   13. #5  0x5617aac795cb in MDSRank::_dispatch (this=this@entry=
   0x5617b5d22000, m=0x5617bdfa8700, new_msg=new_msg@entry=false) at
   /build/ceph-12.2.5/src/mds/MDSRank.cc:551
   14. #6  0x5617aac7a472 in MDSRank::retry_dispatch (this=
   0x5617b5d22000, m=) at
   /build/ceph-12.2.5/src/mds/MDSRank.cc:998
   15. #7  0x5617aaf0207b in Context::complete (r=0, this=0x5617bd568080
   ) at /build/ceph-12.2.5/src/include/Context.h:70
   16. #8  MDSInternalContextBase::complete (this=0x5617bd568080, r=0) at
   /build/ceph-12.2.5/src/mds/MDSContext.cc:30
   17. #9  0x5617aac78bf7 in MDSRank::_advance_queues (this=
   0x5617b5d22000) at /build/ceph-12.2.5/src/mds/MDSRank.cc:776
   18. #10 0x5617aac7921a in MDSRank::ProgressThread::entry (this=
   0x5617b5d22d40) at /build/ceph-12.2.5/src/mds/MDSRank.cc:502
   19. #11 0x7fe3bb3066ba in start_thread (arg=0x7fe3b100a700) at
   pthread_create.c:333
   20. #12 0x7fe3ba37241d in clone () at
   ../sysdeps/unix/sysv/linux/x86_64/clone.S:109



I
* set the debug level to mds=20 mon=1,
*  attached gdb prior to trying to mount aufs from a separate client,
*  typed continue, attempted the mount,
* then backtraced after it seg faulted.

I hope this is more helpful. Is there something else I should try to get
more info? I was hoping for something closer to a python trace where it
says a variable is a different type or a missing delimiter. womp. I am
definitely out of my depth but now is a great time to learn! Can anyone
shed some more light as to what may be wrong?



On Fri, May 4, 2018 at 7:49 PM, Yan, Zheng  wrote:

> On Wed, May 2, 2018 at 7:19 AM, Sean Sullivan  wrote:
> > Forgot to reply to all:
> >
> > Sure thing!
> >
> > I couldn't install the ceph-mds-dbg packages without upgrading. I just
> > finished upgrading the cluster to 12.2.5. The issue still persists in
> 12.2.5
> >
> > From here I'm not really sure how to do generate the backtrace so I hope
> I
> > did it right. For others on Ubuntu this is what I did:
> >
> > * firstly up the debug_mds to 20 and debug_ms to 1:
> > ceph tell mds.* injectargs '--debug-mds 20 --debug-ms 1'
> >
> > * install the debug packages
> > ceph-mds-dbg in my case
> >
> > * I also added these options to /etc/ceph/ceph.conf just in case they
> > restart.
> >
> > * Now allow pids to dump (stolen partly from redhat docs and partly from
> > ubuntu)
> > echo -e 'DefaultLimitCORE=infinity\nPrivateTmp=true' | tee -a
> > /etc/systemd/system.conf
> > sysctl fs.suid_dumpable=2
> > sysctl kernel.core_pattern=/tmp/core
> > systemctl daemon-reload
> > systemctl restart ceph-mds@$(hostname -s)
> >
> > * A crash was created in /var/crash by apport but gdb cant read it. I
> used
> > apport-unpack and then ran GDB on what is inside:
> >
>
> core dump should be in /tmp/core
>
> > apport-unpack /var/crash/$(ls /var/crash/*mds*) /root/crash_dump/
> > cd /root/crash_dump/
> > gdb $(cat ExecutablePath) CoreDump -ex 'thr a a bt' | tee
> > /root/ceph_mds_$(hostname -s)_backtrace
> >
> > * This left me with the attached backtraces (which I think are wrong as I

Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

2018-05-04 Thread Sean Sullivan
Most of this is over my head but the last line of the logs on both mds
servers show something similar to:

 0> 2018-05-01 15:37:46.871932 7fd10163b700 -1 *** Caught signal
(Segmentation fault) **
 in thread 7fd10163b700 thread_name:mds_rank_progr

When I search for this in ceph user and devel mailing list the only mention
I can see is from 12.0.3:

https://marc.info/?l=ceph-devel&m=149726392820648&w=2 -- ceph-devel

I don't see any mention of journal.cc in my logs however so I hope they are
not related. I also have not experienced any major loss in my cluster as of
yet and cephfs-journal-tool shows my journals as healthy.  To trigger this
bug I created a cephfs directory and user called aufstest. Here is the part
of the log with the crash mentioning aufstest.

https://pastebin.com/EL5ALLuE



I created a new bug ticket on ceph.com with all of the current info as I
believe this isn't a problem with my setup specifically and anyone else
trying this will have the same issue.
https://tracker.ceph.com/issues/23972

I hope this is the correct path. If anyone can guide me in the right
direction for troubleshooting this further I would be grateful.

On Tue, May 1, 2018 at 6:19 PM, Sean Sullivan  wrote:

> Forgot to reply to all:
>
>
> Sure thing!
>
> I couldn't install the ceph-mds-dbg packages without upgrading. I just
> finished upgrading the cluster to 12.2.5. The issue still persists in 12.2.5
>
> From here I'm not really sure how to do generate the backtrace so I hope I
> did it right. For others on Ubuntu this is what I did:
>
> * firstly up the debug_mds to 20 and debug_ms to 1:
> ceph tell mds.* injectargs '--debug-mds 20 --debug-ms 1'
>
> * install the debug packages
> ceph-mds-dbg in my case
>
> * I also added these options to /etc/ceph/ceph.conf just in case they
> restart.
>
> * Now allow pids to dump (stolen partly from redhat docs and partly from
> ubuntu)
> echo -e 'DefaultLimitCORE=infinity\nPrivateTmp=true' | tee -a
> /etc/systemd/system.conf
> sysctl fs.suid_dumpable=2
> sysctl kernel.core_pattern=/tmp/core
> systemctl daemon-reload
> systemctl restart ceph-mds@$(hostname -s)
>
> * A crash was created in /var/crash by apport but gdb cant read it. I used
> apport-unpack and then ran GDB on what is inside:
>
> apport-unpack /var/crash/$(ls /var/crash/*mds*) /root/crash_dump/
> cd /root/crash_dump/
> gdb $(cat ExecutablePath) CoreDump -ex 'thr a a bt' | tee
> /root/ceph_mds_$(hostname -s)_backtrace
>
> * This left me with the attached backtraces (which I think are wrong as I
> see a lot of ?? yet gdb says /usr/lib/debug/.build-id/1d/
> 23dc5ef4fec1dacebba2c6445f05c8fe6b8a7c.debug was loaded)
>
>  kh10-8 mds backtrace -- https://pastebin.com/bwqZGcfD
>  kh09-8 mds backtrace -- https://pastebin.com/vvGiXYVY
>
>
> The log files are pretty large (one 4.1G and the other 200MB)
>
> kh10-8 (200MB) mds log -- https://griffin-objstore.op
> ensciencedatacloud.org/logs/ceph-mds.kh10-8.log
> kh09-8 (4.1GB) mds log -- https://griffin-objstore.op
> ensciencedatacloud.org/logs/ceph-mds.kh09-8.log
>
> On Tue, May 1, 2018 at 12:09 AM, Patrick Donnelly 
> wrote:
>
>> Hello Sean,
>>
>> On Mon, Apr 30, 2018 at 2:32 PM, Sean Sullivan 
>> wrote:
>> > I was creating a new user and mount point. On another hardware node I
>> > mounted CephFS as admin to mount as root. I created /aufstest and then
>> > unmounted. From there it seems that both of my mds nodes crashed for
>> some
>> > reason and I can't start them any more.
>> >
>> > https://pastebin.com/1ZgkL9fa -- my mds log
>> >
>> > I have never had this happen in my tests so now I have live data here.
>> If
>> > anyone can lend a hand or point me in the right direction while
>> > troubleshooting that would be a godsend!
>>
>> Thanks for keeping the list apprised of your efforts. Since this is so
>> easily reproduced for you, I would suggest that you next get higher
>> debug logs (debug_mds=20/debug_ms=1) from the MDS. And, since this is
>> a segmentation fault, a backtrace with debug symbols from gdb would
>> also be helpful.
>>
>> --
>> Patrick Donnelly
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

2018-05-01 Thread Sean Sullivan
Forgot to reply to all:

Sure thing!

I couldn't install the ceph-mds-dbg packages without upgrading. I just
finished upgrading the cluster to 12.2.5. The issue still persists in 12.2.5

>From here I'm not really sure how to do generate the backtrace so I hope I
did it right. For others on Ubuntu this is what I did:

* firstly up the debug_mds to 20 and debug_ms to 1:
ceph tell mds.* injectargs '--debug-mds 20 --debug-ms 1'

* install the debug packages
ceph-mds-dbg in my case

* I also added these options to /etc/ceph/ceph.conf just in case they
restart.

* Now allow pids to dump (stolen partly from redhat docs and partly from
ubuntu)
echo -e 'DefaultLimitCORE=infinity\nPrivateTmp=true' | tee -a
/etc/systemd/system.conf
sysctl fs.suid_dumpable=2
sysctl kernel.core_pattern=/tmp/core
systemctl daemon-reload
systemctl restart ceph-mds@$(hostname -s)

* A crash was created in /var/crash by apport but gdb cant read it. I used
apport-unpack and then ran GDB on what is inside:

apport-unpack /var/crash/$(ls /var/crash/*mds*) /root/crash_dump/
cd /root/crash_dump/
gdb $(cat ExecutablePath) CoreDump -ex 'thr a a bt' | tee
/root/ceph_mds_$(hostname -s)_backtrace

* This left me with the attached backtraces (which I think are wrong as I
see a lot of ?? yet gdb says /usr/lib/debug/.build-id/1d/
23dc5ef4fec1dacebba2c6445f05c8fe6b8a7c.debug was loaded)

 kh10-8 mds backtrace -- https://pastebin.com/bwqZGcfD
 kh09-8 mds backtrace -- https://pastebin.com/vvGiXYVY


The log files are pretty large (one 4.1G and the other 200MB)

kh10-8 (200MB) mds log -- https://griffin-objstore.
opensciencedatacloud.org/logs/ceph-mds.kh10-8.log
kh09-8 (4.1GB) mds log -- https://griffin-objstore.
opensciencedatacloud.org/logs/ceph-mds.kh09-8.log

On Tue, May 1, 2018 at 12:09 AM, Patrick Donnelly 
wrote:

> Hello Sean,
>
> On Mon, Apr 30, 2018 at 2:32 PM, Sean Sullivan 
> wrote:
> > I was creating a new user and mount point. On another hardware node I
> > mounted CephFS as admin to mount as root. I created /aufstest and then
> > unmounted. From there it seems that both of my mds nodes crashed for some
> > reason and I can't start them any more.
> >
> > https://pastebin.com/1ZgkL9fa -- my mds log
> >
> > I have never had this happen in my tests so now I have live data here. If
> > anyone can lend a hand or point me in the right direction while
> > troubleshooting that would be a godsend!
>
> Thanks for keeping the list apprised of your efforts. Since this is so
> easily reproduced for you, I would suggest that you next get higher
> debug logs (debug_mds=20/debug_ms=1) from the MDS. And, since this is
> a segmentation fault, a backtrace with debug symbols from gdb would
> also be helpful.
>
> --
> Patrick Donnelly
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

2018-04-30 Thread Sean Sullivan
I forgot that I left my VM mount command running. It hangs my VM but more
alarming is that it crashes my MDS servers on the ceph cluster. The ceph
cluster is all hardware nodes and the openstack vm does not have an admin
keyring (although the cephX keyring for cephfs generated does have write
permissions to the ec42 pool.


 +-+
 |
   |
 |   Luminous CephFS
Cluster   |
 |   version 12.2.4
|
 |   Ubuntu 16.04
|
 |   4.10.0-38-generic (all
hardware nodes)|
 |
   |
++
 +---+++
||   |   |
  ||
|  Openstack VM  |   |  Ceph Monitor A   |  Ceph
Monitor B|  Ceph Monitor C|
|  Ubuntu 16.04  +--->   |  Ceph Mon Server  |  Ceph
MDS A|  Ceph MDS Failover |
|  4.13.0-39-generic |   |  kh08-8   |  Kh09-8
  |  kh10-8|
|  Cephfs via kernel |   |   |
  ||
++
 +---+++
 |
   |
 |ec42
   16384 PGs   |
 |CephFS Data Pool
   |
 |Erasure coded with
4/2 profile   |
 |
   |

 +-+
 |
   |
 |   cephfs_metadata
   4096 PGs|
 |   CephFS Metadata Pool
|
 |   Replicated pool (n=3)
   |
 |
   |

 +-+

As far as I am aware this shouldn't happen. I will try upgrading as soon as
I can but I didn't see anything like this mentioned in the change log and
am worried this will still exist in 12.2.5. Has anyone seen this before?


On Mon, Apr 30, 2018 at 7:24 PM, Sean Sullivan  wrote:

> So I think I can reliably reproduce this crash from a ceph client.
>
> ```
> root@kh08-8:~# ceph -s
>   cluster:
> id: 9f58ee5a-7c5d-4d68-81ee-debe16322544
> health: HEALTH_OK
>
>   services:
> mon: 3 daemons, quorum kh08-8,kh09-8,kh10-8
> mgr: kh08-8(active)
> mds: cephfs-1/1/1 up  {0=kh09-8=up:active}, 1 up:standby
> osd: 570 osds: 570 up, 570 in
> ```
>
>
> then from a client try to mount aufs over cephfs:
> ```
> mount -vvv -t aufs -o br=/cephfs=rw:/mnt/aufs=rw -o udba=reval none /aufs
> ```
>
> Now watch as your ceph mds servers fail:
>
> ```
> root@kh08-8:~# ceph -s
>   cluster:
> id: 9f58ee5a-7c5d-4d68-81ee-debe16322544
> health: HEALTH_WARN
> insufficient standby MDS daemons available
>
>   services:
> mon: 3 daemons, quorum kh08-8,kh09-8,kh10-8
> mgr: kh08-8(active)
> mds: cephfs-1/1/1 up  {0=kh10-8=up:active(laggy or crashed)}
> ```
>
>
> I am now stuck in a degraded and I can't seem to get them to start again.
>
> On Mon, Apr 30, 2018 at 5:06 PM, Sean Sullivan 
> wrote:
>
>> I had 2 MDS servers (one active one standby) and both were down. I took a
>> dumb chance and marked the active as down (it said it was up but laggy).
>> Then started the primary again and now both are back up. I have never seen
>> this before I am also not sure of what I just did.
>>
>> On Mon, Apr 30, 2018 at 4:32 PM, Sean Sullivan 
>> wrote:
>>
>>> I was creating a new user and mount point. On another hardware node I
>>> mounted CephFS as admin to mount as root. I created /aufstest and then
>>> unmounted. From there it seems that both of my mds nodes crashed for some
>>> reason and I can't start them any more.
>>>
>>> https://pastebin.c

Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

2018-04-30 Thread Sean Sullivan
So I think I can reliably reproduce this crash from a ceph client.

```
root@kh08-8:~# ceph -s
  cluster:
id: 9f58ee5a-7c5d-4d68-81ee-debe16322544
health: HEALTH_OK

  services:
mon: 3 daemons, quorum kh08-8,kh09-8,kh10-8
mgr: kh08-8(active)
mds: cephfs-1/1/1 up  {0=kh09-8=up:active}, 1 up:standby
osd: 570 osds: 570 up, 570 in
```


then from a client try to mount aufs over cephfs:
```
mount -vvv -t aufs -o br=/cephfs=rw:/mnt/aufs=rw -o udba=reval none /aufs
```

Now watch as your ceph mds servers fail:

```
root@kh08-8:~# ceph -s
  cluster:
id: 9f58ee5a-7c5d-4d68-81ee-debe16322544
health: HEALTH_WARN
insufficient standby MDS daemons available

  services:
mon: 3 daemons, quorum kh08-8,kh09-8,kh10-8
mgr: kh08-8(active)
mds: cephfs-1/1/1 up  {0=kh10-8=up:active(laggy or crashed)}
```


I am now stuck in a degraded and I can't seem to get them to start again.

On Mon, Apr 30, 2018 at 5:06 PM, Sean Sullivan  wrote:

> I had 2 MDS servers (one active one standby) and both were down. I took a
> dumb chance and marked the active as down (it said it was up but laggy).
> Then started the primary again and now both are back up. I have never seen
> this before I am also not sure of what I just did.
>
> On Mon, Apr 30, 2018 at 4:32 PM, Sean Sullivan 
> wrote:
>
>> I was creating a new user and mount point. On another hardware node I
>> mounted CephFS as admin to mount as root. I created /aufstest and then
>> unmounted. From there it seems that both of my mds nodes crashed for some
>> reason and I can't start them any more.
>>
>> https://pastebin.com/1ZgkL9fa -- my mds log
>>
>> I have never had this happen in my tests so now I have live data here. If
>> anyone can lend a hand or point me in the right direction while
>> troubleshooting that would be a godsend!
>>
>> I tried cephfs-journal-tool inspect and it reports that the journal
>> should be fine. I am not sure why it's crashing:
>>
>> /home/lacadmin# cephfs-journal-tool journal inspect
>> Overall journal integrity: OK
>>
>>
>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

2018-04-30 Thread Sean Sullivan
I had 2 MDS servers (one active one standby) and both were down. I took a
dumb chance and marked the active as down (it said it was up but laggy).
Then started the primary again and now both are back up. I have never seen
this before I am also not sure of what I just did.

On Mon, Apr 30, 2018 at 4:32 PM, Sean Sullivan  wrote:

> I was creating a new user and mount point. On another hardware node I
> mounted CephFS as admin to mount as root. I created /aufstest and then
> unmounted. From there it seems that both of my mds nodes crashed for some
> reason and I can't start them any more.
>
> https://pastebin.com/1ZgkL9fa -- my mds log
>
> I have never had this happen in my tests so now I have live data here. If
> anyone can lend a hand or point me in the right direction while
> troubleshooting that would be a godsend!
>
> I tried cephfs-journal-tool inspect and it reports that the journal should
> be fine. I am not sure why it's crashing:
>
> /home/lacadmin# cephfs-journal-tool journal inspect
> Overall journal integrity: OK
>
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

2018-04-30 Thread Sean Sullivan
I was creating a new user and mount point. On another hardware node I
mounted CephFS as admin to mount as root. I created /aufstest and then
unmounted. From there it seems that both of my mds nodes crashed for some
reason and I can't start them any more.

https://pastebin.com/1ZgkL9fa -- my mds log

I have never had this happen in my tests so now I have live data here. If
anyone can lend a hand or point me in the right direction while
troubleshooting that would be a godsend!

I tried cephfs-journal-tool inspect and it reports that the journal should
be fine. I am not sure why it's crashing:

/home/lacadmin# cephfs-journal-tool journal inspect
Overall journal integrity: OK
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] luminous ubuntu 16.04 HWE (4.10 kernel). ceph-disk can't prepare a disk

2017-10-22 Thread Sean Sullivan
On freshly installed ubuntu 16.04 servers with the HWE kernel selected
(4.10). I can not use ceph-deploy or ceph-disk to provision osd.


 whenever I try I get the following::

ceph-disk -v prepare --dmcrypt --dmcrypt-key-dir /etc/ceph/dmcrypt-keys
--bluestore --cluster ceph --fs-type xfs -- /dev/sdy
command: Running command: /usr/bin/ceph-osd --cluster=ceph
--show-config-value=fsid
get_dm_uuid: get_dm_uuid /dev/sdy uuid path is /sys/dev/block/65:128/dm/uuid
set_type: Will colocate block with data on /dev/sdy
command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd.
--lookup bluestore_block_size
[command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd.
--lookup bluestore_block_db_size
command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd.
--lookup bluestore_block_size
command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd.
--lookup bluestore_block_wal_size
get_dm_uuid: get_dm_uuid /dev/sdy uuid path is /sys/dev/block/65:128/dm/uuid
get_dm_uuid: get_dm_uuid /dev/sdy uuid path is /sys/dev/block/65:128/dm/uuid
get_dm_uuid: get_dm_uuid /dev/sdy uuid path is /sys/dev/block/65:128/dm/uuid
Traceback (most recent call last):
  File "/usr/sbin/ceph-disk", line 9, in 
load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5704, in
run
main(sys.argv[1:])
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5655, in
main
args.func(args)
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2091, in
main
Prepare.factory(args).prepare()
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2080, in
prepare
self._prepare()
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2154, in
_prepare
self.lockbox.prepare()
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2842, in
prepare
verify_not_in_use(self.args.lockbox, check_partitions=True)
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 950, in
verify_not_in_use
raise Error('Device is mounted', partition)
ceph_disk.main.Error: Error: Device is mounted: /dev/sdy5

unmounting the disk does not seem to help either. I'm assuming something is
triggering too early but i'm not sure how to delay or figure that out.

has anyone deployed on xenial with the 4.10 kernel? Am I missing something
important?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] zombie partitions, ceph-disk failure.

2017-10-20 Thread Sean Sullivan
I am trying to stand up ceph (luminous) on 3 72 disk supermicro servers
running ubuntu 16.04 with HWE enabled (for a 4.10 kernel for cephfs). I am
not sure how this is possible but even though I am running the following
line to wipe all disks of their partitions, once I run ceph-disk to
partition the drive udev or device mapper automatically mounts a lockbox
partition and ceph-disk fails::


wipe line::

for disk in $(lsblk --output MODEL,NAME | grep -iE "HGST|SSDSC2BA40" | awk
'{print $NF}'); do sgdisk -Z /dev/${disk}; dd if=/dev/zero of=/dev/${disk}
bs=1024 count=1; ceph-disk zap /dev/${disk}; sgdisk -o /dev/${disk};
sgdisk -G /dev/${disk}; done

ceph-disk line:
cephcmd="ceph-disk -v prepare --dmcrypt --dmcrypt-key-dir
/etc/ceph/dmcrypt-keys --block.db /dev/${pssd}  --block.wal /dev/${pssd}
--bluestore --cluster ceph --fs-type xfs
-- /dev/${phdd}"


prior to running that on a single disk all of the drives are empty except
the OS drives

root@kg15-1:/home/ceph-admin# lsblk --fs
NAMEFSTYPELABELUUID
 MOUNTPOINT
sdbu
sdy
sdam
sdbb
sdf
sdau
sdab
sdbk
sdo
sdbs
sdw
sdak
sdd
sdas
sdbi
sdm
sdbq
sdu
sdai
sdb
sdaq
sdbg
sdk
sdaz
sds
sdag
sdbe
sdi
sdax
sdq
sdae
sdbn
sdbv
├─sdbv3 linux_raid_member kg15-1:2 664f69b7-2dd7-7012-75e3-a920ba7416b8
│ └─md2 ext4   6696d9f5-3385-47cb-8e8b-058637f8a1b8 /
├─sdbv1 linux_raid_member kg15-1:0 c4c78d8b-5c0b-6d51-d0a4-ecd40432f98c
│ └─md0 ext4   44f76d8d-0333-49a7-ab89-dafe70f6f12d
/boot
└─sdbv2 linux_raid_member kg15-1:1 e3a74474-502c-098c-9415-7b99abcbd2e1
  └─md1 swap   37e071a9-9361-456b-a740-87ddc99a8260
[SWAP]
sdz
sdan
sdbc
sdg
sdav
sdac
sdbl
sdbt
sdx
sdal
sdba
sde
sdat
sdaa
sdbj
sdn
sdbr
sdv
sdaj
sdc
sdar
sdbh
sdl
sdbp
sdt
sdah
sda
├─sda2  linux_raid_member kg15-1:1 e3a74474-502c-098c-9415-7b99abcbd2e1
│ └─md1 swap   37e071a9-9361-456b-a740-87ddc99a8260
[SWAP]
├─sda3  linux_raid_member kg15-1:2 664f69b7-2dd7-7012-75e3-a920ba7416b8
│ └─md2 ext4   6696d9f5-3385-47cb-8e8b-058637f8a1b8 /
└─sda1  linux_raid_member kg15-1:0 c4c78d8b-5c0b-6d51-d0a4-ecd40432f98c
  └─md0 ext4   44f76d8d-0333-49a7-ab89-dafe70f6f12d
/boot
sdap
sdbf
sdj
sday
sdr
sdaf
sdbo
sdao
sdbd
sdh
sdaw
sdp
sdad
sdbm

-

but as soon as I run that cephcmd (which worked prior to upgrading to the
4.10 kernel:

ceph-disk -v prepare --dmcrypt --dmcrypt-key-dir /etc/ceph/dmcrypt-keys
--block.db /dev/sdd  --block.wal /dev/sdd  --bluestore --cluster ceph
--fs-type xfs -- /dev/sdbu
command: Running command: /usr/bin/ceph-osd --cluster=ceph
--show-config-value=fsid
get_dm_uuid: get_dm_uuid /dev/sdbu uuid path is
/sys/dev/block/68:128/dm/uuid
set_type: Will colocate block with data on /dev/sdbu
command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd.
--lookup bluestore_block_size
command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd.
--lookup bluestore_block_db_size
command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd.
--lookup bluestore_block_size
command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd.
--lookup bluestore_block_wal_size
get_dm_uuid: get_dm_uuid /dev/sdbu uuid path is
/sys/dev/block/68:128/dm/uuid
get_dm_uuid: get_dm_uuid /dev/sdbu uuid path is
/sys/dev/block/68:128/dm/uuid
get_dm_uuid: get_dm_uuid /dev/sdbu uuid path is
/sys/dev/block/68:128/dm/uuid
Traceback (most recent call last):
  File "/usr/sbin/ceph-disk", line 9, in 
load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5704, in
run
main(sys.argv[1:])
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5655, in
main
args.func(args)
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2091, in
main
Prepare.factory(args).prepare()
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2080, in
prepare
self._prepare()
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2154, in
_prepare
self.lockbox.prepare()
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2842, in
prepare
verify_not_in_use(self.args.lockbox, check_partitions=True)
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 950, in
verify_not_in_use
raise Error('Device is mounted', partition)
ceph_disk.main.Error: Error: Device is mounted: /dev/sdbu5


So it says sdbu is mounted. I unmount it and again it errors saying it
can't create the partition it just tried to create.

root@kg15-1:/# mount | grep sdbu
/dev/sdbu5 on
/var/lib/ceph/osd-lockbox/0e3baee9-a5dd-46f0-ae53-0e7dd2b0b257 type ext4
(rw,relatime,stripe=4,

Re: [ceph-users] Luminous can't seem to provision more than 32 OSDs per server

2017-10-19 Thread Sean Sullivan
I have tried using ceph-disk directly and i'm running into all sorts of
trouble but I'm trying my best. Currently I am using the following cobbled
script which seems to be working:
https://github.com/seapasulli/CephScripts/blob/master/provision_storage.sh
I'm at 11 right now. I hope this works.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Luminous can't seem to provision more than 32 OSDs per server

2017-10-18 Thread Sean Sullivan
I am trying to install Ceph luminous (ceph version 12.2.1) on 4 ubuntu
16.04 servers each with 74 disks, 60 of which are HGST 7200rpm sas drives::

HGST HUS724040AL sdbv  sas
root@kg15-2:~# lsblk --output MODEL,KNAME,TRAN | grep HGST | wc -l
60

I am trying to deploy them all with ::
a line like the following::
ceph-deploy osd zap kg15-2:(sas_disk)
ceph-deploy osd create --dmcrypt --bluestore --block-db (ssd_partition)
kg15-2:(sas_disk)

This didn't seem to work at all so I am now trying to troubleshoot by just
provisioning the sas disks::
ceph-deploy osd create --dmcrypt --bluestore kg15-2:(sas_disk)

Across all 4 hosts I can only seem to get 32 OSDs up and after that the
rest fail::
root@kg15-1:~# ps faux | grep [c]eph-osd' | wc -l
32
root@kg15-2:~# ps faux | grep [c]eph-osd' | wc -l
32
root@kg15-3:~# ps faux | grep [c]eph-osd' | wc -l
32

The ceph-deploy tool doesn't seem to log or notice any failure but the host
itself shows the following in the osd log:

2017-10-17 23:05:43.121016 7f8ca75c9e00  0 set uid:gid to 64045:64045
(ceph:ceph)
2017-10-17 23:05:43.121040 7f8ca75c9e00  0 ceph version 12.2.1 (
3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable), process
(unknown), pid 69926
2017-10-17 23:05:43.123939 7f8ca75c9e00  1
bluestore(/var/lib/ceph/tmp/mnt.8oIc5b)
mkfs path /var/lib/ceph/tmp/mnt.8oIc5b
2017-10-17 23:05:43.124037 7f8ca75c9e00  1 bdev create path
/var/lib/ceph/tmp/mnt.8oIc5b/block type kernel
2017-10-17 23:05:43.124045 7f8ca75c9e00  1 bdev(0x564b7a05e900
/var/lib/ceph/tmp/mnt.8oIc5b/block) open path /var/lib/ceph/tmp/mnt.8oIc5b/
block
2017-10-17 23:05:43.124231 7f8ca75c9e00  1 bdev(0x564b7a05e900
/var/lib/ceph/tmp/mnt.8oIc5b/block) open size 4000668520448 (0x3a37a6d1000,
3725 GB) block_size 4096 (4096 B) rotational
2017-10-17 23:05:43.124296 7f8ca75c9e00  1
bluestore(/var/lib/ceph/tmp/mnt.8oIc5b)
_set_cache_sizes max 0.5 < ratio 0.99
2017-10-17 23:05:43.124313 7f8ca75c9e00  1
bluestore(/var/lib/ceph/tmp/mnt.8oIc5b)
_set_cache_sizes cache_size 1073741824 meta 0.5 kv 0.5 data 0
2017-10-17 23:05:43.124349 7f8ca75c9e00 -1
bluestore(/var/lib/ceph/tmp/mnt.8oIc5b)
_open_db /var/lib/ceph/tmp/mnt.8oIc5b/block.db link target doesn't exist
2017-10-17 23:05:43.124368 7f8ca75c9e00  1 bdev(0x564b7a05e900
/var/lib/ceph/tmp/mnt.8oIc5b/block) close
2017-10-17 23:05:43.402165 7f8ca75c9e00 -1
bluestore(/var/lib/ceph/tmp/mnt.8oIc5b)
mkfs failed, (2) No such file or directory
2017-10-17 23:05:43.402185 7f8ca75c9e00 -1 OSD::mkfs: ObjectStore::mkfs
failed with error (2) No such file or directory
2017-10-17 23:05:43.402258 7f8ca75c9e00 -1  ** ERROR: error creating empty
object store in /var/lib/ceph/tmp/mnt.8oIc5b: (2) No such file or directory


I have a few questions. I am not sure where to start troubleshooting so I
have a few questions.

1.) Anyone have any idea on why 32?
2.) Is there a good guide / outline on how to get the benefit of storing
the keys in the monitor while still having ceph more or less manage the
drives but provisioning the drives without ceph-deploy? I looked at the
manual deployment long and short form and it doesn't mention dmcrypt or
bluestore at all. I know I can use crypttab and cryptsetup to do this and
then give ceph-disk the path to the mapped device but I would prefer to
keep as much management in ceph as possible if I could.  (mailing list
thread :: https://www.mail-archive.com/ceph-users@lists.ceph.com/
msg38575.html )

3.) Ideally I would like to provision the drives with the DB on the SSD.
(or would it be better to make a cache tier? I read on a reddit thread that
the tiering in ceph isn't being developed anymore is it still worth it?)

Sorry for the bother and thanks for all the help!!!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-monstore-tool rebuild assert error

2017-02-07 Thread Sean Sullivan
I have a hammer cluster that died a bit ago (hammer 94.9) consisting of 3
monitors and 630 osds spread across 21 storage hosts. The clusters monitors
all died due to leveldb corruption and the cluster was shut down. I was
finally given word that I could try to revive the cluster this week!

https://github.com/ceph/ceph/blob/hammer/doc/rados/troubleshooting/troubleshooting-mon.rst#recovery-using-osds

I see that the latest hammer code in github has the ceph-monstore-tool
rebuild backport and that is what I am running on the cluster now (ceph
version 0.94.9-4530-g83af8cd (83af8cdaaa6d94404e6146b68e532a784e3cc99c). I
was able to scrape all 630 of the osds and am left with a 1.1G store.db
directory. Using python I was successfully able to list all of the keys and
values which was very promising. That said I can not run the final command
in the recovery-using-osds article (ceph-monstore-tool rebuild)
successfully.

Whenever I run the tool (with the newly created admin keyring or with my
existing one) it errors with the following:


   1.  0> 2017-02-17 15:00:47.516901 7f8b4d7408c0 -1
./mon/MonitorDBStore.h:
   In function 'KeyValueDB::Iterator MonitorDBStore::get_iterator(const
   string&)' thread 7f8b4d7408c0 time 2017-02-07 15:00:47.516319
   2.


The complete trace is here
http://pastebin.com/NQE8uYiG

Can anyone lend a hand and tell me what may be wrong? I am able to iterate
over the leveldb database in python so the structure should be somewhat
okay? Am I SOL at this point? The cluster isn't production any longer and
while I don't have months of time I would really like to recover this
cluster just to see if it is at all possible.
-- 
- Sean:  I wrote this. -
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

2017-02-06 Thread Sean Sullivan
cd6d88c0  5 asok(0x355a000)
register_command log flush hook 0x350a0d0
-3> 2017-02-06 17:35:54.362215 7f10cd6d88c0  5 asok(0x355a000)
register_command log dump hook 0x350a0d0
-2> 2017-02-06 17:35:54.362220 7f10cd6d88c0  5 asok(0x355a000)
register_command log reopen hook 0x350a0d0
-1> 2017-02-06 17:35:54.379684 7f10cd6d88c0  2 auth: KeyRing::load:
loaded key file /home/lacadmin/admin.keyring
 0> 2017-02-06 17:35:59.885651 7f10cd6d88c0 -1 *** Caught signal
(Segmentation fault) **
 in thread 7f10cd6d88c0

 ceph version 0.94.9-4530-g83af8cd
(83af8cdaaa6d94404e6146b68e532a784e3cc99c)
 1: ceph-monstore-tool() [0x5e960a]
 2: (()+0x10330) [0x7f10cc5c8330]
 3: (strlen()+0x2a) [0x7f10cac629da]
 4: (std::basic_string, std::allocator
>::basic_string(char const*, std::allocator const&)+0x25)
[0x7f10cb576d75]
 5: (rebuild_monstore(char const*, std::vector >&, MonitorDBStore&)+0x878) [0x544958]
 6: (main()+0x3e05) [0x52c035]
 7: (__libc_start_main()+0xf5) [0x7f10cabfbf45]
 8: ceph-monstore-tool() [0x540347]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   1/ 1 ms
  10/10 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
  max_recent   500
  max_new 1000
  log_file
--- end dump of recent events ---
Segmentation fault (core dumped)

--

I have tried copying my monitor and admin keyring into the admin.keyring
used to try to rebuild and it still fails. I am not sure whether this is
due to my packages or if something else is wrong. Is there a way to test or
see what may be happening?


On Sat, Aug 13, 2016 at 10:36 PM, Sean Sullivan 
wrote:

> So with a patched leveldb to skip errors I now have a store.db that I can
> extract the pg,mon,and osd map from. That said when I try to start kh10-8
> it bombs out::
>
> ---
> ---
> root@kh10-8:/var/lib/ceph/mon/ceph-kh10-8# ceph-mon -i $(hostname) -d
> 2016-08-13 22:30:54.596039 7fa8b9e088c0  0 ceph version 0.94.7 (
> d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 708653
> starting mon.kh10-8 rank 2 at 10.64.64.125:6789/0 mon_data
> /var/lib/ceph/mon/ceph-kh10-8 fsid e452874b-cb29-4468-ac7f-f8901dfccebf
> 2016-08-13 22:30:54.608150 7fa8b9e088c0  0 starting mon.kh10-8 rank 2 at
> 10.64.64.125:6789/0 mon_data /var/lib/ceph/mon/ceph-kh10-8 fsid
> e452874b-cb29-4468-ac7f-f8901dfccebf
> 2016-08-13 22:30:54.608395 7fa8b9e088c0  1 mon.kh10-8@-1(probing) e1
> preinit fsid e452874b-cb29-4468-ac7f-f8901dfccebf
> 2016-08-13 22:30:54.608617 7fa8b9e088c0  1 
> mon.kh10-8@-1(probing).paxosservice(pgmap
> 0..35606392) refresh upgraded, format 0 -> 1
> 2016-08-13 22:30:54.608629 7fa8b9e088c0  1 mon.kh10-8@-1(probing).pg v0
> on_upgrade discarding in-core PGMap
> terminate called after throwing an instance of
> 'ceph::buffer::end_of_buffer'
>   what():  buffer::end_of_buffer
> *** Caught signal (Aborted) **
>  in thread 7fa8b9e088c0
>  ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
>  1: ceph-mon() [0x9b25ea]
>  2: (()+0x10330) [0x7fa8b8f0b330]
>  3: (gsignal()+0x37) [0x7fa8b73a8c37]
>  4: (abort()+0x148) [0x7fa8b73ac028]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fa8b7cb3535]
>  6: (()+0x5e6d6) [0x7fa8b7cb16d6]
>  7: (()+0x5e703) [0x7fa8b7cb1703]
>  8: (()+0x5e922) [0x7fa8b7cb1922]
>  9: ceph-mon() [0x853c39]
>  10: (object_stat_collection_t::decode(ceph::buffer::list::iterator&)+0x167)
> [0x894227]
>  11: (pg_stat_t::decode(ceph::buffer::list::iterator&)+0x5ff) [0x894baf]
>  12: (PGMap::update_pg(pg_t, ceph::buffer::list&)+0xa3) [0x91a8d3]
>  13: (PGMonitor::read_pgmap_full()+0x1d8) [0x68b9b8]
>  14: (PGMonitor::update_from_paxos(bool*)+0xbf7) [0x6977b7]
>  15: (PaxosService::refresh(bool*)+0x19a) [0x605b5a]
>  16: (Monitor::refresh_from_paxos(bool*)+0x1db) [0x5b1ffb]
>  17: (Monitor::init_paxos()+0x85) [0x5b2365]
>  18: (Monitor::preinit()+0x7d7) [0x5b6f87]
>  19: (main()+0x230c) [0x57853c]
>  20: (__libc_start_main()+0xf5) [0x7fa8b7393f45]
>  21: ceph-m

[ceph-users] ceph radosgw - 500 errors -- odd

2017-01-13 Thread Sean Sullivan
I am sorry for posting this if this has been addressed already. I am not
sure on how to search through old ceph-users mailing list posts. I used to
use gmane.org but that seems to be down.

My setup::

I have a moderate ceph cluster (ceph hammer 94.9
- fe6d859066244b97b24f09d46552afc2071e6f90 ). The cluster is running ubuntu
but the gateways are running centos7 due to an odd memory issue we had
across all of our gateways.

Outside of that the cluster is pretty standard and healthy:

[root@kh11-9 ~]# ceph -s
cluster XXX-XXX-XXX-XXX
 health HEALTH_OK
 monmap e4: 3 mons at
{kh11-8=X.X.X.X:6789/0,kh12-8=X.X.X.X:6789/0,kh13-8=X.X.X.X:6789/0}
election epoch 150, quorum 0,1,2 kh11-8,kh12-8,kh13-8
 osdmap e69678: 627 osds: 627 up, 627 in

Here is my radosgw config in ceph::

[client.rgw.kh09-10]
log_file = /var/log/radosgw/client.radosgw.log
rgw_frontends = "civetweb port=80
access_log_file=/var/log/radosgw/rgw.access
 error_log_file=/var/log/radosgw/rgw.error"
rgw_enable_ops_log = true
rgw_ops_log_rados = true
rgw_thread_pool_size = 1000
rgw_override_bucket_index_max_shards = 23
error_log_file = /var/log/radosgw/civetweb.error.log
access_log_file = /var/log/radosgw/civetweb.access.log
objecter_inflight_op_bytes = 1073741824
objecter_inflight_ops = 20480
ms_dispatch_throttle_bytes = 209715200


The gateways are sitting behind haproxy for ssl termination. Here is my
haproxy config:

global
log /dev/loglocal0
log /dev/loglocal1 notice
chroot /var/lib/haproxy
stats socket /var/lib/haproxy/admin.sock mode 660 level admin
stats timeout 30s
user haproxy
group haproxy
daemon
ca-base /etc/ssl/certs
crt-base /etc/ssl/private
tune.ssl.default-dh-param 2048
tune.ssl.maxrecord 2048

ssl-default-bind-ciphers
ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256
ssl-default-bind-options no-sslv3 no-tlsv10 no-tlsv11 no-tls-tickets
ssl-default-server-ciphers
ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256
ssl-default-server-options no-sslv3 no-tlsv10 no-tlsv11
no-tls-tickets



defaults
log global
modehttp
option  httplog
option  dontlognull
timeout connect 5000
timeout client  5
timeout server  5
errorfile 400 /etc/haproxy/errors/400.http
errorfile 403 /etc/haproxy/errors/403.http
errorfile 408 /etc/haproxy/errors/408.http
errorfile 500 /etc/haproxy/errors/500.http
errorfile 502 /etc/haproxy/errors/502.http
errorfile 503 /etc/haproxy/errors/503.http
errorfile 504 /etc/haproxy/errors/504.http
option forwardfor
option http-server-close

frontend fourfourthree
   bind :443 ssl crt /etc/ssl/STAR.opensciencedatacloud.org.pem
   reqadd X-Forwarded-Proto:\ https
   default_backend radosgw

backend radosgw
   cookie RADOSGWLB insert indirect nocache
   server primary 127.0.0.1:80 check cookie primary




I am seeing sporadic 500 errors in my access logs on all of my radosgws:

/var/log/radosgw/client.radosgw.log-2017-01-13 11:30:41.635645 7feacf6c6700
 0 RGWObjManifest::operator++(): result: ofs=12607029248
stripe_ofs=12607029248 part_ofs=12598640640 rule->part_size=15728640
/var/log/radosgw/client.radosgw.log-2017-01-13 11:30:41.637559 7feacf6c6700
 0 RGWObjManifest::operator++(): result: ofs=12611223552
stripe_ofs=12611223552 part_ofs=12598640640 rule->part_size=15728640
/var/log/radosgw/client.radosgw.log-2017-01-13 11:30:41.642630 7feacf6c6700
 0 RGWObjManifest::operator++(): result: ofs=12614369280
stripe_ofs=12614369280 part_ofs=12614369280 rule->part_size=15728640
/var/log/radosgw/client.radosgw.log-2017-01-13 11:30:41.644368 7feadf6e6700
 1 == req done req=0x7fed00053a50 http_status=500 ==
/var/log/radosgw/client.radosgw.log:2017-01-13 11:30:41.644475 7feadf6e6700
 1 civetweb: 0x7fed9340: 10.64.0.124 - - [13/Jan/2017:11:28:24 -0600]
"GET
/BUCKET/306d4fe1-1515-44e0-b527-eee0e83412bf/306d4fe1-1515-44e0-b527-eee0e83412bf_gdc_realn_rehead.bam
HTTP/1.1" 500 0 - Boto/2.36.0 Python/2.7.6 Linux/3.13.0-95-generic
/var/log/radosgw/client.radosgw.log-2017-01-13 11:30:41.645611 7feacf6c6700
 0 RGWObjManifest::operator++(): result: ofs=12618563584
stripe_ofs=12618563584 part_ofs=12614369280 rule->part_size=15728640
/var/log/radosgw/client.radosgw.log-2017-01-13 11:30:41.647998 7feacf6c6700
 0 RGWObjManifest::operator++(): result: ofs=12622757888
stripe_ofs=12622757888 pa

Re: [ceph-users] Filling up ceph past 75%

2016-08-28 Thread Sean Sullivan
I've seen it in the past in the ML but I don't remember seeing it lately.
We recently had an ceph engineer come out from RH and he mentioned he
hasn't seen this kind of disparity either which made me jump on here to
double check as I thought it was a well known thing.

So I'm not crazy and  the roughly 30% difference is normal? I've tried the
osd by utilization function before (with other clusters)  and have been
left with broken pgs(ones that seem to be stuck back filling) before so
I've stayed away from it.  I saw that it has been redone but with past
exposure I've been hesitant.  I'll give it another shot in a test instance
and see how it goes.

Thanks for your help as always Mr. Balzer.

On Aug 28, 2016 8:59 PM, "Christian Balzer"  wrote:

>
> Hello,
>
> On Sun, 28 Aug 2016 14:34:25 -0500 Sean Sullivan wrote:
>
> > I was curious if anyone has filled ceph storage beyond 75%.
>
> If you (re-)search the ML archives, you will find plenty of cases like
> this, albeit most of them involuntary.
> Same goes for uneven distribution.
>
> > Admitedly we
> > lost a single host due to power failure and are down 1 host until the
> > replacement parts arrive but outside of that I am seeing disparity
> between
> > the most and least full osd::
> >
> > ID  WEIGHT  REWEIGHT SIZE  USE   AVAIL %USE  VAR
> > MIN/MAX VAR: 0/1.26  STDDEV: 7.12
> >TOTAL 2178T 1625T  552T 74.63
> >
> > 559 4.54955  1.0 3724G 2327G 1396G 62.50 0.84
> > 193 2.48537  1.0 3724G 3406G  317G 91.47 1.23
> >
> Those extremes, especially with the weights they have, look odd indeed.
> Unless OSD 193 is in the rack which lost a node.
>
> > The crush weights are really off right now but even with a default crush
> > map I am seeing a similar spread::
> >
> > # osdmaptool --test-map-pgs --pool 1 /tmp/osdmap
> >  avg 82 stddev 10.54 (0.128537x) (expected 9.05095 0.110377x))
> >  min osd.336 55
> >  max osd.54 115
> >
> > That's with a default weight of 3.000 across all osds. I was wondering if
> > anyone can give me any tips on how to reach closer to 80% full.
> >
> > We have 630 osds (down one host right now but it will be back in in a
> week
> > or so) spread across 3 racks of 7 hosts (30 osds each). Our data
> > replication scheme is by rack and we only use S3 (so 98% of our data is
> in
> > .rgw.buckets pool). We are on hammer (94.7) and using the hammer
> tunables.
> >
> What comes to mind here is that probably your split into 3 buckets (racks)
> and then into 7 (hosts) is probably not helping the already rather fuzzy
> CRUSH algorithm to come up with an even distribution.
> Meaning that imbalances are likely to be amplified.
>
> And dense (30 OSDs) storage servers amplify things of course when one goes
> down.
>
> So how many PGs in the bucket pool then?
>
> With jewel (backport exists, check the ML archives) there's an improved
> reweight-by-utilization script that can help with these things.
> And I prefer to do this manually by using the (persistent) crush-reweight
> to achieve a more even distribution.
>
> For example on one cluster here I got the 18 HDD OSDs all within 100GB of
> each other.
>
> However having lost 3 of those OSDs 2 days ago the spread is now 300GB,
> most likely NOT helped by the manual adjustments done earlier.
> So your nice and evenly distributed cluster during normal state may be
> worse off using custom weights when there is a significant OSD loss.
>
> Christian
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Filling up ceph past 75%

2016-08-28 Thread Sean Sullivan
I was curious if anyone has filled ceph storage beyond 75%. Admitedly we
lost a single host due to power failure and are down 1 host until the
replacement parts arrive but outside of that I am seeing disparity between
the most and least full osd::

ID  WEIGHT  REWEIGHT SIZE  USE   AVAIL %USE  VAR
MIN/MAX VAR: 0/1.26  STDDEV: 7.12
   TOTAL 2178T 1625T  552T 74.63

559 4.54955  1.0 3724G 2327G 1396G 62.50 0.84
193 2.48537  1.0 3724G 3406G  317G 91.47 1.23

The crush weights are really off right now but even with a default crush
map I am seeing a similar spread::

# osdmaptool --test-map-pgs --pool 1 /tmp/osdmap
 avg 82 stddev 10.54 (0.128537x) (expected 9.05095 0.110377x))
 min osd.336 55
 max osd.54 115

That's with a default weight of 3.000 across all osds. I was wondering if
anyone can give me any tips on how to reach closer to 80% full.

We have 630 osds (down one host right now but it will be back in in a week
or so) spread across 3 racks of 7 hosts (30 osds each). Our data
replication scheme is by rack and we only use S3 (so 98% of our data is in
.rgw.buckets pool). We are on hammer (94.7) and using the hammer tunables.




-- 
- Sean:  I wrote this. -
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How can we repair OSD leveldb?

2016-08-18 Thread Sean Sullivan
We have a hammer cluster that experienced a similar power failure and ended
up corrupting our monitors leveldb stores. I am still trying to repair ours
but I can give you a few tips that seem to help.

1.)  I would copy the database off to somewhere safe right away. Just
opening it seems to change it.

2.) check out ceph-test tools (ceph-objectstore-tool, ceph-kvstore-tool,
ceph-osdmap-tool, etc).  It lets you list the keys/data in your
osd leveldb, possibly export them and get some barings on what you need to
do to recover your map.


   3.) I am making a few assumptions here. a.) You are using
replication for your pools. b.) you are using either S3 or rbd, not cephFS.
>From here worse case chances are your data is recoverable sans the osd and
monitor leveldb store so long as the rest of the data is okay. (The actual
rados objects spread across each osd in '/var/lib/ceph/osd/ceph-*/
current/blah_head)

If you use RBD there is a tool out there that lets you recover your RBD
images:: https://github.com/ceph/ceph/tree/master/src/tools/rbd_recover_tool
We only use S3 but this seems to be doable as well:

As an example we have a 9MB file that was stored in ceph::
I ran a find across all of the osds in my cluster and compiled a list of
files::

find /var/lib/ceph/osd/ceph-*/current/ -type f -iname \*this_is_my_File\.
gzip\*

>From here I resulted in a list that looks like the following::

This is the head. It's usually the bucket.id\file__head__

default.20283.1\ud975ef9e-c7b1-42c5-938b-d746fc2c7996\
sC1635.TCGA-DJ-A3UP-10A-01D-A22D-08.1.bam__head_CA57D598__1
[__A]\[_B___
_].[__C__]

default.20283.1\u\umultipart\ud975ef9e-c7b1-42c5-938b-
d746fc2c7996\sC1635.TCGA-DJ-A3UP-10A-01D-A22D-08.1.bam.2\
sYDDf8Qip4tn5YxQWfOmTt5fgm7o9Tw6.1__head_C338075C__1
[__A]\[_D___]\[__B_
_].[__C_
_]

And for each of those you'll have matching shadow files::
default.20283.1\u\ushadow\ud975ef9e-c7b1-42c5-938b-
d746fc2c7996\sC1635.TCGA-DJ-A3UP-10A-01D-A22D-08.1.bam.2\
sYDDf8Qip4tn5YxQWfOmTt5fgm7o9Tw6.1\u1__head_02F05634__1
[__A]\[_E__]\[__B___
].[__C__
__]

Here is another part of the multipart (this file only had 1 multipart and
we use multipart for all files larger than 5MB irrespective of size)::

default.20283.1\u\ushadow\ud975ef9e-c7b1-42c5-938b-
d746fc2c7996\sC1635.TCGA-DJ-A3UP-10A-01D-A22D-08.1.bam.2\
sYDDf8Qip4tn5YxQWfOmTt5fgm7o9Tw6.1\u2__head_1EA07BDF__1
[__A]\[_E__]\[__B___
].[__C__
__]



   ^^ notice the different part number
here.

A is the bucket.id and is the same for every object in the same bucket.
Even if you don't know what the bucket id for your bucket is, you should be
able to assume with good certainty after you review your list which is which

B is our object name. We generate uuids for each object so I can not be
certain how much of this is ceph or us but the tail of your object name
should exist and be the same across all of your parts.

C.) Is their suffix for each object. From here you may have suffix' like
the above

D.) Is your upload chunks

E.) Is your shadow chunks for each part of the multipart (i think)

I'm sure it's much more complicated than that but that's what worked for
me.  From here I just scanned through all of my osds and slowly pulled all
of the individual parts via ssh and concatinated them all to their
respective files. So far the md5 sums match our md5 of the file prior to
uploading them to ceph in the first place.

We have a python tool to do this but it's kind of specific to us. I can ask
the author and see if I can post a gist of the code if that helps. Please
let me know.



I can't speak for CephFS unfortunately as we do not use it but I wouldn't
be surprised if it is similar. So if you set up ssh-keys across all of your
osd nodes you should be able to export all of the data to another
server/cluster/etc.


I am working on trying to rebuild leveldb for our monitors with the correct
keys/values but I have a feeling this is going to be a long way off. I
wouldn't be surprised if the leveldb structure for the mon databse is
similar to the osd omap database.

On Wed, Aug 17, 2016 at 4:54 PM, Dan Jakubiec 
wrote:

> Hi Wido,
>
> Thank you for the response:
>
> > On Aug 17, 2016, at 16:25, Wido den Hollander  wrote:
> >
> >
> >> Op 17 augustus 2016 om 17:44 schreef Dan Jakubiec <
> dan.jakub...@gmail.com>:
> >>
> >>
> >> Hello, we have a Ceph cluster with 8 OSD that recently lost power to
> all 8 machines.  We've managed to recover the XFS filesystems on 7 of the
> machines, but the O

Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

2016-08-13 Thread Sean Sullivan
13 22:30:54.593557 7fa8b9e088c0  5 asok(0x36a20f0)
register_command log dump hook 0x365a050
   -20> 2016-08-13 22:30:54.593561 7fa8b9e088c0  5 asok(0x36a20f0)
register_command log reopen hook 0x365a050
   -19> 2016-08-13 22:30:54.596039 7fa8b9e088c0  0 ceph version 0.94.7
(d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 708653
   -18> 2016-08-13 22:30:54.597587 7fa8b9e088c0  5 asok(0x36a20f0) init
/var/run/ceph/ceph-mon.kh10-8.asok
   -17> 2016-08-13 22:30:54.597601 7fa8b9e088c0  5 asok(0x36a20f0)
bind_and_listen /var/run/ceph/ceph-mon.kh10-8.asok
   -16> 2016-08-13 22:30:54.597767 7fa8b9e088c0  5 asok(0x36a20f0)
register_command 0 hook 0x36560c0
   -15> 2016-08-13 22:30:54.597775 7fa8b9e088c0  5 asok(0x36a20f0)
register_command version hook 0x36560c0
   -14> 2016-08-13 22:30:54.597778 7fa8b9e088c0  5 asok(0x36a20f0)
register_command git_version hook 0x36560c0
   -13> 2016-08-13 22:30:54.597781 7fa8b9e088c0  5 asok(0x36a20f0)
register_command help hook 0x365a150
   -12> 2016-08-13 22:30:54.597783 7fa8b9e088c0  5 asok(0x36a20f0)
register_command get_command_descriptions hook 0x365a140
   -11> 2016-08-13 22:30:54.597860 7fa8b5181700  5 asok(0x36a20f0) entry
start
   -10> 2016-08-13 22:30:54.608150 7fa8b9e088c0  0 starting mon.kh10-8 rank
2 at 10.64.64.125:6789/0 mon_data /var/lib/ceph/mon/ceph-kh10-8 fsid
e452874b-cb29-4468-ac7f-f8901dfccebf
-9> 2016-08-13 22:30:54.608210 7fa8b9e088c0  1 -- 10.64.64.125:6789/0
learned my addr 10.64.64.125:6789/0
-8> 2016-08-13 22:30:54.608214 7fa8b9e088c0  1 accepter.accepter.bind
my_inst.addr is 10.64.64.125:6789/0 need_addr=0
-7> 2016-08-13 22:30:54.608279 7fa8b9e088c0  5 adding auth protocol:
cephx
-6> 2016-08-13 22:30:54.608282 7fa8b9e088c0  5 adding auth protocol:
cephx
-5> 2016-08-13 22:30:54.608311 7fa8b9e088c0 10 log_channel(cluster)
update_config to_monitors: true to_syslog: false syslog_facility: daemon
prio: info)
-4> 2016-08-13 22:30:54.608317 7fa8b9e088c0 10 log_channel(audit)
update_config to_monitors: true to_syslog: false syslog_facility: local0
prio: info)
-3> 2016-08-13 22:30:54.608395 7fa8b9e088c0  1 mon.kh10-8@-1(probing)
e1 preinit fsid e452874b-cb29-4468-ac7f-f8901dfccebf
-2> 2016-08-13 22:30:54.608617 7fa8b9e088c0  1
mon.kh10-8@-1(probing).paxosservice(pgmap
0..35606392) refresh upgraded, format 0 -> 1
-1> 2016-08-13 22:30:54.608629 7fa8b9e088c0  1 mon.kh10-8@-1(probing).pg
v0 on_upgrade discarding in-core PGMap
 0> 2016-08-13 22:30:54.611791 7fa8b9e088c0 -1 *** Caught signal
(Aborted) **
 in thread 7fa8b9e088c0

 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
 1: ceph-mon() [0x9b25ea]
 2: (()+0x10330) [0x7fa8b8f0b330]
 3: (gsignal()+0x37) [0x7fa8b73a8c37]
 4: (abort()+0x148) [0x7fa8b73ac028]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fa8b7cb3535]
 6: (()+0x5e6d6) [0x7fa8b7cb16d6]
 7: (()+0x5e703) [0x7fa8b7cb1703]
 8: (()+0x5e922) [0x7fa8b7cb1922]
 9: ceph-mon() [0x853c39]
 10:
(object_stat_collection_t::decode(ceph::buffer::list::iterator&)+0x167)
[0x894227]
 11: (pg_stat_t::decode(ceph::buffer::list::iterator&)+0x5ff) [0x894baf]
 12: (PGMap::update_pg(pg_t, ceph::buffer::list&)+0xa3) [0x91a8d3]
 13: (PGMonitor::read_pgmap_full()+0x1d8) [0x68b9b8]
 14: (PGMonitor::update_from_paxos(bool*)+0xbf7) [0x6977b7]
 15: (PaxosService::refresh(bool*)+0x19a) [0x605b5a]
 16: (Monitor::refresh_from_paxos(bool*)+0x1db) [0x5b1ffb]
 17: (Monitor::init_paxos()+0x85) [0x5b2365]
 18: (Monitor::preinit()+0x7d7) [0x5b6f87]
 19: (main()+0x230c) [0x57853c]
 20: (__libc_start_main()+0xf5) [0x7fa8b7393f45]
 21: ceph-mon() [0x59a3c7]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
  max_recent 1
  max_new 1000
  log_file
--- end dump of recent events ---
Aborted (core dumped)
---
---

I feel like I am so close but so far. Can anyone give me a nudge as to what
I can do next? it looks like it is bombing out on trying to get an updated
paxos.



On Fri, Aug 12, 2016 at 1:09 PM, Sean Sullivan 
wrote:

> A coworker patched leveldb and w

Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

2016-08-12 Thread Sean Sullivan
A coworker patched leveldb and we were able to export quite a bit of data
from kh08's leveldb database. At this point I think I need to re-construct
a new leveldb with whatever values I can. Is it the same leveldb database
across all 3 montiors? IE will keys exported from one work in the other?
All should have the same keys/values although constructed differently
right? I can't blindly copy /var/lib/ceph/mon/ceph-$(hostname)/store.db/
from one host to another right? But can I copy the keys/values from one to
another?

On Fri, Aug 12, 2016 at 12:45 PM, Sean Sullivan 
wrote:

> ceph-monstore-tool? Is that the same as monmaptool? oops! NM found it in
> ceph-test package::
>
> I can't seem to get it working :-( dump monmap or any of the commands.
> They all bomb out with the same message:
>
> root@kh10-8:/var/lib/ceph/mon/ceph-kh10-8/store.db# ceph-monstore-tool
> /var/lib/ceph/mon/ceph-kh10-8 dump-trace -- /tmp/test.trace
> Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/ceph-kh10-8/
> store.db/10882319.ldb
> root@kh10-8:/var/lib/ceph/mon/ceph-kh10-8/store.db# ceph-monstore-tool
> /var/lib/ceph/mon/ceph-kh10-8 dump-keys
> Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/ceph-kh10-8/
> store.db/10882319.ldb
>
>
> I need to clarify as I originally had 2 clusters with this issue and now I
> have 1 with all 3 monitors dead and 1 that I was successfully able to
> repair. I am about to recap everything I know about the issue and the issue
> at hand. Should I start a new email thread about this instead?
>
> The cluster that is currently having issues is on hammer (94.7), and the
> monitor stats are the same::
> root@kh08-8:~# cat /proc/cpuinfo | grep -iE "model name" | uniq -c
>  24 model name : Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
>  ext4 volume comprised of 4x300GB 10k drives in raid 10.
>  ubuntu 14.04
>
> root@kh08-8:~# uname -a
> Linux kh08-8 3.13.0-76-generic #120-Ubuntu SMP Mon Jan 18 15:59:10 UTC
> 2016 x86_64 x86_64 x86_64 GNU/Linux
> root@kh08-8:~# ceph --version
> ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
>
>
> From here: Here are the errors I am getting when starting each of the
> monitors::
>
>
> ---
> root@kh08-8:~# /usr/bin/ceph-mon --cluster=ceph -i kh08-8 -d
> 2016-08-11 22:15:23.731550 7fe5ad3e98c0  0 ceph version 0.94.7
> (d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 317309
> Corruption: error in middle of record
> 2016-08-11 22:15:28.274340 7fe5ad3e98c0 -1 error opening mon data
> directory at '/var/lib/ceph/mon/ceph-kh08-8': (22) Invalid argument
> --
> root@kh09-8:~# /usr/bin/ceph-mon --cluster=ceph -i kh09-8 -d
> 2016-08-11 22:14:28.252370 7f7eaab908c0  0 ceph version 0.94.7
> (d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 30
> Corruption: 14 missing files; e.g.: /var/lib/ceph/mon/ceph-kh09-8/
> store.db/10845998.ldb
> 2016-08-11 22:14:35.094237 7f7eaab908c0 -1 error opening mon data
> directory at '/var/lib/ceph/mon/ceph-kh09-8': (22) Invalid argument
> --
> root@kh10-8:/var/lib/ceph/mon/ceph-kh10-8/store.db# /usr/bin/ceph-mon
> --cluster=ceph -i kh10-8 -d
> 2016-08-11 22:17:54.632762 7f80bf34d8c0  0 ceph version 0.94.7
> (d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 292620
> Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/ceph-kh10-8/
> store.db/10882319.ldb
> 2016-08-11 22:18:01.207749 7f80bf34d8c0 -1 error opening mon data
> directory at '/var/lib/ceph/mon/ceph-kh10-8': (22) Invalid argument
> ---
>
>
> for kh08, a coworker patched leveldb to print and skip on the first error
> and that one is also missing a bunch of files. As such I think kh10-8 is my
> most likely candidate to recover but either way recovery is probably not an
> option. I see leveldb has a repair.cc (https://github.com/google/lev
> eldb/blob/master/db/repair.cc)) but I do not see repair mentioned in
> monitor in respect to the dbstore. I tried using the leveldb python module
> (plyvel) to attempt a repair but my repl just ends up dying.
>
> I understand two things:: 1.) Without rebuilding the monitor backend
> leveldb (the cluster map as I understand it) store all of the data in the
> cluster is essentialy lost (right?)
>  2.) it is possible to rebuild
> this database via some form of magic or (source)ry as all of this data is
> essential held throughout the cluster as well.
>
> We only use radosgw / S3 for this cluster. If there is a way to recover my
> data that is easier//more likely than rebuilding the leveldb of a monitor
> and starting a single monitor cluster up I would like to switch gears and
> focus on that.
>
> Looking

Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

2016-08-12 Thread Sean Sullivan
iew a CRUSH map, execute
ceph osd getcrushmap -o {filename}; then, decompile it by executing
crushtool -d {comp-crushmap-filename} -o {decomp-crushmap-filename}. You
can view the decompiled map in a text editor or with cat.
The MDS Map: Contains the current MDS map epoch, when the map was created,
and the last time it changed. It also contains the pool for storing
metadata, a list of metadata servers, and which metadata servers are up and
in. To view an MDS map, execute ceph mds dump.
```

As we don't use cephfs mds can essentially be blank(right) so I am left
with 4 valid maps needed to get a working cluster again. I don't see auth
mentioned in there but that too.  Then I just need to rebuild the leveldb
database somehow with the right information and I should be good. So long
long long journey ahead.

I don't think that the data is stored in strings or json, right? Am I going
down the wrong path here? Is there a shorter/simpler path to retrieve the
data from a cluster that lost all 3 monitors in power falure? If I am going
down the right path is there any advice on how I can assemble/repair the
database?

I see that there is a rbd recovery from a dead cluster tool. Is it possible
to do the same with s3 objects?

On Thu, Aug 11, 2016 at 11:15 AM, Wido den Hollander  wrote:

>
> > Op 11 augustus 2016 om 15:17 schreef Sean Sullivan <
> seapasu...@uchicago.edu>:
> >
> >
> > Hello Wido,
> >
> > Thanks for the advice.  While the data center has a/b circuits and
> > redundant power, etc if a ground fault happens it  travels outside and
> > fails causing the whole building to fail (apparently).
> >
> > The monitors are each the same with
> > 2x e5 cpus
> > 64gb of ram
> > 4x 300gb 10k SAS drives in raid 10 (write through mode).
> > Ubuntu 14.04 with the latest updates prior to power failure (2016/Aug/10
> -
> > 3am CST)
> > Ceph hammer LTS 0.94.7
> >
> > (we are still working on our jewel test cluster so it is planned but not
> in
> > place yet)
> >
> > The only thing that seems to be corrupt is the monitors leveldb store.  I
> > see multiple issues on Google leveldb github from March 2016 about fsync
> > and power failure so I assume this is an issue with leveldb.
> >
> > I have backed up /var/lib/ceph/Mon on all of my monitors before trying to
> > proceed with any form of recovery.
> >
> > Is there any way to reconstruct the leveldb or replace the monitors and
> > recover the data?
> >
> I don't know. I have never done it. Other people might know this better
> than me.
>
> Maybe 'ceph-monstore-tool' can help you?
>
> Wido
>
> > I found the following post in which sage says it is tedious but
> possible. (
> > http://www.spinics.net/lists/ceph-devel/msg06662.html). Tedious is fine
> if
> > I have any chance of doing it.  I have the fsid, the Mon key map and all
> of
> > the osds look to be fine so all of the previous osd maps  are there.
> >
> > I just don't understand what key/values I need inside.
> >
> > On Aug 11, 2016 1:33 AM, "Wido den Hollander"  wrote:
> >
> > >
> > > > Op 11 augustus 2016 om 0:10 schreef Sean Sullivan <
> > > seapasu...@uchicago.edu>:
> > > >
> > > >
> > > > I think it just got worse::
> > > >
> > > > all three monitors on my other cluster say that ceph-mon can't open
> > > > /var/lib/ceph/mon/$(hostname). Is there any way to recover if you
> lose
> > > all
> > > > 3 monitors? I saw a post by Sage saying that the data can be
> recovered as
> > > > all of the data is held on other servers. Is this possible? If so has
> > > > anyone had any experience doing so?
> > >
> > > I have never done so, so I couldn't tell you.
> > >
> > > However, it is weird that on all three it got corrupted. What hardware
> are
> > > you using? Was it properly protected against power failure?
> > >
> > > If you mon store is corrupted I'm not sure what might happen.
> > >
> > > However, make a backup of ALL monitors right now before doing anything.
> > >
> > > Wido
> > >
> > > > ___
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
>



-- 
- Sean:  I wrote this. -
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

2016-08-11 Thread Sean Sullivan
Hello Wido,

Thanks for the advice.  While the data center has a/b circuits and
redundant power, etc if a ground fault happens it  travels outside and
fails causing the whole building to fail (apparently).

The monitors are each the same with
2x e5 cpus
64gb of ram
4x 300gb 10k SAS drives in raid 10 (write through mode).
Ubuntu 14.04 with the latest updates prior to power failure (2016/Aug/10 -
3am CST)
Ceph hammer LTS 0.94.7

(we are still working on our jewel test cluster so it is planned but not in
place yet)

The only thing that seems to be corrupt is the monitors leveldb store.  I
see multiple issues on Google leveldb github from March 2016 about fsync
and power failure so I assume this is an issue with leveldb.

I have backed up /var/lib/ceph/Mon on all of my monitors before trying to
proceed with any form of recovery.

Is there any way to reconstruct the leveldb or replace the monitors and
recover the data?

I found the following post in which sage says it is tedious but possible. (
http://www.spinics.net/lists/ceph-devel/msg06662.html). Tedious is fine if
I have any chance of doing it.  I have the fsid, the Mon key map and all of
the osds look to be fine so all of the previous osd maps  are there.

I just don't understand what key/values I need inside.

On Aug 11, 2016 1:33 AM, "Wido den Hollander"  wrote:

>
> > Op 11 augustus 2016 om 0:10 schreef Sean Sullivan <
> seapasu...@uchicago.edu>:
> >
> >
> > I think it just got worse::
> >
> > all three monitors on my other cluster say that ceph-mon can't open
> > /var/lib/ceph/mon/$(hostname). Is there any way to recover if you lose
> all
> > 3 monitors? I saw a post by Sage saying that the data can be recovered as
> > all of the data is held on other servers. Is this possible? If so has
> > anyone had any experience doing so?
>
> I have never done so, so I couldn't tell you.
>
> However, it is weird that on all three it got corrupted. What hardware are
> you using? Was it properly protected against power failure?
>
> If you mon store is corrupted I'm not sure what might happen.
>
> However, make a backup of ALL monitors right now before doing anything.
>
> Wido
>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fwd: lost power. monitors died. Cephx errors now

2016-08-10 Thread Sean Sullivan
I think it just got worse::

all three monitors on my other cluster say that ceph-mon can't open
/var/lib/ceph/mon/$(hostname). Is there any way to recover if you lose all
3 monitors? I saw a post by Sage saying that the data can be recovered as
all of the data is held on other servers. Is this possible? If so has
anyone had any experience doing so?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] lost power. monitors died. Cephx errors now

2016-08-10 Thread Sean Sullivan
So our datacenter lost power and 2/3 of our monitors died with FS
corruption. I tried fixing it but it looks like the store.db didn't make
it.

I copied the working journal via



   1.

   sudo mv /var/lib/ceph/mon/ceph-$(hostname){,.BAK}

   2.

   sudo ceph-mon -i {mon-id} --mkfs --monmap {tmp}/{map-filename}
--keyring {tmp}/{key-filename}



   1.

   ceph-mon -i `hostname` --extract-monmap /tmp/monmap

   2.

   ceph-mon -i {mon-id} --inject-monmap {map-path}


and for a brief moment i had a quorum but any ceph cli commands would
result in cephx errors. Now the two failed monitors have elected a quorum
and the monitor that was working keeps getting kicked out of the cluster::


 '''
{
"election_epoch": 402,
"quorum": [
0,
1
],
"quorum_names": [
"kh11-8",
"kh12-8"
],
"quorum_leader_name": "kh11-8",
"monmap": {
"epoch": 1,
"fsid": "a6ae50db-5c71-4ef8-885e-8137c7793da8",
"modified": "0.00",
"created": "0.00",
"mons": [
{
"rank": 0,
"name": "kh11-8",
"addr": "10.64.64.134:6789\/0"
},
{
"rank": 1,
"name": "kh12-8",
"addr": "10.64.64.143:6789\/0"
},
{
"rank": 2,
"name": "kh13-8",
"addr": "10.64.64.151:6789\/0"
}
]
}
}
'''

At this point I am not sure what to do as any ceph commands return cephx
errors and I can't seem to verify if the new "quorum" is actually valid

any way to regenerate a cephx authentication key or recover it with
hardware access to the nodes or any advice on how to recover from what
seems to be complete monitor failure?


-- 
- Sean:  I wrote this. -
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Power Outage! Oh No!

2016-08-10 Thread Sean Sullivan
So we recently had a power outage and I seem to have lost 2 of 3 of my
monitors. I have since copied /var/lib/ceph/mon/ceph-$(hostname){,.BAK} and
then created a new cephfs and finally generated a new filesystem via

''' sudo ceph-mon -i {mon-id} --mkfs --monmap {tmp}/{map-filename}
--keyring {tmp}/{key-filename} '''

After this I copied the monmap from the working monitor to the other two.
via::
''' ceph-mon -i {mon-id} --inject-monmap {map-path} '''

At this point I was left with a working monitor map (afaik) but ceph
cli commands return ::
'''
root@kh11-8:/var/run/ceph# ceph -s
2016-08-10 14:13:58.563241 7fdd719b3700  0 librados: client.admin
authentication error (1) Operation not permitted
Error connecting to cluster: PermissionError
'''

Now after waiting a little while it looks like the quorum kicked out the
only working monitor::

'''
{
"election_epoch": 358,
"quorum": [
0,
1
],
"quorum_names": [
"kh11-8",
"kh12-8"
],
"quorum_leader_name": "kh11-8",
"monmap": {
"epoch": 1,
"fsid": "a6ae50db-5c71-4ef8-885e-8137c7793da8",
"modified": "0.00",
"created": "0.00",
"mons": [
{
"rank": 0,
"name": "kh11-8",
"addr": "10.64.64.134:6789\/0"
},
{
"rank": 1,
"name": "kh12-8",
"addr": "10.64.64.143:6789\/0"
},
{
"rank": 2,
"name": "kh13-8",
"addr": "10.64.64.151:6789\/0"
}
]
}
}
'''
kh13-8 was the original working node and kh11-8 and kh12-8 were the ones
that had fs issues.

Currently I am at a loss as to what to do as ceph -w and -s commands do not
work due to permissions/cephx errors and the original working monitor was
kicked out.

Is there any way to regenerate the cephx authentication and recover the
monitor map?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Radosgw (civetweb) hangs once around 850 established connections

2016-03-20 Thread Sean Sullivan

Hi Ben!

I'm using ubuntu 14.04
I have restarted the gateways with the numthreads line you suggested.  I 
hope this helps.  I would think I would get some kind of throttle log or 
something.


500 seems really strange as well.  Do you have a thread for this? RGW still 
has a weird race condition with multipart uploads where it garbage collects 
the parts but I think I get a 404 for those which makes sense. I hope 
you're not seeing something similar.


Thanks for the tip and good luck! I'll bump this thread when it happens again.


Sent from my pocket typo cannon.



On March 16, 2016 8:30:46 PM Ben Hines  wrote:


What OS are you using?

I have a lot more open connections than that. (though i have some other
issues, where rgw sometimes returns 500 errors, it doesn't stop like yours)

You might try tuning civetweb's num_threads and 'rgw num rados handles':

rgw frontends = civetweb num_threads=125
error_log_file=/var/log/radosgw/civetweb.error.log
access_log_file=/var/log/radosgw/civetweb.access.log
rgw num rados handles = 32

You can also up civetweb loglevel:

debug civetweb = 20

-Ben

On Wed, Mar 16, 2016 at 5:03 PM, seapasu...@uchicago.edu <
seapasu...@uchicago.edu> wrote:


I have a cluster of around 630 OSDs with 3 dedicated monitors and 2
dedicated gateways. The entire cluster is running hammer (0.94.5
(9764da52395923e0b32908d83a9f7304401fee43)).

(Both of my gateways have stopped responding to curl right now.
root@host:~# timeout 5 curl localhost ; echo $?
124

From here I checked and it looks like radosgw has over 1 million open
files:
root@host:~# grep -i rados whatisopen.files.list | wc -l
1151753

And around 750 open connections:
root@host:~# netstat -planet | grep radosgw | wc -l
752
root@host:~# ss -tnlap | grep rados | wc -l
752

I don't think that the backend storage is hanging based on the following
dump:

root@host:~# ceph daemon /var/run/ceph/ceph-client.rgw.kh11-9.asok
objecter_requests | grep -i mtime
"mtime": "0.00",
"mtime": "0.00",
"mtime": "0.00",
"mtime": "0.00",
"mtime": "0.00",
"mtime": "0.00",
[...]
"mtime": "0.00",

The radosgw log is still showing lots of activity and so does strace which
makes me think this is a config issue or limit of some kind that is not
triggering a log. Of what I am not sure as the log doesn't seem to show any
open file limit being hit and I don't see any big errors showing up in the
logs.
(last 500 lines of /var/log/radosgw/client.radosgw.log)
http://pastebin.com/jmM1GFSA

Perf dump of radosgw
http://pastebin.com/rjfqkxzE

Radosgw objecter requests:
http://pastebin.com/skDJiyHb

After restarting the gateway with '/etc/init.d/radosgw restart' the old
process remains, no error is sent, and then I get connection refused via
curl or netcat::
root@kh11-9:~# curl localhost
curl: (7) Failed to connect to localhost port 80: Connection refused

Once I kill the old radosgw via sigkill the new radosgw instance restarts
automatically and starts responding::
root@kh11-9:~# curl localhost
http://s3.amazonaws.com/doc/2006-03-01/
">anonymoushttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph-deploy won't write journal if partition exists and using -- dmcrypt

2015-07-16 Thread Sean Sullivan
Some context.  I have a small cluster running ubuntu 14.04 and giant ( now 
hsmmer).  I ran some updates everything was fine.  Rebooted a node and a 
drive must have failed as it no longer shows up.


I use --dmcrypt with ceph deploy and 5 osds per ssd journal.  To do this I 
created the ssd partitions already and pointed ceph-deploy towards the 
partition for the journal.


This worked in giant without issue (I was able to zap the osd and redeploy 
using the same journal all of the time).  Now it seems to fail in hammer 
stating that the partition  exists and im using - - decrypt.


This raises a few questions.

1.) ceph osd start scripts must have a list of dm-crypt keys and uuids 
somewhere as the init mounts the drives.  Is this accessible? Normally 
outside of ceph I've used crypt tab,  how is ceph doing it?


2.) my ceph-deploy line is:
ceph-deploy osd --dmcrypt create ${host}:/dev/drive:/dev/journal_partition

I see that a variable in ceph-disk exists in and is set to false. Is this 
what I would need to change to get this working again? Or is this set to 
false for a reason?


3.) I see multiple references to journal_uuid in Sebastian Hans blog as 
well as the mailing list when replacing a disk.  I don't have this file, 
and I'm assuming it's due to the - - dmcrypt flag.  I also see 60 
dmcrypt-keys in /etc/ceph/dmxrypt-keys but only 30 mapped devices.  Are the 
journals not using these keys at all?






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW - Can't download complete object

2015-05-13 Thread Sean Sullivan
Thank you so much Yahuda! I look forward to testing these. Is there a way 
for me to pull this code in? Is it in master?



On May 13, 2015 7:08:44 PM Yehuda Sadeh-Weinraub  wrote:

Ok, I dug a bit more, and it seems to me that the problem is with the 
manifest that was created. I was able to reproduce a similar issue (opened 
ceph bug #11622), for which I also have a fix.


I created new tests to cover this issue, and we'll get those recent fixes 
as soon as we can, after we test for any regressions.


Thanks,
Yehuda

- Original Message -
> From: "Yehuda Sadeh-Weinraub" 
> To: "Sean Sullivan" 
> Cc: ceph-users@lists.ceph.com
> Sent: Wednesday, May 13, 2015 2:33:07 PM
> Subject: Re: [ceph-users] RGW - Can't download complete object
>
> That's another interesting issue. Note that for part 12_80 the manifest
> specifies (I assume, by the messenger log) this part:
>
> 
default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.tJ8UddmcCxe0lOsgfHR9Q-ZHXdlrM14.12_80

> (note the 'tJ8UddmcCxe0lOsgfHR9Q-ZHXdlrM14')
>
> whereas it seems that you do have the original part:
> 
default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.2/-ztodNISNLlaNeV4kDmrQwmkECBP2mZ.12_80

> (note the '2/...')
>
> The part that the manifest specifies does not exist, which makes me think
> that there is some weird upload sequence, something like:
>
>  - client uploads part, upload finishes but client does not get ack for it
>  - client retries (second upload)
>  - client gets ack for the first upload and gives up on the second one
>
> But I'm not sure if it would explain the manifest, I'll need to take a look
> at the code. Could such a sequence happen with the client that you're using
> to upload?
>
> Yehuda
>
> - Original Message -
> > From: "Sean Sullivan" 
> > To: "Yehuda Sadeh-Weinraub" 
> > Cc: ceph-users@lists.ceph.com
> > Sent: Wednesday, May 13, 2015 2:07:22 PM
> > Subject: Re: [ceph-users] RGW - Can't download complete object
> >
> > Sorry for the delay. It took me a while to figure out how to do a range
> > request and append the data to a single file. The good news is that the end
> > file seems to be 14G in size which matches the files manifest size. The bad
> > news is that the file is completely corrupt and the radosgw log has errors.
> > I am using the following code to perform the download::
> >
> > 
https://raw.githubusercontent.com/mumrah/s3-multipart/master/s3-mp-download.py

> >
> > Here is a clip of the log file::
> > --
> > 2015-05-11 15:28:52.313742 7f570db7d700  1 -- 10.64.64.126:0/108 <==
> > osd.11 10.64.64.101:6809/942707 5  osd_op_reply(74566287
> > 
default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.2/-ztodNISNLlaNeV4kDmrQwmkECBP2mZ.13_12

> > [read 0~858004] v0'0 uv41308 ondisk = 0) v6  304+0+858004 (1180387808 0
> > 2445559038) 0x7f53d005b1a0 con 0x7f56f8119240
> > 2015-05-11 15:28:52.313797 7f57067fc700 20 get_obj_aio_completion_cb: io
> > completion ofs=12934184960 len=858004
> > 2015-05-11 15:28:52.372453 7f570db7d700  1 -- 10.64.64.126:0/108 <==
> > osd.45 10.64.64.101:6845/944590 2  osd_op_reply(74566142
> > 
default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.tJ8UddmcCxe0lOsgfHR9Q-ZHXdlrM14.12_80

> > [read 0~4194304] v0'0 uv0 ack = -2 ((2) No such file or directory)) v6 
> > 302+0+0 (3754425489 0 0) 0x7f53d005b1a0 con 0x7f56f81b1f30
> > 2015-05-11 15:28:52.372494 7f57067fc700 20 get_obj_aio_completion_cb: io
> > completion ofs=12145655808 len=4194304
> >
> > 2015-05-11 15:28:52.372501 7f57067fc700  0 ERROR: got unexpected error when
> > trying to read object: -2
> >
> > 2015-05-11 15:28:52.426079 7f570db7d700  1 -- 10.64.64.126:0/108 <==
> > osd.21 10.64.64.102:6856/1133473 16  osd_op_reply(74566144
> > 
default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.2/-ztodNISNLlaNeV4kDmrQwmkECBP2mZ.11_12

> > [read 0~3671316] v0'0 uv41395 ondisk = 0) v6  304+0+3671316 (1695485150
> > 0 3933234139) 0x7f53d005b1a0 con 0x7f56f81e17d0
> > 2015-05-11 15:28:52.426123 7f57067fc700 20 get_obj_aio_completion_cb: io
> > completion ofs=10786701312 len=3671316
> > 2015-05-11 15:28:52.504072 7f570db7d700  1 -- 10.64.64.126:0/108 <==
> > osd.82 10.64.64.103:6857/88524 2  osd_op_reply(74566283
> > 
default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de6

Re: [ceph-users] RGW - Can't download complete object

2015-05-13 Thread Sean Sullivan
Sorry for the delay. It took me a while to figure out how to do a range request 
and append the data to a single file. The good news is that the end file seems 
to be 14G in size which matches the files manifest size. The bad news is that 
the file is completely corrupt and the radosgw log has errors. I am using the 
following code to perform the download::

https://raw.githubusercontent.com/mumrah/s3-multipart/master/s3-mp-download.py

Here is a clip of the log file::
--
2015-05-11 15:28:52.313742 7f570db7d700  1 -- 10.64.64.126:0/108 <== osd.11 
10.64.64.101:6809/942707 5  osd_op_reply(74566287 
default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.2/-ztodNISNLlaNeV4kDmrQwmkECBP2mZ.13_12
 [read 0~858004] v0'0 uv41308 ondisk = 0) v6  304+0+858004 (1180387808 0 
2445559038) 0x7f53d005b1a0 con 0x7f56f8119240
2015-05-11 15:28:52.313797 7f57067fc700 20 get_obj_aio_completion_cb: io 
completion ofs=12934184960 len=858004
2015-05-11 15:28:52.372453 7f570db7d700  1 -- 10.64.64.126:0/108 <== osd.45 
10.64.64.101:6845/944590 2  osd_op_reply(74566142 
default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.tJ8UddmcCxe0lOsgfHR9Q-ZHXdlrM14.12_80
 [read 0~4194304] v0'0 uv0 ack = -2 ((2) No such file or directory)) v6  
302+0+0 (3754425489 0 0) 0x7f53d005b1a0 con 0x7f56f81b1f30
2015-05-11 15:28:52.372494 7f57067fc700 20 get_obj_aio_completion_cb: io 
completion ofs=12145655808 len=4194304

2015-05-11 15:28:52.372501 7f57067fc700  0 ERROR: got unexpected error when 
trying to read object: -2

2015-05-11 15:28:52.426079 7f570db7d700  1 -- 10.64.64.126:0/108 <== osd.21 
10.64.64.102:6856/1133473 16  osd_op_reply(74566144 
default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.2/-ztodNISNLlaNeV4kDmrQwmkECBP2mZ.11_12
 [read 0~3671316] v0'0 uv41395 ondisk = 0) v6  304+0+3671316 (1695485150 0 
3933234139) 0x7f53d005b1a0 con 0x7f56f81e17d0
2015-05-11 15:28:52.426123 7f57067fc700 20 get_obj_aio_completion_cb: io 
completion ofs=10786701312 len=3671316
2015-05-11 15:28:52.504072 7f570db7d700  1 -- 10.64.64.126:0/108 <== osd.82 
10.64.64.103:6857/88524 2  osd_op_reply(74566283 
default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.2/-ztodNISNLlaNeV4kDmrQwmkECBP2mZ.13_8
 [read 0~4194304] v0'0 uv41566 ondisk = 0) v6  303+0+4194304 (1474509283 0 
3209869954) 0x7f53d005b1a0 con 0x7f56f81b1420
2015-05-11 15:28:52.504118 7f57067fc700 20 get_obj_aio_completion_cb: io 
completion ofs=12917407744 len=4194304

I couldn't really find any good documentation on how fragments/files are layed 
out on the object file system so I am not sure on where the file will be. How 
could the 4mb object have issues but the cluster be completely health okay? I 
did do the rados stat of each object inside ceph and they all appear to be 
there::

http://paste.ubuntu.com/8561/

The sum of all of the objects :: 14584887282
The stat of the object inside ceph:: 14577056082

So for some reason I have more data in objects than the key manifest. We 
easiliy identified this object via the same method as the other thread I have::

for key in keys:
   : if ( key.name == 
'b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam' ):
   : implicit = key.size
   : explicit = conn.get_bucket(bucket).get_key(key.name).size
   : absolute = abs(implicit - explicit)
   : print key.name
   : print implicit
   : print explicit
   :

b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam
14578628946
14577056082

So it looks like I have 3 different sizes. I figure this may be the network 
issue that was mentioned in the other thread but seeing as this is not the 
first 512k and the overalll size still matches as well as the errors I am 
seeing in the gateway I feel that this may be a bigger issue. 

Has anyone seen this before?  The only mention of the "got unexpected error 
when trying to read object" is here 
(http://lists.ceph.com/pipermail/ceph-commit-ceph.com/2014-May/021688.html) but 
my google skills are pretty poor. 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Civet RadosGW S3 not storing complete obects; civetweb logs stop after rotation

2015-04-28 Thread Sean Sullivan
Will do.  The reason for the partial request is that the total size of the 
file is close to 1TB so attempting a download would take quite some time on 
our 10Gb connection.  What is odd is that if I request the last bit 
received to the end of the file we get a 406 can not be satisfied response  
while if I request one byte less to the end of the file we are only given 
1byte but not the whole file.


I will bump it up and attempt a partial then full download.  Thanks for the 
reply!!



On April 28, 2015 5:03:12 PM Yehuda Sadeh-Weinraub  wrote:




- Original Message -
> From: "Sean" 
> To: ceph-users@lists.ceph.com
> Sent: Tuesday, April 28, 2015 2:52:35 PM
> Subject: [ceph-users] Civet RadosGW S3 not storing complete obects; 
civetweb logs stop after rotation

>
> Hey yall!
>
> I have a weird issue and I am not sure where to look so any help would
> be appreciated. I have a large ceph giant cluster that has been stable
> and healthy almost entirely since its inception. We have stored over
> 1.5PB into the cluster currently through RGW and everything seems to be
> functioning great. We have downloaded smaller objects without issue but
> last night we did a test on our largest file (almost 1 terabyte) and it
> continuously times out at almost the exact same place. Investigating
> further it looks like Civetweb/RGW is returning that the uploads
> completed even though the objects are truncated. At least when we
> download the objects they seem to be truncated.
>
> I have tried searching through the mailing list archives to see what may
> be going on but it looks like the mailing list DB may be going through
> some mainenance:
>
> 
> Unable to read word database file
> '/dh/mailman/dap/archives/private/ceph-users-ceph.com/htdig/db.words.db'
> 
>
> After checking through the gzipped logs I see that civetweb just stops
> logging after a rotation for some reason as well and my last log is from
> the 28th of march. I tried manually running /etc/init.d/radosgw reload
> but this didn't seem to work. As running the download again could take
> all day to error out we instead use the range request to try and pull
> the missing bites.
>
> https://gist.github.com/MurphyMarkW/8e356823cfe00de86a48 -- there is the
> code we are using to download via S3 / boto as well as the returned size
> report and overview of our issue.
> http://pastebin.com/cVLdQBMF-- Here is some of the log from the civetweb
> server they are hitting.
>
> Here is our current config ::
> http://pastebin.com/2SGfSDYG
>
> Current output of ceph health::
> http://pastebin.com/3f6iJEbu
>
> I am thinking that this must be a civetweb/radosgw bug of somekind. My
> question is 1.) is there a way to try and download the object via rados
> directly I am guessing I will need to find the prefix and then just cat
> all of them together and hope I get it right? 2.) Why would ceph say the
> upload went fine but then return a smaller object?
>
>


Note that the returned http resonse returns 206 (partial content):
/var/log/radosgw/client.radosgw.log:2015-04-28 16:08:26.525268 7f6e93fff700 
 2 req 0:1.067030:s3:GET 
/tcga_cghub_protected/ff9b730c-d303-4d49-b28f-e0bf9d8f1c84/759366461d2bf8bb0583d5b9566ce947.bam:get_obj:http 
status=206


It'll only return that if partial content is requested (through the http 
Range header). It's really hard to tell from these logs whether there's any 
actual problem. I suggest bumping up the log level (debug ms = 1, debug rgw 
= 20), and take a look at an entire request (one that include all the 
request http headers).


Yehuda



>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Can not list objects in large bucket

2015-03-11 Thread Sean Sullivan
I have a single radosgw user with 2 s3 keys and 1 swift key. I have created a 
few buckets and I can list all of the contents of bucket A and C but not B with 
either S3 (boto) or python-swiftclient. I am able to list the first 1000 
entries using radosgw-admin 'bucket list --bucket=bucketB' without any issues 
but this doesn't really help.

The odd thing is I can still upload and download objects in the bucket. I just 
can't list them. I tried setting the bucket canned_acl to private and public 
but I still can't list the objects inside.

I'm using ceph .87 (Giant) Here is some info about the cluster::
http://pastebin.com/LvQYnXem -- ceph.conf
http://pastebin.com/efBBPCwa -- ceph -s
http://pastebin.com/tF62WMU9 -- radosgw-admin bucket list
http://pastebin.com/CZ8TkyNG -- python list bucket objects script
http://pastebin.com/TUCyxhMD -- radosgw-admin bucket stats --bucketB
http://pastebin.com/uHbEtGHs -- rados -p .rgw.buckets ls | grep default.20283.2 
(bucketB marker)
http://pastebin.com/WYwfQndV -- Python Error when trying to list BucketB via 
boto

I have no idea why this could be happening outside of the acl. Has anyone seen 
this before? Any idea on how I can get access to this bucket again via 
s3/swift? Also is there a way to list the full list of a bucket via 
radosgw-admin and not the first 9000 lines / 1000 entries, or a way to page 
through them?

EDIT:: I just fixed it (I hope) but the fix doesn't make any sense:

radosgw-admin bucket unlink --uid=user --bucket=bucketB
radosgw-admin bucket link --uid=user --bucket=bucketB 
--bucket-id=default.20283.2

Now with swift or s3 (boto) I am able to list the bucket contents without issue 
^_^

Can someone elaborate on why this works and how it broken in the first place 
when ceph was health_ok the entire time? With 3 replicas how did this happen? 
Could this be a bug?  sorry for the rambling. I am confused and tired ;p



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-23 Thread Sean Sullivan
I am trying to understand these drive throttle markers that were
mentioned to get an idea of why these drives are marked as slow.::

here is the iostat of the drive /dev/sdbm
http://paste.ubuntu.com/9607168/
 
an IO wait of .79 doesn't seem bad but a write wait of 21.52 seems
really high.  Looking at the ops in flight::
http://paste.ubuntu.com/9607253/


If we check against all of the osds on this node, this seems strange::
http://paste.ubuntu.com/9607331/

I do not understand why this node has ops in flight while the the
remainder seem to be performing without issue. The load on the node is
pretty light as well with an average CPU at 16 and an average iowait of
.79::

---
/var/run/ceph# iostat -xm /dev/sdbm
Linux 3.13.0-40-generic (kh10-4) 12/23/2014 _x86_64_(40 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   3.940.00   23.300.790.00   71.97

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdbm  0.09 0.255.033.42 0.55 0.63  
288.02 0.09   10.562.55   22.32   2.54   2.15
---

I am still trying to understand the osd throttle perfdump so if anyone
can help shed some light on this that would be rad. From what I can tell
from the perfdump 4 osds (the last one, 228, being the slow one
currently). I ended up pulling .228 from the cluster and I have yet to
see another slow/blocked osd in the output of ceph -s. It is still
rebuilding as I just pulled .228 out but I am still getting at least
200MB/s via bonnie while the rebuild is occurring.

Finally, if this helps anyone. Although one 1x1Gb takes around 2.0 - 2.5
minutes. If we split a 10 file into 100 x 100MB we get a completion time
of about 1 minute. Which would be a 10G file in about 1-1.5 minutes or
166.66MB/s versus the 8MB/s I was getting before with sequential
uploads. All of these are coming from a single client via boto. This
leads me to think that this is a radosgw issue specifically.  

This again makes me think that this is not a slow disk issue but an
overall radosgw issue. If this were structural in anyway I would think
that all of rados/cephs faculties would be hit and the 8MBps limit per
client would be due to client throttling due to a ceiling being hit.  As
it turns out I am not hitting the ceiling but some other aspect of the
radosgw or boto is limiting my throughput. Is this logic not correct? I
feel like I am missing something.

Thanks for the help everyone!


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-22 Thread Sean Sullivan
Awesome! I have yet to hear of any zfs in ceph chat nor have I seen it
on the mailing lists that I have caught. I would assume it would
function pretty well considering how long it has been in use along some
production systems I have seen. I have little to no experience with it
personally though.

I thought the rados issue was weird as well. Even with a degraded
cluster I feel like I should be getting better throughput unless I hit
an object with a bunch of bad PGs or something. We are using 2x 2x10G 
cards in LACP to get over 10G on average and have separate gateway nodes
(Went with the Supermicro kit after all) so CPU on those nodes shouldn't
be an issue. It is extremely low as it is currently which is again
surprising.

I honestly think that this is some kind of radosgw bug in giant as I
have another giant cluster with the exact same config that is performing
much better with much less hardware. Hopefully it is indeed a bug of
somesort and not yet another screw up on my end. Furthermore hopefully I
find the bug and fix it for others to find and profit from ^_^.

Thanks for all of your help!


On 12/22/2014 05:26 PM, Craig Lewis wrote:
>
>
> On Mon, Dec 22, 2014 at 2:57 PM, Sean Sullivan
> mailto:seapasu...@uchicago.edu>> wrote:
>
> Thanks Craig!
>
> I think that this may very well be my issue with osds dropping out
> but I am still not certain as I had the cluster up for a small
> period while running rados bench for a few days without any status
> changes.
>
>
> Mine were fine for a while too, through several benchmarks and a large
> RadosGW import.  My problems were memory pressure plus an XFS bug, so
> it took a while to manifest.  When it did, all of the ceph-osd
> processes on that node would have periods of ~30 seconds with 100%
> CPU.  Some OSDs would get kicked out.  Once that started, it was a
> downward spiral of recovery causing increasing load causing more OSDs
> to get kicked out...
>
> Once I found the memory problem, I cronned a buffer flush, and that
> usually kept things from getting too bad.
>
> I was able to see on the CPU graphs that CPU was increasing before the
> problems started.  Once CPU got close to 100% usage on all cores,
> that's when the OSDs started dropping out.  Hard to say if it was the
> CPU itself, or if the CPU was just a symptom of the memory pressure
> plus XFS bug.
>
>
>  
>
> The real big issue that I have is the radosgw one currently. After
> I figure out the root cause of the slow radosgw performance and
> correct that, it should hopefully buy me enough time to figure out
> the osd slow issue.
>
> It just doesn't make sense that I am getting 8mbps per client no
> matter 1 or 60 clients while rbd and rados shoot well above 600MBs
> (above 1000 as well).
>
>
> That is strange.  I was able to get >300 Mbps per client, on a 3 node
> cluster with GigE.  I expected that each client would saturate the
> GigE on their own, but 300 Mbps is more than enough for now.
>
> I am using the Ceph apache and fastcgi module, but otherwise it's a
> pretty standard apache setup.  My RadosGW processes are using a fair
> amount of CPU, but as long as you have some idle CPU, that shouldn't
> be the bottleneck.
>  
>
>  
>
>
> May I ask how you are monitoring your clusters logs? Are you just
> using rsyslog or do you have a logstash type system set up? Load
> wise I do not see a spike until I pull an osd out of the cluster
> or stop then start an osd without marking nodown.
>
>
> I'm monitoring the cluster with Zabbix, and that gives me pretty much
> the same info that I'd get in the logs.  I am planning to start
> pushing the logs to Logstash soon, as soon as I get my logstash is
> able to handle the extra load.
>  
>
>
> I do think that CPU is probably the cause of the osd slow issue
> though as it makes the most logical sense. Did you end up dropping
> ceph and moving to zfs or did you stick with it and try to
> mitigate it via file flusher/ other tweaks?
>
>
> I'm still on Ceph.  I worked around the memory pressure by
> reformatting my XFS filesystems to use regular sized inodes.  It was a
> rough couple of months, but everything has been stable for the last
> two months.
>
> I do still want to use ZFS on my OSDs.  It's got all the features of
> BtrFS, with the extra feature of being production ready.  It's just
> not production ready in Ceph yet.  It's coming along nicely though,
> and I hope to reformat one node to be all ZFS sometime next year.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-22 Thread Sean Sullivan
Hello Christian,

Sorry for the long wait. Actually I have done a rados bench earlier on
in the cluster without any failure but it did take a while. That and
there is actually a lot of data being downloaded to the cluster now.
Here are the rados results for 100 seconds::
http://pastebin.com/q5E6JjkG

On 12/19/2014 08:10 PM, Christian Balzer wrote
> Hello Sean,
>
> On Fri, 19 Dec 2014 02:47:41 -0600 Sean Sullivan wrote:
>
>> Hello Christian,
>>
>> Thanks again for all of your help! I started a bonnie test using the 
>> following::
>> bonnie -d /mnt/rbd/scratch2/  -m $(hostname) -f -b
>>
> While that gives you a decent idea of what the limitations of kernelspace e -
> mounted RBD images are, it won't tell you what your cluster is actually
> capable of in raw power.
Indeed I agree here, and I am not interested in raw power at this point
as I am a bit past this. I performed a rados test prior and it seemed to
do pretty well, or as expected. What I have noticed in rados bench tests
is that the test can only go as fast as the client network can allow.
The above seems to demonstrate this as well. If I were to start two
rados bench tests from two different hosts I am confident I can push
above 1100 Mbps without any issue.



>
> For that use rados bench, however if your cluster is as brittle as it
> seems, this may very well cause OSDs to flop, so look out for that.
> Observe your nodes (a bit tricky with 21, but try) while this is going on.
>
> To test the write throughput, do something like this:
> "rados -p rbd bench 60 write  -t 64"
>  
> To see your CPUs melt and get an idea of the IOPS capability with 4k
> blocks, do this:
>
> "rados -p rbd bench 60 write  -t 64 -b 4096"
>  
I will try with 4k blocks next to see how this works out. I honestly
think that the cluster will stress but should be able to handle it. A
rebuild on failure will be scary however.

>> Hopefully it completes in the next hour or so. A reboot of the slow OSDs 
>> clears the slow marker for now
>>
>> kh10-9$ ceph -w
>>  cluster 9ea4d9d9-04e4-42fe-835a-34e4259cf8ec
>>   health HEALTH_OK
>>   monmap e1: 3 mons at
> 3 monitors, another recommendation/default that isn't really adequate for
> a cluster of this size and magnitude. Because it means you can only loose
> ONE monitor before the whole thing seizes up. I'd get 2 more (with DC
> S3700, 100 or 200GB will do fine) and spread them among the racks. 
The plan is to scale out the monitors to have two more, they have not
arrived yet but that is in the plan. I agree about the number of
monitors. When I talked to Inktank/redhat about this when I was testing
the 36 disk storage node cluster.  though they said something along the
lines of we shouldn't need 2 more until we have a much larger cluster.
Just know that two more monitors are indeed on the way and that this is
a known issue.

>  
>> {kh08-8=10.64.64.108:6789/0,kh09-8=10.64.64.117:6789/0,kh10-8=10.64.64.125:6789/0},
>>  
>> election epoch 338, quorum 0,1,2 kh08-8,kh09-8,kh10-8
>>   osdmap e15356: 1256 osds: 1256 up, 1256 in
>>pgmap v788798: 87560 pgs, 18 pools, 187 TB data, 47919 kobjects
>>  566 TB used, 4001 TB / 4567 TB avail
> That's a lot of objects and data, was your cluster that full when before
> it started to have problems?
This is due to the rados benches I ran as well as the massive amount of
data we are transferring to the current cluster.
We have 20 pools currently:
1 data,2 rbd,3 .rgw,4 .rgw.root,5 .rgw.control,6 .rgw.gc,7
.rgw.buckets,8 .rgw.buckets.index,9 .log,10 .intent-log,11 .usage,12
.users,13 .users.email,14 .users.swift,15 .users.uid,16 volumes,18
vms,19 .rgw.buckets.extra,20 images,

data and rbd will be removed once I am done testing. these pools were my
test pools I created.  The rest are the standard s3/swift // openstack
pools.

>> 87560 active+clean
> Odd number of PGs, it makes for 71 per OSD, a bit on the low side. OTOH
> you're already having scaling issues of sorts, so probably leave it be for
> now. How many pools?
20 pools, but we will only have 18 once I delete data and rbd (these
were just testing pools to begin with).
>
>>client io 542 MB/s rd, 1548 MB/s wr, 7552 op/s
>>
> Is that a typical, idle, steady state example or is this while you're
> running bonnie and pushing things into radosgw?
I am doing both actually. The downloads into radosgw can't be stopped
right now but I can stop the bonnie tests.


>
>> 2014-12-19 01:27:28.547884 mon.0 [INF] pgmap v788797: 87560 pgs: 87560 
>> active+clean; 187 TB data, 566 TB used, 4001 TB / 4567 TB avail; 433 
>> MB/s rd, 1090 MB/s wr, 5774 op/s
>> 2014-12-19 01:27:29.581955 mon.0 [INF] pgmap 

Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-19 Thread Sean Sullivan

Hello Christian,

Thanks again for all of your help! I started a bonnie test using the 
following::

bonnie -d /mnt/rbd/scratch2/  -m $(hostname) -f -b

Hopefully it completes in the next hour or so. A reboot of the slow OSDs 
clears the slow marker for now


kh10-9$ ceph -w
cluster 9ea4d9d9-04e4-42fe-835a-34e4259cf8ec
 health HEALTH_OK
 monmap e1: 3 mons at 
{kh08-8=10.64.64.108:6789/0,kh09-8=10.64.64.117:6789/0,kh10-8=10.64.64.125:6789/0}, 
election epoch 338, quorum 0,1,2 kh08-8,kh09-8,kh10-8

 osdmap e15356: 1256 osds: 1256 up, 1256 in
  pgmap v788798: 87560 pgs, 18 pools, 187 TB data, 47919 kobjects
566 TB used, 4001 TB / 4567 TB avail
   87560 active+clean
  client io 542 MB/s rd, 1548 MB/s wr, 7552 op/s

2014-12-19 01:27:28.547884 mon.0 [INF] pgmap v788797: 87560 pgs: 87560 
active+clean; 187 TB data, 566 TB used, 4001 TB / 4567 TB avail; 433 
MB/s rd, 1090 MB/s wr, 5774 op/s
2014-12-19 01:27:29.581955 mon.0 [INF] pgmap v788798: 87560 pgs: 87560 
active+clean; 187 TB data, 566 TB used, 4001 TB / 4567 TB avail; 542 
MB/s rd, 1548 MB/s wr, 7552 op/s
2014-12-19 01:27:30.638744 mon.0 [INF] pgmap v788799: 87560 pgs: 87560 
active+clean; 187 TB data, 566 TB used, 4001 TB / 4567 TB avail; 726 
MB/s rd, 2284 MB/s wr, 10451 op/s


Once the next slow osd comes up I guess I can tell it to bump it's log 
up to 5 and see what may be going on.


That said I didn't see much last time.

On 12/19/2014 12:17 AM, Christian Balzer wrote:

Hello,

On Thu, 18 Dec 2014 23:45:57 -0600 Sean Sullivan wrote:


Wow Christian,

Sorry I missed these in line replies. Give me a minute to gather some
data. Thanks a million for the in depth responses!


No worries.


I thought about raiding it but I needed the space unfortunately. I had a
3x60 osd node test cluster that we tried before this and it didn't have
this flopping issue or rgw issue I am seeing .


I think I remember that...


I  hope not. I don't think I posted about it at all. I only had it for a 
short period before it was re purposed. I did post about a cluster 
before that with 32 osds per node though. That one had tons of issues 
but now seems to be running relatively smoothly.




You do realize that the RAID6 configuration option I mentioned would
actually give you MORE space (replication of 2 is sufficient with reliable
OSDs) than what you have now?
Albeit probably at reduced performance, how much would also depend on the
controllers used, but at worst the RAID6 OSD performance would be
equivalent to that of single disk.
So a Cluster (performance wise) with 21 nodes and 8 disks each.


Ah I must have misread, I thought you said raid 10 which would half the 
storage and  a small write penalty. For a raid 6 of 4 drives I would get 
something like 160 iops (assuming each drive is 75) which may be worth 
it. I would just hate to have 2+ failures and lose 4-5 drives as opposed 
to 2 and the rebuild for a raid 6 always left a sour taste in my mouth. 
Still 4 slow drives is better than 4TB of data over the network slowing 
down the whole cluster.


I knew about the 40 cores being low but I thought at 2.7 we may be fine 
as the docs recommend 1 X 1G xeons per osd. The cluster hovers around 
15-18 CPU but with the constant flipping disks I am seeing it bump up as 
high as 120 when a disk is marked as out of the cluster.


kh10-3$ cat /proc/loadavg
14.35 29.50 66.06 14/109434 724476





  
No need, now that strange monitor configuration makes sense, you (or

whoever spec'ed this) went for the Supermicro Ceph solution, right?

indeed.

In my not so humble opinion, this the worst storage chassis ever designed
by a long shot and totally unsuitable for Ceph.
I told the Supermicro GM for Japan as much. ^o^
Well it looks like I done goofed. I thought it was odd that they went 
against most of what ceph documentation says about recommended hardware. 
I read/heard from them that they worked with intank on this though so I 
was swayed. Besides that we really needed the density per rack due to 
limited floor space. As I said in capable hands this cluster would work 
but by stroke of luck..




Every time a HDD dies, you will have to go and shut down the other OSD
that resides on the same tray (and set the cluster to noout).
Even worse of course if a SSD should fail.
And if somebody should just go and hotswap things w/o that step first,
hello data movement storm (2 or 10 OSDs instead of 1 or 5 respectively).

Christian
Thanks for your help and insight on this! I am going to take a nap and 
hope the cluster doesn't set fire before I wake up o_o

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-18 Thread Sean Sullivan

Wow Christian,

Sorry I missed these in line replies. Give me a minute to gather some data. 
Thanks a million for the in depth responses!


I thought about raiding it but I needed the space unfortunately. I had a 
3x60 osd node test cluster that we tried before this and it didn't have 
this flopping issue or rgw issue I am seeing .


I can quickly answered the case/make questions, the model will need to wait 
till I get home :)


Case is a 72 disk supermicro chassis, I'll grab the exact model in my next 
reply.


Drives are HGST 4TB drives, ill grab the model once I get home as well.

The 300 was completely incorrect and it can push more, it was just meant 
for a quick comparison but I agree it should be higher.


Thank you so much. Please hold up and ill grab the extra info ^~^


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-18 Thread Sean Sullivan

thanks!
It would be really great in the right hands. Through some stroke of luck 
it's in mine. The flapping osd is becoming a real issue at this point as it 
is the only possible lead I have to why the gateways are transferring so 
slowly. The weird issue is that I can have 8 or 60 transfers going to the 
radosgw and they are all at roughly 8mbps. To work around this right now I 
am starting 60+ clients across 10 boxes to get roughly 1gbps per gateway 
across gw1 and gw2.


I heve been staring at logs for hours trying to get a handle at what the 
issue may be with no luck.


The third gateway was made last minute to test and rule out the hardware.


On December 18, 2014 10:57:41 PM Christian Balzer  wrote:



Hello,

Nice cluster, I wouldn't mind getting my hand or her ample nacelles, er,
wrong movie. ^o^

On Thu, 18 Dec 2014 21:35:36 -0600 Sean Sullivan wrote:

> Hello Yall!
>
> I can't figure out why my gateways are performing so poorly and I am not
> sure where to start looking. My RBD mounts seem to be performing fine
> (over 300 MB/s)
>
I wouldn't call 300MB/s writes fine with a cluster of this size.
How are you testing this (which tool, settings, from where)?

> while uploading a 5G file to Swift/S3 takes 2m32s
> (32MBps i believe). If we try a 1G file it's closer to 8MBps. Testing
> with nuttcp shows that I can transfer from a client with 10G interface
> to any node on the ceph cluster at the full 10G and ceph can transfer
> close to 20G between itself. I am not really sure where to start looking
> as outside of another issue which I will mention below I am clueless.
>
I know nuttin about radosgw, but I wouldn't be surprised that the
difference you see here is based how that is eventually written to the
storage (smaller chunks than what you're using to test RBD performance).

> I have a weird setup
I'm always interested in monster storage nodes, care to share what case
this is?

> [osd nodes]
> 60 x 4TB 7200 RPM SATA Drives
What maker/model?

> 12 x  400GB s3700 SSD drives
Journals, one assumes.

> 3 x SAS2308 PCI-Express Fusion-MPT cards (drives are split evenly across
> the 3 cards)
I smell a port-expander or 3 on your backplane.
And while making sure that your SSDs get undivided 6Gb/s love would
probably help, you still have plenty of bandwidth here (4.5Gb/s per
drive), so no real issue.

> 512 GB of RAM
Sufficient.

> 2 x CPU E5-2670 v2 @ 2.50GHz
Vastly, and I mean VASTLY insufficient.
It would still be 10GHz short of the (optimistic IMHO) recommendation of
1GHz per OSD w/o SSD journals.
With SSD journals my experience shows that with certain write patterns
even 3.5GHz per OSD isn't sufficient. (there are several threads
about this here)

> 2 x 10G interfaces  LACP bonded for cluster traffic
> 2 x 10G interfaces LACP bonded for public traffic (so a total of 4 10G
> ports)
>
Your journals could handle 5.5GB/s, so you're limiting yourself here a
bit, but not too horribly.

If I had been given this hardware, I would have RAIDed things (different
controller) to keep the number of OSDs per node to something the CPUs (any
CPU really!) can handle.
Something like 16 x 4HDD RAID10 + SSDs +spares (if possible) for
performance and  8 x 8HDD RAID6 + SSDs +spares for capacity.
That still gives you 336 or 168 OSDs, allows for a replication size of 2
and as bonus you'll probably never have to deal with a failed OSD. ^o^

> [monitor nodes and gateway nodes]
> 4 x 300G 1500RPM SAS drives in raid 10
I would have used Intel DC S3700s here as well, mons love their leveldb to
be fast but
> 1 x SAS 2208
combined with this it should be fine.

> 64G of RAM
> 2 x CPU E5-2630 v2
> 2 x 10G interfaces LACP bonded for public traffic (total of 2 10G ports)
>
>
> Here is a pastebin dump of my details, I am running ceph giant 0.87
> (c51c8f9d80fa4e0168aa52685b8de40e42758578) and kernel 3.13.0-40-generic
> across the entire cluster.
>
> http://pastebin.com/XQ7USGUz -- ceph health detail
That looks positively scary, blocked requests for hours...

> http://pastebin.com/8DCzrnq1 -- /etc/ceph/ceph.conf
> http://pastebin.com/BC3gzWhT -- ceph osd tree
scroll, scroll, woah! ^o^

> http://pastebin.com/eRyY4H4c -- /var/log/radosgw/client.radosgw.rgw03.log
> http://paste.ubuntu.com/9565385/ -- crushmap (pastebin wouldn't let me)
>
>
> We ran into a few issues with density (conntrack limits, pid limit, and
> number of open files) all of which I adjusted by bumping the ulimits in
> /etc/security/limits.d/ceph.conf or sysctl. I am no longer seeing any
> signs of these limits being hit so I have not included my limits or
> sysctl conf. If you like this as well let me know and I can include it.
>
> One of the issues I am seeing is that OSDs have started to flop/ be
> marked as slow. The cluster was HEALTH_OK with all of the d

Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-18 Thread Sean Sullivan
Thanks for the reply Gegory,

Sorry if this is in the wrong direction or something. Maybe I do not
understand

To test uploads I either use bash time and either python-swiftclient or
boto key.set_contents_from_filename to the radosgw. I was unaware that
radosgw had any type of throttle settings in the configuration (I can't
seem to find any either).  As for rbd mounts I test by creating a 1TB
mount and writing a file to it through time+cp or dd. Not the most
accurate test but I think should be good enough as a quick functionality
test. So for writes, it's more for functionality than performance. I
would think a basic functionality test should yield more than 8mb/s though.

As for checking admin sockets: I have actually, I set the 3rd gateways
debug_civetweb to 10 as well as debug_rgw to 5 but I still do not see
anything that stands out. The snippet of the log I pasted has these
values set. I did the same for an osd that is marked as slow (1112). All
I can see in the log for the osd are ticks and heartbeat responses
though, nothing that shows any issues. Finally I did it for the primary
monitor node to see if I would see anything there with debug_mon set to
5 (http://pastebin.com/hhnaFac1). I do not really see anything that
would stand out as a failure (like a fault or timeout error).

What kind of throttler limits do you mean? I didn't/don't see any
mention of rgw throttler limits in the ceph.com docs or admin socket
just osd/ filesystem throttle like inode/flusher limits, do you mean
these? I have not messed with these limits yet on this cluster, do you
think it would help?

On 12/18/2014 10:24 PM, Gregory Farnum wrote:
> What kind of uploads are you performing? How are you testing?
> Have you looked at the admin sockets on any daemons yet? Examining the
> OSDs to see if they're behaving differently on the different requests
> is one angle of attack. The other is look into is if the RGW daemons
> are hitting throttler limits or something that the RBD clients aren't.
> -Greg
> On Thu, Dec 18, 2014 at 7:35 PM Sean Sullivan  <mailto:seapasu...@uchicago.edu>> wrote:
>
> Hello Yall!
>
> I can't figure out why my gateways are performing so poorly and I
> am not
> sure where to start looking. My RBD mounts seem to be performing fine
> (over 300 MB/s) while uploading a 5G file to Swift/S3 takes 2m32s
> (32MBps i believe). If we try a 1G file it's closer to 8MBps. Testing
> with nuttcp shows that I can transfer from a client with 10G interface
> to any node on the ceph cluster at the full 10G and ceph can transfer
> close to 20G between itself. I am not really sure where to start
> looking
> as outside of another issue which I will mention below I am clueless.
>
> I have a weird setup
> [osd nodes]
> 60 x 4TB 7200 RPM SATA Drives
> 12 x  400GB s3700 SSD drives
> 3 x SAS2308 PCI-Express Fusion-MPT cards (drives are split evenly
> across
> the 3 cards)
> 512 GB of RAM
> 2 x CPU E5-2670 v2 @ 2.50GHz
> 2 x 10G interfaces  LACP bonded for cluster traffic
> 2 x 10G interfaces LACP bonded for public traffic (so a total of 4 10G
> ports)
>
> [monitor nodes and gateway nodes]
> 4 x 300G 1500RPM SAS drives in raid 10
> 1 x SAS 2208
> 64G of RAM
> 2 x CPU E5-2630 v2
> 2 x 10G interfaces LACP bonded for public traffic (total of 2 10G
> ports)
>
>
> Here is a pastebin dump of my details, I am running ceph giant 0.87
> (c51c8f9d80fa4e0168aa52685b8de40e42758578) and kernel
> 3.13.0-40-generic
> across the entire cluster.
>
> http://pastebin.com/XQ7USGUz -- ceph health detail
> http://pastebin.com/8DCzrnq1 -- /etc/ceph/ceph.conf
> http://pastebin.com/BC3gzWhT -- ceph osd tree
> http://pastebin.com/eRyY4H4c --
> /var/log/radosgw/client.radosgw.rgw03.log
> http://paste.ubuntu.com/9565385/ -- crushmap (pastebin wouldn't
> let me)
>
>
> We ran into a few issues with density (conntrack limits, pid
> limit, and
> number of open files) all of which I adjusted by bumping the
> ulimits in
> /etc/security/limits.d/ceph.conf or sysctl. I am no longer seeing any
> signs of these limits being hit so I have not included my limits or
> sysctl conf. If you like this as well let me know and I can
> include it.
>
> One of the issues I am seeing is that OSDs have started to flop/ be
> marked as slow. The cluster was HEALTH_OK with all of the disks added
> for over 3 weeks before this behaviour started. RBD transfers seem
> to be
> fine for the most part which makes me think that this has little
> baring
> on the gateway issue but it may be related.

[ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-18 Thread Sean Sullivan
Hello Yall!

I can't figure out why my gateways are performing so poorly and I am not
sure where to start looking. My RBD mounts seem to be performing fine
(over 300 MB/s) while uploading a 5G file to Swift/S3 takes 2m32s
(32MBps i believe). If we try a 1G file it's closer to 8MBps. Testing
with nuttcp shows that I can transfer from a client with 10G interface
to any node on the ceph cluster at the full 10G and ceph can transfer
close to 20G between itself. I am not really sure where to start looking
as outside of another issue which I will mention below I am clueless.

I have a weird setup
[osd nodes]
60 x 4TB 7200 RPM SATA Drives
12 x  400GB s3700 SSD drives
3 x SAS2308 PCI-Express Fusion-MPT cards (drives are split evenly across
the 3 cards)
512 GB of RAM
2 x CPU E5-2670 v2 @ 2.50GHz
2 x 10G interfaces  LACP bonded for cluster traffic
2 x 10G interfaces LACP bonded for public traffic (so a total of 4 10G
ports)

[monitor nodes and gateway nodes]
4 x 300G 1500RPM SAS drives in raid 10
1 x SAS 2208
64G of RAM
2 x CPU E5-2630 v2
2 x 10G interfaces LACP bonded for public traffic (total of 2 10G ports)


Here is a pastebin dump of my details, I am running ceph giant 0.87 
(c51c8f9d80fa4e0168aa52685b8de40e42758578) and kernel 3.13.0-40-generic
across the entire cluster.

http://pastebin.com/XQ7USGUz -- ceph health detail
http://pastebin.com/8DCzrnq1 -- /etc/ceph/ceph.conf
http://pastebin.com/BC3gzWhT -- ceph osd tree
http://pastebin.com/eRyY4H4c -- /var/log/radosgw/client.radosgw.rgw03.log
http://paste.ubuntu.com/9565385/ -- crushmap (pastebin wouldn't let me)


We ran into a few issues with density (conntrack limits, pid limit, and
number of open files) all of which I adjusted by bumping the ulimits in
/etc/security/limits.d/ceph.conf or sysctl. I am no longer seeing any
signs of these limits being hit so I have not included my limits or
sysctl conf. If you like this as well let me know and I can include it.

One of the issues I am seeing is that OSDs have started to flop/ be
marked as slow. The cluster was HEALTH_OK with all of the disks added
for over 3 weeks before this behaviour started. RBD transfers seem to be
fine for the most part which makes me think that this has little baring
on the gateway issue but it may be related. Rebooting the OSD seems to
fix this issue.

I would like to figure out the root cause of both of these issues and
post the results back here if possible (perhaps it can help other
people). I am really looking for a place to start looking at as the
gateway just outputs that it is posting data and all of the logs
(outside of the monitors reporting down osds) seem to show a fully
functioning cluster.

Please help. I am in the #ceph room on OFTC every day as 'seapasulli' as
well.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph health related message

2014-09-22 Thread Sean Sullivan
I had this happen to me as well. Turned out to be a connlimit thing for me.
I would check dmesg/kernel log and see if you see any conntrack limit
reached connection dropped messages then increase connlimit. Odd as I
connected over ssh for this but I can't deny syslog.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Swift can upload, list, and delete, but not download

2014-09-19 Thread Sean Sullivan
So this was working a moment ago and I was running rados bencharks as 
well as swift benchmarks to try to see how my install was doing. Now 
when I try to download an object I get this read_length error::


http://pastebin.com/R4CW8Cgj

To try to poke at this I wiped all of the .rgw pools, removed the rados 
gateways and re-installed them. I am still getting the same error.


Here are my config if that helps

http://pastebin.com/q9DRHaQr

Uploads work, downloads fail with the same error (the content_length 
changes though depending on object)::


Error downloading 1G: read_length != content_length, 0 != 1073741824'


I have tried removing all of the .rgw pools and re-creating a user but 
this has not had any change in behaviour. Any help would be greatly 
appreciated.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph can't seem to forget

2014-08-07 Thread Sean Sullivan
I think I have a split issue or I can't seem to get rid of these objects.
How can I tell ceph to forget the objects and revert?

How this happened is that due to the python 2.7.8/ceph bug ( a whole rack
of ceph went town (it had ubuntu 14.10 and that seemed to have 2.7.8 before
14.04). I didn't know what was going on and tried re-installing which
killed the vast majority of the data. 2/3. The drives are gone and the data
on them is lost now.

I tried deleting them via rados but that didn't seem to work either and
just froze there.  Any help would be much appreciated.


Pastebin data below
http://pastebin.com/HU8yZ1ae


cephuser@host:~/CephPDC$ ceph --version
ceph version 0.82-524-gbf04897 (bf048976f50bd0142f291414ea893ef0f205b51a)

cephuser@host:~/CephPDC$ ceph -s
cluster 9e0a4a8e-91fa-4643-887a-c7464aa3fd14
 health HEALTH_WARN 2 pgs recovering; 2 pgs stuck unclean; 5 requests
are blocked > 32 sec; recovery 478/15386946 objects degraded (0.003%);
23/5128982 unfound (0.000%)
 monmap e9: 5 mons at {kg37-12=
10.16.0.124:6789/0,kg37-17=10.16.0.129:6789/0,kg37-23=10.16.0.135:6789/0,kg37-28=10.16.0.140:6789/0,kg37-5=10.16.0.117:6789/0},
election epoch 1450, quorum 0,1,2,3,4 kg37-5,kg37-12,kg37-17,kg37-23,kg37-28
 mdsmap e100: 1/1/1 up {0=kg37-5=up:active}
 osdmap e46061: 245 osds: 245 up, 245 in
  pgmap v3268915: 22560 pgs, 19 pools, 20020 GB data, 5008 kobjects
61956 GB used, 830 TB / 890 TB avail
478/15386946 objects degraded (0.003%); 23/5128982 unfound
(0.000%)
   22558 active+clean
   2 active+recovering
  client io 95939 kB/s rd, 80854 B/s wr, 795 op/s


cephuser@host:~/CephPDC$ ceph health detail
HEALTH_WARN 2 pgs recovering; 2 pgs stuck unclean; 5 requests are blocked >
32 sec; 1 osds have slow requests; recovery 478/15386946 objects degraded
(0.003%); 23/5128982 unfound (0.000%)
pg 5.f4f is stuck unclean since forever, current state active+recovering,
last acting [279,115,78]
pg 5.27f is stuck unclean since forever, current state active+recovering,
last acting [213,0,258]
pg 5.f4f is active+recovering, acting [279,115,78], 10 unfound
pg 5.27f is active+recovering, acting [213,0,258], 13 unfound
5 ops are blocked > 67108.9 sec
5 ops are blocked > 67108.9 sec on osd.279
1 osds have slow requests
recovery 478/15386946 objects degraded (0.003%); 23/5128982 unfound (0.000%)

cephuser@host:~/CephPDC$ ceph pg 5.f4f mark_unfound_lost revert
2014-08-06 12:59:42.282672 7f7d4a6fb700  0 -- 10.16.0.117:0/1005129 >>
10.16.64.29:6844/718 pipe(0x7f7d4005c120 sd=4 :0 s=1 pgs=0 cs=0 l=1
c=0x7f7d4005c3b0).fault
2014-08-06 12:59:51.890574 7f7d4a4f9700  0 -- 10.16.0.117:0/1005129 >>
10.16.64.29:6806/7875 pipe(0x7f7d4005f180 sd=4 :0 s=1 pgs=0 cs=0 l=1
c=0x7f7d4005fae0).fault
pg has no unfound objects
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph can't seem to forget.

2014-08-06 Thread Sean Sullivan
I forgot to register before posting so reposting.

I think I have a split issue or I can't seem to get rid of these objects.
How can I tell ceph to forget the objects and revert?

How this happened is that due to the python 2.7.8/ceph bug ( a whole rack
of ceph went town (it had ubuntu 14.10 and that seemed to have 2.7.8 before
14.04). I didn't know what was going on and tried re-installing which
killed the vast majority of the data. 2/3. The drives are gone and the data
on them is lost now.

I tried deleting them via rados but that didn't seem to work either and
just froze there.  Any help would be much appreciated.


Pastebin data below
http://pastebin.com/HU8yZ1ae


cephuser@host:~/CephPDC$ ceph --version
ceph version 0.82-524-gbf04897 (bf048976f50bd0142f291414ea893ef0f205b51a)

cephuser@host:~/CephPDC$ ceph -s
cluster 9e0a4a8e-91fa-4643-887a-c7464aa3fd14
 health HEALTH_WARN 2 pgs recovering; 2 pgs stuck unclean; 5 requests
are blocked > 32 sec; recovery 478/15386946 objects degraded (0.003%);
23/5128982 unfound (0.000%)
 monmap e9: 5 mons at {kg37-12=
10.16.0.124:6789/0,kg37-17=10.16.0.129:6789/0,kg37-23=10.16.0.135:6789/0,kg37-28=10.16.0.140:6789/0,kg37-5=10.16.0.117:6789/0},
election epoch 1450, quorum 0,1,2,3,4 kg37-5,kg37-12,kg37-17,kg37-23,kg37-28
 mdsmap e100: 1/1/1 up {0=kg37-5=up:active}
 osdmap e46061: 245 osds: 245 up, 245 in
  pgmap v3268915: 22560 pgs, 19 pools, 20020 GB data, 5008 kobjects
61956 GB used, 830 TB / 890 TB avail
478/15386946 objects degraded (0.003%); 23/5128982 unfound
(0.000%)
   22558 active+clean
   2 active+recovering
  client io 95939 kB/s rd, 80854 B/s wr, 795 op/s


cephuser@host:~/CephPDC$ ceph health detail
HEALTH_WARN 2 pgs recovering; 2 pgs stuck unclean; 5 requests are blocked >
32 sec; 1 osds have slow requests; recovery 478/15386946 objects degraded
(0.003%); 23/5128982 unfound (0.000%)
pg 5.f4f is stuck unclean since forever, current state active+recovering,
last acting [279,115,78]
pg 5.27f is stuck unclean since forever, current state active+recovering,
last acting [213,0,258]
pg 5.f4f is active+recovering, acting [279,115,78], 10 unfound
pg 5.27f is active+recovering, acting [213,0,258], 13 unfound
5 ops are blocked > 67108.9 sec
5 ops are blocked > 67108.9 sec on osd.279
1 osds have slow requests
recovery 478/15386946 objects degraded (0.003%); 23/5128982 unfound (0.000%)

cephuser@host:~/CephPDC$ ceph pg 5.f4f mark_unfound_lost revert
2014-08-06 12:59:42.282672 7f7d4a6fb700  0 -- 10.16.0.117:0/1005129 >>
10.16.64.29:6844/718 pipe(0x7f7d4005c120 sd=4 :0 s=1 pgs=0 cs=0 l=1
c=0x7f7d4005c3b0).fault
2014-08-06 12:59:51.890574 7f7d4a4f9700  0 -- 10.16.0.117:0/1005129 >>
10.16.64.29:6806/7875 pipe(0x7f7d4005f180 sd=4 :0 s=1 pgs=0 cs=0 l=1
c=0x7f7d4005fae0).fault
pg has no unfound objects
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com