[ceph-users] osd is stuck in "bluestore(/var/lib/ceph/osd/ceph-3) _open_alloc loaded 599 G in 1055 extents" when it starts

2018-10-01 Thread jython.li
# tail -f /var/log/ceph/ceph-osd.3.log
...
2018-10-02 03:49:48.552686 7ffa10b6bd00  1 bluestore(/var/lib/ceph/osd/ceph-3) 
_open_db opened rocksdb path db options 
compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to
_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152
2018-10-02 03:49:48.568184 7ffa10b6bd00  1 freelist init
2018-10-02 03:49:48.579042 7ffa10b6bd00  1 bluestore(/var/lib/ceph/osd/ceph-3) 
_open_alloc opening allocation metadata
2018-10-02 03:49:48.767642 7ffa10b6bd00  1 bluestore(/var/lib/ceph/osd/ceph-3) 
_open_alloc loaded 599 G in 1055 extents
2018-10-02 03:58:46.061109 7ffa10b6bd00  2 osd.3 0 journal looks like hdd
2018-10-02 03:58:46.061146 7ffa10b6bd00  2 osd.3 0 boot
2018-10-02 03:58:46.061236 7ffa10b6bd00 20 osd.3 0 configured 
osd_max_object_name[space]_len looks ok
2018-10-02 03:58:46.062040 7ffa10b6bd00 10 osd.3 0 read_superblock 
sb(4c9eb31c-b6a9-4b46-a8a2-cbbfaf56444b osd.3 
656ca1d3-7e50-4bcc-bbfe-27b88a28dcf6 e7019 [5862,7019] lci=[7015,7019])
2018-10-02 03:58:46.086503 7ffa10b6bd00 10 open_all_classes
...


As above, the "bluestore(/var/lib/ceph/osd/ceph-3) _open_alloc loaded xxx" took 
8 minutes, and at this time, top to see a cpu core iowait 80% ~ 100%, as follow
"%Cpu3  :  3.6 us,  4.2 sy,  0.0 ni,  0.0 id, 92.2 wa,  0.0 hi,  0.0 si,  0.0 
st"


I tested the disk with `dd` cmd, the disk itself is fine. So I am confused, 
what is the cause?


---
ceph version 12.2.4 luminous (stable)
Linux 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 
x86_64 x86_64 GNU/Linux
---





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client stability

2018-10-01 Thread Linh Vu
Might be a networking problem. Are your client nodes on the same subnet as ceph 
client network (i.e public_network in ceph.conf)? In my experience, the kernel 
client only likes being on the same public_network subnet as the MDS, Mons and 
OSDs. Else you get tons of weird issues. The fuse client however is a lot more 
tolerant of this and can jump through gateways etc. no problem.


From: ceph-users  on behalf of Andras Pataki 

Sent: Tuesday, 2 October 2018 6:40:44 AM
To: Marc Roos; ceph-users
Subject: Re: [ceph-users] cephfs kernel client stability

Unfortunately the CentOS kernel (3.10.0-862.14.4.el7.x86_64) has issues
as well.  Different ones, but the nodes end up with an unusable mount in
an hour or two.  Here are some syslogs:

Oct  1 11:50:28 worker1004 kernel: INFO: task fio:29007 blocked for more
than 120 seconds.
Oct  1 11:50:28 worker1004 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  1 11:50:28 worker1004 kernel: fio D
996d86e92f70 0 29007  28970 0x
Oct  1 11:50:28 worker1004 kernel: Call Trace:
Oct  1 11:50:28 worker1004 kernel: [] ? bit_wait+0x50/0x50
Oct  1 11:50:28 worker1004 kernel: [] schedule+0x29/0x70
Oct  1 11:50:28 worker1004 kernel: []
schedule_timeout+0x239/0x2c0
Oct  1 11:50:28 worker1004 kernel: [] ?
ktime_get_ts64+0x52/0xf0
Oct  1 11:50:28 worker1004 kernel: [] ? bit_wait+0x50/0x50
Oct  1 11:50:28 worker1004 kernel: []
io_schedule_timeout+0xad/0x130
Oct  1 11:50:28 worker1004 kernel: []
io_schedule+0x18/0x20
Oct  1 11:50:28 worker1004 kernel: []
bit_wait_io+0x11/0x50
Oct  1 11:50:28 worker1004 kernel: []
__wait_on_bit_lock+0x61/0xc0
Oct  1 11:50:28 worker1004 kernel: []
__lock_page+0x74/0x90
Oct  1 11:50:28 worker1004 kernel: [] ?
wake_bit_function+0x40/0x40
Oct  1 11:50:28 worker1004 kernel: []
__find_lock_page+0x54/0x70
Oct  1 11:50:28 worker1004 kernel: []
grab_cache_page_write_begin+0x55/0xc0
Oct  1 11:50:28 worker1004 kernel: []
ceph_write_begin+0x43/0xe0 [ceph]
Oct  1 11:50:28 worker1004 kernel: []
generic_file_buffered_write+0x124/0x2c0
Oct  1 11:50:28 worker1004 kernel: []
ceph_aio_write+0xa3e/0xcb0 [ceph]
Oct  1 11:50:28 worker1004 kernel: [] ?
do_numa_page+0x1be/0x250
Oct  1 11:50:28 worker1004 kernel: [] ?
handle_pte_fault+0x316/0xd10
Oct  1 11:50:28 worker1004 kernel: [] ?
aio_read_events+0x1f3/0x2e0
Oct  1 11:50:28 worker1004 kernel: [] ?
security_file_permission+0x27/0xa0
Oct  1 11:50:28 worker1004 kernel: [] ?
ceph_direct_read_write+0xcd0/0xcd0 [ceph]
Oct  1 11:50:28 worker1004 kernel: []
do_io_submit+0x3c3/0x870
Oct  1 11:50:28 worker1004 kernel: []
SyS_io_submit+0x10/0x20
Oct  1 11:50:28 worker1004 kernel: []
system_call_fastpath+0x22/0x27
Oct  1 11:52:28 worker1004 kernel: INFO: task fio:29007 blocked for more
than 120 seconds.
Oct  1 11:52:28 worker1004 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  1 11:52:28 worker1004 kernel: fio D
996d86e92f70 0 29007  28970 0x
Oct  1 11:52:28 worker1004 kernel: Call Trace:
Oct  1 11:52:28 worker1004 kernel: [] ? bit_wait+0x50/0x50
Oct  1 11:52:28 worker1004 kernel: [] schedule+0x29/0x70
Oct  1 11:52:28 worker1004 kernel: []
schedule_timeout+0x239/0x2c0
Oct  1 11:52:28 worker1004 kernel: [] ?
ktime_get_ts64+0x52/0xf0
Oct  1 11:52:28 worker1004 kernel: [] ? bit_wait+0x50/0x50
Oct  1 11:52:28 worker1004 kernel: []
io_schedule_timeout+0xad/0x130
Oct  1 11:52:28 worker1004 kernel: []
io_schedule+0x18/0x20
Oct  1 11:52:28 worker1004 kernel: []
bit_wait_io+0x11/0x50
Oct  1 11:52:28 worker1004 kernel: []
__wait_on_bit_lock+0x61/0xc0
Oct  1 11:52:28 worker1004 kernel: []
__lock_page+0x74/0x90
Oct  1 11:52:28 worker1004 kernel: [] ?
wake_bit_function+0x40/0x40
Oct  1 11:52:28 worker1004 kernel: []
__find_lock_page+0x54/0x70
Oct  1 11:52:28 worker1004 kernel: []
grab_cache_page_write_begin+0x55/0xc0
Oct  1 11:52:28 worker1004 kernel: []
ceph_write_begin+0x43/0xe0 [ceph]
Oct  1 11:52:28 worker1004 kernel: []
generic_file_buffered_write+0x124/0x2c0
Oct  1 11:52:28 worker1004 kernel: []
ceph_aio_write+0xa3e/0xcb0 [ceph]
Oct  1 11:52:28 worker1004 kernel: [] ?
do_numa_page+0x1be/0x250
Oct  1 11:52:28 worker1004 kernel: [] ?
handle_pte_fault+0x316/0xd10
Oct  1 11:52:28 worker1004 kernel: [] ?
aio_read_events+0x1f3/0x2e0
Oct  1 11:52:28 worker1004 kernel: [] ?
security_file_permission+0x27/0xa0
Oct  1 11:52:28 worker1004 kernel: [] ?
ceph_direct_read_write+0xcd0/0xcd0 [ceph]
Oct  1 11:52:28 worker1004 kernel: []
do_io_submit+0x3c3/0x870
Oct  1 11:52:28 worker1004 kernel: []
SyS_io_submit+0x10/0x20
Oct  1 11:52:28 worker1004 kernel: []
system_call_fastpath+0x22/0x27

Oct  1 15:04:08 worker1004 kernel: libceph: reset on mds0
Oct  1 15:04:08 worker1004 kernel: ceph: mds0 closed our session
Oct  1 15:04:08 worker1004 kernel: ceph: mds0 reconnect start
Oct  1 15:04:08 worker1004 kernel: libceph: osd182 10.128.150.155:6976
socket closed (con state OPEN)
Oct  1 15:04:08 

Re: [ceph-users] NVMe SSD not assigned "nvme" device class

2018-10-01 Thread Konstantin Shalygin

It looks like Ceph (13.2.2) assigns device class "ssd" to our Samsung
PM1725a NVMe SSDs instead of "nvme". Is that a bug or is the "nvme"
class reserved for a different kind of device?


Nope, this is because (I'n not sure, but I think) is_rotational() 
function is used to determine from kernel "is hdd or not". So for ssd 
and nvme behavior the same.


If you prefer (like me) nvme class instead ssd you should create this 
class by yourself and assign it to your nvme osd's.




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic Upgrade, features not showing up

2018-10-01 Thread William Law
Cool, thanks for the quick turn around!

Regards,

Will

> On Oct 1, 2018, at 1:09 PM, Gregory Farnum  wrote:
> 
> On Mon, Oct 1, 2018 at 12:37 PM William Law  > wrote:
> Hi -
> 
> I feel like we missed something with upgrading to mimic from luminous.  
> Everything went fine, but running 'ceph features' still shows luminous across 
> the system.  Running 'ceph versions' shows that everything is at mimic.
> 
> The cluster shows healthy.  Any ideas?
> 
> ...yep! We didn't update a key function in that pipeline so that it was 
> willing to show Mimic. I generated a PR: 
> https://github.com/ceph/ceph/pull/24360 
> 
> 
> Thanks (to you and others) for mentioning this!
> -Greg
>  
> 
> Thanks.
> 
> Will
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs mds cache tuning

2018-10-01 Thread Adam Tygart
Okay, here's what I've got: https://www.paste.ie/view/abe8c712

Of note, I've changed things up a little bit for the moment. I've
activated a second mds to see if it is a particular subtree that is
more prone to issues. maybe EC vs replica... The one that is currently
being slow has my EC volume pinned to it.

--
Adam
On Mon, Oct 1, 2018 at 10:02 PM Gregory Farnum  wrote:
>
> Can you grab the perf dump during this time, perhaps plus dumps of the ops in 
> progress?
>
> This is weird but given it’s somewhat periodic it might be something like the 
> MDS needing to catch up on log trimming (though I’m unclear why changing the 
> cache size would impact this).
>
> On Sun, Sep 30, 2018 at 9:02 PM Adam Tygart  wrote:
>>
>> Hello all,
>>
>> I've got a ceph (12.2.8) cluster with 27 servers, 500 osds, and 1000
>> cephfs mounts (kernel client). We're currently only using 1 active
>> mds.
>>
>> Performance is great about 80% of the time. MDS responses (per ceph
>> daemonperf mds.$(hostname -s), indicates 2k-9k requests per second,
>> with a latency under 100.
>>
>> It is the other 20ish percent I'm worried about. I'll check on it and
>> it with be going 5-15 seconds with "0" requests, "0" latency, then
>> give me 2 seconds of reasonable response times, and then back to
>> nothing. Clients are actually seeing blocked requests for this period
>> of time.
>>
>> The strange bit is that when I *reduce* the mds_cache_size, requests
>> and latencies go back to normal for a while. When it happens again,
>> I'll increase it back to where it was. It feels like the mds server
>> decides that some of these inodes can't be dropped from the cache
>> unless the cache size changes. Maybe something wrong with the LRU?
>>
>> I feel like I've got a reasonable cache size for my workload, 30GB on
>> the small end, 55GB on the large. No real reason for a swing this
>> large except to potentially delay it recurring after expansion for
>> longer.
>>
>> I also feel like there is probably some magic tunable to change how
>> inodes get stuck in the LRU. perhaps mds_cache_mid. Anyone know what
>> this tunable actually does? The documentation is a little sparse.
>>
>> I can grab logs from the mds if needed, just let me know the settings
>> you'd like to see.
>>
>> --
>> Adam
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs mds cache tuning

2018-10-01 Thread Gregory Farnum
Can you grab the perf dump during this time, perhaps plus dumps of the ops
in progress?

This is weird but given it’s somewhat periodic it might be something like
the MDS needing to catch up on log trimming (though I’m unclear why
changing the cache size would impact this).

On Sun, Sep 30, 2018 at 9:02 PM Adam Tygart  wrote:

> Hello all,
>
> I've got a ceph (12.2.8) cluster with 27 servers, 500 osds, and 1000
> cephfs mounts (kernel client). We're currently only using 1 active
> mds.
>
> Performance is great about 80% of the time. MDS responses (per ceph
> daemonperf mds.$(hostname -s), indicates 2k-9k requests per second,
> with a latency under 100.
>
> It is the other 20ish percent I'm worried about. I'll check on it and
> it with be going 5-15 seconds with "0" requests, "0" latency, then
> give me 2 seconds of reasonable response times, and then back to
> nothing. Clients are actually seeing blocked requests for this period
> of time.
>
> The strange bit is that when I *reduce* the mds_cache_size, requests
> and latencies go back to normal for a while. When it happens again,
> I'll increase it back to where it was. It feels like the mds server
> decides that some of these inodes can't be dropped from the cache
> unless the cache size changes. Maybe something wrong with the LRU?
>
> I feel like I've got a reasonable cache size for my workload, 30GB on
> the small end, 55GB on the large. No real reason for a swing this
> large except to potentially delay it recurring after expansion for
> longer.
>
> I also feel like there is probably some magic tunable to change how
> inodes get stuck in the LRU. perhaps mds_cache_mid. Anyone know what
> this tunable actually does? The documentation is a little sparse.
>
> I can grab logs from the mds if needed, just let me know the settings
> you'd like to see.
>
> --
> Adam
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs issue with moving files between data pools gives Input/output error

2018-10-01 Thread Brady Deetz
I have a python script that is migrating my data from replicated to ec
pools for cephfs on files that haven't been accessed in a while. My process
involves setting the data_pool recursively for an existing replicated dir
to the new ec pool, copying the existing replicated file to a temporary
file in the same directory, then moving/renaming the ec file over the
replicated file. Ceph does correctly handle discarding the replicated file
data from the replicated pool.

Since mv operations are based on inode, you can't simply perform a mv to
migrate data to a new pool. Obviously it would be nice if Ceph was smart
enough to do this for us in the backend, but I feel like it's moderately
reasonable for it not to.

On Mon, Oct 1, 2018 at 3:13 PM Gregory Farnum  wrote:

> On Mon, Oct 1, 2018 at 12:43 PM Marc Roos 
> wrote:
>
>> Hmmm, did not know that, so it becomes a soft link or so?
>>
>> totally new for me, also not what I would expect of a mv on a fs. I know
>> this is normal to expect coping between pools, also from the s3cmd
>> client. But I think more people will not expect this behaviour. Can't
>> the move be implemented as a move?
>>
>> How can users even know about what folders have a 'different layout'.
>> What happens if we export such mixed pool filesystem via smb. How would
>> smb deal with the 'move' between those directories?
>>
>
> Since the CephX permissions are thoroughly outside of POSIX, handling this
> is unfortunately just your problem. :(
>
> Consider it the other way around — what if a mv *did* copy the file data
> into a new pool, and somebody who had the file open was suddenly no longer
> able to access it? There's no feasible way for us to handle that with rules
> that fall inside of POSIX; what we have now is better.
>
> John's right; it would be great if we could do a server-side "re-stripe"
> or "re-layout" or something, but that will also be an "outside POSIX"
> operation and never the default.
> -Greg
>
>
>>
>>
>>
>>
>> -Original Message-
>> From: Gregory Farnum [mailto:gfar...@redhat.com]
>> Sent: maandag 1 oktober 2018 21:28
>> To: Marc Roos
>> Cc: ceph-users; jspray; ukernel
>> Subject: Re: [ceph-users] cephfs issue with moving files between data
>> pools gives Input/output error
>>
>> Moving a file into a directory with a different layout does not, and is
>> not intended to, copy the underlying file data into a different pool
>> with the new layout. If you want to do that you have to make it happen
>> yourself by doing a copy.
>>
>> On Mon, Oct 1, 2018 at 12:16 PM Marc Roos 
>> wrote:
>>
>>
>>
>> I will explain the test again, I think you might have some bug in
>> your
>> cephfs copy between data pools.
>>
>> c04 has mounted the root cephfs
>> /a (has data pool a, ec21)
>> /test (has data pool b, r1)
>>
>> test2 has mounted
>> /m  (nfs mount of cephfs /a)
>> /m2 (cephfs mount of /a)
>>
>> Creating the test file.
>> [root@c04 test]# echo asdfasdfasdfasdfasdfasdfasdfasdfasdf >
>> testfile.txt
>>
>> Then I am moving on c04 the test file from the test folder(pool
>> b)
>> to
>> the a folder/pool
>>
>> Now on test2
>> [root@test2 m]# ls -arlt
>> -rw-r--r--  1 nobody nobody21 Oct  1 20:48 r1.txt
>> -rw-r--r--  1 nobody nobody21 Oct  1 20:49 r1-copy.txt
>> -rw-r--r--  1 nobody nobody37 Oct  1 21:02
>> testfile.txt
>>
>> [root@test2 /]# cat /mnt/m/testfile.txt
>> cat: /mnt/m/old/testfile.txt: Input/output error
>>
>> [root@test2 /]# cat /mnt/m2/testfile.txt
>> cat: /mnt/m2/old/testfile.txt: Operation not permitted
>>
>> Now I am creating a copy of the test file in the same directory
>> back on
>> c04
>>
>> [root@c04 a]# cp testfile.txt testfile-copy.txt
>> [root@c04 a]# ls -alrt
>> -rw-r--r-- 1 root root 21 Oct  1 20:49 r1-copy.txt
>> -rw-r--r-- 1 root root 37 Oct  1 21:02 testfile.txt
>> -rw-r--r-- 1 root root 37 Oct  1 21:07 testfile-copy.txt
>>
>> Now I trying to access the copy of testfile.txt back on test2
>> (without
>> unmounting, or changing permissions)
>>
>> [root@test2 /]# cat /mnt/m/testfile-copy.txt
>> asdfasdfasdfasdfasdfasdfasdfasdfasdf
>> [root@test2 /]# cat /mnt/m2/testfile-copy.txt
>> asdfasdfasdfasdfasdfasdfasdfasdfasdf
>>
>>
>>
>>
>>
>>
>>
>> -Original Message-
>> From: Yan, Zheng [mailto:uker...@gmail.com]
>> Sent: zaterdag 29 september 2018 6:55
>> To: Marc Roos
>> Subject: Re: [ceph-users] cephfs issue with moving files between
>> data
>> pools gives Input/output error
>>
>> check_pool_perm on pool 30 ns  need Fr, but no read perm
>>
>> client does not permission to read the pool.  ceph-fuse did
>> return
>> EPERM
>> for the kernel readpage 

Re: [ceph-users] too few PGs per OSD

2018-10-01 Thread John Petrini
You need to set the pg number before setting the pgp number, it's a
two step process.

ceph osd pool set cephfs_data pg_num 64

Setting the pg number creates new placement groups by splitting
existing ones but keeps them on the local OSD. Setting the pgp number
allows ceph to move the new pg's to different OSD's and will trigger
re-balancing and data movement as a result.

OSD size has no affect on pg per OSD cout.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-01 Thread Sergey Malinin
I was able to apply patches to mimic, but nothing changed. One osd that I had 
space expanded on fails with bluefs mount IO error, others keep failing with 
enospc.


> On 1.10.2018, at 19:26, Igor Fedotov  wrote:
> 
> So you should call repair which rebalances (i.e. allocates additional space) 
> BlueFS space. Hence allowing OSD to start.
> 
> Thanks,
> 
> Igor
> 
> 
> On 10/1/2018 7:22 PM, Igor Fedotov wrote:
>> Not exactly. The rebalancing from this kv_sync_thread still might be 
>> deferred due to the nature of this thread (haven't 100% sure though).
>> 
>> Here is my PR showing the idea (still untested and perhaps unfinished!!!)
>> 
>> https://github.com/ceph/ceph/pull/24353
>> 
>> 
>> Igor
>> 
>> 
>> On 10/1/2018 7:07 PM, Sergey Malinin wrote:
>>> Can you please confirm whether I got this right:
>>> 
>>> --- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300
>>> +++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300
>>> @@ -9049,22 +9049,17 @@
>>> throttle_bytes.put(costs);
>>>   PExtentVector bluefs_gift_extents;
>>> -  if (bluefs &&
>>> -  after_flush - bluefs_last_balance >
>>> -  cct->_conf->bluestore_bluefs_balance_interval) {
>>> -bluefs_last_balance = after_flush;
>>> -int r = _balance_bluefs_freespace(_gift_extents);
>>> -assert(r >= 0);
>>> -if (r > 0) {
>>> -  for (auto& p : bluefs_gift_extents) {
>>> -bluefs_extents.insert(p.offset, p.length);
>>> -  }
>>> -  bufferlist bl;
>>> -  encode(bluefs_extents, bl);
>>> -  dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
>>> -   << bluefs_extents << std::dec << dendl;
>>> -  synct->set(PREFIX_SUPER, "bluefs_extents", bl);
>>> +  int r = _balance_bluefs_freespace(_gift_extents);
>>> +  ceph_assert(r >= 0);
>>> +  if (r > 0) {
>>> +for (auto& p : bluefs_gift_extents) {
>>> +  bluefs_extents.insert(p.offset, p.length);
>>>   }
>>> +bufferlist bl;
>>> +encode(bluefs_extents, bl);
>>> +dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
>>> + << bluefs_extents << std::dec << dendl;
>>> +synct->set(PREFIX_SUPER, "bluefs_extents", bl);
>>> }
>>>   // cleanup sync deferred keys
>>> 
 On 1.10.2018, at 18:39, Igor Fedotov  wrote:
 
 So you have just a single main device per OSD
 
 Then bluestore-tool wouldn't help, it's unable to expand BlueFS partition 
 at main device, standalone devices are supported only.
 
 Given that you're able to rebuild the code I can suggest to make a patch 
 that triggers BlueFS rebalance (see code snippet below) on repairing.
  PExtentVector bluefs_gift_extents;
  int r = _balance_bluefs_freespace(_gift_extents);
  ceph_assert(r >= 0);
  if (r > 0) {
for (auto& p : bluefs_gift_extents) {
  bluefs_extents.insert(p.offset, p.length);
}
bufferlist bl;
encode(bluefs_extents, bl);
dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
 << bluefs_extents << std::dec << dendl;
synct->set(PREFIX_SUPER, "bluefs_extents", bl);
  }
 
 If it waits I can probably make a corresponding PR tomorrow.
 
 Thanks,
 Igor
 On 10/1/2018 6:16 PM, Sergey Malinin wrote:
> I have rebuilt the tool, but none of my OSDs no matter dead or alive have 
> any symlinks other than 'block' pointing to LVM.
> I adjusted main device size but it looks like it needs even more space 
> for db compaction. After executing bluefs-bdev-expand OSD fails to start, 
> however 'fsck' and 'repair' commands finished successfully.
> 
> 2018-10-01 18:02:39.755 7fc9226c6240  1 freelist init
> 2018-10-01 18:02:39.763 7fc9226c6240  1 
> bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc opening allocation 
> metadata
> 2018-10-01 18:02:40.907 7fc9226c6240  1 
> bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc loaded 285 GiB in 2249899 
> extents
> 2018-10-01 18:02:40.951 7fc9226c6240 -1 
> bluestore(/var/lib/ceph/osd/ceph-1) _reconcile_bluefs_freespace bluefs 
> extra 0x[6d6f00~50c80]
> 2018-10-01 18:02:40.951 7fc9226c6240  1 stupidalloc 0x0x55d053fb9180 
> shutdown
> 2018-10-01 18:02:40.963 7fc9226c6240  1 freelist shutdown
> 2018-10-01 18:02:40.963 7fc9226c6240  4 rocksdb: 
> [/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:252] Shutdown: canceling 
> all background work
> 2018-10-01 18:02:40.967 7fc9226c6240  4 rocksdb: 
> [/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:397] Shutdown complete
> 2018-10-01 18:02:40.971 7fc9226c6240  1 bluefs umount
> 2018-10-01 18:02:40.975 7fc9226c6240  1 stupidalloc 0x0x55d053883800 
> shutdown
> 2018-10-01 18:02:40.975 7fc9226c6240  1 bdev(0x55d053c32e00 
> /var/lib/ceph/osd/ceph-1/block) close
> 2018-10-01 18:02:41.267 7fc9226c6240  1 

[ceph-users] too few PGs per OSD

2018-10-01 Thread solarflow99
I have a new deployment and it always has this problem even if I increase
the size of the OSD, it stays at 8.  I saw examples where others had this
problem but it was with the RBD pool, I don't have an RBD pool, and just
deployed it fresh with ansible.


health: HEALTH_WARN
1 MDSs report slow metadata IOs
Reduced data availability: 16 pgs inactive
Degraded data redundancy: 16 pgs undersized
too few PGs per OSD (16 < min 30)
 data:
pools:   2 pools, 16 pgs
objects: 0  objects, 0 B
usage:   2.0 GiB used, 39 GiB / 41 GiB avail
pgs: 100.000% pgs not active
 16 undersized+peered


# ceph osd pool ls
cephfs_data
cephfs_metadata


# ceph osd tree
ID CLASS WEIGHT  TYPE NAMESTATUS REWEIGHT PRI-AFF
-1   0.03989 root default
-3   0.03989 host mytesthost104
 0   hdd 0.03989 osd.0up  1.0 1.0


# ceph osd pool set cephfs_data pgp_num 64
Error EINVAL: specified pgp_num 64 > pg_num 8
# ceph osd pool set cephfs_data pgp_num 256
Error EINVAL: specified pgp_num 256 > pg_num 8
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NVMe SSD not assigned "nvme" device class

2018-10-01 Thread Gregory Farnum
Ceph only cares about the SSD and HDD distinction right now, so that's all
the device classes try to handle. (In fact it's *actually* just looking at
the rotational flag the kernel exports; the detection would need to become
a lot more advanced to start assigning stuff to an nvme class.)
-Greg

On Mon, Oct 1, 2018 at 2:25 PM Vladimir Brik 
wrote:

> Hello,
>
> It looks like Ceph (13.2.2) assigns device class "ssd" to our Samsung
> PM1725a NVMe SSDs instead of "nvme". Is that a bug or is the "nvme"
> class reserved for a different kind of device?
>
>
> Vlad
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] NVMe SSD not assigned "nvme" device class

2018-10-01 Thread Vladimir Brik
Hello,

It looks like Ceph (13.2.2) assigns device class "ssd" to our Samsung
PM1725a NVMe SSDs instead of "nvme". Is that a bug or is the "nvme"
class reserved for a different kind of device?


Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mount cephfs from a public network ip of mds

2018-10-01 Thread Joshua Chen
Thank you for all your reply.
I will consider changing the design or negotiate with my colleagues for the
topology issue. Or if all are not working, try to come back to this
solution.

Cheers
Joshua

On Mon, Oct 1, 2018 at 9:05 PM Paul Emmerich  wrote:

> No, mons can only have exactly one IP address and they'll only listen
> on that IP.
>
> As David suggested: check if you really need separate networks. This
> setup usually creates more problems than it solves, especially if you
> have one 1G and one 10G network.
>
> Paul
> Am Mo., 1. Okt. 2018 um 04:11 Uhr schrieb Joshua Chen
> :
> >
> > Hello Paul,
> >   Thanks for your reply.
> >   Now my clients will be from 140.109 (LAN, the real ip network 1Gb/s)
> and from 10.32 (SAN, a closed 10Gb network). Could I make this
> public_network to be 0.0.0.0? so mon daemon listens on both 1Gb and 10Gb
> network?
> >   Or could I have
> > public_network = 140.109.169.0/24, 10.32.67.0/24
> > cluster_network = 10.32.67.0/24
> >
> > does ceph allow 2 (multiple) public_network?
> >
> >   And I don't want to limit the client read/write speed to be 1Gb/s nics
> unless they don't have 10Gb nic installed. To guarantee clients read/write
> to osd (when they know the details of the location) they should be using
> the fastest nic (10Gb) when available. But other clients with only 1Gb nic
> will go through 140.109.0.0 (1Gb LAN) to ask mon or to read/write to osds.
> This is why my osds also have 1Gb and 10Gb nics with 140.109.0.0 and
> 10.32.0.0 networking respectively.
> >
> > Cheers
> > Joshua
> >
> > On Sun, Sep 30, 2018 at 12:09 PM David Turner 
> wrote:
> >>
> >> The cluster/private network is only used by the OSDs. Nothing else in
> ceph or its clients communicate using it. Everything other than osd to osd
> communication uses the public network. That includes the MONs, MDSs,
> clients, and anything other than an osd talking to an osd. Nothing else
> other than osd to osd traffic can communicate on the private/cluster
> network.
> >>
> >> On Sat, Sep 29, 2018, 6:43 AM Paul Emmerich 
> wrote:
> >>>
> >>> All Ceph clients will always first connect to the mons. Mons provide
> >>> further information on the cluster such as the IPs of MDS and OSDs.
> >>>
> >>> This means you need to provide the mon IPs to the mount command, not
> >>> the MDS IPs. Your first command works by coincidence since
> >>> you seem to run the mons and MDS' on the same server.
> >>>
> >>>
> >>> Paul
> >>> Am Sa., 29. Sep. 2018 um 12:07 Uhr schrieb Joshua Chen
> >>> :
> >>> >
> >>> > Hello all,
> >>> >   I am testing the cephFS cluster so that clients could mount -t
> ceph.
> >>> >
> >>> >   the cluster has 6 nodes, 3 mons (also mds), and 3 osds.
> >>> >   All these 6 nodes has 2 nic, one 1Gb nic with real ip
> (140.109.0.0) and 1 10Gb nic with virtual ip (10.32.0.0)
> >>> >
> >>> > 140.109. Nic1 1G<-MDS1->Nic2 10G 10.32.
> >>> > 140.109. Nic1 1G<-MDS2->Nic2 10G 10.32.
> >>> > 140.109. Nic1 1G<-MDS3->Nic2 10G 10.32.
> >>> > 140.109. Nic1 1G<-OSD1->Nic2 10G 10.32.
> >>> > 140.109. Nic1 1G<-OSD2->Nic2 10G 10.32.
> >>> > 140.109. Nic1 1G<-OSD3->Nic2 10G 10.32.
> >>> >
> >>> >
> >>> >
> >>> > and I have the following questions:
> >>> >
> >>> > 1, can I have both public (140.109.0.0) and cluster (10.32.0.0)
> clients all be able to mount this cephfs resource
> >>> >
> >>> > I want to do
> >>> >
> >>> > (in a 140.109 network client)
> >>> > mount -t ceph mds1(140.109.169.48):/ /mnt/cephfs -o user=,secret=
> >>> >
> >>> > and also in a 10.32.0.0 network client)
> >>> > mount -t ceph mds1(10.32.67.48):/
> >>> > /mnt/cephfs -o user=,secret=
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > Currently, only this 10.32.0.0 clients can mount it. that of public
> network (140.109) can not. How can I enable this?
> >>> >
> >>> > here attached is my ceph.conf
> >>> >
> >>> > Thanks in advance
> >>> >
> >>> > Cheers
> >>> > Joshua
> >>> > ___
> >>> > ceph-users mailing list
> >>> > ceph-users@lists.ceph.com
> >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >>>
> >>>
> >>> --
> >>> Paul Emmerich
> >>>
> >>> Looking for help with your Ceph cluster? Contact us at
> https://croit.io
> >>>
> >>> croit GmbH
> >>> Freseniusstr. 31h
> >>> 81247 München
> >>> www.croit.io
> >>> Tel: +49 89 1896585 90
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Mimic offline problem

2018-10-01 Thread Göktuğ Yıldırım
I mistyped the user list mail address. I am correcting and sending again. 
Apologies for the noise.

My mail is below.


İleti başlangıcı:

> Kimden: Goktug Yildirim 
> Tarih: 1 Ekim 2018 21:54:31 GMT+2
> Kime: ceph-users-j...@lists.ceph.com
> Bilgi: ceph-de...@vger.kernel.org
> Konu: Mimic offline problem
> 
> Hi all,
> 
> We have recently upgraded from luminous to mimic. It’s been 6 days since this 
> cluster is offline. The long short story is here: 
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
> 
> I’ve also CC’ed developers since I believe this is a bug. If this is not to 
> correct way I apology and please let me know.
> 
> For the 6 days lots of thing happened and there were some outcomes about the 
> problem. Some of them was misjudged and some of them are not looked deeper. 
> However the most certain diagnosis is this: each OSD causes very high disk 
> I/O to its bluestore disk (WAL and DB are fine). After that OSDs become 
> unresponsive or very very less responsive. For example "ceph tell osd.x 
> version” stucks like for ever.
> 
> So due to unresponsive OSDs cluster does not settle. This is our problem! 
> 
> This is the one we are very sure of. But we are not sure of the reason. 
> 
> Here is the latest ceph status: 
> https://paste.ubuntu.com/p/2DyZ5YqPjh/. 
> 
> This is the status after we started all of the OSDs 24 hours ago.
> Some of the OSDs are not started. However it didnt make any difference when 
> all of them was online.
> 
> Here is the debug=20 log of an OSD which is same for all others: 
> https://paste.ubuntu.com/p/8n2kTvwnG6/
> As we figure out there is a loop pattern. I am sure it wont caught from eye.
> 
> This the full log the same OSD.
> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
> 
> Here is the strace of the same OSD process:
> https://paste.ubuntu.com/p/8n2kTvwnG6/
> 
> Recently we hear more to uprade mimic. I hope none get hurts as we do. I am 
> sure we have done lots of mistakes to let this happening. And this situation 
> may be a example for other user and could be a potential bug for ceph 
> developer.
> 
> Any help to figure out what is going on would be great.
> 
> Best Regards,
> Goktug Yildirim
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client stability

2018-10-01 Thread Gregory Farnum
The critical bit in your logs below are the lines
Oct  1 15:04:08 worker1004 kernel: libceph: reset on mds0
Oct  1 15:04:08 worker1004 kernel: ceph: mds0 closed our session
Oct  1 15:04:08 worker1004 kernel: ceph: mds0 reconnect start
...
Oct  1 15:04:08 worker1004 kernel: ceph: mds0 reconnect denied

I couldn't tell you why the kernel client is facing disconnects that it
doesn't handle more often than the userspace client is, perhaps it (or at
least this kernel's version) isn't subscribing to mdsmap updates or
handling them quickly enough.
But that sequence means that the mount is busted and can't recover itself.
-Greg


On Mon, Oct 1, 2018 at 1:41 PM Andras Pataki 
wrote:

> Unfortunately the CentOS kernel (3.10.0-862.14.4.el7.x86_64) has issues
> as well.  Different ones, but the nodes end up with an unusable mount in
> an hour or two.  Here are some syslogs:
>
> Oct  1 11:50:28 worker1004 kernel: INFO: task fio:29007 blocked for more
> than 120 seconds.
> Oct  1 11:50:28 worker1004 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Oct  1 11:50:28 worker1004 kernel: fio D
> 996d86e92f70 0 29007  28970 0x
> Oct  1 11:50:28 worker1004 kernel: Call Trace:
> Oct  1 11:50:28 worker1004 kernel: [] ?
> bit_wait+0x50/0x50
> Oct  1 11:50:28 worker1004 kernel: [] schedule+0x29/0x70
> Oct  1 11:50:28 worker1004 kernel: []
> schedule_timeout+0x239/0x2c0
> Oct  1 11:50:28 worker1004 kernel: [] ?
> ktime_get_ts64+0x52/0xf0
> Oct  1 11:50:28 worker1004 kernel: [] ?
> bit_wait+0x50/0x50
> Oct  1 11:50:28 worker1004 kernel: []
> io_schedule_timeout+0xad/0x130
> Oct  1 11:50:28 worker1004 kernel: []
> io_schedule+0x18/0x20
> Oct  1 11:50:28 worker1004 kernel: []
> bit_wait_io+0x11/0x50
> Oct  1 11:50:28 worker1004 kernel: []
> __wait_on_bit_lock+0x61/0xc0
> Oct  1 11:50:28 worker1004 kernel: []
> __lock_page+0x74/0x90
> Oct  1 11:50:28 worker1004 kernel: [] ?
> wake_bit_function+0x40/0x40
> Oct  1 11:50:28 worker1004 kernel: []
> __find_lock_page+0x54/0x70
> Oct  1 11:50:28 worker1004 kernel: []
> grab_cache_page_write_begin+0x55/0xc0
> Oct  1 11:50:28 worker1004 kernel: []
> ceph_write_begin+0x43/0xe0 [ceph]
> Oct  1 11:50:28 worker1004 kernel: []
> generic_file_buffered_write+0x124/0x2c0
> Oct  1 11:50:28 worker1004 kernel: []
> ceph_aio_write+0xa3e/0xcb0 [ceph]
> Oct  1 11:50:28 worker1004 kernel: [] ?
> do_numa_page+0x1be/0x250
> Oct  1 11:50:28 worker1004 kernel: [] ?
> handle_pte_fault+0x316/0xd10
> Oct  1 11:50:28 worker1004 kernel: [] ?
> aio_read_events+0x1f3/0x2e0
> Oct  1 11:50:28 worker1004 kernel: [] ?
> security_file_permission+0x27/0xa0
> Oct  1 11:50:28 worker1004 kernel: [] ?
> ceph_direct_read_write+0xcd0/0xcd0 [ceph]
> Oct  1 11:50:28 worker1004 kernel: []
> do_io_submit+0x3c3/0x870
> Oct  1 11:50:28 worker1004 kernel: []
> SyS_io_submit+0x10/0x20
> Oct  1 11:50:28 worker1004 kernel: []
> system_call_fastpath+0x22/0x27
> Oct  1 11:52:28 worker1004 kernel: INFO: task fio:29007 blocked for more
> than 120 seconds.
> Oct  1 11:52:28 worker1004 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Oct  1 11:52:28 worker1004 kernel: fio D
> 996d86e92f70 0 29007  28970 0x
> Oct  1 11:52:28 worker1004 kernel: Call Trace:
> Oct  1 11:52:28 worker1004 kernel: [] ?
> bit_wait+0x50/0x50
> Oct  1 11:52:28 worker1004 kernel: [] schedule+0x29/0x70
> Oct  1 11:52:28 worker1004 kernel: []
> schedule_timeout+0x239/0x2c0
> Oct  1 11:52:28 worker1004 kernel: [] ?
> ktime_get_ts64+0x52/0xf0
> Oct  1 11:52:28 worker1004 kernel: [] ?
> bit_wait+0x50/0x50
> Oct  1 11:52:28 worker1004 kernel: []
> io_schedule_timeout+0xad/0x130
> Oct  1 11:52:28 worker1004 kernel: []
> io_schedule+0x18/0x20
> Oct  1 11:52:28 worker1004 kernel: []
> bit_wait_io+0x11/0x50
> Oct  1 11:52:28 worker1004 kernel: []
> __wait_on_bit_lock+0x61/0xc0
> Oct  1 11:52:28 worker1004 kernel: []
> __lock_page+0x74/0x90
> Oct  1 11:52:28 worker1004 kernel: [] ?
> wake_bit_function+0x40/0x40
> Oct  1 11:52:28 worker1004 kernel: []
> __find_lock_page+0x54/0x70
> Oct  1 11:52:28 worker1004 kernel: []
> grab_cache_page_write_begin+0x55/0xc0
> Oct  1 11:52:28 worker1004 kernel: []
> ceph_write_begin+0x43/0xe0 [ceph]
> Oct  1 11:52:28 worker1004 kernel: []
> generic_file_buffered_write+0x124/0x2c0
> Oct  1 11:52:28 worker1004 kernel: []
> ceph_aio_write+0xa3e/0xcb0 [ceph]
> Oct  1 11:52:28 worker1004 kernel: [] ?
> do_numa_page+0x1be/0x250
> Oct  1 11:52:28 worker1004 kernel: [] ?
> handle_pte_fault+0x316/0xd10
> Oct  1 11:52:28 worker1004 kernel: [] ?
> aio_read_events+0x1f3/0x2e0
> Oct  1 11:52:28 worker1004 kernel: [] ?
> security_file_permission+0x27/0xa0
> Oct  1 11:52:28 worker1004 kernel: [] ?
> ceph_direct_read_write+0xcd0/0xcd0 [ceph]
> Oct  1 11:52:28 worker1004 kernel: []
> do_io_submit+0x3c3/0x870
> Oct  1 11:52:28 worker1004 kernel: []
> SyS_io_submit+0x10/0x20
> Oct  1 11:52:28 worker1004 kernel: []
> 

Re: [ceph-users] Is object name used by CRUSH algorithm?

2018-10-01 Thread Jin Mao
Gregory and Paul,

Thank you for sharing the information.

Jin.

On Mon, Oct 1, 2018 at 4:08 PM Paul Emmerich  wrote:

> You are probably thinking of Amazon's S3 where adding a random prefix
> to object names was a common performance optimization in the past, but
> I think they fixed that recently.
>
> Anyways, common prefixes are very common in Ceph and no problem. For
> example, all objects within an rbd imge have the same prefix.
>
> Paul
> Am Do., 27. Sep. 2018 um 21:09 Uhr schrieb Jin Mao  >:
> >
> > I am running luminous and the objects were copied from Isilon with a
> long and similar prefix in path like /dir1/dir2/dir3//mm/dd. The
> objects are copied to various buckets like
> bucket_MMDD/dir1/dir2/dir3//mm/dd. This setup minimize some
> internal code change when moving from NFS to object store.
> >
> > I heard that CRUSH may NOT evenly balance OSDs if there are many common
> leading characters in the object name? However, I couldn't find any
> evidence to support this.
> >
> > Does anyone know further details about this?
> >
> > Thank you.
> >
> > Jin.
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client stability

2018-10-01 Thread Andras Pataki
Unfortunately the CentOS kernel (3.10.0-862.14.4.el7.x86_64) has issues 
as well.  Different ones, but the nodes end up with an unusable mount in 
an hour or two.  Here are some syslogs:


Oct  1 11:50:28 worker1004 kernel: INFO: task fio:29007 blocked for more 
than 120 seconds.
Oct  1 11:50:28 worker1004 kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  1 11:50:28 worker1004 kernel: fio D 
996d86e92f70 0 29007  28970 0x

Oct  1 11:50:28 worker1004 kernel: Call Trace:
Oct  1 11:50:28 worker1004 kernel: [] ? bit_wait+0x50/0x50
Oct  1 11:50:28 worker1004 kernel: [] schedule+0x29/0x70
Oct  1 11:50:28 worker1004 kernel: [] 
schedule_timeout+0x239/0x2c0
Oct  1 11:50:28 worker1004 kernel: [] ? 
ktime_get_ts64+0x52/0xf0

Oct  1 11:50:28 worker1004 kernel: [] ? bit_wait+0x50/0x50
Oct  1 11:50:28 worker1004 kernel: [] 
io_schedule_timeout+0xad/0x130
Oct  1 11:50:28 worker1004 kernel: [] 
io_schedule+0x18/0x20
Oct  1 11:50:28 worker1004 kernel: [] 
bit_wait_io+0x11/0x50
Oct  1 11:50:28 worker1004 kernel: [] 
__wait_on_bit_lock+0x61/0xc0
Oct  1 11:50:28 worker1004 kernel: [] 
__lock_page+0x74/0x90
Oct  1 11:50:28 worker1004 kernel: [] ? 
wake_bit_function+0x40/0x40
Oct  1 11:50:28 worker1004 kernel: [] 
__find_lock_page+0x54/0x70
Oct  1 11:50:28 worker1004 kernel: [] 
grab_cache_page_write_begin+0x55/0xc0
Oct  1 11:50:28 worker1004 kernel: [] 
ceph_write_begin+0x43/0xe0 [ceph]
Oct  1 11:50:28 worker1004 kernel: [] 
generic_file_buffered_write+0x124/0x2c0
Oct  1 11:50:28 worker1004 kernel: [] 
ceph_aio_write+0xa3e/0xcb0 [ceph]
Oct  1 11:50:28 worker1004 kernel: [] ? 
do_numa_page+0x1be/0x250
Oct  1 11:50:28 worker1004 kernel: [] ? 
handle_pte_fault+0x316/0xd10
Oct  1 11:50:28 worker1004 kernel: [] ? 
aio_read_events+0x1f3/0x2e0
Oct  1 11:50:28 worker1004 kernel: [] ? 
security_file_permission+0x27/0xa0
Oct  1 11:50:28 worker1004 kernel: [] ? 
ceph_direct_read_write+0xcd0/0xcd0 [ceph]
Oct  1 11:50:28 worker1004 kernel: [] 
do_io_submit+0x3c3/0x870
Oct  1 11:50:28 worker1004 kernel: [] 
SyS_io_submit+0x10/0x20
Oct  1 11:50:28 worker1004 kernel: [] 
system_call_fastpath+0x22/0x27
Oct  1 11:52:28 worker1004 kernel: INFO: task fio:29007 blocked for more 
than 120 seconds.
Oct  1 11:52:28 worker1004 kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  1 11:52:28 worker1004 kernel: fio D 
996d86e92f70 0 29007  28970 0x

Oct  1 11:52:28 worker1004 kernel: Call Trace:
Oct  1 11:52:28 worker1004 kernel: [] ? bit_wait+0x50/0x50
Oct  1 11:52:28 worker1004 kernel: [] schedule+0x29/0x70
Oct  1 11:52:28 worker1004 kernel: [] 
schedule_timeout+0x239/0x2c0
Oct  1 11:52:28 worker1004 kernel: [] ? 
ktime_get_ts64+0x52/0xf0

Oct  1 11:52:28 worker1004 kernel: [] ? bit_wait+0x50/0x50
Oct  1 11:52:28 worker1004 kernel: [] 
io_schedule_timeout+0xad/0x130
Oct  1 11:52:28 worker1004 kernel: [] 
io_schedule+0x18/0x20
Oct  1 11:52:28 worker1004 kernel: [] 
bit_wait_io+0x11/0x50
Oct  1 11:52:28 worker1004 kernel: [] 
__wait_on_bit_lock+0x61/0xc0
Oct  1 11:52:28 worker1004 kernel: [] 
__lock_page+0x74/0x90
Oct  1 11:52:28 worker1004 kernel: [] ? 
wake_bit_function+0x40/0x40
Oct  1 11:52:28 worker1004 kernel: [] 
__find_lock_page+0x54/0x70
Oct  1 11:52:28 worker1004 kernel: [] 
grab_cache_page_write_begin+0x55/0xc0
Oct  1 11:52:28 worker1004 kernel: [] 
ceph_write_begin+0x43/0xe0 [ceph]
Oct  1 11:52:28 worker1004 kernel: [] 
generic_file_buffered_write+0x124/0x2c0
Oct  1 11:52:28 worker1004 kernel: [] 
ceph_aio_write+0xa3e/0xcb0 [ceph]
Oct  1 11:52:28 worker1004 kernel: [] ? 
do_numa_page+0x1be/0x250
Oct  1 11:52:28 worker1004 kernel: [] ? 
handle_pte_fault+0x316/0xd10
Oct  1 11:52:28 worker1004 kernel: [] ? 
aio_read_events+0x1f3/0x2e0
Oct  1 11:52:28 worker1004 kernel: [] ? 
security_file_permission+0x27/0xa0
Oct  1 11:52:28 worker1004 kernel: [] ? 
ceph_direct_read_write+0xcd0/0xcd0 [ceph]
Oct  1 11:52:28 worker1004 kernel: [] 
do_io_submit+0x3c3/0x870
Oct  1 11:52:28 worker1004 kernel: [] 
SyS_io_submit+0x10/0x20
Oct  1 11:52:28 worker1004 kernel: [] 
system_call_fastpath+0x22/0x27


Oct  1 15:04:08 worker1004 kernel: libceph: reset on mds0
Oct  1 15:04:08 worker1004 kernel: ceph: mds0 closed our session
Oct  1 15:04:08 worker1004 kernel: ceph: mds0 reconnect start
Oct  1 15:04:08 worker1004 kernel: libceph: osd182 10.128.150.155:6976 
socket closed (con state OPEN)
Oct  1 15:04:08 worker1004 kernel: libceph: osd548 10.128.150.171:6936 
socket closed (con state OPEN)
Oct  1 15:04:08 worker1004 kernel: libceph: osd59 10.128.150.154:6918 
socket closed (con state OPEN)

Oct  1 15:04:08 worker1004 kernel: ceph: mds0 reconnect denied
Oct  1 15:04:08 worker1004 kernel: ceph:  dropping dirty+flushing Fw 
state for 997ff05193b0 1099516450605
Oct  1 15:04:08 worker1004 kernel: ceph:  dropping dirty+flushing Fw 
state for 997ff0519930 1099516450607
Oct  1 15:04:08 worker1004 kernel: ceph:  dropping dirty+flushing Fw 

Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever

2018-10-01 Thread Pavan Rallabhandi
Yeah, I think this is something to do with the CentOS binaries, sorry that I 
couldn’t be of much help here.

Thanks,
-Pavan.

From: David Turner 
Date: Monday, October 1, 2018 at 1:37 PM
To: Pavan Rallabhandi 
Cc: ceph-users 
Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster 
unusable and takes forever

I tried modifying filestore_rocksdb_options by removing 
compression=kNoCompression as well as setting it to 
compression=kSnappyCompression.  Leaving it with kNoCompression or removing it 
results in the same segfault in the previous log.  Setting it to 
kSnappyCompression resulted in [1] this being logged and the OSD just failing 
to start instead of segfaulting.  Is there anything else you would suggest 
trying before I purge this OSD from the cluster?  I'm afraid it might be 
something with the CentOS binaries. 

[1] 2018-10-01 17:10:37.134930 7f1415dfcd80  0  set rocksdb option compression 
= kSnappyCompression
2018-10-01 17:10:37.134986 7f1415dfcd80 -1 rocksdb: Invalid argument: 
Compression type Snappy is not linked with the binary.
2018-10-01 17:10:37.135004 7f1415dfcd80 -1 filestore(/var/lib/ceph/osd/ceph-1) 
mount(1723): Error initializing rocksdb :
2018-10-01 17:10:37.135020 7f1415dfcd80 -1 osd.1 0 OSD:init: unable to mount 
object store
2018-10-01 17:10:37.135029 7f1415dfcd80 -1 ESC[0;31m ** ERROR: osd init failed: 
(1) Operation not permittedESC[0m

On Sat, Sep 29, 2018 at 1:57 PM Pavan Rallabhandi 
 wrote:
I looked at one of my test clusters running Jewel on Ubuntu 16.04, and 
interestingly I found this(below) in one of the OSD logs, which is different 
from your OSD boot log, where none of the compression algorithms seem to be 
supported. This hints more at how rocksdb was built on CentOS for Ceph.

2018-09-29 17:38:38.629112 7fbd318d4b00  4 rocksdb: Compression algorithms 
supported:
2018-09-29 17:38:38.629112 7fbd318d4b00  4 rocksdb:     Snappy supported: 1
2018-09-29 17:38:38.629113 7fbd318d4b00  4 rocksdb:     Zlib supported: 1
2018-09-29 17:38:38.629113 7fbd318d4b00  4 rocksdb:     Bzip supported: 0
2018-09-29 17:38:38.629114 7fbd318d4b00  4 rocksdb:     LZ4 supported: 0
2018-09-29 17:38:38.629114 7fbd318d4b00  4 rocksdb:     ZSTD supported: 0
2018-09-29 17:38:38.629115 7fbd318d4b00  4 rocksdb: Fast CRC32 supported: 0

On 9/27/18, 2:56 PM, "Pavan Rallabhandi"  
wrote:

    I see Filestore symbols on the stack, so the bluestore config doesn’t 
affect. And the top frame of the stack hints at a RocksDB issue, and there are 
a whole lot of these too:

    “2018-09-17 19:23:06.480258 7f1f3d2a7700  2 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/table/block_based_table_reader.cc:636]
 Cannot find Properties block from file.”

    It really seems to be something with RocksDB on centOS. I still think you 
can try removing “compression=kNoCompression” from the 
filestore_rocksdb_options And/Or check if rocksdb is expecting snappy to be 
enabled.

    Thanks,
    -Pavan.

    From: David Turner 
    Date: Thursday, September 27, 2018 at 1:18 PM
    To: Pavan Rallabhandi 
    Cc: ceph-users 
    Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the 
cluster unusable and takes forever

    I got pulled away from this for a while.  The error in the log is "abort: 
Corruption: Snappy not supported or corrupted Snappy compressed block contents" 
and the OSD has 2 settings set to snappy by default, async_compressor_type and 
bluestore_compression_algorithm.  Do either of these settings affect the omap 
store?

    On Wed, Sep 19, 2018 at 2:33 PM Pavan Rallabhandi 
 wrote:
    Looks like you are running on CentOS, fwiw. We’ve successfully ran the 
conversion commands on Jewel, Ubuntu 16.04.

    Have a feel it’s expecting the compression to be enabled, can you try 
removing “compression=kNoCompression” from the filestore_rocksdb_options? 
And/or you might want to check if rocksdb is expecting snappy to be enabled.

    From: David Turner 
    Date: Tuesday, September 18, 2018 at 6:01 PM
    To: Pavan Rallabhandi 
    Cc: ceph-users 
    Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the 
cluster unusable and takes forever

    Here's the [1] full log from the time the OSD was started to the end of the 
crash dump.  These logs are so hard to parse.  Is there anything useful in them?

    I did confirm that all perms were set correctly and that the superblock was 
changed to rocksdb before the first time I attempted to start the OSD with it's 
new DB.  This is 

Re: [ceph-users] cephfs issue with moving files between data pools gives Input/output error

2018-10-01 Thread Gregory Farnum
On Mon, Oct 1, 2018 at 12:43 PM Marc Roos  wrote:

> Hmmm, did not know that, so it becomes a soft link or so?
>
> totally new for me, also not what I would expect of a mv on a fs. I know
> this is normal to expect coping between pools, also from the s3cmd
> client. But I think more people will not expect this behaviour. Can't
> the move be implemented as a move?
>
> How can users even know about what folders have a 'different layout'.
> What happens if we export such mixed pool filesystem via smb. How would
> smb deal with the 'move' between those directories?
>

Since the CephX permissions are thoroughly outside of POSIX, handling this
is unfortunately just your problem. :(

Consider it the other way around — what if a mv *did* copy the file data
into a new pool, and somebody who had the file open was suddenly no longer
able to access it? There's no feasible way for us to handle that with rules
that fall inside of POSIX; what we have now is better.

John's right; it would be great if we could do a server-side "re-stripe" or
"re-layout" or something, but that will also be an "outside POSIX"
operation and never the default.
-Greg


>
>
>
>
> -Original Message-
> From: Gregory Farnum [mailto:gfar...@redhat.com]
> Sent: maandag 1 oktober 2018 21:28
> To: Marc Roos
> Cc: ceph-users; jspray; ukernel
> Subject: Re: [ceph-users] cephfs issue with moving files between data
> pools gives Input/output error
>
> Moving a file into a directory with a different layout does not, and is
> not intended to, copy the underlying file data into a different pool
> with the new layout. If you want to do that you have to make it happen
> yourself by doing a copy.
>
> On Mon, Oct 1, 2018 at 12:16 PM Marc Roos 
> wrote:
>
>
>
> I will explain the test again, I think you might have some bug in
> your
> cephfs copy between data pools.
>
> c04 has mounted the root cephfs
> /a (has data pool a, ec21)
> /test (has data pool b, r1)
>
> test2 has mounted
> /m  (nfs mount of cephfs /a)
> /m2 (cephfs mount of /a)
>
> Creating the test file.
> [root@c04 test]# echo asdfasdfasdfasdfasdfasdfasdfasdfasdf >
> testfile.txt
>
> Then I am moving on c04 the test file from the test folder(pool b)
> to
> the a folder/pool
>
> Now on test2
> [root@test2 m]# ls -arlt
> -rw-r--r--  1 nobody nobody21 Oct  1 20:48 r1.txt
> -rw-r--r--  1 nobody nobody21 Oct  1 20:49 r1-copy.txt
> -rw-r--r--  1 nobody nobody37 Oct  1 21:02 testfile.txt
>
> [root@test2 /]# cat /mnt/m/testfile.txt
> cat: /mnt/m/old/testfile.txt: Input/output error
>
> [root@test2 /]# cat /mnt/m2/testfile.txt
> cat: /mnt/m2/old/testfile.txt: Operation not permitted
>
> Now I am creating a copy of the test file in the same directory
> back on
> c04
>
> [root@c04 a]# cp testfile.txt testfile-copy.txt
> [root@c04 a]# ls -alrt
> -rw-r--r-- 1 root root 21 Oct  1 20:49 r1-copy.txt
> -rw-r--r-- 1 root root 37 Oct  1 21:02 testfile.txt
> -rw-r--r-- 1 root root 37 Oct  1 21:07 testfile-copy.txt
>
> Now I trying to access the copy of testfile.txt back on test2
> (without
> unmounting, or changing permissions)
>
> [root@test2 /]# cat /mnt/m/testfile-copy.txt
> asdfasdfasdfasdfasdfasdfasdfasdfasdf
> [root@test2 /]# cat /mnt/m2/testfile-copy.txt
> asdfasdfasdfasdfasdfasdfasdfasdfasdf
>
>
>
>
>
>
>
> -Original Message-
> From: Yan, Zheng [mailto:uker...@gmail.com]
> Sent: zaterdag 29 september 2018 6:55
> To: Marc Roos
> Subject: Re: [ceph-users] cephfs issue with moving files between
> data
> pools gives Input/output error
>
> check_pool_perm on pool 30 ns  need Fr, but no read perm
>
> client does not permission to read the pool.  ceph-fuse did return
> EPERM
> for the kernel readpage request.  But kernel return -EIO for any
> readpage error.
> On Fri, Sep 28, 2018 at 10:09 PM Marc Roos
> 
> wrote:
> >
> >
> > Is this useful? I think this is the section of the client log
> when
> >
> > [@test2 m]$ cat out6
> > cat: out6: Input/output error
> >
> > 2018-09-28 16:03:39.082200 7f1ad01f1700 10 client.3246756
> fill_statx
> > on 0x100010943bc snap/devhead mode 040557 mtime 2018-09-28
> > 14:49:35.349370 ctime 2018-09-28 14:49:35.349370
> > 2018-09-28 16:03:39.082223 7f1ad01f1700  3 client.3246756
> ll_getattrx
> > 0x100010943bc.head = 0
> > 2018-09-28 16:03:39.082727 7f1ae813f700 10 client.3246756
> fill_statx
> > on
> > 0x10001698ac5 snap/devhead mode 0100644 mtime 2018-09-28
> > 14:45:50.323273 ctime 2018-09-28 14:47:47.028679
> 

Re: [ceph-users] cephfs clients hanging multi mds to single mds

2018-10-01 Thread Paul Emmerich
Which kernel version are you using for the kernel cephfs clients?
I've seen this problem with "older" kernels (where old is as recent as 4.9)

Paul
Am Mo., 1. Okt. 2018 um 18:35 Uhr schrieb Jaime Ibar :
>
> Hi all,
>
> we're running a ceph 12.2.7 Luminous cluster, two weeks ago we enabled
> multi mds and after few hours
>
> these errors started showing up
>
> 2018-09-28 09:41:20.577350 mds.1 [WRN] slow request 64.421475 seconds
> old, received at 2018-09-28 09:40:16.155841:
> client_request(client.31059144:8544450 getattr Xs #0$
> 12e1e73 2018-09-28 09:40:16.147368 caller_uid=0, caller_gid=124{})
> currently failed to authpin local pins
>
> 2018-09-28 10:56:51.051100 mon.1 [WRN] Health check failed: 5 clients
> failing to respond to cache pressure (MDS_CLIENT_RECALL)
> 2018-09-28 10:57:08.000361 mds.1 [WRN] 3 slow requests, 1 included
> below; oldest blocked for > 4614.580689 secs
> 2018-09-28 10:57:08.000365 mds.1 [WRN] slow request 244.796854 seconds
> old, received at 2018-09-28 10:53:03.203476:
> client_request(client.31059144:9080057 lookup #0x100
> 000b7564/58 2018-09-28 10:53:03.197922 caller_uid=0, caller_gid=0{})
> currently initiated
> 2018-09-28 11:00:00.000105 mon.1 [WRN] overall HEALTH_WARN 1 clients
> failing to respond to capability release; 5 clients failing to respond
> to cache pressure; 1 MDSs report slow requests,
>
> Due to this, we decide to go back to single mds(as it worked before),
> however, the clients pointing to mds.1 started hanging, however, the
> ones pointing to mds.0 worked fine.
>
> Then, we tried to enable multi mds again and the clients pointing mds.1
> went back online, however the ones pointing to mds.0 stopped work.
>
> Today, we tried to go back to single mds, however this error was
> preventing ceph to disable second active mds(mds.1)
>
> 2018-10-01 14:33:48.358443 mds.1 [WRN] evicting unresponsive client
> X: (30108925), after 68213.084174 seconds
>
> After wait for 3 hours, we restarted mds.1 daemon (as it was stuck in
> stopping state forever due to the above error), we waited for it to
> become active again,
>
> unmount the problematic clients, wait for the cluster to be healthy and
> try to go back to single mds again.
>
> Apparently this worked with some of the clients, we tried to enable
> multi mds again to bring faulty clients back again, however no luck this
> time
>
> and some of them are hanging and can't access to ceph fs.
>
> This is what we have in kern.log
>
> Oct  1 15:29:32 05 kernel: [2342847.017426] ceph: mds1 reconnect start
> Oct  1 15:29:32 05 kernel: [2342847.018677] ceph: mds1 reconnect success
> Oct  1 15:29:49 05 kernel: [2342864.651398] ceph: mds1 recovery completed
>
> Not sure what else can we try to bring hanging clients back without
> rebooting as they're in production and rebooting is not an option.
>
> Does anyone know how can we deal with this, please?
>
> Thanks
>
> Jaime
>
> --
>
> Jaime Ibar
> High Performance & Research Computing, IS Services
> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
> http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
> Tel: +353-1-896-3725
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic Upgrade, features not showing up

2018-10-01 Thread Gregory Farnum
On Mon, Oct 1, 2018 at 12:37 PM William Law  wrote:

> Hi -
>
> I feel like we missed something with upgrading to mimic from luminous.
> Everything went fine, but running 'ceph features' still shows luminous
> across the system.  Running 'ceph versions' shows that everything is at
> mimic.
>
> The cluster shows healthy.  Any ideas?
>

...yep! We didn't update a key function in that pipeline so that it was
willing to show Mimic. I generated a PR:
https://github.com/ceph/ceph/pull/24360

Thanks (to you and others) for mentioning this!
-Greg


>
> Thanks.
>
> Will
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs issue with moving files between data pools gives Input/output error

2018-10-01 Thread John Spray
On Mon, Oct 1, 2018 at 8:43 PM Marc Roos  wrote:
>
> Hmmm, did not know that, so it becomes a soft link or so?
>
> totally new for me, also not what I would expect of a mv on a fs. I know
> this is normal to expect coping between pools, also from the s3cmd
> client. But I think more people will not expect this behaviour. Can't
> the move be implemented as a move?

In almost all filesystems, a rename (like "mv") is a pure metadata
operation -- it doesn't involve reading all the file's data and
re-writing it.  It would be very surprising for most users if they
found that their "mv" command blocked for a very long time while
waiting for a large file's content to be e.g. read out of one pool and
written into another.  If would be especially surprising if the system
was close to being full, and the user found that "mv" operations
started getting ENOSPC because there wasn't enough room to write to a
new location before clearing data out of the old one.

All that said, I would quite like to be able to explicitly change
layouts on files and have the data copy happen in the background --
the route of doing a "cp -R" does feel hacky.

> How can users even know about what folders have a 'different layout'.
> What happens if we export such mixed pool filesystem via smb. How would
> smb deal with the 'move' between those directories?

Samba would pass a rename though to Ceph, so you'd have the same
behaviour whether doing a mv via Samba or doing it via a native CephFS
client.

John

>
>
>
> -Original Message-
> From: Gregory Farnum [mailto:gfar...@redhat.com]
> Sent: maandag 1 oktober 2018 21:28
> To: Marc Roos
> Cc: ceph-users; jspray; ukernel
> Subject: Re: [ceph-users] cephfs issue with moving files between data
> pools gives Input/output error
>
> Moving a file into a directory with a different layout does not, and is
> not intended to, copy the underlying file data into a different pool
> with the new layout. If you want to do that you have to make it happen
> yourself by doing a copy.
>
> On Mon, Oct 1, 2018 at 12:16 PM Marc Roos 
> wrote:
>
>
>
> I will explain the test again, I think you might have some bug in
> your
> cephfs copy between data pools.
>
> c04 has mounted the root cephfs
> /a (has data pool a, ec21)
> /test (has data pool b, r1)
>
> test2 has mounted
> /m  (nfs mount of cephfs /a)
> /m2 (cephfs mount of /a)
>
> Creating the test file.
> [root@c04 test]# echo asdfasdfasdfasdfasdfasdfasdfasdfasdf >
> testfile.txt
>
> Then I am moving on c04 the test file from the test folder(pool b)
> to
> the a folder/pool
>
> Now on test2
> [root@test2 m]# ls -arlt
> -rw-r--r--  1 nobody nobody21 Oct  1 20:48 r1.txt
> -rw-r--r--  1 nobody nobody21 Oct  1 20:49 r1-copy.txt
> -rw-r--r--  1 nobody nobody37 Oct  1 21:02 testfile.txt
>
> [root@test2 /]# cat /mnt/m/testfile.txt
> cat: /mnt/m/old/testfile.txt: Input/output error
>
> [root@test2 /]# cat /mnt/m2/testfile.txt
> cat: /mnt/m2/old/testfile.txt: Operation not permitted
>
> Now I am creating a copy of the test file in the same directory
> back on
> c04
>
> [root@c04 a]# cp testfile.txt testfile-copy.txt
> [root@c04 a]# ls -alrt
> -rw-r--r-- 1 root root 21 Oct  1 20:49 r1-copy.txt
> -rw-r--r-- 1 root root 37 Oct  1 21:02 testfile.txt
> -rw-r--r-- 1 root root 37 Oct  1 21:07 testfile-copy.txt
>
> Now I trying to access the copy of testfile.txt back on test2
> (without
> unmounting, or changing permissions)
>
> [root@test2 /]# cat /mnt/m/testfile-copy.txt
> asdfasdfasdfasdfasdfasdfasdfasdfasdf
> [root@test2 /]# cat /mnt/m2/testfile-copy.txt
> asdfasdfasdfasdfasdfasdfasdfasdfasdf
>
>
>
>
>
>
>
> -Original Message-
> From: Yan, Zheng [mailto:uker...@gmail.com]
> Sent: zaterdag 29 september 2018 6:55
> To: Marc Roos
> Subject: Re: [ceph-users] cephfs issue with moving files between
> data
> pools gives Input/output error
>
> check_pool_perm on pool 30 ns  need Fr, but no read perm
>
> client does not permission to read the pool.  ceph-fuse did return
> EPERM
> for the kernel readpage request.  But kernel return -EIO for any
> readpage error.
> On Fri, Sep 28, 2018 at 10:09 PM Marc Roos
> 
> wrote:
> >
> >
> > Is this useful? I think this is the section of the client log
> when
> >
> > [@test2 m]$ cat out6
> > cat: out6: Input/output error
> >
> > 2018-09-28 16:03:39.082200 7f1ad01f1700 10 client.3246756
> fill_statx
> > on 0x100010943bc snap/devhead mode 040557 mtime 2018-09-28
> > 14:49:35.349370 ctime 2018-09-28 14:49:35.349370
> > 

Re: [ceph-users] Is object name used by CRUSH algorithm?

2018-10-01 Thread Paul Emmerich
You are probably thinking of Amazon's S3 where adding a random prefix
to object names was a common performance optimization in the past, but
I think they fixed that recently.

Anyways, common prefixes are very common in Ceph and no problem. For
example, all objects within an rbd imge have the same prefix.

Paul
Am Do., 27. Sep. 2018 um 21:09 Uhr schrieb Jin Mao :
>
> I am running luminous and the objects were copied from Isilon with a long and 
> similar prefix in path like /dir1/dir2/dir3//mm/dd. The objects are 
> copied to various buckets like bucket_MMDD/dir1/dir2/dir3//mm/dd. 
> This setup minimize some internal code change when moving from NFS to object 
> store.
>
> I heard that CRUSH may NOT evenly balance OSDs if there are many common 
> leading characters in the object name? However, I couldn't find any evidence 
> to support this.
>
> Does anyone know further details about this?
>
> Thank you.
>
> Jin.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs issue with moving files between data pools gives Input/output error

2018-10-01 Thread John Spray
On Mon, Oct 1, 2018 at 8:41 PM Vasu Kulkarni  wrote:
>
> On Mon, Oct 1, 2018 at 12:28 PM Gregory Farnum  wrote:
> >
> > Moving a file into a directory with a different layout does not, and is not 
> > intended to, copy the underlying file data into a different pool with the 
> > new layout. If you want to do that you have to make it happen yourself by 
> > doing a copy.
> Isn't this a bug? Isnt unix 'mv' from admin point same as "cp" and "rm"

The result is similar but not identical.  cp creates a new file with a
new inode number (this matters -- if someone had the old file open,
they won't have the new one open).  mv really moves the existing file
(if someone had the file open, they wouldn't notice).

John

>
> >
> > On Mon, Oct 1, 2018 at 12:16 PM Marc Roos  wrote:
> >>
> >>
> >> I will explain the test again, I think you might have some bug in your
> >> cephfs copy between data pools.
> >>
> >> c04 has mounted the root cephfs
> >> /a (has data pool a, ec21)
> >> /test (has data pool b, r1)
> >>
> >> test2 has mounted
> >> /m  (nfs mount of cephfs /a)
> >> /m2 (cephfs mount of /a)
> >>
> >> Creating the test file.
> >> [root@c04 test]# echo asdfasdfasdfasdfasdfasdfasdfasdfasdf >
> >> testfile.txt
> >>
> >> Then I am moving on c04 the test file from the test folder(pool b) to
> >> the a folder/pool
> >>
> >> Now on test2
> >> [root@test2 m]# ls -arlt
> >> -rw-r--r--  1 nobody nobody21 Oct  1 20:48 r1.txt
> >> -rw-r--r--  1 nobody nobody21 Oct  1 20:49 r1-copy.txt
> >> -rw-r--r--  1 nobody nobody37 Oct  1 21:02 testfile.txt
> >>
> >> [root@test2 /]# cat /mnt/m/testfile.txt
> >> cat: /mnt/m/old/testfile.txt: Input/output error
> >>
> >> [root@test2 /]# cat /mnt/m2/testfile.txt
> >> cat: /mnt/m2/old/testfile.txt: Operation not permitted
> >>
> >> Now I am creating a copy of the test file in the same directory back on
> >> c04
> >>
> >> [root@c04 a]# cp testfile.txt testfile-copy.txt
> >> [root@c04 a]# ls -alrt
> >> -rw-r--r-- 1 root root 21 Oct  1 20:49 r1-copy.txt
> >> -rw-r--r-- 1 root root 37 Oct  1 21:02 testfile.txt
> >> -rw-r--r-- 1 root root 37 Oct  1 21:07 testfile-copy.txt
> >>
> >> Now I trying to access the copy of testfile.txt back on test2 (without
> >> unmounting, or changing permissions)
> >>
> >> [root@test2 /]# cat /mnt/m/testfile-copy.txt
> >> asdfasdfasdfasdfasdfasdfasdfasdfasdf
> >> [root@test2 /]# cat /mnt/m2/testfile-copy.txt
> >> asdfasdfasdfasdfasdfasdfasdfasdfasdf
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> -Original Message-
> >> From: Yan, Zheng [mailto:uker...@gmail.com]
> >> Sent: zaterdag 29 september 2018 6:55
> >> To: Marc Roos
> >> Subject: Re: [ceph-users] cephfs issue with moving files between data
> >> pools gives Input/output error
> >>
> >> check_pool_perm on pool 30 ns  need Fr, but no read perm
> >>
> >> client does not permission to read the pool.  ceph-fuse did return EPERM
> >> for the kernel readpage request.  But kernel return -EIO for any
> >> readpage error.
> >> On Fri, Sep 28, 2018 at 10:09 PM Marc Roos 
> >> wrote:
> >> >
> >> >
> >> > Is this useful? I think this is the section of the client log when
> >> >
> >> > [@test2 m]$ cat out6
> >> > cat: out6: Input/output error
> >> >
> >> > 2018-09-28 16:03:39.082200 7f1ad01f1700 10 client.3246756 fill_statx
> >> > on 0x100010943bc snap/devhead mode 040557 mtime 2018-09-28
> >> > 14:49:35.349370 ctime 2018-09-28 14:49:35.349370
> >> > 2018-09-28 16:03:39.082223 7f1ad01f1700  3 client.3246756 ll_getattrx
> >> > 0x100010943bc.head = 0
> >> > 2018-09-28 16:03:39.082727 7f1ae813f700 10 client.3246756 fill_statx
> >> > on
> >> > 0x10001698ac5 snap/devhead mode 0100644 mtime 2018-09-28
> >> > 14:45:50.323273 ctime 2018-09-28 14:47:47.028679
> >> > 2018-09-28 16:03:39.082737 7f1ae813f700  3 client.3246756 ll_getattrx
> >> > 0x10001698ac5.head = 0
> >> > 2018-09-28 16:03:39.083149 7f1ac07f8700  3 client.3246756 ll_open
> >> > 0x10001698ac5.head 0
> >> > 2018-09-28 16:03:39.083160 7f1ac07f8700 10 client.3246756 _getattr
> >> > mask As issued=1
> >> > 2018-09-28 16:03:39.083165 7f1ac07f8700  3 client.3246756 may_open
> >> > 0x7f1a7810ad00 = 0
> >> > 2018-09-28 16:03:39.083169 7f1ac07f8700 10 break_deleg: breaking
> >> > delegs on 0x10001698ac5.head(faked_ino=0 ref=2 ll_ref=1 cap_refs={}
> >> > open={1=1}
> >> > mode=100644 size=17/0 nlink=1 mtime=2018-09-28 14:45:50.323273
> >> > caps=pAsLsXsFs(0=pAsLsXsFs) objectset[0x10001698ac5 ts 0/0 objects 0
> >> > dirty_or_tx 0] parents=0x7f1a780f1dd0 0x7f1a7810ad00)
> >> > 2018-09-28 16:03:39.083183 7f1ac07f8700 10 delegations_broken:
> >> > delegations empty on 0x10001698ac5.head(faked_ino=0 ref=2 ll_ref=1
> >> > cap_refs={} open={1=1} mode=100644 size=17/0 nlink=1 mtime=2018-09-28
> >> > 14:45:50.323273 caps=pAsLsXsFs(0=pAsLsXsFs) objectset[0x10001698ac5 ts
> >>
> >> > 0/0 objects 0 dirty_or_tx 0] parents=0x7f1a780f1dd0 0x7f1a7810ad00)
> >> > 2018-09-28 16:03:39.083198 7f1ac07f8700 10 client.3246756
> >> > 

Re: [ceph-users] cephfs issue with moving files between data pools gives Input/output error

2018-10-01 Thread Marc Roos
Hmmm, did not know that, so it becomes a soft link or so?

totally new for me, also not what I would expect of a mv on a fs. I know 
this is normal to expect coping between pools, also from the s3cmd 
client. But I think more people will not expect this behaviour. Can't 
the move be implemented as a move?

How can users even know about what folders have a 'different layout'. 
What happens if we export such mixed pool filesystem via smb. How would 
smb deal with the 'move' between those directories?




-Original Message-
From: Gregory Farnum [mailto:gfar...@redhat.com] 
Sent: maandag 1 oktober 2018 21:28
To: Marc Roos
Cc: ceph-users; jspray; ukernel
Subject: Re: [ceph-users] cephfs issue with moving files between data 
pools gives Input/output error

Moving a file into a directory with a different layout does not, and is 
not intended to, copy the underlying file data into a different pool 
with the new layout. If you want to do that you have to make it happen 
yourself by doing a copy.

On Mon, Oct 1, 2018 at 12:16 PM Marc Roos  
wrote:


 
I will explain the test again, I think you might have some bug in 
your 
cephfs copy between data pools.

c04 has mounted the root cephfs 
/a (has data pool a, ec21)
/test (has data pool b, r1)

test2 has mounted
/m  (nfs mount of cephfs /a)
/m2 (cephfs mount of /a)

Creating the test file.
[root@c04 test]# echo asdfasdfasdfasdfasdfasdfasdfasdfasdf > 
testfile.txt

Then I am moving on c04 the test file from the test folder(pool b) 
to 
the a folder/pool

Now on test2
[root@test2 m]# ls -arlt
-rw-r--r--  1 nobody nobody21 Oct  1 20:48 r1.txt
-rw-r--r--  1 nobody nobody21 Oct  1 20:49 r1-copy.txt
-rw-r--r--  1 nobody nobody37 Oct  1 21:02 testfile.txt

[root@test2 /]# cat /mnt/m/testfile.txt
cat: /mnt/m/old/testfile.txt: Input/output error

[root@test2 /]# cat /mnt/m2/testfile.txt
cat: /mnt/m2/old/testfile.txt: Operation not permitted

Now I am creating a copy of the test file in the same directory 
back on 
c04

[root@c04 a]# cp testfile.txt testfile-copy.txt
[root@c04 a]# ls -alrt
-rw-r--r-- 1 root root 21 Oct  1 20:49 r1-copy.txt
-rw-r--r-- 1 root root 37 Oct  1 21:02 testfile.txt
-rw-r--r-- 1 root root 37 Oct  1 21:07 testfile-copy.txt

Now I trying to access the copy of testfile.txt back on test2 
(without 
unmounting, or changing permissions)

[root@test2 /]# cat /mnt/m/testfile-copy.txt
asdfasdfasdfasdfasdfasdfasdfasdfasdf
[root@test2 /]# cat /mnt/m2/testfile-copy.txt
asdfasdfasdfasdfasdfasdfasdfasdfasdf







-Original Message-
From: Yan, Zheng [mailto:uker...@gmail.com] 
Sent: zaterdag 29 september 2018 6:55
To: Marc Roos
Subject: Re: [ceph-users] cephfs issue with moving files between 
data 
pools gives Input/output error

check_pool_perm on pool 30 ns  need Fr, but no read perm

client does not permission to read the pool.  ceph-fuse did return 
EPERM 
for the kernel readpage request.  But kernel return -EIO for any 
readpage error.
On Fri, Sep 28, 2018 at 10:09 PM Marc Roos 
 
wrote:
>
>
> Is this useful? I think this is the section of the client log 
when
>
> [@test2 m]$ cat out6
> cat: out6: Input/output error
>
> 2018-09-28 16:03:39.082200 7f1ad01f1700 10 client.3246756 
fill_statx 
> on 0x100010943bc snap/devhead mode 040557 mtime 2018-09-28 
> 14:49:35.349370 ctime 2018-09-28 14:49:35.349370
> 2018-09-28 16:03:39.082223 7f1ad01f1700  3 client.3246756 
ll_getattrx 
> 0x100010943bc.head = 0
> 2018-09-28 16:03:39.082727 7f1ae813f700 10 client.3246756 
fill_statx 
> on
> 0x10001698ac5 snap/devhead mode 0100644 mtime 2018-09-28 
> 14:45:50.323273 ctime 2018-09-28 14:47:47.028679
> 2018-09-28 16:03:39.082737 7f1ae813f700  3 client.3246756 
ll_getattrx 
> 0x10001698ac5.head = 0
> 2018-09-28 16:03:39.083149 7f1ac07f8700  3 client.3246756 ll_open 

> 0x10001698ac5.head 0
> 2018-09-28 16:03:39.083160 7f1ac07f8700 10 client.3246756 
_getattr 
> mask As issued=1
> 2018-09-28 16:03:39.083165 7f1ac07f8700  3 client.3246756 
may_open 
> 0x7f1a7810ad00 = 0
> 2018-09-28 16:03:39.083169 7f1ac07f8700 10 break_deleg: breaking 
> delegs on 0x10001698ac5.head(faked_ino=0 ref=2 ll_ref=1 
cap_refs={} 
> open={1=1}
> mode=100644 size=17/0 nlink=1 

Re: [ceph-users] cephfs issue with moving files between data pools gives Input/output error

2018-10-01 Thread Vasu Kulkarni
On Mon, Oct 1, 2018 at 12:28 PM Gregory Farnum  wrote:
>
> Moving a file into a directory with a different layout does not, and is not 
> intended to, copy the underlying file data into a different pool with the new 
> layout. If you want to do that you have to make it happen yourself by doing a 
> copy.
Isn't this a bug? Isnt unix 'mv' from admin point same as "cp" and "rm"

>
> On Mon, Oct 1, 2018 at 12:16 PM Marc Roos  wrote:
>>
>>
>> I will explain the test again, I think you might have some bug in your
>> cephfs copy between data pools.
>>
>> c04 has mounted the root cephfs
>> /a (has data pool a, ec21)
>> /test (has data pool b, r1)
>>
>> test2 has mounted
>> /m  (nfs mount of cephfs /a)
>> /m2 (cephfs mount of /a)
>>
>> Creating the test file.
>> [root@c04 test]# echo asdfasdfasdfasdfasdfasdfasdfasdfasdf >
>> testfile.txt
>>
>> Then I am moving on c04 the test file from the test folder(pool b) to
>> the a folder/pool
>>
>> Now on test2
>> [root@test2 m]# ls -arlt
>> -rw-r--r--  1 nobody nobody21 Oct  1 20:48 r1.txt
>> -rw-r--r--  1 nobody nobody21 Oct  1 20:49 r1-copy.txt
>> -rw-r--r--  1 nobody nobody37 Oct  1 21:02 testfile.txt
>>
>> [root@test2 /]# cat /mnt/m/testfile.txt
>> cat: /mnt/m/old/testfile.txt: Input/output error
>>
>> [root@test2 /]# cat /mnt/m2/testfile.txt
>> cat: /mnt/m2/old/testfile.txt: Operation not permitted
>>
>> Now I am creating a copy of the test file in the same directory back on
>> c04
>>
>> [root@c04 a]# cp testfile.txt testfile-copy.txt
>> [root@c04 a]# ls -alrt
>> -rw-r--r-- 1 root root 21 Oct  1 20:49 r1-copy.txt
>> -rw-r--r-- 1 root root 37 Oct  1 21:02 testfile.txt
>> -rw-r--r-- 1 root root 37 Oct  1 21:07 testfile-copy.txt
>>
>> Now I trying to access the copy of testfile.txt back on test2 (without
>> unmounting, or changing permissions)
>>
>> [root@test2 /]# cat /mnt/m/testfile-copy.txt
>> asdfasdfasdfasdfasdfasdfasdfasdfasdf
>> [root@test2 /]# cat /mnt/m2/testfile-copy.txt
>> asdfasdfasdfasdfasdfasdfasdfasdfasdf
>>
>>
>>
>>
>>
>>
>>
>> -Original Message-
>> From: Yan, Zheng [mailto:uker...@gmail.com]
>> Sent: zaterdag 29 september 2018 6:55
>> To: Marc Roos
>> Subject: Re: [ceph-users] cephfs issue with moving files between data
>> pools gives Input/output error
>>
>> check_pool_perm on pool 30 ns  need Fr, but no read perm
>>
>> client does not permission to read the pool.  ceph-fuse did return EPERM
>> for the kernel readpage request.  But kernel return -EIO for any
>> readpage error.
>> On Fri, Sep 28, 2018 at 10:09 PM Marc Roos 
>> wrote:
>> >
>> >
>> > Is this useful? I think this is the section of the client log when
>> >
>> > [@test2 m]$ cat out6
>> > cat: out6: Input/output error
>> >
>> > 2018-09-28 16:03:39.082200 7f1ad01f1700 10 client.3246756 fill_statx
>> > on 0x100010943bc snap/devhead mode 040557 mtime 2018-09-28
>> > 14:49:35.349370 ctime 2018-09-28 14:49:35.349370
>> > 2018-09-28 16:03:39.082223 7f1ad01f1700  3 client.3246756 ll_getattrx
>> > 0x100010943bc.head = 0
>> > 2018-09-28 16:03:39.082727 7f1ae813f700 10 client.3246756 fill_statx
>> > on
>> > 0x10001698ac5 snap/devhead mode 0100644 mtime 2018-09-28
>> > 14:45:50.323273 ctime 2018-09-28 14:47:47.028679
>> > 2018-09-28 16:03:39.082737 7f1ae813f700  3 client.3246756 ll_getattrx
>> > 0x10001698ac5.head = 0
>> > 2018-09-28 16:03:39.083149 7f1ac07f8700  3 client.3246756 ll_open
>> > 0x10001698ac5.head 0
>> > 2018-09-28 16:03:39.083160 7f1ac07f8700 10 client.3246756 _getattr
>> > mask As issued=1
>> > 2018-09-28 16:03:39.083165 7f1ac07f8700  3 client.3246756 may_open
>> > 0x7f1a7810ad00 = 0
>> > 2018-09-28 16:03:39.083169 7f1ac07f8700 10 break_deleg: breaking
>> > delegs on 0x10001698ac5.head(faked_ino=0 ref=2 ll_ref=1 cap_refs={}
>> > open={1=1}
>> > mode=100644 size=17/0 nlink=1 mtime=2018-09-28 14:45:50.323273
>> > caps=pAsLsXsFs(0=pAsLsXsFs) objectset[0x10001698ac5 ts 0/0 objects 0
>> > dirty_or_tx 0] parents=0x7f1a780f1dd0 0x7f1a7810ad00)
>> > 2018-09-28 16:03:39.083183 7f1ac07f8700 10 delegations_broken:
>> > delegations empty on 0x10001698ac5.head(faked_ino=0 ref=2 ll_ref=1
>> > cap_refs={} open={1=1} mode=100644 size=17/0 nlink=1 mtime=2018-09-28
>> > 14:45:50.323273 caps=pAsLsXsFs(0=pAsLsXsFs) objectset[0x10001698ac5 ts
>>
>> > 0/0 objects 0 dirty_or_tx 0] parents=0x7f1a780f1dd0 0x7f1a7810ad00)
>> > 2018-09-28 16:03:39.083198 7f1ac07f8700 10 client.3246756
>> > choose_target_mds from caps on inode 0x10001698ac5.head(faked_ino=0
>> > ref=3 ll_ref=1 cap_refs={} open={1=1} mode=100644 size=17/0 nlink=1
>> > mtime=2018-09-28 14:45:50.323273 caps=pAsLsXsFs(0=pAsLsXsFs)
>> > objectset[0x10001698ac5 ts 0/0 objects 0 dirty_or_tx 0]
>> > parents=0x7f1a780f1dd0 0x7f1a7810ad00)
>> > 2018-09-28 16:03:39.083209 7f1ac07f8700 10 client.3246756 send_request
>>
>> > rebuilding request 1911 for mds.0
>> > 2018-09-28 16:03:39.083218 7f1ac07f8700 10 client.3246756 send_request
>> > client_request(unknown.0:1911 open #0x10001698ac5 

[ceph-users] Mimic Upgrade, features not showing up

2018-10-01 Thread William Law
Hi -

I feel like we missed something with upgrading to mimic from luminous.  
Everything went fine, but running 'ceph features' still shows luminous across 
the system.  Running 'ceph versions' shows that everything is at mimic.

The cluster shows healthy.  Any ideas?

Thanks.

Will
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs issue with moving files between data pools gives Input/output error

2018-10-01 Thread Gregory Farnum
Moving a file into a directory with a different layout does not, and is not
intended to, copy the underlying file data into a different pool with the
new layout. If you want to do that you have to make it happen yourself by
doing a copy.

On Mon, Oct 1, 2018 at 12:16 PM Marc Roos  wrote:

>
> I will explain the test again, I think you might have some bug in your
> cephfs copy between data pools.
>
> c04 has mounted the root cephfs
> /a (has data pool a, ec21)
> /test (has data pool b, r1)
>
> test2 has mounted
> /m  (nfs mount of cephfs /a)
> /m2 (cephfs mount of /a)
>
> Creating the test file.
> [root@c04 test]# echo asdfasdfasdfasdfasdfasdfasdfasdfasdf >
> testfile.txt
>
> Then I am moving on c04 the test file from the test folder(pool b) to
> the a folder/pool
>
> Now on test2
> [root@test2 m]# ls -arlt
> -rw-r--r--  1 nobody nobody21 Oct  1 20:48 r1.txt
> -rw-r--r--  1 nobody nobody21 Oct  1 20:49 r1-copy.txt
> -rw-r--r--  1 nobody nobody37 Oct  1 21:02 testfile.txt
>
> [root@test2 /]# cat /mnt/m/testfile.txt
> cat: /mnt/m/old/testfile.txt: Input/output error
>
> [root@test2 /]# cat /mnt/m2/testfile.txt
> cat: /mnt/m2/old/testfile.txt: Operation not permitted
>
> Now I am creating a copy of the test file in the same directory back on
> c04
>
> [root@c04 a]# cp testfile.txt testfile-copy.txt
> [root@c04 a]# ls -alrt
> -rw-r--r-- 1 root root 21 Oct  1 20:49 r1-copy.txt
> -rw-r--r-- 1 root root 37 Oct  1 21:02 testfile.txt
> -rw-r--r-- 1 root root 37 Oct  1 21:07 testfile-copy.txt
>
> Now I trying to access the copy of testfile.txt back on test2 (without
> unmounting, or changing permissions)
>
> [root@test2 /]# cat /mnt/m/testfile-copy.txt
> asdfasdfasdfasdfasdfasdfasdfasdfasdf
> [root@test2 /]# cat /mnt/m2/testfile-copy.txt
> asdfasdfasdfasdfasdfasdfasdfasdfasdf
>
>
>
>
>
>
>
> -Original Message-
> From: Yan, Zheng [mailto:uker...@gmail.com]
> Sent: zaterdag 29 september 2018 6:55
> To: Marc Roos
> Subject: Re: [ceph-users] cephfs issue with moving files between data
> pools gives Input/output error
>
> check_pool_perm on pool 30 ns  need Fr, but no read perm
>
> client does not permission to read the pool.  ceph-fuse did return EPERM
> for the kernel readpage request.  But kernel return -EIO for any
> readpage error.
> On Fri, Sep 28, 2018 at 10:09 PM Marc Roos 
> wrote:
> >
> >
> > Is this useful? I think this is the section of the client log when
> >
> > [@test2 m]$ cat out6
> > cat: out6: Input/output error
> >
> > 2018-09-28 16:03:39.082200 7f1ad01f1700 10 client.3246756 fill_statx
> > on 0x100010943bc snap/devhead mode 040557 mtime 2018-09-28
> > 14:49:35.349370 ctime 2018-09-28 14:49:35.349370
> > 2018-09-28 16:03:39.082223 7f1ad01f1700  3 client.3246756 ll_getattrx
> > 0x100010943bc.head = 0
> > 2018-09-28 16:03:39.082727 7f1ae813f700 10 client.3246756 fill_statx
> > on
> > 0x10001698ac5 snap/devhead mode 0100644 mtime 2018-09-28
> > 14:45:50.323273 ctime 2018-09-28 14:47:47.028679
> > 2018-09-28 16:03:39.082737 7f1ae813f700  3 client.3246756 ll_getattrx
> > 0x10001698ac5.head = 0
> > 2018-09-28 16:03:39.083149 7f1ac07f8700  3 client.3246756 ll_open
> > 0x10001698ac5.head 0
> > 2018-09-28 16:03:39.083160 7f1ac07f8700 10 client.3246756 _getattr
> > mask As issued=1
> > 2018-09-28 16:03:39.083165 7f1ac07f8700  3 client.3246756 may_open
> > 0x7f1a7810ad00 = 0
> > 2018-09-28 16:03:39.083169 7f1ac07f8700 10 break_deleg: breaking
> > delegs on 0x10001698ac5.head(faked_ino=0 ref=2 ll_ref=1 cap_refs={}
> > open={1=1}
> > mode=100644 size=17/0 nlink=1 mtime=2018-09-28 14:45:50.323273
> > caps=pAsLsXsFs(0=pAsLsXsFs) objectset[0x10001698ac5 ts 0/0 objects 0
> > dirty_or_tx 0] parents=0x7f1a780f1dd0 0x7f1a7810ad00)
> > 2018-09-28 16:03:39.083183 7f1ac07f8700 10 delegations_broken:
> > delegations empty on 0x10001698ac5.head(faked_ino=0 ref=2 ll_ref=1
> > cap_refs={} open={1=1} mode=100644 size=17/0 nlink=1 mtime=2018-09-28
> > 14:45:50.323273 caps=pAsLsXsFs(0=pAsLsXsFs) objectset[0x10001698ac5 ts
>
> > 0/0 objects 0 dirty_or_tx 0] parents=0x7f1a780f1dd0 0x7f1a7810ad00)
> > 2018-09-28 16:03:39.083198 7f1ac07f8700 10 client.3246756
> > choose_target_mds from caps on inode 0x10001698ac5.head(faked_ino=0
> > ref=3 ll_ref=1 cap_refs={} open={1=1} mode=100644 size=17/0 nlink=1
> > mtime=2018-09-28 14:45:50.323273 caps=pAsLsXsFs(0=pAsLsXsFs)
> > objectset[0x10001698ac5 ts 0/0 objects 0 dirty_or_tx 0]
> > parents=0x7f1a780f1dd0 0x7f1a7810ad00)
> > 2018-09-28 16:03:39.083209 7f1ac07f8700 10 client.3246756 send_request
>
> > rebuilding request 1911 for mds.0
> > 2018-09-28 16:03:39.083218 7f1ac07f8700 10 client.3246756 send_request
> > client_request(unknown.0:1911 open #0x10001698ac5 2018-09-28
> > 16:03:39.083194 caller_uid=501, caller_gid=501{501,}) v4 to mds.0
> > 2018-09-28 16:03:39.084088 7f1a82ffd700  5 client.3246756
> > set_cap_epoch_barrier epoch = 24093
> > 2018-09-28 16:03:39.084097 7f1a82ffd700 10 client.3246756  mds.0 seq
> > now
> > 1

Re: [ceph-users] cephfs issue with moving files between data pools gives Input/output error

2018-10-01 Thread Marc Roos
 
I will explain the test again, I think you might have some bug in your 
cephfs copy between data pools.

c04 has mounted the root cephfs 
/a (has data pool a, ec21)
/test (has data pool b, r1)

test2 has mounted
/m  (nfs mount of cephfs /a)
/m2 (cephfs mount of /a)

Creating the test file.
[root@c04 test]# echo asdfasdfasdfasdfasdfasdfasdfasdfasdf > 
testfile.txt

Then I am moving on c04 the test file from the test folder(pool b) to 
the a folder/pool

Now on test2
[root@test2 m]# ls -arlt
-rw-r--r--  1 nobody nobody21 Oct  1 20:48 r1.txt
-rw-r--r--  1 nobody nobody21 Oct  1 20:49 r1-copy.txt
-rw-r--r--  1 nobody nobody37 Oct  1 21:02 testfile.txt

[root@test2 /]# cat /mnt/m/testfile.txt
cat: /mnt/m/old/testfile.txt: Input/output error

[root@test2 /]# cat /mnt/m2/testfile.txt
cat: /mnt/m2/old/testfile.txt: Operation not permitted

Now I am creating a copy of the test file in the same directory back on 
c04

[root@c04 a]# cp testfile.txt testfile-copy.txt
[root@c04 a]# ls -alrt
-rw-r--r-- 1 root root 21 Oct  1 20:49 r1-copy.txt
-rw-r--r-- 1 root root 37 Oct  1 21:02 testfile.txt
-rw-r--r-- 1 root root 37 Oct  1 21:07 testfile-copy.txt

Now I trying to access the copy of testfile.txt back on test2 (without 
unmounting, or changing permissions)

[root@test2 /]# cat /mnt/m/testfile-copy.txt
asdfasdfasdfasdfasdfasdfasdfasdfasdf
[root@test2 /]# cat /mnt/m2/testfile-copy.txt
asdfasdfasdfasdfasdfasdfasdfasdfasdf







-Original Message-
From: Yan, Zheng [mailto:uker...@gmail.com] 
Sent: zaterdag 29 september 2018 6:55
To: Marc Roos
Subject: Re: [ceph-users] cephfs issue with moving files between data 
pools gives Input/output error

check_pool_perm on pool 30 ns  need Fr, but no read perm

client does not permission to read the pool.  ceph-fuse did return EPERM 
for the kernel readpage request.  But kernel return -EIO for any 
readpage error.
On Fri, Sep 28, 2018 at 10:09 PM Marc Roos  
wrote:
>
>
> Is this useful? I think this is the section of the client log when
>
> [@test2 m]$ cat out6
> cat: out6: Input/output error
>
> 2018-09-28 16:03:39.082200 7f1ad01f1700 10 client.3246756 fill_statx 
> on 0x100010943bc snap/devhead mode 040557 mtime 2018-09-28 
> 14:49:35.349370 ctime 2018-09-28 14:49:35.349370
> 2018-09-28 16:03:39.082223 7f1ad01f1700  3 client.3246756 ll_getattrx 
> 0x100010943bc.head = 0
> 2018-09-28 16:03:39.082727 7f1ae813f700 10 client.3246756 fill_statx 
> on
> 0x10001698ac5 snap/devhead mode 0100644 mtime 2018-09-28 
> 14:45:50.323273 ctime 2018-09-28 14:47:47.028679
> 2018-09-28 16:03:39.082737 7f1ae813f700  3 client.3246756 ll_getattrx 
> 0x10001698ac5.head = 0
> 2018-09-28 16:03:39.083149 7f1ac07f8700  3 client.3246756 ll_open 
> 0x10001698ac5.head 0
> 2018-09-28 16:03:39.083160 7f1ac07f8700 10 client.3246756 _getattr 
> mask As issued=1
> 2018-09-28 16:03:39.083165 7f1ac07f8700  3 client.3246756 may_open 
> 0x7f1a7810ad00 = 0
> 2018-09-28 16:03:39.083169 7f1ac07f8700 10 break_deleg: breaking 
> delegs on 0x10001698ac5.head(faked_ino=0 ref=2 ll_ref=1 cap_refs={} 
> open={1=1}
> mode=100644 size=17/0 nlink=1 mtime=2018-09-28 14:45:50.323273
> caps=pAsLsXsFs(0=pAsLsXsFs) objectset[0x10001698ac5 ts 0/0 objects 0 
> dirty_or_tx 0] parents=0x7f1a780f1dd0 0x7f1a7810ad00)
> 2018-09-28 16:03:39.083183 7f1ac07f8700 10 delegations_broken:
> delegations empty on 0x10001698ac5.head(faked_ino=0 ref=2 ll_ref=1 
> cap_refs={} open={1=1} mode=100644 size=17/0 nlink=1 mtime=2018-09-28
> 14:45:50.323273 caps=pAsLsXsFs(0=pAsLsXsFs) objectset[0x10001698ac5 ts 

> 0/0 objects 0 dirty_or_tx 0] parents=0x7f1a780f1dd0 0x7f1a7810ad00)
> 2018-09-28 16:03:39.083198 7f1ac07f8700 10 client.3246756 
> choose_target_mds from caps on inode 0x10001698ac5.head(faked_ino=0
> ref=3 ll_ref=1 cap_refs={} open={1=1} mode=100644 size=17/0 nlink=1
> mtime=2018-09-28 14:45:50.323273 caps=pAsLsXsFs(0=pAsLsXsFs)
> objectset[0x10001698ac5 ts 0/0 objects 0 dirty_or_tx 0] 
> parents=0x7f1a780f1dd0 0x7f1a7810ad00)
> 2018-09-28 16:03:39.083209 7f1ac07f8700 10 client.3246756 send_request 

> rebuilding request 1911 for mds.0
> 2018-09-28 16:03:39.083218 7f1ac07f8700 10 client.3246756 send_request
> client_request(unknown.0:1911 open #0x10001698ac5 2018-09-28
> 16:03:39.083194 caller_uid=501, caller_gid=501{501,}) v4 to mds.0
> 2018-09-28 16:03:39.084088 7f1a82ffd700  5 client.3246756 
> set_cap_epoch_barrier epoch = 24093
> 2018-09-28 16:03:39.084097 7f1a82ffd700 10 client.3246756  mds.0 seq 
> now
> 1
> 2018-09-28 16:03:39.084108 7f1a82ffd700  5 client.3246756 
> handle_cap_grant on in 0x10001698ac5 mds.0 seq 7 caps now pAsLsXsFscr 
> was pAsLsXsFs
> 2018-09-28 16:03:39.084118 7f1a82ffd700 10 client.3246756 
> update_inode_file_time 0x10001698ac5.head(faked_ino=0 ref=3 ll_ref=1 
> cap_refs={} open={1=1} mode=100644 size=17/0 nlink=1 mtime=2018-09-28
> 14:45:50.323273 caps=pAsLsXsFs(0=pAsLsXsFs) objectset[0x10001698ac5 ts 

> 0/0 objects 0 dirty_or_tx 0] 

Re: [ceph-users] cephfs kernel client stability

2018-10-01 Thread Andras Pataki
These hangs happen during random I/O fio benchmark loads.  Something 
like 4 or 8 fio processes doing random reads/writes to distinct large 
files (to ensure there is no caching possible).  This is all on CentOS 
7.4 nodes.  Same (and even tougher) tests run without any problems with 
ceph-fuse.  We do have jobs that do heavy parallel I/O (MPI-IO, HDF5 via 
MPI-IO, etc.) - so running 8 parallel random I/O generating processes on 
nodes with 28 cores and plenty of RAM (256GB - 512GB) should not be 
excessive.


I am going to test the latest CentOS kernel next (the one you are 
referencing).  The RedHat/CentOS kernels are not "old kernel clients" - 
they contains various backports of hundreds of patches to all kinds of 
subsystems of Linux.  What is unclear there is exactly what ceph client 
RedHat is backporting to their kernels.  Any pointers there would be 
helpful.


Andras


On 10/1/18 2:26 PM, Marc Roos wrote:
  
How do you test this? I have had no issues under "normal load" with an

old kernel client and a stable os.

CentOS Linux release 7.5.1804 (Core)
Linux c04 3.10.0-862.11.6.el7.x86_64 #1 SMP Tue Aug 14 21:49:04 UTC 2018
x86_64 x86_64 x86_64 GNU/Linux





-Original Message-
From: Andras Pataki [mailto:apat...@flatironinstitute.org]
Sent: maandag 1 oktober 2018 20:10
To: ceph-users
Subject: [ceph-users] cephfs kernel client stability

We have so far been using ceph-fuse for mounting cephfs, but the small
file performance of ceph-fuse is often problematic.  We've been testing
the kernel client, and have seen some pretty bad crashes/hangs.

What is the policy on fixes to the kernel client?  Is only the latest
stable kernel updated (4.18.x nowadays), or are fixes backported to LTS
kernels also (like 4.14.x or 4.9.x for example)? I've seen various
threads that certain newer features require pretty new kernels - but I'm
wondering whether newer kernels are also required for better stability -
or - in general, where the kernel client stability stands nowadays.

Here is an example of kernel hang with 4.14.67.  On heavy loads the
machine isn't even pingable.

Sep 29 21:10:16 worker1004 kernel: INFO: rcu_sched self-detected stall
on CPU Sep 29 21:10:16 worker1004 kernel: #0111-...: (1 GPs behind)
idle=bee/141/0 softirq=21319/21319 fqs=7499 Sep 29 21:10:16
worker1004 kernel: #011 (t=15000 jiffies g=13989 c=13988
q=8334)
Sep 29 21:10:16 worker1004 kernel: NMI backtrace for cpu 1 Sep 29
21:10:16 worker1004 kernel: CPU: 1 PID: 19436 Comm: kworker/1:42
Tainted: P    W  O    4.14.67 #1
Sep 29 21:10:16 worker1004 kernel: Hardware name: Dell Inc. PowerEdge
C6320/082F9M, BIOS 2.6.0 10/27/2017 Sep 29 21:10:16 worker1004 kernel:
Workqueue: ceph-msgr ceph_con_workfn [libceph] Sep 29 21:10:16
worker1004 kernel: Call Trace:
Sep 29 21:10:16 worker1004 kernel:  Sep 29 21:10:16 worker1004
kernel: dump_stack+0x46/0x5f Sep 29 21:10:16 worker1004 kernel:
nmi_cpu_backtrace+0xba/0xc0 Sep 29 21:10:16 worker1004 kernel: ?
irq_force_complete_move+0xd0/0xd0 Sep 29 21:10:16 worker1004 kernel:
nmi_trigger_cpumask_backtrace+0x8a/0xc0
Sep 29 21:10:16 worker1004 kernel: rcu_dump_cpu_stacks+0x81/0xb1 Sep 29
21:10:16 worker1004 kernel: rcu_check_callbacks+0x642/0x790 Sep 29
21:10:16 worker1004 kernel: ? update_wall_time+0x26d/0x6e0 Sep 29
21:10:16 worker1004 kernel: update_process_times+0x23/0x50 Sep 29
21:10:16 worker1004 kernel: tick_sched_timer+0x2f/0x60 Sep 29 21:10:16
worker1004 kernel: __hrtimer_run_queues+0xa3/0xf0 Sep 29 21:10:16
worker1004 kernel: hrtimer_interrupt+0x94/0x170 Sep 29 21:10:16
worker1004 kernel: smp_apic_timer_interrupt+0x4c/0x90
Sep 29 21:10:16 worker1004 kernel: apic_timer_interrupt+0x84/0x90 Sep 29
21:10:16 worker1004 kernel:  Sep 29 21:10:16 worker1004 kernel:
RIP: 0010:crush_hash32_3+0x1e5/0x270 [libceph] Sep 29 21:10:16
worker1004 kernel: RSP: 0018:c9000fdff5d8 EFLAGS:
0a97 ORIG_RAX: ff10
Sep 29 21:10:16 worker1004 kernel: RAX: 06962033 RBX:
883f6e7173c0 RCX: dcdcc373
Sep 29 21:10:16 worker1004 kernel: RDX: bd5425ca RSI:
8a8b0b56 RDI: b1983b87
Sep 29 21:10:16 worker1004 kernel: RBP: 0023 R08:
bd5425ca R09: 137904e9
Sep 29 21:10:16 worker1004 kernel: R10:  R11:
0002 R12: b0f29f21
Sep 29 21:10:16 worker1004 kernel: R13: 000c R14:
f0ae R15: 0023
Sep 29 21:10:16 worker1004 kernel: crush_bucket_choose+0x2ad/0x340
[libceph] Sep 29 21:10:16 worker1004 kernel:
crush_choose_firstn+0x1b0/0x4c0 [libceph] Sep 29 21:10:16 worker1004
kernel: crush_choose_firstn+0x48d/0x4c0 [libceph] Sep 29 21:10:16
worker1004 kernel: crush_do_rule+0x28c/0x5a0 [libceph] Sep 29 21:10:16
worker1004 kernel: ceph_pg_to_up_acting_osds+0x459/0x850
[libceph]
Sep 29 21:10:16 worker1004 kernel: calc_target+0x213/0x520 [libceph] Sep
29 21:10:16 worker1004 kernel: ? ixgbe_xmit_frame_ring+0x362/0xe80
[ixgbe] Sep 29 21:10:16 worker1004 kernel: ? 

Re: [ceph-users] cephfs kernel client stability

2018-10-01 Thread Marc Roos
 
How do you test this? I have had no issues under "normal load" with an 
old kernel client and a stable os.  

CentOS Linux release 7.5.1804 (Core)
Linux c04 3.10.0-862.11.6.el7.x86_64 #1 SMP Tue Aug 14 21:49:04 UTC 2018 
x86_64 x86_64 x86_64 GNU/Linux





-Original Message-
From: Andras Pataki [mailto:apat...@flatironinstitute.org] 
Sent: maandag 1 oktober 2018 20:10
To: ceph-users
Subject: [ceph-users] cephfs kernel client stability

We have so far been using ceph-fuse for mounting cephfs, but the small 
file performance of ceph-fuse is often problematic.  We've been testing 
the kernel client, and have seen some pretty bad crashes/hangs.

What is the policy on fixes to the kernel client?  Is only the latest 
stable kernel updated (4.18.x nowadays), or are fixes backported to LTS 
kernels also (like 4.14.x or 4.9.x for example)? I've seen various 
threads that certain newer features require pretty new kernels - but I'm 
wondering whether newer kernels are also required for better stability - 
or - in general, where the kernel client stability stands nowadays.

Here is an example of kernel hang with 4.14.67.  On heavy loads the 
machine isn't even pingable.

Sep 29 21:10:16 worker1004 kernel: INFO: rcu_sched self-detected stall 
on CPU Sep 29 21:10:16 worker1004 kernel: #0111-...: (1 GPs behind) 
idle=bee/141/0 softirq=21319/21319 fqs=7499 Sep 29 21:10:16 
worker1004 kernel: #011 (t=15000 jiffies g=13989 c=13988
q=8334)
Sep 29 21:10:16 worker1004 kernel: NMI backtrace for cpu 1 Sep 29 
21:10:16 worker1004 kernel: CPU: 1 PID: 19436 Comm: kworker/1:42
Tainted: P    W  O    4.14.67 #1
Sep 29 21:10:16 worker1004 kernel: Hardware name: Dell Inc. PowerEdge 
C6320/082F9M, BIOS 2.6.0 10/27/2017 Sep 29 21:10:16 worker1004 kernel: 
Workqueue: ceph-msgr ceph_con_workfn [libceph] Sep 29 21:10:16 
worker1004 kernel: Call Trace:
Sep 29 21:10:16 worker1004 kernel:  Sep 29 21:10:16 worker1004 
kernel: dump_stack+0x46/0x5f Sep 29 21:10:16 worker1004 kernel: 
nmi_cpu_backtrace+0xba/0xc0 Sep 29 21:10:16 worker1004 kernel: ? 
irq_force_complete_move+0xd0/0xd0 Sep 29 21:10:16 worker1004 kernel: 
nmi_trigger_cpumask_backtrace+0x8a/0xc0
Sep 29 21:10:16 worker1004 kernel: rcu_dump_cpu_stacks+0x81/0xb1 Sep 29 
21:10:16 worker1004 kernel: rcu_check_callbacks+0x642/0x790 Sep 29 
21:10:16 worker1004 kernel: ? update_wall_time+0x26d/0x6e0 Sep 29 
21:10:16 worker1004 kernel: update_process_times+0x23/0x50 Sep 29 
21:10:16 worker1004 kernel: tick_sched_timer+0x2f/0x60 Sep 29 21:10:16 
worker1004 kernel: __hrtimer_run_queues+0xa3/0xf0 Sep 29 21:10:16 
worker1004 kernel: hrtimer_interrupt+0x94/0x170 Sep 29 21:10:16 
worker1004 kernel: smp_apic_timer_interrupt+0x4c/0x90
Sep 29 21:10:16 worker1004 kernel: apic_timer_interrupt+0x84/0x90 Sep 29 
21:10:16 worker1004 kernel:  Sep 29 21:10:16 worker1004 kernel: 
RIP: 0010:crush_hash32_3+0x1e5/0x270 [libceph] Sep 29 21:10:16 
worker1004 kernel: RSP: 0018:c9000fdff5d8 EFLAGS: 
0a97 ORIG_RAX: ff10
Sep 29 21:10:16 worker1004 kernel: RAX: 06962033 RBX: 
883f6e7173c0 RCX: dcdcc373
Sep 29 21:10:16 worker1004 kernel: RDX: bd5425ca RSI: 
8a8b0b56 RDI: b1983b87
Sep 29 21:10:16 worker1004 kernel: RBP: 0023 R08: 
bd5425ca R09: 137904e9
Sep 29 21:10:16 worker1004 kernel: R10:  R11: 
0002 R12: b0f29f21
Sep 29 21:10:16 worker1004 kernel: R13: 000c R14: 
f0ae R15: 0023
Sep 29 21:10:16 worker1004 kernel: crush_bucket_choose+0x2ad/0x340 
[libceph] Sep 29 21:10:16 worker1004 kernel: 
crush_choose_firstn+0x1b0/0x4c0 [libceph] Sep 29 21:10:16 worker1004 
kernel: crush_choose_firstn+0x48d/0x4c0 [libceph] Sep 29 21:10:16 
worker1004 kernel: crush_do_rule+0x28c/0x5a0 [libceph] Sep 29 21:10:16 
worker1004 kernel: ceph_pg_to_up_acting_osds+0x459/0x850
[libceph]
Sep 29 21:10:16 worker1004 kernel: calc_target+0x213/0x520 [libceph] Sep 
29 21:10:16 worker1004 kernel: ? ixgbe_xmit_frame_ring+0x362/0xe80 
[ixgbe] Sep 29 21:10:16 worker1004 kernel: ? put_prev_entity+0x27/0x620 
Sep 29 21:10:16 worker1004 kernel: ? pick_next_task_fair+0x1c7/0x520 Sep 
29 21:10:16 worker1004 kernel: 
scan_requests.constprop.55+0x16f/0x280 [libceph] Sep 29 21:10:16 
worker1004 kernel: handle_one_map+0x175/0x200 [libceph] Sep 29 21:10:16 
worker1004 kernel: ceph_osdc_handle_map+0x390/0x850 [libceph] Sep 29 
21:10:16 worker1004 kernel: ? ceph_x_encrypt+0x46/0x70 [libceph] Sep 29 
21:10:16 worker1004 kernel: dispatch+0x2ef/0xba0 [libceph] Sep 29 
21:10:16 worker1004 kernel: ? read_partial_message+0x215/0x880 [libceph] 
Sep 29 21:10:16 worker1004 kernel: ? inet_recvmsg+0x45/0xb0 Sep 29 
21:10:16 worker1004 kernel: try_read+0x6f8/0x11b0 [libceph] Sep 29 
21:10:16 worker1004 kernel: ? sched_clock_cpu+0xc/0xa0 Sep 29 21:10:16 
worker1004 kernel: ? put_prev_entity+0x27/0x620 Sep 29 21:10:16 
worker1004 kernel: ? pick_next_task_fair+0x415/0x520 Sep 29 21:10:16 

Re: [ceph-users] cephfs clients hanging multi mds to single mds

2018-10-01 Thread Burkhard Linke

Hi,


we also experience hanging clients after MDS restarts; in our case we 
only use a single active MDS server, and the client are actively 
blacklisted by the MDS server after restart. It usually happens if the 
clients are not responsive during MDS restart (e.g. being very busy).



You can check whether this is the case in your setup by inspecting the 
blacklist ('ceph osd blacklist ls'). It should print the connections 
which are currently blacklisted.



You can also remove entries ('ceph osd blacklist rm ...'), but be warned 
that the mechanism is there for a reason. Removing a blacklisted entry 
might result in file corruption if client and MDS server disagree about 
the current state. Use at own risk.



We were also trying a multi active setup after upgrading to luminous, 
but we were running into the same problem with the same error message. 
If was probably due to old kernel clients, so in case of kernel based 
cephfs I would recommend to upgrade to the latest available kernel.



As another approach you can check the current state of the cephfs 
client, either by using the daemon socket in case of ceph-fuse, or the 
debug information in /sys/kernel/debug/ceph/... for the kernel client.


Regards,

Burkhard


On 01.10.2018 18:34, Jaime Ibar wrote:

Hi all,

we're running a ceph 12.2.7 Luminous cluster, two weeks ago we enabled 
multi mds and after few hours


these errors started showing up

2018-09-28 09:41:20.577350 mds.1 [WRN] slow request 64.421475 seconds 
old, received at 2018-09-28 09:40:16.155841: 
client_request(client.31059144:8544450 getattr Xs #0$
12e1e73 2018-09-28 09:40:16.147368 caller_uid=0, caller_gid=124{}) 
currently failed to authpin local pins


2018-09-28 10:56:51.051100 mon.1 [WRN] Health check failed: 5 clients 
failing to respond to cache pressure (MDS_CLIENT_RECALL)
2018-09-28 10:57:08.000361 mds.1 [WRN] 3 slow requests, 1 included 
below; oldest blocked for > 4614.580689 secs
2018-09-28 10:57:08.000365 mds.1 [WRN] slow request 244.796854 seconds 
old, received at 2018-09-28 10:53:03.203476: 
client_request(client.31059144:9080057 lookup #0x100
000b7564/58 2018-09-28 10:53:03.197922 caller_uid=0, caller_gid=0{}) 
currently initiated
2018-09-28 11:00:00.000105 mon.1 [WRN] overall HEALTH_WARN 1 clients 
failing to respond to capability release; 5 clients failing to respond 
to cache pressure; 1 MDSs report slow requests,


Due to this, we decide to go back to single mds(as it worked before), 
however, the clients pointing to mds.1 started hanging, however, the 
ones pointing to mds.0 worked fine.


Then, we tried to enable multi mds again and the clients pointing 
mds.1 went back online, however the ones pointing to mds.0 stopped work.


Today, we tried to go back to single mds, however this error was 
preventing ceph to disable second active mds(mds.1)


2018-10-01 14:33:48.358443 mds.1 [WRN] evicting unresponsive client 
X: (30108925), after 68213.084174 seconds


After wait for 3 hours, we restarted mds.1 daemon (as it was stuck in 
stopping state forever due to the above error), we waited for it to 
become active again,


unmount the problematic clients, wait for the cluster to be healthy 
and try to go back to single mds again.


Apparently this worked with some of the clients, we tried to enable 
multi mds again to bring faulty clients back again, however no luck 
this time


and some of them are hanging and can't access to ceph fs.

This is what we have in kern.log

Oct  1 15:29:32 05 kernel: [2342847.017426] ceph: mds1 reconnect start
Oct  1 15:29:32 05 kernel: [2342847.018677] ceph: mds1 reconnect success
Oct  1 15:29:49 05 kernel: [2342864.651398] ceph: mds1 recovery completed

Not sure what else can we try to bring hanging clients back without 
rebooting as they're in production and rebooting is not an option.


Does anyone know how can we deal with this, please?

Thanks

Jaime



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs kernel client stability

2018-10-01 Thread Andras Pataki
We have so far been using ceph-fuse for mounting cephfs, but the small 
file performance of ceph-fuse is often problematic.  We've been testing 
the kernel client, and have seen some pretty bad crashes/hangs.


What is the policy on fixes to the kernel client?  Is only the latest 
stable kernel updated (4.18.x nowadays), or are fixes backported to LTS 
kernels also (like 4.14.x or 4.9.x for example)? I've seen various 
threads that certain newer features require pretty new kernels - but I'm 
wondering whether newer kernels are also required for better stability - 
or - in general, where the kernel client stability stands nowadays.


Here is an example of kernel hang with 4.14.67.  On heavy loads the 
machine isn't even pingable.


Sep 29 21:10:16 worker1004 kernel: INFO: rcu_sched self-detected stall 
on CPU
Sep 29 21:10:16 worker1004 kernel: #0111-...: (1 GPs behind) 
idle=bee/141/0 softirq=21319/21319 fqs=7499
Sep 29 21:10:16 worker1004 kernel: #011 (t=15000 jiffies g=13989 c=13988 
q=8334)

Sep 29 21:10:16 worker1004 kernel: NMI backtrace for cpu 1
Sep 29 21:10:16 worker1004 kernel: CPU: 1 PID: 19436 Comm: kworker/1:42 
Tainted: P    W  O    4.14.67 #1
Sep 29 21:10:16 worker1004 kernel: Hardware name: Dell Inc. PowerEdge 
C6320/082F9M, BIOS 2.6.0 10/27/2017
Sep 29 21:10:16 worker1004 kernel: Workqueue: ceph-msgr ceph_con_workfn 
[libceph]

Sep 29 21:10:16 worker1004 kernel: Call Trace:
Sep 29 21:10:16 worker1004 kernel: 
Sep 29 21:10:16 worker1004 kernel: dump_stack+0x46/0x5f
Sep 29 21:10:16 worker1004 kernel: nmi_cpu_backtrace+0xba/0xc0
Sep 29 21:10:16 worker1004 kernel: ? irq_force_complete_move+0xd0/0xd0
Sep 29 21:10:16 worker1004 kernel: nmi_trigger_cpumask_backtrace+0x8a/0xc0
Sep 29 21:10:16 worker1004 kernel: rcu_dump_cpu_stacks+0x81/0xb1
Sep 29 21:10:16 worker1004 kernel: rcu_check_callbacks+0x642/0x790
Sep 29 21:10:16 worker1004 kernel: ? update_wall_time+0x26d/0x6e0
Sep 29 21:10:16 worker1004 kernel: update_process_times+0x23/0x50
Sep 29 21:10:16 worker1004 kernel: tick_sched_timer+0x2f/0x60
Sep 29 21:10:16 worker1004 kernel: __hrtimer_run_queues+0xa3/0xf0
Sep 29 21:10:16 worker1004 kernel: hrtimer_interrupt+0x94/0x170
Sep 29 21:10:16 worker1004 kernel: smp_apic_timer_interrupt+0x4c/0x90
Sep 29 21:10:16 worker1004 kernel: apic_timer_interrupt+0x84/0x90
Sep 29 21:10:16 worker1004 kernel: 
Sep 29 21:10:16 worker1004 kernel: RIP: 0010:crush_hash32_3+0x1e5/0x270 
[libceph]
Sep 29 21:10:16 worker1004 kernel: RSP: 0018:c9000fdff5d8 EFLAGS: 
0a97 ORIG_RAX: ff10
Sep 29 21:10:16 worker1004 kernel: RAX: 06962033 RBX: 
883f6e7173c0 RCX: dcdcc373
Sep 29 21:10:16 worker1004 kernel: RDX: bd5425ca RSI: 
8a8b0b56 RDI: b1983b87
Sep 29 21:10:16 worker1004 kernel: RBP: 0023 R08: 
bd5425ca R09: 137904e9
Sep 29 21:10:16 worker1004 kernel: R10:  R11: 
0002 R12: b0f29f21
Sep 29 21:10:16 worker1004 kernel: R13: 000c R14: 
f0ae R15: 0023

Sep 29 21:10:16 worker1004 kernel: crush_bucket_choose+0x2ad/0x340 [libceph]
Sep 29 21:10:16 worker1004 kernel: crush_choose_firstn+0x1b0/0x4c0 [libceph]
Sep 29 21:10:16 worker1004 kernel: crush_choose_firstn+0x48d/0x4c0 [libceph]
Sep 29 21:10:16 worker1004 kernel: crush_do_rule+0x28c/0x5a0 [libceph]
Sep 29 21:10:16 worker1004 kernel: ceph_pg_to_up_acting_osds+0x459/0x850 
[libceph]

Sep 29 21:10:16 worker1004 kernel: calc_target+0x213/0x520 [libceph]
Sep 29 21:10:16 worker1004 kernel: ? ixgbe_xmit_frame_ring+0x362/0xe80 
[ixgbe]

Sep 29 21:10:16 worker1004 kernel: ? put_prev_entity+0x27/0x620
Sep 29 21:10:16 worker1004 kernel: ? pick_next_task_fair+0x1c7/0x520
Sep 29 21:10:16 worker1004 kernel: 
scan_requests.constprop.55+0x16f/0x280 [libceph]

Sep 29 21:10:16 worker1004 kernel: handle_one_map+0x175/0x200 [libceph]
Sep 29 21:10:16 worker1004 kernel: ceph_osdc_handle_map+0x390/0x850 
[libceph]

Sep 29 21:10:16 worker1004 kernel: ? ceph_x_encrypt+0x46/0x70 [libceph]
Sep 29 21:10:16 worker1004 kernel: dispatch+0x2ef/0xba0 [libceph]
Sep 29 21:10:16 worker1004 kernel: ? read_partial_message+0x215/0x880 
[libceph]

Sep 29 21:10:16 worker1004 kernel: ? inet_recvmsg+0x45/0xb0
Sep 29 21:10:16 worker1004 kernel: try_read+0x6f8/0x11b0 [libceph]
Sep 29 21:10:16 worker1004 kernel: ? sched_clock_cpu+0xc/0xa0
Sep 29 21:10:16 worker1004 kernel: ? put_prev_entity+0x27/0x620
Sep 29 21:10:16 worker1004 kernel: ? pick_next_task_fair+0x415/0x520
Sep 29 21:10:16 worker1004 kernel: ceph_con_workfn+0x9d/0x5a0 [libceph]
Sep 29 21:10:16 worker1004 kernel: process_one_work+0x127/0x290
Sep 29 21:10:16 worker1004 kernel: worker_thread+0x3f/0x3b0
Sep 29 21:10:16 worker1004 kernel: kthread+0xf2/0x130
Sep 29 21:10:16 worker1004 kernel: ? process_one_work+0x290/0x290
Sep 29 21:10:16 worker1004 kernel: ? __kthread_parkme+0x90/0x90
Sep 29 21:10:16 worker1004 kernel: ret_from_fork+0x1f/0x30

Andras


Re: [ceph-users] CRUSH puzzle: step weighted-take

2018-10-01 Thread Gregory Farnum
On Fri, Sep 28, 2018 at 12:03 AM Dan van der Ster  wrote:
>
> On Thu, Sep 27, 2018 at 9:57 PM Maged Mokhtar  wrote:
> >
> >
> >
> > On 27/09/18 17:18, Dan van der Ster wrote:
> > > Dear Ceph friends,
> > >
> > > I have a CRUSH data migration puzzle and wondered if someone could
> > > think of a clever solution.
> > >
> > > Consider an osd tree like this:
> > >
> > >-2   4428.02979 room 0513-R-0050
> > >   -72911.81897 rack RA01
> > >-4917.27899 rack RA05
> > >-6917.25500 rack RA09
> > >-9786.23901 rack RA13
> > >   -14895.43903 rack RA17
> > >   -65   1161.16003 room 0513-R-0060
> > >   -71578.76001 ipservice S513-A-IP38
> > >   -70287.56000 rack BA09
> > >   -80291.20001 rack BA10
> > >   -76582.40002 ipservice S513-A-IP63
> > >   -75291.20001 rack BA11
> > >   -78291.20001 rack BA12
> > >
> > > In the beginning, for reasons that are not important, we created two 
> > > pools:
> > >* poolA chooses room=0513-R-0050 then replicates 3x across the racks.
> > >* poolB chooses room=0513-R-0060, replicates 2x across the
> > > ipservices, then puts a 3rd replica in room 0513-R-0050.
> > >
> > > For clarity, here is the crush rule for poolB:
> > >  type replicated
> > >  min_size 1
> > >  max_size 10
> > >  step take 0513-R-0060
> > >  step chooseleaf firstn 2 type ipservice
> > >  step emit
> > >  step take 0513-R-0050
> > >  step chooseleaf firstn -2 type rack
> > >  step emit
> > >
> > > Now to the puzzle.
> > > For reasons that are not important, we now want to change the rule for
> > > poolB to put all three 3 replicas in room 0513-R-0060.
> > > And we need to do this in a way which is totally non-disruptive
> > > (latency-wise) to the users of either pools. (These are both *very*
> > > active RBD pools).
> > >
> > > I see two obvious ways to proceed:
> > >(1) simply change the rule for poolB to put a third replica on any
> > > osd in room 0513-R-0060. I'm afraid though that this would involve way
> > > too many concurrent backfills, cluster-wide, even with
> > > osd_max_backfills=1.
> > >(2) change poolB size to 2, then change the crush rule to that from
> > > (1), then reset poolB size to 3. This would risk data availability
> > > during the time that the pool is size=2, and also risks that every osd
> > > in room 0513-R-0050 would be too busy deleting for some indeterminate
> > > time period (10s of minutes, I expect).
> > >
> > > So I would probably exclude those two approaches.
> > >
> > > Conceptually what I'd like to be able to do is a gradual migration,
> > > which if I may invent some syntax on the fly...
> > >
> > > Instead of
> > > step take 0513-R-0050
> > > do
> > > step weighted-take 99 0513-R-0050 1 0513-R-0060
> > >
> > > That is, 99% of the time take room 0513-R-0050 for the 3rd copies, 1%
> > > of the time take room 0513-R-0060.
> > > With a mechanism like that, we could gradually adjust those "step
> > > weighted-take" lines until 100% of the 3rd copies were in 0513-R-0060.
> > >
> > > I have a feeling that something equivalent to that is already possible
> > > with weight-sets or some other clever crush trickery.
> > > Any ideas?
> > >
> > > Best Regards,
> > >
> > > Dan
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > would it be possible in your case to create a parent datacenter bucket
> > to hold both rooms and assign their relative weights there, then for the
> > third replica do a step take to this parent bucket ? its not elegant but
> > may do the trick.
>
> Hey, that might work! both rooms are already in the default root:
>
>   -1   5589.18994 root default
>   -2   4428.02979 room 0513-R-0050
>  -65   1161.16003 room 0513-R-0060
>  -71578.76001 ipservice S513-A-IP38
>  -76582.40002 ipservice S513-A-IP63
>
> so I'll play with a test pool and weighting down room 0513-R-0060 to
> see if this can work.

I don't think this will work — it will probably change the seed that
is used and mean that the rule tries to move *everything*, not just
the third PG replicas. But perhaps I'm mistaken about the details of
this mechanic...

The crush weighted-take is interesting, but I'm not sure I would want
to do something probabilistic like that in this situation. What we've
discussed before — but *not* implemented or even scheduled, sadly for
you here — is having multiple CRUSH "epochs" active at the same time,
and letting the OSDMap specify a pg as the crossover point from one
CRUSH epoch to the next. (Among other things, this would let us
finally limit the number of backfills in progress at the 

Re: [ceph-users] Is object name used by CRUSH algorithm?

2018-10-01 Thread Gregory Farnum
On Thu, Sep 27, 2018 at 12:08 PM Jin Mao  wrote:

> I am running luminous and the objects were copied from Isilon with a long
> and similar prefix in path like /dir1/dir2/dir3//mm/dd. The objects are
> copied to various buckets like bucket_MMDD/dir1/dir2/dir3//mm/dd.
> This setup minimize some internal code change when moving from NFS to
> object store.
>
> I heard that CRUSH may NOT evenly balance OSDs if there are many common
> leading characters in the object name? However, I couldn't find any
> evidence to support this.
>
> Does anyone know further details about this?
>

CRUSH does use the object name as an input to generate a hashed value, but
it's not a stable hash and having common prefixes should not be an issue.

Also since you're going through RGW there's quite a bit more happening to
the object names, so I wouldn't worry about it.


>
> Thank you.
>
> Jin.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever

2018-10-01 Thread David Turner
I tried modifying filestore_rocksdb_options by removing
compression=kNoCompression
as well as setting it to compression=kSnappyCompression.  Leaving it with
kNoCompression or removing it results in the same segfault in the previous
log.  Setting it to kSnappyCompression resulted in [1] this being logged
and the OSD just failing to start instead of segfaulting.  Is there
anything else you would suggest trying before I purge this OSD from the
cluster?  I'm afraid it might be something with the CentOS binaries.

[1] 2018-10-01 17:10:37.134930 7f1415dfcd80  0  set rocksdb option
compression = kSnappyCompression
2018-10-01 17:10:37.134986 7f1415dfcd80 -1 rocksdb: Invalid argument:
Compression type Snappy is not linked with the binary.
2018-10-01 17:10:37.135004 7f1415dfcd80 -1
filestore(/var/lib/ceph/osd/ceph-1) mount(1723): Error initializing rocksdb
:
2018-10-01 17:10:37.135020 7f1415dfcd80 -1 osd.1 0 OSD:init: unable to
mount object store
2018-10-01 17:10:37.135029 7f1415dfcd80 -1 ESC[0;31m ** ERROR: osd init
failed: (1) Operation not permittedESC[0m

On Sat, Sep 29, 2018 at 1:57 PM Pavan Rallabhandi <
prallabha...@walmartlabs.com> wrote:

> I looked at one of my test clusters running Jewel on Ubuntu 16.04, and
> interestingly I found this(below) in one of the OSD logs, which is
> different from your OSD boot log, where none of the compression algorithms
> seem to be supported. This hints more at how rocksdb was built on CentOS
> for Ceph.
>
> 2018-09-29 17:38:38.629112 7fbd318d4b00  4 rocksdb: Compression algorithms
> supported:
> 2018-09-29 17:38:38.629112 7fbd318d4b00  4 rocksdb: Snappy supported: 1
> 2018-09-29 17:38:38.629113 7fbd318d4b00  4 rocksdb: Zlib supported: 1
> 2018-09-29 17:38:38.629113 7fbd318d4b00  4 rocksdb: Bzip supported: 0
> 2018-09-29 17:38:38.629114 7fbd318d4b00  4 rocksdb: LZ4 supported: 0
> 2018-09-29 17:38:38.629114 7fbd318d4b00  4 rocksdb: ZSTD supported: 0
> 2018-09-29 17:38:38.629115 7fbd318d4b00  4 rocksdb: Fast CRC32 supported: 0
>
> On 9/27/18, 2:56 PM, "Pavan Rallabhandi" 
> wrote:
>
> I see Filestore symbols on the stack, so the bluestore config doesn’t
> affect. And the top frame of the stack hints at a RocksDB issue, and there
> are a whole lot of these too:
>
> “2018-09-17 19:23:06.480258 7f1f3d2a7700  2 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/table/block_based_table_reader.cc:636]
> Cannot find Properties block from file.”
>
> It really seems to be something with RocksDB on centOS. I still think
> you can try removing “compression=kNoCompression” from the
> filestore_rocksdb_options And/Or check if rocksdb is expecting snappy to be
> enabled.
>
> Thanks,
> -Pavan.
>
> From: David Turner 
> Date: Thursday, September 27, 2018 at 1:18 PM
> To: Pavan Rallabhandi 
> Cc: ceph-users 
> Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the
> cluster unusable and takes forever
>
> I got pulled away from this for a while.  The error in the log is
> "abort: Corruption: Snappy not supported or corrupted Snappy compressed
> block contents" and the OSD has 2 settings set to snappy by default,
> async_compressor_type and bluestore_compression_algorithm.  Do either of
> these settings affect the omap store?
>
> On Wed, Sep 19, 2018 at 2:33 PM Pavan Rallabhandi  prallabha...@walmartlabs.com> wrote:
> Looks like you are running on CentOS, fwiw. We’ve successfully ran the
> conversion commands on Jewel, Ubuntu 16.04.
>
> Have a feel it’s expecting the compression to be enabled, can you try
> removing “compression=kNoCompression” from the filestore_rocksdb_options?
> And/or you might want to check if rocksdb is expecting snappy to be enabled.
>
> From: David Turner 
> Date: Tuesday, September 18, 2018 at 6:01 PM
> To: Pavan Rallabhandi 
> Cc: ceph-users 
> Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the
> cluster unusable and takes forever
>
> Here's the [1] full log from the time the OSD was started to the end
> of the crash dump.  These logs are so hard to parse.  Is there anything
> useful in them?
>
> I did confirm that all perms were set correctly and that the
> superblock was changed to rocksdb before the first time I attempted to
> start the OSD with it's new DB.  This is on a fully Luminous cluster with
> [2] the defaults you mentioned.
>
> [1]
> https://gist.github.com/drakonstein/fa3ac0ad9b2ec1389c957f95e05b79ed
> [2] "filestore_omap_backend": "rocksdb",
> "filestore_rocksdb_options":
> "max_background_compactions=8,compaction_readahead_size=2097152,compression=kNoCompression",
>
> On Tue, Sep 18, 2018 at 5:29 PM Pavan Rallabhandi  

[ceph-users] cephfs clients hanging multi mds to single mds

2018-10-01 Thread Jaime Ibar

Hi all,

we're running a ceph 12.2.7 Luminous cluster, two weeks ago we enabled 
multi mds and after few hours


these errors started showing up

2018-09-28 09:41:20.577350 mds.1 [WRN] slow request 64.421475 seconds 
old, received at 2018-09-28 09:40:16.155841: 
client_request(client.31059144:8544450 getattr Xs #0$
12e1e73 2018-09-28 09:40:16.147368 caller_uid=0, caller_gid=124{}) 
currently failed to authpin local pins


2018-09-28 10:56:51.051100 mon.1 [WRN] Health check failed: 5 clients 
failing to respond to cache pressure (MDS_CLIENT_RECALL)
2018-09-28 10:57:08.000361 mds.1 [WRN] 3 slow requests, 1 included 
below; oldest blocked for > 4614.580689 secs
2018-09-28 10:57:08.000365 mds.1 [WRN] slow request 244.796854 seconds 
old, received at 2018-09-28 10:53:03.203476: 
client_request(client.31059144:9080057 lookup #0x100
000b7564/58 2018-09-28 10:53:03.197922 caller_uid=0, caller_gid=0{}) 
currently initiated
2018-09-28 11:00:00.000105 mon.1 [WRN] overall HEALTH_WARN 1 clients 
failing to respond to capability release; 5 clients failing to respond 
to cache pressure; 1 MDSs report slow requests,


Due to this, we decide to go back to single mds(as it worked before), 
however, the clients pointing to mds.1 started hanging, however, the 
ones pointing to mds.0 worked fine.


Then, we tried to enable multi mds again and the clients pointing mds.1 
went back online, however the ones pointing to mds.0 stopped work.


Today, we tried to go back to single mds, however this error was 
preventing ceph to disable second active mds(mds.1)


2018-10-01 14:33:48.358443 mds.1 [WRN] evicting unresponsive client 
X: (30108925), after 68213.084174 seconds


After wait for 3 hours, we restarted mds.1 daemon (as it was stuck in 
stopping state forever due to the above error), we waited for it to 
become active again,


unmount the problematic clients, wait for the cluster to be healthy and 
try to go back to single mds again.


Apparently this worked with some of the clients, we tried to enable 
multi mds again to bring faulty clients back again, however no luck this 
time


and some of them are hanging and can't access to ceph fs.

This is what we have in kern.log

Oct  1 15:29:32 05 kernel: [2342847.017426] ceph: mds1 reconnect start
Oct  1 15:29:32 05 kernel: [2342847.018677] ceph: mds1 reconnect success
Oct  1 15:29:49 05 kernel: [2342864.651398] ceph: mds1 recovery completed

Not sure what else can we try to bring hanging clients back without 
rebooting as they're in production and rebooting is not an option.


Does anyone know how can we deal with this, please?

Thanks

Jaime

--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
Tel: +353-1-896-3725

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-01 Thread Igor Fedotov
So you should call repair which rebalances (i.e. allocates additional 
space) BlueFS space. Hence allowing OSD to start.


Thanks,

Igor


On 10/1/2018 7:22 PM, Igor Fedotov wrote:
Not exactly. The rebalancing from this kv_sync_thread still might be 
deferred due to the nature of this thread (haven't 100% sure though).


Here is my PR showing the idea (still untested and perhaps unfinished!!!)

https://github.com/ceph/ceph/pull/24353


Igor


On 10/1/2018 7:07 PM, Sergey Malinin wrote:

Can you please confirm whether I got this right:

--- BlueStore.cc.bak    2018-10-01 18:54:45.096836419 +0300
+++ BlueStore.cc    2018-10-01 19:01:35.937623861 +0300
@@ -9049,22 +9049,17 @@
    throttle_bytes.put(costs);
      PExtentVector bluefs_gift_extents;
-  if (bluefs &&
-  after_flush - bluefs_last_balance >
-  cct->_conf->bluestore_bluefs_balance_interval) {
-    bluefs_last_balance = after_flush;
-    int r = _balance_bluefs_freespace(_gift_extents);
-    assert(r >= 0);
-    if (r > 0) {
-  for (auto& p : bluefs_gift_extents) {
-    bluefs_extents.insert(p.offset, p.length);
-  }
-  bufferlist bl;
-  encode(bluefs_extents, bl);
-  dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
-   << bluefs_extents << std::dec << dendl;
-  synct->set(PREFIX_SUPER, "bluefs_extents", bl);
+  int r = _balance_bluefs_freespace(_gift_extents);
+  ceph_assert(r >= 0);
+  if (r > 0) {
+    for (auto& p : bluefs_gift_extents) {
+  bluefs_extents.insert(p.offset, p.length);
  }
+    bufferlist bl;
+    encode(bluefs_extents, bl);
+    dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
+ << bluefs_extents << std::dec << dendl;
+    synct->set(PREFIX_SUPER, "bluefs_extents", bl);
    }
      // cleanup sync deferred keys


On 1.10.2018, at 18:39, Igor Fedotov  wrote:

So you have just a single main device per OSD

Then bluestore-tool wouldn't help, it's unable to expand BlueFS 
partition at main device, standalone devices are supported only.


Given that you're able to rebuild the code I can suggest to make a 
patch that triggers BlueFS rebalance (see code snippet below) on 
repairing.

 PExtentVector bluefs_gift_extents;
 int r = _balance_bluefs_freespace(_gift_extents);
 ceph_assert(r >= 0);
 if (r > 0) {
   for (auto& p : bluefs_gift_extents) {
 bluefs_extents.insert(p.offset, p.length);
   }
   bufferlist bl;
   encode(bluefs_extents, bl);
   dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
    << bluefs_extents << std::dec << dendl;
   synct->set(PREFIX_SUPER, "bluefs_extents", bl);
 }

If it waits I can probably make a corresponding PR tomorrow.

Thanks,
Igor
On 10/1/2018 6:16 PM, Sergey Malinin wrote:
I have rebuilt the tool, but none of my OSDs no matter dead or 
alive have any symlinks other than 'block' pointing to LVM.
I adjusted main device size but it looks like it needs even more 
space for db compaction. After executing bluefs-bdev-expand OSD 
fails to start, however 'fsck' and 'repair' commands finished 
successfully.


2018-10-01 18:02:39.755 7fc9226c6240  1 freelist init
2018-10-01 18:02:39.763 7fc9226c6240  1 
bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc opening allocation 
metadata
2018-10-01 18:02:40.907 7fc9226c6240  1 
bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc loaded 285 GiB in 
2249899 extents
2018-10-01 18:02:40.951 7fc9226c6240 -1 
bluestore(/var/lib/ceph/osd/ceph-1) _reconcile_bluefs_freespace 
bluefs extra 0x[6d6f00~50c80]
2018-10-01 18:02:40.951 7fc9226c6240  1 stupidalloc 
0x0x55d053fb9180 shutdown

2018-10-01 18:02:40.963 7fc9226c6240  1 freelist shutdown
2018-10-01 18:02:40.963 7fc9226c6240  4 rocksdb: 
[/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:252] Shutdown: 
canceling all background work
2018-10-01 18:02:40.967 7fc9226c6240  4 rocksdb: 
[/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:397] Shutdown complete

2018-10-01 18:02:40.971 7fc9226c6240  1 bluefs umount
2018-10-01 18:02:40.975 7fc9226c6240  1 stupidalloc 
0x0x55d053883800 shutdown
2018-10-01 18:02:40.975 7fc9226c6240  1 bdev(0x55d053c32e00 
/var/lib/ceph/osd/ceph-1/block) close
2018-10-01 18:02:41.267 7fc9226c6240  1 bdev(0x55d053c32a80 
/var/lib/ceph/osd/ceph-1/block) close
2018-10-01 18:02:41.443 7fc9226c6240 -1 osd.1 0 OSD:init: unable to 
mount object store
2018-10-01 18:02:41.443 7fc9226c6240 -1  ** ERROR: osd init failed: 
(5) Input/output error




On 1.10.2018, at 18:09, Igor Fedotov  wrote:

Well, actually you can avoid bluestore-tool rebuild.

You'll need to edit the first chunk of blocks.db where labels are 
stored. (Please make a backup first!!!)


Size label is stored at offset 0x52 and is 8 bytes long - 
little-endian 64bit integer encoding. (Please verify that old 
value at this offset exactly corresponds to you original volume 
size and/or 'size' label reported by ceph-bluestore-tool).


So you have to put new DB volume size 

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-01 Thread Igor Fedotov
Not exactly. The rebalancing from this kv_sync_thread still might be 
deferred due to the nature of this thread (haven't 100% sure though).


Here is my PR showing the idea (still untested and perhaps unfinished!!!)

https://github.com/ceph/ceph/pull/24353


Igor


On 10/1/2018 7:07 PM, Sergey Malinin wrote:

Can you please confirm whether I got this right:

--- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300
+++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300
@@ -9049,22 +9049,17 @@
throttle_bytes.put(costs);
  
PExtentVector bluefs_gift_extents;

-  if (bluefs &&
- after_flush - bluefs_last_balance >
- cct->_conf->bluestore_bluefs_balance_interval) {
-   bluefs_last_balance = after_flush;
-   int r = _balance_bluefs_freespace(_gift_extents);
-   assert(r >= 0);
-   if (r > 0) {
- for (auto& p : bluefs_gift_extents) {
-   bluefs_extents.insert(p.offset, p.length);
- }
- bufferlist bl;
- encode(bluefs_extents, bl);
- dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
-  << bluefs_extents << std::dec << dendl;
- synct->set(PREFIX_SUPER, "bluefs_extents", bl);
+  int r = _balance_bluefs_freespace(_gift_extents);
+  ceph_assert(r >= 0);
+  if (r > 0) {
+   for (auto& p : bluefs_gift_extents) {
+ bluefs_extents.insert(p.offset, p.length);
}
+   bufferlist bl;
+   encode(bluefs_extents, bl);
+   dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
+<< bluefs_extents << std::dec << dendl;
+   synct->set(PREFIX_SUPER, "bluefs_extents", bl);
}
  
// cleanup sync deferred keys



On 1.10.2018, at 18:39, Igor Fedotov  wrote:

So you have just a single main device per OSD

Then bluestore-tool wouldn't help, it's unable to expand BlueFS partition at 
main device, standalone devices are supported only.

Given that you're able to rebuild the code I can suggest to make a patch that 
triggers BlueFS rebalance (see code snippet below) on repairing.
 PExtentVector bluefs_gift_extents;
 int r = _balance_bluefs_freespace(_gift_extents);
 ceph_assert(r >= 0);
 if (r > 0) {
   for (auto& p : bluefs_gift_extents) {
 bluefs_extents.insert(p.offset, p.length);
   }
   bufferlist bl;
   encode(bluefs_extents, bl);
   dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
<< bluefs_extents << std::dec << dendl;
   synct->set(PREFIX_SUPER, "bluefs_extents", bl);
 }

If it waits I can probably make a corresponding PR tomorrow.

Thanks,
Igor
On 10/1/2018 6:16 PM, Sergey Malinin wrote:

I have rebuilt the tool, but none of my OSDs no matter dead or alive have any 
symlinks other than 'block' pointing to LVM.
I adjusted main device size but it looks like it needs even more space for db 
compaction. After executing bluefs-bdev-expand OSD fails to start, however 
'fsck' and 'repair' commands finished successfully.

2018-10-01 18:02:39.755 7fc9226c6240  1 freelist init
2018-10-01 18:02:39.763 7fc9226c6240  1 bluestore(/var/lib/ceph/osd/ceph-1) 
_open_alloc opening allocation metadata
2018-10-01 18:02:40.907 7fc9226c6240  1 bluestore(/var/lib/ceph/osd/ceph-1) 
_open_alloc loaded 285 GiB in 2249899 extents
2018-10-01 18:02:40.951 7fc9226c6240 -1 bluestore(/var/lib/ceph/osd/ceph-1) 
_reconcile_bluefs_freespace bluefs extra 0x[6d6f00~50c80]
2018-10-01 18:02:40.951 7fc9226c6240  1 stupidalloc 0x0x55d053fb9180 shutdown
2018-10-01 18:02:40.963 7fc9226c6240  1 freelist shutdown
2018-10-01 18:02:40.963 7fc9226c6240  4 rocksdb: 
[/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:252] Shutdown: canceling all 
background work
2018-10-01 18:02:40.967 7fc9226c6240  4 rocksdb: 
[/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:397] Shutdown complete
2018-10-01 18:02:40.971 7fc9226c6240  1 bluefs umount
2018-10-01 18:02:40.975 7fc9226c6240  1 stupidalloc 0x0x55d053883800 shutdown
2018-10-01 18:02:40.975 7fc9226c6240  1 bdev(0x55d053c32e00 
/var/lib/ceph/osd/ceph-1/block) close
2018-10-01 18:02:41.267 7fc9226c6240  1 bdev(0x55d053c32a80 
/var/lib/ceph/osd/ceph-1/block) close
2018-10-01 18:02:41.443 7fc9226c6240 -1 osd.1 0 OSD:init: unable to mount 
object store
2018-10-01 18:02:41.443 7fc9226c6240 -1  ** ERROR: osd init failed: (5) 
Input/output error



On 1.10.2018, at 18:09, Igor Fedotov  wrote:

Well, actually you can avoid bluestore-tool rebuild.

You'll need to edit the first chunk of blocks.db where labels are stored. 
(Please make a backup first!!!)

Size label is stored at offset 0x52 and is 8 bytes long - little-endian 64bit 
integer encoding. (Please verify that old value at this offset exactly 
corresponds to you original volume size and/or 'size' label reported by 
ceph-bluestore-tool).

So you have to put new DB volume size there. Or you can send the first 4K chunk 
(e.g. extracted with dd) along with new DB volume size (in bytes) to me and 

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-01 Thread Sergey Malinin
Can you please confirm whether I got this right:

--- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300
+++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300
@@ -9049,22 +9049,17 @@
   throttle_bytes.put(costs);
 
   PExtentVector bluefs_gift_extents;
-  if (bluefs &&
- after_flush - bluefs_last_balance >
- cct->_conf->bluestore_bluefs_balance_interval) {
-   bluefs_last_balance = after_flush;
-   int r = _balance_bluefs_freespace(_gift_extents);
-   assert(r >= 0);
-   if (r > 0) {
- for (auto& p : bluefs_gift_extents) {
-   bluefs_extents.insert(p.offset, p.length);
- }
- bufferlist bl;
- encode(bluefs_extents, bl);
- dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
-  << bluefs_extents << std::dec << dendl;
- synct->set(PREFIX_SUPER, "bluefs_extents", bl);
+  int r = _balance_bluefs_freespace(_gift_extents);
+  ceph_assert(r >= 0);
+  if (r > 0) {
+   for (auto& p : bluefs_gift_extents) {
+ bluefs_extents.insert(p.offset, p.length);
}
+   bufferlist bl;
+   encode(bluefs_extents, bl);
+   dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
+<< bluefs_extents << std::dec << dendl;
+   synct->set(PREFIX_SUPER, "bluefs_extents", bl);
   }
 
   // cleanup sync deferred keys

> On 1.10.2018, at 18:39, Igor Fedotov  wrote:
> 
> So you have just a single main device per OSD
> 
> Then bluestore-tool wouldn't help, it's unable to expand BlueFS partition at 
> main device, standalone devices are supported only.
> 
> Given that you're able to rebuild the code I can suggest to make a patch that 
> triggers BlueFS rebalance (see code snippet below) on repairing.
> PExtentVector bluefs_gift_extents;
> int r = _balance_bluefs_freespace(_gift_extents);
> ceph_assert(r >= 0);
> if (r > 0) {
>   for (auto& p : bluefs_gift_extents) {
> bluefs_extents.insert(p.offset, p.length);
>   }
>   bufferlist bl;
>   encode(bluefs_extents, bl);
>   dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
><< bluefs_extents << std::dec << dendl;
>   synct->set(PREFIX_SUPER, "bluefs_extents", bl);
> }
> 
> If it waits I can probably make a corresponding PR tomorrow.
> 
> Thanks,
> Igor
> On 10/1/2018 6:16 PM, Sergey Malinin wrote:
>> I have rebuilt the tool, but none of my OSDs no matter dead or alive have 
>> any symlinks other than 'block' pointing to LVM.
>> I adjusted main device size but it looks like it needs even more space for 
>> db compaction. After executing bluefs-bdev-expand OSD fails to start, 
>> however 'fsck' and 'repair' commands finished successfully.
>> 
>> 2018-10-01 18:02:39.755 7fc9226c6240  1 freelist init
>> 2018-10-01 18:02:39.763 7fc9226c6240  1 bluestore(/var/lib/ceph/osd/ceph-1) 
>> _open_alloc opening allocation metadata
>> 2018-10-01 18:02:40.907 7fc9226c6240  1 bluestore(/var/lib/ceph/osd/ceph-1) 
>> _open_alloc loaded 285 GiB in 2249899 extents
>> 2018-10-01 18:02:40.951 7fc9226c6240 -1 bluestore(/var/lib/ceph/osd/ceph-1) 
>> _reconcile_bluefs_freespace bluefs extra 0x[6d6f00~50c80]
>> 2018-10-01 18:02:40.951 7fc9226c6240  1 stupidalloc 0x0x55d053fb9180 shutdown
>> 2018-10-01 18:02:40.963 7fc9226c6240  1 freelist shutdown
>> 2018-10-01 18:02:40.963 7fc9226c6240  4 rocksdb: 
>> [/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:252] Shutdown: canceling all 
>> background work
>> 2018-10-01 18:02:40.967 7fc9226c6240  4 rocksdb: 
>> [/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:397] Shutdown complete
>> 2018-10-01 18:02:40.971 7fc9226c6240  1 bluefs umount
>> 2018-10-01 18:02:40.975 7fc9226c6240  1 stupidalloc 0x0x55d053883800 shutdown
>> 2018-10-01 18:02:40.975 7fc9226c6240  1 bdev(0x55d053c32e00 
>> /var/lib/ceph/osd/ceph-1/block) close
>> 2018-10-01 18:02:41.267 7fc9226c6240  1 bdev(0x55d053c32a80 
>> /var/lib/ceph/osd/ceph-1/block) close
>> 2018-10-01 18:02:41.443 7fc9226c6240 -1 osd.1 0 OSD:init: unable to mount 
>> object store
>> 2018-10-01 18:02:41.443 7fc9226c6240 -1  ** ERROR: osd init failed: (5) 
>> Input/output error
>> 
>> 
>>> On 1.10.2018, at 18:09, Igor Fedotov  wrote:
>>> 
>>> Well, actually you can avoid bluestore-tool rebuild.
>>> 
>>> You'll need to edit the first chunk of blocks.db where labels are stored. 
>>> (Please make a backup first!!!)
>>> 
>>> Size label is stored at offset 0x52 and is 8 bytes long - little-endian 
>>> 64bit integer encoding. (Please verify that old value at this offset 
>>> exactly corresponds to you original volume size and/or 'size' label 
>>> reported by ceph-bluestore-tool).
>>> 
>>> So you have to put new DB volume size there. Or you can send the first 4K 
>>> chunk (e.g. extracted with dd) along with new DB volume size (in bytes) to 
>>> me and I'll do that for you.
>>> 
>>> 
>>> Thanks,
>>> 
>>> Igor
>>> 
>>> 
>>> On 10/1/2018 5:32 PM, Igor Fedotov wrote:
 

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-01 Thread Igor Fedotov

So you have just a single main device per OSD

Then bluestore-tool wouldn't help, it's unable to expand BlueFS 
partition at main device, standalone devices are supported only.


Given that you're able to rebuild the code I can suggest to make a patch 
that triggers BlueFS rebalance (see code snippet below) on repairing.

    PExtentVector bluefs_gift_extents;
    int r = _balance_bluefs_freespace(_gift_extents);
    ceph_assert(r >= 0);
    if (r > 0) {
      for (auto& p : bluefs_gift_extents) {
        bluefs_extents.insert(p.offset, p.length);
      }
      bufferlist bl;
      encode(bluefs_extents, bl);
      dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
           << bluefs_extents << std::dec << dendl;
      synct->set(PREFIX_SUPER, "bluefs_extents", bl);
    }

If it waits I can probably make a corresponding PR tomorrow.

Thanks,
Igor
On 10/1/2018 6:16 PM, Sergey Malinin wrote:

I have rebuilt the tool, but none of my OSDs no matter dead or alive have any 
symlinks other than 'block' pointing to LVM.
I adjusted main device size but it looks like it needs even more space for db 
compaction. After executing bluefs-bdev-expand OSD fails to start, however 
'fsck' and 'repair' commands finished successfully.

2018-10-01 18:02:39.755 7fc9226c6240  1 freelist init
2018-10-01 18:02:39.763 7fc9226c6240  1 bluestore(/var/lib/ceph/osd/ceph-1) 
_open_alloc opening allocation metadata
2018-10-01 18:02:40.907 7fc9226c6240  1 bluestore(/var/lib/ceph/osd/ceph-1) 
_open_alloc loaded 285 GiB in 2249899 extents
2018-10-01 18:02:40.951 7fc9226c6240 -1 bluestore(/var/lib/ceph/osd/ceph-1) 
_reconcile_bluefs_freespace bluefs extra 0x[6d6f00~50c80]
2018-10-01 18:02:40.951 7fc9226c6240  1 stupidalloc 0x0x55d053fb9180 shutdown
2018-10-01 18:02:40.963 7fc9226c6240  1 freelist shutdown
2018-10-01 18:02:40.963 7fc9226c6240  4 rocksdb: 
[/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:252] Shutdown: canceling all 
background work
2018-10-01 18:02:40.967 7fc9226c6240  4 rocksdb: 
[/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:397] Shutdown complete
2018-10-01 18:02:40.971 7fc9226c6240  1 bluefs umount
2018-10-01 18:02:40.975 7fc9226c6240  1 stupidalloc 0x0x55d053883800 shutdown
2018-10-01 18:02:40.975 7fc9226c6240  1 bdev(0x55d053c32e00 
/var/lib/ceph/osd/ceph-1/block) close
2018-10-01 18:02:41.267 7fc9226c6240  1 bdev(0x55d053c32a80 
/var/lib/ceph/osd/ceph-1/block) close
2018-10-01 18:02:41.443 7fc9226c6240 -1 osd.1 0 OSD:init: unable to mount 
object store
2018-10-01 18:02:41.443 7fc9226c6240 -1  ** ERROR: osd init failed: (5) 
Input/output error



On 1.10.2018, at 18:09, Igor Fedotov  wrote:

Well, actually you can avoid bluestore-tool rebuild.

You'll need to edit the first chunk of blocks.db where labels are stored. 
(Please make a backup first!!!)

Size label is stored at offset 0x52 and is 8 bytes long - little-endian 64bit 
integer encoding. (Please verify that old value at this offset exactly 
corresponds to you original volume size and/or 'size' label reported by 
ceph-bluestore-tool).

So you have to put new DB volume size there. Or you can send the first 4K chunk 
(e.g. extracted with dd) along with new DB volume size (in bytes) to me and 
I'll do that for you.


Thanks,

Igor


On 10/1/2018 5:32 PM, Igor Fedotov wrote:


On 10/1/2018 5:03 PM, Sergey Malinin wrote:

Before I received your response, I had already added 20GB to the OSD (by epanding LV followed 
by bluefs-bdev-expand) and ran "ceph-kvstore-tool bluestore-kv  compact", 
however it still needs more space.
Is that because I didn't update DB size with set-label-key?

In mimic you need to run both "bluefs-bdev-expand" and "set-label-key" command 
to commit bluefs volume expansion.
Unfortunately the last command doesn't handle "size" label properly. That's why 
you might need to backport and rebuild with the mentioned commits.


What exactly is the label-key that needs to be updated, as I couldn't find 
which one is related to DB:

# ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-1
inferring bluefs devices from bluestore path
{
  "/var/lib/ceph/osd/ceph-1/block": {
  "osd_uuid": "f8f122ee-70a6-4c54-8eb0-9b42205b1ecc",
  "size": 471305551872,
  "btime": "2018-07-31 03:06:43.751243",
  "description": "main",
  "bluefs": "1",
  "ceph_fsid": "7d320499-5b3f-453e-831f-60d4db9a4533",
  "kv_backend": "rocksdb",
  "magic": "ceph osd volume v026",
  "mkfs_done": "yes",
  "osd_key": "XXX",
  "ready": "ready",
  "whoami": "1"
  }
}

'size' label but your output is for block(aka slow) device.

It should return labels for db/wal devices as well (block.db and block.wal 
symlinks respectively). It works for me in master, can't verify with mimic at 
the moment though..
Here is output for master:

# bin/ceph-bluestore-tool show-label --path dev/osd0
inferring bluefs devices from bluestore path
{
 "dev/osd0/block": {
 

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-01 Thread Sergey Malinin
I have rebuilt the tool, but none of my OSDs no matter dead or alive have any 
symlinks other than 'block' pointing to LVM.
I adjusted main device size but it looks like it needs even more space for db 
compaction. After executing bluefs-bdev-expand OSD fails to start, however 
'fsck' and 'repair' commands finished successfully.

2018-10-01 18:02:39.755 7fc9226c6240  1 freelist init
2018-10-01 18:02:39.763 7fc9226c6240  1 bluestore(/var/lib/ceph/osd/ceph-1) 
_open_alloc opening allocation metadata
2018-10-01 18:02:40.907 7fc9226c6240  1 bluestore(/var/lib/ceph/osd/ceph-1) 
_open_alloc loaded 285 GiB in 2249899 extents
2018-10-01 18:02:40.951 7fc9226c6240 -1 bluestore(/var/lib/ceph/osd/ceph-1) 
_reconcile_bluefs_freespace bluefs extra 0x[6d6f00~50c80]
2018-10-01 18:02:40.951 7fc9226c6240  1 stupidalloc 0x0x55d053fb9180 shutdown
2018-10-01 18:02:40.963 7fc9226c6240  1 freelist shutdown
2018-10-01 18:02:40.963 7fc9226c6240  4 rocksdb: 
[/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:252] Shutdown: canceling all 
background work
2018-10-01 18:02:40.967 7fc9226c6240  4 rocksdb: 
[/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:397] Shutdown complete
2018-10-01 18:02:40.971 7fc9226c6240  1 bluefs umount
2018-10-01 18:02:40.975 7fc9226c6240  1 stupidalloc 0x0x55d053883800 shutdown
2018-10-01 18:02:40.975 7fc9226c6240  1 bdev(0x55d053c32e00 
/var/lib/ceph/osd/ceph-1/block) close
2018-10-01 18:02:41.267 7fc9226c6240  1 bdev(0x55d053c32a80 
/var/lib/ceph/osd/ceph-1/block) close
2018-10-01 18:02:41.443 7fc9226c6240 -1 osd.1 0 OSD:init: unable to mount 
object store
2018-10-01 18:02:41.443 7fc9226c6240 -1  ** ERROR: osd init failed: (5) 
Input/output error


> On 1.10.2018, at 18:09, Igor Fedotov  wrote:
> 
> Well, actually you can avoid bluestore-tool rebuild.
> 
> You'll need to edit the first chunk of blocks.db where labels are stored. 
> (Please make a backup first!!!)
> 
> Size label is stored at offset 0x52 and is 8 bytes long - little-endian 64bit 
> integer encoding. (Please verify that old value at this offset exactly 
> corresponds to you original volume size and/or 'size' label reported by 
> ceph-bluestore-tool).
> 
> So you have to put new DB volume size there. Or you can send the first 4K 
> chunk (e.g. extracted with dd) along with new DB volume size (in bytes) to me 
> and I'll do that for you.
> 
> 
> Thanks,
> 
> Igor
> 
> 
> On 10/1/2018 5:32 PM, Igor Fedotov wrote:
>> 
>> 
>> On 10/1/2018 5:03 PM, Sergey Malinin wrote:
>>> Before I received your response, I had already added 20GB to the OSD (by 
>>> epanding LV followed by bluefs-bdev-expand) and ran "ceph-kvstore-tool 
>>> bluestore-kv  compact", however it still needs more space.
>>> Is that because I didn't update DB size with set-label-key?
>> In mimic you need to run both "bluefs-bdev-expand" and "set-label-key" 
>> command to commit bluefs volume expansion.
>> Unfortunately the last command doesn't handle "size" label properly. That's 
>> why you might need to backport and rebuild with the mentioned commits.
>> 
>>> What exactly is the label-key that needs to be updated, as I couldn't find 
>>> which one is related to DB:
>>> 
>>> # ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-1
>>> inferring bluefs devices from bluestore path
>>> {
>>>  "/var/lib/ceph/osd/ceph-1/block": {
>>>  "osd_uuid": "f8f122ee-70a6-4c54-8eb0-9b42205b1ecc",
>>>  "size": 471305551872,
>>>  "btime": "2018-07-31 03:06:43.751243",
>>>  "description": "main",
>>>  "bluefs": "1",
>>>  "ceph_fsid": "7d320499-5b3f-453e-831f-60d4db9a4533",
>>>  "kv_backend": "rocksdb",
>>>  "magic": "ceph osd volume v026",
>>>  "mkfs_done": "yes",
>>>  "osd_key": "XXX",
>>>  "ready": "ready",
>>>  "whoami": "1"
>>>  }
>>> }
>> 'size' label but your output is for block(aka slow) device.
>> 
>> It should return labels for db/wal devices as well (block.db and block.wal 
>> symlinks respectively). It works for me in master, can't verify with mimic 
>> at the moment though..
>> Here is output for master:
>> 
>> # bin/ceph-bluestore-tool show-label --path dev/osd0
>> inferring bluefs devices from bluestore path
>> {
>> "dev/osd0/block": {
>> "osd_uuid": "404dcbe9-3f8d-4ef5-ac59-2582454a9a75",
>> "size": 21474836480,
>> "btime": "2018-09-10 15:55:09.044039",
>> "description": "main",
>> "bluefs": "1",
>> "ceph_fsid": "56eddc15-11b9-4e0b-9192-e391fbae551c",
>> "kv_backend": "rocksdb",
>> "magic": "ceph osd volume v026",
>> "mkfs_done": "yes",
>> "osd_key": "AQCsaZZbYTxXJBAAe3jJI4p6WbMjvA8CBBUJbA==",
>> "ready": "ready",
>> "whoami": "0"
>> },
>> "dev/osd0/block.wal": {
>> "osd_uuid": "404dcbe9-3f8d-4ef5-ac59-2582454a9a75",
>> "size": 1048576000,
>> "btime": "2018-09-10 15:55:09.044985",
>> "description": "bluefs wal"
>> },
>> 

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-01 Thread Igor Fedotov

Well, actually you can avoid bluestore-tool rebuild.

You'll need to edit the first chunk of blocks.db where labels are 
stored. (Please make a backup first!!!)


Size label is stored at offset 0x52 and is 8 bytes long - little-endian 
64bit integer encoding. (Please verify that old value at this offset 
exactly corresponds to you original volume size and/or 'size' label 
reported by ceph-bluestore-tool).


So you have to put new DB volume size there. Or you can send the first 
4K chunk (e.g. extracted with dd) along with new DB volume size (in 
bytes) to me and I'll do that for you.



Thanks,

Igor


On 10/1/2018 5:32 PM, Igor Fedotov wrote:



On 10/1/2018 5:03 PM, Sergey Malinin wrote:
Before I received your response, I had already added 20GB to the OSD 
(by epanding LV followed by bluefs-bdev-expand) and ran 
"ceph-kvstore-tool bluestore-kv  compact", however it still 
needs more space.

Is that because I didn't update DB size with set-label-key?
In mimic you need to run both "bluefs-bdev-expand" and "set-label-key" 
command to commit bluefs volume expansion.
Unfortunately the last command doesn't handle "size" label properly. 
That's why you might need to backport and rebuild with the mentioned 
commits.


What exactly is the label-key that needs to be updated, as I couldn't 
find which one is related to DB:


# ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-1
inferring bluefs devices from bluestore path
{
 "/var/lib/ceph/osd/ceph-1/block": {
 "osd_uuid": "f8f122ee-70a6-4c54-8eb0-9b42205b1ecc",
 "size": 471305551872,
 "btime": "2018-07-31 03:06:43.751243",
 "description": "main",
 "bluefs": "1",
 "ceph_fsid": "7d320499-5b3f-453e-831f-60d4db9a4533",
 "kv_backend": "rocksdb",
 "magic": "ceph osd volume v026",
 "mkfs_done": "yes",
 "osd_key": "XXX",
 "ready": "ready",
 "whoami": "1"
 }
}

'size' label but your output is for block(aka slow) device.

It should return labels for db/wal devices as well (block.db and 
block.wal symlinks respectively). It works for me in master, can't 
verify with mimic at the moment though..

Here is output for master:

# bin/ceph-bluestore-tool show-label --path dev/osd0
inferring bluefs devices from bluestore path
{
    "dev/osd0/block": {
    "osd_uuid": "404dcbe9-3f8d-4ef5-ac59-2582454a9a75",
    "size": 21474836480,
    "btime": "2018-09-10 15:55:09.044039",
    "description": "main",
    "bluefs": "1",
    "ceph_fsid": "56eddc15-11b9-4e0b-9192-e391fbae551c",
    "kv_backend": "rocksdb",
    "magic": "ceph osd volume v026",
    "mkfs_done": "yes",
    "osd_key": "AQCsaZZbYTxXJBAAe3jJI4p6WbMjvA8CBBUJbA==",
    "ready": "ready",
    "whoami": "0"
    },
    "dev/osd0/block.wal": {
    "osd_uuid": "404dcbe9-3f8d-4ef5-ac59-2582454a9a75",
    "size": 1048576000,
    "btime": "2018-09-10 15:55:09.044985",
    "description": "bluefs wal"
    },
    "dev/osd0/block.db": {
    "osd_uuid": "404dcbe9-3f8d-4ef5-ac59-2582454a9a75",
    "size": 1048576000,
    "btime": "2018-09-10 15:55:09.044469",
    "description": "bluefs db"
    }
}


You can try --dev option instead of --path, e.g.
ceph-bluestore-tool show-label --dev 





On 1.10.2018, at 16:48, Igor Fedotov  wrote:

This looks like a sort of deadlock when BlueFS needs some additional 
space to replay the log left after the crash. Which happens during 
BlueFS open.


But such a space (at slow device as DB is full) is gifted in 
background during bluefs rebalance procedure which will occur after 
the open.


Hence OSDs stuck in permanent crashing..

The only way to recover I can suggest for now is to expand DB 
volumes. You can do that with lvm tools if you have any spare space 
for that.


Once resized you'll need ceph-bluestore-tool to indicate volume 
expansion to BlueFS (bluefs-bdev-expand command ) and finally update 
DB volume size label with  set-label-key command.


The latter is a bit tricky for mimic - you might need to backport 
https://github.com/ceph/ceph/pull/22085/commits/ffac450da5d6e09cf14b8363b35f21819b48f38b


and rebuild ceph-bluestore-tool. Alternatively you can backport 
https://github.com/ceph/ceph/pull/22085/commits/71c3b58da4e7ced3422bce2b1da0e3fa9331530b


then bluefs expansion and label updates will occur in a single step.

I'll do these backports in upstream but this will take some time to 
pass all the procedures and get into official mimic release.


Will fire a ticket to fix the original issue as well.


Thanks,

Igor


On 10/1/2018 3:28 PM, Sergey Malinin wrote:
These are LVM bluestore NVMe SSDs created with "ceph-volume --lvm 
prepare --bluestore /dev/nvme0n1p3" i.e. without specifying wal/db 
devices.
OSDs were created with bluestore_min_alloc_size_ssd=4096, another 
modified setting is bluestore_cache_kv_max=1073741824


DB/block usage collected by prometheus module for 3 failed and 1 
survived 

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-01 Thread Igor Fedotov



On 10/1/2018 5:03 PM, Sergey Malinin wrote:

Before I received your response, I had already added 20GB to the OSD (by epanding LV followed 
by bluefs-bdev-expand) and ran "ceph-kvstore-tool bluestore-kv  compact", 
however it still needs more space.
Is that because I didn't update DB size with set-label-key?
In mimic you need to run both "bluefs-bdev-expand" and "set-label-key" 
command to commit bluefs volume expansion.
Unfortunately the last command doesn't handle "size" label properly. 
That's why you might need to backport and rebuild with the mentioned 
commits.



What exactly is the label-key that needs to be updated, as I couldn't find 
which one is related to DB:

# ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-1
inferring bluefs devices from bluestore path
{
 "/var/lib/ceph/osd/ceph-1/block": {
 "osd_uuid": "f8f122ee-70a6-4c54-8eb0-9b42205b1ecc",
 "size": 471305551872,
 "btime": "2018-07-31 03:06:43.751243",
 "description": "main",
 "bluefs": "1",
 "ceph_fsid": "7d320499-5b3f-453e-831f-60d4db9a4533",
 "kv_backend": "rocksdb",
 "magic": "ceph osd volume v026",
 "mkfs_done": "yes",
 "osd_key": "XXX",
 "ready": "ready",
 "whoami": "1"
 }
}

'size' label but your output is for block(aka slow) device.

It should return labels for db/wal devices as well (block.db and 
block.wal symlinks respectively). It works for me in master, can't 
verify with mimic at the moment though..

Here is output for master:

# bin/ceph-bluestore-tool show-label --path dev/osd0
inferring bluefs devices from bluestore path
{
    "dev/osd0/block": {
    "osd_uuid": "404dcbe9-3f8d-4ef5-ac59-2582454a9a75",
    "size": 21474836480,
    "btime": "2018-09-10 15:55:09.044039",
    "description": "main",
    "bluefs": "1",
    "ceph_fsid": "56eddc15-11b9-4e0b-9192-e391fbae551c",
    "kv_backend": "rocksdb",
    "magic": "ceph osd volume v026",
    "mkfs_done": "yes",
    "osd_key": "AQCsaZZbYTxXJBAAe3jJI4p6WbMjvA8CBBUJbA==",
    "ready": "ready",
    "whoami": "0"
    },
    "dev/osd0/block.wal": {
    "osd_uuid": "404dcbe9-3f8d-4ef5-ac59-2582454a9a75",
    "size": 1048576000,
    "btime": "2018-09-10 15:55:09.044985",
    "description": "bluefs wal"
    },
    "dev/osd0/block.db": {
    "osd_uuid": "404dcbe9-3f8d-4ef5-ac59-2582454a9a75",
    "size": 1048576000,
    "btime": "2018-09-10 15:55:09.044469",
    "description": "bluefs db"
    }
}


You can try --dev option instead of --path, e.g.
ceph-bluestore-tool show-label --dev 





On 1.10.2018, at 16:48, Igor Fedotov  wrote:

This looks like a sort of deadlock when BlueFS needs some additional space to 
replay the log left after the crash. Which happens during BlueFS open.

But such a space (at slow device as DB is full) is gifted in background during 
bluefs rebalance procedure which will occur after the open.

Hence OSDs stuck in permanent crashing..

The only way to recover I can suggest for now is to expand DB volumes. You can 
do that with lvm tools if you have any spare space for that.

Once resized you'll need ceph-bluestore-tool to indicate volume expansion to 
BlueFS (bluefs-bdev-expand command ) and finally update DB volume size label 
with  set-label-key command.

The latter is a bit tricky for mimic - you might need to backport 
https://github.com/ceph/ceph/pull/22085/commits/ffac450da5d6e09cf14b8363b35f21819b48f38b

and rebuild ceph-bluestore-tool. Alternatively you can backport 
https://github.com/ceph/ceph/pull/22085/commits/71c3b58da4e7ced3422bce2b1da0e3fa9331530b

then bluefs expansion and label updates will occur in a single step.

I'll do these backports in upstream but this will take some time to pass all 
the procedures and get into official mimic  release.

Will fire a ticket to fix the original issue as well.


Thanks,

Igor


On 10/1/2018 3:28 PM, Sergey Malinin wrote:

These are LVM bluestore NVMe SSDs created with "ceph-volume --lvm prepare 
--bluestore /dev/nvme0n1p3" i.e. without specifying wal/db devices.
OSDs were created with bluestore_min_alloc_size_ssd=4096, another modified 
setting is bluestore_cache_kv_max=1073741824

DB/block usage collected by prometheus module for 3 failed and 1 survived OSDs:

ceph_bluefs_db_total_bytes{ceph_daemon="osd.0"} 65493008384.0
ceph_bluefs_db_total_bytes{ceph_daemon="osd.1"} 49013587968.0
ceph_bluefs_db_total_bytes{ceph_daemon="osd.2"} 76834406400.0 --> this one has 
survived
ceph_bluefs_db_total_bytes{ceph_daemon="osd.3"} 63726157824.0

ceph_bluefs_db_used_bytes{ceph_daemon="osd.0"} 65217232896.0
ceph_bluefs_db_used_bytes{ceph_daemon="osd.1"} 48944381952.0
ceph_bluefs_db_used_bytes{ceph_daemon="osd.2"} 68093476864.0
ceph_bluefs_db_used_bytes{ceph_daemon="osd.3"} 63632834560.0

ceph_osd_stat_bytes{ceph_daemon="osd.0"} 471305551872.0
ceph_osd_stat_bytes{ceph_daemon="osd.1"} 471305551872.0

Re: [ceph-users] Problems after increasing number of PGs in a pool

2018-10-01 Thread Vladimir Brik
Thanks to everybody who responded. The problem was, indeed, that I hit
the limit on the number of PGs per SSD OSD when I increased the number
of PGs in a pool.

One question though: should I have received a warning that some OSDs are
close to their maximum PG limit? A while back, in a Luminous test pool I
remember seeing something like "too many PGs per OSD" in some of my
testing, but not this time (perhaps because this time I hit the limit
during the resizing operation). Where might such warning be recorded if
not in "ceph status"?

Thanks,

Vlad



On 09/28/2018 01:04 PM, Paul Emmerich wrote:
> I guess the pool is mapped to SSDs only from the name and you only got 20 
> SSDs.
> So you should have about ~2000 effective PGs taking replication into account.
> 
> Your pool has ~10k effective PGs with k+m=5 and you seem to have 5
> more pools
> 
> Check "ceph osd df tree" to see how many PGs per OSD you got.
> 
> Try increasing these two options to "fix" it.
> 
> mon max pg per osd
> osd max pg per osd hard ratio
> 
> 
> Paul
> Am Fr., 28. Sep. 2018 um 18:05 Uhr schrieb Vladimir Brik
> :
>>
>> Hello
>>
>> I've attempted to increase the number of placement groups of the pools
>> in our test cluster and now ceph status (below) is reporting problems. I
>> am not sure what is going on or how to fix this. Troubleshooting
>> scenarios in the docs don't seem to quite match what I am seeing.
>>
>> I have no idea how to begin to debug this. I see OSDs listed in
>> "blocked_by" of pg dump, but don't know how to interpret that. Could
>> somebody assist please?
>>
>> I attached output of "ceph pg dump_stuck -f json-pretty" just in case.
>>
>> The cluster consists of 5 hosts, each with 16 HDDs and 4 SSDs. I am
>> running 13.2.2.
>>
>> This is the affected pool:
>> pool 6 'fs-data-ec-ssd' erasure size 5 min_size 4 crush_rule 6
>> object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 2493 lfor
>> 0/2491 flags hashpspool,ec_overwrites stripe_width 12288 application cephfs
>>
>>
>> Thanks,
>>
>> Vlad
>>
>>
>> ceph health
>>
>>   cluster:
>> id: 47caa1df-42be-444d-b603-02cad2a7fdd3
>> health: HEALTH_WARN
>> Reduced data availability: 155 pgs inactive, 47 pgs peering,
>> 64 pgs stale
>> Degraded data redundancy: 321039/114913606 objects degraded
>> (0.279%), 108 pgs degraded, 108 pgs undersized
>>
>>   services:
>> mon: 5 daemons, quorum ceph-1,ceph-2,ceph-3,ceph-4,ceph-5
>> mgr: ceph-3(active), standbys: ceph-2, ceph-5, ceph-1, ceph-4
>> mds: cephfs-1/1/1 up  {0=ceph-5=up:active}, 4 up:standby
>> osd: 100 osds: 100 up, 100 in; 165 remapped pgs
>>
>>   data:
>> pools:   6 pools, 5120 pgs
>> objects: 22.98 M objects, 88 TiB
>> usage:   154 TiB used, 574 TiB / 727 TiB avail
>> pgs: 3.027% pgs not active
>>  321039/114913606 objects degraded (0.279%)
>>  4903 active+clean
>>  105  activating+undersized+degraded+remapped
>>  61   stale+active+clean
>>  47   remapped+peering
>>  3stale+activating+undersized+degraded+remapped
>>  1active+clean+scrubbing+deep
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-01 Thread Sergey Malinin
Before I received your response, I had already added 20GB to the OSD (by 
epanding LV followed by bluefs-bdev-expand) and ran "ceph-kvstore-tool 
bluestore-kv  compact", however it still needs more space.
Is that because I didn't update DB size with set-label-key?

What exactly is the label-key that needs to be updated, as I couldn't find 
which one is related to DB:

# ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-1
inferring bluefs devices from bluestore path
{
"/var/lib/ceph/osd/ceph-1/block": {
"osd_uuid": "f8f122ee-70a6-4c54-8eb0-9b42205b1ecc",
"size": 471305551872,
"btime": "2018-07-31 03:06:43.751243",
"description": "main",
"bluefs": "1",
"ceph_fsid": "7d320499-5b3f-453e-831f-60d4db9a4533",
"kv_backend": "rocksdb",
"magic": "ceph osd volume v026",
"mkfs_done": "yes",
"osd_key": "XXX",
"ready": "ready",
"whoami": "1"
}
}


> On 1.10.2018, at 16:48, Igor Fedotov  wrote:
> 
> This looks like a sort of deadlock when BlueFS needs some additional space to 
> replay the log left after the crash. Which happens during BlueFS open.
> 
> But such a space (at slow device as DB is full) is gifted in background 
> during bluefs rebalance procedure which will occur after the open.
> 
> Hence OSDs stuck in permanent crashing..
> 
> The only way to recover I can suggest for now is to expand DB volumes. You 
> can do that with lvm tools if you have any spare space for that.
> 
> Once resized you'll need ceph-bluestore-tool to indicate volume expansion to 
> BlueFS (bluefs-bdev-expand command ) and finally update DB volume size label 
> with  set-label-key command.
> 
> The latter is a bit tricky for mimic - you might need to backport 
> https://github.com/ceph/ceph/pull/22085/commits/ffac450da5d6e09cf14b8363b35f21819b48f38b
> 
> and rebuild ceph-bluestore-tool. Alternatively you can backport 
> https://github.com/ceph/ceph/pull/22085/commits/71c3b58da4e7ced3422bce2b1da0e3fa9331530b
> 
> then bluefs expansion and label updates will occur in a single step.
> 
> I'll do these backports in upstream but this will take some time to pass all 
> the procedures and get into official mimic  release.
> 
> Will fire a ticket to fix the original issue as well.
> 
> 
> Thanks,
> 
> Igor
> 
> 
> On 10/1/2018 3:28 PM, Sergey Malinin wrote:
>> These are LVM bluestore NVMe SSDs created with "ceph-volume --lvm prepare 
>> --bluestore /dev/nvme0n1p3" i.e. without specifying wal/db devices.
>> OSDs were created with bluestore_min_alloc_size_ssd=4096, another modified 
>> setting is bluestore_cache_kv_max=1073741824
>> 
>> DB/block usage collected by prometheus module for 3 failed and 1 survived 
>> OSDs:
>> 
>> ceph_bluefs_db_total_bytes{ceph_daemon="osd.0"} 65493008384.0
>> ceph_bluefs_db_total_bytes{ceph_daemon="osd.1"} 49013587968.0
>> ceph_bluefs_db_total_bytes{ceph_daemon="osd.2"} 76834406400.0 --> this one 
>> has survived
>> ceph_bluefs_db_total_bytes{ceph_daemon="osd.3"} 63726157824.0
>> 
>> ceph_bluefs_db_used_bytes{ceph_daemon="osd.0"} 65217232896.0
>> ceph_bluefs_db_used_bytes{ceph_daemon="osd.1"} 48944381952.0
>> ceph_bluefs_db_used_bytes{ceph_daemon="osd.2"} 68093476864.0
>> ceph_bluefs_db_used_bytes{ceph_daemon="osd.3"} 63632834560.0
>> 
>> ceph_osd_stat_bytes{ceph_daemon="osd.0"} 471305551872.0
>> ceph_osd_stat_bytes{ceph_daemon="osd.1"} 471305551872.0
>> ceph_osd_stat_bytes{ceph_daemon="osd.2"} 471305551872.0
>> ceph_osd_stat_bytes{ceph_daemon="osd.3"} 471305551872.0
>> 
>> ceph_osd_stat_bytes_used{ceph_daemon="osd.0"} 222328213504.0
>> ceph_osd_stat_bytes_used{ceph_daemon="osd.1"} 214472544256.0
>> ceph_osd_stat_bytes_used{ceph_daemon="osd.2"} 163603996672.0
>> ceph_osd_stat_bytes_used{ceph_daemon="osd.3"} 212806815744.0
>> 
>> 
>> First crashed OSD was doing DB compaction, others crashed shortly after 
>> during backfilling. Workload was "ceph-data-scan scan_inodes" filling 
>> metadata pool located on these OSDs at the rate close to 10k objects/second.
>> Here is the log excerpt of the first crash occurrence:
>> 
>> 2018-10-01 03:27:12.762 7fbf16dd6700  0 bluestore(/var/lib/ceph/osd/ceph-1) 
>> _balance_bluefs_freespace no allocate on 0x8000 min_alloc_size 0x1000
>> 2018-10-01 03:27:12.886 7fbf1e5e5700  4 rocksdb: 
>> [/build/ceph-13.2.2/src/rocksdb/db/compaction_job.cc:1166] [default] [JOB 
>> 24] Generated table #89741: 106356 keys, 68110589 bytes
>> 2018-10-01 03:27:12.886 7fbf1e5e5700  4 rocksdb: EVENT_LOG_v1 
>> {"time_micros": 1538353632892744, "cf_name": "default", "job": 24, "event": 
>> "table_file_creation", "file_number": 89741, "file_size": 68110589, 
>> "table_properties": {"data_size": 67112903, "index_size": 579319, 
>> "filter_size": 417316, "raw_key_size": 6733561, "raw_average_key_size": 63, 
>> "raw_value_size": 60994583, "raw_average_value_size": 573, 
>> "num_data_blocks": 16336, "num_entries": 106356, "filter_policy_name": 
>> "rocksdb.BuiltinBloomFilter", "kDeletedKeys": 

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-01 Thread Igor Fedotov
This looks like a sort of deadlock when BlueFS needs some additional 
space to replay the log left after the crash. Which happens during 
BlueFS open.


But such a space (at slow device as DB is full) is gifted in background 
during bluefs rebalance procedure which will occur after the open.


Hence OSDs stuck in permanent crashing..

The only way to recover I can suggest for now is to expand DB volumes. 
You can do that with lvm tools if you have any spare space for that.


Once resized you'll need ceph-bluestore-tool to indicate volume 
expansion to BlueFS (bluefs-bdev-expand command ) and finally update DB 
volume size label with  set-label-key command.


The latter is a bit tricky for mimic - you might need to backport 
https://github.com/ceph/ceph/pull/22085/commits/ffac450da5d6e09cf14b8363b35f21819b48f38b


and rebuild ceph-bluestore-tool. Alternatively you can backport 
https://github.com/ceph/ceph/pull/22085/commits/71c3b58da4e7ced3422bce2b1da0e3fa9331530b


then bluefs expansion and label updates will occur in a single step.

I'll do these backports in upstream but this will take some time to pass 
all the procedures and get into official mimic  release.


Will fire a ticket to fix the original issue as well.


Thanks,

Igor


On 10/1/2018 3:28 PM, Sergey Malinin wrote:

These are LVM bluestore NVMe SSDs created with "ceph-volume --lvm prepare 
--bluestore /dev/nvme0n1p3" i.e. without specifying wal/db devices.
OSDs were created with bluestore_min_alloc_size_ssd=4096, another modified 
setting is bluestore_cache_kv_max=1073741824

DB/block usage collected by prometheus module for 3 failed and 1 survived OSDs:

ceph_bluefs_db_total_bytes{ceph_daemon="osd.0"} 65493008384.0
ceph_bluefs_db_total_bytes{ceph_daemon="osd.1"} 49013587968.0
ceph_bluefs_db_total_bytes{ceph_daemon="osd.2"} 76834406400.0 --> this one has 
survived
ceph_bluefs_db_total_bytes{ceph_daemon="osd.3"} 63726157824.0

ceph_bluefs_db_used_bytes{ceph_daemon="osd.0"} 65217232896.0
ceph_bluefs_db_used_bytes{ceph_daemon="osd.1"} 48944381952.0
ceph_bluefs_db_used_bytes{ceph_daemon="osd.2"} 68093476864.0
ceph_bluefs_db_used_bytes{ceph_daemon="osd.3"} 63632834560.0

ceph_osd_stat_bytes{ceph_daemon="osd.0"} 471305551872.0
ceph_osd_stat_bytes{ceph_daemon="osd.1"} 471305551872.0
ceph_osd_stat_bytes{ceph_daemon="osd.2"} 471305551872.0
ceph_osd_stat_bytes{ceph_daemon="osd.3"} 471305551872.0

ceph_osd_stat_bytes_used{ceph_daemon="osd.0"} 222328213504.0
ceph_osd_stat_bytes_used{ceph_daemon="osd.1"} 214472544256.0
ceph_osd_stat_bytes_used{ceph_daemon="osd.2"} 163603996672.0
ceph_osd_stat_bytes_used{ceph_daemon="osd.3"} 212806815744.0


First crashed OSD was doing DB compaction, others crashed shortly after during 
backfilling. Workload was "ceph-data-scan scan_inodes" filling metadata pool 
located on these OSDs at the rate close to 10k objects/second.
Here is the log excerpt of the first crash occurrence:

2018-10-01 03:27:12.762 7fbf16dd6700  0 bluestore(/var/lib/ceph/osd/ceph-1) 
_balance_bluefs_freespace no allocate on 0x8000 min_alloc_size 0x1000
2018-10-01 03:27:12.886 7fbf1e5e5700  4 rocksdb: 
[/build/ceph-13.2.2/src/rocksdb/db/compaction_job.cc:1166] [default] [JOB 24] 
Generated table #89741: 106356 keys, 68110589 bytes
2018-10-01 03:27:12.886 7fbf1e5e5700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1538353632892744, "cf_name": "default", "job": 24, "event": "table_file_creation", "file_number": 89741, "file_size": 68110589, "table_properties": 
{"data_size": 67112903, "index_size": 579319, "filter_size": 417316, "raw_key_size": 6733561, "raw_average_key_size": 63, "raw_value_size": 60994583, "raw_average_value_size": 573, "num_data_blocks": 16336, "num_entries": 106356, 
"filter_policy_name": "rocksdb.BuiltinBloomFilter", "kDeletedKeys": "1", "kMergeOperands": "0"}}
2018-10-01 03:27:12.934 7fbf1e5e5700  4 rocksdb: 
[/build/ceph-13.2.2/src/rocksdb/db/compaction_job.cc:1166] [default] [JOB 24] 
Generated table #89742: 23214 keys, 16352315 bytes
2018-10-01 03:27:12.934 7fbf1e5e5700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1538353632938670, "cf_name": "default", "job": 24, "event": "table_file_creation", "file_number": 89742, "file_size": 16352315, "table_properties": 
{"data_size": 16116986, "index_size": 139894, "filter_size": 94386, "raw_key_size": 1470883, "raw_average_key_size": 63, "raw_value_size": 14775006, "raw_average_value_size": 636, "num_data_blocks": 3928, "num_entries": 23214, 
"filter_policy_name": "rocksdb.BuiltinBloomFilter", "kDeletedKeys": "90", "kMergeOperands": "0"}}
2018-10-01 03:27:13.042 7fbf1e5e5700  1 bluefs _allocate failed to allocate 
0x410 on bdev 1, free 0x1a0; fallback to bdev 2
2018-10-01 03:27:13.042 7fbf1e5e5700 -1 bluefs _allocate failed to allocate 
0x410 on bdev 2, dne
2018-10-01 03:27:13.042 7fbf1e5e5700 -1 bluefs _flush_range allocated: 0x0 
offset: 0x0 length: 0x40ea9f1
2018-10-01 03:27:13.046 7fbf1e5e5700 -1 
/build/ceph-13.2.2/src/os/bluestore/BlueFS.cc: In function 'int 

Re: [ceph-users] Cephfs mds cache tuning

2018-10-01 Thread Paul Emmerich
You might find something by looking at the MDS server with perf:

   perf top --pid $(pidof ceph-mds)

as the simpelst command to get started. If you can catch it during a
period of blocked requests/not doing anything, you might be able to
see what it is actually doing and figure out something from there.

But that might also yield nothing if it's not blocked on anything CPU-intensive.

Paul

Am Mo., 1. Okt. 2018 um 06:02 Uhr schrieb Adam Tygart :
>
> Hello all,
>
> I've got a ceph (12.2.8) cluster with 27 servers, 500 osds, and 1000
> cephfs mounts (kernel client). We're currently only using 1 active
> mds.
>
> Performance is great about 80% of the time. MDS responses (per ceph
> daemonperf mds.$(hostname -s), indicates 2k-9k requests per second,
> with a latency under 100.
>
> It is the other 20ish percent I'm worried about. I'll check on it and
> it with be going 5-15 seconds with "0" requests, "0" latency, then
> give me 2 seconds of reasonable response times, and then back to
> nothing. Clients are actually seeing blocked requests for this period
> of time.
>
> The strange bit is that when I *reduce* the mds_cache_size, requests
> and latencies go back to normal for a while. When it happens again,
> I'll increase it back to where it was. It feels like the mds server
> decides that some of these inodes can't be dropped from the cache
> unless the cache size changes. Maybe something wrong with the LRU?
>
> I feel like I've got a reasonable cache size for my workload, 30GB on
> the small end, 55GB on the large. No real reason for a swing this
> large except to potentially delay it recurring after expansion for
> longer.
>
> I also feel like there is probably some magic tunable to change how
> inodes get stuck in the LRU. perhaps mds_cache_mid. Anyone know what
> this tunable actually does? The documentation is a little sparse.
>
> I can grab logs from the mds if needed, just let me know the settings
> you'd like to see.
>
> --
> Adam
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mount cephfs from a public network ip of mds

2018-10-01 Thread Paul Emmerich
No, mons can only have exactly one IP address and they'll only listen
on that IP.

As David suggested: check if you really need separate networks. This
setup usually creates more problems than it solves, especially if you
have one 1G and one 10G network.

Paul
Am Mo., 1. Okt. 2018 um 04:11 Uhr schrieb Joshua Chen
:
>
> Hello Paul,
>   Thanks for your reply.
>   Now my clients will be from 140.109 (LAN, the real ip network 1Gb/s) and 
> from 10.32 (SAN, a closed 10Gb network). Could I make this public_network to 
> be 0.0.0.0? so mon daemon listens on both 1Gb and 10Gb network?
>   Or could I have
> public_network = 140.109.169.0/24, 10.32.67.0/24
> cluster_network = 10.32.67.0/24
>
> does ceph allow 2 (multiple) public_network?
>
>   And I don't want to limit the client read/write speed to be 1Gb/s nics 
> unless they don't have 10Gb nic installed. To guarantee clients read/write to 
> osd (when they know the details of the location) they should be using the 
> fastest nic (10Gb) when available. But other clients with only 1Gb nic will 
> go through 140.109.0.0 (1Gb LAN) to ask mon or to read/write to osds. This is 
> why my osds also have 1Gb and 10Gb nics with 140.109.0.0 and 10.32.0.0 
> networking respectively.
>
> Cheers
> Joshua
>
> On Sun, Sep 30, 2018 at 12:09 PM David Turner  wrote:
>>
>> The cluster/private network is only used by the OSDs. Nothing else in ceph 
>> or its clients communicate using it. Everything other than osd to osd 
>> communication uses the public network. That includes the MONs, MDSs, 
>> clients, and anything other than an osd talking to an osd. Nothing else 
>> other than osd to osd traffic can communicate on the private/cluster network.
>>
>> On Sat, Sep 29, 2018, 6:43 AM Paul Emmerich  wrote:
>>>
>>> All Ceph clients will always first connect to the mons. Mons provide
>>> further information on the cluster such as the IPs of MDS and OSDs.
>>>
>>> This means you need to provide the mon IPs to the mount command, not
>>> the MDS IPs. Your first command works by coincidence since
>>> you seem to run the mons and MDS' on the same server.
>>>
>>>
>>> Paul
>>> Am Sa., 29. Sep. 2018 um 12:07 Uhr schrieb Joshua Chen
>>> :
>>> >
>>> > Hello all,
>>> >   I am testing the cephFS cluster so that clients could mount -t ceph.
>>> >
>>> >   the cluster has 6 nodes, 3 mons (also mds), and 3 osds.
>>> >   All these 6 nodes has 2 nic, one 1Gb nic with real ip (140.109.0.0) and 
>>> > 1 10Gb nic with virtual ip (10.32.0.0)
>>> >
>>> > 140.109. Nic1 1G<-MDS1->Nic2 10G 10.32.
>>> > 140.109. Nic1 1G<-MDS2->Nic2 10G 10.32.
>>> > 140.109. Nic1 1G<-MDS3->Nic2 10G 10.32.
>>> > 140.109. Nic1 1G<-OSD1->Nic2 10G 10.32.
>>> > 140.109. Nic1 1G<-OSD2->Nic2 10G 10.32.
>>> > 140.109. Nic1 1G<-OSD3->Nic2 10G 10.32.
>>> >
>>> >
>>> >
>>> > and I have the following questions:
>>> >
>>> > 1, can I have both public (140.109.0.0) and cluster (10.32.0.0) clients 
>>> > all be able to mount this cephfs resource
>>> >
>>> > I want to do
>>> >
>>> > (in a 140.109 network client)
>>> > mount -t ceph mds1(140.109.169.48):/ /mnt/cephfs -o user=,secret=
>>> >
>>> > and also in a 10.32.0.0 network client)
>>> > mount -t ceph mds1(10.32.67.48):/
>>> > /mnt/cephfs -o user=,secret=
>>> >
>>> >
>>> >
>>> >
>>> > Currently, only this 10.32.0.0 clients can mount it. that of public 
>>> > network (140.109) can not. How can I enable this?
>>> >
>>> > here attached is my ceph.conf
>>> >
>>> > Thanks in advance
>>> >
>>> > Cheers
>>> > Joshua
>>> > ___
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>> --
>>> Paul Emmerich
>>>
>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>>
>>> croit GmbH
>>> Freseniusstr. 31h
>>> 81247 München
>>> www.croit.io
>>> Tel: +49 89 1896585 90
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mount cephfs from a public network ip of mds

2018-10-01 Thread John Petrini
Multiple subnets are supported.

http://docs.ceph.com/docs/master/rados/configuration/network-config-ref/#id1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-01 Thread Sergey Malinin
These are LVM bluestore NVMe SSDs created with "ceph-volume --lvm prepare 
--bluestore /dev/nvme0n1p3" i.e. without specifying wal/db devices.
OSDs were created with bluestore_min_alloc_size_ssd=4096, another modified 
setting is bluestore_cache_kv_max=1073741824

DB/block usage collected by prometheus module for 3 failed and 1 survived OSDs:

ceph_bluefs_db_total_bytes{ceph_daemon="osd.0"} 65493008384.0
ceph_bluefs_db_total_bytes{ceph_daemon="osd.1"} 49013587968.0
ceph_bluefs_db_total_bytes{ceph_daemon="osd.2"} 76834406400.0 --> this one has 
survived
ceph_bluefs_db_total_bytes{ceph_daemon="osd.3"} 63726157824.0

ceph_bluefs_db_used_bytes{ceph_daemon="osd.0"} 65217232896.0
ceph_bluefs_db_used_bytes{ceph_daemon="osd.1"} 48944381952.0
ceph_bluefs_db_used_bytes{ceph_daemon="osd.2"} 68093476864.0
ceph_bluefs_db_used_bytes{ceph_daemon="osd.3"} 63632834560.0

ceph_osd_stat_bytes{ceph_daemon="osd.0"} 471305551872.0
ceph_osd_stat_bytes{ceph_daemon="osd.1"} 471305551872.0
ceph_osd_stat_bytes{ceph_daemon="osd.2"} 471305551872.0
ceph_osd_stat_bytes{ceph_daemon="osd.3"} 471305551872.0

ceph_osd_stat_bytes_used{ceph_daemon="osd.0"} 222328213504.0
ceph_osd_stat_bytes_used{ceph_daemon="osd.1"} 214472544256.0
ceph_osd_stat_bytes_used{ceph_daemon="osd.2"} 163603996672.0
ceph_osd_stat_bytes_used{ceph_daemon="osd.3"} 212806815744.0


First crashed OSD was doing DB compaction, others crashed shortly after during 
backfilling. Workload was "ceph-data-scan scan_inodes" filling metadata pool 
located on these OSDs at the rate close to 10k objects/second.
Here is the log excerpt of the first crash occurrence:

2018-10-01 03:27:12.762 7fbf16dd6700  0 bluestore(/var/lib/ceph/osd/ceph-1) 
_balance_bluefs_freespace no allocate on 0x8000 min_alloc_size 0x1000
2018-10-01 03:27:12.886 7fbf1e5e5700  4 rocksdb: 
[/build/ceph-13.2.2/src/rocksdb/db/compaction_job.cc:1166] [default] [JOB 24] 
Generated table #89741: 106356 keys, 68110589 bytes
2018-10-01 03:27:12.886 7fbf1e5e5700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 
1538353632892744, "cf_name": "default", "job": 24, "event": 
"table_file_creation", "file_number": 89741, "file_size": 68110589, 
"table_properties": {"data_size": 67112903, "index_size": 579319, 
"filter_size": 417316, "raw_key_size": 6733561, "raw_average_key_size": 63, 
"raw_value_size": 60994583, "raw_average_value_size": 573, "num_data_blocks": 
16336, "num_entries": 106356, "filter_policy_name": 
"rocksdb.BuiltinBloomFilter", "kDeletedKeys": "1", "kMergeOperands": "0"}}
2018-10-01 03:27:12.934 7fbf1e5e5700  4 rocksdb: 
[/build/ceph-13.2.2/src/rocksdb/db/compaction_job.cc:1166] [default] [JOB 24] 
Generated table #89742: 23214 keys, 16352315 bytes
2018-10-01 03:27:12.934 7fbf1e5e5700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 
1538353632938670, "cf_name": "default", "job": 24, "event": 
"table_file_creation", "file_number": 89742, "file_size": 16352315, 
"table_properties": {"data_size": 16116986, "index_size": 139894, 
"filter_size": 94386, "raw_key_size": 1470883, "raw_average_key_size": 63, 
"raw_value_size": 14775006, "raw_average_value_size": 636, "num_data_blocks": 
3928, "num_entries": 23214, "filter_policy_name": "rocksdb.BuiltinBloomFilter", 
"kDeletedKeys": "90", "kMergeOperands": "0"}}
2018-10-01 03:27:13.042 7fbf1e5e5700  1 bluefs _allocate failed to allocate 
0x410 on bdev 1, free 0x1a0; fallback to bdev 2
2018-10-01 03:27:13.042 7fbf1e5e5700 -1 bluefs _allocate failed to allocate 
0x410 on bdev 2, dne
2018-10-01 03:27:13.042 7fbf1e5e5700 -1 bluefs _flush_range allocated: 0x0 
offset: 0x0 length: 0x40ea9f1
2018-10-01 03:27:13.046 7fbf1e5e5700 -1 
/build/ceph-13.2.2/src/os/bluestore/BlueFS.cc: In function 'int 
BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 
7fbf1e5e5700 time 2018-10-01 03:27:13.048298
/build/ceph-13.2.2/src/os/bluestore/BlueFS.cc: 1663: FAILED assert(0 == "bluefs 
enospc")

 ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x102) [0x7fbf2d4fe5c2]
 2: (()+0x26c787) [0x7fbf2d4fe787]
 3: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned 
long)+0x1ab4) [0x5619325114b4]
 4: (BlueRocksWritableFile::Flush()+0x3d) [0x561932527c1d]
 5: (rocksdb::WritableFileWriter::Flush()+0x1b9) [0x56193271c399]
 6: (rocksdb::WritableFileWriter::Sync(bool)+0x3b) [0x56193271d42b]
 7: (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status const&, 
rocksdb::CompactionJob::SubcompactionState*, rocksdb::RangeDelAggregator*, 
CompactionIterationStats*, rocksdb::Slice const*)+0x3db) [0x56193276098b]
 8: 
(rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0x7d9)
 [0x561932763da9]
 9: (rocksdb::CompactionJob::Run()+0x314) [0x561932765504]
 10: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, 
rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*)+0xc54) 
[0x5619325b5c44]
 11: 

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-01 Thread Igor Fedotov

Hi Sergey,

could you please provide more details on your OSDs ?

What are sizes for DB/block devices?

Do you have any modifications in BlueStore config settings?

Can you share stats you're referring to?


Thanks,

Igor


On 10/1/2018 12:29 PM, Sergey Malinin wrote:

Hello,
3 of 4 NVME OSDs crashed at the same time on assert(0 == "bluefs enospc") and 
no longer start.
Stats collected just before crash show that ceph_bluefs_db_used_bytes is 100% 
used. Although OSDs have over 50% of free space, it is not reallocated for DB 
usage.

2018-10-01 12:18:06.744 7f1d6a04d240  1 bluefs _allocate failed to allocate 
0x10 on bdev 1, free 0x0; fallback to bdev 2
2018-10-01 12:18:06.744 7f1d6a04d240 -1 bluefs _allocate failed to allocate 
0x10 on bdev 2, dne
2018-10-01 12:18:06.744 7f1d6a04d240 -1 bluefs _flush_range allocated: 0x0 
offset: 0x0 length: 0xa8700
2018-10-01 12:18:06.748 7f1d6a04d240 -1 
/build/ceph-13.2.2/src/os/bluestore/BlueFS.cc: In function 'int 
BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 
7f1d6a04d240 time 2018-10-01 12:18:06.746800
/build/ceph-13.2.2/src/os/bluestore/BlueFS.cc: 1663: FAILED assert(0 == "bluefs 
enospc")

  ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x102) [0x7f1d6146f5c2]
  2: (()+0x26c787) [0x7f1d6146f787]
  3: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned 
long)+0x1ab4) [0x5586b22684b4]
  4: (BlueRocksWritableFile::Flush()+0x3d) [0x5586b227ec1d]
  5: (rocksdb::WritableFileWriter::Flush()+0x1b9) [0x5586b2473399]
  6: (rocksdb::WritableFileWriter::Sync(bool)+0x3b) [0x5586b247442b]
  7: (rocksdb::BuildTable(std::__cxx11::basic_string, 
std::allocator > const&, rocksdb::Env*, rocksdb::ImmutableCFOptions const&, 
rocksdb::MutableCFOptions const&, rocksdb::EnvOptions const&, rock
sdb::TableCache*, rocksdb::InternalIterator*, std::unique_ptr >, rocksdb::FileMetaData*, 
rocksdb::InternalKeyComparator const&, std::vector >, 
std::allocator > > > co
nst*, unsigned int, std::__cxx11::basic_string, std::allocator 
> const&, std::vector >, unsigned long, 
rocksdb::SnapshotChecker*, rocksdb::Compression
Type, rocksdb::CompressionOptions const&, bool, rocksdb::InternalStats*, 
rocksdb::TableFileCreationReason, rocksdb::EventLogger*, int, 
rocksdb::Env::IOPriority, rocksdb::TableProperties*, int, unsigned long, unsigned 
long, rocksdb
::Env::WriteLifeTimeHint)+0x1e24) [0x5586b249ef94]
  8: (rocksdb::DBImpl::WriteLevel0TableForRecovery(int, 
rocksdb::ColumnFamilyData*, rocksdb::MemTable*, rocksdb::VersionEdit*)+0xcb7) 
[0x5586b2321457]
  9: (rocksdb::DBImpl::RecoverLogFiles(std::vector > const&, unsigned long*, bool)+0x19de) [0x5586b232373e]
  10: (rocksdb::DBImpl::Recover(std::vector > const&, bool, bool, bool)+0x5d4) 
[0x5586b23242f4]
  11: (rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string, std::allocator > const&, 
std::vector > const&, std::vector >*, rocksdb::DB**, bool)+0x68b) [0x5586b232559b]
  12: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string, std::allocator > const&, 
std::vector
const&, std::vector >*, rocksdb::DB**)+0x22) [0x5586b2326e72]

  13: (RocksDBStore::do_open(std::ostream&, bool, std::vector > const*)+0x170c) [0x5586b220219c]
  14: (BlueStore::_open_db(bool, bool)+0xd8e) [0x5586b218ee1e]
  15: (BlueStore::_mount(bool, bool)+0x4b7) [0x5586b21bf807]
  16: (OSD::init()+0x295) [0x5586b1d673c5]
  17: (main()+0x268d) [0x5586b1c554ed]
  18: (__libc_start_main()+0xe7) [0x7f1d5ea2db97]
  19: (_start()+0x2a) [0x5586b1d1d7fa]
  NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Manually deleting an RGW bucket

2018-10-01 Thread Sean Purdy
On Sat, 29 Sep 2018, Konstantin Shalygin said:
> > How do I delete an RGW/S3 bucket and its contents if the usual S3 API 
> > commands don't work?
> > 
> > The bucket has S3 delete markers that S3 API commands are not able to 
> > remove, and I'd like to reuse the bucket name.  It was set up for 
> > versioning and lifecycles under ceph 12.2.5 which broke the bucket when a 
> > reshard happened.  12.2.7 allowed me to remove the regular files but not 
> > the delete markers.
> > 
> > There must be a way of removing index files and so forth through rados 
> > commands.
> 
> 
> What error actually is?
> 
> For delete bucket you should delete all bucket objects ("s3cmd rm -rf
> s3://bucket/") and multipart uploads.


No errors, but I can't remove delete markers from the versioned bucket.


Here's the bucket:

$ aws --profile=mybucket --endpoint-url http://myserver/ s3 ls s3://mybucket/

(no objects returned)

Try removing the bucket:

$ aws --profile=mybucket --endpoint-url http://myserver/ s3 rb s3://mybucket/
remove_bucket failed: s3://mybucket/ An error occurred (BucketNotEmpty) when 
calling the DeleteBucket operation: Unknown

So the bucket is not empty.

List object versions:

$ aws --profile=mybucket --endpoint-url http://myserver/ s3api 
list-object-versions --bucket mybucket --prefix someprefix/0/0

Shows lots of delete markers from the versioned bucket:

{
"Owner": {
"DisplayName": "mybucket bucket owner", 
"ID": "mybucket"
}, 
"IsLatest": true, 
"VersionId": "ZB8ty9c3hxjxV5izmIKM1QwDR6fwnsd", 
"Key": "someprefix/0/0/00fff6df-863d-48b5-9089-cc6e7c5997e7", 
}
 
Let's try removing that delete marker object:

$ aws --profile=mybucket --endpoint-url http://myserver/ s3api delete-object 
--bucket mybucket --key someprefix/0/0/00fff6df-863d-48b5-9089-cc6e7c5997e7 
--version-id ZB8ty9c3hxjxV5izmIKM1QwDR6fwnsd

Returns 0, has it worked?

$ aws --profile=mybucket --endpoint-url http://myserver/ s3api 
list-object-versions --bucket mybucket --prefix 
someprefix/0/0/00fff6df-863d-48b5-9089-cc6e7c5997e7

No:

"DeleteMarkers": [
{
"Owner": {
"DisplayName": "static bucket owner", 
"ID": "static"
}, 
"IsLatest": true, 
"VersionId": "ZB8ty9c3hxjxV5izmIKM1QwDR6fwnsd", 
"Key": "candidate-photo/0/0/00fff6df-863d-48b5-9089-cc6e7c5997e7", 
"LastModified": "2018-09-17T16:19:58.187Z"
}
]


So how do I get rid of the delete markers to empty the bucket?  This is my 
problem.

Sean
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-01 Thread Sergey Malinin
Hello,
3 of 4 NVME OSDs crashed at the same time on assert(0 == "bluefs enospc") and 
no longer start.
Stats collected just before crash show that ceph_bluefs_db_used_bytes is 100% 
used. Although OSDs have over 50% of free space, it is not reallocated for DB 
usage.

2018-10-01 12:18:06.744 7f1d6a04d240  1 bluefs _allocate failed to allocate 
0x10 on bdev 1, free 0x0; fallback to bdev 2
2018-10-01 12:18:06.744 7f1d6a04d240 -1 bluefs _allocate failed to allocate 
0x10 on bdev 2, dne
2018-10-01 12:18:06.744 7f1d6a04d240 -1 bluefs _flush_range allocated: 0x0 
offset: 0x0 length: 0xa8700
2018-10-01 12:18:06.748 7f1d6a04d240 -1 
/build/ceph-13.2.2/src/os/bluestore/BlueFS.cc: In function 'int 
BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 
7f1d6a04d240 time 2018-10-01 12:18:06.746800
/build/ceph-13.2.2/src/os/bluestore/BlueFS.cc: 1663: FAILED assert(0 == "bluefs 
enospc")

 ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x102) [0x7f1d6146f5c2]
 2: (()+0x26c787) [0x7f1d6146f787]
 3: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned 
long)+0x1ab4) [0x5586b22684b4]
 4: (BlueRocksWritableFile::Flush()+0x3d) [0x5586b227ec1d]
 5: (rocksdb::WritableFileWriter::Flush()+0x1b9) [0x5586b2473399]
 6: (rocksdb::WritableFileWriter::Sync(bool)+0x3b) [0x5586b247442b]
 7: (rocksdb::BuildTable(std::__cxx11::basic_string, std::allocator > const&, rocksdb::Env*, 
rocksdb::ImmutableCFOptions const&, rocksdb::MutableCFOptions const&, 
rocksdb::EnvOptions const&, rock
sdb::TableCache*, rocksdb::InternalIterator*, 
std::unique_ptr >, rocksdb::FileMetaData*, 
rocksdb::InternalKeyComparator const&, std::vector >, 
std::allocator > > > co
nst*, unsigned int, std::__cxx11::basic_string, 
std::allocator > const&, std::vector >, unsigned long, rocksdb::SnapshotChecker*, 
rocksdb::Compression
Type, rocksdb::CompressionOptions const&, bool, rocksdb::InternalStats*, 
rocksdb::TableFileCreationReason, rocksdb::EventLogger*, int, 
rocksdb::Env::IOPriority, rocksdb::TableProperties*, int, unsigned long, 
unsigned long, rocksdb
::Env::WriteLifeTimeHint)+0x1e24) [0x5586b249ef94]
 8: (rocksdb::DBImpl::WriteLevel0TableForRecovery(int, 
rocksdb::ColumnFamilyData*, rocksdb::MemTable*, rocksdb::VersionEdit*)+0xcb7) 
[0x5586b2321457]
 9: (rocksdb::DBImpl::RecoverLogFiles(std::vector > const&, unsigned long*, bool)+0x19de) 
[0x5586b232373e]
 10: (rocksdb::DBImpl::Recover(std::vector > const&, bool, bool, 
bool)+0x5d4) [0x5586b23242f4]
 11: (rocksdb::DBImpl::Open(rocksdb::DBOptions const&, 
std::__cxx11::basic_string, std::allocator > 
const&, std::vector > const&, std::vector >*, rocksdb::DB**, bool)+0x68b) 
[0x5586b232559b]
 12: (rocksdb::DB::Open(rocksdb::DBOptions const&, 
std::__cxx11::basic_string, std::allocator > 
const&, std::vector > const&, std::vector > std::allocator >*, rocksdb::DB**)+0x22) 
> > [0x5586b2326e72]
 13: (RocksDBStore::do_open(std::ostream&, bool, 
std::vector 
> const*)+0x170c) [0x5586b220219c]
 14: (BlueStore::_open_db(bool, bool)+0xd8e) [0x5586b218ee1e]
 15: (BlueStore::_mount(bool, bool)+0x4b7) [0x5586b21bf807]
 16: (OSD::init()+0x295) [0x5586b1d673c5]
 17: (main()+0x268d) [0x5586b1c554ed]
 18: (__libc_start_main()+0xe7) [0x7f1d5ea2db97]
 19: (_start()+0x2a) [0x5586b1d1d7fa]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com