Re: [ceph-users] OSD crashed during the fio test

2019-10-01 Thread Brad Hubbard
If it is only this one osd I'd be inclined to be taking a hard look at
the underlying hardware and how it behaves/performs compared to the hw
backing identical osds. The less likely possibility is that you have
some sort of "hot spot" causing resource contention for that osd. To
investigate that further you could look at whether the pattern of cpu
and ram usage of that daemon varies significantly compared to the
other osd daemons in the cluster. You could also compare perf dumps
between daemons.

On Wed, Oct 2, 2019 at 1:46 PM Sasha Litvak
 wrote:
>
> I updated firmware and kernel, running torture tests.  So far no assert, but 
> I still noticed this on the same osd as yesterday
>
> Oct 01 19:35:13 storage2n2-la ceph-osd-34[11188]: 2019-10-01 19:35:13.721 
> 7f8d03150700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 
> 0x7f8cd05d7700' had timed out after 60
> Oct 01 19:35:13 storage2n2-la ceph-osd-34[11188]: 2019-10-01 19:35:13.721 
> 7f8d03150700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 
> 0x7f8cd0dd8700' had timed out after 60
> Oct 01 19:35:13 storage2n2-la ceph-osd-34[11188]: 2019-10-01 19:35:13.721 
> 7f8d03150700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 
> 0x7f8cd2ddc700' had timed out after 60
> Oct 01 19:35:13 storage2n2-la ceph-osd-34[11188]: 2019-10-01 19:35:13.721 
> 7f8d03150700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 
> 0x7f8cd35dd700' had timed out after 60
> Oct 01 19:35:13 storage2n2-la ceph-osd-34[11188]: 2019-10-01 19:35:13.721 
> 7f8d03150700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 
> 0x7f8cd3dde700' had timed out after 60
>
> The spike of latency on this OSD is 6 seconds at that time.  Any ideas?
>
> On Tue, Oct 1, 2019 at 8:03 AM Sasha Litvak  
> wrote:
>>
>> It was hardware indeed.  Dell server reported a disk being reset with power 
>> on.  Checking the usual suspects i.e. controller firmware, controller event 
>> log (if I can get one), drive firmware.
>> I will report more when I get a better idea
>>
>> Thank you!
>>
>> On Tue, Oct 1, 2019 at 2:33 AM Brad Hubbard  wrote:
>>>
>>> Removed ceph-de...@vger.kernel.org and added d...@ceph.io
>>>
>>> On Tue, Oct 1, 2019 at 4:26 PM Alex Litvak  
>>> wrote:
>>> >
>>> > Hellow everyone,
>>> >
>>> > Can you shed the line on the cause of the crash?  Could actually client 
>>> > request trigger it?
>>> >
>>> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]: 2019-09-30 22:52:58.867 
>>> > 7f093d71e700 -1 bdev(0x55b72c156000 /var/lib/ceph/osd/ceph-17/block) 
>>> > aio_submit retries 16
>>> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]: 2019-09-30 22:52:58.867 
>>> > 7f093d71e700 -1 bdev(0x55b72c156000 /var/lib/ceph/osd/ceph-17/block)  aio 
>>> > submit got (11) Resource temporarily unavailable
>>>
>>> The KernelDevice::aio_submit function has tried to submit Io 16 times
>>> (a hard coded limit) and received an error each time causing it to
>>> assert. Can you check the status of the underlying device(s)?
>>>
>>> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:
>>> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/os/bluestore/KernelDevice.cc:
>>> > In fun
>>> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:
>>> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/os/bluestore/KernelDevice.cc:
>>> > 757: F
>>> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  ceph version 14.2.2 
>>> > (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)
>>> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  1: 
>>> > (ceph::__ceph_assert_fail(char const*, char const*, int, char 
>>> > const*)+0x14a) [0x55b71f668cf4]
>>> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  2: 
>>> > (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, 
>>> > char const*, ...)+0) [0x55b71f668ec2]
>>> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  3: 
>>> > (KernelDevice::aio_submit(IOContext*)+0x701) [0x55b71fd61ca1]
>>> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  4: 
>>> > (BlueStore::_txc_aio_submit(BlueStore::TransContext*)+0x42) 
>>> > [0x55b71fc29892]
>>> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  5: 
>>> > (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x42b) 
>>> > [0x55b71fc496ab]
>>> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  6: 
>>> > (BlueStore::queue_transactions(boost::intrusive_ptr&,
>>> >  std::vector>> > std::allocator >&, 
>>> > boost::intrusive_ptr, ThreadPool::T
>>> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  7: (non-virtual thunk 
>>> > to PrimaryLogPG::queue_transactions(std::vector>> > std::allocator >&,
>>> > boost::intrusive_ptr)+0x54) [0x55b71f9b1b84]
>>> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  8: 
>>> > 

Re: [ceph-users] OSD crashed during the fio test

2019-10-01 Thread Sasha Litvak
I updated firmware and kernel, running torture tests.  So far no assert,
but I still noticed this on the same osd as yesterday

Oct 01 19:35:13 storage2n2-la ceph-osd-34[11188]: 2019-10-01 19:35:13.721
7f8d03150700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread
0x7f8cd05d7700' had timed out after 60
Oct 01 19:35:13 storage2n2-la ceph-osd-34[11188]: 2019-10-01 19:35:13.721
7f8d03150700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread
0x7f8cd0dd8700' had timed out after 60
Oct 01 19:35:13 storage2n2-la ceph-osd-34[11188]: 2019-10-01 19:35:13.721
7f8d03150700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread
0x7f8cd2ddc700' had timed out after 60
Oct 01 19:35:13 storage2n2-la ceph-osd-34[11188]: 2019-10-01 19:35:13.721
7f8d03150700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread
0x7f8cd35dd700' had timed out after 60
Oct 01 19:35:13 storage2n2-la ceph-osd-34[11188]: 2019-10-01 19:35:13.721
7f8d03150700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread
0x7f8cd3dde700' had timed out after 60

The spike of latency on this OSD is 6 seconds at that time.  Any ideas?

On Tue, Oct 1, 2019 at 8:03 AM Sasha Litvak 
wrote:

> It was hardware indeed.  Dell server reported a disk being reset with
> power on.  Checking the usual suspects i.e. controller firmware, controller
> event log (if I can get one), drive firmware.
> I will report more when I get a better idea
>
> Thank you!
>
> On Tue, Oct 1, 2019 at 2:33 AM Brad Hubbard  wrote:
>
>> Removed ceph-de...@vger.kernel.org and added d...@ceph.io
>>
>> On Tue, Oct 1, 2019 at 4:26 PM Alex Litvak 
>> wrote:
>> >
>> > Hellow everyone,
>> >
>> > Can you shed the line on the cause of the crash?  Could actually client
>> request trigger it?
>> >
>> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]: 2019-09-30
>> 22:52:58.867 7f093d71e700 -1 bdev(0x55b72c156000
>> /var/lib/ceph/osd/ceph-17/block) aio_submit retries 16
>> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]: 2019-09-30
>> 22:52:58.867 7f093d71e700 -1 bdev(0x55b72c156000
>> /var/lib/ceph/osd/ceph-17/block)  aio submit got (11) Resource temporarily
>> unavailable
>>
>> The KernelDevice::aio_submit function has tried to submit Io 16 times
>> (a hard coded limit) and received an error each time causing it to
>> assert. Can you check the status of the underlying device(s)?
>>
>> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:
>> >
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/os/bluestore/KernelDevice.cc:
>> > In fun
>> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:
>> >
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/os/bluestore/KernelDevice.cc:
>> > 757: F
>> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  ceph version 14.2.2
>> (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)
>> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  1:
>> (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x14a) [0x55b71f668cf4]
>> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  2:
>> (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char
>> const*, ...)+0) [0x55b71f668ec2]
>> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  3:
>> (KernelDevice::aio_submit(IOContext*)+0x701) [0x55b71fd61ca1]
>> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  4:
>> (BlueStore::_txc_aio_submit(BlueStore::TransContext*)+0x42) [0x55b71fc29892]
>> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  5:
>> (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x42b)
>> [0x55b71fc496ab]
>> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  6:
>> (BlueStore::queue_transactions(boost::intrusive_ptr&,
>> std::vector> > std::allocator >&,
>> boost::intrusive_ptr, ThreadPool::T
>> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  7: (non-virtual
>> thunk to
>> PrimaryLogPG::queue_transactions(std::vector> std::allocator >&,
>> > boost::intrusive_ptr)+0x54) [0x55b71f9b1b84]
>> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  8:
>> (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t
>> const&, eversion_t const&, std::unique_ptr> > std::default_delete >&&, eversion_t const&, eversion_t
>> const&, s
>> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  9:
>> (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*,
>> PrimaryLogPG::OpContext*)+0xf12) [0x55b71f90e322]
>> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  10:
>> (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0xfae) [0x55b71f969b7e]
>> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  11:
>> (PrimaryLogPG::do_op(boost::intrusive_ptr&)+0x3965)
>> [0x55b71f96de15]
>> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  12:
>> (PrimaryLogPG::do_request(boost::intrusive_ptr&,
>> 

Re: [ceph-users] OSD crashed during the fio test

2019-10-01 Thread Sasha Litvak
It was hardware indeed.  Dell server reported a disk being reset with power
on.  Checking the usual suspects i.e. controller firmware, controller event
log (if I can get one), drive firmware.
I will report more when I get a better idea

Thank you!

On Tue, Oct 1, 2019 at 2:33 AM Brad Hubbard  wrote:

> Removed ceph-de...@vger.kernel.org and added d...@ceph.io
>
> On Tue, Oct 1, 2019 at 4:26 PM Alex Litvak 
> wrote:
> >
> > Hellow everyone,
> >
> > Can you shed the line on the cause of the crash?  Could actually client
> request trigger it?
> >
> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]: 2019-09-30
> 22:52:58.867 7f093d71e700 -1 bdev(0x55b72c156000
> /var/lib/ceph/osd/ceph-17/block) aio_submit retries 16
> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]: 2019-09-30
> 22:52:58.867 7f093d71e700 -1 bdev(0x55b72c156000
> /var/lib/ceph/osd/ceph-17/block)  aio submit got (11) Resource temporarily
> unavailable
>
> The KernelDevice::aio_submit function has tried to submit Io 16 times
> (a hard coded limit) and received an error each time causing it to
> assert. Can you check the status of the underlying device(s)?
>
> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:
> >
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/os/bluestore/KernelDevice.cc:
> > In fun
> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:
> >
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/os/bluestore/KernelDevice.cc:
> > 757: F
> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  ceph version 14.2.2
> (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)
> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  1:
> (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x14a) [0x55b71f668cf4]
> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  2:
> (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char
> const*, ...)+0) [0x55b71f668ec2]
> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  3:
> (KernelDevice::aio_submit(IOContext*)+0x701) [0x55b71fd61ca1]
> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  4:
> (BlueStore::_txc_aio_submit(BlueStore::TransContext*)+0x42) [0x55b71fc29892]
> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  5:
> (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x42b)
> [0x55b71fc496ab]
> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  6:
> (BlueStore::queue_transactions(boost::intrusive_ptr&,
> std::vector > std::allocator >&,
> boost::intrusive_ptr, ThreadPool::T
> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  7: (non-virtual thunk
> to PrimaryLogPG::queue_transactions(std::vector std::allocator >&,
> > boost::intrusive_ptr)+0x54) [0x55b71f9b1b84]
> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  8:
> (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t
> const&, eversion_t const&, std::unique_ptr > std::default_delete >&&, eversion_t const&, eversion_t
> const&, s
> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  9:
> (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*,
> PrimaryLogPG::OpContext*)+0xf12) [0x55b71f90e322]
> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  10:
> (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0xfae) [0x55b71f969b7e]
> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  11:
> (PrimaryLogPG::do_op(boost::intrusive_ptr&)+0x3965)
> [0x55b71f96de15]
> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  12:
> (PrimaryLogPG::do_request(boost::intrusive_ptr&,
> ThreadPool::TPHandle&)+0xbd4) [0x55b71f96f8a4]
> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  13:
> (OSD::dequeue_op(boost::intrusive_ptr, boost::intrusive_ptr,
> ThreadPool::TPHandle&)+0x1a9) [0x55b71f7a9ea9]
> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  14:
> (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr&,
> ThreadPool::TPHandle&)+0x62) [0x55b71fa475d2]
> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  15:
> (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9f4)
> [0x55b71f7c6ef4]
> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  16:
> (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433)
> [0x55b71fdc5ce3]
> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  17:
> (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55b71fdc8d80]
> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  18: (()+0x7dd5)
> [0x7f0971da9dd5]
> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  19: (clone()+0x6d)
> [0x7f0970c7002d]
> > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]: 2019-09-30
> 22:52:58.879 7f093d71e700 -1
> >
> 

Re: [ceph-users] OSD crashed during the fio test

2019-10-01 Thread Brad Hubbard
Removed ceph-de...@vger.kernel.org and added d...@ceph.io

On Tue, Oct 1, 2019 at 4:26 PM Alex Litvak  wrote:
>
> Hellow everyone,
>
> Can you shed the line on the cause of the crash?  Could actually client 
> request trigger it?
>
> Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]: 2019-09-30 22:52:58.867 
> 7f093d71e700 -1 bdev(0x55b72c156000 /var/lib/ceph/osd/ceph-17/block) 
> aio_submit retries 16
> Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]: 2019-09-30 22:52:58.867 
> 7f093d71e700 -1 bdev(0x55b72c156000 /var/lib/ceph/osd/ceph-17/block)  aio 
> submit got (11) Resource temporarily unavailable

The KernelDevice::aio_submit function has tried to submit Io 16 times
(a hard coded limit) and received an error each time causing it to
assert. Can you check the status of the underlying device(s)?

> Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/os/bluestore/KernelDevice.cc:
> In fun
> Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/os/bluestore/KernelDevice.cc:
> 757: F
> Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  ceph version 14.2.2 
> (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)
> Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  1: 
> (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) 
> [0x55b71f668cf4]
> Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  2: 
> (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char 
> const*, ...)+0) [0x55b71f668ec2]
> Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  3: 
> (KernelDevice::aio_submit(IOContext*)+0x701) [0x55b71fd61ca1]
> Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  4: 
> (BlueStore::_txc_aio_submit(BlueStore::TransContext*)+0x42) [0x55b71fc29892]
> Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  5: 
> (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x42b) [0x55b71fc496ab]
> Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  6: 
> (BlueStore::queue_transactions(boost::intrusive_ptr&,
>  std::vector std::allocator >&, boost::intrusive_ptr, 
> ThreadPool::T
> Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  7: (non-virtual thunk to 
> PrimaryLogPG::queue_transactions(std::vector std::allocator >&,
> boost::intrusive_ptr)+0x54) [0x55b71f9b1b84]
> Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  8: 
> (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t 
> const&, eversion_t const&, std::unique_ptr std::default_delete >&&, eversion_t const&, eversion_t const&, 
> s
> Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  9: 
> (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, 
> PrimaryLogPG::OpContext*)+0xf12) [0x55b71f90e322]
> Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  10: 
> (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0xfae) [0x55b71f969b7e]
> Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  11: 
> (PrimaryLogPG::do_op(boost::intrusive_ptr&)+0x3965) 
> [0x55b71f96de15]
> Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  12: 
> (PrimaryLogPG::do_request(boost::intrusive_ptr&, 
> ThreadPool::TPHandle&)+0xbd4) [0x55b71f96f8a4]
> Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  13: 
> (OSD::dequeue_op(boost::intrusive_ptr, boost::intrusive_ptr, 
> ThreadPool::TPHandle&)+0x1a9) [0x55b71f7a9ea9]
> Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  14: (PGOpItem::run(OSD*, 
> OSDShard*, boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x62) 
> [0x55b71fa475d2]
> Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  15: 
> (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9f4) 
> [0x55b71f7c6ef4]
> Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  16: 
> (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) 
> [0x55b71fdc5ce3]
> Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  17: 
> (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55b71fdc8d80]
> Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  18: (()+0x7dd5) 
> [0x7f0971da9dd5]
> Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  19: (clone()+0x6d) 
> [0x7f0970c7002d]
> Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]: 2019-09-30 22:52:58.879 
> 7f093d71e700 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/
> Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/os/bluestore/KernelDevice.cc:
> 757: F
> Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  ceph version 

[ceph-users] OSD crashed during the fio test

2019-10-01 Thread Alex Litvak

Hellow everyone,

Can you shed the line on the cause of the crash?  Could actually client request 
trigger it?

Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]: 2019-09-30 22:52:58.867 
7f093d71e700 -1 bdev(0x55b72c156000 /var/lib/ceph/osd/ceph-17/block) aio_submit 
retries 16
Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]: 2019-09-30 22:52:58.867 
7f093d71e700 -1 bdev(0x55b72c156000 /var/lib/ceph/osd/ceph-17/block)  aio 
submit got (11) Resource temporarily unavailable
Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]: 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/os/bluestore/KernelDevice.cc: 
In fun
Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]: 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/os/bluestore/KernelDevice.cc: 
757: F

Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  ceph version 14.2.2 
(4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)
Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  1: 
(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) 
[0x55b71f668cf4]
Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  2: 
(ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char 
const*, ...)+0) [0x55b71f668ec2]
Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  3: 
(KernelDevice::aio_submit(IOContext*)+0x701) [0x55b71fd61ca1]
Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  4: 
(BlueStore::_txc_aio_submit(BlueStore::TransContext*)+0x42) [0x55b71fc29892]
Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  5: 
(BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x42b) [0x55b71fc496ab]
Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  6: (BlueStore::queue_transactions(boost::intrusive_ptr&, std::vectorstd::allocator >&, boost::intrusive_ptr, ThreadPool::T
Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  7: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector >&, 
boost::intrusive_ptr)+0x54) [0x55b71f9b1b84]
Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  8: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptrstd::default_delete >&&, eversion_t const&, eversion_t const&, s

Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  9: 
(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, 
PrimaryLogPG::OpContext*)+0xf12) [0x55b71f90e322]
Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  10: 
(PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0xfae) [0x55b71f969b7e]
Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  11: 
(PrimaryLogPG::do_op(boost::intrusive_ptr&)+0x3965) [0x55b71f96de15]
Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  12: 
(PrimaryLogPG::do_request(boost::intrusive_ptr&, 
ThreadPool::TPHandle&)+0xbd4) [0x55b71f96f8a4]
Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  13: 
(OSD::dequeue_op(boost::intrusive_ptr, boost::intrusive_ptr, 
ThreadPool::TPHandle&)+0x1a9) [0x55b71f7a9ea9]
Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  14: (PGOpItem::run(OSD*, OSDShard*, 
boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x62) [0x55b71fa475d2]
Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  15: 
(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9f4) 
[0x55b71f7c6ef4]
Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  16: 
(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) 
[0x55b71fdc5ce3]
Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  17: 
(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55b71fdc8d80]
Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  18: (()+0x7dd5) 
[0x7f0971da9dd5]
Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  19: (clone()+0x6d) 
[0x7f0970c7002d]
Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]: 2019-09-30 22:52:58.879 7f093d71e700 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/
Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]: 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/os/bluestore/KernelDevice.cc: 
757: F

Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  ceph version 14.2.2 
(4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)
Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  1: 
(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) 
[0x55b71f668cf4]
Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  2: 
(ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char 
const*, ...)+0) [0x55b71f668ec2]
Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]:  3: 
(KernelDevice::aio_submit(IOContext*)+0x701)