Re: [ceph-users] MDS getattr op stuck in snapshot

2019-06-12 Thread Hector Martin
On 12/06/2019 22.33, Yan, Zheng wrote:
> I have tracked down the bug. thank you for reporting this.  'echo 2 >
> /proc/sys/vm/drop_cache' should fix the hang.  If you can compile ceph
> from source, please try following patch.
> 
> diff --git a/src/mds/Locker.cc b/src/mds/Locker.cc
> index ecd06294fa..94b947975a 100644
> --- a/src/mds/Locker.cc
> +++ b/src/mds/Locker.cc
> @@ -2956,7 +2956,8 @@ void Locker::handle_client_caps(MClientCaps *m)
> 
>// client flushes and releases caps at the same time. make sure
> MDCache::cow_inode()
>// properly setup CInode::client_need_snapflush
> -  if ((m->get_dirty() & ~cap->issued()) && !need_snapflush)
> +  if (!need_snapflush && (m->get_dirty() & ~cap->issued()) &&
> + (m->flags & MClientCaps::FLAG_PENDING_CAPSNAP))
> cap->mark_needsnapflush();
>  }
> 
> 
> 

That was quick, thanks! I can build from source but I won't have time to
do so and test it until next week, if that's okay.


-- 
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Enable buffered write for bluestore

2019-06-12 Thread Trilok Agarwal
Hi
How can we enable bluestore_default_buffered_write using ceph-conf utility
Any pointers would be appreciated
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Verifying current configuration values

2019-06-12 Thread Mark Nelson


On 6/12/19 5:51 PM, Jorge Garcia wrote:
I'm following the bluestore config reference guide and trying to 
change the value for osd_memory_target. I added the following entry in 
the /etc/ceph/ceph.conf file:


  [osd]
  osd_memory_target = 2147483648

and restarted the osd daemons doing "systemctl restart 
ceph-osd.target". Now, how do I verify that the value has changed? I 
have tried "ceph daemon osd.0 config show" and it lists many settings, 
but osd_memory_target isn't one of them. What am I doing wrong?



What version of ceph are you using? Here's a quick dump of one of my 
test OSDs from master (ignore the low memory target):



$ sudo ceph daemon osd.0 config show  | grep osd_memory_target
    "osd_memory_target": "1073741824",
    "osd_memory_target_cgroup_limit_ratio": "0.80",


It looks to me like you may be on an older version of ceph that hasn't 
had the osd_memory_target code backported?



Mark

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Verifying current configuration values

2019-06-12 Thread Jorge Garcia
I'm following the bluestore config reference guide and trying to change 
the value for osd_memory_target. I added the following entry in the 
/etc/ceph/ceph.conf file:


  [osd]
  osd_memory_target = 2147483648

and restarted the osd daemons doing "systemctl restart ceph-osd.target". 
Now, how do I verify that the value has changed? I have tried "ceph 
daemon osd.0 config show" and it lists many settings, but 
osd_memory_target isn't one of them. What am I doing wrong?


Thanks!


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-12 Thread Sage Weil
On Wed, 12 Jun 2019, Sage Weil wrote:
> On Thu, 13 Jun 2019, Simon Leinen wrote:
> > Sage Weil writes:
> > >> 2019-06-12 23:40:43.555 7f724b27f0c0  1 rocksdb: do_open column 
> > >> families: [default]
> > >> Unrecognized command: stats
> > >> ceph-kvstore-tool: /build/ceph-14.2.1/src/rocksdb/db/version_set.cc:356: 
> > >> rocksdb::Version::~Version(): Assertion `path_id < 
> > >> cfd_->ioptions()->cf_paths.size()' failed.
> > >> *** Caught signal (Aborted) **
> > 
> > > Ah, this looks promising.. it looks like it got it open and has some
> > > problem with teh error/teardown path.
> > 
> > > Try 'compact' instead of 'stats'?
> > 
> > That run for a while and then crashed, also in the destructor for
> > rocksdb::Version, but with an otherwise different backtrace.  I'm
> > attaching the log again.
> 
> Hmm, I'm pretty sure this is a shutdown problem, but not certain.  If you 
> do
> 
>  ceph-kvstore-tool rocksdb /mnt/ceph/db list > keys
> 
> is the keys file huge?  Can you send the head and tail of it so we can 
> make sure it looks complete?
> 
> One last thing to check:
> 
>  ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-NNN list > keys
> 
> and see if that behaves similarly or crashes in the way it did before when 
> the OSD was starting.

One other thing to try before taking any drastic steps (as described 
below):

 ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-NNN fsck

And, if we do the below,
 
> If the exported version looks intact, I have a workaround that will 
> make the osd use that external rocksdb db instead of the embedded one... 
> basically,
> 
>  - symlink the db, db.wal, db.slow files from the osd dir 
> (/var/lib/ceph/osd/ceph-NNN/db -> ... etc)
>  - ceph-bluestore-tool --dev /var/lib/ceph/osd/ceph-NNN/block set-label-key 
> -k bluefs -v 0
>  - start osd

...then before starting the OSD we should again do

 ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-NNN fsck

sage



> 
> but be warned this is fragile: there isn't a bluefs import function, so 
> this OSD will be permanently in that weird state.  The goal will be to get 
> it up and the PG/cluster behaving, and then eventually let rados recover 
> elsewhere and reprovision this osd.
> 
> But first, let's make sure the external rocksdb has a complete set of 
> keys!
> 
> sage
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-12 Thread Simon Leinen
[Sorry for the piecemeal information... it's getting late here]

> Oops, I forgot: Before it crashed, it did modify /mnt/ceph/db; the
> overall size of that directory increased(!) from 3.9GB to 12GB.  The
> compaction seems to have eaten two .log files, but created many more
> .sst files.

...and it upgraded the contents of db/CURRENT from "MANIFEST-053662" to
"MANIFEST-079750".

Good night,
-- 
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-12 Thread Sage Weil
On Thu, 13 Jun 2019, Simon Leinen wrote:
> Sage Weil writes:
> >> 2019-06-12 23:40:43.555 7f724b27f0c0  1 rocksdb: do_open column families: 
> >> [default]
> >> Unrecognized command: stats
> >> ceph-kvstore-tool: /build/ceph-14.2.1/src/rocksdb/db/version_set.cc:356: 
> >> rocksdb::Version::~Version(): Assertion `path_id < 
> >> cfd_->ioptions()->cf_paths.size()' failed.
> >> *** Caught signal (Aborted) **
> 
> > Ah, this looks promising.. it looks like it got it open and has some
> > problem with teh error/teardown path.
> 
> > Try 'compact' instead of 'stats'?
> 
> That run for a while and then crashed, also in the destructor for
> rocksdb::Version, but with an otherwise different backtrace.  I'm
> attaching the log again.

Hmm, I'm pretty sure this is a shutdown problem, but not certain.  If you 
do

 ceph-kvstore-tool rocksdb /mnt/ceph/db list > keys

is the keys file huge?  Can you send the head and tail of it so we can 
make sure it looks complete?

One last thing to check:

 ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-NNN list > keys

and see if that behaves similarly or crashes in the way it did before when 
the OSD was starting.

If the exported version looks intact, I have a workaround that will 
make the osd use that external rocksdb db instead of the embedded one... 
basically,

 - symlink the db, db.wal, db.slow files from the osd dir 
(/var/lib/ceph/osd/ceph-NNN/db -> ... etc)
 - ceph-bluestore-tool --dev /var/lib/ceph/osd/ceph-NNN/block set-label-key -k 
bluefs -v 0
 - start osd

but be warned this is fragile: there isn't a bluefs import function, so 
this OSD will be permanently in that weird state.  The goal will be to get 
it up and the PG/cluster behaving, and then eventually let rados recover 
elsewhere and reprovision this osd.

But first, let's make sure the external rocksdb has a complete set of 
keys!

sage

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-12 Thread Simon Leinen
Simon Leinen writes:
> Sage Weil writes:
>> Try 'compact' instead of 'stats'?

> That run for a while and then crashed, also in the destructor for
> rocksdb::Version, but with an otherwise different backtrace. [...]

Oops, I forgot: Before it crashed, it did modify /mnt/ceph/db; the
overall size of that directory increased(!) from 3.9GB to 12GB.  The
compaction seems to have eaten two .log files, but created many more
.sst files.
-- 
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-12 Thread Simon Leinen
Sage Weil writes:
>> 2019-06-12 23:40:43.555 7f724b27f0c0  1 rocksdb: do_open column families: 
>> [default]
>> Unrecognized command: stats
>> ceph-kvstore-tool: /build/ceph-14.2.1/src/rocksdb/db/version_set.cc:356: 
>> rocksdb::Version::~Version(): Assertion `path_id < 
>> cfd_->ioptions()->cf_paths.size()' failed.
>> *** Caught signal (Aborted) **

> Ah, this looks promising.. it looks like it got it open and has some
> problem with teh error/teardown path.

> Try 'compact' instead of 'stats'?

That run for a while and then crashed, also in the destructor for
rocksdb::Version, but with an otherwise different backtrace.  I'm
attaching the log again.
-- 
Simon.
leinen@unil0047:/mnt/ceph/db$ sudo ceph-kvstore-tool rocksdb /mnt/ceph/db 
compact
2019-06-13 00:00:08.650 7f00f4c510c0  1 rocksdb: do_open column families: 
[default]
ceph-kvstore-tool: /build/ceph-14.2.1/src/rocksdb/db/version_set.cc:356: 
rocksdb::Version::~Version(): Assertion `path_id < 
cfd_->ioptions()->cf_paths.size()' failed.
*** Caught signal (Aborted) **
 in thread 7f00e5788700 thread_name:rocksdb:low0
 ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus 
(stable)
 1: (()+0x12890) [0x7f00ea641890]
 2: (gsignal()+0xc7) [0x7f00e9531e97]
 3: (abort()+0x141) [0x7f00e9533801]
 4: (()+0x3039a) [0x7f00e952339a]
 5: (()+0x30412) [0x7f00e9523412]
 6: (rocksdb::Version::~Version()+0x224) [0x5603bd78bfe4]
 7: (rocksdb::Version::Unref()+0x35) [0x5603bd78c065]
 8: (rocksdb::Compaction::~Compaction()+0x25) [0x5603bd880f05]
 9: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, 
rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*)+0xc46) 
[0x5603bd6dab76]
 10: 
(rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*,
 rocksdb::Env::Priority)+0x141) [0x5603bd6dcfa1]
 11: (rocksdb::DBImpl::BGWorkCompaction(void*)+0x97) [0x5603bd6dd5f7]
 12: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x267) 
[0x5603bd8dc847]
 13: (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x49) 
[0x5603bd8dca29]
 14: (()+0xbd57f) [0x7f00e9f5757f]
 15: (()+0x76db) [0x7f00ea6366db]
 16: (clone()+0x3f) [0x7f00e961488f]
2019-06-13 00:05:09.471 7f00e5788700 -1 *** Caught signal (Aborted) **
 in thread 7f00e5788700 thread_name:rocksdb:low0

 ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus 
(stable)
 1: (()+0x12890) [0x7f00ea641890]
 2: (gsignal()+0xc7) [0x7f00e9531e97]
 3: (abort()+0x141) [0x7f00e9533801]
 4: (()+0x3039a) [0x7f00e952339a]
 5: (()+0x30412) [0x7f00e9523412]
 6: (rocksdb::Version::~Version()+0x224) [0x5603bd78bfe4]
 7: (rocksdb::Version::Unref()+0x35) [0x5603bd78c065]
 8: (rocksdb::Compaction::~Compaction()+0x25) [0x5603bd880f05]
 9: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, 
rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*)+0xc46) 
[0x5603bd6dab76]
 10: 
(rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*,
 rocksdb::Env::Priority)+0x141) [0x5603bd6dcfa1]
 11: (rocksdb::DBImpl::BGWorkCompaction(void*)+0x97) [0x5603bd6dd5f7]
 12: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x267) 
[0x5603bd8dc847]
 13: (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x49) 
[0x5603bd8dca29]
 14: (()+0xbd57f) [0x7f00e9f5757f]
 15: (()+0x76db) [0x7f00ea6366db]
 16: (clone()+0x3f) [0x7f00e961488f]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

--- begin dump of recent events ---
   -23> 2019-06-13 00:00:08.602 7f00f4c510c0  5 asok(0x5603bebba000) 
register_command assert hook 0x5603be844130
   -22> 2019-06-13 00:00:08.602 7f00f4c510c0  5 asok(0x5603bebba000) 
register_command abort hook 0x5603be844130
   -21> 2019-06-13 00:00:08.602 7f00f4c510c0  5 asok(0x5603bebba000) 
register_command perfcounters_dump hook 0x5603be844130
   -20> 2019-06-13 00:00:08.602 7f00f4c510c0  5 asok(0x5603bebba000) 
register_command 1 hook 0x5603be844130
   -19> 2019-06-13 00:00:08.602 7f00f4c510c0  5 asok(0x5603bebba000) 
register_command perf dump hook 0x5603be844130
   -18> 2019-06-13 00:00:08.602 7f00f4c510c0  5 asok(0x5603bebba000) 
register_command perfcounters_schema hook 0x5603be844130
   -17> 2019-06-13 00:00:08.602 7f00f4c510c0  5 asok(0x5603bebba000) 
register_command perf histogram dump hook 0x5603be844130
   -16> 2019-06-13 00:00:08.602 7f00f4c510c0  5 asok(0x5603bebba000) 
register_command 2 hook 0x5603be844130
   -15> 2019-06-13 00:00:08.602 7f00f4c510c0  5 asok(0x5603bebba000) 
register_command perf schema hook 0x5603be844130
   -14> 2019-06-13 00:00:08.602 7f00f4c510c0  5 asok(0x5603bebba000) 
register_command perf histogram schema hook 0x5603be844130
   -13> 2019-06-13 00:00:08.602 7f00f4c510c0  5 asok(0x5603bebba000) 
register_command perf reset hook 0x5603be844130
   -12> 2019-06-13 00:00:08.602 7f00f4c510c0  5 asok(0x5603bebba000) 
register_command config show hook 0x5603be844130
   -11> 2019-06-13 00:00:08.602 7f00f4c510c0  5 asok(0x5603bebba000) 
register_command config 

Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-12 Thread Sage Weil
On Wed, 12 Jun 2019, Simon Leinen wrote:
> We hope that we can get some access to S3 bucket indexes back, possibly
> by somehow dropping and re-creating those indexes.

Are all 3 OSDs crashing in the same way?

My guess is that the reshard process triggered some massive rocksdb 
transaction that in turn made all 3 replicas' rocksdbs break in the same 
way.  So we have to get one of them in working order one way or another.

If the exported rocksdb is read/write-able we can coax bluestore into 
using the external rocksdb instead of the internal one.  We need to make 
sure we understand what the issue is first, though! 

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-12 Thread Sage Weil
On Wed, 12 Jun 2019, Simon Leinen wrote:
> Sage Weil writes:
> > What happens if you do
> 
> >  ceph-kvstore-tool rocksdb /mnt/ceph/db stats
> 
> (I'm afraid that our ceph-kvstore-tool doesn't know about a "stats"
> command; but it still tries to open the database.)
> 
> That aborts after complaining about many missing files in /mnt/ceph/db.
> 
> When I ( cd /mnt/ceph/db && sudo ln -s ../db.slow/* . ) and re-run,
> it still aborts, just without complaining about missing files.

Ah, yes--I forgot that part :)

> I'm attaching the output (stdout+stderr combined), in case that helps.
> 
> > or, if htat works,
> 
> >  ceph-kvstore-tool rocksdb /mnt/ceph/db compact
> 
> > It looks like bluefs is happy (in that it can read the whole set 
> > of rocksdb files), so the questoin is if rocksdb can open them, or 
> > if there's some corruption or problem at the rocksdb level.
> 
> > The original crash is actually here:
> 
> >  ...
> >  9: (tc_new()+0x283) [0x7fbdbed8e943]
> >  10: (std::__cxx11::basic_string, 
> > std::allocator >::_M_mutate(unsigned long, unsigned long, char 
> > const*, unsigned long)+0x69) [0x5600b1268109]
> >  11: (std::__cxx11::basic_string, 
> > std::allocator >::_M_append(char const*, unsigned long)+0x63) 
> > [0x5600b12f5b43]
> >  12: (rocksdb::BlockBuilder::Add(rocksdb::Slice const&, rocksdb::Slice 
> > const&, rocksdb::Slice const*)+0x10b) [0x5600b1eaca9b]
> >  ...
> 
> > where tc_new is (I think) tcmalloc.  Which looks to me like rocksdb 
> > is probably trying to allocate something very big.  The question is will 
> > that happen with the exported files or only on bluefs...
> 
> Yes, that's what I was thinking as well.  The server seems to have about
> 50GB of free RAM though, so maybe it was more like ly big :-)
> 
> Also, your ceph-kvstore-tool command seems to have crashed somewhere
> else (the desctructor of a rocksdb::Version object?)
> 
>   2019-06-12 23:40:43.555 7f724b27f0c0  1 rocksdb: do_open column families: 
> [default]
>   Unrecognized command: stats
>   ceph-kvstore-tool: /build/ceph-14.2.1/src/rocksdb/db/version_set.cc:356: 
> rocksdb::Version::~Version(): Assertion `path_id < 
> cfd_->ioptions()->cf_paths.size()' failed.
>   *** Caught signal (Aborted) **

Ah, this looks promising.. it looks like it got it open and has some 
problem with teh error/teardown path.

Try 'compact' instead of 'stats'?

sage


>in thread 7f724b27f0c0 thread_name:ceph-kvstore-to
>ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus 
> (stable)
>1: (()+0x12890) [0x7f7240c6f890]
>2: (gsignal()+0xc7) [0x7f723fb5fe97]
>3: (abort()+0x141) [0x7f723fb61801]
>4: (()+0x3039a) [0x7f723fb5139a]
>5: (()+0x30412) [0x7f723fb51412]
>6: (rocksdb::Version::~Version()+0x224) [0x559749529fe4]
>7: (rocksdb::Version::Unref()+0x35) [0x55974952a065]
>8: (rocksdb::SuperVersion::Cleanup()+0x68) [0x55974960f328]
>9: (rocksdb::ColumnFamilyData::~ColumnFamilyData()+0xf4) [0x5597496123d4]
>10: (rocksdb::ColumnFamilySet::~ColumnFamilySet()+0xb8) [0x559749612ba8]
>11: (rocksdb::VersionSet::~VersionSet()+0x4d) [0x55974951da5d]
>12: (rocksdb::DBImpl::CloseHelper()+0x6a8) [0x55974944a868]
>13: (rocksdb::DBImpl::~DBImpl()+0x65b) [0x559749455deb]
>14: (rocksdb::DBImpl::~DBImpl()+0x11) [0x559749455e21]
>15: (RocksDBStore::~RocksDBStore()+0xe9) [0x559749265349]
>16: (RocksDBStore::~RocksDBStore()+0x9) [0x559749265599]
>17: (main()+0x307) [0x5597490b5fb7]
>18: (__libc_start_main()+0xe7) [0x7f723fb42b97]
>19: (_start()+0x2a) [0x55974918e03a]
>   2019-06-12 23:40:51.363 7f724b27f0c0 -1 *** Caught signal (Aborted) **
>in thread 7f724b27f0c0 thread_name:ceph-kvstore-to
> 
> > Thanks!
> 
> Thanks so much for looking into this!
> 
> We hope that we can get some access to S3 bucket indexes back, possibly
> by somehow dropping and re-creating those indexes.
> -- 
> Simon.
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-12 Thread Simon Leinen
Sage Weil writes:
> What happens if you do

>  ceph-kvstore-tool rocksdb /mnt/ceph/db stats

(I'm afraid that our ceph-kvstore-tool doesn't know about a "stats"
command; but it still tries to open the database.)

That aborts after complaining about many missing files in /mnt/ceph/db.

When I ( cd /mnt/ceph/db && sudo ln -s ../db.slow/* . ) and re-run,
it still aborts, just without complaining about missing files.

I'm attaching the output (stdout+stderr combined), in case that helps.

> or, if htat works,

>  ceph-kvstore-tool rocksdb /mnt/ceph/db compact

> It looks like bluefs is happy (in that it can read the whole set 
> of rocksdb files), so the questoin is if rocksdb can open them, or 
> if there's some corruption or problem at the rocksdb level.

> The original crash is actually here:

>  ...
>  9: (tc_new()+0x283) [0x7fbdbed8e943]
>  10: (std::__cxx11::basic_string, 
> std::allocator >::_M_mutate(unsigned long, unsigned long, char const*, 
> unsigned long)+0x69) [0x5600b1268109]
>  11: (std::__cxx11::basic_string, 
> std::allocator >::_M_append(char const*, unsigned long)+0x63) 
> [0x5600b12f5b43]
>  12: (rocksdb::BlockBuilder::Add(rocksdb::Slice const&, rocksdb::Slice 
> const&, rocksdb::Slice const*)+0x10b) [0x5600b1eaca9b]
>  ...

> where tc_new is (I think) tcmalloc.  Which looks to me like rocksdb 
> is probably trying to allocate something very big.  The question is will 
> that happen with the exported files or only on bluefs...

Yes, that's what I was thinking as well.  The server seems to have about
50GB of free RAM though, so maybe it was more like ly big :-)

Also, your ceph-kvstore-tool command seems to have crashed somewhere
else (the desctructor of a rocksdb::Version object?)

  2019-06-12 23:40:43.555 7f724b27f0c0  1 rocksdb: do_open column families: 
[default]
  Unrecognized command: stats
  ceph-kvstore-tool: /build/ceph-14.2.1/src/rocksdb/db/version_set.cc:356: 
rocksdb::Version::~Version(): Assertion `path_id < 
cfd_->ioptions()->cf_paths.size()' failed.
  *** Caught signal (Aborted) **
   in thread 7f724b27f0c0 thread_name:ceph-kvstore-to
   ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus 
(stable)
   1: (()+0x12890) [0x7f7240c6f890]
   2: (gsignal()+0xc7) [0x7f723fb5fe97]
   3: (abort()+0x141) [0x7f723fb61801]
   4: (()+0x3039a) [0x7f723fb5139a]
   5: (()+0x30412) [0x7f723fb51412]
   6: (rocksdb::Version::~Version()+0x224) [0x559749529fe4]
   7: (rocksdb::Version::Unref()+0x35) [0x55974952a065]
   8: (rocksdb::SuperVersion::Cleanup()+0x68) [0x55974960f328]
   9: (rocksdb::ColumnFamilyData::~ColumnFamilyData()+0xf4) [0x5597496123d4]
   10: (rocksdb::ColumnFamilySet::~ColumnFamilySet()+0xb8) [0x559749612ba8]
   11: (rocksdb::VersionSet::~VersionSet()+0x4d) [0x55974951da5d]
   12: (rocksdb::DBImpl::CloseHelper()+0x6a8) [0x55974944a868]
   13: (rocksdb::DBImpl::~DBImpl()+0x65b) [0x559749455deb]
   14: (rocksdb::DBImpl::~DBImpl()+0x11) [0x559749455e21]
   15: (RocksDBStore::~RocksDBStore()+0xe9) [0x559749265349]
   16: (RocksDBStore::~RocksDBStore()+0x9) [0x559749265599]
   17: (main()+0x307) [0x5597490b5fb7]
   18: (__libc_start_main()+0xe7) [0x7f723fb42b97]
   19: (_start()+0x2a) [0x55974918e03a]
  2019-06-12 23:40:51.363 7f724b27f0c0 -1 *** Caught signal (Aborted) **
   in thread 7f724b27f0c0 thread_name:ceph-kvstore-to

> Thanks!

Thanks so much for looking into this!

We hope that we can get some access to S3 bucket indexes back, possibly
by somehow dropping and re-creating those indexes.
-- 
Simon.

2019-06-12 23:40:43.555 7f724b27f0c0  1 rocksdb: do_open column families: 
[default]
Unrecognized command: stats
ceph-kvstore-tool: /build/ceph-14.2.1/src/rocksdb/db/version_set.cc:356: 
rocksdb::Version::~Version(): Assertion `path_id < 
cfd_->ioptions()->cf_paths.size()' failed.
*** Caught signal (Aborted) **
 in thread 7f724b27f0c0 thread_name:ceph-kvstore-to
 ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus 
(stable)
 1: (()+0x12890) [0x7f7240c6f890]
 2: (gsignal()+0xc7) [0x7f723fb5fe97]
 3: (abort()+0x141) [0x7f723fb61801]
 4: (()+0x3039a) [0x7f723fb5139a]
 5: (()+0x30412) [0x7f723fb51412]
 6: (rocksdb::Version::~Version()+0x224) [0x559749529fe4]
 7: (rocksdb::Version::Unref()+0x35) [0x55974952a065]
 8: (rocksdb::SuperVersion::Cleanup()+0x68) [0x55974960f328]
 9: (rocksdb::ColumnFamilyData::~ColumnFamilyData()+0xf4) [0x5597496123d4]
 10: (rocksdb::ColumnFamilySet::~ColumnFamilySet()+0xb8) [0x559749612ba8]
 11: (rocksdb::VersionSet::~VersionSet()+0x4d) [0x55974951da5d]
 12: (rocksdb::DBImpl::CloseHelper()+0x6a8) [0x55974944a868]
 13: (rocksdb::DBImpl::~DBImpl()+0x65b) [0x559749455deb]
 14: (rocksdb::DBImpl::~DBImpl()+0x11) [0x559749455e21]
 15: (RocksDBStore::~RocksDBStore()+0xe9) [0x559749265349]
 16: (RocksDBStore::~RocksDBStore()+0x9) [0x559749265599]
 17: (main()+0x307) [0x5597490b5fb7]
 18: (__libc_start_main()+0xe7) [0x7f723fb42b97]
 19: (_start()+0x2a) [0x55974918e03a]
2019-06-12 23:40:51.363 7f724b27f0c0 -1 *** Caught 

Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-12 Thread Sage Weil
On Wed, 12 Jun 2019, Simon Leinen wrote:
> Dear Sage,
> 
> > Also, can you try ceph-bluestore-tool bluefs-export on this osd?  I'm
> > pretty sure it'll crash in the same spot, but just want to confirm
> > it's a bluefs issue.
> 
> To my surprise, this actually seems to have worked:
> 
>   $ time sudo ceph-bluestore-tool --out-dir /mnt/ceph bluefs-export --path 
> /var/lib/ceph/osd/ceph-49
>   inferring bluefs devices from bluestore path
>slot 2 /var/lib/ceph/osd/ceph-49/block -> /dev/dm-9
>slot 1 /var/lib/ceph/osd/ceph-49/block.db -> /dev/dm-8
>   db/
>   db/072900.sst
>   db/072901.sst
>   db/076487.sst
>   db/076488.sst
>   db/076489.sst
>   db/076490.sst
>   [...]
>   db/079726.sst
>   db/079727.log
>   db/CURRENT
>   db/IDENTITY
>   db/LOCK
>   db/MANIFEST-053662
>   db/OPTIONS-053662
>   db/OPTIONS-053665
>   db.slow/
>   db.slow/049192.sst
>   db.slow/049193.sst
>   db.slow/049831.sst
>   db.slow/057443.sst
>   db.slow/057444.sst
>   db.slow/058254.sst
>   [...]
>   db.slow/079718.sst
>   db.slow/079719.sst
>   db.slow/079720.sst
>   db.slow/079721.sst
>   db.slow/079722.sst
>   db.slow/079723.sst
>   db.slow/079724.sst
>   
>   real5m19.953s
>   user0m0.101s
>   sys 1m5.571s
>   leinen@unil0047:/var/lib/ceph/osd/ceph-49$
> 
> It left 3GB in /mnt/ceph/db (55 files of varying sizes),
> 
> and 39GB in /mnt/ceph/db.slow (620 files of mostly 68MB each).

What happens if you do

 ceph-kvstore-tool rocksdb /mnt/ceph/db stats

or, if htat works,

 ceph-kvstore-tool rocksdb /mnt/ceph/db compact

It looks like bluefs is happy (in that it can read the whole set 
of rocksdb files), so the questoin is if rocksdb can open them, or 
if there's some corruption or problem at the rocksdb level.

The original crash is actually here:

 ...
 9: (tc_new()+0x283) [0x7fbdbed8e943]
 10: (std::__cxx11::basic_string, 
std::allocator >::_M_mutate(unsigned long, unsigned long, char const*, 
unsigned long)+0x69) [0x5600b1268109]
 11: (std::__cxx11::basic_string, 
std::allocator >::_M_append(char const*, unsigned long)+0x63) 
[0x5600b12f5b43]
 12: (rocksdb::BlockBuilder::Add(rocksdb::Slice const&, rocksdb::Slice const&, 
rocksdb::Slice const*)+0x10b) [0x5600b1eaca9b]
 ...

where tc_new is (I think) tcmalloc.  Which looks to me like rocksdb 
is probably trying to allocate something very big.  The question is will 
that happen with the exported files or only on bluefs...

Thanks!
sage

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-12 Thread Simon Leinen
Dear Sage,

> Also, can you try ceph-bluestore-tool bluefs-export on this osd?  I'm
> pretty sure it'll crash in the same spot, but just want to confirm
> it's a bluefs issue.

To my surprise, this actually seems to have worked:

  $ time sudo ceph-bluestore-tool --out-dir /mnt/ceph bluefs-export --path 
/var/lib/ceph/osd/ceph-49
  inferring bluefs devices from bluestore path
   slot 2 /var/lib/ceph/osd/ceph-49/block -> /dev/dm-9
   slot 1 /var/lib/ceph/osd/ceph-49/block.db -> /dev/dm-8
  db/
  db/072900.sst
  db/072901.sst
  db/076487.sst
  db/076488.sst
  db/076489.sst
  db/076490.sst
  [...]
  db/079726.sst
  db/079727.log
  db/CURRENT
  db/IDENTITY
  db/LOCK
  db/MANIFEST-053662
  db/OPTIONS-053662
  db/OPTIONS-053665
  db.slow/
  db.slow/049192.sst
  db.slow/049193.sst
  db.slow/049831.sst
  db.slow/057443.sst
  db.slow/057444.sst
  db.slow/058254.sst
  [...]
  db.slow/079718.sst
  db.slow/079719.sst
  db.slow/079720.sst
  db.slow/079721.sst
  db.slow/079722.sst
  db.slow/079723.sst
  db.slow/079724.sst
  
  real  5m19.953s
  user  0m0.101s
  sys   1m5.571s
  leinen@unil0047:/var/lib/ceph/osd/ceph-49$

It left 3GB in /mnt/ceph/db (55 files of varying sizes),

and 39GB in /mnt/ceph/db.slow (620 files of mostly 68MB each).

Is there anything we can do with this?
-- 
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-12 Thread Sage Weil
On Wed, 12 Jun 2019, Harald Staub wrote:
> On 12.06.19 17:40, Sage Weil wrote:
> > On Wed, 12 Jun 2019, Harald Staub wrote:
> > > Also opened an issue about the rocksdb problem:
> > > https://tracker.ceph.com/issues/40300
> > 
> > Thanks!
> > 
> > The 'rocksdb: Corruption: file is too short' the root of the problem
> > here. Can you try starting the OSD with 'debug_bluestore=20' and
> > 'debug_bluefs=20'?  (And attach them to the ticket, or ceph-post-file and
> > put the uuid in the ticket..)
> 
> Now we have a log file of size 270 GB, so we tried to extract the interesting
> part:
> ceph-post-file: 78842f1f-1981-4a81-b3f4-3a3ca1dc6ae3
> Is that ok?

Hmm, it looks like bluefs is crashing reading a very large .sst file 
(~13GB).  Can you look back in teh logs a bit to see the *first* crash of 
this OSD?  Presumably it's not not part of BlueStore::_mount ...

Also, can you try ceph-bluestore-tool bluefs-export on this osd?  I'm 
pretty sure it'll crash in the same spot, but just want to confirm 
it's a bluefs issue.

Thanks!
s

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-12 Thread Harald Staub

On 12.06.19 17:40, Sage Weil wrote:

On Wed, 12 Jun 2019, Harald Staub wrote:

Also opened an issue about the rocksdb problem:
https://tracker.ceph.com/issues/40300


Thanks!

The 'rocksdb: Corruption: file is too short' the root of the problem
here. Can you try starting the OSD with 'debug_bluestore=20' and
'debug_bluefs=20'?  (And attach them to the ticket, or ceph-post-file and
put the uuid in the ticket..)


Now we have a log file of size 270 GB, so we tried to extract the 
interesting part:

ceph-post-file: 78842f1f-1981-4a81-b3f4-3a3ca1dc6ae3
Is that ok?

Thank you!
 Harry
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Error when I compare hashes of export-diff / import-diff

2019-06-12 Thread Rafael Diaz Maurin

Le 12/06/2019 à 16:01, Jason Dillaman a écrit :

On Wed, Jun 12, 2019 at 9:50 AM Rafael Diaz Maurin
 wrote:

Hello Jason,

Le 11/06/2019 à 15:31, Jason Dillaman a écrit :

4- I export the snapshot from the source pool and I import the snapshot
towards the destination pool (in the pipe)
rbd export-diff --from-snap ${LAST-SNAP}
${POOL-SOURCE}/${KVM-IMAGE}@${TODAY-SNAP} - | rbd -c ${BACKUP-CLUSTER}
import-diff - ${POOL-DESTINATION}/${KVM-IMAGE}

What's the actual difference between the "rbd diff" outputs? There is
a known "issue" where object-map will flag an object as dirty if you
had run an fstrim/discard on the image, but it doesn't affect the
actual validity of the data.


The feature discard=on is activated, and qemu-guest-agent is running on
the guest.

Here are the only differences between the 2 outputs of the "rbd diff
--format plain" (image source and image destination) :
Offset  Length  Type
121c121
< 14103347200 2097152 data
---
  > 14103339008 2105344 data
153c153
< 14371782656 2097152 data
---
  > 14369685504 4194304 data
216c216
< 14640218112 2097152 data
---
  > 14638120960 4194304 data
444c444
< 15739125760 2097152 data
---
  > 15738519552 2703360 data

And the hashes of the exports are identical (between source and
destination) :
rbd -p ${POOL-SOURCE} export ${KVM-IMAGE}@${TODAY-SNAP} - | md5sum
=> ee7012e14870b36e7b9695e52c417c06

rbd -c ${BACKUP-CLUSTER} -p ${POOL-DESTINATION} export
${KVM-IMAGE}@${TODAY-SNAP} - | md5sum
=> ee7012e14870b36e7b9695e52c417c06


So do you think this can be caused by the fstrim/discard feature.

You said you weren't using fstrim/discard.


In fact, the daemon fstrim is not enabled in the guest, but... in 
proxmox I set the option discard=on and maybe I ran fstrim once by hard 
testing...



If you export both images
and compare those image extents where the diffs are different, is it
just filled w/ zeroes?


I can't find how to to compare if the diffs of the extends are filled by 
w/ zeros.


However I finally succeed getting identical the "rbd diff " hashes with 
the option "--whole-object" like this :


HASH source :
rbd diff --from-snap  /@ --format json 
--whole-object | md5sum | cut -d ' ' -f 1


Hash destination
rbd -c  diff --from-snap  
/@ --format json --whole-object | md5sum | cut -d ' 
' -f 1



Best regards,
Rafael



--
Rafael Diaz Maurin
DSI de l'Université de Rennes 1
Pôle Infrastructures, équipe Systèmes
02 23 23 71 57




smime.p7s
Description: Signature cryptographique S/MIME
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Cluster Replication / Disaster Recovery

2019-06-12 Thread DHilsbos
All;

I'm testing and evaluating Ceph for the next generation of storage architecture 
for our company, and so far I'm fairly impressed, but I've got a couple of 
questions around cluster replication and disaster recovery.

First; intended uses.
Ceph Object Gateway will be used to support new software projects presently in 
the works.
CephFS behind Samba will be used for Windows file shares both during 
transition, and to support long term needs.
The ISCSi gateway and RADOS Block Devices will be used to support 
virtualization.

My research suggests that native replication isn't available within the Ceph 
Cluster (i.e. have a cluster replicate all objects to a second cluster).  
RADOSgw support replicating objects into more than one Ceph cluster, but I 
can't find information on multi-site / replication for ISCSigw or CephFS.

So... How do you plan / manage major event disaster recovery with your Ceph 
Clusters (i.e. loss of the entire cluster)?
What backup solutions do you use / recommend with your Ceph clusters?  Are you 
doing any off-site backup?
Anyone backing up to the cloud?  What kind of bandwidth are you using for this?

Thank you,

Dominic L. Hilsbos, MBA 
Director - Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-large] Large Omap Warning on Log pool

2019-06-12 Thread Aaron Bassett
Correct, it was pre-jewel. I believe we toyed with multisite replication back 
then so it may have gotten baked into the zonegroup inadvertently. Thanks for 
the info!

> On Jun 12, 2019, at 11:08 AM, Casey Bodley  wrote:
> 
> Hi Aaron,
> 
> The data_log objects are storing logs for multisite replication. Judging by 
> the pool name '.us-phx2.log', this cluster was created before jewel. Are you 
> (or were you) using multisite or radosgw-agent?
> 
> If not, you'll want to turn off the logging (log_meta and log_data -> false) 
> in your zonegroup configuration using 'radosgw-admin zonegroup get/set', 
> restart gateways, then delete the data_log and meta_log objects.
> 
> If it is multisite, then the logs should all be trimmed in the background as 
> long as all peer zones are up-to-date. There was a bug prior to 12.2.12 that 
> prevented datalog trimming 
> (https://urldefense.proofpoint.com/v2/url?u=http-3A__tracker.ceph.com_issues_38412=DwICAg=Tpa2GKmmYSmpYS4baANxQwQYqA0vwGXwkJOPBegaiTs=5nKer5huNDFQXjYpOR4o_7t5CRI8wb5Vb_v1pBywbYw=v4DUT5hhECo7oEd5wRUGTpZor7RdHML6WBqg4ShUkD4=WdoWXzoFQ7-MAOLhHAeaFOBUVwtktGzweP8mpMieCDo=).
> 
> Casey
> 
> 
> On 6/11/19 5:41 PM, Aaron Bassett wrote:
>> Hey all,
>> I've just recently upgraded some of my larger rgw clusters to latest 
>> luminous and now I'm getting a lot of warnings about large omap objects. 
>> Most of them were on the indices and I've taken care of them by sharding 
>> where appropriate. However on two of my clusters I have a large object in 
>> the rgw log pool.
>> 
>> ceph health detail
>> HEALTH_WARN 1 large omap objects
>> LARGE_OMAP_OBJECTS 1 large omap objects
>> 1 large objects found in pool '.us-phx2.log'
>> Search the cluster log for 'Large omap object found' for more details.
>> 
>> 
>> 2019-06-11 10:50:04.583354 7f8d2b737700  0 log_channel(cluster) log [WRN] : 
>> Large omap object found. Object: 51:b9a904f6:::data_log.27:head Key count: 
>> 15903755 Size (bytes): 2305116273
>> 
>> 
>> I'm not sure what to make of this. I don't see much chatter on the mailing 
>> lists about the log pool, other than a thread about swift lifecycles, which 
>> I dont use.  The log pool is pretty large, making it difficult to poke 
>> around in there:
>> 
>> .us-phx2.log 51  118GiB  0.03
>> 384TiB  12782413
>> 
>> That said i did a little poking around and it looks like a mix of these 
>> data_log object and some delete hints, but mostly a lot of objects starting 
>> with dates that point to different s3 pools. The object referenced in the 
>> osd log has 15912300  omap keys, and spot checking it, it looks like it's 
>> mostly referencing a pool we use with out dns resolver. We have a dns 
>> service that checks rgw endpoint health by uploading and deleting an object 
>> every few minutes to check health, and adds/removes endpoints from the A 
>> record as indicated.
>> 
>> So I guess I've got a few questions:
>> 
>> 1) what is the nature of the data in the data_log.* objects in the log pool? 
>> Is it safe to remove or is it more like a binlog that needs to be intact 
>> from the beginning of time?
>> 
>> 2) with the log pool in general, beyond the individual objects omap sizes, 
>> is there any concern about size? If so, is there a way to force it to 
>> truncate? I see some log commands in radosgw-admin, but documentation is 
>> light.
>> 
>> 
>> Thanks,
>> Aaron
>> 
>> CONFIDENTIALITY NOTICE
>> This e-mail message and any attachments are only for the use of the intended 
>> recipient and may contain information that is privileged, confidential or 
>> exempt from disclosure under applicable law. If you are not the intended 
>> recipient, any disclosure, distribution or other use of this e-mail message 
>> or attachments is prohibited. If you have received this e-mail message in 
>> error, please delete and notify the sender immediately. Thank you.
>> 
>> ___
>> Ceph-large mailing list
>> ceph-la...@lists.ceph.com
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dlarge-2Dceph.com=DwICAg=Tpa2GKmmYSmpYS4baANxQwQYqA0vwGXwkJOPBegaiTs=5nKer5huNDFQXjYpOR4o_7t5CRI8wb5Vb_v1pBywbYw=v4DUT5hhECo7oEd5wRUGTpZor7RdHML6WBqg4ShUkD4=LMKCnwYhtrDHqSyT7s13zJjf1CxEb8FXZ5AxvZ8IYTc=
>> 
>> 
>> 
> ___
> Ceph-large mailing list
> ceph-la...@lists.ceph.com
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dlarge-2Dceph.com=DwICAg=Tpa2GKmmYSmpYS4baANxQwQYqA0vwGXwkJOPBegaiTs=5nKer5huNDFQXjYpOR4o_7t5CRI8wb5Vb_v1pBywbYw=v4DUT5hhECo7oEd5wRUGTpZor7RdHML6WBqg4ShUkD4=LMKCnwYhtrDHqSyT7s13zJjf1CxEb8FXZ5AxvZ8IYTc=


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW Multisite Q's

2019-06-12 Thread Peter Eisch
Hi,

Could someone be able to point me to a blog or documentation page which helps 
me resolve the issues noted below?

All nodes are Luminous, 12.2.12; one realm, one zonegroup (clustered haproxies 
fronting), two zones (three rgw in each); All endpoint references to each zone 
go are an haproxy.

In hoping to replace a swift config with RGW it has been interesting.  Crafting 
a functional configuration from blog posts and documentation takes time.  It 
was crucial to find and use 
http://docs.ceph.com/docs/luminous/radosgw/multisite/ instead of 
http://docs.ceph.com/docs/master/radosgw/config-ref/ except parts suggest 
incorrect configurations.  I've submitted corrections to the former in #28517, 
for what it's worth.

Through this I'm now finding fewer resources to help explain the abundance of 
404's in the gateway logs:

  "GET 
/admin/log/?type=data=8=true= 
HTTP/1.1" 404 0 - -
  "GET 
/admin/log/?type=data=8=true= 
HTTP/1.1" 404 0 - -
  "GET 
/admin/log/?type=data=8=true= 
HTTP/1.1" 404 0 - -
  "GET 
/admin/log/?type=data=8=true= 
HTTP/1.1" 404 0 - -

To the counts of hundreds of thousands.  The site seems to work with just 
minimal testing so far.  The 404's also seem to be limited to the data queries 
while the metadata queries are mostly more successful with 200's.

  "GET 
/admin/log?type=metadata=55=58b43d07-03e2-48e4-b2dc-74d64ef7f0c9=100&=
 HTTP/1.1" 200 0 - -
   "GET 
/admin/log?type=metadata=45=58b43d07-03e2-48e4-b2dc-74d64ef7f0c9=100&==
 HTTP/1.1" 200 0 - -
  "GET 
/admin/log?type=metadata=4=58b43d07-03e2-48e4-b2dc-74d64ef7f0c9=100&==
 HTTP/1.1" 200 0 - -
   "GET 
/admin/log?type=metadata=35=58b43d07-03e2-48e4-b2dc-74d64ef7f0c9=100&==
 HTTP/1.1" 200 0 - -

Q: How do I address the 404 events to help them succeed?


Other log events which I cannot resolve are the tens of thousands (even while 
no reads or writes are requested) of:

  ... meta sync: ERROR: RGWBackoffControlCR called coroutine returned -2
  ... meta sync: ERROR: RGWBackoffControlCR called coroutine returned -2
  ... data sync: ERROR: failed to read remote data log info: ret=-2
  ... data sync: ERROR: failed to read remote data log info: ret=-2
  ... meta sync: ERROR: RGWBackoffControlCR called coroutine returned -2
  ... meta sync: ERROR: RGWBackoffControlCR called coroutine returned -2
  ... data sync: ERROR: failed to read remote data log info: ret=-2
  ... data sync: ERROR: failed to read remote data log info: ret=-2
  ... data sync: ERROR: failed to read remote data log info: ret=-2
  ... meta sync: ERROR: RGWBackoffControlCR called coroutine returned -2
  ... etc.
These seem to fire off every 30 seconds but doesn't seem to be managed by "rgw 
usage log tick interval" nor "rgw init timeout" values.  Meanwhile the usage 
between the two zones matches for each bucket.

Q:  What are these log events indicating?

Thanks,

peter



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-12 Thread Sage Weil
On Wed, 12 Jun 2019, Harald Staub wrote:
> Also opened an issue about the rocksdb problem:
> https://tracker.ceph.com/issues/40300

Thanks!

The 'rocksdb: Corruption: file is too short' the root of the problem 
here. Can you try starting the OSD with 'debug_bluestore=20' and 
'debug_bluefs=20'?  (And attach them to the ticket, or ceph-post-file and 
put the uuid in the ticket..)

Thanks!
sage

> 
> On 12.06.19 16:06, Harald Staub wrote:
> > We ended in a bad situation with our RadosGW (Cluster is Nautilus 
> > 14.2.1, 350 OSDs with BlueStore):
> > 
> > 1. There is a bucket with about 60 million objects, without shards.
> > 
> > 2. radosgw-admin bucket reshard --bucket $BIG_BUCKET --num-shards 1024
> > 
> > 3. Resharding looked fine first, it counted up to the number of objects, 
> > but then it hang.
> > 
> > 4. 3 OSDs crashed with a segfault: "rocksdb: Corruption: file is too short"
> > 
> > 5. Trying to start the OSDs manually led to the same segfaults.
> > 
> > 6. ceph-bluestore-tool repair ...
> > 
> > 7. The repairs all aborted, with the same rocksdb error as above.
> > 
> > 8. Now 1 PG is stale. It belongs to the radosgw bucket index pool, and 
> > it contained the index of this big bucket.
> > 
> > Is there any hope in getting these rocksdbs up again?
> > 
> > Otherwise: how would we fix the bucket index pool? Our ideas:
> > 
> > 1. ceph pg $BAD_PG mark_unfound_lost delete
> > 2. rados -p .rgw.buckets ls, search $BAD_BUCKET_ID and remove these 
> > objects. The hope of this step would be to make the following step 
> > faster, and avoid another similar problem.
> > 3. radosgw-admin bucket check --check-objects
> > 
> > Will this really rebuild the bucket index? Is it ok to leave the 
> > existing bucket indexes in place? Is it ok to run for all buckets at 
> > once, or has it to be run bucket by bucket? Is there a risk that the 
> > indexes that are not affected by the BAD_PG will be broken afterwards?
> > 
> > Some more details that may be of interest.
> > 
> > ceph-bluestore-repair says:
> > 
> > 2019-06-12 11:15:38.345 7f56269670c0 -1 rocksdb: Corruption: file is too 
> > short (6139497190 bytes) to be an sstabledb/079728.sst
> > 2019-06-12 11:15:38.345 7f56269670c0 -1 
> > bluestore(/var/lib/ceph/osd/ceph-49) _open_db erroring opening db:
> > error from fsck: (5) Input/output error
> > 
> > The repairs also showed several warnings like:
> > 
> > tcmalloc: large alloc 17162051584 bytes == 0x56167918a000 @ 
> > 0x7f5626521887 0x56126a287229 0x56126a2873a3 0x56126a5dc1ec 
> > 0x56126a584ce2 0x56126a586a05 0x56126a587dd0 0x56126a589344 
> > 0x56126a38c3cf 0x56126a2eae94 0x56126a30654e 0x56126a337ae1 
> > 0x56126a1a73a1 0x7f561b228b97 0x56126a28077a
> > 
> > The processes showed up with like 45 GB of RAM used. Fortunately, there 
> > was no Out-Of-Memory.
> > 
> >   Harry
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-12 Thread Casey Bodley

Hi Harald,

If the bucket reshard didn't complete, it's most likely one of the new 
bucket index shards that got corrupted here and the original index shard 
should still be intact. Does $BAD_BUCKET_ID correspond to the 
new/resharded instance id? If so, once the rocksdb/osd issues are 
resolved, you should still be able to access and write to the bucket. 
The 'radosgw-admin reshard stale-instances list/rm' commands should be 
able to detect and clean up after the failed reshard. Without knowing 
more about the rocksdb problem, it's hard to tell whether it's safe to 
re-reshard.


Casey


On 6/12/19 10:31 AM, Harald Staub wrote:

Also opened an issue about the rocksdb problem:
https://tracker.ceph.com/issues/40300

On 12.06.19 16:06, Harald Staub wrote:
We ended in a bad situation with our RadosGW (Cluster is Nautilus 
14.2.1, 350 OSDs with BlueStore):


1. There is a bucket with about 60 million objects, without shards.

2. radosgw-admin bucket reshard --bucket $BIG_BUCKET --num-shards 1024

3. Resharding looked fine first, it counted up to the number of 
objects, but then it hang.


4. 3 OSDs crashed with a segfault: "rocksdb: Corruption: file is too 
short"


5. Trying to start the OSDs manually led to the same segfaults.

6. ceph-bluestore-tool repair ...

7. The repairs all aborted, with the same rocksdb error as above.

8. Now 1 PG is stale. It belongs to the radosgw bucket index pool, 
and it contained the index of this big bucket.


Is there any hope in getting these rocksdbs up again?

Otherwise: how would we fix the bucket index pool? Our ideas:

1. ceph pg $BAD_PG mark_unfound_lost delete
2. rados -p .rgw.buckets ls, search $BAD_BUCKET_ID and remove these 
objects. The hope of this step would be to make the following step 
faster, and avoid another similar problem.

3. radosgw-admin bucket check --check-objects

Will this really rebuild the bucket index? Is it ok to leave the 
existing bucket indexes in place? Is it ok to run for all buckets at 
once, or has it to be run bucket by bucket? Is there a risk that the 
indexes that are not affected by the BAD_PG will be broken afterwards?


Some more details that may be of interest.

ceph-bluestore-repair says:

2019-06-12 11:15:38.345 7f56269670c0 -1 rocksdb: Corruption: file is 
too short (6139497190 bytes) to be an sstabledb/079728.sst
2019-06-12 11:15:38.345 7f56269670c0 -1 
bluestore(/var/lib/ceph/osd/ceph-49) _open_db erroring opening db:

error from fsck: (5) Input/output error

The repairs also showed several warnings like:

tcmalloc: large alloc 17162051584 bytes == 0x56167918a000 @ 
0x7f5626521887 0x56126a287229 0x56126a2873a3 0x56126a5dc1ec 
0x56126a584ce2 0x56126a586a05 0x56126a587dd0 0x56126a589344 
0x56126a38c3cf 0x56126a2eae94 0x56126a30654e 0x56126a337ae1 
0x56126a1a73a1 0x7f561b228b97 0x56126a28077a


The processes showed up with like 45 GB of RAM used. Fortunately, 
there was no Out-Of-Memory.


  Harry
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-large] Large Omap Warning on Log pool

2019-06-12 Thread Casey Bodley

Hi Aaron,

The data_log objects are storing logs for multisite replication. Judging 
by the pool name '.us-phx2.log', this cluster was created before jewel. 
Are you (or were you) using multisite or radosgw-agent?


If not, you'll want to turn off the logging (log_meta and log_data -> 
false) in your zonegroup configuration using 'radosgw-admin zonegroup 
get/set', restart gateways, then delete the data_log and meta_log objects.


If it is multisite, then the logs should all be trimmed in the 
background as long as all peer zones are up-to-date. There was a bug 
prior to 12.2.12 that prevented datalog trimming 
(http://tracker.ceph.com/issues/38412).


Casey


On 6/11/19 5:41 PM, Aaron Bassett wrote:

Hey all,
I've just recently upgraded some of my larger rgw clusters to latest luminous 
and now I'm getting a lot of warnings about large omap objects. Most of them 
were on the indices and I've taken care of them by sharding where appropriate. 
However on two of my clusters I have a large object in the rgw log pool.

ceph health detail
HEALTH_WARN 1 large omap objects
LARGE_OMAP_OBJECTS 1 large omap objects
 1 large objects found in pool '.us-phx2.log'
 Search the cluster log for 'Large omap object found' for more details.


2019-06-11 10:50:04.583354 7f8d2b737700  0 log_channel(cluster) log [WRN] : 
Large omap object found. Object: 51:b9a904f6:::data_log.27:head Key count: 
15903755 Size (bytes): 2305116273


I'm not sure what to make of this. I don't see much chatter on the mailing 
lists about the log pool, other than a thread about swift lifecycles, which I 
dont use.  The log pool is pretty large, making it difficult to poke around in 
there:

.us-phx2.log 51  118GiB  0.03384TiB 
 12782413

That said i did a little poking around and it looks like a mix of these 
data_log object and some delete hints, but mostly a lot of objects starting 
with dates that point to different s3 pools. The object referenced in the osd 
log has 15912300  omap keys, and spot checking it, it looks like it's mostly 
referencing a pool we use with out dns resolver. We have a dns service that 
checks rgw endpoint health by uploading and deleting an object every few 
minutes to check health, and adds/removes endpoints from the A record as 
indicated.

So I guess I've got a few questions:

1) what is the nature of the data in the data_log.* objects in the log pool? Is 
it safe to remove or is it more like a binlog that needs to be intact from the 
beginning of time?

2) with the log pool in general, beyond the individual objects omap sizes, is 
there any concern about size? If so, is there a way to force it to truncate? I 
see some log commands in radosgw-admin, but documentation is light.


Thanks,
Aaron

CONFIDENTIALITY NOTICE
This e-mail message and any attachments are only for the use of the intended 
recipient and may contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient, any disclosure, distribution or other use of this e-mail message or 
attachments is prohibited. If you have received this e-mail message in error, 
please delete and notify the sender immediately. Thank you.

___
Ceph-large mailing list
ceph-la...@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RFC: relicence Ceph LGPL-2.1 code as LGPL-2.1 or LGPL-3.0

2019-06-12 Thread Sage Weil
On Fri, 10 May 2019, Sage Weil wrote:
> Hi everyone,
> 
> -- What --
> 
> The Ceph Leadership Team[1] is proposing a change of license from 
> *LGPL-2.1* to *LGPL-2.1 or LGPL-3.0* (dual license). The specific changes 
> are described by this pull request:
> 
>   https://github.com/ceph/ceph/pull/22446
> 
> If you are a Ceph developer who has contributed code to Ceph and object to 
> this change of license, please let us know, either by replying to this 
> message or by commenting on that pull request.
> 
> Our plan is to leave the issue open for comment for some period of time 
> and, if no objections are raised that cannot be adequately addressed (via 
> persuasion, code replacement, or whatever) we will move forward with the 
> change.

We've heard no concerns about this change, so I just merged the pull 
request.  Thank you, everyone!


Robin suggested that we also add SPDX tags to all files.  IIUC those look 
like this:

 // SPDX-License-Identifier: LGPL-2.1 or LGPL-3.0

This sounds like a fine idea.  Any takers?  Note that this can't replace 
the COPYING and debian/copyright files, that latter of which at least 
is needed by Debian.  But additional and explicit license notifications 
in each file sounds like a good thing.

Thanks!
sage

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-community] Monitors not in quorum (1 of 3 live)

2019-06-12 Thread Lluis Arasanz i Nonell - Adam
Hi,

If nothing special with defined “initial monitos” on cluster, we’ll try to 
remove mon01 from cluster.

I comment about “initial monitor” because in our ceph implementation there is 
only one monitor as “initial:

[root@mon01 ceph]# cat /etc/ceph/ceph.conf
[global]
fsid = ----
mon_initial_members = mon01
mon_host = 10.10.200.20
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd_pool_default_size = 2
public_network = 10.10.200.0/24

So, I could change ceph.conf on every storage related computer, but this does 
not work with monitors.
I have mon05 in a “probing” state trying to contact only with mon01 (down) and 
I change “mon_initial_members” in ceph.conf to find mon02 as initial… and this 
does not work ☹

2019-06-12 03:39:47.033242 7f04e630f700  0 mon.mon05@4(probing).data_health(0) 
update_stats avail 98% total 223 GB, used 4255 MB, avail 219 GB

And  asking to socket:

[root@mon05 ~]# ceph daemon mon.mon05 mon_status
{ "name": "mon05",
  "rank": 4,
  "state": "probing",
  "election_epoch": 0,
  "quorum": [],
  "outside_quorum": [
"mon05"],
  "extra_probe_peers": [],
  "sync_provider": [],
  "monmap": { "epoch": 21,
  "fsid": "----",
  "modified": "2019-06-07 16:59:26.729467",
  "created": "0.00",
  "mons": [
{ "rank": 0,
  "name": "mon01",
  "addr": "10.10.200.20:6789\/0"},
{ "rank": 1,
  "name": "mon02",
  "addr": "10.10.200.21:6789\/0"},
{ "rank": 2,
  "name": "mon03",
  "addr": "10.10.200.22:6789\/0"},
{ "rank": 3,
  "name": "mon04",
  "addr": "10.10.200.23:6789\/0"},
{ "rank": 4,
  "name": "mon05",
  "addr": "10.10.200.24:6789\/0"}]}}

In any case it contacs mon02, mon03 or mon04 that are healty and with quorum:

[root@mon02 ceph-mon02]# ceph daemon mon.mon02 mon_status
{ "name": "mon02",
  "rank": 1,
  "state": "leader",
  "election_epoch": 476,
  "quorum": [
1,
2,
3],
  "outside_quorum": [],
  "extra_probe_peers": [],
  "sync_provider": [],
  "monmap": { "epoch": 21,
  "fsid": "----",
  "modified": "2019-06-07 16:59:26.729467",
  "created": "0.00",
  "mons": [
{ "rank": 0,
  "name": "mon01",
  "addr": "10.10.200.20:6789\/0"},
{ "rank": 1,
  "name": "mon02",
  "addr": "10.10.200.21:6789\/0"},
{ "rank": 2,
  "name": "mon03",
  "addr": "10.10.200.22:6789\/0"},
{ "rank": 3,
  "name": "mon04",
  "addr": "10.10.200.23:6789\/0"},
{ "rank": 4,
  "name": "mon05",
  "addr": "10.10.200.24:6789\/0"}]}}

Of course, no communication related problems exists.

So, this is my fear touching monitors…

Regards


De: Paul Emmerich 
Enviado el: miércoles, 12 de junio de 2019 15:12
Para: Lluis Arasanz i Nonell - Adam 
CC: ceph-users@lists.ceph.com
Asunto: Re: [ceph-users] [Ceph-community] Monitors not in quorum (1 of 3 live)



On Wed, Jun 12, 2019 at 11:45 AM Lluis Arasanz i Nonell - Adam 
mailto:lluis.aras...@adam.es>> wrote:
- Be careful adding or removing monitors in a not healthy monitor cluster: If 
they lost quorum you will be into problems.

safe procedure: remove the dead monitor before adding a new one


Now, we have some work to do:
- Remove mon01 with "ceph mon destroy mon01": we want to remove it from monmap, 
but is the "initial monitor" so we do not know if it is safe to do.

yes that's safe to do, there's nothing special about the first mon. Command is 
"ceph mon remove ", though

- Clean and "format" monitor data (as we do on mon02 and mon03) for mon01, but 
we have the same situation: is safe to do when is the "initial mon"?

all (fully synched and in quorum) mons have the exact same data

- Modify monmap, deleting mon01, and inyect it om mon05, but...  what happens 
when we delete "initial mon" from monmap? Is safe?

"ceph mon remove" will modify the mon map for you; manually modifying the mon 
map is only required if the cluster is down




--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90



Regards
 ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-12 Thread Harald Staub

Also opened an issue about the rocksdb problem:
https://tracker.ceph.com/issues/40300

On 12.06.19 16:06, Harald Staub wrote:
We ended in a bad situation with our RadosGW (Cluster is Nautilus 
14.2.1, 350 OSDs with BlueStore):


1. There is a bucket with about 60 million objects, without shards.

2. radosgw-admin bucket reshard --bucket $BIG_BUCKET --num-shards 1024

3. Resharding looked fine first, it counted up to the number of objects, 
but then it hang.


4. 3 OSDs crashed with a segfault: "rocksdb: Corruption: file is too short"

5. Trying to start the OSDs manually led to the same segfaults.

6. ceph-bluestore-tool repair ...

7. The repairs all aborted, with the same rocksdb error as above.

8. Now 1 PG is stale. It belongs to the radosgw bucket index pool, and 
it contained the index of this big bucket.


Is there any hope in getting these rocksdbs up again?

Otherwise: how would we fix the bucket index pool? Our ideas:

1. ceph pg $BAD_PG mark_unfound_lost delete
2. rados -p .rgw.buckets ls, search $BAD_BUCKET_ID and remove these 
objects. The hope of this step would be to make the following step 
faster, and avoid another similar problem.

3. radosgw-admin bucket check --check-objects

Will this really rebuild the bucket index? Is it ok to leave the 
existing bucket indexes in place? Is it ok to run for all buckets at 
once, or has it to be run bucket by bucket? Is there a risk that the 
indexes that are not affected by the BAD_PG will be broken afterwards?


Some more details that may be of interest.

ceph-bluestore-repair says:

2019-06-12 11:15:38.345 7f56269670c0 -1 rocksdb: Corruption: file is too 
short (6139497190 bytes) to be an sstabledb/079728.sst
2019-06-12 11:15:38.345 7f56269670c0 -1 
bluestore(/var/lib/ceph/osd/ceph-49) _open_db erroring opening db:

error from fsck: (5) Input/output error

The repairs also showed several warnings like:

tcmalloc: large alloc 17162051584 bytes == 0x56167918a000 @ 
0x7f5626521887 0x56126a287229 0x56126a2873a3 0x56126a5dc1ec 
0x56126a584ce2 0x56126a586a05 0x56126a587dd0 0x56126a589344 
0x56126a38c3cf 0x56126a2eae94 0x56126a30654e 0x56126a337ae1 
0x56126a1a73a1 0x7f561b228b97 0x56126a28077a


The processes showed up with like 45 GB of RAM used. Fortunately, there 
was no Out-Of-Memory.


  Harry
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS getattr op stuck in snapshot

2019-06-12 Thread Nathan Fish
I have run into a similar hang on 'ls .snap' recently:
https://tracker.ceph.com/issues/40101#note-2

On Wed, Jun 12, 2019 at 9:33 AM Yan, Zheng  wrote:
>
> On Wed, Jun 12, 2019 at 3:26 PM Hector Martin  wrote:
> >
> > Hi list,
> >
> > I have a setup where two clients mount the same filesystem and
> > read/write from mostly non-overlapping subsets of files (Dovecot mail
> > storage/indices). There is a third client that takes backups by
> > snapshotting the top-level directory, then rsyncing the snapshot over to
> > another location.
> >
> > Ever since I switched the backup process to using snapshots, the rsync
> > process has stalled at a certain point during the backup with a stuck
> > MDS op:
> >
> > root@mon02:~# ceph daemon mds.mon02 dump_ops_in_flight
> > {
> > "ops": [
> > {
> > "description": "client_request(client.146682828:199050
> > getattr pAsLsXsFs #0x107//bak-20190612094501/ > path>/dovecot.index.log 2019-06-12 12:20:56.992049 caller_uid=5000,
> > caller_gid=5000{})",
> > "initiated_at": "2019-06-12 12:20:57.001534",
> > "age": 9563.847754,
> > "duration": 9563.847780,
> > "type_data": {
> > "flag_point": "failed to rdlock, waiting",
> > "reqid": "client.146682828:199050",
> > "op_type": "client_request",
> > "client_info": {
> > "client": "client.146682828",
> > "tid": 199050
> > },
> > "events": [
> > {
> > "time": "2019-06-12 12:20:57.001534",
> > "event": "initiated"
> > },
> > {
> > "time": "2019-06-12 12:20:57.001534",
> > "event": "header_read"
> > },
> > {
> > "time": "2019-06-12 12:20:57.001538",
> > "event": "throttled"
> > },
> > {
> > "time": "2019-06-12 12:20:57.001550",
> > "event": "all_read"
> > },
> > {
> > "time": "2019-06-12 12:20:57.001713",
> > "event": "dispatched"
> > },
> > {
> > "time": "2019-06-12 12:20:57.001997",
> > "event": "failed to rdlock, waiting"
> > }
> > ]
> > }
> > }
> > ],
> > "num_ops": 1
> > }
> >
> > AIUI, when a snapshot is taken, all clients with dirty data are supposed
> > to get a message to flush it to the cluster in order to produce a
> > consistent snapshot. My guess is this isn't happening properly, so reads
> > of that file in the snapshot are blocked. Doing a 'echo 3 >
> > /proc/sys/vm/drop_caches' on both of the writing clients seems to clear
> > the stuck op, but doing it once isn't enough; usually I get the stuck up
> > and have to clear caches twice after making any given snapshot.
> >
> > Everything is on Ubuntu. The cluster is running 13.2.4 (mimic), and the
> > clients are using the kernel client version 4.18.0-20-generic (writers)
> > and 4.18.0-21-generic (backup host).
> >
> > I managed to reproduce it like this:
> >
> > host1$ mkdir _test
> > host1$ cd _test/.snap
> >
> > host2$ cd _test
> > host2$ for i in $(seq 1 1); do (sleep 0.1; echo $i; sleep 1) > b_$i
> > & sleep 0.05; done
> >
> > (while that is running)
> >
> > host1$ mkdir s11
> > host1$ cd s11
> >
> > (wait a few seconds)
> >
> > host2$ ^C
> >
> > host1$ ls -al
> > (hangs)
> >
> > This yielded this stuck request:
> >
> > {
> > "ops": [
> > {
> > "description": "client_request(client.146687505:13785
> > getattr pAsLsXsFs #0x117f41c//s11/b_42 2019-06-12 16:15:59.095025
> > caller_uid=0, caller_gid=0{})",
> > "initiated_at": "2019-06-12 16:15:59.095559",
> > "age": 30.846294,
> > "duration": 30.846318,
> > "type_data": {
> > "flag_point": "failed to rdlock, waiting",
> > "reqid": "client.146687505:13785",
> > "op_type": "client_request",
> > "client_info": {
> > "client": "client.146687505",
> > "tid": 13785
> > },
> > "events": [
> > {
> > "time": "2019-06-12 16:15:59.095559",
> > "event": "initiated"
> > },
> > {
> > "time": "2019-06-12 16:15:59.095559",
> > "event": "header_read"
> > },
> > {
> > "time": "2019-06-12 16:15:59.095562",
> > "event": 

[ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-12 Thread Harald Staub
We ended in a bad situation with our RadosGW (Cluster is Nautilus 
14.2.1, 350 OSDs with BlueStore):


1. There is a bucket with about 60 million objects, without shards.

2. radosgw-admin bucket reshard --bucket $BIG_BUCKET --num-shards 1024

3. Resharding looked fine first, it counted up to the number of objects, 
but then it hang.


4. 3 OSDs crashed with a segfault: "rocksdb: Corruption: file is too short"

5. Trying to start the OSDs manually led to the same segfaults.

6. ceph-bluestore-tool repair ...

7. The repairs all aborted, with the same rocksdb error as above.

8. Now 1 PG is stale. It belongs to the radosgw bucket index pool, and 
it contained the index of this big bucket.


Is there any hope in getting these rocksdbs up again?

Otherwise: how would we fix the bucket index pool? Our ideas:

1. ceph pg $BAD_PG mark_unfound_lost delete
2. rados -p .rgw.buckets ls, search $BAD_BUCKET_ID and remove these 
objects. The hope of this step would be to make the following step 
faster, and avoid another similar problem.

3. radosgw-admin bucket check --check-objects

Will this really rebuild the bucket index? Is it ok to leave the 
existing bucket indexes in place? Is it ok to run for all buckets at 
once, or has it to be run bucket by bucket? Is there a risk that the 
indexes that are not affected by the BAD_PG will be broken afterwards?


Some more details that may be of interest.

ceph-bluestore-repair says:

2019-06-12 11:15:38.345 7f56269670c0 -1 rocksdb: Corruption: file is too 
short (6139497190 bytes) to be an sstabledb/079728.sst
2019-06-12 11:15:38.345 7f56269670c0 -1 
bluestore(/var/lib/ceph/osd/ceph-49) _open_db erroring opening db:

error from fsck: (5) Input/output error

The repairs also showed several warnings like:

tcmalloc: large alloc 17162051584 bytes == 0x56167918a000 @ 
0x7f5626521887 0x56126a287229 0x56126a2873a3 0x56126a5dc1ec 
0x56126a584ce2 0x56126a586a05 0x56126a587dd0 0x56126a589344 
0x56126a38c3cf 0x56126a2eae94 0x56126a30654e 0x56126a337ae1 
0x56126a1a73a1 0x7f561b228b97 0x56126a28077a


The processes showed up with like 45 GB of RAM used. Fortunately, there 
was no Out-Of-Memory.


 Harry
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph threads and performance

2019-06-12 Thread Paul Emmerich
If there was an optimal setting, then it would be the default.

Also, both have these options have been removed in Luminous ~2 years ago


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Wed, Jun 12, 2019 at 3:51 PM tim taler  wrote:

> I will look into that, but:
> IS there a rule of thumb to determine the optimal setting for
> osd disk threads
> and
> osd op threads
> ?
> TIA
>
> On Wed, Jun 12, 2019 at 3:22 PM Paul Emmerich 
> wrote:
> >
> >
> >
> > On Wed, Jun 12, 2019 at 10:57 AM tim taler  wrote:
> >>
> >> We experience absurd slow i/o in the VMs and I suspect
> >> our thread settings in ceph.conf to be one of the culprits.
> >
> >
> > this is probably not the cause. But someone might be able to help you if
> you
> > share details on your setup (hardware, software) and workload (iops,
> bandwidth, latency)
> >
> > Paul
> >
> > --
> > Paul Emmerich
> >
> > Looking for help with your Ceph cluster? Contact us at https://croit.io
> >
> > croit GmbH
> > Freseniusstr. 31h
> > 81247 München
> > www.croit.io
> > Tel: +49 89 1896585 90
> >
> >>
> >> Is there a rule of thumb how to get the proper setting?
> >> What side effect will have a higher threads setting?
> >> Higer cpu utilisation? higher memory impact?
> >>
> >> TIA
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Error when I compare hashes of export-diff / import-diff

2019-06-12 Thread Jason Dillaman
On Wed, Jun 12, 2019 at 9:50 AM Rafael Diaz Maurin
 wrote:
>
> Hello Jason,
>
> Le 11/06/2019 à 15:31, Jason Dillaman a écrit :
> >> 4- I export the snapshot from the source pool and I import the snapshot
> >> towards the destination pool (in the pipe)
> >> rbd export-diff --from-snap ${LAST-SNAP}
> >> ${POOL-SOURCE}/${KVM-IMAGE}@${TODAY-SNAP} - | rbd -c ${BACKUP-CLUSTER}
> >> import-diff - ${POOL-DESTINATION}/${KVM-IMAGE}
> > What's the actual difference between the "rbd diff" outputs? There is
> > a known "issue" where object-map will flag an object as dirty if you
> > had run an fstrim/discard on the image, but it doesn't affect the
> > actual validity of the data.
>
>
> The feature discard=on is activated, and qemu-guest-agent is running on
> the guest.
>
> Here are the only differences between the 2 outputs of the "rbd diff
> --format plain" (image source and image destination) :
> Offset  Length  Type
> 121c121
> < 14103347200 2097152 data
> ---
>  > 14103339008 2105344 data
> 153c153
> < 14371782656 2097152 data
> ---
>  > 14369685504 4194304 data
> 216c216
> < 14640218112 2097152 data
> ---
>  > 14638120960 4194304 data
> 444c444
> < 15739125760 2097152 data
> ---
>  > 15738519552 2703360 data
>
> And the hashes of the exports are identical (between source and
> destination) :
> rbd -p ${POOL-SOURCE} export ${KVM-IMAGE}@${TODAY-SNAP} - | md5sum
> => ee7012e14870b36e7b9695e52c417c06
>
> rbd -c ${BACKUP-CLUSTER} -p ${POOL-DESTINATION} export
> ${KVM-IMAGE}@${TODAY-SNAP} - | md5sum
> => ee7012e14870b36e7b9695e52c417c06
>
>
> So do you think this can be caused by the fstrim/discard feature.

You said you weren't using fstrim/discard. If you export both images
and compare those image extents where the diffs are different, is it
just filled w/ zeroes?

> In fact, when fstrim/discard activated it, the only way to validate the
> export-diff is to compare full exports hashes ?
>
>
> Thank you.
>
>
> Best regards,
> Rafael
>
> --
> Rafael Diaz Maurin
> DSI de l'Université de Rennes 1
> Pôle Infrastructures, équipe Systèmes
> 02 23 23 71 57
>
>


-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph threads and performance

2019-06-12 Thread Bastiaan Visser
On both larger and smaller clusters i have never had problems with the default 
values. 
So i guess thats a pretty good start.

- Original Message -
From: "tim taler" 
To: "Paul Emmerich" 
Cc: "ceph-users" 
Sent: Wednesday, June 12, 2019 3:51:43 PM
Subject: Re: [ceph-users] ceph threads and performance

I will look into that, but:
IS there a rule of thumb to determine the optimal setting for
osd disk threads
and
osd op threads
?
TIA

On Wed, Jun 12, 2019 at 3:22 PM Paul Emmerich  wrote:
>
>
>
> On Wed, Jun 12, 2019 at 10:57 AM tim taler  wrote:
>>
>> We experience absurd slow i/o in the VMs and I suspect
>> our thread settings in ceph.conf to be one of the culprits.
>
>
> this is probably not the cause. But someone might be able to help you if you
> share details on your setup (hardware, software) and workload (iops, 
> bandwidth, latency)
>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
>>
>> Is there a rule of thumb how to get the proper setting?
>> What side effect will have a higher threads setting?
>> Higer cpu utilisation? higher memory impact?
>>
>> TIA
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph threads and performance

2019-06-12 Thread tim taler
I will look into that, but:
IS there a rule of thumb to determine the optimal setting for
osd disk threads
and
osd op threads
?
TIA

On Wed, Jun 12, 2019 at 3:22 PM Paul Emmerich  wrote:
>
>
>
> On Wed, Jun 12, 2019 at 10:57 AM tim taler  wrote:
>>
>> We experience absurd slow i/o in the VMs and I suspect
>> our thread settings in ceph.conf to be one of the culprits.
>
>
> this is probably not the cause. But someone might be able to help you if you
> share details on your setup (hardware, software) and workload (iops, 
> bandwidth, latency)
>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
>>
>> Is there a rule of thumb how to get the proper setting?
>> What side effect will have a higher threads setting?
>> Higer cpu utilisation? higher memory impact?
>>
>> TIA
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Error when I compare hashes of export-diff / import-diff

2019-06-12 Thread Rafael Diaz Maurin

Hello Jason,

Le 11/06/2019 à 15:31, Jason Dillaman a écrit :

4- I export the snapshot from the source pool and I import the snapshot
towards the destination pool (in the pipe)
rbd export-diff --from-snap ${LAST-SNAP}
${POOL-SOURCE}/${KVM-IMAGE}@${TODAY-SNAP} - | rbd -c ${BACKUP-CLUSTER}
import-diff - ${POOL-DESTINATION}/${KVM-IMAGE}

What's the actual difference between the "rbd diff" outputs? There is
a known "issue" where object-map will flag an object as dirty if you
had run an fstrim/discard on the image, but it doesn't affect the
actual validity of the data.



The feature discard=on is activated, and qemu-guest-agent is running on 
the guest.


Here are the only differences between the 2 outputs of the "rbd diff  
--format plain" (image source and image destination) :

Offset              Length  Type
121c121
< 14103347200 2097152 data
---
> 14103339008 2105344 data
153c153
< 14371782656 2097152 data
---
> 14369685504 4194304 data
216c216
< 14640218112 2097152 data
---
> 14638120960 4194304 data
444c444
< 15739125760 2097152 data
---
> 15738519552 2703360 data

And the hashes of the exports are identical (between source and 
destination) :

rbd -p ${POOL-SOURCE} export ${KVM-IMAGE}@${TODAY-SNAP} - | md5sum
=> ee7012e14870b36e7b9695e52c417c06

rbd -c ${BACKUP-CLUSTER} -p ${POOL-DESTINATION} export 
${KVM-IMAGE}@${TODAY-SNAP} - | md5sum

=> ee7012e14870b36e7b9695e52c417c06


So do you think this can be caused by the fstrim/discard feature.

In fact, when fstrim/discard activated it, the only way to validate the 
export-diff is to compare full exports hashes ?



Thank you.


Best regards,
Rafael

--
Rafael Diaz Maurin
DSI de l'Université de Rennes 1
Pôle Infrastructures, équipe Systèmes
02 23 23 71 57




smime.p7s
Description: Signature cryptographique S/MIME
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS getattr op stuck in snapshot

2019-06-12 Thread Yan, Zheng
On Wed, Jun 12, 2019 at 3:26 PM Hector Martin  wrote:
>
> Hi list,
>
> I have a setup where two clients mount the same filesystem and
> read/write from mostly non-overlapping subsets of files (Dovecot mail
> storage/indices). There is a third client that takes backups by
> snapshotting the top-level directory, then rsyncing the snapshot over to
> another location.
>
> Ever since I switched the backup process to using snapshots, the rsync
> process has stalled at a certain point during the backup with a stuck
> MDS op:
>
> root@mon02:~# ceph daemon mds.mon02 dump_ops_in_flight
> {
> "ops": [
> {
> "description": "client_request(client.146682828:199050
> getattr pAsLsXsFs #0x107//bak-20190612094501/ path>/dovecot.index.log 2019-06-12 12:20:56.992049 caller_uid=5000,
> caller_gid=5000{})",
> "initiated_at": "2019-06-12 12:20:57.001534",
> "age": 9563.847754,
> "duration": 9563.847780,
> "type_data": {
> "flag_point": "failed to rdlock, waiting",
> "reqid": "client.146682828:199050",
> "op_type": "client_request",
> "client_info": {
> "client": "client.146682828",
> "tid": 199050
> },
> "events": [
> {
> "time": "2019-06-12 12:20:57.001534",
> "event": "initiated"
> },
> {
> "time": "2019-06-12 12:20:57.001534",
> "event": "header_read"
> },
> {
> "time": "2019-06-12 12:20:57.001538",
> "event": "throttled"
> },
> {
> "time": "2019-06-12 12:20:57.001550",
> "event": "all_read"
> },
> {
> "time": "2019-06-12 12:20:57.001713",
> "event": "dispatched"
> },
> {
> "time": "2019-06-12 12:20:57.001997",
> "event": "failed to rdlock, waiting"
> }
> ]
> }
> }
> ],
> "num_ops": 1
> }
>
> AIUI, when a snapshot is taken, all clients with dirty data are supposed
> to get a message to flush it to the cluster in order to produce a
> consistent snapshot. My guess is this isn't happening properly, so reads
> of that file in the snapshot are blocked. Doing a 'echo 3 >
> /proc/sys/vm/drop_caches' on both of the writing clients seems to clear
> the stuck op, but doing it once isn't enough; usually I get the stuck up
> and have to clear caches twice after making any given snapshot.
>
> Everything is on Ubuntu. The cluster is running 13.2.4 (mimic), and the
> clients are using the kernel client version 4.18.0-20-generic (writers)
> and 4.18.0-21-generic (backup host).
>
> I managed to reproduce it like this:
>
> host1$ mkdir _test
> host1$ cd _test/.snap
>
> host2$ cd _test
> host2$ for i in $(seq 1 1); do (sleep 0.1; echo $i; sleep 1) > b_$i
> & sleep 0.05; done
>
> (while that is running)
>
> host1$ mkdir s11
> host1$ cd s11
>
> (wait a few seconds)
>
> host2$ ^C
>
> host1$ ls -al
> (hangs)
>
> This yielded this stuck request:
>
> {
> "ops": [
> {
> "description": "client_request(client.146687505:13785
> getattr pAsLsXsFs #0x117f41c//s11/b_42 2019-06-12 16:15:59.095025
> caller_uid=0, caller_gid=0{})",
> "initiated_at": "2019-06-12 16:15:59.095559",
> "age": 30.846294,
> "duration": 30.846318,
> "type_data": {
> "flag_point": "failed to rdlock, waiting",
> "reqid": "client.146687505:13785",
> "op_type": "client_request",
> "client_info": {
> "client": "client.146687505",
> "tid": 13785
> },
> "events": [
> {
> "time": "2019-06-12 16:15:59.095559",
> "event": "initiated"
> },
> {
> "time": "2019-06-12 16:15:59.095559",
> "event": "header_read"
> },
> {
> "time": "2019-06-12 16:15:59.095562",
> "event": "throttled"
> },
> {
> "time": "2019-06-12 16:15:59.095573",
> "event": "all_read"
> },
> {
> "time": "2019-06-12 16:15:59.096201",
> "event": "dispatched"
> },
> {
> 

[ceph-users] Fwd: ceph threads and performance

2019-06-12 Thread tim taler
Hi all,

we have a 5 node ceph cluster with 44 OSDs
where all nodes also serve as virtualization hosts,
running about 22 virtual machines with all in all about 75 rbd s
(158 including snapshots).

We experience absurd slow i/o in the VMs and I suspect
our thread settings in ceph.conf to be one of the culprits.

Is there a rule of thumb how to get the proper setting?
What side effect will have a higher threads setting?
Higer cpu utilisation? higher memory impact?

TIA
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph threads and performance

2019-06-12 Thread Paul Emmerich
On Wed, Jun 12, 2019 at 10:57 AM tim taler  wrote:

> We experience absurd slow i/o in the VMs and I suspect
> our thread settings in ceph.conf to be one of the culprits.
>

this is probably not the cause. But someone might be able to help you if you
share details on your setup (hardware, software) and workload (iops,
bandwidth, latency)

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


> Is there a rule of thumb how to get the proper setting?
> What side effect will have a higher threads setting?
> Higer cpu utilisation? higher memory impact?
>
> TIA
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-community] Monitors not in quorum (1 of 3 live)

2019-06-12 Thread Paul Emmerich
On Wed, Jun 12, 2019 at 11:45 AM Lluis Arasanz i Nonell - Adam <
lluis.aras...@adam.es> wrote:

> - Be careful adding or removing monitors in a not healthy monitor cluster:
> If they lost quorum you will be into problems.
>

safe procedure: remove the dead monitor before adding a new one


>
>
> Now, we have some work to do:
>
> - Remove mon01 with "ceph mon destroy mon01": we want to remove it from
> monmap, but is the "initial monitor" so we do not know if it is safe to do.
>

yes that's safe to do, there's nothing special about the first mon. Command
is "ceph mon remove ", though


> - Clean and "format" monitor data (as we do on mon02 and mon03) for mon01,
> but we have the same situation: is safe to do when is the "initial mon"?
>

all (fully synched and in quorum) mons have the exact same data


> - Modify monmap, deleting mon01, and inyect it om mon05, but...  what
> happens when we delete "initial mon" from monmap? Is safe?
>

"ceph mon remove" will modify the mon map for you; manually modifying the
mon map is only required if the cluster is down


>


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90



>
>
> Regards
>
>
>
>
>
> [image: logo_adam_firma_email]
>
> *Lluís Arasanz Nonell* • Departamento de Sistemas
>
> *Tel:* +34 902 902 685
>
> *email: **lluis.aras...@adam.es *
>
> *www.adam.es* 
>
>
>
> *[image: 16-linkedin]*   *[image:
> 16-twitter]* 
>
>
>
> *Advertencia legal:* La información contenida en este mensaje y/o
> archivo(s) adjunto(s), enviada desde OGIC INFORMATICA SLU, es
> confidencial/privilegiada y está destinada a ser leída sólo por la(s)
> persona(s) a la(s) que va dirigida. Le recordamos que sus datos han sido
> incorporados en el sistema de tratamiento de OGIC INFORMATICA SLU y que
> siempre y cuando se cumplan los requisitos exigidos por la normativa, usted
> podrá ejercer sus derechos de acceso, rectificación, limitación de
> tratamiento, supresión, portabilidad y oposición/revocación, en los
> términos que establece la normativa vigente en materia de protección de
> datos, dirigiendo su petición a la dirección postal TRAVESSERA DE GRACIA
> 342-344 08025, BARCELONA o bien a través de correo electrónico
> administrac...@adam.es Si usted lee este mensaje y no es el destinatario
> señalado, el empleado o el agente responsable de entregar el mensaje al
> destinatario, o ha recibido esta comunicación por error, le informamos que
> está totalmente prohibida, y puede ser ilegal, cualquier divulgación,
> distribución o reproducción de esta comunicación, y le rogamos que nos lo
> notifique inmediatamente y nos devuelva el mensaje original a la dirección
> arriba mencionada. Gracias.
>
> *[image: NoImprimir]**No imprimas si no es necesario. Protejamos el Medio
> Ambiente.*
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Any CEPH's iSCSI gateway users?

2019-06-12 Thread Paul Emmerich
On Wed, Jun 12, 2019 at 6:48 AM Glen Baars 
wrote:

> Interesting performance increase! I'm Iscsi it at a few installations and
> now a wonder what version of Centos is required to improve performance! Did
> the cluster go from Luminous to Mimic?
>

wild guess: probably related to updating tcmu-runner from 1.3 to 1.4; the
old versions had a bug that caused a long timeout on some operations in
some workloads


Paul


>
> Glen
>
> -Original Message-
> From: ceph-users  On Behalf Of Heðin
> Ejdesgaard Møller
> Sent: Saturday, 8 June 2019 8:00 AM
> To: Paul Emmerich ; Igor Podlesny <
> ceph-u...@poige.ru>
> Cc: Ceph Users 
> Subject: Re: [ceph-users] Any CEPH's iSCSI gateway users?
>
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> I recently upgraded a RHCS-3.0 cluster with 4 iGW's to RHCS-3.2 on top of
> RHEL-
> 7.6
> Big block size performance went from ~350MB/s to about 1100MB/s on each
> lun, seen from a VM in vSphere-6.5 with data read from an ssd pool and
> written to a hdd pool, both being 3/2 replica.
> I have not experienced any hick-up since the upgrade.
> You will always have a degree of performance hit when using the iGW,
> because it's both an extra layer between consumer and hardware, and a
> potential choke- point, just like any "traditional" iSCSI based SAN
> solution.
>
> If you are considering to deploy the iGW on the upstream bits then I would
> recommend you to stick to CentOS, since a lot of it's development have
> happened on the RHEL platform.
>
> Regards
> Heðin Ejdesgaard
> Synack sp/f
>
> On frí, 2019-06-07 at 12:44 +0200, Paul Emmerich wrote:
> > Hi,
> >
> > ceph-iscsi 3.0 fixes a lot of problems and limitations of the older
> gateway.
> >
> > Best way to run it on Debian/Ubuntu is to build it yourself
> >
> >
> > Paul
> >
> > --
> > Paul Emmerich
> >
> > Looking for help with your Ceph cluster? Contact us at
> > https://croit.io
> >
> > croit GmbH
> > Freseniusstr. 31h
> > 81247 München
> > www.croit.io
> > Tel: +49 89 1896585 90
> >
> >
> > On Tue, May 28, 2019 at 10:02 AM Igor Podlesny 
> wrote:
> > > What is your experience?
> > > Does it make sense to use it -- is it solid enough or beta quality
> > > rather (both in terms of stability and performance)?
> > >
> > > I've read it was more or less packaged to work with RHEL. Does it
> > > hold true still?
> > > What's the best way to install it on, say, CentOS or Debian/Ubuntu?
> > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> -BEGIN PGP SIGNATURE-
>
> iQIzBAEBCAAdFiEElZWfRQVsNukQFi9Ko80MCbT/An0FAlz6+ocACgkQo80MCbT/
> An28EBAA4FlpRYEhFSWm2dfTdYBfFLNJbLrwyMvXOe22sLHwlz3GWMnY2llJ7nyM
> YAZy0DZGmujoztBos3eR1A/FB22yr6BYPjC9/f/+8vt3TMhxG5Tm0g/XifJSXaJl
> zL8lA3T+XkcMZkphukjhR2BZWioam0ipT07n6+rNdQCaS9/xt7QE7gwWeGQWxKsf
> EDY4XWKjiIvyuK4nt2R1raTl9uaW1FI2qM/UoHWyW+ip86syEC1p1HfqWpeU5Mm2
> TXRgTVRS4tM91GfciwKdCwZIZjT10POyFfk2DHwMA40lUc8cFCyzj3aAkdJp4U4h
> 8Wm0QJBzabcuWHfBBJlWRARSGVXKUx08HM3alatO8vum5WSK2w9l5pgyx5H4jM5+
> 6YABtwvT5lwEiHL9hUoO9HDpyj/IcMzHF5yG5v5PdXCuat7HwNcv6dD2j2dEAgma
> HlLRo84PNeHiIn52jSSFGr4O6MQTYei/VMD2IbrDJzjUOFCUOxdX5WsSeFdhF5Zc
> LW2rcnLiTcRisxiu3MvJI1kUvGFr1GFmjQI/7MeTXiq2bfQh08LUpM6Cz/ch7iUQ
> xo7zUGGuQcOx6iSmagTcMa1QqF8+txCSvCTVvlWdXLAzXsDOJ4mkGe1EWJ2pHjz2
> zBcn25Qfws9DEvEww71a/sKp2tlwnKCZgKXhkIBOKyhU7x1dYOI=
> =I3pv
> -END PGP SIGNATURE-
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> This e-mail is intended solely for the benefit of the addressee(s) and any
> other named recipient. It is confidential and may contain legally
> privileged or confidential information. If you are not the recipient, any
> use, distribution, disclosure or copying of this e-mail is prohibited. The
> confidentiality and legal privilege attached to this communication is not
> waived or lost by reason of the mistaken transmission or delivery to you.
> If you have received this e-mail in error, please notify us immediately.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-community] Monitors not in quorum (1 of 3 live)

2019-06-12 Thread Lluis Arasanz i Nonell - Adam
Hi all,

Here our story. Perhaps some day could help anyone. Be in mind that English is 
not my native language so sorry if I make mistakes.

Our system is: Ceph 0.87.2 (Giant), with 5 OSD servers (116 1TB osd total) and 
3 monitors.

After a nightmare time, we initially "correct" ceph monitor problems. But 
first, some additional info and a TimeLine (Dates are in dd-mm- format).

At the beginning, we had 3 working monitors and we were happy. (MON01, MON02 
and MON03)

Wednesday 05/06/2019:
After a SAI outage on B line, we found in MON03 ceph-mon process does not clean 
start: after initiating ceph-mon, ceph-create-keys does not contact with 
daemon. We work with quorum with 2 monitors, and has access to Ceph Storage.

Thusrday 06/06/2019
We have the "good" idea to add a new mon into mon cluster... this was our first 
error. After "ceph-deploy mon mon.mon04" command, new monitor activates (4 
monitor in cluster) but... only 2 monitors had data (mon01 and mon02) and this 
is equal no quorum. As no quorum, mon04 does not contact mon cluster. We lost 
"ceph" commands as no monitor can held quorum, so any ceph related command 
works.

Fortunately, storage "works" and active openstack instance were not affected 
(we do not know why it works, but it does). At this point, we made some mon02 
and mon04 restart. I do not remember order, but our priority was recover mon 
quorum :(  After mon02 restart, repeats same behaviour than mon03: 
ceph-create-keys does not contact deamon.

We left cluster "working" with mon01 in electing status and mon04 in waiting to 
add to cluster.

Friday 07/06/2019
We prepare a new monitor computer (mon05) to integrate on Mon's cluster. Our 
idea was "If we develop mon05 and integrate to mon cluster, this could work as 
3 mon's up will make quorum..."

We done a "ceph-mon -i mon05 --mkfs --monmap /root/monmap-mon04-original  
--keyring /root/keyring" with data extracted from mon04 (keyring and monmap) 
and started it with ceph-mon -i mon05 -c /etc/ceph/ceph.conf --cluster ceph"...

Yes, it works. We were very happy because we recover monitors quorum, we have 
ceph related commands and all works but only 10 minutes :(

And here nightmare began.

Slow request began to increase. We do not know why, so initialy we restart 
affected osd. After 3 hours  restarting osd's we think " this is not normal. 
What's happening here?"
Osd logs show some "key errors" contacting others osd's and monitors. Really we 
were in trouble, because openstack cinder can't contact rbd volumes, rbd 
commands shows a lot of key errors when readind pool volumes. Really all system 
goes down, so no write or read was made to storage We tried to restart 
Mon's, restart openstack serices, restart osd's (one at time), check NTP (no 
errors here) check iptables check anything that colul be checkered...  with no 
success.
We remake monitors 2 and 3 formating ceph-mon data in the same way we do with 
mon05, so we have a 5 monitors cluster, but key errors does no disappears.

And  when no more things we can do...  we use a Spanish sentence: "De perdidos, 
al rio" (direct translation: From the lost, to the river i. e. when nothing 
works and all is lost, you can try anything you want) So...we think "the only 
monitor we never touch is mon01 (the active monitor) so if we reset it?"

Thought and done. We stop mon01. Monitor quorum was transferred to  Mon02, but 
slow request were there. We restart ceph-mon on mon01... but again, 
ceph-create-keys does not contact daemon. We lost Mon01. So mon02 to mon05 was 
working in quorum.

And, suddenly, storage began to recover: slow request decrease, rbd commands 
works, osd logs show normal info (any key related error) and 10 minutes after 
mon01 down, all cluster was active and clean.

After this story, we have some "things to be in mind" we want to share:

- Always have more than 1 "initial-monitors" defines in ceph. We have only one, 
and if it is not active, the other monitors does not start (after storage 
recovery, we stop mon05 and it has status "probing" trying to contact mon01, 
which is down)
- Have a copy of monitors keyring and monmap. This is the safe way to add 
manually monitors to cluster when no ceph related commands works
- Be careful adding or removing monitors in a not healthy monitor cluster: If 
they lost quorum you will be into problems.

Now, we have some work to do:
- Remove mon01 with "ceph mon destroy mon01": we want to remove it from monmap, 
but is the "initial monitor" so we do not know if it is safe to do.
- Clean and "format" monitor data (as we do on mon02 and mon03) for mon01, but 
we have the same situation: is safe to do when is the "initial mon"?
- Modify monmap, deleting mon01, and inyect it om mon05, but...  what happens 
when we delete "initial mon" from monmap? Is safe?

As you can understand, we have now a working storage but in a critical 
situation, because any problem with monitors could  bring it again unstable... 
And there is 

Re: [ceph-users] Expected IO in luminous Ceph Cluster

2019-06-12 Thread Виталий Филиппов
Hi Felix,

Better use fio.

Like fio -ioengine=rbd -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 
-rw=randwrite -pool=rpool_hdd -runtime=60 -rbdname=testimg (for peak parallel 
random iops)

Or the same with -iodepth=1 for the latency test. Here you usually get

Or the same with -ioengine=libaio -filename=testfile -size=10G instead of 
-ioengine=rbd -pool=.. -rbdname=.. to test it from inside a VM.

...or the same with -sync=1 to determine how a DBMS will perform inside a VM...
-- 
With best regards,
  Vitaliy Filippov___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph threads and performance

2019-06-12 Thread tim taler
Hi all,

we have a 5 node ceph cluster with 44 OSDs
where all nodes also serve as virtualization hosts,
running about 22 virtual machines with all in all about 75 rbd s
(158 including snapshots).

We experience absurd slow i/o in the VMs and I suspect
our thread settings in ceph.conf to be one of the culprits.

Is there a rule of thumb how to get the proper setting?
What side effect will have a higher threads setting?
Higer cpu utilisation? higher memory impact?

TIA
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS getattr op stuck in snapshot

2019-06-12 Thread Hector Martin
Hi list,

I have a setup where two clients mount the same filesystem and
read/write from mostly non-overlapping subsets of files (Dovecot mail
storage/indices). There is a third client that takes backups by
snapshotting the top-level directory, then rsyncing the snapshot over to
another location.

Ever since I switched the backup process to using snapshots, the rsync
process has stalled at a certain point during the backup with a stuck
MDS op:

root@mon02:~# ceph daemon mds.mon02 dump_ops_in_flight
{
"ops": [
{
"description": "client_request(client.146682828:199050
getattr pAsLsXsFs #0x107//bak-20190612094501//dovecot.index.log 2019-06-12 12:20:56.992049 caller_uid=5000,
caller_gid=5000{})",
"initiated_at": "2019-06-12 12:20:57.001534",
"age": 9563.847754,
"duration": 9563.847780,
"type_data": {
"flag_point": "failed to rdlock, waiting",
"reqid": "client.146682828:199050",
"op_type": "client_request",
"client_info": {
"client": "client.146682828",
"tid": 199050
},
"events": [
{
"time": "2019-06-12 12:20:57.001534",
"event": "initiated"
},
{
"time": "2019-06-12 12:20:57.001534",
"event": "header_read"
},
{
"time": "2019-06-12 12:20:57.001538",
"event": "throttled"
},
{
"time": "2019-06-12 12:20:57.001550",
"event": "all_read"
},
{
"time": "2019-06-12 12:20:57.001713",
"event": "dispatched"
},
{
"time": "2019-06-12 12:20:57.001997",
"event": "failed to rdlock, waiting"
}
]
}
}
],
"num_ops": 1
}

AIUI, when a snapshot is taken, all clients with dirty data are supposed
to get a message to flush it to the cluster in order to produce a
consistent snapshot. My guess is this isn't happening properly, so reads
of that file in the snapshot are blocked. Doing a 'echo 3 >
/proc/sys/vm/drop_caches' on both of the writing clients seems to clear
the stuck op, but doing it once isn't enough; usually I get the stuck up
and have to clear caches twice after making any given snapshot.

Everything is on Ubuntu. The cluster is running 13.2.4 (mimic), and the
clients are using the kernel client version 4.18.0-20-generic (writers)
and 4.18.0-21-generic (backup host).

I managed to reproduce it like this:

host1$ mkdir _test
host1$ cd _test/.snap

host2$ cd _test
host2$ for i in $(seq 1 1); do (sleep 0.1; echo $i; sleep 1) > b_$i
& sleep 0.05; done

(while that is running)

host1$ mkdir s11
host1$ cd s11

(wait a few seconds)

host2$ ^C

host1$ ls -al
(hangs)

This yielded this stuck request:

{
"ops": [
{
"description": "client_request(client.146687505:13785
getattr pAsLsXsFs #0x117f41c//s11/b_42 2019-06-12 16:15:59.095025
caller_uid=0, caller_gid=0{})",
"initiated_at": "2019-06-12 16:15:59.095559",
"age": 30.846294,
"duration": 30.846318,
"type_data": {
"flag_point": "failed to rdlock, waiting",
"reqid": "client.146687505:13785",
"op_type": "client_request",
"client_info": {
"client": "client.146687505",
"tid": 13785
},
"events": [
{
"time": "2019-06-12 16:15:59.095559",
"event": "initiated"
},
{
"time": "2019-06-12 16:15:59.095559",
"event": "header_read"
},
{
"time": "2019-06-12 16:15:59.095562",
"event": "throttled"
},
{
"time": "2019-06-12 16:15:59.095573",
"event": "all_read"
},
{
"time": "2019-06-12 16:15:59.096201",
"event": "dispatched"
},
{
"time": "2019-06-12 16:15:59.096318",
"event": "failed to rdlock, waiting"
},
{
"time": "2019-06-12 16:15:59.268368",
"event": "failed to rdlock, waiting"
}
]
}
 

Re: [ceph-users] Large OMAP object in RGW GC pool

2019-06-12 Thread Wido den Hollander



On 6/11/19 9:48 PM, J. Eric Ivancich wrote:
> Hi Wido,
> 
> Interleaving below
> 
> On 6/11/19 3:10 AM, Wido den Hollander wrote:
>>
>> I thought it was resolved, but it isn't.
>>
>> I counted all the OMAP values for the GC objects and I got back:
>>
>> gc.0: 0
>> gc.11: 0
>> gc.14: 0
>> gc.15: 0
>> gc.16: 0
>> gc.18: 0
>> gc.19: 0
>> gc.1: 0
>> gc.20: 0
>> gc.21: 0
>> gc.22: 0
>> gc.23: 0
>> gc.24: 0
>> gc.25: 0
>> gc.27: 0
>> gc.29: 0
>> gc.2: 0
>> gc.30: 0
>> gc.3: 0
>> gc.4: 0
>> gc.5: 0
>> gc.6: 0
>> gc.7: 0
>> gc.8: 0
>> gc.9: 0
>> gc.13: 110996
>> gc.10: 04
>> gc.26: 42
>> gc.28: 111292
>> gc.17: 111314
>> gc.12: 111534
>> gc.31: 111956
> 
> Casey Bodley mentioned to me that he's seen similar behavior to what
> you're describing when RGWs are upgraded but not all OSDs are upgraded
> as well. Is it possible that the OSDs hosting gc.13, gc.10, and so forth
> are running a different version of ceph?
> 

Yes, the OSDs are still on 13.2.5. As this is a big (2500 OSD)
production environment we only created a temporary machine with 13.2.6
(just a few hours before it's release) to run the GC.

We did not upgrade the cluster itself as we will have to wait with that
before we have validated the release on the testing cluster before.

Wido

> Eric
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS hangs in "heartbeat_map" deadlock

2019-06-12 Thread Stefan Kooman
Quoting Patrick Donnelly (pdonn...@redhat.com):
> Hi Stefan,
> 
> Sorry I couldn't get back to you sooner.

NP.

> Looks like you hit the infinite loop bug in OpTracker. It was fixed in
> 12.2.11: https://tracker.ceph.com/issues/37977
> 
> The problem was introduced in 12.2.8.

We've been quite long on 12.2.8 (because of issues with 12.2.9 /
uncertainty 12.2.10) ... We upgraded to 12.2.11 at the end of februari
after we stopped seeing crashes ... so it does correlate with the
upgrade, so yeah, probably this bug then.

Thanks,

Stefan


-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com