Re: [ceph-users] Ceph 12.2.5 - FAILED assert(0 == "put on missing extent (nothing before)")

2018-10-02 Thread Gregory Farnum
I'd create a new ticket and reference the older one; they may not have the
same cause.

On Tue, Oct 2, 2018 at 12:33 PM Ricardo J. Barberis 
wrote:

> Hello,
>
> I'm having this same issue on 12.2.8. Should I repoen the bug report?
>
> This cluster started on 12.2.4 and was upgraded to 12.2.5 and then
> directly to
> 12.2.8 (we skipped 2.6 and 2.7) but the malfunctioning OSD is on a new
> node
> installed with 12.2.8.
>
> We're using CentOS 7.5, and bluestore for ceph. This particular node has
> SSD
> disks.
>
> I have an extract of the log and objdump if needed.
>
> Thanks,
>
> El Miércoles 11/07/2018 a las 18:31, Gregory Farnum escribió:
> > A bit delayed, but Radoslaw looked at this some and has a diagnosis on
> the
> > tracker ticket: http://tracker.ceph.com/issues/24715
> > So it looks like a symptom of a bug that was already fixed for unrelated
> > reasons. :)
> > -Greg
> >
> > On Wed, Jun 27, 2018 at 4:51 AM Dyweni - Ceph-Users
> > <6exbab4fy...@dyweni.com>
> >
> > wrote:
> > > Good Morning,
> > >
> > > I have rebuilt the OSD and the cluster is healthy now.
> > >
> > > I have one pool with 3 replica setup.  I am a bit concerned that
> > > removing a snapshot can cause an OSD to crash.  I've asked myself what
> > > would have happened if 2 OSD's had crashed?  God forbid, what if 3 or
> > > more OSD's had crashed with this same error?  How would I have
> recovered
> > > from that?
> > >
> > > So planning for the future:
> > >
> > >   1. Is there any way to proactively scan for (and even repair) this?
> > >
> > >   2. What could have caused this?
> > >
> > > We experienced a cluster wide power outage lasting several hours
> several
> > > days ago.  The outage occurred at a time when no snapshots were being
> > > created.  The cluster was brought back up in a controlled manner and no
> > > errors were discovered immediately afterward (Ceph reported healthly).
> > > Could this have caused corruption?
> > >
> > > Thanks,
> > > Dyweni
> > >
> > > On 2018-06-25 09:34, Dyweni - Ceph-Users wrote:
> > > > Hi,
> > > >
> > > > Is there any information you'd like to grab off this OSD?  Anything I
> > > > can provide to help you troubleshoot this?
> > > >
> > > > I ask, because if not, I'm going to reformat / rebuild this OSD
> > > > (unless there is a faster way to repair this issue).
> > > >
> > > > Thanks,
> > > > Dyweni
> > > >
> > > > On 2018-06-25 07:30, Dyweni - Ceph-Users wrote:
> > > >> Good Morning,
> > > >>
> > > >> After removing roughly 20-some rbd shapshots, one of my OSD's has
> > > >> begun flapping.
> > > >>
> > > >>
> > > >>  ERROR 1 
> > > >>
> > > >> 2018-06-25 06:46:39.132257 a0ce2700 -1 osd.8 pg_epoch: 44738
> pg[4.e8(
> > > >> v 44721'485588 (44697'484015,44721'485588] local-lis/les=44593/44595
> > > >> n=2972 ec=9422/9422 lis/c 44593/44593 les/c/f 44595/44595/40729
> > > >> 44593/44593/44593) [8,7,10] r=0 lpr=44593 crt=44721'485588 lcod
> > > >> 44721'485586 mlcod 44721'485586 active+clean+snapt
> > > >> rim snaptrimq=[276~1,280~1,2af~1,2e8~4]] removing snap head
> > > >> 2018-06-25 06:46:41.314172 a1ce2700 -1
> > >
> >
> >
> /var/tmp/portage/sys-cluster/ceph-12.2.5/work/ceph-12.2.5/src/os/bluestore/bluestore_types.cc:
> > > >> In function 'void bluestore_extent_ref_map_t::put(uint64_t,
> uint32_t,
> > > >> PExtentVector*, bool*)' thread a1ce2700 time 2018-06-25
> > > >> 06:46:41.220388
> > >
> >
> >
> /var/tmp/portage/sys-cluster/ceph-12.2.5/work/ceph-12.2.5/src/os/bluestore/bluestore_types.cc:
> > > >> 217: FAILED assert(0 == "put on missing extent (nothing before)")
> > > >>
> > > >>  ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a)
> > > >> luminous (stable)
> > > >>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > > >> const*)+0x1bc) [0x2a2c314]
> > > >>  2: (bluestore_extent_ref_map_t::put(unsigned long long, unsigned
> int,
> > > >> std::vector > > >> mempool::pool_allocator<(mempool::pool_index_t)4,
> bluestore_pextent_t>
> > > >>
> > > >> >*, bool*)+0x128) [0x2893650]
> > > >>
> > > >>  3: (BlueStore::SharedBlob::put_ref(unsigned long long, unsigned
> int,
> > > >> std::vector > > >> mempool::pool_allocator<(mempool::pool_index_t)4,
> bluestore_pextent_t>
> > > >>
> > > >> >*, std::set > > >> > std::less,
> > >
> > > std::allocator >*)+0xb8) [0x2791bdc]
> > >
> > > >>  4: (BlueStore::_wctx_finish(BlueStore::TransContext*,
> > > >> boost::intrusive_ptr&,
> > > >> boost::intrusive_ptr, BlueStore::WriteContext*,
> > > >> std::set,
> > > >> std::allocator >*)+0x5c8) [0x27f3254]
> > > >>  5: (BlueStore::_do_truncate(BlueStore::TransContext*,
> > > >> boost::intrusive_ptr&,
> > > >> boost::intrusive_ptr, unsigned long long,
> > > >> std::set,
> > > >> std::allocator >*)+0x360) [0x27f7834]
> > > >>  6: (BlueStore::_do_remove(BlueStore::TransContext*,
> > > >> boost::intrusive_ptr&,
> > > >> boost::intrusive_ptr)+0xb4) [0x27f81b4]
> > > >>  7: (BlueStore::_remove(BlueStore::TransContext*,
> > > >> boost::intrusive_ptr&,
> > > >> 

Re: [ceph-users] commit_latency equals apply_latency on bluestore

2018-10-02 Thread Gregory Farnum
As I mentioned in that email, the apply and commit values in BlueStore are
equivalent. They're exported because it's part of the interface (thanks to
FileStore), but they won't differ. If you're doing monitoring or graphs,
just pick one.
-Greg

On Tue, Oct 2, 2018 at 3:43 PM Jakub Jaszewski 
wrote:

> Hi Cephers, Hi Gregory,
>
> I consider same case like here, commit_latency==apply_latency in ceph osd
> perf
>
>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/024317.html
>
> What's the meaning of commit_latency and apply_latency in bluestore OSD
> setups[? How useful is it when troubleshooting? How does it correspond to
> separated block.db and block.wal ?
>
> Thanks
> Jakub
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic offline problem

2018-10-02 Thread Sage Weil
osd_find_best_info_ignore_history_les is a dangerous option and you should 
only use it in very specific circumstances when directed by a developer.  
In such cases it will allow a stuck PG to peer.  But you're not getting to 
that point...you're seeing some sort of resource exhaustion.

The noup trick works when OSDs are way behind on maps and all need to 
catch up.  The way to tell if they are behind is by looking at the 'ceph 
daemon osd.NNN status' output and comparing to the latest OSDMap epoch tha 
t the mons have.  Were they really caught up when you unset noup?

I'm just catching up and haven't read the whole thread but I haven't seen 
anything that explains why teh OSDs are dong lots of disk IO.  Catching up 
on maps could explain it but not why they wouldn't peer once they were all 
marked up...

sage


On Tue, 2 Oct 2018, Göktuğ Yıldırım wrote:

> Anyone heart about osd_find_best_info_ignore_history_les = true ?
> Is that be usefull here? There is such a less information about it.
> 
> Goktug Yildirim  şunları yazdı (2 Eki 2018 22:11):
> 
> > Hi,
> > 
> > Indeed I left ceph-disk to decide the wal and db partitions when I read 
> > somewhere that that will do the proper sizing. 
> > For the blustore cache size I have plenty of RAM. I will increase 8GB for 
> > each and decide a more calculated numberafter cluster settles.
> > 
> > For the osd map loading I’ve also figured it out. And it is in loop. For 
> > that reason I started cluster with noup flag and waited OSDs to reach the 
> > uptodate epoch number. After that I unset noup. But I did not pay attention 
> > to manager logs. Let me check it, thank you!
> > 
> > I am not forcing jmellac or anything else really. I have a very standard 
> > installation and no tweaks or tunings. All we ask for the stability versus 
> > speed from the begining. And here we are :/
> > 
> >> On 2 Oct 2018, at 21:53, Darius Kasparavičius  wrote:
> >> 
> >> Hi,
> >> 
> >> 
> >> I can see some issues from the osd log file. You have an extremely low
> >> size db and wal partitions. Only 1GB for DB and 576MB for wal. I would
> >> recommend cranking up rocksdb cache size as much as possible. If you
> >> have RAM you can also increase bluestores cache size for hdd. Default
> >> is 1GB be as liberal as you can without getting OOM kills. You also
> >> have lots of osd map loading and decoding in the log. Are you sure all
> >> monitors/managers/osds are up to date? Plus make sure you aren't
> >> forcing jemalloc loading. I had a funny interaction after upgrading to
> >> mimic.
> >> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim
> >>  wrote:
> >>> 
> >>> Hello Darius,
> >>> 
> >>> Thanks for reply!
> >>> 
> >>> The main problem is we can not query PGs. “ceph pg 67.54f query” does 
> >>> stucks and wait forever since OSD is unresponsive.
> >>> We are certain that OSD gets unresponsive as soon as it UP. And we are 
> >>> certain that OSD responds again after its disk utilization stops.
> >>> 
> >>> So we have a small test like that:
> >>> * Stop all OSDs (168 of them)
> >>> * Start OSD1. %95 osd disk utilization immediately starts. It takes 8 
> >>> mins to finish. Only after that “ceph pg 67.54f query” works!
> >>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2 
> >>> starts %95 disk utilization. This takes 17 minutes to finish.
> >>> * Now start OSD3 and it is the same. All OSDs start high I/O and it takes 
> >>> 25 mins to settle.
> >>> * If you happen to start 5 of them at the same all of the OSDs start high 
> >>> I/O again. And it takes 1 hour to finish.
> >>> 
> >>> So in the light of these findings we flagged noup, started all OSDs. At 
> >>> first there was no I/O. After 10 minutes we unset noup. All of 168 OSD 
> >>> started to make high I/O. And we thought that if we wait long enough it 
> >>> will finish & OSDs will be responsive again. After 24hours they did not 
> >>> because I/O did not finish or even slowed down.
> >>> One can think that is a lot of data there to scan. But it is just 33TB.
> >>> 
> >>> So at short we dont know which PG is stuck so we can remove it.
> >>> 
> >>> However we met an weird thing half an hour ago. We exported the same PG 
> >>> from two different OSDs. One was 4.2GB and the other is 500KB! So we 
> >>> decided to export all OSDs for backup. Then we will delete strange sized 
> >>> ones and start the cluster all over. Maybe then we could solve the 
> >>> stucked or unfound PGs as you advise.
> >>> 
> >>> Any thought would be greatly appreciated.
> >>> 
> >>> 
>  On 2 Oct 2018, at 18:16, Darius Kasparavičius  wrote:
>  
>  Hello,
>  
>  Currently you have 15 objects missing. I would recommend finding them
>  and making backups of them. Ditch all other osds that are failing to
>  start and concentrate on bringing online those that have missing
>  objects. Then slowly turn off nodown and noout on the cluster and see
>  if it stabilises. If it stabilises leave these setting if 

Re: [ceph-users] Bluestore vs. Filestore

2018-10-02 Thread Christian Balzer


Hello,

this has crept up before, find my thread 
"Bluestore caching, flawed by design?" for starters, if you haven't
already.

I'll have to build a new Ceph cluster next year and am also less than
impressed with the choices at this time:

1. Bluestore is the new shiny, filestore is going to die (and did already
with regards to the MUCH better in my experience and use case for EXT4
compared to XFS). Never mind the very bleeding edge vibe I'm still getting
from bluestore two major releases in.

2. Caching is currently not up to snuff. 
To get the same level of caching as with pagecache AND being safe in
heavy recovery/rebalance situations one needs a lot more RAM, 30% at least
would be my guess. 

3. Small IOPS (for my use case) can't be guaranteed to have similar
behavior as with filestore and journals (some IOPS will go to disk
directly).

4. Cache tiers are deprecated. 
Despite working beautifully for my use case and probably for yours as well.
The proposed alternatives leave me cold, I tried them all a year ago.
LVM dm-cache was a nightmare to configure (documentation and complexity)
and performed abysmally compared to bcache. 
While bcache is fast (using it on 2 non-Ceph systems) it has other issues,
some load spikes (often at near idle times), can be crashed by querying
its sysfs counters and doesn't honor IO priorities...


In short, if you're willing to basically treat your Ceph cluster as an
appliance that cant be disposed off after 5 years instead of a
continuously upgraded and expanded storage solution, you can do just that
and install filestore OSDs and/or cache-tiering now and never upgrade (at
least to a version where it becomes unsupported).


In my case I'm currently undecided between doing something like the above
or go all SSD (if that's affordable, maybe the RAM savings will help) and
thus bypass all the bluestore performance issues at least.

Regards,

Christian

On Tue, 2 Oct 2018 19:28:13 +0200 jes...@krogh.cc wrote:

> Hi.
> 
> Based on some recommendations we have setup our CephFS installation using
> bluestore*. We're trying to get a strong replacement for "huge" xfs+NFS
> server - 100TB-ish size.
> 
> Current setup is - a sizeable Linux host with 512GB of memory - one large
> Dell MD1200 or MD1220 - 100TB + a Linux kernel NFS server.
> 
> Since our "hot" dataset is < 400GB we can actually serve the hot data
> directly out of the host page-cache and never really touch the "slow"
> underlying drives. Except when new bulk data are written where a Perc with
> BBWC is consuming the data.
> 
> In the CephFS + Bluestore world, Ceph is "deliberatly" bypassing the host
> OS page-cache, so even when we have 4-5 x 256GB memory** in the OSD hosts
> it is really hard to create a synthetic test where they hot data does not
> end up being read out of the underlying disks. Yes, the
> client side page cache works very well, but in our scenario we have 30+
> hosts pulling the same data over NFS.
> 
> Is bluestore just a "bad fit" .. Filestore "should" do the right thing? Is
> the recommendation to make an SSD "overlay" on the slow drives?
> 
> Thoughts?
> 
> Jesper
> 
> * Bluestore should be the new and shiny future - right?
> ** Total mem 1TB+
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "rgw relaxed s3 bucket names" and underscores

2018-10-02 Thread Ryan Leimenstoll
Nope, you are right. I think it was just boto catching this for me and I took 
that for granted. 

I think that is the behavior I would expect too, S3-compliant restrictions on 
create and allow legacy buckets to remain. Anyway, noticed you created a ticket 
[0] in the tracker for this, thanks!

Best,
Ryan

[0] https://tracker.ceph.com/issues/36293 



> On Oct 2, 2018, at 6:08 PM, Robin H. Johnson  wrote:
> 
> On Tue, Oct 02, 2018 at 12:37:02PM -0400, Ryan Leimenstoll wrote:
>> I was hoping to get some clarification on what "rgw relaxed s3 bucket
>> names = false” is intended to filter. 
> Yes, it SHOULD have caught this case, but does not.
> 
> Are you sure it rejects the uppercase? My test also showed that it did
> NOT reject the uppercase as intended.
> 
> This code did used to work, I contributed to the logic and discussion
> for earlier versions. A related part I wanted was allowing access to
> existing buckets w/ relaxed names, but disallowing creating of relaxed
> names.
> 
> -- 
> Robin Hugh Johnson
> Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
> E-Mail   : robb...@gentoo.org
> GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
> GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help! OSDs across the cluster just crashed

2018-10-02 Thread Vasu Kulkarni
can you file tracker for your
issues(http://tracker.ceph.com/projects/ceph/issues/new) , email once
its lengthy is not great to track the issue, Ideally full details of
environment (os/ceph versions /before/after/workload info/ tool used
for upgrade) is important if one has to recreate it.  There are
various upgrade tests in the suite, so it might be a miss, please file
a tracker with details. Thanks
On Tue, Oct 2, 2018 at 3:18 PM Goktug Yildirim
 wrote:
>
> Hi,
>
> Sorry to hear that. I’ve been battling with mine for 2 weeks :/
>
> I’ve corrected mine OSDs with the following commands. My OSD logs 
> (/var/log/ceph/ceph-OSDx.log) has a line including log(EER) with the PG 
> number besides and before crash dump.
>
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op trim-pg-log 
> --pgid $2
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op fix-lost 
> --pgid $2
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op repair 
> --pgid $2
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op 
> mark-complete --pgid $2
> systemctl restart ceph-osd@$1
>
> I dont know if it works for you but it may be no harm to try for an OSD.
>
> There is such less information about this tools. So it might be risky. I hope 
> someone much experienced could help more.
>
>
> > On 2 Oct 2018, at 23:23, Brett Chancellor  
> > wrote:
> >
> > Help. I have a 60 node cluster and most of the OSDs decided to crash 
> > themselves at the same time. They wont restart, the messages look like...
> >
> > --- begin dump of recent events ---
> >  0> 2018-10-02 21:19:16.990369 7f57ab5b7d80 -1 *** Caught signal 
> > (Aborted) **
> >  in thread 7f57ab5b7d80 thread_name:ceph-osd
> >
> >  ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous 
> > (stable)
> >  1: (()+0xa3c611) [0x556d618bb611]
> >  2: (()+0xf6d0) [0x7f57a885e6d0]
> >  3: (gsignal()+0x37) [0x7f57a787f277]
> >  4: (abort()+0x148) [0x7f57a7880968]
> >  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> > const*)+0x284) [0x556d618fa6e4]
> >  6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t 
> > const&)+0x3b2) [0x556d615c74a2]
> >  7: (PastIntervals::check_new_interval(int, int, std::vector > std::allocator > const&, std::vector > 
> > const&, int, int, std::vector > const&, 
> > std::vector > const&, unsigned int, unsigned int, 
> > std::shared_ptr, std::shared_ptr, pg_t, 
> > IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380) 
> > [0x556d615ae6c0]
> >  8: (OSD::build_past_intervals_parallel()+0x9ff) [0x556d613707af]
> >  9: (OSD::load_pgs()+0x545) [0x556d61373095]
> >  10: (OSD::init()+0x2169) [0x556d613919d9]
> >  11: (main()+0x2d07) [0x556d61295dd7]
> >  12: (__libc_start_main()+0xf5) [0x7f57a786b445]
> >  13: (()+0x4b53e3) [0x556d613343e3]
> >  NOTE: a copy of the executable, or `objdump -rdS ` is needed 
> > to interpret this.
> >
> >
> > Some hosts have no working OSDs, others seem to have 1 working, and 2 dead. 
> >  It's spread all across the cluster, across several different racks. Any 
> > idea on where to look next? The cluster is dead in the water right now.
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] commit_latency equals apply_latency on bluestore

2018-10-02 Thread Jakub Jaszewski
Hi Cephers, Hi Gregory,

I consider same case like here, commit_latency==apply_latency in ceph osd
perf

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/024317.html

What's the meaning of commit_latency and apply_latency in bluestore OSD
setups[? How useful is it when troubleshooting? How does it correspond to
separated block.db and block.wal ?

Thanks
Jakub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help! OSDs across the cluster just crashed

2018-10-02 Thread Goktug Yildirim
Hi,

Sorry to hear that. I’ve been battling with mine for 2 weeks :/

I’ve corrected mine OSDs with the following commands. My OSD logs 
(/var/log/ceph/ceph-OSDx.log) has a line including log(EER) with the PG number 
besides and before crash dump.

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op trim-pg-log 
--pgid $2
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op fix-lost 
--pgid $2
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op repair --pgid 
$2
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op mark-complete 
--pgid $2
systemctl restart ceph-osd@$1

I dont know if it works for you but it may be no harm to try for an OSD.

There is such less information about this tools. So it might be risky. I hope 
someone much experienced could help more.


> On 2 Oct 2018, at 23:23, Brett Chancellor  wrote:
> 
> Help. I have a 60 node cluster and most of the OSDs decided to crash 
> themselves at the same time. They wont restart, the messages look like...
> 
> --- begin dump of recent events ---
>  0> 2018-10-02 21:19:16.990369 7f57ab5b7d80 -1 *** Caught signal 
> (Aborted) **
>  in thread 7f57ab5b7d80 thread_name:ceph-osd
> 
>  ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous 
> (stable)
>  1: (()+0xa3c611) [0x556d618bb611]
>  2: (()+0xf6d0) [0x7f57a885e6d0]
>  3: (gsignal()+0x37) [0x7f57a787f277]
>  4: (abort()+0x148) [0x7f57a7880968]
>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x284) [0x556d618fa6e4]
>  6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t 
> const&)+0x3b2) [0x556d615c74a2]
>  7: (PastIntervals::check_new_interval(int, int, std::vector std::allocator > const&, std::vector > const&, 
> int, int, std::vector > const&, std::vector std::allocator > const&, unsigned int, unsigned int, 
> std::shared_ptr, std::shared_ptr, pg_t, 
> IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380) 
> [0x556d615ae6c0]
>  8: (OSD::build_past_intervals_parallel()+0x9ff) [0x556d613707af]
>  9: (OSD::load_pgs()+0x545) [0x556d61373095]
>  10: (OSD::init()+0x2169) [0x556d613919d9]
>  11: (main()+0x2d07) [0x556d61295dd7]
>  12: (__libc_start_main()+0xf5) [0x7f57a786b445]
>  13: (()+0x4b53e3) [0x556d613343e3]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
> interpret this.
> 
> 
> Some hosts have no working OSDs, others seem to have 1 working, and 2 dead.  
> It's spread all across the cluster, across several different racks. Any idea 
> on where to look next? The cluster is dead in the water right now.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Testing cluster throughput - one OSD is always 100% utilized during rados bench write

2018-10-02 Thread Jakub Jaszewski
Hi Cephers,

I'm testing cluster throughput before moving to the production. Ceph
version 13.2.1 (I'll update to 13.2.2).

I run rados bench from 10 cluster nodes and 10 clients in parallel.
Just after I call rados command, HDDs behind three OSDs are 100% utilized
while others are < 40%. After the short while only one OSD stay 100%
utilized. I've stopped this OSD to eliminate hardware issue, but then
another OSD on another node start hitting 100% disk util during next rados
bench write. The same OSD is fully utilized for each bench run.

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
sdd   0,00 0,000,00  518,00 0,00   129,50   512,00
  87,99  155,120,00  155,12   1,93 100,00

The test pool size is 3 (replicated). (Deep)scrubbing is temporary off.

Networking, CPU and memory is underutilized during the test.

Particular rados command is
rados bench --name client.rbd_test -p rbd_test 600 write --no-cleanup
--run-name $(hostname)_bench

The same story with
rados --name client.rbd_test -p rbd_test load-gen --min-object-size 4M
--max-object-size 4M --min-op-len 4M --max-op-len 4M --max-ops 16
--read-percent 0 --target-throughput 1000 --run-length 600

Do you face the same behavior? It smells like particular PG related. Is it
the effect of running number of rados bench tasks in parallel ?

Of course, I do not deny it's cluster limit, but I'm not sure why only one
and always the same OSD keeps hitting 100% util. Tomorrow I'm going to test
cluster using rbd,

How looks your clusters limit ? saturated LACP ? 100% utilized HDDs???

Thanks,
Jakub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "rgw relaxed s3 bucket names" and underscores

2018-10-02 Thread Robin H. Johnson
On Tue, Oct 02, 2018 at 12:37:02PM -0400, Ryan Leimenstoll wrote:
> I was hoping to get some clarification on what "rgw relaxed s3 bucket
> names = false” is intended to filter. 
Yes, it SHOULD have caught this case, but does not.

Are you sure it rejects the uppercase? My test also showed that it did
NOT reject the uppercase as intended.

This code did used to work, I contributed to the logic and discussion
for earlier versions. A related part I wanted was allowing access to
existing buckets w/ relaxed names, but disallowing creating of relaxed
names.

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136


signature.asc
Description: Digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic offline problem

2018-10-02 Thread Göktuğ Yıldırım
Anyone heart about osd_find_best_info_ignore_history_les = true ?
Is that be usefull here? There is such a less information about it.

Goktug Yildirim  şunları yazdı (2 Eki 2018 22:11):

> Hi,
> 
> Indeed I left ceph-disk to decide the wal and db partitions when I read 
> somewhere that that will do the proper sizing. 
> For the blustore cache size I have plenty of RAM. I will increase 8GB for 
> each and decide a more calculated numberafter cluster settles.
> 
> For the osd map loading I’ve also figured it out. And it is in loop. For that 
> reason I started cluster with noup flag and waited OSDs to reach the uptodate 
> epoch number. After that I unset noup. But I did not pay attention to manager 
> logs. Let me check it, thank you!
> 
> I am not forcing jmellac or anything else really. I have a very standard 
> installation and no tweaks or tunings. All we ask for the stability versus 
> speed from the begining. And here we are :/
> 
>> On 2 Oct 2018, at 21:53, Darius Kasparavičius  wrote:
>> 
>> Hi,
>> 
>> 
>> I can see some issues from the osd log file. You have an extremely low
>> size db and wal partitions. Only 1GB for DB and 576MB for wal. I would
>> recommend cranking up rocksdb cache size as much as possible. If you
>> have RAM you can also increase bluestores cache size for hdd. Default
>> is 1GB be as liberal as you can without getting OOM kills. You also
>> have lots of osd map loading and decoding in the log. Are you sure all
>> monitors/managers/osds are up to date? Plus make sure you aren't
>> forcing jemalloc loading. I had a funny interaction after upgrading to
>> mimic.
>> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim
>>  wrote:
>>> 
>>> Hello Darius,
>>> 
>>> Thanks for reply!
>>> 
>>> The main problem is we can not query PGs. “ceph pg 67.54f query” does 
>>> stucks and wait forever since OSD is unresponsive.
>>> We are certain that OSD gets unresponsive as soon as it UP. And we are 
>>> certain that OSD responds again after its disk utilization stops.
>>> 
>>> So we have a small test like that:
>>> * Stop all OSDs (168 of them)
>>> * Start OSD1. %95 osd disk utilization immediately starts. It takes 8 mins 
>>> to finish. Only after that “ceph pg 67.54f query” works!
>>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2 starts 
>>> %95 disk utilization. This takes 17 minutes to finish.
>>> * Now start OSD3 and it is the same. All OSDs start high I/O and it takes 
>>> 25 mins to settle.
>>> * If you happen to start 5 of them at the same all of the OSDs start high 
>>> I/O again. And it takes 1 hour to finish.
>>> 
>>> So in the light of these findings we flagged noup, started all OSDs. At 
>>> first there was no I/O. After 10 minutes we unset noup. All of 168 OSD 
>>> started to make high I/O. And we thought that if we wait long enough it 
>>> will finish & OSDs will be responsive again. After 24hours they did not 
>>> because I/O did not finish or even slowed down.
>>> One can think that is a lot of data there to scan. But it is just 33TB.
>>> 
>>> So at short we dont know which PG is stuck so we can remove it.
>>> 
>>> However we met an weird thing half an hour ago. We exported the same PG 
>>> from two different OSDs. One was 4.2GB and the other is 500KB! So we 
>>> decided to export all OSDs for backup. Then we will delete strange sized 
>>> ones and start the cluster all over. Maybe then we could solve the stucked 
>>> or unfound PGs as you advise.
>>> 
>>> Any thought would be greatly appreciated.
>>> 
>>> 
 On 2 Oct 2018, at 18:16, Darius Kasparavičius  wrote:
 
 Hello,
 
 Currently you have 15 objects missing. I would recommend finding them
 and making backups of them. Ditch all other osds that are failing to
 start and concentrate on bringing online those that have missing
 objects. Then slowly turn off nodown and noout on the cluster and see
 if it stabilises. If it stabilises leave these setting if not turn
 them back on.
 Now get some of the pg's that are blocked and querry the pgs to check
 why they are blocked. Try removing as much blocks as possible and then
 remove the norebalance/norecovery flags and see if it starts to fix
 itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
  wrote:
> 
> One of ceph experts indicated that bluestore is somewhat preview tech
> (as for Redhat).
> So it could be best to checkout bluestore and rocksdb. There are some
> tools to check health and also repair. But there are limited
> documentation.
> Anyone who has experince with it?
> Anyone lead/help to a proper check would be great.
> Goktug Yildirim , 1 Eki 2018 Pzt, 22:55
> tarihinde şunu yazdı:
>> 
>> Hi all,
>> 
>> We have recently upgraded from luminous to mimic. It’s been 6 days since 
>> this cluster is offline. The long short story is here: 
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
>> 
>> 

[ceph-users] Help! OSDs across the cluster just crashed

2018-10-02 Thread Brett Chancellor
Help. I have a 60 node cluster and most of the OSDs decided to crash
themselves at the same time. They wont restart, the messages look like...

--- begin dump of recent events ---
 0> 2018-10-02 21:19:16.990369 7f57ab5b7d80 -1 *** Caught signal
(Aborted) **
 in thread 7f57ab5b7d80 thread_name:ceph-osd

 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous
(stable)
 1: (()+0xa3c611) [0x556d618bb611]
 2: (()+0xf6d0) [0x7f57a885e6d0]
 3: (gsignal()+0x37) [0x7f57a787f277]
 4: (abort()+0x148) [0x7f57a7880968]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x284) [0x556d618fa6e4]
 6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t
const&)+0x3b2) [0x556d615c74a2]
 7: (PastIntervals::check_new_interval(int, int, std::vector > const&, std::vector >
const&, int, int, std::vector > const&,
std::vector > const&, unsigned int, unsigned int,
std::shared_ptr, std::shared_ptr, pg_t,
IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380)
[0x556d615ae6c0]
 8: (OSD::build_past_intervals_parallel()+0x9ff) [0x556d613707af]
 9: (OSD::load_pgs()+0x545) [0x556d61373095]
 10: (OSD::init()+0x2169) [0x556d613919d9]
 11: (main()+0x2d07) [0x556d61295dd7]
 12: (__libc_start_main()+0xf5) [0x7f57a786b445]
 13: (()+0x4b53e3) [0x556d613343e3]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.


Some hosts have no working OSDs, others seem to have 1 working, and 2
dead.  It's spread all across the cluster, across several different racks.
Any idea on where to look next? The cluster is dead in the water right now.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD Mirror Question

2018-10-02 Thread Jason Dillaman
On Tue, Oct 2, 2018 at 4:47 PM Vikas Rana  wrote:
>
> Hi,
>
> We have a CEPH 3 node cluster at primary site. We created a RBD image and the 
> image has about 100TB of data.
>
> Now we installed another 3 node cluster on secondary site. We want to 
> replicate the image at primary site to this new cluster on secondary site.
>
> As per documentation, we enabled journaling on primary site. We followed all 
> the procedure and peering looks good but the image is not copying.
> The status is always showing down.

Do you have an "rbd-mirror" daemon running on the secondary site? Are
you running "rbd mirror pool status" against the primary site or the
secondary site? The mirroring status is only available on the sites
running "rbd-mirror" daemon (the "down" means that the cluster you are
connected to doesn't have the daemon running).

> So my question is, is it possible to replicate a image which already have 
> some data before enabling journalling?

Indeed -- it will perform a full image sync to the secondary site.

> We are using the image mirroring instead of pool mirroring. Do we need to 
> create the RBD image on secondary site? As per documentation, its not 
> required.

The only difference between the two modes is whether or not you need
to run "rbd mirror image enable" or not.

> Is there any other option to copy the image to the remote site?

No other procedure should be required.

> Thanks,
> -Vikas
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore vs. Filestore

2018-10-02 Thread Ronny Aasen

On 02.10.2018 21:21, jes...@krogh.cc wrote:

On 02.10.2018 19:28, jes...@krogh.cc wrote:
In the cephfs world there is no central server that hold the cache. each
cephfs client reads data directly from the osd's.

I can accept this argument, but nevertheless .. if I used Filestore - it
would work.


bluestore is fairly new tho, so if your use case fits filestore better, 
there is no huge reason not to just use that





This also means no
single point of failure, and you can scale out performance by spreading
metadata tree information over multiple MDS servers. and scale out
storage and throughput with added osd nodes.

so if the cephfs client cache is not sufficient, you can look at at the
bluestore cache.

http://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/#cache-size

I have been there, but it seems to "not work"- I think the need to
slice per OSD and statically allocate mem per OSD breaks the efficiency.
(but I cannot prove it)


or you can look at adding a ssd layer over the spinning disks. with egÂ
bcache.  I assume you are using a ssd/nvram for bluestore db already

My currently bluestore(s) is backed by 10TB 7.2K RPM drives, allthough behind
BBWC. Can you elaborate on the "assumption" as we're not doing that, I'd like
to explore that.


https://ceph.com/community/new-luminous-bluestore/
read about "multiple devices"
you can split out the DB part of the bluestore to a faster drive (ssd) 
many tend to put db's for 4 spinners on a single ssd.
the db is the osd metadata, it say where on the block the objects are. 
and it increases the performance of bluestore significantly.






you should also look at tuning the cephfs metadata servers.
make sure the metadata pool is on fast ssd osd's .  and tune the mds
cache to the mds server's ram, so you cache as much metadata as possible.

Yes, we're in the process of doing that - I belive we're seeing the MDS
suffering
when we saturate a few disks in the setup - and they are sharing. Thus
we'll move
the metadata as per recommendations to SSD.



good luck

Ronny Aasen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD Mirror Question

2018-10-02 Thread Vikas Rana
Hi,

We have a CEPH 3 node cluster at primary site. We created a RBD image and
the image has about 100TB of data.

Now we installed another 3 node cluster on secondary site. We want to
replicate the image at primary site to this new cluster on secondary site.

As per documentation, we enabled journaling on primary site. We followed
all the procedure and peering looks good but the image is not copying.
The status is always showing down.


So my question is, is it possible to replicate a image which already have
some data before enabling journalling?

We are using the image mirroring instead of pool mirroring. Do we need to
create the RBD image on secondary site? As per documentation, its not
required.

Is there any other option to copy the image to the remote site?

Thanks,
-Vikas
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic offline problem

2018-10-02 Thread Goktug Yildirim
Hi,

Indeed I left ceph-disk to decide the wal and db partitions when I read 
somewhere that that will do the proper sizing. 
For the blustore cache size I have plenty of RAM. I will increase 8GB for each 
and decide a more calculated numberafter cluster settles.

For the osd map loading I’ve also figured it out. And it is in loop. For that 
reason I started cluster with noup flag and waited OSDs to reach the uptodate 
epoch number. After that I unset noup. But I did not pay attention to manager 
logs. Let me check it, thank you!

I am not forcing jmellac or anything else really. I have a very standard 
installation and no tweaks or tunings. All we ask for the stability versus 
speed from the begining. And here we are :/

> On 2 Oct 2018, at 21:53, Darius Kasparavičius  wrote:
> 
> Hi,
> 
> 
> I can see some issues from the osd log file. You have an extremely low
> size db and wal partitions. Only 1GB for DB and 576MB for wal. I would
> recommend cranking up rocksdb cache size as much as possible. If you
> have RAM you can also increase bluestores cache size for hdd. Default
> is 1GB be as liberal as you can without getting OOM kills. You also
> have lots of osd map loading and decoding in the log. Are you sure all
> monitors/managers/osds are up to date? Plus make sure you aren't
> forcing jemalloc loading. I had a funny interaction after upgrading to
> mimic.
> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim
>  wrote:
>> 
>> Hello Darius,
>> 
>> Thanks for reply!
>> 
>> The main problem is we can not query PGs. “ceph pg 67.54f query” does stucks 
>> and wait forever since OSD is unresponsive.
>> We are certain that OSD gets unresponsive as soon as it UP. And we are 
>> certain that OSD responds again after its disk utilization stops.
>> 
>> So we have a small test like that:
>> * Stop all OSDs (168 of them)
>> * Start OSD1. %95 osd disk utilization immediately starts. It takes 8 mins 
>> to finish. Only after that “ceph pg 67.54f query” works!
>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2 starts 
>> %95 disk utilization. This takes 17 minutes to finish.
>> * Now start OSD3 and it is the same. All OSDs start high I/O and it takes 25 
>> mins to settle.
>> * If you happen to start 5 of them at the same all of the OSDs start high 
>> I/O again. And it takes 1 hour to finish.
>> 
>> So in the light of these findings we flagged noup, started all OSDs. At 
>> first there was no I/O. After 10 minutes we unset noup. All of 168 OSD 
>> started to make high I/O. And we thought that if we wait long enough it will 
>> finish & OSDs will be responsive again. After 24hours they did not because 
>> I/O did not finish or even slowed down.
>> One can think that is a lot of data there to scan. But it is just 33TB.
>> 
>> So at short we dont know which PG is stuck so we can remove it.
>> 
>> However we met an weird thing half an hour ago. We exported the same PG from 
>> two different OSDs. One was 4.2GB and the other is 500KB! So we decided to 
>> export all OSDs for backup. Then we will delete strange sized ones and start 
>> the cluster all over. Maybe then we could solve the stucked or unfound PGs 
>> as you advise.
>> 
>> Any thought would be greatly appreciated.
>> 
>> 
>>> On 2 Oct 2018, at 18:16, Darius Kasparavičius  wrote:
>>> 
>>> Hello,
>>> 
>>> Currently you have 15 objects missing. I would recommend finding them
>>> and making backups of them. Ditch all other osds that are failing to
>>> start and concentrate on bringing online those that have missing
>>> objects. Then slowly turn off nodown and noout on the cluster and see
>>> if it stabilises. If it stabilises leave these setting if not turn
>>> them back on.
>>> Now get some of the pg's that are blocked and querry the pgs to check
>>> why they are blocked. Try removing as much blocks as possible and then
>>> remove the norebalance/norecovery flags and see if it starts to fix
>>> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
>>>  wrote:
 
 One of ceph experts indicated that bluestore is somewhat preview tech
 (as for Redhat).
 So it could be best to checkout bluestore and rocksdb. There are some
 tools to check health and also repair. But there are limited
 documentation.
 Anyone who has experince with it?
 Anyone lead/help to a proper check would be great.
 Goktug Yildirim , 1 Eki 2018 Pzt, 22:55
 tarihinde şunu yazdı:
> 
> Hi all,
> 
> We have recently upgraded from luminous to mimic. It’s been 6 days since 
> this cluster is offline. The long short story is here: 
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
> 
> I’ve also CC’ed developers since I believe this is a bug. If this is not 
> to correct way I apology and please let me know.
> 
> For the 6 days lots of thing happened and there were some outcomes about 
> the problem. Some of them was misjudged and some of them are not looked 
> deeper.

Re: [ceph-users] Mimic offline problem

2018-10-02 Thread Goktug Yildirim
Thanks for the reply! My answers are inline.

> On 2 Oct 2018, at 21:51, Paul Emmerich  wrote:
> 
> (Didn't follow the whole story, so you might have already answered that)
> Did you check what the OSDs are doing during the period of high disk
> utilization?
> As in:
> 
> * running perf top
Did not cross my mind. Thanks for the pop-up! Will do.
> * sampling a few stack traces from procfs or gdb
I have strace for OSD. https://paste.ubuntu.com/p/8n2kTvwnG6/
> * or just high log settings
They have default debug settings and also log disk is different. Indeed I have 
a fairly fast system. OS disks are Mirror SSD, WALs+DBs are mirrored NvME and 
OSD disks are NL-SAS. All hardware came from Dell (R730). Also 28 Core and 
256GB RAM per server and 2x10Ge cluster and 2x10Gbe for public networks.
> * running "status" on the admin socket locally
I can run daemon and see status. I must have checked it but will do again.
> 
> 
> Paul
> 
> Am Di., 2. Okt. 2018 um 20:02 Uhr schrieb Goktug Yildirim
> :
>> 
>> Hello Darius,
>> 
>> Thanks for reply!
>> 
>> The main problem is we can not query PGs. “ceph pg 67.54f query” does stucks 
>> and wait forever since OSD is unresponsive.
>> We are certain that OSD gets unresponsive as soon as it UP. And we are 
>> certain that OSD responds again after its disk utilization stops.
>> 
>> So we have a small test like that:
>> * Stop all OSDs (168 of them)
>> * Start OSD1. %95 osd disk utilization immediately starts. It takes 8 mins 
>> to finish. Only after that “ceph pg 67.54f query” works!
>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2 starts 
>> %95 disk utilization. This takes 17 minutes to finish.
>> * Now start OSD3 and it is the same. All OSDs start high I/O and it takes 25 
>> mins to settle.
>> * If you happen to start 5 of them at the same all of the OSDs start high 
>> I/O again. And it takes 1 hour to finish.
>> 
>> So in the light of these findings we flagged noup, started all OSDs. At 
>> first there was no I/O. After 10 minutes we unset noup. All of 168 OSD 
>> started to make high I/O. And we thought that if we wait long enough it will 
>> finish & OSDs will be responsive again. After 24hours they did not because 
>> I/O did not finish or even slowed down.
>> One can think that is a lot of data there to scan. But it is just 33TB.
>> 
>> So at short we dont know which PG is stuck so we can remove it.
>> 
>> However we met an weird thing half an hour ago. We exported the same PG from 
>> two different OSDs. One was 4.2GB and the other is 500KB! So we decided to 
>> export all OSDs for backup. Then we will delete strange sized ones and start 
>> the cluster all over. Maybe then we could solve the stucked or unfound PGs 
>> as you advise.
>> 
>> Any thought would be greatly appreciated.
>> 
>> 
>>> On 2 Oct 2018, at 18:16, Darius Kasparavičius  wrote:
>>> 
>>> Hello,
>>> 
>>> Currently you have 15 objects missing. I would recommend finding them
>>> and making backups of them. Ditch all other osds that are failing to
>>> start and concentrate on bringing online those that have missing
>>> objects. Then slowly turn off nodown and noout on the cluster and see
>>> if it stabilises. If it stabilises leave these setting if not turn
>>> them back on.
>>> Now get some of the pg's that are blocked and querry the pgs to check
>>> why they are blocked. Try removing as much blocks as possible and then
>>> remove the norebalance/norecovery flags and see if it starts to fix
>>> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
>>>  wrote:
 
 One of ceph experts indicated that bluestore is somewhat preview tech
 (as for Redhat).
 So it could be best to checkout bluestore and rocksdb. There are some
 tools to check health and also repair. But there are limited
 documentation.
 Anyone who has experince with it?
 Anyone lead/help to a proper check would be great.
 Goktug Yildirim , 1 Eki 2018 Pzt, 22:55
 tarihinde şunu yazdı:
> 
> Hi all,
> 
> We have recently upgraded from luminous to mimic. It’s been 6 days since 
> this cluster is offline. The long short story is here: 
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
> 
> I’ve also CC’ed developers since I believe this is a bug. If this is not 
> to correct way I apology and please let me know.
> 
> For the 6 days lots of thing happened and there were some outcomes about 
> the problem. Some of them was misjudged and some of them are not looked 
> deeper.
> However the most certain diagnosis is this: each OSD causes very high 
> disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs 
> become unresponsive or very very less responsive. For example "ceph tell 
> osd.x version” stucks like for ever.
> 
> So due to unresponsive OSDs cluster does not settle. This is our problem!
> 
> This is the one we are very sure of. But we are not sure of the 

Re: [ceph-users] EC pool spread evenly across failure domains?

2018-10-02 Thread Paul Emmerich
step take default
step choose indep 3 chassis
step chooseleaf indep 2 host

which will only work for k+m=6 setups

Paul

Am Di., 2. Okt. 2018 um 20:36 Uhr schrieb Mark Johnston
:
>
> I have the following setup in a test cluster:
>
>  -1   8.49591 root default
> -15   2.83197 chassis vm1
>  -3   1.41599 host ceph01
>   0   ssd 1.41599 osd.0
>  -5   1.41599 host ceph02
>   1   ssd 1.41599 osd.1
> -19   2.83197 chassis vm2
>  -7   1.41599 host ceph03
>   2   ssd 1.41599 osd.2
>  -9   1.41599 host ceph04
>   3   ssd 1.41599 osd.3
> -20   2.83197 chassis vm3
> -11   1.41599 host ceph05
>   4   ssd 1.41599 osd.4
> -13   1.41599 host ceph06
>   5   ssd 1.41599 osd.5
>
> I created an EC pool with k=4 m=2 and crush-failure-domain=chassis.  The PGs
> are stuck in creating+incomplete with only 3 assigned OSDs each.  I'm assuming
> this is because using crush-failure-domain=chassis requires a different 
> chassis
> for every chunk.
>
> I don't want to switch to k=2 m=1 because I want to be able to survive two OSD
> failures, and I don't want to use crush-failure-domain=host because I don't 
> want
> more than two chunks to be placed on the same chassis.  (The production 
> cluster
> will have more than two hosts per chassis, so crush-failure-domain=host could
> put all 6 chunks on the same chassis.)
>
> Do I need to write a custom CRUSH rule to get this to happen?  Or have I 
> missed
> something?
>
> Thanks,
> Mark
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic offline problem

2018-10-02 Thread Darius Kasparavičius
Hi,


I can see some issues from the osd log file. You have an extremely low
size db and wal partitions. Only 1GB for DB and 576MB for wal. I would
recommend cranking up rocksdb cache size as much as possible. If you
have RAM you can also increase bluestores cache size for hdd. Default
is 1GB be as liberal as you can without getting OOM kills. You also
have lots of osd map loading and decoding in the log. Are you sure all
monitors/managers/osds are up to date? Plus make sure you aren't
forcing jemalloc loading. I had a funny interaction after upgrading to
mimic.
On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim
 wrote:
>
> Hello Darius,
>
> Thanks for reply!
>
> The main problem is we can not query PGs. “ceph pg 67.54f query” does stucks 
> and wait forever since OSD is unresponsive.
> We are certain that OSD gets unresponsive as soon as it UP. And we are 
> certain that OSD responds again after its disk utilization stops.
>
> So we have a small test like that:
> * Stop all OSDs (168 of them)
> * Start OSD1. %95 osd disk utilization immediately starts. It takes 8 mins to 
> finish. Only after that “ceph pg 67.54f query” works!
> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2 starts 
> %95 disk utilization. This takes 17 minutes to finish.
> * Now start OSD3 and it is the same. All OSDs start high I/O and it takes 25 
> mins to settle.
> * If you happen to start 5 of them at the same all of the OSDs start high I/O 
> again. And it takes 1 hour to finish.
>
> So in the light of these findings we flagged noup, started all OSDs. At first 
> there was no I/O. After 10 minutes we unset noup. All of 168 OSD started to 
> make high I/O. And we thought that if we wait long enough it will finish & 
> OSDs will be responsive again. After 24hours they did not because I/O did not 
> finish or even slowed down.
> One can think that is a lot of data there to scan. But it is just 33TB.
>
> So at short we dont know which PG is stuck so we can remove it.
>
> However we met an weird thing half an hour ago. We exported the same PG from 
> two different OSDs. One was 4.2GB and the other is 500KB! So we decided to 
> export all OSDs for backup. Then we will delete strange sized ones and start 
> the cluster all over. Maybe then we could solve the stucked or unfound PGs as 
> you advise.
>
> Any thought would be greatly appreciated.
>
>
> > On 2 Oct 2018, at 18:16, Darius Kasparavičius  wrote:
> >
> > Hello,
> >
> > Currently you have 15 objects missing. I would recommend finding them
> > and making backups of them. Ditch all other osds that are failing to
> > start and concentrate on bringing online those that have missing
> > objects. Then slowly turn off nodown and noout on the cluster and see
> > if it stabilises. If it stabilises leave these setting if not turn
> > them back on.
> > Now get some of the pg's that are blocked and querry the pgs to check
> > why they are blocked. Try removing as much blocks as possible and then
> > remove the norebalance/norecovery flags and see if it starts to fix
> > itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
> >  wrote:
> >>
> >> One of ceph experts indicated that bluestore is somewhat preview tech
> >> (as for Redhat).
> >> So it could be best to checkout bluestore and rocksdb. There are some
> >> tools to check health and also repair. But there are limited
> >> documentation.
> >> Anyone who has experince with it?
> >> Anyone lead/help to a proper check would be great.
> >> Goktug Yildirim , 1 Eki 2018 Pzt, 22:55
> >> tarihinde şunu yazdı:
> >>>
> >>> Hi all,
> >>>
> >>> We have recently upgraded from luminous to mimic. It’s been 6 days since 
> >>> this cluster is offline. The long short story is here: 
> >>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
> >>>
> >>> I’ve also CC’ed developers since I believe this is a bug. If this is not 
> >>> to correct way I apology and please let me know.
> >>>
> >>> For the 6 days lots of thing happened and there were some outcomes about 
> >>> the problem. Some of them was misjudged and some of them are not looked 
> >>> deeper.
> >>> However the most certain diagnosis is this: each OSD causes very high 
> >>> disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs 
> >>> become unresponsive or very very less responsive. For example "ceph tell 
> >>> osd.x version” stucks like for ever.
> >>>
> >>> So due to unresponsive OSDs cluster does not settle. This is our problem!
> >>>
> >>> This is the one we are very sure of. But we are not sure of the reason.
> >>>
> >>> Here is the latest ceph status:
> >>> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
> >>>
> >>> This is the status after we started all of the OSDs 24 hours ago.
> >>> Some of the OSDs are not started. However it didnt make any difference 
> >>> when all of them was online.
> >>>
> >>> Here is the debug=20 log of an OSD which is same for all others:
> >>> https://paste.ubuntu.com/p/8n2kTvwnG6/
> >>> As we figure out there 

Re: [ceph-users] Mimic offline problem

2018-10-02 Thread Paul Emmerich
(Didn't follow the whole story, so you might have already answered that)
Did you check what the OSDs are doing during the period of high disk
utilization?
As in:

* running perf top
* sampling a few stack traces from procfs or gdb
* or just high log settings
* running "status" on the admin socket locally


Paul

Am Di., 2. Okt. 2018 um 20:02 Uhr schrieb Goktug Yildirim
:
>
> Hello Darius,
>
> Thanks for reply!
>
> The main problem is we can not query PGs. “ceph pg 67.54f query” does stucks 
> and wait forever since OSD is unresponsive.
> We are certain that OSD gets unresponsive as soon as it UP. And we are 
> certain that OSD responds again after its disk utilization stops.
>
> So we have a small test like that:
> * Stop all OSDs (168 of them)
> * Start OSD1. %95 osd disk utilization immediately starts. It takes 8 mins to 
> finish. Only after that “ceph pg 67.54f query” works!
> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2 starts 
> %95 disk utilization. This takes 17 minutes to finish.
> * Now start OSD3 and it is the same. All OSDs start high I/O and it takes 25 
> mins to settle.
> * If you happen to start 5 of them at the same all of the OSDs start high I/O 
> again. And it takes 1 hour to finish.
>
> So in the light of these findings we flagged noup, started all OSDs. At first 
> there was no I/O. After 10 minutes we unset noup. All of 168 OSD started to 
> make high I/O. And we thought that if we wait long enough it will finish & 
> OSDs will be responsive again. After 24hours they did not because I/O did not 
> finish or even slowed down.
> One can think that is a lot of data there to scan. But it is just 33TB.
>
> So at short we dont know which PG is stuck so we can remove it.
>
> However we met an weird thing half an hour ago. We exported the same PG from 
> two different OSDs. One was 4.2GB and the other is 500KB! So we decided to 
> export all OSDs for backup. Then we will delete strange sized ones and start 
> the cluster all over. Maybe then we could solve the stucked or unfound PGs as 
> you advise.
>
> Any thought would be greatly appreciated.
>
>
> > On 2 Oct 2018, at 18:16, Darius Kasparavičius  wrote:
> >
> > Hello,
> >
> > Currently you have 15 objects missing. I would recommend finding them
> > and making backups of them. Ditch all other osds that are failing to
> > start and concentrate on bringing online those that have missing
> > objects. Then slowly turn off nodown and noout on the cluster and see
> > if it stabilises. If it stabilises leave these setting if not turn
> > them back on.
> > Now get some of the pg's that are blocked and querry the pgs to check
> > why they are blocked. Try removing as much blocks as possible and then
> > remove the norebalance/norecovery flags and see if it starts to fix
> > itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
> >  wrote:
> >>
> >> One of ceph experts indicated that bluestore is somewhat preview tech
> >> (as for Redhat).
> >> So it could be best to checkout bluestore and rocksdb. There are some
> >> tools to check health and also repair. But there are limited
> >> documentation.
> >> Anyone who has experince with it?
> >> Anyone lead/help to a proper check would be great.
> >> Goktug Yildirim , 1 Eki 2018 Pzt, 22:55
> >> tarihinde şunu yazdı:
> >>>
> >>> Hi all,
> >>>
> >>> We have recently upgraded from luminous to mimic. It’s been 6 days since 
> >>> this cluster is offline. The long short story is here: 
> >>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
> >>>
> >>> I’ve also CC’ed developers since I believe this is a bug. If this is not 
> >>> to correct way I apology and please let me know.
> >>>
> >>> For the 6 days lots of thing happened and there were some outcomes about 
> >>> the problem. Some of them was misjudged and some of them are not looked 
> >>> deeper.
> >>> However the most certain diagnosis is this: each OSD causes very high 
> >>> disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs 
> >>> become unresponsive or very very less responsive. For example "ceph tell 
> >>> osd.x version” stucks like for ever.
> >>>
> >>> So due to unresponsive OSDs cluster does not settle. This is our problem!
> >>>
> >>> This is the one we are very sure of. But we are not sure of the reason.
> >>>
> >>> Here is the latest ceph status:
> >>> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
> >>>
> >>> This is the status after we started all of the OSDs 24 hours ago.
> >>> Some of the OSDs are not started. However it didnt make any difference 
> >>> when all of them was online.
> >>>
> >>> Here is the debug=20 log of an OSD which is same for all others:
> >>> https://paste.ubuntu.com/p/8n2kTvwnG6/
> >>> As we figure out there is a loop pattern. I am sure it wont caught from 
> >>> eye.
> >>>
> >>> This the full log the same OSD.
> >>> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
> >>>
> >>> Here is the strace of the same OSD process:
> >>> 

Re: [ceph-users] EC pool spread evenly across failure domains?

2018-10-02 Thread Vasu Kulkarni
On Tue, Oct 2, 2018 at 11:35 AM Mark Johnston  wrote:
>
> I have the following setup in a test cluster:
>
>  -1   8.49591 root default
> -15   2.83197 chassis vm1
>  -3   1.41599 host ceph01
>   0   ssd 1.41599 osd.0
>  -5   1.41599 host ceph02
>   1   ssd 1.41599 osd.1
> -19   2.83197 chassis vm2
>  -7   1.41599 host ceph03
>   2   ssd 1.41599 osd.2
>  -9   1.41599 host ceph04
>   3   ssd 1.41599 osd.3
> -20   2.83197 chassis vm3
> -11   1.41599 host ceph05
>   4   ssd 1.41599 osd.4
> -13   1.41599 host ceph06
>   5   ssd 1.41599 osd.5
>
> I created an EC pool with k=4 m=2 and crush-failure-domain=chassis.  The PGs
> are stuck in creating+incomplete with only 3 assigned OSDs each.  I'm assuming
> this is because using crush-failure-domain=chassis requires a different 
> chassis
> for every chunk.
>
> I don't want to switch to k=2 m=1 because I want to be able to survive two OSD
> failures, and I don't want to use crush-failure-domain=host because I don't 
> want
> more than two chunks to be placed on the same chassis.  (The production 
> cluster
> will have more than two hosts per chassis, so crush-failure-domain=host could
> put all 6 chunks on the same chassis.)
>
> Do I need to write a custom CRUSH rule to get this to happen?  Or have I 
> missed
> something?
your hierarchy only includes "3 chassis" with 2 nodes each,  so the
max k+m configuration for "chassis" failure domain
will be 3,  you can either change the failure domain to "rack" and
move one node each in its own rack so that you will
have 6 racks or you can create addition chassis and move around the
nodes, you dont have to edit the crushmap for
that "ceph osd crush" with add-bucket/move/remove etc should help you
achieve that ( http://docs.ceph.com/docs/master/rados/operations/crush-map/
)
eg:
ceph osd crush add-bucket rack1 rack
sudo ceph crush move node1 rack=rack1

>
> Thanks,
> Mark
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph 12.2.5 - FAILED assert(0 == "put on missing extent (nothing before)")

2018-10-02 Thread Ricardo J. Barberis
Hello,

I'm having this same issue on 12.2.8. Should I repoen the bug report?

This cluster started on 12.2.4 and was upgraded to 12.2.5 and then directly to 
12.2.8 (we skipped 2.6 and 2.7) but the malfunctioning OSD is on a new node 
installed with 12.2.8.

We're using CentOS 7.5, and bluestore for ceph. This particular node has SSD 
disks.

I have an extract of the log and objdump if needed.

Thanks,

El Miércoles 11/07/2018 a las 18:31, Gregory Farnum escribió:
> A bit delayed, but Radoslaw looked at this some and has a diagnosis on the
> tracker ticket: http://tracker.ceph.com/issues/24715
> So it looks like a symptom of a bug that was already fixed for unrelated
> reasons. :)
> -Greg
>
> On Wed, Jun 27, 2018 at 4:51 AM Dyweni - Ceph-Users
> <6exbab4fy...@dyweni.com>
>
> wrote:
> > Good Morning,
> >
> > I have rebuilt the OSD and the cluster is healthy now.
> >
> > I have one pool with 3 replica setup.  I am a bit concerned that
> > removing a snapshot can cause an OSD to crash.  I've asked myself what
> > would have happened if 2 OSD's had crashed?  God forbid, what if 3 or
> > more OSD's had crashed with this same error?  How would I have recovered
> > from that?
> >
> > So planning for the future:
> >
> >   1. Is there any way to proactively scan for (and even repair) this?
> >
> >   2. What could have caused this?
> >
> > We experienced a cluster wide power outage lasting several hours several
> > days ago.  The outage occurred at a time when no snapshots were being
> > created.  The cluster was brought back up in a controlled manner and no
> > errors were discovered immediately afterward (Ceph reported healthly).
> > Could this have caused corruption?
> >
> > Thanks,
> > Dyweni
> >
> > On 2018-06-25 09:34, Dyweni - Ceph-Users wrote:
> > > Hi,
> > >
> > > Is there any information you'd like to grab off this OSD?  Anything I
> > > can provide to help you troubleshoot this?
> > >
> > > I ask, because if not, I'm going to reformat / rebuild this OSD
> > > (unless there is a faster way to repair this issue).
> > >
> > > Thanks,
> > > Dyweni
> > >
> > > On 2018-06-25 07:30, Dyweni - Ceph-Users wrote:
> > >> Good Morning,
> > >>
> > >> After removing roughly 20-some rbd shapshots, one of my OSD's has
> > >> begun flapping.
> > >>
> > >>
> > >>  ERROR 1 
> > >>
> > >> 2018-06-25 06:46:39.132257 a0ce2700 -1 osd.8 pg_epoch: 44738 pg[4.e8(
> > >> v 44721'485588 (44697'484015,44721'485588] local-lis/les=44593/44595
> > >> n=2972 ec=9422/9422 lis/c 44593/44593 les/c/f 44595/44595/40729
> > >> 44593/44593/44593) [8,7,10] r=0 lpr=44593 crt=44721'485588 lcod
> > >> 44721'485586 mlcod 44721'485586 active+clean+snapt
> > >> rim snaptrimq=[276~1,280~1,2af~1,2e8~4]] removing snap head
> > >> 2018-06-25 06:46:41.314172 a1ce2700 -1
> >
> 
> /var/tmp/portage/sys-cluster/ceph-12.2.5/work/ceph-12.2.5/src/os/bluestore/bluestore_types.cc:
> > >> In function 'void bluestore_extent_ref_map_t::put(uint64_t, uint32_t,
> > >> PExtentVector*, bool*)' thread a1ce2700 time 2018-06-25
> > >> 06:46:41.220388
> >
> 
> /var/tmp/portage/sys-cluster/ceph-12.2.5/work/ceph-12.2.5/src/os/bluestore/bluestore_types.cc:
> > >> 217: FAILED assert(0 == "put on missing extent (nothing before)")
> > >>
> > >>  ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a)
> > >> luminous (stable)
> > >>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > >> const*)+0x1bc) [0x2a2c314]
> > >>  2: (bluestore_extent_ref_map_t::put(unsigned long long, unsigned int,
> > >> std::vector > >> mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t>
> > >>
> > >> >*, bool*)+0x128) [0x2893650]
> > >>
> > >>  3: (BlueStore::SharedBlob::put_ref(unsigned long long, unsigned int,
> > >> std::vector > >> mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t>
> > >>
> > >> >*, std::set > >> > std::less,
> >
> > std::allocator >*)+0xb8) [0x2791bdc]
> >
> > >>  4: (BlueStore::_wctx_finish(BlueStore::TransContext*,
> > >> boost::intrusive_ptr&,
> > >> boost::intrusive_ptr, BlueStore::WriteContext*,
> > >> std::set,
> > >> std::allocator >*)+0x5c8) [0x27f3254]
> > >>  5: (BlueStore::_do_truncate(BlueStore::TransContext*,
> > >> boost::intrusive_ptr&,
> > >> boost::intrusive_ptr, unsigned long long,
> > >> std::set,
> > >> std::allocator >*)+0x360) [0x27f7834]
> > >>  6: (BlueStore::_do_remove(BlueStore::TransContext*,
> > >> boost::intrusive_ptr&,
> > >> boost::intrusive_ptr)+0xb4) [0x27f81b4]
> > >>  7: (BlueStore::_remove(BlueStore::TransContext*,
> > >> boost::intrusive_ptr&,
> > >> boost::intrusive_ptr&)+0x1dc) [0x27f9638]
> > >>  8: (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> > >> ObjectStore::Transaction*)+0xe7c) [0x27e855c]
> > >>  9: (BlueStore::queue_transactions(ObjectStore::Sequencer*,
> > >> std::vector > >> std::allocator >&,
> > >> boost::intrusive_ptr, ThreadPool::TPHandle*)+0x67c)
> > >> [0x27e6f80]
> > >>  10: (ObjectStore::queue_transactions(ObjectStore::Sequencer*,
> > >> 

Re: [ceph-users] cephfs issue with moving files between data pools gives Input/output error

2018-10-02 Thread Marc Roos
 
I would 'also' choose for a solution where in case there is mv across 
pools, the user has to wait a bit longer for the cp to finish. And as 
said before if you export cephfs via smb or nfs, I wonder how the 
nfs/smb server will execute the move. 

If I use 1x replicated pool on /tmp and move the file to a 3x replicated 
pool, I assume my data is moved there and more secure. 




-Original Message-
From: Janne Johansson [mailto:icepic...@gmail.com] 
Sent: dinsdag 2 oktober 2018 15:44
To: jsp...@redhat.com
Cc: Marc Roos; Ceph Users
Subject: Re: [ceph-users] cephfs issue with moving files between data 
pools gives Input/output error

Den mån 1 okt. 2018 kl 22:08 skrev John Spray :



> totally new for me, also not what I would expect of a mv on a fs. 
I know
> this is normal to expect coping between pools, also from the 
s3cmd
> client. But I think more people will not expect this behaviour. 
Can't
> the move be implemented as a move?

In almost all filesystems, a rename (like "mv") is a pure metadata
operation -- it doesn't involve reading all the file's data and
re-writing it.  It would be very surprising for most users if they
found that their "mv" command blocked for a very long time while
waiting for a large file's content to be e.g. read out of one pool 
and
written into another.


There are other networked filesystems which do behave like that, where 
the OS thinks the whole mount is one single FS, but when you move stuff 
with mv around it actually needs to move all data to other servers/disks 
and incur the slowness of a copy/delete operation.
 

-- 

May the most significant bit of your life be positive.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore vs. Filestore

2018-10-02 Thread jesper
> On 02.10.2018 19:28, jes...@krogh.cc wrote:
> In the cephfs world there is no central server that hold the cache. each
> cephfs client reads data directly from the osd's.

I can accept this argument, but nevertheless .. if I used Filestore - it
would work.

> This also means no
> single point of failure, and you can scale out performance by spreading
> metadata tree information over multiple MDS servers. and scale out
> storage and throughput with added osd nodes.
>
> so if the cephfs client cache is not sufficient, you can look at at the
> bluestore cache.
http://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/#cache-size

I have been there, but it seems to "not work"- I think the need to
slice per OSD and statically allocate mem per OSD breaks the efficiency.
(but I cannot prove it)

> or you can look at adding a ssd layer over the spinning disks. with eg 
> bcache.  I assume you are using a ssd/nvram for bluestore db already

My currently bluestore(s) is backed by 10TB 7.2K RPM drives, allthough behind
BBWC. Can you elaborate on the "assumption" as we're not doing that, I'd like
to explore that.

> you should also look at tuning the cephfs metadata servers.
> make sure the metadata pool is on fast ssd osd's .  and tune the mds
> cache to the mds server's ram, so you cache as much metadata as possible.

Yes, we're in the process of doing that - I belive we're seeing the MDS
suffering
when we saturate a few disks in the setup - and they are sharing. Thus
we'll move
the metadata as per recommendations to SSD.

-- 
Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] getattr - failed to rdlock waiting

2018-10-02 Thread Thomas Sumpter
Hi Folks,

I am looking for advice on how to troubleshoot some long operations found in 
MDS. Most of the time performance is fantastic, but occasionally and to no real 
pattern or trend, a gettattr op will take up to ~30 seconds to complete in MDS 
which is stuck on "event": "failed to rdlock, waiting"

E.g.
"description": "client_request(client.84183:54794012 getattr pAsLsXsFs 
#0x1038585 2018-10-02 07:56:27.554282 caller_uid=48, caller_gid=48{})",
"duration": 28.987992,
{
"time": "2018-09-25 07:56:27.552511",
"event": "failed to rdlock, waiting"
},
{
"time": "2018-09-25 07:56:56.529748",
"event": "failed to rdlock, waiting"
},
{
"time": "2018-09-25 07:56:56.540386",
"event": "acquired locks"
}

I can find no corresponding long op on any of the OSDs and no other op in MDS 
which this one could be waiting for.
Nearly all configuration will be the default. Currently have a small amount of 
data which is constantly being updated. 1 data pool and 1 metadata pool.
How can I track down what is holding up this op and try to stop it happening?

# rados df
...
total_objects191
total_used   5.7 GiB
total_avail  367 GiB
total_space  373 GiB


Cephfs version 13.2.1 on CentOs 7.5
Kernel: 3.10.0-862.11.6.el7.x86_64
1x Active MDS, 1x Replay Standby MDS
3x MON
4x OSD
Bluestore FS

Ceph kernel client on CentOs 7.4
Kernel: 4.18.7-1.el7.elrepo.x86_64  (almost the latest, should be good?)

Many Thanks!
Tom
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore vs. Filestore

2018-10-02 Thread Ronny Aasen

On 02.10.2018 19:28, jes...@krogh.cc wrote:

Hi.

Based on some recommendations we have setup our CephFS installation using
bluestore*. We're trying to get a strong replacement for "huge" xfs+NFS
server - 100TB-ish size.

Current setup is - a sizeable Linux host with 512GB of memory - one large
Dell MD1200 or MD1220 - 100TB + a Linux kernel NFS server.

Since our "hot" dataset is < 400GB we can actually serve the hot data
directly out of the host page-cache and never really touch the "slow"
underlying drives. Except when new bulk data are written where a Perc with
BBWC is consuming the data.

In the CephFS + Bluestore world, Ceph is "deliberatly" bypassing the host
OS page-cache, so even when we have 4-5 x 256GB memory** in the OSD hosts
it is really hard to create a synthetic test where they hot data does not
end up being read out of the underlying disks. Yes, the
client side page cache works very well, but in our scenario we have 30+
hosts pulling the same data over NFS.

Is bluestore just a "bad fit" .. Filestore "should" do the right thing? Is
the recommendation to make an SSD "overlay" on the slow drives?

Thoughts?

Jesper

* Bluestore should be the new and shiny future - right?
** Total mem 1TB+





In the cephfs world there is no central server that hold the cache. each 
cephfs client reads data directly from the osd's.  this also means no 
single point of failure, and you can scale out performance by spreading 
metadata tree information over multiple MDS servers. and scale out 
storage and throughput with added osd nodes.


so if the cephfs client cache is not sufficient, you can look at at the 
bluestore cache.

http://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/#cache-size

or you can look at adding a ssd layer over the spinning disks. with eg  
bcache.  I assume you are using a ssd/nvram for bluestore db already


you should also look at tuning the cephfs metadata servers.
make sure the metadata pool is on fast ssd osd's .  and tune the mds 
cache to the mds server's ram, so you cache as much metadata as possible.


good luck
Ronny Aasen




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] EC pool spread evenly across failure domains?

2018-10-02 Thread Mark Johnston
I have the following setup in a test cluster:

 -1   8.49591 root default 
-15   2.83197 chassis vm1
 -3   1.41599 host ceph01 
  0   ssd 1.41599 osd.0
 -5   1.41599 host ceph02 
  1   ssd 1.41599 osd.1
-19   2.83197 chassis vm2
 -7   1.41599 host ceph03 
  2   ssd 1.41599 osd.2
 -9   1.41599 host ceph04 
  3   ssd 1.41599 osd.3
-20   2.83197 chassis vm3
-11   1.41599 host ceph05 
  4   ssd 1.41599 osd.4
-13   1.41599 host ceph06 
  5   ssd 1.41599 osd.5

I created an EC pool with k=4 m=2 and crush-failure-domain=chassis.  The PGs
are stuck in creating+incomplete with only 3 assigned OSDs each.  I'm assuming
this is because using crush-failure-domain=chassis requires a different chassis
for every chunk.

I don't want to switch to k=2 m=1 because I want to be able to survive two OSD
failures, and I don't want to use crush-failure-domain=host because I don't want
more than two chunks to be placed on the same chassis.  (The production cluster
will have more than two hosts per chassis, so crush-failure-domain=host could
put all 6 chunks on the same chassis.)

Do I need to write a custom CRUSH rule to get this to happen?  Or have I missed
something?

Thanks,
Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Recover data from cluster / get rid of down, incomplete, unknown pgs

2018-10-02 Thread Dylan Jones
Our ceph cluster stopped responding to requests two weeks ago, and I have
been trying to fix it since then.  After a semi-hard reboot, we had 11-ish
OSDs "fail" spread across two hosts, with the pool size set to two.  I was
able to extract a copy of every PG that resided solely on the nonfunctional
OSDs, but the cluster is refusing to let me read from it.  I marked all the
"failed" OSDs as lost and used ceph pg $pg mark_unfound_lost revert for all
the PGs reporting unfound objects, but that didn't help either.  ddrescue
also breaks, because ceph will never admit that it has lost data and just
blocks forever instead of returning a read error.
Is there any way to tell ceph to cut its losses and just let me access my
data again?


  cluster:
id: 313be153-5e8a-4275-b3aa-caea1ce7bce2
health: HEALTH_ERR
noout,nobackfill,norebalance flag(s) set
2720183/6369036 objects misplaced (42.709%)
9/3184518 objects unfound (0.000%)
39 scrub errors
Reduced data availability: 131 pgs inactive, 16 pgs down, 114
pgs incomplete
Possible data damage: 7 pgs recovery_unfound, 1 pg
inconsistent, 7 pgs snaptrim_error
Degraded data redundancy: 1710175/6369036 objects degraded
(26.851%), 1069 pgs degraded, 1069 pgs undersized
Degraded data redundancy (low space): 82 pgs backfill_toofull

  services:
mon: 1 daemons, quorum waitaha
mgr: waitaha(active)
osd: 43 osds: 34 up, 34 in; 1786 remapped pgs
 flags noout,nobackfill,norebalance

  data:
pools:   2 pools, 2048 pgs
objects: 3.18 M objects, 8.4 TiB
usage:   21 TiB used, 60 TiB / 82 TiB avail
pgs: 0.049% pgs unknown
 6.348% pgs not active
 1710175/6369036 objects degraded (26.851%)
 2720183/6369036 objects misplaced (42.709%)
 9/3184518 objects unfound (0.000%)
 987 active+undersized+degraded+remapped+backfill_wait
 695 active+remapped+backfill_wait
 124 active+clean
 114 incomplete
 62
active+undersized+degraded+remapped+backfill_wait+backfill_toofull
 20  active+remapped+backfill_wait+backfill_toofull
 16  down
 12  active+undersized+degraded+remapped+backfilling
 7   active+recovery_unfound+undersized+degraded+remapped
 7   active+clean+snaptrim_error
 2   active+remapped+backfilling
 1   unknown
 1
active+undersized+degraded+remapped+inconsistent+backfill_wait

Thanks,
Dylan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] QEMU/Libvirt + librbd issue using Luminous 12.2.7

2018-10-02 Thread Andre Goree

On 2018/10/02 2:03 pm, Andre Goree wrote:

On 2018/10/02 1:54 pm, Jason Dillaman wrote:

On Tue, Oct 2, 2018 at 1:48 PM Andre Goree  wrote:



I'm actually not so sure the libvirt user has write access to the
location -- will libvirt automatically try to write to the file 
(given

that it's a setting in ceph.conf)?

I just confirmed that the libvirt-qemu user could NOT write to the
location I have defined (/var/log/ceph_client.log).

After adjusting perms though, I still have nothing printed in the 
logs

except the creation.


Technically, libvirt will just use QEMU's monitor protocol to instruct
an already running QEMU instance to attach the device. Therefore, it
really comes down to the permissions of that QEMU process.




--
Jason


Very true, thanks for clarifying that.  In any case, I changed the
ceph client log location to a world readable/writeable location
(/tmp/ceph_client.log) and still am not getting in the log regarding
the attach, _only_ the creation.

Do I need to explicitly adjust the location of the qemu logs somehow?
In fact, I didn't even think to add debugging to qemu nor libvirt, do
you happen to know off-hand where to configure that?  I'm sure I have
logs in /var/log/libvirt/qemu/* already, let me see what I can come up
with...


--
Andre Goree
-=-=-=-=-=-
Email - andre at drenet.net
Website   - http://blog.drenet.net
PGP key   - http://www.drenet.net/pubkey.html
-=-=-=-=-=-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



I figured this out, thanks for the help.  Definite PEBKAC error here.  
Here's the relevant part of my qemu logs:


server name not found:  (Name or service not known)
unable to parse addrs in 'xxx.xxx.xxx.xxx:6789;:6789;:6789'
server name not found:  (Name or service not known)
unable to parse addrs in 'xxx.xxx.xxx.xxx:6789;:6789;:6789'
server name not found:  (Name or service not known)
unable to parse addrs in 'xxx.xxx.xxx.xxx:6789;:6789;:6789'

In my xml, I'm defining three MONs, but only giving an IP for one 
(xxx.xxx.xxx.xxx, redacted).  Some change in 12.2.4 to 12.2.5 must cause 
qemu or ceph or both to be more stringent on the server names (bc the 
same configuration worked on 12.2.4), of course with good reason.


Sorry for wasting your and everyone else's time, thanks again for the 
help.


--
Andre Goree
-=-=-=-=-=-
Email - andre at drenet.net
Website   - http://blog.drenet.net
PGP key   - http://www.drenet.net/pubkey.html
-=-=-=-=-=-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] QEMU/Libvirt + librbd issue using Luminous 12.2.7

2018-10-02 Thread Andre Goree

On 2018/10/02 1:54 pm, Jason Dillaman wrote:

On Tue, Oct 2, 2018 at 1:48 PM Andre Goree  wrote:



I'm actually not so sure the libvirt user has write access to the
location -- will libvirt automatically try to write to the file (given
that it's a setting in ceph.conf)?

I just confirmed that the libvirt-qemu user could NOT write to the
location I have defined (/var/log/ceph_client.log).

After adjusting perms though, I still have nothing printed in the logs
except the creation.


Technically, libvirt will just use QEMU's monitor protocol to instruct
an already running QEMU instance to attach the device. Therefore, it
really comes down to the permissions of that QEMU process.




--
Jason


Very true, thanks for clarifying that.  In any case, I changed the ceph 
client log location to a world readable/writeable location 
(/tmp/ceph_client.log) and still am not getting in the log regarding the 
attach, _only_ the creation.


Do I need to explicitly adjust the location of the qemu logs somehow?  
In fact, I didn't even think to add debugging to qemu nor libvirt, do 
you happen to know off-hand where to configure that?  I'm sure I have 
logs in /var/log/libvirt/qemu/* already, let me see what I can come up 
with...



--
Andre Goree
-=-=-=-=-=-
Email - andre at drenet.net
Website   - http://blog.drenet.net
PGP key   - http://www.drenet.net/pubkey.html
-=-=-=-=-=-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic offline problem

2018-10-02 Thread Goktug Yildirim
Hello Darius,

Thanks for reply!

The main problem is we can not query PGs. “ceph pg 67.54f query” does stucks 
and wait forever since OSD is unresponsive.
We are certain that OSD gets unresponsive as soon as it UP. And we are certain 
that OSD responds again after its disk utilization stops. 

So we have a small test like that:
* Stop all OSDs (168 of them)
* Start OSD1. %95 osd disk utilization immediately starts. It takes 8 mins to 
finish. Only after that “ceph pg 67.54f query” works!
* While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2 starts %95 
disk utilization. This takes 17 minutes to finish.
* Now start OSD3 and it is the same. All OSDs start high I/O and it takes 25 
mins to settle.
* If you happen to start 5 of them at the same all of the OSDs start high I/O 
again. And it takes 1 hour to finish.

So in the light of these findings we flagged noup, started all OSDs. At first 
there was no I/O. After 10 minutes we unset noup. All of 168 OSD started to 
make high I/O. And we thought that if we wait long enough it will finish & OSDs 
will be responsive again. After 24hours they did not because I/O did not finish 
or even slowed down.
One can think that is a lot of data there to scan. But it is just 33TB.

So at short we dont know which PG is stuck so we can remove it.

However we met an weird thing half an hour ago. We exported the same PG from 
two different OSDs. One was 4.2GB and the other is 500KB! So we decided to 
export all OSDs for backup. Then we will delete strange sized ones and start 
the cluster all over. Maybe then we could solve the stucked or unfound PGs as 
you advise.

Any thought would be greatly appreciated.


> On 2 Oct 2018, at 18:16, Darius Kasparavičius  wrote:
> 
> Hello,
> 
> Currently you have 15 objects missing. I would recommend finding them
> and making backups of them. Ditch all other osds that are failing to
> start and concentrate on bringing online those that have missing
> objects. Then slowly turn off nodown and noout on the cluster and see
> if it stabilises. If it stabilises leave these setting if not turn
> them back on.
> Now get some of the pg's that are blocked and querry the pgs to check
> why they are blocked. Try removing as much blocks as possible and then
> remove the norebalance/norecovery flags and see if it starts to fix
> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
>  wrote:
>> 
>> One of ceph experts indicated that bluestore is somewhat preview tech
>> (as for Redhat).
>> So it could be best to checkout bluestore and rocksdb. There are some
>> tools to check health and also repair. But there are limited
>> documentation.
>> Anyone who has experince with it?
>> Anyone lead/help to a proper check would be great.
>> Goktug Yildirim , 1 Eki 2018 Pzt, 22:55
>> tarihinde şunu yazdı:
>>> 
>>> Hi all,
>>> 
>>> We have recently upgraded from luminous to mimic. It’s been 6 days since 
>>> this cluster is offline. The long short story is here: 
>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
>>> 
>>> I’ve also CC’ed developers since I believe this is a bug. If this is not to 
>>> correct way I apology and please let me know.
>>> 
>>> For the 6 days lots of thing happened and there were some outcomes about 
>>> the problem. Some of them was misjudged and some of them are not looked 
>>> deeper.
>>> However the most certain diagnosis is this: each OSD causes very high disk 
>>> I/O to its bluestore disk (WAL and DB are fine). After that OSDs become 
>>> unresponsive or very very less responsive. For example "ceph tell osd.x 
>>> version” stucks like for ever.
>>> 
>>> So due to unresponsive OSDs cluster does not settle. This is our problem!
>>> 
>>> This is the one we are very sure of. But we are not sure of the reason.
>>> 
>>> Here is the latest ceph status:
>>> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
>>> 
>>> This is the status after we started all of the OSDs 24 hours ago.
>>> Some of the OSDs are not started. However it didnt make any difference when 
>>> all of them was online.
>>> 
>>> Here is the debug=20 log of an OSD which is same for all others:
>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>>> As we figure out there is a loop pattern. I am sure it wont caught from eye.
>>> 
>>> This the full log the same OSD.
>>> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
>>> 
>>> Here is the strace of the same OSD process:
>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>>> 
>>> Recently we hear more to uprade mimic. I hope none get hurts as we do. I am 
>>> sure we have done lots of mistakes to let this happening. And this 
>>> situation may be a example for other user and could be a potential bug for 
>>> ceph developer.
>>> 
>>> Any help to figure out what is going on would be great.
>>> 
>>> Best Regards,
>>> Goktug Yildirim
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] QEMU/Libvirt + librbd issue using Luminous 12.2.7

2018-10-02 Thread Jason Dillaman
On Tue, Oct 2, 2018 at 1:48 PM Andre Goree  wrote:
>
> On 2018/10/02 1:29 pm, Jason Dillaman wrote:
> > On Tue, Oct 2, 2018 at 1:25 PM Andre Goree  wrote:
> >>
> >>
> >> Unfortunately, it would appear that I'm not getting anything in the
> >> logs
> >> _but_ the creation of the rbd image -- i.e., nothing regarding the
> >> attempt to attach it via libvirt.  Here are the logs, for the sake of
> >> clarity: https://pastebin.com/usAUspCa
> >>
> >>
> >> I think this might be a problem in libvirt and/or QEMU, in that
> >> case...but then why am I only seeing this _after_ an upgrade past
> >> 12.2.4; that makes me think something changed between 12.2.4 & 12.2.5,
> >> libvirt-related, that is causing this (I've not upgraded my libvirt
> >> version in months).
> >
> > AFAIK, you are the only person reporting a similar issue. Are you
> > confident that your libvirt / QEMU user has access to write to the
> > specified log location (both permissions and SElinux/AppArmor)?
> >
> >> OS: Ubuntu 16.04
> >> Libvirt ver: libvirtd (libvirt) 1.3.1
> >> Ceph ver:  12.2.8
> >>
> >
> >
> > --
> > Jason
>
>
> I'm actually not so sure the libvirt user has write access to the
> location -- will libvirt automatically try to write to the file (given
> that it's a setting in ceph.conf)?
>
> I just confirmed that the libvirt-qemu user could NOT write to the
> location I have defined (/var/log/ceph_client.log).
>
> After adjusting perms though, I still have nothing printed in the logs
> except the creation.

Technically, libvirt will just use QEMU's monitor protocol to instruct
an already running QEMU instance to attach the device. Therefore, it
really comes down to the permissions of that QEMU process.

> FWIW, I'm creating the rbd image with 'qemu-img create -f rbd
> rbd:$image_name $storage_size' and I'm attaching the image with 'virsh
> attach-device $vm_name $xml --persistent'
>
>
> --
> Andre Goree
> -=-=-=-=-=-
> Email - andre at drenet.net
> Website   - http://blog.drenet.net
> PGP key   - http://www.drenet.net/pubkey.html
> -=-=-=-=-=-



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] QEMU/Libvirt + librbd issue using Luminous 12.2.7

2018-10-02 Thread Andre Goree

On 2018/10/02 1:29 pm, Jason Dillaman wrote:

On Tue, Oct 2, 2018 at 1:25 PM Andre Goree  wrote:



Unfortunately, it would appear that I'm not getting anything in the 
logs

_but_ the creation of the rbd image -- i.e., nothing regarding the
attempt to attach it via libvirt.  Here are the logs, for the sake of
clarity: https://pastebin.com/usAUspCa


I think this might be a problem in libvirt and/or QEMU, in that
case...but then why am I only seeing this _after_ an upgrade past
12.2.4; that makes me think something changed between 12.2.4 & 12.2.5,
libvirt-related, that is causing this (I've not upgraded my libvirt
version in months).


AFAIK, you are the only person reporting a similar issue. Are you
confident that your libvirt / QEMU user has access to write to the
specified log location (both permissions and SElinux/AppArmor)?


OS: Ubuntu 16.04
Libvirt ver: libvirtd (libvirt) 1.3.1
Ceph ver:  12.2.8




--
Jason



I'm actually not so sure the libvirt user has write access to the 
location -- will libvirt automatically try to write to the file (given 
that it's a setting in ceph.conf)?


I just confirmed that the libvirt-qemu user could NOT write to the 
location I have defined (/var/log/ceph_client.log).


After adjusting perms though, I still have nothing printed in the logs 
except the creation.


FWIW, I'm creating the rbd image with 'qemu-img create -f rbd 
rbd:$image_name $storage_size' and I'm attaching the image with 'virsh 
attach-device $vm_name $xml --persistent'



--
Andre Goree
-=-=-=-=-=-
Email - andre at drenet.net
Website   - http://blog.drenet.net
PGP key   - http://www.drenet.net/pubkey.html
-=-=-=-=-=-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] QEMU/Libvirt + librbd issue using Luminous 12.2.7

2018-10-02 Thread Jason Dillaman
On Tue, Oct 2, 2018 at 1:25 PM Andre Goree  wrote:
>
> On 2018/10/02 10:26 am, Andre Goree wrote:
> > On 2018/10/02 9:54 am, Jason Dillaman wrote:
> >> Perhaps that pastebin link has the wrong log pasted? The provided log
> >> looks like it's associated with the creation of image
> >> "32635-b6592790-5519-5184-b5ef-5f16b3523250" and not the attachment of
> >> an image to a VM.
> >> On Fri, Sep 28, 2018 at 3:15 PM Andre Goree  wrote:
> >>>
> >>>
> >>>
> >>> I actually got the logging working, here's the log from a failed
> >>> attach:
> >>>   https://pastebin.com/jCiD4E2p
> >>>
> >>> Thanks!
> >>>
> >>>
> >>> --
> >>> Andre Goree
> >>> -=-=-=-=-=-
> >>> Email - andre at drenet.net
> >>> Website   - http://blog.drenet.net
> >>> PGP key   - http://www.drenet.net/pubkey.html
> >>> -=-=-=-=-=-
> >>
> >>
> >>
> >> --
> >> Jason
> >
> >
> > Interesting.  I just pasted everything that was in the log that was
> > generated.
> >
> > I'll try again here soon, with ONLY an attempt to attach.  Standby.
> > Thanks again for the help :)
> >
> >
>
>
> Unfortunately, it would appear that I'm not getting anything in the logs
> _but_ the creation of the rbd image -- i.e., nothing regarding the
> attempt to attach it via libvirt.  Here are the logs, for the sake of
> clarity: https://pastebin.com/usAUspCa
>
>
> I think this might be a problem in libvirt and/or QEMU, in that
> case...but then why am I only seeing this _after_ an upgrade past
> 12.2.4; that makes me think something changed between 12.2.4 & 12.2.5,
> libvirt-related, that is causing this (I've not upgraded my libvirt
> version in months).

AFAIK, you are the only person reporting a similar issue. Are you
confident that your libvirt / QEMU user has access to write to the
specified log location (both permissions and SElinux/AppArmor)?

> OS: Ubuntu 16.04
> Libvirt ver: libvirtd (libvirt) 1.3.1
> Ceph ver:  12.2.8
>
>
> --
> Andre Goree
> -=-=-=-=-=-
> Email - andre at drenet.net
> Website   - http://blog.drenet.net
> PGP key   - http://www.drenet.net/pubkey.html
> -=-=-=-=-=-



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bluestore vs. Filestore

2018-10-02 Thread jesper
Hi.

Based on some recommendations we have setup our CephFS installation using
bluestore*. We're trying to get a strong replacement for "huge" xfs+NFS
server - 100TB-ish size.

Current setup is - a sizeable Linux host with 512GB of memory - one large
Dell MD1200 or MD1220 - 100TB + a Linux kernel NFS server.

Since our "hot" dataset is < 400GB we can actually serve the hot data
directly out of the host page-cache and never really touch the "slow"
underlying drives. Except when new bulk data are written where a Perc with
BBWC is consuming the data.

In the CephFS + Bluestore world, Ceph is "deliberatly" bypassing the host
OS page-cache, so even when we have 4-5 x 256GB memory** in the OSD hosts
it is really hard to create a synthetic test where they hot data does not
end up being read out of the underlying disks. Yes, the
client side page cache works very well, but in our scenario we have 30+
hosts pulling the same data over NFS.

Is bluestore just a "bad fit" .. Filestore "should" do the right thing? Is
the recommendation to make an SSD "overlay" on the slow drives?

Thoughts?

Jesper

* Bluestore should be the new and shiny future - right?
** Total mem 1TB+



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] QEMU/Libvirt + librbd issue using Luminous 12.2.7

2018-10-02 Thread Andre Goree

On 2018/10/02 10:26 am, Andre Goree wrote:

On 2018/10/02 9:54 am, Jason Dillaman wrote:

Perhaps that pastebin link has the wrong log pasted? The provided log
looks like it's associated with the creation of image
"32635-b6592790-5519-5184-b5ef-5f16b3523250" and not the attachment of
an image to a VM.
On Fri, Sep 28, 2018 at 3:15 PM Andre Goree  wrote:




I actually got the logging working, here's the log from a failed 
attach:

  https://pastebin.com/jCiD4E2p

Thanks!


--
Andre Goree
-=-=-=-=-=-
Email - andre at drenet.net
Website   - http://blog.drenet.net
PGP key   - http://www.drenet.net/pubkey.html
-=-=-=-=-=-




--
Jason



Interesting.  I just pasted everything that was in the log that was 
generated.


I'll try again here soon, with ONLY an attempt to attach.  Standby.
Thanks again for the help :)





Unfortunately, it would appear that I'm not getting anything in the logs 
_but_ the creation of the rbd image -- i.e., nothing regarding the 
attempt to attach it via libvirt.  Here are the logs, for the sake of 
clarity: https://pastebin.com/usAUspCa



I think this might be a problem in libvirt and/or QEMU, in that 
case...but then why am I only seeing this _after_ an upgrade past 
12.2.4; that makes me think something changed between 12.2.4 & 12.2.5, 
libvirt-related, that is causing this (I've not upgraded my libvirt 
version in months).


OS: Ubuntu 16.04
Libvirt ver: libvirtd (libvirt) 1.3.1
Ceph ver:  12.2.8


--
Andre Goree
-=-=-=-=-=-
Email - andre at drenet.net
Website   - http://blog.drenet.net
PGP key   - http://www.drenet.net/pubkey.html
-=-=-=-=-=-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs mds cache tuning

2018-10-02 Thread Adam Tygart
It may be that having multiple mds is masking the issue, or that we
truly didn't have a large enough inode cache at 55GB. Things are
behaving for me now, even when presenting the same 0 entries in req
and rlat.

If this happens again, I'll attempt to get perf trace logs, along with
ops, ops_in_flight, perf dump and objecter requests. Thanks for your
time.

--
Adam
On Mon, Oct 1, 2018 at 10:36 PM Adam Tygart  wrote:
>
> Okay, here's what I've got: https://www.paste.ie/view/abe8c712
>
> Of note, I've changed things up a little bit for the moment. I've
> activated a second mds to see if it is a particular subtree that is
> more prone to issues. maybe EC vs replica... The one that is currently
> being slow has my EC volume pinned to it.
>
> --
> Adam
> On Mon, Oct 1, 2018 at 10:02 PM Gregory Farnum  wrote:
> >
> > Can you grab the perf dump during this time, perhaps plus dumps of the ops 
> > in progress?
> >
> > This is weird but given it’s somewhat periodic it might be something like 
> > the MDS needing to catch up on log trimming (though I’m unclear why 
> > changing the cache size would impact this).
> >
> > On Sun, Sep 30, 2018 at 9:02 PM Adam Tygart  wrote:
> >>
> >> Hello all,
> >>
> >> I've got a ceph (12.2.8) cluster with 27 servers, 500 osds, and 1000
> >> cephfs mounts (kernel client). We're currently only using 1 active
> >> mds.
> >>
> >> Performance is great about 80% of the time. MDS responses (per ceph
> >> daemonperf mds.$(hostname -s), indicates 2k-9k requests per second,
> >> with a latency under 100.
> >>
> >> It is the other 20ish percent I'm worried about. I'll check on it and
> >> it with be going 5-15 seconds with "0" requests, "0" latency, then
> >> give me 2 seconds of reasonable response times, and then back to
> >> nothing. Clients are actually seeing blocked requests for this period
> >> of time.
> >>
> >> The strange bit is that when I *reduce* the mds_cache_size, requests
> >> and latencies go back to normal for a while. When it happens again,
> >> I'll increase it back to where it was. It feels like the mds server
> >> decides that some of these inodes can't be dropped from the cache
> >> unless the cache size changes. Maybe something wrong with the LRU?
> >>
> >> I feel like I've got a reasonable cache size for my workload, 30GB on
> >> the small end, 55GB on the large. No real reason for a swing this
> >> large except to potentially delay it recurring after expansion for
> >> longer.
> >>
> >> I also feel like there is probably some magic tunable to change how
> >> inodes get stuck in the LRU. perhaps mds_cache_mid. Anyone know what
> >> this tunable actually does? The documentation is a little sparse.
> >>
> >> I can grab logs from the mds if needed, just let me know the settings
> >> you'd like to see.
> >>
> >> --
> >> Adam
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] "rgw relaxed s3 bucket names" and underscores

2018-10-02 Thread Ryan Leimenstoll
Hi all, 

I was hoping to get some clarification on what "rgw relaxed s3 bucket names = 
false” is intended to filter. In our cluster (Luminous 12.2.8, serving S3) it 
seems that RGW, with that setting set to false, is still allowing buckets with 
underscores in the name to be created, although this is now prohibited by 
Amazon in US-East and seemingly all of their other regions [0]. Since clients 
typically follow Amazon’s direction, should RGW be rejecting underscores in 
these names to be in compliance? (I did notice it already rejects uppercase 
letters.) 

Thanks much!
Ryan Leimenstoll
rleim...@umiacs.umd.edu 


[0] https://docs.aws.amazon.com/AmazonS3/latest/dev/BucketRestrictions.html

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic offline problem

2018-10-02 Thread Darius Kasparavičius
Hello,

 Currently you have 15 objects missing. I would recommend finding them
and making backups of them. Ditch all other osds that are failing to
start and concentrate on bringing online those that have missing
objects. Then slowly turn off nodown and noout on the cluster and see
if it stabilises. If it stabilises leave these setting if not turn
them back on.
Now get some of the pg's that are blocked and querry the pgs to check
why they are blocked. Try removing as much blocks as possible and then
remove the norebalance/norecovery flags and see if it starts to fix
itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
 wrote:
>
> One of ceph experts indicated that bluestore is somewhat preview tech
> (as for Redhat).
> So it could be best to checkout bluestore and rocksdb. There are some
> tools to check health and also repair. But there are limited
> documentation.
> Anyone who has experince with it?
> Anyone lead/help to a proper check would be great.
> Goktug Yildirim , 1 Eki 2018 Pzt, 22:55
> tarihinde şunu yazdı:
> >
> > Hi all,
> >
> > We have recently upgraded from luminous to mimic. It’s been 6 days since 
> > this cluster is offline. The long short story is here: 
> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
> >
> > I’ve also CC’ed developers since I believe this is a bug. If this is not to 
> > correct way I apology and please let me know.
> >
> > For the 6 days lots of thing happened and there were some outcomes about 
> > the problem. Some of them was misjudged and some of them are not looked 
> > deeper.
> > However the most certain diagnosis is this: each OSD causes very high disk 
> > I/O to its bluestore disk (WAL and DB are fine). After that OSDs become 
> > unresponsive or very very less responsive. For example "ceph tell osd.x 
> > version” stucks like for ever.
> >
> > So due to unresponsive OSDs cluster does not settle. This is our problem!
> >
> > This is the one we are very sure of. But we are not sure of the reason.
> >
> > Here is the latest ceph status:
> > https://paste.ubuntu.com/p/2DyZ5YqPjh/.
> >
> > This is the status after we started all of the OSDs 24 hours ago.
> > Some of the OSDs are not started. However it didnt make any difference when 
> > all of them was online.
> >
> > Here is the debug=20 log of an OSD which is same for all others:
> > https://paste.ubuntu.com/p/8n2kTvwnG6/
> > As we figure out there is a loop pattern. I am sure it wont caught from eye.
> >
> > This the full log the same OSD.
> > https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
> >
> > Here is the strace of the same OSD process:
> > https://paste.ubuntu.com/p/8n2kTvwnG6/
> >
> > Recently we hear more to uprade mimic. I hope none get hurts as we do. I am 
> > sure we have done lots of mistakes to let this happening. And this 
> > situation may be a example for other user and could be a potential bug for 
> > ceph developer.
> >
> > Any help to figure out what is going on would be great.
> >
> > Best Regards,
> > Goktug Yildirim
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Strange Ceph host behaviour

2018-10-02 Thread Steve Taylor
Unless this is related to load and OSDs really are unreponsive, it is
almost certainly some sort of network issue. Duplicate IP address
maybe?


 
Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 | 
 
If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.

 

On Tue, 2018-10-02 at 17:17 +0200, Vincent Godin wrote:
> Ceph cluster in Jewel 10.2.11
> Mons & Hosts are on CentOS 7.5.1804 kernel 3.10.0-862.6.3.el7.x86_64
> 
> Everyday, we can see in ceph.log on Monitor a lot of logs like these :
> 
> 2018-10-02 16:07:08.882374 osd.478 192.168.1.232:6838/7689 386 :
> cluster [WRN] map e612590 wrongly marked me down
> 2018-10-02 16:07:06.462653 osd.464 192.168.1.232:6830/6650 317 :
> cluster [WRN] map e612588 wrongly marked me down
> 2018-10-02 16:07:10.717673 osd.470 192.168.1.232:6836/7554 371 :
> cluster [WRN] map e612591 wrongly marked me down
> 2018-10-02 16:14:51.179945 osd.414 192.168.1.227:6808/4767 670 :
> cluster [WRN] map e612599 wrongly marked me down
> 2018-10-02 16:14:48.422442 osd.403 192.168.1.227:6832/6727 509 :
> cluster [WRN] map e612597 wrongly marked me down
> 2018-10-02 16:15:13.198180 osd.436 192.168.1.228:6828/6402 533 :
> cluster [WRN] map e612608 wrongly marked me down
> 2018-10-02 16:15:08.792369 osd.433 192.168.1.228:6832/6732 515 :
> cluster [WRN] map e612604 wrongly marked me down
> 2018-10-02 16:15:11.680405 osd.429 192.168.1.228:6838/7393 536 :
> cluster [WRN] map e612607 wrongly marked me down
> 2018-10-02 16:15:14.246717 osd.431 192.168.1.228:6822/5937 474 :
> cluster [WRN] map e612609 wrongly marked me down
> 
> On the server 192.168.1.228 for example, the /var/log/messages looks like :
> 
> Oct  2 16:15:02 bd-ceph-22 ceph-osd: 2018-10-02 16:15:02.935658
> 7f716f16e700 -1 osd.432 612603 heartbeat_check: no reply from
> 192.168.1.215:6815 osd.242 since back 2018-10-02 16:14:59.065582 front
> 2018-10-02 16:14:42.046092 (cutoff 2018-10-02 16:14:42.935642)
> Oct  2 16:15:03 bd-ceph-22 ceph-osd: 2018-10-02 16:15:03.935841
> 7f716f16e700 -1 osd.432 612603 heartbeat_check: no reply from
> 192.168.1.215:6815 osd.242 since back 2018-10-02 16:14:59.065582 front
> 2018-10-02 16:14:42.046092 (cutoff 2018-10-02 16:14:43.935824)
> Oct  2 16:15:04 bd-ceph-22 ceph-osd: 2018-10-02 16:15:04.283822
> 7fe378c13700 -1 osd.426 612603 heartbeat_check: no reply from
> 192.168.1.215:6807 osd.240 since back 2018-10-02 16:15:00.450196 front
> 2018-10-02 16:14:43.433054 (cutoff 2018-10-02 16:14:44.283811)
> Oct  2 16:15:04 bd-ceph-22 ceph-osd: 2018-10-02 16:15:04.353645
> 7f1110a32700 -1 osd.438 612603 heartbeat_check: no reply from
> 192.168.1.212:6807 osd.186 since back 2018-10-02 16:14:59.700105 front
> 2018-10-02 16:14:43.884248 (cutoff 2018-10-02 16:14:44.353612)
> Oct  2 16:15:04 bd-ceph-22 ceph-osd: 2018-10-02 16:15:04.373905
> 7f71375de700 -1 osd.432 612603 heartbeat_check: no reply from
> 192.168.1.215:6815 osd.242 since back 2018-10-02 16:14:59.065582 front
> 2018-10-02 16:14:42.046092 (cutoff 2018-10-02 16:14:44.373897)
> Oct  2 16:15:04 bd-ceph-22 ceph-osd: 2018-10-02 16:15:04.935997
> 7f716f16e700 -1 osd.432 612603 heartbeat_check: no reply from
> 192.168.1.215:6815 osd.242 since back 2018-10-02 16:15:04.369740 front
> 2018-10-02 16:14:42.046092 (cutoff 2018-10-02 16:14:44.935981)
> Oct  2 16:15:05 bd-ceph-22 ceph-osd: 2018-10-02 16:15:05.007484
> 7f10d97ec700 -1 osd.438 612603 heartbeat_check: no reply from
> 192.168.1.212:6807 osd.186 since back 2018-10-02 16:14:59.700105 front
> 2018-10-02 16:14:43.884248 (cutoff 2018-10-02 16:14:45.007477)
> Oct  2 16:15:05 bd-ceph-22 ceph-osd: 2018-10-02 16:15:05.017154
> 7fd4cee4d700 -1 osd.435 612603 heartbeat_check: no reply from
> 192.168.1.212:6833 osd.195 since back 2018-10-02 16:15:03.273909 front
> 2018-10-02 16:14:44.648411 (cutoff 2018-10-02 16:14:45.017106)
> Oct  2 16:15:05 bd-ceph-22 ceph-osd: 2018-10-02 16:15:05.158580
> 7fe343c96700 -1 osd.426 612603 heartbeat_check: no reply from
> 192.168.1.215:6807 osd.240 since back 2018-10-02 16:15:00.450196 front
> 2018-10-02 16:14:43.433054 (cutoff 2018-10-02 16:14:45.158567)
> Oct  2 16:15:05 bd-ceph-22 ceph-osd: 2018-10-02 16:15:05.283983
> 7fe378c13700 -1 osd.426 612603 heartbeat_check: no reply from
> 192.168.1.215:6807 osd.240 since back 2018-10-02 16:15:05.154458 front
> 2018-10-02 16:14:43.433054 (cutoff 2018-10-02 16:14:45.283975)
> 
> There is no network problem at that time (i checked the logs on the
> host and on the switch). OSD logs shows nothing but "wrongly marked me
> down" and sessions reset due to this monitor action. As several OSDs
> are impacted, it looks like a host problem.
> 
> The sysctl.conf is:
> 
> net.core.rmem_max=56623104
> net.core.wmem_max=56623104
> net.core.rmem_default=56623104
> 

[ceph-users] Strange Ceph host behaviour

2018-10-02 Thread Vincent Godin
Ceph cluster in Jewel 10.2.11
Mons & Hosts are on CentOS 7.5.1804 kernel 3.10.0-862.6.3.el7.x86_64

Everyday, we can see in ceph.log on Monitor a lot of logs like these :

2018-10-02 16:07:08.882374 osd.478 192.168.1.232:6838/7689 386 :
cluster [WRN] map e612590 wrongly marked me down
2018-10-02 16:07:06.462653 osd.464 192.168.1.232:6830/6650 317 :
cluster [WRN] map e612588 wrongly marked me down
2018-10-02 16:07:10.717673 osd.470 192.168.1.232:6836/7554 371 :
cluster [WRN] map e612591 wrongly marked me down
2018-10-02 16:14:51.179945 osd.414 192.168.1.227:6808/4767 670 :
cluster [WRN] map e612599 wrongly marked me down
2018-10-02 16:14:48.422442 osd.403 192.168.1.227:6832/6727 509 :
cluster [WRN] map e612597 wrongly marked me down
2018-10-02 16:15:13.198180 osd.436 192.168.1.228:6828/6402 533 :
cluster [WRN] map e612608 wrongly marked me down
2018-10-02 16:15:08.792369 osd.433 192.168.1.228:6832/6732 515 :
cluster [WRN] map e612604 wrongly marked me down
2018-10-02 16:15:11.680405 osd.429 192.168.1.228:6838/7393 536 :
cluster [WRN] map e612607 wrongly marked me down
2018-10-02 16:15:14.246717 osd.431 192.168.1.228:6822/5937 474 :
cluster [WRN] map e612609 wrongly marked me down

On the server 192.168.1.228 for example, the /var/log/messages looks like :

Oct  2 16:15:02 bd-ceph-22 ceph-osd: 2018-10-02 16:15:02.935658
7f716f16e700 -1 osd.432 612603 heartbeat_check: no reply from
192.168.1.215:6815 osd.242 since back 2018-10-02 16:14:59.065582 front
2018-10-02 16:14:42.046092 (cutoff 2018-10-02 16:14:42.935642)
Oct  2 16:15:03 bd-ceph-22 ceph-osd: 2018-10-02 16:15:03.935841
7f716f16e700 -1 osd.432 612603 heartbeat_check: no reply from
192.168.1.215:6815 osd.242 since back 2018-10-02 16:14:59.065582 front
2018-10-02 16:14:42.046092 (cutoff 2018-10-02 16:14:43.935824)
Oct  2 16:15:04 bd-ceph-22 ceph-osd: 2018-10-02 16:15:04.283822
7fe378c13700 -1 osd.426 612603 heartbeat_check: no reply from
192.168.1.215:6807 osd.240 since back 2018-10-02 16:15:00.450196 front
2018-10-02 16:14:43.433054 (cutoff 2018-10-02 16:14:44.283811)
Oct  2 16:15:04 bd-ceph-22 ceph-osd: 2018-10-02 16:15:04.353645
7f1110a32700 -1 osd.438 612603 heartbeat_check: no reply from
192.168.1.212:6807 osd.186 since back 2018-10-02 16:14:59.700105 front
2018-10-02 16:14:43.884248 (cutoff 2018-10-02 16:14:44.353612)
Oct  2 16:15:04 bd-ceph-22 ceph-osd: 2018-10-02 16:15:04.373905
7f71375de700 -1 osd.432 612603 heartbeat_check: no reply from
192.168.1.215:6815 osd.242 since back 2018-10-02 16:14:59.065582 front
2018-10-02 16:14:42.046092 (cutoff 2018-10-02 16:14:44.373897)
Oct  2 16:15:04 bd-ceph-22 ceph-osd: 2018-10-02 16:15:04.935997
7f716f16e700 -1 osd.432 612603 heartbeat_check: no reply from
192.168.1.215:6815 osd.242 since back 2018-10-02 16:15:04.369740 front
2018-10-02 16:14:42.046092 (cutoff 2018-10-02 16:14:44.935981)
Oct  2 16:15:05 bd-ceph-22 ceph-osd: 2018-10-02 16:15:05.007484
7f10d97ec700 -1 osd.438 612603 heartbeat_check: no reply from
192.168.1.212:6807 osd.186 since back 2018-10-02 16:14:59.700105 front
2018-10-02 16:14:43.884248 (cutoff 2018-10-02 16:14:45.007477)
Oct  2 16:15:05 bd-ceph-22 ceph-osd: 2018-10-02 16:15:05.017154
7fd4cee4d700 -1 osd.435 612603 heartbeat_check: no reply from
192.168.1.212:6833 osd.195 since back 2018-10-02 16:15:03.273909 front
2018-10-02 16:14:44.648411 (cutoff 2018-10-02 16:14:45.017106)
Oct  2 16:15:05 bd-ceph-22 ceph-osd: 2018-10-02 16:15:05.158580
7fe343c96700 -1 osd.426 612603 heartbeat_check: no reply from
192.168.1.215:6807 osd.240 since back 2018-10-02 16:15:00.450196 front
2018-10-02 16:14:43.433054 (cutoff 2018-10-02 16:14:45.158567)
Oct  2 16:15:05 bd-ceph-22 ceph-osd: 2018-10-02 16:15:05.283983
7fe378c13700 -1 osd.426 612603 heartbeat_check: no reply from
192.168.1.215:6807 osd.240 since back 2018-10-02 16:15:05.154458 front
2018-10-02 16:14:43.433054 (cutoff 2018-10-02 16:14:45.283975)

There is no network problem at that time (i checked the logs on the
host and on the switch). OSD logs shows nothing but "wrongly marked me
down" and sessions reset due to this monitor action. As several OSDs
are impacted, it looks like a host problem.

The sysctl.conf is:

net.core.rmem_max=56623104
net.core.wmem_max=56623104
net.core.rmem_default=56623104
net.core.wmem_default=56623104
net.core.optmem_max=40960
net.ipv4.tcp_rmem=4096 87380 56623104
net.ipv4.tcp_wmem=4096 65536 56623104
net.core.somaxconn=1024
net.core.netdev_max_backlog=5
net.ipv4.tcp_max_syn_backlog=3
net.ipv4.tcp_max_tw_buckets=200
net.ipv4.tcp_tw_reuse=1
net.ipv4.tcp_fin_timeout=10
net.ipv4.tcp_slow_start_after_idle=0
net.ipv4.udp_rmem_min=8192
net.ipv4.udp_wmem_min=8192
net.ipv4.conf.all.send_redirects=0
net.ipv4.conf.all.accept_redirects=0
net.ipv4.conf.all.accept_source_route=0

kernel.pid_max=4194303
fs.file-max=26234859

Does someone has any idea or has already met this behaviour ?
___
ceph-users mailing list
ceph-users@lists.ceph.com

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-02 Thread Sergey Malinin
Sent download link by email. verbosity=10, over 900M uncompressed.


> On 2.10.2018, at 16:52, Igor Fedotov  wrote:
> 
> May I have a repair log for that "already expanded" OSD?
> 
> 
> On 10/2/2018 4:32 PM, Sergey Malinin wrote:
>> Repair goes through only when LVM volume has been expanded, otherwise it 
>> fails with enospc as well as any other operation. However, expanding the 
>> volume immediately renders bluefs unmountable with IO error.
>> 2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at the very 
>> end of bluefs-log-dump), I'm not sure whether corruption occurred before or 
>> after volume expansion.
>> 
>> 
>>> On 2.10.2018, at 16:07, Igor Fedotov  wrote:
>>> 
>>> You mentioned repair had worked before, is that correct? What's the 
>>> difference now except the applied patch? Different OSD? Anything else?
>>> 
>>> 
>>> On 10/2/2018 3:52 PM, Sergey Malinin wrote:
>>> 
 It didn't work, emailed logs to you.
 
 
> On 2.10.2018, at 14:43, Igor Fedotov  wrote:
> 
> The major change is in get_bluefs_rebalance_txn function, it lacked 
> bluefs_rebalance_txn assignment..
> 
> 
> 
> On 10/2/2018 2:40 PM, Sergey Malinin wrote:
>> PR doesn't seem to have changed since yesterday. Am I missing something?
>> 
>> 
>>> On 2.10.2018, at 14:15, Igor Fedotov  wrote:
>>> 
>>> Please update the patch from the PR - it didn't update bluefs extents 
>>> list before.
>>> 
>>> Also please set debug bluestore 20 when re-running repair and collect 
>>> the log.
>>> 
>>> If repair doesn't help - would you send repair and startup logs 
>>> directly to me as I have some issues accessing ceph-post-file uploads.
>>> 
>>> 
>>> Thanks,
>>> 
>>> Igor
>>> 
>>> 
>>> On 10/2/2018 11:39 AM, Sergey Malinin wrote:
 Yes, I did repair all OSDs and it finished with 'repair success'. I 
 backed up OSDs so now I have more room to play.
 I posted log files using ceph-post-file with the following IDs:
 4af9cc4d-9c73-41c9-9c38-eb6c551047a0
 20df7df5-f0c9-4186-aa21-4e5c0172cd93
 
 
> On 2.10.2018, at 11:26, Igor Fedotov  wrote:
> 
> You did repair for any of this OSDs, didn't you? For all of them?
> 
> 
> Would you please provide the log for both types (failed on mount and 
> failed with enospc) of failing OSDs. Prior to collecting please 
> remove existing ones prior and set debug bluestore to 20.
> 
> 
> 
> On 10/2/2018 2:16 AM, Sergey Malinin wrote:
>> I was able to apply patches to mimic, but nothing changed. One osd 
>> that I had space expanded on fails with bluefs mount IO error, 
>> others keep failing with enospc.
>> 
>> 
>>> On 1.10.2018, at 19:26, Igor Fedotov  wrote:
>>> 
>>> So you should call repair which rebalances (i.e. allocates 
>>> additional space) BlueFS space. Hence allowing OSD to start.
>>> 
>>> Thanks,
>>> 
>>> Igor
>>> 
>>> 
>>> On 10/1/2018 7:22 PM, Igor Fedotov wrote:
 Not exactly. The rebalancing from this kv_sync_thread still might 
 be deferred due to the nature of this thread (haven't 100% sure 
 though).
 
 Here is my PR showing the idea (still untested and perhaps 
 unfinished!!!)
 
 https://github.com/ceph/ceph/pull/24353
 
 
 Igor
 
 
 On 10/1/2018 7:07 PM, Sergey Malinin wrote:
> Can you please confirm whether I got this right:
> 
> --- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300
> +++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300
> @@ -9049,22 +9049,17 @@
> throttle_bytes.put(costs);
>   PExtentVector bluefs_gift_extents;
> -  if (bluefs &&
> -  after_flush - bluefs_last_balance >
> -  cct->_conf->bluestore_bluefs_balance_interval) {
> -bluefs_last_balance = after_flush;
> -int r = _balance_bluefs_freespace(_gift_extents);
> -assert(r >= 0);
> -if (r > 0) {
> -  for (auto& p : bluefs_gift_extents) {
> -bluefs_extents.insert(p.offset, p.length);
> -  }
> -  bufferlist bl;
> -  encode(bluefs_extents, bl);
> -  dout(10) << __func__ << " bluefs_extents now 0x" << 
> std::hex
> -   << bluefs_extents << std::dec << dendl;
> -  synct->set(PREFIX_SUPER, "bluefs_extents", bl);
> +  int r = _balance_bluefs_freespace(_gift_extents);
> + 

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-02 Thread Alfredo Deza
On Tue, Oct 2, 2018 at 10:23 AM Alex Litvak
 wrote:
>
> Igor,
>
> Thank you for your reply.  So what you are saying there are really no
> sensible space requirements for a collocated device? Even if I setup 30
> GB for DB (which I really wouldn't like to do due to a space waste
> considerations ) there is a chance that if this space feels up I will be
> in the same trouble under some heavy load scenario?

We do have good sizing recommendations for a separate block.db
partition. Roughly it shouldn't be less than 4% the size of the data
device.

http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing

>
> On 10/2/2018 9:15 AM, Igor Fedotov wrote:
> > Even with a single device bluestore has a sort of implicit "BlueFS
> > partition" where DB is stored.  And it dynamically adjusts (rebalances)
> > the space for that partition in background. Unfortunately it might
> > perform that "too lazy" and hence under some heavy load it might end-up
> > with the lack of space for that partition. While main device still has
> > plenty of free space.
> >
> > I'm planning to refactor this re-balancing procedure in the future to
> > eliminate the root cause.
> >
> >
> > Thanks,
> >
> > Igor
> >
> >
> > On 10/2/2018 5:04 PM, Alex Litvak wrote:
> >> I am sorry for interrupting the thread, but my understanding always
> >> was that blue store on the single device should not care of the DB
> >> size, i.e. it would use the data part for all operations if DB is
> >> full.  And if it is not true, what would be sensible defaults on 800
> >> GB SSD?  I used ceph-ansible to build my cluster with system defaults
> >> and from I reading in this thread doesn't give me a good feeling at
> >> all. Document ion on the topic is very sketchy and online posts
> >> contradict each other some times.
> >>
> >> Thank you in advance,
> >>
> >> On 10/2/2018 8:52 AM, Igor Fedotov wrote:
> >>> May I have a repair log for that "already expanded" OSD?
> >>>
> >>>
> >>> On 10/2/2018 4:32 PM, Sergey Malinin wrote:
>  Repair goes through only when LVM volume has been expanded,
>  otherwise it fails with enospc as well as any other operation.
>  However, expanding the volume immediately renders bluefs unmountable
>  with IO error.
>  2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at
>  the very end of bluefs-log-dump), I'm not sure whether corruption
>  occurred before or after volume expansion.
> 
> 
> > On 2.10.2018, at 16:07, Igor Fedotov  wrote:
> >
> > You mentioned repair had worked before, is that correct? What's the
> > difference now except the applied patch? Different OSD? Anything else?
> >
> >
> > On 10/2/2018 3:52 PM, Sergey Malinin wrote:
> >
> >> It didn't work, emailed logs to you.
> >>
> >>
> >>> On 2.10.2018, at 14:43, Igor Fedotov  wrote:
> >>>
> >>> The major change is in get_bluefs_rebalance_txn function, it
> >>> lacked bluefs_rebalance_txn assignment..
> >>>
> >>>
> >>>
> >>> On 10/2/2018 2:40 PM, Sergey Malinin wrote:
>  PR doesn't seem to have changed since yesterday. Am I missing
>  something?
> 
> 
> > On 2.10.2018, at 14:15, Igor Fedotov  wrote:
> >
> > Please update the patch from the PR - it didn't update bluefs
> > extents list before.
> >
> > Also please set debug bluestore 20 when re-running repair and
> > collect the log.
> >
> > If repair doesn't help - would you send repair and startup logs
> > directly to me as I have some issues accessing ceph-post-file
> > uploads.
> >
> >
> > Thanks,
> >
> > Igor
> >
> >
> > On 10/2/2018 11:39 AM, Sergey Malinin wrote:
> >> Yes, I did repair all OSDs and it finished with 'repair
> >> success'. I backed up OSDs so now I have more room to play.
> >> I posted log files using ceph-post-file with the following IDs:
> >> 4af9cc4d-9c73-41c9-9c38-eb6c551047a0
> >> 20df7df5-f0c9-4186-aa21-4e5c0172cd93
> >>
> >>
> >>> On 2.10.2018, at 11:26, Igor Fedotov  wrote:
> >>>
> >>> You did repair for any of this OSDs, didn't you? For all of
> >>> them?
> >>>
> >>>
> >>> Would you please provide the log for both types (failed on
> >>> mount and failed with enospc) of failing OSDs. Prior to
> >>> collecting please remove existing ones prior and set debug
> >>> bluestore to 20.
> >>>
> >>>
> >>>
> >>> On 10/2/2018 2:16 AM, Sergey Malinin wrote:
>  I was able to apply patches to mimic, but nothing changed.
>  One osd that I had space expanded on fails with bluefs mount
>  IO error, others keep failing with enospc.
> 
> 
> > On 1.10.2018, at 

Re: [ceph-users] QEMU/Libvirt + librbd issue using Luminous 12.2.7

2018-10-02 Thread Andre Goree




On 2018/10/02 9:54 am, Jason Dillaman wrote:

Perhaps that pastebin link has the wrong log pasted? The provided log
looks like it's associated with the creation of image
"32635-b6592790-5519-5184-b5ef-5f16b3523250" and not the attachment of
an image to a VM.
On Fri, Sep 28, 2018 at 3:15 PM Andre Goree  wrote:




I actually got the logging working, here's the log from a failed 
attach:

  https://pastebin.com/jCiD4E2p

Thanks!


--
Andre Goree
-=-=-=-=-=-
Email - andre at drenet.net
Website   - http://blog.drenet.net
PGP key   - http://www.drenet.net/pubkey.html
-=-=-=-=-=-




--
Jason



Interesting.  I just pasted everything that was in the log that was 
generated.


I'll try again here soon, with ONLY an attempt to attach.  Standby.  
Thanks again for the help :)



--
Andre Goree
-=-=-=-=-=-
Email - andre at drenet.net
Website   - http://blog.drenet.net
PGP key   - http://www.drenet.net/pubkey.html
-=-=-=-=-=-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-02 Thread Alex Litvak

Igor,

Thank you for your reply.  So what you are saying there are really no 
sensible space requirements for a collocated device? Even if I setup 30 
GB for DB (which I really wouldn't like to do due to a space waste 
considerations ) there is a chance that if this space feels up I will be 
in the same trouble under some heavy load scenario?


On 10/2/2018 9:15 AM, Igor Fedotov wrote:
Even with a single device bluestore has a sort of implicit "BlueFS 
partition" where DB is stored.  And it dynamically adjusts (rebalances) 
the space for that partition in background. Unfortunately it might 
perform that "too lazy" and hence under some heavy load it might end-up 
with the lack of space for that partition. While main device still has 
plenty of free space.


I'm planning to refactor this re-balancing procedure in the future to 
eliminate the root cause.



Thanks,

Igor


On 10/2/2018 5:04 PM, Alex Litvak wrote:
I am sorry for interrupting the thread, but my understanding always 
was that blue store on the single device should not care of the DB 
size, i.e. it would use the data part for all operations if DB is 
full.  And if it is not true, what would be sensible defaults on 800 
GB SSD?  I used ceph-ansible to build my cluster with system defaults 
and from I reading in this thread doesn't give me a good feeling at 
all. Document ion on the topic is very sketchy and online posts 
contradict each other some times.


Thank you in advance,

On 10/2/2018 8:52 AM, Igor Fedotov wrote:

May I have a repair log for that "already expanded" OSD?


On 10/2/2018 4:32 PM, Sergey Malinin wrote:
Repair goes through only when LVM volume has been expanded, 
otherwise it fails with enospc as well as any other operation. 
However, expanding the volume immediately renders bluefs unmountable 
with IO error.
2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at 
the very end of bluefs-log-dump), I'm not sure whether corruption 
occurred before or after volume expansion.




On 2.10.2018, at 16:07, Igor Fedotov  wrote:

You mentioned repair had worked before, is that correct? What's the 
difference now except the applied patch? Different OSD? Anything else?



On 10/2/2018 3:52 PM, Sergey Malinin wrote:


It didn't work, emailed logs to you.



On 2.10.2018, at 14:43, Igor Fedotov  wrote:

The major change is in get_bluefs_rebalance_txn function, it 
lacked bluefs_rebalance_txn assignment..




On 10/2/2018 2:40 PM, Sergey Malinin wrote:
PR doesn't seem to have changed since yesterday. Am I missing 
something?




On 2.10.2018, at 14:15, Igor Fedotov  wrote:

Please update the patch from the PR - it didn't update bluefs 
extents list before.


Also please set debug bluestore 20 when re-running repair and 
collect the log.


If repair doesn't help - would you send repair and startup logs 
directly to me as I have some issues accessing ceph-post-file 
uploads.



Thanks,

Igor


On 10/2/2018 11:39 AM, Sergey Malinin wrote:
Yes, I did repair all OSDs and it finished with 'repair 
success'. I backed up OSDs so now I have more room to play.

I posted log files using ceph-post-file with the following IDs:
4af9cc4d-9c73-41c9-9c38-eb6c551047a0
20df7df5-f0c9-4186-aa21-4e5c0172cd93



On 2.10.2018, at 11:26, Igor Fedotov  wrote:

You did repair for any of this OSDs, didn't you? For all of 
them?



Would you please provide the log for both types (failed on 
mount and failed with enospc) of failing OSDs. Prior to 
collecting please remove existing ones prior and set debug 
bluestore to 20.




On 10/2/2018 2:16 AM, Sergey Malinin wrote:
I was able to apply patches to mimic, but nothing changed. 
One osd that I had space expanded on fails with bluefs mount 
IO error, others keep failing with enospc.




On 1.10.2018, at 19:26, Igor Fedotov  wrote:

So you should call repair which rebalances (i.e. allocates 
additional space) BlueFS space. Hence allowing OSD to start.


Thanks,

Igor


On 10/1/2018 7:22 PM, Igor Fedotov wrote:
Not exactly. The rebalancing from this kv_sync_thread 
still might be deferred due to the nature of this thread 
(haven't 100% sure though).


Here is my PR showing the idea (still untested and perhaps 
unfinished!!!)


https://github.com/ceph/ceph/pull/24353


Igor


On 10/1/2018 7:07 PM, Sergey Malinin wrote:

Can you please confirm whether I got this right:

--- BlueStore.cc.bak    2018-10-01 18:54:45.096836419 +0300
+++ BlueStore.cc    2018-10-01 19:01:35.937623861 +0300
@@ -9049,22 +9049,17 @@
 throttle_bytes.put(costs);
   PExtentVector bluefs_gift_extents;
-  if (bluefs &&
-  after_flush - bluefs_last_balance >
- cct->_conf->bluestore_bluefs_balance_interval) {
-    bluefs_last_balance = after_flush;
-    int r = 
_balance_bluefs_freespace(_gift_extents);

-    assert(r >= 0);
-    if (r > 0) {
-  for (auto& p : bluefs_gift_extents) {
-    bluefs_extents.insert(p.offset, p.length);
-  }
-  bufferlist bl;
-  encode(bluefs_extents, bl);
-  dout(10) << 

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-02 Thread Igor Fedotov
Even with a single device bluestore has a sort of implicit "BlueFS 
partition" where DB is stored.  And it dynamically adjusts (rebalances) 
the space for that partition in background. Unfortunately it might 
perform that "too lazy" and hence under some heavy load it might end-up 
with the lack of space for that partition. While main device still has 
plenty of free space.


I'm planning to refactor this re-balancing procedure in the future to 
eliminate the root cause.



Thanks,

Igor


On 10/2/2018 5:04 PM, Alex Litvak wrote:
I am sorry for interrupting the thread, but my understanding always 
was that blue store on the single device should not care of the DB 
size, i.e. it would use the data part for all operations if DB is 
full.  And if it is not true, what would be sensible defaults on 800 
GB SSD?  I used ceph-ansible to build my cluster with system defaults 
and from I reading in this thread doesn't give me a good feeling at 
all. Document ion on the topic is very sketchy and online posts 
contradict each other some times.


Thank you in advance,

On 10/2/2018 8:52 AM, Igor Fedotov wrote:

May I have a repair log for that "already expanded" OSD?


On 10/2/2018 4:32 PM, Sergey Malinin wrote:
Repair goes through only when LVM volume has been expanded, 
otherwise it fails with enospc as well as any other operation. 
However, expanding the volume immediately renders bluefs unmountable 
with IO error.
2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at 
the very end of bluefs-log-dump), I'm not sure whether corruption 
occurred before or after volume expansion.




On 2.10.2018, at 16:07, Igor Fedotov  wrote:

You mentioned repair had worked before, is that correct? What's the 
difference now except the applied patch? Different OSD? Anything else?



On 10/2/2018 3:52 PM, Sergey Malinin wrote:


It didn't work, emailed logs to you.



On 2.10.2018, at 14:43, Igor Fedotov  wrote:

The major change is in get_bluefs_rebalance_txn function, it 
lacked bluefs_rebalance_txn assignment..




On 10/2/2018 2:40 PM, Sergey Malinin wrote:
PR doesn't seem to have changed since yesterday. Am I missing 
something?




On 2.10.2018, at 14:15, Igor Fedotov  wrote:

Please update the patch from the PR - it didn't update bluefs 
extents list before.


Also please set debug bluestore 20 when re-running repair and 
collect the log.


If repair doesn't help - would you send repair and startup logs 
directly to me as I have some issues accessing ceph-post-file 
uploads.



Thanks,

Igor


On 10/2/2018 11:39 AM, Sergey Malinin wrote:
Yes, I did repair all OSDs and it finished with 'repair 
success'. I backed up OSDs so now I have more room to play.

I posted log files using ceph-post-file with the following IDs:
4af9cc4d-9c73-41c9-9c38-eb6c551047a0
20df7df5-f0c9-4186-aa21-4e5c0172cd93



On 2.10.2018, at 11:26, Igor Fedotov  wrote:

You did repair for any of this OSDs, didn't you? For all of 
them?



Would you please provide the log for both types (failed on 
mount and failed with enospc) of failing OSDs. Prior to 
collecting please remove existing ones prior and set debug 
bluestore to 20.




On 10/2/2018 2:16 AM, Sergey Malinin wrote:
I was able to apply patches to mimic, but nothing changed. 
One osd that I had space expanded on fails with bluefs mount 
IO error, others keep failing with enospc.




On 1.10.2018, at 19:26, Igor Fedotov  wrote:

So you should call repair which rebalances (i.e. allocates 
additional space) BlueFS space. Hence allowing OSD to start.


Thanks,

Igor


On 10/1/2018 7:22 PM, Igor Fedotov wrote:
Not exactly. The rebalancing from this kv_sync_thread 
still might be deferred due to the nature of this thread 
(haven't 100% sure though).


Here is my PR showing the idea (still untested and perhaps 
unfinished!!!)


https://github.com/ceph/ceph/pull/24353


Igor


On 10/1/2018 7:07 PM, Sergey Malinin wrote:

Can you please confirm whether I got this right:

--- BlueStore.cc.bak    2018-10-01 18:54:45.096836419 +0300
+++ BlueStore.cc    2018-10-01 19:01:35.937623861 +0300
@@ -9049,22 +9049,17 @@
 throttle_bytes.put(costs);
   PExtentVector bluefs_gift_extents;
-  if (bluefs &&
-  after_flush - bluefs_last_balance >
- cct->_conf->bluestore_bluefs_balance_interval) {
-    bluefs_last_balance = after_flush;
-    int r = 
_balance_bluefs_freespace(_gift_extents);

-    assert(r >= 0);
-    if (r > 0) {
-  for (auto& p : bluefs_gift_extents) {
-    bluefs_extents.insert(p.offset, p.length);
-  }
-  bufferlist bl;
-  encode(bluefs_extents, bl);
-  dout(10) << __func__ << " bluefs_extents now 0x" 
<< std::hex

-   << bluefs_extents << std::dec << dendl;
-  synct->set(PREFIX_SUPER, "bluefs_extents", bl);
+  int r = 
_balance_bluefs_freespace(_gift_extents);

+  ceph_assert(r >= 0);
+  if (r > 0) {
+    for (auto& p : bluefs_gift_extents) {
+  bluefs_extents.insert(p.offset, p.length);
   }
+    bufferlist bl;
+    

Re: [ceph-users] Mimic offline problem

2018-10-02 Thread by morphin
One of ceph experts indicated that bluestore is somewhat preview tech
(as for Redhat).
So it could be best to checkout bluestore and rocksdb. There are some
tools to check health and also repair. But there are limited
documentation.
Anyone who has experince with it?
Anyone lead/help to a proper check would be great.
Goktug Yildirim , 1 Eki 2018 Pzt, 22:55
tarihinde şunu yazdı:
>
> Hi all,
>
> We have recently upgraded from luminous to mimic. It’s been 6 days since this 
> cluster is offline. The long short story is here: 
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
>
> I’ve also CC’ed developers since I believe this is a bug. If this is not to 
> correct way I apology and please let me know.
>
> For the 6 days lots of thing happened and there were some outcomes about the 
> problem. Some of them was misjudged and some of them are not looked deeper.
> However the most certain diagnosis is this: each OSD causes very high disk 
> I/O to its bluestore disk (WAL and DB are fine). After that OSDs become 
> unresponsive or very very less responsive. For example "ceph tell osd.x 
> version” stucks like for ever.
>
> So due to unresponsive OSDs cluster does not settle. This is our problem!
>
> This is the one we are very sure of. But we are not sure of the reason.
>
> Here is the latest ceph status:
> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
>
> This is the status after we started all of the OSDs 24 hours ago.
> Some of the OSDs are not started. However it didnt make any difference when 
> all of them was online.
>
> Here is the debug=20 log of an OSD which is same for all others:
> https://paste.ubuntu.com/p/8n2kTvwnG6/
> As we figure out there is a loop pattern. I am sure it wont caught from eye.
>
> This the full log the same OSD.
> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
>
> Here is the strace of the same OSD process:
> https://paste.ubuntu.com/p/8n2kTvwnG6/
>
> Recently we hear more to uprade mimic. I hope none get hurts as we do. I am 
> sure we have done lots of mistakes to let this happening. And this situation 
> may be a example for other user and could be a potential bug for ceph 
> developer.
>
> Any help to figure out what is going on would be great.
>
> Best Regards,
> Goktug Yildirim
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-02 Thread Alex Litvak
I am sorry for interrupting the thread, but my understanding always was 
that blue store on the single device should not care of the DB size, 
i.e. it would use the data part for all operations if DB is full.  And 
if it is not true, what would be sensible defaults on 800 GB SSD?  I 
used ceph-ansible to build my cluster with system defaults and from I 
reading in this thread doesn't give me a good feeling at all. Document 
ion on the topic is very sketchy and online posts contradict each other 
some times.


Thank you in advance,

On 10/2/2018 8:52 AM, Igor Fedotov wrote:

May I have a repair log for that "already expanded" OSD?


On 10/2/2018 4:32 PM, Sergey Malinin wrote:
Repair goes through only when LVM volume has been expanded, otherwise 
it fails with enospc as well as any other operation. However, 
expanding the volume immediately renders bluefs unmountable with IO 
error.
2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at the 
very end of bluefs-log-dump), I'm not sure whether corruption occurred 
before or after volume expansion.



On 2.10.2018, at 16:07, Igor Fedotov 
 wrote:


You mentioned repair had worked before, is that correct? What's the 
difference now except the applied patch? Different OSD? Anything else?



On 10/2/2018 3:52 PM, Sergey Malinin wrote:


It didn't work, emailed logs to you.


On 2.10.2018, at 14:43, Igor Fedotov 
 wrote:


The major change is in get_bluefs_rebalance_txn function, it lacked 
bluefs_rebalance_txn assignment..




On 10/2/2018 2:40 PM, Sergey Malinin wrote:
PR doesn't seem to have changed since yesterday. Am I missing 
something?



On 2.10.2018, at 14:15, Igor Fedotov 
 wrote:


Please update the patch from the PR - it didn't update bluefs 
extents list before.


Also please set debug bluestore 20 when re-running repair and 
collect the log.


If repair doesn't help - would you send repair and startup logs 
directly to me as I have some issues accessing ceph-post-file 
uploads.



Thanks,

Igor


On 10/2/2018 11:39 AM, Sergey Malinin wrote:
Yes, I did repair all OSDs and it finished with 'repair 
success'. I backed up OSDs so now I have more room to play.

I posted log files using ceph-post-file with the following IDs:
4af9cc4d-9c73-41c9-9c38-eb6c551047a0
20df7df5-f0c9-4186-aa21-4e5c0172cd93


On 2.10.2018, at 11:26, Igor Fedotov 
 wrote:


You did repair for any of this OSDs, didn't you? For all of them?


Would you please provide the log for both types (failed on 
mount and failed with enospc) of failing OSDs. Prior to 
collecting please remove existing ones prior and set debug 
bluestore to 20.




On 10/2/2018 2:16 AM, Sergey Malinin wrote:
I was able to apply patches to mimic, but nothing changed. One 
osd that I had space expanded on fails with bluefs mount IO 
error, others keep failing with enospc.



On 1.10.2018, at 19:26, Igor Fedotov 
 wrote:


So you should call repair which rebalances (i.e. allocates 
additional space) BlueFS space. Hence allowing OSD to start.


Thanks,

Igor


On 10/1/2018 7:22 PM, Igor Fedotov wrote:
Not exactly. The rebalancing from this kv_sync_thread still 
might be deferred due to the nature of this thread (haven't 
100% sure though).


Here is my PR showing the idea (still untested and perhaps 
unfinished!!!)


https://github.com/ceph/ceph/pull/24353


Igor


On 10/1/2018 7:07 PM, Sergey Malinin wrote:

Can you please confirm whether I got this right:

--- BlueStore.cc.bak    2018-10-01 18:54:45.096836419 +0300
+++ BlueStore.cc    2018-10-01 19:01:35.937623861 +0300
@@ -9049,22 +9049,17 @@
 throttle_bytes.put(costs);
   PExtentVector bluefs_gift_extents;
-  if (bluefs &&
-  after_flush - bluefs_last_balance >
-  cct->_conf->bluestore_bluefs_balance_interval) {
-    bluefs_last_balance = after_flush;
-    int r = _balance_bluefs_freespace(_gift_extents);
-    assert(r >= 0);
-    if (r > 0) {
-  for (auto& p : bluefs_gift_extents) {
-    bluefs_extents.insert(p.offset, p.length);
-  }
-  bufferlist bl;
-  encode(bluefs_extents, bl);
-  dout(10) << __func__ << " bluefs_extents now 0x" << 
std::hex

-   << bluefs_extents << std::dec << dendl;
-  synct->set(PREFIX_SUPER, "bluefs_extents", bl);
+  int r = 
_balance_bluefs_freespace(_gift_extents);

+  ceph_assert(r >= 0);
+  if (r > 0) {
+    for (auto& p : bluefs_gift_extents) {
+  bluefs_extents.insert(p.offset, p.length);
   }
+    bufferlist bl;
+    encode(bluefs_extents, bl);
+    dout(10) << __func__ << " bluefs_extents now 0x" << 
std::hex

+ << bluefs_extents << std::dec << dendl;
+    synct->set(PREFIX_SUPER, "bluefs_extents", bl);
 }
   // cleanup sync deferred keys

On 1.10.2018, at 18:39, Igor Fedotov 
 wrote:


So you have just a single main device per OSD

Then bluestore-tool wouldn't help, it's unable to expand 
BlueFS partition at main device, standalone devices are 
supported only.


Given that you're able to rebuild the code I 

Re: [ceph-users] QEMU/Libvirt + librbd issue using Luminous 12.2.7

2018-10-02 Thread Jason Dillaman
Perhaps that pastebin link has the wrong log pasted? The provided log
looks like it's associated with the creation of image
"32635-b6592790-5519-5184-b5ef-5f16b3523250" and not the attachment of
an image to a VM.
On Fri, Sep 28, 2018 at 3:15 PM Andre Goree  wrote:
>
> On 2018/09/28 2:26 pm, Andre Goree wrote:
> > On 2018/08/21 1:24 pm, Jason Dillaman wrote:
> >> Can you collect any librados / librbd debug logs and provide them via
> >> pastebin? Just add / tweak the following in your "/etc/ceph/ceph.conf"
> >> file's "[client]" section and re-run to gather the logs.
> >>
> >> [client]
> >> log file = /path/to/a/log/file
> >> debug ms = 1
> >> debug monc = 20
> >> debug objecter = 20
> >> debug rados = 20
> >> debug rbd = 20
> > ...
> >>
> >>
> >>
> >> --
> >> Jason
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> > Returning to this as I've finally had time, lol.
> >
> > I've tried adding the above [client] lines to both the machine on
> > which the VMs run (the one running libvirt and the VMs) as well as the
> > ceph node running the MON and MGR, but nothing happens -- i.e.,
> > nothing is printed to the logfile that I define.
> >
> > FWIW, I'm still having this issue in 12.2.8 as well :/
>
>
>
> I actually got the logging working, here's the log from a failed attach:
>   https://pastebin.com/jCiD4E2p
>
> Thanks!
>
>
> --
> Andre Goree
> -=-=-=-=-=-
> Email - andre at drenet.net
> Website   - http://blog.drenet.net
> PGP key   - http://www.drenet.net/pubkey.html
> -=-=-=-=-=-



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-02 Thread Igor Fedotov

May I have a repair log for that "already expanded" OSD?


On 10/2/2018 4:32 PM, Sergey Malinin wrote:

Repair goes through only when LVM volume has been expanded, otherwise it fails 
with enospc as well as any other operation. However, expanding the volume 
immediately renders bluefs unmountable with IO error.
2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at the very end 
of bluefs-log-dump), I'm not sure whether corruption occurred before or after 
volume expansion.



On 2.10.2018, at 16:07, Igor Fedotov  wrote:

You mentioned repair had worked before, is that correct? What's the difference 
now except the applied patch? Different OSD? Anything else?


On 10/2/2018 3:52 PM, Sergey Malinin wrote:


It didn't work, emailed logs to you.



On 2.10.2018, at 14:43, Igor Fedotov  wrote:

The major change is in get_bluefs_rebalance_txn function, it lacked 
bluefs_rebalance_txn assignment..



On 10/2/2018 2:40 PM, Sergey Malinin wrote:

PR doesn't seem to have changed since yesterday. Am I missing something?



On 2.10.2018, at 14:15, Igor Fedotov  wrote:

Please update the patch from the PR - it didn't update bluefs extents list 
before.

Also please set debug bluestore 20 when re-running repair and collect the log.

If repair doesn't help - would you send repair and startup logs directly to me 
as I have some issues accessing ceph-post-file uploads.


Thanks,

Igor


On 10/2/2018 11:39 AM, Sergey Malinin wrote:

Yes, I did repair all OSDs and it finished with 'repair success'. I backed up 
OSDs so now I have more room to play.
I posted log files using ceph-post-file with the following IDs:
4af9cc4d-9c73-41c9-9c38-eb6c551047a0
20df7df5-f0c9-4186-aa21-4e5c0172cd93



On 2.10.2018, at 11:26, Igor Fedotov  wrote:

You did repair for any of this OSDs, didn't you? For all of them?


Would you please provide the log for both types (failed on mount and failed 
with enospc) of failing OSDs. Prior to collecting please remove existing ones 
prior and set debug bluestore to 20.



On 10/2/2018 2:16 AM, Sergey Malinin wrote:

I was able to apply patches to mimic, but nothing changed. One osd that I had 
space expanded on fails with bluefs mount IO error, others keep failing with 
enospc.



On 1.10.2018, at 19:26, Igor Fedotov  wrote:

So you should call repair which rebalances (i.e. allocates additional space) 
BlueFS space. Hence allowing OSD to start.

Thanks,

Igor


On 10/1/2018 7:22 PM, Igor Fedotov wrote:

Not exactly. The rebalancing from this kv_sync_thread still might be deferred 
due to the nature of this thread (haven't 100% sure though).

Here is my PR showing the idea (still untested and perhaps unfinished!!!)

https://github.com/ceph/ceph/pull/24353


Igor


On 10/1/2018 7:07 PM, Sergey Malinin wrote:

Can you please confirm whether I got this right:

--- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300
+++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300
@@ -9049,22 +9049,17 @@
 throttle_bytes.put(costs);
   PExtentVector bluefs_gift_extents;
-  if (bluefs &&
-  after_flush - bluefs_last_balance >
-  cct->_conf->bluestore_bluefs_balance_interval) {
-bluefs_last_balance = after_flush;
-int r = _balance_bluefs_freespace(_gift_extents);
-assert(r >= 0);
-if (r > 0) {
-  for (auto& p : bluefs_gift_extents) {
-bluefs_extents.insert(p.offset, p.length);
-  }
-  bufferlist bl;
-  encode(bluefs_extents, bl);
-  dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
-   << bluefs_extents << std::dec << dendl;
-  synct->set(PREFIX_SUPER, "bluefs_extents", bl);
+  int r = _balance_bluefs_freespace(_gift_extents);
+  ceph_assert(r >= 0);
+  if (r > 0) {
+for (auto& p : bluefs_gift_extents) {
+  bluefs_extents.insert(p.offset, p.length);
   }
+bufferlist bl;
+encode(bluefs_extents, bl);
+dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
+ << bluefs_extents << std::dec << dendl;
+synct->set(PREFIX_SUPER, "bluefs_extents", bl);
 }
   // cleanup sync deferred keys


On 1.10.2018, at 18:39, Igor Fedotov  wrote:

So you have just a single main device per OSD

Then bluestore-tool wouldn't help, it's unable to expand BlueFS partition at 
main device, standalone devices are supported only.

Given that you're able to rebuild the code I can suggest to make a patch that 
triggers BlueFS rebalance (see code snippet below) on repairing.
  PExtentVector bluefs_gift_extents;
  int r = _balance_bluefs_freespace(_gift_extents);
  ceph_assert(r >= 0);
  if (r > 0) {
for (auto& p : bluefs_gift_extents) {
  bluefs_extents.insert(p.offset, p.length);
}
bufferlist bl;
encode(bluefs_extents, bl);
dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
 << bluefs_extents << std::dec << dendl;
synct->set(PREFIX_SUPER, "bluefs_extents", bl);
  }

If it waits 

Re: [ceph-users] cephfs issue with moving files between data pools gives Input/output error

2018-10-02 Thread Janne Johansson
Den mån 1 okt. 2018 kl 22:08 skrev John Spray :

>
> > totally new for me, also not what I would expect of a mv on a fs. I know
> > this is normal to expect coping between pools, also from the s3cmd
> > client. But I think more people will not expect this behaviour. Can't
> > the move be implemented as a move?
>
> In almost all filesystems, a rename (like "mv") is a pure metadata
> operation -- it doesn't involve reading all the file's data and
> re-writing it.  It would be very surprising for most users if they
> found that their "mv" command blocked for a very long time while
> waiting for a large file's content to be e.g. read out of one pool and
> written into another.


There are other networked filesystems which do behave like that, where
the OS thinks the whole mount is one single FS, but when you move stuff
with mv around it actually needs to move all data to other servers/disks and
incur the slowness of a copy/delete operation.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-02 Thread Sergey Malinin
Repair goes through only when LVM volume has been expanded, otherwise it fails 
with enospc as well as any other operation. However, expanding the volume 
immediately renders bluefs unmountable with IO error. 
2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at the very end 
of bluefs-log-dump), I'm not sure whether corruption occurred before or after 
volume expansion.


> On 2.10.2018, at 16:07, Igor Fedotov  wrote:
> 
> You mentioned repair had worked before, is that correct? What's the 
> difference now except the applied patch? Different OSD? Anything else?
> 
> 
> On 10/2/2018 3:52 PM, Sergey Malinin wrote:
> 
>> It didn't work, emailed logs to you.
>> 
>> 
>>> On 2.10.2018, at 14:43, Igor Fedotov  wrote:
>>> 
>>> The major change is in get_bluefs_rebalance_txn function, it lacked 
>>> bluefs_rebalance_txn assignment..
>>> 
>>> 
>>> 
>>> On 10/2/2018 2:40 PM, Sergey Malinin wrote:
 PR doesn't seem to have changed since yesterday. Am I missing something?
 
 
> On 2.10.2018, at 14:15, Igor Fedotov  wrote:
> 
> Please update the patch from the PR - it didn't update bluefs extents 
> list before.
> 
> Also please set debug bluestore 20 when re-running repair and collect the 
> log.
> 
> If repair doesn't help - would you send repair and startup logs directly 
> to me as I have some issues accessing ceph-post-file uploads.
> 
> 
> Thanks,
> 
> Igor
> 
> 
> On 10/2/2018 11:39 AM, Sergey Malinin wrote:
>> Yes, I did repair all OSDs and it finished with 'repair success'. I 
>> backed up OSDs so now I have more room to play.
>> I posted log files using ceph-post-file with the following IDs:
>> 4af9cc4d-9c73-41c9-9c38-eb6c551047a0
>> 20df7df5-f0c9-4186-aa21-4e5c0172cd93
>> 
>> 
>>> On 2.10.2018, at 11:26, Igor Fedotov  wrote:
>>> 
>>> You did repair for any of this OSDs, didn't you? For all of them?
>>> 
>>> 
>>> Would you please provide the log for both types (failed on mount and 
>>> failed with enospc) of failing OSDs. Prior to collecting please remove 
>>> existing ones prior and set debug bluestore to 20.
>>> 
>>> 
>>> 
>>> On 10/2/2018 2:16 AM, Sergey Malinin wrote:
 I was able to apply patches to mimic, but nothing changed. One osd 
 that I had space expanded on fails with bluefs mount IO error, others 
 keep failing with enospc.
 
 
> On 1.10.2018, at 19:26, Igor Fedotov  wrote:
> 
> So you should call repair which rebalances (i.e. allocates additional 
> space) BlueFS space. Hence allowing OSD to start.
> 
> Thanks,
> 
> Igor
> 
> 
> On 10/1/2018 7:22 PM, Igor Fedotov wrote:
>> Not exactly. The rebalancing from this kv_sync_thread still might be 
>> deferred due to the nature of this thread (haven't 100% sure though).
>> 
>> Here is my PR showing the idea (still untested and perhaps 
>> unfinished!!!)
>> 
>> https://github.com/ceph/ceph/pull/24353
>> 
>> 
>> Igor
>> 
>> 
>> On 10/1/2018 7:07 PM, Sergey Malinin wrote:
>>> Can you please confirm whether I got this right:
>>> 
>>> --- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300
>>> +++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300
>>> @@ -9049,22 +9049,17 @@
>>> throttle_bytes.put(costs);
>>>   PExtentVector bluefs_gift_extents;
>>> -  if (bluefs &&
>>> -  after_flush - bluefs_last_balance >
>>> -  cct->_conf->bluestore_bluefs_balance_interval) {
>>> -bluefs_last_balance = after_flush;
>>> -int r = _balance_bluefs_freespace(_gift_extents);
>>> -assert(r >= 0);
>>> -if (r > 0) {
>>> -  for (auto& p : bluefs_gift_extents) {
>>> -bluefs_extents.insert(p.offset, p.length);
>>> -  }
>>> -  bufferlist bl;
>>> -  encode(bluefs_extents, bl);
>>> -  dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
>>> -   << bluefs_extents << std::dec << dendl;
>>> -  synct->set(PREFIX_SUPER, "bluefs_extents", bl);
>>> +  int r = _balance_bluefs_freespace(_gift_extents);
>>> +  ceph_assert(r >= 0);
>>> +  if (r > 0) {
>>> +for (auto& p : bluefs_gift_extents) {
>>> +  bluefs_extents.insert(p.offset, p.length);
>>>   }
>>> +bufferlist bl;
>>> +encode(bluefs_extents, bl);
>>> +dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
>>> + << bluefs_extents << std::dec << dendl;
>>> +synct->set(PREFIX_SUPER, "bluefs_extents", bl);
>>> }

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-02 Thread Igor Fedotov
You mentioned repair had worked before, is that correct? What's the 
difference now except the applied patch? Different OSD? Anything else?



On 10/2/2018 3:52 PM, Sergey Malinin wrote:


It didn't work, emailed logs to you.



On 2.10.2018, at 14:43, Igor Fedotov  wrote:

The major change is in get_bluefs_rebalance_txn function, it lacked 
bluefs_rebalance_txn assignment..



On 10/2/2018 2:40 PM, Sergey Malinin wrote:

PR doesn't seem to have changed since yesterday. Am I missing something?



On 2.10.2018, at 14:15, Igor Fedotov  wrote:

Please update the patch from the PR - it didn't update bluefs extents list 
before.

Also please set debug bluestore 20 when re-running repair and collect the log.

If repair doesn't help - would you send repair and startup logs directly to me 
as I have some issues accessing ceph-post-file uploads.


Thanks,

Igor


On 10/2/2018 11:39 AM, Sergey Malinin wrote:

Yes, I did repair all OSDs and it finished with 'repair success'. I backed up 
OSDs so now I have more room to play.
I posted log files using ceph-post-file with the following IDs:
4af9cc4d-9c73-41c9-9c38-eb6c551047a0
20df7df5-f0c9-4186-aa21-4e5c0172cd93



On 2.10.2018, at 11:26, Igor Fedotov  wrote:

You did repair for any of this OSDs, didn't you? For all of them?


Would you please provide the log for both types (failed on mount and failed 
with enospc) of failing OSDs. Prior to collecting please remove existing ones 
prior and set debug bluestore to 20.



On 10/2/2018 2:16 AM, Sergey Malinin wrote:

I was able to apply patches to mimic, but nothing changed. One osd that I had 
space expanded on fails with bluefs mount IO error, others keep failing with 
enospc.



On 1.10.2018, at 19:26, Igor Fedotov  wrote:

So you should call repair which rebalances (i.e. allocates additional space) 
BlueFS space. Hence allowing OSD to start.

Thanks,

Igor


On 10/1/2018 7:22 PM, Igor Fedotov wrote:

Not exactly. The rebalancing from this kv_sync_thread still might be deferred 
due to the nature of this thread (haven't 100% sure though).

Here is my PR showing the idea (still untested and perhaps unfinished!!!)

https://github.com/ceph/ceph/pull/24353


Igor


On 10/1/2018 7:07 PM, Sergey Malinin wrote:

Can you please confirm whether I got this right:

--- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300
+++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300
@@ -9049,22 +9049,17 @@
 throttle_bytes.put(costs);
   PExtentVector bluefs_gift_extents;
-  if (bluefs &&
-  after_flush - bluefs_last_balance >
-  cct->_conf->bluestore_bluefs_balance_interval) {
-bluefs_last_balance = after_flush;
-int r = _balance_bluefs_freespace(_gift_extents);
-assert(r >= 0);
-if (r > 0) {
-  for (auto& p : bluefs_gift_extents) {
-bluefs_extents.insert(p.offset, p.length);
-  }
-  bufferlist bl;
-  encode(bluefs_extents, bl);
-  dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
-   << bluefs_extents << std::dec << dendl;
-  synct->set(PREFIX_SUPER, "bluefs_extents", bl);
+  int r = _balance_bluefs_freespace(_gift_extents);
+  ceph_assert(r >= 0);
+  if (r > 0) {
+for (auto& p : bluefs_gift_extents) {
+  bluefs_extents.insert(p.offset, p.length);
   }
+bufferlist bl;
+encode(bluefs_extents, bl);
+dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
+ << bluefs_extents << std::dec << dendl;
+synct->set(PREFIX_SUPER, "bluefs_extents", bl);
 }
   // cleanup sync deferred keys


On 1.10.2018, at 18:39, Igor Fedotov  wrote:

So you have just a single main device per OSD

Then bluestore-tool wouldn't help, it's unable to expand BlueFS partition at 
main device, standalone devices are supported only.

Given that you're able to rebuild the code I can suggest to make a patch that 
triggers BlueFS rebalance (see code snippet below) on repairing.
  PExtentVector bluefs_gift_extents;
  int r = _balance_bluefs_freespace(_gift_extents);
  ceph_assert(r >= 0);
  if (r > 0) {
for (auto& p : bluefs_gift_extents) {
  bluefs_extents.insert(p.offset, p.length);
}
bufferlist bl;
encode(bluefs_extents, bl);
dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
 << bluefs_extents << std::dec << dendl;
synct->set(PREFIX_SUPER, "bluefs_extents", bl);
  }

If it waits I can probably make a corresponding PR tomorrow.

Thanks,
Igor
On 10/1/2018 6:16 PM, Sergey Malinin wrote:

I have rebuilt the tool, but none of my OSDs no matter dead or alive have any 
symlinks other than 'block' pointing to LVM.
I adjusted main device size but it looks like it needs even more space for db 
compaction. After executing bluefs-bdev-expand OSD fails to start, however 
'fsck' and 'repair' commands finished successfully.

2018-10-01 18:02:39.755 7fc9226c6240  1 freelist init
2018-10-01 18:02:39.763 7fc9226c6240  1 

Re: [ceph-users] cephfs clients hanging multi mds to single mds

2018-10-02 Thread Paul Emmerich
The kernel cephfs client unfortunately has the tendency to get stuck
in some unrecoverable states requiring a reboot, especially in older
kernels.
Usually it's not recoverable without a reboot.

Paul
Am Di., 2. Okt. 2018 um 14:55 Uhr schrieb Jaime Ibar :
>
> Hi Paul,
>
> I tried ceph-fuse mounting it in a different mount point and it worked.
>
> The problem here is we can't unmount ceph kernel client as it is in use
>
> by some virsh processes. We forced the unmount and mount ceph-fuse
>
> but we got an I/O error and mount -l cleared all the processes but after
>
> rebooting the vm's they didn't come back and a server reboot was needed.
>
> Not sure how can I restore mds session or remounting cephfs keeping
>
> all processes.
>
> Thanks a lot for your help.
>
> Jaime
>
>
> On 02/10/18 11:02, Paul Emmerich wrote:
> > Kernel 4.4 is not suitable for a multi MDS setup. In general, I
> > wouldn't feel comfortable running 4.4 with kernel cephfs in
> > production.
> > I think at least 4.15 (not sure, but definitely > 4.9) is recommended
> > for multi MDS setups.
> >
> > If you can't reboot: maybe try cephfs-fuse instead which is usually
> > very awesome and usually fast enough.
> >
> > Paul
> >
> > Am Di., 2. Okt. 2018 um 10:45 Uhr schrieb Jaime Ibar :
> >> Hi Paul,
> >>
> >> we're using 4.4 kernel. Not sure if more recent kernels are stable
> >>
> >> for production services. In any case, as there are some production
> >>
> >> services running on those servers, rebooting wouldn't be an option
> >>
> >> if we can bring ceph clients back without rebooting.
> >>
> >> Thanks
> >>
> >> Jaime
> >>
> >>
> >> On 01/10/18 21:10, Paul Emmerich wrote:
> >>> Which kernel version are you using for the kernel cephfs clients?
> >>> I've seen this problem with "older" kernels (where old is as recent as 
> >>> 4.9)
> >>>
> >>> Paul
> >>> Am Mo., 1. Okt. 2018 um 18:35 Uhr schrieb Jaime Ibar :
>  Hi all,
> 
>  we're running a ceph 12.2.7 Luminous cluster, two weeks ago we enabled
>  multi mds and after few hours
> 
>  these errors started showing up
> 
>  2018-09-28 09:41:20.577350 mds.1 [WRN] slow request 64.421475 seconds
>  old, received at 2018-09-28 09:40:16.155841:
>  client_request(client.31059144:8544450 getattr Xs #0$
>  12e1e73 2018-09-28 09:40:16.147368 caller_uid=0, caller_gid=124{})
>  currently failed to authpin local pins
> 
>  2018-09-28 10:56:51.051100 mon.1 [WRN] Health check failed: 5 clients
>  failing to respond to cache pressure (MDS_CLIENT_RECALL)
>  2018-09-28 10:57:08.000361 mds.1 [WRN] 3 slow requests, 1 included
>  below; oldest blocked for > 4614.580689 secs
>  2018-09-28 10:57:08.000365 mds.1 [WRN] slow request 244.796854 seconds
>  old, received at 2018-09-28 10:53:03.203476:
>  client_request(client.31059144:9080057 lookup #0x100
>  000b7564/58 2018-09-28 10:53:03.197922 caller_uid=0, caller_gid=0{})
>  currently initiated
>  2018-09-28 11:00:00.000105 mon.1 [WRN] overall HEALTH_WARN 1 clients
>  failing to respond to capability release; 5 clients failing to respond
>  to cache pressure; 1 MDSs report slow requests,
> 
>  Due to this, we decide to go back to single mds(as it worked before),
>  however, the clients pointing to mds.1 started hanging, however, the
>  ones pointing to mds.0 worked fine.
> 
>  Then, we tried to enable multi mds again and the clients pointing mds.1
>  went back online, however the ones pointing to mds.0 stopped work.
> 
>  Today, we tried to go back to single mds, however this error was
>  preventing ceph to disable second active mds(mds.1)
> 
>  2018-10-01 14:33:48.358443 mds.1 [WRN] evicting unresponsive client
>  X: (30108925), after 68213.084174 seconds
> 
>  After wait for 3 hours, we restarted mds.1 daemon (as it was stuck in
>  stopping state forever due to the above error), we waited for it to
>  become active again,
> 
>  unmount the problematic clients, wait for the cluster to be healthy and
>  try to go back to single mds again.
> 
>  Apparently this worked with some of the clients, we tried to enable
>  multi mds again to bring faulty clients back again, however no luck this
>  time
> 
>  and some of them are hanging and can't access to ceph fs.
> 
>  This is what we have in kern.log
> 
>  Oct  1 15:29:32 05 kernel: [2342847.017426] ceph: mds1 reconnect start
>  Oct  1 15:29:32 05 kernel: [2342847.018677] ceph: mds1 reconnect success
>  Oct  1 15:29:49 05 kernel: [2342864.651398] ceph: mds1 recovery completed
> 
>  Not sure what else can we try to bring hanging clients back without
>  rebooting as they're in production and rebooting is not an option.
> 
>  Does anyone know how can we deal with this, please?
> 
>  Thanks
> 
>  Jaime
> 
>  --
> 
>  Jaime Ibar
>  High 

Re: [ceph-users] cephfs clients hanging multi mds to single mds

2018-10-02 Thread Jaime Ibar

Hi Paul,

I tried ceph-fuse mounting it in a different mount point and it worked.

The problem here is we can't unmount ceph kernel client as it is in use

by some virsh processes. We forced the unmount and mount ceph-fuse

but we got an I/O error and mount -l cleared all the processes but after

rebooting the vm's they didn't come back and a server reboot was needed.

Not sure how can I restore mds session or remounting cephfs keeping

all processes.

Thanks a lot for your help.

Jaime


On 02/10/18 11:02, Paul Emmerich wrote:

Kernel 4.4 is not suitable for a multi MDS setup. In general, I
wouldn't feel comfortable running 4.4 with kernel cephfs in
production.
I think at least 4.15 (not sure, but definitely > 4.9) is recommended
for multi MDS setups.

If you can't reboot: maybe try cephfs-fuse instead which is usually
very awesome and usually fast enough.

Paul

Am Di., 2. Okt. 2018 um 10:45 Uhr schrieb Jaime Ibar :

Hi Paul,

we're using 4.4 kernel. Not sure if more recent kernels are stable

for production services. In any case, as there are some production

services running on those servers, rebooting wouldn't be an option

if we can bring ceph clients back without rebooting.

Thanks

Jaime


On 01/10/18 21:10, Paul Emmerich wrote:

Which kernel version are you using for the kernel cephfs clients?
I've seen this problem with "older" kernels (where old is as recent as 4.9)

Paul
Am Mo., 1. Okt. 2018 um 18:35 Uhr schrieb Jaime Ibar :

Hi all,

we're running a ceph 12.2.7 Luminous cluster, two weeks ago we enabled
multi mds and after few hours

these errors started showing up

2018-09-28 09:41:20.577350 mds.1 [WRN] slow request 64.421475 seconds
old, received at 2018-09-28 09:40:16.155841:
client_request(client.31059144:8544450 getattr Xs #0$
12e1e73 2018-09-28 09:40:16.147368 caller_uid=0, caller_gid=124{})
currently failed to authpin local pins

2018-09-28 10:56:51.051100 mon.1 [WRN] Health check failed: 5 clients
failing to respond to cache pressure (MDS_CLIENT_RECALL)
2018-09-28 10:57:08.000361 mds.1 [WRN] 3 slow requests, 1 included
below; oldest blocked for > 4614.580689 secs
2018-09-28 10:57:08.000365 mds.1 [WRN] slow request 244.796854 seconds
old, received at 2018-09-28 10:53:03.203476:
client_request(client.31059144:9080057 lookup #0x100
000b7564/58 2018-09-28 10:53:03.197922 caller_uid=0, caller_gid=0{})
currently initiated
2018-09-28 11:00:00.000105 mon.1 [WRN] overall HEALTH_WARN 1 clients
failing to respond to capability release; 5 clients failing to respond
to cache pressure; 1 MDSs report slow requests,

Due to this, we decide to go back to single mds(as it worked before),
however, the clients pointing to mds.1 started hanging, however, the
ones pointing to mds.0 worked fine.

Then, we tried to enable multi mds again and the clients pointing mds.1
went back online, however the ones pointing to mds.0 stopped work.

Today, we tried to go back to single mds, however this error was
preventing ceph to disable second active mds(mds.1)

2018-10-01 14:33:48.358443 mds.1 [WRN] evicting unresponsive client
X: (30108925), after 68213.084174 seconds

After wait for 3 hours, we restarted mds.1 daemon (as it was stuck in
stopping state forever due to the above error), we waited for it to
become active again,

unmount the problematic clients, wait for the cluster to be healthy and
try to go back to single mds again.

Apparently this worked with some of the clients, we tried to enable
multi mds again to bring faulty clients back again, however no luck this
time

and some of them are hanging and can't access to ceph fs.

This is what we have in kern.log

Oct  1 15:29:32 05 kernel: [2342847.017426] ceph: mds1 reconnect start
Oct  1 15:29:32 05 kernel: [2342847.018677] ceph: mds1 reconnect success
Oct  1 15:29:49 05 kernel: [2342864.651398] ceph: mds1 recovery completed

Not sure what else can we try to bring hanging clients back without
rebooting as they're in production and rebooting is not an option.

Does anyone know how can we deal with this, please?

Thanks

Jaime

--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
Tel: +353-1-896-3725

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
Tel: +353-1-896-3725





--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
Tel: +353-1-896-3725

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-02 Thread Sergey Malinin
It didn't work, emailed logs to you.


> On 2.10.2018, at 14:43, Igor Fedotov  wrote:
> 
> The major change is in get_bluefs_rebalance_txn function, it lacked 
> bluefs_rebalance_txn assignment..
> 
> 
> 
> On 10/2/2018 2:40 PM, Sergey Malinin wrote:
>> PR doesn't seem to have changed since yesterday. Am I missing something?
>> 
>> 
>>> On 2.10.2018, at 14:15, Igor Fedotov  wrote:
>>> 
>>> Please update the patch from the PR - it didn't update bluefs extents list 
>>> before.
>>> 
>>> Also please set debug bluestore 20 when re-running repair and collect the 
>>> log.
>>> 
>>> If repair doesn't help - would you send repair and startup logs directly to 
>>> me as I have some issues accessing ceph-post-file uploads.
>>> 
>>> 
>>> Thanks,
>>> 
>>> Igor
>>> 
>>> 
>>> On 10/2/2018 11:39 AM, Sergey Malinin wrote:
 Yes, I did repair all OSDs and it finished with 'repair success'. I backed 
 up OSDs so now I have more room to play.
 I posted log files using ceph-post-file with the following IDs:
 4af9cc4d-9c73-41c9-9c38-eb6c551047a0
 20df7df5-f0c9-4186-aa21-4e5c0172cd93
 
 
> On 2.10.2018, at 11:26, Igor Fedotov  wrote:
> 
> You did repair for any of this OSDs, didn't you? For all of them?
> 
> 
> Would you please provide the log for both types (failed on mount and 
> failed with enospc) of failing OSDs. Prior to collecting please remove 
> existing ones prior and set debug bluestore to 20.
> 
> 
> 
> On 10/2/2018 2:16 AM, Sergey Malinin wrote:
>> I was able to apply patches to mimic, but nothing changed. One osd that 
>> I had space expanded on fails with bluefs mount IO error, others keep 
>> failing with enospc.
>> 
>> 
>>> On 1.10.2018, at 19:26, Igor Fedotov  wrote:
>>> 
>>> So you should call repair which rebalances (i.e. allocates additional 
>>> space) BlueFS space. Hence allowing OSD to start.
>>> 
>>> Thanks,
>>> 
>>> Igor
>>> 
>>> 
>>> On 10/1/2018 7:22 PM, Igor Fedotov wrote:
 Not exactly. The rebalancing from this kv_sync_thread still might be 
 deferred due to the nature of this thread (haven't 100% sure though).
 
 Here is my PR showing the idea (still untested and perhaps 
 unfinished!!!)
 
 https://github.com/ceph/ceph/pull/24353
 
 
 Igor
 
 
 On 10/1/2018 7:07 PM, Sergey Malinin wrote:
> Can you please confirm whether I got this right:
> 
> --- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300
> +++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300
> @@ -9049,22 +9049,17 @@
> throttle_bytes.put(costs);
>   PExtentVector bluefs_gift_extents;
> -  if (bluefs &&
> -  after_flush - bluefs_last_balance >
> -  cct->_conf->bluestore_bluefs_balance_interval) {
> -bluefs_last_balance = after_flush;
> -int r = _balance_bluefs_freespace(_gift_extents);
> -assert(r >= 0);
> -if (r > 0) {
> -  for (auto& p : bluefs_gift_extents) {
> -bluefs_extents.insert(p.offset, p.length);
> -  }
> -  bufferlist bl;
> -  encode(bluefs_extents, bl);
> -  dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
> -   << bluefs_extents << std::dec << dendl;
> -  synct->set(PREFIX_SUPER, "bluefs_extents", bl);
> +  int r = _balance_bluefs_freespace(_gift_extents);
> +  ceph_assert(r >= 0);
> +  if (r > 0) {
> +for (auto& p : bluefs_gift_extents) {
> +  bluefs_extents.insert(p.offset, p.length);
>   }
> +bufferlist bl;
> +encode(bluefs_extents, bl);
> +dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
> + << bluefs_extents << std::dec << dendl;
> +synct->set(PREFIX_SUPER, "bluefs_extents", bl);
> }
>   // cleanup sync deferred keys
> 
>> On 1.10.2018, at 18:39, Igor Fedotov  wrote:
>> 
>> So you have just a single main device per OSD
>> 
>> Then bluestore-tool wouldn't help, it's unable to expand BlueFS 
>> partition at main device, standalone devices are supported only.
>> 
>> Given that you're able to rebuild the code I can suggest to make a 
>> patch that triggers BlueFS rebalance (see code snippet below) on 
>> repairing.
>>  PExtentVector bluefs_gift_extents;
>>  int r = _balance_bluefs_freespace(_gift_extents);
>>  ceph_assert(r >= 0);
>>  if (r > 0) {
>>for (auto& p : bluefs_gift_extents) {
>>  bluefs_extents.insert(p.offset, p.length);
>>}
>>

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-02 Thread Igor Fedotov
The major change is in get_bluefs_rebalance_txn function, it lacked 
bluefs_rebalance_txn assignment..




On 10/2/2018 2:40 PM, Sergey Malinin wrote:

PR doesn't seem to have changed since yesterday. Am I missing something?



On 2.10.2018, at 14:15, Igor Fedotov  wrote:

Please update the patch from the PR - it didn't update bluefs extents list 
before.

Also please set debug bluestore 20 when re-running repair and collect the log.

If repair doesn't help - would you send repair and startup logs directly to me 
as I have some issues accessing ceph-post-file uploads.


Thanks,

Igor


On 10/2/2018 11:39 AM, Sergey Malinin wrote:

Yes, I did repair all OSDs and it finished with 'repair success'. I backed up 
OSDs so now I have more room to play.
I posted log files using ceph-post-file with the following IDs:
4af9cc4d-9c73-41c9-9c38-eb6c551047a0
20df7df5-f0c9-4186-aa21-4e5c0172cd93



On 2.10.2018, at 11:26, Igor Fedotov  wrote:

You did repair for any of this OSDs, didn't you? For all of them?


Would you please provide the log for both types (failed on mount and failed 
with enospc) of failing OSDs. Prior to collecting please remove existing ones 
prior and set debug bluestore to 20.



On 10/2/2018 2:16 AM, Sergey Malinin wrote:

I was able to apply patches to mimic, but nothing changed. One osd that I had 
space expanded on fails with bluefs mount IO error, others keep failing with 
enospc.



On 1.10.2018, at 19:26, Igor Fedotov  wrote:

So you should call repair which rebalances (i.e. allocates additional space) 
BlueFS space. Hence allowing OSD to start.

Thanks,

Igor


On 10/1/2018 7:22 PM, Igor Fedotov wrote:

Not exactly. The rebalancing from this kv_sync_thread still might be deferred 
due to the nature of this thread (haven't 100% sure though).

Here is my PR showing the idea (still untested and perhaps unfinished!!!)

https://github.com/ceph/ceph/pull/24353


Igor


On 10/1/2018 7:07 PM, Sergey Malinin wrote:

Can you please confirm whether I got this right:

--- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300
+++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300
@@ -9049,22 +9049,17 @@
 throttle_bytes.put(costs);
   PExtentVector bluefs_gift_extents;
-  if (bluefs &&
-  after_flush - bluefs_last_balance >
-  cct->_conf->bluestore_bluefs_balance_interval) {
-bluefs_last_balance = after_flush;
-int r = _balance_bluefs_freespace(_gift_extents);
-assert(r >= 0);
-if (r > 0) {
-  for (auto& p : bluefs_gift_extents) {
-bluefs_extents.insert(p.offset, p.length);
-  }
-  bufferlist bl;
-  encode(bluefs_extents, bl);
-  dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
-   << bluefs_extents << std::dec << dendl;
-  synct->set(PREFIX_SUPER, "bluefs_extents", bl);
+  int r = _balance_bluefs_freespace(_gift_extents);
+  ceph_assert(r >= 0);
+  if (r > 0) {
+for (auto& p : bluefs_gift_extents) {
+  bluefs_extents.insert(p.offset, p.length);
   }
+bufferlist bl;
+encode(bluefs_extents, bl);
+dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
+ << bluefs_extents << std::dec << dendl;
+synct->set(PREFIX_SUPER, "bluefs_extents", bl);
 }
   // cleanup sync deferred keys


On 1.10.2018, at 18:39, Igor Fedotov  wrote:

So you have just a single main device per OSD

Then bluestore-tool wouldn't help, it's unable to expand BlueFS partition at 
main device, standalone devices are supported only.

Given that you're able to rebuild the code I can suggest to make a patch that 
triggers BlueFS rebalance (see code snippet below) on repairing.
  PExtentVector bluefs_gift_extents;
  int r = _balance_bluefs_freespace(_gift_extents);
  ceph_assert(r >= 0);
  if (r > 0) {
for (auto& p : bluefs_gift_extents) {
  bluefs_extents.insert(p.offset, p.length);
}
bufferlist bl;
encode(bluefs_extents, bl);
dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
 << bluefs_extents << std::dec << dendl;
synct->set(PREFIX_SUPER, "bluefs_extents", bl);
  }

If it waits I can probably make a corresponding PR tomorrow.

Thanks,
Igor
On 10/1/2018 6:16 PM, Sergey Malinin wrote:

I have rebuilt the tool, but none of my OSDs no matter dead or alive have any 
symlinks other than 'block' pointing to LVM.
I adjusted main device size but it looks like it needs even more space for db 
compaction. After executing bluefs-bdev-expand OSD fails to start, however 
'fsck' and 'repair' commands finished successfully.

2018-10-01 18:02:39.755 7fc9226c6240  1 freelist init
2018-10-01 18:02:39.763 7fc9226c6240  1 bluestore(/var/lib/ceph/osd/ceph-1) 
_open_alloc opening allocation metadata
2018-10-01 18:02:40.907 7fc9226c6240  1 bluestore(/var/lib/ceph/osd/ceph-1) 
_open_alloc loaded 285 GiB in 2249899 extents
2018-10-01 18:02:40.951 7fc9226c6240 -1 bluestore(/var/lib/ceph/osd/ceph-1) 

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-02 Thread Sergey Malinin
PR doesn't seem to have changed since yesterday. Am I missing something?


> On 2.10.2018, at 14:15, Igor Fedotov  wrote:
> 
> Please update the patch from the PR - it didn't update bluefs extents list 
> before.
> 
> Also please set debug bluestore 20 when re-running repair and collect the log.
> 
> If repair doesn't help - would you send repair and startup logs directly to 
> me as I have some issues accessing ceph-post-file uploads.
> 
> 
> Thanks,
> 
> Igor
> 
> 
> On 10/2/2018 11:39 AM, Sergey Malinin wrote:
>> Yes, I did repair all OSDs and it finished with 'repair success'. I backed 
>> up OSDs so now I have more room to play.
>> I posted log files using ceph-post-file with the following IDs:
>> 4af9cc4d-9c73-41c9-9c38-eb6c551047a0
>> 20df7df5-f0c9-4186-aa21-4e5c0172cd93
>> 
>> 
>>> On 2.10.2018, at 11:26, Igor Fedotov  wrote:
>>> 
>>> You did repair for any of this OSDs, didn't you? For all of them?
>>> 
>>> 
>>> Would you please provide the log for both types (failed on mount and failed 
>>> with enospc) of failing OSDs. Prior to collecting please remove existing 
>>> ones prior and set debug bluestore to 20.
>>> 
>>> 
>>> 
>>> On 10/2/2018 2:16 AM, Sergey Malinin wrote:
 I was able to apply patches to mimic, but nothing changed. One osd that I 
 had space expanded on fails with bluefs mount IO error, others keep 
 failing with enospc.
 
 
> On 1.10.2018, at 19:26, Igor Fedotov  wrote:
> 
> So you should call repair which rebalances (i.e. allocates additional 
> space) BlueFS space. Hence allowing OSD to start.
> 
> Thanks,
> 
> Igor
> 
> 
> On 10/1/2018 7:22 PM, Igor Fedotov wrote:
>> Not exactly. The rebalancing from this kv_sync_thread still might be 
>> deferred due to the nature of this thread (haven't 100% sure though).
>> 
>> Here is my PR showing the idea (still untested and perhaps unfinished!!!)
>> 
>> https://github.com/ceph/ceph/pull/24353
>> 
>> 
>> Igor
>> 
>> 
>> On 10/1/2018 7:07 PM, Sergey Malinin wrote:
>>> Can you please confirm whether I got this right:
>>> 
>>> --- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300
>>> +++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300
>>> @@ -9049,22 +9049,17 @@
>>> throttle_bytes.put(costs);
>>>   PExtentVector bluefs_gift_extents;
>>> -  if (bluefs &&
>>> -  after_flush - bluefs_last_balance >
>>> -  cct->_conf->bluestore_bluefs_balance_interval) {
>>> -bluefs_last_balance = after_flush;
>>> -int r = _balance_bluefs_freespace(_gift_extents);
>>> -assert(r >= 0);
>>> -if (r > 0) {
>>> -  for (auto& p : bluefs_gift_extents) {
>>> -bluefs_extents.insert(p.offset, p.length);
>>> -  }
>>> -  bufferlist bl;
>>> -  encode(bluefs_extents, bl);
>>> -  dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
>>> -   << bluefs_extents << std::dec << dendl;
>>> -  synct->set(PREFIX_SUPER, "bluefs_extents", bl);
>>> +  int r = _balance_bluefs_freespace(_gift_extents);
>>> +  ceph_assert(r >= 0);
>>> +  if (r > 0) {
>>> +for (auto& p : bluefs_gift_extents) {
>>> +  bluefs_extents.insert(p.offset, p.length);
>>>   }
>>> +bufferlist bl;
>>> +encode(bluefs_extents, bl);
>>> +dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
>>> + << bluefs_extents << std::dec << dendl;
>>> +synct->set(PREFIX_SUPER, "bluefs_extents", bl);
>>> }
>>>   // cleanup sync deferred keys
>>> 
 On 1.10.2018, at 18:39, Igor Fedotov  wrote:
 
 So you have just a single main device per OSD
 
 Then bluestore-tool wouldn't help, it's unable to expand BlueFS 
 partition at main device, standalone devices are supported only.
 
 Given that you're able to rebuild the code I can suggest to make a 
 patch that triggers BlueFS rebalance (see code snippet below) on 
 repairing.
  PExtentVector bluefs_gift_extents;
  int r = _balance_bluefs_freespace(_gift_extents);
  ceph_assert(r >= 0);
  if (r > 0) {
for (auto& p : bluefs_gift_extents) {
  bluefs_extents.insert(p.offset, p.length);
}
bufferlist bl;
encode(bluefs_extents, bl);
dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
 << bluefs_extents << std::dec << dendl;
synct->set(PREFIX_SUPER, "bluefs_extents", bl);
  }
 
 If it waits I can probably make a corresponding PR tomorrow.
 
 Thanks,
 Igor
 On 10/1/2018 6:16 PM, Sergey Malinin wrote:
> I have rebuilt the tool, but none of my OSDs 

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-02 Thread Igor Fedotov
Please update the patch from the PR - it didn't update bluefs extents 
list before.


Also please set debug bluestore 20 when re-running repair and collect 
the log.


If repair doesn't help - would you send repair and startup logs directly 
to me as I have some issues accessing ceph-post-file uploads.



Thanks,

Igor


On 10/2/2018 11:39 AM, Sergey Malinin wrote:

Yes, I did repair all OSDs and it finished with 'repair success'. I backed up 
OSDs so now I have more room to play.
I posted log files using ceph-post-file with the following IDs:
4af9cc4d-9c73-41c9-9c38-eb6c551047a0
20df7df5-f0c9-4186-aa21-4e5c0172cd93



On 2.10.2018, at 11:26, Igor Fedotov  wrote:

You did repair for any of this OSDs, didn't you? For all of them?


Would you please provide the log for both types (failed on mount and failed 
with enospc) of failing OSDs. Prior to collecting please remove existing ones 
prior and set debug bluestore to 20.



On 10/2/2018 2:16 AM, Sergey Malinin wrote:

I was able to apply patches to mimic, but nothing changed. One osd that I had 
space expanded on fails with bluefs mount IO error, others keep failing with 
enospc.



On 1.10.2018, at 19:26, Igor Fedotov  wrote:

So you should call repair which rebalances (i.e. allocates additional space) 
BlueFS space. Hence allowing OSD to start.

Thanks,

Igor


On 10/1/2018 7:22 PM, Igor Fedotov wrote:

Not exactly. The rebalancing from this kv_sync_thread still might be deferred 
due to the nature of this thread (haven't 100% sure though).

Here is my PR showing the idea (still untested and perhaps unfinished!!!)

https://github.com/ceph/ceph/pull/24353


Igor


On 10/1/2018 7:07 PM, Sergey Malinin wrote:

Can you please confirm whether I got this right:

--- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300
+++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300
@@ -9049,22 +9049,17 @@
 throttle_bytes.put(costs);
   PExtentVector bluefs_gift_extents;
-  if (bluefs &&
-  after_flush - bluefs_last_balance >
-  cct->_conf->bluestore_bluefs_balance_interval) {
-bluefs_last_balance = after_flush;
-int r = _balance_bluefs_freespace(_gift_extents);
-assert(r >= 0);
-if (r > 0) {
-  for (auto& p : bluefs_gift_extents) {
-bluefs_extents.insert(p.offset, p.length);
-  }
-  bufferlist bl;
-  encode(bluefs_extents, bl);
-  dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
-   << bluefs_extents << std::dec << dendl;
-  synct->set(PREFIX_SUPER, "bluefs_extents", bl);
+  int r = _balance_bluefs_freespace(_gift_extents);
+  ceph_assert(r >= 0);
+  if (r > 0) {
+for (auto& p : bluefs_gift_extents) {
+  bluefs_extents.insert(p.offset, p.length);
   }
+bufferlist bl;
+encode(bluefs_extents, bl);
+dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
+ << bluefs_extents << std::dec << dendl;
+synct->set(PREFIX_SUPER, "bluefs_extents", bl);
 }
   // cleanup sync deferred keys


On 1.10.2018, at 18:39, Igor Fedotov  wrote:

So you have just a single main device per OSD

Then bluestore-tool wouldn't help, it's unable to expand BlueFS partition at 
main device, standalone devices are supported only.

Given that you're able to rebuild the code I can suggest to make a patch that 
triggers BlueFS rebalance (see code snippet below) on repairing.
  PExtentVector bluefs_gift_extents;
  int r = _balance_bluefs_freespace(_gift_extents);
  ceph_assert(r >= 0);
  if (r > 0) {
for (auto& p : bluefs_gift_extents) {
  bluefs_extents.insert(p.offset, p.length);
}
bufferlist bl;
encode(bluefs_extents, bl);
dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
 << bluefs_extents << std::dec << dendl;
synct->set(PREFIX_SUPER, "bluefs_extents", bl);
  }

If it waits I can probably make a corresponding PR tomorrow.

Thanks,
Igor
On 10/1/2018 6:16 PM, Sergey Malinin wrote:

I have rebuilt the tool, but none of my OSDs no matter dead or alive have any 
symlinks other than 'block' pointing to LVM.
I adjusted main device size but it looks like it needs even more space for db 
compaction. After executing bluefs-bdev-expand OSD fails to start, however 
'fsck' and 'repair' commands finished successfully.

2018-10-01 18:02:39.755 7fc9226c6240  1 freelist init
2018-10-01 18:02:39.763 7fc9226c6240  1 bluestore(/var/lib/ceph/osd/ceph-1) 
_open_alloc opening allocation metadata
2018-10-01 18:02:40.907 7fc9226c6240  1 bluestore(/var/lib/ceph/osd/ceph-1) 
_open_alloc loaded 285 GiB in 2249899 extents
2018-10-01 18:02:40.951 7fc9226c6240 -1 bluestore(/var/lib/ceph/osd/ceph-1) 
_reconcile_bluefs_freespace bluefs extra 0x[6d6f00~50c80]
2018-10-01 18:02:40.951 7fc9226c6240  1 stupidalloc 0x0x55d053fb9180 shutdown
2018-10-01 18:02:40.963 7fc9226c6240  1 freelist shutdown
2018-10-01 18:02:40.963 7fc9226c6240  4 rocksdb: 

Re: [ceph-users] CRUSH puzzle: step weighted-take

2018-10-02 Thread Dan van der Ster
On Mon, Oct 1, 2018 at 8:09 PM Gregory Farnum  wrote:
>
> On Fri, Sep 28, 2018 at 12:03 AM Dan van der Ster  wrote:
> >
> > On Thu, Sep 27, 2018 at 9:57 PM Maged Mokhtar  wrote:
> > >
> > >
> > >
> > > On 27/09/18 17:18, Dan van der Ster wrote:
> > > > Dear Ceph friends,
> > > >
> > > > I have a CRUSH data migration puzzle and wondered if someone could
> > > > think of a clever solution.
> > > >
> > > > Consider an osd tree like this:
> > > >
> > > >-2   4428.02979 room 0513-R-0050
> > > >   -72911.81897 rack RA01
> > > >-4917.27899 rack RA05
> > > >-6917.25500 rack RA09
> > > >-9786.23901 rack RA13
> > > >   -14895.43903 rack RA17
> > > >   -65   1161.16003 room 0513-R-0060
> > > >   -71578.76001 ipservice S513-A-IP38
> > > >   -70287.56000 rack BA09
> > > >   -80291.20001 rack BA10
> > > >   -76582.40002 ipservice S513-A-IP63
> > > >   -75291.20001 rack BA11
> > > >   -78291.20001 rack BA12
> > > >
> > > > In the beginning, for reasons that are not important, we created two 
> > > > pools:
> > > >* poolA chooses room=0513-R-0050 then replicates 3x across the racks.
> > > >* poolB chooses room=0513-R-0060, replicates 2x across the
> > > > ipservices, then puts a 3rd replica in room 0513-R-0050.
> > > >
> > > > For clarity, here is the crush rule for poolB:
> > > >  type replicated
> > > >  min_size 1
> > > >  max_size 10
> > > >  step take 0513-R-0060
> > > >  step chooseleaf firstn 2 type ipservice
> > > >  step emit
> > > >  step take 0513-R-0050
> > > >  step chooseleaf firstn -2 type rack
> > > >  step emit
> > > >
> > > > Now to the puzzle.
> > > > For reasons that are not important, we now want to change the rule for
> > > > poolB to put all three 3 replicas in room 0513-R-0060.
> > > > And we need to do this in a way which is totally non-disruptive
> > > > (latency-wise) to the users of either pools. (These are both *very*
> > > > active RBD pools).
> > > >
> > > > I see two obvious ways to proceed:
> > > >(1) simply change the rule for poolB to put a third replica on any
> > > > osd in room 0513-R-0060. I'm afraid though that this would involve way
> > > > too many concurrent backfills, cluster-wide, even with
> > > > osd_max_backfills=1.
> > > >(2) change poolB size to 2, then change the crush rule to that from
> > > > (1), then reset poolB size to 3. This would risk data availability
> > > > during the time that the pool is size=2, and also risks that every osd
> > > > in room 0513-R-0050 would be too busy deleting for some indeterminate
> > > > time period (10s of minutes, I expect).
> > > >
> > > > So I would probably exclude those two approaches.
> > > >
> > > > Conceptually what I'd like to be able to do is a gradual migration,
> > > > which if I may invent some syntax on the fly...
> > > >
> > > > Instead of
> > > > step take 0513-R-0050
> > > > do
> > > > step weighted-take 99 0513-R-0050 1 0513-R-0060
> > > >
> > > > That is, 99% of the time take room 0513-R-0050 for the 3rd copies, 1%
> > > > of the time take room 0513-R-0060.
> > > > With a mechanism like that, we could gradually adjust those "step
> > > > weighted-take" lines until 100% of the 3rd copies were in 0513-R-0060.
> > > >
> > > > I have a feeling that something equivalent to that is already possible
> > > > with weight-sets or some other clever crush trickery.
> > > > Any ideas?
> > > >
> > > > Best Regards,
> > > >
> > > > Dan
> > > > ___
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > would it be possible in your case to create a parent datacenter bucket
> > > to hold both rooms and assign their relative weights there, then for the
> > > third replica do a step take to this parent bucket ? its not elegant but
> > > may do the trick.
> >
> > Hey, that might work! both rooms are already in the default root:
> >
> >   -1   5589.18994 root default
> >   -2   4428.02979 room 0513-R-0050
> >  -65   1161.16003 room 0513-R-0060
> >  -71578.76001 ipservice S513-A-IP38
> >  -76582.40002 ipservice S513-A-IP63
> >
> > so I'll play with a test pool and weighting down room 0513-R-0060 to
> > see if this can work.
>
> I don't think this will work — it will probably change the seed that
> is used and mean that the rule tries to move *everything*, not just
> the third PG replicas. But perhaps I'm mistaken about the details of
> this mechanic...
>

Indeed, osdmaptool indicated that this isn't going to work -- the
first two replicas were the same (a, b), but the third replica was
always (a) again... seems like two "overlapping" chooseleafs 

Re: [ceph-users] cephfs clients hanging multi mds to single mds

2018-10-02 Thread Paul Emmerich
Kernel 4.4 is not suitable for a multi MDS setup. In general, I
wouldn't feel comfortable running 4.4 with kernel cephfs in
production.
I think at least 4.15 (not sure, but definitely > 4.9) is recommended
for multi MDS setups.

If you can't reboot: maybe try cephfs-fuse instead which is usually
very awesome and usually fast enough.

Paul

Am Di., 2. Okt. 2018 um 10:45 Uhr schrieb Jaime Ibar :
>
> Hi Paul,
>
> we're using 4.4 kernel. Not sure if more recent kernels are stable
>
> for production services. In any case, as there are some production
>
> services running on those servers, rebooting wouldn't be an option
>
> if we can bring ceph clients back without rebooting.
>
> Thanks
>
> Jaime
>
>
> On 01/10/18 21:10, Paul Emmerich wrote:
> > Which kernel version are you using for the kernel cephfs clients?
> > I've seen this problem with "older" kernels (where old is as recent as 4.9)
> >
> > Paul
> > Am Mo., 1. Okt. 2018 um 18:35 Uhr schrieb Jaime Ibar :
> >> Hi all,
> >>
> >> we're running a ceph 12.2.7 Luminous cluster, two weeks ago we enabled
> >> multi mds and after few hours
> >>
> >> these errors started showing up
> >>
> >> 2018-09-28 09:41:20.577350 mds.1 [WRN] slow request 64.421475 seconds
> >> old, received at 2018-09-28 09:40:16.155841:
> >> client_request(client.31059144:8544450 getattr Xs #0$
> >> 12e1e73 2018-09-28 09:40:16.147368 caller_uid=0, caller_gid=124{})
> >> currently failed to authpin local pins
> >>
> >> 2018-09-28 10:56:51.051100 mon.1 [WRN] Health check failed: 5 clients
> >> failing to respond to cache pressure (MDS_CLIENT_RECALL)
> >> 2018-09-28 10:57:08.000361 mds.1 [WRN] 3 slow requests, 1 included
> >> below; oldest blocked for > 4614.580689 secs
> >> 2018-09-28 10:57:08.000365 mds.1 [WRN] slow request 244.796854 seconds
> >> old, received at 2018-09-28 10:53:03.203476:
> >> client_request(client.31059144:9080057 lookup #0x100
> >> 000b7564/58 2018-09-28 10:53:03.197922 caller_uid=0, caller_gid=0{})
> >> currently initiated
> >> 2018-09-28 11:00:00.000105 mon.1 [WRN] overall HEALTH_WARN 1 clients
> >> failing to respond to capability release; 5 clients failing to respond
> >> to cache pressure; 1 MDSs report slow requests,
> >>
> >> Due to this, we decide to go back to single mds(as it worked before),
> >> however, the clients pointing to mds.1 started hanging, however, the
> >> ones pointing to mds.0 worked fine.
> >>
> >> Then, we tried to enable multi mds again and the clients pointing mds.1
> >> went back online, however the ones pointing to mds.0 stopped work.
> >>
> >> Today, we tried to go back to single mds, however this error was
> >> preventing ceph to disable second active mds(mds.1)
> >>
> >> 2018-10-01 14:33:48.358443 mds.1 [WRN] evicting unresponsive client
> >> X: (30108925), after 68213.084174 seconds
> >>
> >> After wait for 3 hours, we restarted mds.1 daemon (as it was stuck in
> >> stopping state forever due to the above error), we waited for it to
> >> become active again,
> >>
> >> unmount the problematic clients, wait for the cluster to be healthy and
> >> try to go back to single mds again.
> >>
> >> Apparently this worked with some of the clients, we tried to enable
> >> multi mds again to bring faulty clients back again, however no luck this
> >> time
> >>
> >> and some of them are hanging and can't access to ceph fs.
> >>
> >> This is what we have in kern.log
> >>
> >> Oct  1 15:29:32 05 kernel: [2342847.017426] ceph: mds1 reconnect start
> >> Oct  1 15:29:32 05 kernel: [2342847.018677] ceph: mds1 reconnect success
> >> Oct  1 15:29:49 05 kernel: [2342864.651398] ceph: mds1 recovery completed
> >>
> >> Not sure what else can we try to bring hanging clients back without
> >> rebooting as they're in production and rebooting is not an option.
> >>
> >> Does anyone know how can we deal with this, please?
> >>
> >> Thanks
> >>
> >> Jaime
> >>
> >> --
> >>
> >> Jaime Ibar
> >> High Performance & Research Computing, IS Services
> >> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
> >> http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
> >> Tel: +353-1-896-3725
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
>
> --
>
> Jaime Ibar
> High Performance & Research Computing, IS Services
> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
> http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
> Tel: +353-1-896-3725
>


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NVMe SSD not assigned "nvme" device class

2018-10-02 Thread Hervé Ballans

Hi,

You can easily configure it manually, e.g. :

$ sudo ceph osd crush rm-device-class osd.xx
$ sudo ceph osd crush set-device-class nvme osd.xx

Indeed, it may be useful when you want to create custom rules on this 
type of device.


Hervé

Le 01/10/2018 à 23:25, Vladimir Brik a écrit :

Hello,

It looks like Ceph (13.2.2) assigns device class "ssd" to our Samsung
PM1725a NVMe SSDs instead of "nvme". Is that a bug or is the "nvme"
class reserved for a different kind of device?


Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs clients hanging multi mds to single mds

2018-10-02 Thread Jaime Ibar

Hi Paul,

we're using 4.4 kernel. Not sure if more recent kernels are stable

for production services. In any case, as there are some production

services running on those servers, rebooting wouldn't be an option

if we can bring ceph clients back without rebooting.

Thanks

Jaime


On 01/10/18 21:10, Paul Emmerich wrote:

Which kernel version are you using for the kernel cephfs clients?
I've seen this problem with "older" kernels (where old is as recent as 4.9)

Paul
Am Mo., 1. Okt. 2018 um 18:35 Uhr schrieb Jaime Ibar :

Hi all,

we're running a ceph 12.2.7 Luminous cluster, two weeks ago we enabled
multi mds and after few hours

these errors started showing up

2018-09-28 09:41:20.577350 mds.1 [WRN] slow request 64.421475 seconds
old, received at 2018-09-28 09:40:16.155841:
client_request(client.31059144:8544450 getattr Xs #0$
12e1e73 2018-09-28 09:40:16.147368 caller_uid=0, caller_gid=124{})
currently failed to authpin local pins

2018-09-28 10:56:51.051100 mon.1 [WRN] Health check failed: 5 clients
failing to respond to cache pressure (MDS_CLIENT_RECALL)
2018-09-28 10:57:08.000361 mds.1 [WRN] 3 slow requests, 1 included
below; oldest blocked for > 4614.580689 secs
2018-09-28 10:57:08.000365 mds.1 [WRN] slow request 244.796854 seconds
old, received at 2018-09-28 10:53:03.203476:
client_request(client.31059144:9080057 lookup #0x100
000b7564/58 2018-09-28 10:53:03.197922 caller_uid=0, caller_gid=0{})
currently initiated
2018-09-28 11:00:00.000105 mon.1 [WRN] overall HEALTH_WARN 1 clients
failing to respond to capability release; 5 clients failing to respond
to cache pressure; 1 MDSs report slow requests,

Due to this, we decide to go back to single mds(as it worked before),
however, the clients pointing to mds.1 started hanging, however, the
ones pointing to mds.0 worked fine.

Then, we tried to enable multi mds again and the clients pointing mds.1
went back online, however the ones pointing to mds.0 stopped work.

Today, we tried to go back to single mds, however this error was
preventing ceph to disable second active mds(mds.1)

2018-10-01 14:33:48.358443 mds.1 [WRN] evicting unresponsive client
X: (30108925), after 68213.084174 seconds

After wait for 3 hours, we restarted mds.1 daemon (as it was stuck in
stopping state forever due to the above error), we waited for it to
become active again,

unmount the problematic clients, wait for the cluster to be healthy and
try to go back to single mds again.

Apparently this worked with some of the clients, we tried to enable
multi mds again to bring faulty clients back again, however no luck this
time

and some of them are hanging and can't access to ceph fs.

This is what we have in kern.log

Oct  1 15:29:32 05 kernel: [2342847.017426] ceph: mds1 reconnect start
Oct  1 15:29:32 05 kernel: [2342847.018677] ceph: mds1 reconnect success
Oct  1 15:29:49 05 kernel: [2342864.651398] ceph: mds1 recovery completed

Not sure what else can we try to bring hanging clients back without
rebooting as they're in production and rebooting is not an option.

Does anyone know how can we deal with this, please?

Thanks

Jaime

--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
Tel: +353-1-896-3725

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
Tel: +353-1-896-3725

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs clients hanging multi mds to single mds

2018-10-02 Thread Jaime Ibar

Hi,

there's only one entry in blacklist, however is a mon, not a cephfs 
client and no cephfs


is mounted on that host.

We're using kernel client and the kernel version is 4.4 for ceph 
services and cephfs clients.


This is what we have in /sys/kernel/debug/ceph

cat mdsmap

epoch 59259
root 0
session_timeout 60
session_autoclose 300
    mds0    xxx:6800    (up:active)


cat mdsc

13049   mds0    getattr  #1506e43
13051   (no request)    getattr  #150922b
13053   (no request)    getattr  #150922b
13055   (no request)    getattr  #150922b
13057   (no request)    getattr  #150922b
13058   (no request)    getattr  #150922b
13059   (no request)    getattr  #150922b
13063   mds0    lookup   #150922b/.cache (.cache)

[...]

cat mds_sessions

global_id 29669848
name "cephfs"
mds.0 opening
mds.1 restarting

And is similar for other clients.

Thanks

Jaime


On 01/10/18 19:13, Burkhard Linke wrote:

Hi,


we also experience hanging clients after MDS restarts; in our case we 
only use a single active MDS server, and the client are actively 
blacklisted by the MDS server after restart. It usually happens if the 
clients are not responsive during MDS restart (e.g. being very busy).



You can check whether this is the case in your setup by inspecting the 
blacklist ('ceph osd blacklist ls'). It should print the connections 
which are currently blacklisted.



You can also remove entries ('ceph osd blacklist rm ...'), but be 
warned that the mechanism is there for a reason. Removing a 
blacklisted entry might result in file corruption if client and MDS 
server disagree about the current state. Use at own risk.



We were also trying a multi active setup after upgrading to luminous, 
but we were running into the same problem with the same error message. 
If was probably due to old kernel clients, so in case of kernel based 
cephfs I would recommend to upgrade to the latest available kernel.



As another approach you can check the current state of the cephfs 
client, either by using the daemon socket in case of ceph-fuse, or the 
debug information in /sys/kernel/debug/ceph/... for the kernel client.


Regards,

Burkhard


On 01.10.2018 18:34, Jaime Ibar wrote:

Hi all,

we're running a ceph 12.2.7 Luminous cluster, two weeks ago we 
enabled multi mds and after few hours


these errors started showing up

2018-09-28 09:41:20.577350 mds.1 [WRN] slow request 64.421475 seconds 
old, received at 2018-09-28 09:40:16.155841: 
client_request(client.31059144:8544450 getattr Xs #0$
12e1e73 2018-09-28 09:40:16.147368 caller_uid=0, 
caller_gid=124{}) currently failed to authpin local pins


2018-09-28 10:56:51.051100 mon.1 [WRN] Health check failed: 5 clients 
failing to respond to cache pressure (MDS_CLIENT_RECALL)
2018-09-28 10:57:08.000361 mds.1 [WRN] 3 slow requests, 1 included 
below; oldest blocked for > 4614.580689 secs
2018-09-28 10:57:08.000365 mds.1 [WRN] slow request 244.796854 
seconds old, received at 2018-09-28 10:53:03.203476: 
client_request(client.31059144:9080057 lookup #0x100
000b7564/58 2018-09-28 10:53:03.197922 caller_uid=0, caller_gid=0{}) 
currently initiated
2018-09-28 11:00:00.000105 mon.1 [WRN] overall HEALTH_WARN 1 clients 
failing to respond to capability release; 5 clients failing to 
respond to cache pressure; 1 MDSs report slow requests,


Due to this, we decide to go back to single mds(as it worked before), 
however, the clients pointing to mds.1 started hanging, however, the 
ones pointing to mds.0 worked fine.


Then, we tried to enable multi mds again and the clients pointing 
mds.1 went back online, however the ones pointing to mds.0 stopped work.


Today, we tried to go back to single mds, however this error was 
preventing ceph to disable second active mds(mds.1)


2018-10-01 14:33:48.358443 mds.1 [WRN] evicting unresponsive client 
X: (30108925), after 68213.084174 seconds


After wait for 3 hours, we restarted mds.1 daemon (as it was stuck in 
stopping state forever due to the above error), we waited for it to 
become active again,


unmount the problematic clients, wait for the cluster to be healthy 
and try to go back to single mds again.


Apparently this worked with some of the clients, we tried to enable 
multi mds again to bring faulty clients back again, however no luck 
this time


and some of them are hanging and can't access to ceph fs.

This is what we have in kern.log

Oct  1 15:29:32 05 kernel: [2342847.017426] ceph: mds1 reconnect start
Oct  1 15:29:32 05 kernel: [2342847.018677] ceph: mds1 reconnect success
Oct  1 15:29:49 05 kernel: [2342864.651398] ceph: mds1 recovery 
completed


Not sure what else can we try to bring hanging clients back without 
rebooting as they're in production and rebooting is not an option.


Does anyone know how can we deal with this, please?

Thanks

Jaime



___
ceph-users mailing list
ceph-users@lists.ceph.com

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-02 Thread Sergey Malinin
Yes, I did repair all OSDs and it finished with 'repair success'. I backed up 
OSDs so now I have more room to play.
I posted log files using ceph-post-file with the following IDs:
4af9cc4d-9c73-41c9-9c38-eb6c551047a0
20df7df5-f0c9-4186-aa21-4e5c0172cd93


> On 2.10.2018, at 11:26, Igor Fedotov  wrote:
> 
> You did repair for any of this OSDs, didn't you? For all of them?
> 
> 
> Would you please provide the log for both types (failed on mount and failed 
> with enospc) of failing OSDs. Prior to collecting please remove existing ones 
> prior and set debug bluestore to 20.
> 
> 
> 
> On 10/2/2018 2:16 AM, Sergey Malinin wrote:
>> I was able to apply patches to mimic, but nothing changed. One osd that I 
>> had space expanded on fails with bluefs mount IO error, others keep failing 
>> with enospc.
>> 
>> 
>>> On 1.10.2018, at 19:26, Igor Fedotov  wrote:
>>> 
>>> So you should call repair which rebalances (i.e. allocates additional 
>>> space) BlueFS space. Hence allowing OSD to start.
>>> 
>>> Thanks,
>>> 
>>> Igor
>>> 
>>> 
>>> On 10/1/2018 7:22 PM, Igor Fedotov wrote:
 Not exactly. The rebalancing from this kv_sync_thread still might be 
 deferred due to the nature of this thread (haven't 100% sure though).
 
 Here is my PR showing the idea (still untested and perhaps unfinished!!!)
 
 https://github.com/ceph/ceph/pull/24353
 
 
 Igor
 
 
 On 10/1/2018 7:07 PM, Sergey Malinin wrote:
> Can you please confirm whether I got this right:
> 
> --- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300
> +++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300
> @@ -9049,22 +9049,17 @@
> throttle_bytes.put(costs);
>   PExtentVector bluefs_gift_extents;
> -  if (bluefs &&
> -  after_flush - bluefs_last_balance >
> -  cct->_conf->bluestore_bluefs_balance_interval) {
> -bluefs_last_balance = after_flush;
> -int r = _balance_bluefs_freespace(_gift_extents);
> -assert(r >= 0);
> -if (r > 0) {
> -  for (auto& p : bluefs_gift_extents) {
> -bluefs_extents.insert(p.offset, p.length);
> -  }
> -  bufferlist bl;
> -  encode(bluefs_extents, bl);
> -  dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
> -   << bluefs_extents << std::dec << dendl;
> -  synct->set(PREFIX_SUPER, "bluefs_extents", bl);
> +  int r = _balance_bluefs_freespace(_gift_extents);
> +  ceph_assert(r >= 0);
> +  if (r > 0) {
> +for (auto& p : bluefs_gift_extents) {
> +  bluefs_extents.insert(p.offset, p.length);
>   }
> +bufferlist bl;
> +encode(bluefs_extents, bl);
> +dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
> + << bluefs_extents << std::dec << dendl;
> +synct->set(PREFIX_SUPER, "bluefs_extents", bl);
> }
>   // cleanup sync deferred keys
> 
>> On 1.10.2018, at 18:39, Igor Fedotov  wrote:
>> 
>> So you have just a single main device per OSD
>> 
>> Then bluestore-tool wouldn't help, it's unable to expand BlueFS 
>> partition at main device, standalone devices are supported only.
>> 
>> Given that you're able to rebuild the code I can suggest to make a patch 
>> that triggers BlueFS rebalance (see code snippet below) on repairing.
>>  PExtentVector bluefs_gift_extents;
>>  int r = _balance_bluefs_freespace(_gift_extents);
>>  ceph_assert(r >= 0);
>>  if (r > 0) {
>>for (auto& p : bluefs_gift_extents) {
>>  bluefs_extents.insert(p.offset, p.length);
>>}
>>bufferlist bl;
>>encode(bluefs_extents, bl);
>>dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
>> << bluefs_extents << std::dec << dendl;
>>synct->set(PREFIX_SUPER, "bluefs_extents", bl);
>>  }
>> 
>> If it waits I can probably make a corresponding PR tomorrow.
>> 
>> Thanks,
>> Igor
>> On 10/1/2018 6:16 PM, Sergey Malinin wrote:
>>> I have rebuilt the tool, but none of my OSDs no matter dead or alive 
>>> have any symlinks other than 'block' pointing to LVM.
>>> I adjusted main device size but it looks like it needs even more space 
>>> for db compaction. After executing bluefs-bdev-expand OSD fails to 
>>> start, however 'fsck' and 'repair' commands finished successfully.
>>> 
>>> 2018-10-01 18:02:39.755 7fc9226c6240  1 freelist init
>>> 2018-10-01 18:02:39.763 7fc9226c6240  1 
>>> bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc opening allocation 
>>> metadata
>>> 2018-10-01 18:02:40.907 7fc9226c6240  1 
>>> bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc loaded 285 GiB in 
>>> 2249899 extents
>>> 2018-10-01 18:02:40.951 7fc9226c6240 -1 
>>> 

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-02 Thread Igor Fedotov

You did repair for any of this OSDs, didn't you? For all of them?


Would you please provide the log for both types (failed on mount and 
failed with enospc) of failing OSDs. Prior to collecting please remove 
existing ones prior and set debug bluestore to 20.




On 10/2/2018 2:16 AM, Sergey Malinin wrote:

I was able to apply patches to mimic, but nothing changed. One osd that I had 
space expanded on fails with bluefs mount IO error, others keep failing with 
enospc.



On 1.10.2018, at 19:26, Igor Fedotov  wrote:

So you should call repair which rebalances (i.e. allocates additional space) 
BlueFS space. Hence allowing OSD to start.

Thanks,

Igor


On 10/1/2018 7:22 PM, Igor Fedotov wrote:

Not exactly. The rebalancing from this kv_sync_thread still might be deferred 
due to the nature of this thread (haven't 100% sure though).

Here is my PR showing the idea (still untested and perhaps unfinished!!!)

https://github.com/ceph/ceph/pull/24353


Igor


On 10/1/2018 7:07 PM, Sergey Malinin wrote:

Can you please confirm whether I got this right:

--- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300
+++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300
@@ -9049,22 +9049,17 @@
 throttle_bytes.put(costs);
   PExtentVector bluefs_gift_extents;
-  if (bluefs &&
-  after_flush - bluefs_last_balance >
-  cct->_conf->bluestore_bluefs_balance_interval) {
-bluefs_last_balance = after_flush;
-int r = _balance_bluefs_freespace(_gift_extents);
-assert(r >= 0);
-if (r > 0) {
-  for (auto& p : bluefs_gift_extents) {
-bluefs_extents.insert(p.offset, p.length);
-  }
-  bufferlist bl;
-  encode(bluefs_extents, bl);
-  dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
-   << bluefs_extents << std::dec << dendl;
-  synct->set(PREFIX_SUPER, "bluefs_extents", bl);
+  int r = _balance_bluefs_freespace(_gift_extents);
+  ceph_assert(r >= 0);
+  if (r > 0) {
+for (auto& p : bluefs_gift_extents) {
+  bluefs_extents.insert(p.offset, p.length);
   }
+bufferlist bl;
+encode(bluefs_extents, bl);
+dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
+ << bluefs_extents << std::dec << dendl;
+synct->set(PREFIX_SUPER, "bluefs_extents", bl);
 }
   // cleanup sync deferred keys


On 1.10.2018, at 18:39, Igor Fedotov  wrote:

So you have just a single main device per OSD

Then bluestore-tool wouldn't help, it's unable to expand BlueFS partition at 
main device, standalone devices are supported only.

Given that you're able to rebuild the code I can suggest to make a patch that 
triggers BlueFS rebalance (see code snippet below) on repairing.
  PExtentVector bluefs_gift_extents;
  int r = _balance_bluefs_freespace(_gift_extents);
  ceph_assert(r >= 0);
  if (r > 0) {
for (auto& p : bluefs_gift_extents) {
  bluefs_extents.insert(p.offset, p.length);
}
bufferlist bl;
encode(bluefs_extents, bl);
dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
 << bluefs_extents << std::dec << dendl;
synct->set(PREFIX_SUPER, "bluefs_extents", bl);
  }

If it waits I can probably make a corresponding PR tomorrow.

Thanks,
Igor
On 10/1/2018 6:16 PM, Sergey Malinin wrote:

I have rebuilt the tool, but none of my OSDs no matter dead or alive have any 
symlinks other than 'block' pointing to LVM.
I adjusted main device size but it looks like it needs even more space for db 
compaction. After executing bluefs-bdev-expand OSD fails to start, however 
'fsck' and 'repair' commands finished successfully.

2018-10-01 18:02:39.755 7fc9226c6240  1 freelist init
2018-10-01 18:02:39.763 7fc9226c6240  1 bluestore(/var/lib/ceph/osd/ceph-1) 
_open_alloc opening allocation metadata
2018-10-01 18:02:40.907 7fc9226c6240  1 bluestore(/var/lib/ceph/osd/ceph-1) 
_open_alloc loaded 285 GiB in 2249899 extents
2018-10-01 18:02:40.951 7fc9226c6240 -1 bluestore(/var/lib/ceph/osd/ceph-1) 
_reconcile_bluefs_freespace bluefs extra 0x[6d6f00~50c80]
2018-10-01 18:02:40.951 7fc9226c6240  1 stupidalloc 0x0x55d053fb9180 shutdown
2018-10-01 18:02:40.963 7fc9226c6240  1 freelist shutdown
2018-10-01 18:02:40.963 7fc9226c6240  4 rocksdb: 
[/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:252] Shutdown: canceling all 
background work
2018-10-01 18:02:40.967 7fc9226c6240  4 rocksdb: 
[/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:397] Shutdown complete
2018-10-01 18:02:40.971 7fc9226c6240  1 bluefs umount
2018-10-01 18:02:40.975 7fc9226c6240  1 stupidalloc 0x0x55d053883800 shutdown
2018-10-01 18:02:40.975 7fc9226c6240  1 bdev(0x55d053c32e00 
/var/lib/ceph/osd/ceph-1/block) close
2018-10-01 18:02:41.267 7fc9226c6240  1 bdev(0x55d053c32a80 
/var/lib/ceph/osd/ceph-1/block) close
2018-10-01 18:02:41.443 7fc9226c6240 -1 osd.1 0 OSD:init: unable to mount 
object store
2018-10-01 18:02:41.443 7fc9226c6240 -1  ** ERROR: osd init