[ceph-users] Slow Requests when deep scrubbing PGs that hold Bucket Index

2018-07-10 Thread Christian Wimmer
Hi,

I'm using ceph primarily for block storage (which works quite well) and as
an object gateway using the S3 API.

Here is some info about my system:
Ceph: 12.2.4, OS: Ubuntu 18.04
OSD: Bluestore
6 servers in total, about 60 OSDs, 2TB SSDs each, no HDDs, CFQ scheduler
20 GBit private network
20 GBit public network
Block storage and object storage runs on separate disks

Main use case:
Saving small (30KB - 2MB) objects in rgw buckets.
- dynamic bucket index resharding is disabled for now but I keep the index
objects per shard at about 100k.
- data pool: EC4+2
- index pool: replicated (3)
- atm around 500k objects in each bucket

My problem:
Sometimes, I get "slow request" warnings like so:
"[WRN] Health check update: 7 slow requests are blocked > 32 sec
(REQUEST_SLOW)"

It turned out that these warnings appear, whenever specific PGs are being
deep scrubbed.
After further investigation, I figured out that these PG's hold the bucket
index of the rados gateway.

I already tried some configuration changes like:
ceph tell osd.* injectargs '--osd_disk_thread_ioprio_priority 0'
ceph tell osd.* injectargs '--osd_disk_thread_ioprio_class idle'
ceph tell osd.* injectargs '--osd_scrub_sleep 1';
ceph tell osd.* injectargs '--osd_deep_scrub_stride 1048576'
ceph tell osd.* injectargs '--osd_scrub_chunk_max 1'
ceph tell osd.* injectargs '--osd_scrub_chunk_min 1'

This helped a lot to mitigate the effects but the problem is still there.

Does anybody else have this issue?

I have a few questions to better understand what's going on:

As far as I know, the bucket index is stored in rocksdb and the (empty)
objects in the index pool are just references to the data in rocksdb. Is
that correct?

How does a deep scrub affect rocksdb?
Does the index pool even need deep scrubbing or could I just disable it?

Also:

Does it make sense to create more index shards to get the objects per shard
down to let's say 50k or 20k?

Right now, I have about 500k objects per bucket. I want to increase that
number to a couple of hundred million objects. Do you see any problems with
that, provided that the bucket index is sharded appropriately?

Any help is appreciated. Let me know if you need anything like logs,
configs, etc.

Thanks!

Christian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Journel SSD recommendation

2018-07-10 Thread Konstantin Shalygin

I have lots of Samsung 850 EVO but they are consumer, Do you think
consume drive should be good for journal?


No.

Since the fall of 2017purchase of Intel P3700 is not difficult, you 
should buy it if you can.






k
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure coding RBD pool for OpenStack Glance, Nova and Cinder

2018-07-10 Thread Konstantin Shalygin

So if you want, two more questions to you :

- How do you handle your ceph.conf configuration (default data pool by
user) / distribution ? Manually, config management, openstack-ansible...
?
- Did you made comparisons, benchmarks between replicated pools and EC
pools, on the same hardware / drives ? I read that small writes are not
very performant with EC.



ceph.conf with default data pool is only need for Cinder at image 
creation time, after this luminous+ rbd client will be found feature 
"data-pool" and will perform data-io to this pool.



# rbd info 
erasure_rbd_meta/volume-09ed44bf-7d16-453a-b712-a636a0d3d812 <- 
meta pool !

rbd image 'volume-09ed44bf-7d16-453a-b712-a636a0d3d812':
    size 1500 GB in 384000 objects
    order 22 (4096 kB objects)
    data_pool: erasure_rbd_data    <- our data pool
    block_name_prefix: rbd_data.6.a2720a1ec432bf
    format: 2
    features: layering, exclusive-lock, object-map, fast-diff, 
deep-flatten, data-pool  <- "data-pool" feature

    flags:
    create_timestamp: Sat Jan 27 20:24:04 2018




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Journel SSD recommendation

2018-07-10 Thread Satish Patel
I am planning to use Intel 3700 (200GB) for journal and 500GB Samsung
850 EVO for OSD, do you think this design make sense?

On Tue, Jul 10, 2018 at 3:04 PM, Simon Ironside  wrote:
>
> On 10/07/18 19:32, Robert Stanford wrote:
>>
>>
>>   Do the recommendations apply to both data and journal SSDs equally?
>>
>
> Search the list for "Many concurrent drive failures - How do I activate
> pgs?" to read about the Intel DC S4600 failure story. The OP had several 2TB
> models of these fail when used as Bluestore data devices. The Samsung SM863a
> is discussed as a good alternative in the same thread.
>
> Simon
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors

2018-07-10 Thread Linh Vu
Hi John,

Thanks for the explanation, that command is a lot more impacting than I 
thought! I hope the change of name for the verb "reset" comes through in the 
next version, because that is very easy to misunderstand.

"The first question is why we're talking about running it at all.  What
chain of reasoning led you to believe that your inotable needed
erasing?"

I thought the reset inode command is just like the reset session command, as 
you can pass mds rank to it as a param, it only resets whatever the MDS was 
holding.

"The most typical case is where the journal has been recovered/erased,
and take_inos is used to skip forward to avoid re-using any inode
numbers that had been claimed by journal entries that we threw away."

We had the situation where our MDS was crashing at MDCache::add_inode(CInode*), 
as discussed earlier. take_inos should fix this, as you mentioned, but we 
thought that we would need to reset what the MDS was holding, just like the 
session.

So with your clarification, I believe we only need to do these:

journal backup
recover dentries
reset mds journal (it wasn't replaying anyway, kept crashing)
reset session
take_inos
start mds up again

Is that correct?

Many thanks, I've learned a lot more about this process.

Cheers,
Linh


From: John Spray 
Sent: Tuesday, 10 July 2018 7:24 PM
To: Linh Vu
Cc: Wido den Hollander; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors

On Tue, Jul 10, 2018 at 2:49 AM Linh Vu  wrote:
>
> While we're on this topic, could someone please explain to me what 
> `cephfs-table-tool all reset inode` does?

The inode table stores an interval set of free inode numbers.  Active
MDS daemons consume inode numbers as they create files.  Resetting the
inode table means rewriting it to its original state (i.e. everything
free).  Using the "take_inos" command consumes some range of inodes,
to reflect that the inodes up to a certain point aren't really free,
but in use by some files that already exist.

> Does it only reset what the MDS has in its cache, and after starting up 
> again, the MDS will read in new inode range from the metadata pool?

I'm repeating myself a bit, but for the benefit of anyone reading this
thread in the future: no, it's nothing like that.  It effectively
*erases the inode table* by overwriting it ("resetting") with a blank
one.


As with the journal tool (https://github.com/ceph/ceph/pull/22853),
perhaps the verb "reset" is too prone to misunderstanding.

> If so, does it mean *before* we run `cephfs-table-tool take_inos`, we must 
> run `cephfs-table-tool all reset inode`?

The first question is why we're talking about running it at all.  What
chain of reasoning led you to believe that your inotable needed
erasing?

The most typical case is where the journal has been recovered/erased,
and take_inos is used to skip forward to avoid re-using any inode
numbers that had been claimed by journal entries that we threw away.

John

>
> Cheers,
>
> Linh
>
> 
> From: ceph-users  on behalf of Wido den 
> Hollander 
> Sent: Saturday, 7 July 2018 12:26:15 AM
> To: John Spray
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors
>
>
>
> On 07/06/2018 01:47 PM, John Spray wrote:
> > On Fri, Jul 6, 2018 at 12:19 PM Wido den Hollander  wrote:
> >>
> >>
> >>
> >> On 07/05/2018 03:36 PM, John Spray wrote:
> >>> On Thu, Jul 5, 2018 at 1:42 PM Dennis Kramer (DBS)  
> >>> wrote:
> 
>  Hi list,
> 
>  I have a serious problem now... I think.
> 
>  One of my users just informed me that a file he created (.doc file) has
>  a different content then before. It looks like the file's inode is
>  completely wrong and points to the wrong object. I myself have found
>  another file with the same symptoms. I'm afraid my (production) FS is
>  corrupt now, unless there is a possibility to fix the inodes.
> >>>
> >>> You can probably get back to a state with some valid metadata, but it
> >>> might not necessarily be the metadata the user was expecting (e.g. if
> >>> two files are claiming the same inode number, one of them's is
> >>> probably going to get deleted).
> >>>
>  Timeline of what happend:
> 
>  Last week I upgraded our Ceph Jewel to Luminous.
>  This went without any problem.
> 
>  I already had 5 MDS available and went with the Multi-MDS feature and
>  enabled it. The seemed to work okay, but after a while my MDS went
>  beserk and went flapping (crashed -> replay -> rejoin -> crashed)
> 
>  The only way to fix this and get the FS back online was the disaster
>  recovery procedure:
> 
>  cephfs-journal-tool event recover_dentries summary
>  ceph fs set cephfs cluster_down true
>  cephfs-table-tool all reset session
>  cephfs-table-tool all reset inode
>  cephfs-journal-tool --rank=cephfs:0 

Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors

2018-07-10 Thread Linh Vu
Thanks John :) Has it - asserting out on dupe inode - already been logged as a 
bug yet? I could put one in if needed.


Cheers,

Linh



From: John Spray 
Sent: Tuesday, 10 July 2018 7:11 PM
To: Linh Vu
Cc: Wido den Hollander; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors

On Tue, Jul 10, 2018 at 12:43 AM Linh Vu  wrote:
>
> We're affected by something like this right now (the dup inode causing MDS to 
> crash via assert(!p) with add_inode(CInode) function).
>
> In terms of behaviours, shouldn't the MDS simply skip to the next available 
> free inode in the event of a dup, than crashing the entire FS because of one 
> file? Probably I'm missing something but that'd be a no brainer picking 
> between the two?

Historically (a few years ago) the MDS asserted out on any invalid
metadata.  Most of these cases have been picked up and converted into
explicit damage handling, but this one appears to have been missed --
so yes, it's a bug that the MDS asserts out.

John

> 
> From: ceph-users  on behalf of Wido den 
> Hollander 
> Sent: Saturday, 7 July 2018 12:26:15 AM
> To: John Spray
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors
>
>
>
> On 07/06/2018 01:47 PM, John Spray wrote:
> > On Fri, Jul 6, 2018 at 12:19 PM Wido den Hollander  wrote:
> >>
> >>
> >>
> >> On 07/05/2018 03:36 PM, John Spray wrote:
> >>> On Thu, Jul 5, 2018 at 1:42 PM Dennis Kramer (DBS)  
> >>> wrote:
> 
>  Hi list,
> 
>  I have a serious problem now... I think.
> 
>  One of my users just informed me that a file he created (.doc file) has
>  a different content then before. It looks like the file's inode is
>  completely wrong and points to the wrong object. I myself have found
>  another file with the same symptoms. I'm afraid my (production) FS is
>  corrupt now, unless there is a possibility to fix the inodes.
> >>>
> >>> You can probably get back to a state with some valid metadata, but it
> >>> might not necessarily be the metadata the user was expecting (e.g. if
> >>> two files are claiming the same inode number, one of them's is
> >>> probably going to get deleted).
> >>>
>  Timeline of what happend:
> 
>  Last week I upgraded our Ceph Jewel to Luminous.
>  This went without any problem.
> 
>  I already had 5 MDS available and went with the Multi-MDS feature and
>  enabled it. The seemed to work okay, but after a while my MDS went
>  beserk and went flapping (crashed -> replay -> rejoin -> crashed)
> 
>  The only way to fix this and get the FS back online was the disaster
>  recovery procedure:
> 
>  cephfs-journal-tool event recover_dentries summary
>  ceph fs set cephfs cluster_down true
>  cephfs-table-tool all reset session
>  cephfs-table-tool all reset inode
>  cephfs-journal-tool --rank=cephfs:0 journal reset
>  ceph mds fail 0
>  ceph fs reset cephfs --yes-i-really-mean-it
> >>>
> >>> My concern with this procedure is that the recover_dentries and
> >>> journal reset only happened on rank 0, whereas the other 4 MDS ranks
> >>> would have retained lots of content in their journals.  I wonder if we
> >>> should be adding some more multi-mds aware checks to these tools, to
> >>> warn the user when they're only acting on particular ranks (a
> >>> reasonable person might assume that recover_dentries with no args is
> >>> operating on all ranks, not just 0).  Created
> >>> http://tracker.ceph.com/issues/24780 to track improving the default
> >>> behaviour.
> >>>
>  Restarted the MDS and I was back online. Shortly after I was getting a
>  lot of "loaded dup inode". In the meanwhile the MDS kept crashing. It
>  looks like it had trouble creating new inodes. Right before the crash
>  it mostly complained something like:
> 
>  -2> 2018-07-05 05:05:01.614290 7f8f8574b700  4 mds.0.server
>  handle_client_request client_request(client.324932014:1434 create
>  #0x1360346/pyfiles.txt 2018-07-05 05:05:01.607458 caller_uid=0,
>  caller_gid=0{}) v2
>  -1> 2018-07-05 05:05:01.614320 7f8f7e73d700  5 mds.0.log
>  _submit_thread 24100753876035~1070 : EOpen [metablob 0x1360346, 1
>  dirs], 1 open files
>   0> 2018-07-05 05:05:01.661155 7f8f8574b700 -1 /build/ceph-
>  12.2.5/src/mds/MDCache.cc: In function 'void
>  MDCache::add_inode(CInode*)' thread 7f8f8574b700 time 2018-07-05
>  05:05:01.615123
>  /build/ceph-12.2.5/src/mds/MDCache.cc: 262: FAILED assert(!p)
> 
>  I also tried to counter the create inode crash by doing the following:
> 
>  cephfs-journal-tool event recover_dentries
>  cephfs-journal-tool journal reset
>  cephfs-table-tool all reset session
>  cephfs-table-tool all reset inode
>  cephfs-table-tool all take_inos 10
> 

Re: [ceph-users] size of journal partitions pretty small

2018-07-10 Thread Paul Emmerich
1) yes, 5 GB is the default. You can control this with the 'osd journal
size' option during creation. (Or partition the disk manually)

2) no, well, maybe a little bit in weird edge cases with tuned configs but
that's rarely advisable.

But using Bluestore instead of Filestore might help with the performance.

Paul


2018-07-10 21:03 GMT+02:00 Robert Stanford :

>
>  I installed my OSDs using ceph-disk.  The journals are SSDs and are 1TB.
> I notice that Ceph has only dedicated 5GB each to the four OSDs that use
> the journal.
>
>  1) Is this normal
>
>  2) Would performance increase if I made the partitions bigger?
>
>  Thank you
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Journel SSD recommendation

2018-07-10 Thread Simon Ironside


On 10/07/18 19:32, Robert Stanford wrote:


  Do the recommendations apply to both data and journal SSDs equally?



Search the list for "Many concurrent drive failures - How do I activate 
pgs?" to read about the Intel DC S4600 failure story. The OP had several 
2TB models of these fail when used as Bluestore data devices. The 
Samsung SM863a is discussed as a good alternative in the same thread.


Simon
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] size of journal partitions pretty small

2018-07-10 Thread Robert Stanford
 I installed my OSDs using ceph-disk.  The journals are SSDs and are 1TB.
I notice that Ceph has only dedicated 5GB each to the four OSDs that use
the journal.

 1) Is this normal

 2) Would performance increase if I made the partitions bigger?

 Thank you
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Journel SSD recommendation

2018-07-10 Thread Simon Ironside



On 10/07/18 18:59, Satish Patel wrote:


Thanks, I would also like to know about Intel SSD 3700 (Intel SSD SC
3700 Series SSDSC2BA400G3P), price also looking promising, Do you have
opinion on it?
I can't quite tell from Google what exactly that is. If it's the Intel 
DC S3700 then I believe those are discontinued now but if you can still 
get hold of them they were used successfully and recommended by lots of 
cephers, myself included.


Cheers,
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure coding RBD pool for OpenStack Glance, Nova and Cinder

2018-07-10 Thread Paul Emmerich
2018-07-10 6:26 GMT+02:00 Konstantin Shalygin :

>
> rbd default data pool = erasure_rbd_data
>
>
> Keep in mind, your minimal client version is Luminous.
>

specifically, it's 12.2.2 or later for the clients!
12.2.0/1 clients have serious bugs in the rbd ec code that will ruin your
day as soon as you try to use snapshots.




-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Looking for some advise on distributed FS: Is Ceph the right option for me?

2018-07-10 Thread Paul Emmerich
Yes, Ceph is probably a good fit for what you are planning.

The documentation should answer your questions:
http://docs.ceph.com/docs/master/
Look for erasure coding, crush rules, and CephFS-specific pages in
particular.



Paul


2018-07-10 18:40 GMT+02:00 Jones de Andrade :

> Hi all.
>
> I'm looking for some information on several distributed filesystems for
> our application.
>
> It looks like it finally came down to two candidates, Ceph being one of
> them. But there are still a few questions about ir that I would really like
> to clarify, if possible.
>
> Our plan, initially on 6 workstations, is to have it hosting a distributed
> file system that can withstand two simultaneous computers failures without
> data loss (something that can remember a raid 6, but over the network).
> This file system will also need to be also remotely mounted (NFS server
> with fallbacks) by other 5+ computers. Students will be working on all 11+
> computers at the same time (different requisites from different softwares:
> some use many small files, other a few really big, 100s gb, files), and
> absolutely no hardware modifications are allowed. This initial test bed is
> for undergraduate students usage, but if successful will be employed also
> for our small clusters. The connection is a simple GbE.
>
> Our actual concerns are:
> 1) Data Resilience: It seems that double copy of each block is the
> standard setting, is it correct? As such, it will strip-parity data among
> three computers for each block?
>
> 2) Metadata Resilience: We seen that we can now have more than a single
> Metadata Server (which was a show-stopper on previous versions). However,
> do they have to be dedicated boxes, or they can share boxes with the Data
> Servers? Can it be configured in such a way that even if two metadata
> server computers fail the whole system data will still be accessible from
> the remaining computers, without interruptions, or they share different
> data aiming only for performance?
>
> 3) Other softwares compability: We seen that there is NFS incompability,
> is it correct? Also, any posix issues?
>
> 4) No single (or double) point of failure: every single possible stance
> has to be able to endure a *double* failure (yes, things can get time to be
> fixed here). Does Ceph need s single master server for any of its
> activities? Can it endure double failure? How long would it take to any
> sort of "fallback" to be completed, users would need to wait to regain
> access?
>
> I think that covers the initial questions we have. Sorry if this is the
> wrong list, however.
>
> Looking forward for any answer or suggestion,
>
> Regards,
>
> Jones
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Journel SSD recommendation

2018-07-10 Thread Robert Stanford
 Do the recommendations apply to both data and journal SSDs equally?

On Tue, Jul 10, 2018 at 12:59 PM, Satish Patel  wrote:

> On Tue, Jul 10, 2018 at 11:51 AM, Simon Ironside
>  wrote:
> > Hi,
> >
> > On 10/07/18 16:25, Satish Patel wrote:
> >>
> >> Folks,
> >>
> >> I am in middle or ordering hardware for my Ceph cluster, so need some
> >> recommendation from communities.
> >>
> >> - What company/Vendor SSD is good ?
> >
> >
> > Samsung SM863a is the current favourite I believe.
>
> Thanks, I would also like to know about Intel SSD 3700 (Intel SSD SC
> 3700 Series SSDSC2BA400G3P), price also looking promising, Do you have
> opinion on it?  also should i get 1 SSD driver for journal or need
> two? I am planning to put 5 OSD per server
>
>
> >
> > The Intel DC S4600 is one to specifically avoid at the moment unless the
> > latest firmware has resolved some of the list member reported issues.
> >
> >> - What size should be good for Journal (for BlueStore)
> >
> >
> > ceph-disk defaults to a RocksDB partition that is 1% of the main device
> > size. That'll get you in the right ball park.
> >
> >> I have lots of Samsung 850 EVO but they are consumer, Do you think
> >> consume drive should be good for journal?
> >
> >
> > No :)
> >
> > Cheers,
> > Simon.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Journel SSD recommendation

2018-07-10 Thread Satish Patel
On Tue, Jul 10, 2018 at 11:51 AM, Simon Ironside
 wrote:
> Hi,
>
> On 10/07/18 16:25, Satish Patel wrote:
>>
>> Folks,
>>
>> I am in middle or ordering hardware for my Ceph cluster, so need some
>> recommendation from communities.
>>
>> - What company/Vendor SSD is good ?
>
>
> Samsung SM863a is the current favourite I believe.

Thanks, I would also like to know about Intel SSD 3700 (Intel SSD SC
3700 Series SSDSC2BA400G3P), price also looking promising, Do you have
opinion on it?  also should i get 1 SSD driver for journal or need
two? I am planning to put 5 OSD per server


>
> The Intel DC S4600 is one to specifically avoid at the moment unless the
> latest firmware has resolved some of the list member reported issues.
>
>> - What size should be good for Journal (for BlueStore)
>
>
> ceph-disk defaults to a RocksDB partition that is 1% of the main device
> size. That'll get you in the right ball park.
>
>> I have lots of Samsung 850 EVO but they are consumer, Do you think
>> consume drive should be good for journal?
>
>
> No :)
>
> Cheers,
> Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recovering from no quorum (2/3 monitors down) via 1 good monitor

2018-07-10 Thread Syahrul Sazli Shaharir
Hi Paul,

Yes that's what I did - caused some errors. In the end I had to delete
the /var/lib/ceph/mon/* directory in the bad node and run inject with
--mkfs argument to recreate the database. I am good now - thanks. :)

On Tue, Jul 10, 2018 at 10:46 PM, Paul Emmerich  wrote:
> easy:
>
> 1. make sure that none of the mons are running
> 2. extract the monmap from the good one
> 3. use monmaptool to remove the two other mons from it
> 4. inject the mon map back into the good mon
> 5. start the good mon
> 6. you now have a running cluster with only one mon, add two new ones
>
>
>   Paul
>
>
> 2018-07-10 5:50 GMT+02:00 Syahrul Sazli Shaharir :
>>
>> Hi,
>>
>> I am running proxmox pve-5.1, with ceph luminous 12.2.4 as storage. I
>> have been running on 3 monitors, up until an abrupt power outage,
>> resulting in 2 monitors down and unable to start, while 1 monitor up
>> but with no quorum.
>>
>> I tried extracting monmap from the good monitor and injecting it into
>> the other two, but got different errors for each:-
>>
>> 1. mon.mail1
>>
>> # ceph-mon -i mail1 --inject-monmap /tmp/monmap
>> 2018-07-10 11:29:03.562840 7f7d82845f80 -1 abort: Corruption: Bad
>> table magic number*** Caught signal (Aborted) **
>>  in thread 7f7d82845f80 thread_name:ceph-mon
>>
>>  ceph version 12.2.4 (4832b6f0acade977670a37c20ff5dbe69e727416)
>> luminous (stable)
>>  1: (()+0x9439e4) [0x5652655669e4]
>>  2: (()+0x110c0) [0x7f7d81bfe0c0]
>>  3: (gsignal()+0xcf) [0x7f7d7ee12fff]
>>  4: (abort()+0x16a) [0x7f7d7ee1442a]
>>  5: (RocksDBStore::get(std::__cxx11::basic_string> std::char_traits, std::allocator > const&,
>> std::__cxx11::basic_string,
>> std::allocator > const&, ceph::buffer::list*)+0x2f9)
>> [0x5652650a2eb9]
>>  6: (main()+0x1377) [0x565264ec3c57]
>>  7: (__libc_start_main()+0xf1) [0x7f7d7ee002e1]
>>  8: (_start()+0x2a) [0x565264f5954a]
>> 2018-07-10 11:29:03.563721 7f7d82845f80 -1 *** Caught signal (Aborted) **
>>  in thread 7f7d82845f80 thread_name:ceph-mon
>>
>> 2.  mon,mail2
>>
>> # ceph-mon -i mail2 --inject-monmap /tmp/monmap
>> 2018-07-10 11:18:07.536097 7f161e2e3f80 -1 rocksdb: Corruption: Can't
>> access /065339.sst: IO error:
>> /var/lib/ceph/mon/ceph-mail2/store.db/065339.sst: No such file or
>> directory
>> Can't access /065337.sst: IO error:
>> /var/lib/ceph/mon/ceph-mail2/store.db/065337.sst: No such file or
>> directory
>>
>> 2018-07-10 11:18:07.536106 7f161e2e3f80 -1 error opening mon data
>> directory at '/var/lib/ceph/mon/ceph-mail2': (22) Invalid argument
>>
>> Any other way I can recover other than rebuilding the monitor store
>> from the OSDs?
>>
>> Thanks.
>>
>> --
>> --sazli
>> Syahrul Sazli Shaharir 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90



-- 
--sazli
Syahrul Sazli Shaharir 
Mobile: +6019 385 8301 - YM/Skype: syahrulsazli
System Administrator
TMK Pulasan (002339810-M) http://pulasan.my/
11 Jalan 3/4, 43650 Bandar Baru Bangi, Selangor, Malaysia.
Tel/Fax: +603 8926 0338
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Looking for some advise on distributed FS: Is Ceph the right option for me?

2018-07-10 Thread Jones de Andrade
Hi all.

I'm looking for some information on several distributed filesystems for our
application.

It looks like it finally came down to two candidates, Ceph being one of
them. But there are still a few questions about ir that I would really like
to clarify, if possible.

Our plan, initially on 6 workstations, is to have it hosting a distributed
file system that can withstand two simultaneous computers failures without
data loss (something that can remember a raid 6, but over the network).
This file system will also need to be also remotely mounted (NFS server
with fallbacks) by other 5+ computers. Students will be working on all 11+
computers at the same time (different requisites from different softwares:
some use many small files, other a few really big, 100s gb, files), and
absolutely no hardware modifications are allowed. This initial test bed is
for undergraduate students usage, but if successful will be employed also
for our small clusters. The connection is a simple GbE.

Our actual concerns are:
1) Data Resilience: It seems that double copy of each block is the standard
setting, is it correct? As such, it will strip-parity data among three
computers for each block?

2) Metadata Resilience: We seen that we can now have more than a single
Metadata Server (which was a show-stopper on previous versions). However,
do they have to be dedicated boxes, or they can share boxes with the Data
Servers? Can it be configured in such a way that even if two metadata
server computers fail the whole system data will still be accessible from
the remaining computers, without interruptions, or they share different
data aiming only for performance?

3) Other softwares compability: We seen that there is NFS incompability, is
it correct? Also, any posix issues?

4) No single (or double) point of failure: every single possible stance has
to be able to endure a *double* failure (yes, things can get time to be
fixed here). Does Ceph need s single master server for any of its
activities? Can it endure double failure? How long would it take to any
sort of "fallback" to be completed, users would need to wait to regain
access?

I think that covers the initial questions we have. Sorry if this is the
wrong list, however.

Looking forward for any answer or suggestion,

Regards,

Jones
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd lock remove unable to parse address

2018-07-10 Thread Kevin Olbrich
2018-07-10 14:37 GMT+02:00 Jason Dillaman :

> On Tue, Jul 10, 2018 at 2:37 AM Kevin Olbrich  wrote:
>
>> 2018-07-10 0:35 GMT+02:00 Jason Dillaman :
>>
>>> Is the link-local address of "fe80::219:99ff:fe9e:3a86%eth0" at least
>>> present on the client computer you used? I would have expected the OSD to
>>> determine the client address, so it's odd that it was able to get a
>>> link-local address.
>>>
>>
>> Yes, it is. eth0 is part of bond0 which is a vlan trunk. Bond0.X is
>> attached to brX which has an ULA-prefix for the ceph cluster.
>> Eth0 has no address itself. In this case this must mean, the address has
>> been carried down to the hardware interface.
>>
>> I am wondering why it uses link local when there is an ULA-prefix
>> available.
>>
>> The address is available on brX on this client node.
>>
>
> I'll open a tracker ticker to get that issue fixed, but in the meantime,
> you can run "rados -p  rmxattr rbd_header.
> lock.rbd_lock" to remove the lock.
>

Worked perfectly, thank you very much!


>
>> - Kevin
>>
>>
>>> On Mon, Jul 9, 2018 at 3:43 PM Kevin Olbrich  wrote:
>>>
 2018-07-09 21:25 GMT+02:00 Jason Dillaman :

> BTW -- are you running Ceph on a one-node computer? I thought IPv6
> addresses starting w/ fe80 were link-local addresses which would probably
> explain why an interface scope id was appended. The current IPv6 address
> parser stops reading after it encounters a non hex, colon character [1].
>

 No, this is a compute machine attached to the storage vlan where I
 previously had also local disks.


>
>
> On Mon, Jul 9, 2018 at 3:14 PM Jason Dillaman 
> wrote:
>
>> Hmm ... it looks like there is a bug w/ RBD locks and IPv6 addresses
>> since it is failing to parse the address as valid. Perhaps it's barfing 
>> on
>> the "%eth0" scope id suffix within the address.
>>
>> On Mon, Jul 9, 2018 at 2:47 PM Kevin Olbrich  wrote:
>>
>>> Hi!
>>>
>>> I tried to convert an qcow2 file to rbd and set the wrong pool.
>>> Immediately I stopped the transfer but the image is stuck locked:
>>>
>>> Previusly when that happened, I was able to remove the image after
>>> 30 secs.
>>>
>>> [root@vm2003 images1]# rbd -p rbd_vms_hdd lock list fpi_server02
>>> There is 1 exclusive lock on this image.
>>> Locker ID  Address
>>>
>>> client.1195723 auto 93921602220416 [fe80::219:99ff:fe9e:3a86%
>>> eth0]:0/1200385089
>>>
>>> [root@vm2003 images1]# rbd -p rbd_vms_hdd lock rm fpi_server02
>>> "auto 93921602220416" client.1195723
>>> rbd: releasing lock failed: (22) Invalid argument
>>> 2018-07-09 20:45:19.080543 7f6c2c267d40 -1 librados: unable to parse
>>> address [fe80::219:99ff:fe9e:3a86%eth0]:0/1200385089
>>> 2018-07-09 20:45:19.080555 7f6c2c267d40 -1 librbd: unable to
>>> blacklist client: (22) Invalid argument
>>>
>>> The image is not in use anywhere!
>>>
>>> How can I force removal of all locks for this image?
>>>
>>> Kind regards,
>>> Kevin
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>> --
>> Jason
>>
>
> [1] https://github.com/ceph/ceph/blob/master/src/msg/msg_types.cc#L108
>
> --
> Jason
>


>>>
>>> --
>>> Jason
>>>
>>
>>
>
> --
> Jason
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Journel SSD recommendation

2018-07-10 Thread Simon Ironside

Hi,

On 10/07/18 16:25, Satish Patel wrote:

Folks,

I am in middle or ordering hardware for my Ceph cluster, so need some
recommendation from communities.

- What company/Vendor SSD is good ?


Samsung SM863a is the current favourite I believe.

The Intel DC S4600 is one to specifically avoid at the moment unless the 
latest firmware has resolved some of the list member reported issues.



- What size should be good for Journal (for BlueStore)


ceph-disk defaults to a RocksDB partition that is 1% of the main device 
size. That'll get you in the right ball park.



I have lots of Samsung 850 EVO but they are consumer, Do you think
consume drive should be good for journal?


No :)

Cheers,
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSDs stalling on Intel SSDs

2018-07-10 Thread Shawn Iverson
Hi everybody,

I have a situation that occurs under moderate I/O load on Ceph Luminous:

2018-07-10 10:27:01.257916 mon.node4 mon.0 172.16.0.4:6789/0 15590 :
cluster [INF] mon.node4 is new leader, mons node4,node5,node6,node7,node8
in quorum (ranks 0,1,2,3,4)
2018-07-10 10:27:01.306329 mon.node4 mon.0 172.16.0.4:6789/0 15595 :
cluster [INF] Health check cleared: MON_DOWN (was: 1/5 mons down, quorum
node4,node6,node7,node8)
2018-07-10 10:27:01.386124 mon.node4 mon.0 172.16.0.4:6789/0 15596 :
cluster [WRN] overall HEALTH_WARN 1 osds down; Reduced data availability: 1
pg peering; Degraded data redundancy: 58774/10188798 objects degraded
(0.577%), 13 pgs degraded; 412 slow requests are blocked > 32 sec
2018-07-10 10:27:02.598175 mon.node4 mon.0 172.16.0.4:6789/0 15597 :
cluster [WRN] Health check update: Degraded data redundancy: 77153/10188798
objects degraded (0.757%), 17 pgs degraded (PG_DEGRADED)
2018-07-10 10:27:02.598225 mon.node4 mon.0 172.16.0.4:6789/0 15598 :
cluster [WRN] Health check update: 381 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-07-10 10:27:02.598264 mon.node4 mon.0 172.16.0.4:6789/0 15599 :
cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data
availability: 1 pg peering)
2018-07-10 10:27:02.608006 mon.node4 mon.0 172.16.0.4:6789/0 15600 :
cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2018-07-10 10:27:02.701029 mon.node4 mon.0 172.16.0.4:6789/0 15601 :
cluster [INF] osd.36 172.16.0.5:6800/3087 boot
2018-07-10 10:27:01.184334 osd.36 osd.36 172.16.0.5:6800/3087 23 : cluster
[WRN] Monitor daemon marked osd.36 down, but it is still running
2018-07-10 10:27:04.861372 mon.node4 mon.0 172.16.0.4:6789/0 15604 :
cluster [INF] Health check cleared: REQUEST_SLOW (was: 381 slow requests
are blocked > 32 sec)

The OSDs that seem to be affected are Intel SSDs, specific model is
SSDSC2BX480G4L.

I have throttled backups to try to lessen the situation, but it seems to
affect the same OSDs when it happens.  It has the added side effect of
taking down the mon on the same node for a few seconds and triggering a
monitor election.

I am wondering if this may be a firmware issue on this drive and if anyone
has any insight or additional troubleshooting steps I should try to get a
deeper look at this behavior.

I am going to upgrade firmware on these drives and see if it helps.

-- 
Shawn Iverson, CETL
Director of Technology
Rush County Schools
765-932-3901 x1171
ivers...@rushville.k12.in.us
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Journel SSD recommendation

2018-07-10 Thread Anton Aleksandrov

I think you will get some useful information from this link:

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

Even though it is dated 2014 - you can get approximate direction.

Anton

On 10.07.2018 18:25, Satish Patel wrote:

Folks,

I am in middle or ordering hardware for my Ceph cluster, so need some
recommendation from communities.

- What company/Vendor SSD is good ?
- What size should be good for Journal (for BlueStore)


I have lots of Samsung 850 EVO but they are consumer, Do you think
consume drive should be good for journal?

~S
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Journel SSD recommendation

2018-07-10 Thread Satish Patel
Folks,

I am in middle or ordering hardware for my Ceph cluster, so need some
recommendation from communities.

- What company/Vendor SSD is good ?
- What size should be good for Journal (for BlueStore)


I have lots of Samsung 850 EVO but they are consumer, Do you think
consume drive should be good for journal?

~S
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recovering from no quorum (2/3 monitors down) via 1 good monitor

2018-07-10 Thread Paul Emmerich
easy:

1. make sure that none of the mons are running
2. extract the monmap from the good one
3. use monmaptool to remove the two other mons from it
4. inject the mon map back into the good mon
5. start the good mon
6. you now have a running cluster with only one mon, add two new ones


  Paul


2018-07-10 5:50 GMT+02:00 Syahrul Sazli Shaharir :

> Hi,
>
> I am running proxmox pve-5.1, with ceph luminous 12.2.4 as storage. I
> have been running on 3 monitors, up until an abrupt power outage,
> resulting in 2 monitors down and unable to start, while 1 monitor up
> but with no quorum.
>
> I tried extracting monmap from the good monitor and injecting it into
> the other two, but got different errors for each:-
>
> 1. mon.mail1
>
> # ceph-mon -i mail1 --inject-monmap /tmp/monmap
> 2018-07-10 11:29:03.562840 7f7d82845f80 -1 abort: Corruption: Bad
> table magic number*** Caught signal (Aborted) **
>  in thread 7f7d82845f80 thread_name:ceph-mon
>
>  ceph version 12.2.4 (4832b6f0acade977670a37c20ff5dbe69e727416)
> luminous (stable)
>  1: (()+0x9439e4) [0x5652655669e4]
>  2: (()+0x110c0) [0x7f7d81bfe0c0]
>  3: (gsignal()+0xcf) [0x7f7d7ee12fff]
>  4: (abort()+0x16a) [0x7f7d7ee1442a]
>  5: (RocksDBStore::get(std::__cxx11::basic_string std::char_traits, std::allocator > const&,
> std::__cxx11::basic_string,
> std::allocator > const&, ceph::buffer::list*)+0x2f9)
> [0x5652650a2eb9]
>  6: (main()+0x1377) [0x565264ec3c57]
>  7: (__libc_start_main()+0xf1) [0x7f7d7ee002e1]
>  8: (_start()+0x2a) [0x565264f5954a]
> 2018-07-10 11:29:03.563721 7f7d82845f80 -1 *** Caught signal (Aborted) **
>  in thread 7f7d82845f80 thread_name:ceph-mon
>
> 2.  mon,mail2
>
> # ceph-mon -i mail2 --inject-monmap /tmp/monmap
> 2018-07-10 11:18:07.536097 7f161e2e3f80 -1 rocksdb: Corruption: Can't
> access /065339.sst: IO error:
> /var/lib/ceph/mon/ceph-mail2/store.db/065339.sst: No such file or
> directory
> Can't access /065337.sst: IO error:
> /var/lib/ceph/mon/ceph-mail2/store.db/065337.sst: No such file or
> directory
>
> 2018-07-10 11:18:07.536106 7f161e2e3f80 -1 error opening mon data
> directory at '/var/lib/ceph/mon/ceph-mail2': (22) Invalid argument
>
> Any other way I can recover other than rebuilding the monitor store
> from the OSDs?
>
> Thanks.
>
> --
> --sazli
> Syahrul Sazli Shaharir 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors

2018-07-10 Thread John Spray
On Tue, Jul 10, 2018 at 3:14 PM Dennis Kramer (DBS)  wrote:
>
> Hi John,
>
> On Tue, 2018-07-10 at 10:11 +0100, John Spray wrote:
> > On Tue, Jul 10, 2018 at 12:43 AM Linh Vu  wrote:
> > >
> > >
> > > We're affected by something like this right now (the dup inode
> > > causing MDS to crash via assert(!p) with add_inode(CInode)
> > > function).
> > >
> > > In terms of behaviours, shouldn't the MDS simply skip to the next
> > > available free inode in the event of a dup, than crashing the
> > > entire FS because of one file? Probably I'm missing something but
> > > that'd be a no brainer picking between the two?
> > Historically (a few years ago) the MDS asserted out on any invalid
> > metadata.  Most of these cases have been picked up and converted into
> > explicit damage handling, but this one appears to have been missed --
> > so yes, it's a bug that the MDS asserts out.
>
> I have followed the disaster recovery and now all my files and
> directories in CephFS which complained about duplicate inodes
> disappeared from my FS. I see *some* data in "lost+found", but thats
> only a part of it. Is there any way to retrieve those missing files?

If you had multiple files trying to use the same inode number, then
the contents of the data pool would only have been storing the
contents of one of those files (or, worst case, some interspersed
mixture of both files).  So the chances are that if something wasn't
linked into lost+found, it is gone for good.

Now that your damaged filesystem is up and running again, if you have
the capacity then it's a good precaution to create a fresh filesystem,
copy the files over, and then restore anything missing from backups.
The multi-filesystem functionality is officially an experimental
feature (mainly because it gets little testing), but when you've gone
through a metadata damage episode it's the lesser of two evils.

John

>
> > John
> >
> > >
> > > 
> > > From: ceph-users  on behalf of
> > > Wido den Hollander 
> > > Sent: Saturday, 7 July 2018 12:26:15 AM
> > > To: John Spray
> > > Cc: ceph-users@lists.ceph.com
> > > Subject: Re: [ceph-users] CephFS - How to handle "loaded dup inode"
> > > errors
> > >
> > >
> > >
> > > On 07/06/2018 01:47 PM, John Spray wrote:
> > > >
> > > > On Fri, Jul 6, 2018 at 12:19 PM Wido den Hollander  > > > > wrote:
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On 07/05/2018 03:36 PM, John Spray wrote:
> > > > > >
> > > > > > On Thu, Jul 5, 2018 at 1:42 PM Dennis Kramer (DBS)  > > > > > lmes.nl> wrote:
> > > > > > >
> > > > > > >
> > > > > > > Hi list,
> > > > > > >
> > > > > > > I have a serious problem now... I think.
> > > > > > >
> > > > > > > One of my users just informed me that a file he created
> > > > > > > (.doc file) has
> > > > > > > a different content then before. It looks like the file's
> > > > > > > inode is
> > > > > > > completely wrong and points to the wrong object. I myself
> > > > > > > have found
> > > > > > > another file with the same symptoms. I'm afraid my
> > > > > > > (production) FS is
> > > > > > > corrupt now, unless there is a possibility to fix the
> > > > > > > inodes.
> > > > > > You can probably get back to a state with some valid
> > > > > > metadata, but it
> > > > > > might not necessarily be the metadata the user was expecting
> > > > > > (e.g. if
> > > > > > two files are claiming the same inode number, one of them's
> > > > > > is
> > > > > > probably going to get deleted).
> > > > > >
> > > > > > >
> > > > > > > Timeline of what happend:
> > > > > > >
> > > > > > > Last week I upgraded our Ceph Jewel to Luminous.
> > > > > > > This went without any problem.
> > > > > > >
> > > > > > > I already had 5 MDS available and went with the Multi-MDS
> > > > > > > feature and
> > > > > > > enabled it. The seemed to work okay, but after a while my
> > > > > > > MDS went
> > > > > > > beserk and went flapping (crashed -> replay -> rejoin ->
> > > > > > > crashed)
> > > > > > >
> > > > > > > The only way to fix this and get the FS back online was the
> > > > > > > disaster
> > > > > > > recovery procedure:
> > > > > > >
> > > > > > > cephfs-journal-tool event recover_dentries summary
> > > > > > > ceph fs set cephfs cluster_down true
> > > > > > > cephfs-table-tool all reset session
> > > > > > > cephfs-table-tool all reset inode
> > > > > > > cephfs-journal-tool --rank=cephfs:0 journal reset
> > > > > > > ceph mds fail 0
> > > > > > > ceph fs reset cephfs --yes-i-really-mean-it
> > > > > > My concern with this procedure is that the recover_dentries
> > > > > > and
> > > > > > journal reset only happened on rank 0, whereas the other 4
> > > > > > MDS ranks
> > > > > > would have retained lots of content in their journals.  I
> > > > > > wonder if we
> > > > > > should be adding some more multi-mds aware checks to these
> > > > > > tools, to
> > > > > > warn the user when they're only acting on particular ranks (a
> > > > > > reasonable person might assume that recover_dentries 

Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors

2018-07-10 Thread Dennis Kramer (DBS)
Hi John,

On Tue, 2018-07-10 at 10:11 +0100, John Spray wrote:
> On Tue, Jul 10, 2018 at 12:43 AM Linh Vu  wrote:
> > 
> > 
> > We're affected by something like this right now (the dup inode
> > causing MDS to crash via assert(!p) with add_inode(CInode)
> > function).
> > 
> > In terms of behaviours, shouldn't the MDS simply skip to the next
> > available free inode in the event of a dup, than crashing the
> > entire FS because of one file? Probably I'm missing something but
> > that'd be a no brainer picking between the two?
> Historically (a few years ago) the MDS asserted out on any invalid
> metadata.  Most of these cases have been picked up and converted into
> explicit damage handling, but this one appears to have been missed --
> so yes, it's a bug that the MDS asserts out.

I have followed the disaster recovery and now all my files and
directories in CephFS which complained about duplicate inodes
disappeared from my FS. I see *some* data in "lost+found", but thats
only a part of it. Is there any way to retrieve those missing files?

> John
> 
> > 
> > 
> > From: ceph-users  on behalf of
> > Wido den Hollander 
> > Sent: Saturday, 7 July 2018 12:26:15 AM
> > To: John Spray
> > Cc: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] CephFS - How to handle "loaded dup inode"
> > errors
> > 
> > 
> > 
> > On 07/06/2018 01:47 PM, John Spray wrote:
> > > 
> > > On Fri, Jul 6, 2018 at 12:19 PM Wido den Hollander  > > > wrote:
> > > > 
> > > > 
> > > > 
> > > > 
> > > > On 07/05/2018 03:36 PM, John Spray wrote:
> > > > > 
> > > > > On Thu, Jul 5, 2018 at 1:42 PM Dennis Kramer (DBS)  > > > > lmes.nl> wrote:
> > > > > > 
> > > > > > 
> > > > > > Hi list,
> > > > > > 
> > > > > > I have a serious problem now... I think.
> > > > > > 
> > > > > > One of my users just informed me that a file he created
> > > > > > (.doc file) has
> > > > > > a different content then before. It looks like the file's
> > > > > > inode is
> > > > > > completely wrong and points to the wrong object. I myself
> > > > > > have found
> > > > > > another file with the same symptoms. I'm afraid my
> > > > > > (production) FS is
> > > > > > corrupt now, unless there is a possibility to fix the
> > > > > > inodes.
> > > > > You can probably get back to a state with some valid
> > > > > metadata, but it
> > > > > might not necessarily be the metadata the user was expecting
> > > > > (e.g. if
> > > > > two files are claiming the same inode number, one of them's
> > > > > is
> > > > > probably going to get deleted).
> > > > > 
> > > > > > 
> > > > > > Timeline of what happend:
> > > > > > 
> > > > > > Last week I upgraded our Ceph Jewel to Luminous.
> > > > > > This went without any problem.
> > > > > > 
> > > > > > I already had 5 MDS available and went with the Multi-MDS
> > > > > > feature and
> > > > > > enabled it. The seemed to work okay, but after a while my
> > > > > > MDS went
> > > > > > beserk and went flapping (crashed -> replay -> rejoin ->
> > > > > > crashed)
> > > > > > 
> > > > > > The only way to fix this and get the FS back online was the
> > > > > > disaster
> > > > > > recovery procedure:
> > > > > > 
> > > > > > cephfs-journal-tool event recover_dentries summary
> > > > > > ceph fs set cephfs cluster_down true
> > > > > > cephfs-table-tool all reset session
> > > > > > cephfs-table-tool all reset inode
> > > > > > cephfs-journal-tool --rank=cephfs:0 journal reset
> > > > > > ceph mds fail 0
> > > > > > ceph fs reset cephfs --yes-i-really-mean-it
> > > > > My concern with this procedure is that the recover_dentries
> > > > > and
> > > > > journal reset only happened on rank 0, whereas the other 4
> > > > > MDS ranks
> > > > > would have retained lots of content in their journals.  I
> > > > > wonder if we
> > > > > should be adding some more multi-mds aware checks to these
> > > > > tools, to
> > > > > warn the user when they're only acting on particular ranks (a
> > > > > reasonable person might assume that recover_dentries with no
> > > > > args is
> > > > > operating on all ranks, not just 0).  Created
> > > > > https://protect-au.mimecast.com/s/PZyQCJypvAfnP9VmfVwGUS?doma
> > > > > in=tracker.ceph.com to track improving the default
> > > > > behaviour.
> > > > > 
> > > > > > 
> > > > > > Restarted the MDS and I was back online. Shortly after I
> > > > > > was getting a
> > > > > > lot of "loaded dup inode". In the meanwhile the MDS kept
> > > > > > crashing. It
> > > > > > looks like it had trouble creating new inodes. Right before
> > > > > > the crash
> > > > > > it mostly complained something like:
> > > > > > 
> > > > > > -2> 2018-07-05 05:05:01.614290 7f8f8574b700  4
> > > > > > mds.0.server
> > > > > > handle_client_request client_request(client.324932014:1434
> > > > > > create
> > > > > > #0x1360346/pyfiles.txt 2018-07-05 05:05:01.607458
> > > > > > caller_uid=0,
> > > > > > caller_gid=0{}) v2
> > > > > > -1> 2018-07-05 05:05:01.614320 7f8f7e73d700  5
> > > > > 

Re: [ceph-users] rbd lock remove unable to parse address

2018-07-10 Thread Jason Dillaman
On Tue, Jul 10, 2018 at 2:37 AM Kevin Olbrich  wrote:

> 2018-07-10 0:35 GMT+02:00 Jason Dillaman :
>
>> Is the link-local address of "fe80::219:99ff:fe9e:3a86%eth0" at least
>> present on the client computer you used? I would have expected the OSD to
>> determine the client address, so it's odd that it was able to get a
>> link-local address.
>>
>
> Yes, it is. eth0 is part of bond0 which is a vlan trunk. Bond0.X is
> attached to brX which has an ULA-prefix for the ceph cluster.
> Eth0 has no address itself. In this case this must mean, the address has
> been carried down to the hardware interface.
>
> I am wondering why it uses link local when there is an ULA-prefix
> available.
>
> The address is available on brX on this client node.
>

I'll open a tracker ticker to get that issue fixed, but in the meantime,
you can run "rados -p  rmxattr rbd_header.
lock.rbd_lock" to remove the lock.

>
> - Kevin
>
>
>> On Mon, Jul 9, 2018 at 3:43 PM Kevin Olbrich  wrote:
>>
>>> 2018-07-09 21:25 GMT+02:00 Jason Dillaman :
>>>
 BTW -- are you running Ceph on a one-node computer? I thought IPv6
 addresses starting w/ fe80 were link-local addresses which would probably
 explain why an interface scope id was appended. The current IPv6 address
 parser stops reading after it encounters a non hex, colon character [1].

>>>
>>> No, this is a compute machine attached to the storage vlan where I
>>> previously had also local disks.
>>>
>>>


 On Mon, Jul 9, 2018 at 3:14 PM Jason Dillaman 
 wrote:

> Hmm ... it looks like there is a bug w/ RBD locks and IPv6 addresses
> since it is failing to parse the address as valid. Perhaps it's barfing on
> the "%eth0" scope id suffix within the address.
>
> On Mon, Jul 9, 2018 at 2:47 PM Kevin Olbrich  wrote:
>
>> Hi!
>>
>> I tried to convert an qcow2 file to rbd and set the wrong pool.
>> Immediately I stopped the transfer but the image is stuck locked:
>>
>> Previusly when that happened, I was able to remove the image after 30
>> secs.
>>
>> [root@vm2003 images1]# rbd -p rbd_vms_hdd lock list fpi_server02
>> There is 1 exclusive lock on this image.
>> Locker ID  Address
>>
>> client.1195723 auto 93921602220416
>> [fe80::219:99ff:fe9e:3a86%eth0]:0/1200385089
>>
>> [root@vm2003 images1]# rbd -p rbd_vms_hdd lock rm fpi_server02 "auto
>> 93921602220416" client.1195723
>> rbd: releasing lock failed: (22) Invalid argument
>> 2018-07-09 20:45:19.080543 7f6c2c267d40 -1 librados: unable to parse
>> address [fe80::219:99ff:fe9e:3a86%eth0]:0/1200385089
>> 2018-07-09 20:45:19.080555 7f6c2c267d40 -1 librbd: unable to
>> blacklist client: (22) Invalid argument
>>
>> The image is not in use anywhere!
>>
>> How can I force removal of all locks for this image?
>>
>> Kind regards,
>> Kevin
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
> --
> Jason
>

 [1] https://github.com/ceph/ceph/blob/master/src/msg/msg_types.cc#L108

 --
 Jason

>>>
>>>
>>
>> --
>> Jason
>>
>
>

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure coding RBD pool for OpenStack Glance, Nova and Cinder

2018-07-10 Thread Gilles Mocellin

Le 2018-07-10 06:26, Konstantin Shalygin a écrit :

Does someone have used EC
pools with OpenStack in production ?



By chance, I found that link :


https://www.reddit.com/r/ceph/comments/72yc9m/ceph_openstack_with_ec/

Yes, this good post.

My configuration is:

cinder.conf:


[erasure-rbd-hdd]
volume_driver = cinder.volume.drivers.rbd.RBDDriver
volume_backend_name = erasure-rbd-hdd
rbd_pool = erasure_rbd_meta
rbd_user = cinder_erasure_hdd
rbd_ceph_conf = /etc/ceph/ceph.conf


ceph.conf:


[client.cinder_erasure_hdd]
rbd default data pool = erasure_rbd_data


Keep in mind, your minimal client version is Luminous.

So trick is - tell to everyone your pool is "erasure_rbd_meta", rbd
clients will find data pool "erasure_rbd_data" automatically.

k


Thank you for your feed back Konstantin !

So if you want, two more questions to you :

- How do you handle your ceph.conf configuration (default data pool by 
user) / distribution ? Manually, config management, openstack-ansible... 
?
- Did you made comparisons, benchmarks between replicated pools and EC 
pools, on the same hardware / drives ? I read that small writes are not 
very performant with EC.


Thanks again,
--
Gilles
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic 13.2.1 release date

2018-07-10 Thread Martin Overgaard Hansen


> Den 9. jul. 2018 kl. 17.12 skrev Wido den Hollander :
> 
> Hi,
> 
> Is there a release date for Mimic 13.2.1 yet?
> 
> There are a few issues which currently make deploying with Mimic 13.2.0
> a bit difficult, for example:
> 
> - https://tracker.ceph.com/issues/24423
> - https://github.com/ceph/ceph/pull/22393
> 
> Especially the first one makes it difficult.
> 
> 13.2.1 would be very welcome with these fixes in there.
> 
> Is there a ETA for this version yet?
> 
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Also looking forward to this release, we had to revert to luminous to continue 
expanding our cluster. An ETA would be great, thanks.

Best regards,
Martin Overgaard Hansen
MultiHouse IT Partner A/S
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Add Partitions to Ceph Cluster

2018-07-10 Thread Dimitri Roschkowski

Hi,

is it possible to use just a partition instead of a whole disk for OSD? 
On a server I already use hdb for Ceph and want to add hda4 to be used 
in the Ceph Cluster, but it didn’t work for me.


On the server with the partition I tried:

ceph-disk prepare /dev/sda4

and

ceph-disk activate /dev/sda4

And with df I see, that ceph did something on the partition:

/dev/sda4   1.8T  2.8G  1.8T   1% /var/lib/ceph/osd/ceph-4


My problem is, that after I activated the disk, I didn't see a change in 
the ceph status output:


  data:
pools:   6 pools, 168 pgs
objects: 25.84 k objects, 100 GiB
usage:   305 GiB used, 6.8 TiB / 7.1 TiB avail
pgs: 168 active+clean

Can some one help me?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous 12.2.6 release date?

2018-07-10 Thread Sean Purdy
Hi Sean,

On Tue, 10 Jul 2018, Sean Redmond said:
> Can you please link me to the tracker 12.2.6 fixes? I have disabled
> resharding in 12.2.5 due to it running endlessly.

http://tracker.ceph.com/issues/22721


Sean
 
> Thanks
> 
> On Tue, Jul 10, 2018 at 9:07 AM, Sean Purdy 
> wrote:
> 
> > While we're at it, is there a release date for 12.2.6?  It fixes a
> > reshard/versioning bug for us.
> >
> > Sean
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors

2018-07-10 Thread John Spray
On Tue, Jul 10, 2018 at 2:49 AM Linh Vu  wrote:
>
> While we're on this topic, could someone please explain to me what 
> `cephfs-table-tool all reset inode` does?

The inode table stores an interval set of free inode numbers.  Active
MDS daemons consume inode numbers as they create files.  Resetting the
inode table means rewriting it to its original state (i.e. everything
free).  Using the "take_inos" command consumes some range of inodes,
to reflect that the inodes up to a certain point aren't really free,
but in use by some files that already exist.

> Does it only reset what the MDS has in its cache, and after starting up 
> again, the MDS will read in new inode range from the metadata pool?

I'm repeating myself a bit, but for the benefit of anyone reading this
thread in the future: no, it's nothing like that.  It effectively
*erases the inode table* by overwriting it ("resetting") with a blank
one.

As with the journal tool (https://github.com/ceph/ceph/pull/22853),
perhaps the verb "reset" is too prone to misunderstanding.

> If so, does it mean *before* we run `cephfs-table-tool take_inos`, we must 
> run `cephfs-table-tool all reset inode`?

The first question is why we're talking about running it at all.  What
chain of reasoning led you to believe that your inotable needed
erasing?

The most typical case is where the journal has been recovered/erased,
and take_inos is used to skip forward to avoid re-using any inode
numbers that had been claimed by journal entries that we threw away.

John

>
> Cheers,
>
> Linh
>
> 
> From: ceph-users  on behalf of Wido den 
> Hollander 
> Sent: Saturday, 7 July 2018 12:26:15 AM
> To: John Spray
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors
>
>
>
> On 07/06/2018 01:47 PM, John Spray wrote:
> > On Fri, Jul 6, 2018 at 12:19 PM Wido den Hollander  wrote:
> >>
> >>
> >>
> >> On 07/05/2018 03:36 PM, John Spray wrote:
> >>> On Thu, Jul 5, 2018 at 1:42 PM Dennis Kramer (DBS)  
> >>> wrote:
> 
>  Hi list,
> 
>  I have a serious problem now... I think.
> 
>  One of my users just informed me that a file he created (.doc file) has
>  a different content then before. It looks like the file's inode is
>  completely wrong and points to the wrong object. I myself have found
>  another file with the same symptoms. I'm afraid my (production) FS is
>  corrupt now, unless there is a possibility to fix the inodes.
> >>>
> >>> You can probably get back to a state with some valid metadata, but it
> >>> might not necessarily be the metadata the user was expecting (e.g. if
> >>> two files are claiming the same inode number, one of them's is
> >>> probably going to get deleted).
> >>>
>  Timeline of what happend:
> 
>  Last week I upgraded our Ceph Jewel to Luminous.
>  This went without any problem.
> 
>  I already had 5 MDS available and went with the Multi-MDS feature and
>  enabled it. The seemed to work okay, but after a while my MDS went
>  beserk and went flapping (crashed -> replay -> rejoin -> crashed)
> 
>  The only way to fix this and get the FS back online was the disaster
>  recovery procedure:
> 
>  cephfs-journal-tool event recover_dentries summary
>  ceph fs set cephfs cluster_down true
>  cephfs-table-tool all reset session
>  cephfs-table-tool all reset inode
>  cephfs-journal-tool --rank=cephfs:0 journal reset
>  ceph mds fail 0
>  ceph fs reset cephfs --yes-i-really-mean-it
> >>>
> >>> My concern with this procedure is that the recover_dentries and
> >>> journal reset only happened on rank 0, whereas the other 4 MDS ranks
> >>> would have retained lots of content in their journals.  I wonder if we
> >>> should be adding some more multi-mds aware checks to these tools, to
> >>> warn the user when they're only acting on particular ranks (a
> >>> reasonable person might assume that recover_dentries with no args is
> >>> operating on all ranks, not just 0).  Created
> >>> https://protect-au.mimecast.com/s/PZyQCJypvAfnP9VmfVwGUS?domain=tracker.ceph.com
> >>>  to track improving the default
> >>> behaviour.
> >>>
>  Restarted the MDS and I was back online. Shortly after I was getting a
>  lot of "loaded dup inode". In the meanwhile the MDS kept crashing. It
>  looks like it had trouble creating new inodes. Right before the crash
>  it mostly complained something like:
> 
>  -2> 2018-07-05 05:05:01.614290 7f8f8574b700  4 mds.0.server
>  handle_client_request client_request(client.324932014:1434 create
>  #0x1360346/pyfiles.txt 2018-07-05 05:05:01.607458 caller_uid=0,
>  caller_gid=0{}) v2
>  -1> 2018-07-05 05:05:01.614320 7f8f7e73d700  5 mds.0.log
>  _submit_thread 24100753876035~1070 : EOpen [metablob 0x1360346, 1
>  dirs], 1 open files
>   0> 2018-07-05 05:05:01.661155 7f8f8574b700 -1 

Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors

2018-07-10 Thread John Spray
On Tue, Jul 10, 2018 at 12:43 AM Linh Vu  wrote:
>
> We're affected by something like this right now (the dup inode causing MDS to 
> crash via assert(!p) with add_inode(CInode) function).
>
> In terms of behaviours, shouldn't the MDS simply skip to the next available 
> free inode in the event of a dup, than crashing the entire FS because of one 
> file? Probably I'm missing something but that'd be a no brainer picking 
> between the two?

Historically (a few years ago) the MDS asserted out on any invalid
metadata.  Most of these cases have been picked up and converted into
explicit damage handling, but this one appears to have been missed --
so yes, it's a bug that the MDS asserts out.

John

> 
> From: ceph-users  on behalf of Wido den 
> Hollander 
> Sent: Saturday, 7 July 2018 12:26:15 AM
> To: John Spray
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors
>
>
>
> On 07/06/2018 01:47 PM, John Spray wrote:
> > On Fri, Jul 6, 2018 at 12:19 PM Wido den Hollander  wrote:
> >>
> >>
> >>
> >> On 07/05/2018 03:36 PM, John Spray wrote:
> >>> On Thu, Jul 5, 2018 at 1:42 PM Dennis Kramer (DBS)  
> >>> wrote:
> 
>  Hi list,
> 
>  I have a serious problem now... I think.
> 
>  One of my users just informed me that a file he created (.doc file) has
>  a different content then before. It looks like the file's inode is
>  completely wrong and points to the wrong object. I myself have found
>  another file with the same symptoms. I'm afraid my (production) FS is
>  corrupt now, unless there is a possibility to fix the inodes.
> >>>
> >>> You can probably get back to a state with some valid metadata, but it
> >>> might not necessarily be the metadata the user was expecting (e.g. if
> >>> two files are claiming the same inode number, one of them's is
> >>> probably going to get deleted).
> >>>
>  Timeline of what happend:
> 
>  Last week I upgraded our Ceph Jewel to Luminous.
>  This went without any problem.
> 
>  I already had 5 MDS available and went with the Multi-MDS feature and
>  enabled it. The seemed to work okay, but after a while my MDS went
>  beserk and went flapping (crashed -> replay -> rejoin -> crashed)
> 
>  The only way to fix this and get the FS back online was the disaster
>  recovery procedure:
> 
>  cephfs-journal-tool event recover_dentries summary
>  ceph fs set cephfs cluster_down true
>  cephfs-table-tool all reset session
>  cephfs-table-tool all reset inode
>  cephfs-journal-tool --rank=cephfs:0 journal reset
>  ceph mds fail 0
>  ceph fs reset cephfs --yes-i-really-mean-it
> >>>
> >>> My concern with this procedure is that the recover_dentries and
> >>> journal reset only happened on rank 0, whereas the other 4 MDS ranks
> >>> would have retained lots of content in their journals.  I wonder if we
> >>> should be adding some more multi-mds aware checks to these tools, to
> >>> warn the user when they're only acting on particular ranks (a
> >>> reasonable person might assume that recover_dentries with no args is
> >>> operating on all ranks, not just 0).  Created
> >>> https://protect-au.mimecast.com/s/PZyQCJypvAfnP9VmfVwGUS?domain=tracker.ceph.com
> >>>  to track improving the default
> >>> behaviour.
> >>>
>  Restarted the MDS and I was back online. Shortly after I was getting a
>  lot of "loaded dup inode". In the meanwhile the MDS kept crashing. It
>  looks like it had trouble creating new inodes. Right before the crash
>  it mostly complained something like:
> 
>  -2> 2018-07-05 05:05:01.614290 7f8f8574b700  4 mds.0.server
>  handle_client_request client_request(client.324932014:1434 create
>  #0x1360346/pyfiles.txt 2018-07-05 05:05:01.607458 caller_uid=0,
>  caller_gid=0{}) v2
>  -1> 2018-07-05 05:05:01.614320 7f8f7e73d700  5 mds.0.log
>  _submit_thread 24100753876035~1070 : EOpen [metablob 0x1360346, 1
>  dirs], 1 open files
>   0> 2018-07-05 05:05:01.661155 7f8f8574b700 -1 /build/ceph-
>  12.2.5/src/mds/MDCache.cc: In function 'void
>  MDCache::add_inode(CInode*)' thread 7f8f8574b700 time 2018-07-05
>  05:05:01.615123
>  /build/ceph-12.2.5/src/mds/MDCache.cc: 262: FAILED assert(!p)
> 
>  I also tried to counter the create inode crash by doing the following:
> 
>  cephfs-journal-tool event recover_dentries
>  cephfs-journal-tool journal reset
>  cephfs-table-tool all reset session
>  cephfs-table-tool all reset inode
>  cephfs-table-tool all take_inos 10
> >>>
> >>> This procedure is recovering some metadata from the journal into the
> >>> main tree, then resetting everything, but duplicate inodes are
> >>> happening when the main tree has multiple dentries containing inodes
> >>> using the same inode number.
> >>>
> >>> What you need is something that scans 

Re: [ceph-users] Luminous 12.2.6 release date?

2018-07-10 Thread Sean Redmond
Hi Sean (Good name btw),

Can you please link me to the tracker 12.2.6 fixes? I have disabled
resharding in 12.2.5 due to it running endlessly.

Thanks

On Tue, Jul 10, 2018 at 9:07 AM, Sean Purdy 
wrote:

> While we're at it, is there a release date for 12.2.6?  It fixes a
> reshard/versioning bug for us.
>
> Sean
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic 13.2.1 release date

2018-07-10 Thread Steffen Winther Sørensen


> On 9 Jul 2018, at 17.11, Wido den Hollander  wrote:
> 
> Hi,
> 
> Is there a release date for Mimic 13.2.1 yet?
> 
> There are a few issues which currently make deploying with Mimic 13.2.0
> a bit difficult, for example:
> 
> - https://tracker.ceph.com/issues/24423
> - https://github.com/ceph/ceph/pull/22393
> 
> Especially the first one makes it difficult.
+1

> 13.2.1 would be very welcome with these fixes in there.
+1

> 
> Is there a ETA for this version yet?
> 
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Luminous 12.2.6 release date?

2018-07-10 Thread Sean Purdy
While we're at it, is there a release date for 12.2.6?  It fixes a 
reshard/versioning bug for us.

Sean
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph poor performance when compress files

2018-07-10 Thread Mostafa Hamdy Abo El-Maty El-Giar
Hi Ceph Experts,

When I compress my files stored in ceph cluster using gzip command, the
command take long time.

The poor performance only when ziping files stored on ceph.

Please , any idea about this problem.

Thank you
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd lock remove unable to parse address

2018-07-10 Thread Kevin Olbrich
2018-07-10 0:35 GMT+02:00 Jason Dillaman :

> Is the link-local address of "fe80::219:99ff:fe9e:3a86%eth0" at least
> present on the client computer you used? I would have expected the OSD to
> determine the client address, so it's odd that it was able to get a
> link-local address.
>

Yes, it is. eth0 is part of bond0 which is a vlan trunk. Bond0.X is
attached to brX which has an ULA-prefix for the ceph cluster.
Eth0 has no address itself. In this case this must mean, the address has
been carried down to the hardware interface.

I am wondering why it uses link local when there is an ULA-prefix available.

The address is available on brX on this client node.

- Kevin


> On Mon, Jul 9, 2018 at 3:43 PM Kevin Olbrich  wrote:
>
>> 2018-07-09 21:25 GMT+02:00 Jason Dillaman :
>>
>>> BTW -- are you running Ceph on a one-node computer? I thought IPv6
>>> addresses starting w/ fe80 were link-local addresses which would probably
>>> explain why an interface scope id was appended. The current IPv6 address
>>> parser stops reading after it encounters a non hex, colon character [1].
>>>
>>
>> No, this is a compute machine attached to the storage vlan where I
>> previously had also local disks.
>>
>>
>>>
>>>
>>> On Mon, Jul 9, 2018 at 3:14 PM Jason Dillaman 
>>> wrote:
>>>
 Hmm ... it looks like there is a bug w/ RBD locks and IPv6 addresses
 since it is failing to parse the address as valid. Perhaps it's barfing on
 the "%eth0" scope id suffix within the address.

 On Mon, Jul 9, 2018 at 2:47 PM Kevin Olbrich  wrote:

> Hi!
>
> I tried to convert an qcow2 file to rbd and set the wrong pool.
> Immediately I stopped the transfer but the image is stuck locked:
>
> Previusly when that happened, I was able to remove the image after 30
> secs.
>
> [root@vm2003 images1]# rbd -p rbd_vms_hdd lock list fpi_server02
> There is 1 exclusive lock on this image.
> Locker ID  Address
>
> client.1195723 auto 93921602220416 [fe80::219:99ff:fe9e:3a86%
> eth0]:0/1200385089
>
> [root@vm2003 images1]# rbd -p rbd_vms_hdd lock rm fpi_server02 "auto
> 93921602220416" client.1195723
> rbd: releasing lock failed: (22) Invalid argument
> 2018-07-09 20:45:19.080543 7f6c2c267d40 -1 librados: unable to parse
> address [fe80::219:99ff:fe9e:3a86%eth0]:0/1200385089
> 2018-07-09 20:45:19.080555 7f6c2c267d40 -1 librbd: unable to blacklist
> client: (22) Invalid argument
>
> The image is not in use anywhere!
>
> How can I force removal of all locks for this image?
>
> Kind regards,
> Kevin
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


 --
 Jason

>>>
>>> [1] https://github.com/ceph/ceph/blob/master/src/msg/msg_types.cc#L108
>>>
>>> --
>>> Jason
>>>
>>
>>
>
> --
> Jason
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com