[ceph-users] bluestore db & wal use spdk device how to ?

2019-08-05 Thread Chris Hsiang
Hi All,

I have multiple nvme ssd and I wish to use two of them for spdk as
bluestore db & wal

my assumption would be in ceph.conf under osd.conf

put following

bluestore_block_db_path = "spdk::01:00.0"bluestore_block_db_size =
40 * 1024 * 1024 * 1024 (40G)

Then how to prepare osd?
ceph-volume lvm prepare --bluestore --data vg_ceph/lv_sas-sda
--block.db spdk::01:00.0  ?

what if I have a second nvme ssd (:1a:00.0) want to use for different osd  ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] about ceph v12.2.12 rpm have no found

2019-08-05 Thread 潘东元
Hello,every one,
  I can not found ceph v12.2.12 rpm at
https://download.ceph.com/rpm-luminous/el7/aarch64/
  why?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] even number of monitors

2019-08-05 Thread DHilsbos
All;

While most discussion of MONs, and their failure modes revolves around the 
failure of the MONs themselves, the recommendation for od numbers of MONs has 
nothing to do with the loss of one or more MONs.  It's actually in response to 
the split brain problem.

Imagine you have the following (where 's" is a switch):
s1--mon1
|
s2--mon2
|
s3--mon3
|
s4--mon4

Now imagine what happens when the link between s2 and s3 breaks (imagine 
accidently pulling the wrong cable, port failure on switch, WAN fiber cut, etc.)
s1--mon1
|
s2--mon2
X
s3--mon3
|
s4--mon4

All 4 MONs are alive, but which has the official state of the cluster?  Which 
MON(s) can make decisions on behalf of the cluster?

Now imagine a similar situation for 3 MONs:
s1--mon1
X
s2--mon2
|
s3--mon3

or:
s1--mon1
|
s2--mon2
X
s3--mon3

The cluster can continue.

Similarly imagine 5 MONs:
s1--mon1
X
s2--mon2
|
s3--mon3
|
s4--mon4
|
s5--mon5
or:
s1--mon1
|
s2--mon2
X
s3--mon3
|
s4--mon4
|
s5--mon5
or:
s1--mon1
|
s2--mon2
|
s3--mon3
X
s4--mon4
|
s5--mon5
or:
s1--mon1
|
s2--mon2
|
s3--mon3
|
s4--mon4
X
s5--mon5

In each case, one side retains a quorum; enough MONs to definitively make 
decisions on behalf of the cluster. 

Note that it is just as important, in solving the split-brain problem, to 
recognize when you are NOT in the quorum (and thus should not make decisions), 
as to recognize when you are.

Within a single datacenter it is relatively easy to ensure that this kind of 
failure shouldn't occur (ring-style switch stacking for instance), but imagine 
that you cluster covers a good portion of the Eastern U.S., with MON(s) in 
Philadelphia, New York, and Baltimore.  Can you achieve redundant interconnects 
without going through the same fiber bundler?

Thank you,

Dominic L. Hilsbos, MBA 
Director - Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Alfredo Daniel Rezinovsky
Sent: Monday, August 05, 2019 3:28 AM
To: ceph-users
Subject: [ceph-users] even number of monitors

With 3 monitors, paxos needs at least 2 to reach consensus about the 
cluster status

With 4 monitors, more than half is 3. The only problem I can see here is 
that I will have only 1 spare monitor.

There's any other problem with and even number of monitors?

--
Alfrenovsky

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes

2019-08-05 Thread Patrick Donnelly
On Mon, Aug 5, 2019 at 12:21 AM Janek Bevendorff
 wrote:
>
> Hi,
>
> > You can also try increasing the aggressiveness of the MDS recall but
> > I'm surprised it's still a problem with the settings I gave you:
> >
> > ceph config set mds mds_recall_max_caps 15000
> > ceph config set mds mds_recall_max_decay_rate 0.75
>
> I finally had the chance to try the more aggressive recall settings, but
> they did not change anything. As soon as the client starts copying files
> again, the numbers go up an I get a health message that the client is
> failing to respond to cache pressure.
>
> After this week of idle time, the dns/inos numbers (what does dns stand
> for anyway?) settled at around 8000k. That's basically that "idle"
> number that it goes back to when the client stops copying files. Though,
> for some weird reason, this number gets (quite) a bit higher every time
> (last time it was around 960k). Of course, I wouldn't expect it to go
> back all the way to zero, because that would mean dropping the entire
> cache for no reason, but it's still quite high and the same after
> restarting the MDS and all clients, which doesn't make a lot of sense to
> me. After resuming the copy job, the number went up to 20M in just the
> time it takes to write this email. There must be a bug somewhere.
>
> > Can you share two captures of `ceph daemon mds.X perf dump` about 1
> > second apart.
>
> I attached the requested perf dumps.

Thanks that helps. Looks like the problem is that the MDS is not
automatically trimming its cache fast enough. Please try bumping
mds_cache_trim_threshold:

bin/ceph config set mds mds_cache_trim_threshold 512K

Increase it further if it's not aggressive enough. Please let us know
if that helps.

It shouldn't be necessary to do this so I'll make a tracker ticket
once we confirm that's the issue.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS snapshot for backup & disaster recovery

2019-08-05 Thread Eitan Mosenkis
I'm using it for a NAS to make backups from the other machines on my home
network. Since everything is in one location, I want to keep a copy offsite
for disaster recovery. Running Ceph across the internet is not recommended
and is also very expensive compared to just storing snapshots.

On Sun, Aug 4, 2019 at 3:08 PM Виталий Филиппов  wrote:

> Afaik no. What's the idea of running a single-host cephfs cluster?
>
> 4 августа 2019 г. 13:27:00 GMT+03:00, Eitan Mosenkis 
> пишет:
>>
>> I'm running a single-host Ceph cluster for CephFS and I'd like to keep
>> backups in Amazon S3 for disaster recovery. Is there a simple way to
>> extract a CephFS snapshot as a single file and/or to create a file that
>> represents the incremental difference between two snapshots?
>>
>
> --
> With best regards,
> Vitaliy Filippov
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] tcmu-runner: "Acquired exclusive lock" every 21s

2019-08-05 Thread Mike Christie
On 08/05/2019 05:58 AM, Matthias Leopold wrote:
> Hi,
> 
> I'm still testing my 2 node (dedicated) iSCSI gateway with ceph 12.2.12
> before I dare to put it into production. I installed latest tcmu-runner
> release (1.5.1) and (like before) I'm seeing that both nodes switch
> exclusive locks for the disk images every 21 seconds. tcmu-runner logs
> look like this:
> 
> 2019-08-05 12:53:04.184 13742 [WARN] tcmu_notify_lock_lost:222
> rbd/iscsi.test03: Async lock drop. Old state 1
> 2019-08-05 12:53:04.714 13742 [WARN] tcmu_rbd_lock:762 rbd/iscsi.test03:
> Acquired exclusive lock.
> 2019-08-05 12:53:25.186 13742 [WARN] tcmu_notify_lock_lost:222
> rbd/iscsi.test03: Async lock drop. Old state 1
> 2019-08-05 12:53:25.773 13742 [WARN] tcmu_rbd_lock:762 rbd/iscsi.test03:
> Acquired exclusive lock.
> 
> Old state can sometimes be 0 or 2.
> Is this expected behaviour?

What initiator OS are you using?

It happens if you have 2 or more initiators accessing the same image but
it should not happen normally. It occurs when one initiator cannot
access the image's primary gateway and it is using the secondary, and
the other initiators are accessing the image via the primary gateway.
The lock then bounces between the gws as the initiators access the image
via both.

You could also hit it if somehow you mapped the image to multiple LUNs
and so the initiator thinks LUN0 and LUN10 are difference images with
different primary gws.

> 
> What may be of interest in my case is that I use a dedicated
> cluster_client_name in iscsi-gateway.cfg (not client.admin) and that I'm
> running 2 separate targets in different IP networks.
>

Your network setup might not be correct on one initiator node and so it
has dropped down to the secondary gw.

On the initiator OS check that all initiators are accessing the primary
(active optimized) path.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is the admin burden avoidable? "1 pg inconsistent" every other day?

2019-08-05 Thread Harry G. Coin
Thanks for that.  Seeing 'health err' so frequently has led to worrisome 
'alarm fatigue'. Yup that's half of what I want to do.


The number of copies of a pg in the crush map drives how time-critical 
and human-intervention critical the pg repair process is.  Having 
several copies makes automatic pg repair reasonable-- only if there's a 
way to log the count of repairs filed against pg's on the same osd since 
it was last marked 'in'.    I'd love to have looking at that list be a 
periodic staffer chore for pro-active osd replacement.


Appreciate the lead for the setting.


On 8/4/19 10:47 AM, Brett Chancellor wrote:
If all you want to do is repair the pg when it finds an inconsistent 
pg, you could set osd_scrub_auto_repair to true.


On Sun, Aug 4, 2019, 9:16 AM Harry G. Coin > wrote:


Question: If you have enough osds it seems an almost daily thing when
you get to work in the morning there' s a "ceph health error" "1 pg
inconsistent"   arising from a 'scrub error'.   Or 2, etc. Then like
most such mornings you look to see there's two or more valid
instances
of the pg and one with an issue.  So, like putting on socks that just
takes time every day: there's the 'ceph pg repair xx' (making note of
the likely soon to fail osd) then hey presto on with the day.

Am I missing some way to automate this and be notified only if one
attempt at pg repair has failed and just a log entry for successful
repairs?   Calls about dashboard "HEALTH ERR" warnings so often I
don't
need.

Ideas welcome!

Thanks


___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Built-in HA?

2019-08-05 Thread Robert LeBlanc
Another option is if both RDMA ports are on the same card, then you can do
RDMA with a bond. This does not work if you have two separate cards.

As far as your questions go, my guess would be that you would want to have
the different NICs in different broadcast domains, or set up Source Based
Routing and bind the source port on the connection (not the easiest, but
allows you to have multiple NICs in the same broadcast domain). I don't
have experience with Ceph in this type of configuration.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Fri, Aug 2, 2019 at 9:41 AM Volodymyr Litovka  wrote:

> Dear colleagues,
>
> at the moment, we use Ceph in routed environment (OSPF, ECMP) and
> everything is ok, reliability is high and there is nothing to complain
> about. But for hardware reasons (to be more precise - RDMA offload), we are
> faced with the need to operate Ceph directly on physical interfaces.
>
> According to documentation, "We generally recommend that dual-NIC systems
> either be configured with two IPs on the same network, or bonded."
>
> Q1: Did anybody test and can explain, how Ceph will behave in first
> scenario (two IPs on the same network)? I think this configuration require
> just one statement in 'public network' (where both interfaces reside)? How
> it will distribute traffic between links, how it will detect link failures
> and how it will switchover?
>
> Q2: Did anybody test a bit another scenario - both NICs have addresses in
> different networks and Ceph configuration contain two 'public networks'?
> Questions are same - how Ceph distributes traffic between links and how it
> recovers from link failures?
>
> Thank you.
>
> --
> Volodymyr Litovka
>   "Vision without Execution is Hallucination." -- Thomas Edison
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bluestore write iops calculation

2019-08-05 Thread vitalif

Hi  Team,
@vita...@yourcmc.ru , thank you for information and could you please
clarify on the below quires as well,

1. Average object size we use will be 256KB to 512KB , will there be
deferred write queue ?


With the default settings, no (bluestore_prefer_deferred_size_hdd = 
32KB)


Are you sure that 256-512KB operations aren't counted as multiple 
operations in your disk stats?


2. Share the link of existing rocksdb ticket which does 2 write + 
syncs.


My PR is here https://github.com/ceph/ceph/pull/26909, you can find the 
issue tracker links inside it.



3. Any configuration by which we can reduce/optimize the iops ?


As already said part of your I/O may be caused by the metadata (rocksdb) 
reads if it doesn't fit into RAM. You can try to add more RAM in that 
case... :)


You can also try to add SSDs for metadata (block.db/block.wal).

Is there something else?... I don't think so.

--
Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore caching oddities, again

2019-08-05 Thread Mark Nelson


On 8/4/19 7:36 PM, Christian Balzer wrote:

Hello,

On Sun, 4 Aug 2019 06:34:46 -0500 Mark Nelson wrote:


On 8/4/19 6:09 AM, Paul Emmerich wrote:


On Sun, Aug 4, 2019 at 3:47 AM Christian Balzer  wrote:
  

2. Bluestore caching still broken
When writing data with the fios below, it isn't cached on the OSDs.
Worse, existing cached data that gets overwritten is removed from the
cache, which while of course correct can't be free in terms of allocation
overhead.
Why not doing what any sensible person would expect from experience with
any other cache there is, cache writes in case the data gets read again
soon and in case of overwrites use existing allocations.

This is by design.
The BlueStore only populates its cache on reads, not on writes. The idea is
that a reasonable application does not read data it just wrote (and if it does
it's already cached at a higher layer like the page cache or a cache on the
hypervisor).


Note that this behavior can be change by setting
bluestore_default_buffered_write = true.


Thanks to Mark for his detailed reply.
Given the points I assume that with HDD backed (but SSD WAL/DB) OSDs it's
not actually a performance killer?



Not typically from the overhead perspective (ie CPU usage shouldn't be 
an issue unless you have a lot of HDDs and wimpy CPUs or possibly if you 
are also doing EC/compression/encryption with lots of small IO).  The 
next question though is if you are better off caching bluestore onodes 
vs rocksdb block cache vs object data.  When you have DB/WAL on the same 
device as bluestore block, you typically want to prioritize rocksdb 
indexes/filters, bluestore onode, rocksdb block cache, and bluestore 
data in that order (the ratios here though are very workload 
dependent).  If you have HDD + SSD DB/WAL, you probably still want to 
cache the indexes/filters with high priority (these are relatively small 
and will reduce read amplification in the DB significantly!).  Now 
caching bluestore onodes and rocksdb block cache may be less important 
since the SSDs may be able to handle the metadata reads fast enough to 
have little impact on the HDD side of things.  Not all SSDs are made 
equal and people often like to put multiple DB/WALs on a single SSD, so 
all of this can be pretty hardware dependent.  You'll also eat more CPU 
going this path due to encode/decode between bluestore and rocksdb and 
all of the work involved in finding the right key/value pair in rocksdb 
itself. So there are definitely going to be hardware-dependent 
trade-offs (ie even if it's faster on HDD/SSD setups to focus on 
bluestore buffer cache, you may eat more CPU per IO doing it).  Probably 
the take-away is that if you have really beefy CPUs and really buffer 
SSDs in a HDD+SSD setup, it may be worth trying a higher buffer cache 
ratio and see what happens.



Note that with the prioritycachemanager and osd memory autotuning, if 
you enable bluestore_default_buffered_write and neither the rocksdb 
block cache nor the bluestore onode cache need more memory, the rest 
automatically gets assigned to bluestore buffer cache for objects.



Mark


I'll test that of course but a gut feeling or ball park would be
appreciated by probably more people that me.

As Paul's argument, I'm not buying it because:
- It's a complete paradigm change when comparing it to filestore. Somebody
   migrating from FS to BS is likely to experience yet another performance
   decrease they didn't expect.
- Arguing for larger caches on the client only increases the cost of Ceph
   further. In that vein, BS currently can't utilize as much memory as FS
   did for caching in a save manner.
- Use cases like a DB with enough caching to deal with the normal working
   set but doing some hourly crunching on data that exceeds come to mind.
   One application here also processes written data once an hour, more than
   would fit in the VM pagecache, but currently comes from the FS pagecache.
- The overwrites of already cached date _clearly_ indicate a hotness and
   thus should be preferably cached. That bit in particular is upsetting,
   initial write caching or not.
  
Regards,


Christian

FWIW, there's also a CPU usage and lock contention penalty for default
buffered write when using extremely fast flash storage.  A lot of my
recent work on improving cache performance and intelligence in bluestore
is to reduce contention in the onode/buffer cache and also significantly
reduce the impact of default buffered write = true.  The
PriorityCacheManger was a big one to do a better job of autotuning.
Another big one that recently merged was refactoring bluestore's caches
to trim on write (better memory behavior, shorter more frequent trims,
trims distributed across threads) and not share a single lock between
the onode and buffer cache:


https://github.com/ceph/ceph/pull/28597


Ones still coming down the pipe are to avoid double caching onodes in
the bluestore onode cache and rocksdb block cache and age-binning the
LRU cach

Re: [ceph-users] Problems understanding 'ceph-features' output

2019-08-05 Thread Massimo Sgaravatto
On Mon, Aug 5, 2019 at 11:43 AM Ilya Dryomov  wrote:

> On Tue, Jul 30, 2019 at 10:33 AM Massimo Sgaravatto
>  wrote:
> >
> > The documentation that I have seen says that the minimum requirements
> for clients to use upmap are:
> >
> > - CentOs 7.5 or kernel 4.5
> > - Luminous version
>
> Do you have a link for that?
>
> This is wrong: CentOS 7.5 (i.e. RHEL 7.5 kernel) is right, but for
> upstream kernels it is 4.13 (unless someone did a large backport that
> I'm not aware of).
>


Yes sorry: 4.13 !


>
> >
> > But in general ceph admins could not have access to all clients to check
> these versions.
> >
> > In general: is there a table somewhere reporting the minimum "feature"
> version supported by upmap ?
> >
> > E.g. right now I am interested about 0x1ffddff8eea4fffb. Is this also
> good enough for upmap ?
>
> Yeah, this is annoying.  The missing feature bit has been merged into
> 5.3, so starting with 5.3 the kernel client will finally report itself
> as luminous.
>
> In the meantime, use this:
>
> $ cat /tmp/detect_upmap.py
> if int(input()) & (1 << 21):
> print("Upmap is supported")
> else:
> print("Upmap is NOT supported")
>
> $ echo 0x1ffddff8eea4fffb | python /tmp/detect_upmap.py
> Upmap is supported
>


Great !!

Thanks a lot !

Cheers, Massimo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] tcmu-runner: "Acquired exclusive lock" every 21s

2019-08-05 Thread Matthias Leopold

Hi,

I'm still testing my 2 node (dedicated) iSCSI gateway with ceph 12.2.12 
before I dare to put it into production. I installed latest tcmu-runner 
release (1.5.1) and (like before) I'm seeing that both nodes switch 
exclusive locks for the disk images every 21 seconds. tcmu-runner logs 
look like this:


2019-08-05 12:53:04.184 13742 [WARN] tcmu_notify_lock_lost:222 
rbd/iscsi.test03: Async lock drop. Old state 1
2019-08-05 12:53:04.714 13742 [WARN] tcmu_rbd_lock:762 rbd/iscsi.test03: 
Acquired exclusive lock.
2019-08-05 12:53:25.186 13742 [WARN] tcmu_notify_lock_lost:222 
rbd/iscsi.test03: Async lock drop. Old state 1
2019-08-05 12:53:25.773 13742 [WARN] tcmu_rbd_lock:762 rbd/iscsi.test03: 
Acquired exclusive lock.


Old state can sometimes be 0 or 2.
Is this expected behaviour?

What may be of interest in my case is that I use a dedicated 
cluster_client_name in iscsi-gateway.cfg (not client.admin) and that I'm 
running 2 separate targets in different IP networks.


thx for advice
matthias

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] even number of monitors

2019-08-05 Thread Lars Marowsky-Bree
On 2019-08-05T07:27:39, Alfredo Daniel Rezinovsky  wrote:

There's no massive problem with even MON counts.

As you note, n+2 doesn't really provide added fault tolerance compared
to n+1, so there's no win either. That's fairly obvious.

Somewhat less obvious - since the failure of any additional MON now will
lose quorum, and you know have, say, 3 instead of just 2, there's a
slightly higher chance that that case will trigger.

If the reason you're doing this is that you, say, want to standardize on
having one MON in each of your racks, and you happen to have 4 racks,
this is likely worth the trade-off.

And you can always manually lower the MON count to recover service even
then - from the durability perspective, you have one more copy of the
MON database afterall.

Probability is fun ;-)


Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] even number of monitors

2019-08-05 Thread EDH - Manuel Rios Fernandez
With 4 monitors if you lost 2 , your quorum will get out, because it needs
be N+1

Monitors recommended:

1 - 3 - 5 - 7

Regards
Manuel


-Mensaje original-
De: ceph-users  En nombre de Alfredo
Daniel Rezinovsky
Enviado el: lunes, 5 de agosto de 2019 12:28
Para: ceph-users 
Asunto: [ceph-users] even number of monitors

With 3 monitors, paxos needs at least 2 to reach consensus about the cluster
status

With 4 monitors, more than half is 3. The only problem I can see here is
that I will have only 1 spare monitor.

There's any other problem with and even number of monitors?

--
Alfrenovsky

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] even number of monitors

2019-08-05 Thread Alfredo Daniel Rezinovsky
With 3 monitors, paxos needs at least 2 to reach consensus about the 
cluster status


With 4 monitors, more than half is 3. The only problem I can see here is 
that I will have only 1 spare monitor.


There's any other problem with and even number of monitors?

--
Alfrenovsky

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problems understanding 'ceph-features' output

2019-08-05 Thread Ilya Dryomov
On Tue, Jul 30, 2019 at 10:33 AM Massimo Sgaravatto
 wrote:
>
> The documentation that I have seen says that the minimum requirements for 
> clients to use upmap are:
>
> - CentOs 7.5 or kernel 4.5
> - Luminous version

Do you have a link for that?

This is wrong: CentOS 7.5 (i.e. RHEL 7.5 kernel) is right, but for
upstream kernels it is 4.13 (unless someone did a large backport that
I'm not aware of).

>
> But in general ceph admins could not have access to all clients to check 
> these versions.
>
> In general: is there a table somewhere reporting the minimum "feature" 
> version supported by upmap ?
>
> E.g. right now I am interested about 0x1ffddff8eea4fffb. Is this also good 
> enough for upmap ?

Yeah, this is annoying.  The missing feature bit has been merged into
5.3, so starting with 5.3 the kernel client will finally report itself
as luminous.

In the meantime, use this:

$ cat /tmp/detect_upmap.py
if int(input()) & (1 << 21):
print("Upmap is supported")
else:
print("Upmap is NOT supported")

$ echo 0x1ffddff8eea4fffb | python /tmp/detect_upmap.py
Upmap is supported

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS snapshot for backup & disaster recovery

2019-08-05 Thread Lars Marowsky-Bree
On 2019-08-04T13:27:00, Eitan Mosenkis  wrote:

> I'm running a single-host Ceph cluster for CephFS and I'd like to keep
> backups in Amazon S3 for disaster recovery. Is there a simple way to
> extract a CephFS snapshot as a single file and/or to create a file that
> represents the incremental difference between two snapshots?

You could either use rclone to sync your CephFS to S3.

rsync can build the latter - see write-batch and only-write-batch
options.

(Unfortunately, there's not yet CephFS support for building the diff
between snapshots quickly, this will do a full fs scan, even though it
takes advantage of the mtime/ctime obviously.)

You could also use something like duplicity to run a regular incremental
backup.



-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Listing buckets in a pool

2019-08-05 Thread Euan Tilley
I have inherited a ceph cluster & being new to ceph I am trying to understand 
whats being stored in the cluster. I can see we have the below pools

# ceph df
GLOBAL:
SIZE AVAIL  RAW USED %RAW USED
170T 24553G 146T 85.90
POOLS:
NAMEID USED   %USED MAX AVAIL   
  OBJECTS
rook-ceph-pool  1234G 16.76 1165G   
 60233
.rgw.root   89075 0 1165G   
38
objectstore2.rgw.control11  0 0 1165G   
 8
objectstore2.rgw.meta   12  12433 0 1165G   
45
objectstore2.rgw.log13186 0 1165G   
   227
objectstore2.rgw.buckets.index  14  0 0 1165G   
41
objectstore2.rgw.buckets.data   15 11480G 86.77 1748G   
   3166026
default.rgw.meta16   2794 0 3497G   
12
default.rgw.log 17  0 0 3497G   
   128
default.rgw.control 18  0 0 3497G   
 8
objectstore2.rgw.buckets.non-ec 19  0 0 3497G   
 0
objectstore.rgw.control 25  0 0 1165G   
 8
objectstore.rgw.meta26  40894 0 1165G   
   165
objectstore.rgw.log 27  0 0 1165G   
   208
objectstore.rgw.buckets.index   28  0 0 1165G   
55
objectstore.rgw.buckets.data29 51563G 96.72 1748G   
  14210801
objectstore.rgw.buckets.non-ec  30  0 0 3497G   
   394

but when it came to listing the buckets in the pools I got nothing back.
# radosgw-admin bucket list
[]

Eventually I discovered we have a realm & zonalgroup setup
# radosgw-admin realm list
{
"default_info": "eb9f511a-3d96-4c0f-b0b1-212e5d185846",
"realms": [
"objectstore"
]
}

# radosgw-admin zonegroup list
{
"default_info": "",
"zonegroups": [
"objectstore",
"default"
]
}

# radosgw-admin zone list
{
"default_info": "844a9975-a467-4ff1-bda2-6715d559ef53",
"zones": [
"objectstore",
"default"
]
}

so was able to list the buckets in the objectstore pool 
[root@rook-ceph-tools-deployment-6848dddfbf-k7z5b /]# radosgw-admin bucket list 
--rgw-zonegroup=objectstore
[
"load-test-media-resources-common",
"load-test-media-resources-latimer",
"backup-ops-01.recordsure.com-snipeit-mysql",
"elastic-backup-2018-12-14",
"elastic-backup-2018-10-15",
"elastic-backup-2019-01-02",
...
]

however I can't list the buckets from the objectstore2 pool. I have tried 

# radosgw-admin bucket list --rgw-zonegroup=default
2019-08-05 08:32:00.686739 7f334a731c80  1 Cannot find zone 
id=844a9975-a467-4ff1-bda2-6715d559ef53 (name=objectstore), switching to local 
zonegroup configuration
2019-08-05 08:32:00.702751 7f334a731c80 -1 Cannot find zone 
id=844a9975-a467-4ff1-bda2-6715d559ef53 (name=objectstore)
couldn't init storage provider

So my question is how can I go about listing the buckets in the 
objectstore2.rgw.buckets.data pool?

Regards
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes

2019-08-05 Thread Janek Bevendorff

Hi,


You can also try increasing the aggressiveness of the MDS recall but
I'm surprised it's still a problem with the settings I gave you:

ceph config set mds mds_recall_max_caps 15000
ceph config set mds mds_recall_max_decay_rate 0.75


I finally had the chance to try the more aggressive recall settings, but 
they did not change anything. As soon as the client starts copying files 
again, the numbers go up an I get a health message that the client is 
failing to respond to cache pressure.


After this week of idle time, the dns/inos numbers (what does dns stand 
for anyway?) settled at around 8000k. That's basically that "idle" 
number that it goes back to when the client stops copying files. Though, 
for some weird reason, this number gets (quite) a bit higher every time 
(last time it was around 960k). Of course, I wouldn't expect it to go 
back all the way to zero, because that would mean dropping the entire 
cache for no reason, but it's still quite high and the same after 
restarting the MDS and all clients, which doesn't make a lot of sense to 
me. After resuming the copy job, the number went up to 20M in just the 
time it takes to write this email. There must be a bug somewhere.



Can you share two captures of `ceph daemon mds.X perf dump` about 1
second apart.


I attached the requested perf dumps.


Thanks!



perf_dump_1.json
Description: application/json


perf_dump_2.json
Description: application/json
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com