Re: [ceph-users] Ceph in OSPF environment

2019-01-18 Thread Robin H. Johnson
On Fri, Jan 18, 2019 at 12:21:07PM +, Max Krasilnikov wrote:
> Dear colleagues,
> 
> we build L3 topology for use with CEPH, which is based on OSPF routing 
> between Loopbacks, in order to get reliable and ECMPed topology, like this:
...
> CEPH configured in the way
You have a minor misconfiguration, but I've had trouble with the address
picking logic before, on a L3 routed ECMP BGP topography on IPv6 (using
the Cumulus magic link-local IPv6 BGP)

> 
> [global]
> public_network = 10.10.200.0/24
Keep this, but see below.

> [osd.0]
> public bind addr = 10.10.200.5
public_bind_addr is only used by mons.

> cluster bind addr = 10.10.200.5
There is no such option as 'cluster_bind_addr'; it's just 'cluster_addr'

Set the following in the OSD block:
| public_network = # keep empty; empty != unset
| cluster_network = # keep empty; empty != unset
| cluster_addr = 10.10.200.5
| public_addr = 10.10.200.5

Alternatively, see the code src/common/pick_address.cc to see about
using cluster_network_interface and public_network_interface.

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136


signature.asc
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs crashing in EC pool (whack-a-mole)

2019-01-18 Thread Peter Woodman
At the risk of hijacking this thread, like I said I've ran into this
problem again, and have captured a log with debug_osd=20, viewable at
https://www.dropbox.com/s/8zoos5hhvakcpc4/ceph-osd.3.log?dl=0 - any
pointers?

On Tue, Jan 8, 2019 at 11:31 AM Peter Woodman  wrote:
>
> For the record, in the linked issue, it was thought that this might be
> due to write caching. This seems not to be the case, as it happened
> again to me with write caching disabled.
>
> On Tue, Jan 8, 2019 at 11:15 AM Sage Weil  wrote:
> >
> > I've seen this on luminous, but not on mimic.  Can you generate a log with
> > debug osd = 20 leading up to the crash?
> >
> > Thanks!
> > sage
> >
> >
> > On Tue, 8 Jan 2019, Paul Emmerich wrote:
> >
> > > I've seen this before a few times but unfortunately there doesn't seem
> > > to be a good solution at the moment :(
> > >
> > > See also: http://tracker.ceph.com/issues/23145
> > >
> > > Paul
> > >
> > > --
> > > Paul Emmerich
> > >
> > > Looking for help with your Ceph cluster? Contact us at https://croit.io
> > >
> > > croit GmbH
> > > Freseniusstr. 31h
> > > 81247 München
> > > www.croit.io
> > > Tel: +49 89 1896585 90
> > >
> > > On Tue, Jan 8, 2019 at 9:37 AM David Young  
> > > wrote:
> > > >
> > > > Hi all,
> > > >
> > > > One of my OSD hosts recently ran into RAM contention (was swapping 
> > > > heavily), and after rebooting, I'm seeing this error on random OSDs in 
> > > > the cluster:
> > > >
> > > > ---
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  ceph version 13.2.4 
> > > > (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  1: /usr/bin/ceph-osd() 
> > > > [0xcac700]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  2: (()+0x11390) 
> > > > [0x7f8fa5d0e390]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  3: (gsignal()+0x38) 
> > > > [0x7f8fa5241428]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  4: (abort()+0x16a) 
> > > > [0x7f8fa524302a]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  5: 
> > > > (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> > > > const*)+0x250) [0x7f8fa767c510]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  6: (()+0x2e5587) 
> > > > [0x7f8fa767c587]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  7: 
> > > > (BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
> > > > ObjectStore::Transaction*)+0x923) [0xbab5e3]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  8: 
> > > > (BlueStore::queue_transactions(boost::intrusive_ptr&,
> > > >  std::vector > > > std::allocator >&, 
> > > > boost::intrusive_ptr, ThreadPool::TPHandle*)+0x5c3) 
> > > > [0xbade03]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  9: 
> > > > (ObjectStore::queue_transaction(boost::intrusive_ptr&,
> > > >  ObjectStore::Transaction&&, boost::intrusive_ptr, 
> > > > ThreadPool::TPHandle*)+0x82) [0x79c812]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  10: 
> > > > (OSD::dispatch_context_transaction(PG::RecoveryCtx&, PG*, 
> > > > ThreadPool::TPHandle*)+0x58) [0x730ff8]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  11: 
> > > > (OSD::dequeue_peering_evt(OSDShard*, PG*, 
> > > > std::shared_ptr, ThreadPool::TPHandle&)+0xfe) [0x759aae]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  12: (PGPeeringItem::run(OSD*, 
> > > > OSDShard*, boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x50) 
> > > > [0x9c5720]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  13: 
> > > > (OSD::ShardedOpWQ::_process(unsigned int, 
> > > > ceph::heartbeat_handle_d*)+0x590) [0x769760]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  14: 
> > > > (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x476) 
> > > > [0x7f8fa76824f6]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  15: 
> > > > (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f8fa76836b0]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  16: (()+0x76ba) 
> > > > [0x7f8fa5d046ba]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  17: (clone()+0x6d) 
> > > > [0x7f8fa531341d]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  NOTE: a copy of the 
> > > > executable, or `objdump -rdS ` is needed to interpret this.
> > > > Jan 08 03:34:36 prod1 systemd[1]: ceph-osd@43.service: Main process 
> > > > exited, code=killed, status=6/ABRT
> > > > ---
> > > >
> > > > I've restarted all the OSDs and the mons, but still encountering the 
> > > > above.
> > > >
> > > > Any ideas / suggestions?
> > > >
> > > > Thanks!
> > > > D
> > > > ___
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS - Small file - single thread - read performance.

2019-01-18 Thread jesper
Hi Everyone.

Thanks for the testing everyone - I think my system works as intented.

When reading from another client - hitting the cache of the OSD-hosts
I also get down to 7-8ms.

As mentioned, this is probably as expected.

I need to figure out to increase parallism somewhat - or convince users to
not created those ridiciouls amounts of small files.

-- 
Jesper


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Today's DocuBetter meeting topic is... SEO

2019-01-18 Thread Brian Topping
Hi Noah!

With an eye toward improving documentation and community, two things come to 
mind:

1. I didn’t know about this meeting or I would have done my very best to enlist 
my roommate, who probably could have answered these questions very quickly. I 
do know there’s something to do with the metadata tags in the HTML  that 
manages most of this. Web spiders see these tags and know what to do.

2. I realized I really didn’t know there were any Ceph meetings like this and 
thought I would raise awareness to 
https://github.com/kubernetes/community/blob/master/events/community-meeting.md 
,
 where the kubernetes team has created an iCal subscription that one can 
automatically get alerts and updates for upcoming events. Best, they work 
accurately across time zones, so no need to have people doing math ("daylight 
savings time” is a pet peeve, please don’t get me started! :))

Hope this provides some value! 

Brian

> On Jan 18, 2019, at 11:37 AM, Noah Watkins  wrote:
> 
> 1 PM PST / 9 PM GMT
> https://bluejeans.com/908675367
> 
> On Fri, Jan 18, 2019 at 10:31 AM Noah Watkins  wrote:
>> 
>> We'll be discussing SEO for the Ceph documentation site today at the
>> DocuBetter meeting. Currently when Googling or DuckDuckGoing for
>> Ceph-related things you may see results from master, mimic, or what's
>> a dumpling? The goal is figure out what sort of approach we can take
>> to make these results more relevant. If you happen to know a bit about
>> the topic of SEO please join and contribute to the conversation.
>> 
>> Best,
>> Noah
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Today's DocuBetter meeting topic is... SEO

2019-01-18 Thread Noah Watkins
1 PM PST / 9 PM GMT
https://bluejeans.com/908675367

On Fri, Jan 18, 2019 at 10:31 AM Noah Watkins  wrote:
>
> We'll be discussing SEO for the Ceph documentation site today at the
> DocuBetter meeting. Currently when Googling or DuckDuckGoing for
> Ceph-related things you may see results from master, mimic, or what's
> a dumpling? The goal is figure out what sort of approach we can take
> to make these results more relevant. If you happen to know a bit about
> the topic of SEO please join and contribute to the conversation.
>
> Best,
> Noah
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Today's DocuBetter meeting topic is... SEO

2019-01-18 Thread Noah Watkins
We'll be discussing SEO for the Ceph documentation site today at the
DocuBetter meeting. Currently when Googling or DuckDuckGoing for
Ceph-related things you may see results from master, mimic, or what's
a dumpling? The goal is figure out what sort of approach we can take
to make these results more relevant. If you happen to know a bit about
the topic of SEO please join and contribute to the conversation.

Best,
Noah
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Boot volume on OSD device

2019-01-18 Thread Hector Martin
On 19/01/2019 02.24, Brian Topping wrote:
> 
> 
>> On Jan 18, 2019, at 4:29 AM, Hector Martin  wrote:
>>
>> On 12/01/2019 15:07, Brian Topping wrote:
>>> I’m a little nervous that BlueStore assumes it owns the partition table and 
>>> will not be happy that a couple of primary partitions have been used. Will 
>>> this be a problem?
>>
>> You should look into using ceph-volume in LVM mode. This will allow you to 
>> create an OSD out of any arbitrary LVM logical volume, and it doesn't care 
>> about other volumes on the same PV/VG. I'm running BlueStore OSDs sharing 
>> PVs with some non-Ceph stuff without any issues. It's the easiest way for 
>> OSDs to coexist with other stuff right now.
> 
> Very interesting, thanks!
> 
> On the subject, I just rediscovered the technique of putting boot and root 
> volumes on mdadm-backed stores. The last time I felt the need for this, it 
> was a lot of careful planning and commands. 
> 
> Now, at least with RHEL/CentOS, it’s now available in Anaconda. As it’s set 
> up before mkfs, there’s no manual hackery to reduce the size of a volume to 
> make room for the metadata. Even better, one isn’t stuck using metadata 0.9.0 
> just because they need the /boot volume to have the header at the end (grub 
> now understands mdadm 1.2 headers). Just be sure /boot is RAID 1 and it 
> doesn’t seem to matter what one does with the rest of the volumes. Kernel 
> upgrades process correctly as well (another major hassle in the old days 
> since mkinitrd had to be carefully managed).
> 

Just to add a related experience: you still need 1.0 metadata (that's
the 1.x variant at the end of the partition, like 0.9.0) for an
mdadm-backed EFI system partition if you boot using UEFI. This generally
works well, except on some Dell servers where the firmware inexplicably
*writes* to the ESP, messing up the RAID mirroring. But there is a hacky
workaround. They create a directory ("Dell" IIRC) to put their junk in.
If you create a *file* with the same name ahead of time, that makes the
firmware fail to mkdir, but it doesn't seem to cause any issues and it
doesn't touch the disk in this case, so the RAID stays in sync.


-- 
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Boot volume on OSD device

2019-01-18 Thread Brian Topping


> On Jan 18, 2019, at 4:29 AM, Hector Martin  wrote:
> 
> On 12/01/2019 15:07, Brian Topping wrote:
>> I’m a little nervous that BlueStore assumes it owns the partition table and 
>> will not be happy that a couple of primary partitions have been used. Will 
>> this be a problem?
> 
> You should look into using ceph-volume in LVM mode. This will allow you to 
> create an OSD out of any arbitrary LVM logical volume, and it doesn't care 
> about other volumes on the same PV/VG. I'm running BlueStore OSDs sharing PVs 
> with some non-Ceph stuff without any issues. It's the easiest way for OSDs to 
> coexist with other stuff right now.

Very interesting, thanks!

On the subject, I just rediscovered the technique of putting boot and root 
volumes on mdadm-backed stores. The last time I felt the need for this, it was 
a lot of careful planning and commands. 

Now, at least with RHEL/CentOS, it’s now available in Anaconda. As it’s set up 
before mkfs, there’s no manual hackery to reduce the size of a volume to make 
room for the metadata. Even better, one isn’t stuck using metadata 0.9.0 just 
because they need the /boot volume to have the header at the end (grub now 
understands mdadm 1.2 headers). Just be sure /boot is RAID 1 and it doesn’t 
seem to matter what one does with the rest of the volumes. Kernel upgrades 
process correctly as well (another major hassle in the old days since mkinitrd 
had to be carefully managed).

best, B

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests and high i/o / read rate on bluestore osds after upgrade 12.2.8 -> 12.2.10

2019-01-18 Thread Mark Nelson


On 1/18/19 9:22 AM, Nils Fahldieck - Profihost AG wrote:

Hello Mark,

I'm answering on behalf of Stefan.
Am 18.01.19 um 00:22 schrieb Mark Nelson:

On 1/17/19 4:06 PM, Stefan Priebe - Profihost AG wrote:

Hello Mark,

after reading
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/

again i'm really confused how the behaviour is exactly under 12.2.8
regarding memory and 12.2.10.

Also i stumpled upon "When tcmalloc and cache autotuning is enabled," -
we're compiling against and using jemalloc. What happens in this case?


Hi Stefan,


The autotuner uses the existing in-tree perfglue code that grabs the
tcmalloc heap and unmapped memory statistics to determine how to tune
the caches.  Theoretically we might be able to do the same thing for
jemalloc and maybe even glibc malloc, but there's no perfglue code for
those yet.  If the autotuner can't get heap statistics it won't try to
tune the caches and should instead revert to using the
bluestore_cache_size and whatever the ratios are (the same as if you set
bluestore_cache_autotune to false).


Thank you for that information on the difference between tcmalloc and
jemalloc. We compiled a new 12.2.10 version using tcmalloc. I upgraded a
cluster, which was running _our_ old 12.2.10 version (which used
jemalloc). This cluster has a very low load, so the
jemalloc-ceph-version didn't trigger any performance problems. Prior to
upgrading, one OSD never used more than 1 GB of RAM. After upgrading
there are OSDs using approx. 5,7 GB right now.

I also removed the 'osd_memory_target' option, which we falsely believed
has replaced 'bluestore_cache_size'.

We still have to test this on a cluster generating more I/O load.

For now, this seems to be working fine. Thanks.


Also i saw now - that 12.2.10 uses 1GB mem max while 12.2.8 uses 6-7GB
Mem (with bluestore_cache_size = 1073741824).


If you are using the autotuner (but it sounds like maybe you are not if
jemalloc is being used?) you'll want to set the osd_memory_target at
least 1GB higher than what you previously had the bluestore_cache_size
set to.  It's likely that trying to set the OSD to stay within 1GB of
memory will cause the cache to sit at osd_memory_cache_min because the
tuner simply can't shrink the cache enough to meet the target (too much
other memory consumed by pglog, rocksdb WAL buffers, random other stuff).

The fact that you see 6-7GB of mem usage with 12.2.8 vs 1GB with 12.2.10
sounds like a clue.  A bluestore OSD using 1GB of memory is going to
have very little space for cache and it's quite likely that it would be
performing reads from disk for a variety of reasons.  Getting to the
root of that might explain what's going on.  If you happen to still have
a 12.2.8 OSD up that's consuming 6-7GB of memory (with
bluestore_cache_size = 1073741824), can you dump the mempool stats and
running configuration for it?



This is one OSD from a different cluster using approximately 6,1 GB of
memory. This OSD and it's cluster is still running with version 12.2.8.

This OSD (and every other OSD running with 12.2.8) is still configured
with 'bluestore_cache_size = 1073741824'. Please see the following
pastebins:


ceph daemon osd.NNN dump_mempools

https://pastebin.com/Pdcrr4ut


And


ceph daemon osd.NNN show config

https://pastebin.com/nkKpNFU3

Best Regards
Nils



Hi Nils,


Forgive me if you already said this, but is osd.32 backed by an SSD?  I 
believe what you are seeing is that the OSD is actually using 3GB of 
cache due to:



bluestore_cache_size_ssd = 3221225472


on line 132 of your show config paste.


That is backed up by the mempool data:


    "bluestore_cache_other": {
    "items": 62839413,
    "bytes": 2573767714
    },

    "total": {
    "items": 214595893,
    "bytes": 3087934707
    }


IE even though you guys set bluestore_cache_size to 1GB, it is being 
overridden by bluestore_cache_size_ssd.  Later when you compiled the 
tcmalloc version of 12.2.10 and set the osd_memory_target to 1GB, it was 
properly being applied and the autotuner desperately attempted to fit 
the entire OSD into 1GB of memory by shrinking all of the caches to fit 
within osd_memory_cache_min (128MB by default).  Ultimately that lead to 
many reads from disk as even the rocksdb bloom filters may not have 
properly fit into that small of a cache.  Generally I think the absolute 
minimum osd_memory_target for bluestore is probably around 1.5-2GB (with 
potential performance penalties), but 3-4GB gives it a lot more 
breathing room.  If you are ok with the OSD taking up 6-7GB of memory 
you might set the osd_memory_target accordingly.



The reason we wrote the autotuning code is to try to make all of this 
simpler and more explicit.  The idea is that a user shouldn't need to 
think about any of this beyond giving the OSD a target for how much 
memory it should consume and let it worry about figuring out how to use 
it.  We're still working on making it smarter, but the goal is f

Re: [ceph-users] block.db on a LV? (Re: Mixed SSD+HDD OSD setup recommendation)

2019-01-18 Thread Alfredo Deza
On Fri, Jan 18, 2019 at 10:07 AM Jan Kasprzak  wrote:
>
> Alfredo,
>
> Alfredo Deza wrote:
> : On Fri, Jan 18, 2019 at 7:21 AM Jan Kasprzak  wrote:
> : > Eugen Block wrote:
> : > :
> : > : I think you're running into an issue reported a couple of times.
> : > : For the use of LVM you have to specify the name of the Volume Group
> : > : and the respective Logical Volume instead of the path, e.g.
> : > :
> : > : ceph-volume lvm prepare --bluestore --block.db ssd_vg/ssd00 --data 
> /dev/sda
> : > thanks, I will try it. In the meantime, I have discovered another way
> : > how to get around it: convert my SSDs from MBR to GPT partition table,
> : > and then create 15 additional GPT partitions for the respective block.dbs
> : > instead of 2x15 LVs.
> :
> : This is because ceph-volume can accept both LVs or GPT partitions for 
> block.db
> :
> : Another way around this, that doesn't require you to create the LVs is
> : to use the `batch` sub-command, that will automatically
> : detect your HDD and put data on it, and detect the SSD and create the
> : block.db LVs. The command could look something like:
> :
> :
> : ceph-volume lvm batch --bluestore /dev/sda /dev/sdb /dev/sdc /dev/sdd
> : /dev/nvme0n1
> :
> : Would create 4 OSDs, place data on: sda, sdb, sdc, and sdd. And create
> : 4 block.db LVs on nvme0n1
>
> Interesting. Thanks!
>
> Can the batch command accept also partitions instead of a whole
> device for block.db? I already have two partitions on my SSDs for
> root and swap.

Ah in that case, no. The idea is that it abstracts the handling in
such a way that it is as hands-off as possible. It is hard to
accomplish that
if there are partitions in the SSDs. However, it is still possible if
the SSDs have LVs in them. The sub-command will just figure out what
extra space is available and just use that.


>
> -Yenya
>
> --
> | Jan "Yenya" Kasprzak  |
> | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
>  This is the world we live in: the way to deal with computers is to google
>  the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore 32bit max_object_size limit

2019-01-18 Thread KEVIN MICHAEL HRPCEK


On 1/18/19 7:26 AM, Igor Fedotov wrote:

Hi Kevin,

On 1/17/2019 10:50 PM, KEVIN MICHAEL HRPCEK wrote:
Hey,

I recall reading about this somewhere but I can't find it in the docs or list 
archive and confirmation from a dev or someone who knows for sure would be 
nice. What I recall is that bluestore has a max 4GB file size limit based on 
the design of bluestore not the osd_max_object_size setting. The bluestore 
source seems to suggest that by setting the OBJECT_MAX_SIZE to a 32bit max, 
giving an error if osd_max_object_size is > OBJECT_MAX_SIZE, and not writing 
the data if offset+length >= OBJECT_MAX_SIZE. So it seems like the in osd file 
size int can't exceed 32 bits which is 4GB, like FAT32. Am I correct or maybe 
I'm reading all this wrong..?

You're correct, BlueStore doesn't support object larger than 
OBJECT_MAX_SIZE(i.e. 4Gb)

Thanks for confirming that!


If bluestore has a hard 4GB object limit using radosstriper to break up an 
object would work, but does using an EC pool that breaks up the object to 
shards smaller than OBJECT_MAX_SIZE have the same effect as radosstriper to get 
around a 4GB limit? We use rados directly and would like to move to bluestore 
but we have some large objects <= 13G that may need attention if this 4GB limit 
does exist and an ec pool doesn't get around it.
Theoretically object split using EC might help. But I'm not sure whether one 
needs to adjust osd_max_object_size greater than 4Gb to permit 13Gb object 
usage in EC pool. If it's needed than tosd_max_object_size <= OBJECT_MAX_SIZE 
constraint is violated and BlueStore wouldn't start.
In my experience I had to increase osd_max_object_size from the 128M default it 
changed to a couple versions ago to ~20G to be able to write our largest 
objects with some margin. Do you think there is another way to handle 
osd_max_object_size > OBJECT_MAX_SIZE so that bluestore will start and EC pools 
or striping can be used to write objects that are greater than OBJECT_MAX_SIZE 
but each stripe/shard ends up smaller than OBJECT_MAX_SIZE after striping or 
being in an ec pool?



https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L88
#define OBJECT_MAX_SIZE 0x // 32 bits

https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L4395

 // sanity check(s)
  auto osd_max_object_size =
cct->_conf.get_val("osd_max_object_size");
  if (osd_max_object_size >= (size_t)OBJECT_MAX_SIZE) {
derr << __func__ << " osd_max_object_size >= 0x" << std::hex << 
OBJECT_MAX_SIZE
  << "; BlueStore has hard limit of 0x" << OBJECT_MAX_SIZE << "." <<  
std::dec << dendl;
return -EINVAL;
  }


https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L12331
  if (offset + length >= OBJECT_MAX_SIZE) {
r = -E2BIG;
  } else {
_assign_nid(txc, o);
r = _do_write(txc, c, o, offset, length, bl, fadvise_flags);
txc->write_onode(o);
  }

Thanks!
Kevin


--
Kevin Hrpcek
NASA SNPP Atmosphere SIPS
Space Science & Engineering Center
University of Wisconsin-Madison



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Thanks,

Igor

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests and high i/o / read rate on bluestore osds after upgrade 12.2.8 -> 12.2.10

2019-01-18 Thread Nils Fahldieck - Profihost AG
Hello Mark,

I'm answering on behalf of Stefan.
Am 18.01.19 um 00:22 schrieb Mark Nelson:
> 
> On 1/17/19 4:06 PM, Stefan Priebe - Profihost AG wrote:
>> Hello Mark,
>>
>> after reading
>> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/
>>
>> again i'm really confused how the behaviour is exactly under 12.2.8
>> regarding memory and 12.2.10.
>>
>> Also i stumpled upon "When tcmalloc and cache autotuning is enabled," -
>> we're compiling against and using jemalloc. What happens in this case?
> 
> 
> Hi Stefan,
> 
> 
> The autotuner uses the existing in-tree perfglue code that grabs the
> tcmalloc heap and unmapped memory statistics to determine how to tune
> the caches.  Theoretically we might be able to do the same thing for
> jemalloc and maybe even glibc malloc, but there's no perfglue code for
> those yet.  If the autotuner can't get heap statistics it won't try to
> tune the caches and should instead revert to using the
> bluestore_cache_size and whatever the ratios are (the same as if you set
> bluestore_cache_autotune to false).

Thank you for that information on the difference between tcmalloc and
jemalloc. We compiled a new 12.2.10 version using tcmalloc. I upgraded a
cluster, which was running _our_ old 12.2.10 version (which used
jemalloc). This cluster has a very low load, so the
jemalloc-ceph-version didn't trigger any performance problems. Prior to
upgrading, one OSD never used more than 1 GB of RAM. After upgrading
there are OSDs using approx. 5,7 GB right now.

I also removed the 'osd_memory_target' option, which we falsely believed
has replaced 'bluestore_cache_size'.

We still have to test this on a cluster generating more I/O load.

For now, this seems to be working fine. Thanks.

> 
>>
>> Also i saw now - that 12.2.10 uses 1GB mem max while 12.2.8 uses 6-7GB
>> Mem (with bluestore_cache_size = 1073741824).
> 
> 
> If you are using the autotuner (but it sounds like maybe you are not if
> jemalloc is being used?) you'll want to set the osd_memory_target at
> least 1GB higher than what you previously had the bluestore_cache_size
> set to.  It's likely that trying to set the OSD to stay within 1GB of
> memory will cause the cache to sit at osd_memory_cache_min because the
> tuner simply can't shrink the cache enough to meet the target (too much
> other memory consumed by pglog, rocksdb WAL buffers, random other stuff).
> 
> The fact that you see 6-7GB of mem usage with 12.2.8 vs 1GB with 12.2.10
> sounds like a clue.  A bluestore OSD using 1GB of memory is going to
> have very little space for cache and it's quite likely that it would be
> performing reads from disk for a variety of reasons.  Getting to the
> root of that might explain what's going on.  If you happen to still have
> a 12.2.8 OSD up that's consuming 6-7GB of memory (with
> bluestore_cache_size = 1073741824), can you dump the mempool stats and
> running configuration for it?


This is one OSD from a different cluster using approximately 6,1 GB of
memory. This OSD and it's cluster is still running with version 12.2.8.

This OSD (and every other OSD running with 12.2.8) is still configured
with 'bluestore_cache_size = 1073741824'. Please see the following
pastebins:

> 
> 
> ceph daemon osd.NNN dump_mempools
https://pastebin.com/Pdcrr4ut
> 
> 
> And
> 
> 
> ceph daemon osd.NNN show config
https://pastebin.com/nkKpNFU3
> 
Best Regards
Nils
> 
> Thanks,
> 
> Mark
> 
> 
>>
>> Greets,
>> Stefan
>>
>> Am 17.01.19 um 22:59 schrieb Stefan Priebe - Profihost AG:
>>> Hello Mark,
>>>
>>> for whatever reason i didn't get your mails - most probably you kicked
>>> me out of CC/TO and only sent to the ML? I've only subscribed to a daily
>>> digest. (changed that for now)
>>>
>>> So i'm very sorry to answer so late.
>>>
>>> My messages might sound a bit confuse as it isn't easy reproduced and we
>>> tried a lot to find out what's going on.
>>>
>>> As 12.2.10 does not contain the pg hard limit i don't suspect it is
>>> related to it.
>>>
>>> What i can tell right now is:
>>>
>>> 1.) Under 12.2.8 we've set bluestore_cache_size = 1073741824
>>>
>>> 2.) While upgrading to 12.2.10 we replaced it with osd_memory_target =
>>> 1073741824
>>>
>>> 3.) i also tried 12.2.10 without setting osd_memory_target or
>>> bluestore_cache_size
>>>
>>> 4.) it's not kernel related - for some unknown reason it worked for some
>>> hours with a newer kernel but gave problems again later
>>>
>>> 5.) a backfill with 12.2.10 of 6x 2TB SSDs took about 14 hours using
>>> 12.2.10 while it took 2 hours with 12.2.8
>>>
>>> 6.) with 12.2.10 i have a constant rate of 100% read i/o (400-500MB/s)
>>> on most of my bluestore OSDs - while on 12.2.8 i've 100kb - 2MB/s max
>>> read on 12.2.8.
>>>
>>> 7.) upgrades on small clusters or fresh installs seem to work fine. (no
>>> idea why or it is related to cluste size)
>>>
>>> That's currently all i know.
>>>
>>> Thanks a lot!
>>>
>>> Greets,
>>> Stefan
>>> Am 16.01.19 um 20:56 schrieb Stefan Priebe - Profihost

Re: [ceph-users] block.db on a LV? (Re: Mixed SSD+HDD OSD setup recommendation)

2019-01-18 Thread Jan Kasprzak
Alfredo,

Alfredo Deza wrote:
: On Fri, Jan 18, 2019 at 7:21 AM Jan Kasprzak  wrote:
: > Eugen Block wrote:
: > :
: > : I think you're running into an issue reported a couple of times.
: > : For the use of LVM you have to specify the name of the Volume Group
: > : and the respective Logical Volume instead of the path, e.g.
: > :
: > : ceph-volume lvm prepare --bluestore --block.db ssd_vg/ssd00 --data 
/dev/sda
: > thanks, I will try it. In the meantime, I have discovered another way
: > how to get around it: convert my SSDs from MBR to GPT partition table,
: > and then create 15 additional GPT partitions for the respective block.dbs
: > instead of 2x15 LVs.
: 
: This is because ceph-volume can accept both LVs or GPT partitions for block.db
: 
: Another way around this, that doesn't require you to create the LVs is
: to use the `batch` sub-command, that will automatically
: detect your HDD and put data on it, and detect the SSD and create the
: block.db LVs. The command could look something like:
: 
: 
: ceph-volume lvm batch --bluestore /dev/sda /dev/sdb /dev/sdc /dev/sdd
: /dev/nvme0n1
: 
: Would create 4 OSDs, place data on: sda, sdb, sdc, and sdd. And create
: 4 block.db LVs on nvme0n1

Interesting. Thanks!

Can the batch command accept also partitions instead of a whole
device for block.db? I already have two partitions on my SSDs for
root and swap.

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS - Small file - single thread - read performance.

2019-01-18 Thread Marc Roos
 
Yes, and to be sure I did the read test again from another client. 


-Original Message-
From: David C [mailto:dcsysengin...@gmail.com] 
Sent: 18 January 2019 16:00
To: Marc Roos
Cc: aderumier; Burkhard.Linke; ceph-users
Subject: Re: [ceph-users] CephFS - Small file - single thread - read 
performance.



On Fri, 18 Jan 2019, 14:46 Marc Roos  /dev/null

real0m0.004s
user0m0.000s
sys 0m0.002s
[@test]# time cat 50b.img > /dev/null

real0m0.002s
user0m0.000s
sys 0m0.002s
[@test]# time cat 50b.img > /dev/null

real0m0.002s
user0m0.000s
sys 0m0.001s
[@test]# time cat 50b.img > /dev/null

real0m0.002s
user0m0.001s
sys 0m0.001s
[@test]#

Luminous, centos7.6 kernel cephfs mount, 10Gbit, ssd meta, hdd 
data, mds 
2,2Ghz



Did you drop the caches on your client before reading the file? 




-Original Message-
From: Alexandre DERUMIER [mailto:aderum...@odiso.com] 
Sent: 18 January 2019 15:37
To: Burkhard Linke
Cc: ceph-users
Subject: Re: [ceph-users] CephFS - Small file - single thread - 
read 
performance.

Hi,
I don't have so big latencies:

# time cat 50bytesfile > /dev/null

real0m0,002s
user0m0,001s
sys 0m0,000s


(It's on an ceph ssd cluster (mimic), kernel cephfs client (4.18), 
10GB 
network with small latency too, client/server have 3ghz cpus)



- Mail original -
De: "Burkhard Linke" 

À: "ceph-users" 
Envoyé: Vendredi 18 Janvier 2019 15:29:45
Objet: Re: [ceph-users] CephFS - Small file - single thread - read 
performance.

Hi, 

On 1/18/19 3:11 PM, jes...@krogh.cc wrote: 
> Hi. 
> 
> We have the intention of using CephFS for some of our shares, 
which 
> we'd like to spool to tape as a part normal backup schedule. 
CephFS 
> works nice for large files but for "small" .. < 0.1MB .. there 
seem to 

> be a "overhead" on 20-40ms per file. I tested like this:
> 
> root@abe:/nfs/home/jk# time cat 
/ceph/cluster/rsyncbackups/13kbfile > 
> /dev/null
> 
> real 0m0.034s
> user 0m0.001s
> sys 0m0.000s
> 
> And from local page-cache right after. 
> root@abe:/nfs/home/jk# time cat 
/ceph/cluster/rsyncbackups/13kbfile > 
> /dev/null
> 
> real 0m0.002s
> user 0m0.002s
> sys 0m0.000s
> 
> Giving a ~20ms overhead in a single file. 
> 
> This is about x3 higher than on our local filesystems (xfs) based 
on 
> same spindles.
> 
> CephFS metadata is on SSD - everything else on big-slow HDD's (in 
both 

> cases).
> 
> Is this what everyone else see? 


Each file access on client side requires the acquisition of a 
corresponding locking entity ('file capability') from the MDS. This 
adds 
an extra network round trip to the MDS. In the worst case the MDS 
needs 
to request a capability release from another client which still 
holds 
the cap (e.g. file is still in page cache), adding another extra 
network 
round trip. 


CephFS is not NFS, and has a strong consistency model. This comes 
at a 
price. 


Regards, 

Burkhard 


___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS - Small file - single thread - read performance.

2019-01-18 Thread David C
On Fri, 18 Jan 2019, 14:46 Marc Roos 
>
> [@test]# time cat 50b.img > /dev/null
>
> real0m0.004s
> user0m0.000s
> sys 0m0.002s
> [@test]# time cat 50b.img > /dev/null
>
> real0m0.002s
> user0m0.000s
> sys 0m0.002s
> [@test]# time cat 50b.img > /dev/null
>
> real0m0.002s
> user0m0.000s
> sys 0m0.001s
> [@test]# time cat 50b.img > /dev/null
>
> real0m0.002s
> user0m0.001s
> sys 0m0.001s
> [@test]#
>
> Luminous, centos7.6 kernel cephfs mount, 10Gbit, ssd meta, hdd data, mds
> 2,2Ghz
>

Did you drop the caches on your client before reading the file?

>
>
>
> -Original Message-
> From: Alexandre DERUMIER [mailto:aderum...@odiso.com]
> Sent: 18 January 2019 15:37
> To: Burkhard Linke
> Cc: ceph-users
> Subject: Re: [ceph-users] CephFS - Small file - single thread - read
> performance.
>
> Hi,
> I don't have so big latencies:
>
> # time cat 50bytesfile > /dev/null
>
> real0m0,002s
> user0m0,001s
> sys 0m0,000s
>
>
> (It's on an ceph ssd cluster (mimic), kernel cephfs client (4.18), 10GB
> network with small latency too, client/server have 3ghz cpus)
>
>
>
> - Mail original -
> De: "Burkhard Linke" 
> À: "ceph-users" 
> Envoyé: Vendredi 18 Janvier 2019 15:29:45
> Objet: Re: [ceph-users] CephFS - Small file - single thread - read
> performance.
>
> Hi,
>
> On 1/18/19 3:11 PM, jes...@krogh.cc wrote:
> > Hi.
> >
> > We have the intention of using CephFS for some of our shares, which
> > we'd like to spool to tape as a part normal backup schedule. CephFS
> > works nice for large files but for "small" .. < 0.1MB .. there seem to
>
> > be a "overhead" on 20-40ms per file. I tested like this:
> >
> > root@abe:/nfs/home/jk# time cat /ceph/cluster/rsyncbackups/13kbfile >
> > /dev/null
> >
> > real 0m0.034s
> > user 0m0.001s
> > sys 0m0.000s
> >
> > And from local page-cache right after.
> > root@abe:/nfs/home/jk# time cat /ceph/cluster/rsyncbackups/13kbfile >
> > /dev/null
> >
> > real 0m0.002s
> > user 0m0.002s
> > sys 0m0.000s
> >
> > Giving a ~20ms overhead in a single file.
> >
> > This is about x3 higher than on our local filesystems (xfs) based on
> > same spindles.
> >
> > CephFS metadata is on SSD - everything else on big-slow HDD's (in both
>
> > cases).
> >
> > Is this what everyone else see?
>
>
> Each file access on client side requires the acquisition of a
> corresponding locking entity ('file capability') from the MDS. This adds
> an extra network round trip to the MDS. In the worst case the MDS needs
> to request a capability release from another client which still holds
> the cap (e.g. file is still in page cache), adding another extra network
> round trip.
>
>
> CephFS is not NFS, and has a strong consistency model. This comes at a
> price.
>
>
> Regards,
>
> Burkhard
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS - Small file - single thread - read performance.

2019-01-18 Thread Marc Roos
 

[@test]# time cat 50b.img > /dev/null

real0m0.004s
user0m0.000s
sys 0m0.002s
[@test]# time cat 50b.img > /dev/null

real0m0.002s
user0m0.000s
sys 0m0.002s
[@test]# time cat 50b.img > /dev/null

real0m0.002s
user0m0.000s
sys 0m0.001s
[@test]# time cat 50b.img > /dev/null

real0m0.002s
user0m0.001s
sys 0m0.001s
[@test]#

Luminous, centos7.6 kernel cephfs mount, 10Gbit, ssd meta, hdd data, mds 
2,2Ghz



-Original Message-
From: Alexandre DERUMIER [mailto:aderum...@odiso.com] 
Sent: 18 January 2019 15:37
To: Burkhard Linke
Cc: ceph-users
Subject: Re: [ceph-users] CephFS - Small file - single thread - read 
performance.

Hi,
I don't have so big latencies:

# time cat 50bytesfile > /dev/null

real0m0,002s
user0m0,001s
sys 0m0,000s


(It's on an ceph ssd cluster (mimic), kernel cephfs client (4.18), 10GB 
network with small latency too, client/server have 3ghz cpus)



- Mail original -
De: "Burkhard Linke" 
À: "ceph-users" 
Envoyé: Vendredi 18 Janvier 2019 15:29:45
Objet: Re: [ceph-users] CephFS - Small file - single thread - read 
performance.

Hi, 

On 1/18/19 3:11 PM, jes...@krogh.cc wrote: 
> Hi. 
> 
> We have the intention of using CephFS for some of our shares, which 
> we'd like to spool to tape as a part normal backup schedule. CephFS 
> works nice for large files but for "small" .. < 0.1MB .. there seem to 

> be a "overhead" on 20-40ms per file. I tested like this:
> 
> root@abe:/nfs/home/jk# time cat /ceph/cluster/rsyncbackups/13kbfile > 
> /dev/null
> 
> real 0m0.034s
> user 0m0.001s
> sys 0m0.000s
> 
> And from local page-cache right after. 
> root@abe:/nfs/home/jk# time cat /ceph/cluster/rsyncbackups/13kbfile > 
> /dev/null
> 
> real 0m0.002s
> user 0m0.002s
> sys 0m0.000s
> 
> Giving a ~20ms overhead in a single file. 
> 
> This is about x3 higher than on our local filesystems (xfs) based on 
> same spindles.
> 
> CephFS metadata is on SSD - everything else on big-slow HDD's (in both 

> cases).
> 
> Is this what everyone else see? 


Each file access on client side requires the acquisition of a 
corresponding locking entity ('file capability') from the MDS. This adds 
an extra network round trip to the MDS. In the worst case the MDS needs 
to request a capability release from another client which still holds 
the cap (e.g. file is still in page cache), adding another extra network 
round trip. 


CephFS is not NFS, and has a strong consistency model. This comes at a 
price. 


Regards, 

Burkhard 


___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS - Small file - single thread - read performance.

2019-01-18 Thread David C
On Fri, Jan 18, 2019 at 2:12 PM  wrote:

> Hi.
>
> We have the intention of using CephFS for some of our shares, which we'd
> like to spool to tape as a part normal backup schedule. CephFS works nice
> for large files but for "small" .. < 0.1MB  .. there seem to be a
> "overhead" on 20-40ms per file. I tested like this:
>
> root@abe:/nfs/home/jk# time cat /ceph/cluster/rsyncbackups/13kbfile >
> /dev/null
>
> real0m0.034s
> user0m0.001s
> sys 0m0.000s
>
> And from local page-cache right after.
> root@abe:/nfs/home/jk# time cat /ceph/cluster/rsyncbackups/13kbfile >
> /dev/null
>
> real0m0.002s
> user0m0.002s
> sys 0m0.000s
>
> Giving a ~20ms overhead in a single file.
>
> This is about x3 higher than on our local filesystems (xfs) based on
> same spindles.
>
> CephFS metadata is on SSD - everything else on big-slow HDD's (in both
> cases).
>
> Is this what everyone else see?
>

Pretty much. Reading a file from a pool of Filestore spinners:

# time cat 13kb > /dev/null

real0m0.013s
user0m0.000s
sys 0m0.003s

That's after dropping the caches on the client however the file would have
still been in the page cache on the OSD nodes as I just created it. If the
file was coming straight off the spinners I'd expect to see something
closer to your time.

I guess if you wanted to improve the latency you would be looking at the
usual stuff e.g (off the top of my head):

- Faster network links/tuning your network
- Turning down Ceph debugging
- Trying a different striping layout on the dirs with the small files
(unlikely to have much affect)
- If you're using fuse mount try Kernel mount (or maybe vice versa)
- Play with mount options
- Tune CPU on MDS node

Still even with all of that unlikely you'll get to local file-system
performance, as Burkhard says you have the locking overhead. You'll
probably need to look at getting more parallelism going in your rsyncs.



>
> Thanks
>
> --
> Jesper
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS - Small file - single thread - read performance.

2019-01-18 Thread Alexandre DERUMIER
Hi,
I don't have so big latencies:

# time cat 50bytesfile > /dev/null

real0m0,002s
user0m0,001s
sys 0m0,000s


(It's on an ceph ssd cluster (mimic), kernel cephfs client (4.18), 10GB network 
with small latency too, client/server have 3ghz cpus)



- Mail original -
De: "Burkhard Linke" 
À: "ceph-users" 
Envoyé: Vendredi 18 Janvier 2019 15:29:45
Objet: Re: [ceph-users] CephFS - Small file - single thread - read performance.

Hi, 

On 1/18/19 3:11 PM, jes...@krogh.cc wrote: 
> Hi. 
> 
> We have the intention of using CephFS for some of our shares, which we'd 
> like to spool to tape as a part normal backup schedule. CephFS works nice 
> for large files but for "small" .. < 0.1MB .. there seem to be a 
> "overhead" on 20-40ms per file. I tested like this: 
> 
> root@abe:/nfs/home/jk# time cat /ceph/cluster/rsyncbackups/13kbfile > 
> /dev/null 
> 
> real 0m0.034s 
> user 0m0.001s 
> sys 0m0.000s 
> 
> And from local page-cache right after. 
> root@abe:/nfs/home/jk# time cat /ceph/cluster/rsyncbackups/13kbfile > 
> /dev/null 
> 
> real 0m0.002s 
> user 0m0.002s 
> sys 0m0.000s 
> 
> Giving a ~20ms overhead in a single file. 
> 
> This is about x3 higher than on our local filesystems (xfs) based on 
> same spindles. 
> 
> CephFS metadata is on SSD - everything else on big-slow HDD's (in both 
> cases). 
> 
> Is this what everyone else see? 


Each file access on client side requires the acquisition of a 
corresponding locking entity ('file capability') from the MDS. This adds 
an extra network round trip to the MDS. In the worst case the MDS needs 
to request a capability release from another client which still holds 
the cap (e.g. file is still in page cache), adding another extra network 
round trip. 


CephFS is not NFS, and has a strong consistency model. This comes at a 
price. 


Regards, 

Burkhard 


___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS - Small file - single thread - read performance.

2019-01-18 Thread Burkhard Linke

Hi,

On 1/18/19 3:11 PM, jes...@krogh.cc wrote:

Hi.

We have the intention of using CephFS for some of our shares, which we'd
like to spool to tape as a part normal backup schedule. CephFS works nice
for large files but for "small" .. < 0.1MB  .. there seem to be a
"overhead" on 20-40ms per file. I tested like this:

root@abe:/nfs/home/jk# time cat /ceph/cluster/rsyncbackups/13kbfile >
/dev/null

real0m0.034s
user0m0.001s
sys 0m0.000s

And from local page-cache right after.
root@abe:/nfs/home/jk# time cat /ceph/cluster/rsyncbackups/13kbfile >
/dev/null

real0m0.002s
user0m0.002s
sys 0m0.000s

Giving a ~20ms overhead in a single file.

This is about x3 higher than on our local filesystems (xfs) based on
same spindles.

CephFS metadata is on SSD - everything else on big-slow HDD's (in both
cases).

Is this what everyone else see?



Each file access on client side requires the acquisition of a 
corresponding locking entity ('file capability') from the MDS. This adds 
an extra network round trip to the MDS. In the worst case the MDS needs 
to request a capability release from another client which still holds 
the cap (e.g. file is still in page cache), adding another extra network 
round trip.



CephFS is not NFS, and has a strong consistency model. This comes at a 
price.



Regards,

Burkhard


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS - Small file - single thread - read performance.

2019-01-18 Thread jesper
Hi.

We have the intention of using CephFS for some of our shares, which we'd
like to spool to tape as a part normal backup schedule. CephFS works nice
for large files but for "small" .. < 0.1MB  .. there seem to be a
"overhead" on 20-40ms per file. I tested like this:

root@abe:/nfs/home/jk# time cat /ceph/cluster/rsyncbackups/13kbfile >
/dev/null

real0m0.034s
user0m0.001s
sys 0m0.000s

And from local page-cache right after.
root@abe:/nfs/home/jk# time cat /ceph/cluster/rsyncbackups/13kbfile >
/dev/null

real0m0.002s
user0m0.002s
sys 0m0.000s

Giving a ~20ms overhead in a single file.

This is about x3 higher than on our local filesystems (xfs) based on
same spindles.

CephFS metadata is on SSD - everything else on big-slow HDD's (in both
cases).

Is this what everyone else see?

Thanks

-- 
Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dropping python 2 for nautilus... go/no-go

2019-01-18 Thread Wido den Hollander



On 1/16/19 4:54 PM, c...@jack.fr.eu.org wrote:
> Hi,
> 
> My 2 cents:
> - do drop python2 support

I wouldn't agree. Python 2 needs to be dropped.

> - do not drop python2 support unexpectedly, aka do a deprecation phase
> 
Indeed. Deprecate it at the Nautilus release and drop it after N.

Write blogs, post on the ML, Tweet about, e-mail everybody you know
about the fact that Ceph is dropping Python 2 support after N.

Dropping it in N without a deprecation period doesn't seem like a good idea.

Wido

> People should already know that python2 is dead
> That is not enough, though, to remove that "by surprise"
> 
> Regards,
> 
> On 01/16/2019 04:45 PM, Sage Weil wrote:
>> Hi everyone,
>>
>> This has come up several times before, but we need to make a final 
>> decision.  Alfredo has a PR prepared that drops Python 2 support entirely 
>> in master, which will mean nautilus is Python 3 only.
>>
>> All of our distro targets (el7, bionic, xenial) include python 3, so that 
>> isn't an issue.  However, it also means that users of python-rados, 
>> python-rbd, and python-cephfs will need to be using python 3.
>>
>> Python 2 is on its way out, and has been for years.  See
>>
>>  https://pythonclock.org/
>>
>> If it don't kill it in Nautilus, we'll be doing it for Octopus.
>>
>> Are there major python-{rbd,cephfs,rgw,rados} users that are still Python 
>> 2 that we need to be worried about?  (OpenStack?)
>>
>> sage
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dropping python 2 for nautilus... go/no-go

2019-01-18 Thread Hector Martin
On 18/01/2019 22.33, Alfredo Deza wrote:
> On Fri, Jan 18, 2019 at 7:07 AM Hector Martin  wrote:
>>
>> On 17/01/2019 00:45, Sage Weil wrote:
>>> Hi everyone,
>>>
>>> This has come up several times before, but we need to make a final
>>> decision.  Alfredo has a PR prepared that drops Python 2 support entirely
>>> in master, which will mean nautilus is Python 3 only.
>>>
>>> All of our distro targets (el7, bionic, xenial) include python 3, so that
>>> isn't an issue.  However, it also means that users of python-rados,
>>> python-rbd, and python-cephfs will need to be using python 3.
>>
>> I'm not sure dropping Python 2 support in Nautilus is reasonable...
>> simply because Python 3 support isn't quite stable in Mimic yet - I just
>> filed https://tracker.ceph.com/issues/37963 for ceph-volume being broken
>> with Python 3 and dm-crypt :-)
> 
> These are the exact type of things we can't really get to test because
> we rely on functional coverage. Because we currently build Ceph with
> support with Python2, then the binaries end up "choosing" the Python2
> interpreter and so the tests are all Python2

Sounds like that should be changed to default to Python3. On Gentoo, the
Ceph ebuilds do the usual Gentoo thing, that is: Gentoo lets users
select both a set of supported Python versions, and a preferred/selected
version. User modules get built for all supported versions and
binaries/tools get built for the active/main/preferred version. This is
how I ended up with a Python3 ceph-volume, because most of my systems
default to Python3 these days.

Perhaps the same thing coulde be done on other distros, forcing all the
tools to switch to Python3 while keeping the Python2 compatible modules?

-- 
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dropping python 2 for nautilus... go/no-go

2019-01-18 Thread Alfredo Deza
On Fri, Jan 18, 2019 at 7:07 AM Hector Martin  wrote:
>
> On 17/01/2019 00:45, Sage Weil wrote:
> > Hi everyone,
> >
> > This has come up several times before, but we need to make a final
> > decision.  Alfredo has a PR prepared that drops Python 2 support entirely
> > in master, which will mean nautilus is Python 3 only.
> >
> > All of our distro targets (el7, bionic, xenial) include python 3, so that
> > isn't an issue.  However, it also means that users of python-rados,
> > python-rbd, and python-cephfs will need to be using python 3.
>
> I'm not sure dropping Python 2 support in Nautilus is reasonable...
> simply because Python 3 support isn't quite stable in Mimic yet - I just
> filed https://tracker.ceph.com/issues/37963 for ceph-volume being broken
> with Python 3 and dm-crypt :-)

These are the exact type of things we can't really get to test because
we rely on functional coverage. Because we currently build Ceph with
support with Python2, then the binaries end up "choosing" the Python2
interpreter and so the tests are all Python2

The other issue is that we found to be borderline impossible to toggle
Python2/Python3 builds to allow some builds to be Python3 so that we
can actually
run some tests.

I do expect breakage though, so the sooner we get to switch the
better. Seems like we are leaning towards merging the
Python3-exclusive branch once Nautilus is out.

>
> I think there needs to be a release that supports both equally well to
> give people time to safely migrate over. Might be worth doing some
> tree-wide reviews (like that division thing) to hopefully squash more
> lurking Python 3 bugs.
>
> (just my 2c - maybe I got unlucky and otherwise things work well enough
> for everyone else in Py3; I'm certainly happy to get rid of Py2 ASAP).
>
> --
> Hector Martin (hec...@marcansoft.com)
> Public Key: https://marcan.st/marcan.asc
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] block.db on a LV? (Re: Mixed SSD+HDD OSD setup recommendation)

2019-01-18 Thread Alfredo Deza
On Fri, Jan 18, 2019 at 7:21 AM Jan Kasprzak  wrote:
>
> Eugen Block wrote:
> : Hi Jan,
> :
> : I think you're running into an issue reported a couple of times.
> : For the use of LVM you have to specify the name of the Volume Group
> : and the respective Logical Volume instead of the path, e.g.
> :
> : ceph-volume lvm prepare --bluestore --block.db ssd_vg/ssd00 --data /dev/sda
>
> Eugen,
>
> thanks, I will try it. In the meantime, I have discovered another way
> how to get around it: convert my SSDs from MBR to GPT partition table,
> and then create 15 additional GPT partitions for the respective block.dbs
> instead of 2x15 LVs.

This is because ceph-volume can accept both LVs or GPT partitions for block.db

Another way around this, that doesn't require you to create the LVs is
to use the `batch` sub-command, that will automatically
detect your HDD and put data on it, and detect the SSD and create the
block.db LVs. The command could look something like:


ceph-volume lvm batch --bluestore /dev/sda /dev/sdb /dev/sdc /dev/sdd
/dev/nvme0n1

Would create 4 OSDs, place data on: sda, sdb, sdc, and sdd. And create
4 block.db LVs on nvme0n1



>
> -Yenya
>
> --
> | Jan "Yenya" Kasprzak  |
> | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
>  This is the world we live in: the way to deal with computers is to google
>  the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore 32bit max_object_size limit

2019-01-18 Thread Igor Fedotov

Hi Kevin,

On 1/17/2019 10:50 PM, KEVIN MICHAEL HRPCEK wrote:

Hey,

I recall reading about this somewhere but I can't find it in the docs 
or list archive and confirmation from a dev or someone who knows for 
sure would be nice. What I recall is that bluestore has a max 4GB file 
size limit based on the design of bluestore not the 
osd_max_object_size setting. The bluestore source seems to suggest 
that by setting the OBJECT_MAX_SIZE to a 32bit max, giving an error if 
osd_max_object_size is > OBJECT_MAX_SIZE, and not writing the data if 
offset+length >= OBJECT_MAX_SIZE. So it seems like the in osd file 
size int can't exceed 32 bits which is 4GB, like FAT32. Am I correct 
or maybe I'm reading all this wrong..?


You're correct, BlueStore doesn't support object larger than 
OBJECT_MAX_SIZE(i.e. 4Gb)





If bluestore has a hard 4GB object limit using radosstriper to break 
up an object would work, but does using an EC pool that breaks up the 
object to shards smaller than OBJECT_MAX_SIZE have the same effect as 
radosstriper to get around a 4GB limit? We use rados directly and 
would like to move to bluestore but we have some large objects <= 13G 
that may need attention if this 4GB limit does exist and an ec pool 
doesn't get around it.
Theoretically object split using EC might help. But I'm not sure whether 
one needs to adjust osd_max_object_size greater than 4Gb to permit 13Gb 
object usage in EC pool. If it's needed than tosd_max_object_size <= 
OBJECT_MAX_SIZE constraint is violated and BlueStore wouldn't start.



https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L88
#define OBJECT_MAX_SIZE 0x // 32 bits

https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L4395
  // sanity check(s)
   auto osd_max_object_size =
 cct->_conf.get_val("osd_max_object_size");
   if (osd_max_object_size >= (size_t)OBJECT_MAX_SIZE) {
 derr << __func__ << " osd_max_object_size >= 0x" << std::hex << 
OBJECT_MAX_SIZE
   << "; BlueStore has hard limit of 0x" << OBJECT_MAX_SIZE << "." <<  std::dec 
<< dendl;
 return -EINVAL;
   }


https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L12331
   if (offset + length >= OBJECT_MAX_SIZE) {
 r = -E2BIG;
   } else {
 _assign_nid(txc, o);
 r = _do_write(txc, c, o, offset, length, bl, fadvise_flags);
 txc->write_onode(o);
   }

Thanks!
Kevin
--
Kevin Hrpcek
NASA SNPP Atmosphere SIPS
Space Science & Engineering Center
University of Wisconsin-Madison

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Thanks,

Igor

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] quick questions about a 5-node homelab setup

2019-01-18 Thread Eugen Leitl
On Fri, Jan 18, 2019 at 12:42:21PM +0100, Robert Sander wrote:
> On 18.01.19 11:48, Eugen Leitl wrote:
> 
> > OSD on every node (Bluestore), journal on SSD (do I need a directory, or a 
> > dedicated partition? How large, assuming 2 TB and 4 TB Bluestore HDDs?)
> 
> You need a partition on the SSD for the block.db (it's not a journal

Thanks, didn't realize that. Can I do that and remain flexible by going LVM on 
both SDD and HDD?

> anymore with blustore). You should look into osd_memory_target to
> configure the osd process with 1 or 2 GB of RAM in your setup.

Got that.
 
> > Can I run ceph-mon instances on the two D510, or would that already 
> > overload them? No sense to try running 2x monitors on D510 and one on the 
> > 330, right?
> 
> Yes, Mons need some resources. If you have set osd_memory_target they
> may fit on your Atoms.

Thanks. I'll guess I'll have to experiment a little. It's not that I'll be 
using this heavily.
 
> > I've just realized that I'll also need ceph-mgr daemons on the hosts 
> > running ceph-mon. I don't see the added system resource requirements for 
> > these.
> 
> The mgr process is quite light in resource usage.

Good to know.
 
> > Assuming BlueStore is too fat for my crappy nodes, do I need to go to 
> > FileStore? If yes, then with xfs as the file system? Journal on the SSD as 
> > a directory, then?
> 
> Journal for FileStore is also a block device.

Didn't realize that, either.

Thank you for your answers. Appreciated.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] block.db on a LV? (Re: Mixed SSD+HDD OSD setup recommendation)

2019-01-18 Thread Jan Kasprzak
Eugen Block wrote:
: Hi Jan,
: 
: I think you're running into an issue reported a couple of times.
: For the use of LVM you have to specify the name of the Volume Group
: and the respective Logical Volume instead of the path, e.g.
: 
: ceph-volume lvm prepare --bluestore --block.db ssd_vg/ssd00 --data /dev/sda

Eugen,

thanks, I will try it. In the meantime, I have discovered another way
how to get around it: convert my SSDs from MBR to GPT partition table,
and then create 15 additional GPT partitions for the respective block.dbs
instead of 2x15 LVs.

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph in OSPF environment

2019-01-18 Thread Max Krasilnikov
Dear colleagues,

we build L3 topology for use with CEPH, which is based on OSPF routing 
between Loopbacks, in order to get reliable and ECMPed topology, like this:

10.10.200.6 proto bird metric 64
     nexthop via 10.10.15.3 dev enp97s0f1 weight 1
     nexthop via 10.10.25.3 dev enp19s0f0 weight 1

where 10.10.200.x are loopbacks (plenty of on every node) and 
10.10.15.x/25.x are interface addresses of the host. So, my physical interface
addresses on this host are 10.10.25.2 and 10.10.15.2, local loopback is
10.10.200.5.

CEPH configured in the way

[global]
public_network = 10.10.200.0/24
[osd.0]
public bind addr = 10.10.200.5
cluster bind addr = 10.10.200.5

but regardless of these settings ceph-osd process originates connections 
from interface addresses, e.g.

tcp    0  0 10.10.15.2:57476    10.10.200.7:6817 ESTABLISHED 
52896/ceph-osd
tcp    0  0 10.10.25.2:42650    10.10.200.9:6814 ESTABLISHED 
52896/ceph-osd
tcp    0  0 10.10.25.2:36422    10.10.200.7:6804 ESTABLISHED 
52896/ceph-osd
tcp    0  0 10.10.15.2:49940    10.10.200.6:6815 ESTABLISHED 
52896/ceph-osd

which has a negative impact when one or another physical port gone.

Is there way to say CEPH to originate connections from specified ip 
address, which is unbinded from physical infra?

Thank you.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] read-only mounts of RBD images on multiple nodes for parallel reads

2019-01-18 Thread Ilya Dryomov
On Fri, Jan 18, 2019 at 11:25 AM Mykola Golub  wrote:
>
> On Thu, Jan 17, 2019 at 10:27:20AM -0800, Void Star Nill wrote:
> > Hi,
> >
> > We am trying to use Ceph in our products to address some of the use cases.
> > We think Ceph block device for us. One of the use cases is that we have a
> > number of jobs running in containers that need to have Read-Only access to
> > shared data. The data is written once and is consumed multiple times. I
> > have read through some of the similar discussions and the recommendations
> > on using CephFS for these situations, but in our case Block device makes
> > more sense as it fits well with other use cases and restrictions we have
> > around this use case.
> >
> > The following scenario seems to work as expected when we tried on a test
> > cluster, but we wanted to get an expert opinion to see if there would be
> > any issues in production. The usage scenario is as follows:
> >
> > - A block device is created with "--image-shared" options:
> >
> > rbd create mypool/foo --size 4G --image-shared
>
> "--image-shared" just means that the created image will have
> "exclusive-lock" feature and all other features that depend on it
> disabled. It is useful for scenarios when one wants simulteous write
> access to the image (e.g. when using a shared-disk cluster fs like
> ocfs2) and does not want a performance penalty due to "exlusive-lock"
> being pinged-ponged between writers.
>
> For your scenario it is not necessary but is ok.
>
> > - The image is mapped to a host, formatted in ext4 format (or other file
> > formats), mounted to a directory in read/write mode and data is written to
> > it. Please note that the image will be mapped in exclusive write mode -- no
> > other read/write mounts are allowed a this time.
>
> The map "exclusive" option works only for images with "exclusive-lock"
> feature enabled and prevent in this case automatic exclusive lock
> transitions (ping-pong mentioned above) from one writer to
> another. And in this case it will not prevent from mapping and
> mounting it ro and probably even rw (I am not familiar enough with
> kernel rbd implementation to be sure here), though in the last case
> the write will fail.

With -o exclusive, in addition to preventing automatic lock
transitions, the kernel will attempt to acquire the lock at map time
(i.e. before allowing any I/O) and return an error from "rbd map" in
case the lock cannot be acquired.

However, the fact the image is mapped -o exclusive on one host doesn't
mean that it can't be mapped without -o exclusive on another host.  If
you then try to write though the non-exclusive mapping, the write will
block until the exclusive mapping goes away resulting a hung tasks in
uninterruptible sleep state -- a much less pleasant failure mode.

So make sure that all writers use -o exclusive.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dropping python 2 for nautilus... go/no-go

2019-01-18 Thread Hector Martin

On 17/01/2019 00:45, Sage Weil wrote:

Hi everyone,

This has come up several times before, but we need to make a final
decision.  Alfredo has a PR prepared that drops Python 2 support entirely
in master, which will mean nautilus is Python 3 only.

All of our distro targets (el7, bionic, xenial) include python 3, so that
isn't an issue.  However, it also means that users of python-rados,
python-rbd, and python-cephfs will need to be using python 3.


I'm not sure dropping Python 2 support in Nautilus is reasonable... 
simply because Python 3 support isn't quite stable in Mimic yet - I just 
filed https://tracker.ceph.com/issues/37963 for ceph-volume being broken 
with Python 3 and dm-crypt :-)


I think there needs to be a release that supports both equally well to 
give people time to safely migrate over. Might be worth doing some 
tree-wide reviews (like that division thing) to hopefully squash more 
lurking Python 3 bugs.


(just my 2c - maybe I got unlucky and otherwise things work well enough 
for everyone else in Py3; I'm certainly happy to get rid of Py2 ASAP).


--
Hector Martin (hec...@marcansoft.com)
Public Key: https://marcan.st/marcan.asc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] block.db on a LV? (Re: Mixed SSD+HDD OSD setup recommendation)

2019-01-18 Thread Eugen Block

Hi Jan,

I think you're running into an issue reported a couple of times.
For the use of LVM you have to specify the name of the Volume Group  
and the respective Logical Volume instead of the path, e.g.


ceph-volume lvm prepare --bluestore --block.db ssd_vg/ssd00 --data /dev/sda

Regards,
Eugen


Zitat von Jan Kasprzak :


Hello, Ceph users,

replying to my own post from several weeks ago:

Jan Kasprzak wrote:
: [...] I plan to add new OSD hosts,
: and I am looking for setup recommendations.
:
: Intended usage:
:
: - small-ish pool (tens of TB) for RBD volumes used by QEMU
: - large pool for object-based cold (or not-so-hot :-) data,
:   write-once read-many access pattern, average object size
:   10s or 100s of MBs, probably custom programmed on top of
:   libradosstriper.
:
: Hardware:
:
: The new OSD hosts have ~30 HDDs 12 TB each, and two 960 GB SSDs.
: There is a small RAID-1 root and RAID-1 swap volume spanning both SSDs,
: leaving about 900 GB free on each SSD.
: The OSD hosts have two CPU sockets (32 cores including SMT), 128 GB RAM.
:
: My questions:
[...]
: - block.db on SSDs? The docs recommend about 4 % of the data size
:   for block.db, but my SSDs are only 0.6 % of total storage size.
:
: - or would it be better to leave SSD caching on the OS and use LVMcache
:   or something?
:
: - LVM or simple volumes?

I have problem setting this up with ceph-volume: I want to have an OSD
on each HDD, with block.db on the SSD. In order to set this up,
I have created a VG on the two SSDs, created 30 LVs on top of it for  
block.db,

and wanted to create an OSD using the following:

# ceph-volume lvm prepare --bluestore \
--block.db /dev/ssd_vg/ssd00 \
--data /dev/sda
[...]
--> blkid could not detect a PARTUUID for device: /dev/cbia_ssd_vg/ssd00
--> Was unable to complete a new OSD, will rollback changes
[...]

Then it failed, because deploying a volume used client.bootstrap-osd user,
but trying to roll the changes back required the client.admin user,
which does not have a keyring on the OSD host. Never mind.

The problem is with determining the PARTUUID of the SSD LV for block.db.
How can I deploy an OSD which is on top of bare HDD, but which also
has a block.db on an existing LV?

Thanks,

-Yenya

--
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Suggestions/experiences with mixed disk sizes and models from 4TB - 14TB

2019-01-18 Thread Hector Martin

On 16/01/2019 18:33, Götz Reinicke wrote:

My question is: How are your experiences with the current >=8TB SATA disks are 
some very bad models out there which I should avoid?


Be careful with Seagate consumer SATA drives. They are now shipping SMR 
drives without mentioning that fact anywhere in the documentation. One 
example of such a model is the 4TB ST4000DM004 (previous models like the 
ST4000DM000 were not SMR). I expect this to cause catastrophically slow 
performance under heavy write volumes, e.g. when rebuilding or 
rebalancing PGs.


I assume enterprise models are fine (if you read the fine print), but I 
would avoid any current generation Seagate consumer models unless you're 
happy buying a sample first and benchmarking it to confirm what kind of 
drive it is, or you can find someone who has done so. SMR drives have a 
telltale sign of unreasonably fast random write performance for a brief 
time (well beyond practical IOPS for any normal HDD), which then craters 
to nearly zero once the internal journal fills up.


Personally I'm using MD05ACA800 (8TB toshiba, spec unknown, seems to be 
a B2B model but they're available for cheap) and they seem to work well 
so far in my home cluster, but I haven't finished setting things up yet. 
Those are definitely not SMR.


--
Hector Martin (hec...@marcansoft.com)
Public Key: https://marcan.st/marcan.asc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] quick questions about a 5-node homelab setup

2019-01-18 Thread Robert Sander
On 18.01.19 11:48, Eugen Leitl wrote:

> OSD on every node (Bluestore), journal on SSD (do I need a directory, or a 
> dedicated partition? How large, assuming 2 TB and 4 TB Bluestore HDDs?)

You need a partition on the SSD for the block.db (it's not a journal
anymore with blustore). You should look into osd_memory_target to
configure the osd process with 1 or 2 GB of RAM in your setup.

> Can I run ceph-mon instances on the two D510, or would that already overload 
> them? No sense to try running 2x monitors on D510 and one on the 330, right?

Yes, Mons need some resources. If you have set osd_memory_target they
may fit on your Atoms.

> I've just realized that I'll also need ceph-mgr daemons on the hosts running 
> ceph-mon. I don't see the added system resource requirements for these.

The mgr process is quite light in resource usage.

> Assuming BlueStore is too fat for my crappy nodes, do I need to go to 
> FileStore? If yes, then with xfs as the file system? Journal on the SSD as a 
> directory, then?

Journal for FileStore is also a block device.

Regards
-- 
Robert Sander
Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 93818 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] block.db on a LV? (Re: Mixed SSD+HDD OSD setup recommendation)

2019-01-18 Thread Jan Kasprzak
Hello, Ceph users,

replying to my own post from several weeks ago:

Jan Kasprzak wrote:
: [...] I plan to add new OSD hosts,
: and I am looking for setup recommendations.
: 
: Intended usage:
: 
: - small-ish pool (tens of TB) for RBD volumes used by QEMU
: - large pool for object-based cold (or not-so-hot :-) data,
:   write-once read-many access pattern, average object size
:   10s or 100s of MBs, probably custom programmed on top of
:   libradosstriper.
: 
: Hardware:
: 
: The new OSD hosts have ~30 HDDs 12 TB each, and two 960 GB SSDs.
: There is a small RAID-1 root and RAID-1 swap volume spanning both SSDs,
: leaving about 900 GB free on each SSD.
: The OSD hosts have two CPU sockets (32 cores including SMT), 128 GB RAM.
: 
: My questions:
[...]
: - block.db on SSDs? The docs recommend about 4 % of the data size
:   for block.db, but my SSDs are only 0.6 % of total storage size.
: 
: - or would it be better to leave SSD caching on the OS and use LVMcache
:   or something?
: 
: - LVM or simple volumes?

I have problem setting this up with ceph-volume: I want to have an OSD
on each HDD, with block.db on the SSD. In order to set this up,
I have created a VG on the two SSDs, created 30 LVs on top of it for block.db,
and wanted to create an OSD using the following:

# ceph-volume lvm prepare --bluestore \
--block.db /dev/ssd_vg/ssd00 \
--data /dev/sda
[...]
--> blkid could not detect a PARTUUID for device: /dev/cbia_ssd_vg/ssd00
--> Was unable to complete a new OSD, will rollback changes
[...]

Then it failed, because deploying a volume used client.bootstrap-osd user,
but trying to roll the changes back required the client.admin user,
which does not have a keyring on the OSD host. Never mind.

The problem is with determining the PARTUUID of the SSD LV for block.db.
How can I deploy an OSD which is on top of bare HDD, but which also
has a block.db on an existing LV?

Thanks,

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Boot volume on OSD device

2019-01-18 Thread Hector Martin

On 12/01/2019 15:07, Brian Topping wrote:

I’m a little nervous that BlueStore assumes it owns the partition table and 
will not be happy that a couple of primary partitions have been used. Will this 
be a problem?


You should look into using ceph-volume in LVM mode. This will allow you 
to create an OSD out of any arbitrary LVM logical volume, and it doesn't 
care about other volumes on the same PV/VG. I'm running BlueStore OSDs 
sharing PVs with some non-Ceph stuff without any issues. It's the 
easiest way for OSDs to coexist with other stuff right now.


So, for example, you could have /boot on a partition, an LVM PV on 
another partition, containing an LV for / and an LV for your OSD. Or you 
could just use a partition for / (including /boot) and just have another 
partition for a PV wholly occupied by a single OSD LV. How you set up 
everything around LVM is up to you, ceph-volume just wants a logical 
volume to own (and uses LVM metadata to store its stuff, so it doesn't 
require a separate filesystem for metadata, just the main BlueStore device).


I also have two clusters using ceph-disk with a rootfs RAID1 across OSD 
data drives, with extra partitions; at least with GPT this works without 
any problems, but set-up might be finicky. For our deployment I ended up 
rewriting what ceph-disk does in my own script (it's not that 
complicated, just create a few partitions with the right GUIDs and write 
some files to the OSD filesystem root). So the OSDs get set up with some 
custom code, but then normal usage just uses ceph-disk (it certainly 
doesn't care about extra partitions once everything is set up). This was 
formerly FileStore and now BlueStore, but it's a legacy setup. I expect 
to move this over to ceph-volume at some point.


--
Hector Martin (hec...@marcansoft.com)
Public Key: https://marcan.st/marcan.asc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] read-only mounts of RBD images on multiple nodes for parallel reads

2019-01-18 Thread Ilya Dryomov
On Fri, Jan 18, 2019 at 9:25 AM Burkhard Linke
 wrote:
>
> Hi,
>
> On 1/17/19 7:27 PM, Void Star Nill wrote:
>
> Hi,
>
> We am trying to use Ceph in our products to address some of the use cases. We 
> think Ceph block device for us. One of the use cases is that we have a number 
> of jobs running in containers that need to have Read-Only access to shared 
> data. The data is written once and is consumed multiple times. I have read 
> through some of the similar discussions and the recommendations on using 
> CephFS for these situations, but in our case Block device makes more sense as 
> it fits well with other use cases and restrictions we have around this use 
> case.
>
> The following scenario seems to work as expected when we tried on a test 
> cluster, but we wanted to get an expert opinion to see if there would be any 
> issues in production. The usage scenario is as follows:
>
> - A block device is created with "--image-shared" options:
>
> rbd create mypool/foo --size 4G --image-shared
>
>
> - The image is mapped to a host, formatted in ext4 format (or other file 
> formats), mounted to a directory in read/write mode and data is written to 
> it. Please note that the image will be mapped in exclusive write mode -- no 
> other read/write mounts are allowed a this time.
>
> - The volume is unmapped from the host and then mapped on to N number of 
> other hosts where it will be mounted in read-only mode and the data is read 
> simultaneously from N readers
>
>
> There is no read-only ext4. Using the 'ro' mount option is by no means a 
> read-only access to the underlying storage. ext4 maintains a journal for 
> example, and needs to access and flush the journal on mount. You _WILL_ run 
> into unexpected issues.

Only if the journal needs replaying.  If you ensure a clean unmount
after writing the data, it shouldn't need to write to the underlying
block device on subsequent read-only mounts.

As an additional safeguard, map the image with -o ro.  This way the
block device will be read-only from the get-go.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] quick questions about a 5-node homelab setup

2019-01-18 Thread Eugen Leitl


(Crossposting this from Reddit /r/ceph , since likely to have more technical 
audience present here).

I've scrounged up 5 old Atom Supermicro nodes and would like to run them 365/7 
for limited production as RBD with Bluestore (ideally latest 13.2.4 Mimic), 
triple copy redundancy. Underlying OS is a Debian 9 64 bit, minimal install.

Specs: 

3x Atom 330, 2 GB RAM, 1x SSD, 1x 2 TB HDD, dual 1G NICs (4x 1G but for one 
node actually which only has 2x Realtek, 2x Realtek, 2x Intel) - 1 NIC for 
private storage network, one front-facing

2x Atom D510, 4 GB RAM, 1x SSD, 1x 4 TB HDD, quad 1G NICs (4x Intel) - 1 NIC 
for private storage network, one front facing, one management (IPMI)

Jumbo frames enabled.

Question: can I use the following role distribution for the nodes?

OSD on every node (Bluestore), journal on SSD (do I need a directory, or a 
dedicated partition? How large, assuming 2 TB and 4 TB Bluestore HDDs?)

Can I run ceph-mon instances on the two D510, or would that already overload 
them? No sense to try running 2x monitors on D510 and one on the 330, right?

I've just realized that I'll also need ceph-mgr daemons on the hosts running 
ceph-mon. I don't see the added system resource requirements for these.

Assuming BlueStore is too fat for my crappy nodes, do I need to go to 
FileStore? If yes, then with xfs as the file system? Journal on the SSD as a 
directory, then?

Thanks!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] read-only mounts of RBD images on multiple nodes for parallel reads

2019-01-18 Thread Mykola Golub
On Thu, Jan 17, 2019 at 10:27:20AM -0800, Void Star Nill wrote:
> Hi,
> 
> We am trying to use Ceph in our products to address some of the use cases.
> We think Ceph block device for us. One of the use cases is that we have a
> number of jobs running in containers that need to have Read-Only access to
> shared data. The data is written once and is consumed multiple times. I
> have read through some of the similar discussions and the recommendations
> on using CephFS for these situations, but in our case Block device makes
> more sense as it fits well with other use cases and restrictions we have
> around this use case.
> 
> The following scenario seems to work as expected when we tried on a test
> cluster, but we wanted to get an expert opinion to see if there would be
> any issues in production. The usage scenario is as follows:
> 
> - A block device is created with "--image-shared" options:
> 
> rbd create mypool/foo --size 4G --image-shared

"--image-shared" just means that the created image will have
"exclusive-lock" feature and all other features that depend on it
disabled. It is useful for scenarios when one wants simulteous write
access to the image (e.g. when using a shared-disk cluster fs like
ocfs2) and does not want a performance penalty due to "exlusive-lock"
being pinged-ponged between writers.

For your scenario it is not necessary but is ok.

> - The image is mapped to a host, formatted in ext4 format (or other file
> formats), mounted to a directory in read/write mode and data is written to
> it. Please note that the image will be mapped in exclusive write mode -- no
> other read/write mounts are allowed a this time.

The map "exclusive" option works only for images with "exclusive-lock"
feature enabled and prevent in this case automatic exclusive lock
transitions (ping-pong mentioned above) from one writer to
another. And in this case it will not prevent from mapping and
mounting it ro and probably even rw (I am not familiar enough with
kernel rbd implementation to be sure here), though in the last case
the write will fail.

> - The volume is unmapped from the host and then mapped on to N number of
> other hosts where it will be mounted in read-only mode and the data is read
> simultaneously from N readers
> 
> As mentioned above, this seems to work as expected, but we wanted to
> confirm that we won't run into any unexpected issues.

It should work. Although as you can see rbd hardly protects
simultaneous access in this case so it should be carefully organized on
higher level. But you may consider creating a snapshot after modifying
the image and mapping and mounting the snapshot on readers. This way
you even can modify the image without unmounting the readers and then
remap/remount the new snapshot. And you will have a rollback option as
a gratis.

Also there is a valid concern mentioned by others about ext4 might want
to flush the journal if it is not clean even when mounting ro. I
expect the mount will just fail in this case because the image is
mapped ro, but you might want to investigate how to improve this.

-- 
Mykola Golub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Nautilus Release T-shirt Design

2019-01-18 Thread Marc Roos


Is there an overview of previous tshirts? 


-Original Message-
From: Anthony D'Atri [mailto:a...@dreamsnake.net] 
Sent: 18 January 2019 01:07
To: Tim Serong
Cc: Ceph Development; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph Nautilus Release T-shirt Design

>> Lenz has provided this image that is currently being used for the 404 

>> page of the dashboard:
>> 
>> https://github.com/ceph/ceph/blob/master/src/pybind/mgr/dashboard/fro
>> ntend/src/assets/1280px-Nautilus_Octopus.jpg
> 
> Nautilus *shells* are somewhat iconic/well known/distinctive.  Maybe a 

> variant of https://en.wikipedia.org/wiki/File:Nautilus_Section_cut.jpg
> would be interesting on a t-shirt?

I agree with Tim.  T shirts with photos can be tricky, its easy for 
them to look cheesy and they dont age well.

In the same vein, something with a lower bit-depth and not 
non-cross-section might be slightly more recognizable:

https://www.vectorstock.com/royalty-free-vector/nautilus-vector-2806848

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] read-only mounts of RBD images on multiple nodes for parallel reads

2019-01-18 Thread Burkhard Linke

Hi,

On 1/17/19 7:27 PM, Void Star Nill wrote:

Hi,

We am trying to use Ceph in our products to address some of the use 
cases. We think Ceph block device for us. One of the use cases is that 
we have a number of jobs running in containers that need to have 
Read-Only access to shared data. The data is written once and is 
consumed multiple times. I have read through some of the similar 
discussions and the recommendations on using CephFS for these 
situations, but in our case Block device makes more sense as it fits 
well with other use cases and restrictions we have around this use case.


The following scenario seems to work as expected when we tried on a 
test cluster, but we wanted to get an expert opinion to see if there 
would be any issues in production. The usage scenario is as follows:


- A block device is created with "--image-shared" options:

rbd create mypool/foo --size 4G --image-shared


- The image is mapped to a host, formatted in ext4 format (or other 
file formats), mounted to a directory in read/write mode and data is 
written to it. Please note that the image will be mapped in exclusive 
write mode -- no other read/write mounts are allowed a this time.


- The volume is unmapped from the host and then mapped on to N number 
of other hosts where it will be mounted in read-only mode and the data 
is read simultaneously from N readers



There is no read-only ext4. Using the 'ro' mount option is by no means a 
read-only access to the underlying storage. ext4 maintains a journal for 
example, and needs to access and flush the journal on mount. You _WILL_ 
run into unexpected issues.



There are filesystems that are intended for this use case like ocfs2. 
But they require extra overhead, since any parallel access to any kind 
of data has its cost.



Regards,

Burkhard


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com