Re: [ceph-users] Ceph performance IOPS

2019-07-15 Thread Christian Wuerdig
Option 1 is the official way, option 2 will be a lot faster if it works for
you (I was never in the situation requiring this so can't say) and option 3
is for filestore and not applicable to bluestore

On Wed, 10 Jul 2019 at 07:55, Davis Mendoza Paco 
wrote:

> What would be the most appropriate procedure to move blockdb/wal to SSD?
>
> 1.- remove the OSD and recreate it (affects the performance)
> ceph-volume lvm prepare --bluestore --data  --block.wal
>  --block.db 
>
> 2.- Follow the documentation
>
> http://heiterbiswolkig.blogs.nde.ag/2018/04/08/migrating-bluestores-block-db/
>
> 3.- Follow the documentation
>
> https://swamireddy.wordpress.com/2016/02/19/ceph-how-to-add-the-ssd-journal/
>
> Thanks for the help
>
> El dom., 7 jul. 2019 a las 14:39, Christian Wuerdig (<
> christian.wuer...@gmail.com>) escribió:
>
>> One thing to keep in mind is that the blockdb/wal becomes a Single Point
>> Of Failure for all OSDs using it. So if that SSD dies essentially you have
>> to consider all OSDs using it as lost. I think most go with something like
>> 4-8 OSDs per blockdb/wal drive but it really depends how risk-averse you
>> are, what your budget is etc. Given that you only have 5 nodes I'd probably
>> go for fewer OSDs per blockdb device.
>>
>>
>> On Sat, 6 Jul 2019 at 02:16, Davis Mendoza Paco 
>> wrote:
>>
>>> Hi all,
>>> I have installed ceph luminous, witch 5 nodes(45 OSD), each OSD server
>>> supports up to 16HD and I'm only using 9
>>>
>>> I wanted to ask for help to improve IOPS performance since I have about
>>> 350 virtual machines of approximately 15 GB in size and I/O processes are
>>> very slow.
>>> You who recommend me?
>>>
>>> In the documentation of ceph recommend using SSD for the journal, my
>>> question is
>>> How many SSD do I have to enable per server so that the journals of the
>>> 9 OSDs can be separated into SSDs?
>>>
>>> I currently use ceph with OpenStack, on 11 servers with SO Debian
>>> Stretch:
>>> * 3 controller
>>> * 3 compute
>>> * 5 ceph-osd
>>>   network: bond lacp 10GB
>>>   RAM: 96GB
>>>   HD: 9 disk SATA-3TB (bluestore)
>>>
>>> --
>>> *Davis Mendoza P.*
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>
> --
> *Davis Mendoza P.*
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph performance IOPS

2019-07-07 Thread Christian Wuerdig
One thing to keep in mind is that the blockdb/wal becomes a Single Point Of
Failure for all OSDs using it. So if that SSD dies essentially you have to
consider all OSDs using it as lost. I think most go with something like 4-8
OSDs per blockdb/wal drive but it really depends how risk-averse you are,
what your budget is etc. Given that you only have 5 nodes I'd probably go
for fewer OSDs per blockdb device.


On Sat, 6 Jul 2019 at 02:16, Davis Mendoza Paco 
wrote:

> Hi all,
> I have installed ceph luminous, witch 5 nodes(45 OSD), each OSD server
> supports up to 16HD and I'm only using 9
>
> I wanted to ask for help to improve IOPS performance since I have about
> 350 virtual machines of approximately 15 GB in size and I/O processes are
> very slow.
> You who recommend me?
>
> In the documentation of ceph recommend using SSD for the journal, my
> question is
> How many SSD do I have to enable per server so that the journals of the 9
> OSDs can be separated into SSDs?
>
> I currently use ceph with OpenStack, on 11 servers with SO Debian Stretch:
> * 3 controller
> * 3 compute
> * 5 ceph-osd
>   network: bond lacp 10GB
>   RAM: 96GB
>   HD: 9 disk SATA-3TB (bluestore)
>
> --
> *Davis Mendoza P.*
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Thoughts on rocksdb and erasurecode

2019-06-26 Thread Christian Wuerdig
Hm, according to https://tracker.ceph.com/issues/24025 snappy compression
should be available out of the box at least since luminous. What ceph
version are you running?

On Wed, 26 Jun 2019 at 21:51, Rafał Wądołowski 
wrote:

> We changed these settings. Our config now is:
>
> bluestore_rocksdb_options =
> "compression=kSnappyCompression,max_write_buffer_number=16,min_write_buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=kCompactionStyleLevel,write_buffer_size=50331648,target_file_size_base=50331648,max_background_compactions=31,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,num_levels=5,max_bytes_for_level_base=603979776,max_bytes_for_level_multiplier=10,compaction_threads=32,flusher_threads=8"
>
> It could be changed without redeploy. It changes the sst files, when
> compaction is triggered. The additional improvement is Snappy compression.
> We rebuild ceph with support for it. I can create PR with it, if you want :)
>
>
> Best Regards,
>
> Rafał Wądołowski
> Cloud & Security Engineer
>
> On 25.06.2019 22:16, Christian Wuerdig wrote:
>
> The sizes are determined by rocksdb settings - some details can be found
> here: https://tracker.ceph.com/issues/24361
> One thing to note, in this thread
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030775.html
> it's noted that rocksdb could use up to 100% extra space during compaction
> so if you want to avoid spill over during compaction then safer values
> would be 6/60/600 GB
>
> You can change max_bytes_for_level_base and max_bytes_for_level_multiplier
> to suit your needs better but I'm not sure if that can be changed on the
> fly or if you have to re-create OSDs in order to make them apply
>
> On Tue, 25 Jun 2019 at 18:06, Rafał Wądołowski 
> wrote:
>
>> Why are you selected this specific sizes? Are there any tests/research on
>> it?
>>
>>
>> Best Regards,
>>
>> Rafał Wądołowski
>>
>> On 24.06.2019 13:05, Konstantin Shalygin wrote:
>>
>> Hi
>>
>> Have been thinking a bit about rocksdb and EC pools:
>>
>> Since a RADOS object written to a EC(k+m) pool is split into several
>> minor pieces, then the OSD will receive many more smaller objects,
>> compared to the amount it would receive in a replicated setup.
>>
>> This must mean that the rocksdb will also need to handle this more
>> entries, and will grow faster. This will have an impact when using
>> bluestore for slow HDD with DB on SSD drives, where the faster growing
>> rocksdb might result in spillover to slow store - if not taken into
>> consideration when designing the disk layout.
>>
>> Are my thoughts on the right track or am I missing something?
>>
>> Has somebody done any measurement on rocksdb growth, comparing replica
>> vs EC ?
>>
>> If you want to be not affected on spillover of block.db - use 3/30/300 GB
>> partition for your block.db.
>>
>>
>>
>> k
>>
>> ___
>> ceph-users mailing 
>> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Thoughts on rocksdb and erasurecode

2019-06-25 Thread Christian Wuerdig
The sizes are determined by rocksdb settings - some details can be found
here: https://tracker.ceph.com/issues/24361
One thing to note, in this thread
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030775.html
it's noted that rocksdb could use up to 100% extra space during compaction
so if you want to avoid spill over during compaction then safer values
would be 6/60/600 GB

You can change max_bytes_for_level_base and max_bytes_for_level_multiplier
to suit your needs better but I'm not sure if that can be changed on the
fly or if you have to re-create OSDs in order to make them apply

On Tue, 25 Jun 2019 at 18:06, Rafał Wądołowski 
wrote:

> Why are you selected this specific sizes? Are there any tests/research on
> it?
>
>
> Best Regards,
>
> Rafał Wądołowski
>
> On 24.06.2019 13:05, Konstantin Shalygin wrote:
>
> Hi
>
> Have been thinking a bit about rocksdb and EC pools:
>
> Since a RADOS object written to a EC(k+m) pool is split into several
> minor pieces, then the OSD will receive many more smaller objects,
> compared to the amount it would receive in a replicated setup.
>
> This must mean that the rocksdb will also need to handle this more
> entries, and will grow faster. This will have an impact when using
> bluestore for slow HDD with DB on SSD drives, where the faster growing
> rocksdb might result in spillover to slow store - if not taken into
> consideration when designing the disk layout.
>
> Are my thoughts on the right track or am I missing something?
>
> Has somebody done any measurement on rocksdb growth, comparing replica
> vs EC ?
>
> If you want to be not affected on spillover of block.db - use 3/30/300 GB
> partition for your block.db.
>
>
>
> k
>
> ___
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nautilus, k+m erasure coding a profile vs size+min_size

2019-05-21 Thread Christian Wuerdig
The simple answer is because k+1 is the default min_size for EC pools.
min_size means that the pool will still accept writes if that many failure
domains are still available. If you set min_size to k then you have entered
the dangerous territory that if you loose another failure domain (OSD or
host) while the pool is recovering you will potentially lose data. Same as
why min_size=1 is a bad idea for replicated pools (which has been
extensively discussed on this list)

On Tue, 21 May 2019 at 12:52, Yoann Moulin  wrote:

> Dear all,
>
> I am doing some tests with Nautilus and cephfs on erasure coding pool.
>
> I noticed something strange between k+m in my erasure profile and
> size+min_size in the pool created:
>
> > test@icadmin004:~$ ceph osd erasure-code-profile get ecpool-4-2
> > crush-device-class=
> > crush-failure-domain=osd
> > crush-root=default
> > jerasure-per-chunk-alignment=false
> > k=4
> > m=2
> > plugin=jerasure
> > technique=reed_sol_van
> > w=8
>
> > test@icadmin004:~$ ceph --cluster test osd pool create cephfs_data 8 8
> erasure ecpool-4-2
> > pool 'cephfs_data' created
>
> > test@icadmin004:~$ ceph osd pool ls detail | grep cephfs_data
> > pool 14 'cephfs_data' erasure size 6 min_size 5 crush_rule 1 object_hash
> rjenkins pg_num 8 pgp_num 8 autoscale_mode warn last_change 2646 flags
> hashpspool stripe_width 16384
>
> Why min_size = 5 and not 4 ?
>
> Best,
>
> --
> Yoann Moulin
> EPFL IC-IT
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How does CEPH calculates PGs per OSD for erasure coded (EC) pools?

2019-04-29 Thread Christian Wuerdig
On Sun, 28 Apr 2019 at 21:45, Igor Podlesny  wrote:

> On Sun, 28 Apr 2019 at 16:14, Paul Emmerich 
> wrote:
> > Use k+m for PG calculation, that value also shows up as "erasure size"
> > in ceph osd pool ls detail
>
> So does it mean that for PG calculation those 2 pools are equivalent:
>
> 1) EC(4, 2)
> 2) replicated, size 6
>

Correct


>
> ? Sounds weird to be honest. Replicated with size 6 means each logical
> data is stored 6 times, what needed single PG now requires 6 PGs.
> And with EC(4, 2) there's still only 1.5 overhead in terms of raw
> occupied space -- how come PG calculation distribution needs adjusting
> to 6 instead of 1.5 then?
>

A single logical data unit (an object in ceph terms) will be allocated to a
single PG. For a replicated pool of size n this PG will simply be stored on
n OSDs. For an EC(k+m) pool this PG will get stored on k+m OSDs with the
difference that this single PG will contain different parts of the data on
the different OSDs.
http://docs.ceph.com/docs/master/architecture/#erasure-coding provides a
good overview on how this is actually achieved.


> Also, why does CEPH documentation say "It is equivalent to a
> replicated pool of size __two__" when describing EC(2, 1) example?
>

This relates to fault tolerance. A replicated pool of size 2 can loose one
OSD without data loss and so can a EC(2+1) pool


>
> --
> End of message. Next message?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Using Ceph central backup storage - Best practice creating pools

2019-01-22 Thread Christian Wuerdig
If you use librados directly it's up to you to ensure you can identify your
objects. Generally RADOS stores objects and not files so when you provide
your object ids you need to come up with a convention so you can correctly
identify them. If you need to provide meta data (i.e. a list of all
existing backups, when they were taken etc.) then again you need to manage
that yourself (probably in dedicated meta-data objects). Using RADOS
namespaces (like one per database) is probably a good idea.
Also keep in mind that for example Bluestore has a maximum object size of
4GB so mapping files 1:1 to object is probably not a wise approach and you
should breakup your files into smaller chunks when storing them. There is
libradosstriper which handles the striping of large objects transparently
but not sure if that has support for RADOS namespaces.

Using RGW instead might be an easier route to go down

On Wed, 23 Jan 2019 at 10:10, cmonty14 <74cmo...@gmail.com> wrote:

> My backup client is using librados.
> I understand that defining a pool for the same application is recommended.
>
> However this would not answer my other questions:
> How can I identify a backup created by client A that I want to restore
> on another client Z?
> I mean typically client A would write a backup file identified by the
> filename.
> Would it be possible on client Z to identify this backup file by
> filename? If yes, how?
>
> Am Di., 22. Jan. 2019 um 15:07 Uhr schrieb :
> >
> > Hi,
> >
> > Ceph's pool are meant to let you define specific engineering rules
> > and/or application (rbd, cephfs, rgw)
> > They are not designed to be created in a massive fashion (see pgs etc)
> > So, create a pool for each engineering ruleset, and store your data in
> them
> > For what is left of your project, I believe you have to implement that
> > on top of Ceph
> >
> > For instance, let say you simply create a pool, with a rbd volume in it
> > You then create a filesystem on that, and map it on some server
> > Finally, you can push your files on that mountpoint, using various
> > Linux's user, acl or whatever : beyond that point, there is nothing more
> > specific to Ceph, it is "just" a mounted filesystem
> >
> > Regards,
> >
> > On 01/22/2019 02:16 PM, cmonty14 wrote:
> > > Hi,
> > >
> > > my use case for Ceph is providing a central backup storage.
> > > This means I will backup multiple databases in Ceph storage cluster.
> > >
> > > This is my question:
> > > What is the best practice for creating pools & images?
> > > Should I create multiple pools, means one pool per database?
> > > Or should I create a single pool "backup" and use namespace when
> writing
> > > data in the pool?
> > >
> > > This is the security demand that should be considered:
> > > DB-owner A can only modify the files that belong to A; other files
> > > (owned by B, C or D) are accessible for A.
> > >
> > > And there's another issue:
> > > How can I identify a backup created by client A that I want to restore
> > > on another client Z?
> > > I mean typically client A would write a backup file identified by the
> > > filename.
> > > Would it be possible on client Z to identify this backup file by
> > > filename? If yes, how?
> > >
> > >
> > > THX
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-community] How much RAM and CPU cores would you recommend when using ceph only as block storage for KVM?

2018-08-19 Thread Christian Wuerdig
Depends a little bit what your expected work-load will be but again
generally you want to aim for one logical core per OSD - so 14 core
CPU in your case, 16 wouldn't hurt. Ceph tends to eat heavily into
resources (RAM and/or CPU) when it needs to recover from a problem
which are the situations when you really don't want to have to deal
with under-resourced hardware.

On Wed, 8 Aug 2018 at 12:26, Cheyenne Forbes
 wrote:
>
> Next time I will ask there, any number of core recommendation?
>
> Regards,
>
> Cheyenne O. Forbes
>
>
> On Tue, Aug 7, 2018 at 2:49 PM, Christian Wuerdig 
>  wrote:
>>
>> ceph-users is a better place to ask this kind of question.
>>
>> Anyway the 1GB  RAM per TB storage recommendation still stands as far as I 
>> know plus you want some for the OS and some safety margin so in your case 
>> 64GB seem sensible
>>
>> On Wed, 8 Aug 2018, 01:51 Cheyenne Forbes,  
>> wrote:
>>>
>>> The case is 28TB (14x2TB drives) of storage for each data server.
>>>
>>> For both Replicated and Erasure Coded.
>>>
>>> Regards,
>>>
>>> Cheyenne O. Forbes
>>> ___
>>> Ceph-community mailing list
>>> ceph-commun...@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-19 Thread Christian Wuerdig
It should be added though that you're running at only 1/3 of the
recommended RAM usage for the OSD setup alone - not to mention that
you also co-host MON, MGR and MDS deamons on there. The next time you
run into an issue - in particular with OSD recovery - you may be in a
pickle again and then it might not be so easy to get going.
On Fri, 17 Aug 2018 at 02:48, Jonathan Woytek  wrote:
>
> On Thu, Aug 16, 2018 at 10:15 AM, Gregory Farnum  wrote:
> > Do note that while this works and is unlikely to break anything, it's
> > not entirely ideal. The MDS was trying to probe the size and mtime of
> > any files which were opened by clients that have since disappeared. By
> > removing that list of open files, it can't do that any more, so you
> > may have some inaccurate metadata about individual file sizes or
> > mtimes.
>
> Understood, and thank you for the additional details. However, when
> the difference is having a working filesystem, or having a filesystem
> permanently down because the ceph-mds rejoin is impossible to
> complete, I'll accept the risk involved. I'd prefer to see the rejoin
> process able to proceed without chewing up memory until the machine
> deadlocks on itself, but I don't yet know enough about the internals
> of the rejoin process to even attempt to comment on how that could be
> done. Ideally, it seems like flushing the current recovery/rejoin
> status periodically and monitoring memory usage during recovery would
> help to fix the problem. From what I could see, ceph-mds just
> continued to allocate memory as it processed every open handle, and
> never released any of it until it was killed.
>
> jonathan
> --
> Jonathan Woytek
> http://www.dryrose.com
> KB3HOZ
> PGP:  462C 5F50 144D 6B09 3B65  FCE8 C1DC DEC4 E8B6 AABC
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-community] How much RAM and CPU cores would you recommend when using ceph only as block storage for KVM?

2018-08-07 Thread Christian Wuerdig
ceph-users is a better place to ask this kind of question.

Anyway the 1GB  RAM per TB storage recommendation still stands as far as I
know plus you want some for the OS and some safety margin so in your case
64GB seem sensible

On Wed, 8 Aug 2018, 01:51 Cheyenne Forbes, 
wrote:

> The case is 28TB (14x2TB drives) of storage for each data server.
>
> For both Replicated and Erasure Coded.
>
> Regards,
>
> Cheyenne O. Forbes
> ___
> Ceph-community mailing list
> ceph-commun...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Default erasure code profile and sustaining loss of one host containing 4 OSDs

2018-07-22 Thread Christian Wuerdig
Generally the recommendation is: if your redundancy is X you should have at
least X+1 entities in your failure domain to allow ceph to automatically
self-heal

Given your setup of 6 severs and failure domain host means you should
select k+m=5 at most. So 3+2 should make for a good profile in your case.

You could go to 4+2 but a loss of a host means ceph can't auto heal means
of another server gets into trouble before the first one is replaced your
cluster will become unavailable.

Please note that you can't change the EC profile of an existing pool -
you'll need to create a new pool and copy the data over if you want to
change your current profile

Cheers
Christian

On Sat, 21 Jul 2018, 01:52 Ziggy Maes,  wrote:

> Hello Caspar
>
>
>
> That makes a great deal of sense, thank you for elaborating. Am I correct
> to assume that if we were to use a k=2, m=2 profile, it would be identical
> to a replicated pool (since there would be an equal amount of data and
> parity chunks)? Furthermore, how should the proper erasure profile be
> determined then? Are we to strive for a as high as possible data chunk
> value (k) and a low parity/coding value (m)?
>
>
>
> Kind regards
>
>
> *Ziggy Maes *DevOps Engineer
> CELL +32 478 644 354
> SKYPE Ziggy.Maes
>
> [image: http://www.be-mobile.com/mail/bemobile_email.png]
> 
>
> *www.be-mobile.com *
>
>
>
>
>
> *From: *Caspar Smit 
> *Date: *Friday, 20 July 2018 at 14:15
> *To: *Ziggy Maes 
> *Cc: *"ceph-users@lists.ceph.com" 
> *Subject: *Re: [ceph-users] Default erasure code profile and sustaining
> loss of one host containing 4 OSDs
>
>
>
> Ziggy,
>
>
>
> For EC pools: min_size = k+1
>
>
>
> So in your case (m=1) -> min_size is 3  which is the same as the number of
> shards. So if ANY shard goes down, IO is freezed.
>
>
>
> If you choose m=2 min_size will still be 3 but you now have 4 shards (k+m
> = 4) so you can loose a shard and still remain availability.
>
>
>
> Of course a failure domain of 'host' is required to do this but since you
> have 6 hosts that would be ok.
>
>
> Met vriendelijke groet,
>
> Caspar Smit
> Systemengineer
> SuperNAS
> Dorsvlegelstraat 13
> 1445 PA Purmerend
>
> t: (+31) 299 410 414
> e: caspars...@supernas.eu
> w: www.supernas.eu
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] amount of PGs/pools/OSDs for your openstack / Ceph

2018-04-07 Thread Christian Wuerdig
The general recommendation is to target around 100 PG/OSD. Have you tried
the https://ceph.com/pgcalc/ tool?

On Wed, 4 Apr 2018 at 21:38, Osama Hasebou  wrote:

> Hi Everyone,
>
> I would like to know what kind of setup had the Ceph community been using
> for their Openstack's Ceph configuration when it comes to number of Pools &
> OSDs and their PGs.
>
> Ceph documentation briefly mentions it for small cluster size, and I would
> like to know from your experience, how much PGs have you created for your
> openstack pools in reality for a ceph cluster ranging from 1-2 PB capacity
> or 400-600 number of OSDs that performs well without issues.
>
> Hope to hear from you!
>
> Thanks.
>
> Regards,
> Ossi
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Difference in speed on Copper of Fiber ports on switches

2018-03-22 Thread Christian Wuerdig
I think the primary area where people are concerned about latency are rbd
and 4k block size access. OTOH 2.3us latency seems to be 2 orders of
magnitude below of what seems to be realistically achievable on a real
world cluster anyway (
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-July/011731.html)
so I don't really think the basic latency difference from copper vs fiber
as listed make much of a difference at this point

On Thu, 22 Mar 2018 at 17:14, Subhachandra Chandra 
wrote:

> Latency is a concern if your application is sending one packet at a time
> and waiting for a reply. If you are streaming large blocks of data, the
> first packet is delayed by the network latency but after that you will
> receive a 10Gbps stream continuously. The latency for jumbo frames vs 1500
> byte frames depends upon the switch type. On a cut-through switch there is
> very little difference but on a store-and-forward switch it will be
> proportional to packet size. Most modern switching ASICs are capable of
> cut-through operation.
>
> Subhachandra
>
> On Wed, Mar 21, 2018 at 7:15 AM, Willem Jan Withagen 
> wrote:
>
>> On 21-3-2018 13:47, Paul Emmerich wrote:
>> > Hi,
>> >
>> > 2.3µs is a typical delay for a 10GBASE-T connection. But fiber or SFP+
>> > DAC connections should be faster: switches are typically in the range of
>> > ~500ns to 1µs.
>> >
>> >
>> > But you'll find that this small difference in latency induced by the
>> > switch will be quite irrelevant in the grand scheme of things when using
>> > the Linux network stack...
>>
>> But I think it does when people start to worry about selecting High
>> clock speed CPUS versus packages with more cores...
>>
>> 900ns is quite a lot if you have that mindset.
>> And probably 1800ns at that, because the delay will be a both ends.
>> Or perhaps even 3600ns because the delay is added at every ethernet
>> connector???
>>
>> But I'm inclined to believe you that the network stack could take quite
>> some time...
>>
>>
>> --WjW
>>
>>
>> > Paul
>> >
>> > 2018-03-21 12:16 GMT+01:00 Willem Jan Withagen > > >:
>> >
>> > Hi,
>> >
>> > I just ran into this table for a 10G Netgear switch we use:
>> >
>> > Fiberdelays:
>> > 10 Gbps vezelvertraging (64 bytepakketten): 1.827 µs
>> > 10 Gbps vezelvertraging (512 bytepakketten): 1.919 µs
>> > 10 Gbps vezelvertraging (1024 bytepakketten): 1.971 µs
>> > 10 Gbps vezelvertraging (1518 bytepakketten): 1.905 µs
>> >
>> > Copperdelays:
>> > 10 Gbps kopervertraging (64 bytepakketten): 2.728 µs
>> > 10 Gbps kopervertraging (512 bytepakketten): 2.85 µs
>> > 10 Gbps kopervertraging (1024 bytepakketten): 2.904 µs
>> > 10 Gbps kopervertraging (1518 bytepakketten): 2.841 µs
>> >
>> > Fiberdelays:
>> > 1 Gbps vezelvertraging (64 bytepakketten) 2.289 µs
>> > 1 Gbps vezelvertraging (512 bytepakketten) 2.393 µs
>> > 1 Gbps vezelvertraging (1024 bytepakketten) 2.423 µs
>> > 1 Gbps vezelvertraging (1518 bytepakketten) 2.379 µs
>> >
>> > Copperdelays:
>> > 1 Gbps kopervertraging (64 bytepakketten) 2.707 µs
>> > 1 Gbps kopervertraging (512 bytepakketten) 2.821 µs
>> > 1 Gbps kopervertraging (1024 bytepakketten) 2.866 µs
>> > 1 Gbps kopervertraging (1518 bytepakketten) 2.826 µs
>> >
>> > So the difference is serious: 900ns on a total of 1900ns for a 10G
>> > pakket.
>> > Other strange thing is that 1K packets are slower than 1518 bytes.
>> >
>> > So that might warrant connecting boxes preferably with optics
>> > instead of CAT cableing if you are trying to squeeze the max out of
>> > a setup.
>> >
>> > Sad thing is that they do not report for jumbo frames, and doing
>> these
>> > measurements your self is not easy...
>> >
>> > --WjW
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com 
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > 
>> >
>> >
>> >
>> >
>> > --
>> > --
>> > Paul Emmerich
>> >
>> > croit GmbH
>> > Freseniusstr. 31h
>> > 81247 München
>> > www.croit.io 
>> > Tel: +49 89 1896585 90
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] XFS Metadata corruption while activating OSD

2018-03-11 Thread Christian Wuerdig
Hm, so you're running OSD nodes with 2GB of RAM and 2x10TB = 20TB of
storage? Literally everything posted on this list in relation to HW
requirements and related problems will tell you that this simply isn't
going to work. The slightest hint of a problem will simply kill the OSD
nodes with OOM. Have you tried with smaller disks - like 1TB models (or
even smaller like 256GB SSDs) and see if the same problem persists?


On Tue, 6 Mar 2018 at 10:51, 赵赵贺东  wrote:

> Hello ceph-users,
>
> It is a really really *Really* tough problem for our team.
> We investigated in the problem for a long time, try a lot of efforts, but
> can’t solve the problem, even the concentrate cause of the problem is still
> unclear for us!
> So, Anyone give any solution/suggestion/opinion whatever  will be highly
> highly appreciated!!!
>
> Problem Summary:
> When we activate osd, there will be  metadata corrupttion in the
> activating disk, probability is 100% !
>
> Admin Nodes node:
> Platform: X86
> OS: Ubuntu 16.04
> Kernel: 4.12.0
> Ceph: Luminous 12.2.2
>
> OSD nodes:
> Platform: armv7
> OS:   Ubuntu 14.04
> Kernel:   4.4.39
> Ceph: Lominous 12.2.2
> Disk: 10T+10T
> Memory: 2GB
>
> Deploy log:
>
>
> dmesg log:(Sorry arms001-01 dmesg log has log has been lost, but error
> message about metadata corruption on arms003-10 are the same with
> arms001-01)
> Mar  5 11:08:49 arms003-10 kernel: [  252.534232] XFS (sda1): Unmount and
> run xfs_repair
> Mar  5 11:08:49 arms003-10 kernel: [  252.539100] XFS (sda1): First 64
> bytes of corrupted metadata buffer:
> Mar  5 11:08:49 arms003-10 kernel: [  252.545504] eb82f000: 58 46 53 42 00
> 00 10 00 00 00 00 00 91 73 fe fb  XFSB.s..
> Mar  5 11:08:49 arms003-10 kernel: [  252.553569] eb82f010: 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00 00  
> Mar  5 11:08:49 arms003-10 kernel: [  252.561624] eb82f020: fc 4e e3 89 50
> 8f 42 aa be bc 07 0c 6e fa 83 2f  .N..P.B.n../
> Mar  5 11:08:49 arms003-10 kernel: [  252.569706] eb82f030: 00 00 00 00 80
> 00 00 07 ff ff ff ff ff ff ff ff  
> Mar  5 11:08:49 arms003-10 kernel: [  252.58] XFS (sda1): metadata I/O
> error: block 0x48b9ff80 ("xfs_trans_read_buf_map") error 117 numblks 8
> Mar  5 11:08:49 arms003-10 kernel: [  252.602944] XFS (sda1): Metadata
> corruption detected at xfs_dir3_data_read_verify+0x58/0xd0, xfs_dir3_data
> block 0x48b9ff80
> Mar  5 11:08:49 arms003-10 kernel: [  252.614170] XFS (sda1): Unmount and
> run xfs_repair
> Mar  5 11:08:49 arms003-10 kernel: [  252.619030] XFS (sda1): First 64
> bytes of corrupted metadata buffer:
> Mar  5 11:08:49 arms003-10 kernel: [  252.625403] eb901000: 58 46 53 42 00
> 00 10 00 00 00 00 00 91 73 fe fb  XFSB.s..
> Mar  5 11:08:49 arms003-10 kernel: [  252.633441] eb901010: 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00 00  
> Mar  5 11:08:49 arms003-10 kernel: [  252.641474] eb901020: fc 4e e3 89 50
> 8f 42 aa be bc 07 0c 6e fa 83 2f  .N..P.B.n../
> Mar  5 11:08:49 arms003-10 kernel: [  252.649519] eb901030: 00 00 00 00 80
> 00 00 07 ff ff ff ff ff ff ff ff  
> Mar  5 11:08:49 arms003-10 kernel: [  252.657554] XFS (sda1): metadata I/O
> error: block 0x48b9ff80 ("xfs_trans_read_buf_map") error 117 numblks 8
> Mar  5 11:08:49 arms003-10 kernel: [  252.675056] XFS (sda1): Metadata
> corruption detected at xfs_dir3_data_read_verify+0x58/0xd0, xfs_dir3_data
> block 0x48b9ff80
> Mar  5 11:08:49 arms003-10 kernel: [  252.686228] XFS (sda1): Unmount and
> run xfs_repair
> Mar  5 11:08:49 arms003-10 kernel: [  252.691054] XFS (sda1): First 64
> bytes of corrupted metadata buffer:
> Mar  5 11:08:49 arms003-10 kernel: [  252.697425] eb901000: 58 46 53 42 00
> 00 10 00 00 00 00 00 91 73 fe fb  XFSB.s..
> Mar  5 11:08:49 arms003-10 kernel: [  252.705459] eb901010: 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00 00  
> Mar  5 11:08:49 arms003-10 kernel: [  252.713489] eb901020: fc 4e e3 89 50
> 8f 42 aa be bc 07 0c 6e fa 83 2f  .N..P.B.n../
> Mar  5 11:08:49 arms003-10 kernel: [  252.721520] eb901030: 00 00 00 00 80
> 00 00 07 ff ff ff ff ff ff ff ff  
> Mar  5 11:08:49 arms003-10 kernel: [  252.729558] XFS (sda1): metadata I/O
> error: block 0x48b9ff80 ("xfs_trans_read_buf_map") error 117 numblks 8
> Mar  5 11:08:49 arms003-10 kernel: [  252.741953] XFS (sda1): Metadata
> corruption detected at xfs_dir3_data_read_verify+0x58/0xd0, xfs_dir3_data
> block 0x48b9ff80
> Mar  5 11:08:49 arms003-10 kernel: [  252.753139] XFS (sda1): Unmount and
> run xfs_repair
> Mar  5 11:08:49 arms003-10 kernel: [  252.757955] XFS (sda1): First 64
> bytes of corrupted metadata buffer:
> Mar  5 11:08:49 arms003-10 kernel: [  252.764336] eb901000: 58 46 53 42 00
> 00 10 00 00 00 00 00 91 73 fe fb  XFSB.s..
> Mar  5 11:08:49 arms003-10 kernel: [  252.772365] eb901010: 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00 00  
> Mar  5 11:08:49 arms003-10 

Re: [ceph-users] Disaster Backups

2018-02-01 Thread Christian Wuerdig
In case of bluestore if your blockdb is on a different drive to the
OSD and that's included in your hardware loss then I think you're
pretty much toast. Not sure if you can re-build the blockdb from the
OSD data somehow. In case of filestore if you lose your journal drive
you also risk data corruption.


On 1 February 2018 at 08:05, Dyweni - Ceph-Users
<6exbab4fy...@dyweni.com> wrote:
> Hi,
>
> I'm trying to plan for a disaster, in which all data and all hardware
> (excluding the full set of Ceph OSD data drives) is lost.  What data do I
> need to backup in order to put those drives into new machines and startup my
> cluster?
>
> Would a flat file backup of /var/lib/ceph/mon  (while the monitor daemon is
> running) be sufficient?
>
> Thanks,
> Dyweni
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Have I configured erasure coding wrong ?

2018-01-14 Thread Christian Wuerdig
Depends on what you mean with "your pool overloads"? What's your
hardware setup (CPU, RAM, how many nodes, network etc.)? What can you
see when you monitor the system resources with atop or the likes?

On Sat, Jan 13, 2018 at 8:59 PM, Mike O'Connor  wrote:
> I followed the announcement of Luminous and erasure coding when I
> configured my system. Could this be the reason why my pool overloads
> when I push to much data at it ?
>
>
> root@pve:/#  ceph osd erasure-code-profile get ec-42-profile
> crush-device-class=hdd
> crush-failure-domain=osd
> crush-root=default
> jerasure-per-chunk-alignment=false
> k=4
> m=2
> plugin=jerasure
> technique=reed_sol_van
> w=8
>
> Thanks
> Mike
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance issues on Luminous

2018-01-05 Thread Christian Wuerdig
You should do your reference test with dd with  oflag=direct,dsync

direct will only bypass the cache while dsync will fsync on every
block which is much closer to reality of what ceph is doing afaik

On Thu, Jan 4, 2018 at 9:54 PM, Rafał Wądołowski
 wrote:
> Hi folks,
>
> I am currently benchmarking my cluster for an performance issue and I have
> no idea, what is going on. I am using these devices in qemu.
>
> Ceph version 12.2.2
>
> Infrastructure:
>
> 3 x Ceph-mon
>
> 11 x Ceph-osd
>
> Ceph-osd has 22x1TB Samsung SSD 850 EVO 1TB
>
> 96GB RAM
>
> 2x E5-2650 v4
>
> 4x10G Network (2 seperate bounds for cluster and public) with MTU 9000
>
>
> I had tested it with rados bench:
>
> # rados bench -p rbdbench 30 write -t 1
>
> Total time run: 30.055677
> Total writes made:  1199
> Write size: 4194304
> Object size:4194304
> Bandwidth (MB/sec): 159.571
> Stddev Bandwidth:   6.83601
> Max bandwidth (MB/sec): 168
> Min bandwidth (MB/sec): 140
> Average IOPS:   39
> Stddev IOPS:1
> Max IOPS:   42
> Min IOPS:   35
> Average Latency(s): 0.0250656
> Stddev Latency(s):  0.00321545
> Max latency(s): 0.0471699
> Min latency(s): 0.0206325
>
> # ceph tell osd.0 bench
> {
> "bytes_written": 1073741824,
> "blocksize": 4194304,
> "bytes_per_sec": 414199397
> }
>
> Testing osd directly
>
> # dd if=/dev/zero of=/dev/sdc bs=4M oflag=direct count=100
> 100+0 records in
> 100+0 records out
> 419430400 bytes (419 MB, 400 MiB) copied, 1.0066 s, 417 MB/s
>
> When I do dd inside vm (bs=4M wih direct), I have result like in rados
> bench.
>
> I think that the speed should be arround ~400MB/s.
>
> Is there any new parameters for rbd in luminous? Maybe I forgot about some
> performance tricks? If more information needed feel free to ask.
>
> --
> BR,
> Rafal Wadolowski
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Increasing PG number

2018-01-03 Thread Christian Wuerdig
A while back there was a thread on the ML where someone posted a bash
script to slowly increase the number of PGs in steps of 256 AFAIR, the
script would monitor the cluster activity and once all data shuffling
had finished it would do another round until the target is hit.

That was on filestore though and hammer or jewel, not sure if you can
go faster on bluestore or luminous in general.

On Thu, Jan 4, 2018 at 12:04 AM,   wrote:
> Last summer we increased an EC 8+3 pool from 1024 to 2048 PGs on our ~1500
> OSD (Kraken) cluster. This pool contained ~2 petabytes of data at the time.
>
>
>
> We did a fair amount of testing on a throwaway pool on the same cluster
> beforehand, starting with small increases (16/32/64).
>
>
>
> The main observation was that the act of splitting the PGs causes issues,
> not the resulting data movement, assuming your backfills are tuned to a
> level where they don’t affect client IO.
>
>
>
> As the PG splitting and peering (pg_num and pgp_num) increases are a) non
> reversible and b) the resulting operations happen instantaneously, overly
> large increases can end up with an unhappy mess of excessive storage node
> load, OSDs flapping and blocked requests.
>
>
>
> We ended up doing increases of 128 PGs at a time.
>
>
>
> I’d hazard a guess that you will be fine going straight to 512 PGs, but the
> only way to be sure of the correct increase size for your cluster is to test
> it.
>
>
>
> Cheers
>
> Tom
>
>
>
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Karun Josy
> Sent: 02 January 2018 16:23
> To: Hans van den Bogert 
> Cc: ceph-users 
> Subject: Re: [ceph-users] Increasing PG number
>
>
>
> https://access.redhat.com/solutions/2457321
>
> It says it is a very intensive process and can affect cluster performance.
>
>
>
> Our Version is Luminous 12.2.2
>
> And we are using erasure coding profile for a pool 'ecpool' with k=5 and m=3
>
> Current PG number is 256 and it has about 20 TB of data.
>
>
> Should I increase it gradually? Or set pg as 512 in one step ?
>
>
>
>
>
>
>
>
> Karun Josy
>
>
>
> On Tue, Jan 2, 2018 at 9:26 PM, Hans van den Bogert 
> wrote:
>
> Please refer to standard documentation as much as possible,
>
>
>
>
> http://docs.ceph.com/docs/jewel/rados/operations/placement-groups/#set-the-number-of-placement-groups
>
>
>
> Han’s is also incomplete, since you also need to change the ‘pgp_num’ as
> well.
>
>
>
> Regards,
>
>
>
> Hans
>
>
>
> On Jan 2, 2018, at 4:41 PM, Vladimir Prokofev  wrote:
>
>
>
> Increased number of PGs in multiple pools in a production cluster on 12.2.2
> recently - zero issues.
>
> CEPH claims that increasing pg_num and pgp_num are safe operations, which
> are essential for it's ability to scale, and this sounds pretty reasonable
> to me. [1]
>
>
>
>
>
> [1]
> https://www.sebastien-han.fr/blog/2013/03/12/ceph-change-pg-number-on-the-fly/
>
>
>
> 2018-01-02 18:21 GMT+03:00 Karun Josy :
>
> Hi,
>
>
>
>  Initial PG count was not properly planned while setting up the cluster, so
> now there are only less than 50 PGs per OSDs.
>
>
>
> What are the best practises to increase PG number of a pool ?
>
> We have replicated pools as well as EC pools.
>
>
>
> Or is it better to create a new pool with higher PG numbers?
>
>
>
>
>
> Karun
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Questions about pg num setting

2018-01-03 Thread Christian Wuerdig
Well, there is a setting for the minimum number of pgs per OSD (mon pg
warn min per osd, see
http://docs.ceph.com/docs/master/rados/configuration/pool-pg-config-ref/)
and there will be a HEALTH_WARN state if you have too few. As far as I
know having not enough PGs can cause trouble for CRUSH sometimes not
being able to place an object.

On Wed, Jan 3, 2018 at 11:10 PM, Marc Roos <m.r...@f1-outsourcing.eu> wrote:
>
>
> Is there a disadvantage to just always start pg_num and pgp_num with
> something low like 8, and then later increase it when necessary?
>
> Question is then how to identify when necessary
>
>
>
>
>
> -----Original Message-
> From: Christian Wuerdig [mailto:christian.wuer...@gmail.com]
> Sent: dinsdag 2 januari 2018 19:40
> To: 于相洋
> Cc: Ceph-User
> Subject: Re: [ceph-users] Questions about pg num setting
>
> Have you had a look at http://ceph.com/pgcalc/?
>
> Generally if you have too many PGs per OSD you can get yourself into
> trouble during recovery and backfilling operations consuming a lot more
> RAM than you have and eventually making your cluster unusable (some more
> info can be found here for example:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-October/013614.html
> but there are other threads on the ML).
> Also currently you cannot reduce the number of PGs for a pool so you are
> much better of starting with a lower value and then gradually increasing
> it.
>
> The fact that the ceph developers introduced a config option which
> prevents users from increasing the number of PGs if it exceeds the
> configured limit should be a tell-tale sign that having too many PGs per
> OSD is considered a problem (see also
> https://bugzilla.redhat.com/show_bug.cgi?id=1489064 and linked PRs)
>
> On Wed, Dec 27, 2017 at 3:15 PM, 于相洋 <penglai...@gmail.com> wrote:
>> Hi cephers,
>>
>> I have two questions about pg number setting.
>>
>> First :
>> My storage informaiton is show as belows:
>> HDD: 10 * 8TB
>> CPU: Intel(R) Xeon(R) CPU E5645 @ 2.40GHz (24 cores)
>> Memery: 64GB
>>
>> As my HDD capacity and my Mem is too large, so I want to set as many
>> as 300 pgs to each OSD. Although 100 pgs per OSD is perferred. I want
>> to know what is the disadvantage of setting too many pgs?
>>
>>
>> Second:
>> At begin ,I can not judge the capacity proportion of my workloads, so
>> I can not set accurate pg numbers of different pools. How many pgs
>> should I set for each pools first?
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majord...@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow 4k writes, Luminous with bluestore backend

2018-01-02 Thread Christian Wuerdig
The main difference is that rados bench uses 4MB objects while your dd
test uses 4k block size
rados bench shows an average of 283 IOPS which at 4k blocksize would
be around 1.1MB so it's somewhat consistent with the dd result
Monitor your CPU usage, network latency with something like atop on
the OSD nodes and check what might be causing the problem

On Wed, Dec 27, 2017 at 7:31 AM, kevin parrikar
 wrote:
> Hi All,
> I upgraded my cluster from Hammer to Jewel and then to Luminous , changed
> from filestore to bluestore backend.
>
> on a KVM vm with 4 cpu /2 Gb RAM i have attached a 20gb rbd volume as vdc
> and performed following test.
>
> dd if=/dev/zero of=/dev/vdc bs=4k count=1000 oflag=direct
> 1000+0 records in
> 1000+0 records out
> 4096000 bytes (4.1 MB) copied, 3.08965 s, 1.3 MB/s
>
> and its consistently giving 1.3MB/s which i feel is too low.I have 3 ceph
> osd nodes each with 24 x15k RPM with a replication of 2 ,connected 2x10G
> LACP bonded NICs with an MTU of 9100.
>
> Rados Bench results:
>
> rados bench -p volumes 4 write
> hints = 1
> Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304
> for up to 4 seconds or 0 objects
> Object prefix: benchmark_data_ceph3.sapiennetworks.com_820994
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
> lat(s)
> 0   0 0 0 0 0   -
> 0
> 1  16   276   260   1039.98  1040   0.0165053
> 0.0381299
> 2  16   545   529   1057.92  10760.043151
> 0.0580376
> 3  16   847   831   1107.91  1208   0.0394811
> 0.0567684
> 4  16  1160  11441143.9  1252 0.63265
> 0.0541888
> Total time run: 4.099801
> Total writes made:  1161
> Write size: 4194304
> Object size:4194304
> Bandwidth (MB/sec): 1132.74
> Stddev Bandwidth:   101.98
> Max bandwidth (MB/sec): 1252
> Min bandwidth (MB/sec): 1040
> Average IOPS:   283
> Stddev IOPS:25
> Max IOPS:   313
> Min IOPS:   260
> Average Latency(s): 0.0560897
> Stddev Latency(s):  0.107352
> Max latency(s): 1.02123
> Min latency(s): 0.00920514
> Cleaning up (deleting benchmark objects)
> Removed 1161 objects
> Clean up completed and total clean up time :0.079850
>
>
> After upgrading to Luminous i have executed
>
> ceph osd crush tunables optimal
>
> ceph.conf
>
> [global]
> fsid = 06c5c906-fc43-499f-8a6f-6c8e21807acf
> mon_initial_members = node-16 node-30 node-31
> mon_host = 172.16.1.9 172.16.1.3 172.16.1.11
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> filestore_xattr_use_omap = true
> log_to_syslog_level = info
> log_to_syslog = True
> osd_pool_default_size = 2
> osd_pool_default_min_size = 1
> osd_pool_default_pg_num = 64
> public_network = 172.16.1.0/24
> log_to_syslog_facility = LOG_LOCAL0
> osd_journal_size = 2048
> auth_supported = cephx
> osd_pool_default_pgp_num = 64
> osd_mkfs_type = xfs
> cluster_network = 172.16.1.0/24
> osd_recovery_max_active = 1
> osd_max_backfills = 1
> max_open_files = 131072
> debug_default = False
>
>
> [client]
> rbd_cache_writethrough_until_flush = True
> rbd_cache = True
>
> [client.radosgw.gateway]
> rgw_keystone_accepted_roles = _member_, Member, admin, swiftoperator
> keyring = /etc/ceph/keyring.radosgw.gateway
> rgw_frontends = fastcgi socket_port=9000 socket_host=127.0.0.1
> rgw_socket_path = /tmp/radosgw.sock
> rgw_keystone_revocation_interval = 100
> rgw_keystone_url = http://192.168.1.3:35357
> rgw_keystone_admin_token = jaJSmlTNxgsFp1ttq5SuAT1R
> rgw_init_timeout = 36
> host = controller2
> rgw_dns_name = *.sapiennetworks.com
> rgw_print_continue = True
> rgw_keystone_token_cache_size = 10
> rgw_data = /var/lib/ceph/radosgw
> user = www-data
>
> [osd]
> journal_queue_max_ops = 3000
> objecter_inflight_ops = 10240
> journal_queue_max_bytes = 1048576000
> filestore_queue_max_ops = 500
> osd_mkfs_type = xfs
> osd_mount_options_xfs = rw,relatime,inode64,logbsize=256k,allocsize=4M
> osd_op_threads = 20
> filestore_queue_committing_max_ops = 5000
> journal_max_write_entries = 1000
> objecter_infilght_op_bytes = 1048576000
> filestore_queue_max_bytes = 1048576000
> filestore_max_sync_interval = 10
> journal_max_write_bytes = 1048576000
> filestore_queue_committing_max_bytes = 1048576000
> ms_dispatch_throttle_bytes = 1048576000
>
>  ceph -s
>   cluster:
> id: 06c5c906-fc43-499f-8a6f-6c8e21807acf
> health: HEALTH_WARN
> application not enabled on 2 pool(s)
>
>   services:
> mon: 3 daemons, quorum controller3,controller2,controller1
> mgr: controller1(active)
> osd: 72 osds: 72 up, 72 in
> rgw: 1 daemon active
>
>   data:
> pools:   5 pools, 6240 pgs
> objects: 12732 objects, 72319 MB
> usage:   229 GB used, 39965 GB / 40195 GB avail
> pgs: 6240 active+clean
>
> 

Re: [ceph-users] Questions about pg num setting

2018-01-02 Thread Christian Wuerdig
Have you had a look at http://ceph.com/pgcalc/?

Generally if you have too many PGs per OSD you can get yourself into
trouble during recovery and backfilling operations consuming a lot
more RAM than you have and eventually making your cluster unusable
(some more info can be found here for example:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-October/013614.html
but there are other threads on the ML).
Also currently you cannot reduce the number of PGs for a pool so you
are much better of starting with a lower value and then gradually
increasing it.

The fact that the ceph developers introduced a config option which
prevents users from increasing the number of PGs if it exceeds the
configured limit should be a tell-tale sign that having too many PGs
per OSD is considered a problem (see also
https://bugzilla.redhat.com/show_bug.cgi?id=1489064 and linked PRs)

On Wed, Dec 27, 2017 at 3:15 PM, 于相洋  wrote:
> Hi cephers,
>
> I have two questions about pg number setting.
>
> First :
> My storage informaiton is show as belows:
> HDD: 10 * 8TB
> CPU: Intel(R) Xeon(R) CPU E5645 @ 2.40GHz (24 cores)
> Memery: 64GB
>
> As my HDD capacity and my Mem is too large, so I want to set as many
> as 300 pgs to each OSD. Although 100 pgs per OSD is perferred. I want
> to know what is the disadvantage of setting too many pgs?
>
>
> Second:
> At begin ,I can not judge the capacity proportion of my workloads, so
> I can not set accurate pg numbers of different pools. How many pgs
> should I set for each pools first?
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hangs with qemu/libvirt/rbd when one host disappears

2017-12-07 Thread Christian Wuerdig
On Thu, Dec 7, 2017 at 10:24 PM, Marcus Priesch  wrote:
> Hello Alwin, Dear All,

[snip]

>> Mixing of spinners with SSDs is not recommended, as spinners will slow
>> down the pools residing on that root.
>
> why should this happen ? i would assume that osd's are seperate parts
> running on hosts - not influencing each other ?
>
> otherwise i would need a different set of hosts for the ssd's and the
> hdd's ?
>

You stated that you have one pool which uses SSD and HDD mixed with 3
replicas where 2 end up on SSD and 1 on HDD
Writes will not be ACKed untill all 3 replicas have been written as
such the slowest disk in the replica set determines the latency of the
write op
Reads can also be affected if you don't take care to ensure that
primary OSD for the set will always end up on an SSD - if you sift
through the archives there was a thread about this a while back

Cheers
Christian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd after xfs repair only 50 percent data and osd won't start

2017-11-26 Thread Christian Wuerdig
In filestore the journal is crucial for the operation of the OSD to
ensure consistency. If it's toast then so is the associated OSD in
most cases. I think people often overlook this fact when they share
many OSDs to a single journal drive to save cost.

On Sun, Nov 26, 2017 at 5:23 AM, Hauke Homburg  wrote:
> Hello List,
>
> Yesterday i repaired a Partition with an osd. I flushed and recreated
> the yournal. After This i saw that i only have 50 Percent of Data in the
> OSD. Now the OSD Daemon will not start. An enable of the Debug brings
>
> ceph tell osd.0 injectargs --debug-osd 0/5
> Error ENXIO: problem getting command descriptions from osd.0
>
> The last Log Entry of the osd:
>
> 2017-11-25 15:22:00.351424 7f146f95b800  1 journal check: header looks ok
> 2017-11-25 15:22:00.351433 7f146f95b800  1 journal close
> /var/lib/ceph/osd/ceph-0/journal
> 2017-11-25 15:22:00.351514 7f146f95b800 -1 created new journal
> /var/lib/ceph/osd/ceph-0/journal for object store /var/lib/ceph/osd/ceph-0
>
> i have set ceph osd noout.
>
> Can anybody help me to start the OSD Daemon? I guess the OSD daesn't
> start because the OSD has only 50% Data. IS this correct? How can i
> repair this? I tried unset noout but after this the OSD Crashed, too
>
> Hauke
>
> --
> www.w3-creative.de
>
> www.westchat.de
>
> https://friendica.westchat.de/profile/hauke
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] S3/Swift :: Pools Ceph

2017-11-14 Thread Christian Wuerdig
As per documentation: http://docs.ceph.com/docs/luminous/radosgw/
"The S3 and Swift APIs share a common namespace, so you may write data
with one API and retrieve it with the other."
So you can access one pool through both APIs and the data will be
available via both.


On Wed, Nov 15, 2017 at 7:52 AM, Osama Hasebou  wrote:
> Hi Everyone,
>
> I was wondering, has anyone tried in a Test/Production environment, to have
> 1 pool, to which you can input/output data using S3 and Swift, or would each
> need a separate pool, one to serve via S3 and one to serve via Swift ?
>
> Also, I believe you can use 1 pool for RBD and Object storage as well, or is
> that false ?
>
> Thank you!
>
> Regards,
> Ossi
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Getting errors on erasure pool writes k=2, m=1

2017-11-13 Thread Christian Wuerdig
I haven't used the rados command line utility but it has an "-o
object_size" option as well as "--striper" to make it use the
libradosstriper library so I'd suggest to give these options a go.

On Mon, Nov 13, 2017 at 9:40 PM, Marc Roos <m.r...@f1-outsourcing.eu> wrote:
>
> 1. I don’t think an osd should 'crash' in such situation.
> 2. How else should I 'rados put' an 8GB file?
>
>
>
>
>
>
> -Original Message-
> From: Christian Wuerdig [mailto:christian.wuer...@gmail.com]
> Sent: maandag 13 november 2017 0:12
> To: Marc Roos
> Cc: ceph-users
> Subject: Re: [ceph-users] Getting errors on erasure pool writes k=2, m=1
>
> As per: https://www.spinics.net/lists/ceph-devel/msg38686.html
> Bluestore as a hard 4GB object size limit
>
>
> On Sat, Nov 11, 2017 at 9:27 AM, Marc Roos <m.r...@f1-outsourcing.eu>
> wrote:
>>
>> osd's are crashing when putting a (8GB) file in a erasure coded pool,
>> just before finishing. The same osd's are used for replicated pools
>> rbd/cephfs, and seem to do fine. Did I made some error is this a bug?
>> Looks similar to
>> https://www.spinics.net/lists/ceph-devel/msg38685.html
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/021
>> 045.html
>>
>>
>> [@c01 ~]# date ; rados -p ec21 put  $(basename
>> "/mnt/disk/blablablalbalblablalablalb.txt")
>> blablablalbalblablalablalb.txt
>> Fri Nov 10 20:27:26 CET 2017
>>
>> [Fri Nov 10 20:33:51 2017] libceph: osd9 down [Fri Nov 10 20:33:51
>> 2017] libceph: osd9 down [Fri Nov 10 20:33:51 2017] libceph: osd0
>> 192.168.10.111:6802 socket closed (con state OPEN) [Fri Nov 10
>> 20:33:51 2017] libceph: osd0 192.168.10.111:6802 socket error on write
>
>> [Fri Nov 10 20:33:52 2017] libceph: osd0 down [Fri Nov 10 20:33:52
>> 2017] libceph: osd7 down [Fri Nov 10 20:33:55 2017] libceph: osd0 down
>
>> [Fri Nov 10 20:33:55 2017] libceph: osd7 down [Fri Nov 10 20:34:41
>> 2017] libceph: osd7 up [Fri Nov 10 20:34:41 2017] libceph: osd7 up
>> [Fri Nov 10 20:35:03 2017] libceph: osd9 up [Fri Nov 10 20:35:03 2017]
>
>> libceph: osd9 up [Fri Nov 10 20:35:47 2017] libceph: osd0 up [Fri Nov
>> 10 20:35:47 2017] libceph: osd0 up
>>
>> [@c02 ~]# rados -p ec21 stat blablablalbalblablalablalb.txt 2017-11-10
>
>> 20:39:31.296101 7f840ad45e40 -1 WARNING: the following dangerous and
>> experimental features are enabled: bluestore 2017-11-10
>> 20:39:31.296290 7f840ad45e40 -1 WARNING: the following dangerous and
>> experimental features are enabled: bluestore 2017-11-10
>> 20:39:31.331588 7f840ad45e40 -1 WARNING: the following dangerous and
>> experimental features are enabled: bluestore
>> ec21/blablablalbalblablalablalb.txt mtime 2017-11-10 20:32:52.00,
>> size 8585740288
>>
>>
>>
>> 2017-11-10 20:32:52.287503 7f933028d700  4 rocksdb: EVENT_LOG_v1
>> {"time_micros": 1510342372287484, "job": 32, "event": "flush_started",
>> "num_memtables": 1, "num_entries": 728747, "num_deletes": 363960,
>> "memory_usage": 263854696}
>> 2017-11-10 20:32:52.287509 7f933028d700  4 rocksdb:
>> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_
>> AR
>> CH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/releas
>> e/ 12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/rocksdb/db/flush_job.cc:293]
>> [default] [JOB 32] Level-0 flush table #25279: started 2017-11-10
>> 20:32:52.503311 7f933028d700  4 rocksdb: EVENT_LOG_v1
>> {"time_micros": 1510342372503293, "cf_name": "default", "job": 32,
>> "event": "table_file_creation", "file_number": 25279, "file_size":
>> 4811948, "table_properties": {"data_size": 4675796, "index_size":
>> 102865, "filter_size": 32302, "raw_key_size": 646440,
>> "raw_average_key_size": 75, "raw_value_size": 4446103,
>> "raw_average_value_size": 519, "num_data_blocks": 1180, "num_entries":
>> 8560, "filter_policy_name": "rocksdb.BuiltinBloomFilter",
>> "kDeletedKeys": "0", "kMergeOperands": "330"}} 2017-11-10
>> 20:32:52.503327 7f933028d700  4 rocksdb:
>> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_
>> AR
>> CH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/releas
>> e/ 12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/rocksdb/db/flush_job.cc:319]
>> [default] [JOB 32] Level-0 

Re: [ceph-users] Erasure Coding Pools and PG calculation - documentation

2017-11-12 Thread Christian Wuerdig
Well, as stated in the other email I think in the EC scenario you can
set size=k+m for the pgcalc tool. If you want 10+2 then in theory you
should be able to get away with 6 nodes to survive a single node
failure if you can guarantee that every node will always receive 2 out
of the 12 chunks - looks like this might be achievable:
http://ceph.com/planet/erasure-code-on-small-clusters/

On Mon, Nov 13, 2017 at 1:32 PM, Tim Gipson <tgip...@ena.com> wrote:
> I guess my questions are more centered around k+m and PG calculations.
>
> As we were starting to build and test our EC pools with our infrastructure we 
> were trying to figure out what our calculations needed to be starting with 3 
> OSD hosts with 12 x 10 TB OSDs a piece.  The nodes have the ability to expand 
> to 24 drives a piece and we hope to eventually get to around a 1PB cluster 
> after we add some more hosts.  Initially we hoped to be able to do a k=10 m=2 
> on the pool but I am not sure that is going to be feasible.  We’d like to set 
> up the failure domain so that we would be able to lose an entire host without 
> losing the cluster.  At this point I’m not sure that’s possible without 
> bringing in more hosts.
>
> Thanks for the help!
>
> Tim Gipson
>
>
> On 11/12/17, 5:14 PM, "Christian Wuerdig" <christian.wuer...@gmail.com> wrote:
>
> I might be wrong, but from memory I think you can use
> http://ceph.com/pgcalc/ and use k+m for the size
>
> On Sun, Nov 12, 2017 at 5:41 AM, Ashley Merrick <ash...@amerrick.co.uk> 
> wrote:
> > Hello,
> >
> > Are you having any issues with getting the pool working or just around 
> the
> > PG num you should use?
> >
> > ,Ashley
> >
> > Get Outlook for Android
> >
> > 
> > From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of Tim 
> Gipson
> > <tgip...@ena.com>
> > Sent: Saturday, November 11, 2017 5:38:02 AM
> > To: ceph-users@lists.ceph.com
> > Subject: [ceph-users] Erasure Coding Pools and PG calculation -
> > documentation
> >
> > Hey all,
> >
> > I’m having some trouble setting up a Pool for Erasure Coding.  I haven’t
> > found much documentation around the PG calculation for an Erasure Coding
> > pool.  It seems from what I’ve tried so far that the math needed to set 
> one
> > up is different than the math you use to calculate PGs for a regular
> > replicated pool.
> >
> > Does anyone have any experience setting up a pool this way and can you 
> give
> > me some help or direction, or point me toward some documentation that 
> goes
> > over the math behind this sort of pool setup?
> >
> > Any help would be greatly appreciated!
> >
> > Thanks,
> >
> >
> > Tim Gipson
> > Systems Engineer
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure Coding Pools and PG calculation - documentation

2017-11-12 Thread Christian Wuerdig
I might be wrong, but from memory I think you can use
http://ceph.com/pgcalc/ and use k+m for the size

On Sun, Nov 12, 2017 at 5:41 AM, Ashley Merrick  wrote:
> Hello,
>
> Are you having any issues with getting the pool working or just around the
> PG num you should use?
>
> ,Ashley
>
> Get Outlook for Android
>
> 
> From: ceph-users  on behalf of Tim Gipson
> 
> Sent: Saturday, November 11, 2017 5:38:02 AM
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Erasure Coding Pools and PG calculation -
> documentation
>
> Hey all,
>
> I’m having some trouble setting up a Pool for Erasure Coding.  I haven’t
> found much documentation around the PG calculation for an Erasure Coding
> pool.  It seems from what I’ve tried so far that the math needed to set one
> up is different than the math you use to calculate PGs for a regular
> replicated pool.
>
> Does anyone have any experience setting up a pool this way and can you give
> me some help or direction, or point me toward some documentation that goes
> over the math behind this sort of pool setup?
>
> Any help would be greatly appreciated!
>
> Thanks,
>
>
> Tim Gipson
> Systems Engineer
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Getting errors on erasure pool writes k=2, m=1

2017-11-12 Thread Christian Wuerdig
As per: https://www.spinics.net/lists/ceph-devel/msg38686.html
Bluestore as a hard 4GB object size limit


On Sat, Nov 11, 2017 at 9:27 AM, Marc Roos  wrote:
>
> osd's are crashing when putting a (8GB) file in a erasure coded pool,
> just before finishing. The same osd's are used for replicated pools
> rbd/cephfs, and seem to do fine. Did I made some error is this a bug?
> Looks similar to
> https://www.spinics.net/lists/ceph-devel/msg38685.html
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/021045.html
>
>
> [@c01 ~]# date ; rados -p ec21 put  $(basename
> "/mnt/disk/blablablalbalblablalablalb.txt")
> blablablalbalblablalablalb.txt
> Fri Nov 10 20:27:26 CET 2017
>
> [Fri Nov 10 20:33:51 2017] libceph: osd9 down
> [Fri Nov 10 20:33:51 2017] libceph: osd9 down
> [Fri Nov 10 20:33:51 2017] libceph: osd0 192.168.10.111:6802 socket
> closed (con state OPEN)
> [Fri Nov 10 20:33:51 2017] libceph: osd0 192.168.10.111:6802 socket
> error on write
> [Fri Nov 10 20:33:52 2017] libceph: osd0 down
> [Fri Nov 10 20:33:52 2017] libceph: osd7 down
> [Fri Nov 10 20:33:55 2017] libceph: osd0 down
> [Fri Nov 10 20:33:55 2017] libceph: osd7 down
> [Fri Nov 10 20:34:41 2017] libceph: osd7 up
> [Fri Nov 10 20:34:41 2017] libceph: osd7 up
> [Fri Nov 10 20:35:03 2017] libceph: osd9 up
> [Fri Nov 10 20:35:03 2017] libceph: osd9 up
> [Fri Nov 10 20:35:47 2017] libceph: osd0 up
> [Fri Nov 10 20:35:47 2017] libceph: osd0 up
>
> [@c02 ~]# rados -p ec21 stat blablablalbalblablalablalb.txt
> 2017-11-10 20:39:31.296101 7f840ad45e40 -1 WARNING: the following
> dangerous and experimental features are enabled: bluestore
> 2017-11-10 20:39:31.296290 7f840ad45e40 -1 WARNING: the following
> dangerous and experimental features are enabled: bluestore
> 2017-11-10 20:39:31.331588 7f840ad45e40 -1 WARNING: the following
> dangerous and experimental features are enabled: bluestore
> ec21/blablablalbalblablalablalb.txt mtime 2017-11-10 20:32:52.00,
> size 8585740288
>
>
>
> 2017-11-10 20:32:52.287503 7f933028d700  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1510342372287484, "job": 32, "event": "flush_started",
> "num_memtables": 1, "num_entries": 728747, "num_deletes": 363960,
> "memory_usage": 263854696}
> 2017-11-10 20:32:52.287509 7f933028d700  4 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_AR
> CH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/
> 12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/rocksdb/db/flush_job.cc:293]
> [default] [JOB 32] Level-0 flush table #25279: started
> 2017-11-10 20:32:52.503311 7f933028d700  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1510342372503293, "cf_name": "default", "job": 32,
> "event": "table_file_creation", "file_number": 25279, "file_size":
> 4811948, "table_properties": {"data_size": 4675796, "index_size":
> 102865, "filter_size": 32302, "raw_key_size": 646440,
> "raw_average_key_size": 75, "raw_value_size": 4446103,
> "raw_average_value_size": 519, "num_data_blocks": 1180, "num_entries":
> 8560, "filter_policy_name": "rocksdb.BuiltinBloomFilter",
> "kDeletedKeys": "0", "kMergeOperands": "330"}}
> 2017-11-10 20:32:52.503327 7f933028d700  4 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_AR
> CH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/
> 12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/rocksdb/db/flush_job.cc:319]
> [default] [JOB 32] Level-0 flush table #25279: 4811948 bytes OK
> 2017-11-10 20:32:52.572413 7f933028d700  4 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_AR
> CH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/
> 12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/rocksdb/db/db_impl_files.cc:242]
> adding log 25276 to recycle list
>
> 2017-11-10 20:32:52.572422 7f933028d700  4 rocksdb: (Original Log Time
> 2017/11/10-20:32:52.503339)
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_AR
> CH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/
> 12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/rocksdb/db/memtable_list.cc:360]
> [default] Level-0 commit table #25279 started
> 2017-11-10 20:32:52.572425 7f933028d700  4 rocksdb: (Original Log Time
> 2017/11/10-20:32:52.572312)
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_AR
> CH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/
> 12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/rocksdb/db/memtable_list.cc:383]
> [default] Level-0 commit table #25279: memtable #1 done
> 2017-11-10 20:32:52.572428 7f933028d700  4 rocksdb: (Original Log Time
> 2017/11/10-20:32:52.572328) EVENT_LOG_v1 {"time_micros":
> 1510342372572321, "job": 32, "event": "flush_finished", "lsm_state": [4,
> 4, 36, 140, 0, 0, 0], "immutable_memtables": 0}
> 2017-11-10 20:32:52.572430 7f933028d700  4 rocksdb: (Original Log Time
> 2017/11/10-20:32:52.572397)
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_AR
> 

Re: [ceph-users] Undersized fix for small cluster, other than adding a 4th node?

2017-11-12 Thread Christian Wuerdig
The default failure domain is host and you will need 5 (=k+m) nodes
for this config. If you have 4 nodes you can run k=3,m=1 or k=2,m=2
otherwise you'd have to change failure domain to OSD

On Fri, Nov 10, 2017 at 10:52 AM, Marc Roos  wrote:
>
> I added an erasure k=3,m=2 coded pool on a 3 node test cluster and am
> getting these errors.
>
>pg 48.0 is stuck undersized for 23867.00, current state
> active+undersized+degraded, last acting [9,13,2147483647,7,2147483647]
> pg 48.1 is stuck undersized for 27479.944212, current state
> active+undersized+degraded, last acting [12,1,2147483647,8,2147483647]
> pg 48.2 is stuck undersized for 27479.944514, current state
> active+undersized+degraded, last acting [12,1,2147483647,3,2147483647]
> pg 48.3 is stuck undersized for 27479.943845, current state
> active+undersized+degraded, last acting [11,0,2147483647,2147483647,5]
> pg 48.4 is stuck undersized for 27479.947473, current state
> active+undersized+degraded, last acting [8,4,2147483647,2147483647,5]
> pg 48.5 is stuck undersized for 27479.940289, current state
> active+undersized+degraded, last acting [6,5,11,2147483647,2147483647]
> pg 48.6 is stuck undersized for 27479.947125, current state
> active+undersized+degraded, last acting [5,8,2147483647,1,2147483647]
> pg 48.7 is stuck undersized for 23866.977708, current state
> active+undersized+degraded, last acting [13,11,2147483647,0,2147483647]
>
> Mentioned here
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-May/009572.html
> is that the problem was resolved by adding an extra node, I already
> changed the min_size to 3. Or should I change to k=2,m=2 but do I still
> then have good saving on storage then? How can you calculate saving
> storage of erasure pool?
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pool shard/stripe settings for file too large files?

2017-11-09 Thread Christian Wuerdig
It should be noted that the general advise is to not use such large
objects since cluster performance will suffer, see also this thread:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/021051.html

libradosstriper might be an option which will automatically break the
object into smaller chunks

On Fri, Nov 10, 2017 at 9:08 AM, Kevin Hrpcek
 wrote:
> Marc,
>
> If you're running luminous you may need to increase osd_max_object_size.
> This snippet is from the Luminous change log.
>
> "The default maximum size for a single RADOS object has been reduced from
> 100GB to 128MB. The 100GB limit was completely impractical in practice while
> the 128MB limit is a bit high but not unreasonable. If you have an
> application written directly to librados that is using objects larger than
> 128MB you may need to adjust osd_max_object_size"
>
> Kevin
>
> On 11/09/2017 02:01 PM, Marc Roos wrote:
>
>
> I would like store objects with
>
> rados -p ec32 put test2G.img test2G.img
>
> error putting ec32/test2G.img: (27) File too large
>
> Changing the pool application from custom to rgw did not help
>
>
>
>
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs inconsistent, do I fear data loss?

2017-11-02 Thread Christian Wuerdig
I'm not a big expert but the OP said he's suspecting bitrot is at
least part of issue in which case you can have the situation where the
drive has ACK'ed the write but a later scrub discovered checksum
errors
Plus you don't need to actually loose a drive to get inconsistent pgs
with size=2 min_size=1 > flapping OSDs (even just temporary) while the
cluster is receiving writes can generate this.

On Fri, Nov 3, 2017 at 12:05 PM, Denes Dolhay  wrote:
> Hi Greg,
>
> Accepting the fact, that an osd with outdated data can never accept write,
> or io of any kind, how is it possible, that the system goes into this state?
>
> -All osds are Bluestore, checksum, mtime etc.
>
> -All osds are up and in
>
> -No hw failures, lost disks, damaged journals or databases etc.
>
> -The data became inconsistent
>
>
> Thanks,
>
> Denke.
>
>
> On 11/02/2017 11:51 PM, Gregory Farnum wrote:
>
>
> On Thu, Nov 2, 2017 at 1:21 AM koukou73gr  wrote:
>>
>> The scenario is actually a bit different, see:
>>
>> Let's assume size=2, min_size=1
>> -We are looking at pg "A" acting [1, 2]
>> -osd 1 goes down
>> -osd 2 accepts a write for pg "A"
>> -osd 2 goes down
>> -osd 1 comes back up, while osd 2 still down
>> -osd 1 has no way to know osd 2 accepted a write in pg "A"
>> -osd 1 accepts a new write to pg "A"
>> -osd 2 comes back up.
>>
>> bang! osd 1 and 2 now have different views of pg "A" but both claim to
>> have current data.
>
>
> In this case, OSD 1 will not accept IO precisely because it can not prove it
> has the current data. That is the basic purpose of OSD peering and holds in
> all cases.
> -Greg
>
>>
>>
>> -K.
>>
>> On 2017-11-01 20:27, Denes Dolhay wrote:
>> > Hello,
>> >
>> > I have a trick question for Mr. Turner's scenario:
>> > Let's assume size=2, min_size=1
>> > -We are looking at pg "A" acting [1, 2]
>> > -osd 1 goes down, OK
>> > -osd 1 comes back up, backfill of pg "A" commences from osd 2 to osd 1,
>> > OK
>> > -osd 2 goes down (and therefore pg "A" 's backfill to osd 1 is
>> > incomplete and stopped) not OK, but this is the case...
>> > --> In this event, why does osd 1 accept IO to pg "A" knowing full well,
>> > that it's data is outdated and will cause an inconsistent state?
>> > Wouldn't it be prudent to deny io to pg "A" until either
>> > -osd 2 comes back (therefore we have a clean osd in the acting group)
>> > ... backfill would continue to osd 1 of course
>> > -or data in pg "A" is manually marked as lost, and then continues
>> > operation from osd 1 's (outdated) copy?
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to increase the size of requests written to a ceph image

2017-10-26 Thread Christian Wuerdig
Hm, no necessarily directly related to your performance problem,
however: These SSDs have a listed endurance of 72TB total data written
- over a 5 year period that's 40GB a day or approx 0.04 DWPD. Given
that you run the journal for each OSD on the same disk, that's
effectively at most 0.02 DWPD (about 20GB per day per disk). I don't
know many who'd run a cluster on disks like those. Also it means these
are pure consumer drives which have a habit of exhibiting random
performance at times (based on unquantified anecdotal personal
experience with other consumer model SSDs). I wouldn't touch these
with a long stick for anything but small toy-test clusters.

On Fri, Oct 27, 2017 at 3:44 AM, Russell Glaue  wrote:
>
> On Wed, Oct 25, 2017 at 7:09 PM, Maged Mokhtar  wrote:
>>
>> It depends on what stage you are in:
>> in production, probably the best thing is to setup a monitoring tool
>> (collectd/grahite/prometheus/grafana) to monitor both ceph stats as well as
>> resource load. This will, among other things, show you if you have slowing
>> disks.
>
> I am monitoring Ceph performance with ceph-dash
> (http://cephdash.crapworks.de/), that is why I knew to look into the slow
> writes issue. And I am using Monitorix (http://www.monitorix.org/) to
> monitor system resources, including Disk I/O.
>
> However, though I can monitor individual disk performance at the system
> level, it seems Ceph does not tax any disk more than the worst disk. So in
> my monitoring charts, all disks have the same performance.
> All four nodes are base-lining at 50 writes/sec during the cluster's normal
> load, with the non-problem hosts spiking up to 150, and the problem host
> only spikes up to 100.
> But during the window of time I took the problem host OSDs down to run the
> bench tests, the OSDs on the other nodes increased to 300-500 writes/sec.
> Otherwise, the chart looks the same for all disks on all ceph nodes/hosts.
>
>> Before production you should first make sure your SSDs are suitable for
>> Ceph, either by being recommend by other Ceph users or you test them
>> yourself for sync writes performance using fio tool as outlined earlier.
>> Then after you build your cluster you can use rados and/or rbd bencmark
>> tests to benchmark your cluster and find bottlenecks using atop/sar/collectl
>> which will help you tune your cluster.
>
> All 36 OSDs are: Crucial_CT960M500SSD1
>
> Rados bench tests were done at the beginning. The speed was much faster than
> it is now. I cannot recall the test results, someone else on my team ran
> them. Recently, I had thought the slow disk problem was a configuration
> issue with Ceph - before I posted here. Now we are hoping it may be resolved
> with a firmware update. (If it is firmware related, rebooting the problem
> node may temporarily resolve this)
>
>>
>> Though you did see better improvements, your cluster with 27 SSDs should
>> give much higher numbers than 3k iops. If you are running rados bench while
>> you have other client ios, then obviously the reported number by the tool
>> will be less than what the cluster is actually giving...which you can find
>> out via ceph status command, it will print the total cluster throughput and
>> iops. If the total is still low i would recommend running the fio raw disk
>> test, maybe the disks are not suitable. When you removed your 9 bad disk
>> from 36 and your performance doubled, you still had 2 other disk slowing
>> you..meaning near 100% busy ? It makes me feel the disk type used is not
>> good. For these near 100% busy disks can you also measure their raw disk
>> iops at that load (i am not sure atop shows this, if not use
>> sat/syssyat/iostat/collecl).
>
> I ran another bench test today with all 36 OSDs up. The overall performance
> was improved slightly compared to the original tests. Only 3 OSDs on the
> problem host were increasing to 101% disk busy.
> The iops reported from ceph status during this bench test ranged from 1.6k
> to 3.3k, the test yielding 4k iops.
>
> Yes, the two other OSDs/disks that were the bottleneck were at 101% disk
> busy. The other OSD disks on the same host were sailing along at like 50-60%
> busy.
>
> All 36 OSD disks are exactly the same disk. They were all purchased at the
> same time. All were installed at the same time.
> I cannot believe it is a problem with the disk model. A failed/bad disk,
> perhaps is possible. But the disk model itself cannot be the problem based
> on what I am seeing. If I am seeing bad performance on all disks on one ceph
> node/host, but not on another ceph node with these same disks, it has to be
> some other factor. This is why I am now guessing a firmware upgrade is
> needed.
>
> Also, as I eluded to here earlier. I took down all 9 OSDs in the problem
> host yesterday to run the bench test.
> Today, with those 9 OSDs back online, I rerun the bench test, I am see 2-3
> OSD disks with 101% busy on the problem host, and the other disks are 

Re: [ceph-users] Infinite degraded objects

2017-10-25 Thread Christian Wuerdig
ancer
>1/ 5 mds_locker
>1/ 5 mds_log
>1/ 5 mds_log_expire
>1/ 5 mds_migrator
>0/ 1 buffer
>0/ 1 timer
>0/ 1 filer
>0/ 1 striper
>0/ 1 objecter
>0/ 5 rados
>0/ 5 rbd
>0/ 5 rbd_mirror
>0/ 5 rbd_replay
>0/ 5 journaler
>0/ 5 objectcacher
>0/ 5 client
>0/ 5 osd
>0/ 5 optracker
>0/ 5 objclass
>1/ 3 filestore
>1/ 3 journal
>0/ 5 ms
>1/ 5 mon
>0/10 monc
>1/ 5 paxos
>0/ 5 tp
>1/ 5 auth
>1/ 5 crypto
>1/ 1 finisher
>1/ 5 heartbeatmap
>1/ 5 perfcounter
>1/ 5 rgw
>1/10 civetweb
>1/ 5 javaclient
>1/ 5 asok
>1/ 1 throttle
>0/ 0 refs
>1/ 5 xio
>1/ 5 compressor
>1/ 5 newstore
>1/ 5 bluestore
>1/ 5 bluefs
>1/ 3 bdev
>1/ 5 kstore
>4/ 5 rocksdb
>4/ 5 leveldb
>1/ 5 kinetic
>1/ 5 fuse
>   -2/-2 (syslog threshold)
>   -1/-1 (stderr threshold)
>   max_recent 1
>   max_new 1000
>   log_file /var/log/ceph/ceph-osd.3.log
> -
>
>
> On 25/10/17 00:42, Christian Wuerdig wrote:
>
> >From which version of ceph to which other version of ceph did you
> upgrade? Can you provide logs from crashing OSDs? The degraded object
> percentage being larger than 100% has been reported before
> (https://www.spinics.net/lists/ceph-users/msg39519.html) and looks
> like it's been fixed a week or so ago:
> http://tracker.ceph.com/issues/21803
>
> On Mon, Oct 23, 2017 at 5:10 AM, Gonzalo Aguilar Delgado
> <gagui...@aguilardelgado.com> wrote:
>
> Hello,
>
> Since we upgraded ceph cluster we are facing a lot of problems. Most of them
> due to osd crashing. What can cause this?
>
>
> This morning I woke up with thi message:
>
>
> root@red-compute:~# ceph -w
> cluster 9028f4da-0d77-462b-be9b-dbdf7fa57771
>  health HEALTH_ERR
> 1 pgs are stuck inactive for more than 300 seconds
> 7 pgs inconsistent
> 1 pgs stale
> 1 pgs stuck stale
> recovery 20266198323167232/287940 objects degraded
> (7038340738753.641%)
> 37154696925806626 scrub errors
> too many PGs per OSD (305 > max 300)
>  monmap e12: 2 mons at
> {blue-compute=172.16.0.119:6789/0,red-compute=172.16.0.100:6789/0}
> election epoch 4986, quorum 0,1 red-compute,blue-compute
>   fsmap e913: 1/1/1 up {0=blue-compute=up:active}
>  osdmap e8096: 5 osds: 5 up, 5 in
> flags require_jewel_osds
>   pgmap v68755349: 764 pgs, 6 pools, 558 GB data, 140 kobjects
> 1119 GB used, 3060 GB / 4179 GB avail
> 20266198323167232/287940 objects degraded (7038340738753.641%)
>  756 active+clean
>7 active+clean+inconsistent
>1 stale+active+clean
>   client io 1630 B/s rd, 552 kB/s wr, 0 op/s rd, 64 op/s wr
>
> 2017-10-22 18:10:13.000812 mon.0 [INF] pgmap v68755348: 764 pgs: 7
> active+clean+inconsistent, 756 active+clean, 1 stale+active+clean; 558 GB
> data, 1119 GB used, 3060 GB / 4179 GB avail; 1641 B/s rd, 229 kB/s wr, 39
> op/s; 20266198323167232/287940 objects degraded (7038340738753.641%)
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Infinite degraded objects

2017-10-24 Thread Christian Wuerdig
>From which version of ceph to which other version of ceph did you
upgrade? Can you provide logs from crashing OSDs? The degraded object
percentage being larger than 100% has been reported before
(https://www.spinics.net/lists/ceph-users/msg39519.html) and looks
like it's been fixed a week or so ago:
http://tracker.ceph.com/issues/21803

On Mon, Oct 23, 2017 at 5:10 AM, Gonzalo Aguilar Delgado
 wrote:
> Hello,
>
> Since we upgraded ceph cluster we are facing a lot of problems. Most of them
> due to osd crashing. What can cause this?
>
>
> This morning I woke up with thi message:
>
>
> root@red-compute:~# ceph -w
> cluster 9028f4da-0d77-462b-be9b-dbdf7fa57771
>  health HEALTH_ERR
> 1 pgs are stuck inactive for more than 300 seconds
> 7 pgs inconsistent
> 1 pgs stale
> 1 pgs stuck stale
> recovery 20266198323167232/287940 objects degraded
> (7038340738753.641%)
> 37154696925806626 scrub errors
> too many PGs per OSD (305 > max 300)
>  monmap e12: 2 mons at
> {blue-compute=172.16.0.119:6789/0,red-compute=172.16.0.100:6789/0}
> election epoch 4986, quorum 0,1 red-compute,blue-compute
>   fsmap e913: 1/1/1 up {0=blue-compute=up:active}
>  osdmap e8096: 5 osds: 5 up, 5 in
> flags require_jewel_osds
>   pgmap v68755349: 764 pgs, 6 pools, 558 GB data, 140 kobjects
> 1119 GB used, 3060 GB / 4179 GB avail
> 20266198323167232/287940 objects degraded (7038340738753.641%)
>  756 active+clean
>7 active+clean+inconsistent
>1 stale+active+clean
>   client io 1630 B/s rd, 552 kB/s wr, 0 op/s rd, 64 op/s wr
>
> 2017-10-22 18:10:13.000812 mon.0 [INF] pgmap v68755348: 764 pgs: 7
> active+clean+inconsistent, 756 active+clean, 1 stale+active+clean; 558 GB
> data, 1119 GB used, 3060 GB / 4179 GB avail; 1641 B/s rd, 229 kB/s wr, 39
> op/s; 20266198323167232/287940 objects degraded (7038340738753.641%)
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Reported bucket size incorrect (Luminous)

2017-10-24 Thread Christian Wuerdig
What version of Ceph are you using? There were a few bugs leaving
behind orphaned objects (e.g. http://tracker.ceph.com/issues/18331 and
http://tracker.ceph.com/issues/10295). If that's your problem then
there is a tool for finding these objects so you can then manually
delete them - have a google search for rgw orphan find

On Sat, Oct 21, 2017 at 2:40 AM, Mark Schouten  wrote:
> Hi,
>
> I have a bucket that according to radosgw-admin is about 8TB, even though
> it's really only 961GB.
>
> I have ran radosgw-admin gc process, and that completes quite fast.
> root@osdnode04:~# radosgw-admin gc process
> root@osdnode04:~# radosgw-admin gc list
> []
>
> {
> "bucket": "qnapnas",
> "zonegroup": "8e81f1e2-c173-4b8d-b421-6ccabdf69f2e",
> "placement_rule": "default-placement",
> "explicit_placement": {
> "data_pool": "default.rgw.buckets.data",
> "data_extra_pool": "default.rgw.buckets.non-ec",
> "index_pool": "default.rgw.buckets.index"
> },
> "id": "1c19a332-7ffc-4472-b852-ec4a143785cc.19675875.3",
> "marker": "1c19a332-7ffc-4472-b852-ec4a143785cc.19675875.3",
> "index_type": "Normal",
> "owner": "DB0339$REDACTED",
> "ver": "0#963948",
> "master_ver": "0#0",
> "mtime": "2017-08-23 12:15:50.203650",
> "max_marker": "0#",
> "usage": {
> "rgw.main": {
> "size": 8650431493893,
> "size_actual": 8650431578112,
> "size_utilized": 8650431493893,
> "size_kb": 8447687006,
> "size_kb_actual": 8447687088,
> "size_kb_utilized": 8447687006,
> "num_objects": 227080
> },
> "rgw.multimeta": {
> "size": 0,
> "size_actual": 0,
> "size_utilized": 0,
> "size_kb": 0,
> "size_kb_actual": 0,
> "size_kb_utilized": 0,
> "num_objects": 17
> }
> },
> "bucket_quota": {
> "enabled": false,
> "check_on_raw": false,
> "max_size": -1024,
> "max_size_kb": 0,
> "max_objects": -1
> }
> },
>
>
> Can anybody explain what's wrong?
>
>
> Met vriendelijke groeten,
>
> --
> Kerio Operator in de Cloud? https://www.kerioindecloud.nl/
> Mark Schouten | Tuxis Internet Engineering
> KvK: 61527076 | http://www.tuxis.nl/
> T: 0318 200208 | i...@tuxis.nl
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bareos and libradosstriper works only for 4M sripe_unit size

2017-10-16 Thread Christian Wuerdig
Maybe an additional example where the numbers don't line up all so
nicely would be good as well. For example it's not immediately obvious
to me what would happen with the stripe settings given by your example
but you write 97M of data
Would it be 4 objects of 24M and 4 objects of 250KB? Or will the last
4 objects be artificially padded (with 0's) to meet the stripe_unit?



On Tue, Oct 17, 2017 at 12:35 PM, Alexander Kushnirenko
 wrote:
> Hi, Gregory, Ian!
>
> There is very little information on striper mode in Ceph documentation.
> Could this explanation help?
>
> The logic of striper mode is very much the same as in RAID-0.  There are 3
> parameters that drives it:
>
> stripe_unit - the stripe size  (default=4M)
> stripe_count - how many objects to write in parallel (default=1)
> object_size  - when to stop increasing object size and create new objects.
> (default =4M)
>
> For example if you write 132M of data (132 consecutive pieces of data 1M
> each) in striped mode with the following parameters:
> stripe_unit = 8M
> stripe_count = 4
> object_size = 24M
> Then 8 objects will be created - 4 objects with 24M size and 4 objects with
> 8M size.
>
> Obj1=24MObj2=24MObj3=24MObj4=24M
> 00 .. 07 08 .. 0f 10 .. 17 18 .. 1f  <-- consecutive
> 1M pieces of data
> 20 .. 27 21 .. 2f 30 .. 37 38 .. 3f
> 40 .. 47 48 .. 4f 50 .. 57 58 .. 5f
>
> Obj5= 8MObj6= 8MObj7= 8MObj8= 8M
> 60 .. 6768 .. 6f70 .. 7778 .. 7f
>
> Alexander.
>
>
>
>
> On Wed, Oct 11, 2017 at 3:19 PM, Alexander Kushnirenko
>  wrote:
>>
>> Oh!  I put a wrong link, sorry  The picture which explains stripe_unit and
>> stripe count is here:
>>
>>
>> https://indico.cern.ch/event/330212/contributions/1718786/attachments/642384/883834/CephPluginForXroot.pdf
>>
>> I tried to attach it in the mail, but it was blocked.
>>
>>
>> On Wed, Oct 11, 2017 at 3:16 PM, Alexander Kushnirenko
>>  wrote:
>>>
>>> Hi, Ian!
>>>
>>> Thank you for your reference!
>>>
>>> Could you comment on the following rule:
>>> object_size = stripe_unit * stripe_count
>>> Or it is not necessarily so?
>>>
>>> I refer to page 8 in this report:
>>>
>>>
>>> https://indico.cern.ch/event/531810/contributions/2298934/attachments/1358128/2053937/Ceph-Experience-at-RAL-final.pdf
>>>
>>>
>>> Alexander.
>>>
>>> On Wed, Oct 11, 2017 at 1:11 PM,  wrote:

 Hi Gregory

 You’re right, when setting the object layout in libradosstriper, one
 should set all three parameters (the number of stripes, the size of the
 stripe unit, and the size of the striped object). The Ceph plugin for
 GridFTP has an example of this at
 https://github.com/stfc/gridFTPCephPlugin/blob/master/ceph_posix.cpp#L371



 At RAL, we use the following values:



 $STRIPER_NUM_STRIPES 1

 $STRIPER_STRIPE_UNIT 8388608

 $STRIPER_OBJECT_SIZE 67108864



 Regards,



 Ian Johnson MBCS

 Data Services Group

 Scientific Computing Department

 Rutherford Appleton Laboratory




 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>
>>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] list admin issues

2017-10-15 Thread Christian Wuerdig
You're not the only one, happens to me too. I found some old ML thread
from a couple years back where someone mentioned the same thing.
I do notice from time to time spam coming through (not much though and
it seems to come in waves) although I'm not sure how much gmail is
bouncing but nobody else seems to complain about spam.

Gmail has a 25MB attachment limit and I'd be surprised if someone
sends 25MB+ attachments. So it's most likely an attachment with a
blocked extension or a "dangerous" link as written here:
https://support.google.com/mail/answer/6590

Who knows, there might be someone on the list deliberately including a
"signature" or attaching a file with a blocked extension to
deliberately annoy gmail users in order to "educate" them (I remember
similarly annoying behaviour from the Usenet days)

On Mon, Oct 16, 2017 at 4:15 PM, Blair Bethwaite
 wrote:
> Thanks Christian,
>
> You're no doubt on the right track, but I'd really like to figure out
> what it is at my end - I'm unlikely to be the only person subscribed
> to ceph-users via a gmail account.
>
> Re. attachments, I'm surprised mailman would be allowing them in the
> first place, and even so gmail's attachment requirements are less
> strict than most corporate email setups (those that don't already use
> a cloud provider).
>
> This started happening earlier in the year after I turned off digest
> mode. I also have a paid google domain, maybe I'll try setting
> delivery to that address and seeing if anything changes...
>
> Cheers,
>
> On 16 October 2017 at 13:54, Christian Balzer  wrote:
>>
>> Hello,
>>
>> You're on gmail.
>>
>> Aside from various potential false positives with regards to spam my bet
>> is that gmail's known dislike for attachments is the cause of these
>> bounces and that setting is beyond your control.
>>
>> Because Google knows best[tm].
>>
>> Christian
>>
>> On Mon, 16 Oct 2017 13:50:43 +1100 Blair Bethwaite wrote:
>>
>>> Hi all,
>>>
>>> This is a mailing-list admin issue - I keep being unsubscribed from
>>> ceph-users with the message:
>>> "Your membership in the mailing list ceph-users has been disabled due
>>> to excessive bounces..."
>>> This seems to be happening on roughly a monthly basis.
>>>
>>> Thing is I have no idea what the bounce is or where it is coming from.
>>> I've tried emailing ceph-users-ow...@lists.ceph.com and the contact
>>> listed in Mailman (l...@redhat.com) to get more info but haven't
>>> received any response despite several attempts.
>>>
>>> Help!
>>>
>>
>>
>> --
>> Christian BalzerNetwork/Systems Engineer
>> ch...@gol.com   Rakuten Communications
>
>
>
> --
> Cheers,
> ~Blairo
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Creating a custom cluster name using ceph-deploy

2017-10-15 Thread Christian Wuerdig
See also this ML thread regarding removing the cluster name option:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018520.html

On Mon, Oct 16, 2017 at 11:42 AM, Erik McCormick
 wrote:
> Do not, under any circumstances, make a custom named cluster. There be pain
> and suffering (and dragons) there, and official support for it has been
> deprecated.
>
> On Oct 15, 2017 6:29 PM, "Bogdan SOLGA"  wrote:
>>
>> Hello, everyone!
>>
>> We are trying to create a custom cluster name using the latest ceph-deploy
>> version (1.5.39), but we keep getting the error:
>> 'ceph-deploy new: error: subnet must have at least 4 numbers separated by
>> dots like x.x.x.x/xx, but got: cluster_name'
>>
>> We tried to run the new command using the following orders for the
>> parameters:
>>
>> ceph-deploy new --cluster cluster_name ceph-mon-001
>> ceph-deploy new ceph-mon-001 --cluster cluster_name
>>
>> The output of 'ceph-deploy new -h' no longer lists the '--cluster' option,
>> but the 'man ceph-deploy' lists it.
>>
>> Any help is highly appreciated.
>>
>> Thank you,
>> Bogdan
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cache pool full

2017-10-05 Thread Christian Wuerdig
The default filesize limit for CephFS is 1TB, see also here:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-May/018208.html
(also includes a pointer on how to increase it)

On Fri, Oct 6, 2017 at 12:45 PM, Shawfeng Dong  wrote:
> Dear all,
>
> We just set up a Ceph cluster, running the latest stable release Ceph
> v12.2.0 (Luminous):
> # ceph --version
> ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
>
> The goal is to serve Ceph filesystem, for which we created 3 pools:
> # ceph osd lspools
> 1 cephfs_data,2 cephfs_metadata,3 cephfs_cache,
> where
> * cephfs_data is the data pool (36 OSDs on HDDs), which is erased-coded;
> * cephfs_metadata is the metadata pool
> * cephfs_cache is the cache tier (3 OSDs on NVMes) for cephfs_data. The
> cache-mode is writeback.
>
> Everything had worked fine, until today when we tried to copy a 1.3TB file
> to the CephFS.  We got the "No space left on device" error!
>
> 'ceph -s' says some OSDs are full:
> # ceph -s
>   cluster:
> id: e18516bf-39cb-4670-9f13-88ccb7d19769
> health: HEALTH_ERR
> full flag(s) set
> 1 full osd(s)
> 1 pools have many more objects per pg than average
>
>   services:
> mon: 3 daemons, quorum pulpo-admin,pulpo-mon01,pulpo-mds01
> mgr: pulpo-mds01(active), standbys: pulpo-admin, pulpo-mon01
> mds: pulpos-1/1/1 up  {0=pulpo-mds01=up:active}
> osd: 39 osds: 39 up, 39 in
>  flags full
>
>   data:
> pools:   3 pools, 2176 pgs
> objects: 347k objects, 1381 GB
> usage:   2847 GB used, 262 TB / 265 TB avail
> pgs: 2176 active+clean
>
>   io:
> client:   19301 kB/s rd, 2935 op/s rd, 0 op/s wr
>
> And indeed the cache pool is full:
> # rados df
> POOL_NAME   USED  OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND
> DEGRADED RD_OPS   RD
> WR_OPS  WR
> cephfs_cache1381G  355385  0 710770  0   0
> 0 10004954 15
> 22G 1398063  1611G
> cephfs_data 0   0  0  0  0   0
> 00
>   0   0  0
> cephfs_metadata 8515k  24  0 72  0   0
> 03  3
> 0723953 10541k
>
> total_objects355409
> total_used   2847G
> total_avail  262T
> total_space  265T
>
> However, the data pool is completely empty! So it seems that data has only
> been written to the cache pool, but not written back to the data pool.
>
> I am really at a loss whether this is due to a setup error on my part, or a
> Luminous bug. Could anyone shed some light on this? Please let me know if
> you need any further info.
>
> Best,
> Shaw
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW how to delete orphans

2017-10-02 Thread Christian Wuerdig
yes, at least that's how I'd interpret the information given in this
thread: 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-February/016521.html

On Tue, Oct 3, 2017 at 1:11 AM, Webert de Souza Lima
<webert.b...@gmail.com> wrote:
> Hey Christian,
>
>> On 29 Sep 2017 12:32 a.m., "Christian Wuerdig"
>> <christian.wuer...@gmail.com> wrote:
>>>
>>> I'm pretty sure the orphan find command does exactly just that -
>>> finding orphans. I remember some emails on the dev list where Yehuda
>>> said he wasn't 100% comfortable of automating the delete just yet.
>>> So the purpose is to run the orphan find tool and then delete the
>>> orphaned objects once you're happy that they all are actually
>>> orphaned.
>>>
>
> so what you mean is that one should manually remove the result listed
> objects that are output?
>
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> Belo Horizonte - Brasil
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW how to delete orphans

2017-09-28 Thread Christian Wuerdig
I'm pretty sure the orphan find command does exactly just that -
finding orphans. I remember some emails on the dev list where Yehuda
said he wasn't 100% comfortable of automating the delete just yet.
So the purpose is to run the orphan find tool and then delete the
orphaned objects once you're happy that they all are actually
orphaned.

On Fri, Sep 29, 2017 at 7:46 AM, Webert de Souza Lima
 wrote:
> When I had to use that I just took for granted that it worked, so I can't
> really tell you if that's just it.
>
> :|
>
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> Belo Horizonte - Brasil
>
> On Thu, Sep 28, 2017 at 1:31 PM, Andreas Calminder
>  wrote:
>>
>> Hi,
>> Yes I'm able to run these commands, however it is unclear both in man file
>> and the docs what's supposed to happen with the orphans, will they be
>> deleted once I run finish? Or will that just throw away the job? What will
>> orphans find actually produce? At the moment it just outputs a lot of text
>> saying something like putting $num in orphans.$jobid.$shardnum and listing
>> objects that are not orphans?
>>
>> Regards,
>> Andreas
>>
>> On 28 Sep 2017 15:10, "Webert de Souza Lima" 
>> wrote:
>>
>> Hello,
>>
>> not an expert here but I think the answer is something like:
>>
>> radosgw-admin orphans find --pool=_DATA_POOL_ --job-id=_JOB_ID_
>> radosgw-admin orphans finish --job-id=_JOB_ID_
>>
>> _JOB_ID_ being anything.
>>
>>
>>
>> Regards,
>>
>> Webert Lima
>> DevOps Engineer at MAV Tecnologia
>> Belo Horizonte - Brasil
>>
>> On Thu, Sep 28, 2017 at 9:38 AM, Andreas Calminder
>>  wrote:
>>>
>>> Hello,
>>> running Jewel on some nodes with rados gateway I've managed to get a
>>> lot of leaked multipart objects, most of them belonging to buckets
>>> that do not even exist anymore. We estimated these objects to occupy
>>> somewhere around 60TB, which would be great to reclaim. Question is
>>> how, since trying to find them one by one and perform some kind of
>>> sanity check if they're in use or not will take forever.
>>>
>>> The radosgw-admin orphans find command sounds like something I could
>>> use, but it's not clear if the command also removes the orphans? If
>>> not, what does it do? Can I use it to help me removing my orphan
>>> objects?
>>>
>>> Best regards,
>>> Andreas
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Usage not balanced over OSDs

2017-09-17 Thread Christian Wuerdig
There is a ceph command "reweight-by-utilization" you can run to
adjust the OSD weights automatically based on their utilization:
http://docs.ceph.com/docs/master/rados/operations/control/#osd-subsystem
Some people run this on a periodic basis (cron script)

Check the mailing list archives, for example this thread
https://www.spinics.net/lists/ceph-devel/msg15083.html provides a bit
off background information but there are many others.

You probably want to run test-reweight-by-utilization first just to
see what it would do before too much data moves around.

And last but not least there is some work being done on adding an
automatic balancer to ceph which runs periodically and adjusts the
weights to achieve an even distribution but I don't think that's fully
baked yet.

On Thu, Sep 14, 2017 at 8:30 AM, Sinan Polat  wrote:
> Hi,
>
>
>
> I have 52 OSD’s in my cluster, all with the same disk size and same weight.
>
>
>
> When I perform a:
>
> ceph osd df
>
>
>
> The disk with the least available space: 863G
>
> The disk with the most available space: 1055G
>
>
>
> I expect the available space or the usage on the disks to be the same, since
> they have the same weight, but there is a difference of almost 200GB.
>
>
>
> Due to this, the MAX AVAIL in ceph df is lower than expected (the MAX AVAIL
> is based on the disk with the least available space).
>
>
>
> -  How can I balance the disk usage over the disks, so the usage /
> available space on each disk is more or less the same?
>
> -  What will happen if I hit the MAX AVAIL, while most of the disks
> still have space?
>
>
>
> Thanks!
>
> Sinan
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD memory usage

2017-09-15 Thread Christian Wuerdig
Assuming you're using Bluestore you could experiments with the cache
settings 
(http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/)

In your case setting bluestore_cache_size_hdd lower than the default
1GB might help with the RAM usage

various people have reported solving OOM issues by setting this to
512MB, not sure what the performance impact might be

On Tue, Sep 12, 2017 at 6:15 AM,   wrote:
> Please excuse my brain-fart.  We're using 24 disks on the servers in
> question.  Only after discussing this further with a colleague did we
> realize this.
>
> This brings us right to the minimum-spec which generally isn't a good idea.
>
> Sincerely
>
> -Dave
>
>
> On 11/09/17 11:38 AM, bulk.sch...@ucalgary.ca wrote:
>>
>> [This sender failed our fraud detection checks and may not be who they
>> appear to be. Learn about spoofing at http://aka.ms/LearnAboutSpoofing]
>>
>>
>> Hi Everyone,
>>
>> I wonder if someone out there has a similar problem to this?
>>
>> I keep having issues with memory usage.  I have 2 OSD servers wiith 48G
>> memory and 12 2TB OSDs.  I seem to have significantly more memory than
>> the minimum spec, but these two machines with 2TB drives seem to OOM
>> kill and crash periodically -- basically any time the cluster goes into
>> recovery for even 1 OSD this happens.
>>
>> 12 Drives * 2TB = 24 TB.  By using the 1GB RAM per 1TB Disk rule: I
>> should need only 24TB or so.
>>
>> I am testing and benchmarking at this time so most changes are fine.  I
>> am abusing this filesystem considerably by running 14 clients with
>> something that is more or less dd each to a different file but that's
>> the point :)
>>
>> When it's working, the performance is really good.  3GB/s with 3x
>> replicated data pool up to around 10GB/s with 1X replication (just for
>> kicks and giggles) My bottleneck is likely the SAS channels to those
>> disks.
>>
>> I'm using the 12.2.0 release running on Centos 7
>>
>> Testing cephfs with one MDS and 3 montors.  The MON/MDS are not on the
>> servers in question.
>>
>> Total of around 350 OSDs (all spinning disk) most of which are 1TB
>> drives on 15 servers that are a bit older with Xeon E5620's.
>>
>> Dual QDR Infiniband (20GBit) fabrics (1 cluster and 1 client).
>>
>> Any thoughts?  Am I missing some tuning parameter in /proc or something?
>>
>> Thanks
>> -Dave
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous BlueStore EC performance

2017-09-07 Thread Christian Wuerdig
What type of EC config (k+m) was used if I may ask?

On Fri, Sep 8, 2017 at 1:34 AM, Mohamad Gebai  wrote:
> Hi,
>
> These numbers are probably not as detailed as you'd like, but it's
> something. They show the overhead of reading and/or writing to EC pools as
> compared to 3x replicated pools using 1, 2, 8 and 16 threads (single
> client):
>
>  Rep   EC Diff  Slowdown
>  IOPS  IOPS
> Read
> 123,32522,052 -5.46%1.06
> 227,26127,147 -0.42%1.00
> 827,15127,127 -0.09%1.00
> 16   26,79326,728 -0.24%1.00
> Write
> 119,444 5,708-70.64%3.41
> 223,902 5,395-77.43%4.43
> 823,912 5,641-76.41%4.24
> 16   24,587 5,643-77.05%4.36
> RW
> 120,37911,166-45.21%1.83
> 234,246 9,525-72.19%3.60
> 833,195 9,300-71.98%3.57
> 16   31,641 9,762-69.15%3.24
>
> This is on an all-SSD cluster, with 3 OSD nodes and Bluestore. Ceph version
> 12.1.0-671-g2c11b88d14 (2c11b88d14e64bf60c0556c6a4ec8c9eda36ff6a) luminous
> (rc).
>
> Mohamad
>
>
> On 09/06/2017 01:28 AM, Blair Bethwaite wrote:
>
> Hi all,
>
> (Sorry if this shows up twice - I got auto-unsubscribed and so first attempt
> was blocked)
>
> I'm keen to read up on some performance comparisons for replication versus
> EC on HDD+SSD based setups. So far the only recent thing I've found is
> Sage's Vault17 slides [1], which have a single slide showing 3X / EC42 /
> EC51 for Kraken. I guess there is probably some of this data to be found in
> the performance meeting threads, but it's hard to know the currency of those
> (typically master or wip branch tests) with respect to releases. Can anyone
> point out any other references or highlight something that's coming?
>
> I'm sure there are piles of operators and architects out there at the moment
> wondering how they could and should reconfigure their clusters once upgraded
> to Luminous. A couple of things going around in my head at the moment:
>
> * We want to get to having the bulk of our online storage in CephFS on EC
> pool/s...
> *-- is overwrite performance on EC acceptable for near-line NAS use-cases?
> *-- recovery implications (currently recovery on our Jewel RGW EC83 pool is
> _way_ slower that 3X pools, what does this do to reliability? maybe split
> capacity into multiple pools if it helps to contain failure?)
>
> [1]
> https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in/37
>
> --
> Cheers,
> ~Blairo
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph kraken: Calamari Centos7

2017-07-20 Thread Christian Wuerdig
Judging by the github repo, development on it has all but stalled, the last
commit was more then 3 months ago (
https://github.com/ceph/calamari/commits/master)
Also there is  the new dashboard in the new ceph mgr deamon in Luminous -
so my guess is that pretty much Calamari is dead.

On Thu, Jul 20, 2017 at 4:28 AM, Oscar Segarra 
wrote:

> Hi,
>
> Anybody has been able to setup Calamari on Centos7??
>
> I've done a lot of Google but I haven't found any good documentation...
>
> The command "ceph-deploy calamari connect"  does not work!
>
> Thanks a lot for your help!
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph random read IOPS

2017-06-26 Thread Christian Wuerdig
Well, preferring faster clock CPUs for SSD scenarios has been floated
several times over the last few months on this list. And realistic or not,
Nick's and Kostas' setup are similar enough (testing single disk) that it's
a distinct possibility.
Anyway, as mentioned measuring the performance counters would probably
provide more insight.


On Sun, Jun 25, 2017 at 4:53 AM, Willem Jan Withagen <w...@digiware.nl>
wrote:

>
>
> Op 24 jun. 2017 om 14:17 heeft Maged Mokhtar <mmokh...@petasan.org> het
> volgende geschreven:
>
> My understanding was this test is targeting latency more than IOPS. This
> is probably why its was run using QD=1. It also makes sense that cpu freq
> will be more important than cores.
>
>
> But then it is not generic enough to be used as an advise!
> It is just a line in 3D-space.
> As there are so many
>
> --WjW
>
> On 2017-06-24 12:52, Willem Jan Withagen wrote:
>
> On 24-6-2017 05:30, Christian Wuerdig wrote:
>
> The general advice floating around is that your want CPUs with high
> clock speeds rather than more cores to reduce latency and increase IOPS
> for SSD setups (see also
> http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-performance/) So
> something like a E5-2667V4 might bring better results in that situation.
> Also there was some talk about disabling the processor C states in order
> to bring latency down (something like this should be easy to test:
> https://stackoverflow.com/a/22482722/220986)
>
>
> I would be very careful to call this a general advice...
>
> Although the article is interesting, it is rather single sided.
>
> The only thing is shows that there is a lineair relation between
> clockspeed and write or read speeds???
> The article is rather vague on how and what is actually tested.
>
> By just running a single OSD with no replication a lot of the
> functionality is left out of the equation.
> Nobody is running just 1 osD on a box in a normal cluster host.
>
> Not using a serious SSD is another source of noise on the conclusion.
> More Queue depth can/will certainly have impact on concurrency.
>
> I would call this an observation, and nothing more.
>
> --WjW
>
>
> On Sat, Jun 24, 2017 at 1:28 AM, Kostas Paraskevopoulos
> <reverend...@gmail.com <mailto:reverend...@gmail.com>> wrote:
>
> Hello,
>
> We are in the process of evaluating the performance of a testing
> cluster (3 nodes) with ceph jewel. Our setup consists of:
> 3 monitors (VMs)
> 2 physical servers each connected with 1 JBOD running Ubuntu Server
> 16.04
>
> Each server has 32 threads @2.1GHz and 128GB RAM.
> The disk distribution per server is:
> 38 * HUS726020ALS210 (SAS rotational)
> 2 * HUSMH8010BSS200 (SAS SSD for journals)
> 2 * ST1920FM0043 (SAS SSD for data)
> 1 * INTEL SSDPEDME012T4 (NVME measured with fio ~300K iops)
>
> Since we don't currently have a 10Gbit switch, we test the performance
> with the cluster in a degraded state, the noout flag set and we mount
> rbd images on the powered on osd node. We confirmed that the network
> is not saturated during the tests.
>
> We ran tests on the NVME disk and the pool created on this disk where
> we hoped to get the most performance without getting limited by the
> hardware specs since we have more disks than CPU threads.
>
> The nvme disk was at first partitioned with one partition and the
> journal on the same disk. The performance on random 4K reads was
> topped at 50K iops. We then removed the osd and partitioned with 4
> data partitions and 4 journals on the same disk. The performance
> didn't increase significantly. Also, since we run read tests, the
> journals shouldn't cause performance issues.
>
> We then ran 4 fio processes in parallel on the same rbd mounted image
> and the total iops reached 100K. More parallel fio processes didn't
> increase the measured iops.
>
> Our ceph.conf is pretty basic (debug is set to 0/0 for everything) and
> the crushmap just defines the different buckets/rules for the disk
> separation (rotational, ssd, nvme) in order to create the required
> pools
>
> Is the performance of 100.000 iops for random 4K read normal for a
> disk that on the same benchmark runs at more than 300K iops on the
> same hardware or are we missing something?
>
> Best regards,
> Kostas
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> <http://lists.ceph.com/listinfo.cgi/ceph-user

Re: [ceph-users] Ceph random read IOPS

2017-06-23 Thread Christian Wuerdig
The general advice floating around is that your want CPUs with high clock
speeds rather than more cores to reduce latency and increase IOPS for SSD
setups (see also
http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-performance/) So
something like a E5-2667V4 might bring better results in that situation.
Also there was some talk about disabling the processor C states in order to
bring latency down (something like this should be easy to test:
https://stackoverflow.com/a/22482722/220986)

On Sat, Jun 24, 2017 at 1:28 AM, Kostas Paraskevopoulos <
reverend...@gmail.com> wrote:

> Hello,
>
> We are in the process of evaluating the performance of a testing
> cluster (3 nodes) with ceph jewel. Our setup consists of:
> 3 monitors (VMs)
> 2 physical servers each connected with 1 JBOD running Ubuntu Server 16.04
>
> Each server has 32 threads @2.1GHz and 128GB RAM.
> The disk distribution per server is:
> 38 * HUS726020ALS210 (SAS rotational)
> 2 * HUSMH8010BSS200 (SAS SSD for journals)
> 2 * ST1920FM0043 (SAS SSD for data)
> 1 * INTEL SSDPEDME012T4 (NVME measured with fio ~300K iops)
>
> Since we don't currently have a 10Gbit switch, we test the performance
> with the cluster in a degraded state, the noout flag set and we mount
> rbd images on the powered on osd node. We confirmed that the network
> is not saturated during the tests.
>
> We ran tests on the NVME disk and the pool created on this disk where
> we hoped to get the most performance without getting limited by the
> hardware specs since we have more disks than CPU threads.
>
> The nvme disk was at first partitioned with one partition and the
> journal on the same disk. The performance on random 4K reads was
> topped at 50K iops. We then removed the osd and partitioned with 4
> data partitions and 4 journals on the same disk. The performance
> didn't increase significantly. Also, since we run read tests, the
> journals shouldn't cause performance issues.
>
> We then ran 4 fio processes in parallel on the same rbd mounted image
> and the total iops reached 100K. More parallel fio processes didn't
> increase the measured iops.
>
> Our ceph.conf is pretty basic (debug is set to 0/0 for everything) and
> the crushmap just defines the different buckets/rules for the disk
> separation (rotational, ssd, nvme) in order to create the required
> pools
>
> Is the performance of 100.000 iops for random 4K read normal for a
> disk that on the same benchmark runs at more than 300K iops on the
> same hardware or are we missing something?
>
> Best regards,
> Kostas
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] handling different disk sizes

2017-06-05 Thread Christian Wuerdig
Yet another option is to change the failure domain to OSD instead host
(this avoids having to move disks around and will probably meet you initial
expectations).
Means your cluster will become unavailable when you loose a host until you
fix it though. OTOH you probably don't have too much leeway anyway with
just 3 hosts so it might be an acceptable trade-off. It also means you can
just add new OSDs to the servers wherever they fit.

On Tue, Jun 6, 2017 at 1:51 AM, David Turner  wrote:

> If you want to resolve your issue without purchasing another node, you
> should move one disk of each size into each server.  This process will be
> quite painful as you'll need to actually move the disks in the crush map to
> be under a different host and then all of your data will move around, but
> then your weights will be able to utilize the weights and distribute the
> data between the 2TB, 3TB, and 8TB drives much more evenly.
>
> On Mon, Jun 5, 2017 at 9:21 AM Loic Dachary  wrote:
>
>>
>>
>> On 06/05/2017 02:48 PM, Christian Balzer wrote:
>> >
>> > Hello,
>> >
>> > On Mon, 5 Jun 2017 13:54:02 +0200 Félix Barbeira wrote:
>> >
>> >> Hi,
>> >>
>> >> We have a small cluster for radosgw use only. It has three nodes,
>> witch 3
>> > ^  ^
>> >> osds each. Each node has different disk sizes:
>> >>
>> >
>> > There's your answer, staring you right in the face.
>> >
>> > Your default replication size is 3, your default failure domain is host.
>> >
>> > Ceph can not distribute data according to the weight, since it needs to
>> be
>> > on a different node (one replica per node) to comply with the replica
>> size.
>>
>> Another way to look at it is to imagine a situation where 10TB worth of
>> data
>> is stored on node01 which has 8x3 24TB. Since you asked for 3 replicas,
>> this
>> data must be replicated to node02 but ... there only is 2x3 6TB available.
>> So the maximum you can store is 6TB and remaining disk space on node01
>> and node03
>> will never be used.
>>
>> python-crush analyze will display a message about that situation and show
>> which buckets
>> are overweighted.
>>
>> Cheers
>>
>> >
>> > If your cluster had 4 or more nodes, you'd see what you expected.
>> > And most likely wouldn't be happy about the performance with your 8TB
>> HDDs
>> > seeing 4 times more I/Os than then 2TB ones and thus becoming the
>> > bottleneck of your cluster.
>> >
>> > Christian
>> >
>> >> node01 : 3x8TB
>> >> node02 : 3x2TB
>> >> node03 : 3x3TB
>> >>
>> >> I thought that the weight handle the amount of data that every osd
>> receive.
>> >> In this case for example the node with the 8TB disks should receive
>> more
>> >> than the rest, right? All of them receive the same amount of data and
>> the
>> >> smaller disk (2TB) reaches 100% before the bigger ones. Am I doing
>> >> something wrong?
>> >>
>> >> The cluster is jewel LTS 10.2.7.
>> >>
>> >> # ceph osd df
>> >> ID WEIGHT  REWEIGHT SIZE   USE   AVAIL  %USE  VAR  PGS
>> >>  0 7.27060  1.0  7445G 1012G  6432G 13.60 0.57 133
>> >>  3 7.27060  1.0  7445G 1081G  6363G 14.52 0.61 163
>> >>  4 7.27060  1.0  7445G  787G  6657G 10.58 0.44 120
>> >>  1 1.81310  1.0  1856G 1047G   809G 56.41 2.37 143
>> >>  5 1.81310  1.0  1856G  956G   899G 51.53 2.16 143
>> >>  6 1.81310  1.0  1856G  877G   979G 47.24 1.98 130
>> >>  2 2.72229  1.0  2787G 1010G  1776G 36.25 1.52 140
>> >>  7 2.72229  1.0  2787G  831G  1955G 29.83 1.25 130
>> >>  8 2.72229  1.0  2787G 1038G  1748G 37.27 1.56 146
>> >>   TOTAL 36267G 8643G 27624G 23.83
>> >> MIN/MAX VAR: 0.44/2.37  STDDEV: 18.60
>> >> #
>> >>
>> >> # ceph osd tree
>> >> ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
>> >> -1 35.41795 root default
>> >> -2 21.81180 host node01
>> >>  0  7.27060 osd.0   up  1.0  1.0
>> >>  3  7.27060 osd.3   up  1.0  1.0
>> >>  4  7.27060 osd.4   up  1.0  1.0
>> >> -3  5.43929 host node02
>> >>  1  1.81310 osd.1   up  1.0  1.0
>> >>  5  1.81310 osd.5   up  1.0  1.0
>> >>  6  1.81310 osd.6   up  1.0  1.0
>> >> -4  8.16687 host node03
>> >>  2  2.72229 osd.2   up  1.0  1.0
>> >>  7  2.72229 osd.7   up  1.0  1.0
>> >>  8  2.72229 osd.8   up  1.0  1.0
>> >> #
>> >>
>> >> # ceph -s
>> >> cluster 49ba9695-7199-4c21-9199-ac321e60065e
>> >>  health HEALTH_OK
>> >>  monmap e1: 3 mons at
>> >> {ceph-mon01=[x:x:x:x:x:x:x:x]:6789/0,ceph-mon02=[x:x:x:x:x:
>> x:x:x]:6789/0,ceph-mon03=[x:x:x:x:x:x:x:x]:6789/0}
>> >> election epoch 48, quorum 0,1,2 ceph-mon01,ceph-mon03,ceph-
>> mon02
>> >>  osdmap e265: 9 osds: 9 up, 9 in
>> >> flags sortbitwise,require_jewel_osds
>> >>   pgmap 

Re: [ceph-users] Recovery stuck in active+undersized+degraded

2017-06-02 Thread Christian Wuerdig
Well, what's "best" really depends on your needs and use-case. The general
advise which has been floated several times now is to have at least N+2
entities of your failure domain in your cluster.
So for example if you run with size=3 then you should have at least 5 OSDs
if your failure domain is OSD and 5 hosts if your failure domain is host.


On Sat, Jun 3, 2017 at 3:20 AM, Oleg Obleukhov  wrote:

> But what would be the best? Have 3 servers and how many osd?
> Thanks!
>
> On 2 Jun 2017, at 17:09, David Turner  wrote:
>
> That's good for testing in the small scale.  For production I would
> revisit using size 3.  Glad you got it working.
>
> On Fri, Jun 2, 2017 at 11:02 AM Oleg Obleukhov 
> wrote:
>
>> Thanks to everyone,
>> problem is solved by:
>> ceph osd pool set cephfs_metadata size 2
>> ceph osd pool set cephfs_data size 2
>>
>> Best, Oleg.
>>
>> On 2 Jun 2017, at 16:15, Oleg Obleukhov  wrote:
>>
>> Hello,
>> I am playing around with ceph (ceph version 10.2.7 (
>> 50e863e0f4bc8f4b9e31156de690d765af245185)) on Debian Jessie and I build
>> a test setup:
>>
>> $ ceph osd tree
>> ID WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
>> -1 0.01497 root default
>> -2 0.00499 host af-staging-ceph01
>>  0 0.00499 osd.0   up  1.0  1.0
>> -3 0.00499 host af-staging-ceph02
>>  1 0.00499 osd.1   up  1.0  1.0
>> -4 0.00499 host af-staging-ceph03
>>  2 0.00499 osd.2   up  1.0  1.0
>>
>> So I have 3 osd on 3 servers.
>> I also created 2 pools:
>>
>> ceph osd dump | grep 'replicated size'
>> pool 1 'cephfs_data' replicated size 3 min_size 2 crush_ruleset 0
>> object_hash rjenkins pg_num 32 pgp_num 32 last_change 33 flags hashpspool
>> crash_replay_interval 45 stripe_width 0
>> pool 2 'cephfs_metadata' replicated size 3 min_size 2 crush_ruleset 0
>> object_hash rjenkins pg_num 32 pgp_num 32 last_change 31 flags hashpspool
>> stripe_width 0
>>
>> Now I am testing failover and kill one of servers:
>> ceph osd tree
>> ID WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
>> -1 0.01497 root default
>> -2 0.00499 host af-staging-ceph01
>>  0 0.00499 osd.0   up  1.0  1.0
>> -3 0.00499 host af-staging-ceph02
>>  1 0.00499 osd.1 down  1.0  1.0
>> -4 0.00499 host af-staging-ceph03
>>  2 0.00499 osd.2   up  1.0  1.0
>>
>> And now it stuck in the recovery state:
>> ceph -s
>> cluster 6b5ff07a-7232-4840-b486-6b7906248de7
>>  health HEALTH_WARN
>> 64 pgs degraded
>> 18 pgs stuck unclean
>> 64 pgs undersized
>> recovery 21/63 objects degraded (33.333%)
>> 1/3 in osds are down
>> 1 mons down, quorum 0,2 af-staging-ceph01,af-staging-ceph03
>>  monmap e1: 3 mons at {af-staging-ceph01=10.36.0.
>> 121:6789/0,af-staging-ceph02=10.36.0.122:6789/0,af-staging-
>> ceph03=10.36.0.123:6789/0}
>> election epoch 38, quorum 0,2 af-staging-ceph01,af-staging-
>> ceph03
>>   fsmap e29: 1/1/1 up {0=af-staging-ceph03.crm.ig.local=up:active},
>> 2 up:standby
>>  osdmap e78: 3 osds: 2 up, 3 in; 64 remapped pgs
>> flags sortbitwise,require_jewel_osds
>>   pgmap v334: 64 pgs, 2 pools, 47129 bytes data, 21 objects
>> 122 MB used, 15204 MB / 15326 MB avail
>> 21/63 objects degraded (33.333%)
>>   64 active+undersized+degraded
>>
>> And if I kill one more node I lose access to mounted file system on
>> client.
>> Normally I would expect replica-factor to be respected and ceph should
>> create the missing copies of degraded pg.
>>
>> I was trying to rebuild the crush map and it looks like this, but this
>> did not help:
>> rule replicated_ruleset {
>> ruleset 0
>> type replicated
>> min_size 1
>> max_size 10
>> step take default
>> step chooseleaf firstn 0 type osd
>> step emit
>> }
>>
>> # end crush map
>>
>> Would very appreciate help,
>> Thank you very much in advance,
>> Oleg.
>>
>>
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Performance

2017-05-04 Thread Christian Wuerdig
On Thu, May 4, 2017 at 7:53 PM, Fuxion Cloud  wrote:

> Hi all,
>
> Im newbie in ceph technology. We have ceph deployed by vendor 2 years ago
> with Ubuntu 14.04LTS without fine tuned the performance. I noticed that the
> performance of storage is very slow. Can someone please help to advise how
> to  improve the performance?
>
>
You really need to provide a bit more information than that. Like what
hardware is involved (CPU, RAM, how many nodes, how many OSDs, what kind of
disks, what size disks, networking hardware), how you use ceph (RBD, RGW,
CephFS, plain RADOS object storage).

Outputs of

ceph status
ceph osd tree
ceph df

also provide useful information.

Also what does "slow performance" mean - how have you determined that
(throughout, latency)?


> Any changes or configuration require for OS kernel?
>
> Regards,
> James
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Understanding Ceph in case of a failure

2017-03-20 Thread Christian Wuerdig
On Tue, Mar 21, 2017 at 8:57 AM, Karol Babioch  wrote:

> Hi,
>
> Am 20.03.2017 um 05:34 schrieb Christian Balzer:
> > you do realize that you very much have a corner case setup there, right?
>
> Yes, I know that this is not exactly a recommendation, but I hoped it
> would be good enough for the start :-).
>
> > That being said, if you'd search the archives, a similar question was
> > raised by me a long time ago.
>
> Do you have some sort of reference to this? Sounds interesting, but
> couldn't find a particular thread, and you posted quite a lot on this
> list already :-).
>
> > The new CRUSH map of course results in different computations of where
> PGs
> > should live, so they get copied to their new primary OSDs.
> > This is the I/O you're seeing and that's why it stops eventually.
>
> Hm, ok, that might be an explanation. Haven't considered the fact that
> it gets removed from the CRUSH map and a new location is calculated. Is
> there a way to prevent this in my case?
>
>
If an OSD doesn't respond it will be marked as down and then after some
time (default 300sec) it will be marked as out
Data will start to move once the OSD is marked out (i.e. no longer part of
the crush map) which is what you are observing.

The settings you are probably interested in are (docs from here:
http://docs.ceph.com/docs/jewel/rados/configuration/mon-osd-interaction/)

1. mon osd down out interval   - defaults to 300sec after which a down OSD
will be marked out
2. mon osd down out subtree limit - will prevent down OSDs being marked out
automatically if the whole subtree disappears. This defaults to rack - if
you change it to host then turning off an entire host should prevent all
those OSDs from being marked out automatically



> Thank you very much for your insights!
>
> Best regards,
> Karol Babioch
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating data from a Ceph clusters to another

2017-02-11 Thread Christian Wuerdig
According to:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-May/009485.html it
seems not entirely safe to copy an RBD pool this way.

This thread mentions doing a rados ls and the get/put the objects but Greg
mentioned that this may also have issues with snapshots.

Maybe cppool has been fixed but maybe there is no generic way to do this.

On Sat, Feb 11, 2017 at 4:42 AM, Eugen Block  wrote:

> I'm not sure if this is what you need, but I recently tried 'rados cppool
>  ' to populate a new test-pool with different
> properties. I simply needed some data in there, and this command for me.
>
> Regards,
> Eugen
>
> Zitat von 林自均 :
>
> Hi Irek & Craig,
>>
>> Sorry, I misunderstood "RBD mirroring". What I want to do is not like
>> that.
>>
>> I just want to move all the data from a cluster to another. It can be
>> achieved by `rados -p  get  ` for all objects on
>> cluster A, and then `rados -p  put  ` on cluster
>> B.
>> Is there any tool for that?
>>
>> Best,
>> John Lin
>>
>> Craig Chi  於 2017年2月9日 週四 下午4:43寫道:
>>
>> Hi,
>>>
>>> Sorry I gave the wrong feature.
>>> rbd mirroring method can only be used on rbd with "journaling" feature
>>> (not layering).
>>>
>>> Sincerely,
>>> Craig Chi
>>>
>>> On 2017-02-09 16:41, Craig Chi  wrote:
>>>
>>> Hi John,
>>>
>>> rbd mirroring can configured by pool.
>>> http://docs.ceph.com/docs/master/rbd/rbd-mirroring/
>>> However the rbd mirroring method can only be used on rbd with layering
>>> feature, it can not mirror objects other than rbd for you.
>>>
>>> Sincerely,
>>> Craig Chi
>>>
>>> On 2017-02-09 16:24, Irek Fasikhov  wrote:
>>>
>>> Hi.
>>> I recommend using rbd import/export.
>>>
>>> С уважением, Фасихов Ирек Нургаязович
>>> Моб.: +79229045757 <+7%20922%20904-57-57>
>>>
>>> 2017-02-09 11:13 GMT+03:00 林自均 :
>>>
>>> Hi,
>>>
>>> I have 2 Ceph clusters, cluster A and cluster B. I want to move all the
>>> pools on A to B. The pool names don't conflict between clusters. I guess
>>> it's like RBD mirroring, except that it's pool mirroring. Is there any
>>> proper ways to do it?
>>>
>>> Thanks for any suggestions.
>>>
>>> Best,
>>> John Lin
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>> ___ ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>
>
> --
> Eugen Block voice   : +49-40-559 51 75
> NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
> Postfach 61 03 15
> D-22423 Hamburg e-mail  : ebl...@nde.ag
>
> Vorsitzende des Aufsichtsrates: Angelika Mozdzen
>   Sitz und Registergericht: Hamburg, HRB 90934
>   Vorstand: Jens-U. Mozdzen
>USt-IdNr. DE 814 013 983
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs stuck active+remapped and osds lose data?!

2017-01-09 Thread Christian Wuerdig
 the other way, we have now 40722 GB free. You can see the
> change on the %USE of the osds. For me this looks like there is some data
> lost, since ceph did not do any backfill or other operation. That’s the
> problem...
>
>
Ok that output is indeed a bit different. However as you should note the
actual data stored in the cluster goes from 4809 to 4830 GB. 4830 * 3 is
actually only 14490 GB so currently it's using a bit more space than
strictly necessary. My guess would be that the data gets migrated first to
the new OSDs before being deleted from the old OSD and as such it will
transiently use up more space. Pretty sure that you didn't loose any data.


>
> Am 09.01.2017 um 21:55 schrieb Christian Wuerdig <
> christian.wuer...@gmail.com>:
>
>
>
> On Tue, Jan 10, 2017 at 8:23 AM, Marcus Müller <mueller.mar...@posteo.de>
> wrote:
>
>> Hi all,
>>
>> Recently I added a new node with new osds to my cluster, which, of course
>> resulted in backfilling. At the end, there are 4 pgs left in the state 4
>> active+remapped and I don’t know what to do.
>>
>> Here is how my cluster looks like currently:
>>
>> ceph -s
>>  health HEALTH_WARN
>> 4 pgs stuck unclean
>> recovery 3586/58734009 objects degraded (0.006%)
>> recovery 420074/58734009 objects misplaced (0.715%)
>> noscrub,nodeep-scrub flag(s) set
>>  monmap e9: 5 mons at {ceph1=192.168.10.3:6789/0,cep
>> h2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.
>> 168.60.6:6789/0,ceph5=192.168.60.11:6789/0}
>> election epoch 478, quorum 0,1,2,3,4
>> ceph1,ceph2,ceph3,ceph4,ceph5
>>  osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs
>> flags noscrub,nodeep-scrub
>>   pgmap v9970276: 320 pgs, 3 pools, 4831 GB data, 19119 kobjects
>> 15152 GB used, 40719 GB / 55872 GB avail
>> 3586/58734009 objects degraded (0.006%)
>> 420074/58734009 objects misplaced (0.715%)
>>  316 active+clean
>>4 active+remapped
>>   client io 643 kB/s rd, 7 op/s
>>
>> # ceph osd df
>> ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR
>>  0 1.28899  1.0  3724G  1697G  2027G 45.57 1.68
>>  1 1.57899  1.0  3724G  1706G  2018G 45.81 1.69
>>  2 1.68900  1.0  3724G  1794G  1929G 48.19 1.78
>>  3 6.78499  1.0  7450G  1240G  6209G 16.65 0.61
>>  4 8.3  1.0  7450G  1226G  6223G 16.47 0.61
>>  5 9.51500  1.0  7450G  1237G  6212G 16.62 0.61
>>  6 7.66499  1.0  7450G  1264G  6186G 16.97 0.63
>>  7 9.75499  1.0  7450G  2494G  4955G 33.48 1.23
>>  8 9.32999  1.0  7450G  2491G  4958G 33.45 1.23
>>   TOTAL 55872G 15152G 40719G 27.12
>> MIN/MAX VAR: 0.61/1.78  STDDEV: 13.54
>>
>> # ceph health detail
>> HEALTH_WARN 4 pgs stuck unclean; recovery 3586/58734015 objects degraded
>> (0.006%); recovery 420074/58734015 objects misplaced (0.715%);
>> noscrub,nodeep-scrub flag(s) set
>> pg 9.7 is stuck unclean for 512936.160212, current state active+remapped,
>> last acting [7,3,0]
>> pg 7.84 is stuck unclean for 512623.894574, current state
>> active+remapped, last acting [4,8,1]
>> pg 8.1b is stuck unclean for 513164.616377, current state
>> active+remapped, last acting [4,7,2]
>> pg 7.7a is stuck unclean for 513162.316328, current state
>> active+remapped, last acting [7,4,2]
>> recovery 3586/58734015 objects degraded (0.006%)
>> recovery 420074/58734015 objects misplaced (0.715%)
>> noscrub,nodeep-scrub flag(s) set
>>
>> # ceph osd tree
>> ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
>> -1 56.00693 root default
>> -2  1.28899 host ceph1
>>  0  1.28899 osd.0   up  1.0  1.0
>> -3  1.57899 host ceph2
>>  1  1.57899 osd.1   up  1.0  1.0
>> -4  1.68900 host ceph3
>>  2  1.68900 osd.2   up  1.0  1.0
>> -5 32.36497 host ceph4
>>  3  6.78499 osd.3   up  1.0  1.0
>>  4  8.3 osd.4   up  1.0  1.0
>>  5  9.51500 osd.5   up  1.0  1.0
>>  6  7.66499 osd.6   up  1.0  1.0
>> -6 19.08498 host ceph5
>>  7  9.75499 osd.7   up  1.0  1.0
>>  8  9.32999 osd.8   up  1.0  1.0
>>
>> I’m using a customized crushmap because as you can see this cluster is
>> not very optimal. Ceph1, ceph2 and ceph3 are vms on one physical host

Re: [ceph-users] PGs stuck active+remapped and osds lose data?!

2017-01-09 Thread Christian Wuerdig
On Tue, Jan 10, 2017 at 8:23 AM, Marcus Müller 
wrote:

> Hi all,
>
> Recently I added a new node with new osds to my cluster, which, of course
> resulted in backfilling. At the end, there are 4 pgs left in the state 4
> active+remapped and I don’t know what to do.
>
> Here is how my cluster looks like currently:
>
> ceph -s
>  health HEALTH_WARN
> 4 pgs stuck unclean
> recovery 3586/58734009 objects degraded (0.006%)
> recovery 420074/58734009 objects misplaced (0.715%)
> noscrub,nodeep-scrub flag(s) set
>  monmap e9: 5 mons at {ceph1=192.168.10.3:6789/0,
> ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,
> ceph4=192.168.60.6:6789/0,ceph5=192.168.60.11:6789/0}
> election epoch 478, quorum 0,1,2,3,4
> ceph1,ceph2,ceph3,ceph4,ceph5
>  osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs
> flags noscrub,nodeep-scrub
>   pgmap v9970276: 320 pgs, 3 pools, 4831 GB data, 19119 kobjects
> 15152 GB used, 40719 GB / 55872 GB avail
> 3586/58734009 objects degraded (0.006%)
> 420074/58734009 objects misplaced (0.715%)
>  316 active+clean
>4 active+remapped
>   client io 643 kB/s rd, 7 op/s
>
> # ceph osd df
> ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR
>  0 1.28899  1.0  3724G  1697G  2027G 45.57 1.68
>  1 1.57899  1.0  3724G  1706G  2018G 45.81 1.69
>  2 1.68900  1.0  3724G  1794G  1929G 48.19 1.78
>  3 6.78499  1.0  7450G  1240G  6209G 16.65 0.61
>  4 8.3  1.0  7450G  1226G  6223G 16.47 0.61
>  5 9.51500  1.0  7450G  1237G  6212G 16.62 0.61
>  6 7.66499  1.0  7450G  1264G  6186G 16.97 0.63
>  7 9.75499  1.0  7450G  2494G  4955G 33.48 1.23
>  8 9.32999  1.0  7450G  2491G  4958G 33.45 1.23
>   TOTAL 55872G 15152G 40719G 27.12
> MIN/MAX VAR: 0.61/1.78  STDDEV: 13.54
>
> # ceph health detail
> HEALTH_WARN 4 pgs stuck unclean; recovery 3586/58734015 objects degraded
> (0.006%); recovery 420074/58734015 objects misplaced (0.715%);
> noscrub,nodeep-scrub flag(s) set
> pg 9.7 is stuck unclean for 512936.160212, current state active+remapped,
> last acting [7,3,0]
> pg 7.84 is stuck unclean for 512623.894574, current state active+remapped,
> last acting [4,8,1]
> pg 8.1b is stuck unclean for 513164.616377, current state active+remapped,
> last acting [4,7,2]
> pg 7.7a is stuck unclean for 513162.316328, current state active+remapped,
> last acting [7,4,2]
> recovery 3586/58734015 objects degraded (0.006%)
> recovery 420074/58734015 objects misplaced (0.715%)
> noscrub,nodeep-scrub flag(s) set
>
> # ceph osd tree
> ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 56.00693 root default
> -2  1.28899 host ceph1
>  0  1.28899 osd.0   up  1.0  1.0
> -3  1.57899 host ceph2
>  1  1.57899 osd.1   up  1.0  1.0
> -4  1.68900 host ceph3
>  2  1.68900 osd.2   up  1.0  1.0
> -5 32.36497 host ceph4
>  3  6.78499 osd.3   up  1.0  1.0
>  4  8.3 osd.4   up  1.0  1.0
>  5  9.51500 osd.5   up  1.0  1.0
>  6  7.66499 osd.6   up  1.0  1.0
> -6 19.08498 host ceph5
>  7  9.75499 osd.7   up  1.0  1.0
>  8  9.32999 osd.8   up  1.0  1.0
>
> I’m using a customized crushmap because as you can see this cluster is not
> very optimal. Ceph1, ceph2 and ceph3 are vms on one physical host - Ceph4
> and Ceph5 are both separate physical hosts. So the idea is to spread 33% of
> the data to ceph1, ceph2 and ceph3 and the other 66% to each ceph4 and
> ceph5.
>
> Everything went fine with the backfilling but now I see those 4 pgs stuck
> active+remapped since 2 days while the degrades objects increase.
>
> I did a restart of all osds after and after but this helped not really. It
> first showed me no degraded objects and then increased again.
>
> What can I do in order to get those pgs to active+clean state again? My
> idea was to increase the weight of a osd a little bit in order to let ceph
> calculate the map again, is this a good idea?
>

Trying google with "ceph pg stuck in active and remapped" points to a
couple of post on this ML typically indicating that it's a problem with the
CRUSH map and ceph being unable to satisfy the mapping rules. Your ceph -s
output indicates that your using replication of size 3 in your pools. You
also said you had a custom CRUSH map - can you post it?


>
> ---
>
> On the other side I saw something very strange too: After the backfill was
> done (2 days ago), my ceph osd df looked like this:
>
> # ceph osd df
> ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR
>  0 1.28899  1.0  3724G  1924G  1799G 51.67 1.79
>  1 1.57899  1.0  3724G  2143G  1580G 57.57 2.00
>  2 1.68900  1.0  3724G  

Re: [ceph-users] rgw civetweb ssl official documentation?

2016-12-19 Thread Christian Wuerdig
No official documentation but here is how I got it to work on Ubuntu
16.04.01 (in this case I'm using a self-signed certificate):

assuming you're running rgw on a computer called rgwnode:

1. create self-signed certificate

ssh rgwnode
openssl req -x509 -nodes -newkey rsa:4096 -keyout key.pem -out cert.pem
-days 1000

cat key.pem >> /usr/share/ca-certificates/cert.pem
 ^--- without doing this you get errors like this "civetweb:
0x564d0357d8c0: set_ssl_option: cannot open
/usr/share/ca-certificates/cert.pem: error:0906D06C:PEM
routines:PEM_read_bio:no start line"
cp cert.pem /usr/share/ca-certificates/

2. configure civitweb:

edit your ceph.conf on the admin node and add:

[client.rgw.rgwnode]
rgw_frontends = civetweb port=443s
ssl_certificate=/usr/share/ca-certificates/cert.pem

push the config
ceph-deploy push rgwnode

ssh rgwnode 'sudo systemctl restart ceph-radosgw@rgwnode'

this ended up not being enough and I found log messages like these in the
logs:
2016-09-09 17:22:21.593231 7f36c33f8a00  0 civetweb: 0x555a3b7988c0:
load_dll: cannot load libssl.so
2016-09-09 17:22:21.593278 7f36c33f8a00  0 civetweb: 0x555a3b7988c0:
load_dll: cannot load libcrypto.so

to fix it:
ssh rgwnode
sudo ln -s /lib/x86_64-linux-gnu/libssl.so.1.0.0 /usr/lib/libssl.so
sudo ln -s /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 /usr/lib/libcrypto.so


On Thu, Dec 8, 2016 at 7:44 AM, Puff, Jonathon 
wrote:

> There’s a few documents out around this subject, but I can’t find anything
> official.  Can someone point me to any official documentation for deploying
> this?   Other alternatives appear to be a HAproxy frontend.  Currently
> running 10.2.3 with a single radosgw.
>
>
>
> -JP
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pgs stuck on undersized+degraded+peered

2016-12-09 Thread Christian Wuerdig
Hi,

it's useful to generally provide some detail around the setup, like:
What are your pool settings - size and min_size?
What is your failure domain - osd or host?
What version of ceph are you running on which OS?

You can check which specific PGs are problematic by running "ceph health
detail" and then you can use "ceph pg x.y query" (where x.y is a
problematic PG identified from ceph health).
http://docs.ceph.com/docs/jewel/rados/troubleshooting/troubleshooting-pg/
might provide you some pointers.

One obvious fix would be to get your 3rd osd server up and running again -
but I guess you're already working on this.

Cheers
Christian

On Sat, Dec 10, 2016 at 7:25 AM, fridifree  wrote:

> Hi,
> 1 of 3 of my osd servers is down and I get this error
> And I do not have any access to rbds on the cluster
>
> Any suggestions?
>
> Thank you
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] VM disk operation blocked during OSDs failures

2016-11-04 Thread Christian Wuerdig
What are your pool size and min_size settings? An object with less than
min_size replicas will not receive I/O (
http://docs.ceph.com/docs/jewel/rados/operations/pools/#set-the-number-of-object-replicas).
So if size=2 and min_size=1 then an OSD failure means blocked operations to
all objects located on the failed OSD until they have been replicated again.

On Sat, Nov 5, 2016 at 9:04 AM, fcid  wrote:

> Dear ceph community,
>
> I'm working in a small ceph deployment for testing purposes, in which i
> want to test the high availability features of Ceph and how clients are
> affected during outages in the cluster.
>
> This small cluster is deployed using 3 servers on which are running 2 OSDs
> and 1 monitor each, and we are using it to serve Rados block devices for
> KVM hypervisors in other hosts. The ceph software was installed using
> ceph-deploy.
>
> For HA testing we are simulating disk failures by physically detaching OSD
> disks from servers and also by eliminating the power source from servers we
> want to fail.
>
> I have some doubts regarding the behavior during OSD and disk failures
> under light workloads.
>
> During disk failures, the cluster takes a long time to promote the
> secondary OSD to primary, thus blocking all the disk operations of virtual
> machines using RBD until the cluster map is updated with the failed OSD
> (which can take up to 10 minutes in our cluster). Is this the expected
> behavior of the OSD cluster? or should it be transparent to clients when
> the disks fails?
>
> Thanks in advance, kind regards.
>
> Configuration and version of our ceph cluster:
>
> root@ceph00:~# cat /etc/ceph/ceph.conf
> [global]
> fsid = 440fce60-3097-4f1c-a489-c170e65d8e09
> mon_initial_members = ceph00
> mon_host = 192.168.x1.x1
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> public network = 192.168.x.x/x
> cluster network = y.y.y.y/y
> [osd]
> osd mkfs options = -f -i size=2048 -n size=64k
> osd mount options xfs = inode64,noatime,logbsize=256k
> osd journal size = 20480
> filestore merge threshold = 40
> filestore split multiple = 8
> filestore xattr use omap = true
>
> root@ceph00:~# ceph -v
> ceph version 10.2.3
>
> --
> Fernando Cid O.
> Ingeniero de Operaciones
> AltaVoz S.A.
>  http://www.altavoz.net
> Viña del Mar, Valparaiso:
>  2 Poniente 355 of 53
>  +56 32 276 8060
> Santiago:
>  San Pío X 2460, oficina 304, Providencia
>  +56 2 2585 4264
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hammer Cache Tiering

2016-11-01 Thread Christian Wuerdig
On Wed, Nov 2, 2016 at 5:19 PM, Ashley Merrick 
wrote:

> Hello,
>
> Thanks for your reply, when you say latest's version do you .6 and not .5?
>
> The use case is large scale storage VM's, which may have a burst of high
> write's during new storage being loaded onto the environment, looking to
> place the SSD Cache in front currently with a replica of 3 and useable size
> of 1.5TB.
>
> Looking to run in Read-forward Mode, so reads will come direct from the
> OSD layer where there is no issue with current read performance, however
> any large write's will first go to the SSD and then at a later date flushed
> to the OSD's as the SSD cache hits for example 60%.
>
> So the use case is not as such to store hot DB data that will stay there,
> but to act as a temp sponge for high but short writes in bursts.
>

This is precisely what the journals are for. From what I've seen and read
on this list so far I'd say you will be way better of putting your journals
on SSDs in the OSD nodes than to try setting up a cache tier. In general
using a cache for write buffer to me at least sounds the wrong way round -
typically you want a cache for fast read access (i.e. serving very
frequently read data as fast as possible).


>
> ,Ashley
>
> -Original Message-
> From: Christian Balzer [mailto:ch...@gol.com]
> Sent: Wednesday, 2 November 2016 11:48 AM
> To: ceph-us...@ceph.com
> Cc: Ashley Merrick 
> Subject: Re: [ceph-users] Hammer Cache Tiering
>
>
> Hello,
>
> On Tue, 1 Nov 2016 15:07:33 + Ashley Merrick wrote:
>
> > Hello,
> >
> > Currently using a Proxmox & CEPH cluster, currently they are running on
> Hammer looking to update to Jewel shortly, I know I can do a manual upgrade
> however would like to keep what is tested well with Proxmox.
> >
> > Looking to put a SSD Cache tier in front, however have seen and read
> there has been a few bug's with Cache Tiering causing corruption, from what
> I read all fixed on Jewel however not 100% if they have been pushed to
> Hammer (even though is still not EOL for a little while).
> >
> You will want to read at LEAST the last two threads about "cache tier" in
> this ML, more if you can.
>
> > Is anyone running Cache Tiering on Hammer in production and had no
> issues, or is anyone aware of any bugs' / issues that means I should hold
> off till I upgrade to Jewel, or any reason basically to hold off for a
> month or so to update to Jewel before enabling a cache tier.
> >
> The latest Hammer should be fine, 0.94.5 has been working for me a long
> time, 0.94.6 is DEFINITELY to be avoided at all costs.
>
> A cache tier is a complex beast.
> Does it fit your needs/use patterns, can you afford to make it large
> enough to actually fit all your hot data in it?
>
> Jewel has more control knobs to help you, so unless you are 100% sure that
> you know what you're doing or have a cache pool in mind that's as large as
> your current used data, waiting for Jewel might be a better proposition.
>
> Of course the lack of any official response to the last relevant thread
> here about the future of cache tiering makes adding/designing a cache tier
> an additional challenge...
>
>
> Christian
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com