Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]
Hi! Thanks. The parameter gets reset when you reconnect the SSD so in fact it requires not to power cycle it after changing the parameter :-) Ok, this case seems lucky, ~2x change isn't a lot. Can you tell the exact model and capacity of this Micron, and what controller was used in this test? I'll add it to the spreadsheet. -- With best regards, Vitaliy Filippov___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]
...disable signatures and rbd cache. I didn't mention it in the email to not repeat myself. But I have it in the article :-) -- With best regards, Vitaliy Filippov___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Consumer-grade SSD in Ceph
I had. 100-200 write iops with iodepth=1, ~5k iops with iodepth=128. These were intel 545s. Not that awful, but micron 5200 costs only a fraction more, so it seems pointless to me to use desktop samsungs. 19 декабря 2019 г. 22:20:28 GMT+03:00, Sinan Polat пишет: >Hi all, > >Thanks for the replies. I am not worried about their lifetime. We will >be adding only 1 SSD disk per physical server. All SSD’s are enterprise >drives. If the added consumer grade disk will fail, no problem. > >I am more curious regarding their I/O performance. I do want to have >50% drop in performance. > >So anyone any experience with 860 EVO or Crucial MX500 in a Ceph setup? > >Thanks! > >> Op 19 dec. 2019 om 19:18 heeft Mark Nelson het >volgende geschreven: >> >> The way I try to look at this is: >> >> >> 1) How much more do the enterprise grade drives cost? >> >> 2) What are the benefits? (Faster performance, longer life, etc) >> >> 3) How much does it cost to deal with downtime, diagnose issues, and >replace malfunctioning hardware? >> >> >> My personal take is that enterprise drives are usually worth it. >There may be consumer grade drives that may be worth considering in >very specific scenarios if they still have power loss protection and >high write durability. Even when I was in academia years ago with very >limited budgets, we got burned with consumer grade SSDs to the point >where we had to replace them all. You have to be very careful and know >exactly what you are buying. >> >> >> Mark >> >> >>> On 12/19/19 12:04 PM, jes...@krogh.cc wrote: >>> I dont think “usually” is good enough in a production setup. >>> >>> >>> >>> Sent from myMail for iOS >>> >>> >>> Thursday, 19 December 2019, 12.09 +0100 from Виталий Филиппов >: >>> >>>Usually it doesn't, it only harms performance and probably SSD >>>lifetime >>>too >>> >>>> I would not be running ceph on ssds without powerloss >protection. I >>>> delivers a potential data loss scenario >>> >>> >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >___ >ceph-users mailing list >ceph-users@lists.ceph.com >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- With best regards, Vitaliy Filippov___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Consumer-grade SSD in Ceph
https://yourcmc.ru/wiki/Ceph_performance https://docs.google.com/spreadsheets/d/1E9-eXjzsKboiCCX-0u0r5fAjjufLKayaut_FOPxYZjc 19 декабря 2019 г. 0:41:02 GMT+03:00, Sinan Polat пишет: >Hi, > >I am aware that >https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ >holds a list with benchmark of quite some different ssd models. >Unfortunately it >doesn't have benchmarks for recent ssd models. > >A client is planning to expand a running cluster (Luminous, FileStore, >SSD only, >Replicated). I/O Utilization is close to 0, but capacity wise the >cluster is >almost nearfull. To save costs the cluster will be expanded will >customer-grade >SSD's, but I am unable to find benchmarks of recent SSD models. > >Does anyone has experience with Samsung 860 EVO, 860 PRO and Crucial >MX500 in a >Ceph cluster? > >Thanks! >Sinan -- With best regards, Vitaliy Filippov___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] WAL/DB size
30gb already includes WAL, see http://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing 15 августа 2019 г. 1:15:58 GMT+03:00, Anthony D'Atri пишет: >Good points in both posts, but I think there’s still some unclarity. > >Absolutely let’s talk about DB and WAL together. By “bluestore goes on >flash” I assume you mean WAL+DB? > >“Simply allocate DB and WAL will appear there automatically” > >Forgive me please if this is obvious, but I’d like to see a holistic >explanation of WAL and DB sizing *together*, which I think would help >folks put these concepts together and plan deployments with some sense >of confidence. > >We’ve seen good explanations on the list of why only specific DB sizes, >say 30GB, are actually used _for the DB_. >If the WAL goes along with the DB, shouldn’t we also explicitly >determine an appropriate size N for the WAL, and make the partition >(30+N) GB? >If so, how do we derive N? Or is it a constant? > >Filestore was so much simpler, 10GB set+forget for the journal. Not >that I miss XFS, mind you. > > >>> Actually standalone WAL is required when you have either very small >fast >>> device (and don't want db to use it) or three devices (different in >>> performance) behind OSD (e.g. hdd, ssd, nvme). So WAL is to be >located >>> at the fastest one. >>> >>> For the given use case you just have HDD and NVMe and DB and WAL can >>> safely collocate. Which means you don't need to allocate specific >volume >>> for WAL. Hence no need to answer the question how many space is >needed >>> for WAL. Simply allocate DB and WAL will appear there automatically. >>> >>> >> Yes, i'm surprised how often people talk about the DB and WAL >separately >> for no good reason. In common setups bluestore goes on flash and the >> storage goes on the HDDs, simple. >> >> In the event flash is 100s of GB and would be wasted, is there >anything >> that needs to be done to set rocksdb to use the highest level? 600 I >> believe > > > >___ >ceph-users mailing list >ceph-users@lists.ceph.com >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- With best regards, Vitaliy Filippov___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS snapshot for backup & disaster recovery
Afaik no. What's the idea of running a single-host cephfs cluster? 4 августа 2019 г. 13:27:00 GMT+03:00, Eitan Mosenkis пишет: >I'm running a single-host Ceph cluster for CephFS and I'd like to keep >backups in Amazon S3 for disaster recovery. Is there a simple way to >extract a CephFS snapshot as a single file and/or to create a file that >represents the incremental difference between two snapshots? -- With best regards, Vitaliy Filippov___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Future of Filestore?
Hi again, I reread your initial email - do you also run a nanoceph on some SBCs each having one 2.5" 5400rpm HDD plugged into it? What SBCs do you use? :-) -- With best regards, Vitaliy Filippov___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Future of Filestore?
Cache=writeback is perfectly safe, it's flushed when the guest calls fsync, so journaled filesystems and databases don't lose data that's committed to the journal. 25 июля 2019 г. 2:28:26 GMT+03:00, Stuart Longland пишет: >On 25/7/19 9:01 am, vita...@yourcmc.ru wrote: >>> 60 millibits per second? 60 bits every 1000 seconds? Are you >serious? >>> Or did we get the capitalisation wrong? >>> >>> Assuming 60MB/sec (as 60 Mb/sec would still be slower than the >5MB/sec I >>> was getting), maybe there's some characteristic that Bluestore is >>> particularly dependent on regarding the HDDs. >>> >>> I'll admit right up front the drives I'm using were chosen because >they >>> were all I could get with a 2TB storage capacity for a reasonable >price. >>> >>> I'm not against moving to Bluestore, however, I think I need to >research >>> it better to understand why the performance I was getting before was >so >>> poor. >> >> It's a nano-ceph! So millibits :) I mean 60 megabytes per second, of >> course. My drives are also crap. I just want to say that you probably > >> miss some option for your VM, for example "cache=writeback". > >cache=writeback should have no effect on read performance but could be >quite dangerous if the VM host were to go down immediately after a >write >for any reason. > >While 60MB/sec is getting respectable, doing so at the cost of data >safety is not something I'm keen on. >-- >Stuart Longland (aka Redhatter, VK4MSL) > >I haven't lost my mind... > ...it's backed up on a tape somewhere. -- With best regards, Vitaliy Filippov___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Observation of bluestore db/wal performance
Bluestore's deferred write queue doesn't act like Filestore's journal because a) it's very small = 64 requests b) it doesn't have a background flush thread. Bluestore basically refuses to do writes faster than the HDD can do them _on_average_. With Filestore you can have 1000-2000 write iops until the journal becomes full. After that the performance will drop to 30-50 iops with very unstable latency. With Bluestore you only get 100-300 iops, but these 100-300 iops are always stable :-) I'd recommend bcache. It should perform much better than ceph's tiering. -- With best regards, Vitaliy Filippov___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Expected IO in luminous Ceph Cluster
Hi Felix, Better use fio. Like fio -ioengine=rbd -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 -rw=randwrite -pool=rpool_hdd -runtime=60 -rbdname=testimg (for peak parallel random iops) Or the same with -iodepth=1 for the latency test. Here you usually get Or the same with -ioengine=libaio -filename=testfile -size=10G instead of -ioengine=rbd -pool=.. -rbdname=.. to test it from inside a VM. ...or the same with -sync=1 to determine how a DBMS will perform inside a VM... -- With best regards, Vitaliy Filippov___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?
Is it a question to me or Victor? :-) I did test my drives, intel nvmes are capable of something like 95100 single thread iops. 10 марта 2019 г. 1:31:15 GMT+03:00, Martin Verges пишет: >Hello, > >did you test the performance of your individual drives? > >Here is a small snippet: >- >DRIVE=/dev/XXX >smartctl --a $DRIVE >for i in 1 2 4 8 16; do echo "Test $i"; fio --filename=$DRIVE >--direct=1 >--sync=1 --rw=write --bs=4k --numjobs=$i --iodepth=1 --runtime=60 >--time_based --group_reporting --name=journal-test; done >- > >Please share the results that we know what's possible with your >hardware. > >-- >Martin Verges >Managing director > >Mobile: +49 174 9335695 >E-Mail: martin.ver...@croit.io >Chat: https://t.me/MartinVerges > >croit GmbH, Freseniusstr. 31h, 81247 Munich >CEO: Martin Verges - VAT-ID: DE310638492 >Com. register: Amtsgericht Munich HRB 231263 > >Web: https://croit.io >YouTube: https://goo.gl/PGE1Bx > >Vitaliy Filippov schrieb am Sa., 9. März 2019, >21:09: > >> There are 2: >> >> fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=1 >-rw=randwrite >> -pool=bench -rbdname=testimg >> >> fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=128 >-rw=randwrite >> -pool=bench -rbdname=testimg >> >> The first measures your min possible latency - it does not scale with >the >> number of OSDs at all, but it's usually what real applications like >> DBMSes >> need. >> >> The second measures your max possible random write throughput which >you >> probably won't be able to utilize if you don't have enough VMs all >> writing >> in parallel. >> >> -- >> With best regards, >>Vitaliy Filippov >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> -- With best regards, Vitaliy Filippov___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fwd: Re: Blocked ops after change from filestore on HDD to bluestore on SDD
"Advanced power loss protection" is in fact a performance feature, not a safety one. 28 февраля 2019 г. 13:03:51 GMT+03:00, Uwe Sauter пишет: >Hi all, > >thanks for your insights. > >Eneko, > >> We tried to use a Samsung 840 Pro SSD as OSD some time ago and it was >a no-go; it wasn't that performance was bad, it >> just didn't work for the kind of use of OSD. Any HDD was better than >it (the disk was healthy and have been used in a >> software raid-1 for a pair of years). >> >> I suggest you check first that your Samsung 860 Pro disks work well >for Ceph. Also, how is your host's RAM? > >As already mentioned the hosts each have 64GB RAM. Each host has 3 SSDs >for OSD usage. Each OSD is using about 1.3GB virtual >memory / 400MB residual memory. > > > >Joachim, > >> I can only recommend the use of enterprise SSDs. We've tested many >consumer SSDs in the past, including your SSDs. Many >> of them are not suitable for long-term use and some weard out within >6 months. > >Unfortunately I couldn't afford enterprise grade SSDs. But I suspect >that my workload (about 20 VMs for our infrastructure, the >most IO demanding is probably LDAP) is light enough that wearout won't >be a problem. > >The issue I'm seeing then is probably related to direct IO if using >bluestore. But with filestore, the file system cache probably >hides the latency issues. > > >Igor, > >> AFAIR Samsung 860 Pro isn't for enterprise market, you shouldn't use >consumer SSDs for Ceph. >> >> I had some experience with Samsung 960 Pro a while ago and it turned >out that it handled fsync-ed writes very slowly >> (comparing to the original/advertised performance). Which one can >probably explain by the lack of power loss protection >> for these drives. I suppose it's the same in your case. >> >> Here are a couple links on the topic: >> >> >https://www.percona.com/blog/2018/02/08/fsync-performance-storage-devices/ >> >> >https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ > >Power loss protection wasn't a criteria for me as the cluster hosts are >distributed in two buildings with separate battery backed >UPSs. As mentioned above I suspect the main difference for my case >between filestore and bluestore is file system cache vs. direct >IO. Which means I will keep using filestore. > >Regards, > > Uwe >___ >ceph-users mailing list >ceph-users@lists.ceph.com >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- With best regards, Vitaliy Filippov___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?
rados bench is garbage, it creates and benches a very small amount of objects. If you want RBD better test it with fio ioengine=rbd 7 февраля 2019 г. 15:16:11 GMT+03:00, Ryan пишет: >I just ran your test on a cluster with 5 hosts 2x Intel 6130, 12x 860 >Evo >2TB SSD per host (6 per SAS3008), 2x bonded 10GB NIC, 2x Arista >switches. > >Pool with 3x replication > >rados bench -p scbench -b 4096 10 write --no-cleanup >hints = 1 >Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 >for >up to 10 seconds or 0 objects >Object prefix: benchmark_data_dc1-kube-01_3458991 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg >lat(s) >0 0 0 0 0 0 - > 0 >1 16 5090 5074 19.7774 19.8203 0.00312568 >0.00315352 >2 16 10441 10425 20.3276 20.9023 0.00332591 >0.00307105 >3 16 15548 1553220.201 19.9492 0.00337573 >0.00309134 >4 16 20906 20890 20.3826 20.9297 0.00282902 >0.00306437 >5 16 26107 26091 20.3686 20.3164 0.00269844 >0.00306698 >6 16 31246 31230 20.3187 20.0742 0.00339814 >0.00307462 >7 16 36372 36356 20.2753 20.0234 0.00286653 > 0.0030813 >8 16 41470 41454 20.2293 19.9141 0.00272051 >0.00308839 >9 16 46815 46799 20.3011 20.8789 0.00284063 >0.00307738 >Total time run: 10.0035 >Total writes made: 51918 >Write size: 4096 >Object size:4096 >Bandwidth (MB/sec): 20.2734 >Stddev Bandwidth: 0.464082 >Max bandwidth (MB/sec): 20.9297 >Min bandwidth (MB/sec): 19.8203 >Average IOPS: 5189 >Stddev IOPS:118 >Max IOPS: 5358 >Min IOPS: 5074 >Average Latency(s): 0.00308195 >Stddev Latency(s): 0.00142825 >Max latency(s): 0.0267947 >Min latency(s): 0.00217364 > >rados bench -p scbench 10 rand >hints = 1 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg >lat(s) >0 0 0 0 0 0 - > 0 >1 15 39691 39676154.95 154.984 0.00027022 >0.000395993 >2 16 83701 83685 163.416171.91 0.000318949 >0.000375363 >3 15129218129203 168.199 177.805 0.000300898 >0.000364647 >4 15173733173718 169.617 173.887 0.000311723 >0.00036156 >5 15216073216058 168.769 165.391 0.000407594 >0.000363371 >6 16260381260365 169.483 173.074 0.000323371 >0.000361829 >7 15306838306823 171.193 181.477 0.000284247 >0.000358199 >8 15353675353660 172.661 182.957 0.000338128 >0.000355139 >9 15399221399206 173.243 177.914 0.000422527 >0.00035393 >Total time run: 10.0003 >Total reads made: 446353 >Read size:4096 >Object size: 4096 >Bandwidth (MB/sec): 174.351 >Average IOPS: 44633 >Stddev IOPS: 2220 >Max IOPS: 46837 >Min IOPS: 39676 >Average Latency(s): 0.000351679 >Max latency(s): 0.00530195 >Min latency(s): 0.000135292 > >On Thu, Feb 7, 2019 at 2:17 AM wrote: > >> Hi List >> >> We are in the process of moving to the next usecase for our ceph >cluster >> (Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and >> that works fine. >> >> We're currently on luminous / bluestore, if upgrading is deemed to >> change what we're seeing then please let us know. >> >> We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each. >Connected >> through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set >to >> deadline, nomerges = 1, rotational = 0. >> >> Each disk "should" give approximately 36K IOPS random write and the >double >> random read. >> >> Pool is setup with a 3x replicaiton. We would like a "scaleout" setup >of >> well performing SSD block devices - potentially to host databases and >> things like that. I ready through this nice document [0], I know the >> HW are radically different from mine, but I still think I'm in the >> very low end of what 6 x S4510 should be capable of doing. >> >> Since it is IOPS i care about I have lowered block size to 4096 -- 4M >> blocksize nicely saturates the NIC's in both directions. >> >> >> $ sudo rados bench -p scbench -b 4096 10 write --no-cleanup >> hints = 1 >> Maintaining 16 concurrent writes of 4096 bytes to objects of size >4096 for >> up to 10 seconds or 0 objects >> Object prefix: benchmark_data_torsk2_11207 >> sec Cur ops started finished avg MB/s cur MB/s last lat(s) >avg >> lat(s) >> 0 0 0 0 0 0 - >> 0 >> 1 16 5857 5841 22.8155 22.8164 0.00238437 >> 0.00273434 >> 2 15 11768 11753 22.9533 23.0938 0.0028559 >> 0.00271944 >> 3 16 17264 17
Re: [ceph-users] RDMA/RoCE enablement failed with (113) No route to host
Is RDMA officially supported? I'm asking because I recently tried to use DPDK and it seems it's broken... i.e the code is there, but does not compile until I fix cmake scripts, and after fixing the build OSDs just get segfaults and die after processing something like 40-50 incoming packets. Maybe RDMA is in the same state? 13 декабря 2018 г. 2:42:23 GMT+03:00, Michael Green пишет: >Sorry for bumping the thread. I refuse to believe there are no people >on this list who have successfully enabled and run RDMA with Mimic. :) > >Mike > >> Hello collective wisdom, >> >> ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic >(stable) here. >> >> I have a working cluster here consisting of 3 monitor hosts, 64 OSD >processes across 4 osd hosts, plus 2 MDSs, plus 2 MGRs. All of that is >consumed by 10 client nodes. >> >> Every host in the cluster, including clients is >> RHEL 7.5 >> Mellanox OFED 4.4-2.0.7.0 >> RoCE NICs are either MCX416A-CCAT or MCX414A-CCAT @ 50Gbit/sec >> The NICs are all mlx5_0 port 1 >> >> ring and ib_send_bw work fine both ways on any two nodes in the >cluster. >> >> Full configuration of the cluster is pasted below, but RDMA related >parameters are configured as following: >> >> >> ms_public_type = async+rdma >> ms_cluster = async+rdma >> # Exclude clients for now >> ms_type = async+posix >> >> ms_async_rdma_device_name = mlx5_0 >> ms_async_rdma_polling_us = 0 >> ms_async_rdma_port_num=1 >> >> When I try to start MON, it immediately fails as below. Anybody has >seen this or could give any pointers what to/where to look next? >> >> >> --ceph-mon.rio.log--begin-- >> 2018-12-12 22:35:30.011 7f515dc39140 0 set uid:gid to 167:167 >(ceph:ceph) >> 2018-12-12 22:35:30.011 7f515dc39140 0 ceph version 13.2.2 >(02899bfda814146b021136e9d8e80eba494e1126) mimic (stable), process >ceph-mon, pid 2129843 >> 2018-12-12 22:35:30.011 7f515dc39140 0 pidfile_write: ignore empty >--pid-file >> 2018-12-12 22:35:30.036 7f515dc39140 0 load: jerasure load: lrc >load: isa >> 2018-12-12 22:35:30.036 7f515dc39140 0 set rocksdb option >compression = kNoCompression >> 2018-12-12 22:35:30.036 7f515dc39140 0 set rocksdb option >level_compaction_dynamic_level_bytes = true >> 2018-12-12 22:35:30.036 7f515dc39140 0 set rocksdb option >write_buffer_size = 33554432 >> 2018-12-12 22:35:30.036 7f515dc39140 0 set rocksdb option >compression = kNoCompression >> 2018-12-12 22:35:30.036 7f515dc39140 0 set rocksdb option >level_compaction_dynamic_level_bytes = true >> 2018-12-12 22:35:30.036 7f515dc39140 0 set rocksdb option >write_buffer_size = 33554432 >> 2018-12-12 22:35:30.147 7f51442ed700 2 Event(0x55d927e95700 >nevent=5000 time_id=1).set_owner idx=1 owner=139987012998912 >> 2018-12-12 22:35:30.147 7f51442ed700 10 stack operator() starting >> 2018-12-12 22:35:30.147 7f5143aec700 2 Event(0x55d927e95200 >nevent=5000 time_id=1).set_owner idx=0 owner=139987004606208 >> 2018-12-12 22:35:30.147 7f5144aee700 2 Event(0x55d927e95c00 >nevent=5000 time_id=1).set_owner idx=2 owner=139987021391616 >> 2018-12-12 22:35:30.147 7f5143aec700 10 stack operator() starting >> 2018-12-12 22:35:30.147 7f5144aee700 10 stack operator() starting >> 2018-12-12 22:35:30.147 7f515dc39140 0 starting mon.rio rank 0 at >public addr 192.168.1.58:6789/0 at bind addr 192.168.1.58:6789/0 >mon_data /var/lib/ceph/mon/ceph-rio fsid >376540c8-a362-41cc-9a58-9c8ceca0e4ee >> 2018-12-12 22:35:30.147 7f515dc39140 10 -- - bind bind >192.168.1.58:6789/0 >> 2018-12-12 22:35:30.147 7f515dc39140 10 -- - bind Network Stack is >not ready for bind yet - postponed >> 2018-12-12 22:35:30.147 7f515dc39140 0 starting mon.rio rank 0 at >192.168.1.58:6789/0 mon_data /var/lib/ceph/mon/ceph-rio fsid >376540c8-a362-41cc-9a58-9c8ceca0e4ee >> 2018-12-12 22:35:30.148 7f515dc39140 0 mon.rio@-1(probing).mds e84 >new map >> 2018-12-12 22:35:30.148 7f515dc39140 0 mon.rio@-1(probing).mds e84 >print_map >> e84 >> enable_multiple, ever_enabled_multiple: 0,0 >> compat: compat={},rocompat={},incompat={1=base v0.20,2=client >writeable ranges,3=default file layouts on dirs,4=dir inode in separate >object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no >anchor table,9=file layout v2,10=snaprealm v2} >> legacy client fscid: -1 >> >> No filesystems configured >> Standby daemons: >> >> 5906437:192.168.1.152:6800/1077205146 'prince' mds.-1.0 >up:standby seq 2 >> 6284118:192.168.1.59:6800/1266235911 'salvador' mds.-1.0 >up:standby seq 2 >> >> 2018-12-12 22:35:30.148 7f515dc39140 0 mon.rio@-1(probing).osd >e25894 crush map has features 288514051259236352, adjusting msgr >requires >> 2018-12-12 22:35:30.148 7f515dc39140 0 mon.rio@-1(probing).osd >e25894 crush map has features 288514051259236352, adjusting msgr >requires >> 2018-12-12 22:35:30.148 7f515dc39140 0 mon.rio@-1(probing).osd >e25894 crush map has features 1009089991638532096, adjusting msgr >requires >> 2018-12-12 22:35:30.148 7f515dc39140 0 mon.rio@-
Re: [ceph-users] Low traffic Ceph cluster with consumer SSD.
Ok... That's better than previous thread with file download where the topic starter suffered from normal only-metadata-journaled fs... Thanks for the link, it would be interesting to repeat similar tests. Although I suspect it shouldn't be that bad... at least not all desktop SSDs are that broken - for example https://engineering.nordeus.com/power-failure-testing-with-ssds/ says samsumg 840 pro is ok. -- With best regards, Vitaliy Filippov___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Disabling write cache on SATA HDDs reduces write latency 7 times
This may be the explanation: https://serverfault.com/questions/857271/better-performance-when-hdd-write-cache-is-disabled-hgst-ultrastar-7k6000-and Other manufacturers may have started to do the same, I suppose. -- With best regards, Vitaliy Filippov___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph cluster uses substantially more disk space after rebalancing
Is there a way to force OSDs to remove old data? Hi After I recreated one OSD + increased pg count of my erasure-coded (2+1) pool (which was way too low, only 100 for 9 osds) the cluster started to eat additional disk space. First I thought that was caused by the moved PGs using additional space during unfinished backfills. I pinned most of new PGs to old OSDs via `pg-upmap` and indeed it freed some space in the cluster. Then I reduced osd_max_backfills to 1 and started to remove upmap pins in small portions which allowed Ceph to finish backfills for these PGs. HOWEVER, used capacity still grows! It drops after moving each PG, but still grows overall. It has grown +1.3TB yesterday. In the same period of time clients have written only ~200 new objects (~800 MB, there are RBD images only). Why, what's using such big amount of additional space? Graphs from our prometheus are attached. Only ~200 objects were created by RBD clients yesterday, but used raw space increased +1.3 TB. Additional question is why ceph df / rados df tells there is only 16 TB actual data written, but it uses 29.8 TB (now 31 TB) of raw disk space. Shouldn't it be 16 / 2*3 = 24 TB ? ceph df output: [root@sill-01 ~]# ceph df GLOBAL: SIZE AVAIL RAW USED %RAW USED 38 TiB 6.9 TiB 32 TiB 82.03 POOLS: NAME ID USED%USED MAX AVAIL OBJECTS ecpool_hdd 13 16 TiB 93.94 1.0 TiB 7611672 rpool_hdd 15 9.2 MiB 0 515 GiB 92 fs_meta44 20 KiB 0 515 GiB 23 fs_data45 0 B 0 1.0 TiB 0 How to heal it? -- С наилучшими пожеланиями, Виталий Филиппов ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph cluster uses substantially more disk space after rebalancing
Hi After I recreated one OSD + increased pg count of my erasure-coded (2+1) pool (which was way too low, only 100 for 9 osds) the cluster started to eat additional disk space. First I thought that was caused by the moved PGs using additional space during unfinished backfills. I pinned most of new PGs to old OSDs via `pg-upmap` and indeed it freed some space in the cluster. Then I reduced osd_max_backfills to 1 and started to remove upmap pins in small portions which allowed Ceph to finish backfills for these PGs. HOWEVER, used capacity still grows! It drops after moving each PG, but still grows overall. It has grown +1.3TB yesterday. In the same period of time clients have written only ~200 new objects (~800 MB, there are RBD images only). Why, what's using such big amount of additional space? Graphs from our prometheus are attached. Only ~200 objects were created by RBD clients yesterday, but used raw space increased +1.3 TB. Additional question is why ceph df / rados df tells there is only 16 TB actual data written, but it uses 29.8 TB (now 31 TB) of raw disk space. Shouldn't it be 16 / 2*3 = 24 TB ? ceph df output: [root@sill-01 ~]# ceph df GLOBAL: SIZE AVAIL RAW USED %RAW USED 38 TiB 6.9 TiB 32 TiB 82.03 POOLS: NAME ID USED%USED MAX AVAIL OBJECTS ecpool_hdd 13 16 TiB 93.94 1.0 TiB 7611672 rpool_hdd 15 9.2 MiB 0 515 GiB 92 fs_meta44 20 KiB 0 515 GiB 23 fs_data45 0 B 0 1.0 TiB 0 How to heal it? -- With best regards, Vitaliy Filippov___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Don't upgrade to 13.2.2 if you use cephfs
I mean, does every upgraded installation hit this bug, or do some upgrade without any problem? The problem occurs after upgrade, fresh 13.2.2 installs are not affected. -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Don't upgrade to 13.2.2 if you use cephfs
By the way, does it happen with all installations or only under some conditions? CephFS will be offline and show up as "damaged" in ceph -s The fix is to downgrade to 13.2.1 and issue a "ceph fs repaired " command. Paul -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] CephFS "authorize" on erasure-coded FS
Hi, I've recently tried to setup a user for CephFS running on a pair of replicated+erasure pools, but after I ran ceph fs authorize ecfs client.samba / rw The "client.samba" user could only see listings, but couldn't read or write any files. I've tried to look in logs and to raise the debug level and I've seen no clues about this problem. However, when I then modified its caps with: ceph auth caps client.samba mds 'allow rw' mon 'allow r' osd 'allow rw tag cephfs data=ecfs, allow rw pool=ecpool' Everything went OK and the user gained read-write access to files. Does that mean there's a bug in CephFS caps that prevents users from reading or writing to an FS running on a EC pool? -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph issue tracker tells that posting issues is forbidden
Thanks for the reply! Ok I understand :-) But the page still shows 403 by now... 5 августа 2018 г. 6:42:33 GMT+03:00, Gregory Farnum пишет: >On Sun, Aug 5, 2018 at 1:25 AM Виталий Филиппов >wrote: > >> Hi! >> >> I wanted to report a bug in ceph, but I found out that visiting >> http://tracker.ceph.com/projects/ceph/issues/new gives me only "403 >You >> are not authorized to access this page." >> >> What does it mean - why is it forbidden to post issues? > > >We just got spammed via the API last week so we had to lock some things >down temporarily to prevent it from continuing. I don’t think it was >expected to impact users of the web site, but it might have for a few >hours. The page is showing up for me when I try and visit it now, so >try >again? >Sorry you ran in to this! >-Greg > > > >> >> -- >> With best regards, >>Vitaliy Filippov >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> -- With best regards, Vitaliy Filippov___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph issue tracker tells that posting issues is forbidden
Hi! I wanted to report a bug in ceph, but I found out that visiting http://tracker.ceph.com/projects/ceph/issues/new gives me only "403 You are not authorized to access this page." What does it mean - why is it forbidden to post issues? -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Strange copy errors in osd log
Hi! I'm playing with a test setup of ceph jewel with bluestore and cephfs over erasure-coded pool with replicated pool as a cache tier. After writing some number of small files to cephfs I begin seeing the following error messages during the migration of data from cache to EC pool: 2016-09-01 10:19:27.364710 7f37c1a09700 -1 osd.0 pg_epoch: 329 pg[6.2cs0( v 329'388 (0'0,329'388] local-les=315 n=326 ec=279 les/c/f 315/315/0 314/314/314) [0,1,2] r=0 lpr=314 crt=329'387 lcod 329'387 mlcod 329'387 active+clean] process_copy_chunk data digest 0x648fd38c != source 0x40203b61 2016-09-01 10:19:27.364742 7f37c1a09700 -1 log_channel(cluster) log [ERR] : 6.2cs0 copy from 8:372dc315:::200.002b:head to 6:372dc315:::200.002b:head data digest 0x648fd38c != source 0x40203b61 These messages then repeat infinitely for the same set of objects with some interval. I'm not sure - does this mean some objects are corrupted in OSDs? (how to check?) Is it a bug at all? P.S: I've also reported this as an issue: http://tracker.ceph.com/issues/17194 (not sure if it was right to do :)) -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com