Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]

2020-01-21 Thread Виталий Филиппов
Hi! Thanks.

The parameter gets reset when you reconnect the SSD so in fact it requires not 
to power cycle it after changing the parameter :-)

Ok, this case seems lucky, ~2x change isn't a lot. Can you tell the exact model 
and capacity of this Micron, and what controller was used in this test? I'll 
add it to the spreadsheet.
-- 
With best regards,
  Vitaliy Filippov___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]

2020-01-14 Thread Виталий Филиппов
...disable signatures and rbd cache. I didn't mention it in the email to not 
repeat myself. But I have it in the article :-)
-- 
With best regards,
  Vitaliy Filippov___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Consumer-grade SSD in Ceph

2019-12-19 Thread Виталий Филиппов
I had. 100-200 write iops with iodepth=1, ~5k iops with iodepth=128. These were 
intel 545s.

Not that awful, but micron 5200 costs only a fraction more, so it seems 
pointless to me to use desktop samsungs.

19 декабря 2019 г. 22:20:28 GMT+03:00, Sinan Polat  пишет:
>Hi all,
>
>Thanks for the replies. I am not worried about their lifetime. We will
>be adding only 1 SSD disk per physical server. All SSD’s are enterprise
>drives. If the added consumer grade disk will fail, no problem.
>
>I am more curious regarding their I/O performance. I do want to have
>50% drop in performance.
>
>So anyone any experience with 860 EVO or Crucial MX500 in a Ceph setup?
>
>Thanks!
>
>> Op 19 dec. 2019 om 19:18 heeft Mark Nelson  het
>volgende geschreven:
>> 
>> The way I try to look at this is:
>> 
>> 
>> 1) How much more do the enterprise grade drives cost?
>> 
>> 2) What are the benefits? (Faster performance, longer life, etc)
>> 
>> 3) How much does it cost to deal with downtime, diagnose issues, and
>replace malfunctioning hardware?
>> 
>> 
>> My personal take is that enterprise drives are usually worth it.
>There may be consumer grade drives that may be worth considering in
>very specific scenarios if they still have power loss protection and
>high write durability.  Even when I was in academia years ago with very
>limited budgets, we got burned with consumer grade SSDs to the point
>where we had to replace them all.  You have to be very careful and know
>exactly what you are buying.
>> 
>> 
>> Mark
>> 
>> 
>>> On 12/19/19 12:04 PM, jes...@krogh.cc wrote:
>>> I dont think “usually” is good enough in a production setup.
>>> 
>>> 
>>> 
>>> Sent from myMail for iOS
>>> 
>>> 
>>> Thursday, 19 December 2019, 12.09 +0100 from Виталий Филиппов
>:
>>> 
>>>Usually it doesn't, it only harms performance and probably SSD
>>>lifetime
>>>too
>>> 
>>>> I would not be running ceph on ssds without powerloss
>protection. I
>>>> delivers a potential data loss scenario
>>> 
>>> 
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
With best regards,
  Vitaliy Filippov___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Consumer-grade SSD in Ceph

2019-12-18 Thread Виталий Филиппов
https://yourcmc.ru/wiki/Ceph_performance

https://docs.google.com/spreadsheets/d/1E9-eXjzsKboiCCX-0u0r5fAjjufLKayaut_FOPxYZjc

19 декабря 2019 г. 0:41:02 GMT+03:00, Sinan Polat  пишет:
>Hi,
>
>I am aware that
>https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>holds a list with benchmark of quite some different ssd models.
>Unfortunately it
>doesn't have benchmarks for recent ssd models.
>
>A client is planning to expand a running cluster (Luminous, FileStore,
>SSD only,
>Replicated). I/O Utilization is close to 0, but capacity wise the
>cluster is
>almost nearfull. To save costs the cluster will be expanded will
>customer-grade
>SSD's, but I am unable to find benchmarks of recent SSD models.
>
>Does anyone has experience with Samsung 860 EVO, 860 PRO and Crucial
>MX500 in a
>Ceph cluster?
>
>Thanks!
>Sinan

-- 
With best regards,
  Vitaliy Filippov___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2019-08-15 Thread Виталий Филиппов
30gb already includes WAL, see 
http://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing

15 августа 2019 г. 1:15:58 GMT+03:00, Anthony D'Atri  
пишет:
>Good points in both posts, but I think there’s still some unclarity.
>
>Absolutely let’s talk about DB and WAL together.  By “bluestore goes on
>flash” I assume you mean WAL+DB?
>
>“Simply allocate DB and WAL will appear there automatically”
>
>Forgive me please if this is obvious, but I’d like to see a holistic
>explanation of WAL and DB sizing *together*, which I think would help
>folks put these concepts together and plan deployments with some sense
>of confidence.
>
>We’ve seen good explanations on the list of why only specific DB sizes,
>say 30GB, are actually used _for the DB_.
>If the WAL goes along with the DB, shouldn’t we also explicitly
>determine an appropriate size N for the WAL, and make the partition
>(30+N) GB?
>If so, how do we derive N?  Or is it a constant?
>
>Filestore was so much simpler, 10GB set+forget for the journal.  Not
>that I miss XFS, mind you.
>
>
>>> Actually standalone WAL is required when you have either very small
>fast
>>> device (and don't want db to use it) or three devices (different in
>>> performance) behind OSD (e.g. hdd, ssd, nvme). So WAL is to be
>located
>>> at the fastest one.
>>> 
>>> For the given use case you just have HDD and NVMe and DB and WAL can
>>> safely collocate. Which means you don't need to allocate specific
>volume
>>> for WAL. Hence no need to answer the question how many space is
>needed
>>> for WAL. Simply allocate DB and WAL will appear there automatically.
>>> 
>>> 
>> Yes, i'm surprised how often people talk about the DB and WAL
>separately
>> for no good reason.  In common setups bluestore goes on flash and the
>> storage goes on the HDDs, simple.
>> 
>> In the event flash is 100s of GB and would be wasted, is there
>anything
>> that needs to be done to set rocksdb to use the highest level?  600 I
>> believe
>
>
>
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
With best regards,
  Vitaliy Filippov___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS snapshot for backup & disaster recovery

2019-08-04 Thread Виталий Филиппов
Afaik no. What's the idea of running a single-host cephfs cluster?

4 августа 2019 г. 13:27:00 GMT+03:00, Eitan Mosenkis  пишет:
>I'm running a single-host Ceph cluster for CephFS and I'd like to keep
>backups in Amazon S3 for disaster recovery. Is there a simple way to
>extract a CephFS snapshot as a single file and/or to create a file that
>represents the incremental difference between two snapshots?

-- 
With best regards,
  Vitaliy Filippov___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Future of Filestore?

2019-07-25 Thread Виталий Филиппов
Hi again,

I reread your initial email - do you also run a nanoceph on some SBCs each 
having one 2.5" 5400rpm HDD plugged into it? What SBCs do you use? :-)
-- 
With best regards,
  Vitaliy Filippov___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Future of Filestore?

2019-07-24 Thread Виталий Филиппов
Cache=writeback is perfectly safe, it's flushed when the guest calls fsync, so 
journaled filesystems and databases don't lose data that's committed to the 
journal.

25 июля 2019 г. 2:28:26 GMT+03:00, Stuart Longland  
пишет:
>On 25/7/19 9:01 am, vita...@yourcmc.ru wrote:
>>> 60 millibits per second?  60 bits every 1000 seconds?  Are you
>serious?
>>>  Or did we get the capitalisation wrong?
>>>
>>> Assuming 60MB/sec (as 60 Mb/sec would still be slower than the
>5MB/sec I
>>> was getting), maybe there's some characteristic that Bluestore is
>>> particularly dependent on regarding the HDDs.
>>>
>>> I'll admit right up front the drives I'm using were chosen because
>they
>>> were all I could get with a 2TB storage capacity for a reasonable
>price.
>>>
>>> I'm not against moving to Bluestore, however, I think I need to
>research
>>> it better to understand why the performance I was getting before was
>so
>>> poor.
>> 
>> It's a nano-ceph! So millibits :) I mean 60 megabytes per second, of 
>> course. My drives are also crap. I just want to say that you probably
>
>> miss some option for your VM, for example "cache=writeback".
>
>cache=writeback should have no effect on read performance but could be 
>quite dangerous if the VM host were to go down immediately after a
>write 
>for any reason.
>
>While 60MB/sec is getting respectable, doing so at the cost of data 
>safety is not something I'm keen on.
>-- 
>Stuart Longland (aka Redhatter, VK4MSL)
>
>I haven't lost my mind...
>   ...it's backed up on a tape somewhere.

-- 
With best regards,
  Vitaliy Filippov___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Observation of bluestore db/wal performance

2019-07-23 Thread Виталий Филиппов
Bluestore's deferred write queue doesn't act like Filestore's journal because 
a) it's very small = 64 requests b) it doesn't have a background flush thread. 
Bluestore basically refuses to do writes faster than the HDD can do them 
_on_average_. With Filestore you can have 1000-2000 write iops until the 
journal becomes full. After that the performance will drop to 30-50 iops with 
very unstable latency. With Bluestore you only get 100-300 iops, but these 
100-300 iops are always stable :-)

I'd recommend bcache. It should perform much better than ceph's tiering.
-- 
With best regards,
  Vitaliy Filippov___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Expected IO in luminous Ceph Cluster

2019-06-12 Thread Виталий Филиппов
Hi Felix,

Better use fio.

Like fio -ioengine=rbd -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 
-rw=randwrite -pool=rpool_hdd -runtime=60 -rbdname=testimg (for peak parallel 
random iops)

Or the same with -iodepth=1 for the latency test. Here you usually get

Or the same with -ioengine=libaio -filename=testfile -size=10G instead of 
-ioengine=rbd -pool=.. -rbdname=.. to test it from inside a VM.

...or the same with -sync=1 to determine how a DBMS will perform inside a VM...
-- 
With best regards,
  Vitaliy Filippov___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?

2019-03-09 Thread Виталий Филиппов
Is it a question to me or Victor? :-)

I did test my drives, intel nvmes are capable of something like 95100 single 
thread iops.

10 марта 2019 г. 1:31:15 GMT+03:00, Martin Verges  
пишет:
>Hello,
>
>did you test the performance of your individual drives?
>
>Here is a small snippet:
>-
>DRIVE=/dev/XXX
>smartctl --a $DRIVE
>for i in 1 2 4 8 16; do echo "Test $i"; fio --filename=$DRIVE
>--direct=1
>--sync=1 --rw=write --bs=4k --numjobs=$i --iodepth=1 --runtime=60
>--time_based --group_reporting --name=journal-test; done
>-
>
>Please share the results that we know what's possible with your
>hardware.
>
>--
>Martin Verges
>Managing director
>
>Mobile: +49 174 9335695
>E-Mail: martin.ver...@croit.io
>Chat: https://t.me/MartinVerges
>
>croit GmbH, Freseniusstr. 31h, 81247 Munich
>CEO: Martin Verges - VAT-ID: DE310638492
>Com. register: Amtsgericht Munich HRB 231263
>
>Web: https://croit.io
>YouTube: https://goo.gl/PGE1Bx
>
>Vitaliy Filippov  schrieb am Sa., 9. März 2019,
>21:09:
>
>> There are 2:
>>
>> fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=1
>-rw=randwrite
>> -pool=bench -rbdname=testimg
>>
>> fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=128
>-rw=randwrite
>> -pool=bench -rbdname=testimg
>>
>> The first measures your min possible latency - it does not scale with
>the
>> number of OSDs at all, but it's usually what real applications like
>> DBMSes
>> need.
>>
>> The second measures your max possible random write throughput which
>you
>> probably won't be able to utilize if you don't have enough VMs all
>> writing
>> in parallel.
>>
>> --
>> With best regards,
>>Vitaliy Filippov
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>

-- 
With best regards,
  Vitaliy Filippov___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Re: Blocked ops after change from filestore on HDD to bluestore on SDD

2019-02-28 Thread Виталий Филиппов
"Advanced power loss protection" is in fact a performance feature, not a safety 
one.

28 февраля 2019 г. 13:03:51 GMT+03:00, Uwe Sauter  
пишет:
>Hi all,
>
>thanks for your insights.
>
>Eneko,
>
>> We tried to use a Samsung 840 Pro SSD as OSD some time ago and it was
>a no-go; it wasn't that performance was bad, it 
>> just didn't work for the kind of use of OSD. Any HDD was better than
>it (the disk was healthy and have been used in a 
>> software raid-1 for a pair of years).
>> 
>> I suggest you check first that your Samsung 860 Pro disks work well
>for Ceph. Also, how is your host's RAM?
>
>As already mentioned the hosts each have 64GB RAM. Each host has 3 SSDs
>for OSD usage. Each OSD is using about 1.3GB virtual
>memory / 400MB residual memory.
>
>
>
>Joachim,
>
>> I can only recommend the use of enterprise SSDs. We've tested many
>consumer SSDs in the past, including your SSDs. Many 
>> of them are not suitable for long-term use and some weard out within
>6 months.
>
>Unfortunately I couldn't afford enterprise grade SSDs. But I suspect
>that my workload (about 20 VMs for our infrastructure, the
>most IO demanding is probably LDAP) is light enough that wearout won't
>be a problem.
>
>The issue I'm seeing then is probably related to direct IO if using
>bluestore. But with filestore, the file system cache probably
>hides the latency issues.
>
>
>Igor,
>
>> AFAIR Samsung 860 Pro isn't for enterprise market, you shouldn't use
>consumer SSDs for Ceph.
>> 
>> I had some experience with Samsung 960 Pro a while ago and it turned
>out that it handled fsync-ed writes very slowly 
>> (comparing to the original/advertised performance). Which one can
>probably explain by the lack of power loss protection 
>> for these drives. I suppose it's the same in your case.
>> 
>> Here are a couple links on the topic:
>> 
>>
>https://www.percona.com/blog/2018/02/08/fsync-performance-storage-devices/
>> 
>>
>https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>
>Power loss protection wasn't a criteria for me as the cluster hosts are
>distributed in two buildings with separate battery backed
>UPSs. As mentioned above I suspect the main difference for my case
>between filestore and bluestore is file system cache vs. direct
>IO. Which means I will keep using filestore.
>
>Regards,
>
>   Uwe
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
With best regards,
  Vitaliy Filippov___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?

2019-02-07 Thread Виталий Филиппов
rados bench is garbage, it creates and benches a very small amount of objects. 
If you want RBD better test it with fio ioengine=rbd

7 февраля 2019 г. 15:16:11 GMT+03:00, Ryan  пишет:
>I just ran your test on a cluster with 5 hosts 2x Intel 6130, 12x 860
>Evo
>2TB SSD per host (6 per SAS3008), 2x bonded 10GB NIC, 2x Arista
>switches.
>
>Pool with 3x replication
>
>rados bench -p scbench -b 4096 10 write --no-cleanup
>hints = 1
>Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096
>for
>up to 10 seconds or 0 objects
>Object prefix: benchmark_data_dc1-kube-01_3458991
>  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
>lat(s)
>0   0 0 0 0 0   -
> 0
>1  16  5090  5074   19.7774   19.8203  0.00312568
>0.00315352
>2  16 10441 10425   20.3276   20.9023  0.00332591
>0.00307105
>3  16 15548 1553220.201   19.9492  0.00337573
>0.00309134
>4  16 20906 20890   20.3826   20.9297  0.00282902
>0.00306437
>5  16 26107 26091   20.3686   20.3164  0.00269844
>0.00306698
>6  16 31246 31230   20.3187   20.0742  0.00339814
>0.00307462
>7  16 36372 36356   20.2753   20.0234  0.00286653
> 0.0030813
>8  16 41470 41454   20.2293   19.9141  0.00272051
>0.00308839
>9  16 46815 46799   20.3011   20.8789  0.00284063
>0.00307738
>Total time run: 10.0035
>Total writes made:  51918
>Write size: 4096
>Object size:4096
>Bandwidth (MB/sec): 20.2734
>Stddev Bandwidth:   0.464082
>Max bandwidth (MB/sec): 20.9297
>Min bandwidth (MB/sec): 19.8203
>Average IOPS:   5189
>Stddev IOPS:118
>Max IOPS:   5358
>Min IOPS:   5074
>Average Latency(s): 0.00308195
>Stddev Latency(s):  0.00142825
>Max latency(s): 0.0267947
>Min latency(s): 0.00217364
>
>rados bench -p scbench 10 rand
>hints = 1
>  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
>lat(s)
>0   0 0 0 0 0   -
> 0
>1  15 39691 39676154.95   154.984  0.00027022
>0.000395993
>2  16 83701 83685   163.416171.91 0.000318949
>0.000375363
>3  15129218129203   168.199   177.805 0.000300898
>0.000364647
>4  15173733173718   169.617   173.887 0.000311723
>0.00036156
>5  15216073216058   168.769   165.391 0.000407594
>0.000363371
>6  16260381260365   169.483   173.074 0.000323371
>0.000361829
>7  15306838306823   171.193   181.477 0.000284247
>0.000358199
>8  15353675353660   172.661   182.957 0.000338128
>0.000355139
>9  15399221399206   173.243   177.914 0.000422527
>0.00035393
>Total time run:   10.0003
>Total reads made: 446353
>Read size:4096
>Object size:  4096
>Bandwidth (MB/sec):   174.351
>Average IOPS: 44633
>Stddev IOPS:  2220
>Max IOPS: 46837
>Min IOPS: 39676
>Average Latency(s):   0.000351679
>Max latency(s):   0.00530195
>Min latency(s):   0.000135292
>
>On Thu, Feb 7, 2019 at 2:17 AM  wrote:
>
>> Hi List
>>
>> We are in the process of moving to the next usecase for our ceph
>cluster
>> (Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and
>> that works fine.
>>
>> We're currently on luminous / bluestore, if upgrading is deemed to
>> change what we're seeing then please let us know.
>>
>> We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each.
>Connected
>> through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set
>to
>> deadline, nomerges = 1, rotational = 0.
>>
>> Each disk "should" give approximately 36K IOPS random write and the
>double
>> random read.
>>
>> Pool is setup with a 3x replicaiton. We would like a "scaleout" setup
>of
>> well performing SSD block devices - potentially to host databases and
>> things like that. I ready through this nice document [0], I know the
>> HW are radically different from mine, but I still think I'm in the
>> very low end of what 6 x S4510 should be capable of doing.
>>
>> Since it is IOPS i care about I have lowered block size to 4096 -- 4M
>> blocksize nicely saturates the NIC's in both directions.
>>
>>
>> $ sudo rados bench -p scbench -b 4096 10 write --no-cleanup
>> hints = 1
>> Maintaining 16 concurrent writes of 4096 bytes to objects of size
>4096 for
>> up to 10 seconds or 0 objects
>> Object prefix: benchmark_data_torsk2_11207
>>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s) 
>avg
>> lat(s)
>> 0   0 0 0 0 0   -
>>  0
>> 1  16  5857  5841   22.8155   22.8164  0.00238437
>> 0.00273434
>> 2  15 11768 11753   22.9533   23.0938   0.0028559
>> 0.00271944
>> 3  16 17264 17

Re: [ceph-users] RDMA/RoCE enablement failed with (113) No route to host

2018-12-18 Thread Виталий Филиппов
Is RDMA officially supported? I'm asking because I recently tried to use DPDK 
and it seems it's broken... i.e the code is there, but does not compile until I 
fix cmake scripts, and after fixing the build OSDs just get segfaults and die 
after processing something like 40-50 incoming packets.

Maybe RDMA is in the same state?

13 декабря 2018 г. 2:42:23 GMT+03:00, Michael Green  пишет:
>Sorry for bumping the thread. I refuse to believe there are no people
>on this list who have successfully enabled and run RDMA with Mimic. :)
>
>Mike
>
>> Hello collective wisdom,
>> 
>> ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic
>(stable) here.
>> 
>> I have a working cluster here consisting of 3 monitor hosts,  64 OSD
>processes across 4 osd hosts, plus 2 MDSs, plus 2 MGRs. All of that is
>consumed by 10 client nodes.
>> 
>> Every host in the cluster, including clients is 
>> RHEL 7.5
>> Mellanox OFED 4.4-2.0.7.0
>> RoCE NICs are either MCX416A-CCAT or MCX414A-CCAT @ 50Gbit/sec
>> The NICs are all mlx5_0 port 1
>> 
>> ring and ib_send_bw work fine both ways on any two nodes in the
>cluster.
>> 
>> Full configuration of the cluster is pasted below, but RDMA related
>parameters are configured as following:
>> 
>> 
>> ms_public_type = async+rdma
>> ms_cluster = async+rdma
>> # Exclude clients for now 
>> ms_type = async+posix
>> 
>> ms_async_rdma_device_name = mlx5_0
>> ms_async_rdma_polling_us = 0
>> ms_async_rdma_port_num=1
>> 
>> When I try to start MON, it immediately fails as below. Anybody has
>seen this or could give any pointers what to/where to look next?
>> 
>> 
>> --ceph-mon.rio.log--begin--
>> 2018-12-12 22:35:30.011 7f515dc39140  0 set uid:gid to 167:167
>(ceph:ceph)
>> 2018-12-12 22:35:30.011 7f515dc39140  0 ceph version 13.2.2
>(02899bfda814146b021136e9d8e80eba494e1126) mimic (stable), process
>ceph-mon, pid 2129843
>> 2018-12-12 22:35:30.011 7f515dc39140  0 pidfile_write: ignore empty
>--pid-file
>> 2018-12-12 22:35:30.036 7f515dc39140  0 load: jerasure load: lrc
>load: isa
>> 2018-12-12 22:35:30.036 7f515dc39140  0  set rocksdb option
>compression = kNoCompression
>> 2018-12-12 22:35:30.036 7f515dc39140  0  set rocksdb option
>level_compaction_dynamic_level_bytes = true
>> 2018-12-12 22:35:30.036 7f515dc39140  0  set rocksdb option
>write_buffer_size = 33554432
>> 2018-12-12 22:35:30.036 7f515dc39140  0  set rocksdb option
>compression = kNoCompression
>> 2018-12-12 22:35:30.036 7f515dc39140  0  set rocksdb option
>level_compaction_dynamic_level_bytes = true
>> 2018-12-12 22:35:30.036 7f515dc39140  0  set rocksdb option
>write_buffer_size = 33554432
>> 2018-12-12 22:35:30.147 7f51442ed700  2 Event(0x55d927e95700
>nevent=5000 time_id=1).set_owner idx=1 owner=139987012998912
>> 2018-12-12 22:35:30.147 7f51442ed700 10 stack operator() starting
>> 2018-12-12 22:35:30.147 7f5143aec700  2 Event(0x55d927e95200
>nevent=5000 time_id=1).set_owner idx=0 owner=139987004606208
>> 2018-12-12 22:35:30.147 7f5144aee700  2 Event(0x55d927e95c00
>nevent=5000 time_id=1).set_owner idx=2 owner=139987021391616
>> 2018-12-12 22:35:30.147 7f5143aec700 10 stack operator() starting
>> 2018-12-12 22:35:30.147 7f5144aee700 10 stack operator() starting
>> 2018-12-12 22:35:30.147 7f515dc39140  0 starting mon.rio rank 0 at
>public addr 192.168.1.58:6789/0 at bind addr 192.168.1.58:6789/0
>mon_data /var/lib/ceph/mon/ceph-rio fsid
>376540c8-a362-41cc-9a58-9c8ceca0e4ee
>> 2018-12-12 22:35:30.147 7f515dc39140 10 -- - bind bind
>192.168.1.58:6789/0
>> 2018-12-12 22:35:30.147 7f515dc39140 10 -- - bind Network Stack is
>not ready for bind yet - postponed
>> 2018-12-12 22:35:30.147 7f515dc39140  0 starting mon.rio rank 0 at
>192.168.1.58:6789/0 mon_data /var/lib/ceph/mon/ceph-rio fsid
>376540c8-a362-41cc-9a58-9c8ceca0e4ee
>> 2018-12-12 22:35:30.148 7f515dc39140  0 mon.rio@-1(probing).mds e84
>new map
>> 2018-12-12 22:35:30.148 7f515dc39140  0 mon.rio@-1(probing).mds e84
>print_map
>> e84
>> enable_multiple, ever_enabled_multiple: 0,0
>> compat: compat={},rocompat={},incompat={1=base v0.20,2=client
>writeable ranges,3=default file layouts on dirs,4=dir inode in separate
>object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no
>anchor table,9=file layout v2,10=snaprealm v2}
>> legacy client fscid: -1
>> 
>> No filesystems configured
>> Standby daemons:
>> 
>> 5906437:192.168.1.152:6800/1077205146 'prince' mds.-1.0
>up:standby seq 2
>> 6284118:192.168.1.59:6800/1266235911 'salvador' mds.-1.0
>up:standby seq 2
>> 
>> 2018-12-12 22:35:30.148 7f515dc39140  0 mon.rio@-1(probing).osd
>e25894 crush map has features 288514051259236352, adjusting msgr
>requires
>> 2018-12-12 22:35:30.148 7f515dc39140  0 mon.rio@-1(probing).osd
>e25894 crush map has features 288514051259236352, adjusting msgr
>requires
>> 2018-12-12 22:35:30.148 7f515dc39140  0 mon.rio@-1(probing).osd
>e25894 crush map has features 1009089991638532096, adjusting msgr
>requires
>> 2018-12-12 22:35:30.148 7f515dc39140  0 mon.rio@-

Re: [ceph-users] Low traffic Ceph cluster with consumer SSD.

2018-11-25 Thread Виталий Филиппов
Ok... That's better than previous thread with file download where the topic 
starter suffered from normal only-metadata-journaled fs... Thanks for the link, 
it would be interesting to repeat similar tests. Although I suspect it 
shouldn't be that bad... at least not all desktop SSDs are that broken - for 
example https://engineering.nordeus.com/power-failure-testing-with-ssds/ says 
samsumg 840 pro is ok.
-- 
With best regards,
  Vitaliy Filippov___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Disabling write cache on SATA HDDs reduces write latency 7 times

2018-11-13 Thread Виталий Филиппов
This may be the explanation:

https://serverfault.com/questions/857271/better-performance-when-hdd-write-cache-is-disabled-hgst-ultrastar-7k6000-and

Other manufacturers may have started to do the same, I suppose.
-- 
With best regards,
  Vitaliy Filippov___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster uses substantially more disk space after rebalancing

2018-10-29 Thread Виталий Филиппов

Is there a way to force OSDs to remove old data?


Hi

After I recreated one OSD + increased pg count of my erasure-coded (2+1)  
pool (which was way too low, only 100 for 9 osds) the cluster started to  
eat additional disk space.


First I thought that was caused by the moved PGs using additional space  
during unfinished backfills. I pinned most of new PGs to old OSDs via  
`pg-upmap` and indeed it freed some space in the cluster.


Then I reduced osd_max_backfills to 1 and started to remove upmap pins  
in small portions which allowed Ceph to finish backfills for these PGs.


HOWEVER, used capacity still grows! It drops after moving each PG, but  
still grows overall.


It has grown +1.3TB yesterday. In the same period of time clients have  
written only ~200 new objects (~800 MB, there are RBD images only).


Why, what's using such big amount of additional space?

Graphs from our prometheus are attached. Only ~200 objects were created  
by RBD clients yesterday, but used raw space increased +1.3 TB.


Additional question is why ceph df / rados df tells there is only 16 TB  
actual data written, but it uses 29.8 TB (now 31 TB) of raw disk space.  
Shouldn't it be 16 / 2*3 = 24 TB ?


ceph df output:

[root@sill-01 ~]# ceph df
GLOBAL:
SIZE   AVAIL   RAW USED %RAW USED
38 TiB 6.9 TiB   32 TiB 82.03
POOLS:
NAME   ID USED%USED MAX AVAIL OBJECTS
ecpool_hdd 13  16 TiB 93.94   1.0 TiB 7611672
rpool_hdd  15 9.2 MiB 0   515 GiB  92
fs_meta44  20 KiB 0   515 GiB  23
fs_data45 0 B 0   1.0 TiB   0

How to heal it?



--
С наилучшими пожеланиями,
  Виталий Филиппов
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph cluster uses substantially more disk space after rebalancing

2018-10-29 Thread Виталий Филиппов
Hi

After I recreated one OSD + increased pg count of my erasure-coded (2+1) pool 
(which was way too low, only 100 for 9 osds) the cluster started to eat 
additional disk space.

First I thought that was caused by the moved PGs using additional space during 
unfinished backfills. I pinned most of new PGs to old OSDs via `pg-upmap` and 
indeed it freed some space in the cluster.

Then I reduced osd_max_backfills to 1 and started to remove upmap pins in small 
portions which allowed Ceph to finish backfills for these PGs.

HOWEVER, used capacity still grows! It drops after moving each PG, but still 
grows overall.

It has grown +1.3TB yesterday. In the same period of time clients have written 
only ~200 new objects (~800 MB, there are RBD images only).

Why, what's using such big amount of additional space?

Graphs from our prometheus are attached. Only ~200 objects were created by RBD 
clients yesterday, but used raw space increased +1.3 TB.

Additional question is why ceph df / rados df tells there is only 16 TB actual 
data written, but it uses 29.8 TB (now 31 TB) of raw disk space. Shouldn't it 
be 16 / 2*3 = 24 TB ?

ceph df output:

[root@sill-01 ~]# ceph df
GLOBAL:
SIZE   AVAIL   RAW USED %RAW USED 
38 TiB 6.9 TiB   32 TiB 82.03 
POOLS:
NAME   ID USED%USED MAX AVAIL OBJECTS 
ecpool_hdd 13  16 TiB 93.94   1.0 TiB 7611672 
rpool_hdd  15 9.2 MiB 0   515 GiB  92 
fs_meta44  20 KiB 0   515 GiB  23 
fs_data45 0 B 0   1.0 TiB   0 

How to heal it?
-- 
With best regards,
  Vitaliy Filippov___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Don't upgrade to 13.2.2 if you use cephfs

2018-10-17 Thread Виталий Филиппов
I mean, does every upgraded installation hit this bug, or do some upgrade  
without any problem?



The problem occurs after upgrade, fresh 13.2.2 installs are not affected.


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Don't upgrade to 13.2.2 if you use cephfs

2018-10-17 Thread Виталий Филиппов
By the way, does it happen with all installations or only under some  
conditions?



CephFS will be offline and show up as "damaged" in ceph -s
The fix is to downgrade to 13.2.1 and issue a "ceph fs repaired "  
command.



Paul


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS "authorize" on erasure-coded FS

2018-09-14 Thread Виталий Филиппов

Hi,

I've recently tried to setup a user for CephFS running on a pair of  
replicated+erasure pools, but after I ran


ceph fs authorize ecfs client.samba / rw

The "client.samba" user could only see listings, but couldn't read or  
write any files. I've tried to look in logs and to raise the debug level  
and I've seen no clues about this problem.


However, when I then modified its caps with:

ceph auth caps client.samba mds 'allow rw' mon 'allow r' osd 'allow rw tag  
cephfs data=ecfs, allow rw pool=ecpool'


Everything went OK and the user gained read-write access to files.

Does that mean there's a bug in CephFS caps that prevents users from  
reading or writing to an FS running on a EC pool?


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph issue tracker tells that posting issues is forbidden

2018-08-05 Thread Виталий Филиппов
Thanks for the reply! Ok I understand :-)

But the page still shows 403 by now...

5 августа 2018 г. 6:42:33 GMT+03:00, Gregory Farnum  пишет:
>On Sun, Aug 5, 2018 at 1:25 AM Виталий Филиппов 
>wrote:
>
>> Hi!
>>
>> I wanted to report a bug in ceph, but I found out that visiting
>> http://tracker.ceph.com/projects/ceph/issues/new gives me only "403
>You
>> are not authorized to access this page."
>>
>> What does it mean - why is it forbidden to post issues?
>
>
>We just got spammed via the API last week so we had to lock some things
>down temporarily to prevent it from continuing. I don’t think it was
>expected to impact users of the web site, but it might have for a few
>hours. The page is showing up for me when I try and visit it now, so
>try
>again?
>Sorry you ran in to this!
>-Greg
>
>
>
>>
>> --
>> With best regards,
>>Vitaliy Filippov
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>

-- 
With best regards,
  Vitaliy Filippov___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph issue tracker tells that posting issues is forbidden

2018-08-04 Thread Виталий Филиппов

Hi!

I wanted to report a bug in ceph, but I found out that visiting  
http://tracker.ceph.com/projects/ceph/issues/new gives me only "403 You  
are not authorized to access this page."


What does it mean - why is it forbidden to post issues?

--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Strange copy errors in osd log

2016-09-01 Thread Виталий Филиппов
Hi! I'm playing with a test setup of ceph jewel with bluestore and cephfs  
over erasure-coded pool with replicated pool as a cache tier. After  
writing some number of small files to cephfs I begin seeing the following  
error messages during the migration of data from cache to EC pool:


2016-09-01 10:19:27.364710 7f37c1a09700 -1 osd.0 pg_epoch: 329 pg[6.2cs0(  
v 329'388 (0'0,329'388] local-les=315 n=326 ec=279 les/c/f 315/315/0  
314/314/314) [0,1,2] r=0 lpr=314 crt=329'387 lcod 329'387 mlcod 329'387  
active+clean] process_copy_chunk data digest 0x648fd38c != source  
0x40203b61
2016-09-01 10:19:27.364742 7f37c1a09700 -1 log_channel(cluster) log [ERR]  
: 6.2cs0 copy from 8:372dc315:::200.002b:head to  
6:372dc315:::200.002b:head data digest 0x648fd38c != source 0x40203b61


These messages then repeat infinitely for the same set of objects with  
some interval. I'm not sure - does this mean some objects are corrupted in  
OSDs? (how to check?) Is it a bug at all?


P.S: I've also reported this as an issue:  
http://tracker.ceph.com/issues/17194 (not sure if it was right to do :))


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com