Re: [ceph-users] SSD test results with Plextor M6 Pro, HyperX Fury, Kingston V300, ADATA SP90

2015-09-02 Thread Jan Schermer
Hi,
comments below

> On 01 Sep 2015, at 18:08, Jelle de Jong  wrote:
> 
> Hi Jan,
> 
> I am building two new clusters for testing. I been reading your messages
> on the mailing list for a while now and want to thank you for your support.
> 
> I can redo all the numbers, but is your question to run all the test
> again with [hdparm -W 1 /dev/sdc]? Please tell me what else you would
> like to see test, commands?
> 

Probably not necessary, I figure lots of stuff out.

> My experience was that enabling disk cache causes about a 45%
> performance drop, iops=25690 vs iops=46185
> 

This very much depends on the controller - low end "smart" HBAs (like some RAID 
controllers) are limited in their clock speed so while they can handle 
relatively large throughput they also introduce latency and this translates 
into less "IOPS".
Enabling write cache and benchmarking a synchronous workload causes the IOPS to 
go up by a factor of 2, which is almost exactly what you're seeing.

See my comment below on power loss protection.

> I am going to test DT01ACA300 vs WD1003FBYZ disks with SV300S37A ssd's
> in my other two three node ceph clusters.
> 
> What is your advice on making hdparm and possible scheduler (noop)
> changes persistent (cmd in rc.local or special udev rules, examples?)
> 

We do that with puppet that runs every few minutes. Use whatever tool you have.

The correct way is via udev on hotplug, since that eliminates any window where 
it is set incorrectly, but it is slightly distribution specific.

You can also pass "elevator=noop" to kernel cmdline, which makes noop the 
default for all devices, then you can just re-set that for your non-OSD drives 
which are not likely to be hotplugged... not 100% solution if a drive is 
replaced though.


See more comments below

> Kind regards,
> 
> Jelle de Jong
> 
> 
> On 23/06/15 12:41, Jan Schermer wrote:
>> Those are interesting numbers - can you rerun the test with write cache 
>> enabled this time? I wonder how much your drop will be…
>> 
>> thanks
>> 
>> Jan
>> 
>>> On 18 Jun 2015, at 17:48, Jelle de Jong  wrote:
>>> 
>>> Hello everybody,
>>> 
>>> I thought I would share the benchmarks from these four ssd's I tested
>>> (see attachment)
>>> 
>>> I do still have some question:
>>> 
>>> #1 *Data Set Management TRIM supported (limit 1 block)
>>>   vs
>>>  *Data Set Management TRIM supported (limit 8 blocks)
>>> and how this effects Ceph and also how can I test if TRIM is actually
>>> working and not corruption data.
>>> 

This by itself means nothing, it just says how many blocks are TRIMmed in one 
OP. The trimming itself could be fast or slow, and I have not seen a clear 
correlation between the TRIM parameters and speed of TRIM itself.
CEPH doesn't trim anything, this is a job for filesystem.

I recommend _disabling_ filesystem trimming as a mount options because it 
causes big overhead in writes.
It's much better to schedule a daily/weekly fstrim cronjob to do that, and even 
then only discard large blocks (fstrim -m 131072 or similiar) - fstrim can in 
some cases cause the SSD to pause IO for significant amounts of time, so test 
how it behaves after filling and erasing the drive.
However, it's not really that necesary to use TRIM with modern SSDs, unless you 
want to squeeze more endurance than the drive is rated for, and then it should 
be combined with under-provisioning or simply partitioning only part of the 
drive (both after one-time TRIM!).


>>> #2 are there other things I should test to compare ssd's for Ceph Journals
>>> 

Test if the drive performance is consistent. Some drives fill the "cache" part 
(sometimes SLC or eMLC) of the NAND and then the throughput drops 
significantly. Some drives perform garbage collection that causes periodic 
spikes/drops of performance, some drives simply slow down when many blocks are 
dirty... 
For example I have been running a fio job on Intel S3610 almost non-stop for 
the past week, and the performance increased from 17K IOPS to 21K IOPS :-)
Samsung drives also sped up with time.

If you have a database server that performs many transactions and it's possible 
to put the SSD in there, do that. I can't think of a better test. You should 
know how it behaves after a few weeks.
You can google various fio jobs simulating workloads, or you can write your own 
scripts - fio is very powerful, but it's still only a synthetic test.

>>> #3 are the power loss security mechanisms on SSD relevant in Ceph when
>>> configured in a way that a full node can fully die and that a power loss
>>> of all nodes at the same time should not be possible (or has an extreme
>>> low probability)

Depends :-)

When the system asks SSD to flush the data it should always flush it to the 
platter/NAND. The standard doesn't allow exceptions*, even for devices with 
non-volatile cache.
In practice, controllers and SANs with non-volatile cache ignore flushes unless 

Re: [ceph-users] Ceph SSD CPU Frequency Benchmarks

2015-09-02 Thread Jan Schermer
What "idle" driver are you using?
/dev/cpu_dma_latency might not be sufficient if the OS uses certain idle 
instructions, IMO mwait is still issued and its latency might not be 1 on Atoms.
What is in /sys/devices/system/cpu/cpu0/cpuidle/state*/latency on Atoms?

Btw disabling all power management is IMO not a good idea. This disables 
TurboBoost (do Atoms have it?) which gives a huge gain if the TDP is low 
enough. Kernel scheduler in recent kernels should be smart enough to keep some 
cores busy to an extent and not wake up all cores and this will give you better 
performance than using all the cores concurrently unless you really use all of 
their CPU time.
 
https://access.redhat.com/articles/65410 

http://stackoverflow.com/questions/12111954/context-switches-much-slower-in-new-linux-kernels
 


Jan


> On 01 Sep 2015, at 22:48, Robert LeBlanc  wrote:
> 
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> Nick,
> 
> I've been trying to replicate your results without success. Can you help me 
> understand what I'm doing that is not the same as your test?
> 
> My setup is two boxes, one is a client and the other is a server. The server 
> has Intel(R) Atom(TM) CPU  C2750  @ 2.40GHz, 32 GB RAM and 2 Intel S3500 240 
> GB SSD drives. The boxes have Infiniband FDR cards connected to a QDR switch 
> using IPoIB. I set up OSDs on the 2 SSDs and set pool size=1. I mapped a 
> 200GB RBD using the kernel module ran fio on the RBD. I adjusted the number 
> of cores, clock speed and C-states of the server and here are my results:
> 
> Adjusted core number and set the processor to a set frequency using the 
> userspace governor.
> 
> 8 jobs 8 depth   Cores
>   12 3 4 5 6 7 8
> Frequency  2.4  387  762  1121  1432  1657  1900  2092  2260
> GHz2386  758  1126  1428  1657  1890  2090  2232
>1.6  382  756  1127  1428  1656  1894  2083  2201
>1.2  385  756  1125  1431  1656  1885  2093  2244
> 
> I then adjusted the processor to not go in a deeper sleep state than C1 and 
> also tested setting the highest CPU frequency with the ondemand governor.
> 
> 1 job 1 depth
> Cores  1
>   <=C1, feq range  C0-C6, freq range  C0-C6, static freq  <=C1, 
> static freq
> Frequency 2.4  381 381379 381
> GHz   2382 380381 381
>   1.6  380 381379 382
>   1.2  383 378379 383
> Cores  8
>   <=C1, feq range  C0-C6, freq range  C0-C6, static freq  <=C1, 
> static freq
> Frequency 2.4  629 580584 629
> GHz   2630 579584 634
>   1.6  630 579584 634
>   1.2  632 581582 634
> 
> Here I'm see a correlation between # cores and C-states, but not frequency.
> 
> Frequency was controlled with:
> cpupower frequency-set -d 1.2GHz -u 1.2GHz -g userspace
> and
> cpupower frequency-set -d 1.2GHz -u 2.0GHz -g ondemand
> 
> Core count adjusted by:
> for i in {1..7}; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done
> 
> C-states controlled by:
> # python
> Python 2.7.5 (default, Jun 24 2015, 00:41:19) 
> [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> fd = open('/dev/cpu_dma_latency','wb')
> >>> fd.write('1')
> >>> fd.flush()
> >>> fd.close() # Don't run this until the tests are completed (the handle has 
> >>> to stay open).
> >>> 
> 
> I'd like to replicate your results. I'd also like if you can verify some of 
> mine in your set-up around C-States and cores.
> 
> Thanks,
> 
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.0.2
> Comment: https://www.mailvelope.com 
> 
> wsFcBAEBCAAQBQJV5g8GCRDmVDuy+mK58QAAe6YP/j+SNGFI2z7ndnbOk87D
> UjxG+hiZT5bkdt2/wVfI6QiH0UGDA3rLBsttOHPgfxP6/CEy801q8/fO0QOk
> tLxIgX01K4ECls2uhiFAM3bhKalFsKDM6rHYFx96tIGWonQeou36ouDG8pfz
> YsprvQ2XZEX1+G4dfZZ4lc3A3mfIY6Wsn7DC0tup9eRp3cl9hQLXEu4Zg8CZ
> 7867FNaud4S4f6hYV0KUC0fv+hZvyruMCt/jgl8gVr8bAdNgiW5u862gsk5b
> sO9mb7H679G8t47m3xd89jTh9siMshbcakF9PXKzrN7DxBb/sBuN3GykesZA
> +5jdUTzPCxFu+LocJ91by8FybatpLwxycmfP2gRxd/owclXk5BqqJUnrdYVm
> n2GcHobdHVv9k/s+iBVV0xbwqOY+IO9UNUfLAKNy7E1xtpXdTpQBuokmu/4D
> WXg3C4u+DsZNvcziO4s/edQ1koOQm1Fcj5VnbouSqmsHpB5nHeJbGmiKNTBA
> 9pE/hTph56YRqOE3bq3X/ohjtziL7/e/MVF3VUisDJieaLxV9weLxKIf0W9t
> L7NMhX7iUIMps5ulA9qzd8qJK6yBa65BVXtk5M0A5oTA/VvxHQT6e5nSZS+Z
> WLjavMnmSSJT1BQZ5GkVbVqo4UVjndcXEvkBm3+McaGKliO2xvxP+U3nCKpZ
> js+h
> =4WAa
> -END PGP SIGNATURE-
> 
> 
> 

Re: [ceph-users] libvirt rbd issue

2015-09-02 Thread Jan Schermer
1) Take a look at the number of file descriptors the QEMU process is using, I 
think you are over the limits

pid=pid of qemu process

cat /proc/$pid/limits
echo /proc/$pid/fd/* | wc -w

2) Jumbo frames may be the cause, are they enabled on the rest of the network? 
In any case, get rid of NetworkManager ASAP and set it manually, though it 
looks like your NIC might not support them.

Jan



> On 02 Sep 2015, at 01:44, Rafael Lopez  wrote:
> 
> Hi ceph-users,
> 
> Hoping to get some help with a tricky problem. I have a rhel7.1 VM guest 
> (host machine also rhel7.1) with root disk presented from ceph 0.94.2-0 (rbd) 
> using libvirt. 
> 
> The VM also has a second rbd for storage presented from the same ceph 
> cluster, also using libvirt.
> 
> The VM boots fine, no apparent issues with the OS root rbd. I am able to 
> mount the storage disk in the VM, and create a file system. I can even 
> transfer small files to it. But when I try to transfer a moderate size files, 
> eg. greater than 1GB, it seems to slow to a grinding halt and eventually it 
> locks up the whole system, and generates the kernel messages below. 
> 
> I have googled some *similar* issues around, but haven't come across some 
> solid advice/fix. So far I have tried modifying the libvirt disk cache 
> settings, tried using the latest mainline kernel (4.2+), different file 
> systems (ext4, xfs, zfs) all produce similar results. I suspect it may be 
> network related, as when I was using the mainline kernel I was transferring 
> some files to the storage disk and this message came up, and the transfer 
> seemed to stop at the same time:
> 
> Sep  1 15:31:22 nas1-rds NetworkManager[724]:  [1441085482.078646] 
> [platform/nm-linux-platform.c:2133] sysctl_set(): sysctl: failed to set 
> '/proc/sys/net/ipv6/conf/eth0/mtu' to '9000': (22) Invalid argument
> 
> I think maybe the key info to troubleshooting is that it seems to be OK for 
> files under 1GB. 
> 
> Any ideas would be appreciated.
> 
> Cheers,
> Raf
> 
> 
> Sep  1 16:04:15 nas1-rds kernel: INFO: task kworker/u8:1:60 blocked for more 
> than 120 seconds.
> Sep  1 16:04:15 nas1-rds kernel: "echo 0 > 
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep  1 16:04:15 nas1-rds kernel: kworker/u8:1D 88023fd93680 0
> 60  2 0x
> Sep  1 16:04:15 nas1-rds kernel: Workqueue: writeback bdi_writeback_workfn 
> (flush-252:80)
> Sep  1 16:04:15 nas1-rds kernel: 880230c136b0 0046 
> 8802313c4440 880230c13fd8
> Sep  1 16:04:15 nas1-rds kernel: 880230c13fd8 880230c13fd8 
> 8802313c4440 88023fd93f48
> Sep  1 16:04:15 nas1-rds kernel: 880230c137b0 880230fbcb08 
> e8d80ec0 88022e827590
> Sep  1 16:04:15 nas1-rds kernel: Call Trace:
> Sep  1 16:04:15 nas1-rds kernel: [] io_schedule+0x9d/0x130
> Sep  1 16:04:15 nas1-rds kernel: [] bt_get+0x10f/0x1a0
> Sep  1 16:04:15 nas1-rds kernel: [] ? wake_up_bit+0x30/0x30
> Sep  1 16:04:15 nas1-rds kernel: [] blk_mq_get_tag+0xbf/0xf0
> Sep  1 16:04:15 nas1-rds kernel: [] 
> __blk_mq_alloc_request+0x1b/0x1f0
> Sep  1 16:04:15 nas1-rds kernel: [] 
> blk_mq_map_request+0x181/0x1e0
> Sep  1 16:04:15 nas1-rds kernel: [] 
> blk_sq_make_request+0x9a/0x380
> Sep  1 16:04:15 nas1-rds kernel: [] ? 
> generic_make_request_checks+0x24f/0x380
> Sep  1 16:04:15 nas1-rds kernel: [] 
> generic_make_request+0xe2/0x130
> Sep  1 16:04:15 nas1-rds kernel: [] submit_bio+0x71/0x150
> Sep  1 16:04:15 nas1-rds kernel: [] 
> ext4_io_submit+0x25/0x50 [ext4]
> Sep  1 16:04:15 nas1-rds kernel: [] 
> ext4_bio_write_page+0x159/0x2e0 [ext4]
> Sep  1 16:04:15 nas1-rds kernel: [] 
> mpage_submit_page+0x5d/0x80 [ext4]
> Sep  1 16:04:15 nas1-rds kernel: [] 
> mpage_map_and_submit_buffers+0x172/0x2a0 [ext4]
> Sep  1 16:04:15 nas1-rds kernel: [] 
> ext4_writepages+0x733/0xd60 [ext4]
> Sep  1 16:04:15 nas1-rds kernel: [] do_writepages+0x1e/0x40
> Sep  1 16:04:15 nas1-rds kernel: [] 
> __writeback_single_inode+0x40/0x220
> Sep  1 16:04:15 nas1-rds kernel: [] 
> writeback_sb_inodes+0x25e/0x420
> Sep  1 16:04:15 nas1-rds kernel: [] 
> __writeback_inodes_wb+0x9f/0xd0
> Sep  1 16:04:15 nas1-rds kernel: [] wb_writeback+0x263/0x2f0
> Sep  1 16:04:15 nas1-rds kernel: [] 
> bdi_writeback_workfn+0x1cc/0x460
> Sep  1 16:04:15 nas1-rds kernel: [] 
> process_one_work+0x17b/0x470
> Sep  1 16:04:15 nas1-rds kernel: [] 
> worker_thread+0x11b/0x400
> Sep  1 16:04:15 nas1-rds kernel: [] ? 
> rescuer_thread+0x400/0x400
> Sep  1 16:04:15 nas1-rds kernel: [] kthread+0xcf/0xe0
> Sep  1 16:04:15 nas1-rds kernel: [] ? 
> kthread_create_on_node+0x140/0x140
> Sep  1 16:04:15 nas1-rds kernel: [] ret_from_fork+0x7c/0xb0
> Sep  1 16:04:15 nas1-rds kernel: [] ? 
> kthread_create_on_node+0x140/0x140
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs read-only setting doesn't work?

2015-09-02 Thread Gregory Farnum
On Tue, Sep 1, 2015 at 9:20 PM, Erming Pei  wrote:
> Hi,
>
>   I tried to set up a read-only permission for a client but it looks always
> writable.
>
>   I did the following:
>
> ==Server end==
>
> [client.cephfs_data_ro]
> key = AQxx==
> caps mon = "allow r"
> caps osd = "allow r pool=cephfs_data, allow r pool=cephfs_metadata"

The clients don't directly access the metadata pool at all so you
don't need to grant that. :) And I presume you have an MDS cap in
there as well?

>
>
> ==Client end==
> mount -v -t ceph hostname.domainname:6789:/ /cephfs -o
> name=cephfs_data_ro,secret=AQxx==
>
> But I still can touch, delete, overwrite.
>
> I read that touch/delete could be only meta data operations, but why I still
> can overwrite?
>
> Is there anyway I could test/check the data pool (instead of meta data) to
> see if any effect on it?

What you're seeing here is an unfortunate artifact of the page cache
and the way these user capabilities work in Ceph. As you surmise,
touch/delete are metadata operations through the MDS and in current
code you can't block the client off from that (although we have work
in progress to improve things). I think you'll find that the data
you've overwritten isn't really written to the OSDs — you wrote it in
the local page cache, but the OSDs will reject the writes with EPERM.
I don't remember the kernel's exact behavior here though — we updated
the userspace client to preemptively check access permissions on new
pools but I don't think the kernel ever got that. Zheng?
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Appending to an open file - O_APPEND flag

2015-09-02 Thread Gregory Farnum
On Wed, Sep 2, 2015 at 10:00 AM, Janusz Borkowski
 wrote:
> Hi!
>
> I mount cephfs using kernel client (3.10.0-229.11.1.el7.x86_64).
>
> The effect is the same when doing "echo >>" from another machine and from a
> machine keeping the file open.
>
> The file is opened with open( ..,
> O_WRONLY|O_LARGEFILE|O_APPEND|O_BINARY|O_CREAT)
>
> Shell ">>" is implemented as (from strace bash -c "echo '7789' >>
> /mnt/ceph/test):
>
> open("/mnt/ceph/test", O_WRONLY|O_CREAT|O_APPEND, 0666) = 3
>
> The test file had ~500KB size.
>
> Each subsequent "echo >>" writes to the start of the test file, first "echo"
> overwriting the original contents, next "echos" overwriting bytes written by
> the preceding "echo".

Hmmm. The userspace (ie, ceph-fuse) implementation of this is a little
bit racy but ought to work. I'm not as familiar with the kernel code
but I'm not seeing any special behavior in the Ceph code — Zheng,
would you expect this to work? It looks like some of the linux
filesystems have their own O_APPEND handling and some don't, but I
can't find it in the VFS either.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Performance Questions with rbd images access by qemu-kvm

2015-09-02 Thread Jan Schermer
I somehow missed the original question, but if you run a database on CEPH you 
will be limited not by throughput but by latency.
Even if you run OSDs with ramdisk, the latency will still be 1-2ms at best 
(depending strictly on OSD CPU and memory speed) and that limits the number of 
database transactions to 500/s or even less. It will scale with number of 
instances, so two database servers will likely have 2x the IOPS total until 
your disks saturate. If you can shard your database easily then that might be 
an option.

With MySQL you can achieve higher throughput by tuning InnoDB parameters (I 
assume you use InnoDB), for example setting innodb_flush_log_at_trx_commit=2. 
With this setting you can lose the last second of commited transactions, and it 
may require recovery on startup (no guarantees). I used to run quite a few 
database like that and haven't lost any data though. YMMV.

I believe there was another parameter that batched X commits into one flush, 
but I can't find it now - either I am mistaken and it was for PostgreSQL, or 
they removed it.

You can also save some IOPS by putting binlogs and innodb logs on a separate 
device or fast local storage, then you can disable flushing on the InnoDB 
tablespace while having all the data safely in binlog for replay.

Jan


> On 01 Sep 2015, at 18:26, Robert LeBlanc  wrote:
> 
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> Just swapping out spindles for SSD will not give you orders of magnitude 
> performance gains as it does in regular cases. This is because Ceph has a lot 
> of overhead for each I/O which limits the performance of the SSDs. In my 
> testing, two Intel S3500 SSDs with an 8 core Atom (Intel(R) Atom(TM) CPU  
> C2750  @ 2.40GHz) and size=1 and fio with 8 jobs and QD=8 sync,direct 4K 
> read/writes produced 2,600 IOPs. Don't get me wrong, it will help, but don't 
> expect spectacular results.
> 
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> On Tue, Sep 1, 2015 at 8:01 AM, Kenneth Van Alstyne  wrote:
> Thanks for the awesome advice folks.  Until I can go larger scale (50+ SATA 
> disks), I’m thinking my best option here is to just swap out these 1TB SATA 
> disks with 1TB SSDs.  Am I oversimplifying the short term solution?
> 
> Thanks,
> 
> - --
> Kenneth Van Alstyne
> Systems Architect
> Knight Point Systems, LLC
> Service-Disabled Veteran-Owned Business
> 1775 Wiehle Avenue Suite 101 | Reston, VA 20190
> c: 228-547-8045 f: 571-266-3106
> www.knightpoint.com  
> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track
> GSA Schedule 70 SDVOSB: GS-35F-0646S
> GSA MOBIS Schedule: GS-10F-0404Y
> ISO 2 / ISO 27001
> 
> Notice: This e-mail message, including any attachments, is for the sole use 
> of the intended recipient(s) and may contain confidential and privileged 
> information. Any unauthorized review, copy, use, disclosure, or distribution 
> is STRICTLY prohibited. If you are not the intended recipient, please contact 
> the sender by reply e-mail and destroy all copies of the original message.
> 
> On Aug 31, 2015, at 7:29 PM, Christian Balzer  wrote:
> 
> 
> Hello,
> 
> On Mon, 31 Aug 2015 12:28:15 -0500 Kenneth Van Alstyne wrote:
> 
> In addition to the spot on comments by Warren and Quentin, verify this by
> watching your nodes with atop, iostat, etc. 
> The culprit (HDDs) should be plainly visible.
> 
> More inline:
> 
> Christian, et al:
> 
> Sorry for the lack of information.  I wasn’t sure what of our hardware
> specifications or Ceph configuration was useful information at this
> point.  Thanks for the feedback — any feedback, is appreciated at this
> point, as I’ve been beating my head against a wall trying to figure out
> what’s going on.  (If anything.  Maybe the spindle count is indeed our
> upper limit or our SSDs really suck? :-) )
> 
> Your SSDs aren't the problem.
> 
> To directly address your questions, see answers below:
>   - CBT is the Ceph Benchmarking Tool.  Since my question was more
> generic rather than with CBT itself, it was probably more useful to post
> in the ceph-users list rather than cbt.
>   - 8 Cores are from 2x quad core Intel(R) Xeon(R) CPU E5-2609 0 @
> 2.40GHz
> Not your problem either.
> 
>   - The SSDs are indeed Intel S3500s.  I agree — not ideal, but
> supposedly capable of up to 75,000 random 4KB reads/writes.  Throughput
> and longevity is quite low for an SSD, rated at about 400MB/s reads and
> 100MB/s writes, though.  When we added these as journals in front of the
> SATA spindles, both VM performance and rados benchmark numbers were
> relatively unchanged.
> 
> The only thing relevant in regards to journal SSDs is the sequential write
> speed (SYNC), they don't seek and normally don't get read either.
> This is why a 200GB DC S3700 is a better journal SSD than the 200GB S3710
> which is faster in any other aspect but sequential writes. ^o^
> 
> 

Re: [ceph-users] Appending to an open file - O_APPEND flag

2015-09-02 Thread Janusz Borkowski
Hi!

I mount cephfs using kernel client (3.10.0-229.11.1.el7.x86_64).

The effect is the same when doing "echo >>" from another machine and from a 
machine keeping the file open.

The file is opened with open( ..,  
O_WRONLY|O_LARGEFILE|O_APPEND|O_BINARY|O_CREAT)

Shell ">>" is implemented as (from strace bash -c "echo '7789' >>  
/mnt/ceph/test):

open("/mnt/ceph/test", O_WRONLY|O_CREAT|O_APPEND, 0666) = 3

The test file had ~500KB size.

Each subsequent "echo >>" writes to the start of the test file, first "echo" 
overwriting the original contents, next "echos" overwriting bytes written by 
the preceding "echo".

Thanks!

J.


On 01.09.2015 18:15, Gregory Farnum wrote:
>
>
> On Sep 1, 2015 4:41 PM, "Janusz Borkowski"  > wrote:
> >
> > Hi!
> >
> > open( ... O_APPEND) works fine in a single system. If many processes write 
> > to the same file, their output will never overwrite each other.
> >
> > On NFS overwriting is possible, as appending is only emulated - each write 
> > is preceded by a seek to the current file size and race condition may occur.
> >
> > How it is in cephfs?
>
> CephFS generally ought to handle appends correctly. If it's not we will want 
> to fix that.
>
> >
> > I have a file F opened with  O_APPEND|O_WRONLY by some process. In a 
> > console I type
> >
> > $ echo "asd" >> F
> >
> > Effectively, this is opening of file F by another process with O_APPEND 
> > flag .
> >
> > The string "asd" is written to the beginning of file F, overwriting the 
> > starting bytes in the file. Is it a bug or a feature? If a feature, how it 
> > is described?
>
> Are you doing this in the same box that's got the the file open, or a 
> different one? Are you using the ceph-fuse or kernel clients on the systems?
>
> I'm not sure how the shell actually handles >> so I'd like to see this 
> reproduced with strace or an example program to be sure it's really not 
> handling append properly.
> -Greg
>
> >
> > It is ceph Hammer and kernel 3.10.0-229.11.1.el7.x86_64
> >
> > Thanks!
> >
> > J.
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph SSD CPU Frequency Benchmarks

2015-09-02 Thread Nick Fisk
I think this may be related to what I had to do, it rings a bell at least.

http://unix.stackexchange.com/questions/153693/cant-use-userspace-cpufreq-governor-and-set-cpu-frequency

The P-state drive doesn't support userspace, so you need to disable it and make 
Linux use the old acpi drive instead.

> -Original Message-
> From: Nick Fisk [mailto:n...@fisk.me.uk]
> Sent: 01 September 2015 22:21
> To: 'Robert LeBlanc' 
> Cc: ceph-users@lists.ceph.com
> Subject: RE: [ceph-users] Ceph SSD CPU Frequency Benchmarks
> 
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Robert LeBlanc
> > Sent: 01 September 2015 21:48
> > To: Nick Fisk 
> > Cc: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Ceph SSD CPU Frequency Benchmarks
> >
> > -BEGIN PGP SIGNED MESSAGE-
> > Hash: SHA256
> >
> > Nick,
> >
> > I've been trying to replicate your results without success. Can you
> > help me understand what I'm doing that is not the same as your test?
> >
> > My setup is two boxes, one is a client and the other is a server. The
> > server has Intel(R) Atom(TM) CPU  C2750  @ 2.40GHz, 32 GB RAM and 2
> > Intel S3500
> > 240 GB SSD drives. The boxes have Infiniband FDR cards connected to a
> > QDR switch using IPoIB. I set up OSDs on the 2 SSDs and set pool
> > size=1. I mapped a 200GB RBD using the kernel module ran fio on the
> > RBD. I adjusted the number of cores, clock speed and C-states of the
> > server and here are my
> > results:
> >
> > Adjusted core number and set the processor to a set frequency using
> > the userspace governor.
> >
> > 8 jobs 8 depth   Cores
> >   12 3 4 5 6 7 8
> > Frequency  2.4  387  762  1121  1432  1657  1900  2092  2260
> > GHz2386  758  1126  1428  1657  1890  2090  2232
> >1.6  382  756  1127  1428  1656  1894  2083  2201
> >1.2  385  756  1125  1431  1656  1885  2093  2244
> >
> 
> I tested at QD=1 as this tends to highlight the difference in clock speed,
> whereas a higher queue depth will probably scale with both frequency and
> cores. I'm not sure this is your problem, but to make sure your environment
> is doing what you want I would suggest QD=1 and 1 job to start with.
> 
> But thank you for sharing these results regardless of your current frequency
> scaling issues. Information like this is really useful for people trying to 
> decide
> on hardware purchases. Those Atom boards look like they could support 12x
> normal HDD's quite happily, assuming 80 IOPsx12.
> 
> I wonder if we can get enough data from various people to generate a
> IOPs/CPU Freq for various CPU architectures?
> 
> 
> > I then adjusted the processor to not go in a deeper sleep state than
> > C1 and also tested setting the highest CPU frequency with the ondemand
> governor.
> >
> > 1 job 1 depth
> > Cores  1
> >   <=C1, feq range  C0-C6, freq range  C0-C6, static freq
> > <=C1, static
> > freq
> > Frequency 2.4  381 381379 381
> > GHz   2382 380381 381
> >   1.6  380 381379 382
> >   1.2  383 378379 383
> > Cores  8
> >   <=C1, feq range  C0-C6, freq range  C0-C6, static freq
> > <=C1, static
> > freq
> > Frequency 2.4  629 580584 629
> > GHz   2630 579584 634
> >   1.6  630 579584 634
> >   1.2  632 581582 634
> >
> > Here I'm see a correlation between # cores and C-states, but not
> frequency.
> >
> > Frequency was controlled with:
> > cpupower frequency-set -d 1.2GHz -u 1.2GHz -g userspace and cpupower
> > frequency-set -d 1.2GHz -u 2.0GHz -g ondemand
> >
> > Core count adjusted by:
> > for i in {1..7}; do echo 0 > /sys/devices/system/cpu/cpu$i/online;
> > done
> >
> > C-states controlled by:
> > # python
> > Python 2.7.5 (default, Jun 24 2015, 00:41:19) [GCC 4.8.3 20140911 (Red
> > Hat 4.8.3-9)] on linux2 Type "help", "copyright", "credits" or
> > "license" for more information.
> > >>> fd = open('/dev/cpu_dma_latency','wb')
> > >>> fd.write('1')
> > >>> fd.flush()
> > >>> fd.close() # Don't run this until the tests are completed (the
> > >>> handle has
> > to stay open).
> > >>>
> >
> > I'd like to replicate your results. I'd also like if you can verify
> > some of mine in your set-up around C-States and cores.
> 
> I can't remember exactly, but I think I had to do something to get the
> userspace governor to behave as I expected it to. I tend to recall setting the
> frequency low and yet still seeing it bursting up to max. I will have a look
> through my notes tomorrow and see if I 

Re: [ceph-users] cephfs read-only setting doesn't work?

2015-09-02 Thread Erming Pei

On 9/2/15, 9:31 AM, Gregory Farnum wrote:

[ Re-adding the list. ]

On Wed, Sep 2, 2015 at 4:29 PM, Erming Pei  wrote:

Hi Gregory,

Thanks very much for the confirmation and explanation.


And I presume you have an MDS cap in there as well?

   Is there a difference between set this cap and without setting?

Well, I don't think you can access the MDS without a read cap, but
maybe it's really just null...


  I asked this as I don't see a difference on operating files.



I think you'll find that the data you've overwritten isn't really written
to the OSDs — you wrote it in the local page cache, but the OSDs will reject
the writes with EPERM.

I see. Is there a way for me to verify that, i.e., see there is not a
change to the data is OSDs? I found I can overwrite a file and then I can
see the file is changed. It may be in the local cache. But how can I test
and retrieve one in the OSD pool?

Mounting it on another client and seeing if changes are reflected
there would do it. Or unmounting the filesystem, mounting again, and
seeing if the file has really changed.
-Greg


Good idea.

Thank you Gregory.

Erming

Thanks!

Erming



On 9/2/15, 2:44 AM, Gregory Farnum wrote:

On Tue, Sep 1, 2015 at 9:20 PM, Erming Pei  wrote:

Hi,

I tried to set up a read-only permission for a client but it looks
always
writable.

I did the following:

==Server end==

[client.cephfs_data_ro]
  key = AQxx==
  caps mon = "allow r"
  caps osd = "allow r pool=cephfs_data, allow r
pool=cephfs_metadata"

The clients don't directly access the metadata pool at all so you
don't need to grant that. :) And I presume you have an MDS cap in
there as well?


==Client end==
mount -v -t ceph hostname.domainname:6789:/ /cephfs -o
name=cephfs_data_ro,secret=AQxx==

But I still can touch, delete, overwrite.

I read that touch/delete could be only meta data operations, but why I
still
can overwrite?

Is there anyway I could test/check the data pool (instead of meta data)
to
see if any effect on it?

What you're seeing here is an unfortunate artifact of the page cache
and the way these user capabilities work in Ceph. As you surmise,
touch/delete are metadata operations through the MDS and in current
code you can't block the client off from that (although we have work
in progress to improve things). I think you'll find that the data
you've overwritten isn't really written to the OSDs — you wrote it in
the local page cache, but the OSDs will reject the writes with EPERM.
I don't remember the kernel's exact behavior here though — we updated
the userspace client to preemptively check access permissions on new
pools but I don't think the kernel ever got that. Zheng?
-Greg



--
-
  Erming Pei, Ph.D
  Senior System Analyst; Grid/Cloud Specialist

  Research Computing Group
  Information Services & Technology
  University of Alberta, Canada

  Tel: +1 7804929914Fax: +1 7804921729
-




--
-
 Erming Pei, Ph.D
 Senior System Analyst; Grid/Cloud Specialist

 Research Computing Group
 Information Services & Technology
 University of Alberta, Canada

 Tel: +1 7804929914Fax: +1 7804921729
-

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Strange logging behaviour for ceph

2015-09-02 Thread J-P Methot
Hi,

We're using Ceph Hammer 0.94.1 on centOS 7. On the monitor, when we set
log_to_syslog = true
Ceph starts shooting logs at stdout. I thought at first it might be
rsyslog that is wrongly configured, but I did not find a rule that could
explain this behavior.

Can anybody else replicate this? If it's a bug, has it been fixed in
more recent version? (I couldn't find anything relating to such an issue).

-- 
==
Jean-Philippe Méthot
Administrateur système / System administrator
GloboTech Communications
Phone: 1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750
jpmet...@gtcomm.net
http://www.gtcomm.net
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ask Sage Anything!

2015-09-02 Thread Patrick McGarry
Hey cephers,

While I'm sure that most of you probably get your Ceph-related
questions answered here on the mailing lists, Sage is doing an "Ask me
anything" on Reddit in about an hour:

https://www.reddit.com/r/IAmA/comments/3jdnnd/i_am_sage_weil_lead_architect_and_cocreator_of/

You can ask him questions about WebRing, DreamHost, CROSS, or anything
else. We'd also appreciate it if you would spread the word so that
Sage's time can be well spent and everyone who is interested gets a
chance to ask a question. Thanks!



-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Corruption of file systems on RBD images

2015-09-02 Thread Lionel Bouton
Le 02/09/2015 18:16, Mathieu GAUTHIER-LAFAYE a écrit :
> Hi Lionel,
>
> - Original Message -
>> From: "Lionel Bouton" 
>> To: "Mathieu GAUTHIER-LAFAYE" , 
>> ceph-us...@ceph.com
>> Sent: Wednesday, 2 September, 2015 4:40:26 PM
>> Subject: Re: [ceph-users] Corruption of file systems on RBD images
>>
>> Hi Mathieu,
>>
>> Le 02/09/2015 14:10, Mathieu GAUTHIER-LAFAYE a écrit :
>>> Hi All,
>>>
>>> We have some troubles regularly with virtual machines using RBD storage.
>>> When we restart some virtual machines, they starts to do some filesystem
>>> checks. Sometime it can rescue it, sometime the virtual machine die (Linux
>>> or Windows).
>> What is the cause of death as reported by the VM? FS inconsistency?
>> Block device access timeout? ...
> The VM starts normally without any error message but when the OS starts it 
> detects some inconsistencies on the filesystem.
>
> It try to repair it (fsck.ext4 or chkdsk.exe)... Few times, the repair was 
> successful and we didn't notice any corruption on the VM but we not checked 
> all the filesystem. The solution is often to reinstall the VM.

Hum. Ceph is pretty good at keeping your data safe (with a small caveat,
see below) so you might have some other problem causing data corruption.
The first thing coming to mind is that the VM might run on faulty
hardware (corrupting data in memory before being written to disk).

> [...]
> We have not detected any performance issues due to scrubbing. My doubt was 
> when it check for data integrity of a pg on two replicas. Can it take a wrong 
> decision and replace the good data with the bad one ? I have got probably 
> wrong on how works the scrubbing. Data is safe even if we have only two 
> replicas ?

I'm not 100% sure. With non-checksumming filesystems, if the primary OSD
for a PG is corrupted I believe you are out of luck: AFAIK Ceph doesn't
have internal checksums which allows it to detect corruption when
reading back data and will give you back what the OSD disk has event if
it's corrupted. When repairing a pg (after detecting inconsistencies
during a deep scrub) it seems it doesn't try to find the "right" value
by vote (ie: if you use size=3, you could choose the data on the 2
"secondary" OSDs even if they match but don't match the primary to
correct corruption on the primary) but overwrite secondary OSDs with the
data from the primary OSD (which obviously would transmit any corruption
from the primary to the secondary OSDs).

Then there's a subtlety: with BTRFS and disk corruption the underlying
filesystem will return a system error when reading from the primary OSD
(because all reads are checked against internal checksums) and I believe
Ceph will then switch the read to a secondary OSD to give back valid
data to the rbd client. I'm not sure how repairing works in this case: I
suspect the data is overwritten by data from the first OSD where a read
doesn't fail which would correct the situation without any room for an
incorrect choice but the documentation and posts on this subject where
not explicit about it.

If I'm right (please wait confirmation about Ceph behaviour with Btrfs
from developpers), Ceph shouldn't be able to corrupt data from your VM
and corruption should happen before it is stored.
That said there's a theoretical small window where corruption could
occur outside the system running the VM: in the primary OSD contacted by
this system if the data to be written is corrupted after being received
and before being transmitted to secondary OSDs Ceph itself could corrupt
data (due to flaky hardware on some OSDs). This could be protected
against by computing a checksum of the data on the rbd client and
checking it on all OSD before writing to disk but I don't know the
internals/protocols so I don't know if it's done and this window closed.

>>> We use BTRFS for OSD with a kernel 3.10. This was not strongly discouraged
>>> when we start the deployment of CEPH last year. Now, it seems that the
>>> kernel version should be 3.14 or later for this kind of setup.
>> See https://btrfs.wiki.kernel.org/index.php/Gotchas for various reasons
>> to upgrade.
>>
>> We have a good deal of experience with Btrfs in production now. We had
>> to disable snapshots, make the journal NoCOW, disable autodefrag and
>> develop our own background defragmenter (which converts to zlib at the
>> same time it defragments for additional space savings). We currently use
>> kernel version 4.0.5 (we don't use any RAID level so we don't need 4.0.6
>> to get a fix for an online RAID level conversion bug) and I wouldn't use
>> anything less than 3.19.5. The results are pretty good, but Btrfs is
>> definitely not an out-of-the-box solution for Ceph.
>>
> We do not change any specific options for BTRFS.

On https://btrfs.wiki.kernel.org/index.php/Gotchas there are unspecified 
problems with snapshot-aware defrag fixed in 3.10.31 (so if you use autodefrag 
and <3.10.31 you are 

Re: [ceph-users] cephfs read-only setting doesn't work?

2015-09-02 Thread Gregory Farnum
[ Re-adding the list. ]

On Wed, Sep 2, 2015 at 4:29 PM, Erming Pei  wrote:
> Hi Gregory,
>
>Thanks very much for the confirmation and explanation.
>
>>And I presume you have an MDS cap in there as well?
>   Is there a difference between set this cap and without setting?

Well, I don't think you can access the MDS without a read cap, but
maybe it's really just null...

>
>>I think you'll find that the data you've overwritten isn't really written
>> to the OSDs — you wrote it in the local page cache, but the OSDs will reject
>> the writes with EPERM.
>I see. Is there a way for me to verify that, i.e., see there is not a
> change to the data is OSDs? I found I can overwrite a file and then I can
> see the file is changed. It may be in the local cache. But how can I test
> and retrieve one in the OSD pool?

Mounting it on another client and seeing if changes are reflected
there would do it. Or unmounting the filesystem, mounting again, and
seeing if the file has really changed.
-Greg

>
> Thanks!
>
> Erming
>
>
>
> On 9/2/15, 2:44 AM, Gregory Farnum wrote:
>>
>> On Tue, Sep 1, 2015 at 9:20 PM, Erming Pei  wrote:
>>>
>>> Hi,
>>>
>>>I tried to set up a read-only permission for a client but it looks
>>> always
>>> writable.
>>>
>>>I did the following:
>>>
>>> ==Server end==
>>>
>>> [client.cephfs_data_ro]
>>>  key = AQxx==
>>>  caps mon = "allow r"
>>>  caps osd = "allow r pool=cephfs_data, allow r
>>> pool=cephfs_metadata"
>>
>> The clients don't directly access the metadata pool at all so you
>> don't need to grant that. :) And I presume you have an MDS cap in
>> there as well?
>>
>>>
>>> ==Client end==
>>> mount -v -t ceph hostname.domainname:6789:/ /cephfs -o
>>> name=cephfs_data_ro,secret=AQxx==
>>>
>>> But I still can touch, delete, overwrite.
>>>
>>> I read that touch/delete could be only meta data operations, but why I
>>> still
>>> can overwrite?
>>>
>>> Is there anyway I could test/check the data pool (instead of meta data)
>>> to
>>> see if any effect on it?
>>
>> What you're seeing here is an unfortunate artifact of the page cache
>> and the way these user capabilities work in Ceph. As you surmise,
>> touch/delete are metadata operations through the MDS and in current
>> code you can't block the client off from that (although we have work
>> in progress to improve things). I think you'll find that the data
>> you've overwritten isn't really written to the OSDs — you wrote it in
>> the local page cache, but the OSDs will reject the writes with EPERM.
>> I don't remember the kernel's exact behavior here though — we updated
>> the userspace client to preemptively check access permissions on new
>> pools but I don't think the kernel ever got that. Zheng?
>> -Greg
>
>
>
> --
> -
>  Erming Pei, Ph.D
>  Senior System Analyst; Grid/Cloud Specialist
>
>  Research Computing Group
>  Information Services & Technology
>  University of Alberta, Canada
>
>  Tel: +1 7804929914Fax: +1 7804921729
> -
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph new mon deploy v9.0.3-1355

2015-09-02 Thread German Anders
Hi cephers, trying to deploying a new ceph cluster with master release
(v9.0.3) and when trying to create the initial mons and error appears
saying that "admin_socket: exception getting command descriptions: [Errno
2] No such file or directory", find the log:


...
[ceph_deploy.mon][INFO  ] distro info: Ubuntu 14.04 trusty
[cibmon01][DEBUG ] determining if provided host has same hostname in remote
[cibmon01][DEBUG ] get remote short hostname
[cibmon01][DEBUG ] deploying mon to cibmon01
[cibmon01][DEBUG ] get remote short hostname
[cibmon01][DEBUG ] remote hostname: cibmon01
[cibmon01][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
[cibmon01][DEBUG ] create the mon path if it does not exist
[cibmon01][DEBUG ] checking for done path:
/var/lib/ceph/mon/ceph-cibmon01/done
[cibmon01][DEBUG ] done path does not exist:
/var/lib/ceph/mon/ceph-cibmon01/done
[cibmon01][INFO  ] creating keyring file:
/var/lib/ceph/tmp/ceph-cibmon01.mon.keyring
[cibmon01][DEBUG ] create the monitor keyring file
[cibmon01][INFO  ] Running command: ceph-mon --cluster ceph --mkfs -i
cibmon01 --keyring /var/lib/ceph/tmp/ceph-cibmon01.mon.keyring
[cibmon01][WARNING] libust[16029/16029]: Warning: HOME environment variable
not set. Disabling LTTng-UST per-user tracing. (in setup_local_apps() at
lttng-ust
-comm.c:305)
[cibmon01][DEBUG ] ceph-mon: set fsid to xx----x
[cibmon01][DEBUG ] ceph-mon: created monfs at
/var/lib/ceph/mon/ceph-cibmon01 for mon.cibmon01
[cibmon01][INFO  ] unlinking keyring file
/var/lib/ceph/tmp/ceph-cibmon01.mon.keyring
[cibmon01][DEBUG ] create a done file to avoid re-doing the mon deployment
[cibmon01][DEBUG ] create the init path if it does not exist
[cibmon01][DEBUG ] locating the `service` executable...
[cibmon01][INFO  ] Running command: initctl emit ceph-mon cluster=ceph
id=cibmon01
[cibmon01][INFO  ] Running command: ceph --cluster=ceph --admin-daemon
/var/run/ceph/ceph-mon.cibmon01.asok mon_status
*[cibmon01][ERROR ] admin_socket: exception getting command descriptions:
[Errno 2] No such file or directory*
[cibmon01][WARNING] monitor: mon.cibmon01, might not be running yet
[cibmon01][INFO  ] Running command: ceph --cluster=ceph --admin-daemon
/var/run/ceph/ceph-mon.cibmon01.asok mon_status
*[cibmon01][ERROR ] admin_socket: exception getting command descriptions:
[Errno 2] No such file or directory*
[cibmon01][WARNING] monitor cibmon01 does not exist in monmap
[ceph_deploy.mon][DEBUG ] detecting platform for host cibmon02 ...
[cibmon02][DEBUG ] connected to host: cibmon02
...

Any ideas?

Thanks in advance,

Regards,

*German*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Corruption of file systems on RBD images

2015-09-02 Thread Mathieu GAUTHIER-LAFAYE
Hi Lionel,

- Original Message -
> From: "Lionel Bouton" 
> To: "Mathieu GAUTHIER-LAFAYE" , 
> ceph-us...@ceph.com
> Sent: Wednesday, 2 September, 2015 4:40:26 PM
> Subject: Re: [ceph-users] Corruption of file systems on RBD images
> 
> Hi Mathieu,
> 
> Le 02/09/2015 14:10, Mathieu GAUTHIER-LAFAYE a écrit :
> > Hi All,
> >
> > We have some troubles regularly with virtual machines using RBD storage.
> > When we restart some virtual machines, they starts to do some filesystem
> > checks. Sometime it can rescue it, sometime the virtual machine die (Linux
> > or Windows).
> 
> What is the cause of death as reported by the VM? FS inconsistency?
> Block device access timeout? ...

The VM starts normally without any error message but when the OS starts it 
detects some inconsistencies on the filesystem.

It try to repair it (fsck.ext4 or chkdsk.exe)... Few times, the repair was 
successful and we didn't notice any corruption on the VM but we not checked all 
the filesystem. The solution is often to reinstall the VM.

> 
> >
> > We have move from Firefly to Hammer the last month. I don't know if the
> > problem is in Ceph and is still there or if we continue to see symptom of
> > a Firefly bug.
> >
> > We have two rooms in two separate building, so we set the replica size to
> > 2. I'm in doubt if it can cause this kind of problems when scrubbing
> > operations. I guess the recommended replica size is at less 3.
> 
> Scrubbing is pretty harmless, deep scrubbing is another matter.
> Simultaneous deep scrubs on the same OSD are a performance killer. It
> seems latest Ceph versions provide some way of limiting its impact on
> performance (scrubs are done per pg so 2 simultaneous scrubs can and
> often involve the same OSD and I think there's a limit on scrubs per OSD
> now). AFAIK Firefly doesn't have this (and it surely didn't when we were
> confronted to the problem) so we developed our own deep scrub scheduler
> to avoid involving the same OSD twice (in fact our scheduler tries to
> interleave scrubs so that each OSD has as much inactivity after a deep
> scrub as possible before the next). This helps a lot.
> 

We have not detected any performance issues due to scrubbing. My doubt was when 
it check for data integrity of a pg on two replicas. Can it take a wrong 
decision and replace the good data with the bad one ? I have got probably wrong 
on how works the scrubbing. Data is safe even if we have only two replicas ?

> >
> > We use BTRFS for OSD with a kernel 3.10. This was not strongly discouraged
> > when we start the deployment of CEPH last year. Now, it seems that the
> > kernel version should be 3.14 or later for this kind of setup.
> 
> See https://btrfs.wiki.kernel.org/index.php/Gotchas for various reasons
> to upgrade.
> 
> We have a good deal of experience with Btrfs in production now. We had
> to disable snapshots, make the journal NoCOW, disable autodefrag and
> develop our own background defragmenter (which converts to zlib at the
> same time it defragments for additional space savings). We currently use
> kernel version 4.0.5 (we don't use any RAID level so we don't need 4.0.6
> to get a fix for an online RAID level conversion bug) and I wouldn't use
> anything less than 3.19.5. The results are pretty good, but Btrfs is
> definitely not an out-of-the-box solution for Ceph.
> 

We do not change any specific options for BTRFS. 

> 
> >
> > Does some people already have got similar problems ? Do you think, it's
> > related to our BTRFS setup. Is it the replica size of the pool ?
> 
> It mainly depends on the answer to the first question above (is it a
> corruption or a freezing problem?).

I hope my answer explain the problem more clearly.

> 
> Lionel
> 

Mathieu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph SSD CPU Frequency Benchmarks

2015-09-02 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Thanks for the responses.

I forgot to include the fio test for completeness:

8 job QD=8
[ext4-test]
runtime=150
name=ext4-test
readwrite=randrw
size=15G
blocksize=4k
ioengine=sync
iodepth=8
numjobs=8
thread
group_reporting
time_based
direct=1


1 job QD=1
[ext4-test]
runtime=150
name=ext4-test
readwrite=randrw
size=15G
blocksize=4k
ioengine=sync
iodepth=1
numjobs=1
thread
group_reporting
time_based
direct=1

I have not disabled all of the power management, I've only prevented
the CPU from going to an idle state below C1. I'll have to check on
Jan's suggestion of swapping out the intel_idle driver to see what
difference it makes. I did not run powertop as I did the testing
because it (or cpupower monitor) impacted performance and would have
thrown off the results. I'll do some runs with lower clocks and make
sure that it is staying at the lower speeds. Here is some additional
output:

# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
userspace
# cpupower monitor
|Nehalem|| Mperf  || Idle_Stats
CPU | C3   | C6   | PC3  | PC6  || C0   | Cx   | Freq || POLL | C1-A | C6-A
   0|  0.00| 94.19|  0.00|  0.00||  5.70| 94.30|  1299||  0.00|  0.00| 94.32
   1|  0.00| 99.39|  0.00|  0.00||  0.53| 99.47|  1298||  0.00|  0.00| 99.48
   2|  0.00| 99.60|  0.00|  0.00||  0.38| 99.62|  1299||  0.00|  0.00| 99.61
   3|  0.00| 99.63|  0.00|  0.00||  0.36| 99.64|  1299||  0.00|  0.00| 99.64
   4|  0.00| 99.84|  0.00|  0.00||  0.11| 99.89|  1301||  0.00|  0.00| 99.97
   5|  0.00| 99.57|  0.00|  0.00||  0.40| 99.60|  1299||  0.00|  0.00| 99.61
   6|  0.00| 99.72|  0.00|  0.00||  0.27| 99.73|  1299||  0.00|  0.00| 99.73
   7|  0.00| 99.98|  0.00|  0.00||  0.01| 99.99|  1321||  0.00|  0.00| 99.99
# cat /sys/devices/system/cpu/cpuidle/current_driver
intel_idle

I then echo "1" into /dev/cpu_dma_latency. We can see that the idle
time moves from C6 to C1

# cpupower monitor
|Nehalem|| Mperf  || Idle_Stats
CPU | C3   | C6   | PC3  | PC6  || C0   | Cx   | Freq || POLL | C1-A | C6-A
   0|  0.00|  0.00|  0.00|  0.00||  0.37| 99.63|  1299||  0.00| 99.63|  0.00
   1|  0.00|  0.00|  0.00|  0.00||  0.16| 99.84|  1299||  0.00| 99.84|  0.00
   2|  0.00|  0.00|  0.00|  0.00||  0.47| 99.53|  1299||  0.00| 99.53|  0.00
   3|  0.00|  0.00|  0.00|  0.00||  0.43| 99.57|  1299||  0.00| 99.57|  0.00
   4|  0.00|  0.00|  0.00|  0.00||  0.09| 99.91|  1300||  0.00| 99.91|  0.00
   5|  0.00|  0.00|  0.00|  0.00||  0.06| 99.94|  1298||  0.00| 99.94|  0.00
   6|  0.00|  0.00|  0.00|  0.00||  0.09| 99.91|  1300||  0.00| 99.91|  0.00
   7|  0.00|  0.00|  0.00|  0.00||  0.28| 99.72|  1299||  0.00| 99.72|  0.00
# cat /sys/devices/system/cpu/cpu0/cpuidle/state*/latency
0
2
15
# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_{min,max,cur}_freq
120
120
120
120
120
120
120
120
120
2401000
2401000
2401000
2401000
2401000
2401000
2401000
120
120
120
160
120
120
120
120

Thanks for taking the time to collaborate with me on this.
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.0.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV5xrBCRDmVDuy+mK58QAAWaoP/2bIKlsp+fmlViP4pFV7
Sv+y/1nCQdNs0l2AJdiDX2l7OQrYavDh5LldJBkcmTyB74KjDJ+i88VGYkdG
n8Q6tTbF4erw8P/gPf3DIrvQazdQm+a/6rUBpkM+MNTRyKRczxeyCu8kCNzb
jDP7erwnj0WzCZMAA1uFLa9sMKBNxOfpK9wQR5NbQCkOcsDtprNL2KPfxrFV
Rgk0OBGBSLtz9BE/PMYpbeqr9o1nChCp4hkg5AUcFrAuceOKdA7R8lKPIUZ6
0zTL1OjGsGfy/sp856poqmF02bANF9LXzmcBMKBNMO0iS89xv0YyIgRBlt/Z
lXc4M7IWtYzbbUVAtSLcOtWrzS8Yp0hMKlPrhA7LZFrhZ4+t45mvyrS3RbiP
RG8osdvjz58ZBS7/jk1gDZd8Xbj5bsU3n01DTFJ3CeAE2etAqgheAGlj4OTR
kfs/g1jbYArEgnfX3jTJ2wECjfVRTrgXJGjceoYtJYbQ4Ns/0dBWpZBrkEu0
AX4VU1dk9R1B0rootvKsWedcKvof4cSOyKRtQxGHS7ipqtkyep+1JquO41mr
cBC9p/TOXgh90M8476G1CpMqWwWHneHJ6bjO5V1W8uWGXTNFnaGbqS4v3mWk
ge1qukr9et0Su0llUb8Rz3hCDqD6PfMJpquBTAB/kaanS+t0pi+00wxu7zzB
zVQ/
=v4sY
-END PGP SIGNATURE-



Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Wed, Sep 2, 2015 at 3:21 AM, Nick Fisk  wrote:

> I think this may be related to what I had to do, it rings a bell at least.
>
>
> http://unix.stackexchange.com/questions/153693/cant-use-userspace-cpufreq-governor-and-set-cpu-frequency
>
> The P-state drive doesn't support userspace, so you need to disable it and
> make Linux use the old acpi drive instead.
>
> > -Original Message-
> > From: Nick Fisk [mailto:n...@fisk.me.uk]
> > Sent: 01 September 2015 22:21
> > To: 'Robert LeBlanc' 
> > Cc: ceph-users@lists.ceph.com
> > Subject: RE: [ceph-users] Ceph SSD CPU Frequency Benchmarks
> >
> > > -Original Message-
> > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > > Of Robert LeBlanc
> > > Sent: 01 September 2015 21:48
> > > To: Nick Fisk 
> > > Cc: ceph-users@lists.ceph.com
> > > Subject: Re: 

Re: [ceph-users] testing a crush rule against an out osd

2015-09-02 Thread Sage Weil
On Wed, 2 Sep 2015, Dan van der Ster wrote:
> On Wed, Sep 2, 2015 at 4:23 PM, Sage Weil  wrote:
> > On Wed, 2 Sep 2015, Dan van der Ster wrote:
> >> On Wed, Sep 2, 2015 at 4:11 PM, Sage Weil  wrote:
> >> > On Wed, 2 Sep 2015, Dan van der Ster wrote:
> >> >> ...
> >> >> Normally I use crushtool --test --show-mappings to test rules, but
> >> >> AFAICT it doesn't let you simulate an out osd, i.e. with reweight = 0.
> >> >> Any ideas how to test this situation without uploading a crushmap to a
> >> >> running cluster?
> >> >
> >> > crushtool --test --weight  0 ...
> >> >
> >>
> >> Oh thanks :)
> >>
> >> I can't reproduce my real life issue with crushtool though. Still looking 
> >> ...
> >
> > osdmaptool has a --test-map-pg option that may be easier...
> >
> 
> alas I don't have the osdmap from when this happened :(
> 
> But anyway I finally managed to reproduce with crushtool:
> 
> # crushtool -i crush.map --num-rep 3 --test --show-mappings --weight
> 1008 0 --rule 4 --x 7357 2>&1
> CRUSH rule 4 x 7357 [1048,889]
> 
> That's with these tunables:
> 
> tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 50
> tunable chooseleaf_descend_once 1
> tunable straw_calc_version 1
> 
> So I started tweaking and found that choose_local_tries has no effect,
> choose_local_fallback_tries 1 fixes it, choose_total_tries up to 1000
> has no effect, chooseleaf_descend_once 0 fixes it.
> 
> I don't really want to disable these "optimal" tunables; any other
> advice what's might be going on here?

Ah, the tunable you need is vary_r (=1).  Switching to the 'firefly' 
tunables will enable this... just make sure no librbd/librados instances 
are older than that.

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] testing a crush rule against an out osd

2015-09-02 Thread Dan van der Ster
On Wed, Sep 2, 2015 at 7:23 PM, Sage Weil  wrote:
> On Wed, 2 Sep 2015, Dan van der Ster wrote:
>> On Wed, Sep 2, 2015 at 4:23 PM, Sage Weil  wrote:
>> > On Wed, 2 Sep 2015, Dan van der Ster wrote:
>> >> On Wed, Sep 2, 2015 at 4:11 PM, Sage Weil  wrote:
>> >> > On Wed, 2 Sep 2015, Dan van der Ster wrote:
>> >> >> ...
>> >> >> Normally I use crushtool --test --show-mappings to test rules, but
>> >> >> AFAICT it doesn't let you simulate an out osd, i.e. with reweight = 0.
>> >> >> Any ideas how to test this situation without uploading a crushmap to a
>> >> >> running cluster?
>> >> >
>> >> > crushtool --test --weight  0 ...
>> >> >
>> >>
>> >> Oh thanks :)
>> >>
>> >> I can't reproduce my real life issue with crushtool though. Still looking 
>> >> ...
>> >
>> > osdmaptool has a --test-map-pg option that may be easier...
>> >
>>
>> alas I don't have the osdmap from when this happened :(
>>
>> But anyway I finally managed to reproduce with crushtool:
>>
>> # crushtool -i crush.map --num-rep 3 --test --show-mappings --weight
>> 1008 0 --rule 4 --x 7357 2>&1
>> CRUSH rule 4 x 7357 [1048,889]
>>
>> That's with these tunables:
>>
>> tunable choose_local_tries 0
>> tunable choose_local_fallback_tries 0
>> tunable choose_total_tries 50
>> tunable chooseleaf_descend_once 1
>> tunable straw_calc_version 1
>>
>> So I started tweaking and found that choose_local_tries has no effect,
>> choose_local_fallback_tries 1 fixes it, choose_total_tries up to 1000
>> has no effect, chooseleaf_descend_once 0 fixes it.
>>
>> I don't really want to disable these "optimal" tunables; any other
>> advice what's might be going on here?
>
> Ah, the tunable you need is vary_r (=1).  Switching to the 'firefly'
> tunables will enable this... just make sure no librbd/librados instances
> are older than that.
>

Yes, that explains it. Thanks very much for the help!

Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] testing a crush rule against an out osd

2015-09-02 Thread Dan van der Ster
On Wed, Sep 2, 2015 at 4:23 PM, Sage Weil  wrote:
> On Wed, 2 Sep 2015, Dan van der Ster wrote:
>> On Wed, Sep 2, 2015 at 4:11 PM, Sage Weil  wrote:
>> > On Wed, 2 Sep 2015, Dan van der Ster wrote:
>> >> ...
>> >> Normally I use crushtool --test --show-mappings to test rules, but
>> >> AFAICT it doesn't let you simulate an out osd, i.e. with reweight = 0.
>> >> Any ideas how to test this situation without uploading a crushmap to a
>> >> running cluster?
>> >
>> > crushtool --test --weight  0 ...
>> >
>>
>> Oh thanks :)
>>
>> I can't reproduce my real life issue with crushtool though. Still looking ...
>
> osdmaptool has a --test-map-pg option that may be easier...
>

alas I don't have the osdmap from when this happened :(

But anyway I finally managed to reproduce with crushtool:

# crushtool -i crush.map --num-rep 3 --test --show-mappings --weight
1008 0 --rule 4 --x 7357 2>&1
CRUSH rule 4 x 7357 [1048,889]

That's with these tunables:

tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1

So I started tweaking and found that choose_local_tries has no effect,
choose_local_fallback_tries 1 fixes it, choose_total_tries up to 1000
has no effect, chooseleaf_descend_once 0 fixes it.

I don't really want to disable these "optimal" tunables; any other
advice what's might be going on here?

Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is Ceph appropriate for small installations?

2015-09-02 Thread Marcin Przyczyna
On 09/02/2015 02:31 PM, Janusz Borkowski wrote:
> Hi!
> 
> Do you have replication factor 2?

yes.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ESXi/LIO/RBD repeatable problem, hang when cloning VM

2015-09-02 Thread Alex Gorbachev
e have experienced a repeatable issue when performing the following:

Ceph backend with no issues, we can repeat any time at will in lab and
production.  Cloning an ESXi VM to another VM on the same datastore on
which the original VM resides.  Practically instantly, the LIO machine
becomes unresponsive, Pacemaker fails over to another LIO machine and
that too becomes unresponsive.

Both running Ubuntu 14.04, kernel 4.1 (4.1.0-040100-generic x86_64),
Ceph Hammer 0.94.2, and have been able to take quite a workoad with no
issues.

output of /var/log/syslog below.  I also have a screen dump of a
frozen system - attached.

Thank you,
Alex

Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.886254] CPU: 22 PID:
18130 Comm: kworker/22:1 Tainted: G C OE
4.1.0-040100-generic #201506220235
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.886303] Hardware name:
Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.0a
12/05/2013
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.886364] Workqueue:
xcopy_wq target_xcopy_do_work [target_core_mod]
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.886395] task:
8810441c3250 ti: 88105bb4 task.ti: 88105bb4
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.886440] RIP:
0010:[]  []
sbc_check_prot+0x49/0x210 [target_core_mod]
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.886498] RSP:
0018:88105bb43b88  EFLAGS: 00010246
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.886525] RAX:
0400 RBX: 8810589eb008 RCX: 0400
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.886554] RDX:
8810589eb0f8 RSI:  RDI: 
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.886584] RBP:
88105bb43bc8 R08:  R09: 0001
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.886613] R10:
 R11:  R12: 
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.886643] R13:
88084860c000 R14: c02372c0 R15: 0400
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.886673] FS:
() GS:88105f48()
knlGS:
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.886719] CS:  0010 DS:
 ES:  CR0: 80050033
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.886747] CR2:
0010 CR3: 01e0f000 CR4: 001407e0
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.886777] Stack:
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.886798]  000b
000c  8810589eb0f8
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.886851]  8810589eb008
88084860c000 c02372c0 0400
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.886904]  88105bb43c28
c03e528a 000c 0004000c
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.886957] Call Trace:
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.886989]
[] sbc_parse_cdb+0x66a/0xa20 [target_core_mod]
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.887022]
[] iblock_parse_cdb+0x15/0x20 [target_core_iblock]
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.887077]
[] target_setup_cmd_from_cdb+0x1c0/0x260
[target_core_mod]
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.887133]
[] target_xcopy_setup_pt_cmd+0x8d/0x170
[target_core_mod]
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.887188]
[] target_xcopy_read_source.isra.12+0x126/0x220
[target_core_mod]
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.887243]
[] ? sched_clock+0x9/0x10
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.887279]
[] target_xcopy_do_work+0xf1/0x370 [target_core_mod]
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.887329]
[] ? __switch_to+0x1e6/0x580
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.887361]
[] process_one_work+0x144/0x490
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.887390]
[] worker_thread+0x11e/0x460
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.887418]
[] ? create_worker+0x1f0/0x1f0
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.887449]
[] kthread+0xc9/0xe0
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.887477]
[] ? flush_kthread_worker+0x90/0x90
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.887510]
[] ret_from_fork+0x42/0x70
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.887538]
[] ? flush_kthread_worker+0x90/0x90
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.890342] Code: 7d f8 49 89
fd 4c 89 65 e0 44 0f b6 62 01 41 89 cf 48 8b be 80 00 00 00 41 8b b5
18 04 00 00 41 c0 ec 05 48 83 bb f0 01 00 00 00 <8b> 4f 10 41 89 f6 74
0a 8b 83 f8 01 00 00 85 c0 75 14 45 84 e4
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.890580] RIP
[] sbc_check_prot+0x49/0x210 [target_core_mod]
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.890636]  RSP 
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.890659] CR2: 0010
Sep  2 12:11:55 roc-4r-scd214 kernel: [86831.890956] ---[ end trace
894b2880b8116889 ]---
Sep  2 12:12:04 roc-4r-scd214 kernel: [86833.204150] BUG: unable to
handle kernel paging request at ffd8
Sep  2 12:12:04 roc-4r-scd214 kernel: [86833.204291] IP:
[] kthread_data+0x10/0x20
Sep 

Re: [ceph-users] Ceph read / write : Terrible performance

2015-09-02 Thread Vickey Singh
Thank You Mark , please see my response below.

On Wed, Sep 2, 2015 at 5:23 PM, Mark Nelson  wrote:

> On 09/02/2015 08:51 AM, Vickey Singh wrote:
>
>> Hello Ceph Experts
>>
>> I have a strange problem , when i am reading or writing to Ceph pool ,
>> its not writing properly. Please notice Cur MB/s which is going up and
>> down
>>
>> --- Ceph Hammer 0.94.2
>> -- CentOS 6, 2.6
>> -- Ceph cluster is healthy
>>
>
> You might find that CentOS7 gives you better performance.  In some cases
> we were seeing nearly 2X.


Wooo 2X , i would definitely plan for upgrade. Thanks


>
>
>
>>
>> One interesting thing is when every i start rados bench command for read
>> or write CPU Idle % goes down ~10 and System load is increasing like
>> anything.
>>
>> Hardware
>>
>> HpSL4540
>>
>
> Please make sure the controller is on the newest firmware.  There used to
> be a bug that would cause sequential write performance to bottleneck when
> writeback cache was enabled on the RAID controller.


Last month i have upgraded the firmwares for this hardware , so i hope they
are up to date.


>
>
> 32Core CPU
>> 196G Memory
>> 10G Network
>>
>
> Be sure to check the network too.  We've seen a lot of cases where folks
> have been burned by one of the NICs acting funky.
>

At a first view , Interface looks good and they are pushing data nicely (
what ever they are getting )


>
>
>> I don't think hardware is a problem.
>>
>> Please give me clues / pointers , how should i troubleshoot this problem.
>>
>>
>>
>> # rados bench -p glance-test 60 write
>>   Maintaining 16 concurrent writes of 4194304 bytes for up to 60 seconds
>> or 0 objects
>>   Object prefix: benchmark_data_pouta-s01.pouta.csc.fi_2173350
>> sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg
>> lat
>>   0   0 0 0 0 0 -
>>  0
>>   1  1620 4 15.9916   0.12308
>>  0.10001
>>   2  163721   41.984168   1.79104
>> 0.827021
>>   3  166852   69.3122   124  0.084304
>> 0.854829
>>   4  16   11498   97.9746   184   0.12285
>> 0.614507
>>   5  16   188   172   137.568   296  0.210669
>> 0.449784
>>   6  16   248   232   154.634   240  0.090418
>> 0.390647
>>   7  16   305   289165.11   228  0.069769
>> 0.347957
>>   8  16   331   315   157.471   104  0.026247
>> 0.3345
>>   9  16   361   345   153.306   120  0.082861
>> 0.320711
>>  10  16   380   364   145.57576  0.027964
>> 0.310004
>>  11  16   393   377   137.06752   3.73332
>> 0.393318
>>  12  16   448   432   143.971   220  0.334664
>> 0.415606
>>  13  16   476   460   141.508   112  0.271096
>> 0.406574
>>  14  16   497   481   137.39984  0.257794
>> 0.412006
>>  15  16   507   491   130.90640   1.49351
>> 0.428057
>>  16  16   529   513   115.04288  0.399384
>>  0.48009
>>  17  16   533   517   94.628616   5.50641
>> 0.507804
>>  18  16   537   52183.40516   4.42682
>> 0.549951
>>  19  16   538   52280.349 4   11.2052
>> 0.570363
>> 2015-09-02 09:26:18.398641min lat: 0.023851 max lat: 11.2052 avg lat:
>> 0.570363
>> sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg
>> lat
>>  20  16   538   522   77.3611 0 -
>> 0.570363
>>  21  16   540   524   74.8825 4   8.88847
>> 0.591767
>>  22  16   542   526   72.5748 8   1.41627
>> 0.593555
>>  23  16   543   527   70.2873 48.0856
>> 0.607771
>>  24  16   555   539   69.567448  0.145199
>> 0.781685
>>  25  16   560   544   68.0177201.4342
>> 0.787017
>>  26  16   564   548   66.424116  0.451905
>>  0.78765
>>  27  16   566   550   64.7055 8  0.611129
>> 0.787898
>>  28  16   570   554   63.313816   2.51086
>> 0.797067
>>  29  16   570   554   61.5549 0 -
>> 0.797067
>>  30  16   572   556   60.1071 4   7.71382
>> 0.830697
>>  31  16   577   561   59.051520   23.3501
>> 0.916368
>>  32  16   590   574   58.870552  0.336684
>> 0.956958
>>  33  16   591   575   57.4986 4   1.92811
>> 0.958647
>>  34  16   591   575   56.0961 0 -
>> 0.958647
>>  35  16   591   575   54.7603 0 -
>> 0.958647
>>  36  16   597   581   54.0447 8  0.187351
>>  1.00313
>>  37  16   625   609  

[ceph-users] rebalancing taking very long time

2015-09-02 Thread Bob Ababurko
When I lose a disk OR replace a OSD in my POC ceph cluster, it takes a very
long time to rebalance.  I should note that my cluster is slightly unique
in that I am using cephfs(shouldn't matter?) and it currently contains
about 310 million objects.

The last time I replaced a disk/OSD was 2.5 days ago and it is still
rebalancing.  This is on a cluster with no client load.

The configurations is 5 hosts with 6 x 1TB 7200rpm SATA OSD's & 1 850 Pro
SSD which contains the journals for said OSD's.  Thats means 30 OSD's in
total.  System disk is on its own disk.  I'm also using a backend network
with single Gb NIC.  THe rebalancing rate(objects/s) seems to be very slow
when it is close to finishingsay <1% objects misplaced.

It doesn't seem right that it would take 2+ days to rebalance a 1TB disk
with no load on the cluster.  Are my expectations off?

I'm not sure if my pg_num/pgp_num needs to be changed OR the rebalance time
is dependent on the number of objects in the pool.  These are thoughts i've
had but am not certain are relevant here.

$ sudo ceph -v
ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)

$ sudo ceph -s
[sudo] password for bababurko:
cluster f25cb23f-2293-4682-bad2-4b0d8ad10e79
 health HEALTH_WARN
5 pgs backfilling
5 pgs stuck unclean
recovery 3046506/676638611 objects misplaced (0.450%)
 monmap e1: 3 mons at {cephmon01=
10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.24.135:6789/0
}
election epoch 20, quorum 0,1,2 cephmon01,cephmon02,cephmon03
 mdsmap e6070: 1/1/1 up {0=cephmds01=up:active}, 1 up:standby
 osdmap e4395: 30 osds: 30 up, 30 in; 5 remapped pgs
  pgmap v3100039: 2112 pgs, 3 pools, 6454 GB data, 321 Mobjects
18319 GB used, 9612 GB / 27931 GB avail
3046506/676638611 objects misplaced (0.450%)
2095 active+clean
  12 active+clean+scrubbing+deep
   5 active+remapped+backfilling
recovery io 2294 kB/s, 147 objects/s

$ sudo rados df
pool name KB  objects   clones degraded
 unfound   rdrd KB   wrwr KB
cephfs_data   676756996233574670200
  0  21368341676984208   7052266742
cephfs_metadata42738  105843700
  0 16130199  30718800215295996938   3811963908
rbd0000
  00000
  total used 19209068780336805139
  total avail10079469460
  total space29288538240

$ sudo ceph osd pool get cephfs_data pgp_num
pg_num: 1024
$ sudo ceph osd pool get cephfs_metadata pgp_num
pg_num: 1024


thanks,
Bob
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osds on 2 nodes vs. on one node

2015-09-02 Thread Christian Balzer

Hello,

On Wed, 2 Sep 2015 22:38:12 + Deneau, Tom wrote:

> In a small cluster I have 2 OSD nodes with identical hardware, each with
> 6 osds.
> 
> * Configuration 1:  I shut down the osds on one node so I am using 6
> OSDS on a single node
>
Shut down how?
Just a "service blah stop" or actually removing them from the cluster aka
CRUSH map?
 
> * Configuration 2:  I shut down 3 osds on each node so now I have 6
> total OSDS but 3 on each node.
> 
Same as above. 
And in this case even more relevant, because just shutting down random OSDs
on both nodes would result in massive recovery action at best and more
likely a broken cluster.

> I measure read performance using rados bench from a separate client node.
Default parameters?

> The client has plenty of spare CPU power and the network and disk
> utilization are not limiting factors. In all cases, the pool type is
> replicated so we're just reading from the primary.
>
Replicated as in size 2? 
We can guess/assume that from your cluster size, but w/o you telling us
or giving us all the various config/crush outputs that is only a guess.
 
> With Configuration 1, I see approximately 70% more bandwidth than with
> configuration 2. 

Never mind that bandwidth is mostly irrelevant in real life, which
bandwidth, read or write?

> In general, any configuration where the osds span 2
> nodes gets poorer performance but in particular when the 2 nodes have
> equal amounts of traffic.
>

Again, guessing from what you're actually doing this isn't particular
surprising. 
Because with a single node, default rules and replication of 2 your OSDs
never have to replicate anything when it comes to writes. 
Whereas with 2 nodes replication happens and takes more time (latency) and
might also saturate your network (we have of course no idea how your
cluster looks like).

Christian
 
> Is there any ceph parameter that might be throttling the cases where
> osds span 2 nodes?
> 
> -- Tom Deneau, AMD
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] libvirt rbd issue

2015-09-02 Thread Rafael Lopez
Hi Jan,

Thanks for the advice, hit the nail on the head.

I checked the limits and watched the no. of fd's and as it reached the soft
limit (1024) thats when the transfer came to a grinding halt and the vm
started locking up.

After your reply I also did some more googling and found another old thread:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-December/026187.html

I increased the max_files in qemu.conf and restarted libvirtd and the VM
(as per Dan's solution in thread above), and now it seems to be happy
copying any size files to the rbd. Confirmed the fd count is going past the
previous soft limit of 1024 also.

Thanks again!!
Raf

On 2 September 2015 at 18:44, Jan Schermer  wrote:

> 1) Take a look at the number of file descriptors the QEMU process is
> using, I think you are over the limits
>
> pid=pid of qemu process
>
> cat /proc/$pid/limits
> echo /proc/$pid/fd/* | wc -w
>
> 2) Jumbo frames may be the cause, are they enabled on the rest of the
> network? In any case, get rid of NetworkManager ASAP and set it manually,
> though it looks like your NIC might not support them.
>
> Jan
>
>
>
> > On 02 Sep 2015, at 01:44, Rafael Lopez  wrote:
> >
> > Hi ceph-users,
> >
> > Hoping to get some help with a tricky problem. I have a rhel7.1 VM guest
> (host machine also rhel7.1) with root disk presented from ceph 0.94.2-0
> (rbd) using libvirt.
> >
> > The VM also has a second rbd for storage presented from the same ceph
> cluster, also using libvirt.
> >
> > The VM boots fine, no apparent issues with the OS root rbd. I am able to
> mount the storage disk in the VM, and create a file system. I can even
> transfer small files to it. But when I try to transfer a moderate size
> files, eg. greater than 1GB, it seems to slow to a grinding halt and
> eventually it locks up the whole system, and generates the kernel messages
> below.
> >
> > I have googled some *similar* issues around, but haven't come across
> some solid advice/fix. So far I have tried modifying the libvirt disk cache
> settings, tried using the latest mainline kernel (4.2+), different file
> systems (ext4, xfs, zfs) all produce similar results. I suspect it may be
> network related, as when I was using the mainline kernel I was transferring
> some files to the storage disk and this message came up, and the transfer
> seemed to stop at the same time:
> >
> > Sep  1 15:31:22 nas1-rds NetworkManager[724]: 
> [1441085482.078646] [platform/nm-linux-platform.c:2133] sysctl_set():
> sysctl: failed to set '/proc/sys/net/ipv6/conf/eth0/mtu' to '9000': (22)
> Invalid argument
> >
> > I think maybe the key info to troubleshooting is that it seems to be OK
> for files under 1GB.
> >
> > Any ideas would be appreciated.
> >
> > Cheers,
> > Raf
> >
> >
> > Sep  1 16:04:15 nas1-rds kernel: INFO: task kworker/u8:1:60 blocked for
> more than 120 seconds.
> > Sep  1 16:04:15 nas1-rds kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Sep  1 16:04:15 nas1-rds kernel: kworker/u8:1D 88023fd93680
>  060  2 0x
> > Sep  1 16:04:15 nas1-rds kernel: Workqueue: writeback
> bdi_writeback_workfn (flush-252:80)
> > Sep  1 16:04:15 nas1-rds kernel: 880230c136b0 0046
> 8802313c4440 880230c13fd8
> > Sep  1 16:04:15 nas1-rds kernel: 880230c13fd8 880230c13fd8
> 8802313c4440 88023fd93f48
> > Sep  1 16:04:15 nas1-rds kernel: 880230c137b0 880230fbcb08
> e8d80ec0 88022e827590
> > Sep  1 16:04:15 nas1-rds kernel: Call Trace:
> > Sep  1 16:04:15 nas1-rds kernel: []
> io_schedule+0x9d/0x130
> > Sep  1 16:04:15 nas1-rds kernel: [] bt_get+0x10f/0x1a0
> > Sep  1 16:04:15 nas1-rds kernel: [] ?
> wake_up_bit+0x30/0x30
> > Sep  1 16:04:15 nas1-rds kernel: []
> blk_mq_get_tag+0xbf/0xf0
> > Sep  1 16:04:15 nas1-rds kernel: []
> __blk_mq_alloc_request+0x1b/0x1f0
> > Sep  1 16:04:15 nas1-rds kernel: []
> blk_mq_map_request+0x181/0x1e0
> > Sep  1 16:04:15 nas1-rds kernel: []
> blk_sq_make_request+0x9a/0x380
> > Sep  1 16:04:15 nas1-rds kernel: [] ?
> generic_make_request_checks+0x24f/0x380
> > Sep  1 16:04:15 nas1-rds kernel: []
> generic_make_request+0xe2/0x130
> > Sep  1 16:04:15 nas1-rds kernel: []
> submit_bio+0x71/0x150
> > Sep  1 16:04:15 nas1-rds kernel: []
> ext4_io_submit+0x25/0x50 [ext4]
> > Sep  1 16:04:15 nas1-rds kernel: []
> ext4_bio_write_page+0x159/0x2e0 [ext4]
> > Sep  1 16:04:15 nas1-rds kernel: []
> mpage_submit_page+0x5d/0x80 [ext4]
> > Sep  1 16:04:15 nas1-rds kernel: []
> mpage_map_and_submit_buffers+0x172/0x2a0 [ext4]
> > Sep  1 16:04:15 nas1-rds kernel: []
> ext4_writepages+0x733/0xd60 [ext4]
> > Sep  1 16:04:15 nas1-rds kernel: []
> do_writepages+0x1e/0x40
> > Sep  1 16:04:15 nas1-rds kernel: []
> __writeback_single_inode+0x40/0x220
> > Sep  1 16:04:15 nas1-rds kernel: []
> writeback_sb_inodes+0x25e/0x420
> > Sep  1 16:04:15 nas1-rds kernel: []
> __writeback_inodes_wb+0x9f/0xd0
> > Sep  1 16:04:15 

[ceph-users] osds on 2 nodes vs. on one node

2015-09-02 Thread Deneau, Tom
In a small cluster I have 2 OSD nodes with identical hardware, each with 6 osds.

* Configuration 1:  I shut down the osds on one node so I am using 6 OSDS on a 
single node

* Configuration 2:  I shut down 3 osds on each node so now I have 6 total OSDS 
but 3 on each node.

I measure read performance using rados bench from a separate client node.
The client has plenty of spare CPU power and the network and disk utilization 
are not limiting factors.
In all cases, the pool type is replicated so we're just reading from the 
primary.

With Configuration 1, I see approximately 70% more bandwidth than with 
configuration 2.
In general, any configuration where the osds span 2 nodes gets poorer 
performance but in particular
when the 2 nodes have equal amounts of traffic.

Is there any ceph parameter that might be throttling the cases where osds span 
2 nodes?

-- Tom Deneau, AMD
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph SSD CPU Frequency Benchmarks

2015-09-02 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Changing to the acpi_idle driver dropped the performance by about 50%.
That was an unexpected result.

I'm having issues with powertop and the userspace governor, it always
shows 100% idle. I downloaded the latest version with the same result.
Still more work to do, but wanted to share my findings.

- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Wed, Sep 2, 2015 at 9:50 AM, Robert LeBlanc  wrote:
- -BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Thanks for the responses.

I forgot to include the fio test for completeness:

8 job QD=8
[ext4-test]
runtime=150
name=ext4-test
readwrite=randrw
size=15G
blocksize=4k
ioengine=sync
iodepth=8
numjobs=8
thread
group_reporting
time_based
direct=1


1 job QD=1
[ext4-test]
runtime=150
name=ext4-test
readwrite=randrw
size=15G
blocksize=4k
ioengine=sync
iodepth=1
numjobs=1
thread
group_reporting
time_based
direct=1

I have not disabled all of the power management, I've only prevented
the CPU from going to an idle state below C1. I'll have to check on
Jan's suggestion of swapping out the intel_idle driver to see what
difference it makes. I did not run powertop as I did the testing
because it (or cpupower monitor) impacted performance and would have
thrown off the results. I'll do some runs with lower clocks and make
sure that it is staying at the lower speeds. Here is some additional
output:

# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
userspace
# cpupower monitor
|Nehalem|| Mperf  || Idle_Stats
CPU | C3   | C6   | PC3  | PC6  || C0   | Cx   | Freq || POLL | C1-A | C6-A
   0|  0.00| 94.19|  0.00|  0.00||  5.70| 94.30|  1299||  0.00|  0.00| 94.32
   1|  0.00| 99.39|  0.00|  0.00||  0.53| 99.47|  1298||  0.00|  0.00| 99.48
   2|  0.00| 99.60|  0.00|  0.00||  0.38| 99.62|  1299||  0.00|  0.00| 99.61
   3|  0.00| 99.63|  0.00|  0.00||  0.36| 99.64|  1299||  0.00|  0.00| 99.64
   4|  0.00| 99.84|  0.00|  0.00||  0.11| 99.89|  1301||  0.00|  0.00| 99.97
   5|  0.00| 99.57|  0.00|  0.00||  0.40| 99.60|  1299||  0.00|  0.00| 99.61
   6|  0.00| 99.72|  0.00|  0.00||  0.27| 99.73|  1299||  0.00|  0.00| 99.73
   7|  0.00| 99.98|  0.00|  0.00||  0.01| 99.99|  1321||  0.00|  0.00| 99.99
# cat /sys/devices/system/cpu/cpuidle/current_driver
intel_idle

I then echo "1" into /dev/cpu_dma_latency. We can see that the idle
time moves from C6 to C1

# cpupower monitor
|Nehalem|| Mperf  || Idle_Stats
CPU | C3   | C6   | PC3  | PC6  || C0   | Cx   | Freq || POLL | C1-A | C6-A
   0|  0.00|  0.00|  0.00|  0.00||  0.37| 99.63|  1299||  0.00| 99.63|  0.00
   1|  0.00|  0.00|  0.00|  0.00||  0.16| 99.84|  1299||  0.00| 99.84|  0.00
   2|  0.00|  0.00|  0.00|  0.00||  0.47| 99.53|  1299||  0.00| 99.53|  0.00
   3|  0.00|  0.00|  0.00|  0.00||  0.43| 99.57|  1299||  0.00| 99.57|  0.00
   4|  0.00|  0.00|  0.00|  0.00||  0.09| 99.91|  1300||  0.00| 99.91|  0.00
   5|  0.00|  0.00|  0.00|  0.00||  0.06| 99.94|  1298||  0.00| 99.94|  0.00
   6|  0.00|  0.00|  0.00|  0.00||  0.09| 99.91|  1300||  0.00| 99.91|  0.00
   7|  0.00|  0.00|  0.00|  0.00||  0.28| 99.72|  1299||  0.00| 99.72|  0.00
# cat /sys/devices/system/cpu/cpu0/cpuidle/state*/latency
0
2
15
# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_{min,max,cur}_freq
120
120
120
120
120
120
120
120
120
2401000
2401000
2401000
2401000
2401000
2401000
2401000
120
120
120
160
120
120
120
120

Thanks for taking the time to collaborate with me on this.
- -BEGIN PGP SIGNATURE-
Version: Mailvelope v1.0.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV5xrBCRDmVDuy+mK58QAAWaoP/2bIKlsp+fmlViP4pFV7
Sv+y/1nCQdNs0l2AJdiDX2l7OQrYavDh5LldJBkcmTyB74KjDJ+i88VGYkdG
n8Q6tTbF4erw8P/gPf3DIrvQazdQm+a/6rUBpkM+MNTRyKRczxeyCu8kCNzb
jDP7erwnj0WzCZMAA1uFLa9sMKBNxOfpK9wQR5NbQCkOcsDtprNL2KPfxrFV
Rgk0OBGBSLtz9BE/PMYpbeqr9o1nChCp4hkg5AUcFrAuceOKdA7R8lKPIUZ6
0zTL1OjGsGfy/sp856poqmF02bANF9LXzmcBMKBNMO0iS89xv0YyIgRBlt/Z
lXc4M7IWtYzbbUVAtSLcOtWrzS8Yp0hMKlPrhA7LZFrhZ4+t45mvyrS3RbiP
RG8osdvjz58ZBS7/jk1gDZd8Xbj5bsU3n01DTFJ3CeAE2etAqgheAGlj4OTR
kfs/g1jbYArEgnfX3jTJ2wECjfVRTrgXJGjceoYtJYbQ4Ns/0dBWpZBrkEu0
AX4VU1dk9R1B0rootvKsWedcKvof4cSOyKRtQxGHS7ipqtkyep+1JquO41mr
cBC9p/TOXgh90M8476G1CpMqWwWHneHJ6bjO5V1W8uWGXTNFnaGbqS4v3mWk
ge1qukr9et0Su0llUb8Rz3hCDqD6PfMJpquBTAB/kaanS+t0pi+00wxu7zzB
zVQ/
=v4sY
- -END PGP SIGNATURE-

- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Wed, Sep 2, 2015 at 3:21 AM, Nick Fisk  wrote:
I think this may be related to what I had to do, it rings a bell at least.
http://unix.stackexchange.com/questions/153693/cant-use-userspace-cpufreq-governor-and-set-cpu-frequency

The P-state drive doesn't support userspace, so you need to disable it
and make Linux use the old acpi drive instead.

> -Original Message-
> From: Nick 

Re: [ceph-users] ceph-deploy: too many argument: --setgroup 10

2015-09-02 Thread Travis Rhoden
Hi Noah,

What is the ownership on /var/lib/ceph ?

ceph-deploy should only be trying to use --setgroup if /var/lib/ceph is
owned by non-root.

On a fresh install of Hammer, this should be root:root.

The --setgroup flag was added to ceph-deploy in 1.5.26.

 - Travis

On Wed, Sep 2, 2015 at 1:59 PM, Noah Watkins  wrote:

> I'm getting the following error using ceph-deploy to setup a cluster.
> It's Centos6.6 and I'm using Hammer and the latest ceph-deploy. It
> looks like setgroup wasn't an option in Hammer, but ceph-deploy adds
> it. Is there a trick or older version of ceph-deploy I should try?
>
> - Noah
>
> [cn67][INFO  ] Running command: sudo ceph-mon --cluster ceph --mkfs -i
> cn67 --keyring /var/lib/ceph/tmp/ceph-cn67.mon.keyring --setgroup 10
> [cn67][WARNIN] too many arguments: [--setgroup,10]
> [cn67][DEBUG ]   --conf/-c FILEread configuration from the given
> configuration file
> [cn67][WARNIN] usage: ceph-mon -i monid [flags]
> [cn67][DEBUG ]   --id/-i IDset ID portion of my name
> [cn67][WARNIN]   --debug_mon n
> [cn67][DEBUG ]   --name/-n TYPE.ID set name
> [cn67][WARNIN] debug monitor level (e.g. 10)
> [cn67][DEBUG ]   --cluster NAMEset cluster name (default: ceph)
> [cn67][WARNIN]   --mkfs
> [cn67][DEBUG ]   --version show version and quit
> [cn67][WARNIN] build fresh monitor fs
> [cn67][DEBUG ]
> [cn67][WARNIN]   --force-sync
> [cn67][DEBUG ]   -drun in foreground, log to stderr.
> [cn67][WARNIN] force a sync from another mon by wiping local
> data (BE CAREFUL)
> [cn67][DEBUG ]   -frun in foreground, log to usual
> location.
> [cn67][WARNIN]   --yes-i-really-mean-it
> [cn67][DEBUG ]   --debug_ms N  set message debug level (e.g. 1)
> [cn67][WARNIN] mandatory safeguard for --force-sync
> [cn67][WARNIN]   --compact
> [cn67][WARNIN] compact the monitor store
> [cn67][WARNIN]   --osdmap 
> [cn67][WARNIN] only used when --mkfs is provided: load the
> osdmap from 
> [cn67][WARNIN]   --inject-monmap 
> [cn67][WARNIN] write the  monmap to the local
> monitor store and exit
> [cn67][WARNIN]   --extract-monmap 
> [cn67][WARNIN] extract the monmap from the local monitor store and
> exit
> [cn67][WARNIN]   --mon-data 
> [cn67][WARNIN] where the mon store and keyring are located
> [cn67][ERROR ] RuntimeError: command returned non-zero exit status: 1
> [ceph_deploy.mon][ERROR ] Failed to execute command: ceph-mon
> --cluster ceph --mkfs -i cn67 --keyring
> /var/lib/ceph/tmp/ceph-cn67.mon.keyring --setgroup 10
> [ceph_deploy][ERROR ] GenericError: Failed to create 1 monitors
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS with cache tiering - reading files are filled with 0s

2015-09-02 Thread Arthur Liu
Hi,

I am experiencing an issue with CephFS with cache tiering where the kernel
clients are reading files filled entirely with 0s.

The setup:
ceph 0.94.3
create cephfs_metadata replicated pool
create cephfs_data replicated pool
cephfs was created on the above two pools, populated with files, then:
create cephfs_ssd_cache replicated pool,
then adding the tiers:
ceph osd tier add cephfs_data cephfs_ssd_cache
ceph osd tier cache-mode cephfs_ssd_cache writeback
ceph osd tier set-overlay cephfs_data cephfs_ssd_cache

While the cephfs_ssd_cache pool is empty, multiple kernel clients on
different hosts open the same file (the size of the file is small, <10k) at
approximately the same time. A number of the clients from the OS level see
the entire file being empty. I can do a rados -p {cache pool} ls for the
list of files cached, and do a rados -p {cache pool} get {object} /tmp/file
and see the complete contents of the file.
I can repeat this by setting cache-mode to forward, rados -p {cache pool}
cache-flush-evict-all, checking no more objects in cache with rados -p
{cache pool} ls, resetting cache-mode to writeback with an empty pool, and
doing the multiple same file opens.

Has anyone seen this issue? It seems like what may be a race condition
where the object is not yet completely loaded into the cache pool so the
cache pool serves out an incomplete object.
If anyone can shed some light or any suggestions to help debug this issue,
that would be very helpful.

Thanks,
Arthur
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to improve ceph cluster capacity usage

2015-09-02 Thread huang jun
After search the source code, i found ceph_psim tool which can
simulate objects distribution,
but it seems a little simple.



2015-09-01 22:58 GMT+08:00 huang jun :
> hi,all
>
> Recently, i did some experiments on OSD data distribution,
> we set up a cluster with 72 OSDs,all 2TB sata disk,
> and ceph version is v0.94.3 and linux kernel version is 3.18,
> and set "ceph osd crush tunables optimal".
> There are 3 pools:
> pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 0 object_hash
> rjenkins pg_num 512 pgp_num 512 last_change 302 stripe_width 0
> pool 1 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash
> rjenkins pg_num 4096 pgp_num 4096 last_change 832
> crash_replay_interval 45 stripe_width 0
> pool 2 'metadata' replicated size 3 min_size 1 crush_ruleset 0
> object_hash rjenkins pg_num 512 pgp_num 512 last_change 302
> stripe_width 0
>
> the osd pg num of each osd:
> pool  : 0  1  2  | SUM
> 
> osd.0   13 10518 | 136
> osd.1   17 11026 | 153
> osd.2   15 11420 | 149
> osd.3   11 10117 | 129
> osd.4   8  10617 | 131
> osd.5   12 10219 | 133
> osd.6   19 11429 | 162
> osd.7   16 11521 | 152
> osd.8   15 11725 | 157
> osd.9   13 11723 | 153
> osd.10  13 13316 | 162
> osd.11  14 10521 | 140
> osd.12  11 94 16 | 121
> osd.13  12 11021 | 143
> osd.14  20 11926 | 165
> osd.15  12 12519 | 156
> osd.16  15 12622 | 163
> osd.17  13 10919 | 141
> osd.18  8  11919 | 146
> osd.19  14 11419 | 147
> osd.20  17 11329 | 159
> osd.21  17 11127 | 155
> osd.22  13 12120 | 154
> osd.23  14 95 23 | 132
> osd.24  17 11026 | 153
> osd.25  13 13315 | 161
> osd.26  17 12424 | 165
> osd.27  16 11920 | 155
> osd.28  19 13430 | 183
> osd.29  13 12120 | 154
> osd.30  11 97 20 | 128
> osd.31  12 10918 | 139
> osd.32  10 11215 | 137
> osd.33  18 11428 | 160
> osd.34  19 11229 | 160
> osd.35  16 12132 | 169
> osd.36  13 11118 | 142
> osd.37  15 10722 | 144
> osd.38  21 12924 | 174
> osd.39  9  12117 | 147
> osd.40  11 10218 | 131
> osd.41  14 10119 | 134
> osd.42  16 11925 | 160
> osd.43  12 11813 | 143
> osd.44  17 11425 | 156
> osd.45  11 11415 | 140
> osd.46  12 10716 | 135
> osd.47  15 11123 | 149
> osd.48  14 11520 | 149
> osd.49  9  94 13 | 116
> osd.50  14 11718 | 149
> osd.51  13 11219 | 144
> osd.52  11 12622 | 159
> osd.53  12 12218 | 152
> osd.54  13 12120 | 154
> osd.55  17 11425 | 156
> osd.56  11 11818 | 147
> osd.57  22 13725 | 184
> osd.58  15 10522 | 142
> osd.59  13 12018 | 151
> osd.60  12 11019 | 141
> osd.61  21 11428 | 163
> osd.62  12 97 18 | 127
> osd.63  19 10931 | 159
> osd.64  10 13221 | 163
> osd.65  19 13721 | 177
> osd.66  22 10732 | 161
> osd.67  12 10720 | 139
> osd.68  14 10022 | 136
> osd.69  16 11024 | 150
> osd.70  9  10114 | 124
> osd.71  15 11224 | 151
>
> 
> SUM   : 1024   8192   1536   |
>
> We can found that, for poolid=1(data pool),
> osd.57 and osd.65 both have 137 PGs but osd.12 and osd.49 only have 94 PGs,
> which maybe cause data distribution imbanlance, and reduces the space
> utilization of the cluster.
>
> Use "crushtool -i crush.raw --test --show-mappings --rule 0 --num-rep
> 2 --min-x 1 --max-x %s"
> we tested different pool pg_num:
>
> Total PG num PG num stats
>  ---
> 4096 avg: 113.78 (avg stands for avg PG num of every osd)
> total: 8192  (total stands for total PG num, include replica PG)
> max: 139 +0.221680 (max stands for max PG num on OSD, +0.221680 stands
> for percent above avage PG num )
> min: 113 -0.226562 (min stands for min PG num on OSD, -0.226562 stands
> for ratio below avage PG num )
>
> 8192 avg: 227.56
> total: 16384
> max: 267 0.173340
> min: 226 -0.129883
>
> 16384 avg: 455.11
> total: 32768
> max: 502 0.103027
> min: 455 -0.127686
>
> 32768 avg: 910.22
> total: 65536
> max: 966 0.061279
> min: 910 -0.076050
>
> With bigger pg_num, the gap between the maximum and the minimum decreased.
> But it's unreasonable to set such large pg_num, which will increase
> OSD and MON load.
>
> Is there any way to get a more balanced PG distribution of the 

Re: [ceph-users] cephfs read-only setting doesn't work?

2015-09-02 Thread Yan, Zheng

> On Sep 2, 2015, at 16:44, Gregory Farnum  wrote:
> 
> On Tue, Sep 1, 2015 at 9:20 PM, Erming Pei  wrote:
>> Hi,
>> 
>>  I tried to set up a read-only permission for a client but it looks always
>> writable.
>> 
>>  I did the following:
>> 
>> ==Server end==
>> 
>> [client.cephfs_data_ro]
>>key = AQxx==
>>caps mon = "allow r"
>>caps osd = "allow r pool=cephfs_data, allow r pool=cephfs_metadata"
> 
> The clients don't directly access the metadata pool at all so you
> don't need to grant that. :) And I presume you have an MDS cap in
> there as well?
> 
>> 
>> 
>> ==Client end==
>> mount -v -t ceph hostname.domainname:6789:/ /cephfs -o
>> name=cephfs_data_ro,secret=AQxx==
>> 
>> But I still can touch, delete, overwrite.
>> 
>> I read that touch/delete could be only meta data operations, but why I still
>> can overwrite?
>> 
>> Is there anyway I could test/check the data pool (instead of meta data) to
>> see if any effect on it?
> 
> What you're seeing here is an unfortunate artifact of the page cache
> and the way these user capabilities work in Ceph. As you surmise,
> touch/delete are metadata operations through the MDS and in current
> code you can't block the client off from that (although we have work
> in progress to improve things). I think you'll find that the data
> you've overwritten isn't really written to the OSDs — you wrote it in
> the local page cache, but the OSDs will reject the writes with EPERM.
> I don't remember the kernel's exact behavior here though — we updated
> the userspace client to preemptively check access permissions on new
> pools but I don't think the kernel ever got that. Zheng?

4.2 and later kernels include that.

Yan, Zheng


> -Greg

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph read / write : Terrible performance

2015-09-02 Thread Vickey Singh
Hello Ceph Experts

I have a strange problem , when i am reading or writing to Ceph pool , its
not writing properly. Please notice Cur MB/s which is going up and down

--- Ceph Hammer 0.94.2
-- CentOS 6, 2.6
-- Ceph cluster is healthy


One interesting thing is when every i start rados bench command for read or
write CPU Idle % goes down ~10 and System load is increasing like anything.

Hardware

HpSL4540
32Core CPU
196G Memory
10G Network

I don't think hardware is a problem.

Please give me clues / pointers , how should i troubleshoot this problem.



# rados bench -p glance-test 60 write
 Maintaining 16 concurrent writes of 4194304 bytes for up to 60 seconds or
0 objects
 Object prefix: benchmark_data_pouta-s01.pouta.csc.fi_2173350
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
 0   0 0 0 0 0 - 0
 1  1620 4 15.9916   0.12308   0.10001
 2  163721   41.984168   1.79104  0.827021
 3  166852   69.3122   124  0.084304  0.854829
 4  16   11498   97.9746   184   0.12285  0.614507
 5  16   188   172   137.568   296  0.210669  0.449784
 6  16   248   232   154.634   240  0.090418  0.390647
 7  16   305   289165.11   228  0.069769  0.347957
 8  16   331   315   157.471   104  0.0262470.3345
 9  16   361   345   153.306   120  0.082861  0.320711
10  16   380   364   145.57576  0.027964  0.310004
11  16   393   377   137.06752   3.73332  0.393318
12  16   448   432   143.971   220  0.334664  0.415606
13  16   476   460   141.508   112  0.271096  0.406574
14  16   497   481   137.39984  0.257794  0.412006
15  16   507   491   130.90640   1.49351  0.428057
16  16   529   513   115.04288  0.399384   0.48009
17  16   533   517   94.628616   5.50641  0.507804
18  16   537   52183.40516   4.42682  0.549951
19  16   538   52280.349 4   11.2052  0.570363
2015-09-02 09:26:18.398641min lat: 0.023851 max lat: 11.2052 avg lat:
0.570363
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
20  16   538   522   77.3611 0 -  0.570363
21  16   540   524   74.8825 4   8.88847  0.591767
22  16   542   526   72.5748 8   1.41627  0.593555
23  16   543   527   70.2873 48.0856  0.607771
24  16   555   539   69.567448  0.145199  0.781685
25  16   560   544   68.0177201.4342  0.787017
26  16   564   548   66.424116  0.451905   0.78765
27  16   566   550   64.7055 8  0.611129  0.787898
28  16   570   554   63.313816   2.51086  0.797067
29  16   570   554   61.5549 0 -  0.797067
30  16   572   556   60.1071 4   7.71382  0.830697
31  16   577   561   59.051520   23.3501  0.916368
32  16   590   574   58.870552  0.336684  0.956958
33  16   591   575   57.4986 4   1.92811  0.958647
34  16   591   575   56.0961 0 -  0.958647
35  16   591   575   54.7603 0 -  0.958647
36  16   597   581   54.0447 8  0.187351   1.00313
37  16   625   609   52.8394   112   2.12256   1.09256
38  16   631   61552.22724   1.57413   1.10206
39  16   638   622   51.723228   4.41663   1.15086
2015-09-02 09:26:40.510623min lat: 0.023851 max lat: 27.6704 avg lat:
1.15657
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
40  16   652   636   51.810256  0.113345   1.15657
41  16   682   666   53.1443   120  0.041251   1.17813
42  16   685   669   52.339512  0.501285   1.17421
43  15   690   675   51.795524   2.26605   1.18357
44  16   728   712   53.6062   148  0.589826   1.17478
45  16   728   712   52.6158 0 -   1.17478
46  16   728   712   51.6613 0 -   1.17478
47  16   728   712   50.7407 0 -   1.17478
48  16   772   756   52.933244  0.2348111.1946
49  16   835   819   56.3577   252   5.67087   1.12063
50  16   890   874   59.1252   220  0.230806   1.06778
51  16   896   880   58.5409   

[ceph-users] testing a crush rule against an out osd

2015-09-02 Thread Dan van der Ster
Hi all,

We just ran into a small problem where some PGs wouldn't backfill
after an OSD was marked out. Here's the relevant crush rule; being a
non-trivial example I'd like to test different permutations of the
crush map (e.g. increasing choose_total_tries):

rule critical {
ruleset 4
type replicated
min_size 1
max_size 10
step take 0513-R-0060
step chooseleaf firstn 2 type ipservice
step emit
step take 0513-R-0050
step chooseleaf firstn -2 type rack
step emit
}

Here's the osd tree:
   https://stikked.web.cern.ch/stikked/view/c284b6b2

The relevant pool has size=3. The problem was that when a single OSD
in 0513-R-0060 was marked out then the rule above was only emitting 2
OSDs for a few PGs, (the missing replica was always from 0513-R-0060).

Normally I use crushtool --test --show-mappings to test rules, but
AFAICT it doesn't let you simulate an out osd, i.e. with reweight = 0.
Any ideas how to test this situation without uploading a crushmap to a
running cluster?

Cheers,

Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Appending to an open file - O_APPEND flag

2015-09-02 Thread Yan, Zheng

> On Sep 2, 2015, at 17:11, Gregory Farnum  wrote:
> 
> Whoops, forgot to add Zheng.
> 
> On Wed, Sep 2, 2015 at 10:11 AM, Gregory Farnum  wrote:
>> On Wed, Sep 2, 2015 at 10:00 AM, Janusz Borkowski
>>  wrote:
>>> Hi!
>>> 
>>> I mount cephfs using kernel client (3.10.0-229.11.1.el7.x86_64).
>>> 
>>> The effect is the same when doing "echo >>" from another machine and from a
>>> machine keeping the file open.
>>> 
>>> The file is opened with open( ..,
>>> O_WRONLY|O_LARGEFILE|O_APPEND|O_BINARY|O_CREAT)
>>> 
>>> Shell ">>" is implemented as (from strace bash -c "echo '7789' >>
>>> /mnt/ceph/test):
>>> 
>>>open("/mnt/ceph/test", O_WRONLY|O_CREAT|O_APPEND, 0666) = 3
>>> 
>>> The test file had ~500KB size.
>>> 
>>> Each subsequent "echo >>" writes to the start of the test file, first "echo"
>>> overwriting the original contents, next "echos" overwriting bytes written by
>>> the preceding "echo".
>> 
>> Hmmm. The userspace (ie, ceph-fuse) implementation of this is a little
>> bit racy but ought to work. I'm not as familiar with the kernel code
>> but I'm not seeing any special behavior in the Ceph code — Zheng,
>> would you expect this to work? It looks like some of the linux
>> filesystems have their own O_APPEND handling and some don't, but I
>> can't find it in the VFS either.
>> -Greg

Yes, the kernel client does not handle the case that multiple clients do append 
write to the same file. I will fix it soon.

Regards
Yan, Zheng

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is Ceph appropriate for small installations?

2015-09-02 Thread Marcin Przyczyna
On 08/31/2015 09:39 AM, Wido den Hollander wrote:

> True, but your performance is greatly impacted during recovery. So a
> three node cluster might work well when the skies are clear and the sun
> is shining, but it has a hard time dealing with a complete node failure.

The question of "how tiny a cluster can be" I answer with
some perf. data I collected few days ago.

My setup:
- 3 armhf based servers (a sort of raspberrypi),
- only one 100 mbit/s LAN for all sort of access,
- 3 usb sticks with dedicated 4 GB /dev/sda1 partition sticked in each
armhf server and formatted with xfs,
- 3 MMC cards for OS,
- OSD jornal on MMC, OSD on USB Stick
- 1 ordinary PC as client,
- 3 OSDs, 3 MONs, 1 MDS.
- debian8, 64bit everywhere

I/O Performance test command based on
"time dd if=/dev/zero of=./test bs=1024k count=1024"
revealed:

1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 81.3964 s, 13.2 MB/s

real1m28.619s
user0m0.052s
sys 0m1.724s

Result:
A creation of 1 GB zerobased file consumes ~1min 30 secs.

From my point of view it is possible to set poor's man
fileserver cluster on home lan typical hardware
(i.e. my switch is a soho DSL modem/router) with
3 low power servers powered by smartphone chips
and 3 cheap USB sticks as data storage.

It is not quick, but it works. It consumes tiny amount
of electrical power (whole "farm" needs about 25W),
it has no mechanical rotating parts.
The cluster uses passive radiators only and
produces almost no heat wave: no fans are needed at all.
During very hot summer this year I did notice
no temperature based failure at all.

HW cost: about 800 euro.
Hint: try to find a centera on EMC² webpage with that price :-)

My question:
how can I "damage" one of my OSDs in an intelligent way
to test the cluster performance during recovery ?

Cheers,
Marcin.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Jessie repo for ceph hammer?

2015-09-02 Thread Rottmann Jonas
Hi,

When it can be expected that there will be a Jessie repo for ceph hammer 
available?

Thanks!
Mit freundlichen Grüßen/Kind regards
Jonas Rottmann
Systems Engineer

FIS-ASP Application Service Providing und
IT-Outsourcing GmbH
Röthleiner Weg 4
D-97506 Grafenrheinfeld
Phone: +49 (9723) 9188-568
Fax: +49 (9723) 9188-600

email: j.rottm...@fis-asp.de   web: www.fis-asp.de

Geschäftsführer Robert Schuhmann
Registergericht Schweinfurt HRB 3865

[cid:image001.jpg@01D0E57E.5DF45CC0]
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Troubleshooting rgw bucket list

2015-09-02 Thread Sam Wouters

Thanks!
Playing around with max_keys in bucket listing retrieval actually gives 
me results or not, this gives me a way to list the content until the bug 
is fixed.
Is it possible somehow to copy the objects to a new bucket (with 
versioning disabled), and rename the current one? I don't think the 
latter is possible through the api but maybe there is some hidden way ;-)


Could you also take a minute to confirm another versioning related bug I 
posted: http://tracker.ceph.com/issues/12819
If you could give me some pointers to contribute, I don't mind digging 
into code, I will gladly do so.


r,
Sam

On 01-09-15 22:37, Yehuda Sadeh-Weinraub wrote:

Yeah, I'm able to reproduce the issue. It is related to the fact that
you have a bunch of delete markers in the bucket, as it triggers some
bug there. I opened a new ceph issue for this one:

http://tracker.ceph.com/issues/12913

Thanks,
Yehuda

On Tue, Sep 1, 2015 at 11:39 AM, Sam Wouters  wrote:

Sorry, forgot to mention:

- yes, filtered by thread
- the "is not valid" line occurred when performing the bucket --check
- when doing a bucket listing, I also get an "is not valid", but on a
different object:
7fe4f1d5b700 20  cls/rgw/cls_rgw.cc:460: entry
abc_econtract/data/6scbrrlo4vttk72melewizj6n3[] is not valid

bilog entry for this object similar to the one below

r, Sam

On 01-09-15 20:30, Sam Wouters wrote:

Hi,

see inline

On 01-09-15 20:14, Yehuda Sadeh-Weinraub wrote:

I assume you filtered the log by thread? I don't see the response
messages. For the bucket check you can run radosgw-admin with
--log-to-stderr.

nothing is logged to the console when I do that

Can you also set 'debug objclass = 20' on the osds? You can do it by:

$ ceph tell osd.\* injectargs --debug-objclass 20

this continuously prints "20  cls/rgw/cls_rgw.cc:460: entry
abc_econtract/data/6smuz2ysavvxbygng34tgusyse[] is not valid" on osd.0

Also, it'd be interesting to get the following:

$ radosgw-admin bi list --bucket=
--object=abc_econtract/data/6shflrwbwwcm6dsemrpjit2li3v913iad1EZQ3.S6Prb-NXLvfQRlaWC5nBYp5

this gives me an empty array:
[
]
but we did a trim of the bilog a while ago cause a lot entries regarding
objects that were already removed from the bucket kept on syncing with
the sync agent, causing a lot of delete_markers at the replication site.

The object in the error above from the osd log, gives the following:
# radosgw-admin --log-to-stderr -n client.radosgw.be-east-1 bi list
--bucket=aws-cmis-prod
--object=abc_econtract/data/6smuz2ysavvxbygng34tgusyse
[
 {
 "type": "plain",
 "idx": "abc_econtract\/data\/6smuz2ysavvxbygng34tgusyse",
 "entry": {
 "name": "abc_econtract\/data\/6smuz2ysavvxbygng34tgusyse",
 "instance": "",
 "ver": {
 "pool": -1,
 "epoch": 0
 },
 "locator": "",
 "exists": "false",
 "meta": {
 "category": 0,
 "size": 0,
 "mtime": "0.00",
 "etag": "",
 "owner": "",
 "owner_display_name": "",
 "content_type": "",
 "accounted_size": 0
 },
 "tag": "",
 "flags": 8,
 "pending_map": [],
 "versioned_epoch": 0
 }
 },
 {
 "type": "plain",
 "idx":
"abc_econtract\/data\/6smuz2ysavvxbygng34tgusyse\uv913\uiRQZUR76UdeymR-PGaw6sbCHMCOcaovu",
 "entry": {
 "name": "abc_econtract\/data\/6smuz2ysavvxbygng34tgusyse",
 "instance": "RQZUR76UdeymR-PGaw6sbCHMCOcaovu",
 "ver": {
 "pool": 23,
 "epoch": 9680
 },
 "locator": "",
 "exists": "true",
 "meta": {
 "category": 1,
 "size": 103410,
 "mtime": "2015-08-07 17:57:32.00Z",
 "etag": "6c67f5e6cb4aa63f4fa26a3b94d19d3a",
 "owner": "aws-cmis-prod",
 "owner_display_name": "AWS-CMIS prod user",
 "content_type": "application\/pdf",
 "accounted_size": 103410
 },
 "tag": "be-east.34319.4520377",
 "flags": 3,
 "pending_map": [],
 "versioned_epoch": 2
 }
 },
 {
 "type": "instance",
 "idx":
"�1000_abc_econtract\/data\/6smuz2ysavvxbygng34tgusyse\uiRQZUR76UdeymR-PGaw6sbCHMCOcaovu",
 "entry": {
 "name": "abc_econtract\/data\/6smuz2ysavvxbygng34tgusyse",
 "instance": "RQZUR76UdeymR-PGaw6sbCHMCOcaovu",
 "ver": {
 "pool": 23,
 "epoch": 9680
 },
 "locator": "",
 "exists": "true",
 "meta": {
 "category": 1,
 "size": 103410,
 "mtime": "2015-08-07 

Re: [ceph-users] CephFS with cache tiering - reading files are filled with 0s

2015-09-02 Thread Arthur Liu
Hi John and Zheng,

Thanks for the quick replies!
I'm using kernel 4.2. I'll test out that fix.

Arthur

On Wed, Sep 2, 2015 at 10:29 PM, Yan, Zheng  wrote:

> probably caused by http://tracker.ceph.com/issues/12551
>
> On Wed, Sep 2, 2015 at 7:57 PM, Arthur Liu  wrote:
> > Hi,
> >
> > I am experiencing an issue with CephFS with cache tiering where the
> kernel
> > clients are reading files filled entirely with 0s.
> >
> > The setup:
> > ceph 0.94.3
> > create cephfs_metadata replicated pool
> > create cephfs_data replicated pool
> > cephfs was created on the above two pools, populated with files, then:
> > create cephfs_ssd_cache replicated pool,
> > then adding the tiers:
> > ceph osd tier add cephfs_data cephfs_ssd_cache
> > ceph osd tier cache-mode cephfs_ssd_cache writeback
> > ceph osd tier set-overlay cephfs_data cephfs_ssd_cache
> >
> > While the cephfs_ssd_cache pool is empty, multiple kernel clients on
> > different hosts open the same file (the size of the file is small, <10k)
> at
> > approximately the same time. A number of the clients from the OS level
> see
> > the entire file being empty. I can do a rados -p {cache pool} ls for the
> > list of files cached, and do a rados -p {cache pool} get {object}
> /tmp/file
> > and see the complete contents of the file.
> > I can repeat this by setting cache-mode to forward, rados -p {cache pool}
> > cache-flush-evict-all, checking no more objects in cache with rados -p
> > {cache pool} ls, resetting cache-mode to writeback with an empty pool,
> and
> > doing the multiple same file opens.
> >
> > Has anyone seen this issue? It seems like what may be a race condition
> where
> > the object is not yet completely loaded into the cache pool so the cache
> > pool serves out an incomplete object.
> > If anyone can shed some light or any suggestions to help debug this
> issue,
> > that would be very helpful.
> >
> > Thanks,
> > Arthur
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Appending to an open file - O_APPEND flag

2015-09-02 Thread Janusz Borkowski
Hi!

Thanks for the explanation. The behaviour (overwriting) was puzzling and 
suggesting serious filesystem corruption. Once we identified the scenario, we 
can try workarounds.

Regards,
J.

On 02.09.2015 11:50, Yan, Zheng wrote:
>> On Sep 2, 2015, at 17:11, Gregory Farnum  wrote:
>>
>> Whoops, forgot to add Zheng.
>>
>> On Wed, Sep 2, 2015 at 10:11 AM, Gregory Farnum  wrote:
>>> On Wed, Sep 2, 2015 at 10:00 AM, Janusz Borkowski
>>>  wrote:
 Hi!

 I mount cephfs using kernel client (3.10.0-229.11.1.el7.x86_64).

 The effect is the same when doing "echo >>" from another machine and from a
 machine keeping the file open.

 The file is opened with open( ..,
 O_WRONLY|O_LARGEFILE|O_APPEND|O_BINARY|O_CREAT)

 Shell ">>" is implemented as (from strace bash -c "echo '7789' >>
 /mnt/ceph/test):

open("/mnt/ceph/test", O_WRONLY|O_CREAT|O_APPEND, 0666) = 3

 The test file had ~500KB size.

 Each subsequent "echo >>" writes to the start of the test file, first 
 "echo"
 overwriting the original contents, next "echos" overwriting bytes written 
 by
 the preceding "echo".
>>> Hmmm. The userspace (ie, ceph-fuse) implementation of this is a little
>>> bit racy but ought to work. I'm not as familiar with the kernel code
>>> but I'm not seeing any special behavior in the Ceph code — Zheng,
>>> would you expect this to work? It looks like some of the linux
>>> filesystems have their own O_APPEND handling and some don't, but I
>>> can't find it in the VFS either.
>>> -Greg
> Yes, the kernel client does not handle the case that multiple clients do 
> append write to the same file. I will fix it soon.
>
> Regards
> Yan, Zheng
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Corruption of file systems on RBD images

2015-09-02 Thread Mathieu GAUTHIER-LAFAYE
Hi All,

We have some troubles regularly with virtual machines using RBD storage. When 
we restart some virtual machines, they starts to do some filesystem checks. 
Sometime it can rescue it, sometime the virtual machine die (Linux or Windows).

We have move from Firefly to Hammer the last month. I don't know if the problem 
is in Ceph and is still there or if we continue to see symptom of a Firefly bug.

We have two rooms in two separate building, so we set the replica size to 2. 
I'm in doubt if it can cause this kind of problems when scrubbing operations. I 
guess the recommended replica size is at less 3.

We use BTRFS for OSD with a kernel 3.10. This was not strongly discouraged when 
we start the deployment of CEPH last year. Now, it seems that the kernel 
version should be 3.14 or later for this kind of setup.

Does some people already have got similar problems ? Do you think, it's related 
to our BTRFS setup. Is it the replica size of the pool ?

Have you any advice to find the problem ?

Best,
Mathieu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS with cache tiering - reading files are filled with 0s

2015-09-02 Thread Yan, Zheng
probably caused by http://tracker.ceph.com/issues/12551

On Wed, Sep 2, 2015 at 7:57 PM, Arthur Liu  wrote:
> Hi,
>
> I am experiencing an issue with CephFS with cache tiering where the kernel
> clients are reading files filled entirely with 0s.
>
> The setup:
> ceph 0.94.3
> create cephfs_metadata replicated pool
> create cephfs_data replicated pool
> cephfs was created on the above two pools, populated with files, then:
> create cephfs_ssd_cache replicated pool,
> then adding the tiers:
> ceph osd tier add cephfs_data cephfs_ssd_cache
> ceph osd tier cache-mode cephfs_ssd_cache writeback
> ceph osd tier set-overlay cephfs_data cephfs_ssd_cache
>
> While the cephfs_ssd_cache pool is empty, multiple kernel clients on
> different hosts open the same file (the size of the file is small, <10k) at
> approximately the same time. A number of the clients from the OS level see
> the entire file being empty. I can do a rados -p {cache pool} ls for the
> list of files cached, and do a rados -p {cache pool} get {object} /tmp/file
> and see the complete contents of the file.
> I can repeat this by setting cache-mode to forward, rados -p {cache pool}
> cache-flush-evict-all, checking no more objects in cache with rados -p
> {cache pool} ls, resetting cache-mode to writeback with an empty pool, and
> doing the multiple same file opens.
>
> Has anyone seen this issue? It seems like what may be a race condition where
> the object is not yet completely loaded into the cache pool so the cache
> pool serves out an incomplete object.
> If anyone can shed some light or any suggestions to help debug this issue,
> that would be very helpful.
>
> Thanks,
> Arthur
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is Ceph appropriate for small installations?

2015-09-02 Thread Janusz Borkowski
Hi!

Do you have replication factor 2?

To test recovery e.g. kill one OSD process, observe when ceph notices it and 
starts moving data. Reformat the OSD partition, remove the killed OSD from 
cluster, then add a new OSD using the freshly formatted partition. When you 
have again 3 OSDs, observe when data migration finishes. Till then, the system 
will be loaded with recovery

J.

On 02.09.2015 12:15, Marcin Przyczyna wrote:
> On 08/31/2015 09:39 AM, Wido den Hollander wrote:
>
>> True, but your performance is greatly impacted during recovery. So a
>> three node cluster might work well when the skies are clear and the sun
>> is shining, but it has a hard time dealing with a complete node failure.
> The question of "how tiny a cluster can be" I answer with
> some perf. data I collected few days ago.
>
> My setup:
> - 3 armhf based servers (a sort of raspberrypi),
> - only one 100 mbit/s LAN for all sort of access,
> - 3 usb sticks with dedicated 4 GB /dev/sda1 partition sticked in each
> armhf server and formatted with xfs,
> - 3 MMC cards for OS,
> - OSD jornal on MMC, OSD on USB Stick
> - 1 ordinary PC as client,
> - 3 OSDs, 3 MONs, 1 MDS.
> - debian8, 64bit everywhere
>
> I/O Performance test command based on
> "time dd if=/dev/zero of=./test bs=1024k count=1024"
> revealed:
>
> 1024+0 records in
> 1024+0 records out
> 1073741824 bytes (1.1 GB) copied, 81.3964 s, 13.2 MB/s
>
> real1m28.619s
> user0m0.052s
> sys 0m1.724s
>
> Result:
> A creation of 1 GB zerobased file consumes ~1min 30 secs.
>
> From my point of view it is possible to set poor's man
> fileserver cluster on home lan typical hardware
> (i.e. my switch is a soho DSL modem/router) with
> 3 low power servers powered by smartphone chips
> and 3 cheap USB sticks as data storage.
>
> It is not quick, but it works. It consumes tiny amount
> of electrical power (whole "farm" needs about 25W),
> it has no mechanical rotating parts.
> The cluster uses passive radiators only and
> produces almost no heat wave: no fans are needed at all.
> During very hot summer this year I did notice
> no temperature based failure at all.
>
> HW cost: about 800 euro.
> Hint: try to find a centera on EMC² webpage with that price :-)
>
> My question:
> how can I "damage" one of my OSDs in an intelligent way
> to test the cluster performance during recovery ?
>
> Cheers,
> Marcin.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [sepia] debian jessie repository ?

2015-09-02 Thread Alfredo Deza
As of yesterday we are now ready to start providing Debian Jessie packages.
They will be present by default for the upcoming Ceph release (Infernalis).

For other releases (e.g. Firefly, Hammer, Giant) it means that there will
be a Jessie package for them for new versions only.

Let me know if you have any questions.


Thanks!

Alfredo

On Mon, Aug 31, 2015 at 1:27 AM, Alexandre DERUMIER 
wrote:

> Hi,
>
> any news for to add an official debian jessie repository on ceph.com?
>
>
> gitbuilder repository is available since some weeks
>
> http://gitbuilder.ceph.com/ceph-deb-jessie-x86_64-basic/ref/
>
>
> Is it something blocking the release of the packages ?
>
> Regards,
>
> Alexandre
>
>
>
> ___
> Sepia mailing list
> se...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/sepia-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] testing a crush rule against an out osd

2015-09-02 Thread Sage Weil
On Wed, 2 Sep 2015, Dan van der Ster wrote:
> Hi all,
> 
> We just ran into a small problem where some PGs wouldn't backfill
> after an OSD was marked out. Here's the relevant crush rule; being a
> non-trivial example I'd like to test different permutations of the
> crush map (e.g. increasing choose_total_tries):
> 
> rule critical {
> ruleset 4
> type replicated
> min_size 1
> max_size 10
> step take 0513-R-0060
> step chooseleaf firstn 2 type ipservice
> step emit
> step take 0513-R-0050
> step chooseleaf firstn -2 type rack
> step emit
> }
> 
> Here's the osd tree:
>https://stikked.web.cern.ch/stikked/view/c284b6b2
> 
> The relevant pool has size=3. The problem was that when a single OSD
> in 0513-R-0060 was marked out then the rule above was only emitting 2
> OSDs for a few PGs, (the missing replica was always from 0513-R-0060).
> 
> Normally I use crushtool --test --show-mappings to test rules, but
> AFAICT it doesn't let you simulate an out osd, i.e. with reweight = 0.
> Any ideas how to test this situation without uploading a crushmap to a
> running cluster?

crushtool --test --weight  0 ...

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] testing a crush rule against an out osd

2015-09-02 Thread Sage Weil
On Wed, 2 Sep 2015, Dan van der Ster wrote:
> On Wed, Sep 2, 2015 at 4:11 PM, Sage Weil  wrote:
> > On Wed, 2 Sep 2015, Dan van der Ster wrote:
> >> ...
> >> Normally I use crushtool --test --show-mappings to test rules, but
> >> AFAICT it doesn't let you simulate an out osd, i.e. with reweight = 0.
> >> Any ideas how to test this situation without uploading a crushmap to a
> >> running cluster?
> >
> > crushtool --test --weight  0 ...
> >
> 
> Oh thanks :)
> 
> I can't reproduce my real life issue with crushtool though. Still looking ...

osdmaptool has a --test-map-pg option that may be easier...

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Corruption of file systems on RBD images

2015-09-02 Thread Lionel Bouton
Hi Mathieu,

Le 02/09/2015 14:10, Mathieu GAUTHIER-LAFAYE a écrit :
> Hi All,
>
> We have some troubles regularly with virtual machines using RBD storage. When 
> we restart some virtual machines, they starts to do some filesystem checks. 
> Sometime it can rescue it, sometime the virtual machine die (Linux or 
> Windows).

What is the cause of death as reported by the VM? FS inconsistency?
Block device access timeout? ...

>
> We have move from Firefly to Hammer the last month. I don't know if the 
> problem is in Ceph and is still there or if we continue to see symptom of a 
> Firefly bug.
>
> We have two rooms in two separate building, so we set the replica size to 2. 
> I'm in doubt if it can cause this kind of problems when scrubbing operations. 
> I guess the recommended replica size is at less 3.

Scrubbing is pretty harmless, deep scrubbing is another matter.
Simultaneous deep scrubs on the same OSD are a performance killer. It
seems latest Ceph versions provide some way of limiting its impact on
performance (scrubs are done per pg so 2 simultaneous scrubs can and
often involve the same OSD and I think there's a limit on scrubs per OSD
now). AFAIK Firefly doesn't have this (and it surely didn't when we were
confronted to the problem) so we developed our own deep scrub scheduler
to avoid involving the same OSD twice (in fact our scheduler tries to
interleave scrubs so that each OSD has as much inactivity after a deep
scrub as possible before the next). This helps a lot.

>
> We use BTRFS for OSD with a kernel 3.10. This was not strongly discouraged 
> when we start the deployment of CEPH last year. Now, it seems that the kernel 
> version should be 3.14 or later for this kind of setup.

See https://btrfs.wiki.kernel.org/index.php/Gotchas for various reasons
to upgrade.

We have a good deal of experience with Btrfs in production now. We had
to disable snapshots, make the journal NoCOW, disable autodefrag and
develop our own background defragmenter (which converts to zlib at the
same time it defragments for additional space savings). We currently use
kernel version 4.0.5 (we don't use any RAID level so we don't need 4.0.6
to get a fix for an online RAID level conversion bug) and I wouldn't use
anything less than 3.19.5. The results are pretty good, but Btrfs is
definitely not an out-of-the-box solution for Ceph.


>
> Does some people already have got similar problems ? Do you think, it's 
> related to our BTRFS setup. Is it the replica size of the pool ?

It mainly depends on the answer to the first question above (is it a
corruption or a freezing problem?).

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] testing a crush rule against an out osd

2015-09-02 Thread Dan van der Ster
On Wed, Sep 2, 2015 at 4:11 PM, Sage Weil  wrote:
> On Wed, 2 Sep 2015, Dan van der Ster wrote:
>> ...
>> Normally I use crushtool --test --show-mappings to test rules, but
>> AFAICT it doesn't let you simulate an out osd, i.e. with reweight = 0.
>> Any ideas how to test this situation without uploading a crushmap to a
>> running cluster?
>
> crushtool --test --weight  0 ...
>

Oh thanks :)

I can't reproduce my real life issue with crushtool though. Still looking ...

-- Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph read / write : Terrible performance

2015-09-02 Thread Mark Nelson

On 09/02/2015 08:51 AM, Vickey Singh wrote:

Hello Ceph Experts

I have a strange problem , when i am reading or writing to Ceph pool ,
its not writing properly. Please notice Cur MB/s which is going up and down

--- Ceph Hammer 0.94.2
-- CentOS 6, 2.6
-- Ceph cluster is healthy


You might find that CentOS7 gives you better performance.  In some cases 
we were seeing nearly 2X.





One interesting thing is when every i start rados bench command for read
or write CPU Idle % goes down ~10 and System load is increasing like
anything.

Hardware

HpSL4540


Please make sure the controller is on the newest firmware.  There used 
to be a bug that would cause sequential write performance to bottleneck 
when writeback cache was enabled on the RAID controller.



32Core CPU
196G Memory
10G Network


Be sure to check the network too.  We've seen a lot of cases where folks 
have been burned by one of the NICs acting funky.




I don't think hardware is a problem.

Please give me clues / pointers , how should i troubleshoot this problem.



# rados bench -p glance-test 60 write
  Maintaining 16 concurrent writes of 4194304 bytes for up to 60 seconds
or 0 objects
  Object prefix: benchmark_data_pouta-s01.pouta.csc.fi_2173350
sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
  0   0 0 0 0 0 - 0
  1  1620 4 15.9916   0.12308   0.10001
  2  163721   41.984168   1.79104  0.827021
  3  166852   69.3122   124  0.084304  0.854829
  4  16   11498   97.9746   184   0.12285  0.614507
  5  16   188   172   137.568   296  0.210669  0.449784
  6  16   248   232   154.634   240  0.090418  0.390647
  7  16   305   289165.11   228  0.069769  0.347957
  8  16   331   315   157.471   104  0.0262470.3345
  9  16   361   345   153.306   120  0.082861  0.320711
 10  16   380   364   145.57576  0.027964  0.310004
 11  16   393   377   137.06752   3.73332  0.393318
 12  16   448   432   143.971   220  0.334664  0.415606
 13  16   476   460   141.508   112  0.271096  0.406574
 14  16   497   481   137.39984  0.257794  0.412006
 15  16   507   491   130.90640   1.49351  0.428057
 16  16   529   513   115.04288  0.399384   0.48009
 17  16   533   517   94.628616   5.50641  0.507804
 18  16   537   52183.40516   4.42682  0.549951
 19  16   538   52280.349 4   11.2052  0.570363
2015-09-02 09:26:18.398641min lat: 0.023851 max lat: 11.2052 avg lat:
0.570363
sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
 20  16   538   522   77.3611 0 -  0.570363
 21  16   540   524   74.8825 4   8.88847  0.591767
 22  16   542   526   72.5748 8   1.41627  0.593555
 23  16   543   527   70.2873 48.0856  0.607771
 24  16   555   539   69.567448  0.145199  0.781685
 25  16   560   544   68.0177201.4342  0.787017
 26  16   564   548   66.424116  0.451905   0.78765
 27  16   566   550   64.7055 8  0.611129  0.787898
 28  16   570   554   63.313816   2.51086  0.797067
 29  16   570   554   61.5549 0 -  0.797067
 30  16   572   556   60.1071 4   7.71382  0.830697
 31  16   577   561   59.051520   23.3501  0.916368
 32  16   590   574   58.870552  0.336684  0.956958
 33  16   591   575   57.4986 4   1.92811  0.958647
 34  16   591   575   56.0961 0 -  0.958647
 35  16   591   575   54.7603 0 -  0.958647
 36  16   597   581   54.0447 8  0.187351   1.00313
 37  16   625   609   52.8394   112   2.12256   1.09256
 38  16   631   61552.22724   1.57413   1.10206
 39  16   638   622   51.723228   4.41663   1.15086
2015-09-02 09:26:40.510623min lat: 0.023851 max lat: 27.6704 avg lat:
1.15657
sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
 40  16   652   636   51.810256  0.113345   1.15657
 41  16   682   666   53.1443   120  0.041251   1.17813
 42  16   685   669   52.339512  0.501285   1.17421
 43  15   690   675   51.795524   2.26605   1.18357
 44  16   728   712