On 07/28/2014 11:28 AM, Steve Anthony wrote:
While searching for more information I happened across the following
post (http://dachary.org/?p=2961) which vaguely resembled the symptoms
I've been experiencing. I ran tcpdump and noticed what appeared to be a
high number of retransmissions on the host where the images are mounted
during a read from a Ceph rbd, so I ran iperf3 to get some concrete numbers:

Very interesting that you are seeing retransmissions.


Server: nas4 (where rbd images are mapped)
Client: ceph2 (currently not in the cluster, but configured identically to the 
other nodes)

Start server on nas4:
iperf3 -s

On ceph2, connect to server nas4, send 4096MB of data, report on 1 second 
intervals. Add -R to reverse the client/server roles.
iperf3 -c nas4 -n 4096M -i 1

Summary of traffic going out the 1Gb interface to a switch

[ ID] Interval           Transfer     Bandwidth       Retr
[  5]   0.00-36.53  sec  4.00 GBytes   941 Mbits/sec   15             sender
[  5]   0.00-36.53  sec  4.00 GBytes   940 Mbits/sec
receiver

Reversed, summary of traffic going over the fabric extender

[ ID] Interval           Transfer     Bandwidth       Retr
[  5]   0.00-80.84  sec  4.00 GBytes   425 Mbits/sec  30756
sender
[  5]   0.00-80.84  sec  4.00 GBytes   425 Mbits/sec
receiver

Definitely looks suspect!



It appears that the issue is related to the network topology employed.
The private cluster network and nas4's public interface are both
connected to a 10Gb Cisco Fabric Extender (FEX), in turn connected to a
Nexus 7000. This was meant as a temporary solution until our network
team could finalize their design and bring up the Nexus 6001 for the
cluster. From what our network guys have said, the FEX has been much
more limited than they anticipated and they haven't been pleased with it
as a solution in general. The 6001 is supposed be ready this week, so
once it's online I'll move the cluster to that switch and re-test to see
if this fixes the issues I've been experiencing.

If it's not the hardware, one other thing you might want to test is to make sure it's not something similar to the autotuning issues we used to see. I don't think this should be an issue at this point given the code changes we made to address it, but it would be easy to test. Doesn't seem like it should be happening with simple iperf tests though so the hardware is maybe the better theory.

http://www.spinics.net/lists/ceph-devel/msg05049.html


-Steve

On 07/24/2014 05:59 PM, Steve Anthony wrote:
Thanks for the information!

Based on my reading of http://ceph.com/docs/next/rbd/rbd-config-ref I
was under the impression that rbd cache options wouldn't apply, since
presumably the kernel is handling the caching. I'll have to toggle some
of those values and see it they make a difference in my setup.

I did some additional testing today. If I limit the write benchmark to 1
concurrent operation I see a lower bandwidth number, as I expected.
However, when writing to the XFS filesystem on an rbd image I see
transfer rates closer to to 400MB/s.

# rados -p bench bench 300 write --no-cleanup -t 1

Total time run:         300.105945
Total writes made:      1992
Write size:             4194304
Bandwidth (MB/sec):     26.551

Stddev Bandwidth:       5.69114
Max bandwidth (MB/sec): 40
Min bandwidth (MB/sec): 0
Average Latency:        0.15065
Stddev Latency:         0.0732024
Max latency:            0.617945
Min latency:            0.097339

# time cp -a /mnt/local/climate /mnt/ceph_test1

real    2m11.083s
user    0m0.440s
sys    1m11.632s

# du -h --max-deph=1 /mnt/local
53G    /mnt/local/climate

This seems to imply that the there is more than one concurrent operation
when writing into the filesystem on top of the rbd image. However, given
that the filesystem read speeds and the rados benchmark read speeds are
much closer in reported bandwidth, it's as if reads are occurring as a
single operation.

# time cp -a /mnt/ceph_test2/isos /mnt/local/

real    36m2.129s
user    0m1.572s
sys    3m23.404s

# du -h --max-deph=1 /mnt/ceph_test2/
68G    /mnt/ceph_test2/isos

Is this apparent single-thread read and multi-thread write with the rbd
kernel module the expected mode of operation? If so, could someone
explain the reason for this limitation?

Based on the information on data striping in
http://ceph.com/docs/next/architecture/#data-striping I would assume
that a format 1 image would stripe a file larger than the 4MB object
size over multiple objects and that those objects would be distributed
over multiple OSDs. This would seem to indicate that reading a file back
would be much faster since even though Ceph is only reading the primary
replica, the read is still distributed over multiple OSDs. At worst I
would expect something near the read bandwidth of a single OSD, which
would still be much higher than 30-40MB/s.

-Steve

On 07/24/2014 04:07 PM, Udo Lembke wrote:

Hi Steve,
I'm also looking for improvements of single-thread-reads.

A little bit higher values (twice?) should be possible with your config.
I have 5 nodes with 60 4-TB hdds and got following:
rados -p test bench -b 4194304 60 seq -t 1 --no-cleanup
Total time run:        60.066934
Total reads made:     863
Read size:            4194304
Bandwidth (MB/sec):    57.469
Average Latency:       0.0695964
Max latency:           0.434677
Min latency:           0.016444

In my case I had some osds (xfs) with an high fragmentation (20%).
Changing the mount options and defragmentation help slightly.
Performance changes:
[client]
rbd cache = true
rbd cache writethrough until flush = true

[osd]

osd mount options xfs =
"rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M"

osd_op_threads =
4

osd_disk_threads = 4


But I expect much more speed for an single thread...

Udo

On 23.07.2014 22:13, Steve Anthony wrote:

Ah, ok. That makes sense. With one concurrent operation I see numbers
more in line with the read speeds I'm seeing from the filesystems on the
rbd images.

# rados -p bench bench 300 seq --no-cleanup -t 1
Total time run:        300.114589
Total reads made:     2795
Read size:            4194304
Bandwidth (MB/sec):    37.252

Average Latency:       0.10737
Max latency:           0.968115
Min latency:           0.039754

# rados -p bench bench 300 rand --no-cleanup -t 1
Total time run:        300.164208
Total reads made:     2996
Read size:            4194304
Bandwidth (MB/sec):    39.925

Average Latency:       0.100183
Max latency:           1.04772
Min latency:           0.039584

I really wish I could find my data on read speeds from a couple weeks
ago. It's possible that they've always been in this range, but I
remember one of my test users saturating his 1GbE link over NFS reading
copying from the rbd client to his workstation. Of course, it's also
possible that the data set he was using was cached in RAM when he was
testing, masking the lower rbd speeds.

It just seems counterintuitive to me that read speeds would be so much
slower that writes at the filesystem layer in practice. With images in
the 10-100TB range, reading data at 20-60MB/s isn't going to be
pleasant. Can you suggest any tunables or other approaches to
investigate to improve these speeds, or are they in line with what you'd
expect? Thanks for your help!

-Steve



_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to