On Thu, Dec 13, 2018 at 3:31 PM Bryan Henderson <bry...@giraffe-data.com>
wrote:

> I've searched the ceph-users archives and found no discussion to speak of
> of
> Cephfs block sizes, and I wonder how much people have thought about it.
>
> The POSIX 'stat' system call reports for each file a block size, which is
> usually defined vaguely as the smallest read or write size that is
> efficient.
> It usually takes into account that small writes may require a
> read-modify-write and there may be a minimum size on reads from backing
> storage.
>
> One thing that uses this information is the stream I/O implementation
> (fopen/fclose/fread/fwrite) in GNU libc.  It always reads and usually
> writes
> full blocks, buffering as necessary.
>
> Most filesystems report this number as 4K.
>
> Ceph reports the stripe unit (stripe column size), which is the maximum
> size
> of the RADOS objects that back the file.  This is 4M by default.
>
> One result of this is that a program uses a thousand times more buffer
> space
> when running against a Ceph file as against a traditional filesystem.
>
> And a really pernicious result occurs when you have a special file in
> Cephfs.
> Block size doesn't make any sense at all for special files, and it's
> probably
> a bad idea to use stream I/O to read one, but I've seen it done.  The
> Chrony
> clock synchronizer programs use fread to read random numbers from
> /dev/urandom.  Should /dev/urandom be in a Cephfs filesystem, with
> defaults,
> it's going to generate 4M of random bits to satisfy a 4-byte request.  On
> one
> of my computers, that takes 7 seconds - and wipes out the entropy pool.
>
>
> Has stat block size been discussed much?  Is there a good reason that it's
> the RADOS object size?
>
> I'm thinking of modifying the cephfs filesystem driver to add a mount
> option
> to specify a fixed block size to be reported for all files, and using 4K or
> 64K.  Would that break something?
>

I remember this being a huge pain in the butt for a variety of reasons.
Going back through the logs though it looks like the main reason we do a
4MiB block size is so that we have a chance of reporting actual cluster
sizes to 32-bit systems, so obviously mount options to change it should
work fine as long as there aren't any shortcuts in the code. (Given that
we've previously switched from 4KiB to 4MiB, I wouldn't expect that to be a
problem.) My main worry would be that we definitely want to make sure that
the block size is appropriate for anybody using EC data pools, which may be
a little more complicated than a simple 4KiB or 64KiB setting.

It was kind of fun switching though since it revealed a lot of ecosystem
tools assuming the FS' block size was the same as a page size. :D
-Greg



>
> --
> Bryan Henderson                                   San Jose, California
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to