Hi,

While not necessarily CephFS specific - we somehow seem to manage to
frequently end up with objects that have inconsistent omaps. This seems to
be replication (as anecdotally it's a replica that ends up diverging, and
it's at least a few times something that happened after the osd that held
that replica were re-started). (I had hoped
http://tracker.ceph.com/issues/17177 would solve this - but it doesn't
appear to have solved it completely).

We also have one workload which we'd need to re-engineer in order to be a
good fit for CephFS, we do a lot of hardlinks where there's no clear
"origin" file, which is slightly at odds with the hardlink implementation.
If I understand correctly, unlink is move from directory tree into the
stray directories, decrement link count, if link count = 0, purge, if not
keep it around until you encounter another link to it and re-integrate it
back in again. This netted us hilariously large stray directories, which
combined with the above were less than ideal.

Beyond that - there's been other small(-ish) bugs we've encountered, but
it's either been solvable by cherry-picking fixes, upgrading, or using the
available tools for doing surgery guided either by the internet and/or an
approximate understanding of how it's supposed to work/be).

-KJ

On Wed, Jul 19, 2017 at 11:20 AM, Brady Deetz <bde...@gmail.com> wrote:

> Thanks Greg. I thought it was impossible when I reported 34MB for 52
> million files.
>
> On Jul 19, 2017 1:17 PM, "Gregory Farnum" <gfar...@redhat.com> wrote:
>
>>
>>
>> On Wed, Jul 19, 2017 at 10:25 AM David <dclistsli...@gmail.com> wrote:
>>
>>> On Tue, Jul 18, 2017 at 6:54 AM, Blair Bethwaite <
>>> blair.bethwa...@gmail.com> wrote:
>>>
>>>> We are a data-intensive university, with an increasingly large fleet
>>>> of scientific instruments capturing various types of data (mostly
>>>> imaging of one kind or another). That data typically needs to be
>>>> stored, protected, managed, shared, connected/moved to specialised
>>>> compute for analysis. Given the large variety of use-cases we are
>>>> being somewhat more circumspect it our CephFS adoption and really only
>>>> dipping toes in the water, ultimately hoping it will become a
>>>> long-term default NAS choice from Luminous onwards.
>>>>
>>>> On 18 July 2017 at 15:21, Brady Deetz <bde...@gmail.com> wrote:
>>>> > All of that said, you could also consider using rbd and zfs or
>>>> whatever filesystem you like. That would allow you to gain the benefits of
>>>> scaleout while still getting a feature rich fs. But, there are some down
>>>> sides to that architecture too.
>>>>
>>>> We do this today (KVMs with a couple of large RBDs attached via
>>>> librbd+QEMU/KVM), but the throughput able to be achieved this way is
>>>> nothing like native CephFS - adding more RBDs doesn't seem to help
>>>> increase overall throughput. Also, if you have NFS clients you will
>>>> absolutely need SSD ZIL. And of course you then have a single point of
>>>> failure and downtime for regular updates etc.
>>>>
>>>> In terms of small file performance I'm interested to hear about
>>>> experiences with in-line file storage on the MDS.
>>>>
>>>> Also, while we're talking about CephFS - what size metadata pools are
>>>> people seeing on their production systems with 10s-100s millions of
>>>> files?
>>>>
>>>
>>> On a system with 10.1 million files, metadata pool is 60MB
>>>
>>>
>> Unfortunately that's not really an accurate assessment, for good but
>> terrible reasons:
>> 1) CephFS metadata is principally stored via the omap interface (which is
>> designed for handling things like the directory storage CephFS needs)
>> 2) omap is implemented via Level/RocksDB
>> 3) there is not a good way to determine which pool is responsible for
>> which portion of RocksDBs data
>> 4) So the pool stats do not incorporate omap data usage at all in their
>> reports (it's part of the overall space used, and is one of the things that
>> can make that larger than the sum of the per-pool spaces)
>>
>> You could try and estimate it by looking at how much "lost" space there
>> is (and subtracting out journal sizes and things, depending on setup). But
>> I promise there's more than 60MB of CephFS metadata for 10.1 million files!
>> -Greg
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Kjetil Joergensen <kje...@medallia.com>
SRE, Medallia Inc
Phone: +1 (650) 739-6580
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to