> On Oct 19, 2019, at 4:03 PM, Michał Górny <mgo...@gentoo.org> wrote:
>
> On Sat, 2019-10-19 at 15:26 -0400, Richard Yao wrote:
>>>> On Oct 18, 2019, at 9:10 PM, Richard Yao <r...@gentoo.org> wrote:
>>>
>>>
>>>>> On Oct 18, 2019, at 4:49 PM, Michał Górny <mgo...@gentoo.org> wrote:
>>>> On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote:
>>>>>>>>>> On Oct 18, 2019, at 9:42 AM, Michał Górny <mgo...@gentoo.org> wrote:
>>>>>>>>> Hi, everybody.
>>>>>>>>> It is my pleasure to announce that yesterday (EU) evening we've
>>>>>>>>> switched
>>>>>>>>> to a new distfile mirror layout. Users will be switching to the new
>>>>>>>>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
>>>>>>>>> already -- as their caches expire (24hrs).
>>>>>>>>> The new layout is mostly a bow towards mirror admins, for some of whom
>>>>>>>>> having a 60000+ files in a single directory have been a problem.
>>>>>>>>> However, I suppose some of you also found e.g. the directory index
>>>>>>>>> hardly usable due to its size.
>>>>> This sounds like a filesystem issue. Do we know which filesystems are
>>>>> suffering?
>>>>> ZFS should be fine. I believe ext2/ext3 have problems with this many
>>>>> files. ext4 is probably okay, but don’t quote me on that.
>>>> Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this
>>>> may apply only to older ntfs versions. NFS has been mentioned too.
>>>
>>> ext2 and vfat are not surprises to me (outside of the idea that anyone
>>> would use them for a mirror). NTFS and NFS are though.
>>>> However, just because modern filesystems can handle them efficiently, it
>>>> doesn't mean having directories that huge comes with zero cost.
>>> While I am okay with the change, what do you mean when you say that having
>>> huge directories does not come with zero cost?
>>>
>>> Filesystems with O(1) directory lookups like ZFS would probably be hurt by
>>> this, but the impact should be negligible. Filesystems with O(log n)
>>> directory lookups would see faster directory lookups.
>>>
>>> Outside of directory lookups, this could speed up up searches and sort
>>> operations when listing everything with just about any filesystem
>>> benefiting from the improvement.
>>>
>>> Listing directories on such filesystems should not benefit from this unless
>>> you are using ls where the default behavior is to sort the directory
>>> contents (which is where the improvement when sorting comes into play). The
>>> need to sort the directory contents by default keeps ls from displaying
>>> anything until it has scanned the entire directory. The asymptotic
>>> complexity of a fast comparison based sort improves in this situation from
>>> O(nlogn) to O(nlog(n/b)) provided that you sort each subdirectory
>>> independently. A further speed up could be obtained by doing multithreading
>>> to parallelize the sort operations.
>> I read your original email late at night and I misread the description of
>> how this works.
>>
>> At an initial glance, I thought we were doing a prefix approach (with the
>> caveat that buckets are unbalanced). In reality, we are doing a
>> cryptographic hash of the filenames.
>>
>> That would keep all buckets balanced, which gives the best directory lookup
>> times on O(log n) lookup filesystems, but I think there is something to be
>> gained from using the less optimal approach of using filename prefixes:
>>
>> * some regex searches on distfiles can be accelerated
>> * generating a sorted list of all distfiles becomes asymptotically faster
>> * it is easy for a user to find all versions of a given distfile
>> * no need to calculate a cryptographic hash
>>
>> I realize that I am late to propose it, but could we consider a switch to
>> this alternative arrangement?
>
> No, we can't. Please read either the original discussion on the bug, or
> the linked article. It's explained in detail why this won't work.
Alright. I am convinced. Thanks.
>
> --
> Best regards,
> Michał Górny
>