On Tue, Apr 18, 2017 at 06:14:36PM +0200, Lars Schneider wrote:
> > Both Git and the filter are going to have to keep these paths in memory
> > somewhere, be that in-process, or on disk. That being said, I can see 
> > potential
> > troubles with a large number of long paths that exceed the memory available 
> > to
> > Git or the filter when stored in a hashmap/set.
> >
> > On Git's side, I think trading that for some CPU time might make sense. If 
> > Git
> > were to SHA1 each path and store that in a hashmap, it would consume more 
> > CPU
> > time, but less memory to store each path. Git and the filter could then 
> > exchange
> > path names, and Git would simply SHA1 the pathname each time it needed to 
> > refer
> > back to memory associated with that entry in a hashmap.
>
> I would be surprised if this would be necessary. If we filter delay 50,000
> files (= a lot!) with a path length of 1000 characters (= very long!) then we
> would use 50MB plus some hashmap data structures. Modern machines should have
> enough RAM I would think...

I agree, and thanks for correcting my thinking here. I ran a simple command to
get the longest path names in a large repository, as:

  $ find . -type f | awk '{ print length($1) }' | sort -r -n | uniq -c

And found a few files close to the 200 character mark as the longest pathnames
in the repository. I think 50k files at 1k bytes per pathname is quite enough
head-room :-).


--
Thanks,
Taylor Blau

Reply via email to