Vincent Lefevre wrote:
> Filesystem created:       Mon Jan  4 16:26:16 2010
> 
> My machine is old, but I've never changed anything concerning the
> file system.

2010 isn't that old.  Just a baby!  :-)  After a quick look on my
network I located these.

  desolation: Filesystem created:       Tue Feb 26 13:46:27 2008
  despair: Filesystem created:       Thu Oct 11 17:58:10 2007
  devastation: Filesystem created:       Tue Feb 26 15:31:37 2008
  thrill: Filesystem created:       Sun Jun  3 14:50:55 2007

I have been rolling over systems or I am sure I would have located
older ones.  If I turned on some archived systems I could almost
certainly produce older ones.

> I also notice slowness with a large maildir directory:
> 
> drwx------ 2 vlefevre vlefevre 8409088 2015-03-24 14:04:33 Mail/oldarc/cur/
> 
> In this one, the files are real (145400 files), but I have a Perl
> script that basically reads the headers and it takes a lot of time
> (several dozens of minutes) after a reboot or dropping the caches
> as you suggested above. With a second run of this script, it just
> takes 8 seconds.

This is going to be at least two different points of slowness.  One is
the directory that must be read.  Two is simply opening 145400 files
and reading the mail header from each of them is going to take a
while.  Opening many files will have a quantifiable time.  Try this
experiment.  Cache the directory and the inodes without opening the
file.  Then run your perl script to read the mail headers.  That
should 

    # echo 3 > /proc/sys/vm/drop_caches
    # ls -lR Mail/oldarc/cur >/dev/null

Then run your perl script:

  $ time yourperlscript Mail/oldarc/cur
  $ time yourperlscript Mail/oldarc/cur

Divide 145400 files by the time in seconds to run the first uncached
run and you should be able to quantify the files per second
performance to open and read the mail headers from those files
uncached.  Then repeat and determine the cached performance time.

That is a lot of files!  I expect it would take a while.  But the
second run cached should be much better.  As long as you have enough
file system buffer cache to hold those blocks in memory.

It would also be interesting to convert the Maildir with 145400 files
to a compressed mbox format single file.  (That will convert "^From "
lines if that is a concern for you.)  I expect that if you were to
modify your perl script program to read the compressed mbox file and
do the same task that it might be faster!  It would remove the
overhead time needed to open each of those 145400 files.  It all
depends upon the distribution of data size of the body of the messages
since then it would need to read and skip the message bodies.

But let's say that all of the bodies were small less than 50k then I
expect that converging them to a single mbox file would make them much
faster than the individual files.  Also compressing the file reduces
the amount of I/O needed to pull the data into memory.  With today's
fast cpus decompression is faster than disk I/O and reading a
compressed file and decompressing it is usually faster in my
experience.  Every case is individually different however.  If you run
that experiment I would be interested in knowning the result.

Bob

Attachment: signature.asc
Description: Digital signature

Reply via email to