Re: [PERFORM] filesystem performance with lots of files

2005-12-20 Thread Jim C. Nasby
On Tue, Dec 20, 2005 at 01:26:00PM +, David Roussel wrote:
> Note that you can do the taring, zipping, copying and untaring 
> concurrentlt.  I can't remember the exactl netcat command line options, 
> but it goes something like this
> 
> Box1:
> tar czvf - myfiles/* | netcat myserver:12345
> 
> Box2:
> netcat -listen 12345 | tar xzvf -

You can also use ssh... something like

tar -cf - blah/* | ssh machine tar -xf -
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] filesystem performance with lots of files

2005-12-20 Thread David Roussel




David Lang wrote:

  ext3 has an option to make searching directories faster (htree), but
enabling it kills performance when you create files. And this doesn't
help with large files.
  
  

The ReiserFS white paper talks about the data structure he uses to
store directories (some kind of tree), and he says it's quick to both
read and write.  Don't forget if you find ls slow, that could just be
ls, since it's ls, not the fs, that sorts this files into alphabetical
order.

> how long would it take to do a tar-ftp-untar cycle with no smarts

Note that you can do the taring, zipping, copying and untaring
concurrentlt.  I can't remember the exactl netcat command line options,
but it goes something like this

Box1:
tar czvf - myfiles/* | netcat myserver:12345

Box2:
netcat -listen 12345 | tar xzvf -

Not only do you gain from doing it all concurrently, but not writing a
temp file means that disk seeks a reduced too if you have a one spindle
machine.

Also condsider just copying files onto a network mount.  May not be as
fast as the above, but will be faster than rsync, which has high CPU
usage and thus not a good choice on a LAN.

Hmm, sorry this is not directly postgres anymore...

David




Re: [PERFORM] filesystem performance with lots of files

2005-12-02 Thread David Lang

On Fri, 2 Dec 2005, Qingqing Zhou wrote:



I don't have all the numbers readily available (and I didn't do all the
tests on every filesystem), but I found that even with only 1000
files/directory ext3 had some problems, and if you enabled dir_hash some
functions would speed up, but writing lots of files would just collapse
(that was the 80 min run)



Interesting. I would suggest test small number but bigger file would be
better if the target is for database performance comparison. By small
number, I mean 10^2 - 10^3; By bigger, I mean file size from 8k to 1G
(PostgreSQL data file is at most this size under normal installation).


I agree, that round of tests was done on my system at home, and was in 
response to a friend who had rsync over a local lan take > 10 hours for 
<10G of data. but even so it generated some interesting info. I need to 
make a more controlled run at it though.



Let's take TPCC as an example, if we get a TPCC database of 500 files,
each one is at most 1G (PostgreSQL has this feature/limit in ordinary
installation), then this will give us a 500G database, which is big enough
for your current configuration.

Regards,
Qingqing



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] filesystem performance with lots of files

2005-12-01 Thread Qingqing Zhou


On Fri, 2 Dec 2005, David Lang wrote:
>
> I don't have all the numbers readily available (and I didn't do all the
> tests on every filesystem), but I found that even with only 1000
> files/directory ext3 had some problems, and if you enabled dir_hash some
> functions would speed up, but writing lots of files would just collapse
> (that was the 80 min run)
>

Interesting. I would suggest test small number but bigger file would be
better if the target is for database performance comparison. By small
number, I mean 10^2 - 10^3; By bigger, I mean file size from 8k to 1G
(PostgreSQL data file is at most this size under normal installation).

Let's take TPCC as an example, if we get a TPCC database of 500 files,
each one is at most 1G (PostgreSQL has this feature/limit in ordinary
installation), then this will give us a 500G database, which is big enough
for your current configuration.

Regards,
Qingqing

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] filesystem performance with lots of files

2005-12-01 Thread David Lang

On Thu, 1 Dec 2005, Qingqing Zhou wrote:


"David Lang" <[EMAIL PROTECTED]> wrote


a few weeks ago I did a series of tests to compare different filesystems.
the test was for a different purpose so the particulars are not what I
woud do for testing aimed at postgres, but I think the data is relavent)
and I saw major differences between different filesystems, I'll see aobut
re-running the tests to get a complete set of benchmarks in the next few
days. My tests had their times vary from 4 min to 80 min depending on the
filesystem in use (ext3 with hash_dir posted the worst case). what testing
have other people done with different filesystems?



That's good ... what benchmarks did you used?


I was doing testing in the context of a requirement to sync over a million 
small files from one machine to another (rsync would take >10 hours to do 
this over a 100Mb network so I started with the question 'how long would 
it take to do a tar-ftp-untar cycle with no smarts) so I created 1m x 1K 
files in a three deep directory tree (10d/10d/10d/1000files) and was doing 
simple 'time to copy tree', 'time to create tar', 'time to extract from 
tar', 'time to copy tarfile (1.6G file). I flushed the memory between each 
test with cat largefile >/dev/null (I know now that I should have 
unmounted and remounted between each test), source and destination on 
different IDE controllers


I don't have all the numbers readily available (and I didn't do all the 
tests on every filesystem), but I found that even with only 1000 
files/directory ext3 had some problems, and if you enabled dir_hash some 
functions would speed up, but writing lots of files would just collapse 
(that was the 80 min run)


I'll have to script it and re-do the tests (and when I do this I'll also 
set it to do a test with far fewer, far larger files as well)


David Lang

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] filesystem performance with lots of files

2005-12-01 Thread Qingqing Zhou

"David Lang" <[EMAIL PROTECTED]> wrote
>
> a few weeks ago I did a series of tests to compare different filesystems. 
> the test was for a different purpose so the particulars are not what I 
> woud do for testing aimed at postgres, but I think the data is relavent) 
> and I saw major differences between different filesystems, I'll see aobut 
> re-running the tests to get a complete set of benchmarks in the next few 
> days. My tests had their times vary from 4 min to 80 min depending on the 
> filesystem in use (ext3 with hash_dir posted the worst case). what testing 
> have other people done with different filesystems?
>

That's good ... what benchmarks did you used?

Regards,
Qingqing 



---(end of broadcast)---
TIP 6: explain analyze is your friend


[PERFORM] filesystem performance with lots of files

2005-12-01 Thread David Lang
this subject has come up a couple times just today (and it looks like one 
that keeps popping up).


under linux ext2/3 have two known weaknesses (or rather one weakness with 
two manifestations). searching through large objects on disk is slow, this 
applies to both directories (creating, opening, deleting files if there 
are (or have been) lots of files in a directory), and files (seeking to 
the right place in a file).


the rule of thumb that I have used for years is that if files get over a 
few tens of megs or directories get over a couple thousand entries you 
will start slowing down.


common places you can see this (outside of postgres)

1. directories, mail or news storage.
  if you let your /var/spool/mqueue directory get large (for example a 
server that can't send mail for a while or mail gets misconfigured on). 
there may only be a few files in there after it gets fixed, but if the 
directory was once large just doing a ls on the directory will be slow.


  news servers that store each message as a seperate file suffer from this 
as well, they work around it by useing multiple layers of nested 
directories so that no directory has too many files in it (navigating the 
layers of directories costs as well, it's all about the tradeoffs). Mail 
servers that use maildir (and Cyrus which uses a similar scheme) have the 
same problem.


  to fix this you have to create a new directory and move the files to 
that directory (and then rename the new to the old)


  ext3 has an option to make searching directories faster (htree), but 
enabling it kills performance when you create files. And this doesn't help 
with large files.


2. files, mbox formatted mail files and log files
  as these files get large, the process of appending to them takes more 
time. syslog makes this very easy to test. On a box that does syncronous 
syslog writing (default for most systems useing standard syslog, on linux 
make sure there is not a - in front of the logfile name) time how long it 
takes to write a bunch of syslog messages, then make the log file large 
and time it again.


a few weeks ago I did a series of tests to compare different filesystems. 
the test was for a different purpose so the particulars are not what I 
woud do for testing aimed at postgres, but I think the data is relavent) 
and I saw major differences between different filesystems, I'll see aobut 
re-running the tests to get a complete set of benchmarks in the next few 
days. My tests had their times vary from 4 min to 80 min depending on the 
filesystem in use (ext3 with hash_dir posted the worst case). what testing 
have other people done with different filesystems?


David Lang

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly