Jussi's Was: info point on linux hdr

2000-04-21 Thread Benno Senoner

I tested Jussi's preallocation scheme:
( lseeking in 4096 increments + writing 4 bytes at the beginning of each block)
30secs von 80secs when doing plain write()s even when writing 1MB at time.
that means more than twice as fast.

But I still have my doubts that all blocks gets really allocated:
for example I can do
ftruncate(filesize)
lseek(filesize-4,SEEK_SET)
write(value,4);

and get an empty file of filesize len,
but these blocks aren't really allocated since it takes a fraction of
second to do this.
when you do a cat of this files, the disk doesn't almost flash up,
and it returns very fast ( create on write algorithm ?).

Although st_blksize reports 4096, fragmentsize seems to be  1024,
is this related to the preallocation issues.

If the FS always does it's IO in 4096 byte blocks,
why should the lseek method be much faster, since basically both methods
would touch the same amount of data.

If the granules were 1024 bytes, then the above figures would make more sense:
since the first method lseeks in 4096 byte steps, it would allocate only
1 block every  4 blocks (of 1024byte size)

And my thesis seems to be right !

I tried various increments 4096,2048,1024,512 on Jussi's method:

results:
with 4096: 30secs
with 2048: 51secs
with 1024: 82secs
with   512: 83secs

(plain writes of 256KB data blocks take about 80sec)

BINGO !

That means the allocation "granule" seems to be 1024 byte long ,
since as you can see there is no difference between 1024 and 512 byte
increments, which let me assume that in these case ALL blocks are referenced,
where the 4096 case references only 25% of the blocks.

This proves Stephen's claims, that the only method to preallocate ALL blocks
of a file is to write() the data. 

Jussi: IMHO you are using too small filesizes in order to get reliable results:
I tried with 300-600MB , and all numbers are consistent with my above findings.

comments ?

PS: at this point I am not really interested in preallocation since it degrades
the write performance in my multitrack testcases.

Benno.

On Fri, 21 Apr 2000, Jussi Laako wrote:
> Paul Barton-Davis wrote:
> > 
> > also, in the code you show, the ftruncate is redundant, because it
> > doesn't allocate any blocks, and the point of preallocation (if there
> > is one) is to get the blocks allocated in a certain way.
> 
> It depends on where in the block we write the data. If we write data to
> start of the block and without ftruncate() we get file size
> 
> filesize = requested_file_size - blocksize + write_data_size
> 
> Allocated block count is same...
> 
> I got little better preallocation performance with
> ftruncate()/lseek()/write() compared to write() of full blocks.
> 
> Test source follows
> 
> --- 8< ---
> 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> #define TEST_FILE_SIZE  33554432l
> #define TEST_BLOCK_SIZE 4096l
> 
> 
> double ConvTime(const struct timeval *spTime)
> {
> double dTime;
> 
> dTime = (double) spTime->tv_sec + (double) spTime->tv_usec / 100.0;
> return dTime;
> }
> 
> 
> int main()
> {
> int iHandle;
> long lFilePos;
> double dStartTime;
> #ifdef TEST_WRITE_ONLY
> long lWriteRes;
> char cpData[TEST_BLOCK_SIZE];
> #endif
> struct timeval sCurrentTime;
> 
> iHandle = open("test.file", O_WRONLY|O_CREAT|O_TRUNC, S_IRUSR|S_IWUSR);
> if (iHandle < 0)
> {
> printf("open() failed: %s\n", strerror(errno));
> return 1;
> }
> gettimeofday(&sCurrentTime, NULL);
> dStartTime = ConvTime(&sCurrentTime);
> #ifndef TEST_WRITE_ONLY
> if (ftruncate(iHandle, TEST_FILE_SIZE) < 0)
> {
> printf("ftruncate() failed: %s\n", strerror(errno));
> return 1;
> }
> for (lFilePos = 0l; 
> lFilePos < TEST_FILE_SIZE; 
> lFilePos += TEST_BLOCK_SIZE)
> {
> if (lseek(iHandle, lFilePos, SEEK_SET) < lFilePos)
> {
> printf("lseek() failed: %s\n", strerror(errno));
> return 1;
> }
> if (write(iHandle, &lFilePos, sizeof(long)) < sizeof(long))
> {
> printf("write() failed: %s\n", strerror(errno));
> return 1;
> }
> }
> #else
> lFilePos = 0l;
> while (lFilePos < TEST_FILE_SIZE)
> {
> lWriteRes = write(iHandle, cpData, TEST_BLOCK_SIZE);
> if (lWriteRes < TEST_BLOCK_SIZE)
> {
> printf("write() failed: %s\n", strerror(errno));
> return 1;
> }
> lFilePos += lWriteRes;
> }
> #endif
> fsync(iHandle);
> close(iHandle);
> gettimeofday(&sCurrentTime, NULL);
> printf("Operation took %.2f seconds\n", 
> ConvTime(&sCurrentTime) - dStartTime);
> return 0;
> }
> 
> --- 8< ---
> 
>  - Jussi Laako
> 
> -- 
> PGP key fingerprint: 161D 6FED 6A92 39E2 EB5B  39DD A4DE 63EB C216 1E4B
> Available at: ldap://certserver.pgp.com, http://keys.pgp.com:11

Re: [linux-audio-dev] more preallocation vs no prealloc / async vs sync tests.

2000-04-21 Thread Benno Senoner

On Fri, 21 Apr 2000, Paul Barton-Davis wrote:
> >PS: Paul run it on your 10k rpm SCSI disk so that we can do some comparison.
> 
> I hope you are ready for some *very* different numbers.
> 
> /tmp/hdtest 500 async trunc
> SINGLE THREADED: 12.788 MByte/sec
> MULTI-THREADED: 12.788 MByte/sec
> 
> /tmp/hdtest 500 async notrunc
> SINGLE THREADED: 6.096 MByte/sec
> MULTI-THREADED:  6.168 MByte/sec
> 
> /tmp/hdtest 150 sync trunc
> SINGLE THREADED: 11.292 MByte/sec
> MULTI-THREADED: 12.233 MByte/sec
> 
> /tmp/hdtest 150 sync trunc
> SINGLE THREADED: 5.437 MByte/sec
> MULTI-THREADED:  6.383 MByte/sec
> 
> A few notes.
> 
> In the source you sent, you are not doing 256kB writes, but 1MB
> writes, since you defined MYSIZE as (262144*4). This is puzzling.
> However, changing it to 256kB doesn't change the results in any
> significant way, as far as I can tell.

Yes you are right I forgot to remove that *4 while before sending the mail.
I actually experimented with writing 1-2MB at time, rather than the default
256KB but basically the performance is the same, and you can gain a few %
at maximum, but nothing relevant.
Therefore I think 256KByte is actually the best tradeoff between buffersize and
speed.

> 
> It troubles me that the ongoing rate display is always significantly
> higher than the eventual effective speed. I understand the reason for
> the initially very high rate, but I typically see final rates from the
> ongoing display that are very much higher than in your effective rate
> display (e.g. 13MB/sec versus 5.5MB/sec, 20MB/sec versus 12MB/sec).
> I don't have the time to stare at the source and figure out why this
> is.

I think that the effective speed doesn't reflect real speed (it's too low,
since I am calling sync() at the end of writes, and then get the elapsed time
of the whole process.
I added this in order to avoid that the write() loops finishes , with much data
still in the buffer cache instead of on the disk.

I think the best way to get reasonable numbers is :
- use a testsize which is at least 2-3 times your RAM in order to avoid that the
cache distorts the results.
- use the last number of the ongoing rate display as the "effective" average
datatransfer rate.

Anyway I ran the test on the RAID box again, this time with 256KB writes,
and I got the same performance as before (24-26MB/sec), therefore as you
pointed out 256KB is a quite ideal IO size.

It's amazing that you got the 12MB/sec in the SYNC mode,
wondering if it's your SCSI disk and/or 2.3.x kernel,
I am guessing a combination of both, as Stehpen said, that the
SCSI driver is much more performant when issuing lots of requests.

I wasn't prepared for that fast O_SYNC results, can you please rerun the
test using 500MB as testsize. (plus take the last MB/sec value of the ongoing
rate display as final result).
If you really get the same performance as in the async mode, then 
it probably makes more sense to use O_SYNC in ardour on SCSI boxes,
since you get more predictable buffer cache usage.
On EIDE, unfortunately we have to forget O_SYNC.

> 
> Its very interesting that writing to pre-allocated files is 50%
> slower for me. This is even though your pre-allocation strategy causes
> block-interleaving of the files. I suspect, but at this time cannot
> prove, that this is due (in my case at least) to fs fragmentation. I
> will try the benchmark on a clean 18GB disk the next time I'm over at
> the studio.

Notice that I even tried to allocate the files in a non interleaved fashion,
(by creating 20 separate files in sequence) giving me basically the same
performance as creating the files using the interleaved mode.

But again, on your case it may be different, the only way to know it is to test
it.
(just create 20 files with dd named outfile0 , outfile1 etc , for example each
of a size 25MB, and then run hdtest with the notrunc parameter).

If you find interesting results let us know.

> 
> Stephen Tweedie or someone else would know the answer to my last
> question: I am wondering if contiguous allocation of fs blocks to a
> file reduces the amount of metadata updating ? Does metadata belong to
> a fixed-sized unit, or an inode, or a variable-sized unit, or some
> combination ? I ask this because I see some visual indication of the
> disk stalls you have talked about when running your hdtest program (it
> may just be paging issues, however - hard to tell), and I still have
> not seen them in ardour. Assuming for a second that these are real
> stalls, one obvious difference is that your preallocation strategy
> does not produce contiguous files.
> 
> --p

I can't tell exactly here, but I guess that your supposition may be true.
But again , an answer from the filesystem gurus would be nice.

Benno.



Re: [linux-audio-dev] more preallocation vs no prealloc / async vs sync tests.

2000-04-21 Thread Paul Barton-Davis

>./hdtest 500 async trunc
>SINGLE THREADED:  5.856 MByte/sec 
>MULTI-THREADED:  6.096 MByte/sec
>
>./hdtest 500 async notrunc (rewrite to preallocated files)
>SINGLE THREADED: 4.040 MByte/sec
>MULTI-THREADED: 4.766 MByte/sec
>
>./hdtest 150 sync trunc 
>
>SINGLE THREADED: 1.442 MByte/sec
>MULTI-THREADED: 0.121 MByte/sec   (floppy-like performance :-)  )
>
>./hdtest 150 sync notrunc
>SINGLE THREADED:  4.788 MByte/sec
>MULTI-THREADED: 1.984 MByte/sec

>PS: Paul run it on your 10k rpm SCSI disk so that we can do some comparison.

I hope you are ready for some *very* different numbers.

/tmp/hdtest 500 async trunc
SINGLE THREADED: 12.788 MByte/sec
MULTI-THREADED: 12.788 MByte/sec

/tmp/hdtest 500 async notrunc
SINGLE THREADED: 6.096 MByte/sec
MULTI-THREADED:  6.168 MByte/sec

/tmp/hdtest 150 sync trunc
SINGLE THREADED: 11.292 MByte/sec
MULTI-THREADED: 12.233 MByte/sec

/tmp/hdtest 150 sync trunc
SINGLE THREADED: 5.437 MByte/sec
MULTI-THREADED:  6.383 MByte/sec

A few notes.

In the source you sent, you are not doing 256kB writes, but 1MB
writes, since you defined MYSIZE as (262144*4). This is puzzling.
However, changing it to 256kB doesn't change the results in any
significant way, as far as I can tell.

It troubles me that the ongoing rate display is always significantly
higher than the eventual effective speed. I understand the reason for
the initially very high rate, but I typically see final rates from the
ongoing display that are very much higher than in your effective rate
display (e.g. 13MB/sec versus 5.5MB/sec, 20MB/sec versus 12MB/sec).
I don't have the time to stare at the source and figure out why this
is.

Its very interesting that writing to pre-allocated files is 50%
slower for me. This is even though your pre-allocation strategy causes
block-interleaving of the files. I suspect, but at this time cannot
prove, that this is due (in my case at least) to fs fragmentation. I
will try the benchmark on a clean 18GB disk the next time I'm over at
the studio.

Stephen Tweedie or someone else would know the answer to my last
question: I am wondering if contiguous allocation of fs blocks to a
file reduces the amount of metadata updating ? Does metadata belong to
a fixed-sized unit, or an inode, or a variable-sized unit, or some
combination ? I ask this because I see some visual indication of the
disk stalls you have talked about when running your hdtest program (it
may just be paging issues, however - hard to tell), and I still have
not seen them in ardour. Assuming for a second that these are real
stalls, one obvious difference is that your preallocation strategy
does not produce contiguous files.

--p






more preallocation vs no prealloc / async vs sync tests.

2000-04-21 Thread Benno Senoner


Hi, more realworld of writing / vs rewriting (without O_TRUNC) to files
performance:

I enchanced hdtest.c a bit in order to recognize the "notrunc" flag which opens
the file without O_TRUNC (you have to run a first test with "trunc" in order to
create the files, which are created in an interleaved fashion (by 256k blocks);
I tried even to create them in a linear fashion (write first the entire file1,
then file2 , etc, but I see no big performance difference between the two modes)

(testbox PII400, 256MB RAM, IBM 16GB EIDE UDMA 5400rpm, kernel 2.2.12)


./hdtest 500 async trunc
SINGLE THREADED:  5.856 MByte/sec 
MULTI-THREADED:  6.096 MByte/sec

./hdtest 500 async notrunc (rewrite to preallocated files)
SINGLE THREADED: 4.040 MByte/sec
MULTI-THREADED: 4.766 MByte/sec

./hdtest 150 sync trunc 

SINGLE THREADED: 1.442 MByte/sec
MULTI-THREADED: 0.121 MByte/sec   (floppy-like performance :-)  )

./hdtest 150 sync notrunc
SINGLE THREADED:  4.788 MByte/sec
MULTI-THREADED: 1.984 MByte/sec


interesting that preallocation speeds up the O_SYNC mode quite a bit,
although mulithreading is not usable in this mode.

Anyone know what happens here ?

PS: Paul run it on your 10k rpm SCSI disk so that we can do some comparison.

Benno.



/* hdtest.c
   small async vs O_SYNC / single vs multithreaded disk writing benchmark
   by Benno Senoner ([EMAIL PROTECTED])

   the program writes NUMFILES (=20) files simultaneously

   compile with :  gcc -O2 -o hdtest hdtest.c -lpthread
   run with ./hdtest TOTAL_OUTPUT_MEGABYTES  

   for example: ./hdtest 500  writes a total amount of 500MB of data
 
   can be either sync (opens with O_SYNC) or async
   for O_SYNC output give 'sync' as 2nd argument ( ./hdtest 500 sync )
  default is async

   can be either trunc (opens with O_TRUNC) or notrunc
  default is trunc
  If you want to run with notrunc, run a session with trunc first
  or create the files by hand

*/


#include 
#include 
#include 
#include 
#include 
#include 
#include 

// IO OUTPUT SIZE , default 256k )
#define MYSIZE (262144*4)

#define NUMFILES 20

#define NUMLOOPS (TOTAL_WRITE_SIZE/(MYSIZE*NUMFILES))

void *writer_thread(void *data);

pthread_mutex_t my_mutex = PTHREAD_MUTEX_INITIALIZER; 
int num_active_threads;
int written_bytes;

time_t time1,time2;
char *buf;

int TOTAL_WRITE_SIZE;

void print_status(void);

int main(int argc, char **argv) 
{
  int i,u;
  int res;
  int retcode;
  int counter=0;
  int open_flags;

  pthread_t my_thread[NUMFILES];

  char filename[200];
  int fds[NUMFILES];

  buf=(char *)malloc(MYSIZE);
  for(i=0;i=2) {
TOTAL_WRITE_SIZE=1024*1024*atoi(argv[1]);
  }
  TOTAL_WRITE_SIZE=MYSIZE*NUMFILES*NUMLOOPS;
  printf("TOTAL WRITE SIZE=%d\n",TOTAL_WRITE_SIZE);
 

  if(argc >=3) {
if(!strcmp(argv[2],"sync")) {
  open_flags |= O_SYNC;
  printf("opening in files in O_SYNC mode\n");
}
  }

  if(argc >=4 ) {
if(!strcmp(argv[3],"notrunc")) {
  open_flags &= ~O_TRUNC;
  printf("opening files without O_TRUNC (rewrite mode)\n");
}
  }
  

  printf("opening files");
  for(i=0;i


Re: [linux-audio-dev] Re: File writes with O_SYNC slow

2000-04-21 Thread Andrea Arcangeli

On Thu, 20 Apr 2000, Stephen C. Tweedie wrote:

>much difference, and it may be the stick we need to beat Linus into
>believing that this change is really quite important.

The change is really quite important and it will return to be important as
it was importnat in 2.2.x when raid5 will work keeping breaking layering
for performance.

Andrea