> From: Bob Friesenhahn [mailto:bfrie...@simple.dallas.tx.us]
> 
> On Thu, 12 Jan 2012, Edward Ned Harvey wrote:
> > Suppose you have a 1G file open, and a snapshot of this file is on disk
from
> > a previous point in time.
> > for ( i=0 ; i<1trillion ; i++ ) {
> >     seek(random integer in range[0 to 1G]);
> >     write(4k);
> > }
> >
> > Something like this would quickly try to write a bunch of separate and
> > scattered 4k blocks at different offsets within the file.  Every 32 of
these
> > 4k writes would be write-coalesced into a single 128k on-disk block.
> >
> > Sometime later, you read the whole file sequentially such as cp or tar
or
> > cat.  The first 4k come from this 128k block...  The next 4k come from
> > another 128k block...  The next 4k come from yet another 128k block...
> > Essentially, the file has become very fragmented and scattered about on
> the
> > physical disk.  Every 4k read results in a random disk seek.
> 
> Are you talking about some other filesystem or are you talking about
> zfs?  Because zfs does not work like that ...

In what way?  I've only described behavior of COW and write coalescing.
Which part are you saying is un-ZFS-like?

Before answering, here, let's do some test work:

Create a new pool, with a single disk, no compression or dedup or anything,
called "junk"

run this script.  All it does is generate some data in a file sequentially,
and then randomly overwrite random pieces of the file in random order,
creating snapshots all along the way...  Many times, until the file has been
completely overwritten many times over.  This should be a fragmentation
nightmare.
http://dl.dropbox.com/u/543241/fragmenter.py

Then reboot, to ensure cache is clear.
And see how long it takes to sequentially read the original sequential file,
as compared to the highly fragmented one:
cat /junk/.zfs/snapshot/sequential-before/out.txt | pv > /dev/null ; cat
/junk/.zfs/snapshot/random1399/out.txt | pv > /dev/null

While I'm waiting for this to run, I'll make some predictions:
The file is 2GB (16 Gbit) and the disk reads around 1Gbit/sec, so reading
the initial sequential file should take ~16 sec
After fragmentation, it should be essentially random 4k fragments (32768
bits).  I figure each time the head is able to find useful data, it takes
32us to read the 4kb, followed by 10ms random access time...  disk is doing
useful work 0.3% of the time and wasting 99.7% of the time doing random
seeks.  Should be about 300x longer to read the fragmented file.

... (Ding!) ...  Test is done.  Thank you for patiently waiting during this
time warp.  ;-)

Actual result:  15s and 45s.  So it was 3x longer, not 300x.  Either way it
proves the point - but I want to see results that are at least 100x worse
due to fragmentation, to REALLY drive home the point, that fragmentation
matters.

I hypothesize, that the mere 3x performance degradation is because I have
only a single 2G file in a 2T pool, and no other activity, and no other
files.  So all my supposedly randomly distributed data might reside very
close to each other on platter...  The combination of short stroke & read
prefetcher could be doing wonders in this case.  So now I'll repeat that
test, but this time...  I allow the sequential data to be written
sequentially again just like before, but after it starts the random
rewriting, I'll run a couple of separate threads writing and removing other
junk to the pool, so the write coalescing will include other files, spread
more across a larger percentage of the total disk, getting closer to the
worst case random distribution on disk...

(destroy & recreate the pool in between test runs...)

Actual result:  15s and 104s.  So it's only 6.9x performance degradation.
That's the worst I can do without hurting myself.  It proves the point, but
not to the magnitude that I expected.

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to