Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-11 Thread sensille
Andrey Kuzmin wrote:
 On Fri, Jun 11, 2010 at 1:54 AM, Richard Elling
 richard.ell...@gmail.com mailto:richard.ell...@gmail.com wrote:
 
 On Jun 10, 2010, at 1:24 PM, Arne Jansen wrote:
 
  Andrey Kuzmin wrote:
  Well, I'm more accustomed to  sequential vs. random, but YMMW.
  As to 67000 512 byte writes (this sounds suspiciously close to
 32Mb fitting into cache), did you have write-back enabled?
 
  It's a sustained number, so it shouldn't matter.
 
 That is only 34 MB/sec.  The disk can do better for sequential writes.
 
 Note: in ZFS, such writes will be coalesced into 128KB chunks.
 
 
 So this is just 256 IOPS in the controller, not 64K.

No, it's 67k ops, it was a completely ZFS-free test setup. iostat also confirmed
the numbers.

--Arne

 
 Regards,
 Andrey
  
 
  -- richard
 
 --
 ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
 http://nexenta-rotterdam.eventbrite.com/
 
 
 
 
 
 
 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Depth of Scrub

2010-06-05 Thread sensille

David Dyer-Bennet wrote:

But what about the parity? Obviously it has to be checked, but I can't
find
any indications for it in the literature. The man page only states that
the
data is being checksummed and only if that fails the redundancy is being
used.
Please tell me I'm wrong ;)


I believe you're wrong.  Scrub checks all the blocks used by ZFS,
regardless of what's in them.  (It doesn't check free blocks.)



Thanks for all the reassurances :) I don't know why I got weak, of course
it would be pointless if it wouldn't check all redundant data.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Depth of Scrub

2010-06-04 Thread sensille
Hi,

I have a small question about the depth of scrub in a raidz/2/3 configuration.
I'm quite sure scrub does not check spares or unused areas of the disks (it
could check if the disks detects any errors there).
But what about the parity? Obviously it has to be checked, but I can't find
any indications for it in the literature. The man page only states that the
data is being checksummed and only if that fails the redundancy is being used.
Please tell me I'm wrong ;)

But what I'm really targeting with my question: How much coverage can be
reached with a find | xargs wc in contrast to scrub? It misses the snapshots,
but anything beyond that?

Thanks,
Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cannot destroy ... dataset already exists

2010-06-02 Thread sensille
Is the pool mounted? I ran into this problem frequently, until I set mountpoint
to legacy. It may be that I had to destroy the filesystem afterwards, but since
I stopped mounting the backup target everything runs smoothly. Nevertheless I
agree it would be nice to find the root cause for this.

--
Arne

Edward Ned Harvey wrote:
 This is the problem:
 
 [r...@nasbackup backup-scripts]# zfs destroy
 storagepool/nas-lyricp...@nasbackup-2010-05-14-15-56-30
 
 cannot destroy
 'storagepool/nas-lyricp...@nasbackup-2010-05-14-15-56-30': dataset
 already exists
 
  
 
 This is apparently a common problem.  It's happened to me twice already,
 and the third time now.  Each time it happens, it's on the backup
 server, so fortunately, I have total freedom to do whatever I want,
 including destroy the pool.
 
  
 
 The previous two times, I googled around, basically only found destroy
 the pool as a solution, and I destroyed the pool.
 
  
 
 This time, I would like to dedicate a little bit of time and resource to
 finding the cause of the problem, so hopefully this can be fixed for
 future users, including myself.  This time I also found apply updates
 and repeat your attempt to destroy the snapshot  ...  So I applied
 updates, and repeated.  But no improvement.  The OS was sol 10u6, but
 now it’s fully updated.  Problem persists.
 
  
 
 I’ve also tried exporting and importing the pool.
 
  
 
 Somebody on the Internet suspected the problem is somehow aftermath of
 killing a zfs send or receive.  This is distinctly possible, as I’m
 sure that’s happened on my systems.  But there is currently no send or
 receive being killed ... Any such occurrence is long since past, and
 even beyond reboots and such.
 
  
 
 I do not use clones.  There are no clones of this snapshot anywhere, and
 there never have been.
 
  
 
 I do have other snapshots, which were incrementally received based on
 this one.  But that shouldn't matter, right?
 
  
 
 I have not yet called support, although we do have a support contract. 
 
  
 
 Any suggestions?
 
  
 
 FYI:
 
  
 
 [r...@nasbackup backup-scripts]# zfs list
 
 NAME   USED 
 AVAIL  REFER  MOUNTPOINT
 
 rpool
 19.3G   126G34K  /rpool
 
 rpool/ROOT   
 16.3G   126G21K  legacy
 
 rpool/ROOT/nasbackup_slash   
 16.3G   126G  16.3G  /
 
 rpool/dump1.00G 
  126G  1.00G  -
 
 rpool/swap   
 2.00G   127G  1.08G  -
 
 storagepool   1.28T 
 4.06T  34.4K  /storage
 
 storagepool/nas-lyricpool 1.27T 
 4.06T  1.13T  /storage/nas-lyricpool
 
 storagepool/nas-lyricp...@nasbackup-2010-05-14-15-56-30  
 94.1G  -  1.07T  -
 
 storagepool/nas-lyricp...@daily-2010-06-01-00-00-00  
 0  -  1.13T  -
 
 storagepool/nas-rpool-ROOT-nas_slash  8.65G 
 4.06T  8.65G  /storage/nas-rpool-ROOT-nas_slash
 
 storagepool/nas-rpool-root-nas_sl...@daily-2010-06-01-00-00-00   
 0  -  8.65G  -
 
 zfs-external1
 1.13T   670G24K  /zfs-external1
 
 zfs-external1/nas-lyricpool  
 1.12T   670G  1.12T  /zfs-external1/nas-lyricpool
 
 zfs-external1/nas-lyricp...@daily-2010-06-01-00-00-00
 0  -  1.12T  -
 
 zfs-external1/nas-rpool-ROOT-nas_slash   
 8.60G   670G  8.60G  /zfs-external1/nas-rpool-ROOT-nas_slash
 
 zfs-external1/nas-rpool-root-nas_sl...@daily-2010-06-01-00-00-00 
 0  -  8.60G  -
 
  
 
 And
 
  
 
 [r...@nasbackup ~]# zfs get origin
 
 NAME  
PROPERTY  VALUE   SOURCE
 
 rpool
 origin-   -
 
 rpool/ROOT   
 origin-   -
 
 rpool/ROOT/nasbackup_slash   
 origin-   -
 
 rpool/dump   
 origin-   -
 
 rpool/swap   
 origin-   -
 
 storagepool  
 origin-   -
 
 storagepool/nas-lyricpool
 origin-   -
 
 storagepool/nas-lyricp...@nasbackup-2010-05-14-15-56-30  
 origin-   -
 
 storagepool/nas-lyricp...@daily-2010-06-01-00-00-00  
 origin-   -
 
 storagepool/nas-lyricp...@daily-2010-06-02-00-00-00  
 origin-   -
 
 storagepool/nas-rpool-ROOT-nas_slash 
 origin-   -
 
 

Re: [zfs-discuss] creating a fast ZIL device for $200

2010-05-27 Thread sensille
Neil Perrin wrote:
 Yes, I agree this seems very appealing. I have investigated and
 observed similar results. Just allocating larger intent log blocks but
 only writing to say the first half of them has seen the same effect.
 Despite the impressive results, we have not pursued this further mainly
 because of it's maintainability. There is quite a variance between
 drives so, as mentioned, feedback profiling of the device is needed
 in the working system. The layering of the Solaris IO subsystem doesn't
 provide the feedback necessary and the ZIL code is layered on the SPA/DMU.
 Still it should be possible. Good luck!
 

Thanks :) Though I hoped to get a different answer. An integration into
ZFS code would be much more elegant, but of course in a few years the
necessity for this optimization will be gone, when SSDs are cheap, fast
and reliable.


There seems to be some interest in this idea here. Would it make sense
to start a project for it? Currently I'm implementing a driver as a
proof of concept, but I'm in need of a lot of discussions about algo-
rithms and concepts, and maybe some code reviews.

Can I count on some support from here?

--Arne



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] creating a fast ZIL device for $200

2010-05-27 Thread sensille

(resent because of mail problems)
Edward Ned Harvey wrote:

From: sensille [mailto:sensi...@gmx.net]

The only thing I'd like to point out
is that
ZFS doesn't do random writes on a slog, but nearly linear writes. This
might
even be hurting performance more than random writes, because you always
hit
the worst case of one full rotation.


Um ... I certainly have a doubt about this.  My understanding is that hard
disks are already optimized for sustained sequential throughput.  I have a
really hard time believing Seagate, WD, etc, designed their drives such that
you read/write one track, then pause and wait for a full rotation, then
read/write one track, and wait again, and so forth.  This would limit the
drive to approx 50% duty cycle, and the market is very competitive.

Yes, I am really quite sure, without any knowledge at all, that the drive
mfgrs are intelligent enough to map the logical blocks in such a way that
sequential reads/writes which are larger than a single track will not suffer
such a huge penalty.  Just a small penalty to jump up one track, and wait
for a few degrees of rotation, not 360 degrees.


I'm afraid you got me wrong here. Of course the drives are optimized for
sequential reads/writes. If you give the drive a single read or write that
is larger than one track the drive acts exactly as you described. The same
holds if you give the drive multiple smaller consecutive reads/writes in
advance (NCQ/TCQ) so that the drive can coagulate them to one big op.

But this is not what happens in case of ZFS/ZIL with a single application.
The application requests a synchronous op. This request goes down into
ZFS, which in turn allocates a ZIL block, writes it to the disk and issues a
cache flush. Only after the cache flush completes, ZFS can acknowledge the
op to the application. Now the application can issue the next op, for which
ZFS will again allocate ZIL block, probably immediately after the previous
one. It writes the block and issues a flush. But in the meantime the head
has traveled some sectors down the track. To physically write the block the
drive has of course to wait until the sector is under the head again, which
means waiting nearly one full rotation. If ZFS would have chosen a block
appropriately further down the track the possibility would have been high
that the head had not passed it and could write without a big rotational
delay.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] creating a fast ZIL device for $200

2010-05-27 Thread sensille

(resent because of received bounce)
Edward Ned Harvey wrote:

From: sensille [mailto:sensi...@gmx.net]



So this brings me back to the question I indirectly asked in the middle of a
much longer previous email - 


Is there some way, in software, to detect the current position of the head?
If not, then I only see two possibilities:

Either you have some previous knowledge (or assumptions) about the drive
geometry, rotation speed, and wall clock time passed since the last write
completed, and use this (possibly vague or inaccurate) info to make your
best guess what available blocks are accessible with minimum latency next
...



That is my approach currently, and it works quite well. I obtain the prior
knowledge through a special measuring process run before first using the
disk. To keep the driver in sync with the disk during idle times it issues
dummy ops in regular intervals, say 20 per second.


or else some sort of new hardware behavior would be necessary.  Possibly a
special type of drive, which always assumes a command to write to a
magical block number actually means write to the next available block or
something like that ... or reading from a magical block actually tells you
the position of the head or something like that...


That would be nice. But what would be much nicer is a drive with an extremely
small setup time. Current drives need the command 0.4-0.7ms in advance,
depending on manufacturer and drive type.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] creating a fast ZIL device for $200

2010-05-27 Thread sensille
Edward Ned Harvey wrote:
 From: sensille [mailto:sensi...@gmx.net]

 The only thing I'd like to point out
 is that
 ZFS doesn't do random writes on a slog, but nearly linear writes. This
 might
 even be hurting performance more than random writes, because you always
 hit
 the worst case of one full rotation.
 
 Um ... I certainly have a doubt about this.  My understanding is that hard
 disks are already optimized for sustained sequential throughput.  I have a
 really hard time believing Seagate, WD, etc, designed their drives such that
 you read/write one track, then pause and wait for a full rotation, then
 read/write one track, and wait again, and so forth.  This would limit the
 drive to approx 50% duty cycle, and the market is very competitive.
 
 Yes, I am really quite sure, without any knowledge at all, that the drive
 mfgrs are intelligent enough to map the logical blocks in such a way that
 sequential reads/writes which are larger than a single track will not suffer
 such a huge penalty.  Just a small penalty to jump up one track, and wait
 for a few degrees of rotation, not 360 degrees.

I'm afraid you got me wrong here. Of course the drives are optimized for
sequential reads/writes. If you give the drive a single read or write that
is larger than one track the drive acts exactly as you described. The same
holds if you give the drive multiple smaller consecutive reads/writes in
advance (NCQ/TCQ) so that the drive can coagulate them to one big op.

But this is not what happens in case of ZFS/ZIL with a single application.
The application requests a synchronous op. This request goes down into
ZFS, which in turn allocates a ZIL block, writes it to the disk and issues a
cache flush. Only after the cache flush completes, ZFS can acknowledge the
op to the application. Now the application can issue the next op, for which
ZFS will again allocate ZIL block, probably immediately after the previous
one. It writes the block and issues a flush. But in the meantime the head
has traveled some sectors down the track. To physically write the block the
drive has of course to wait until the sector is under the head again, which
means waiting nearly one full rotation. If ZFS would have chosen a block
appropriately further down the track the possibility would have been high
that the head had not passed it and could write without a big rotational
delay.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] creating a fast ZIL device for $200

2010-05-26 Thread sensille
Recently, I've been reading through the ZIL/slog discussion and
have the impression that a lot of folks here are (like me)
interested in getting a viable solution for a cheap, fast and
reliable ZIL device.
I think I can provide such a solution for about $200, but it
involves a lot of development work.
The basic idea: the main problem when using a HDD as a ZIL device
are the cache flushes in combination with the linear write pattern
of the ZIL. This leads to a whole rotation of the platter after
each write, because after the first write returns, the head is
already past the sector that will be written next.
My idea goes as follows: don't write linearly. Track the rotation
and write to the position the head will hit next. This might be done
by a re-mapping layer or integrated into ZFS. This works only because
ZIL device are basically write-only. Reads from this device will be
horribly slow.

I have done some testing and am quite enthusiastic. If I take a
decent SAS disk (like the Hitachi Ultrastar C10K300), I can raise
the synchronous write performance from 166 writes/s to about
2000 writes/s (!). 2000 IOPS is more than sufficient for our
production environment.

Currently I'm implementing a re-mapping driver for this. The
reason I'm writing to this list is that I'd like to find support
from the zfs team, find sparring partners to discuss implementation
details and algorithms and, most important, find testers!

If there is interest it would be great to build an official project
around it. I'd be willing to contribute most of the code, but any
help will be more than welcome.

So, anyone interested? :)

--
Arne Jansen

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] creating a fast ZIL device for $200

2010-05-26 Thread sensille
Edward Ned Harvey wrote:
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of sensille

 The basic idea: the main problem when using a HDD as a ZIL device
 are the cache flushes in combination with the linear write pattern
 of the ZIL. This leads to a whole rotation of the platter after
 each write, because after the first write returns, the head is
 already past the sector that will be written next.
 My idea goes as follows: don't write linearly. Track the rotation
 and write to the position the head will hit next. This might be done
 by a re-mapping layer or integrated into ZFS. This works only because
 ZIL device are basically write-only. Reads from this device will be
 horribly slow.
 
 The reason why hard drives are less effective as ZIL dedicated log devices
 compared to such things as SSD's, is because of the rotation of the hard
 drives; the physical time to seek a random block.  There may be a
 possibility to use hard drives as dedicated log devices, cheaper than SSD's
 with possibly comparable latency, if you can intelligently eliminate the
 random seek.  If you have a way to tell the hard drive Write this data, to
 whatever block happens to be available at minimum seek time.

Thanks for rephrasing my idea :) The only thing I'd like to point out is that
ZFS doesn't do random writes on a slog, but nearly linear writes. This might
even be hurting performance more than random writes, because you always hit
the worst case of one full rotation.

 
 For rough estimates:  Assume the drive is using Zone Density Recording, like
 this:
 http://www.dewassoc.com/kbase/hard_drives/hard_disk_sector_structures.htm
 Suppose you're able to keep your hard drive head on the outer sectors.
 Suppose 1000 sectors per track (I have no idea if that's accurate, but at
 least according to the above article in the year 2000 it was ballpark
 realistic).  Suppose 10krpm.  Then the physical seek time could
 theoretically be brought down to as low as 10^-7 seconds.  Of course, that's
 not realistic - some sectors may already be used - the electronics
 themselves could be a factor - But the point remains, the physical seek time
 can be effectively eliminated.  At least in theory.  And that was the year
 2000.

The mentioned Hitachi disk (at least the one I have in my test machine)
has 1764 sectors on head1 and 1680 sectors on head2 in the first zone, which
has 50 tracks. I'm quite sure the limiting factor is the electronics. This
disk needs the write about 140 sectors in advance. It may be that also the
servo information on the platters has to be taken into account. Other disks
don't behave that well. I tried with 1TB SATA disks, but they doesn't seem to
have any predictable timing.

 I have done some testing and am quite enthusiastic. If I take a
 decent SAS disk (like the Hitachi Ultrastar C10K300), I can raise
 the synchronous write performance from 166 writes/s to about
 2000 writes/s (!). 2000 IOPS is more than sufficient for our
 production environment.
 
 Um ... Careful there.  There are many apples, oranges, and bananas to be
 compared inaccurately against each other.  When I measure IOPS of physical
 disks, with all the caches disabled, I get anywhere from 200 to 2400 for a
 single spindle disk (SAS 10k), and I get anywhere from 2000 to 6000 with a
 SSD (SATA).  Just depending on the benchmark configuration.  Because ZFS is
 doing all sorts of acceleration behind the scenes, which make the results
 vary *immensely* from some IOPS number that you look up online.

The measurement is simple: disable write cache, write on sector, when that write
returns, calculate the next optimal sector to write to, write, calculate
again... This gives a quite stable result of about 2000 writes/s or 0.5ms
average service time, single threaded. No ZFS involved, just pure disk
performance.

 
 So you believe you can know the drive geometry, the instantaneous head
 position, and the next available physical block address in software?  No
 need for special hardware?  That's cool.  I hope there aren't any gotchas
 as-yet undiscovered.

Yes, I already did a mapping of several drives. I measured at least the track
length, the interleave needed between two writes and the interleave if a
track-to-track seek is involved. Of course you can always learn more about a
disk, but that's a good starting point.

--
Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] creating a fast ZIL device for $200

2010-05-26 Thread sensille
Bob Friesenhahn wrote:
 On Wed, 26 May 2010, sensille wrote:
 The basic idea: the main problem when using a HDD as a ZIL device
 are the cache flushes in combination with the linear write pattern
 of the ZIL. This leads to a whole rotation of the platter after
 each write, because after the first write returns, the head is
 already past the sector that will be written next.
 My idea goes as follows: don't write linearly. Track the rotation
 and write to the position the head will hit next. This might be done
 by a re-mapping layer or integrated into ZFS. This works only because
 ZIL device are basically write-only. Reads from this device will be
 horribly slow.
 
 I like your idea.  It would require a profiling application to learn the
 physical geometry and timing of a given disk drive in order to save the
 configuration data for it.  The timing could vary under heavy system
 load so the data needs to be sent early enough that it will always be
 there when needed.  The profiling application might need to drive a disk
 for several hours (or a day) in order to fully understand how it
 behaves.

A day is a good landmark. Currently the application runs several hours just
to map the tracks. But there's lots of room for algorithms that measure and
fine-tune on the fly. Every write is also a measurement.

 Remapped failed sectors would cause this micro-timing to fail,
 but only for the remapped sectors.

Of course you could detect those remapped sectors because of the failed timing
and stop using them in the future :)

--
Arne

 Bob

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New SSD options

2010-05-19 Thread sensille
Don wrote:
 
 With that in mind- Is anyone using the new OCZ Vertex 2 SSD's as a ZIL?
 
 They're claiming 50k IOPS (4k Write- Aligned), 2 million hour MTBF, TRIM 
 support, etc. That's more write IOPS than the ZEUS (40k IOPS, $) but at 
 half the price of an Intel X25-E (3.3k IOPS, $400).
 
 Needless to say I'd love to know if anyone has evaluated these drives to see 
 if they make sense as a ZIL- for example- do they honor cache flush requests? 
 Are those sustained IOPS numbers?

In my understanding nearly the only relevant number is the number
of cache flushes a drive can handle per second, as this determines
my single thread performance.
Has anyone an idea what numbers I can expect from an Intel X25-E or
an OCZ Vertex 2?

-Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss