Re: [zfs-discuss] Who is using ZFS ACL's in production?

2010-03-02 Thread Kjetil Torgrim Homme
"Paul B. Henson"  writes:

> Good :). I am certainly not wedded to my proposal, if some other
> solution is proposed that would meet my requirements, great. However,
> pretty much all of the advice has boiled down to either "ACL's are
> broken, don't use them", or "why would you want to do *that*?", which
> isn't particularly useful.

you haven't demonstrated why the current capabilities are insufficient
for your requirements.  it's a bit hard to offer advice for perceived
problems other than "reconsider your perception".

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Who is using ZFS ACL's in production?

2010-03-02 Thread Kjetil Torgrim Homme
"Paul B. Henson"  writes:

> On Tue, 2 Mar 2010, Kjetil Torgrim Homme wrote:
>
>> no.  what happens when an NFS client without ACL support mounts your
>> filesystem?  your security is blown wide open.  the filemode should
>> reflect the *least* level of access.  if the filemode on its own allows
>> more access, then you've lost.
>
> Say what?
>
> If you're using secure NFS, access control is handled on the server
> side.  If an NFS client that doesn't support ACL's mounts the
> filesystem, it will have whatever access the user is supposed to have,
> the lack of ACL support on the client is immaterial.

this is true for AUTH_SYS, too, sorry about the bad example.  but it
doesn't really affect my point.  if you just consider the filemode to be
the lower bound for access rights, aclmode=passthrough will not give you
any nasty surprises regardless of what clients do, *and* an ACL-ignorant
client will get the behaviour it needs and wants.  win-win!

>> if your ACLs are completely specified and give proper access on their
>> own, and you're using aclmode=passthrough, "chmod -R 000 /" will not
>> harm your system.
>
> Actually, it will destroy the three special ACE's, user@, group@, and
> every...@.  On the other hand, with a hypothetical aclmode=ignore or
> aclmode=deny, such a chmod would indeed not harm the system.

you're not using those, are you?  they are a direct mapping of the old
style permissions, so it would be pretty weird if they were allowed to
diverge.

>> if you have rogue processes doing "chmod a+rwx" or other nonsense, you
>> need to fix the rogue process, that's not an ACL problem or a problem
>> with traditional Unix permissions.
>
> What I have are processes that don't know about ACL's. Are they
> broken? Not in and of themselves, they are simply incompatible with a
> security model they are unaware of.

you made that model.

> Why on earth would I want to go and try to make every single
> application in the world ACL aware/compatible instead of simply having
> a filesystem which I can configure to ignore any attempt to manipulate
> legacy permissions?

you don't have to.  just subscribe to the principle of least security,
and it just works.

>> not at all.  you just have to use them correctly.
>
> I think we're just not on the same page on this; while I am not saying
> I'm on the right page, it does seem you need to do a little more
> reading up on how ACL's work.

nice insult.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Consolidating a huge stack of DVDs using ZFS dedup: automation?

2010-03-02 Thread Kjetil Torgrim Homme
Freddie Cash  writes:

> Kjetil Torgrim Homme  wrote:
>
> it would be inconvenient to make a dedup copy on harddisk or tape,
> you could only do it as a ZFS filesystem or ZFS send stream.  it's
> better to use a generic tool like hardlink(1), and just delete
> files afterwards with
>
> Why would it be inconvenient?  This is pretty much exactly what ZFS +
> dedupe is perfect for.

the duplication is not visible, so it's still a wilderness of duplicates
when you navigate the files.

> Since dedupe is pool-wide, you could create individual filesystems for
> each DVD.  Or use just 1 filesystem with sub-directories.  Or just one
> filesystem with snapshots after each DVD is copied over top.
>
> The data would be dedupe'd on write, so you would only have 1 copy of
> unique data.

for this application, I don't think the OP *wants* COW if he changes one
file.  he'll want the duplicates to be kept in sync, not diverging (in
contrast to storage for VMs, for instance).

with hardlinks, it is easier to identify duplicates and handle them
however you like.  if there is a reason for the duplicate access paths
to your data, you can keep them.  I would want to straighten the mess
out, though, rather than keep it intact as closely as possible.

> To save it to tape, just "zfs send" it, and save the stream file.

the zfs stream format is not recommended for archiving.

> ZFS dedupe would also work better than hardlinking files, as it works
> at the block layer, and will be able to dedupe partial files.

yes, but for the most part this will be negligible.  copies of growing
files, like log files, or perhaps your novel written as a stream of
conciousness, will benefit.  unrelated partially identical files are
rare.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Consolidating a huge stack of DVDs using ZFS dedup: automation?

2010-03-02 Thread Kjetil Torgrim Homme
"valrh...@gmail.com"  writes:

> I have been using DVDs for small backups here and there for a decade
> now, and have a huge pile of several hundred. They have a lot of
> overlapping content, so I was thinking of feeding the entire stack
> into some sort of DVD autoloader, which would just read each disk, and
> write its contents to a ZFS filesystem with dedup enabled. [...] That
> would allow me to consolidate a few hundred CDs and DVDs onto probably
> a terabyte or so, which could then be kept conveniently on a hard
> drive and archived to tape.

it would be inconvenient to make a dedup copy on harddisk or tape, you
could only do it as a ZFS filesystem or ZFS send stream.  it's better to
use a generic tool like hardlink(1), and just delete files afterwards
with

  find . -type f -links +1 -exec rm {} \;

(untested!  notice that using xargs or -exec rm {} + will wipe out all
copies of your duplicate files, so don't do that!)

  http://linux.die.net/man/1/hardlink

perhaps this is more convenient:
  http://netdial.caribe.net/~adrian2/fdupes.html

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Who is using ZFS ACL's in production?

2010-03-02 Thread Kjetil Torgrim Homme
"Paul B. Henson"  writes:

> On Sun, 28 Feb 2010, Kjetil Torgrim Homme wrote:
>
>> why are you doing this?  it's inherently insecure to rely on ACL's to
>> restrict access.  do as David says and use ACL's to *grant* access.
>> if needed, set permission on the file to 000 and use umask 777.
>
> Umm, it's inherently insecure to rely on Access Control Lists to,
> well, control access? Doesn't that sound a bit off?

no.  what happens when an NFS client without ACL support mounts your
filesystem?  your security is blown wide open.  the filemode should
reflect the *least* level of access.  if the filemode on its own allows
more access, then you've lost.

> The only reason it's insecure is because the ACL's don't stand alone,
> they're propped up on a legacy chmod interoperability house of cards
> which frequently falls down.

not if you do it right.

>> why is umask 022 when you want 077?  *that's* your problem.
>
> What I want is for my inheritable ACL's not to be mixed in with legacy
> concepts. ACL's don't have a umask. One of the benefits of inherited
> ACL's is you don't need to globally pick "022, let people see what I'm
> up to" vs "077, hide it all". You can just create files, with the
> confidence that every one you create will have the appropriate
> permissions as configured.

if your ACLs are completely specified and give proper access on their
own, and you're using aclmode=passthrough, "chmod -R 000 /" will not
harm your system.

if you have rogue processes doing "chmod a+rwx" or other nonsense, you
need to fix the rogue process, that's not an ACL problem or a problem
with traditional Unix permissions.

> Except, of course, when they're comingled with incompatible security
> models. Basically, it sounds like you're arguing I shouldn't try to
> fix ACL/chmod issues because ACL's are insecure because they have
> chmod issues 8-/.

not at all.  you just have to use them correctly.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Who is using ZFS ACL's in production?

2010-02-28 Thread Kjetil Torgrim Homme
"Paul B. Henson"  writes:
> On Fri, 26 Feb 2010, David Dyer-Bennet wrote:
>> I think of using ACLs to extend extra access beyond what the
>> permission bits grant.  Are you talking about using them to prevent
>> things that the permission bits appear to grant?  Because so long as
>> they're only granting extended access, losing them can't expose
>> anything.
>
> Consider the example of creating a file in a directory which has an
> inheritable ACL for new files:

why are you doing this?  it's inherently insecure to rely on ACL's to
restrict access.  do as David says and use ACL's to *grant* access.  if
needed, set permission on the file to 000 and use umask 777.

> drwx--s--x+  2 henson   csupomona   4 Feb 27 09:21 .
> owner@:rwxpdDaARWcC--:-di---:allow
> owner@:rwxpdDaARWcC--:--:allow
> group@:--x---a-R-c---:-di---:allow
> group@:--x---a-R-c---:--:allow
>  everyone@:--x---a-R-c---:-di---:allow
>  everyone@:--x---a-R-c---:--:allow
> owner@:rwxpdDaARWcC--:f-i---:allow
> group@:--:f-i---:allow
>  everyone@:--:f-i---:allow
>
> When the ACL is respected, then regardless of the requested creation
> mode or the umask, new files will have the following ACL:
>
> -rw---+  1 henson   csupomona   0 Feb 27 09:26 foo
> owner@:rw-pdDaARWcC--:--:allow
> group@:--:--:allow
>  everyone@:--:--:allow
>
> Now, let's say a legacy application used a requested creation mode of
> 0644, and the current umask was 022, and the application calculated
> the resultant mode and explicitly set it with chmod(0644):

why is umask 022 when you want 077?  *that's* your problem.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Oops, ran zfs destroy after renaming a folder and deleted my file system.

2010-02-25 Thread Kjetil Torgrim Homme
tomwaters  writes:

> I created a zfs file system, cloud/movies and shared it.
> I then filled it with movies and music.
> I then decided to rename it, so I used rename in the Gnome to change
> the folder name to media...ie cloud/media. < MISTAKE
> I then noticed the zfs share was pointing to /cloud/movies which no
> longer exists.

I think you should file a bug against Nautilus (the GNOME file manager).
When you rename a directory, it should check for it being a mountpoint
and warn appropriately.  (adding ZFS specific code to DTRT is perhaps
asking for a bit too much.)  evidently it got an error for the rename(2)
and instead started to copy/delete the original.  *inside* some
filesystems, this is probably correct behaviour, but when the object is
a filesystem, I don't think anyone want this behaviour.  if they want to
move data off the filesystem, they should go inside, mark all files, and
drag (or ^X ^V) the files wherever they should go.

> So, I removed cloud/movies with zfs destroy <--- BIGGER MISTAKE

I see the reasoning behind this, but as you've learnt the hard way:
always double-check before using zfs destroy.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS with hundreds of millions of files

2010-02-24 Thread Kjetil Torgrim Homme
"David Dyer-Bennet"  writes:

> Which is bad enough if you say "ls".  And there's no option to say
> "don't sort" that I know of, either.

/bin/ls -f

"/bin/ls" makes sure an alias for "ls" to "ls -F" or similar doesn't
cause extra work.  you can also write "\ls -f" to ignore a potential
alias.

without an argument, GNU ls and SunOS ls behave the same.  if you write
"ls -f *" you'll only get output for directories in SunOS, while GNU ls
will list all files.

(ls -f has been there since SunOS 4.0 at least)
-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS with hundreds of millions of files

2010-02-24 Thread Kjetil Torgrim Homme
Steve  writes:

> I would like to ask a question regarding ZFS performance overhead when
> having hundreds of millions of files
>
> We have a storage solution, where one of the datasets has a folder
> containing about 400 million files and folders (very small 1K files)
>
> What kind of overhead do we get from this kind of thing?

at least 50%.  I don't think this is obvious, so I'll state it: RAID-Z
will not gain you any additional capacity over mirroring in this
scenario.

remember each individual file gets its own stripe.  if the file is 512
bytes or less, you'll need another 512 byte block for the parity
(actually as a special case, it's not parity, but a copy.  parity would
just be an inversion of all bits, so it's not useful to spend time doing
it.)  what's more, even if the file is 1024 bytes or less, ZFS will
allocate an additional padding block to reduce the chance of unusable
single disk blocks.  a 1536 byte file will also consume 2048 bytes of
physical disk, however.  the reasoning for RAID-Z2 is similar, except it
will add a padding block even for the 1536 byte file.  to summarise:

net   raid-z1   raidz-2
  --
512   1024 2x   1536 3x
   1024   2048 2x   3072 3x
   1536   2048 1½x  3072 2x
   2048   3072 1½x  3072 1½x
   2560   3072 1⅕x  3584 1⅖x

the above assumes at least 8 (9) disks in the vdev, otherwise you'll get
a little more overhead for the "larger" filesizes.

> Our storage performance has degraded over time, and we have been
> looking in different places for cause of problems, but now I am
> wondering if its simply a file pointer issue?

adding new files will fragment directories, that might cause performance
degradation depending on access patterns.

I don't think many files in itself will cause problems, but since you
get a lot more ZFS records in your dataset (128x!), more of the disk
space is "wasted" on block pointers, and you may get more block pointer
writes since more levels are needed.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance

2010-02-23 Thread Kjetil Torgrim Homme
Miles Nordin  writes:

>>>>>> "kth" == Kjetil Torgrim Homme  writes:
>
>kth> the SCSI layer handles the replaying of operations after a
>kth> reboot or connection failure.
>
> how?
>
> I do not think it is handled by SCSI layers, not for SAS nor iSCSI.

sorry, I was inaccurate.  error reporting is done by the SCSI layer, and
the filesystem handles it by retrying whatever outstanding operations it
has.

> Also, remember a write command that goes into the write cache is a
> SCSI command that's succeeded, even though it's not actually on disk
> for sure unless you can complete a sync cache command successfully and
> do so with no errors nor ``protocol events'' in the gap between the
> successful write and the successful sync.  A facility to replay failed
> commands won't help because when a drive with write cache on reboots,
> successful writes are rolled back.

this is true, sorry about my lack of precision.  the SCSI layer can't do
this on its own.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance

2010-02-22 Thread Kjetil Torgrim Homme
Miles Nordin  writes:

> There will probably be clients that might seem to implicitly make this
> assuption by mishandling the case where an iSCSI target goes away and
> then comes back (but comes back less whatever writes were in its write
> cache).  Handling that case for NFS was complicated, and I bet such
> complexity is just missing without any equivalent from the iSCSI spec,
> but I could be wrong.  I'd love to be educated.
>
> Even if there is some magical thing in iSCSI to handle it, the magic
> will be rarely used and often wrong until peopel learn how to test it,
> which they haven't yet they way they have with NFS.

I decided I needed to read up on this and found RFC 3783 which is very
readable, highly recommended:

  http://tools.ietf.org/html/rfc3783

basically iSCSI just defines a reliable channel for SCSI.  the SCSI
layer handles the replaying of operations after a reboot or connection
failure.  as far as I understand it, anyway.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] improve meta data performance

2010-02-19 Thread Kjetil Torgrim Homme
Chris Banal  writes:

> We have a SunFire X4500 running Solaris 10U5 which does about 5-8k nfs
> ops of which about 90% are meta data. In hind sight it would have been
> significantly better  to use a mirrored configuration but we opted for
> 4 x (9+2) raidz2 at the time. We can not take the downtime necessary
> to change the zpool configuration.
>
> We need to improve the meta data performance with little to no
> money. Does anyone have any suggestions?

I believe the latest Solaris update will improve metadata caching.
always good to be up-to-date on patches, no?

> Is there such a thing as a Sun supported NVRAM PCI-X card compatible
> with the X4500 which can be used as an L2ARC?

I think they only have PCIe, and it hardly qualifies as "little to no
money".

  http://www.sun.com/storage/disk_systems/sss/f20/specs.xml

I'll second the recommendations for Intel X25-M for L2ARC if you can
spare a SATA slot for it.
-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS slowness under domU high load

2010-02-14 Thread Kjetil Torgrim Homme
Bogdan Ćulibrk  writes:

> What are my options from here? To move onto zvol with greater
> blocksize? 64k? 128k? Or I will get into another trouble going that
> way when I have small reads coming from domU (ext3 with default
> blocksize of 4k)?

yes, definitely.  have you considered using NFS rather than zvols for
the data filesystems?  (keep zvol for the domU software.)

it's strange that you see so much write activity during backup -- I'd
expect that to do just reads...  what's going on at the domU?

generally, the best way to improve performance is to add RAM for ARC
(512 MiB is *very* little IMHO) and SSD for your ZIL, but it does seem
to be a poor match for your concept of many small low-cost dom0's.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance

2010-02-10 Thread Kjetil Torgrim Homme
[please don't top-post, please remove CC's, please trim quotes.  it's
 really tedious to clean up your post to make it readable.]

Marc Nicholas  writes:
> Brent Jones  wrote:
>> Marc Nicholas  wrote:
>>> Kjetil Torgrim Homme  wrote:
>>>> his problem is "lazy" ZFS, notice how it gathers up data for 15
>>>> seconds before flushing the data to disk.  tweaking the flush
>>>> interval down might help.
>>>
>>> How does lowering the flush interval help? If he can't ingress data
>>> fast enough, faster flushing is a Bad Thibg(tm).

if network traffic is blocked during the flush, you can experience
back-off on both the TCP and iSCSI level.

>>>> what are the other values?  ie., number of ops and actual amount of
>>>> data read/written.

this remained unanswered.

>> ZIL performance issues? Is writecache enabled on the LUNs?
> This is a Windows box, not a DB that flushes every write.

have you checked if the iSCSI traffic is synchronous or not?  I don't
use Windows, but other reports on the list have indicated that at least
the NTFS format operation *is* synchronous.  use zilstats to see.

> The drives are capable of over 2000 IOPS (albeit with high latency as
> its NCQ that gets you there) which would mean, even with sync flushes,
> 8-9MB/sec.

2000 IOPS is the aggregate, but the disks are set up as *one* RAID-Z2!
NCQ doesn't help much, since the write operations issued by ZFS are
already ordered correctly.

the OP may also want to try tweaking metaslab_df_free_pct, this helped
linear write performance on our Linux clients a lot:
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6869229

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance

2010-02-10 Thread Kjetil Torgrim Homme
Bob Friesenhahn  writes:
> On Wed, 10 Feb 2010, Frank Cusack wrote:
>
> The other three commonly mentioned issues are:
>
>  - Disable the naggle algorithm on the windows clients.

for iSCSI?  shouldn't be necessary.

>  - Set the volume block size so that it matches the client filesystem
>block size (default is 128K!).

default for a zvol is 8 KiB.

>  - Check for an abnormally slow disk drive using 'iostat -xe'.

his problem is "lazy" ZFS, notice how it gathers up data for 15 seconds
before flushing the data to disk.  tweaking the flush interval down
might help.

>> An "iostat -xndz 1" readout of the "%b% coloum during a file copy to
>> the LUN shows maybe 10-15 seconds of %b at 0 for all disks, then 1-2
>> seconds of 100, and repeats.

what are the other values?  ie., number of ops and actual amount of data
read/written.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives

2010-02-10 Thread Kjetil Torgrim Homme
"Eric D. Mudama"  writes:
> On Tue, Feb  9 at  2:36, Kjetil Torgrim Homme wrote:
>> no one is selling disk brackets without disks.  not Dell, not EMC,
>> not NetApp, not IBM, not HP, not Fujitsu, ...
>
> http://discountechnology.com/Products/SCSI-Hard-Drive-Caddies-Trays

very nice, thanks.  unfortunately it probably won't last:

[http://lists.us.dell.com/pipermail/linux-poweredge/2010-February/041335.html]:
|
| In the case of Dell's PERC RAID controllers, we began informing
| customers when a non-Dell drive was detected with the introduction of
| PERC5 RAID controllers in early 2006. With the introduction of the
| PERC H700/H800 controllers, we began enabling only the use of Dell
| qualified drives.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intrusion Detection - powered by ZFS Checksumming ?

2010-02-09 Thread Kjetil Torgrim Homme
Neil Perrin  writes:

> On 02/09/10 08:18, Kjetil Torgrim Homme wrote:
>> I think the above is easily misunderstood.  I assume the OP means
>> append, not rewrites, and in that case (with recordsize=128k):
>>
>> * after the first write, the file will consist of a single 1 KiB record.
>> * after the first append, the file will consist of a single 5 KiB
>>   record.
>
> Good so far.
>
>> * after the second append, one 128 KiB record and one 7 KiB record.
>
> A long time ago we used to write short tail blocks, but not any more.
> So after the 2nd append we actually have 2 128KB blocks.

thanks a lot for the correction!

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup Questions.

2010-02-09 Thread Kjetil Torgrim Homme
Richard Elling  writes:

> On Feb 8, 2010, at 6:04 PM, Kjetil Torgrim Homme wrote:
>> the size of [a DDT] entry is much larger:
>> 
>> | From: Mertol Ozyoney 
>> | 
>> | Approximately it's 150 bytes per individual block.
>
> "zdb -D poolname" will provide details on the DDT size.  FWIW, I have a
> pool with 52M DDT entries and the DDT is around 26GB.

wow, that's much larger than Mertol's estimate: 500 bytes per block.

>$ pfexec zdb -D tank
>DDT-sha256-zap-duplicate: 19725 entries, size 270 on disk, 153 in core
>DDT-sha256-zap-unique: 52284055 entries, size 284 on disk, 159 in core
>
>dedup = 1.00, compress = 1.00, copies = 1.00, dedup * compress / copies = 
> 1.00

how do you calculate the 26 GB size from this?

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intrusion Detection - powered by ZFS Checksumming ?

2010-02-09 Thread Kjetil Torgrim Homme
Richard Elling  writes:

> On Feb 8, 2010, at 9:10 PM, Damon Atkins wrote:
>
>> I would have thought that if I write 1k then ZFS txg times out in
>> 30secs, then the 1k will be written to disk in a 1k record block, and
>> then if I write 4k then 30secs latter txg happen another 4k record
>> size block will be written, and then if I write 130k a 128k and 2k
>> record block will be written.
>> 
>> Making the file have record sizes of
>> 1k+4k+128k+2k
>
> Close. Once the max record size is achieved, it is not reduced.  So
> the allocation is: 1KB + 4KB + 128KB + 128KB

I think the above is easily misunderstood.  I assume the OP means
append, not rewrites, and in that case (with recordsize=128k):

* after the first write, the file will consist of a single 1 KiB record.
* after the first append, the file will consist of a single 5 KiB
  record.
* after the second append, one 128 KiB record and one 7 KiB record.

in each of these operations, the *whole* file will be rewritten to a new
location, but after a third append, only the tail record will be
rewritten.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [OT] excess zfs-discuss mailman digests

2010-02-08 Thread Kjetil Torgrim Homme
grarpamp  writes:

> PS: Is there any way to get a copy of the list since inception for
> local client perusal, not via some online web interface?

I prefer to read mailing lists using a newsreader and the NNTP interface
at Gmane.  a newsreader tends to be better at threading etc. than a mail
client which is fed an mbox...  see http://gmane.org/about.php for more
information.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intrusion Detection - powered by ZFS Checksumming ?

2010-02-08 Thread Kjetil Torgrim Homme
Damon Atkins  writes:

> One problem could be block sizes, if a file is re-written and is the
> same size it may have different ZFS record sizes within, if it was
> written over a long period of time (txg's)(ignoring compression), and
> therefore you could not use ZFS checksum to compare two files.

the record size used for a file is chosen when that file is created.  it
can't change.  when the default record size for the dataset changes,
only new files will be affected.  ZFS *must* write a complete record
even if you change just one byte (unless it's the tail record of
course), since there isn't any better granularity for the block
pointers.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup Questions.

2010-02-08 Thread Kjetil Torgrim Homme
Tom Hall  writes:

> If you enable it after data is on the filesystem, it will find the
> dupes on read as well as write? Would a scrub therefore make sure the
> DDT is fully populated.

no.  only written data is added to the DDT, so you need to copy the data
somehow.  zfs send/recv is the most convenient, but you could even do a
loop of commands like

  cp -p "$file" "$file.tmp" && mv "$file.tmp" "$file"

> Re the DDT, can someone outline it's structure please? Some sort of
> hash table? The blogs I have read so far dont specify.

I can't help here.

> Re DDT size, is (data in use)/(av blocksize) * 256bit right as a worst
> case (ie all blocks non identical)

the size of an entry is much larger:

| From: Mertol Ozyoney 
| Subject: Re: Dedup memory overhead
| Message-ID: <00cb01caa580$a3d6f110$eb84d330$%ozyo...@sun.com>
| Date: Thu, 04 Feb 2010 11:58:44 +0200
| 
| Approximately it's 150 bytes per individual block.

> What are average block sizes?

as a start, look at your own data.  divide the used size in "df" with
used inodes in "df -i".  example from my home directory:

  $ /usr/gnu/bin/df -i ~
  FilesystemInodes IUsed IFree  IUse%Mounted on
  tank/home  223349423   3412777 219936646 2%/volumes/home

  $ df -k ~
  Filesystemkbytes  used avail capacity  Mounted on
  tank/home  573898752 257644703 10996825471%/volumes/home

so the average file size is 75 KiB, smaller than the recordsize of 128
KiB.  extrapolating to a full filesystem, we'd get 4.9M files.
unfortunately, it's more complicated than that, since a file can consist
of many records even if the *average* is smaller than a single record.

a pessimistic estimate, then, is one record for each of those 4.9M
files, plus one record for each 128 KiB of diskspace (2.8M), for a total
of 7.7M records.  the size of the DDT for this (quite small!) filesystem
would be something like 1.2 GB.  perhaps a reasonable rule of thumb is 1
GB DDT per TB of storage.

(disclaimer: I'm not a kernel hacker, I just read this list :-)
-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives

2010-02-08 Thread Kjetil Torgrim Homme
Daniel Carosone  writes:

> In that context, I haven't seen an answer, just a conclusion: 
>
>  - All else is not equal, so I give my money to some other hardware
>manufacturer, and get frustrated that Sun "won't let me" buy the
>parts I could use effectively and comfortably.  

no one is selling disk brackets without disks.  not Dell, not EMC, not
NetApp, not IBM, not HP, not Fujitsu, ...

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pool import with failed ZIL device now possible ?

2010-02-07 Thread Kjetil Torgrim Homme
Christo Kutrovsky  writes:

> Has anyone seen soft corruption in NTFS iSCSI ZVOLs after a power
> loss?

this is not from experience, but I'll answer anyway.

> I mean, there is no guarantee writes will be executed in order, so in
> theory, one could corrupt it's NTFS file system.

I think you have that guarantee, actually.

the problem is that the Windows client will think that block N has been
updated, since the iSCSI server told it it was commited to stable
storage.  however, when ZIL is disabled, that update may get lost during
power loss.  if block N contains, say, directory information, this could
cause weird behaviour.  it may look fine at first -- the problem won't
appear until NTFS has thrown block N out of its cache and it needs to
re-read it from the server.  when the re-read stale data is combined
with fresh data from RAM, it's panic time...

> Would best practice be to rollback the last snapshot before making
> those iSCSI available again?

I think you need to reboot the client so that its RAM cache is cleared
before any other writes are made.

a rollback shouldn't be necessary.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives

2010-02-07 Thread Kjetil Torgrim Homme
Tim Cook  writes:
> Kjetil Torgrim Homme  wrote:
>>I don't know what the J4500 drive sled contains, but for the J4200
>>and J4400 they need to include quite a bit of circuitry to handle
>>SAS protocol all the way, for multipathing and to be able to
>>accept a mix of SAS and SATA drives.  it's not just a piece of
>>sheet metal, some plastic and a LED.
>>
>>the pricing does look strange, and I think it would be better to
>>raise the price of the enclosure (which is silly cheap when empty
>>IMHO) and reduce the drive prices somewhat.  but that's just
>>psychology, and doesn't really matter for total cost.
>
> Why exactly would that be better?

people are looking at the price list and seeing that the J4200 costs
22550 NOK [1], while one sixpack of 2TB SATA disks to go with it costs
82500 NOK.  on the other hand you could get six 2TB SATA disks
(Ultrastars) from your friendly neighbourhood shop for 14370 NOK (7700
NOK for six Deskstars).  and to add insult to injury, the neighbourhood
shop offers five years warranty (required by Norwegian consumer laws),
compared to Sun's three years...

everyone knows the price of a harddisk since they buy them for their
home computers.  do they know the price of a disk storage array?  not so
well.  yes, it's a matter of perception for the buyer, but perception
can matter.

> Then it's a high cost of entry.   What if an SMB only needs 6 drives
> day one?  Why charge them an arm and a leg for the enclosure, and
> nothing for the drives?  Again, the idea is that you're charging based
> on capacity.

see my numbers above.  the chassis itself is just 12% of the cost (22%
when half full).  some middle ground could be found.

anyway, we're buying these systems and are very happy with them.  when
disks fail, Sun replace them very expediently and with a minimum of
fuss.


[1] all prices include VAT to simplify comparison.  prices are current
from shop.sun.com and komplett.no.  Sun list prices are subject to
haggling.
-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pool import with failed ZIL device now possible ?

2010-02-07 Thread Kjetil Torgrim Homme
Darren J Moffat  writes:

> That disables the ZIL for *all* datasets on *all* pools on the system.
> Doing this means that for NFS client or other applications (maybe
> local) that rely on the POSIX synchronus requirements of fsync they
> may see data loss on a crash.  Note that the ZFS pool is still
> consistent on disk but the application that is flushing writes
> synchonusly may have missing data on recovery from the crash.
>
> NEVER turn off the ZIL other than for testing on dummy data whether or
> not the ZIL is your bottleneck.  NEVER turn off the ZIL on live data
> pools.

I think that is a bit too strong, I'd say "NEVER turn off the ZIL when
external clients depend on stable storage promises".  that is, don't
turn ZIL off when your server is used for CIFS, NFS, iSCSI or even
services on a higher level, such as a web server running a storefront,
where customers expect a purchase to be followed through...

I've disabled ZIL on my server which is running our backup software
since there are no promises made to external clients.  the backup jobs
will be aborted, and the PostgreSQL database may lose a few transactions
after the reboot, but I won't lose more than 30 seconds of stored data.
this is quite unproblematic, especially compared to the disruption of
the crash itself.

(in my setup I potentially have to manually invalidate any backup
jobs which finished within the last 30 seconds of the crash.  this is
due to the database running on a different storage pool than the backup
data, so the point in time for the commitment of backup data and backup
metadata to stable storage may/will diverge.  "luckily" the database
transactions usually take 30+ seconds, so this is not a problem in
practice...)

of course, doing this analysis of your software requires in-depth
knowledge of both the software stack and the workings of ZFS, so I can
understand Sun^H^H^HOracle employees stick to the simple advice "NEVER
disable ZIL".

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] list new files/activity monitor

2010-02-06 Thread Kjetil Torgrim Homme
"Nilsen, Vidar"  writes:

> And once an hour I run a script that checks for new dirs last 60
> minutes matching some criteria, and outputs the path to an
> IRC-channel. Where we can see if someone else has added new stuff.
>
> Method used is “find –mmin -60”, which gets horrible slow when more
> data is added.
>
> My question is if there exists some method I can get the same results
> but based on events rather than seeking through everything once an
> hour.

yes, File Events Notification (FEN)

  http://blogs.sun.com/praks/entry/file_events_notification

you access this through the event port API.

  http://developers.sun.com/solaris/articles/event_completion.html

gnome-vfs uses FEN, but unfortunately gnomevfs-monitor will only watch a
specific directory.  I think you'll need to write your own code to watch
all directories in a tree.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to get a list of changed files between two snapshots?

2010-02-06 Thread Kjetil Torgrim Homme
Frank Cusack  writes:

> On 2/4/10 8:00 AM +0100 Tomas Ögren wrote:
>> The "find -newer blah" suggested in other posts won't catch newer
>> files with an old timestamp (which could happen for various reasons,
>> like being copied with kept timestamps from somewhere else).
>
> good point.  that is definitely a restriction with find -newer.  but
> if you meet that restriction, and don't need to find added or deleted
> files, it will be faster since only 1 directory tree has to be walked.

FWIW, GNU find has -cnewer

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives

2010-02-06 Thread Kjetil Torgrim Homme
matthew patton  writes:

> true. but I buy a Ferrari for the engine and bodywork and chassis
> engineering. It is totally criminal what Sun/EMC/Dell/Netapp do
> charging customers 10x the open-market rate for standard drives. A
> RE3/4 or NS drive is the same damn thing no matter if I buy it from
> ebay or my local distributor. Dell/Sun/Netapp buy drives by the
> container load. Oh sure, I don't mind paying an extra couple
> pennies/GB for all the strenuous efforts the vendors spend on firmware
> verification (HA!).

I don't know what the J4500 drive sled contains, but for the J4200 and
J4400 they need to include quite a bit of circuitry to handle SAS
protocol all the way, for multipathing and to be able to accept a mix of
SAS and SATA drives.  it's not just a piece of sheet metal, some plastic
and a LED.

the pricing does look strange, and I think it would be better to raise
the price of the enclosure (which is silly cheap when empty IMHO) and
reduce the drive prices somewhat.  but that's just psychology, and
doesn't really matter for total cost.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 3ware 9650 SE

2010-02-06 Thread Kjetil Torgrim Homme
Alexandre MOREL  writes:

> It's a few day now that I try to use a 9650SE 3ware controller to work
> on opensolaris and I found the following problem : the tw driver seems
> work, I can see my controller whith the tw_cli of 3ware. I can see
> that 2 drives are created with the controller, but when I try to use
> "pfexec format", it doesn't detect the drive.

did you create logical devices using tw_cli?  a pity none of these cards
seem to support proper JBOD mode.  in 9650's case it's especially bad,
since pulling a drive will *renumber* the logical devices without
notifying the OS.  quite scary!

if you've created the devices, try running devfsadm once more.
-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS compressed ration inconsistency

2010-02-01 Thread Kjetil Torgrim Homme
antst  writes:

> I'm more than happy by the fact that data consumes even less physical
> space on storage.  But I want to understand why and how. And want to
> know to what numbers I can trust.

my guess is sparse files.

BTW, I think you should compare the size returned from "du -bx" with
"refer", not "used".  in this case it's not snapshots which makes the
difference, though.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 3ware 9650 SE

2010-02-01 Thread Kjetil Torgrim Homme
Tiernan O'Toole  writes:

> looking at the 3ware 9650 SE raid controller for a new build... anyone
> have any luck with this card? their site says they support
> OpenSolaris... anyone used one?

didn't work too well for me.  it's fast and nice for a couple of days,
then the driver gets slower and slower, and eventually it gets stuck and
all I/O freezes.  preventive reboots were needed.  I used the newest
driver from 3ware/AMCC with 2008.11 and 2009.05.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is LSI SAS3081E-R suitable for a ZFS NAS ?

2010-01-31 Thread Kjetil Torgrim Homme
Mark Bennett  writes:

> Update:
>
> For the WD10EARS, the blocks appear to be aligned on the 4k boundary
> when zfs uses the whole disk (whole disk as EFI partition).
>
> Part  TagFlag First Sector Size Last Sector
>  0usrwm256   931.51Gb  1953508750
>  
>  calc256*512/4096=32

I'm afraid this isn't enough.  if you enable compression, any ZFS write
can be unaligned.  also, if you're using raid-z with an odd number of
data disks, some of (most of?) your stripes will be misaligned.

ZFS needs to use 4096 octets as the basic block to fully exploit
the performance of these disks.
-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Building big cheap storage system. What hardware to use?

2010-01-28 Thread Kjetil Torgrim Homme
Freddie Cash  writes:

> We use the following for our storage servers:
> [...]
> 3Ware 9650SE PCIe RAID controller (12-port, muli-lane)
> [...]
> Fully supported by FreeBSD, so everything should work with
> OpenSolaris.

FWIW, I've used the 9650SE with 16 ports in OpenSolaris 2008.11 and
2009.06, and had problems with the driver just hanging after 4-5 days of
use.  iostat would report 100% busy on all drives connected to the card,
and even "uadmin 1 1" (low-level reboot command) was ineffective.  I had
to break into the debugger and do the reboot from there.  I was using
the newest driver from AMCC.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zero out block / sectors

2010-01-25 Thread Kjetil Torgrim Homme
Mike Gerdts  writes:

> Kjetil Torgrim Homme wrote:
>> Mike Gerdts  writes:
>>
>>> John Hoogerdijk wrote:
>>>> Is there a way to zero out unused blocks in a pool?  I'm looking for
>>>> ways to shrink the size of an opensolaris virtualbox VM and using the
>>>> compact subcommand will remove zero'd sectors.
>>>
>>> I've long suspected that you should be able to just use mkfile or "dd
>>> if=/dev/zero ..." to create a file that consumes most of the free
>>> space then delete that file.  Certainly it is not an ideal solution,
>>> but seems quite likely to be effective.
>>
>> you'll need to (temporarily) enable compression for this to have an
>> effect, AFAIK.
>>
>> (dedup will obviously work, too, if you dare try it.)
>
> You are missing the point.  Compression and dedup will make it so that
> the blocks in the devices are not overwritten with zeroes.  The goal
> is to overwrite the blocks so that a back-end storage device or
> back-end virtualization platform can recognize that the blocks are not
> in use and as such can reclaim the space.

aha, I was assuming the OP's VirtualBox image was stored on ZFS, but I
realise now that it's the other way around -- he's running ZFS inside a
VirtualBox image hosted on a traditional filesystem.  in that case
you're right, and I'm wrong :-)

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zero out block / sectors

2010-01-25 Thread Kjetil Torgrim Homme
Mike Gerdts  writes:

> John Hoogerdijk wrote:
>> Is there a way to zero out unused blocks in a pool?  I'm looking for
>> ways to shrink the size of an opensolaris virtualbox VM and using the
>> compact subcommand will remove zero'd sectors.
>
> I've long suspected that you should be able to just use mkfile or "dd
> if=/dev/zero ..." to create a file that consumes most of the free
> space then delete that file.  Certainly it is not an ideal solution,
> but seems quite likely to be effective.

you'll need to (temporarily) enable compression for this to have an
effect, AFAIK.

(dedup will obviously work, too, if you dare try it.)
-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] optimise away COW when rewriting the same data?

2010-01-24 Thread Kjetil Torgrim Homme
David Magda  writes:

> On Jan 24, 2010, at 10:26, Kjetil Torgrim Homme wrote:
>
>> But it occured to me that this is a special case which could be
>> beneficial in many cases -- if the filesystem uses secure checksums,
>> it could check the existing block pointer and see if the replaced
>> data matches.  [...]
>>
>> Are there any ZFS hackers who can comment on the feasibility of this
>> idea?
>
> There is a bug that requests an API in ZFS' DMU library to get
> checksum data:
>
>   6856024 - DMU checksum API
>   http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6856024

That would work, but it would require rsync to do checksum calculations
itself to do the comparison.  Then ZFS would recalculate the checksum if
the data was actually written, so it's wasting work for local copies.
It would be interesting to extend the rsync protocol to take advantage
of this, though, so that the checksum can be calculated on the remote
host.  H...  It would need very ZFS specific support, e.g., the
recordsize is potentially different for each file, likewise for the
checksum algorithm.

Fixing a library seems easier than patching the kernel, so your approach
is probably better anyhow.

> It specifically mentions Lustre, and not anything like the ZFS POSIX
> interface to files (ZPL). There's also:
>
>> Here's another: file comparison based on values derived from files'
>> checksum or dnode block pointer. This would allow for very efficient
>> file comparison between filesystems related by cloning. Such values
>> might be made available through an extended attribute, say.
>
> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6366224
>
> It's been brought up before on zfs-discuss: the two options would be
> linking against some kind of ZFS-specific library, or using an ioctl()
> of some kind. As it stands, ZFS is really the only mainstream(-ish)
> file system that does checksums, and so there's no standard POSIX call
> for such things. Perhaps as more file systems add this functionality
> something will come of it.

The checksum algorithms need to be very strictly specified.  Not a
problem for sha256, I guess, but fletcher4 probably don't have
independent implementations which are 100% compatible with ZFS -- and
GPL (needed for rsync and many other applications).

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best 1.5TB drives for consumer RAID?

2010-01-24 Thread Kjetil Torgrim Homme
Tim Cook  writes:

> On Sat, Jan 23, 2010 at 7:57 PM, Frank Cusack  wrote:
>
> I mean, just do a triple mirror of the 1.5TB drives rather than
> say (6) .5TB drives in a raidz3.
>
> I bet you'll get the same performance out of 3x1.5TB drives you get
> out of 6x500GB drives too.

no, it will be much better.  you get 3 independent disks available for
reading, so 3x the IOPS.  in a 6x500 GB setup all disks will need to
operate in tandem, for both reading and writing.  even if the larger
disks are slower than small disks, the difference is not even close to
such a factor.  perhaps 20% fewer IOPS?

> Are you really trying to argue people should never buy anything but
> the largest drives available?

I don't think that's Frank's point.  the key here is the advantages of a
(triple) mirroring over RAID-Z.  it just so happens that it makes
economic sense.  (think about savings in power, too.)

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] optimise away COW when rewriting the same data?

2010-01-24 Thread Kjetil Torgrim Homme
I was looking at the performance of using rsync to copy some large files
which change only a little between each run (database files).  I take a
snapshot after every successful run of rsync, so when using rsync
--inplace, only changed portions of the file will occupy new disk space.

Unfortunately, performance wasn't too good, the source server in
question simply didn't have much CPU to perform the rsync delta
algorithm, and in addition it creates read I/O load on the destination
server.  So I had to switch it off and transfer the whole file instead.
In this particular case, that means I need 120 GB to store each run
rather than 10, but that's the way it goes.

If I had enabled deduplication, this would be a moot point, dedup would
take care of it for me.  Judging from early reports my server
will probably not have the required oomph to handle it, so I'm holding
off until I get to replace it with a server with more RAM and CPU.

But it occured to me that this is a special case which could be
beneficial in many cases -- if the filesystem uses secure checksums, it
could check the existing block pointer and see if the replaced data
matches.  (Due to the (infinitesimal) potential for hash collisions this
should be configurable the same way it is for dedup.)  In essence,
rsync's writes would become no-ops, and very little CPU would be wasted
on either side of the pipe.

Even in the absence of snapshots, this would leave the filesystem less
fragmented, since the COW is avoided.  This would be a win-win if the
ZFS pipeline can communicate the correct information between layers.

Are there any ZFS hackers who can comment on the feasibility of this
idea?

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS cache flush ignored by certain devices ?

2010-01-11 Thread Kjetil Torgrim Homme
Lutz Schumann  writes:

> Actually the performance decrease when disableing the write cache on
> the SSD is aprox 3x (aka 66%).

for this reason, you want a controller with battery backed write cache.
in practice this means a RAID controller, even if you don't use the RAID
functionality.  of course you can buy SSDs with capacitors, too, but I
think that will be more expensive, and it will restrict your choice of
model severely.

(BTW, thank you for testing forceful removal of power.  the result is as
expected, but it's good to see that theory and practice match.)
-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz stripe size (not stripe width)

2010-01-05 Thread Kjetil Torgrim Homme
Brad  writes:

> Hi Adam,

I'm not Adam, but I'll take a stab at it anyway.

BTW, your crossposting is a bit confusing to follow, at least when using
gmane.org.  I think it is better to stick to one mailing list anyway?

> From your the picture, it looks like the data is distributed evenly
> (with the exception of parity) across each spindle then wrapping
> around again (final 4K) - is this one single write operation or two?

it is a single write operation per device.  actually, it may be "less
than" one write operation since the transaction group, which probably
contains many more updates, is written as a whole.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] dedup existing data

2009-12-18 Thread Kjetil Torgrim Homme
Anil  writes:

> If you have another partition with enough space, you could technically
> just do:
>
> mv src /some/other/place
> mv /some/other/place src
>
> Anyone see a problem with that? Might be the best way to get it
> de-duped.

I get uneasy whenever I see mv(1) used to move directory trees between
filesystems, that is, whenever mv(1) can't do a simple rename(2), but
has to do a recursive copy of files.  it is essentially not restartable,
if mv(1) is interrupted, you must clean up the mess with rsync or
similar tools.  so why not use rsync from the get go?  (or zfs send/recv
of course.)

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DeDup and Compression - Reverse Order?

2009-12-18 Thread Kjetil Torgrim Homme
Darren J Moffat  writes:

> Kjetil Torgrim Homme wrote:
>
>> I don't know how tightly interwoven the dedup hash tree and the block
>> pointer hash tree are, or if it is all possible to disentangle them.
>
> At the moment I'd say very interwoven by design.
>
>> conceptually it doesn't seem impossible, but that's easy for me to
>> say, with no knowledge of the zio pipeline...
>
> Correct it isn't impossible but instead there would probably need to
> be two checksums held, one of the untransformed data (ie uncompressed
> and unencrypted) and one of the transformed data (compressed and
> encrypted). That has different tradeoffs and SHA256 can be expensive
> too see:
>
> http://blogs.sun.com/darren/entry/improving_zfs_dedup_performance_via

great work!  SHA256 is more expensive than I thought, even with
misc/sha2 it takes 1 ms per 128 KiB?  that's roughly the same CPU usage
as lzjb!  in that case hashing the (smaller) compressed data is more
efficient than doing an additional hash of the full uncompressed block.

it's interesting to note that 64 KiB looks faster (a bit hard to read
the chart accurately), L1 cache size coming into play, perhaps?

> Note also that the compress/encrypt/checksum and the dedup are
> separate pipeline stages so while dedup is happening for block N block
> N+1 can be getting transformed - so this is designed to take advantage
> of multiple scheduling units (threads,cpus,cores etc).

nice.  are all of them separate stages, or are compress/encrypt/checksum
done as one stage?

>> oh, how does encryption play into this?  just don't?  knowing that
>> someone else has the same block as you is leaking information, but that
>> may be acceptable -- just make different pools for people you don't
>> trust.
>
> compress, encrypt, checksum, dedup.
>
> You are correct that it is an information leak but only within a
> dataset and its clones and only if you can observe the deduplication
> stats (and you need to use zdb to get enough info to see the leak -
> and that means you have access to the raw devices), the deupratio
> isn't really enough unless the pool is really idle or has only one
> user writing at a time.
>
> For the encryption case deduplication of the same plaintext block will
> only work within a dataset or a clone of it - because only in those
> cases do you have the same key (and the way I have implemented the IV
> generation for AES CCM/GCM mode ensures that the same plaintext will
> have the same IV so the ciphertexts will match).

makes sense.

> Also if you place a block in an unencrypted dataset that happens to
> match the ciphertext in an encrypted dataset they won't dedup either
> (you need to understand what I've done with the AES CCM/GCM MAC and
> the zio_chksum_t field in the blkptr_t and how that is used by dedup
> to see why).

wow, I didn't think of that problem.  did you get bitten by wrongful
dedup during testing with image files? :-)

> If that small information leak isn't acceptable even within the
> dataset then don't enable both encryption and deduplication on those
> datasets - and don't delegate that property to your users either.  Or
> you can frequently rekey your per dataset data encryption keys 'zfs
> key -K' but then you might as well turn dedup off - other there are
> some very good usecases in multi level security where doing
> dedup/encryption and rekey provides a nice effect.

indeed.  ZFS is extremely flexible.

thank you for your response, it was very enlightening.
-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DeDup and Compression - Reverse Order?

2009-12-17 Thread Kjetil Torgrim Homme
Darren J Moffat  writes:
> Kjetil Torgrim Homme wrote:
>> Andrey Kuzmin  writes:
>>
>>> Downside you have described happens only when the same checksum is
>>> used for data protection and duplicate detection. This implies sha256,
>>> BTW, since fletcher-based dedupe has been dropped in recent builds.
>>
>> if the hash used for dedup is completely separate from the hash used
>> for data protection, I don't see any downsides to computing the dedup
>> hash from uncompressed data.  why isn't it?
>
> It isn't separate because that isn't how Jeff and Bill designed it.

thanks for confirming that, Darren.

> I think the design the have is great.

I don't disagree.

> Instead of trying to pick holes in the theory can you demonstrate a
> real performance problem with compression=on and dedup=on and show
> that it is because of the compression step ?

compression requires CPU, actually quite a lot of it.  even with the
lean and mean lzjb, you will get not much more than 150 MB/s per core or
something like that.  so, if you're copying a 10 GB image file, it will
take a minute or two, just to compress the data so that the hash can be
computed so that the duplicate block can be identified.  if the dedup
hash was based on uncompressed data, the copy would be limited by
hashing efficiency (and dedup tree lookup).

I don't know how tightly interwoven the dedup hash tree and the block
pointer hash tree are, or if it is all possible to disentangle them.

conceptually it doesn't seem impossible, but that's easy for me to
say, with no knowledge of the zio pipeline...

oh, how does encryption play into this?  just don't?  knowing that
someone else has the same block as you is leaking information, but that
may be acceptable -- just make different pools for people you don't
trust.

> Otherwise if you want it changed code it up and show how what you have
> done is better in all cases.

I wish I could :-)

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DeDup and Compression - Reverse Order?

2009-12-17 Thread Kjetil Torgrim Homme
Andrey Kuzmin  writes:

> Downside you have described happens only when the same checksum is
> used for data protection and duplicate detection. This implies sha256,
> BTW, since fletcher-based dedupe has been dropped in recent builds.

if the hash used for dedup is completely separate from the hash used for
data protection, I don't see any downsides to computing the dedup hash
from uncompressed data.  why isn't it?

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DeDup and Compression - Reverse Order?

2009-12-16 Thread Kjetil Torgrim Homme
Andrey Kuzmin  writes:
> Darren J Moffat wrote:
>> Andrey Kuzmin wrote:
>>> Resilvering has noting to do with sha256: one could resilver long
>>> before dedupe was introduced in zfs.
>>
>> SHA256 isn't just used for dedup it is available as one of the
>> checksum algorithms right back to pool version 1 that integrated in
>> build 27.
>
> 'One of' is the key word. And thanks for code pointers, I'll take a
> look.

I didn't mention sha256 at all :-).  the reasoning is the same no matter
what hash algorithm you're using (fletcher2, fletcher4 or sha256.  dedup
doesn't require sha256 either, you can use fletcher4.

the question was: why does data have to be compressed before it can be
recognised as a duplicate?  it does seem like a waste of CPU, no?  I
attempted to show the downsides to identifying blocks by their
uncompressed hash.  (BTW, it doesn't affect storage efficiency, the same
duplicate blocks will be discovered either way.)

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DeDup and Compression - Reverse Order?

2009-12-16 Thread Kjetil Torgrim Homme
Andrey Kuzmin  writes:
> Yet again, I don't see how RAID-Z reconstruction is related to the
> subject discussed (what data should be sha256'ed when both dedupe and
> compression are enabled, raw or compressed ). sha256 has nothing to do
> with bad block detection (may be it will when encryption is
> implemented, but for now sha256 is used for duplicate candidates
> look-up only).

how do you think RAID-Z resilvering works?  please correct me where I'm
wrong.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DeDup and Compression - Reverse Order?

2009-12-16 Thread Kjetil Torgrim Homme
Andrey Kuzmin  writes:

> Kjetil Torgrim Homme wrote:
>> for some reason I, like Steve, thought the checksum was calculated on
>> the uncompressed data, but a look in the source confirms you're right,
>> of course.
>>
>> thinking about the consequences of changing it, RAID-Z recovery would be
>> much more CPU intensive if hashing was done on uncompressed data --
>
> I don't quite see how dedupe (based on sha256) and parity (based on
> crc32) are related.

I tried to hint at an explanation:

>> every possible combination of the N-1 disks would have to be
>> decompressed (and most combinations would fail), and *then* the
>> remaining candidates would be hashed to see if the data is correct.

the key is that you don't know which block is corrupt.  if everything is
hunky-dory, the parity will match the data.  parity in RAID-Z1 is not a
checksum like CRC32, it is simply XOR (like in RAID 5).  here's an
example with four data disks and one paritydisk:

  D1  D2  D3  D4  PP
  00  01  10  10  01

this is a single stripe with 2-bit disk blocks for simplicity.  if you
XOR together all the blocks, you get 00.  that's the simple premise for
reconstruction -- D1 = XOR(D2, D3, D4, PP), D2 = XOR(D1, D3, D4, PP) and
so on.

so what happens if a bit flips in D4 and it becomes 00?  the total XOR
isn't 00 anymore, it is 10 -- something is wrong.  but unless you get a
hardware signal from D4, you don't know which block is corrupt.  this is
a major problem with RAID 5, the data is irrevocably corrupt.  the
parity discovers the error, and can alert the user, but that's the best
it can do.  in RAID-Z the hash saves the day: first *assume* D1 is bad
and reconstruct it from parity.  if the hash for the block is OK, D1
*was* bad.  otherwise, assume D2 is bad.  and so on.

so, the parity calculation will indicate which stripes contain bad
blocks.  but the hashing, the sanity check for which disk blocks are
actually bad must be calculated over all the stripes a ZFS block
(record) consists of.

>> this would be done on a per recordsize basis, not per stripe, which
>> means reconstruction would fail if two disk blocks (512 octets) on
>> different disks and in different stripes go bad.  (doing an exhaustive
>> search for all possible permutations to handle that case doesn't seem
>> realistic.)

actually this is the same for compression before/after hashing.  it's
just that each permutation is more expensive to check.

>> in addition, hashing becomes slightly more expensive since more data
>> needs to be hashed.
>>
>> overall, my guess is that this choice (made before dedup!) will give
>> worse performance in normal situations in the future, when dedup+lzjb
>> will be very common, at a cost of faster and more reliable resilver.  in
>> any case, there is not much to be done about it now.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DeDup and Compression - Reverse Order?

2009-12-15 Thread Kjetil Torgrim Homme
Robert Milkowski  writes:
> On 13/12/2009 20:51, Steve Radich, BitShop, Inc. wrote:
>> Because if you can de-dup anyway why bother to compress THEN check?
>> This SEEMS to be the behaviour - i.e. I would suspect many of the
>> files I'm writing are dups - however I see high cpu use even though
>> on some of the copies I see almost no disk writes.
>
> First, the checksum is calculated after compression happens.

for some reason I, like Steve, thought the checksum was calculated on
the uncompressed data, but a look in the source confirms you're right,
of course.

thinking about the consequences of changing it, RAID-Z recovery would be
much more CPU intensive if hashing was done on uncompressed data --
every possible combination of the N-1 disks would have to be
decompressed (and most combinations would fail), and *then* the
remaining candidates would be hashed to see if the data is correct.

this would be done on a per recordsize basis, not per stripe, which
means reconstruction would fail if two disk blocks (512 octets) on
different disks and in different stripes go bad.  (doing an exhaustive
search for all possible permutations to handle that case doesn't seem
realistic.)

in addition, hashing becomes slightly more expensive since more data
needs to be hashed.

overall, my guess is that this choice (made before dedup!) will give
worse performance in normal situations in the future, when dedup+lzjb
will be very common, at a cost of faster and more reliable resilver.  in
any case, there is not much to be done about it now.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Confusion regarding 'zfs send'

2009-12-11 Thread Kjetil Torgrim Homme
Brandon High  writes:

> Matthew Ahrens  wrote:
>> Well, changing the "compression" property doesn't really interrupt
>> service, but I can understand not wanting to have even a few blocks
>> with the "wrong"
>
> I was thinking of sharesmb or sharenfs settings when I wrote that.
> Toggling them for the send would interrupt any clients trying to use
> the resource. Obviously disabling compression or dedup on the source
> doesn't interrupt service

you can avoid the problem by making sure the target filesystem isn't
mounted, i.e., by setting canmount=noauto on the source.  it's a bit
ugly, since you'll get an outage if the source server reboots before you
set it back.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] will deduplication know about old blocks?

2009-12-09 Thread Kjetil Torgrim Homme
Adam Leventhal  writes:
> Unfortunately, dedup will only apply to data written after the setting
> is enabled. That also means that new blocks cannot dedup against old
> block regardless of how they were written. There is therefore no way
> to "prepare" your pool for dedup -- you just have to enable it when
> you have the new bits.

thank you for the clarification!
-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] will deduplication know about old blocks?

2009-12-09 Thread Kjetil Torgrim Homme
I'm planning to try out deduplication in the near future, but started
wondering if I can prepare for it on my servers.  one thing which struck
me was that I should change the checksum algorithm to sha256 as soon as
possible.  but I wonder -- is that sufficient?  will the dedup code know
about old blocks when I store new data?

let's say I have an existing file img0.jpg.  I turn on dedup, and copy
it twice, to img0a.jpg and img0b.jpg.  will all three files refer to the
same block(s), or will only img0a and img0b share blocks?

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Accidentally added disk instead of attaching

2009-12-08 Thread Kjetil Torgrim Homme
Daniel Carosone  writes:

>>> Not if you're trying to make a single disk pool redundant by adding
>>> ..  er, attaching .. a mirror; then there won't be such a warning,
>>> however effective that warning might or might not be otherwise.
>>
>> Not a problem because you can then detach the vdev and add it.
>
> It's a problem if you're trying to do that, but end up adding instead
> of attaching, which you can't (yet) undo.

at least in that case the amount of data shuffling you have to do is
limited to one disk (it's unlikely you do this mistake for a multi
device vdev).

in any case, the block rewrite implementation isn't *that* far away, is
it?
-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] nodiratime support in ZFS?

2009-12-07 Thread Kjetil Torgrim Homme
I was catching up on old e-mail on this list, and came across a blog
posting from Henrik Johansson:

  http://sparcv9.blogspot.com/2009/10/curious-case-of-strange-arc.html

it tells of his woes with a fragmented /var/pkg/downloads combined
with atime updates.  I see the same problem on my servers, e.g. 

  $ time du -s /var/pkg/download
  1614308 /var/pkg/download
  real11m50.682s

  $ time du -s /var/pkg/download
  1614308 /var/pkg/download
  real12m03.395s

on this server, increasing arc_meta_limit wouldn't help, but I think
a newer kernel would be more aggressive (this is 2008.11).

  arc_meta_used  =   262 MB
  arc_meta_limit =  2812 MB
  arc_meta_max   =   335 MB

turning off atime helps:

  real 8m06.563s

in this test case, running du(1), turning off atime altogether isn't
really needed, it would suffice to turn off atime updates on
directories.  in Linux, this can be achieved with the mount option
"nodiratime".  if ZFS had it, I guess it would be a new value for the
atime property, "nodir" or somesuch.

I quite often find it useful to have access to atime information to see
if files have been read, for forensic purposes, for debugging, etc. so I
am loath to turn it off.  however, atime on directories can hardly ever
be used for anything -- you have to take really good care not to trigger
an update just checking the atime, and even if you do get a proper
reading, there are so many tree traversing utilities that the
information value is low.  it is quite unlikely that any applications
break in a nodiratime mode, and few people should have any qualms
enabling it.

Santa, are you listening? :-)
-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Heads up: SUNWzfs-auto-snapshot obsoletion in snv 128

2009-11-25 Thread Kjetil Torgrim Homme
Daniel Carosone  writes:

>> you can fetch the "cr_txg" (cr for creation) for a
>> snapshot using zdb,
>
> yes, but this is hardly an appropriate interface.

agreed.

> zdb is also likely to cause disk activity because it looks at many
> things other than the specific item in question.

I'd expect meta-information like this to fit comfortably in RAM over
extended amounts of time.  haven't tried, though.

>> but the very creation of a snapshot requires a new
>> txg to note that fact in the pool.
>
> yes, which is exactly what we're trying to avoid, because it requires
> disk activity to write.

you missed my point: you can't compare the current txg to an old cr_txg
directly, since the current txg value will be at least 1 higher, even if
no changes have been made.

>> if the snapshot is taken recursively, all snapshots will have the
>> same cr_txg, but that requires the same configuration for all
>> filesets.
>
> again, yes, but that's irrelevant - the important knowledge at this
> moment is that the txg has not changed since last time, and that thus
> there will be no benefit in taking further snapshots, regardless of
> configuration.

yes, that's what we're trying to establish, and it's easier when
all snapshots are commited in the same txg.
-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Heads up: SUNWzfs-auto-snapshot obsoletion in snv 128

2009-11-24 Thread Kjetil Torgrim Homme
Daniel Carosone  writes:

>> I don't think it is easy to do, the txg counter is on
>> a pool level,
>> [..]
>> it would help when the entire pool is idle, though.
>
> .. which is exactly the scenario in question: when the disks are
> likely to be spun down already (or to spin down soon without further
> activity), and you want to avoid waking them up (or keeping them
> awake) with useless snapshot activity.

good point!

> However, this highlights that a (pool? fs?) property that exposes the
> current txg id (frozen in snapshots, as normal, if an fs property)
> might be enough for the userspace daemon to make its own decision to
> avoid requesting snapshots, without needing a whole discretionary
> mechanism in zfs itself.

you can fetch the "cr_txg" (cr for creation) for a snapshot using zdb,
but the very creation of a snapshot requires a new txg to note that fact
in the pool.  if there are several filesystems to snapshot, you'll get a
sequence of cr_txg, and they won't be adjacent.

  # zdb tank/te...@snap1
  Dataset tank/te...@snap1 [ZVOL], ID 78, cr_txg 872401, 4.03G, 3 objects
  # zdb -u tank
  txg = 872402
  timestamp = 1259064201 UTC = Tue Nov 24 13:03:21 2009
  # sync
  # zdb -u tank
  txg = 872402
  # zfs snapshot tank/te...@snap1
  # zdb tank/te...@snap1
  Dataset tank/te...@snap1 [ZVOL], ID 80, cr_txg 872419, 4.03G, 3 objects
  # zdb -u tank
  txg = 872420
  timestamp = 1259064641 UTC = Tue Nov 24 13:10:41 2009

if the snapshot is taken recursively, all snapshots will have the same
cr_txg, but that requires the same configuration for all filesets.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs-raidz - simulate disk failure

2009-11-23 Thread Kjetil Torgrim Homme
sundeep dhall  writes:
> Q) How do I simulate a sudden 1-disk failure to validate that zfs /
> raidz handles things well without data errors
>
> Options considered
> 1. suddenly pulling a disk out 
> 2. using zpool offline
>
> I think both these have issues in simulating a sudden failure 

why not take a look at what HP's test department is doing and fire a
round through the disk with a rifle?  oh, I guess that won't be a
*simulation*.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Heads up: SUNWzfs-auto-snapshot obsoletion in snv 128

2009-11-23 Thread Kjetil Torgrim Homme
Daniel Carosone  writes:

> Would there be a way to avoid taking snapshots if they're going to be
> zero-sized?

I don't think it is easy to do, the txg counter is on a pool level,
AFAIK:

  # zdb -u spool
  Uberblock

magic = 00bab10c
version = 13
txg = 1773324
guid_sum = 16611641539891595281
timestamp = 1258992244 UTC = Mon Nov 23 17:04:04 2009

it would help when the entire pool is idle, though.

> (posted here, rather than in response to the mailing list reference
> given, because I'm not subscribed [...]

ditto.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Basic question about striping and ZFS

2009-11-23 Thread Kjetil Torgrim Homme
Kjetil Torgrim Homme  writes:
> Cindy Swearingen  writes:
>> You might check the slides on this page:
>>
>> http://hub.opensolaris.org/bin/view/Community+Group+zfs/docs
>>
>> Particularly, slides 14-18.
>>
>> In this case, graphic illustrations are probably the best way
>> to answer your questions.
>
> thanks, Cindy.  can you explain the meaning of the blocks marked X in
> the illustration on page 18?

I found the explanation in an older (2009-09-03) message to this list
from Adam Leventhal:

|   RAID-Z writes full stripes every time; note that without careful
|   accounting it would be possible to effectively fragment the vdev
|   such that single sectors were free but useless since single-parity
|   RAID-Z requires two adjacent sectors to store data (one for data,
|   one for parity). To address this, RAID-Z rounds up its allocation to
|   the next (nparity + 1).  This ensures that all space is accounted
|   for. RAID-Z will thus skip sectors that are unused based on this
|   rounding. For example, under raidz1 a write of 1024 bytes would
|   result in 512 bytes of parity, 512 bytes of data on two devices and
|   512 bytes skipped.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quick drive slicing madness question

2009-11-09 Thread Kjetil Torgrim Homme
Darren J Moffat  writes:

> Mauricio Tavares wrote:
>> If I have a machine with two drives, could I create equal size slices
>> on the two disks, set them up as boot pool (mirror) and then use the
>> remaining space as a striped pool for other more wasteful
>> applications?
>
> You could but why bother ?  Why not just create one mirrored pool.

you get half the space available...  even if you don't forego redundancy
and use mirroring on both slices, you can't extend the data pool later.

> Having two pools on the same disk (or mirroring to the same disk) is
> asking for performance pain if both are being written to heavily.

not too common with heavy writing to rpool, is it?  the main source of
writing is syslog, I guess.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] how to destroy a pool by id?

2009-06-24 Thread Kjetil Torgrim Homme
Cindy Swearingen  writes:
> I wish we had a zpool destroy option like this:
>
> # zpool destroy -really_dead tank2

I think it would be clearer to call it

  zpool export --clear-name tank   # or 3280066346390919920

or alternatively,

  zpool destroy --exported 3280066346390919920

I guess the reason you don't want to allow operations on exported
pools is that they can be residing on shared storage.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] compression at zfs filesystem creation

2009-06-17 Thread Kjetil Torgrim Homme
"Monish Shah"  writes:

>> I'd be interested to see benchmarks on MySQL/PostgreSQL performance
>> with compression enabled.  my *guess* would be it isn't beneficial
>> since they usually do small reads and writes, and there is little
>> gain in reading 4 KiB instead of 8 KiB.
>
> OK, now you have switched from compressibility of data to
> performance advantage.  As I said above, this kind of data usually
> compresses pretty well.

the thread has been about I/O performance since the first response, as
far as I can tell.

> I agree that for random reads, there wouldn't be any gain from
> compression. For random writes, in a copy-on-write file system,
> there might be gains, because the blocks may be arranged in
> sequential fashion anyway.  We are in the process of doing some
> performance tests to prove or disprove this.
>
> Now, if you are using SSDs for this type of workload, I'm pretty
> sure that compression will help writes.  The reason is that the
> flash translation layer in the SSD has to re-arrange the data and
> write it page by page.  If there is less data to write, there will
> be fewer program operations.
>
> Given that write IOPS rating in an SSD is often much less than read
> IOPS, using compression to improve that will surely be of great
> value.

not necessarily, since a partial SSD write is much more expensive than
a full block write (128 KiB?).  in a write intensive application, that
won't be an issue since the data is flowing steadily, but for the
right mix of random reads and writes, this may exacerbate the
bottleneck.

> At this point, this is educated guesswork.  I'm going to see if I
> can get my hands on an SSD to prove this.

that'd be great!

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] compression at zfs filesystem creation

2009-06-17 Thread Kjetil Torgrim Homme
"Fajar A. Nugraha"  writes:

> Kjetil Torgrim Homme wrote:
>> indeed.  I think only programmers will see any substantial benefit
>> from compression, since both the code itself and the object files
>> generated are easily compressible.
>
>>> Perhaps compressing /usr could be handy, but why bother enabling
>>> compression if the majority (by volume) of user data won't do
>>> anything but burn CPU?
>
> How do you define "substantial"? My opensolaris snv_111b installation
> has 1.47x compression ratio for "/", with the default compression.
> It's well worthed for me.

I don't really care if my "/" is 5 GB or 3 GB.  how much faster is
your system operating?  what's the compression rate on your data
areas?

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] compression at zfs filesystem creation

2009-06-17 Thread Kjetil Torgrim Homme
"David Magda"  writes:

> On Tue, June 16, 2009 15:32, Kyle McDonald wrote:
>
>> So the cache saves not only the time to access the disk but also
>> the CPU time to decompress. Given this, I think it could be a big
>> win.
>
> Unless you're in GIMP working on JPEGs, or doing some kind of MPEG
> video editing--or ripping audio (MP3 / AAC / FLAC) stuff. All of
> which are probably some of the largest files in most people's
> homedirs nowadays.

indeed.  I think only programmers will see any substantial benefit
from compression, since both the code itself and the object files
generated are easily compressible.

> 1 GB of e-mail is a lot (probably my entire personal mail collection
> for a decade) and will compress well; 1 GB of audio files is
> nothing, and won't compress at all.
>
> Perhaps compressing /usr could be handy, but why bother enabling
> compression if the majority (by volume) of user data won't do
> anything but burn CPU?
>
> So the correct answer on whether compression should be enabled by
> default is "it depends". (IMHO :) )

I'd be interested to see benchmarks on MySQL/PostgreSQL performance
with compression enabled.  my *guess* would be it isn't beneficial
since they usually do small reads and writes, and there is little gain
in reading 4 KiB instead of 8 KiB.

what other uses cases can benefit from compression?
-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] disabling showmount -e behaviour

2009-05-27 Thread Kjetil Torgrim Homme
Roman V Shaposhnik  writes:

> I must admit that this question originates in the context of Sun's
> Storage 7210 product, which impose additional restrictions on the
> kind of knobs I can turn.
>
> But here's the question: suppose I have an installation where ZFS is
> the storage for user home directories. Since I need quotas, each
> directory gets to be its own filesystem. Since I also need these
> homes to be accessible remotely each FS is exported via NFS. Here's
> the question though: how do I prevent showmount -e (or a manually
> constructed EXPORT/EXPORTALL RPC request) to disclose a list of
> users that are hosted on a particular server?

I think the best you can do is to reject mount protocol requests
coming from "high" ports (1024+) in your firewall.  this means you
need root priveleges (or more specific capability) on the client to
fetch the list.

another option is to make the usernames opaque and anonymous, e.g.,
"u4233".

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Errors on mirrored drive

2009-05-26 Thread Kjetil Torgrim Homme
Frank Middleton  writes:

> Exactly. My whole point. And without ECC there's no way of knowing.
> But if the data is damaged /after/ checksum but /before/ write, then
> you have a real problem...

we can't do much to protect ourselves from damage to the data itself
(an extra copy in RAM will help little and ruin performance).

damages to the bits holding the computed checksum before it is written
can be alleviated by doing the calculation independently for each
written copy.  in particular, this will help if the bit error is
transient.

since the number of octets in RAM holding the checksum dwarves the
number of octets occupied by data by a large ratio (256 bits vs. one
mebibit for a full default sized record), such a paranoia mode will
most likely tell you that the *data* is corrupt, not the checksum.
but today you don't know, so it's an improvement in my book.

> Quoting the ZFS admin guide: "The failmode property ... provides the
> failmode property for determining the behavior of a catastrophic
> pool failure due to a loss of device connectivity or the failure of
> all devices in the pool. ". Has this changed since the ZFS admin
> guide was last updated?  If not, it doesn't seem relevant.

I guess checksum error handling is orthogonal to this and should have
its own property.  it sure would be nice if the admin could ask the OS
to deliver the bits contained in a file, no matter what, and just log
the problem.

> Cheers -- Frank

thank you for pointing out this potential weakness in ZFS' consistency
checking, I didn't realise it was there.

also thank you, all ZFS developers, for your great job :-)

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss