from:"Richard L. Hamilton"

Re: [zfs-discuss] ZFS - Sudden decrease in write performance

2010-11-20 Thread Richard L. Hamilton

arc-discuss doesn't have anything specifically to do with ZFS;
in particular, it has nothing to do with the ZFS ARC.  Just an
unfortunate overlap of acronyms.

Cross-posted to zfs-discuss, where this probably belongs.


> Hey all1
> 
> Recently I've decided to implement OpenSolaris as a
> target for BackupExec.
> 
> The server I've converted into a "Storage Appliance"
> is an IBM x3650 M2 w/ ~4TB of on board storage via
> ~10 local SATA drives and I'm using OpenSolaris
> svn_134. I'm using a QLogic 4Gb FC HBA w/ the QLT
> driver and presented an 8TB sparse volume to the host
> due to dedup and compression being turned on for the
> zpool.
> 
> When writes begin, I see anywhere from 4.5GB/Min to
> 5.5GB/Min and then it drops of quickly (I mean down
> to 1GB/Min or less). I've already swapped out the
> card, cable, and port with no results. I have since
> ensured that every piece of equipment on the box had
> it's firmware updated. While doing so, I installed
> Windows Server 2008 to flash all the firmware (IBM
> doesn't have a Solaris installer). 
> 
> While in Server 2008, I decided to just attempt a
> backup via share on the 1Gbs copper connection. I saw
> speeds of up to 5.5GB/Min consistently and they were
> sustained throughout 3 days of testing. Today I
> decided to move back to OpenSolaris with confidence.
> All writes began at 5.5GB/Min and quickly dropped
> off.
> 
> In my troubleshooting efforts, I have also dropped
> the fiber connection and made it an iSCSI target with
> no performance gains. I have let the on board RAID
> controller do the RAID portion instead of creating a
> zpool of multiple disks with no performance gains.
> And, I have created the target LUN using both rdsk
> and dsk paths.
> 
> I did notice today though, that there is a direct
> correlation between the ARC memory usage and speed.
> Using arcstat.pl, as soon as arcsz hits 1G (half of c
> column [commit?]), my throughput hits the floor (i.e.
> 600MB/Min or less). I can't figure it out. I tried
> every configuration possible.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs under medium load causes SMB to delay writes

2010-11-07 Thread Richard L. Hamilton

This is not the appropriate group/list for this message.
Crossposting to zfs-discuss (where it perhaps primarily
belongs) and to cifs-discuss, which also relates.

> Hi,
> 
> I have an I/O load issue and after days of searching
> wanted to know if anyone has pointers on how to
> approach this.
> 
> My 1-year stable zfs system (raidz3 8 2TB drives, all
> OK) just started to cause problems when I introduced
> a new backup script that puts medium I/O load. This
> script simply tars up a few filesystems and md5sums
> the tarball, to copy to another system for off
> OpenSolaris backup. The simple commands are:
> 
> tar -c /tank/[filesystem]/.zfs/snapshot/[snapshot] >
> /tank/[otherfilesystem]/file.tar
> md5sum -b /tank/[otherfilesystem]/file.tar >
> file.md5sum
> 
> These 2 commands obviously cause high read/write I/O
> because the 8 drives are directly reading and writing
> a large amount of data as fast as the system can go.
> This is OK and OpenSolaris functions fine.
> 
> The problem is I host VMWare images on another PC
> which access their images on this zfs box over smb,
> and during this high I/O period, these VMWare guests
> are crashing.
> 
> What I think is happening is during the backup with
> high I/O, zfs is delaying reads/writes to the VMWare
> images. This is quickly causing VMWare to freeze the
> guest machines. 
> 
> When the backup script is not running, the VMWare
> guests are fine, and have been fine for 1-year.
> (setup has been rock solid)
> 
> Any idea how to address this? I'd thought puting the
> relevant filesystem (tank/vmware) on a higher
> priority for reads/writes, but haven't figured out
> how. 
> Another way is to deprioritize the backup somehow.
> 
> Any pointers would be appreciated.
> 
> Thanks,
> Tom
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] sharesmb should be ignored if filesystem is not mounted

2010-11-04 Thread Richard L. Hamilton

> On 10/28/10 08:40 AM, Richard L. Hamilton wrote:
> > I have sharesmb=on set for a bunch of filesystems,
> > including three that weren't mounted.
>  Nevertheless,
>  all of those are advertised.  Needless to say,
> the one that isn't mounted can't be accessed
>  remotely,
> even though since advertised, it looks like it could
>  be.
> When you say "advertised" do you mean that it appears
> in
> /etc/dfs/sharetab when the dataset is not mounted
> and/or
> you can see it from a client with 'net view' on a
> client?
> 
> I'm using a recent build and I see the smb share
> disappear
> from both when the dataset is unmounted.

I could see it in Finder on a Mac client; presumably were
I on a Windows client, it would have appeared with "net view".
I've since turned off the sharesmb property on those filesystems,
so I may need to reboot (which I'd much rather not) to re-create
the problem.

But if recent builds don't have the problem, that's the main thing.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] sharesmb should be ignored if filesystem is not mounted

2010-10-28 Thread Richard L. Hamilton

PS obviously these are home systems; in a real environment,
I'd only be sharing out filesystems with user or application
data, and not local system filesystems!  But since it's just
me, I somewhat trust myself not to shoot myself in the foot.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] sharesmb should be ignored if filesystem is not mounted

2010-10-28 Thread Richard L. Hamilton

I have sharesmb=on set for a bunch of filesystems,
including three that weren't mounted.  Nevertheless,
all of those are advertised.  Needless to say,
the one that isn't mounted can't be accessed remotely,
even though since advertised, it looks like it could be.

# zfs list -o name,mountpoint,sharesmb,mounted|awk '$(NF-1)!="off"  && 
$(NF-1)!="-" && $NF!="yes"'
NAME   MOUNTPOINT  SHARESMB  MOUNTED
rpool/ROOT legacy  on no
rpool/ROOT/snv_129 /   on no
rpool/ROOT/snv_93  /tmp/.alt.luupdall.22709on no
# 


So I think that if a zfs filesystem is not mounted,
sharesmb should be ignored.

This is in snv_97 (SXCE; with a pending LU BE not yet activated,
and an old one no longer active); I don't know if it's still a problem in
current builds that unmounted filesystems are advertised, but if it is,
I can see how it could confuse clients.  So I thought I'd mention it.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side

2010-10-02 Thread Richard L. Hamilton

> On Thu, Sep 30, 2010 at 08:14:24PM -0400, Miles
> Nordin wrote:
> > >> Can the user in (3) fix the permissions from
> Windows?
> > 
> > no, not under my proposal.
> 
> Then your proposal is a non-starter.  Support for
> multiple remote
> filesystem access protocols is key for ZFS and
> Solaris.
> 
> The impedance mismatches between these various
> protocols means that we
> need to make some trade-offs.  In this case I think
> the business (as
> well as the engineers involved) would assert that
> being a good SMB
> server is critical, and that being able to
> authoritatively edit file
> permissions via SMB clients is part of what it means
> to be a good SMB
> server.
> 
> Now, you could argue that we should being aclmode
> back and let the user
> choose which trade-offs to make.  And you might
> propose new values for
> aclmode or enhancements to the groupmask setting of
> aclmode.
> 
> > but it sounds like currently people cannot ``fix''
> permissions through
> > the quirky autotranslation anyway, certainly not to
> the point where
> > neither unix nor windows users are confused:
> windows users are always
> > confused, and unix users don't get to see all the
> permissions.
> 
> Thus the current behavior is the same as the old
> aclmode=discard
> setting.
> 
> > >> Now what?
> > 
> > set the unix perms to 777 as a sign to the unix
> people to either (a)
> > leave it alone, or (b) learn to use 'chmod A...'.
>  This will actually
> work: it's not a hand-waving hypothetical that just
>  doesn't play out.
> That's not an option, not for a default behavior
> anyways.
> 
> Nico


One question: Casper, where are you?  The guy that did fine-grained
permissions IMO ought to have an idea of how to do something with ACLs
that's both safe and unsurprising for the various sorts of clients.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Please warn a home user against OpenSolaris under VirtualBox under WinXP ; )

2010-10-01 Thread Richard L. Hamilton

Hmm...according to
http://www.mail-archive.com/vbox-users-commun...@lists.sourceforge.net/msg00640.html

that's only needed before VirtualBox 3.2, or for IDE.  >= 3.2, non-IDE should
honor flush requests, if I read that correctly.

Which is good, because I haven't seen an example of how to enabling flushing
for SAS (which is the emulation I usually use because it's supposed to have
better performance).
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] fs root inode number?

2010-09-26 Thread Richard L. Hamilton

Typically on most filesystems, the inode number of the root
directory of the filesystem is 2, 0 being unused and 1 historically
once invisible and used for bad blocks (no longer done, but kept
reserved so as not to invalidate assumptions implicit in ufsdump tapes).

However, my observation seems to be (at least back at snv_97), the
inode number of ZFS filesystem root directories (including at the
top level of a spool) is 3, not 2.

If there's any POSIX/SUS requirement for the traditional number 2,
I haven't found it.  So maybe there's no reason founded in official
standards for keeping it the same.  But there are bound to be programs
that make what was with other filesystems a safe assumption.

Perhaps a warning is in order, if there isn't already one.

Is there some _reason_ why the inode number of filesystem root directories
in ZFS is 3 rather than 2?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Debunking the dedup memory myth

2010-07-18 Thread Richard L. Hamilton

> Even the most expensive decompression algorithms
> generally run
> significantly faster than I/O to disk -- at least
> when real disks are
> involved.  So, as long as you don't run out of CPU
> and have to wait for
> CPU to be available for decompression, the
> decompression will win.  The
> same concept is true for dedup, although I don't
> necessarily think of
> dedup as a form of compression (others might
> reasonably do so though.)

Effectively, dedup is a form of compression of the
filesystem rather than any single file, but one
oriented to not interfering with access to any of what
may be sharing blocks.

I would imagine that if it's read-mostly, it's a win, but
otherwise it costs more than it saves.  Even more conventional
compression tends to be more resource intensive than decompression...

What I'm wondering is when dedup is a better value than compression.
Most obviously, when there are a lot of identical blocks across different
files; but I'm not sure how often that happens, aside from maybe
blocks of zeros (which may well be sparse anyway).
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Legality and the future of zfs...

2010-07-16 Thread Richard L. Hamilton

> > It'd be handy to have a mechanism where
> applications could register for
> > snapshot notifications. When one is about to
> happen, they could be told
> > about it and do what they need to do. Once all the
> applications have
> > acknowledged the snapshot alert--and/or after a
> pre-set timeout--the file
> > system would create the snapshot, and then notify
> the applications that
> > it's done.
> >
> Why would an application need to be notified? I think
> you're under the 
> misconception that something happens when a ZFS
> snapshot is taken. 
> NOTHING happens when a snapshot is taken (OK, well,
> there is the 
> snapshot reference name created). Blocks aren't moved
> around, we don't 
> copy anything, etc. Applications have no need to "do
> anything" before a 
> snapshot it taken.

It would be nice to have applications request to be notified
before a snapshot is taken, and when that have requested
notification have acknowledged that they're ready, the snapshot
would be taken; and then another notification sent that it was
taken.  Prior to indicating they were ready, the apps could
have achieved a logically consistent on disk state.  That
would eliminate the need for (for example) separate database
backups, if you could have a snapshot with the database on it
in a consistent state.

If I undertand correctly, _that's_ what the notification mechanism
on Windows achieves.

Of course, another approach would be for a zfs aware app to be
keeping its storage on a dedicated filesystem or zvol, and itself
control when snapshots were taken of that.  As lightweight as
zvols and filesystems are under zfs, having each app that needed
such functionality have its own would be no big deal, and would
even be handy insofar as each app could create snapshots on
its own independent schedule.

Either way, the apps would have to be aware of how to
participate in coordinating their logical consistency on disk with
the snapshot (or vice versa).

> > Given that snapshots will probably be more popular
> in the future (WAFL
> > NFS/LUNs, ZFS, Btrfs, VMware disk image snapshots,
> etc.), an agreed upon
> > consensus would be handy (D-Bus? POSIX?).

Hypothetically, one could hide some of the details
with suitable libraries and infrastructure.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Legality and the future of zfs...

2010-07-16 Thread Richard L. Hamilton

> never make it any better. Just for a record: Solaris
> 9 and 10 from Sun
> was a plain crap to work with, and still is
> inconvenient conservative
> stagnationware. They won't build a free cool tools

Everybody but geeks _wants_ stagnationware, if you means
something that just runs.  Even my old Sun Blade 100 at home
still has Solaris 9 on it, because I haven't had a day to kill to
split the mirror, load something newer like the last SXCE, and
get everything on there working on it.  (My other SPARC is running
a semi-recent SXCE, and pending activation of an already installed
most recent SXCE.  Sitting at a Sun, I still prefer CDE to GNOME,
and the best graphics card I have for that box won't work with
the newer Xorg server, so I can't see putting OpenSolaris on it.)

For instance, recent enough Solaris 10 updates to be able to do zfs
root are pretty decent; you get into the habit of doing live upgrades
even for patching, so you can minimize downtime.  Hardly stagnant,
considering that the initial release of Solaris 10 didn't even have
zfs in it yet.

> for Solaris, hence
> the whole thing will turned to be a dry job for
> trained monkeys
> wearing suits in a corporations. Nothing more. That's
> a philosophy of
> last decade, but IT now is very changing and is very
> different. That
> is why Oracle's idea to kill community is totally
> stupid. And that's
> why IBM will win, because you run the same Linux on
> their hardware as
> you run at your home.
> 
> Yes, Oracle will run good for a while, using the
> inertia of a hype
> (and latest their financial report proves that), but
> soon people will
> realize that Oracle is just another evil mean beast
> with great
> marketing and the same sh*tty products as they always
> had. Buy Solaris
> for any single little purpose? No way ever! I may buy
> support and/or
> security patches, updates. But not the OS itself. If
> that is the only
> option, then I'd rather stick to Linux from other
> vendor, i.e. RedHat.
> That will lead me to no more talk to Oracle about
> software at OS
> level, only applications (if I am an idiot enough to
> jump into APEX or
> something like that). Hence, if all I can do is talk
> only about
> hardware (well, not really, because no more
> hardware-only support!!!),
> then I'd better talk to IBM, if I need a brand and I
> consider myself
> too dumb to get SuperMicro instead. IBM System x3550
> M3 is still
> better by characteristics than equivalent from
> Oracle, it is OEM if
> somebody needs that at first place and is still
> cheaper than Oracle's
> similar class. And IBM stuff just works great (at
> least if we talk
> about hardware).

I'm not going to say you're wrong, because in part I agree
with you.  Systems people can run at home, desktops, laptops,
those are all what get future mindshare and eventually get
people with big bucks spending them.

But the simple fact that Sun went down suggests that
just being all lovey-dovey (and plenty of people thought that
Sun wasn't lovey-dovey _enough_?) won't keep you in business
either.

[...]
> > But for home users? I doubt it. I was about to
> build a
> > big storage box at home running OpenSolaris, I
> froze that project.

Mine's running SXCE, and unless I can find a solution
to getting decent graphics working with Xorg on it,
probably always will be.  But the big (well, target 9TB redundant;
presently 3TB redundant) storage is doing just fine.
Being super latest and greatest just isn't necessary for that.

> Same here. A lot of nice ideas and potential
> open-source tools
> basically frozen and I think gonna be dumped. We
> (geeks) won't build
> stuff for Larry just for free. We need OS back opened
> in reward. So I
> think OpenSolaris is pretty much game over, thanks to
> the Oracle. Some
> Oracle fanboys might call it a plain FUD, hope to get
> updates etc, but
> the reality is that Oracle to OpenSolaris is pretty
> much the same what
> Palm did for BeOS.
> 
> Enjoy your last svn_134 build.
> 

I can't rule out that possibility, but I see some reasons
to think that it's worth being patient for a couple more
months.  As it is, I find myself updating my Mac and Windows
every darn week; so I'm pretty much past getting a kick out
of updating just to see what's kewl.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Legality and the future of zfs...

2010-07-16 Thread Richard L. Hamilton

> On Tue, 13 Jul 2010, Edward Ned Harvey wrote:
> 
> > It is true there's no new build published in the
> last 3 months.  But you
> > can't use that to assume they're killing the
> community.
> 
> Hmm, the community seems to think they're killing the
> community:
> 
>   http://developers.slashdot.org/story/10/07/14/1448209
> /OpenSolaris-Governing-Board-Closing-Shop?from=rss
> 
> 
> ZFS is great. It's pretty much the only reason we're
> running Solaris. But I
> don't have much confidence Oracle Solaris is going to
> be a product I'm
> going to want to run in the future. We barely put our
> ZFS stuff into
> production last year but quite frankly I'm already on
> the lookout for
> something to replace it.
> 
> No new version of OpenSolaris (which we were about to
> start migrating to).
> No new update of Solaris 10. *Zero* information about
> what the hell's going
> on...

Presumably if you have a maintenance contract or some other
formal relationship, you could get an NDA briefing.  Not having
been to one yet myself, I don't know what that would tell you,
but presumably more than without it.

Still, the silence is quite unhelpful, and the apparent lack of
anyone willing to recognize that, and with the authority to do
anything about it, is troubling.


> ZFS will surely live on as the filesystem under the
> hood in the doubtlessly
> forthcoming Oracle "database appliances", and I'm
> sure they'll keep selling
> their NAS devices. But for home users? I doubt it. I
> was about to build a
> big storage box at home running OpenSolaris, I froze
> that project. Oracle
> is all about the money. Which I guess is why they're
> succeeding and Sun
> failed to the point of having to sell out to them. My
> home use wasn't
> exactly going to make them a profit, but on the other
> hand, the philosophy
> that led to my not using the product at home is a
> direct cause of my lack
> of desire to continue using it at work, and while
> we're not exactly a huge
> client we've dropped a decent penny or two in Sun's
> wallet over the years.

FWIW, you're not the only one that's tried to make that point!

> Who knows, maybe Oracle will start to play ball
> before August 16th and the
> OpenSolaris Governing Board won't shut themselves
> down. But I wouldn't hold
> my breath.

Postponement of respiration pending hypothetical actions by others
is seldom an effective survival strategy.

Nevertheless, the zfs on my Sun Blade 2000 currently running SXCE snv_97
(pending luactivate and reboot to switch to snv_129) is doing just fine
with what is presently 3TB of redundant storage, and will eventually grow
to 9TB as I populate the rest of the slots in my JBOD
(8 slots; 2 x 1TB mirror for root; presently also 2 x 2TB mirror for data,
but that will change to 5 x 2TB raidz + 1 2TB hot spare when I can afford
four more 2TB drives).

I have a spare power supply and some other odds and ends for the
Sun Blade 2000, so, with fingers crossed, it will run (and heat my house :-)
for quite some time to come, regardless of availability of future software
updates.  If not, I'm sure I have an ISO of SXCE 129 or so for x86 somewhere
too, which I could put on any cheap x86 box with a PCIx slot for my SAS
controller, and just import the zpools and go.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Legality and the future of zfs...

2010-07-16 Thread Richard L. Hamilton

> Losing ZFS would indeed be disastrous, as it would
> leave Solaris with 
> only the Veritas File System (VxFS) as a semi-modern
> filesystem, and a 
> non-native FS at that (i.e. VxFS is a 3rd-party
> for-pay FS, which 
> severely inhibits its uptake). UFS is just way to old
> to be competitive 
> these days.

Having come to depend on them, the absence of some of the
features would certainly be significant.

But how come everyone forgets about QFS?
http://www.sun.com/storage/management_software/data_management/qfs/index.xml
http://en.wikipedia.org/wiki/QFS
http://hub.opensolaris.org/bin/view/Project+samqfs/WebHome
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Exporting iSCSI - it's still getting all the ZFS protect

2010-05-07 Thread Richard L. Hamilton

AFAIK, zfs should be able to protect against (if the pool is redundant), or at 
least
detect, corruption from the point that it is handed the data, to the point
that the data is written to permanent storage, _provided_that_ the system
has ECC RAM (so it can detect and often correct random background-radiation
caused memory errors), and that, if zfs controls the whole disk and the disk
has a write cache, the disk correctly honors requests to flush the write cache
to permanent storage.  That should be just as true for a zvol as for a
regular zfs file.

What I'm trying to say is that zfs should give you a lot of protection in your
situation, but that it can do nothing about it if it is handed bad data: for
example, if the client is buggy and sends corrupt data, if somehow a network
error goes undetected (unlikely given that AFAIK iSCSI runs over TCP and
at least thus far never over UDP, and TCP always checksums (UDP might not)),
if the iSCSI server software corrupts data before writing it to disk, etc.

In other words, zfs probably gives more protection to a larger portion of the
data path than just about anything else, but in the case of a remote client,
whether iSCSI, NFS, CIFS, or whatever, the data path is longer and
distributed, and the verification that zfs does only covers part of that.

What I'm saying would _not_ apply if the client were doing zfs onto iSCSI
storage; in that case, the client's zfs would also be looking after data 
integrity.
So the closer to the data generating application that the integrity from that
point on is provided, the less places something bad can happen without being
at least detected.

Note: I can't guarantee that any of what I said is correct, although I would
be willing to risk my own data as if it were.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] why both dedup and compression?

2010-05-05 Thread Richard L. Hamilton

Another thought is this: _unless_ the CPU is the bottleneck on
a particular system, compression (_when_ it actually helps) can
speed up overall operation, by reducing the amount of I/O needed.
But storing already-compressed files in a filesystem with compression
is likely to result in wasted effort, with little or no gain to show for it.

Even deduplication requires some extra effort.  Looking at the documentation,
it implies a particular checksum algorithm _plus_ verification (if the checksum
or digest matches, then make sure by doing a byte-for-byte compare of the
blocks, since nothing shorter than the data itself can _guarantee_ that
they're the same, just like no lossless compression can possibly work for
all possible bitstreams).

So doing either of these where the success rate is likely to be too low
is probably not helpful.

There are stats that show the savings for a filesystem due to compression
or deduplication.  What I think would be interesting is some advice as to
how much (percentage) savings one should be getting to expect to come
out ahead not just on storage, but on overall system performance.  Of
course, no such guidance would exactly fit any particular workload, but
I think one might be able to come up with some approximate numbers,
or at least a range, below which those features probably represented
a waste of effort unless space was at an absolute premium.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] why both dedup and compression?

2010-05-05 Thread Richard L. Hamilton

> I've googled this for a bit, but can't seem to find
> the answer.
> 
> What does compression bring to the party that dedupe
> doesn't cover already?
> 
> Thank you for you patience and answers.

That almost sounds like a classroom question.

Pick a simple example: large text files, of which each is
unique, maybe lines of data or something.  Not likely to
be much in the way of duplicate blocks to share, but
very likely to be highly compressible.

Contrast that with binary files, which might have blocks
of zero bytes in them (without being strictly sparse, sometimes).
With deduping, one such block is all that's actually stored (along
with all the references to it, of course).

In the 30 seconds or so I've been thinking about it to type this,
I would _guess_ that one might want one or the other, but
rarely both, since compression might tend to work against deduplication.

So given the availability of both, and how lightweight zfs filesystems
are, one might want to create separate filesystems within a pool with
one or the other as appropriate, and separate the data according to
which would likely work better on it.  Also, one might as well
put compressed video, audio, and image formats in a filesystem
that was _not_ compressed, since compressing an already compressed
file seldom gains much if anything more.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool rename?

2010-05-04 Thread Richard L. Hamilton

[...]
> To answer Richard's question, if you have to rename a
> pool during
> import due to a conflict, the only way to change it
> back is to
> re-import it with the original name. You'll have to
> either export the
> conflicting pool, or (if it's rpool) boot off of a
> LiveCD which
> doesn't use an rpool to do the rename.

Thanks.  The latter is what I ended up doing (well,
off of the SXCE install CD image that I'd used to set up that
disk image in VirtualBox in the first place).
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] zpool rename?

2010-05-02 Thread Richard L. Hamilton

One can rename a zpool on import

zpool import -f pool_or_id newname

Is there any way to rename it (back again, perhaps)
on export?

(I had to rename rpool in an old disk image to access
some stuff in it, and I'd like to put it back the way it
was so it's properly usable if I ever want to boot off of it.)

But I suppose there must be other scenarios where that would
be useful too...
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Mapping inode numbers to file names

2010-04-28 Thread Richard L. Hamilton

[...]
> There is a way to do this kind of object to name
> mapping, though there's no documented public
> interface for it. See zfs_obj_to_path() function and
> ZFS_IOC_OBJ_TO_PATH ioctl.
> 
> I think it should also be possible to extend it to
> handle multiple names (in case of multiple hardlinks)
> in some way, as id of parent directory is recorded at
> the time of link creation in zone attributes

To add a bit: these sorts of things are _not_ required by any existing
standard, and may be limited to use by root (since they bypass directory
permissions).  So they're typically private, undocumented, and subject
to change without notice.

Some other examples:

UFS _FIOIO ioctl: obtain a read-only file descriptor given an
existing file descriptor on the file system (to make the ioctl on)
and the inode number and generation number (keeps inode numbers
from being reused too quickly, mostly to make NFS happy I think) in
an argument to the ioctl.

Mac OS X /.vol directory: allows pre-OS X style access by
volume-ID/folder-ID/name triplet

Those are all hidden behind a particular library or application
that is the only supported way of using them.

It is perhaps unfortunate that there is no generic root-only way to
look up
fsid/inode
(problematic though due to hard links)
or
fsid/dir_inode/name
(could fail if name has been moved to another directory on the same filesystem)

but implementing a generic solution would likely be a lot of work
(requiring support from every filesystem, most of which were _not_
designed to do a reverse lookup, i.e. from inode back to name), and
the use cases seem to be very few indeed.  (As an example of that,
/.vol on a Mac is said to only work for HFS or HFS+ volumes, not old UFS
volumes (Macs used to support their own flavor of UFS, apparently; no
doubt one considerably different from on Solaris, so don't go there)
In fact, I'm not sure that /.vol works at all on the latest Mac OS X.)
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] customizing "zfs list" with less typing

2010-01-23 Thread Richard L. Hamilton

> Just make 'zfs' an alias to your version of it.  A
> one-time edit
> of .profile can update that alias.


Sure; write a shell function, and add an alias to it.
And use a quoted command name (or full path) within the function
to get to the real command.  Been there, done that.
But to do a good job of it means parsing the command line the
same way the real command would, so that it only adds
-o ${ZFS_LIST_PROPS:-name,used,available,referenced,mountpoint} 

or perhaps better

${ZFS_LIST_PROPS:+-o "${ZFS_LIST_PROPS}"}

to zfs list (rather than to other subcommands), and only
if the -o option wasn't given explicitly.

That's not only a bit of a pain, but anytime one is duplicating parsing,
it's begging for trouble: in case they don't really handle it the same,
or in case the underlying command is changed.  And unless that sort of thing
is handled with extreme care (quoting all shell variable references, just
for starters), it can turn out to be a security problem.

And that's just the implicit options part of what I want; the other part would 
take
optionally filtering to modify the command output as well.  That's starting to
get nuts, IMO.

Heck, I can grab a copy of the source for the zfs command, modify it,
and recompile it (without building all of OpenSolaris) faster than I can
write a really good shell wrapper that does the same thing.  But then I
have to maintain my own divergent implementation, unless I can interest
someone else in the idea...OTOH by the time the hoop-jumping for getting
something accepted is over, it's definitely been more bother gain...
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] customizing "zfs list" with less typing

2010-01-23 Thread Richard L. Hamilton

It might be nice if "zfs list" would check an environment variable for
a default list of properties to show (same as the comma-separated list
used with the -o option).  If not set, it would use the current default list;
if set, it would use the value of that environment variable as the list.

I find there are a lot of times I want to see the same one additional property
when using "zfs list"; an environment variable would mean a one-time edit
of .profile rather than typing the -o option with the default list modified by 
whatever
I want to add.

Along those lines, pseudo properties that were abbreviated (constant-length) 
versions
of some real properties might help.  For instance, sharenfs can be on, off, or a
rather long list of nfs sharing options.  A pseudo property with a related name
and a value of on, off, or spec (with spec implying some arbitrary list of 
applicable
options) would have a constant length.  Given two potentially long properties
(mountpoint and the pseudo property "name"), output lines are already close to
cumbersome (that assumes one at the beginning of the line and one at the end).
Additional potentially long properties in the output would tend to make it 
unreadable.

Both of those, esp. together, would make quickly checking or familiarizing 
oneself
with a server that much more civilized, IMO.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] high read iops - more memory for arc?

2009-12-25 Thread Richard L. Hamilton

FYI, the arc and arc-discuss lists or forums are not appropriate for this.  
There are
two "arc" acronyms:

* Architecture Review Committee (arc list is for cases being considered, 
arc-discuss is for
other discussion.  Non-committee business is most unwelcome on the arc list.)

* the ZFS Adaptive Replacement Cache.  That is what you are talking about.

The zfs-discuss list is appropriate for that subject; storage-discuss and 
database-discuss
_may_ relate, but rather than sending to every list that _might_ relate, I'd 
suggest starting
with the most appropriate first, and reading enough of the posts already on a 
list to
get some idea of what's appropriate there and what isn't, before just adding it 
as
and additional CC in the hopes that someone might answer.

Very few people are likely to be responding here at this time, insofar as the 
largest
part of the people that might are probably observing (at least socially) the 
Christmas
holiday right now (their families might not appreciate them being distracted by
anything else!), and many of the rest aren't interacting much because of how
many are not around right now.  Don't expect too much until the first Monday
after 1 January.  And anyway, discussion lists are not a place where anyone is
_obligated_ to answer.  Those with support contracts presumably have other ways
of getting help.

Now...I probably couldn't answer your question even if I had all the information
you left out,but maybe someone could, eventually.  Some of the information they
might need:

* what are you running (uname -a will do)?  ZFS is constantly being improved; 
problems
get fixed (and sometimes introduced) in just about every build

* what system, how is it configured, exactly what disk models, etc?

Free memory is _supposed_ to be low.  Free memory is wasted memory, except that
a little is kept free to quickly respond to requests for more.  Most memory not 
otherwise
being used for mappings, kernel data structures, etc, is used as either 
additional VM
page cache of pages that might be used again, or by the ZFS ARC.  The tools to
report on just how memory is used behave differently on Solaris (and even on 
different
versions) than they do on other OSs, because Solaris tries really hard to make 
best
use of all RAM.  The uname -a information would also help someone (more 
knowledgeable
than I, although I might be able to look it up) suggest which tools would best 
help to
understand your situation.

So while free memory alone doesn't tell you much, there's a good chance that 
more
would help unless there's some specific problem that's involved.  There's also 
a good
chance that your problem is known, recognizable, and probably has a fix in a 
newer
version or a workaround, if you provide enough information to help someone find
that for you.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is st_size of a zfs directory equal to the

2009-01-14 Thread Richard L. Hamilton

> "Richard L. Hamilton"  wrote:
> 
> > Cute idea, maybe.  But very inconsistent with the
> size in blocks (reported by ls -dls dir).
> > Is there a particular reason for this, or is it one
> of those just for the heck of it things?
> >
> > Granted that it isn't necessarily _wrong_.  I just
> checked SUSv3 for stat() and sys/stat.h,
> > and it appears that st_size is only well-defined
> for regular files and symlinks.  So I suppose
> > it could be (a) undefined, or  (b) whatever is
> deemed to be useful, for directories,
> > device files, etc.
> 
> You could also return 0 for st_size for all
> directories and would still be 
> POSIX compliant.
> 
> 
> Jörg
> 

Yes, some do IIRC (certainly for empty directories, maybe always; I forget what
OS I saw that on).

Heck, "undefined" means it wouldn't be _wrong_ to return a random number.  Even
a _negative_ number wouldn't necessarily be wrong (although it would be a new 
low
in rudeness, perhaps).

I did find the earlier discussion on the subject (someone e-mailed me that 
there had been
such).  It seemed to conclude that some apps are statically linked with old 
scandir() code
that (incorrectly) assumed that the number of directory entries could be 
estimated as
st_size/24; and worse, that some such apps might be seeing the small st_size 
that zfs
offers via NFS, so they might not even be something that could be fixed on 
Solaris at all.
But I didn't see anything in the discussion that suggested that this was going 
to be changed.
Nor did I see a compelling argument for leaving it the way it is, either.  In 
the face of
"undefined", all arguments end up as pragmatism rather than principle, IMO.

Maybe it's not a bad thing to go and break incorrect code.  But if that code 
has worked for
a long time (maybe long enough for the source to have been lost), I don't know 
that it's
helpful to just remind everyone that st_size is only defined for certain types 
of objects,
and directories aren't one of them.

(Now if one wanted to write something to break code depending on 32-bit time_t 
_now_
rather than waiting for 2038, that might be a good deed in terms of breaking 
things.
But I'll be 80 then (if I'm still alive), and I probably won't care.)
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Why is st_size of a zfs directory equal to the number of entries?

2009-01-14 Thread Richard L. Hamilton

Cute idea, maybe.  But very inconsistent with the size in blocks (reported by 
ls -dls dir).
Is there a particular reason for this, or is it one of those just for the heck 
of it things?

Granted that it isn't necessarily _wrong_.  I just checked SUSv3 for stat() and 
sys/stat.h,
and it appears that st_size is only well-defined for regular files and 
symlinks.  So I suppose
it could be (a) undefined, or  (b) whatever is deemed to be useful, for 
directories,
device files, etc.

This is of course inconsistent with the behavior on other filesystems.  On UFS 
(a bit
of a special case perhaps in that it still allows read(2) on a directory, for 
compatibility),
the st_size seems to reflect the actual number of bytes used by the 
implementation to
hold the directory's current contents.  That may well also be the case for 
tmpfs, but from
user-land, one can't tell since it (reasonably enough) disallows read(2) on 
directories.
Haven't checked any other filesystems.  Don't have anything else (pcfs, hsfs, 
udfs, ...)
mounted at the moment to check.

(other stuff: ISTR that devices on Solaris will give a "size" if applicable, 
but for
non LF-aware 32-bit, that may be capped at MAXOFF32_T rather than returning an 
error;
I think maybe for pipes, one sees the number of bytes available to be read.  
None of
which is portable or should necessarily be depended on...)

Cool ideas are fine, but IMO, if one does wish to make something nominally 
undefined
have some particular behavior, I wonder why one wouldn't at least try for 
consistency...
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Mac Mini (OS X 10.5.4) with globalSAN

2008-08-14 Thread Richard L. Hamilton

> On Wed, 13 Aug 2008, Richard L. Hamilton wrote:
> >
> > Reasonable enough guess, but no, no compression,
> nothing like that;
> > nor am I running anything particularly demanding
> most of the time.
> >
> > I did have the volblocksize set down to 512 for
> that volume, since I thought
> > that for the purpose, that reflected hardware-like
> behavior.  But maybe there's
> > some reason that's not a good idea.
> 
> Yes, that would normally be a very bad idea.  The
> default is 128K. 
> The main reason to want to reduce it is if you have
> an application 
> doing random-access I/O with small block sizes (e.g.
> 8K is common for 
> applications optimized for UFS).  In that case the
> smaller block sizes 
> decrease overhead since zfs reads and updates whole
> blocks.  If the 
> block size is 512 then that means you are normally
> performing more 
> low-level I/Os, doing more disk seeks, and wasting
> disk space.
> 
> The hardware itself does not really deal with 512
> bytes any more since 
> buffering on the disk drive is sufficient to buffer
> entire disk tracks 
> and when data is read, it is common for the disk
> drive to read the 
> entire track into its local buffer.  A hardware RAID
> controller often 
> extends that 512 bytes to a somewhat larger value for
> its own 
> purposes.
> 
> Bob

Ok, but that leaves the question what a better value would be.  I gather
that HFS+ operates in terms of 512-byte sectors but larger allocation units;
however, unless those allocation units are a power of two between 512 and 128k
inclusive _and_ are accordingly aligned within the device (or actually, with
the choice of a proper volblocksize can be made to correspond to blocks in
the underlying zvol), it seems to me that a larger volblocksize would not help;
it might well mean that a one a.u. write by HFS+ equated to two blocks read
and written by zfs, because the alignment didn't match, whereas at least with
the smallest volblocksize, there should never be a need to read/merge/write.

I'm having trouble figuring out how to get the info to make a better choice on
the HFS+ side; maybe I'll just fire up wireshark, and see if it knows
how to interpret iSCSI, and/or run truss on iscsitgtd to see what it actually
is reading from/writing to the zvol; if there is a consistent least common
aligned blocksize, I would expect the latter especially to reveal it, and
probably the former to confirm it.

I did string Ethernet; I think that sped things up a bit, but it didn't change 
the
annoying pauses.  In the end, I found a 500GB USB drive on sale for $89.95 (US),
and put that on the Mac, with 1 partition for backups, and 1 each for possible
future [Open]Solaris x86, Linux, and Windows OSs, assuming they can be
booted from a USB drive on a Mac Mini.  Still, I want to know if the pausing
with iscsitgtd is in part something I can tune down to being non-obnoxious,
or is (as I suspect) in some sense a real bug.

cc-ing zfs-discuss, since I suspect the problem might be there at least as much
as with iscsitgtd (not that the latter is a prize-winner, having core-dumped
with an assert() somewhere a number of times).

This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Can't rm file when "No space left on device"...

2008-06-13 Thread Richard L. Hamilton

I wonder if one couldln't reduce (but probably not eliminate) the likelihood
of this sort of situation by setting refreservation significantly lower than
reservation?

Along those lines, I don't see any property that would restrict the number
of concurrent snapshots of a dataset :-(   I think that would be real handy,
along with one that would say whether to refuse another when the limit was
reached, or to automatically delete the oldest snapshot.  Yes, one can script
the rotation of snapshots, but it might be nice to just make it policy for a
given dataset instead, particularly together with delegated snapshot permission
(provided that that didn't also delegate the ability to change the maximum
number of allowed snapshots).
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Boot from mirrored vdev

2008-06-13 Thread Richard L. Hamilton

Are you using
set md:mirrored_root_flag=1
in /etc/system?

See the entry for md:mirrored_root_flag on
http://docs.sun.com/app/docs/doc/819-2724/chapter2-156?a=view
keeping in mind all the cautions...
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] memory hog

2008-06-13 Thread Richard L. Hamilton

Hmm...my SB2K, 2GB RAM, 2x 1050MHz UltraSPARC III Cu CPU, seems
to freeze momentarily for a couple of seconds every now and then in
a zfs root setup on snv_90, which it never did with mostly ufs on snv_81;
that despite having much faster disks now (LSI SAS 3800X and a pair of
Seagate 1TB SAS drives (mirrored), vs the 2x internal 73GB FC drives;
the SAS drives, at a mere 7200 RPM can sustain a sequential transfer
rate about 2.5x that of the 10KRPM FC drives!).

Then again, between the hardware differences and any other software
differences as well as the configuration change, I'm not absolutely ready
to blame any particular one of those for those annoying pauses...but
my suspicions are on zfs...
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Growing root pool ?

2008-06-11 Thread Richard L. Hamilton

> I'm not even trying to stripe it across multiple
> disks, I just want to add another partition (from the
> same physical disk) to the root pool.  Perhaps that
> is a distinction without a difference, but my goal is
> to grow my root pool, not stripe it across disks or
> enable raid features (for now).
> 
> Currently, my root pool is using c1t0d0s4 and I want
> to add c1t0d0s0 to the pool, but can't.
> 
> -Wyllys

Right, that's how it is right now (which the other guy seemed to
be suggesting might change eventually, but nobody knows when
because it's just not that important compared to other things).

AFAIK, if you could shrink the partition whose data is after
c1t0d0s4 on the disk, you could grow c1t0d0s4 by that much,
and I _think_ zfs would pick up the growth of the device automatically.
(ufs partitions can be grown like that, or by being on an SVM or VxVM
volume that's grown, but then one has to run a command specific to ufs
to grow the filesystem to use the additional space).
I think zpools are supposed to grow automatically if SAN LUNs are grown,
and this should be a similar situation, anyway.  But if you can do that,
and want to try it, just be careful.  And of course you couldn't shrink it 
again, either.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Growing root pool ?

2008-06-11 Thread Richard L. Hamilton

> On Tue, Jun 10, 2008 at 11:33:36AM -0700, Wyllys
> Ingersoll wrote:
> > Im running build 91 with ZFS boot.  It seems that
> ZFS will not allow
> > me to add an additional partition to the current
> root/boot pool
> > because it is a bootable dataset.  Is this a known
> issue that will be
> > fixed or a permanent limitation?
> 
> The current limitation is that a bootable pool be
> limited to one disk or
> one disk and a mirror.  When your data is striped
> across multiple disks,
> that makes booting harder.
> 
> >From a post to zfs-discuss about two months ago:
> 
> ... we do have plans to support booting from
>  RAID-Z.  The design is
> still being worked out, but it's likely that it
>  will involve a new
> kind of dataset which is replicated on each disk of
>  the RAID-Z pool,
> and which contains the boot archive and other
>  crucial files that the
> booter needs to read.  I don't have a projected
>  date for when it will
> be available.  It's a lower priority project than
>  getting the install
>   support for zfs boot done.
> - 
> Darren

If I read you right, with little or nothing extra, that would enable
growing rpool as well, since what it would really do is ensure
/boot (and whatever if anything else) was mirrored even though
the rest of the zpool was raidz or raidz2; which would also
ensure that those critical items were _not_ spread across the
stripe that would result from adding devices to an existing zpool.

Of course installation and upgrade would have to be able to recognize
and deal with such exotica too.  Which seems to pose a problem, since
having one dataset in the zpool mirrored while the rest is raidz and/or
extended by a stripe implies to me that some space is more or less
reserved for that purpose, or that such a dataset couldn't be snapshotted,
or both; so I suppose there might be a smaller-than-total-capacity limit
on the number of BEs possible.

http://en.wikipedia.org/wiki/TANSTAAFL ...

This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SSD reliability, wear levelling, warranty period

2008-06-11 Thread Richard L. Hamilton

> > btw: it's seems to me that this thread is a little
> bit OT.
> 
> I don't think its OT - because SSDs make perfect
> sense as ZFS log
> and/or cache devices.  If I did not make that clear
> in my OP then I
> failed to communicate clearly.  In both these roles
> (log/cache)
> reliability is of the utmost importance.

Older SSDs (before cheap and relatively high-cycle-limit flash)
were RAM cache+battery+hard disk.  Surely RAM+battery+flash
is also possible; the battery only needs to keep the RAM alive long
enough to stage to the flash.  That keeps the write count on the flash
down, and the speed up (RAM being faster than flash).  Such a device
would of course cost more, and be less dense (given having to have
battery+charging circuits and RAM as well as flash), than a pure flash device.
But with more limited write rates needed, and no moving parts, _provided_
it has full ECC and maybe radiation-hardened flash (if that exists), I can't
imagine why such a device couldn't be exceedingly reliable and have quite
a long lifetime (with the battery, hopefully replaceable, being more of
a limitation than the flash).

It could be a matter of paying for how much quality you want...

As for reliability, from zpool(1m):

>log
>
>A separate intent log device. If more than one log device is specified, 
> then
>  writes are load-balanced between devices. Log devices can be mirrored.
>  However, raidz and raidz2 are not supported for the intent log. For more
>  information, see the “Intent Log” section.
>
> cache
>
>A device used to cache storage pool data. A cache device cannot be mirrored
> or part of a raidz or raidz2 configuration. For more information, see the
> “Cache Devices” section.
[...]
> Cache Devices
>
>  Devices can be added to a storage pool as “cache devices.” These devices
> provide an additional layer of caching between main memory and disk. For
> read-heavy workloads, where the working set size is much larger than what can
> be cached in main memory, using cache devices allow much more of this
> working set to be served from low latency media. Using cache devices provides
> the greatest performance improvement for random read-workloads of mostly
>  static content.
>
>   To create a pool with cache devices, specify a “cache” vdev with any number
> of devices. For example:
>
>  # zpool create pool c0d0 c1d0 cache c2d0 c3d0
>
>Cache devices cannot be mirrored or part of a raidz configuration. If a 
> read
>  error is encountered on a cache device, that read I/O is reissued to the
>  original storage pool device, which might be part of a mirrored or raidz
>  configuration.
>
>  The content of the cache devices is considered volatile, as is the case with
> other system caches.

That tells me that the zil can be mirrored and zfs can recover from cache 
errors.

I think that means that these devices don't need to be any more reliable than
regular disks, just much faster.

So...expensive ultra-reliability SSD, or much less expensive SSD plus mirrored
zil?  Given what zfs can do with cheap SATA, my bet is on the latter...
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Filesystem for each home dir - 10,000 users?

2008-06-11 Thread Richard L. Hamilton

> On Sat, 7 Jun 2008, Mattias Pantzare wrote:
> >
> > If I need to count useage I can use du. But if you
> can implement space
> > usage info on a per-uid basis you are not far from
> quota per uid...
> 
> That sounds like quite a challenge.  UIDs are just
> numbers and new 
> ones can appear at any time.  Files with existing
> UIDs can have their 
> UIDs switched from one to another at any time.  The
> space used per UID 
> needs to be tallied continuously and needs to track
> every change, 
> including real-time file growth and truncation.  We
> are ultimately 
> talking about 128 bit counters here.  Instead of
> having one counter 
> per filesystem we now have potentially hundreds of
> thousands, which 
> represents substantial memory.

But if you already have the ZAP code, you ought to be able to do
quick lookups of arbitrary byte sequences, right?  Just assume that
a value not stored is zero (or infinity, or uninitialized, as applicable),
and you have the same functionality as  the sparse quota file on ufs,
without the problems.

Besides, uid/gid/sid quotas would usually make more sense at the zpool level 
than
at the individual filesystem level, so perhaps it's not _that_ bad.  Which is to
say, you want user X to have an n GB quota over the whole zpool, and you
probably don't so much care whether the filesystem within the zpool
corresponds to his home directory or to some shared directory.

> Multicore systems have the additional challenge that
> this complex 
> information needs to be effectively shared between
> cores.  Imagine if 
> you have 512 CPU cores, all of which are running some
> of the ZFS code 
> and have their own caches which become invalidated
> whenever one of 
> those counters is updated.  This sounds like a no-go
> for an almost 
> infinite-sized pooled "last word" filesystem like
> ZFS.
> 
> ZFS is already quite lazy at evaluating space
> consumption.  With ZFS, 
> 'du' does not always reflect true usage since updates
> are delayed.

Whatever mechanism can check at block allocation/deallocation time
to keep track of per-filesystem space (vs a filesystem quota, if there is one)
could surely also do something similar against per-uid/gid/sid quotas.  I 
suspect
a lot of existing functions and data structures could be reused or adapted for
most of it.  Just one more piece of metadata to update, right?  Not as if ufs
quotas had zero runtime penalty if enabled.   And you only need counters and
quotas in-core for identifiers applicable to in-core znodes, not for every
identifier used on the zpool.

Maybe I'm off base on the details.  But in any event, I expect that it's 
entirely
possible to make it happen, scalably.  Just a question of whether it's worth the
cost of designing, coding, testing, documenting.  I suspect there may be enough
scenarios for sites with really high numbers of accounts (particularly
universities, which are not only customers in their own right, but a chance
for future mindshare) that it might be worthwhile, but I don't know that to
be the case.

IMO, even if no one sort of site using existing deployment architectures would
justify it, given the future blurring of server, SAN, and NAS (think recent
SSD announcement + COMSTAR + iSCSI initiator + separate device for zfs
zil & cache + in-kernel CIFS + enterprise authentication with Windows
interoperability + Thumper + ...), the ability to manage all that storage in all
sorts of as-yet unforseen deployment configurations _by user or other identity_
may well be important across a broad base of customers.  Maybe identity-based,
as well as filesystem-based quotas, should be part of that.

This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Filesystem for each home dir - 10,000 users?

2008-06-06 Thread Richard L. Hamilton

[...]
> > That's not to say that there might not be other
> problems with scaling to
> > thousands of filesystems.  But you're certainly not
> the first one to test it.
> >
> > For cases where a single filesystem must contain
> files owned by
> > multiple users (/var/mail being one example), old
> fashioned
> > UFS quotas still solve the problem where the
> alternative approach
> > with ZFS doesn't.
> >   
> 
> A single /var/mail doesn't work well for 10,000 users
> either.  When you
> start getting into that scale of service
> provisioning, you might look at
> how the big boys do it... Apple, Verizon, Google,
> Amazon, etc.  You
> should also look at e-mail systems designed to scale
> to large numbers of 
> users
> which implement limits without resorting to file
> system quotas.  Such
> e-mail systems actually tell users that their mailbox
> is too full rather 
> than
> just failing to deliver mail.  So please, when we
> start having this 
> conversation
> again, lets leave /var/mail out.

I'm not recommending such a configuration; I quite agree that it is neither
scalable nor robust.

It's only merit is that it's an obvious example of where one would have
potentially large files owned by many users necessarily on one filesystem,
inasmuch as they were in one common directory.  But there must  be
other examples where the ufs quota model is a better fit than the
zfs quota model with potentially one filesystem per user.

In terms of the limitations they can provide, zfs filesystem quotas remind me
of DG/UX control point directories (presumably a relic of AOS/VS) - like regular
directories except they could have a quota bound to them restricting the sum of
the space of the subtree rooted there (the native filesystem on DG/UX didn't
have UID-based quotas).

Given restricted chown (non-root can't give files away), per-UID*filesystem
quotas IMO make just as much sense as per-filesystem quotas themselves
do on zfs, save only that per-UID*filesystem quotas make the filesystem less
lightweight.  For zfs, perhaps an answer might be if it were possible to
have per-zpool uid/gid/projid/zoneid/sid quotas too?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Per-user home filesystems and OS-X Leopard anomaly

2008-06-06 Thread Richard L. Hamilton

> I encountered an issue that people using OS-X systems
> as NFS clients 
> need to be aware of.  While not strictly a ZFS issue,
> it may be 
> encounted most often by ZFS users since ZFS makes it
> easy to support 
> and export per-user filesystems.  The problem I
> encountered was when 
> using ZFS to create exported per-user filesystems and
> the OS-X 
> automounter to perform the necessary mount magic.
> 
> OS-X creates hidden ".DS_Store" directories in every
> directory which 
> is accessed (http://en.wikipedia.org/wiki/.DS_Store).
> 
> OS-X decided that it wanted to create the path
> "/home/.DS_Store" and 
> it would not take `no' for an answer.  First it would
> try to create 
> "/home/.DS_Store" and then it would try an alternate
> name.  Since the 
> automounter was used, there would be an automount
> request for 
> "/home/.DS_Store", which does not exist on the server
> so the mount 
> request would fail.  Since OS-X does not take 'no'
> for an answer, 
> there would be subsequent thousands of back to back
> mount requests. 
> The end result was that 'mountd' was one of the top
> three resource 
> consumers on my system, there would be bursts of high
> network traffic 
> (1500 packets/second), and the affected OS-X system
> would operate 
> more strangely than normal.
> 
> The simple solution was to simply create a
> "/home/.DS_Store" directory 
> on the server so that the mount request would
> succeed.

Too bad it appears to be non-obvious how to do loopback mounts
(a mount of one local directory onto another, without having to be an
NFS server) on Darwin/MacOS X; then you could mount the
/home/.DS_Store locally from a directory elsewhere (e.g.
/export/home/.DS_Store) on each machine, rather than bothering
the server with it.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs incremental-forever

2008-06-06 Thread Richard L. Hamilton

If I read the man page right, you might only have to keep a minimum of two
on each side (maybe even just one on the receiving side), although I might be
tempted to keep an extra just in case; say near current, 24 hours old, and a
week old (space permitting for the larger interval of the last one).  Adjust
frequency, spacing, and number according to available space, keeping in
mind that the more COW-ing between snapshots (the longer interval if
activity is more or less constant), the more space required.  (assuming
my head is more or less on straight right now...)

Of course if you get messed up, you can always resync with a non-incremental
transfer, so if you could live with that occasionally, there may be no need for
more than two.

Your script would certainly have to be careful to check for successful send 
_and_
receive before removing old snapshots on either side.

ssh remotehost exit 1

seems to have a return code of 1 (cool).  rsh does _not_ have that desirable
property.  But that still leaves the problem of how to check the exit status
of the commands on both ends of a pipeline; maybe someone has solved
that?

Anyway, correctly verifying successful completion of the commands on both ends
might be a bit tricky, but is critical if you don't want failures or the need 
for
frequent non-incremental transfers.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Can't rm file when "No space left on device"...

2008-06-06 Thread Richard L. Hamilton

> On Thu, Jun 05, 2008 at 09:13:24PM -0600, Keith
> Bierman wrote:
> > On Jun 5, 2008, at 8:58 PM   6/5/, Brad Diggs
> wrote:
> > > Hi Keith,
> > >
> > > Sure you can truncate some files but that
> effectively corrupts
> > > the files in our case and would cause more harm
> than good. The
> > > only files in our volume are data files.
> > 
> > So an rm is ok, but a truncation is not?
> > 
> > Seems odd to me, but if that's your constraint so
> be it.
> 
> Neither will help since before the space can be freed
> a transaction must
> be written, which in turn requires free space.
> 
> (So you say "let ZFS save some just-in-case-space for
> this," but, how
> much is enough?)

If you make it a parameter, that's the admin's problem.  Although
since each rm of a file also present in a snapshot just increases the
divergence, only an rm of a file _not_ present in a snapshot would
actually recover space, right?  So in some circumstances, even if it's
the admin's problem, there might be no amount that's enough to
do what one wants to do without removing a snapshot.  Specifically,
take a snapshot of a filesystem that's very nearly full, and then use
dd or whatever to create a single new file that fills up the filesystem.
At that point, only removing that single new file will help, and even that's
not possible without a just-in-case reserve of enough to handle worst
case metadata(including system attributes, if any) update+transaction log+\
any other fudge I forgot, for at least one file's worth.

Maybe that's a simplistic view of the scenario, I dunno...

This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SATA controller suggestion

2008-06-06 Thread Richard L. Hamilton

I don't presently have any working x86 hardware, nor do I routinely work with
x86 hardware configurations.

But it's not hard to find previous discussion on the subject:
http://www.opensolaris.org/jive/thread.jspa?messageID=96790
for example...

Also, remember that SAS controllers can usually also talk to SATA drives;
they're usually more expensive of course, but sometimes you can find a deal.
I have a LSI SAS 3800x, and I paid a heck of a lot less than list for it (eBay),
I'm guessing because someone bought the bulk package and sold off whatever
they didn't need (new board, sealed, but no docs).  That was a while ago, and
being around US $100, it might still not have been what you'd call cheap.
If you want < $50, you might have better luck looking at the earlier discussion.
But I suspect to some extent you get what you pay for; the throughput on the
higher-end boards may well be a good bit higher, although for one disk
(or even two, to mirror the system disk), it might not matter so much.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] new install - when is zfs root offered? (snv_90)

2008-06-05 Thread Richard L. Hamilton

> A Darren Dunham <[EMAIL PROTECTED]> writes:
> 
> > On Tue, Jun 03, 2008 at 05:56:44PM -0700, Richard
> L. Hamilton wrote:
> >> How about SPARC - can it do zfs install+root yet,
> or if not, when?
> >> Just got a couple of nice 1TB SAS drives, and I
> think I'd prefer to
> >> have a mirrored pool where zfs owns the entire
> drives, if possible.
> >> (I'd also eventually like to have multiple
> bootable zfs filesystems in
> >> that pool, corresponding to multiple versions.)
> >
> > Is they just under 1TB?  I don't believe there's
> any boot support in
> > Solaris for EFI labels, which would be required for
> 1TB+.
> 
> ISTR that I saw an ARC case go past about a week ago
> about extended SMI
> labels to allow > 1TB disks, for exactly this reason.
> 

Thanks.  Just searched, that's
http://www.opensolaris.org/jive/thread.jspa?messageID=237603
(approved)

Since format didn't choke, and since a close reading suggests the
actual older limit is 1TiB (or maybe 1TiB - 1 sector), I should be fine
on that score.  The LSI SAS 3800x is supposed to have fcode boot support.
And snv_90 is supposed to have zfs boot install working on both SPARC and x86.
So I guess I'll just have to try it.  That only leaves me wondering whether I 
should
attempt a live upgrade from SXCE snv_81, or just do the text install off a DVD
onto one of the new disks (hoping the installer takes care of setting up the 
disk
however it needs to be to be bootable), and then adding identical partitioning
to the other disk, attaching a suitable partition on the 2nd disk to the zpool, 
and
using LVM (Disk Suite) to mirror any non-zfs partitions the installation 
created.

Never having used live upgrade myself (although read about it), I suppose it 
would
be an educational experience either way.  Time was once, I'd have looked forward
to that...must be getting tired...

This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] More USB Storage Issues

2008-06-05 Thread Richard L. Hamilton

> Nathan Kroenert wrote:
> > For what it's worth, I started playing with USB +
> flash + ZFS and was 
> > most unhappy for quite a while.
> > 
> > I was suffering with things hanging, going slow or
> just going away and 
> > breaking, and thought I was witnessing something
> zfs was doing as I was 
> > trying to do mirror recovery and all that sort of
> stuff.
> > 
> > On a hunch, I tried doing UFS and RAW instead and
> saw the same issues.
> > 
> > It's starting to look like my USB hubs. Once they
> are under any 
> > reasonable read/write load, they just make bunches
> of things go offline.
> > 
> > Yep - They are powered and plugged in.
> > 
> > So, at this stage, I'll be grabbing a couple of
> 'better' USB hubs (Mine 
> > are pretty much the cheapest I could buy) and see
> how that goes.
> > 
> > For gags, take ZFS out of the equation and validate
> that your hardware 
> > is actually providing a stable platform for ZFS...
> Mine wasn't...
> 
> That's my experience too. USB HUBs are cheap [ expletive deleted ]
> mostly...

What do you expect?  They're mostly consumer-grade, which is to say garbage,
rather than datacenter-grade.

And it's not just USB hubs - I've got a consumer-grade external modem,
and I swear it must have little or no ECC and/or watchdog, because I have
to power-cycle it every so often.  Wish I had a lead box to put it in to shield
it from the cosmic rays, maybe that would help...
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] system backup and recovery

2008-06-05 Thread Richard L. Hamilton

> 
> On Thu, 2008-06-05 at 15:44 +0800, Aubrey Li wrote:
> > for windows we use ghost to backup system and
> recovery.
> > can we do similar thing for solaris by ZFS?
> 
> How about flar ?
> http://docs.sun.com/app/docs/doc/817-5668/flash-24?a=v
> iew
> [ I'm actually not sure if it's supported for zfs
> root though ]
> 
>   cheers,
>   tim

Oops, forgot about that one...
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] system backup and recovery

2008-06-05 Thread Richard L. Hamilton

> Hi list,
> 
> for windows we use ghost to backup system and
> recovery.
> can we do similar thing for solaris by ZFS?
> 
> I want to create a image and install to another
> machine,
> So that the personal configuration will not be lost.

Since I don't do Windows, I'm not familiar with ghost, but I gather from
Wikipedia that it's more a disk cloning tool (bare metal backup/restore)
than a conventional backup program, although some people may well use it
for backups too.

Zfs has send and receive commands, which more or less correspond to
ufsdump and ufsrestore for ufs, except that the names send and receive
are perhaps more appropriate, since the zfs(1m) man page says:
>  The format of the stream is evolving. No backwards compatibility is
> guaranteed. You may not be able to receive your streams on future
> versions of ZFS."
which means to me that it's not a really good choice for archiving or long-term
backups, but it should be ok for transferring zfs filesystems between systems
that are the same OS version (or at any rate, close enough that the format
of the zfs send/receive datastream is compatible).

There are of course also generic archiving utilities that can be used for
backup/restore, like tar (or star), pax, cpio, and so on.  But as far as I know,
there's no bare metal backup/restore facility that comes with Solaris, although
there are some commercial (and probably quite expensive) products that
do that.  But there's probably nothing at all that's quite equivalent to Norton
Ghost.

One can of course use "dd" to copy entire raw disk partitions, but that
won't set up the partitions, nor will it work as expected unless all disk sizes
are identical (for filesystems that don't have the OS on them), or if the OS
is on there, all hardware is identical.

Depending on just what "personal configuration" you mean, you may not
necessarily need to back up the whole system anyway.  Which is another way
of saying that I'm not sure your post was specific enough about what you're 
doing
to make it possible to suggest the best available (and preferably free) 
solution.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] system backup and recovery

2008-06-05 Thread Richard L. Hamilton

> Hi list,
> 
> for windows we use ghost to backup system and
> recovery.
> can we do similar thing for solaris by ZFS?
> 
> I want to create a image and install to another
> machine,
> So that the personal configuration will not be lost.

Since I don't do Windows, I'm not familiar with ghost, but I gather from
Wikipedia that it's more a disk cloning tool (bare metal backup/restore)
than a conventional backup program, although some people may well use it
for backups too.

Zfs has send and receive commands, which more or less correspond to
ufsdump and ufsrestore for ufs, except that the names send and receive
are perhaps more appropriate, since the zfs(1m) man page says:
>  The format of the stream is evolving. No backwards compatibility is
> guaranteed. You may not be able to receive your streams on future
> versions of ZFS."
which means to me that it's not a really good choice for archiving or long-term
backups, but it should be ok for transferring zfs filesystems between systems
that are the same OS version (or at any rate, close enough that the format
of the zfs send/receive datastream is compatible).

There are of course also generic archiving utilities that can be used for
backup/restore, like tar (or star), pax, cpio, and so on.  But as far as I know,
there's no bare metal backup/restore facility that comes with Solaris, although
there are some commercial (and probably quite expensive) products that
do that.  But there's probably nothing at all that's quite equivalent to Norton
Ghost.

One can of course use "dd" to copy entire raw disk partitions, but that
won't set up the partitions, nor will it work as expected unless all disk sizes
are identical (for filesystems that don't have the OS on them), or if the OS
is on there, all hardware is identical.

Depending on just what "personal configuration" you mean, you may not
necessarily need to back up the whole system anyway.  Which is another way
of saying that I'm not sure your post was specific enough about what you're 
doing
to make it possible to suggest the best available (and preferably free) 
solution.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Filesystem for each home dir - 10,000 users?

2008-06-05 Thread Richard L. Hamilton

> Hi All,
> 
> I'm new to ZFS but I'm intrigued by the possibilities
> it presents.
> 
> I'm told one of the greatest benefits is that,
> instead of setting 
> quotas, each user can have their own 'filesystem'
> under a single pool.
> 
> This is obviously great if you've got 10 users but
> what if you have 
> 10,000?  Are the overheads too great and do they
> outweigh the potential 
> benefits?
> 
> I've got a test system running with 5,000 dummy users
> which seems to 
> perform fine, even if my 'df' output is a little
> sluggish :-) .
> 
> Any advice or experiences would be greatly
> appreciated.

I think sharemgr was created to speed up the case of sharing out very
high numbers of filesystems on NFS servers, which otherwise took
quite a long time.

That's not to say that there might not be other problems with scaling to
thousands of filesystems.  But you're certainly not the first one to test it.

For cases where a single filesystem must contain files owned by
multiple users (/var/mail being one example), old fashioned
UFS quotas still solve the problem where the alternative approach
with ZFS doesn't.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] new install - when is zfs root offered? (snv_90)

2008-06-04 Thread Richard L. Hamilton

P.S. the ST31000640SS drives, together with the LSI SAS 3800x
controller (in a 64-bit 66MHz slot) gave me, using dd with
a block size of either 1024k or 16384k (1MB or 16MB) and a count
of 1024, a sustained read rate that worked out to a shade over 119MB/s,
even better than the nominal "sustained transfer rate" of 116MB/s documented
for the drives.  Even at a miserly 7200 RPM, that was better than 2 1/2 times
faster than the internal 10,000 RPM 73GB FC/AL (2GB/s) drives, which impressed
the heck out of me.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] new install - when is zfs root offered? (snv_90)

2008-06-04 Thread Richard L. Hamilton

> On Tue, Jun 03, 2008 at 05:56:44PM -0700, Richard L.
> Hamilton wrote:
> > How about SPARC - can it do zfs install+root yet,
> or if not, when?
> > Just got a couple of nice 1TB SAS drives, and I
> think I'd prefer to
> > have a mirrored pool where zfs owns the entire
> drives, if possible.
> > (I'd also eventually like to have multiple bootable
> zfs filesystems in
> > that pool, corresponding to multiple versions.)
> 
> Is they just under 1TB?  I don't believe there's any
> boot support in
> Solaris for EFI labels, which would be required for
> 1TB+.

Don't know about Solaris or the on-disk bootloader (I would think they
ought to have that eventually if not already), but since it's been awhile
since I've seen a new firmware update for the SB2K, I doubt the firmware
could handle EFI labels.

But "format" is perfectly happy putting either Sun or EFI labels
on these drives, so that shouldn't be a problem.   SCSI "read capacity"
shows 1953525168 (512-byte) sectors, which multiplied out is
1,000,204,886,016 bytes; more than 10^12 (1TB), but less than 2^40 (1TiB).

This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] new install - when is zfs root offered? (snv_90)

2008-06-03 Thread Richard L. Hamilton

How about SPARC - can it do zfs install+root yet, or if not, when?
Just got a couple of nice 1TB SAS drives, and I think I'd prefer to
have a mirrored pool where zfs owns the entire drives, if possible.
(I'd also eventually like to have multiple bootable zfs filesystems in
that pool, corresponding to multiple versions.)

Is/will all that be possible?  Would it be ok to pre-create the pool,
and if so, any particular requirements?

Currently running snv_81 on a Sun Blade 2000; SAS/SATA controller
is an LSI Logic SAS 3800X 8-port, in the 66MHz slot.  I chose SAS drives
for the first two (of 8) trusting SCSI support to probably be more mature
and functional than SATA support, but the rest (as I'm willing to part with the 
$$)
will probably be SATA for price.  The current two SAS drives are Seagate
ST31000640SS (which I just used smartctl to confirm have SMART support
including temperature reporting).  Enclosure is an Enhance E8-ML (no
enclosure services support).
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] The ZFS inventor and Linus sitting in a tree?

2008-05-20 Thread Richard L. Hamilton

> On Mon, May 19, 2008 at 10:06 PM, Bill McGonigle
> <[EMAIL PROTECTED]> wrote:
> > On May 18, 2008, at 14:01, Mario Goebbels wrote:
> >
> >> I mean, if the Linux folks to want it, fine. But
> if Sun's actually
> >> helping with such a possible effort, then it's
> just shooting itself in
> >> the foot here, in my opinion.
> >
> >
> 
> []
> > they're quick to do it - they threatened to sue me
> when they couldn't
> > figure out how to take back a try-out server).
> 
> There's a story contained within that for sure! :)
> You brought a smile
> to this subscriber when I read it.
> 
> 
> > Having ZFS as a de- facto standard lifts all boats,
> IMHO.
> It's still hard to believe (in one sense) that the
> entire world isn't
> beating a path to Sun's door and PLEADING for ZFS.
> This is (if y'all
> will forgive the colloquialism) a kick-ass amazing
> piece of software.
> It appears to defy all the rules, a bit like
> levitation in a way, or
> perhaps it just rewrites those rules. There are days
> I still can't get
> my head around what ZFS really is.
> 
> In general, licensing issues just make my brain
> bleed, but one hopes
> that the licensing gurus can get their heads together
> and find a way
> to get this done. I don't personally believe that
> Open Solaris *OR*
> Solaris will lose if ZFS makes its way over the fence
> to Linux, I
> think that this is a big enough tent for everyone.
> Sure hope so
> anyway, it would be immensely sad to see technology
> like this not
> being adopted/ported/migrated/whatever more widely
> because of "damn
> lawyers" and the morass called licensing.
> 
> Perhaps (gazing into a cloudy crystal ball that
> hasn't been cleaned in
> a while) Solaris/Open Solaris can manage to hold onto
> ZFS-on-boot
> which is perhaps *the* most mind bending
> accomplishment within the zfs
> concept, and let the rest procreate elsewhere. That
> could contribute
> to the "must-have/must-install" cachet of
> Solaris/OpenSolaris.

Umm, I think it's too late for that; as I recall, the bits needed for
read-only access had to be made dual CDDL/GPL to be linked with GRUB.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS for write-only media?

2008-04-24 Thread Richard L. Hamilton

> "Dana H. Myers" <[EMAIL PROTECTED]> wrote:
> 
> > Bob Friesenhahn wrote:
> > > Are there any plans to support ZFS for write-only
> media such as 
> > > optical storage?  It seems that if mirroring or
> even zraid is used 
> > > that ZFS would be a good basis for long term
> archival storage.
> > I'm just going to assume that "write-only" here
> means "write-once,
> > read-many", since it's far too late for an April
> Fool's joke.
> 
> I know two write-only device types:
> 
> WOM   Write-only media
> WORN  Write-once read never (this one is often used
> for backups ;-)
> 
> Jörg

Save $$ (or €€) - use /dev/null instead.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] utf8only-property

2008-02-28 Thread Richard L. Hamilton

> So, I set utf8only=on and try to create a file with a
> filename that is
> a byte array that can't be decoded to text using
> UTF-8. What's supposed
> to happen? Should fopen(), or whatever syscall
> 'touch' uses, fail?
> Should the syscall somehow escape utf8-incompatible
> bytes, or maybe
> replace them with ?s or somesuch? Or should it
> automatically convert the
> filename from the active locale's fs-encoding
> (LC_CTYPE?) to UTF-8?

First, utf8only can AFAIK only be set when a filesystem is created.

Second, "use the source, Luke:"
http://src.opensolaris.org/source/search?q=&defs=&refs=z_utf8&path=%2Fonnv%2Fonnv-gate%2Fusr%2Fsrc%2Futs%2Fcommon%2Ffs%2Fzfs%2Fzfs_vnops.c&hist=&project=%2Fonnv

Looks to me like lookups, file create, directory create, creating symlinks,
and creating hard links will all fail with error EILSEQ ("Illegal byte 
sequence")
if utf8only is enabled and they are presented with a name that is not valid
UTF-8.  Thus, on a filesystem where it is enabled (since creation), no such
names can be created or would ever be there to be found anyway.

So in that case, the system is refusing non UTF-8 compatible byte strings
and there's no need to escape anything.

Further, your last sentence suggests that you might hold the
incorrect idea that the kernel knows or cares what locale an application is
running in: it does not.  Nor indeed does the kernel know about environment
variables at all, except as the third argument passed to execve(2); it
doesn't interpret them, or even validate that they are of the usual
name=value form, they're typically handled pretty much the same as the
command line args, and the only illusion of magic is that with the more
widely used variants of exec that don't explicitly pass the environment,
they internally call execve(2) with the external variable environ as the
last arg, thus passing the environment automatically.

There have been Unix-like OSs that make the environment available to
additional system calls (give or take what's a true system call in the
example I'm thinking of, namely variant links (symlinks with embedded
environment variable references) in the now defunct Apollo Domain/OS),
but AFAIK, that's not the case in those that are part of the historical
Unix source lineage.  (I have no idea off the top of my head whether
or not Linux, or oddballs like OSF/1 might make environment variables
implicitly available to syscalls other than execve(2).)
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 'du' is not accurate on zfs

2008-02-18 Thread Richard L. Hamilton

> On Sat, 16 Feb 2008, Richard Elling wrote:
> 
> > "ls -l" shows the length.  "ls -s" shows the size,
> which may be
> > different than the length.  You probably want size
> rather than du.
> 
> That is true.  Unfortunately 'ls -s' displays in
> units of disk blocks 
> and does not also consider the 'h' option in order to
> provide a value 
> suitable for humans.
> 
> Bob

ISTR someone already proposing to make ls -h -s   work in
a way one might hope for.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] vxfs vs ufs vs zfs

2008-02-18 Thread Richard L. Hamilton

> Hello,
> 
> I have just done comparison of all the above
> filesystems
> using the latest filebench.  If you are interested:
> http://przemol.blogspot.com/2008/02/zfs-vs-vxfs-vs-ufs
> -on-x4500-thumper.html
> 
> Regards
> przemol

I would think there'd be a lot more variation based on workload,
such that the overall comparison may fall far short of telling the
whole story.  For example, IIRC, VxFS is more or less
extent-based (like mainframe storage), so serial I/O for large
files should be perhaps its strongest point, while other workloads
may do relatively better with the other filesystems.

The free basic edition sounds cool, though - downloading now.
I could use a bit of practice with VxVM/VxFS; it's always struck
me as very good when it was good (online reorgs of storage and
such), and an utter terror to untangle when it got messed up,
not to mention rather more complicated that DiskSuite/SVM
(and of course _waay_ more complicated than zfs :-)
Any idea if it works with reasonably recent OpenSolaris (build 81) ?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] sharenfs with over 10000 file systems

2008-01-25 Thread Richard L. Hamilton

> New, yes. Aware - probably not.
> 
> Given cheap filesystems, users would create "many"
> filesystems was an easy guess, but I somehow don't
> think anybody envisioned that users would be creating
> tens of thousands of filesystems.
> 
> ZFS - too good for it's own good :-p

IMO (and given mails/posts I've seen typically by people using
or wanting to use zfs at large universities and the like, for home
directories) this is frequently driven by the need for per-user
quotas.  Since zfs doesn't have per-uid quotas, this means they
end up creating (at least one) filesystem per user.  That means a
share per user, and locally a mount per user, which will never
scale as well as (locally) a single share of /export/home, and a
single mount (although there would of course be automounts to /home
on demand, but they wouldn't slow down bootup).  sharemgr and the
like may be attempts to improve the situation, but they mitigate rather
than eliminate the consequences of exploding what used to be a single
large filesystem into a bunch of relatively small ones, simply based on
the need to have per-user quotas with zfs.

And there are still situations where a per-uid quota would be useful,
such as /var/mail (although I could see that corrupting mailboxes
in some cases) or other sorts of shared directories.

OTOH, the implementation could certainly vary a little.  The
equivalent of the "quotas" file should be automatically created
when quotas are enabled, and invisible; and unless quotas are not
only disabled but purged somehow, it should maintain per-uid use
statistics even for uids with no quotas, to eliminate the need for
quotacheck (initialization of quotas might well be restricted to filesystem
creation time, to eliminate the need for a cumbersome pass through
existing data, at least at first; but that would probably be wanted too,
since people don't always plan ahead).  But other quota-related
functionality could IMO maintain, although the implementations
might have to get smarter, and there ought to be some alternative
to the method presently used with ufs of simply reading the
quotas file to iterate through the available stats.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] do zfs filesystems isolate corruption?

2007-08-11 Thread Richard L. Hamilton

> In the old days of UFS, on occasion one might create
> multiple file systems (using multiple partitions) of
> a large LUN if filesystem corruption was a concern.
> It didn’t happen often  but filesystem corruption
> has happened.  So, if filesystem X was corrupt
>  filesystem Y would be just fine.
> 
> With ZFS, does the same logic hold true for two
> filesystems coming from the same pool?
> 
> Said slightly differently, I’m assuming that if the
> pool becomes mangled some how then all filesystems
> will be toast … but is it possible to have one
> filesystem be corrupted while the other filesystems
> are fine?
> 
> Hmmm, does the answer depend on if the filesystems
> are nested
> ex: 1  /my_fs_1  /my_fs_2
> ex: 2  /home_dirs/home_dirs/chris
> 
> TIA!


If they're always consistent on-disk, and the checksumming catches storage
subsystem errors out to almost 100% certainty, then the only corruption can
come from bugs in the code, or uncaught non-storage (i.e. CPU, memory)
bugs perhaps.

So I suppose the answer would depend on where in the code things
went astray; but that you probably could not expect any sort of isolation
or even sanity at that point; if privileged code is running amok, anything
could happen, and that would be true with two distinct ufs filesystems too,
I would think.  Perhaps one might guess that it might be more likely
for corruption not to be isolated to a single zfs filesystem (given how
lightweight a zfs filesystem is).  OTOH, since zfs catches errors other
filesystems don't, think of how many ufs filesystems may well be corrupt
for a very long time before causing a panic and having that get discovered
by fsck.  Ideally, if zfs code passes its test suites, you're safer with it than
with most anything else, even if it isn't perfect.

But I'm way out on a limb here; no doubt the experts will correct and
amend what I've said...
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 7zip compression?

2007-07-31 Thread Richard L. Hamilton

> Hello Marc,
> 
> Sunday, July 29, 2007, 9:57:13 PM, you wrote:
> 
> MB> MC  eastlink.ca> writes:
> >> 
> >> Obviously 7zip is far more CPU-intensive than
> anything in use with ZFS
> >> today.  But maybe with all these processor cores
> coming down the road,
> >> a high-end compression system is just the thing
> for ZFS to use.
> 
> MB> I am not sure you realize the scale of things
> here. Assuming the worst case:
> MB> that lzjb (default ZFS compression algorithm)
> performs as bad as lha in [1],
> MB> 7zip would compress your data only 20-30% better
> at the cost of being 4x-5x
> MB> slower !
> 
> MB> Also, in most cases, the bottleneck in data
> compression is the CPU, so
> MB> switching to 7zip would reduce the I/O throughput
> by about 4x.
> 
> 1. it depends on a specific case - sometimes it's cpu
> sometimes not
> 
> 2. sometimes you don't really care about cpu - you
> have hundreds TBs
> of data rarely used and then squeezing 20-30% more
> space is a huge
> benefit - especially when you only read those files
> once they are
> written

* disks are probably cheaper than CPUs

* it looks to me like 7z may also be RAM-hungry; and there are probably
better ways to use the RAM, too

No doubt it's an option that would serve _someone_ well despite its
shortcomings.  But are there enough such someones to make it worthwhile?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Cluster File System Use Cases

2007-07-13 Thread Richard L. Hamilton

> Bringing this back towards ZFS-land, I think that
> there are some clever
> things we can do with snapshots and clones.  But the
> age-old problem 
> of arbitration rears its ugly head.  I think I could
> write an option to expose
> ZFS snapshots to read-only clients.  But in doing so,
> I don't see how to
> prevent an ill-behaved client from clobbering the
> data.  To solve that
> problem, an arbiter must decide who can write where.
>  The SCSI
> rotocol has almost nothing to assist us in this
> cause, but NFS, QFS,
> and pxfs do.  There is room for cleverness, but not
> at the SCSI or block
> level.
>  -- richard

Yeah; ISTR that IBM mainframe complexes with what they called
"shared DASD" (DASD==Direct Access Storage Device, i.e. disk, drum, or the
like) depended on extent reserves.  IIRC, SCSI dropped extent reserve
support, and indeed it was never widely nor reliably available anyway.
AFAIK, all SCSI offers is reserves of an entire LUN; that doesn't even help
with slices, let alone anything else.  Nor (unlike either the VTOC structure
on MVS nor VxFS) is ZFS extent-based anyway; so even if extent reserves
were available, they'd only help a little.  Which means, as he says, some
sort of arbitration.

I wonder whether the hooks for putting the ZIL on a separate device
will be of any use for the cluster filesystem problem; it almost makes me
wonder if there could be any parallels between pNFS and a refactored
ZFS.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: Re: ZFS - SAN and Raid

2007-06-27 Thread Richard L. Hamilton

> Victor Engle wrote:
> > Roshan,
> > 
> > As far as I know, there is no problem at all with
> using SAN storage
> > with ZFS and it does look like you were having an
> underlying problem
> > with either powerpath or the array.
> 
> Correct.  A write failed.
> 
> > The best practices guide on opensolaris does
> recommend replicated
> > pools even if your backend storage is redundant.
> There are at least 2
> > good reasons for that. ZFS needs a replica for the
> self healing
> > feature to work. Also there is no fsck like tool
> for ZFS so it is a
> > good idea to make sure self healing can work.
> 
> Yes, currently ZFS on Solaris will panic if a
> non-redundant write fails.
> This is known and being worked on, but there really
> isn't a good solution
> if a write fails, unless you have some ZFS-level
> redundancy.

Why not?  If O_DSYNC applies, a write() can still fail with EIO, right?
And if O_DSYNC does not apply, an app could not assume that the
written data was on stable storage anyway.

Or the write() can just block until the problem is corrected (if correctable)
or the system is rebooted.

In any case, IMO there ought to be some sort of consistent behavior
possible short of a panic.  I've seen UFS based systems stay up even
with their disks incommunicado for awhile, although they were hardly
useful like that except insofar as activity strictly involving reading
already cached pages was involved.

This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: ZFS Scalability/performance

2007-06-20 Thread Richard L. Hamilton

> Hello,
> 
> I'm quite interested in ZFS, like everybody else I
> suppose, and am about
> to install FBSD with ZFS.
> 
> On that note, i have a different first question to
> start with. I
> personally am a Linux fanboy, and would love to
> see/use ZFS on linux. I
> assume that I can use those ZFS disks later with any
> os that can
> work/recognizes ZFS correct? e.g.  I can
> install/setup ZFS in FBSD, and
> later use it in OpenSolaris/Linux Fuse(native) later?

I've seen some discussions that implied adding attributes
to support non-Solaris (*BSD) uses of zfs, so that the format would
remain interoperable (i.e. free of incompatible extensions),
although not all OSs might fully support those.  But I don't know
if there's some firm direction to keeping the on-disk format
compatible across platforms that zfs is ported to.  Indeed, if the
code is open-source, I'm not sure that's possible to _enforce_.  But
I suspect (and certainly hope) it's being encouraged.  If someone who
works on zfs could comment on that, it might help.

> Anyway, back to business :)
> I have a whole bunch of different sized disks/speeds.
> E.g. 3 300GB disks
> @ 40mb, a 320GB disk @ 60mb/s, 3 120gb disks @ 50mb/s
> and so on.
> 
> Raid-Z and ZFS claims to be uber scalable and all
> that, but would it
> 'just work' with a setup like that too?
> 
> I used to match up partition sizes in linux, so make
> the 320gb disk into
> 2 partitions of 300 and 20gb, then use the 4 300gb
> partitions as a
> raid5, same with the 120 gigs and use the scrap on
> those aswell, finally
> stiching everything together with LVM2. I can't easly
> find how this
> would work with raid-Z/ZFS, e.g. can I really just
> put all these disks
> in 1 big pool and remove/add to it at will? And I
> really don't need to
> use softwareraid yet still have the same reliablity
> with raid-z as I had
> with raid-5? What about hardware raid controllers,
> just use it as a JBOD
> device, or would I use it to match up disk sizes in
> raid0 stripes (e.g.
> the 3x 120gb to make a 360 raid0).
> 
> Or you'd recommend to just stick with
> raid/lvm/reiserfs and use that.

One of the advantages of zfs is said to be that if it's used
end-to-end, it can catch more potential data integrity issues
(including controller, disk, cabling glitches, misdirected writes, etc).

As far as I understand, raid-z is like raid-5 except that the stripes
are varying size, so all writes are full-stripe, closing the "write hole",
so no NVRAM is needed to ensure that recovery would always be possible.

Components of raid-z or raid-z2 or mirrors can AFAIK only be used up to the
size of the smallest component.  However, a zpool can consist of
the aggregation (dynamic striping, I think) of various mirrors or raid-z[2]
virtual devices.  So you could group similar sized chunks (be it partitions or
whole disks) into redundant virtual devices, and aggregate them all into a
zpool (and add more later to grow it, too).  Ideally, all such virtual devices
would have the same level of redundancy; I don't think that's _required_, but
there isn't much good excuse for doing otherwise, since the performance of
raid-z[2] is different from that of a mirror.

There may be some advantages to giving zfs entire disks where possible;
it will handle labelling (using EFI labels) and IIRC, may be able to better
manage the disk's write cache.

For the most part, I can't see many cases where using zfs together with
something else (like vxvm or lvm) would make much sense.  One possible
exception might be AVS (http://opensolaris.org/os/project/avs/) for
geographic redundancy; see http://blogs.sun.com/AVS/entry/avs_and_zfs_seamless
for more details.

It can be quite easy to use, with only two commands (zpool and zfs);
however, you still want to know what you're doing, and there are plenty of
issues and tradeoffs to consider to get the best out of it.

Look around a little for more info; for example,
http://www.opensolaris.org/os/community/zfs/faq/
http://en.wikipedia.org/wiki/ZFS
http://docs.sun.com/app/docs/doc/817-2271   (ZFS Administration Guide)
http://www.google.com/search?hl=en&q=zpool+OR+zfs+site%3Ablogs.sun.com&btnG=Search
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: Re: OT: extremely poor experience with Sun Download

2007-06-16 Thread Richard L. Hamilton

Well, I just grabbed the latest SXCE, and just for the heck of it, fooled
around until I got the Java Web Start to work.

Basically, one's browser needs to know the following (how to do that depends
on the browser):

MIME Type:  application/x-java-jnlp-file
File Extension: jnlp
Open With:  /usr/bin/javaws

I got that working with both firefox and opera without inordinate difficulty.
Once that was done, after clicking "accept" and selecting the three files, I
clicked on the "download with sdm" box, it started sdm, and passed all three
files to it.  I think I also had to click start on sdm.  That's it...not so bad 
after all.

sdm has a major advantage over typical downloads done directly by browsers for
such large files: if the server supports it (needs to be able to handle 
requests fo
portions of files rather than just an entire file), it can restart failed 
transfers more
or less automatically; and they can even be paused and resumed more or less 
arbitrarily.
I've used that in the past to download the entire Solaris 10 CD set over a 
_dialup_.  Took
a week (well, 8 hours a day connected), but it worked.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: OT: extremely poor experience with Sun Download

2007-06-14 Thread Richard L. Hamilton

> Intending to experiment with ZFS, I have been
> struggling with what  
> should be a simple download routine.
> 
> Sun Download Manager leaves a great deal to be
> desired.
> 
> In the Online Help for Sun Download Manager there's a
> section on  
> troubleshooting, but if it causes *anyone* this much
> trouble
>  gercrap/> then  
> it should, surely, be fixed.
> 
> Sun Download Manager -- a FORCED step in an
> introduction to  
> downloadable software from Sun -- should be PROBLEM
> FREE in all  
> circumstances. It gives an extraordinarily poor first
> impression.
> 
> If it can't assuredly be fixed, then we should not be
> forced to use it.
> 
> (True, I might have ordered rather than downloaded a
> DVD, but Sun  
> Download Manager has given such a poor impression
> that right now I'm  
> disinclined to pay.)

For trying out zfs, you could always request the free "Starter Kit" DVD
at http://www.opensolaris.org/kits/ which  contains the
SXCE, Nexenta, Belenix and Schillix distros (all newer than Solaris 10).

Beyond that, while I'm sure you're right about that providing a poor
first impression, I guess I'm too old to have much sympathy for something
taking minutes rather than seconds of attention being a barrier to entry.
Yes, the download experience should be vastly improved, but if you let
that stop you, I wonder if you're all that interested in the first place.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: Re: Re: Re: Re: ZFS consistency guarantee

2007-06-09 Thread Richard L. Hamilton

I wish there was a uniform way whereby applications could
register their ability to achieve or release consistency on demand,
and if registered, could also communicate back that they had
either achieved consistency on-disk, or were unable to do so.  That
would allow backup procedures to automatically talk to apps capable
of such functions, to get them to a known state on-disk before taking
a snapshot.  That would allow one to for example not stop a DBMS, but
simply have it seem to pause for a moment while achieving consistency
and until told that the snapshot was complete; thus providing minimum
impact while still having fully usable backups (and without needing to
do the database backups _through_ the DBMS).  

Something I heard once leads me to believe that some such facility
or convention for how to communicate such issues with e.g. database
server processes exists on Windows.  If they've got it, we really ought
to have something even better, right? :-)

(That's of course not specific to ZFS, but would be useful with any filesystem
that can take snapshots.)
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: shareiscsi is cool, but what about sharefc or sharescsi?

2007-06-01 Thread Richard L. Hamilton

> I'd love to be able to server zvols out as SCSI or FC
> targets.  Are
> there any plans to add this to ZFS?  That would be
> amazingly awesome.

Can one use a spare SCSI or FC controller as if it were a target?

Even if the hardware is capable, I don't see what you describe as
a ZFS thing really; it isn't for iSCSI, except that ZFS supports
a shareiscsi option (and property?) by "knowing" how to tell the
iSCSI server to do the right thing.

That is, there would have to be something like an iSCSI server
except that it "listened" on an otherwise unused SCSI or FC
interface.

I think that would require not just the daemon but probably new
driver facilities as well.  Given that one can run IP over FC,
it seems to me that in principle it ought to be possible, at least
for FC.  Not so sure about SCSI.

Also not sure about performance.  I suspect even high-end SAN controllers
have a bit more latency than the underlying drives.  And this is a 
general-purpose
OS we're talking about doing this to; I don't know that it would be acceptably 
close,
or as robust (depending on the hardware) as a high-end FC SAN, although it 
might be
possible to be a good deal cheaper.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: Testing of UFS, VxFS and ZFS

2007-04-17 Thread Richard L. Hamilton

> # zfs create pool raidz d1 … d8

Surely you didn't create the zfs pool on top of SVM metadevices?  If so,
that's not useful; the zfs pool should be on top of raw devices.

Also, because VxFS is extent based (if I understand correctly), not unlike how
MVS manages disk space I might add, _it ought_ to blow the doors off of
everything for sequential reads, and probably sequential writes too,
depending on the write size.  OTOH, if a lot of files are created
and deleted, it needs to be defragmented (although I think it can do that
automatically; but there's still at least some overhead while a defrag is
running).

Finally, don't forget complexity.  VxVM+VxFS is quite capable, but it
doesn't always recover from problems as gracefully as one might hope,
and it can be a real bear to get untangled sometimes (not to mention
moderately tedious just to set up).  SVM, although not as capable as VxVM,
is much easier IMO.  And zfs on top of raw devices is about as easy as it
gets.  That may not matter _now_, when whoever sets these up is still
around; but when their replacement has to troubleshoot or rebuild, it
might help to have something that's as easy as possible.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: storage type for ZFS

2007-04-17 Thread Richard L. Hamilton

Well, no; his quote did say "software or hardware".  The theory is apparently
that ZFS can do better at detecting (and with redundancy, correcting) errors
if it's dealing with raw hardware, or as nearly so as possible.  Most SANs
_can_ hand out raw LUNs as well as RAID LUNs, the folks that run them are
just not used to doing it.

Another issue that may come up with SANs and/or hardware RAID:
supposedly, storage systems with large non-volatile caches will tend to have
poor performance with ZFS, because ZFS issues cache flush commands as
part of committing every transaction group; this is worse if the filesystem
is also being used for NFS service.  Most such hardware can be
configured to ignore cache flushing commands, which is safe as long as
the cache is non-volatile.

The above is simply my understanding of what I've read; I could be way off
base, of course.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: ZFS and Linux

2007-04-14 Thread Richard L. Hamilton

> I hope this isn't turning into a License flame war.
> But why do Linux  
> contributors not deserve the right to retain their
> choice of license  
> as equally as Sun, or any other copyright holder,
> does?
> 
> The anti-GPL kneejerk just witnessed on this list is
> astonishing. The  
> BSD license, for instance, is fundamentally
> undesirable to many GPL  
> licensors (myself included).

Nothing wrong with GPL as an abstract ideology.  But when ideology trumps
practicality (which it does when code can't be as widely reused as
possible), I have a problem with that.

As far as I'm concerned, GPL is to open licenses as political correctness
is to free speech.

Of course, anyone who writes something is free to use any license they
please.  And anyone else is free to choose an incompatible license, either
for reasons that have nothing specifically to do with being incompatible,
or because they just don't want the sucking sound of their goodies being
adopted and very little being returned (which strikes me as a major
element of the relationship between Linux and *BSD; although to be sure,
there is some two-way cooperation).

I have zero problem with Linux using GPLv2 (and as some have said,
perhaps being stuck with it at this point).  I'm not sure I'd want their
code anyway, and even if I did, I darn sure wouldn't want the "we
don't need no steekin' DDI 'cause we're source based" philosophy that
comes with it, because to my mind that ends up justifying a lot of
poor design and engineering discipline in the name of not being limited
by backwards compatibility.

So, if having chosen a license based on the ideology of being a lever to
free other software (but on their terms!) for the sake of being compatible
with them, the Linux folks now have to re-invent equivalents of ZFS and
Dtrace, it serves them right, IMO.

And as someone else also mentioned, competition is good anyway.  Not
as if a lot of ideas don't cross-pollinate.  But if every free OS used
compatible licenses, I think 20 years later, the result would resemble
the result of inbreeding...not pretty, and a shallower meme pool overall.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
[EMAIL PROTECTED]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: FreeBSD's system flags.

2007-04-14 Thread Richard L. Hamilton

So you're talking about not just reserving something for on-disk compatibility,
but also maybe implementing these for Solaris?  Cool.  Might be fairly useful
for hardening systems (although as long as someone had raw device access,
or physical access, they could of course still get around it; that would have to
be taken into account in the overall design for it to make much of a 
difference).

Other problems: from a quick look at the header files there's no room left in
the 64-bit version of the stat structure to add something in which to retrieve
the flags; that may mean a new and incompatible (with other chflags(2) 
supporting systems) system call? Also, there's no provision in pkgmap(4) for
file flags; could that be extended compatibly?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
[EMAIL PROTECTED]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: ZFS for Linux (NO LISCENCE talk, please)

2007-04-14 Thread Richard L. Hamilton

BTW, flash drives have a filesystem too; AFAIK, it's usually pretty much
just FAT32, which is garbage,  but widely supported, so that you
can plug them in just about anywhere.  In most cases, one can put
some other filesystem on them, but I wouldn't rule out the possibility
that that might not work well (or at all) with some of them.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
[EMAIL PROTECTED]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: ZFS for Linux (NO LISCENCE talk, please)

2007-04-14 Thread Richard L. Hamilton

You've ruled out most of what there is to talk about on the subject, I think.
If the licenses are incompatible (regardless of which if either is better),
then a Linux distro probably couldn't just include ZFS.

Now maybe (assuming ZFS were ported, which I doubt anybody would bother
with until a reasonable solution to the license issue existed) it would be 
possible to have a Linux based distro with source for a ported ZFS, that
could with one script be compiled and installed (with the kernel part as a
loadable module).  I don't know if that would be feasible, or compliant, or
how the ideologues (of whichever side) would deal with that.  And it would
probably rule out using ZFS for the root filesystem, at the very least.

In the short run, you might as well just choose one or the other; either
stick with Linux and do without ZFS, or go with the most Linux-like
OpenSolaris distro (I'm guessing Nexenta) and accept that it won't be
100% Linux compatible (and might be awhile if ever before it has
a fully comparable range of driver support, esp. for older hardware).

Like I suggested before, I think you have a chicken-or-egg problem (like
alternative fuel distribution vs getting the vehicles out there - both
are needed for it to work, both are an investment, and nobody will
make one investment without assurance that the other will also be
in place).  But it's more than that.  Not only is there the initial work
that would be involved in porting ZFS, but AFAIK Linux explicitly avoids
committing to something like a DDI/DKI (being totally source-based and
not wanting the burden of backwards compatibility within the kernel,
which I understand, but it does have implications), which means they make
structural changes covering the entire scope of kernel code potentially
at any time, and any kernel code that's not in the main Linux source tree
won't have those changes applied, and thus its independent maintainers
would have to struggle just to keep up.  Add to that that they'd also have
to struggle to keep up with the changes and improvements originating
on the OpenSolaris side, and it's going to fall apart fast.  If the licenses
were compatible* and the ported ZFS code could be in the main Linux
source tree, then at least the problem of keeping it up to date would
only involve bringing updates into the port, not also separately keeping
the port in sync with the rest of the Linux kernel.

You can't ignore the license, because nobody on either side is willing
to, and because breaking the law isn't a good idea.  And the license
is a real part of the problem of getting what you want working, and
keeping it working.

As for the strictly technical, I don't know much about the Solaris vfs/vnode
interface (which is not public or stable unfortunately, although I kind of
understand why - it's one of the few places other than simply adding
system calls where new magic can be introduced), and nothing about
whatever the equivalent Linux filesystem interface is.  But I think one
of the filesystems on Linux was ported from IRIX, which I think was also
SVR4 based, so it may have a somewhat similar (but separately evolved)
filesystem interface.  In other words, whatever was done to port XFS
to Linux might at least serve as a general hint as to what might be
needed to port ZFS to Linux, although not much more, since IRIX and Solaris
are different, and Linux and presumably changed some since the XFS port
was done.  In any case, I'm reasonably sure there are people out there who
would more or less know what needed to be done, and indeed might have
already taken a quick look at it; but unless it was practical (and I've 
explained
why I think it isn't with the license issue), I can't imagine why they'd spend
at least a few hundred hours actually doing it.

Another interim option BTW might be to build an OpenSolaris based NAS
appliance (using ZFS), and have your Linux system(s) NFS mount from it.
Useless on a portable laptop of course, and there are some performance
issues with it, as well as some history of NFS interoperability problems
with Linux.  But it might be useful in some situations.

* I kept this out of the main flow of what I was writing, because I'm going
to get a little more ideological than you'll like.

You won't get any sympathy from me on license issues unless they're a
two-way street.  For example, BSD code can be incorporated into GPL
code, but not the other way around.  So goodies can flow from the BSDs into 
Linux relatively freely, but not so easily the other way.  The Linux code can
of course be used as the basis for a total rewrite for one of the BSDs, but
that's at least twice the work of a port (two teams, one to create a 
specification from the code, another to create the new code from the
specification; and the only thing they talk to each other about is the
specification).  I have nothing against Linux, but neither am I a fan,
and I'm certainly not a fan of it taking more (or more freely) than it gives 
back.
OTO

[zfs-discuss] Re: Re: How big a write to a regular file is atomic?

2007-03-30 Thread Richard L. Hamilton

> On Wed, Mar 28, 2007 at 06:55:17PM -0700, Anton B.
> Rang wrote:
> > It's not defined by POSIX (or Solaris). You can
> rely on being able to
> > atomically write a single disk block (512 bytes);
> anything larger than
> > that is risky. Oh, and it has to be 512-byte
> aligned.
> > 
> > File systems with overwrite semantics (UFS, QFS,
> etc.) will never
> > guarantee atomicity for more than a disk block,
> because that's the
> > only guarantee from the underlying disks.
> 
> I thought UFS and others have a guarantee of
> atomicity for O_APPEND
> writes vis-a-vis other O_APPEND writes up to some
> write size.  (Of
> course, NFS does not have true O_APPEND support, so
> this wouldn't apply
> to NFS.)

That's mainly what I was thinking of, since the overwrite case
would get more complicated.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] How big a write to a regular file is atomic?

2007-03-28 Thread Richard L. Hamilton

and does it vary by filesystem type? I know I ought to know the
answer, but it's been a long time since I thought about it, and
I must not be looking at the right man pages.  And also, if it varies,
how does one tell?  For a pipe, there's fpathconf() with _PC_PIPE_BUF,
but how about for a regular file?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] missing features?Could/should zfs support a new ioctl, constrained if neede

2007-03-24 Thread Richard L. Hamilton

_FIOSATIME - why doesn't zfs support this (assuming I didn't just miss it)?
Might be handy for backups.

Could/should zfs support a new ioctl, constrained if needed to files of
zero size, that sets an explicit (and fixed) blocksize for a particular
file?  That might be useful for performance in special cases when one
didn't necessarily want to specify (or depend on the specification of
perhaps) the attribute at the filesystem level.  One could imagine a
database that was itself tunable per-file to a similar range of
blocksizes, which would almost certainly benefit if it used those sizes
for the corresponding files.  Additional capabilities that might be
desirable: setting the blocksize to zero to let the system return to
default behavior for a file; being able to discover the file's blocksize
(does fstat() report this?) as well as whether it was fixed at the
filesystem level, at the file level, or in default state.

Wasn't there some work going on to add real per-user (and maybe per-group)
quotas, so one doesn't necessarily need to be sharing or automounting
thousands of individual filesystems (slow)?  Haven't heard anything lately 
though...
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: mirror question

2007-03-24 Thread Richard L. Hamilton

> Yes, this is supported now. Replacing one half of a
> mirror with a larger device;
> letting it resilver; then replacing the other half
> does indeed get a larger mirror.
> I believe this is described somewhere but I can't
> remember where now.

Thanks; I sure didn't see the answer on the zpool man page,
and if it's in the admin guide, I must have missed it.  In fact, I'm
not seeing it in the code either, but this is the first time I've looked
at the zfs code, and I haven't exactly read all of it, let alone understood
it as a whole.

It strikes me as useful to be able to expand the storage that way (since
one tends to have to replace disks anyway, and if the same model
isn't available, the new ones will likely be larger), although as disks
grow larger, I suppose one might also want to consider adding more
redundancy.  Since I only have 2 to work with, that's not really an option. :-)
Anyway, I hope that will eventually be mentioned (or more prominently,
if it's already there) in the admin guide, and perhaps even on the zpool
man page.

I suppose there's no way to tell zfs to shrink its use of a device down
to a certain size, releasing space at the end if possible.  While that's
probably much less important, if one gets into a real jam, it might
still be desirable.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] mirror question

2007-03-23 Thread Richard L. Hamilton

If I create a mirror, presumably if possible I use two or more identically 
sized devices,
since it can only be as large as the smallest.  However, if later I want to 
replace a disk
with a larger one, and detach the mirror (and anything else on the disk), 
replace the
disk (and if applicable repartition it), since it _is_ a larger disk (and/or 
the partitions
will likely be larger since they mustn't be smaller, and blocks per cylinder 
will likely differ,
and partitions are on cylinder boundaries), once I reattach everything, I'll 
now have
two different sized devices in the mirror.  So far, the mirror is still the 
original size.
But what if I later replace the other disks with ones identical to the first 
one I replaced?
With all the devices within the mirror now the larger size, will the mirror and 
the zpool
of which it is a part expand?  And if that won't happen automatically, can it 
(without
inordinate trickery, and online, i.e. without backup and restore) be forced to 
do so?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: Limit ZFS Memory Utilization

2007-02-06 Thread Richard L. Hamilton

If I understand correctly, at least some systems claim not to guarantee
consistency between changes to a file via write(2) and changes via mmap(2).
But historically, at least in the case of regular files on local UFS, since 
Solaris
used the page cache for both cases, the results should have been consistent.

Since zfs uses somewhat different mechanisms, does it still have the same
consistency between write(2) and mmap(2) that was historically present
(whether or not "guaranteed") when using UFS on Solaris?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: UFS on zvol: volblocksize and maxcontig

2007-02-01 Thread Richard L. Hamilton

I hope there will be consideration given to providing compatibility with UFS 
quotas
(except that inode limits would be ignored).  At least to the point of having

edquota(1m)
quot(1m)
quota(1m)
quotactl(7i)
repquota(1m)
rquotad(1m)

and possibly quotactl(7i) work with zfs (with the exception previously 
mentioned).
OTOH, quotaon(1m)/quotaoff(1m)/quotacheck(1m) may not be needed for support of
per-user quotas in zfs (since it will presumably have its own ways of enabling 
these, and
will simply never mess up?)

None of which need preclude new interfaces with greater functionality (like both
user and group quotas), but where there is similar functionality, IMO it would 
be
easier for a lot of folks if quota maintenance (esp. edquota and reporting) 
could
be done the same way for ufs and zfs.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: zpool split

2007-01-24 Thread Richard L. Hamilton

...such that a snapshot (cloned if need be) won't do what you want?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: A versioning FS

2006-10-06 Thread Richard L. Hamilton

> What would a version FS buy us that cron+ zfs
> snapshots doesn't?

Some people are making money on the concept, so I
suppose there are those who perceive benefits:

http://en.wikipedia.org/wiki/Rational_ClearCase

(I dimly remember DSEE on the Apollos; also some sort of
versioning file type on (probably long-dead) Harris VOS
real-time OS.)
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: ZFS API (again!), need quotactl(7I)

2006-09-13 Thread Richard L. Hamilton

> On 13/09/2006, at 2:29 AM, Eric Schrock wrote:
> > On Tue, Sep 12, 2006 at 07:23:00AM -0400, Jeff A.
> Earickson wrote:
> >>
> >> Modify the dovecot IMAP server so that it can get
> zfs quota  
> >> information
> >> to be able to implement the QUOTA feature of the
> IMAP protocol  
> >> (RFC 2087).
> >> In this case pull the zfs quota numbers for quoted
> home directory/zfs
> >> filesystem.  Just like what quotactl() would do
> with UFS.
> >>
> >> I am really surprised that there is no zfslib API
> to query/set zfs
> >> filesystem properties.  Doing a fork/exec just to
> execute a "zfs get"
> >> or "zfs set" is expensive and inelegant.
> >
> > The libzfs API will be made public at some point.
>  However, we need to
> finish implementing the bulk of our planned features
>  before we can  
>  feel
> comfortable with the interfaces.  It will take a
>  non-trivial amount of
> work to clean up all the interfaces as well as
>  document them.  It will
> be done eventually, but I wouldn't expect it any
>  time soon - there are
>  simply too many important things to get done first.
> 
> > If you don't care about unstable interfaces, you're
> welcome to use  
> > them
> > as-is.  If you want a stable interface, you are
> correct that the only
> > way is through invoking 'zfs get' and 'zfs set'.
> 
> I'm sure I'm missing something, but is there some
> reason that statvfs 
> () is not good enough?
> 

Assuming that statvfs() reports per filesystem, and
(as I understand it), zfs "quotas" are per-filesystem
sharing a storage pool, and not per-user, I think
you've got a point there.

Of course, for some things (like a /var/mail directory,
or something similar with one large _file_ per user
in a single directory), it seems to me that per-user
quotas would still be quite desirable; likewise
sometimes for shared trees meant for some sort
of collaborative activity...even if for things such
as home directories, a filesystem per user works out
fine, that model doesn't fit everything well.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: ZFS + rsync, backup on steroids.

2006-08-30 Thread Richard L. Hamilton

Are both of you doing a umount/mount (or export/import, I guess) of the
source filesystem before both first and second test?  Otherwise, there might
still be a fair bit of cached data left over from the first test, which would
give the 2nd an unfair advantage.  I'm fairly sure unmounting a filesystem
invalidates all cached pages associated with files on that filesystem, as well
as any cached [iv]node entries, all of which in needed to ensure both tests
are starting from the most similar situation possible.  Ideally, all this would
even be done in single-user mode, so that nothing else could interfere.

If there were a list of precautions to take that would put comparisons
like this on firmer ground, it might provide a good starting point for such
comparisons to be more than anecdotes, saving time for all concerned,
both those attempting to replicate a prior casual observation for reporting,
and those looking at the report.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: Re: Re: SCSI synchronize cache cmd

2006-08-22 Thread Richard L. Hamilton

> Filed as 6462690.
> 
> If our storage qualification test suite doesn't yet
> check for support of this bit, we might want to get
> that added; it would be useful to know (and gently
> nudge vendors who don't yet support it).

Is either the test suite, or at least a list of what it tests
(which it looks like may more or less track what Solaris
requires) publically available, or could it be made so?
Seems to me that if people can independently discover
problem hardware, that might make your job easier
insofar as they're smarter before they start asking you
questions; even more so if they feed back what they find
(not unlike the do-it-yourself x86 compatibility testing).
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

79 matches

Mail list logo