subject:"Re\: \[zfs\-discuss\] zfs streams \& data corruption"

Re: [zfs-discuss] zfs streams & data corruption

2009-02-26 Thread Nicolas Williams

On Wed, Feb 25, 2009 at 07:33:34PM -0500, Miles Nordin wrote:
> You might also have a look at the, somewhat overcomplicated
> w.r.t. database-running-snapshot backups, SQLite2 atomic commit URL
> Toby posted:
> 
>   http://sqlite.org/atomiccommit.html

That's for SQLite_3_, 3, not 2.

Also, we don't know that there's anything wrong with SQLite2 in this
case.  That's because we don't have enough information.  The OP
mentioned panics and showed a kernel panic stack trace.  That means we
should look at things other than SQLite2, or SMF, first.  I asked the OP
about how they are transferring their zfs send images; the OP has not
replied.

So rather than go into the weeds I think we need more information from
the OP.  Enough that someone from the ZFS team could reproduce, say, or
otherwise find the cause in user error.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-25 Thread Miles Nordin

> "gp" == Greg Palmer  writes:

gp> Relying on the equivalent of crashing the database to perform
gp> backups isn't how professionals get the job done.

well, nevertheless, it is, and should be, supported by SQLite2.

gp> Let's take a simple case of a transaction which consists of
gp> three database updates within a transaction. One of those
gp> writes succeeds, you take a snapshot and then the two other
gp> writes succeed.  Everyone concerned with the transaction
gp> believes it succeeded but your snapshot does not show that.

I'm glad you have some rigid procedures that work well for you, but it
sounds like you do not understand how DBMS's actually deal with their
backing store.

You could close the gap by reviewing the glossary entry for ACID.
It's irrelevant whether the transaction spawns one write or
three---the lower parts of the DBMS make updates transactional.  As
long as writes are not re-ordered or silently discarded, it's not a
hand-waving recovery-from-chaos process.  It's certain.  Sometimes
writes ARE lost or misordered, or there are bugs in the DBMS or bad
RAM or who knows what, so I'm not surprised your vendor has given you
hand-waving recovery tools along with a lot of scary disclaimers.  Nor
am I surprised that they ask you to follow procedures that avoid
exposing their bugs.  But it's just plain wrong that the only way to
achieve a correct backup is with the vendor's remedial freezing tools.

I don't understand why you are dwelling on ``everyone concerned
believes it succeeded but it's not in the backup.''  So what?
Obviously the backup has to stop including things at some point.  As
long as the transaction is either in the backup or not in the backup,
the backup is FINE.  It's a BACKUP.  It has to stop somewhere.

You seem to be concerned that a careful forensic scientist could dig
into the depths of the backup and find some lingering evidence that a
transaction might have once been starting to come into existence.  As
far as I'm concerned, that transaction is ``not in the backup'' and
thus fine.

You might also have a look at the, somewhat overcomplicated
w.r.t. database-running-snapshot backups, SQLite2 atomic commit URL
Toby posted:

  http://sqlite.org/atomiccommit.html

Their experience points out, filesystems tend to do certain
somewhat-predictable but surprising things to the data inside files
when the cord is pulled, things which taking a snapshot won't do.  so,
I was a little surprised to read about some of the crash behaviors
SQLite had to deal with, but, with slight reservation, I stand by my
statement that the database should recover swiftly and certainly when
the cord is pulled.  But! it looks like recovering from a
``crash-consistent'' snapshot is actually MUCH easier than a pulled
cord, at least a pulled cord with some of the filesystems SQLite2 aims
to support.

gp> [snapshots] have no knowledge of whether or not one of three
gp> writes required for the database to be consistent have
gp> completed.

it depends on what you mean by consistent.  In my language, the
database is always consistent, after each of those three writes.  The
DBMS orders the writes carefully to ensure this.  Especially in the
case of a lightweight DB like SQLite2 this is the main reason you use
the database in the first place.

gp> Data does not hit the disk instantly, it takes some finite
gp> amount of time in between when the write command is issued for
gp> it to arrive at the disk.

I'm not sure it's critical to my argument, but, snapshots in ZFS have
nothing to do with when data ``hits the disk''.

gp> ZFS promises on disk consistency but as we saw in the recent
gp> thread about "Unreliable for professional usage" it is
gp> possible to have issues. Likewise with database systems.

yes, finally we are in agreement!  Here is where we disagree: you want
to add a bunch of ponderous cargo-cult procedures and dire warnings,
like some convoluted way to tell SMF to put SQLite2 into
remedial-backup mode before taking a ZFS snapshot to clone a system.
I want to fix the bugs in SQLite2, or in whatever is broken, so that
it does what it says on the tin.

The first step in doing that is to convince people like you that there
is *necessarily* a bug if the snapshot is not a working backup.

Nevermind the fact that your way simply isn't workable with hundreds
of these lightweight SQLite/db4/whatever databases all over the system
in nameservices and Samba and LDAP and Thunderbird and so on.
Workable or not, it's not _necessary_, and installing this confusing
and incorrect expectation that it's necessary blocks bugs from getting
fixed, and is thus harmful for reliability overall (see _unworkable_
one sentence ago).

HTH.


pgp9wrTDqgHf8.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-25 Thread Greg Palmer


Miles Nordin wrote:

gp> Performing a checkpoint will perform such tasks as making sure
gp> that all transactions recorded in the log but not yet written
gp> to the database are written out and that the system is not in
gp> the middle of a write when you grab the data.

great copying of buzzwords out of a glossary, 
Wasn't copied from a glossary, I just tried to simplify it enough for 
you to understand. I apologize if I didn't accomplish that goal.



but does it change my
claim or not?  My claim is: 


  that SQLite2 should be equally as tolerant of snapshot backups as it
  is of cord-yanking.
  
You're missing the point here Miles. The folks weren't asking for a 
method to confirm their database was able to perform proper error 
recovery and confirm it would survive having the cord yanked out of the 
wall. They were asking for a reliable way to backup their data. The best 
way to do that is not by snapshotting alone. The process of performing 
database backups is well understood and supported throughout the industry.


Relying on the equivalent of crashing the database to perform backups 
isn't how professionals get the job done. There is a reason that 
database vendor do not suggest you backup their databases by pulling the 
plug out of the wall or killing the running process. The best way to 
backup a database is by using a checkpoint. Your comment about 
checkpoints being for systems where snapshots are not available is not 
accurate. That is the normal method of backing up databases under 
Solaris among others. Checkpoints are useful for all systems since they 
guarantee that the database files are consistent and do not require 
recovery which doesn't always work no matter what the glossy brochures 
tell you. Typically they are used in concert with snapshots. Force the 
checkpoint, trigger the snapshot and you're golden.


Let's take a simple case of a transaction which consists of three 
database updates within a transaction. One of those writes succeeds, you 
take a snapshot and then the two other writes succeed. Everyone 
concerned with the transaction believes it succeeded but your snapshot 
does not show that. When the database starts up again, the data it will 
have in your snapshot indicates the transaction never succeeded and 
therefore it will roll out the database transaction and you will lose 
that transaction. Well, it will assuming that all code involved in that 
recovery works flawlessly. Issuing a checkpoint on the other hand causes 
the database to complete the transaction including ensuring consistency 
of the database files before you take your snapshot. NOTE: If you issue 
a checkpoint and then perform a snapshot you will get consistent data 
which does not require the database perform recovery. Matter of fact, 
that's the best way to do it.


Your dismissal of write activity taking place is inaccurate. Snapshots 
take a picture of the file system at a point in time. They have no 
knowledge of whether or not one of three writes required for the 
database to be consistent have completed. (Refer to above example) Data 
does not hit the disk instantly, it takes some finite amount of time in 
between when the write command is issued for it to arrive at the disk. 
Plainly, terminating the writes between when they are issued and before 
it has completed is possible and a matter of timing. The database on the 
other hand does understand when the transaction has completed and allows 
outside processes to take advantage of this knowledge via checkpointing.


All real database systems have flaws in the recovery process and so far 
every database system I've seen has had issues at one time or another. 
If we were in a perfect world it SHOULD work every time but we aren't in 
a perfect world. ZFS promises on disk consistency but as we saw in the 
recent thread about "Unreliable for professional usage" it is possible 
to have issues. Likewise with database systems.


Regards,
 Greg
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-25 Thread Miles Nordin

> "tt" == Toby Thain  writes:

 c> so it's ok for snapshots but not cord-yanks if VBox never
 c> bothers to call fsync().

tt> Taking good host snapshots may require VB to do that, though.

AIUI the contents of a snapshot on the host will be invariant no
matter where VBox places host fsync() calls along the timeline, or if
it makes them at all.

The host snapshot will not be invariant of when applications running
inside the guest call fsync(), because this inner fsync() implicates
the buffer cache in the guest OS, possibly flush commands at the
guest/VBox driver/virtualdisk boundary, and stdio buffers inside the
VBox app.

so...in the sense that, in a hypothetical nonexistent working overall
system, a guest app calling fsync() eventually propogates out until
finally VBox calls fsync() on the host's kernel, then yeah, observing
a lack of fsync()'s coming out of VBox probably means host snapshots
won't be crash-consistent.  BUT the effect of the fsync() on the host
itself is not what's needed for host snapshots (only needed for host
cord-yanks).  It's all the other stuff that's needed for host
snapshots---flushing the buffer cache inside the guest OS, flushing
VBox's stdio buffers, u.s.w., that makes a bunch of write()'s spew out
just before the fsync() and dams up other write()s inside VBox and the
guest OS until after the fsync() comes out.

 c>   But ext3's supposed ability to mostly work ok without
 c> barriers

tt> Without *working* barriers, you mean? I haven't RTFS but I
tt> suspect ext3 needs functioning barriers to maintain "crash
tt> consistency".

no, the lwn article says that ext3 is just like Solaris UFS and never
issues a cache flush to the drive (except on SLES where Novell made
local patches to their kernel).

ext3 probably does still use an internal Linux barrier API to stop
dangerous kinds of reordering within the Linux buffer cache, but
nothing that makes it down to the drive (nor into VBox).  so I think
even if you turn on the flush-respecting feature in VBox, Linux ext3
and Solaris UFS would both still be necessarily unsafe (according to
our theory so far), at least unsafe from: (1) host cord-yanking, (2)
host snapshots taken without ``pausing'' the VM.

If you're going to turn on the VBox flush option, maybe it would be
worth trying XFS or ext4 or ZFS inside the guest and comparing their
corruptability.

For VBox to simulate a real disk with its write cache turned off, and
thus work better with UFS and ext3, VBox would need to make sure
writes are not re-ordered.  For the unpaused-host-snapshot case this
should be relatively easy---just make VBox stop using stdio, and call
write() exactly once for every disk command the guest issues and call
it in the same order the guest passed it.  It's not necessary to call
fsync() at all, so it should not make things too much slower.

For the host cord-yanking case, I don't think POSIX gives enough to
achieve this and still be fast because you'd be expected to call
fsync() between each write.  What we really want is some flag, ``make
sure my writes appear to have been done in order after a crash.''  I
don't think there's such a thing as a write barrier in POSIX, only the
fsync() flush command?  

Maybe it should be a new rule of zvol's that they always act this
way. It need not slow things down much for the host to arrange that
writes not appear to have been reordered: all you have to do is batch
them into chunks along the timeline, and make sure all the writes in a
chunk commit, or none of them do.  It doesn't matter how big the
chunks are nor where they start and end.  It's sort of a degenerate
form of the snapshot case: with the fwrite()-to-write() change above
we can already take a clean snapshot without fsync(), so just pretend
as thoughyou were taking a snapshot a couple times a minute, and after
losing power roll back to the newest one that survived.  I'm not sure
real snapshots are the right way to implement it, but the idea is with
a COW backingn store it should be well within-reach to provide the
illusion writes are never reordered (and thus that your virtual hard
disk has its write cache turned off) without adding lots of io/s the
way fsync() does.  This still compromises the D in ACID for databases
running inside the guest, in the host cord-yank case, but it should
stop the corruption.


pgpDmKTrtWRL1.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-25 Thread Toby Thain



On 25-Feb-09, at 1:08 PM, Miles Nordin wrote:


"jm" == Moore, Joe  writes:


jm> This is correct.  The general term for these sorts of
jm> point-in-time backups is "crash consistant".

phew, thanks, glad I wasn't talking out my ass again.

jm> In-flight transactions (ones that have not been committed) at
jm> the database level are rolled back.  Applications using the
jm> database will be confused by this in a recovery scenario,
jm> since the transaction was reported as committed are gone when
jm> the database comes back.  But that's the case any time a
jm> database moves "backward" in time.

hm.  I thought a database would not return success to the app until it
was actually certain the data was on disk with fsync() or whatever,
and this is why databases like NVRAM's and slogs.  Are you saying it's
a common ``optimisation'' for DBMS to worry about write barriers only,
not about flushing?



That would break the "Durable" promise of ACID.

To be durable, commit must be synchronous to the application, because  
the application is about to promise something big to the user (e.g.  
printing APPROVED :)


That said, this RDBMS behaviour is generally configurable. In fact  
the subtext of the whole thread is "know your configuration" at all  
layers, whether that is drive, filesystem, virtual machine, RDBMS, ...




jm> Snapshots of a virtual disk are also crash-consistant.  If the
jm> VM has not committed its transactionally-committed data and is
jm> still holding it volatile memory, that VM is not maintaining
jm> its ACID requirements, and that's a bug in either the database
jm> or in the OS running on the VM.

I'm betting mostly ``the OS running inside the VM'' and ``the  
virtualizer

itself''.  For the latter, from Toby's thread:

-8<-
If desired, the virtual disk images (VDI) can be flushed when the
guest issues the IDE FLUSH CACHE command. Normally these requests are
ignored for improved performance.
To enable flushing, issue the following command:
 VBoxManage setextradata VMNAME "VBoxInternal/Devices/piix3ide/0/ 
LUN#[x]/Config/IgnoreFlush" 0

-8<-

Virtualizers are able to take snapshots themselves without help from
the host OS, so I would expect at least those to work, and host
snapshots to be fixable.  VirtualBox has a ``pause'' feature---it
could pretend it's received a flush command from the guest, and flush
whatever internal virtualizer buffers it has to the host OS when
paused.


Indeed.



Also a host snapshot is a little more forgiving than a host cord-yank
because the snapshot will capture things applications like VBox have
written to files but not fsync()d yet.  so it's ok for snapshots but
not cord-yanks if VBox never bothers to call fsync().


Taking good host snapshots may require VB to do that, though.


It's just not
okay that VBox might buffer data internally sometimes.

Even if that's all sorted, though, ``the OS running inside the
VM''---neither UFS nor ext3 sends these cache flush commands to
virtual drives.  At least for ext3, the story is pretty long:

 http://lwn.net/Articles/283161/
  So, for those that wish to enable them, barriers apparently are
  turned on by giving "barrier=1" as an option to the mount(8)  
command,

  either on the command line or in /etc/fstab:
   mount -t ext3 -o barrier=1  
  (but, does not help at all if using LVM2 because LVM2 drops the  
barriers)


ext3 get away with it because drive write buffers are small enough
they can mostly get away with only flushing the journal, and the
journal's written in LBA order, so except when it wraps around there's
little incentive for drives to re-order it.  But ext3's supposed
ability to mostly work ok without barriers


Without *working* barriers, you mean? I haven't RTFS but I suspect  
ext3 needs functioning barriers to maintain "crash consistency".



depends on assumptions
about physical disks---the size of the write cache being <32MB, their
reordering sorting algorithm being elevator-like---that probably don't
apply to a virtual disk so a Linux guest OS very likely is ``broken''


Yes, the problems I observed indicate to me that with the Ignore  
Flushes default, VB can't crash and maintain consistency in ext3 or  
MySQL+InnoDB (and, I'd bet, pretty much *any* transactional system).



w.r.t. taking these crash-consistent virtual disk snapshots.

And also a Solaris guest: we've been told UFS+logging expects the
write cache to be *off* for correctness.  I don't know if UFS is less
good at evading the problem than ext3, or if Solaris users are just
more conservative.  but, with a virtual disk the write cache will
always be effectively on no matter what simon-sez flags you pass to
that awful 'format' tool.  That was never on the bargaining table
because there's no other way it can have remotely reasonable
performance.


...which may imply that a Solaris UFS filesystem is just as prone to  
damage in VB as a Linux one. (Even ZFS, I'd wager.)




P

Re: [zfs-discuss] zfs streams & data corruption

2009-02-25 Thread Miles Nordin

> "jm" == Moore, Joe  writes:

jm> This is correct.  The general term for these sorts of
jm> point-in-time backups is "crash consistant".

phew, thanks, glad I wasn't talking out my ass again.

jm> In-flight transactions (ones that have not been committed) at
jm> the database level are rolled back.  Applications using the
jm> database will be confused by this in a recovery scenario,
jm> since the transaction was reported as committed are gone when
jm> the database comes back.  But that's the case any time a
jm> database moves "backward" in time.

hm.  I thought a database would not return success to the app until it
was actually certain the data was on disk with fsync() or whatever,
and this is why databases like NVRAM's and slogs.  Are you saying it's
a common ``optimisation'' for DBMS to worry about write barriers only,
not about flushing?

jm> Snapshots of a virtual disk are also crash-consistant.  If the
jm> VM has not committed its transactionally-committed data and is
jm> still holding it volatile memory, that VM is not maintaining
jm> its ACID requirements, and that's a bug in either the database
jm> or in the OS running on the VM.

I'm betting mostly ``the OS running inside the VM'' and ``the virtualizer
itself''.  For the latter, from Toby's thread:

-8<-
If desired, the virtual disk images (VDI) can be flushed when the
guest issues the IDE FLUSH CACHE command. Normally these requests are
ignored for improved performance.
To enable flushing, issue the following command:
 VBoxManage setextradata VMNAME 
"VBoxInternal/Devices/piix3ide/0/LUN#[x]/Config/IgnoreFlush" 0
-8<-

Virtualizers are able to take snapshots themselves without help from
the host OS, so I would expect at least those to work, and host
snapshots to be fixable.  VirtualBox has a ``pause'' feature---it
could pretend it's received a flush command from the guest, and flush
whatever internal virtualizer buffers it has to the host OS when
paused.

Also a host snapshot is a little more forgiving than a host cord-yank
because the snapshot will capture things applications like VBox have
written to files but not fsync()d yet.  so it's ok for snapshots but
not cord-yanks if VBox never bothers to call fsync().  It's just not
okay that VBox might buffer data internally sometimes.

Even if that's all sorted, though, ``the OS running inside the
VM''---neither UFS nor ext3 sends these cache flush commands to
virtual drives.  At least for ext3, the story is pretty long:

 http://lwn.net/Articles/283161/
  So, for those that wish to enable them, barriers apparently are
  turned on by giving "barrier=1" as an option to the mount(8) command,
  either on the command line or in /etc/fstab:
   mount -t ext3 -o barrier=1  
  (but, does not help at all if using LVM2 because LVM2 drops the barriers)

ext3 get away with it because drive write buffers are small enough
they can mostly get away with only flushing the journal, and the
journal's written in LBA order, so except when it wraps around there's
little incentive for drives to re-order it.  But ext3's supposed
ability to mostly work ok without barriers depends on assumptions
about physical disks---the size of the write cache being <32MB, their
reordering sorting algorithm being elevator-like---that probably don't
apply to a virtual disk so a Linux guest OS very likely is ``broken''
w.r.t. taking these crash-consistent virtual disk snapshots.

And also a Solaris guest: we've been told UFS+logging expects the
write cache to be *off* for correctness.  I don't know if UFS is less
good at evading the problem than ext3, or if Solaris users are just
more conservative.  but, with a virtual disk the write cache will
always be effectively on no matter what simon-sez flags you pass to
that awful 'format' tool.  That was never on the bargaining table
because there's no other way it can have remotely reasonable
performance.

Possibly the ``pause'' command would be a workaround for this becuase
it could let you force a barrier into the write stream yourself (one
the guest OS never sent) and then take a snapshot right after the
barrier with no writes allowed between barrier and snapshot.  If the
fake barrier is inserted into the stack right at the guest/VBox
boundary, then it should make the overall system behave as well as the
guest running on a drive with the write cache disabled.  I'm not sure
such a barrier is actually implied by VBox ``pause'' but if I were
designing the pause feature it would be.


pgpnmTxPa2z8Y.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-25 Thread Toby Thain



On 25-Feb-09, at 9:53 AM, Moore, Joe wrote:


Miles Nordin wrote:
  that SQLite2 should be equally as tolerant of snapshot backups  
as it

  is of cord-yanking.

The special backup features of databases including ``performing a
checkpoint'' or whatever, are for systems incapable of snapshots,
which is most of them.  Snapshots are not writeable, so this ``in the
middle of a write'' stuff just does not happen.


This is correct.  The general term for these sorts of point-in-time  
backups is "crash consistant".  If the database can be recovered  
easily (and/or automatically) from pulling the plug (or a kill -9),  
then a snapshot is an instant backup of that database.


In-flight transactions (ones that have not been committed) at the  
database level are rolled back.  Applications using the database  
will be confused by this in a recovery scenario, since the  
transaction was reported as committed are gone when the database  
comes back.  But that's the case any time a database moves  
"backward" in time.



Of course Toby rightly pointed out this claim does not apply if you
take a host snapshot of a virtual disk, inside which a database is
running on the VM guest---that implicates several pieces of
untrustworthy stacked software.  But for snapshotting SQLite2 to  
clone

the currently-running machine I think the claim does apply, no?



Snapshots of a virtual disk are also crash-consistant.  If the VM  
has not committed its transactionally-committed data and is still  
holding it volatile memory, that VM is not maintaining its ACID  
requirements, and that's a bug in either the database or in the OS  
running on the VM.


Or the virtual machine! I hate to dredge up the recent thread again -  
but if your virtual machine is not maintaining guest barrier  
semantics (write ordering) on the underlying host, then your snapshot  
may contain inconsistencies entirely unexpected to the virtualised  
transactional/journaled database or filesystem.[1]


I believe this can be reproduced by simply running VirtualBox with  
default settings (ignore flush), though I have been too busy lately  
to run tests which could prove this. (Maybe others would be  
interested in testing as well.) I infer this explanation from  
consistency failures in InnoDB and ext3fs that I have seen[2], which  
would not be expected on bare metal in pull-plug tests. My point is  
not about VB specifically, but just that in general, the consistency  
issue - already complex on bare metal - is tangled further as the  
software stack gets deeper.


--Toby

[1] - The SQLite web site has a good summary of related issues.
http://sqlite.org/atomiccommit.html
[2] http://forums.virtualbox.org/viewtopic.php?t=13661

The snapshot represents the disk state as if the VM were instantly  
gone.  If the VM or the database can't recover from pulling the  
virtual plug, the snapshot can't help that.


That said, it is a good idea to quiesce the software stack as much  
as possible to make the recovery from the crash-consistant image as  
painless as possible.  For example, if you take a snapshot of a VM  
running on an EXT2 filesystem (or unlogged UFS for that matter) the  
recovery will require an fsck of that filesystem to ensure that the  
filesystem structure is consistant.  Perforing a "lockfs" on the  
filesystem while the snapshot is taken could mitigate that, but  
that's still out of the scope of the ZFS snapshot.


--Joe

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-25 Thread Moore, Joe

Miles Nordin wrote:
>   that SQLite2 should be equally as tolerant of snapshot backups as it
>   is of cord-yanking.
> 
> The special backup features of databases including ``performing a
> checkpoint'' or whatever, are for systems incapable of snapshots,
> which is most of them.  Snapshots are not writeable, so this ``in the
> middle of a write'' stuff just does not happen.

This is correct.  The general term for these sorts of point-in-time backups is 
"crash consistant".  If the database can be recovered easily (and/or 
automatically) from pulling the plug (or a kill -9), then a snapshot is an 
instant backup of that database.

In-flight transactions (ones that have not been committed) at the database 
level are rolled back.  Applications using the database will be confused by 
this in a recovery scenario, since the transaction was reported as committed 
are gone when the database comes back.  But that's the case any time a database 
moves "backward" in time.

> Of course Toby rightly pointed out this claim does not apply if you
> take a host snapshot of a virtual disk, inside which a database is
> running on the VM guest---that implicates several pieces of
> untrustworthy stacked software.  But for snapshotting SQLite2 to clone
> the currently-running machine I think the claim does apply, no?
>

Snapshots of a virtual disk are also crash-consistant.  If the VM has not 
committed its transactionally-committed data and is still holding it volatile 
memory, that VM is not maintaining its ACID requirements, and that's a bug in 
either the database or in the OS running on the VM.  The snapshot represents 
the disk state as if the VM were instantly gone.  If the VM or the database 
can't recover from pulling the virtual plug, the snapshot can't help that.

That said, it is a good idea to quiesce the software stack as much as possible 
to make the recovery from the crash-consistant image as painless as possible.  
For example, if you take a snapshot of a VM running on an EXT2 filesystem (or 
unlogged UFS for that matter) the recovery will require an fsck of that 
filesystem to ensure that the filesystem structure is consistant.  Perforing a 
"lockfs" on the filesystem while the snapshot is taken could mitigate that, but 
that's still out of the scope of the ZFS snapshot.

--Joe

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-24 Thread Miles Nordin

> "gp" == Greg Palmer  writes:

gp> Performing a checkpoint will perform such tasks as making sure
gp> that all transactions recorded in the log but not yet written
gp> to the database are written out and that the system is not in
gp> the middle of a write when you grab the data.

great copying of buzzwords out of a glossary, but does it change my
claim or not?  My claim is: 

  that SQLite2 should be equally as tolerant of snapshot backups as it
  is of cord-yanking.

The special backup features of databases including ``performing a
checkpoint'' or whatever, are for systems incapable of snapshots,
which is most of them.  Snapshots are not writeable, so this ``in the
middle of a write'' stuff just does not happen.

gp> Dragging the discussion of database recovery into the
gp> discussion seems to me to only be increasing the FUD factor.

except that you need to draw a distinction between recovery from
cord-yanking which should be swift and absolutely certain, and
recovery from a cpio-style backup done with the database still running
which requires some kind of ``consistency scanning'' and may involve
``corruption'' and has every right to simply not work at all.  The FUD
I'm talking about, is mostly that people seem to think all kinds of
recovery are of the second kind, which is flatly untrue!

Backing up a snapshot of the database should involve the first
category of recovery (after restore), the swift and certain kind, EVEN
if you do not ``quiesce'' the database or take a ``checkpoint'' or
whatever your particular vendor calls it, before taking the snapshot.
You are entitled to just snap it, and expect that recovery work
swiftly and certainly just as it does if you yank the cord.  If your
database vendor considers it some major catastrophe to have the cord
yanked, requiring special tools, training seminars, buzzwords, and
hours of manual checking, then we have a separate problem, but I don't
think SQLite2 is in that category!

Of course Toby rightly pointed out this claim does not apply if you
take a host snapshot of a virtual disk, inside which a database is
running on the VM guest---that implicates several pieces of
untrustworthy stacked software.  But for snapshotting SQLite2 to clone
the currently-running machine I think the claim does apply, no?


pgpd5AH6jPUrj.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-24 Thread Greg Palmer


Miles Nordin wrote:

Hope this helps untangle some FUD.  Snapshot backups of databases
*are* safe, unless the database or application above it is broken in a
way that makes cord-yanking unsafe too.
  
Actually Miles, what they were asking for is generally referred to as a 
checkpoint and they are used by all major databases for backing up 
files. Performing a checkpoint will perform such tasks as making sure 
that all transactions recorded in the log but not yet written to the 
database are written out and that the system is not in the middle of a 
write when you grab the data. Dragging the discussion of database 
recovery into the discussion seems to me to only be increasing the FUD 
factor.


Regards,
 Greg
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-24 Thread Nicolas Williams

On Tue, Feb 24, 2009 at 03:08:18PM -0800, Christopher Mera wrote:
> It's a zfs snapshot that's then sent to a file..
> 
> On the new boxes I'm doing a jumpstart install with the SUNWCreq
> package, and using the finish script to mount an NFS filesystem that
> contains the *.zfs dump files.  Zfs receive is actually importing the
> data and the boot environment then boots fine.

It's possible that your zfs send output files are getting corrupted when
accessed via NFS.  Try ssh.  Also, when does the panic happen?

I searched for CRs with parts of that panic string and found none.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-24 Thread Christopher Mera

It's a zfs snapshot that's then sent to a file..

On the new boxes I'm doing a jumpstart install with the SUNWCreq
package, and using the finish script to mount an NFS filesystem that
contains the *.zfs dump files.  Zfs receive is actually importing the
data and the boot environment then boots fine.

-Original Message-
From: Nicolas Williams [mailto:nicolas.willi...@sun.com] 
Sent: Tuesday, February 24, 2009 5:43 PM
To: Christopher Mera
Cc: lori@sun.com; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] zfs streams & data corruption

On Mon, Feb 23, 2009 at 02:36:07PM -0800, Christopher Mera wrote:
> panic[cpu0]/thread=dacac880: BAD TRAP: type=e (#pf Page fault)
> rp=d9f61850 addr=1048c0d occurred in module "zfs" due to an illegal
> access to a user address

Can you describe what you're doing with your snapshot?

Are you zfs send'ing your snapshots to new systems' rpools?  Or
something else?  You're not using dd(1) or anything like that, right?

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-24 Thread Nicolas Williams

On Mon, Feb 23, 2009 at 02:36:07PM -0800, Christopher Mera wrote:
> panic[cpu0]/thread=dacac880: BAD TRAP: type=e (#pf Page fault)
> rp=d9f61850 addr=1048c0d occurred in module "zfs" due to an illegal
> access to a user address

Can you describe what you're doing with your snapshot?

Are you zfs send'ing your snapshots to new systems' rpools?  Or
something else?  You're not using dd(1) or anything like that, right?

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-24 Thread Nicolas Williams

On Tue, Feb 24, 2009 at 03:25:53PM -0600, Tim wrote:
> On Tue, Feb 24, 2009 at 2:37 PM, Nicolas Williams
> wrote:
> 
> >
> > >
> > >
> > > Hot Backup?
> > >
> > >  # Connect to the database
> > >  sqlite3 db $dbfile
> > >  # Lock the database, copy and commit or rollback
> > >  if {[catch {db transaction immediate {file copy $dbfile ${dbfile}.bak}}
> > res]} {
> > >puts "Backup failed: $res"
> > >  } else {
> > >puts "Backup succeeded"
> > >  }
> >
> > SMF uses SQLite2.  Sorry.
> >
> 
> 
> I don't quite follow why it wouldn't work for sqlite2 as well...

Because SQLite2 doesn't have that feature.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-24 Thread Tim

On Tue, Feb 24, 2009 at 2:37 PM, Nicolas Williams
wrote:

>
> >
> >
> > Hot Backup?
> >
> >  # Connect to the database
> >  sqlite3 db $dbfile
> >  # Lock the database, copy and commit or rollback
> >  if {[catch {db transaction immediate {file copy $dbfile ${dbfile}.bak}}
> res]} {
> >puts "Backup failed: $res"
> >  } else {
> >puts "Backup succeeded"
> >  }
>
> SMF uses SQLite2.  Sorry.
>

I don't quite follow why it wouldn't work for sqlite2 as well...
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-24 Thread Toby Thain



On 24-Feb-09, at 1:37 PM, Mattias Pantzare wrote:


On Tue, Feb 24, 2009 at 19:18, Nicolas Williams
 wrote:

On Mon, Feb 23, 2009 at 10:05:31AM -0800, Christopher Mera wrote:

I recently read up on Scott Dickson's blog with his solution for
jumpstart/flashless cloning of ZFS root filesystem boxes.  I have  
to say

that it initially looks to work out cleanly, but of course there are
kinks to be worked out that deal with auto mounting filesystems  
mostly.


The issue that I'm having is that a few days after these cloned  
systems

are brought up and reconfigured they are crashing and svc.configd
refuses to start.


When you snapshot a ZFS filesystem you get just that -- a snapshot at
the filesystem level.  That does not mean you get a snapshot at the
_application_ level.  Now, svc.configd is a daemon that keeps a  
SQLite2

database.  If you snapshot the filesystem in the middle of a SQLite2
transaction you won't get the behavior that you want.

In other words: quiesce your system before you snapshot its root
filesystem for the purpose of replicating that root on other systems.


That would be a bug in ZFS or SQLite2.

A snapshoot should be an atomic operation. The effect should be the
same as power fail in the meddle of an transaction and decent
databases can cope with that.


In this special case, that is likely so. But Nicolas' point is  
salutary in general, especially in the increasingly common case of  
virtual machines whose disk images are on ZFS. Interacting bugs or  
bad configuration can produce novel failure modes.


Quiescing a system with a complex mix of applications and service  
layers is no simple matter either, as many readers of this list well  
know... :)


--Toby


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-24 Thread Miles Nordin

> "la" == Lori Alt  writes:

la> Could a cpio end up archiving a file that was mid-way
la> through an SQLite2 transaction?

cpio is actually much worse for a database than a snapshot!

I don't know what's going on in this specific case, but the cpio
backup is worse for SQLite2-using things like Thunderbird than a
snapshot backup.  It's ok if your backup is equivalent to this, and
snapshot backups are equivalent:

 * yank the cord.

 * boot up, but do NOT start SQLite2.

 * copy SQLite2's files somewhere else.

   * later, feed the copied files to SQLite2, and say ``recover, as if
 power failed.''

SQLite2 should be able to do this ``recover'' step speedily and
without ``corruption'' or ``inconsistency,'' and without any ``half
completed'' transactions.  The fact that databases have transactions
is not something that makes them vulnerable to cord-yanking or corrupt
from snapshot backups.  About 1/4 of the reason databases even have
something *called* a Transaction, is to support *exactly* this
scenario.

What's not workable is to back up the file storing the database
gradually while the database is writing to it, so the backed-up blocks
near the start of the file are older than blocks near the end.  cpio
backups on live filesystems are like your backup is a wand sweeping
through the file's space, while at the same time SQlite2 writes are
dipping into the file sometimes before the wand, sometimes behind.
Any writes SQLite2 does to offsets behind the wand are lost, while
writes in front of the wand are captured into the backup.  This will
cause corruption.  It's not the same as a cord-yank and not speedily
recoverable.

The way I try to back up UFS systems is to take a snapshot with
fssnap, then backup the snapshot with ufsdump.  You could also
UFS-mount the fssnap device somewhere read-only and use cpio on that
mountpoint instead of ufsdump on the device---that's safe too.
modulo bugs in SQLite2 and SMF.  but backing up the writeable
filesystem with cpio is never safe for SQLite2 or berkeley DB or
any real database.

Older systems had no fssnap and no 'zfs snap', so it was impossible to
do backups by performing the cord-yank-simulation procedure above.
Most Linux systems still can't do it.  You need operating system
support to do it, so if you don't have it, whether you're cpio or
you're an ``enterprise backup solution,'' you need some help from the
database to do a live backup.  When databases have some mode to
support backups, usually what they do is to make two kinds of
promises:

 (1) certain files, I will not write to them at all until you take me
 out of backup-mode.  Pass your backup wands through them all you
 want.  I'll not be changing them.

 (2) other files, I will only append to them.  I will never write to
 the middle.

Both behaviors are wand-safe, so you can use userspace-only cpio
backups without shutting the database all the way down.

You do *NOT* need to use the (1) (2) helper-mode to do a snapshot
backup.  If your database can't handle a snapshot backup unless you
put it into remedial backup-assistance (1) (2) mode first, then your
database can't handle cord-yanking either, and is BROKEN.


The observed problem doesn't mean SQLite2 is broken.  It's possible
the software above SQLite2 is not using the transactions aggressively
enough.  For example suppose SMF craps its pants if it ever boots up
to find database-stored switches 1 and 2 are not set to the same
value.  If SMF is commanding SQLite2 to:

 * Transaction 1:  flip switch 1 to B

 * Transaction 2:  flip switch 2 to B

then it could have trouble surviving cord-yanking or backups, and
it'll have trouble no matter whether it's a cord-yank or a snapshot
backup or a sweeping-wand backup, and no matter if you somehow put
SQLite2 in backup-friendly mode first or not.  The proper way is for
SMF to tell SQLite2:

 * Transaction 1:  flip switch 1 to B
   flip switch 2 to B

SQLite2 will then guarantee that both happen, or neither happens, but
only if you ask it to by putting both in one transaction.  The whole
*point* of using SQLite2 in your SMF project is to arrange for such
guarantees as these to be kept during backups and cord-yanks.  but a
database cannot magically make the system appear to run
continuously---SMF still needs to specify to SQLite2 what
``consistency'' means before the database can guarantee it.

Hope this helps untangle some FUD.  Snapshot backups of databases
*are* safe, unless the database or application above it is broken in a
way that makes cord-yanking unsafe too.


pgp8YJDt8TU89.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-24 Thread Nicolas Williams

On Tue, Feb 24, 2009 at 12:19:22PM -0800, Christopher Mera wrote:
> There are over 700 boxes deployed using Flash Archive's on an S10 system
> with a UFS root.   We've been working on basing our platform on a ZFS
> root and took Scott Dickson's suggestions
> (http://blogs.sun.com/scottdickson/entry/flashless_system_cloning_with_z
> fs) for doing a System Clone.  The process worked out well, the system
> came up and looked stable until 24 hours later kernel panic's became
> incessant and svc.configd won't load its repository any longer.

OK, svc.configd cannot cause a panic, so perhaps there is a ZFS bug.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-24 Thread Nicolas Williams

On Tue, Feb 24, 2009 at 02:27:18PM -0600, Tim wrote:
> On Tue, Feb 24, 2009 at 2:15 PM, Nicolas Williams
> wrote:
> 
> > On Tue, Feb 24, 2009 at 01:17:47PM -0600, Nicolas Williams wrote:
> > > I don't think there's any way to ask svc.config to pause.
> >
> > Well, IIRC that's not quite right.  You can pstop svc.startd, gently
> > kill (i.e., not with SIGKILL) svc.configd, take your snapshot, then prun
> > svc.startd.
> >
> > Nico
> > --
> 
> 
> Hot Backup?
> 
>  # Connect to the database
>  sqlite3 db $dbfile
>  # Lock the database, copy and commit or rollback
>  if {[catch {db transaction immediate {file copy $dbfile ${dbfile}.bak}} 
> res]} {
>puts "Backup failed: $res"
>  } else {
>puts "Backup succeeded"
>  }

SMF uses SQLite2.  Sorry.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-24 Thread Nicolas Williams

On Tue, Feb 24, 2009 at 02:53:14PM -0500, Miles Nordin wrote:
> > "cm" == Christopher Mera  writes:
> 
> cm> it would be ideal to quiesce the system before a snapshot
> cm> anyway, no?
> 
> It would be more ideal to find the bug in SQLite2 or ZFS.  Training
> everyone, ``you always have to quiesce the system before proceeding,
> because it's full of bugs'' is retarded MS-DOS behavior.  I think it
> is actually harmful.

It's NOT a bug in ZFS.  It might be a bug in SQLite2, it might be a bug
in svc.configd.  More information would help; specifically: error/log
messages from svc.configd, and /etc/svc/repository.db.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-24 Thread Tim

On Tue, Feb 24, 2009 at 2:15 PM, Nicolas Williams
wrote:

> On Tue, Feb 24, 2009 at 01:17:47PM -0600, Nicolas Williams wrote:
> > I don't think there's any way to ask svc.config to pause.
>
> Well, IIRC that's not quite right.  You can pstop svc.startd, gently
> kill (i.e., not with SIGKILL) svc.configd, take your snapshot, then prun
> svc.startd.
>
> Nico
> --


Hot Backup?

 # Connect to the database
 sqlite3 db $dbfile
 # Lock the database, copy and commit or rollback
 if {[catch {db transaction immediate {file copy $dbfile ${dbfile}.bak}} res]} {
   puts "Backup failed: $res"
 } else {
   puts "Backup succeeded"
 }
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-24 Thread Nicolas Williams

On Tue, Feb 24, 2009 at 01:17:47PM -0600, Nicolas Williams wrote:
> I don't think there's any way to ask svc.config to pause.

Well, IIRC that's not quite right.  You can pstop svc.startd, gently
kill (i.e., not with SIGKILL) svc.configd, take your snapshot, then prun
svc.startd.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-24 Thread Christopher Mera

Here's what makes me say that:

 

There are over 700 boxes deployed using Flash Archive's on an S10 system
with a UFS root.   We've been working on basing our platform on a ZFS
root and took Scott Dickson's suggestions
(http://blogs.sun.com/scottdickson/entry/flashless_system_cloning_with_z
fs) for doing a System Clone.  The process worked out well, the system
came up and looked stable until 24 hours later kernel panic's became
incessant and svc.configd won't load its repository any longer.

 

Hope that explains where I'm coming from..

 

 

Regards,

Chris

 

 

From: lori@sun.com [mailto:lori@sun.com] 
Sent: Tuesday, February 24, 2009 3:13 PM
To: Christopher Mera
Cc: Brent Jones; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] zfs streams & data corruption

 

On 02/24/09 12:57, Christopher Mera wrote: 

How is it  that flash archives can avoid these headaches?
  

Are we sure that they do avoid this headache?  A flash archive
(on ufs root) is created by doing a cpio of the root file system.
Could a cpio end up archiving a file that was mid-way through
an SQLite2 transaction?

Lori



 
Ultimately I'm doing this to clone ZFS root systems because at the
moment Flash Archives are UFS only.
 
 
-Original Message-
From: Brent Jones [mailto:br...@servuhome.net] 
Sent: Tuesday, February 24, 2009 2:49 PM
To: Christopher Mera
Cc: Mattias Pantzare; Nicolas Williams; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] zfs streams & data corruption
 
On Tue, Feb 24, 2009 at 11:32 AM, Christopher Mera
 <mailto:cm...@reliantsec.net>  wrote:
  

Thanks for your responses..
 
Brent:
And I'd have to do that for every system that I'd want to clone?
There
must be a simpler way.. perhaps I'm missing something.
 
 
Regards,
Chris
 


 
Well, unless the database software itself can "notice" a snapshot
taking place, and flush all data to disk, pause transactions until the
snapshot is finished, then properly resume, I don't know what to tell
you.
It's an issue for all databases, Oracle, MSSQL, MySQL... how to do an
atomic backup, without stopping transactions, and maintaining
consistency.
Replication is on possible solution, dumping to a file periodically is
one, or just tolerating that your database will not be consistent
after a snapshot and have to replay logs / consistency check it after
bringing it up from a snapshot.
 
Once you figure that out in a filesystem agnostic way, you'll be a
wealthy person indeed.
 
 
  

 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-24 Thread Lori Alt


On 02/24/09 12:57, Christopher Mera wrote:

How is it  that flash archives can avoid these headaches?
  

Are we sure that they do avoid this headache?  A flash archive
(on ufs root) is created by doing a cpio of the root file system.
Could a cpio end up archiving a file that was mid-way through
an SQLite2 transaction?

Lori

Ultimately I'm doing this to clone ZFS root systems because at the moment Flash 
Archives are UFS only.


-Original Message-
From: Brent Jones [mailto:br...@servuhome.net] 
Sent: Tuesday, February 24, 2009 2:49 PM

To: Christopher Mera
Cc: Mattias Pantzare; Nicolas Williams; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] zfs streams & data corruption

On Tue, Feb 24, 2009 at 11:32 AM, Christopher Mera  wrote:
  

Thanks for your responses..

Brent:
And I'd have to do that for every system that I'd want to clone?  There
must be a simpler way.. perhaps I'm missing something.


Regards,
Chris




Well, unless the database software itself can "notice" a snapshot
taking place, and flush all data to disk, pause transactions until the
snapshot is finished, then properly resume, I don't know what to tell
you.
It's an issue for all databases, Oracle, MSSQL, MySQL... how to do an
atomic backup, without stopping transactions, and maintaining
consistency.
Replication is on possible solution, dumping to a file periodically is
one, or just tolerating that your database will not be consistent
after a snapshot and have to replay logs / consistency check it after
bringing it up from a snapshot.

Once you figure that out in a filesystem agnostic way, you'll be a
wealthy person indeed.


  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-24 Thread Christopher Mera

How is it  that flash archives can avoid these headaches?

Ultimately I'm doing this to clone ZFS root systems because at the moment Flash 
Archives are UFS only.

-Original Message-
From: Brent Jones [mailto:br...@servuhome.net] 
Sent: Tuesday, February 24, 2009 2:49 PM
To: Christopher Mera
Cc: Mattias Pantzare; Nicolas Williams; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] zfs streams & data corruption

On Tue, Feb 24, 2009 at 11:32 AM, Christopher Mera  wrote:
> Thanks for your responses..
>
> Brent:
> And I'd have to do that for every system that I'd want to clone?  There
> must be a simpler way.. perhaps I'm missing something.
>
>
> Regards,
> Chris
>

Well, unless the database software itself can "notice" a snapshot
taking place, and flush all data to disk, pause transactions until the
snapshot is finished, then properly resume, I don't know what to tell
you.
It's an issue for all databases, Oracle, MSSQL, MySQL... how to do an
atomic backup, without stopping transactions, and maintaining
consistency.
Replication is on possible solution, dumping to a file periodically is
one, or just tolerating that your database will not be consistent
after a snapshot and have to replay logs / consistency check it after
bringing it up from a snapshot.

Once you figure that out in a filesystem agnostic way, you'll be a
wealthy person indeed.

-- 
Brent Jones
br...@servuhome.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-24 Thread Miles Nordin

> "bj" == Brent Jones  writes:

bj> tolerating that your database will not be consistent after a
bj> snapshot and have to replay logs / consistency check it

``not be consistent'' != ``have to replay logs''


pgpLNmP6hsO3I.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-24 Thread Miles Nordin

> "cm" == Christopher Mera  writes:

cm> it would be ideal to quiesce the system before a snapshot
cm> anyway, no?

It would be more ideal to find the bug in SQLite2 or ZFS.  Training
everyone, ``you always have to quiesce the system before proceeding,
because it's full of bugs'' is retarded MS-DOS behavior.  I think it
is actually harmful.


pgpk37ALPzeuv.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-24 Thread Brent Jones

On Tue, Feb 24, 2009 at 11:32 AM, Christopher Mera  wrote:
> Thanks for your responses..
>
> Brent:
> And I'd have to do that for every system that I'd want to clone?  There
> must be a simpler way.. perhaps I'm missing something.
>
>
> Regards,
> Chris
>

Well, unless the database software itself can "notice" a snapshot
taking place, and flush all data to disk, pause transactions until the
snapshot is finished, then properly resume, I don't know what to tell
you.
It's an issue for all databases, Oracle, MSSQL, MySQL... how to do an
atomic backup, without stopping transactions, and maintaining
consistency.
Replication is on possible solution, dumping to a file periodically is
one, or just tolerating that your database will not be consistent
after a snapshot and have to replay logs / consistency check it after
bringing it up from a snapshot.

Once you figure that out in a filesystem agnostic way, you'll be a
wealthy person indeed.

-- 
Brent Jones
br...@servuhome.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-24 Thread Nicolas Williams

On Tue, Feb 24, 2009 at 10:56:45AM -0800, Brent Jones wrote:
> If you are writing a script to handle ZFS snapshots/backups, you could
> issue an SMF command to stop the service before taking the snapshot.
> Or at the very minimum, perform an SQL dump of the DB so you at least
> have a consistent full copy of the DB as a flat file in case you can't
> stop the DB service.

I don't think there's any way to ask svc.config to pause.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-24 Thread Christopher Mera

Thanks for your responses..

Brent:
And I'd have to do that for every system that I'd want to clone?  There
must be a simpler way.. perhaps I'm missing something.


Regards,
Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-24 Thread Nicolas Williams

On Tue, Feb 24, 2009 at 07:37:39PM +0100, Mattias Pantzare wrote:
> On Tue, Feb 24, 2009 at 19:18, Nicolas Williams
>  wrote:
> > When you snapshot a ZFS filesystem you get just that -- a snapshot at
> > the filesystem level.  That does not mean you get a snapshot at the
> > _application_ level.  Now, svc.configd is a daemon that keeps a SQLite2
> > database.  If you snapshot the filesystem in the middle of a SQLite2
> > transaction you won't get the behavior that you want.
> >
> > In other words: quiesce your system before you snapshot its root
> > filesystem for the purpose of replicating that root on other systems.
> 
> That would be a bug in ZFS or SQLite2.

I suspect it's actually a bug in svc.configd.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-24 Thread Brent Jones

On Tue, Feb 24, 2009 at 10:41 AM, Christopher Mera  wrote:
> Either way -  it would be ideal to quiesce the system before a snapshot 
> anyway, no?
>
> My next question now is what particular steps would be recommended to quiesce 
> a system for the clone/zfs stream that I'm looking to achieve...
>
>
> All your help is appreciated.
>
> Regards,
> Christopher Mera
> -Original Message-
> From: Mattias Pantzare [mailto:pantz...@gmail.com]
> Sent: Tuesday, February 24, 2009 1:38 PM
> To: Nicolas Williams
> Cc: Christopher Mera; zfs-discuss@opensolaris.org
> Subject: Re: [zfs-discuss] zfs streams & data corruption
>
> On Tue, Feb 24, 2009 at 19:18, Nicolas Williams
>  wrote:
>> On Mon, Feb 23, 2009 at 10:05:31AM -0800, Christopher Mera wrote:
>>> I recently read up on Scott Dickson's blog with his solution for
>>> jumpstart/flashless cloning of ZFS root filesystem boxes.  I have to say
>>> that it initially looks to work out cleanly, but of course there are
>>> kinks to be worked out that deal with auto mounting filesystems mostly.
>>>
>>> The issue that I'm having is that a few days after these cloned systems
>>> are brought up and reconfigured they are crashing and svc.configd
>>> refuses to start.
>>
>> When you snapshot a ZFS filesystem you get just that -- a snapshot at
>> the filesystem level.  That does not mean you get a snapshot at the
>> _application_ level.  Now, svc.configd is a daemon that keeps a SQLite2
>> database.  If you snapshot the filesystem in the middle of a SQLite2
>> transaction you won't get the behavior that you want.
>>
>> In other words: quiesce your system before you snapshot its root
>> filesystem for the purpose of replicating that root on other systems.
>
> That would be a bug in ZFS or SQLite2.
>
> A snapshoot should be an atomic operation. The effect should be the
> same as power fail in the meddle of an transaction and decent
> databases can cope with that.
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

If you are writing a script to handle ZFS snapshots/backups, you could
issue an SMF command to stop the service before taking the snapshot.
Or at the very minimum, perform an SQL dump of the DB so you at least
have a consistent full copy of the DB as a flat file in case you can't
stop the DB service.


-- 
Brent Jones
br...@servuhome.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-24 Thread Christopher Mera

Either way -  it would be ideal to quiesce the system before a snapshot anyway, 
no?

My next question now is what particular steps would be recommended to quiesce a 
system for the clone/zfs stream that I'm looking to achieve...


All your help is appreciated.

Regards,
Christopher Mera
-Original Message-
From: Mattias Pantzare [mailto:pantz...@gmail.com] 
Sent: Tuesday, February 24, 2009 1:38 PM
To: Nicolas Williams
Cc: Christopher Mera; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] zfs streams & data corruption

On Tue, Feb 24, 2009 at 19:18, Nicolas Williams
 wrote:
> On Mon, Feb 23, 2009 at 10:05:31AM -0800, Christopher Mera wrote:
>> I recently read up on Scott Dickson's blog with his solution for
>> jumpstart/flashless cloning of ZFS root filesystem boxes.  I have to say
>> that it initially looks to work out cleanly, but of course there are
>> kinks to be worked out that deal with auto mounting filesystems mostly.
>>
>> The issue that I'm having is that a few days after these cloned systems
>> are brought up and reconfigured they are crashing and svc.configd
>> refuses to start.
>
> When you snapshot a ZFS filesystem you get just that -- a snapshot at
> the filesystem level.  That does not mean you get a snapshot at the
> _application_ level.  Now, svc.configd is a daemon that keeps a SQLite2
> database.  If you snapshot the filesystem in the middle of a SQLite2
> transaction you won't get the behavior that you want.
>
> In other words: quiesce your system before you snapshot its root
> filesystem for the purpose of replicating that root on other systems.

That would be a bug in ZFS or SQLite2.

A snapshoot should be an atomic operation. The effect should be the
same as power fail in the meddle of an transaction and decent
databases can cope with that.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-24 Thread Mattias Pantzare

On Tue, Feb 24, 2009 at 19:18, Nicolas Williams
 wrote:
> On Mon, Feb 23, 2009 at 10:05:31AM -0800, Christopher Mera wrote:
>> I recently read up on Scott Dickson's blog with his solution for
>> jumpstart/flashless cloning of ZFS root filesystem boxes.  I have to say
>> that it initially looks to work out cleanly, but of course there are
>> kinks to be worked out that deal with auto mounting filesystems mostly.
>>
>> The issue that I'm having is that a few days after these cloned systems
>> are brought up and reconfigured they are crashing and svc.configd
>> refuses to start.
>
> When you snapshot a ZFS filesystem you get just that -- a snapshot at
> the filesystem level.  That does not mean you get a snapshot at the
> _application_ level.  Now, svc.configd is a daemon that keeps a SQLite2
> database.  If you snapshot the filesystem in the middle of a SQLite2
> transaction you won't get the behavior that you want.
>
> In other words: quiesce your system before you snapshot its root
> filesystem for the purpose of replicating that root on other systems.

That would be a bug in ZFS or SQLite2.

A snapshoot should be an atomic operation. The effect should be the
same as power fail in the meddle of an transaction and decent
databases can cope with that.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-24 Thread Nicolas Williams

On Mon, Feb 23, 2009 at 10:05:31AM -0800, Christopher Mera wrote:
> I recently read up on Scott Dickson's blog with his solution for
> jumpstart/flashless cloning of ZFS root filesystem boxes.  I have to say
> that it initially looks to work out cleanly, but of course there are
> kinks to be worked out that deal with auto mounting filesystems mostly.
> 
> The issue that I'm having is that a few days after these cloned systems
> are brought up and reconfigured they are crashing and svc.configd
> refuses to start.

When you snapshot a ZFS filesystem you get just that -- a snapshot at
the filesystem level.  That does not mean you get a snapshot at the
_application_ level.  Now, svc.configd is a daemon that keeps a SQLite2
database.  If you snapshot the filesystem in the middle of a SQLite2
transaction you won't get the behavior that you want.

In other words: quiesce your system before you snapshot its root
filesystem for the purpose of replicating that root on other systems.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-23 Thread Christopher Mera

Forgive the double posts,  they will cease immediately

 

panic[cpu0]/thread=dacac880: BAD TRAP: type=e (#pf Page fault)
rp=d9f61850 addr=1048c0d occurred in module "zfs" due to an illegal
access to a user address

 

net-init: #pf Page fault

Bad kernel fault at addr=0x1048c0d

pid=1069, pc=0xfebab410, sp=0xd1c38018, eflags=0x10296

cr0: 8005003b cr4: 6b8

cr2: 1048c0dcr3: 7965020

 gs: e7b601b0  fs: fec9  es:  160  ds:  160

edi:0 esi:  1048bf9 ebp: d9f618c4 esp: d9f61888

ebx:  1048bf9 edx:  1048bfd ecx: dc311900 eax: d9f6192c

trp:e err:0 eip: febab410  cs:  158

efl:10296 usp: d1c38018  ss: e43feb72

 

d9f6178c unix:die+93 (e, d9f61850, 1048c0)

d9f6183c unix:trap+1422 (d9f61850, 1048c0d, )

d9f61850 unix:cmntrap+7c (e7b601b0, fec9,)

d9f618c4 zfs:mze_compare+18 (d9f6192c, 1048bf9, )

d9f61904 genunix:avl_find+39 (d34b2958, d9f6192c,)

d9f619a4 zfs:mze_find+4a (e45fb8c0, d9f61c9c,)

d9f619e4 zfs:zap_lookup_norm+65 (dc2665a8, 21d, 0, d)

d9f61a34 zfs:zap_lookup+31 (dc2665a8, 21d, 0, d)

d9f61a94 zfs:zfs_match_find+ba (dc8fb980, e0ed3460,)

d9f61b04 zfs:zfs_dirent_lock+358 (d9f61b38, e0ed3460,)

d9f61b54 zfs:zfs_dirlook+f7 (e0ed3460, d9f61c9c,)

d9f61ba4 zfs:zfs_lookup+d5 (e0b420c0, d9f61c9c,)

d9f61c04 genunix:fop_lookup+b0 (e0b420c0, d9f61c9c,)

d9f61dc4 genunix:lookuppnvp+3e4 (d9f61e3c, 0, 1, 0, )

d9f61e14 genunix:lookuppnat+f3 (d9f61e3c, 0, 1, 0, )

d9f61e94 genunix:lookupnameat+52 (807b51c, 0, 1, 0, d)

d9f61ef4 genunix:cstatat_getvp+15d (ffd19553, 807b51c, )

d9f61f54 genunix:cstatat64+68 (ffd19553, 807b51c, )

d9f61f84 genunix:stat64+1c (807b51c, 8047b50, 8)

 

From: lori@sun.com [mailto:lori@sun.com] 
Sent: Monday, February 23, 2009 1:17 PM
To: Christopher Mera
Cc: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] zfs streams & data corruption

 


I don't know what's causing this, nor have I seen it.  

Can you send more information about the errors you
see when the system crashes and svc.configd fails?

Doing the scrub seems like a harmless and possibly
useful thing to do.  Let us know what you find out
from it.

Lori

On 02/23/09 11:05, Christopher Mera wrote: 

Hi folks,

 

I recently read up on Scott Dickson's blog with his solution for
jumpstart/flashless cloning of ZFS root filesystem boxes.  I have to say
that it initially looks to work out cleanly, but of course there are
kinks to be worked out that deal with auto mounting filesystems mostly.


 

The issue that I'm having is that a few days after these cloned systems
are brought up and reconfigured they are crashing and svc.configd
refuses to start.

 

I thought about using zpool scrub   right after completing the
stream as an integrity check.  

 

If you have any suggestions about this I'd love to hear them!

 

 

Thanks,

Christopher Mera

 

 






 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  

 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

2009-02-23 Thread Lori Alt



I don't know what's causing this, nor have I seen it. 


Can you send more information about the errors you
see when the system crashes and svc.configd fails?

Doing the scrub seems like a harmless and possibly
useful thing to do.  Let us know what you find out
from it.

Lori

On 02/23/09 11:05, Christopher Mera wrote:


Hi folks,

 

I recently read up on Scott Dickson's blog with his solution for 
jumpstart/flashless cloning of ZFS root filesystem boxes.  I have to 
say that it initially looks to work out cleanly, but of course there 
are kinks to be worked out that deal with auto mounting filesystems 
mostly. 

 

The issue that I'm having is that a few days after these cloned 
systems are brought up and reconfigured they are crashing and 
svc.configd refuses to start.


 

I thought about using zpool scrub   right after completing 
the stream as an integrity check. 

 


If you have any suggestions about this I'd love to hear them!

 

 


Thanks,

Christopher Mera

 




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

37 matches

Mail list logo