from:"Ray Clark"

Re: [zfs-discuss] Best way to convert checksums

2009-10-03 Thread Ray Clark

Richard, with respect to:

This has been answered several times in this thread already.
set checksum=sha256 filesystem
copy your files -- all newly written data will have the sha256
checksums.

I understand that.  I understood it before the thread started.  I did not ask 
this.  It is a fact that there is no feature to convert checksums as a part of 
resilver or some such.  

I started what utility to use, but quickly zeroed in on zfs send/receive as 
being the native and presumably best method, but had questions as to how to get 
the properly set when it was automatically created.  etc.

Note that my focus in recent portions of the thread has changed to the 
underlying zpool.

Simply changing the checksum=sha356 and copying my data is analogous to hanging 
my data from a hierarchy of 0.256 welded steel chain, with the top of the 
hierarchy hanging it all from an 0.001 steel thread.  Well, that is not quite 
fair because there are probabilities involved.  Someone is going to pick a link 
randomly and go after it with a fingernail clipper.  If they pick a thick one, 
I have very little to worry about to say the least.  If they pick one of the 
few dozen? hundred? thousand? (I don't know how many) that contain the 
structure and services of the underlying zpool, then the nailclipper will not 
be stopped by the 0.001 thread.   I do have 8,000,000,000 links in the chain, 
and only a a very small fraction are 0.001 thick, and that is strongly in my 
favor, but I would expect the heads to also spend a disproportionate amount of 
time over the intent log.  It is hard to know how it comes out.  I just don't 
want and 0.001 steel threads protecting my data from the
  gremlins.  I moved to ZFS to avoid gambles.  If I wanted gambles I would use 
linux raid and lvm2.  They work well enough if there are no errors.  

I should have enumerated the knowns and unknowns in my list last night, then I 
would not have annoyed you with my apparent deafness.  (Hopefully I am not 
still being deaf).  I will clarify below, as I should have last night:


Given that I only have 1.6TB of data in a 4TB pool, what can I do to
change those blocks to sha256 or Fletcher4:

(1) Without destroying and recreating the zpool under U4

I know how to fix the user data (just change checksum property on the pool 
using zfs specifying the pool vs. a zfs file system, then copy the data).

I don't know (am ignorant of) blocks comprising the underlying zpool, and how 
to fix them without recreating the pool.  It makes sense to me that at least 
some would be rewritten in the course of using the system, but (1) I have had 
no confirmation or denial that this is the case, (2) I don't know if this is 
all of them or some of them, (3) I don't know if the checksum= parameter would 
effect these (relling's Oct 2 at 3:26 post implies that it does not by lack of 
reference to checksum property).  So I don't know yet how much exposure will 
remain.  I would think that if the user specified a stronger checksum for their 
data that the system would abandon its use of weaker ones in the underlying 
structure, but Richard's list seems to imply otherwise.

(2) With destroying and recreating the zpool under U4 (Which I don't
really have the resources to pull off)

Due to some of the non-technical factors in the situation, I cannot actually 
execute an experimental valid zpool command, but zpool create -o garbage 
gives me a usage that does not include any -o or -O.  So it appears that under 
U4 I cannot do this.  I wish there were someone who could confirm that I can or 
cannot do this before I arrange for and propose that we dive into this massive 
undertaking.  Also, from Richard's Oct 2 3:26 note, I infer that this will not 
change the checksum used by the underlying zpool anyway, so this might be 
fruitless.  But I am infering... Richard gave a quick list, his attitude was 
not that of providing all level of precise detail so I really don't know.  Many 
of the answers I have received have turned out to recommend features that are 
not available in U4 but in later versions, even unreleased versions.  I have no 
way of sorting this out without the information being qualified with a version.

(3) With upgrading to U7 (Perhaps in a few months)
Not clear what this will support on zpool, or if it would be effective (similar 
to U4 above)

(4) With upgrading to U8
Not sure when it will come out, what it will support, or if it will be 
effective (similar to U7, U4 above).

So I can enable robust protection on my user data, but perhaps not the 
infrastructure needed to get at that user data, and perhaps not the intent log. 
 

The answer may be that I cannot be helped.  That is not the desired answer, but 
if that is the case, so be it.  Let's lay out the facts and the best way to 
move on from here, for me and everybody else.  Why leave us thrashing in the 
dark?  Am I a Mac user?  

I personally will still believe ZFS is the way to go in the short term because 
it is

Re: [zfs-discuss] Desire simple but complete copy - How?

2009-10-03 Thread Ray Clark

Responding to p...@paularcher.org's sep 30 2009 9:21 post:

For the entire file system, I have chosen zfs send/receive, per thread Best 
way to convert checksums.  I has concerns, they have been answered.

Do my immediate need is answered.  The question remains as to how to copy 
portions of trees, which zfs send/receive do not do.  that is, just copy stuff 
around normally.  cp, tar, cpio, rsync are all great.  If one digs in, there 
are various issues with various implementations, none of which are documented.  
Witness the warnings I have gotten that sun tar does not support sparse files.  
Is this true?  Was it true, has it been fixed?  There is another thread that 
says something to the effect that if you tell sun tar to not copy acls that it 
creates an empty acl with every file, where there was no acl at all in the 
source!  Every implementation of every utility has what read like credible, 
carefully documented rocks thrown at them.

My conclusion is to not use any system features that cannot be verified with a 
simply file-contents comparison.  This will check the file contents, the 
filename, and the pathname.  I can see the file owner, group, and dates.  
Sparseness, acls, and extended attributes I will simply not use.  I will have 
to use non-solaris utilities for the compare or build my own in scripts because 
I have large files and diff does not support them and cmp does not recurse.

Thank you all for you insights and suggestions.  I consider this to be done, 
but given the unsatisfying result, I would still welcome any new remarks.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Desire simple but complete copy - How?

2009-10-03 Thread Ray Clark

Responding to p...@paularcher.org's sep 30 2009 9:21 post:

For the entire file system, I have chosen zfs send/receive, per thread Best 
way to convert checksums.  I has concerns, they have been answered.

Do my immediate need is answered.  The question remains as to how to copy 
portions of trees, which zfs send/receive do not do.  that is, just copy stuff 
around normally.  cp, tar, cpio, rsync are all great.  If one digs in, there 
are various issues with various implementations, none of which are documented.  
Witness the warnings I have gotten that sun tar does not support sparse files.  
Is this true?  Was it true, has it been fixed?  There is another thread that 
says something to the effect that if you tell sun tar to not copy acls that it 
creates an empty acl with every file, where there was no acl at all in the 
source!  Every implementation of every utility has what read like credible, 
carefully documented rocks thrown at them.

My conclusion is to not use any system features that cannot be verified with a 
simply file-contents comparison.  This will check the file contents, the 
filename, and the pathname.  I can see the file owner, group, and dates.  
Sparseness, acls, and extended attributes I will simply not use.  I will have 
to use non-solaris utilities for the compare or build my own in scripts because 
I have large files and diff does not support them and cmp does not recurse.

Thank you all for you insights and suggestions.  I consider this to be done, 
but given the unsatisfying result, I would still welcome any new remarks.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

2009-10-03 Thread Ray Clark

With respect to relling's Oct 3 2009 7:46 AM Post:

 I think you are missing the concept of pools. Pools contain datasets.
 One form of dataset is a file system. Pools do not contain data per se,
 datasets contain data. Reviewing the checksums used with this
 heirarchy in mind:

 Pool
 Label [SHA-256]
 Uberblock [SHA-256]
 Metadata [fletcher4]
 Gang block [SHA-256]
 ZIL log [fletcher2]

 Dataset (file system or volume)
 Metadata [fletcher4]
 Data [fletcher2 (default, today), fletcher4, or SHA-256]
 Send stream [fletcher4]

 With this in mind, I don't understand your steel analogy.

I am assuming based on the context of our presentation that the above list of 
pool stuff is exhaustive, that this is everything not in a dataset.

My steel analogy is based on the assumption that the pool-level stuff that 
you list above is needed to gain access to the dataset.  If the dataset can be 
accessed with all of the pool stuff trashed, then my steel thread does not 
exist.  But it also means that the pool stuff is extraneous, so I doubt that 
this is the case.

Given that all of the pool stuff is either sha256 or fletcher4 except for the 
ZIL, I have new understanding that suggests (though I don't understand the 
details of the system) that I am not depending on Fletcher2 protected data, and 
my steel thread is actually pretty thick, not 0.001.

Based on your comments regarding the ZIL, I am infering that stuff is written 
there and never used except for a restart after a messy shutdown.  I might be 
exposed to whatever weakness the Fletcher2 has as implemented, but only in 
these rare circumstances.  Normal transactions and data would not be impacted 
by corruption in the ZIL blocks since they would never be used.  So a large 
layer of probability protects me:  I would have to have a crash at the same 
instance of corruption in the ZIL that hits on a Fletcher2 weakness.

Based on all of this I believe I am relatively happy simply copying my data, 
not recreating my zpool.  

As Darren Moffat taught me, I can zfs set checksum=sha256 zfs01 where zfs01 
is the zpool, then zfs send zfs01/h...@snapshot | zfs receive zfs01/home.new 
and the new file system will all be sha256 as long as I don't specify the -R 
option on the zfs send, and all of this is supported in U4.  I believe it has 
to be supported due to the presence of files with properties in the (odd?) zfs 
file system that exists at the zfs01 zpool level before creation of zfs file 
systems.

So assuming the above process works, this thread is done as far as I am 
concerned right now.  

Thank you all for your help, not to snub anyone, but Darren, Richard, and Cindy 
especially come to mind.  Thanks for sparring with me until we understood each 
other.

--Ray
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

2009-10-02 Thread Ray Clark

Data security.  I migrated my organization from Linux to Solaris driven away 
from Linux by the the shortfalls of fsck on TB size file systems, and towards 
Solaris by the features of ZFS.

At the time I tried to dig up information concerning tradeoffs associated with 
Fletcher2 vs. 4 vs. SHA256 and found nothing.  Studying the algorithms, I 
decided that fletcher2 would tend to be weak for periodic data, which 
characterizes my data.  I ran throughput tests and got 67MB/Sec for Fletcher2 
and 4 and 48MB/Sec for SHA256.  I projected (perhaps without basis) SHA256's 
cryptographic strength to also mean strength as a hash, and chose it since 
48MB/Sec is more than I need.

21 months later (9/15/09) I lost everything to a corrupt metadata (Not sure 
where this was printed) ZFS-8000-CS.  No clue why to date, I will never know.  
The person who restored from tape was not informed to set checksum=sha256, so 
it all went in with the default, Fletcher2.

Before taking rather disruptive actions to correct this, I decided to question 
my original decision and found schlie's post stating that a bug in fletcher2 
makes it essentially a one bit parity on the entire block:
http://opensolaris.org/jive/thread.jspa?threadID=69655tstart=30  While 
this is twice as good as any other file system in the world that has NO such 
checksum, this does not provide the security I migrated for.  Especially given 
that I did not know what caused the original data loss, it is all I have to 
lean on.

Convinced that I need to convert all of the checksums to sha256 to have the 
data security ZFS purports to deliver and in the absence of a checksum 
conversion capability, I need to copy the data.  It appears that all of the 
implementations of the various means of copying data, from tar and cpio to cp 
to rsync to pax have ghosts in their closets, each living in glass houses, and 
each throwing stones at the other with respect to various issues with file 
size, filename lengths, pathname lengths, ACLs, extended attributes, sparse 
files, etc. etc. etc.  

It seems like zfs send/receive *should* be safe from all such issues as part of 
the zfs family, but the questions raised here are ambiguous once one starts to 
think about it.  If the file system is faithfully duplicated, it should also 
duplicate all properties, including the checksum used on each block.  It 
appears (to my advantage) that this is not what is done.  This enables the 
filesystem spontaneously created by zfs receive to inherit from the pool, which 
evidently can be set to sha256 though it is a pool not a file system in the 
pool.  The present question is protection on the base pool.  This can be set 
when the pool is created, though not with U4 which I am running.  It is not 
clear (yet) if this is simply not documented with the current release or if the 
version that supports this has not been released yet.  If I were to upgrade 
(Which I cannot do in a timely fashion), it would only be to U7.  I cannot run 
a weekly build type of OS on my production server.  Any way 
 it goes I am hosed.  In short there is surely some structure, some blocks with 
stuff written in them when a pool is created but before anything else is done, 
else it would be a blank disk, not a zfs pool.  Are these protected by 
Fletcher2 as the default?  I have learned that the Ubberblock is protected by 
SHA256, other parts by Fletcher4.  Is this everything?  In U4 was it fletcher4, 
or was this a recent change steming from Schlie's report?

In short, what is the situation with regard to the data security I switched to 
Solaris/ZFS for, and what can I do to achieve it?  What *do* the tools do?  Are 
there tools for what needs to be done to convert things, to copy things, to 
verify things, and to do so completely and correctly?  

So here is where I am:  I should zfs send/receive, but I cannot have confidence 
that there are not fletcher2 protected blocks (1 bit parity) at the most 
fundamental levels of the zpool.  To verify data, I cannot depend on existing 
tools since diff is not large file aware.  My best idea at this point is to 
calculate and compare MD5 sums of every file and spot check other properties as 
best I can.

Given this rather full perspective, help or comments very appreciated.  I still 
think zfs is the way to go, but the road is a little bumpy at the moment.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

2009-10-02 Thread Ray Clark

Appologize that the preceeding post appears out of context.  I expected it to 
indent as I pushed the reply button on myxiplx' Oct 1, 2009 1:47 post.  It 
was in response to his question.  I will try to remember to provide links 
internal to my messages.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

2009-10-02 Thread Ray Clark

Replying to Cindys Oct 1, 2009 3:34 PM post:

Thank you.   The second part was my attempt to guess my way out of this.  If 
the fundamental structure of the pool (That which was created before I set the 
checksum=sha256 property) is using fletcher2, perhaps as I use the pool all of 
this structure will be updated, and therefore automatically migrate to the new 
checksum.  It would be very difficult for me to recreate the pool, but I have 
space to duplicate the user files (and so get the new checksum).  Perhaps 
this will also result in the underlying structure of the pool being converted 
in the course of normal use.  

Comments for or against?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

2009-10-02 Thread Ray Clark

Replying to relling's October 1, 2009 3:34 post:

Richard, regarding when a pool is created, there is only metadata which uses 
fletcher4.  Was this true in U4, or is this a new change of default with U4 
using fletcher2?  Similarly, did the Ubberblock use sha256 in U4?  I am running 
U4.

--Ray
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

2009-10-02 Thread Ray Clark

My pool was the default, with checksum=256.  The default has two copies of all 
metadata (as I understand it), and one copy of user data.  It was a raidz2 with 
eight 750GB drives, yielding just over 4TB of usable space.  

I am not happy with the situation, but I recognize that I am 2x better off (1 
bit parity) than I would be with any other file system.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

2009-10-02 Thread Ray Clark

Replying to hakanson's Oct 2, 2009 2:01 post:

Thanks.  I suppose it is true that I am not even trying to compare the 
peripheral stuff, and simple presence of a file and the data matching covers 
some of them.  

Using it for moving data, one encounters a longer list:  Sparse files, ACL 
handling, extended atributes, length of filenames, length of pathnames, large 
files.  And probably other interesting things that can be not handled 
correctly. 

Most information for misbehavior of the various archive / backup / data 
movement utilities is very old.  One wonders how they behave today.  This would 
be a useful compilation, but I can't do it.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

2009-10-02 Thread Ray Clark

Cindys Oct 2, 2009 2:59,  Thanks for staying with me.

Re: The checksums are aset on the file systems not the pool.:

But previous responses seem to indicate that I can set them for file stored in 
the filesystem that appears to be the pool, at the pool level, before I create 
any new ones.  One post seems to indicate that there is a checksum property for 
this file system, and independently for the pool.  (This topic needs a 
picture).  

Re: If a new checksum is set and *you* rewrite the data ... then the 
duplciated data will have the new checksum.

Understand.  Now I am on to being concerned for the blocks that comprise the 
zpool that *contain* the file system.

Re: ZFS doesn't rewrite data as part of normal operations.  I confirmed with a 
simple test (like Darren's) that even if you have a single-disk pool and the 
disk is replaced and all the data is resilvered and a new checksum is set, 
you'll see data with the previous checksum and the new checksum.

Yes, ... a resilver duplicates exactly.  Darren's example showed that without 
the -R, no properties were sent and the zfs receive had no choice but to use 
the pool default for the zfs filesystem that it created.  This also implies 
that there was a property associated with the pool.  So my previous comment 
about zfs send/receive not duplicating exactly was not fair.  The man page / 
admin guide should be clear as to what is sent without -R.  I would have 
guessed everything, just not descendent file systems.  It is a shame that zdb 
is totally undocumented.  I thought I had discovered a gold mine when I first 
read Darren's note!

--Ray
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

2009-10-02 Thread Ray Clark

Re: relling's Oct 2, 2009 3:26 Post:

(1) Is this list everything?
(2) Is this the same for U4?
(3) If I change the zpool checksum property on creation as you indicated in 
your Oct 1, 12:51 post (evidently very recent versions only), does this change 
the checksums used for this list?  Why would not the strongest checksum be used 
for the most fundamental data rather than fool around, allowing the user to 
compromise only when the tradeoff pays back on the 99% bulk of the data?

Re: The big question, that is currently unanswered, is do we see single bit 
faults in disk-based storage systems?

I don't think this is the question.  I believe the implication of schlie's post 
is not that single bit faults will get through, but that the current fletcher2 
is equivalent to a single bit checksum.  You could have 1,000 bits in error, or 
4095, and still have a 50-50 chance of detecting it.  A single bit error would 
be certain to be detected (I think) even with the current code.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

2009-10-02 Thread Ray Clark

Re: Miles Nordin Oct 2, 2009 4:20:

Re: Anyway, I'm glad the problem is both fixed...

I want to know HOW it can be fixed?  If they fixed it, this will invalidate 
every pool that has not been changed from the default (Probably almost all of 
them!).  This can't be!  So what WAS done?  In the interest of honesty in 
advertising and enabling people to evaluate their own risks, I think we should 
know how it was fixed.  Something either ingenious or potentially misleading 
must have been done.  I am not suggesting that it was not the best way to 
handle a difficult situation, but I don't see how it can be transparent.  If 
the string fletcher2 does the same thing, it is not fixed.  If it does 
something different, it is misleading.  

... and avoidable on the broken systems.

Please tell me how!  Without destroying and recreating my zpool, I can only fix 
the zfs file system blocks, not the underlying zpool blocks.  WITH destroying 
and recreating my zpool, I can only control the checksum on the underlying 
zpool using a version of Solaris that is not yet available.  And then (Pending 
relling's response) may or may not *still* effect the blocks I am concerned 
about.  So how is this avoidable?  It is partially avoidable (so far) IF I have 
the luxury of doing significant rebuilding..  No?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

2009-10-02 Thread Ray Clark

Re: relling's Oct 2 5:06 Post:

Re: analogy to ECC memory... 

I appreciate the support, but the ECC memory analogy does not hold water.  ECC 
memory is designed to correct for multiple independent events, such as 
electrical noise, bits flipped due to alpha particles from the DRAM package, or 
cosmic rays.  The probability of these independent events coinciding in time 
and space is very small indeed.  It works well.  

ZFS does purport to cover errors such as these in the crummy double layer 
boards wtihout sufficient decoupling, microcontrollers and memories without 
parity or ECC, etc. found in the cost-reduced to the razor's edge hardware most 
of us run on, but it also covers system level errors such as entire blocks 
being replaced, or large fractions of them being corrupted by high level bugs.  
With the current fletcher2 we have only a 50-50 chance of catching these 
multi-bit errors.  Probability of multiple bits being changed is not small, 
because the probabilities of the error mechanism effecting the 4096~1048576 
bits in the block are not independent.  Indeed, in many of the show-cased 
mechanisms, it is a sure bet - the entire disk sector is written with the wrong 
data, for sure!  Although there is a good chance that many of the bits in the 
sector happen to match, there is an excellent chance that many are different.  
And the mechanisms that caused these differences were not independent
 .  

Re: AFAIK, no collisions have been found in SHA-256 digests for symbols of 
size 1,048,576, but it has not been proven that they do not exist

For sure they exist.  I think 4096 of them, for every SHA256 digest, there are 
(I think) 4096 1,048,576 bit long blocks that will create it.  One hopes that 
the same properties that make SHA256 a good cryptographic hash also make it a 
good hash period.  This, I admit, is a leap of ignorance (At least I know what 
cliff I am leaping off of).

Regarding the question of what people have seen, I have seen lots of 
unexplained things happen, and by definition one never knows why.  I am not 
interested in seeing any more.  I see the potential for disaster, and my time, 
and the time of my group, is better spent doing other things.  That is why I 
moved to ZFS.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

2009-10-02 Thread Ray Clark

Let me try to refocus:

Given that I have a U4 system with a zpool created with Fletcher2:

What blocks in the system are protected by Fletcher2, or even Fletcher4 
although that does not worry me so much.

Given that I only have 1.6TB of data in a 4TB pool, what can I do to change 
those blocks to sha256 or Fletcher4:

(1) Without destroying and recreating the zpool under U4

(2) With destroying and recreating the zpool under U4 (Which I don't really 
have the resources to pull off)

(3) With upgrading to U7 (Perhaps in a few months)

(4) With upgrading to U8

Thanks.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

2009-10-01 Thread Ray Clark

Darren, thank you very much!  Not only have you answered my question, you have 
made me aware of a tool to verify, and probably do alot more (zdb).

Can you comment on my concern regarding what checksum is used in the base zpool 
before anything is created in it?  (No doubt my terminology is wrong, but you 
get the idea I am sure).  

The single critical feature of ZFS is debatably that every block on ZFS is 
checksummed to enable detection of corruption, but it appears that the user 
does not have the ability to choose the checksum for the highest levels of the 
pool itself.  Given the issue with fletcher2, this is of concern!  Since this 
activity was kicked off by a Corrupt Metadata ZFS-8000-CS, I am trying to 
move away from fletcher2.  Don't know if that was the cause, but my goal is to 
restore the safety that we went to ZFS for.

Is my understanding correct?
Are there ways to control the checksum algorithm on the empty zpool?

Thanks, again.

--Ray
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

2009-10-01 Thread Ray Clark

U4 zpool does not appear to support the -o option...   Reading a current zpool 
manpage online lists the valid properties for the current zpool -o, and 
checksum is not one of them.  Are you mistaken or am I missing something?

Another thought is that *perhaps* all of the blocks that comprise an empty 
zpool are re-written sooner or later, and once the checksum is changed with 
zfs set checksum=sha256 zfs01 (The pool name) they will be re-written with 
the new checksum very soon anyway.  Is this true?  This would require an 
understanding of the on-disk structure and when what is rewritten.

--Ray
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

2009-09-30 Thread Ray Clark

I made a typo... I only have one pool.  I should have typed:

   zfs snapshot zfs01/h...@before
   zfs send zfs01/h...@before | zfs receive zfs01/home.sha256

Does that change the answer?

And independently if it does or not, zfs01 is a pool, and the property is on 
the home zfs file system.

I cannot change it on the file system before doing the receive because the file 
system does not exist - it is created by the receive.

This raises a related question of whether the file system on the receiving end 
is ALL created using the checksum property from the source file system, or if 
the blocks and their present mix of checksums are faithfully recreated in the 
received file system?

Finally, is there any way to verify behavior after it is done?

Thanks for helping on this.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

2009-09-30 Thread Ray Clark

Dynamite!

I don't feel comfortable leaving things implicit.  That is how 
misunderstandings happen.  

Would you please acknowlege that zfs send | zfs receive uses the checksum 
setting on the receiving pool instead of preserving the checksum algorithm used 
by the sending block?

Thanks a million!
--Ray
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

2009-09-30 Thread Ray Clark

Sinking feeling...

zfs01 was originally created with fletcher2.  Doesn't this mean that the sort 
of root level stuff in the zfs pool exist with fletcher2 and so are not well 
protected?

If so, is there a way to fix this short of a backup and restore?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] True in U4? Tar and cpio...save and restore ZFS File attributes and ACLs

2009-09-30 Thread Ray Clark

Joerg, Thanks.  As you (of all people) know, this area is quite a quagmire.  I 
am confident that I don't have any sparse files, or if I do that they are small 
and loosing this property would not be a big impact.  I have determined that 
none of the files have extended attributes or ACLs.  Some are greater than 4GB 
and have long paths, but Sun TAR supports both if I include the E option.  I am 
trusting that because it is recommended in the ZFS Admin Guide that it is my 
safest option with respect to any ZFS idiosyncrasies, given its limitations.  
If only those were documented!

My next problem is that I want to do an exhaustive file compare afterwards, and 
diff is not large-file aware.  

I always wonder if or how these applications that run across every OS known to 
man such as star can possibly be able to have the right code to work around the 
idiosyncrasies and exploit the capabilities of all of those OS's.  Should I 
consider star for the compare?  For the copy?  (Recognizing that it cannot do 
the ACLs, but I don't have those).
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

2009-09-29 Thread Ray Clark

When using zfs send/receive to do the conversion, the receive creates a new 
file system:

   zfs snapshot zfs01/h...@before
   zfs send zfs01/h...@before | zfs receive afx01/home.sha256

Where do I get the chance to zfs set checksum=sha256 on the new file system 
before all of the files are written ???

The new filesystem is created automatically by the receive command!

Although it does not say so in the man page or zfs admin guide, it certainly 
seems reasonable that I don't get a chance - the idea is that send/receive 
recreates the file system exactly.  

This would still have an ambiguity as to whether the new blocks are 
created/copied with the checksum algorithm they had in the source filesystem 
(Which would not result in the conversion I am trying to accomplish), or are 
they created and checksumed with the algorithm specified by the checksum 
PROPERTY set in the source file system at the time of the send/receive (which 
WOULD do the conversion I am trying to accomplish)?

Is there a way to use send/receive to duplicate a filesystem with a different 
checksum, or do I use cpio or tar?  (I pick on cpio and tar because they are 
specifically called out in the zfs admin manual as saving and restoring zfs 
file attributes and ACLs).

Thanks.

--Ray
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] True in U4? Tar and cpio...save and restore ZFS File attributes and ACLs

2009-09-29 Thread Ray Clark

The April 2009 ZFS Administration Guide states ...tar and cpio commands, to 
save ZFS files.  All of these utilities save and restore ZFS file attributes 
and ACLs.

I am running 8/07 (U4).  Was this true for the U4 verison of ZFS and the tar 
and cpio shipped with U4?

Also, I cannot seem to figure out how to find the ZFS admin manual applicable 
to U4.  Could someone please shove me in the right direction?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Desire simple but complete copy - How?

2009-09-29 Thread Ray Clark

It appears that I have waded into a quagmire.  Every option I can find (cpio, 
tar (Many versions!), cp, star, pax) has issues.  File size and filename or 
path length, and ACLs are common shortfalls.  Surely there is an easy answer 
he says naively!

I simply want to copy one zfs filesystem tree to another, replicating it 
exactly.  Times, Permissions, hard links, symbolic links, sparse file holes, 
ACLs, extended attributes, and anything I don't know about.

Can you give me a commandline with parameters?  I will then study what they 
mean.

If there is no correct answer, please give me your best recommendation and what 
you know about it.

Thanks.  (I hope!)

--Ray
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Best way to convert checksums

2009-09-25 Thread Ray Clark

What is the Best way to convert the checksums of an existing ZFS file system 
from one checksum to another?  To me Best means safest and most complete.

My zpool is 39% used, so there is plenty of space available.

Thanks.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

2009-09-25 Thread Ray Clark

I didn't want my question to lead to an answer, but perhaps I should have put 
more information.  My idea is to copy the file system with one of the following:
   cp -rp
   zfs send | zfs receive
   tar
   cpio
But I don't know what would be the best.

Then I would do a diff -r on them before deleting the old.

I don't know the obscure (for me) secondary things like attributes, links, 
extended modes, etc.

Thanks again.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Checksum property change does not change pre-existing data - right?

2009-09-23 Thread Ray Clark

My understanding is that if I zfs set checksum=different to change the 
algorithm that this will change the checksum algorithm for all FUTURE data 
blocks written, but does not in any way change the checksum for previously 
written data blocks.  

I need to corroborate this understanding.  Could someone please point me to a 
document that states this?  I have searched and searched and cannot find this.

Thank you.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-12-01 Thread Ray Clark

It completed copying 191,xxx MB without issue in 17 hours and 40 minutes, 
average transfer rate of 3.0MB/Sec.  During the copy (At least the first hour 
or so, and an hour in the middle), the machine was reasonably responsive.  It 
was jerky to a greater or lesser extent, but nothing like even the best times 
with gzip-9.  Not sure how to convey it.  The machine was usable.

It was stopped by running out of disk space.  The source was about 1GB larger 
than the target zfs file system.  (When I started this exercise I had an IT8212 
PCI PATA card in the system for a another pair of drives for the pool, and took 
it out to eliminate a potential cause of my troubles).

Interestingly before I started I had to reboot, as there was a trashapplett 
eating 100% of the CPU, 60% user, 40% system.  Note that I have not made, much 
less deleted any files with gnome, nor put any in my home directory.  I don't 
even know how to do these things, as I am a KDE man.  All I have done is futz 
with this zfs in a separate pool and type at terminal windows.  Can't imagine 
what trashapplett was doing with 100% of the CPU for an extended time without 
any files to manage!

Something I have not mentioned is that the fourth memory socket was worn out a 
few years ago testing memory, this is why I only have 768 installed (The bottom 
3 have not been abused and are fine).  My next move is to trade the motherboard 
for one in good shape so I can put in all 1024MB, plug in the IT8212 with a 
couple of 160GB disks to get my pool up to 360GB, and install RC2...  

But it looks like 2008.11 has been released!  The mirrors still have 2008.05, 
but the main link goes to osol-0811.iso!  Is that final, not an RC?

I will be beating on it to gain confidence and learn about Solaris.  If anyone 
wants me to run any other tests, let me know.  Thanks (again) for all of your 
help.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-12-01 Thread Ray Clark

Re pantzer5's suggestion:  

Memory is not a big problem for ZFS, address space is. You may have to
give the kernel more address space on 32-bit CPUs.

eeprom kernelbase=0x8000

This will reduce the usable address space of user processes though.

---
Would you please verify that I understand correctly.  I am extrapolating here 
based on general knowledge:

During a running user process, the process has the entire lower part of the 
address space below the kernel.  The kernel is loaded at kernelbase, and has 
from there to the top (2**32-1) to use for its purposes.  Evidently it is 
relocatable or position independent.

The positioning of kernelbase really has nothing to do with how much physical 
RAM I have, since the user memory and perhaps some of the kernel memory is 
virtual (paged).  So the fact that I have 768MB does enter into this decision 
directly (It does indirectly per Jeff's note implying that kernel structures 
need to be larger with larger RAM, makes sense, more to keep track of, more 
page tables).

By default kernelbase is set at 3G, so presumably the kernel needs a minimum of 
1G space.

Every userland process gets the full virtual space from 0 to kernelbase-1.  So 
unless I am going to run a process that needs more than 1G, there is no 
advantage in setting kernelbase to something larger than 1G, etc.  Even if 
physical RAM is larger.

If I am not going to run virtual machines, or edit enormous video or audio or 
image files in RAM, I really have no use for userland address space, and giving 
alot to the kernel can only help it to have things mapped rather than having to 
recreate create information (Although I don't have a good handle on the utility 
of address space without a storage mechanism like RAM or Disk behind it...must 
be something akin to a pagefault with pages mapped to a disk file so you don't 
have to walk the file hierarchy).  

Hence your suggestion to set kernelbase to 2G.  But 1G is probably fine too 
(Although the incremental benefit may be negligible - I am going for the 
principle here).

How am I doing?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-30 Thread Ray Clark

Andewk8 at 11:39 on 11/29 said:

Solaris reports virtual memory as the sum of physical memory and page file - 
so this is where your strange vmstat output comes from. Running ZFS stress 
tests on a system with only 768MB of memory is not a good idea since ZFS uses 
large amounts of memory for its cache.

VMstat is now jumping around between 493xxx and 527xxx fee while top is 
reporting a solid, unchanging 509M free swap (Both at 1 second updates).

The explanation would only explain a presentation of swap memory using the 
virtual memory definition being larger than a free swap definition that 
includes only swap.  Here we have had it larger, and now it is smaller!  For 
whatever it is worth, top continues to display a rock solid 509MB free, never 
changing for free swap.

Anybody know?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-30 Thread Ray Clark

I think Chris has the right idea.  This would give more little opportunities 
for user processes to get a word in edgewise.  Since the blocks are *obviously* 
taking a LONG time, this would not be a big hit efficiency in the bogged-down 
condition.  It would however increase overhead in the well-behaved case.  I 
think the real answer is making the compression thread task lower and dynamic.  
I like the 80% cap unless there are no competing processes, in which case 
100% suggestion.  If the compression thread backs up, I would *assume* that 
there is a queue that would back up and block the process adding to it, 
regulating the whole process.  The whole machine would get 5x slower than 
normal, but everything would continue to work.

--Ray
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-30 Thread Ray Clark

Re:  gzip compression works a lot better now the compression is threaded.
It's a shame userland gzip isn't!

---
What does now than mean?  I assume you mean the zfs / kernel gzip (right?) at 
some point became threaded.  Is this in the past, or in a kernel post 208.11 B2?

--Ray
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-30 Thread Ray Clark

Re:   Experimentation will show that the compression ratio does not
increase much at -9 so it is not worth it when you are short on time
or CPU.
---
Yes, and a large part of my experiment was to understand the cost (time) vs. 
compression ratio curve.  lj?? only gave me 7%, which to me is not worth 
goofing with.  I was curious what gzip-9 would do, and how it impacted 
performance.  I guess I have one of those points, but my graph paper is not 
large enough!

--Ray
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-30 Thread Ray Clark

I just want to interject here that I think, if memory serves me correctly, that 
SPARC has been 64 bit for 10~15 years, and so had LOTS of address space to map 
stuff.  x86 brought a new restriction.

Regarding the practice of mapping files etc. into virtual memory that does not 
exist, now I understand why a 32 bit address space is viewed as restrictive.  
This is a powerful technique.  I would be interested in understanding how it is 
done though... it somehow ties a file reference (inode?  name?) to an address 
range.  I assume when the range is accessed (since it does not exist) that a 
page fault is generated to fullfill the request, which then (for this to make 
sense) must have a short-circuit map to the disk blocks, which I assume would 
go through some disk cache in case they are in memory somewhere, else generate 
an IO request to disk...  but what if the file was written, and so moved?  
Where would I read more about what is REALLY going on and how it works?

Thanks,
--Ray
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-11-30 Thread Ray Clark

It would be extremely helpful to know what brands/models of disks lie and which 
don't.  This information could be provided diplomatically simply as threads 
documenting problems you are working on, stating the facts.  Use of a specific 
string of words would make searching for it easy.  There should be no 
liability, since you are simply documenting compatibility with zfs.  

Or perhaps if the lawyers let you, you could simply publish a 
compatibility/incompatibility list.  These ARE facts. 

If there is a way to make a detection tool, that would be very useful too, 
although after the purchase is made, it could be hard to send it back.  However 
that info could be fed into the database as that drive/model being incompatible 
with zfs.

As Solaris / zfs gains ground, this could become a strong driver in the 
industry.

Re: I'll run tests with known-broken disks to determine how far back we
need to go in practice -- I'll bet one txg is almost always enough.

So go back three - we are using zfs because we want absolute reliability (or at 
least as close as we can get).

--Ray
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-30 Thread Ray Clark

  I think Chris has the right idea. This would give more little opportunities 
  for user
  processes to get a word in edgewise. Since the blocks are *obviously* 
  taking a
  LONG time, this would not be a big hit efficiency in the bogged-down 
  condition.

 I still think you are expecting too much of a P3 system with limited
 RAM. I chose not to use gzip (default compression) on a max'd out x4540
 because it slowed down zfs receive too much.

---
This is not about getting my P3 to do gzip-9 at 100Mbit wire speeds.  I know 
that is not going to happen.  

This is about not having kernel threads completely lock out user (and other 
kernel) processes for undesirable lengths of time.  It is about improving 
Solaris.  It is about having more appropriate CPU sharing between all of the 
threads in the system, kernel and user. This is the root cause of the 
pathological behavior I stumbled on.

To clarify, (1) This started as an experiment to see what compression ratio 
would result, (2) To see what the performance hit would be, and (3) To stress 
the system severely to expose problems such as exposed critical sections of 
code, race conditions, etc. to give myself confidence in using 2008.11.  I did 
not expect to find that it performed well.  I did not expect to decide to use 
gzip-9 on this machine.

The experiment / exercise turned into a concern regarding the reliability of 
Solaris and ZFS as a platform based on the gradual depredation to 100KB/Sec and 
completely unresponsive console (I understated it, at times it took 10-20 
minutes to respond).  That triggered this thread.  

This thread is NOT about throughput of a gzip-9 zfs system.  It is about a 
Solaris ZFS system becoming completely, 99.999% unresponsive, indistinguishable 
from crashed.  No doubt I will put some effort into seeing if I can boost 
throughput a little, but right now my primary concern is that it WORKS.

This discussion has served to enable me to go away with confidence in Solaris 
and ZFS despite the pathological behavior of the gzip-9 algorithm and its 
interaction with the ZFS thread scheduling.  The copy completed successfully 
last night.  (1) It still functions correctly even with the problems, and I 
will not loose data.  It is NOT a code correctness problem that could under the 
right conditions and random chance result in data loss even without gzip.   (2) 
I can completely avoid it by not doing compression, especially gzip-9 
compression.  It is also comforting to know that the pathological behavior will 
be eliminated by an improvement in zfs thread scheduling.  This will leave only 
the intrinsic poor performance of gzip-9.

I do expect (Though many I gather will disagree) that I will have a reliable, 
predictable, serviceable if low-performance Solaris/ZFS file server based on an 
800MHz P3 with 768MB of memory, without compression.  I can deal with slow, I 
can't deal with crashed or data loss.  I don't think that is an unreasonable 
expectation.

The discussion of how to improve the zfs kernel thread's scheduling I believe 
has value regardless of gzip-9.  It is a latent problem, a poor design the way 
it is now.  Jeff has said that it will be fixed.

The dead-idle system running gnome is a little jerky vs. smooth as silk, I 
expect due to the same root-case.  This will be good to fix, as it gives a 
pretty bad impression of Solaris when Linux can run silky-smooth and responsive.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-30 Thread Ray Clark

Unless something new develops, from my perspective this thread has served its 
purpose.  I don't want to drag it out and waste people's time.  If any late 
readers have insights that should be captured for the archive please add them.

Thank you all VERY much for the discussion, insights, suggestions, and 
especially the responsiveness.  

As you might gather from my several mentions of Linux, I have been using Linux 
for almost 10 years now and on balance am very happy.  But when there is a 
problem, there are usually very few if any responses, they come over several 
days or longer, and only very rarely is any light shed.  

The suggestions regarding running a file server on a low-end machine will be 
taken to heart also.  I like this machine because it has ECC memory, but in 
time I expect to break down and use a faster machine without ECC.  Too bad 
Intel does not provide ECC in their desktop grade chipsets any more.

With best regards,
Ray
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-30 Thread Ray Clark

This has been an evolution, see defect.opensolaris.com's 5482 memory leak with 
top.  I have not run anything bug gzip-9 since fixing that by only running top 
in one-time mode.

I will start the same copy with compress=off in about 45 minutes (Got to go do 
an errand now).  Glad to run tests.

--Ray
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-29 Thread Ray Clark

I am [trying to] perform a test prior to moving my data to solaris and zfs.  
Things are going very poorly.  Please suggest what I might do to understand 
what is going on, report a meaningful bug report, fix it, whatever!

Both to learn what the compression could be, and to induce a heavy load to 
expose issues, I am running with compress=gzip-9.

I have two machines, both identical 800MHz P3 with 768MB memory.  The disk 
complement and OS is different.  My current host is Suse Linux 10.2 (2.6.18 
kernel) running two 120GB drives under LVM.  My test machine is 2008.11 B2 with 
two 200GB drives on the motherboard secondary IDE, zfs mirroring them, NFS 
exported.

My test is to simply run cp -rp * /testhome on the Linux machine, where 
/testhome is the NFS mounted zfs file system on the Solaris system.

It starts out with reasonable throughput.  Although the heavy load makes the 
Solaris system pretty jerky and unresponsive, it does work.  The Linux system 
is a little jerky and unresponsive, I assume due to waiting for sluggish 
network responses.

After about 12 hours, the throughput has slowed to a crawl.  The Solaris 
machine takes a minute or more to respond to every character typed and mouse 
click.  The Linux machines is no longer jerky, which makes sense since it has 
to wait alot for Solaris.  Stuff is flowing, but throughput is in the range of 
100K bytes/second.

The Linux machine (available for tests) gzip -9ing a few multi-GB files seems 
to get 3MB/sec +/- 5% pretty consistently.  Being the exact same CPU, RAM 
(Including brand and model), Chipset, etc. I would expect should have similar 
throughput from ZFS.  This is in the right ballpark of what I saw when the copy 
first started.  In an hour or two it moved about 17GB.

I am also running a vmstat and a top to a log file.  Top reports total swap 
size as 512MB, 510 available.  vmstat for the first few hours reported 
something reasonable (it never seems to agree with top), but now is reporting 
around 570~580MB, and for a while was reporting well over 600MB free swap out 
of the 512M total!

I have gotten past a top memory leak (opensolaris.com bug 5482) and so am now 
running top only one iteration, in a shell for loop with a sleep instead of 
letting it repeat.  This was to be my test run to see it work.

What information can I capture and how can I capture it to figure this out?

My goal is to gain confidence in this system.  The idea is that Solaris and ZFS 
should be more reliable than Linux and LVM.  Although I have never lost data 
due to Linux problems, I have lost it due to disk failure, and zfs should cover 
that!

Thank you ahead for any ideas or suggestions.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-29 Thread Ray Clark

Please help me understand what you mean.  There is a big difference between 
being unacceptably slow and not working correctly, or between being 
unacceptably slow and having an implementation problem that causes it to 
eventually stop.  I expect it to be slow, but I expect it to work.  Are you 
saying that you found that it did not function correctly, or that it was too 
slow for your purposes?  Thanks for your insights!  (3x would be awesome).
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-29 Thread Ray Clark

Andrewk8,

Thanks for the information.  I have some questions.

[1] You said zfs uses large amounts of memory for its cache.   If I 
understand correctly, it is not that it uses large amounts, it is that it uses 
all memory available.  If this is an accurate picture, then it should be just 
as happy with 128MB as it is with 4GB.  The result would simply be less of a 
cache/buffer between clients and the physical disk.  It also seems like any 
congestion should show up fairly soon, not gradually over 12 hours!  Certainly 
limiting the ARC cache is something I will try, but it does not make sense to 
me.  Can you help me along?

[2] Regarding zfs vs. nfs, the reference talks about unneeded cache flushes 
dragging down throughput to NVRAM buffered disks.  The flushes were designed 
for physical rotating disks.  I am using physical, rotating disks, so it seems 
like the changes that they suggest for NVRAM buffered disks would not be 
appropriate for me, and that the default behavior designed for physical 
rotating disks would be what I want.  What am I missing?

[3] I also get ~4MB/second throughput NFS to disk with compression disabled, 
and 3MB/sec with gzip-9 for the first hour or two.  This is nothing to brag 
about and I had planned eventually to look into making it faster, but this 
pales compared to the 100KB/second it has degraded to over 12 hours.  Were your 
comments aimed at helping me get faster NFS throughput, or at addressing the 
immediate gross problem?

Thanks again for taking the time to help.
--Ray
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-29 Thread Ray Clark

tcook,

You bring up a good point.  exponentially slow is very different from crashed, 
though they may have the same net effect.  Also that other factors like 
timeouts would come into play.

Regarding services, I am new to administering modern solaris, and that is on 
my learning curve.  My immediate need is simply a dumb file server.  3 or 4 
MB/sec would be adequate for my needs (marginal and at times annoying, but 
adequate).  If you expect it to be slow, it does work quite nicely without 
compression.  I have to use what I have.  In the meantime, perhaps my stress 
tests will also serve to expose issues.

Regarding the GUI, I don't know how to disable it.   There are no virtual 
consoles, and unlike older versions of SunOS and Solaris, it comes up in XDM 
and there is no [apparent] way to get a shell without running gnome.  I am sure 
that there is, but again, I come from the BSD/SunOS/Linux line, and have not 
learned the ins and outs of Nevada/Indiana yet.  I had hoped to put up a simple 
installation serving up disks and learns details later.  There are several 
60~90MB gnome apps evidently pre-loaded - even a 45MB clock!   Wow.  

Interestingly, the size fields under top add up to 950GB without getting to 
the bottom of the list, yet it shows NO swap being used, and 150MB free out of 
768 of RAM!  So how can the size of the existing processes exceed the size of 
the virtual memory in use by a factor of 2, and the size of total virtual 
memory by a factor of 1.5?  This is not the resident size - this is the total 
size!  

News Flash!  It has come out of it, and is moving along now at 2 MB/sec.  GUI 
is responsive with an occasional stutter.  It was going through a directory 
structure full of .mp3 and .flac files.Perhaps the gzip algorithm gets hung 
up in the data patterns they create.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-29 Thread Ray Clark

Hakimian,

So you had a similar experience to what I had with an 800 MHz P3 and 768MB, all 
the way down to totally unresponsive.  Probably 5 or 6 x the CPU speed 
(assuming single core) and 5 x the memory.  This can only be a real design 
problem or bug, not just expected performance. 

Is there anyone from Sun who can advise me how to file this given the diffuse 
information?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-29 Thread Ray Clark

Servo / mg,

I *have* noticed these effects on my system with lzjb, but they are minor.  
Things are a little grainy, not smooth.

Eliminating the algorithm that exposes the shortfall in how the compression is 
integrated into the system does not change the shortfall (See opensolaris.com 
bug 5483).  My low-end system resulted in my stress test being extra stressful. 
 Perhaps that is a good thing for exposing problems (Although frustrating for 
me)!

What I do not understand is why things get better and worse by orders of 
magnitude vs. being a relatively steady drain.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-29 Thread Ray Clark

relling,  Thank you, you gave me several things to look at.  The one thing that 
sticks out for me is that I don't see why you listed IDE.  Compared to all of 
the other factors, it is not the bottleneck by a long shot even if it is a slow 
transfer rate (33MB/Sec) by todays standards.  What don't I know?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-29 Thread Ray Clark

bfriesen,

Ultimately stuff flows in and stuff flows out.  Data is not reused, so a cache 
does not do anything for us.  As a buffer, it is simply a rubber band, a FIFO.  
So if the client wrote something real quick it would complete quickly.  But if 
it is writing an unlimited amount of data (like 200GB) without reading 
anything, it all simply flows through the buffer.  Whether the buffer is 128MB 
or 4GB, once the buffer is full the client will have to wait until something 
flows out to the disk.  So the system runs at the speed of the slowest 
component.  If accesses are done only once, caches don't help.  A buffer helps 
only to smooth out localized chunkyness.

Regarding the NVRAM discussion, what does this have to do with my situation 
with rotating magnetic disks with tiny 8MB embedded volatile caches?  The 
behavior of disks or storage subsystems with NVRAM are not pertinent to my 
situation!  Or do I have something backwards?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-29 Thread Ray Clark

zpool status -v says No known data errors for both the root rpool (separate 
non-mirrored 80GB drive) and my pool (mirrored 200GB drives). 

It is getting very marginal (sluggish/unresponsive) again.  Interesting, top 
shows 20~30% cpu idle with most of remainder kernel.  I wonder if everything is 
counted?  Linux top definitely does not show everything... I suspected at one 
point that it did not count time in interrupt servicing.  Does Solaris?  (Off 
topic I guess).

Free memory runnint 150~175MB.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-29 Thread Ray Clark

bfriesen,

Andrew brought up NVRAM by refering me to the following link:

Also, NFS to ZFS filesystems will run slowly under certain conditions 
-including with the default configuration. See this link for more information:

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Cache_Flushes

This section discusses exclusively how ZFS cache flushes, which can be 
triggered by NFS requests or policies, interacts with NVRAMs unproductively, 
and how the flushes can be controlled to improve performance.  Since the NVRAMs 
are Non Volatile, the flushes are not necessary to preserve data integrity 
anyway.

Not worth tracing the chain to see how you and I got tangled in this.  One of 
us make an inappropriate association or didn't follow a sub-thread.  Sorry for 
the confusion.
---

Regarding the cache, right now there is 150MB of free memory not being used by 
ANYBODY, so I don't think there is a shortage of memory for the ZFS cache... 
and 150MB  128K, or even a whole slew of 128K blocks.  Also, the yellow light 
that blinks when the disk is accessed is off 90% of the time minimum.  When it 
was almost frozen, the disk almost never blinked (one real quick one every 
minute or two!)  Nothing is accessing the disk to re-obtain anything!  
Otherwise, yes you would have a good point about re-fetching various file 
structure stuff.  (Good thought).

Fragmentation of kernel memory would be a good one.  Wouldn't it get fragmented 
after 6 months or so of everyday use anyway?It must de-frag itself somehow. 
 You bring up an excellent observation.  When it was super-slow, free RAM was 
down to 15MB, although that still seems large compared to 32K or 128K blocks.  
Remember, the system is not doing ANYTHING else.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-29 Thread Ray Clark

tcook, zpool iostat shows 1.15MB/sec.  Is this averaged since boot, or a recent 
running average?

The two drives ARE on a single IDE cable, however again, with a 33MB/sec cable 
rate and 8 or 16MB cache in the disk, 3 or 4 MB/sec should be able to 
time-share the cable without a significant impact on throughput.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-29 Thread Ray Clark

Pantzer5:  Thanks for the top  size explanation.

Re: eeprom kernelbase=0x8000
So this makes the kernel load at the 2G mark?  What is the default, something 
like C00... for 3G?

Are PCI and AGP space in there too, such that kernel space is 4G - (kernelbase 
+ PCI_Size + AGP_Size) ?  (Shot in the dark)?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-29 Thread Ray Clark

I get 15 to/from (I don't remember which) Linux LVM to a USB disk.  It does 
seem to saturate there.  I assume due to interrupt service time between 
transfers.  I appreciate the contention for the IDE, but in a 3MB/Sec system, I 
don't think that it is my bottleneck, much less in a 100KByte/second system.  
Do you disagree?

I *have* a PCI add on card, which is unplugged to make the system dead-simple 
until I figure out why it does not function!  

As a side note, most such controllers report themselves as a RAID card or some 
such, and Solaris will refuse to talk to them!  The only one I could find that 
would work was an IT8212 with an out of production flashchip that ITE supported 
an alternate BIOS for.  I went through 3 or 4 different ones before finding it! 
 You seem to say it is easy to buy a PCI add-in and have it work under Solaris 
- what card are you thinking of, and where did you find it?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-29 Thread Ray Clark

Thanks for the info about Free Memory.  That also links to another sub-thread 
regarding kernel memory space.If disk files are mapped into memory space, 
that would be a reason that the kernel could make use of address space larger 
that virtual memory (RAM+Swap).

Regarding showing stuff as Free when it is tracked and may be used, I would 
assume though that it would be abandoned if the memory is needed.  Wouldn't the 
fact that it was sitting Free indicate that nothing needed memory?

I also understand Working set as a page replacement algorithm, but that would 
make the disk light blink!

These are all good things, I just don't see how they apply to the current 
situation, at least given the apparent information from vmstat and top!
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-29 Thread Ray Clark

If I shut down the Linux box, I won't have a host to send stuff to the Solaris 
box!

Also, the Solaris box can only support 1024MB.  I did have 1024MB in it at one 
time and had essentially the same performance.  I might note that I had the 
same problem with 1024MB, albiet with TOP eating memory (opensolaris.com bug 
5482) (up to 417MB at the highest observation).  No wonder it crashed.  Anyway, 
1024MB is not Far, Far better, it turns out there was no noticeable 
difference when I dropped to 768.

Also note that Hikimiam had identical symptoms with a dual core 64 bit AMD and 
4G of RAM.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-29 Thread Ray Clark

Tim,

I don't think we would really disagree if we were in the same room.  I think in 
the process of the threaded communication that a few things got overlooked, or 
the wrong thing attributed.

You are right that there are many differences.  Some of them are:

- Tests done a year ago, I expect the kernel has had many changes.
- He was moving data via ssh from zfs sed into zfs receive as opposed to my 
file operations over NFS.
- My problem seems to occur on incompressible data.  His was all very 
compressible.
- He had 5x the CPU x2 and 5x the memory.

Yes, I jumped on what I saw as common symptoms, in hakimian's words: becoming 
increasing unresponsive until it was indistinguishable from a complete lockup. 
 This is similar to my description of After about 12 hours, the throughput has 
slowed to a crawl.  The Solaris machine takes a minute or more to respond to 
every character typed... and disk throughput is in the range of 100K 
bytes/second.

I was the one who judged these symptoms to be essentially identical, I did not 
say that Hakimian made that statement.  I also pointed out that he was seeing 
these identical symptoms in a very different environment, which would be your 
point.

Regarding my 768 vs. 1024, there were no changes other than the change in 
memory.  So whatever else is true, the system had 33% more memory to work with 
minimum.  Given that probably a few hundred Meg is needed for a just booted, 
idle system, the effective percentage increase in memory for zfs to work with 
is in reality higher.  I may not have given in 4GB, but I gave it substantially 
more than it had.  It should behave substantially differently if memory is the 
limiting factor.  Just because memory is thin does not make it the limiting 
factor.  I believe the indications by top and vmstat that there is free memory 
(available to be reallocated) that nothing is gobbling up also suggests that 
memory is not the limiting factor.

Regarding my design decisions, I did not make bad design decisions.  I have 
what I have.  I know it is substandard.

Also you seem to be reacting as though I was complaining about the 3MB/Sec 
throughput.  I believe I stated that I understand that there are many 
sub-optimal aspects of this system.  However I don't believe any of them 
explain it running fine for a few hours, then slowing down by a factor of 30, 
for a few hours, then going back up.  I am trying to understand and resolve the 
dysfunctional behavior, not the poor but plausible throughput.  In any system 
there are many possible bottlenecks, most of which are probably suboptimal, but 
it is not productive to focus on the 15MB/Sec links in the chain when you have 
a 100KB/Sec problem.  Increasing the 15MB/Sec to 66 or 132MB/Sec is just not 
going to have a large effect!

I think/hope I have reconciled our apparent differences.  If not, so be it.  I 
do appreciate your suggestions and insights, and they are not lost on me.

--Ray
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-29 Thread Ray Clark

Now that it has come out of its slump, I can watch what it is working on vs. 
response.  Whenever it is going through a folder with alot of incompressible 
stuff, it gets worse.  .mp3 and .flac are horrible.  .iso images and .gz and 
.zip files are bad.  It is sinking again, but still works.  It depends on the 
data.

In hindsight, and with the help of this thread, I think I understand.  Yes, it 
is a hypothesis, not fact.  Bug 5483 and the reference in there to bug 6586537 
explains how the zfs compression task blocks out userland tasks (and probably 
all other kernel tasks) by running at the highest kernel priority.  This is a 
fact I take it.  The hypothesis part would be that certain data characteristics 
(probably higher entropy) results in very tedious, laborious behavior by the 
gzip algorithm, or at least the implementation in zfs.  So NOTHING else runs 
unless the gzip algorithm has nothing to do, and it takes FOREVER to do its 
thing on certain types of data.  

All of the free memory discussions will help me to understand the system and 
how to get more information, but I don't see any of the evidence suggesting 
that lack of RAM was the reason for throughput to drop to 100KB/Sec.  No doubt 
if I address all of these things I can get throughput up from the 3~4 that I 
was seeing with compression disabled.

My plan right now is to let it finish (It has someplace around 50GB to go) just 
to see it do so without crashing.  I may then do a diff -r to see if the 
decompression has the same behavior (Glutton for punishment).  Then I will 
forget compression and do the exercise without.  Not sure how I will finally be 
comfortable to commit all my bits!

This understanding gives me hope that the system will be robust, that my heavy 
load is not exposing a critical section of code.  Rather it is a problem that 
causes dysfunctional though still correct behavior.  And I know how to avoid it.

If you have more comments, or especially if you think I reached the wrong 
conclusion, please do post it.  I will post my continuing results.

Thank you ALL for giving me so much attention and help.  It is good to not be 
alone!
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-29 Thread Ray Clark

Tim,

I am trying to look at the whole picture.  I don't see any unwarranted 
assumptions, although I know so little about Solaris and I extrapolated all 
over the place based on general knowlege, sort of draping it around and over 
what you all said.  I see quite a few misconceptions in the thread you pointed 
me to based on lack of understanding of modern systems, both clear ones and 
questionable ones.  I suppose I probably have my share of them in here.  Please 
refute my defenses as appropriate.  

--Ray
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-29 Thread Ray Clark

Jeff,

Thank you for weighing in, as well as for the additional insight.  It is good 
to have confidence that I am on the right track.  

I like your system ... alot.  Got work to do for it to be as slick as a recent 
Linux distribution, but you are working on a solid core and just need some 
touch-up work.  Thanks.  Hang in there.  

--Ray
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-29 Thread Ray Clark

Ref relling's 12:00 post:
My system does not have arcstat or nicstat.  But it is the B2 distribution.  
Would I expect these to be in the final distribution, or where do these come 
from?
Thanks.
--Ray
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

58 matches

Mail list logo