Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

Ray Clark Sat, 29 Nov 2008 16:31:59 -0800

Tim,

I don't think we would really disagree if we were in the same room.  I think in 
the process of the threaded communication that a few things got overlooked, or 
the wrong thing attributed.


You are right that there are many differences.  Some of them are:

- Tests done a year ago, I expect the kernel has had many changes.
- He was moving data via ssh from zfs sed into zfs receive as opposed to my 
file operations over NFS.
- My problem seems to occur on incompressible data.  His was all very 
compressible.
- He had 5x the CPU x2 and 5x the memory.

Yes, I jumped on what I saw as common symptoms, in hakimian's words: "becoming 
increasing unresponsive until it was indistinguishable from a complete lockup". 
 This is similar to my description of "After about 12 hours, the throughput has 
slowed to a crawl.  The Solaris machine takes a minute or more to respond to 
every character typed..." and "disk throughput is in the range of 100K 
bytes/second".

I was the one who judged these symptoms to be essentially identical, I did not 
say that Hakimian made that statement.  I also pointed out that he was seeing 
these "identical" symptoms in a very different environment, which would be your 
point.

Regarding my 768 vs. 1024, there were no changes other than the change in 
memory.  So whatever else is true, the system had 33% more memory to work with 
minimum.  Given that probably a few hundred Meg is needed for a just booted, 
idle system, the effective percentage increase in memory for zfs to work with 
is in reality higher.  I may not have given in 4GB, but I gave it substantially 
more than it had.  It should behave substantially differently if memory is the 
limiting factor.  Just because memory is thin does not make it the limiting 
factor.  I believe the indications by top and vmstat that there is free memory 
(available to be reallocated) that nothing is gobbling up also suggests that 
memory is not the limiting factor.

Regarding my design decisions, I did not make bad design decisions.  I have 
what I have.  I know it is substandard.

Also you seem to be reacting as though I was complaining about the 3MB/Sec 
throughput.  I believe I stated that I understand that there are many 
sub-optimal aspects of this system.  However I don't believe any of them 
explain it running fine for a few hours, then slowing down by a factor of 30, 
for a few hours, then going back up.  I am trying to understand and resolve the 
dysfunctional behavior, not the poor but plausible throughput.  In any system 
there are many possible bottlenecks, most of which are probably suboptimal, but 
it is not productive to focus on the 15MB/Sec links in the chain when you have 
a 100KB/Sec problem.  Increasing the 15MB/Sec to 66 or 132MB/Sec is just not 
going to have a large effect!

I think/hope I have reconciled our apparent differences.  If not, so be it.  I 
do appreciate your suggestions and insights, and they are not lost on me.

--Ray
-- 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

Reply via email to