Re: [zfs-discuss] I/O Read starvation

2010-01-10 Thread Markus Kovero
Hi, it seems you might have somekind of hardware issue there, I have no way 
reproducing this.

Yours
Markus Kovero

-Original Message-
From: zfs-discuss-boun...@opensolaris.org 
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of bank kus
Sent: 10. tammikuuta 2010 7:21
To: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] I/O Read starvation

Btw FWIW if I redo the dd + 2 cp experiment on /tmp the result is far more 
disastrous. The GUI stops moving caps lock stops responding for large intervals 
no clue why.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] possible to remove a mirror pair from a zpool?

2010-01-10 Thread Dennis Clarke

Suppose the requirements for storage shrink ( it can happen ) is it
possible to remove a mirror set from a zpool?

Given this :

# zpool status array03
  pool: array03
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
pool will no longer be accessible on older software versions.
 scrub: resilver completed after 0h41m with 0 errors on Sat Jan  9
22:54:11 2010
config:

NAME STATE READ WRITE CKSUM
array03  ONLINE   0 0 0
  mirror ONLINE   0 0 0
c2t16d0  ONLINE   0 0 0
c5t0d0   ONLINE   0 0 0
  mirror ONLINE   0 0 0
c5t1d0   ONLINE   0 0 0
c2t17d0  ONLINE   0 0 0
  mirror ONLINE   0 0 0
c5t2d0   ONLINE   0 0 0
c2t18d0  ONLINE   0 0 0
  mirror ONLINE   0 0 0
c2t20d0  ONLINE   0 0 0
c5t4d0   ONLINE   0 0 0
  mirror ONLINE   0 0 0
c2t21d0  ONLINE   0 0 0
c5t6d0   ONLINE   0 0 0
  mirror ONLINE   0 0 0
c2t19d0  ONLINE   0 0 0
c5t5d0   ONLINE   0 0 0
spares
  c2t22d0AVAIL

errors: No known data errors

Suppose I want to power down the disks c2t19d0 and c5t5d0 because they are
not needed. One can easily picture a thumper with many disks unused and
see reasons why one would want to power off disks.

-- 
Dennis

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] possible to remove a mirror pair from a zpool?

2010-01-10 Thread James Dickens
No, sorry Dennis, this functionality doesn't exist yet, but is being worked,
but will take a while, lots of corner cases to handle.

James Dickens
uadmin.blogspot.com


On Sun, Jan 10, 2010 at 3:23 AM, Dennis Clarke dcla...@blastwave.orgwrote:


 Suppose the requirements for storage shrink ( it can happen ) is it
 possible to remove a mirror set from a zpool?

 Given this :

 # zpool status array03
  pool: array03
  state: ONLINE
 status: The pool is formatted using an older on-disk format.  The pool can
still be used, but some features are unavailable.
 action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
pool will no longer be accessible on older software versions.
  scrub: resilver completed after 0h41m with 0 errors on Sat Jan  9
 22:54:11 2010
 config:

NAME STATE READ WRITE CKSUM
array03  ONLINE   0 0 0
  mirror ONLINE   0 0 0
c2t16d0  ONLINE   0 0 0
c5t0d0   ONLINE   0 0 0
  mirror ONLINE   0 0 0
c5t1d0   ONLINE   0 0 0
c2t17d0  ONLINE   0 0 0
  mirror ONLINE   0 0 0
c5t2d0   ONLINE   0 0 0
c2t18d0  ONLINE   0 0 0
  mirror ONLINE   0 0 0
c2t20d0  ONLINE   0 0 0
c5t4d0   ONLINE   0 0 0
  mirror ONLINE   0 0 0
c2t21d0  ONLINE   0 0 0
c5t6d0   ONLINE   0 0 0
  mirror ONLINE   0 0 0
c2t19d0  ONLINE   0 0 0
c5t5d0   ONLINE   0 0 0
spares
  c2t22d0AVAIL

 errors: No known data errors

 Suppose I want to power down the disks c2t19d0 and c5t5d0 because they are
 not needed. One can easily picture a thumper with many disks unused and
 see reasons why one would want to power off disks.

 --
 Dennis

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O Read starvation

2010-01-10 Thread Phil Harman
What version of Solaris / OpenSolaris are you using? Older versions use 
mmap(2) for reads in cp(1). Sadly, mmap(2) does not jive well with ZFS.


To be sure, you could check how your cp(1) is implemented using truss(1) 
(i.e. does it do mmap/write or read/write?)


aside
I find it interesting that ZFS's mmap(2) deficiencies are now dictating 
implementation of utilities which may benefit from mmap(2) on other 
filesystems. And whilst some might argue that mmap(2) is dead for file 
I/O, I think it's interesting to note that Linux appears to have a 
relatively efficient mmap(2) implementation. Sadly, this means that some 
commercial apps which are mmap(2) heavy currently perform much bettter 
on Linux than Solaris, especially ZFS. However, I doubt that Linux uses 
mmap(2) for reads in cp(1).

/aside

You could also try using dd(1) instead of cp(1).

However, it seems to me that you are using bs=1G count=8 as a lazy way 
to generate 8GB (because you don't want to do the math on smaller 
blocksizes?)


Did you know that you are asking dd(1) to do 1GB read(2) and write(2) 
systems calls using a 1GB buffer? This will cause further pressure on 
the memory system.


In performance terms, you'll probably find that block sizes beyond 128K 
add little benefit. So I'd suggest something like:


dd if=/dev/urandom of=largefile.txt bs=128k count=65536

dd if=largefile.txt of=./test/1.txt bs=128k 
dd if=largefile.txt of=./test/2.txt bs=128k 

Phil

http://harmanholistix.com



bank kus wrote:

dd if=/dev/urandom of=largefile.txt bs=1G count=8

cp largefile.txt ./test/1.txt 
cp largefile.txt ./test/2.txt 

Thats it now the system is totally unusable after launching the two 8G copies. Until these copies finish no other application is able to launch completely. Checking prstat shows them to be in the sleep state. 


Question:
 I m guessing this because ZFS doesnt use CFQ and that one process is allowed 
to queue up all its I/O reads ahead of other processes?

 Is there a concept of priority among I/O reads? I only ask because if root 
were to launch some GUI application they dont start up until both copies are done. So 
there is no concept of priority? Needless to say this does not exist on Linux 2.60...
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] possible to remove a mirror pair from a zpool?

2010-01-10 Thread Dennis Clarke

 No, sorry Dennis, this functionality doesn't exist yet, but
 is being worked,
 but will take a while, lots of corner cases to handle.

 James Dickens
 uadmin.blogspot.com

1 ) dammit

2 ) looks like I need to do a full offline backup and then restore
to shrink a zpool.

As usual, Thanks for always being there James.

Dennis

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs, raidz, spare and jbod

2010-01-10 Thread Arnaud Brand
We had a similar problem on Areca 1680. It was caused by a drive that 
didn't properly reset (took ~2 seconds each time according to the drive 
tray's led).
Replacing the drive solved this problem, but then we hit another problem 
which you can see in this thread : 
http://opensolaris.org/jive/thread.jspa?threadID=121335tstart=0


I'm curious wether you have a similar setup and encounter the same problems.
How did you setup your pools ?

Please tell me if you have any luck setting the drives to pass-through.

Thanks,
Arnaud

Le 09/01/10 14:26, Rob a écrit :

I have installed Opensolaris build 129 on our server. It has 12 disk at a Areca 
1130 controller. Using the latest firmware.

I have put all the disk in jbod and running them in raidz2. After a while the 
systems hangs with arcmsr0: tran reset )level 0x1) called for target4 lun 0
target reset not supported

What can I do? I really want it to work!

(I am gone set all the disk to pass-through monday)
   



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O Read starvation

2010-01-10 Thread bank kus
Hi Phil 
You make some interesting points here:

- yes bs=1G was a lazy thing 

-  the GNU cp I m using does __not__ appears to use mmap
open64 open64  read write  close close is the relevant sequence

- replacing cp with dd 128K * 64K does not help no new apps can be launched 
until the copies complete.

Regards
banks
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O Read starvation

2010-01-10 Thread Henrik Johansson
Hello again,

On Jan 10, 2010, at 5:39 AM, bank kus wrote:

 Hi Henrik
 I have 16GB Ram on my system on a lesser RAM system dd does cause problems as 
 I mentioned above. My __guess__ dd is probably sitting in some in memory 
 cache since du -sh doesnt show the full file size until I do a sync.
 
 At this point I m less looking for QA type repro questions and/or 
 speculations rather looking for  ZFS design expectations. 
 
 What is the expected behaviour, if one thread queues 100 reads  and another 
 thread comes later with 50 reads are these 50 reads __guaranteed__ to fall 
 behind the first 100 or is timeslice/fairshre done between two streams? 
 
 Btw this problem is pretty serious with 3 users using the system one of them 
 initiating a large copy grinds the other 2 to a halt. Linux doesnt have this 
 problem and this is almost a switch O/S moment for us unfortunately :-(

Have you reproduced the problem without using /dev/urandom? I can only get this 
behavior when using dd from urandom, not using files with cp, and not even 
files with dd. This could then be related the random driver spending kernel 
time in high priority threads.

So while I agree that this is not optimal, there is a huge difference in how 
bad it is, if it's urandom generated there is no problem with copying files. 
Since you also found that it's not related to ZFS (also tmpfs, and perhaps only 
urandom?) we are on the wrong list. Please isolate the problem, can we put 
aside any filesystem, if so we are on the wrong list, i've added perf-discuss 
also.

Regards

Henrik
http://sparcv9.blogspot.com


Henrik
http://sparcv9.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Repeating scrub does random fixes

2010-01-10 Thread Gary Gendel
I've been using a 5-disk raidZ for years on SXCE machine which I converted to 
OSOL.  The only time I ever had zfs problems in SXCE was with snv_120, which 
was fixed.

So, now I'm at OSOL snv_111b and I'm finding that scrub repairs errors on 
random disks.  If I repeat the scrub, it will fix errors on other disks.  
Occasionally it runs cleanly.  That it doesn't happen in a consistent manner 
makes me believe it's not hardware related.

fmdump only reports, three types of errors:

ereport.fs.zfs.checksum
ereport.io.scsi.cmd.disk.tran
ereport.io.scsi.cmd.disk.recovered

The middle one seems to be the issue I'd like to track down the source.  Any 
docs on how to do this?

Thanks,
Gary
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O Read starvation

2010-01-10 Thread Bob Friesenhahn

On Sun, 10 Jan 2010, Phil Harman wrote:
In performance terms, you'll probably find that block sizes beyond 128K add 
little benefit. So I'd suggest something like:


dd if=/dev/urandom of=largefile.txt bs=128k count=65536

dd if=largefile.txt of=./test/1.txt bs=128k 
dd if=largefile.txt of=./test/2.txt bs=128k 


As an interesting aside, on my Solaris 10U8 system (plus a zfs IDR), 
dd (Solaris or GNU) does not produce the expected file size when using 
/dev/urandom as input:


% /bin/dd if=/dev/urandom of=largefile.txt bs=131072 count=65536
0+65536 records in
0+65536 records out
% ls -lh largefile.txt
-rw-r--r--   1 bfriesen home 65M Jan 10 09:32 largefile.txt
% /opt/sfw/bin/dd if=/dev/urandom of=largefile.txt bs=131072 
count=65536

0+65536 records in
0+65536 records out
68157440 bytes (68 MB) copied, 1.9741 seconds, 34.5 MB/s
% ls -lh largefile.txt
-rw-r--r--   1 bfriesen home 65M Jan 10 09:33 largefile.txt
% df -h .
FilesystemSize  Used Avail Use% Mounted on
Sun_2540/zfstest/defaults
  1.2T   66M  1.2T   1% /Sun_2540/zfstest/defaults

However:
% dd if=/dev/urandom of=largefile.txt bs=1024 count=8388608
8388608+0 records in
8388608+0 records out
8589934592 bytes (8.6 GB) copied, 255.06 seconds, 33.7 MB/s
% ls -lh largefile.txt
-rw-r--r--   1 bfriesen home8.0G Jan 10 09:40 largefile.txt

% dd if=/dev/urandom of=largefile.txt bs=8192 count=1048576
0+1048576 records in
0+1048576 records out
1090519040 bytes (1.1 GB) copied, 31.8846 seconds, 34.2 MB/s

It seems that on my system dd + /dev/urandom is willing to read 1k 
blocks from /dev/urandom but with even 8K blocks, the actual blocksize 
is getting truncated down (without warning), producing much less data 
than requested.


Testing with /dev/zero produces different results:
% dd if=/dev/zero of=largefile.txt bs=8192 count=1048576
1048576+0 records in
1048576+0 records out
8589934592 bytes (8.6 GB) copied, 20.7434 seconds, 414 MB/s

WTF?

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O Read starvation

2010-01-10 Thread bank kus
place a sync call after dd ?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Repeating scrub does random fixes

2010-01-10 Thread Mattias Pantzare
On Sun, Jan 10, 2010 at 16:40, Gary Gendel g...@genashor.com wrote:
 I've been using a 5-disk raidZ for years on SXCE machine which I converted to 
 OSOL.  The only time I ever had zfs problems in SXCE was with snv_120, which 
 was fixed.

 So, now I'm at OSOL snv_111b and I'm finding that scrub repairs errors on 
 random disks.  If I repeat the scrub, it will fix errors on other disks.  
 Occasionally it runs cleanly.  That it doesn't happen in a consistent manner 
 makes me believe it's not hardware related.


That is a good indication for hardware related errors. Software will
do the same thing every time but hardware errors are often random.

But you are running an older version now, I would recommend an upgrade.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS cache flush ignored by certain devices ?

2010-01-10 Thread Lutz Schumann
A very interesting thread 
(http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-fsync-write-cache-barrier-and-lost-transactions/)
 and some thinking about the design of SSD's lead to a experiment I did with 
the Intel X25-M SSD. The question was: 

Is my data safe, once it has reached the disk and has been commited to my 
application ?

All transactional safety in ZFS requires the correct impementation of the 
synchronize cache command (see 
http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg27264.html, where 
someone used Opensolaris within VirtualBox, which  per default - does ignore 
the cache flush command). Thus qualified hardware is VERY essential (also see 
http://www.snia.org/events/storage-developer2009/presentations/monday/JeffBonwick_zfs-What_Next-SDC09.pdf).

What I did (for a Intel X25-M G2 (default settings = write cache on) and a 
Seagate SATA drive (ST3500418AS)): 

a) Create a Pool 
b) Create a Programm that opens a file 
   synchronously and writes to the file. 
   It also prints the latest record written 
   successfully.
c) pull the power of the SATA disk 
d) power cycle everything 
e) open the pool again and verify the content 
   of the file is the one that has been to 
   the application 
e1) if it is the same - nice hardware 
e2) if it is NOT the same - BAD hardware 

What I found out was: 

Intel X25-M G2: 
  - If I pull the power cable much data is lost, altought commited to the app 
(some hundred)
  - If I pull the sata cable no data is lost
  
ST3500418AS: 
  - If I pull the power cable almost no data is lost, but still the last write 
is lost (strange!)
  - If I pull the sata cable no data is lost

Actually this result was partially expected. Howerver the one missing 
transaction in my SATA HDD Disk (Seagate) is strange. 

Unfortunately I do not have enterprise SAS hardware handy to verify that my 
test procedure is correct.

Maybe someone can run this test on a SAS test machine ? (see script attached)


--- Attachments ---

--- script (call it with script.pl --file /mypool/testfile) ---

#!/usr/bin/env perl

# for O_SYNC
use Fcntl qw(:DEFAULT :flock SEEK_CUR SEEK_SET SEEK_END);
use IO::File;
use Getopt::Long;

my $pool=disk;
my $mountroot=/volumes;
my $file=$mountroot/$pool/testfile;
my $abort=0;
my $count=0;

GetOptions(
pool=s = \$pool,
testfile|file=s = \$file,
count=i = \$count,
);

my $dir = $file;
$dir =~ s/[^\/]+$//g;

if (-e $file) {
print ERROR: File $file already exists\n;
exit 1;
}

if (! -d $dir ) {
print ERROR: Directory $dir does not exist\n;
exit 1;
}
sysopen (FILE, $file, O_RDWR | O_CREAT | O_EXCL | O_SYNC) or die ERROR 
Opening file $file: $!\n;

$SIG{INT}= sub { print  ... signalling Abort ... (file: $file)\n; $abort=1; };

$|=1;

my $lastok=undef;
my $i=0;
my $msg=sprintf(This is round number %20s, $i);
# O_SYNC, O_CREAT
while (!$abort) {
$i++;

if ($count  $i$count) { last; };

$msg=sprintf(This is round number %20s, $i);
sysseek (FILE, 0, SEEK_SET);
print $msg;
my $rc=syswrite FILE,$msg;
if (!defined($rc)) {
print ERROR\n;
print ERROR While writing $msg\n;
print ERROR: $!\n;
last;
} else {
print  DONE \n;
$lastok=$msg;
}
}

close(FILE);

print \nTHE LAST MESSAGE WRITTEN to file $file was:\n\n\t\$lastok\\n\n;

 Here's the logs of my tests 

1) Test the SATA SSD (Intel X25-M) 
--
.. start write.pl

This is round number67482
This is round number67483
This is round number67484
This is round number67485
This is round number67486
This is round number67487
This is round number67488
This is round number67489
This is round number67490

( .. I pull the POWER CABLE of the SATA SSD .. )

.. I/O hangs 

.. zpool status shows 

zpool status -v
  pool: ssd
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://www.sun.com/msg/ZFS-8000-JQ
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
ssd UNAVAIL  011 0  insufficient replicas
  c3t5d0UNAVAIL  3 2 0  cannot open

errors: Permanent errors have been detected in the following files:

ssd:0x0
/volumes/ssd/
/volumes/ssd/testfile


... now I power cycled the machine and put back the power cable 

... lets see the pool status 

  pool: ssd
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
ssd ONLINE   0 0 0
  c3t5d0ONLINE   0 0 0

errors: No known data errors

 lets 

Re: [zfs-discuss] I/O Read starvation

2010-01-10 Thread Henrik Johansson
Hello Bob,

On Jan 10, 2010, at 4:54 PM, Bob Friesenhahn wrote:

 On Sun, 10 Jan 2010, Phil Harman wrote:
 In performance terms, you'll probably find that block sizes beyond 128K add 
 little benefit. So I'd suggest something like:
 
 dd if=/dev/urandom of=largefile.txt bs=128k count=65536
 
 dd if=largefile.txt of=./test/1.txt bs=128k 
 dd if=largefile.txt of=./test/2.txt bs=128k 
 
 As an interesting aside, on my Solaris 10U8 system (plus a zfs IDR), dd 
 (Solaris or GNU) does not produce the expected file size when using 
 /dev/urandom as input:

Do you feel this is related to the filesystem, is there any difference between 
putting the data in a file on ZFS or just throwing it away? 

$(dd if=/dev/urandom of=/dev/null bs=1048576k count=16) gives me a quite 
unresponsive system too.

Henrik
http://sparcv9.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS cache flush ignored by certain devices ?

2010-01-10 Thread Lutz Schumann
I managed to disable the write cache (did not know a tool on Solaris, hoever 
hdadm from the EON NAS binary_kit does the job): 

Same power discuption test with Seagate HDD and write cache disabled ...
---

r...@nexenta:/volumes# .sc/bin/hdadm write_cache display c3t5

 c3t5 write_cache disabled

... pull power cable of Seagate SATA Disk 
 
This is round number 4543 DONE
This is round number 4544 DONE
This is round number 4545 DONE
This is round number 4546 DONE
This is round number 4547 DONE
This is round number 4548 DONE
This is round number 4549 DONE
This is round number 4550 ... hangs here

... power cycle everything

node1:/mnt/disk# cat testfile
This is round number 4549

... this looks good. 

So disabeling the write cache helps, but limits the performance really (not for 
synchronous but for async writes). 

Test with Intel X25-M
--

... Same with SSD 
r...@nexenta:/volumes# hdadm write_cache off c3t5

 c3t5 write_cache disabled

r...@nexenta:/volumes# hdadm write_cache display c3t5

 c3t5 write_cache disabled

.. pull SSD power cable 

This is round number 9249 DONE
This is round number 9250 DONE
This is round number 9251 DONE
This is round number 9252 DONE
This is round number 9253 DONE
This is round number 9254 DONE
This is round number 9255 DONE
This is round number 9256 DONE
This is round number 9257 ... hangs here

.. power cycle everything 
... test 

node1:/mnt/ssd# cat testfile
This is round number 9256

So without a write cache the device works correctly 

However be warned on boot the cache is enabled again: 

DeviceSerialVendor   Model Rev  Temperature
------   -  ---
c3t5d0p0  7200Y5160AGN  ATA  INTEL SSDSA2M160  02HD 255 C (491 F)

r...@nexenta:/volumes# hdadm write_cache display c3t5

 c3t5 write_cache enabled

Question: Does anyone know the impact of disabeling the write cache for the 
write amplification factor of the intel SSD's ? 

I would deploy Intel X25-M only for mostly read workloads anyway. Thus the 
performance impact of disabeling the write cache can be ignored. However if the 
life expectency of the device goes down without a write cache (I means it is 
MLC already!) - Bummer.  

And another Question: How can I permanently disable the write cache on the 
Intel X25-M SSD's ? 

Regards
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O Read starvation

2010-01-10 Thread Bob Friesenhahn

On Sun, 10 Jan 2010, Henrik Johansson wrote:

  As an interesting aside, on my Solaris 10U8 system (plus a zfs IDR), dd 
(Solaris or GNU) does
  not produce the expected file size when using /dev/urandom as input:

Do you feel this is related to the filesystem, is there any difference between 
putting the data in a file on
ZFS or just throwing it away? 


My guess is that is due to the implementation of /dev/urandom.  It 
seems to be blocked-up at 1024 bytes and 'dd' is just using that block 
size.  It is interesting that OpenSolaris is different, and this seems 
like a bug in Solaris 10.  It seems like a new bug to me.


The /dev/random and /dev/urandom devices are rather special since 
reading from them consumes a precious resource -- entropy.  Entropy is 
created based on other activities of the system, which are expected to 
be random.  Using up all the available entropy could dramatically 
slow-down software which uses /dev/random, such as ssh or ssl.  The 
/dev/random device will completely block when the system runs out of 
entropy.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs, raidz, spare and jbod

2010-01-10 Thread Rob
Hello Arnaud,

Thanks for your reply.

We have a system ( 2 x  Xeon 5410, Intel S5000PSL  mobo and 8 GB memory) with 
12 x 500 GB SATA disks on a Areca 1130 controller. rpool is a mirror over 2 
disks. 8 disks in raidz2, 1 spare. We have 2 aggr links.

Our goal is a ESX storage system, I am using ISCSI and NFS to serve space to 
our ESX 4.0 servers.

We can remove a disk, with no problem. I can do a replace and the disk is being 
resilverd. That works fine here.

Our problem comes when we make it the server a little bit harder! When we give 
the server a hard time, copy 60G+ of data or do some other stuff to give the 
system some load it hangs. This happens after 5 minutes or after 30 minutes or 
later but it hangs. Then we get the problems of the attached pictures.

I have also emaild Areca. I'll hope the can fix it..

Regards,

Rob
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS cache flush ignored by certain devices ?

2010-01-10 Thread Lutz Schumann
Actually the performance decrease when disableing the write cache on the SSD is 
aprox 3x (aka 66%). 

Setup: 
  node1 = Linux Client with open-iscsi 
  server = comstar (cache=write through) + zvol (recordsize=8k, compression=off)

--- with SSD-Disk-write cache disabled: 

node1:/mnt/ssd# iozone -ec -r 8k -s 128m -l 2 -i 0 -i 2 -o -I
Iozone: Performance Test of File I/O
Version $Revision: 3.327 $
Compiled for 32 bit mode.
Build: linux

Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
 Al Slater, Scott Rhine, Mike Wisner, Ken Goss
 Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
 Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
 Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy,
 Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root.

Run began: Sun Jan 10 20:14:46 2010

Include fsync in write timing
Include close in write timing
Record Size 8 KB
File size set to 131072 KB
SYNC Mode.
O_DIRECT feature enabled
Command line used: iozone -ec -r 8k -s 128m -l 2 -i 0 -i 2 -o -I
Output is in Kbytes/sec
Time Resolution = 0.02 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
Min process = 2
Max process = 2
Throughput test with 2 processes
Each process writes a 131072 Kbyte file in 8 Kbyte records

Children see throughput for  2 initial writers  =1324.45 KB/sec
Parent sees throughput for  2 initial writers   =1291.27 KB/sec
Min throughput per process  = 646.07 KB/sec
Max throughput per process  = 678.38 KB/sec
Avg throughput per process  = 662.23 KB/sec
Min xfer=  124832.00 KB

Children see throughput for  2 rewriters=4360.29 KB/sec
Parent sees throughput for  2 rewriters =4360.08 KB/sec
Min throughput per process  =2158.82 KB/sec
Max throughput per process  =2201.47 KB/sec
Avg throughput per process  =2180.15 KB/sec
Min xfer=  128536.00 KB

Children see throughput for 2 random readers=   43930.41 KB/sec
Parent sees throughput for 2 random readers =   43914.01 KB/sec
Min throughput per process  =   21768.16 KB/sec
Max throughput per process  =   22162.25 KB/sec
Avg throughput per process  =   21965.21 KB/sec
Min xfer=  128760.00 KB

Children see throughput for 2 random writers=5561.01 KB/sec
Parent sees throughput for 2 random writers =5560.41 KB/sec
Min throughput per process  =2780.37 KB/sec
Max throughput per process  =2780.64 KB/sec
Avg throughput per process  =2780.50 KB/sec
Min xfer=  131064.00 KB

... with SSD write cache enabled 

node1:/mnt/ssd# iozone -ec -r 8k -s 128m -l 2 -i 0 -i 2 -o -I
Iozone: Performance Test of File I/O
Version $Revision: 3.327 $
Compiled for 32 bit mode.
Build: linux

Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
 Al Slater, Scott Rhine, Mike Wisner, Ken Goss
 Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
 Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
 Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy,
 Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root.

Run began: Sun Jan 10 20:22:14 2010

Include fsync in write timing
Include close in write timing
Record Size 8 KB
File size set to 131072 KB
SYNC Mode.
O_DIRECT feature enabled
Command line used: iozone -ec -r 8k -s 128m -l 2 -i 0 -i 2 -o -I
Output is in Kbytes/sec
Time Resolution = 0.02 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
Min process = 2
Max process = 2
Throughput test with 2 processes
Each process writes a 131072 Kbyte file in 8 Kbyte records

Children see throughput for  2 initial writers  =3387.15 KB/sec
Parent sees throughput for  2 initial writers   =3258.90 KB/sec

Re: [zfs-discuss] ZFS cache flush ignored by certain devices ?

2010-01-10 Thread Bob Friesenhahn

On Sun, 10 Jan 2010, Lutz Schumann wrote:


Talking about read performance. Assuming a reliable ZIL disk (cache 
flush = working): The ZIL can guarantee data integrity, however if 
the backend disks (aka pool disks) do not properly implement cache 
flush - a reliable ZIL device does not workaround the bad backend 
disks rigth ???


(meaning: having a reliable ZIL + some MLC SSD with write cache 
enabled is not reliable at the end)


As soon as there is more than one disk in the pool, it is necessary 
for cache flush to work or else the devices may contain content from 
entirely different transaction groups, resulting in a scrambled pool.


If you just had one disk which tended to ignore cache flush requests, 
then you should be ok as long as the disk writes the data in order. 
In that case any unwritten data would be lost, but the pool should not 
be lost.  If the device ignores cache flush requests and writes data 
in some random order, then the pool is likely to eventually fail.


I think that zfs mirrors should be safer than raidz when faced with 
devices which fail to flush (should be similar to the single-disk 
case), but only if there is one mirror pair.


A scary thing about SSDs is that they may re-write old data while 
writing new data, which could result in corruption of the old data if 
the power fails while it is being re-written.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O Read starvation

2010-01-10 Thread Daniel Carosone
On Sun, Jan 10, 2010 at 09:54:56AM -0600, Bob Friesenhahn wrote:
 WTF?

urandom is a character device and is returning short reads (note the
0+n vs n+0 counts). dd is not padding these out to the full blocksize
(conv=sync) or making multiple reads to fill blocks (conv=fullblock).

Evidently the urandom device changed behaviour along the way, with
regards to producing/buffering additional requested data, possibly as
a result of a changed source implementation that stretches
better/faster.  No bug here, just bad assumptions. 

--
Dan.






pgpcePztLEpd4.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O Read starvation

2010-01-10 Thread Richard Elling
On Jan 8, 2010, at 7:49 PM, bank kus wrote:

 dd if=/dev/urandom of=largefile.txt bs=1G count=8
 
 cp largefile.txt ./test/1.txt 
 cp largefile.txt ./test/2.txt 
 
 Thats it now the system is totally unusable after launching the two 8G 
 copies. Until these copies finish no other application is able to launch 
 completely. Checking prstat shows them to be in the sleep state. 

What disk drivers are you using?  IDE?
 -- richard

 
 Question:
  I m guessing this because ZFS doesnt use CFQ and that one process is 
 allowed to queue up all its I/O reads ahead of other processes?
 
  Is there a concept of priority among I/O reads? I only ask because if root 
 were to launch some GUI application they dont start up until both copies are 
 done. So there is no concept of priority? Needless to say this does not exist 
 on Linux 2.60...
 -- 
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Repeating scrub does random fixes

2010-01-10 Thread Gary Gendel

Mattias Pantzare wrote:

On Sun, Jan 10, 2010 at 16:40, Gary Gendel g...@genashor.com wrote:
  

I've been using a 5-disk raidZ for years on SXCE machine which I converted to 
OSOL.  The only time I ever had zfs problems in SXCE was with snv_120, which 
was fixed.

So, now I'm at OSOL snv_111b and I'm finding that scrub repairs errors on 
random disks.  If I repeat the scrub, it will fix errors on other disks.  
Occasionally it runs cleanly.  That it doesn't happen in a consistent manner 
makes me believe it's not hardware related.




That is a good indication for hardware related errors. Software will
do the same thing every time but hardware errors are often random.

But you are running an older version now, I would recommend an upgrade.
  


I would have thought that too if it didn't start right after the switch 
from SXCE to OSOL.  As for an upgrade, use the dev repository on my 
laptop and I find that OSOL updates aren't nearly as stable as SXCE 
was.  I tried for a bit, but always had to go back to 111b because 
something crucial broke.  I was hoping to wait until the official 
release in March in order to let things stabilize.  This is my main 
web/mail/file/etc. server and I don't really want to muck too much.


That said, I may take a gambol on upgrading as we're getting closer to 
the 2010.x release.



Gary

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS, xVM and NFS file serving

2010-01-10 Thread jal

I'm hoping someone has already come across this problem and solved it.

I'm using xVM to create 2 NFS fileservers. The Dom0 has a zpool with 8 
zvols in it. 4 of the zvols are used for 4 DomU. 2 of the DomUs are 
fileservers, and each fileserver attaches 2 more of the zvols for the user 
filespace. The user filespace is then exported via NFS. The zvols attached 
as the user filespace are formatted UFS.


For the first 25 or 30 nfs clients to mount the exported filespace 
performance is OK, however it then drops off dramatically - 90 seconds to 
do an ls -l


I've done no tuning, everything is stock.

Would using zfs for the user filespace rather than UFS help? Any other 
ideas for a) determining where the problem is b) improving the throughput.


Thanks

john Landamore
Dept Computer Science
University of Leicester
UK
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS Cache + ZIL on single SSD

2010-01-10 Thread A. Krijgsman
Hi all,

Sorry for spamming your mailinglist, 
but since I could not find a direct awnser on the internet and archives, I give 
this a try!

I am building an ZFS filesystem to export iSCSI LUN's.
Now I was wondering if the L2ARC has the ability to cache non-filesystem iscsi 
lun's?
Or does it only work in combination with the ZFS mounted filesystem?

Next to that I am reading all kind of performance benefits using seperate 
devices
for the ZIL (write) and the Cache (read). I was wondering if I could 
share a single SSD between both ZIL and Cache device?

Or is this not recommended?

This because the ZIL needs 1 GB top from what I read?
But since SSD is not cheap, I would like to make use of the other GB's on the 
disk.

Thank you.

Armand___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Cache + ZIL on single SSD

2010-01-10 Thread Thomas Burgess

 Next to that I am reading all kind of performance benefits using seperate
 devices
 for the ZIL (write) and the Cache (read). I was wondering if I could
 share a single SSD between both ZIL and Cache device?

 Or is this not recommended?


 i asked something similar recently.  The answers i got were along these
lines:

you can use a single ssd but it's not a great idea.  If you DO need to use a
single ssd for such a thing, make sure it's one of the more expensive SLC
variety like the intel x25-e

The MLC variety of SSD works well for L2ARC and is much cheaper (you can
pick up some for less than 100 bucks)  while the ZIL really should have the
SLC variety.

I'm not an expert though, i'm just passing on advice i've been given.


For reads and dedup L2ARC does make a dramatic difference, and for NFS and
database stuff an SSD ZIL will make a huge difference.

I've heard of people getting 5-10x's performance increase on reads just by
adding a cheap ssd so i'd say it's worth it if that's the type of dataset
you have.

Another thing i was told on more than one occasion was that you may not even
NEED a ssd for your ZIL (basically you should run a script to see if you
need a ZIL at all, i don't remember where this script is but i'm SURE
someone will reply with it)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Cache + ZIL on single SSD

2010-01-10 Thread Mertol Ozyoney

Hi ;

I dont think that anyone owns the list and as anyone else you are very  
welcome to ask any question.


L2arc will cache zpool so if your iscs lun is a zvol or a file on zfs  
it will be cached.


Please use constar if you need performance

You are correct that you will only need couple of gb as zil.

Performance gain of using a portion of ssd for l2arc will depend on  
your pool configuration, your average load mix.


A side note: use raid 10 if you are using zfs and iscsi and need  
random iops


Best regards
Mertol

Sent from a mobile device

Mertol Ozyoney

On 11.Oca.2010, at 01:05, A. Krijgsman a.krijgs...@draftsman.nl  
wrote:



Hi all,

Sorry for spamming your mailinglist,
but since I could not find a direct awnser on the internet and  
archives, I give this a try!


I am building an ZFS filesystem to export iSCSI LUN's.
Now I was wondering if the L2ARC has the ability to cache non- 
filesystem iscsi lun's?

Or does it only work in combination with the ZFS mounted filesystem?

Next to that I am reading all kind of performance benefits using  
seperate devices

for the ZIL (write) and the Cache (read). I was wondering if I could
share a single SSD between both ZIL and Cache device?

Or is this not recommended?

This because the ZIL needs 1 GB top from what I read?
But since SSD is not cheap, I would like to make use of the other  
GB's on the disk.


Thank you.

Armand
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss