Re: [zfs-discuss] Faster copy from UFS to ZFS

2011-05-19 Thread Daniel Rock

On Thu, 19 May 2011 15:39:50 +0200, Frank Van Damme wrote:

Op 03-05-11 17:55, Brandon High schreef:

-H: Hard links


If you're going to this for 2 TB of data, remember to expand your 
swap
space first (or have tons of memory). Rsync will need it to store 
every

inode number in the directory.


No, only for files with a link count greater than 1 - unless you are 
using

a very old version of rsync (<2.6.1).

--
Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Daniel Rock

Am 05.01.2010 16:22, schrieb Mikko Lammi:

However when we deleted some other files from the volume and managed to
raise free disk space from 4 GB to 10 GB, the "rm -rf directory" method
started to perform significantly faster. Now it's deleting around 4,000
files/minute (240,000/h - quite an improvement from 10,000/h). I remember
that I saw some discussion related to ZFS performance when filesystem
becomes very full, so I wonder if that was the case here.


I did some tests. They were done on an Ultra 20 (2.2 GHz Dual-Core Opteron) 
with crappy SATA disks. On this machine creation and deletion of files were 
I/O bound. I was able to create about 1 Mio. files per hour. I stopped after 
5 hours, so I had approx. 5 Mio. files in one directory.


Deletion (via the Perl script) also had a rate of ~1 Mio. files per hour. 
During deletion the disks (mirrored zpool) were both 95% busy, CPU time was 
5% total.


If the T1000 has SCSI disks you can turn on write cache on both disks 
(though in my tests on delete most I/O were read operations). For the rpool 
it will probably not be enabled by default because your are "just" using 
partitions:


# format -e
[select disk]
format> scsi
scsi> p8 b2 |= 4

Mode select on page 8 ok.

scsi> quit

Disable write cache:

scsi> p8 b2 &= ~4


(Yes I know, there is a "cache" command in format, but I'm used to above
commands a long time before the "cache" command was introduced)


Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-13 Thread Daniel Rock

Hi,


Solaris 10U7, patched to the latest released patches two weeks ago.

Four ST31000340NS attached to two SI3132 SATA controller, RAIDZ1.

Selfmade system with 2GB RAM and an
  x86 (chipid 0x0 AuthenticAMD family 15 model 35 step 2 clock 2210 MHz)
AMD Athlon(tm) 64 X2 Dual Core Processor 4400+
processor.


On the first run throughput was ~110MB/s, on the second run only 80MB/s.

Doing initial (unmount/mount) 'cpio -o > /dev/null'
48000247 Blöcke

real3m37.17s
user0m11.15s
sys 0m47.74s

Doing second 'cpio -o > /dev/null'
48000247 Blöcke

real4m55.69s
user0m10.69s
sys 0m47.57s




Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Data size grew.. with compression on

2009-04-13 Thread Daniel Rock

Will Murnane schrieb:

Perhaps ZFS could do some very simplistic de-dup here: it has the
B-tree entry for the file in question when it goes to overwrite a
piece of it, so it could calculate the checksum of the new block and
see if it matches the checksum for the block it is overwriting.  This
is a much simpler case than the global de-dup that's in the works:
only one checksum needs to be compared to one other, compared to the
needle-in-haystack problem of dedup on the whole pool.


Beware that the default checksum algorithm is not cryptographically 
safe. You can easily generate blocks with different content but the same 
checksum.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Data size grew.. with compression on

2009-04-09 Thread Daniel Rock

Jonathan schrieb:

OpenSolaris Forums wrote:

if you have a snapshot of your files and rsync the same files again,
you need to use "--inplace" rsync option , otherwise completely new
blocks will be allocated for the new files. that`s because rsync will
write entirely new file and rename it over the old one.


ZFS will allocate new blocks either way


No it won't. --inplace doesn't rewrite blocks identical on source and 
target but only blocks which have been changed.


I use rsync to synchronize a directory with a few large files (each up 
to 32 GB). Data normally gets appended to one file until it reaches the 
size limit of 32 GB. Before I used --inplace a snapshot needed on 
average ~16 GB. Now with --inplace it is just a few kBytes.



Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can I create ZPOOL with missing disks?

2009-01-16 Thread Daniel Rock
Jim Klimov schrieb:
> Is it possible to create a (degraded) zpool with placeholders specified 
> instead
> of actual disks (parity or mirrors)? This is possible in linux mdadm 
> ("missing" 
> keyword), so I kinda hoped this can be done in Solaris, but didn't manage to.

Create sparse files with the size of the disks (mkfile -n ...).

Create a zpool with the free disks and the sparse files (zpool create -f 
...). Then immediately put the sparse files offline (zpool offline ...). 
Copy the files to the new zpool, destroy the old one and replace the 
sparse files with the now freed up disks (zpool replace ...).

Remember: during data migration your are running without safety belts. 
If a disk fails during migration you will lose data.



Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] create raidz with 1 disk offline

2008-09-28 Thread Daniel Rock
Brandon High schrieb:
> 1. Create a sparse file on an existing filesystem. The size should be
> the same as your disks.
> 2. Create the raidz with 4 of the drives and the sparse file.
> 3. Export the zpool.
> 4. Delete the sparse file
> 5. Import the zpool. It should come up as degraded, since one of its
> vdevs is missing.
> 6. Copy your files onto the zpool.
> 7. replace the file vdev with the 5th disk.

You can skip steps 3-5 and just put the sparse file offline manually 
(zpool offline ...)

I have migrated a 2x1TB mirror to a 4x1TB RAIDZ this way a few weeks ago.


Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SIL3124 stability?

2008-09-25 Thread Daniel Rock
Joerg Schilling schrieb:
> Daniel Rock <[EMAIL PROTECTED]> wrote:
>> I disabled the BIOS on my cards because I don't need it. I boot from one 
>> of the onboard SATA ports.
> 
> OK, how did you do this?

This will depend on the card you are using. I simply had to remove a 
jumper (Dawicontrol DC-300e).


Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SIL3124 stability?

2008-09-25 Thread Daniel Rock
Joerg Schilling schrieb:
> If it works for your system, be happy. I mentioned that the controller may 
> not be usable in all systems as it hangs up the BIOS in my machine if there 
> is a
> disk connected to the card.

I disabled the BIOS on my cards because I don't need it. I boot from one 
of the onboard SATA ports.


Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SIL3124 stability?

2008-09-25 Thread Daniel Rock
Mikael Karlsson schrieb:
> Hello!
> 
> Anyone with experience with the SIL3124 chipset? Does it work good?
> 
> It's in the HCL, but since SIL3114 apperantly is totally crap I'm a bit 
> skeptic to silicon image..

I'm running with two SIL3132 PCIe cards in my system with no problems. 
The only panic I had was when forcefully tried via "smartmontools" to 
get SMART data from the drives.

I am using the si3124 driver from OpenSolaris and recompiled it for 
Solaris 10. The original Solaris 10 driver panics when it finds a disk 
at one of its SATA ports.

The system is running for almost a year now, first with two 1TB disks in 
a mirror configuration, now with four 1TB disks in RAID-Z.

I am running "zpool scrub" at irregular intervals and didn't have a 
single checksum error so far.


Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Pools 1+TB

2008-08-28 Thread Daniel Rock
Kenny schrieb:
>2. c6t600A0B800049F93C030A48B3EA2Cd0 
>   /scsi_vhci/[EMAIL PROTECTED]
>3. c6t600A0B800049F93C030D48B3EAB6d0 
>   /scsi_vhci/[EMAIL PROTECTED]

Disk 2: 931GB
Disk 3: 931MB

Do you see the difference?



Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to delete hundreds of emtpy snapshots

2008-07-17 Thread Daniel Rock
Sylvain Dusart schrieb:
> Hi Joe,
> 
> I use this script to delete my empty snapshots :
> 
> #!/bin/bash
> 
> NOW=$(date +%Y%m%d-%H%M%S)
> 
> POOL=tank
> 
> zfs snapshot -r [EMAIL PROTECTED]
> 
> FS_WITH_SNAPSHOTS=$(zfs list -t snapshot | grep '@' | cut -d '@' -f 1 | uniq)
> 
> for fs in $FS_WITH_SNAPSHOTS ; do
> 
> EMPTY_SNAPSHOTS=$(zfs list -t snapshot | grep -v "@$NOW" |
> grep "${fs}@" | awk ' $2 == "0" { print $1 }' )
> 
> for snapshot in $EMPTY_SNAPSHOTS ; do
> echo "Destroying empty snapshot $snapshot"
> zfs destroy $snapshot
> done
> done
> 
> Hope that helps,

This script is broken - because the "zfs list" output is broken. You 
might end up deleting snapshots with data in it.

Data which is in more than one snapshot will not get counted in the "zfs 
list" output at all.

Let's try it with an example:

# zfs create export/test
# mkfile 10m /export/test/bla
# zfs snapshot export/[EMAIL PROTECTED]
# zfs snapshot export/[EMAIL PROTECTED]
# rm /export/test/bla

Now both snapshots should contain (the same) 10MB of valuable data (the 
file /export/test/bla just removed). But:

# zfs list -r export/test
NAME  USED  AVAIL  REFER  MOUNTPOINT
export/test  24.5K   171G  24.5K  /export/test
export/[EMAIL PROTECTED]  0  -  24.5K  -
export/[EMAIL PROTECTED]  0  -  24.5K  -

... indicates two empty snapshots. But after removing one of the 
snapshots ...

# zfs destroy export/[EMAIL PROTECTED]

... the data magically shows up on the second snapshot (in my case not 
10MB, because the compression setting was inherited from the parent zfs)

# zfs list -r export/test
NAME  USED  AVAIL  REFER  MOUNTPOINT
export/test47K   171G  24.5K  /export/test
export/[EMAIL PROTECTED]  22.5K  -  24.5K  -


Your script above would have deleted both snapshots. One possible 
solution would be to re-evaluate the "zfs list" output after each 
snapshot deleted. This way the last one of data-identical snapshots 
would be preserved. Something like:

while :; do
   EMPTYSNAPSHOT=$(zfs list -Ht snapshot -o name,used | \
 awk '$2 == 0 { print $1; exit }')
   [ -n "$EMPTYSNAPSHOT" ] || break
   zfs destroy $EMPTYSNAPSHOT
done


Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS for write-only media?

2008-04-24 Thread Daniel Rock
Joerg Schilling schrieb:
> WOM   Write-only media

http://www.national.com/rap/files/datasheet.pdf


Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 24-port SATA controller options?

2008-04-15 Thread Daniel Rock
Tim schrieb:
> I'm sure you're already aware, but if not, 22 drives in a raid-6 is 
> absolutely SUICIDE when using SATA disks.  12 disks is the upper end of 
> what you want even with raid-6.  The odds of you losing data in a 22 
> disk raid-6 is far too great to be worth it if you care about your 
> data.  /rant

Let's do some calculations. Based on the specs of

http://www.seagate.com/docs/pdf/datasheet/disc/ds_barracuda_es_2.pdf

AFR = 0.73%
BER = 1:10^15

22 disk RAID-6 with 1TB disks.

- The probability of a disk failure is 16.06% p.a.
   (0.73% * 22)
- let's assume one day array rebuild time (22 * 1TB / 300MB/s)
- This means the probability for another disk error during rebuild on
   the hot spare is is 0.042%
   (0.73% * (1/365) * 21)
- If a second disk fails there is a chance of 16% of an unrecoverable
   read error on the remaining 20 disks
   (8 * 20 * 10^12 / 10^15)

So the probability for a data loss is:

16.06% * 0.042% * 16.0% = 0.001% p.a.

(a little bit higher since I haven't calculated 3 or more failing 
disks). The calculations assume an independent failure probability of 
each disk and correct numbers for AFR and BER. In reality I found the 
AFR rates of the disk vendors way too optimistic, but the BER rate too 
pessimistic.

If we calculate with

AFR = 3%
BER = same

we end up with with a data loss probability of 0.018% p.a.



Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] FiberChannel, and EMC tutorial?

2008-03-17 Thread Daniel Rock
Kyle McDonald schrieb:
> Hi all,
> 
> Can anyone explain to me, or point me to any docs that explain how the 
> following numbers map together?
> 
> I have multiple LUNS exported to my HBA's from multiple EMC arrays.
> 
> zpool status, and /dev/dsk show device names like:
> 
> c0t600604838794003753594D333837d0  ONLINE   0 0 0
> c0t600604838794003753594D333838d0  ONLINE   0 0 0
> c0t600604838794003753594D333839d0  ONLINE   0 0 0
> c0t600604838794003753594D333841d0  ONLINE   0 0 0
> c0t600604838794003753594D333842d0  ONLINE   0 0 0
> c0t600604838794003753594D333843d0  ONLINE   0 0 0
> c0t600604838794003753594D333844d0  ONLINE   0 0 0
> 
> but cfgadm shows disk names like:
> 
> c4::5006048dc7dfb170
> c5::5006048dc7dfb17e
> c7::5006048dc7dfb17e
> 
> How do these numbers break down?
> 
> I noticed the zpool or /dev/dsk nubmers all have a 'D' in them, what is 
> represented by the number after the 'D'?
> What is the number before the 'D' made up from?
> 
> I'm guessing at least part of the number befre the 'D' comes from the 
> WWN of the array? Does part come from the Disk in the array? or does the 
> array make up part of it so that that part is unique to the LUN? Or is 
> all the LUN info in the number after the 'D'?
> 
> I've also noticed that if you change the leading 5 in the cfgadm output, 
> to a 6, then the first 7 digits of cfgadm output match up with the first 
> 7 digits of Y in the cXtYdZ number. I've tried converting the remainder 
> 'dc7dfb170' from Hex to Dec, and I don't get anythign that even 
> resembles the '38794003753594' part of the target.
> 
> How does this work?
> 
> These LUNs are all 11GB slices from the EMC Arrays which are filled with 
> 146GB drives. Since this is a lab environment I beleive the EMC arrays 
> are configured with no redundancy. What I'd like to know, is how can I 
> tell which LUNs might live on the same physical drives?


I don't have a Symmetrix here to check (have to wait 'til tomorrow), but 
your Symmetrix should have a serial number of
387940037
and you have mapped the symmetrix IDs 387-38D in the above example:

600604838794003753594D333837

. 006048
should be the vendor identifier for EMC

. 387940037
Serial-No. of your Symmetrix

. 53594D
ASCII code of "SYM"

. 333837
ASCII code of "387"


Some of the information above was built from the WWN information, some 
from the SCSI inquiry (esp. the serial number). You can get a full 
inquiry of your disk with:

# format -e c0t600604838794003753594D333837d0s2
format> scsi
scsi> inquiry

In the hexdump of the full inquiry information you can find
. 8 bytes vendor ID ("EMC "),
. followed by 16 bytes product ID ("SYMMETRIX   ")
. folled by (up to) 20 bytes serial number.
In the serial number you will find again the symdev (387) of this device 
and the last few digits of the serial number.


With the EMC SE tools you can query more information of your symdevs:

symdev -sid 037 show 387

will show under "Backend Configuration" the real disks where this device 
is located.


Daniel

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] file server performance - slow 64 bit sparc or fast 32 bit intel

2007-06-17 Thread Daniel Rock

Joe S schrieb:

I've read that ZFS runs best on 64 bit solaris.

Which box would make the best (fastest) fileserver:

Sun v120
650 MHz UltraSPARC IIi
2GB RAM

--or--

Intel D875PBZ
Intel Pentium D 3.0 GHz (32-bit)
2GB RAM


I'd go with the Intel server. If you want to build a pure file server just 
increase the kernel memory with:


eeprom kernelbase=0x8000

I have a similar machine with 2GB RAM. Without manually setting kernelbase 
the machine would hang randomly during I/O operations. With above kernelbase 
the machine runs stable for months now.



Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] # devices in raidz.

2006-11-07 Thread Daniel Rock

Richard Elling - PAE schrieb:

For modern machines, which *should* be the design point, the channel
bandwidth is underutilized, so why not use it?


And what about encrypted disks? Simply create a zpool with checksum=sha256, 
fill it up, then scrub. I'd be happy if I could use my machine during 
scrubbing. A throttling of scrubbing would help. Maybe also running the 
scrubbing with a "high nice level" in kernel.





NB. At 4 128kByte iops per second, it would take 11 days and 8 hours
to resilver a single 500 GByte drive -- feeling lucky?


250ms is the Veritas default. It doesn't have to be the ZFS default also.


Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] # devices in raidz.

2006-11-07 Thread Daniel Rock

Richard Elling - PAE schrieb:

The big question, though, is "10% of what?"  User CPU?  iops?


Maybe something like the "slow" parameter of VxVM?

   slow[=iodelay]
Reduces toe system performance impact of copy
operations.  Such operations are usually per-
formed on small regions of the  volume  (nor-
mally  from  16  kilobytes to 128 kilobytes).
This  option  inserts  a  delay  between  the
recovery  of  each  such  region . A specific
delay can be  specified  with  iodelay  as  a
number  of milliseconds; otherwise, a default
is chosen (normally 250 milliseconds).



Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very high system loads with ZFS

2006-10-29 Thread Daniel Rock

Peter Guthrie schrieb:

So far I've seen *very* high loads twice using ZFS which does not

> happen when the same task is implemented with UFS

This is a known bug in the SPARC IDE driver.

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6421427

You could try the following workaround until a fix has been delivered:

http://groups.google.com/group/comp.unix.solaris/browse_frm/thread/bed12f2e5085e69b/bc5b8afba2b477d7?lnk=gst&q=zfs+svm&rnum=2#bc5b8afba2b477d7


Daniel

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS Inexpensive SATA Whitebox

2006-10-18 Thread Daniel Rock

Richard Elling - PAE schrieb:

Where most people get confused is the expectation that a hot-plug
device works like a hot-swap device.


Well, seems like you should also inform your documentation team about this 
definition:


http://www.sun.com/products-n-solutions/hardware/docs/html/819-3722-15/index.html#21924

SATA hot plug is supported only for the Windows XP
Operating System (OS).


Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS Inexpensive SATA Whitebox

2006-10-17 Thread Daniel Rock

Richard Elling - PAE schrieb:


The operational definition of "hot pluggable" is:
The ability to add or remove a system component while the
system remains powered up, and without inducing any hardware
errors.

This does not imply anything about whether the component is
automatically integrated into or detached from some higher
level environment for use, nor that such an environment is
necessarily suspended during the operation (although it may
be.)  For example, a hot pluggable processor/memory card may
be added to a system without the need to power the chassis
down and then back up.  However, that does not mean it will
be automatically utilized by the operating system using that
chassis.


Poor excuse
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS Inexpensive SATA Whitebox

2006-10-17 Thread Daniel Rock

Richard Elling - PAE schrieb:

Frank Cusack wrote:

I'm sorry, but that's ridiculous.  Sun sells a hardware product which
their software does not support.  The worst part is it is advertised as
working.  


What is your definition of "work"?
NVidia MCPs work with SATA drives in IDE emulation mode under Solaris
(thus I am able to compose this message on an NForce 410)


Quoted from the web page above:

Internal disk   
Up to two hot-pluggable 3.5 inch SATA or SATA II,
250 GB or 500 GB 7200 RPM disks supported.


"hot-pluggable" - you will find out only in a footnote (not on this specs 
page) that this is only supported with MS Windows.



http://www.sun.com/servers/entry/x2100/os.jsp

Still no mention of hot-pluggable not supported by Solaris.

Maybe here
http://www.sun.com/servers/entry/x2100/datasheet.pdf
... nope.


Perhaps in the FAQ
http://www.sun.com/servers/entry/x2100/faq.jsp
... hmm, no again.

Hopefully in the product notes
http://www.sun.com/products-n-solutions/hardware/docs/html/819-6594-11/
... no again.


I still haven't found the document which states that hot-plugging of disks is 
not supported by Solaris.


So one can normally assume that advertized hot-plugging of a Sun hardware is 
also supported on a Sun operating system - better not.



Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Re: Comments on a ZFS multiple use of a pool, RFE.

2006-09-14 Thread Daniel Rock
> The OP was just showing a test case.  On a real system your HA software
> would exchange a heartbeat and not do a double import.  The problem with
> zfs is that after the original system fails and the second system imports
> the pool, the original system also tries to import on [re]boot, and the OP
> didn't know how to disable this.

Ok, I see. I don't think he showed that because he didn't mention a reboot.

If you *never* want to import a pool automatically on reboot you just have to 
delete the /etc/zfs/zpool.cache file before the zfs module is being loaded. 
This could be integrated into SMF.

ZFS should normally be loaded during svc:/system/filesystem/local. Just create 
a service with filesystem/local as a dependent which removes 
/etc/zfs/zpool.cache.

Like:

/lib/svc/method/zpool-clear

#!/sbin/sh
. /lib/svc/share/smf_include.sh
/usr/bin/rm -f /etc/zfs/zpool.cache
exit $SMF_EXIT_OK


/var/svc/manifest/system/zpool-clear.xml





















Clears the zpool cache file






(sorry for the poor formating but this !$&%()# web interface is driving me nuts)

Then do a "svccfg import /var/svc/manifest/system/zpool-clear.xml". Now 
/etc/zfs/zpool.cache should disappear on each reboot before the zfs module is 
loaded.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Comments on a ZFS multiple use of a pool, RFE.

2006-09-13 Thread Daniel Rock

Anton B. Rang schrieb:

The hostid solution that VxVM uses would catch this second problem,

> because when A came up after its reboot, it would find that -- even
> though it had created the pool -- it was not the last machine to access
> it, and could refuse to automatically mount it. If the administrator
> really wanted it mounted, they could force the issue. Relying on the
> administrator to know that they have to remove a file (the
> 'zpool cache') before they let the machine come up out of single-user
> mode seems the wrong approach to me. ("By default, we'll shoot you in
> the foot, but we'll give you a way to unload the gun if you're fast
> enough and if you remember.")

I haven't tried: Does ZFS try to -f (force) import of zpools in 
/etc/zfs/zpool.cache or does it just do a normal import and fails if the disks 
seem to be in use elsewhere, e.g. after a reboot of a proably failed and later

repaired machine?


Just to clear some things up. The OP who started the whole discussion would 
have had the same problems with VxVM as he has now with ZFS. If you force an 
import of a disk group on one host while it is still active on another host 
won't make the DG magically disappear on the other one.


The corresponding flag to "zpool import -f" is "vxdg import -C". If you issue 
this command you could also end up with the same DG imported on more than one 
host. Because on VxVM there is usually another level of indirection (volumes 
ontop of the DG which may contain filesystems and also have to manually mount) 
just importing a DG is normally harmless.


So also with VxVM you can shoot yourself in the foot.

On host B
B# vxdg -C import DG
B# vxvol -g DG startall
B# mount /dev/vx/dsk/DG/filesys /some/where
B# do_someting on /some/where

while still on host A
A# do_something on /some/where

Instead of a zpool.cache file VxVM uses the hostid (not to be confused with 
the host-id, normally just the ordinary hostname `uname -n` of the machine) to 
know which DGs it should mount automatically. Additionally each DG (or more 
precisely: each disk) has an autoimport flag which also has to be turned on to 
make the DG autoimported during bootup.


So to mimic VxVM in ZFS the solution would simply be: add an autoimport flag 
to the zpool.





Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Oracle on ZFS

2006-08-30 Thread Daniel Rock

[EMAIL PROTECTED] schrieb:

Robert and Daniel,

How did you put oracle on ZFS:
- one zpool+ one  filesystem
- one zpool+ many filesystems
- a few zpools + one  filesystem on each
- a few zpools + many filesystem on each



My goal was not to maximize tuning for ZFS but just compare ZFS vs. UFS for 
average setup. I just setup ZFS with default parameters (beside the tested 
ones: compression on/off; checksum off/on/sha256). So my ZFS setup was simply 
one zpool + one filesystem; no specific tuning on the record size.


I did some preliminary tests on different ZFS recordsize and Oracle 
db_block_size/db_file_multiblock_read_count but had mixed results so I kept 
the defaults. If I have some time (and my test machine back) I might try 
different ZFS/Oracle performance parameters and their impact.




Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Oracle on ZFS

2006-08-25 Thread Daniel Rock

[EMAIL PROTECTED] schrieb:

Hi all,

Does anybody use Oracle on ZFS in production (but not as a
background/testing database but as a front line) ?


I haven't used in production yet - but I'm planning by the end of the year. I 
did some performance stress tests on ZFS vs. UFS+SVM. For testing I used the 
TPC-H benchmark as base and tested on SAN (Symmetrix DMX1000 storage).


The test consists of four parts:

1. creation of database (CREATE DATABASE ...; @catalog.sql; @catproc.sql)
2. Importing Test data with sqlldr (TPC-H target size of 2 GB)
3. Run a single instance of the TPC-H benchmark
4. Run 4 (2 * NCPU) instances of the TPC-H benchmark

Test 1: UFS performed slightly better
Test 2: performance of UFS vs. ZFS was similar
Test 3: ZFS was somewhat faster (<10%)
Test 4: ZFS was much faster (>20%)

Summing up the runtime of all 4 tests ZFS was ~12% faster (248 mins to 280 
mins).



I am interesting especially in:
- how does it behave after a long time of using it. Becasue COW nature
  of ZFS I don't know how it influences performance of queries.


I filled up the pool with random data so that the pool reached 100% usage
during the benchmark (The benchmark needs a huge TEMP tablespace (with above 
2GB data TEMP grew up to 20GB), so you can expect a lot of write I/O) The 
total performance dropped just by 1-2%




- general opinion of such pair (ZFS + Oracle)


Use compression and save a lot of space. Expect a space reduction of >50% by 
using compression even with filled up tablespaces. Oracle just worked on ZFS, 
no "strange" results.


I had one panic, but this was related to the Emulex FC device driver, not to 
ZFS (in the meantime a Emulex patch has been released (120222/120223), don't 
know if it fixes this panic)


The total runtime didn't change during my tests (<1%) if compression was 
turned on compared to compression off. The CPU usage increased by ~15%. So if 
you have fast CPUs give the CPUs something to do instead of just fooling around.




- which version of Oracle


This was Oracle 10g R2 on a FSC PrimePower (2 x 1350 MHz, 4 GB RAM) with 
Solaris 10 06/06.
Currently I have only an older Xeon box with SAN connectivity for Solaris/x86 
but Oracle 10gR2 still isn't released on 32bit x86 platforms[1]. 10gR1 needed 
much more space in the TEMP tablespace so I couldn't complete the tests.



[1] 
http://www.oracle.com/technology/software/products/database/oracle10g/index.html



Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: zfs create -o

2006-08-15 Thread Daniel Rock

Eric Schrock schrieb:

This RFE is also required for crypto support, as the encryption
algorithm must be known when the filesystem is created It also has the
benefit of cleaning up the implementation of other creation-time
properties (volsize and volblocksize) that were previously special
cases.


Ok,

but what about the toplevel FS in each pool? Then we need a -o option for 
"zpool create" also:


zpool create pool 
zfs set compression=on pool

Or will it be impossible to set encryption on directly on "pool" as opposed to 
"pool/fs"?



Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: zfs questions from Sun customer

2006-07-28 Thread Daniel Rock
>
> * follow-up question from customer
> 
> 
> Yes, using the c#t#d# disks work, but anyone using fibre-channel storage
> on somethink like IBM Shark or EMC Clariion will want multiple paths to
> disk using either IBMsdd, EMCpower or Solaris native MPIO.  Does ZFS
> work with any of these fibre channel multipathing drivers?

As a side node: EMCpower does work with ZFS:

# pkginfo -l EMCpower
[...]
   VERSION:  4.5.0_b169

# zpool create test emcpower3c
warning: device in use checking failed: No such device

# zfs list test
NAME   USED  AVAIL  REFER  MOUNTPOINT
test76K  8.24G  24.5K  /test

ZFS does not re-label the disk though, so you have to create an EFI label 
through some other means.

MPxIO works out-of-the-box.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] legato support

2006-07-20 Thread Daniel Rock

Anne Wong schrieb:
The EMC/Legato NetWorker (a.k.a. Sun StorEdge EBS) support for ZFS 
NFSv4/ACLs will be in NetWorker 7.3.2 release currently targeting for 
September release.


Will it also support the new zfs style automounts?

Or do I have to set
zfs set mountpoint=legacy zfs/file/system
and use 'good ole' /etc/vfstab?


Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Big JBOD: what would you do?

2006-07-19 Thread Daniel Rock

Richard Elling schrieb:

First, let's convince everyone to mirror and not RAID-Z[2] -- boil one
ocean at a time, there are only 5 you know... :-)


For maximum protection 4-disk RAID-Z2 is *always* better than 4-disk RAID-1+0. 
With more disks use multiple 4-disk RAID-Z2 packs.



Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Big JBOD: what would you do?

2006-07-18 Thread Daniel Rock

Al Hopper schrieb:

On Tue, 18 Jul 2006, Daniel Rock wrote:

I think this type of calculation is flawed. Disk failures are rare and
multiple disk failures at the same time are even more rare.


Stop right here! :)  If you have a large number of identical disks which
operate in the same environment[1], and possibly the same enclosure, it's
quite likely that you'll see 2 or more disks die within the same,
relatively short, timeframe.


Not my experience. I work and have worked with several disk arrays (EMC, IBM, 
Sun, etc.) and the failure rates of individual disks were fairly random.




Also, with todays higher density disk enclosures, a fan failure, which
goes un-noticed for a period of time, is likely to affect more than one
drive - again leading to multiple disks failing in the same general
timeframe.


Then make sure not more than 2 disks of the same raidz2 pack are in the same 
airflow path (or equivalent for RAID-1).




Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Big JBOD: what would you do?

2006-07-18 Thread Daniel Rock

Richard Elling schrieb:

Jeff Bonwick wrote:

For 6 disks, 3x2-way RAID-1+0 offers better resiliency than RAID-Z
or RAID-Z2.


Maybe I'm missing something, but it ought to be the other way around.
With 6 disks, RAID-Z2 can tolerate any two disk failures, whereas
for 3x2-way mirroring, of the (6 choose 2) = 6*5/2 = 15 possible
two-disk failure scenarios, three of them are fatal.


For the 6-disk case, with RAID-1+0 you get 27/64 surviving states
versus 22/64 for RAID-Z2.  This accounts for the cases where you could
lose 3 disks and survive with RAID-1+0.


I think this type of calculation is flawed. Disk failures are rare and 
multiple disk failures at the same time are even more rare.


Let's do some other calculation:

1. Assume each disk reliability independent of the others.

For ease of calculation:

2. One week between disk failure and its replacement (including resilvering)
3. Failure rate of 1% per week for each disk.


Compare:

a. 6 disk RAID-1+0
b. 6 disk RAID-Z2


i. 1 disk failures have a probability of
   ~5.7 % per week


but more interesting:

ii. 2 disk failures
0.14 % probability per week
a. fatal probability: 20%
b. fatal probability:  0%

iii. 3 disk failures
0.002% probability per week
a. fatal probability:  60%
b. fatal probability: 100%

The remaining probabilites become more and more unlikely.

In summary:

Probability for a fatal loss
a. 0.14% * 20% + 0.002% *  60% = 0.03%  per week
b. 0.14% *  0% + 0.002% * 100% = 0.002% per week


Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] system unresponsive after issuing a zpool attach

2006-07-13 Thread Daniel Rock

Joseph Mocker schrieb:
Today I attempted to upgrade to S10_U2 and migrate some mirrored UFS SVM 
partitions to ZFS.


I used Live Upgrade to migrate from U1 to U2 and that went without a 
hitch on my SunBlade 2000. And the initial conversion of one side of the 
UFS mirrors to a ZFS pool and subsequent data migration went fine. 
However, when I attempted to attach the second side mirrors as a mirror 
of the ZFS pool, all hell broke loose.

>

9. attach the partition to the pool as a mirror
  zpool attach storage cXtXdXs4 cYtYdYs4

A few minutes after issuing the command the system became unresponsive 
as described above.


Same here. I also did upgrade to S10_U2, and converted my non-root md
similar like you. Everything went fine until the "zpool attach". The system
seemed to be hanging for at least 2-3 minutes. Then I could type something
again. "top" then showed 98% system time.

This was on a SunBlade 1000 with 2 x 750MHz CPUs. The zpool/zfs was created 
with checksum=sha256.




Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs snapshot restarts scrubbing?

2006-06-22 Thread Daniel Rock

Hi,

yesterday I implemented a simple hourly snapshot on my filesystems. I also 
regularly initiate a manual "zpool scrub" on all my pools. Usually the 
scrubbing will run for about 3 hours.


But after enabling hourly snapshots I noticed that zfs scrub is always 
restarted if a new snapshot is created - so basically it will never have the 
chance to finish:


# zpool scrub scratch
# zpool status scratch | grep scrub
 scrub: scrub in progress, 1.90% done, 0h0m to go
# zfs snapshot [EMAIL PROTECTED]
# zpool status scratch | grep scrub
 scrub: scrub stopped with 0 errors on Thu Jun 22 18:15:49 2006
# zpool status scratch | grep scrub
 scrub: scrub in progress, 7.36% done, 0h1m to go
# zfs snapshot [EMAIL PROTECTED]
# zpool status scratch | grep scrub
 scrub: scrub stopped with 0 errors on Thu Jun 22 18:16:08 2006
# zpool status scratch | grep scrub
 scrub: scrub in progress, 0.64% done, 0h0m to go


The pool above is otherwise completely idle.

Is this the expected behaviour?
How can I manually scrub the pools if snapshots are created on a regular basis?


Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Safe to enable write cache?

2006-06-19 Thread Daniel Rock

Robert Milkowski schrieb:

Hello UNIX,

Monday, June 19, 2006, 10:02:03 AM, you wrote:

Ua> Simple question: is it safe to enable the disk write cache when using ZFS?

As ZFS should send proper ioctl to flush cache after each transaction
group it should be safe.

Actually if you give ZFS whole disk it will try to enable write cache
on that disk anyway.


What about ATA disks?

Currently (at least on x86) the ata driver enables the write cache 
unconditionally on each drive and doesn't support the ioctl to flush the cache 
(although the function is already there).



Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS ACLs and Samba

2006-06-18 Thread Daniel Rock

Hi,

is anyone working on ZFS ACL support in Samba?

Currently I have to disable ACL support in samba. Otherwise I get 
"permission denied" error messages trying to synchronize my offline folders 
residing on a Samba server (now on ZFS).



Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] fdsync(5, FSYNC) problem and ZFS

2006-06-18 Thread Daniel Rock

Sean Meighan schrieb:

The box runs less than 20% load. Everything has been working perfectly 
until two days ago, now it can take 10 minutes to exit from vi. The 
following truss shows that the 3 line file that is sitting on the ZFS 
volume (/archives) took almost 15 minutes in fdsync.


/me have similar observations. With virtual no other activity on the box 
exiting from vi (:wq) sometimes takes 10-60 seconds. Trussing the process 
always shows vi is waiting in fdsync(). This happened on at least two 
different machines:

- Sun E3500 with ZFS on A5200 disks (snv_37, almost all of the time large
  delay (10-60 secs.) when exiting from vi)
- Solaris/amd64 with ZFS on SATA disks (snv_39, but only few delays, usually
  not longer than 10 secs.)


Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] panic in buf_hash_remove

2006-06-13 Thread Daniel Rock

Noel Dellofano schrieb:

Out of curiosity, is this panic reproducible?


Hmm, not directly. The panic happened during a long running I/O stress test 
in the middle of the night. The tests had already run for ~6 hours at that time.



> A bug should be filed on
this for more investigation. Feel free to open one or I'll open it if 
you forward me info on where the crash dump is and information on the 
I/O stress test you were running.


The core dump is very large. Even compressed with bzip2 it is still ~300MB 
in size. I will upload it to my external server this night and post details 
where the crash dump can be found.


The tests I ran were Oracle database tests with many concurrent connections 
to the database. During the time of crash system load and I/O was just 
average though.



Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] panic in buf_hash_remove

2006-06-12 Thread Daniel Rock

Hi,

had recently this panic during some I/O stress tests:

> $BAD TRAP: type=e (#pf Page fault) rp=fe80005c3980 addr=30 occurred in 
module "zfs" due to a NULL pointer dereference



sched:
#pf Page fault
Bad kernel fault at addr=0x30
pid=0, pc=0xf3ee322e, sp=0xfe80005c3a70, eflags=0x10206
cr0: 8005003b cr4: 6f0
cr2: 30 cr3: a49a000 cr8: c
rdi: fe80f0aa2b40 rsi: 89c3a050 rdx: 6352
rcx:   2f  r8:0  r9:   30
rax: 64f2 rbx:2 rbp: fe80005c3aa0
r10:   fe80f0c979 r11:   bd7189449a7087 r12: 89c3a040
r13: 89c3a040 r14:32790 r15:0
fsb: 8000 gsb: 8149d800  ds:   43
 es:   43  fs:0  gs:  1c3
trp:e err:0 rip: f3ee322e
 cs:   28 rfl:10206 rsp: fe80005c3a70
 ss:   30

fe80005c3870 unix:die+eb ()
fe80005c3970 unix:trap+14f9 ()
fe80005c3980 unix:cmntrap+140 ()
fe80005c3aa0 zfs:buf_hash_remove+54 ()
fe80005c3b00 zfs:arc_change_state+1bd ()
fe80005c3b70 zfs:arc_evict_ghost+d1 ()
fe80005c3b90 zfs:arc_adjust+10f ()
fe80005c3bb0 zfs:arc_kmem_reclaim+d0 ()
fe80005c3bf0 zfs:arc_kmem_reap_now+30 ()
fe80005c3c60 zfs:arc_reclaim_thread+108 ()
fe80005c3c70 unix:thread_start+8 ()

syncing file systems...
 done
dumping to /dev/md/dsk/swap, offset 644874240, content: kernel
> $c
buf_hash_remove+0x54(89c3a040)
arc_change_state+0x1bd(c0099370, 89c3a040, c0098f30)
arc_evict_ghost+0xd1(c0099470, 14b5c0c4)
arc_adjust+0x10f()
arc_kmem_reclaim+0xd0()
arc_kmem_reap_now+0x30(0)
arc_reclaim_thread+0x108()
thread_start+8()
> ::status
debugging crash dump vmcore.0 (64-bit) from server
operating system: 5.11 snv_39 (i86pc)
panic message:
BAD TRAP: type=e (#pf Page fault) rp=fe80005c3980 addr=30 occurred in 
module "zfs" due to a NULL pointer dereference

dump content: kernel pages only



Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Life after the pool party

2006-05-27 Thread Daniel Rock

James Dickens schrieb:

tried it again... same results... restores the damm efi label.. disk
starts on block 34 not 0, there is no slice 2... that solaris
ijnstaller demands

can not start any track at block 0.. so i can't create a backup slice 
aka 2.


This is a SCSI disk? Then you can send SCSI commands directly to the drive 
which overwrite sector 0.


I found a small test program to become famliar with USCSI I wrote a long time 
ago. It can also be used to overwrite the label:


http://www.deadcafe.de/disk.c

At least it still compiles. But I don't have any SCSI drive, so I cannot test.
Simply compile with "make disk". Then call the program with the following 
options:

./disk -u -w c0t0d0s2

This will start writing the disk starting at sector 0. You can abort after a 
few seconds. Now you should have an empty disk without any disk label.



Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hard drive write cache

2006-05-27 Thread Daniel Rock

[EMAIL PROTECTED] schrieb:
What about IDE drives (PATA, SATA). Currently only the sd driver implements 
enabling/disabling the write cache?


They typically have write caches enabled by default; and some
don't take ckindly to disabling the write cache or do not allow
it at all.


But you could at least try. enabling/disabling the write cache and flushing 
the cache are both described in the ATA/ATAPI standard (at least ATA/ATAPI-5 
onwards).


Currently the IDE driver does (by default) exactly the opposite: it always 
turns on the write cache regardless of how you set up the drive:


http://cvs.opensolaris.org/source/xref/on/usr/src/uts/intel/io/dktp/controller/ata/ata_disk.c#ata_write_cache


But disabling the write cache on ATA drives makes only sense if you support 
command queueing (TCQ/NCQ). Otherwise write performance of these drives is 
plain bad.



Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hard drive write cache

2006-05-27 Thread Daniel Rock

Bart Smaalders schrieb:

ZFS enables the write cache and flushes it when committing transaction
groups; this insures that all of a transaction group appears or does
not appear on disk.


What about IDE drives (PATA, SATA). Currently only the sd driver implements 
enabling/disabling the write cache?


Also the WCE bit isn't reset if a zpool is destroyed. If you destroy a zpool 
and later create a "classical" SVM+UFS volume on these disks you run into 
the disk of data corruption.



Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS mirror and read policy; kstat I/O values for zfs

2006-05-26 Thread Daniel Rock

Hi,

after some testing with ZFS I noticed that read requests are not scheduled 
even to the drives but the first one gets predominately selected:



My pool is setup as follows:

NAMESTATE READ WRITE CKSUM
tpc ONLINE   0 0 0
  mirrorONLINE   0 0 0
c1t0d0  ONLINE   0 0 0
c4t0d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c1t1d0  ONLINE   0 0 0
c4t1d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c1t2d0  ONLINE   0 0 0
c4t2d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c1t3d0  ONLINE   0 0 0
c4t3d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c1t4d0  ONLINE   0 0 0
c4t4d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c1t6d0  ONLINE   0 0 0
c4t6d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c1t7d0  ONLINE   0 0 0
c4t7d0  ONLINE   0 0 0


Disk I/O after doing some benchmarking:

   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
tpc 7.70G  50.9G 85 21  10.5M  1.08M
  mirror1.10G  7.28G 11  3  1.47M   159K
c1t0d0  -  - 10  2  1.34M   159K
c4t0d0  -  -  1  2   138K   159K
  mirror1.10G  7.27G 11  3  1.48M   159K
c1t1d0  -  - 10  2  1.34M   159K
c4t1d0  -  -  1  2   140K   159K
  mirror1.09G  7.28G 12  3  1.50M   159K
c1t2d0  -  - 10  2  1.37M   159K
c4t2d0  -  -  0  2   128K   159K
  mirror1.10G  7.28G 12  3  1.53M   158K
c1t3d0  -  - 11  2  1.42M   158K
c4t3d0  -  -  0  2   110K   158K
  mirror1.10G  7.28G 11  3  1.44M   158K
c1t4d0  -  - 10  2  1.33M   158K
c4t4d0  -  -  0  2   112K   158K
  mirror1.10G  7.28G 12  3  1.53M   158K
c1t6d0  -  - 11  2  1.42M   158K
c4t6d0  -  -  0  2   106K   158K
  mirror1.11G  7.26G 12  3  1.55M   158K
c1t7d0  -  - 11  2  1.42M   158K
c4t7d0  -  -  1  2   130K   158K
--  -  -  -  -  -  -


or with "iostat"
   11.44.3 1451.1  157.1  0.0  0.30.4   19.6   0  17 c1t7d0
0.00.00.00.0  0.0  0.00.00.0   0   0 c4t5d0
   10.74.3 1361.4  158.4  0.0  0.30.4   22.1   0  18 c1t0d0
   10.94.3 1395.7  157.9  0.0  0.30.4   18.6   0  16 c1t2d0
1.04.3  129.0  157.1  0.0  0.00.88.9   0   2 c4t7d0
0.94.3  112.0  156.9  0.0  0.00.99.4   0   2 c4t4d0
1.14.4  139.5  158.3  0.0  0.00.98.8   0   3 c4t1d0
   10.64.3 1354.8  157.0  0.0  0.30.4   18.8   0  16 c1t4d0
0.94.3  109.2  157.3  0.0  0.10.99.7   0   3 c4t3d0
   10.74.4 1363.4  158.3  0.0  0.30.4   21.9   0  18 c1t1d0
0.00.00.00.0  0.0  0.00.00.0   0   0 c4t8d0
1.04.3  127.0  157.8  0.0  0.00.99.0   0   2 c4t2d0
0.00.00.00.0  0.0  0.00.00.0   0   0 c1t8d0
   11.44.3 1449.9  156.9  0.0  0.30.4   20.0   0  17 c1t6d0
0.84.3  105.4  156.8  0.0  0.00.98.5   0   2 c4t6d0
   11.34.3 1447.4  157.4  0.0  0.30.4   18.9   0  17 c1t3d0
1.14.4  137.7  158.4  0.0  0.00.98.8   0   2 c4t0d0



So you can see the second disk of each mirror pair (c4tXd0) gets almost no 
I/O. How does ZFS decide from which mirror device to read?



And just another notice:
SVM does offer kstat values of type KSTAT_TYPE_IO. Why not ZFS (at least on 
zpool level)?


And BTW (not ZFS related, but SVM):
With the introduction of the SVM bunnahabhain project (friendly names) 
"iostat -n" output is now completely useless - even if you still use the old 
naming scheme:


% iostat -n
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.00.00.00.0  0.0  0.00.02.3   0   0 c0d0
0.00.00.00.0  0.0  0.00.02.4   0   0 c0d1
0.05.00.7   21.8  0.0  0.00.01.5   0   1 c3d0
0.04.10.6   20.9  0.0  0.00.02.8   0   1 c4d0
1.6   37.3   16.6  164.3  0.1  0.12.51.6   1   5 c2d0
1.6   37.5   16.5  164.5  0.1  0.13.21.7   1   5 c1d0
0.00.00.00.0  0.0  0.00.00.0   0   0 fd0
2.91.9   19.34.8  0.0  0.20.3   37.2   0   1 md5
0.00.00.00.

Re: [zfs-discuss] Oracle on ZFS vs. UFS

2006-05-19 Thread Daniel Rock

Richard Elling schrieb:

On Fri, 2006-05-19 at 23:09 +0200, Daniel Rock wrote:

(*) maxphys = 8388608


Pedantically, because ZFS does 128kByte I/Os.  Setting maxphys >
128kBytes won't make any difference.


I know, but with the default maxphys value of 56kByte on x86 a 128kByte 
request will be split into three physical I/Os.



Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Oracle on ZFS vs. UFS

2006-05-19 Thread Daniel Rock

Bart Smaalders schrieb:

How big is the database?


After all the data has been loaded, all datafiles together 2.8GB, SGA 320MB. 
But I don't think size matters on this problem, since you can already see 
during the catalog creation phase that UFS is 2x faster.




Since oracle writes in small block sizes, did you set the recordsize for
ZFS?


recordsize is default (128K). Oracle uses:
db_block_size=8192
db_file_multi_block_read_count=16

I tried with "db_block_size=32768" but the results got worse.

I have just rerun the first parts of my benchmark (database + catalog 
creation) with different parameters.


The datafiles will be deleted before each run, so I assume if Oracle 
recreates the files again they will already use the modified zfs parameters 
(so I don't have to recreate the zpool/zfs).


Below the results (UFS again as the reference point):
UFS written as UFS(forcedirectio?,ufs:blocksize,oracle:db_block_size)
ZFS written as ZFS(zfs:compression,zfs:recordsize,oracle:db_block_size)

These results are now run with memory capping in effect (physmem=262144 (1GB))

db creation catalog creation
UFS(-,8K,8K) [default]   0:41.8516:17.530
UFS(forcedirectio,8K,8K) 0:40.4796:03.688
UFS(forcedirectio,8K,32K)0:48.7188:19.359

ZFS(off,128K,8K) [default]   0:52.427   13:28.081
ZFS(on,128K,8K)  0:50.791   14.27.919
ZFS(on,8K,8K)0:42.611   13:34.464
ZFS(off,32K,32K) 1:40.038   15:35.177

(times in min:sec.msec)

So you will win a few percent, but still slower compared to UFS. UFS catalog 
creation is already mostly CPU bound: During the ~6 minutes of catalog 
creation time the corresponding oracle process consumes ~5:30 minutes of CPU 
time. So for UFS there is little margin for improvement.



If you have Oracle installed you can easily check yourself. I have uploaded 
my init.ora file and the DB creation script to

http://www.deadcafe.de/perf/
Just modify the variables
. ADMIN (location of the oracle admin files)
. DBFILES (ZFS or UFS where datafiles should be placed)
. and the paths in init.ora

Benchmark results will be in "db.bench" file.


BTW: Why is maxphys still only 56 kByte by default on x86? I have increased 
maxphys to 8MB, but not much difference on the results:


db creation catalog creation
ZFS(off,128K,8K) (*) 0:53.250   13:32.369

(*) maxphys = 8388608


Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Oracle on ZFS vs. UFS

2006-05-19 Thread Daniel Rock

Hi,

I'm preparing a personal TPC-H benchmark. The goal is not to measure or
optimize the database performance, but to compare ZFS to UFS in similar
configurations.

At the moment I'm preparing the tests at home. The test setup is as
follows:
. Solaris snv_37
. 2 x AMD Opteron 252
. 4 GB RAM
. 2 x 80 GB ST380817AS
. Oracle 10gR2 (small SGA (320m))

The disks also contain the OS image (mirrored via SVM). On the remaining
space I have created one zpool (one disk) resp. one MD-Volume with an UFS
filesystem ontop (the other disk)
Later I want to rerun the tests on an old E3500 (4x400MHz, 2GB RAM) with
two A5200 attached (~15 still alive 9GB disks each).

The first results at home are not very promising for ZFS.

I measured:
. database creation
. catalog integration (catalog + catproc)
. tablespace creation
. loading data into the database from dbgen with sqlldr

I can provide all the scripts (and precompiled binaries for qgen and dbgen 
(SPARC + x86) if anyone wants to verify my tests.


In most of these tests UFS was considerable faster than ZFS. I tested
. ZFS with default options
. ZFS with compression enabled
. ZFS without checksums
. UFS (newfs: -f 8192 -i 2097152; tunefs: -e 6144; mount: nologging)


Below the (preliminary) results (with a 1GB dataset from dbgen), runtime
in minutes:seconds

 UFSZFS (default)   ZFS+compZFS+nochksum
db creation  0:380:420:180:40
catalog  6:19   12:05   11:55   12:04
ts creation  0:130:140:040:16
data load[1] 8:49   26:20   25:39   26:19
index creation   0:480:380:310:36
key creation 1:551:311:181:25

[1] dbgen writes into named pipes, which are read back by sqlldr. So no
interim files are created

Esp. on catalog creation and loading data into the database UFS is by factor
2-3 faster than ZFS (regardless of ZFS options)

Only for read intensive tasks and for file creation if compression is enabled
ZFS is faster than UFS. This is to no surprise, since the machine has 4GB
RAM of which at least 3GB are unused, so ZFS has plenty of space for
caching (all datafiles together use just 2.8GB disk space). If I enlarge
the dataset I suspect that then also on the tests where ZFS does perform
better, UFS will again gain the lead.

I will now prepare the query benchmark to see how ZFS performs with a larger
amount of parallelism in the database. In order to test also read throughput 
of ZFS vs. UFS, instead of using a larger dataset I will cut the memory the 
OS uses by setting physmem to 1GB.



--
Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: snapshots and directory traversal (inode numbers

2006-05-09 Thread Daniel Rock
Ok,

found the BugID:
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6416101

and the relevant code:
http://cvs.opensolaris.org/source/diff/on/usr/src/uts/common/fs/zfs/zfs_dir.c?r2=1.7&r1=1.6

So I will wait for snv_39


-- 
Daniel
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] snapshots and directory traversal (inode numbers

2006-05-09 Thread Daniel Rock
Just noticed this:

# zfs create scratch/test
# cd /scratch/test
# mkdir d1 d2 d3
# zfs snapshot scratch/[EMAIL PROTECTED]
# cd .zfs/snapshot/snap
# ls
d1d2d3
# du -k
1 ./d3
3 .
{so "du" doesn't traverse the other directories 'd1' and 'd2'}

# pwd
/scratch/test/.zfs/snapshot/snap
# cd d3
# pwd
/scratch/test/.zfs/snapshot/snap/d3
# cd ..
/scratch/test/.zfs/snapshot
{Hmm, got back two directories at once - this was done with /sbin/sh as shell, 
so no 'tricks' on "cd ..", same with 'd1' and 'd2'}

# cd /scratch/test/.zfs/snapshot
# ls -ia
2 .1 ..127 snap
# cd snap
# ls -ia
3 .3 ..4 d15 d26 d3
{at least 'strange' non-consistent inode numbers}


-- 
Daniel
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RAM requirements?

2006-05-07 Thread Daniel Rock

Roch Bourbonnais - Performance Engineering schrieb:

A already noted, this needs not be different from other FS
but is still an interesting question. I'll touch 3 aspects
here

- reported freemem
- syscall writes to mmap pages
- application write throttling


Reported  freemem will be lower   when running with ZFS than
say UFS. The  UFS page cache  is considered as freemem.  ZFS
will return it's 'cache' only when memory is needed.  So you
will operate with lower   freemem but won't  actually suffer
from this.



The RAM usage should be made more transparent to the administrator. Just 
today, after installing snv_37 on another machine, I couldn't disable swap 
because zfs has grabbed all free memory it could get and didn't release it 
(even after a "zfs export"):


# swap -l
swapfile dev  swaplo blocks   free
/dev/md/dsk/d2  85,4   8 4193272 4193272

# swap -s
total: 275372k bytes allocated + 93876k reserved = 369248k used, 1899492k 
available


# swap -d /dev/md/dsk/d2
/dev/md/dsk/d2: Not enough space

# kstat | grep pp_kernel
pp_kernel   872514

# prtconf | head -2
System Configuration:  Sun Microsystems  i86pc
Memory size: 4095 Megabytes

# zpool export pool
# zpool list
no pools available
# swap -d /dev/md/dsk/d2

This was shortly after installation with not much running on the machine. To 
speed up 'mirroring' of swap I usually do the following (but couldn't in 
this case):


swap -d /dev/md/dsk/d2
metaclear d2
metainit d2 -m d21 d22 0
swap -a /dev/md/dsk/d2




Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss