Re: [zfs-discuss] Thinking about spliting a zpool in "system" and "data"
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 10/01/12 21:32, Richard Elling wrote: > On Jan 9, 2012, at 7:23 PM, Jesus Cea wrote: [...] >> The page is written in Spanish, but the terminal transcriptions >> should be useful for everybody. >> >> In the process, maybe somebody finds this interesting too: >> >> http://www.jcea.es/artic/zfs_flash01.htm > > Google translate works well for this :-) Thanks for posting! -- > richard Talking about this, there is something that bugs me. For some reason, sync writes are written to the ZIL only if they are "small". Big writes are far slower, apparently bypassing the ZIL. Maybe some concern about disk bandwidth (because we would be writing the data twice, but it is only speculation). But this is happening TOO when the ZIL is in a SSD. I guess ZFS should write the sync writes to the SSD even if they are quite big (megabytes). In the "zil.c" code I see things like: """ /* * Define a limited set of intent log block sizes. * These must be a multiple of 4KB. Note only the amount used (again * aligned to 4KB) actually gets written. However, we can't always just * allocate SPA_MAXBLOCKSIZE as the slog space could be exhausted. */ uint64_t zil_block_buckets[] = { 4096, /* non TX_WRITE */ 8192+4096, /* data base */ 32*1024 + 4096, /* NFS writes */ UINT64_MAX }; /* * Use the slog as long as the logbias is 'latency' and the current commit size * is less than the limit or the total list size is less than 2X the limit. * Limit checking is disabled by setting zil_slog_limit to UINT64_MAX. */ uint64_t zil_slog_limit = 1024 * 1024; #define USE_SLOG(zilog) (((zilog)->zl_logbias == ZFS_LOGBIAS_LATENCY) && \ (((zilog)->zl_cur_used < zil_slog_limit) || \ ((zilog)->zl_itx_list_sz < (zil_slog_limit << 1 """ I have 2GB of ZIL in a mirrored SSD. I can randomly write to it at 240MB/s, so I guess the sync write restriction could be reexamined when ZFS is using a separate ZIL device, with plenty of space to burn :-). Am I missing anything? Could I change the value of "zil_slog_limit" in the kernel (via mdb) when using a ZIL device, safely?. Would do what I expect? My usual database block size is 64KB... :-(. The writeahead log write can be bigger that 128KB easily (before and after data, plus some changes in the parent nodes). Seems faster to do several writes with several SYNCs that a big write with a final SYNC. That is quite counterintuitive. Am I hitting something else, like the "write throttle"? PS: I am talking about Solaris 10 U10. My ZFS "logbias" attribute is "latency". - -- Jesus Cea Avion _/_/ _/_/_/_/_/_/ j...@jcea.es - http://www.jcea.es/ _/_/_/_/ _/_/_/_/ _/_/ jabber / xmpp:j...@jabber.org _/_/_/_/ _/_/_/_/_/ . _/_/ _/_/_/_/ _/_/ _/_/ "Things are not so easy" _/_/ _/_/_/_/ _/_/_/_/ _/_/ "My name is Dump, Core Dump" _/_/_/_/_/_/ _/_/ _/_/ "El amor es poner tu felicidad en la felicidad de otro" - Leibniz -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQCVAwUBTw0EMZlgi5GaxT1NAQLfVAQAhQxJwLVBOJ4ybA8HUJc+p94cJJ4CtsSS /9Un7KKR09+FYrkOycoViYsUqrb+vBGSZHCyElQRXZf7nz14qX0qullXn6jqkSHv Pxjp3nQAu7ERCcPi2jfuOgXyzw7F74F/UduL2Qla+XFrYSpkBYsDIikIO+lgSLZh JVdvnshISMc= =at00 -END PGP SIGNATURE- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Unable to allocate dma memory for extra SGL
On Tue, Jan 10, 2012 at 06:23:50PM -0800, Hung-Sheng Tsao (laoTsao) wrote: > how is the ram size what is the zpool setup and what is your hba and > hdd size and type Hmm, actually this system has only 6GB of memory. For some reason I though it had more. The controller is an LSISAS2008 (which oddly enough dose not seem to be recognized by lsiutil). There are 23x1TB disks (SATA interface, not SAS unfortunately) in the system. Three RAIDZ2 vdevs of seven disks each and one spare comprises a single zpool with two zfs file systems mounted (no deduplication or compression in use). There are two internally mounted Intel X-25E's -- these double as the rootpool and ZIL devices. There is an 80GB X-25M mounted to the expander along with the 1TB drives operating as L2ARC. > > On Jan 10, 2012, at 21:07, Ray Van Dolson wrote: > > > Hi all; > > > > We have a Solaris 10 U9 x86 instance running on Silicon Mechanics / > > SuperMicro hardware. > > > > Occasionally under high load (ZFS scrub for example), the box becomes > > non-responsive (it continues to respond to ping but nothing else works > > -- not even the local console). Our only solution is to hard reset > > after which everything comes up normally. > > > > Logs are showing the following: > > > > Jan 8 09:44:08 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: > > /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0): > > Jan 8 09:44:08 prodsys-dmz-zfs2MPT SGL mem alloc failed > > Jan 8 09:44:08 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: > > /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0): > > Jan 8 09:44:08 prodsys-dmz-zfs2Unable to allocate dma memory for > > extra SGL. > > Jan 8 09:44:08 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: > > /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0): > > Jan 8 09:44:08 prodsys-dmz-zfs2Unable to allocate dma memory for > > extra SGL. > > Jan 8 09:44:10 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: > > /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0): > > Jan 8 09:44:10 prodsys-dmz-zfs2Unable to allocate dma memory for > > extra SGL. > > Jan 8 09:44:10 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: > > /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0): > > Jan 8 09:44:10 prodsys-dmz-zfs2MPT SGL mem alloc failed > > Jan 8 09:44:11 prodsys-dmz-zfs2 rpcmod: [ID 851375 kern.warning] WARNING: > > svc_cots_kdup no slots free > > > > I am able to resolve the last error by adjusting upwards the duplicate > > request cache sizes, but have been unable to find anything on the MPT > > SGL errors. > > > > Anyone have any thoughts on what this error might be? > > > > At this point, we are simply going to apply patches to this box (we do > > see an outstanding mpt patch): > > > > 147150 -- < 01 R-- 124 SunOS 5.10_x86: mpt_sas patch > > 147702 -- < 03 R-- 21 SunOS 5.10_x86: mpt patch > > > > But we have another identically configured box at the same patch level > > (admittedly with slightly less workload, though it also undergoes > > monthly zfs scrubs) which does not experience this issue. > > > > Ray Thanks, Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?
On Sun, January 8, 2012 00:28, Bob Friesenhahn wrote: > > I think that I would also be interested in a system which uses the > so-called spare disks for more protective redundancy but then reduces > that protective redundancy in order to use that disk to replace a > failed disk or to automatically enlarge the pool. > > For example, a pool could start out with four-way mirroring when there > is little data in the pool. When the pool becomes more full, mirror > devices are automatically removed (from existing vdevs), and used to > add more vdevs. Eventually a limit would be hit so that no more > mirrors are allowed to be removed. > > Obviously this approach works with simple mirrors but not for raidz. > > Bob > -- > Bob Friesenhahn > bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > I actually disagree about raidz. I have often thought that a "dynamic raidz" would be a great feature. For instance, you have a 4-way raidz. What you are saying is you want the array to survive the loss of a single drive. So, from an empty vdev, it starts by writing 2 copies of each block, effectively creating a pair of mirrors. These are quicker to write and quicker to resilver than parity, and you would likely get a read speed increase too. As the vdev starts to get full, it starts using a parity based redundancy, and converting "older" data to this as well. Performance drops a bit, but it happens slowly. In addition, any older blocks not yet converted are still quicker to read and resilver. This is only a theory, but it is certainly something which could be considered. It would probably take a lot of rewriting of the raidz code, though. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Unable to allocate dma memory for extra SGL
how is the ram size what is the zpool setup and what is your hba and hdd size and type Sent from my iPad On Jan 10, 2012, at 21:07, Ray Van Dolson wrote: > Hi all; > > We have a Solaris 10 U9 x86 instance running on Silicon Mechanics / > SuperMicro hardware. > > Occasionally under high load (ZFS scrub for example), the box becomes > non-responsive (it continues to respond to ping but nothing else works > -- not even the local console). Our only solution is to hard reset > after which everything comes up normally. > > Logs are showing the following: > > Jan 8 09:44:08 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: > /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0): > Jan 8 09:44:08 prodsys-dmz-zfs2MPT SGL mem alloc failed > Jan 8 09:44:08 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: > /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0): > Jan 8 09:44:08 prodsys-dmz-zfs2Unable to allocate dma memory for > extra SGL. > Jan 8 09:44:08 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: > /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0): > Jan 8 09:44:08 prodsys-dmz-zfs2Unable to allocate dma memory for > extra SGL. > Jan 8 09:44:10 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: > /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0): > Jan 8 09:44:10 prodsys-dmz-zfs2Unable to allocate dma memory for > extra SGL. > Jan 8 09:44:10 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: > /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0): > Jan 8 09:44:10 prodsys-dmz-zfs2MPT SGL mem alloc failed > Jan 8 09:44:11 prodsys-dmz-zfs2 rpcmod: [ID 851375 kern.warning] WARNING: > svc_cots_kdup no slots free > > I am able to resolve the last error by adjusting upwards the duplicate > request cache sizes, but have been unable to find anything on the MPT > SGL errors. > > Anyone have any thoughts on what this error might be? > > At this point, we are simply going to apply patches to this box (we do > see an outstanding mpt patch): > > 147150 -- < 01 R-- 124 SunOS 5.10_x86: mpt_sas patch > 147702 -- < 03 R-- 21 SunOS 5.10_x86: mpt patch > > But we have another identically configured box at the same patch level > (admittedly with slightly less workload, though it also undergoes > monthly zfs scrubs) which does not experience this issue. > > Ray > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Unable to allocate dma memory for extra SGL
Hi all; We have a Solaris 10 U9 x86 instance running on Silicon Mechanics / SuperMicro hardware. Occasionally under high load (ZFS scrub for example), the box becomes non-responsive (it continues to respond to ping but nothing else works -- not even the local console). Our only solution is to hard reset after which everything comes up normally. Logs are showing the following: Jan 8 09:44:08 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0): Jan 8 09:44:08 prodsys-dmz-zfs2MPT SGL mem alloc failed Jan 8 09:44:08 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0): Jan 8 09:44:08 prodsys-dmz-zfs2Unable to allocate dma memory for extra SGL. Jan 8 09:44:08 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0): Jan 8 09:44:08 prodsys-dmz-zfs2Unable to allocate dma memory for extra SGL. Jan 8 09:44:10 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0): Jan 8 09:44:10 prodsys-dmz-zfs2Unable to allocate dma memory for extra SGL. Jan 8 09:44:10 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0): Jan 8 09:44:10 prodsys-dmz-zfs2MPT SGL mem alloc failed Jan 8 09:44:11 prodsys-dmz-zfs2 rpcmod: [ID 851375 kern.warning] WARNING: svc_cots_kdup no slots free I am able to resolve the last error by adjusting upwards the duplicate request cache sizes, but have been unable to find anything on the MPT SGL errors. Anyone have any thoughts on what this error might be? At this point, we are simply going to apply patches to this box (we do see an outstanding mpt patch): 147150 -- < 01 R-- 124 SunOS 5.10_x86: mpt_sas patch 147702 -- < 03 R-- 21 SunOS 5.10_x86: mpt patch But we have another identically configured box at the same patch level (admittedly with slightly less workload, though it also undergoes monthly zfs scrubs) which does not experience this issue. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs read-ahead and L2ARC
To follow on the subject of VDEV caching, even if only of metadata, in oi_148a, I have found the disabling entry in /etc/system of the LiveUSB: set zfs:zfs_vdev_cache_size=0 Now that I have the cache turned on and my scrub continues, cache efficiency so far happens to be 75%. Not bad for a feature turned off by default: # kstat -p zfs:0:vdev_cache_stats zfs:0:vdev_cache_stats:classmisc zfs:0:vdev_cache_stats:crtime 60.67302806 zfs:0:vdev_cache_stats:delegations 22619 zfs:0:vdev_cache_stats:hits 32989 zfs:0:vdev_cache_stats:misses 10676 zfs:0:vdev_cache_stats:snaptime 39898.161717983 //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thinking about spliting a zpool in "system" and "data"
On Jan 9, 2012, at 7:23 PM, Jesus Cea wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > On 07/01/12 13:39, Jim Klimov wrote: >> I have transitioned a number of systems roughly by the same >> procedure as you've outlined. Sadly, my notes are not in English so >> they wouldn't be of much help directly; > > Yes, my russian is rusty :-). > > I have bitten the bullet and spend 3-4 days doing the migration. I > wrote the details here: > > http://www.jcea.es/artic/solaris_zfs_split.htm > > The page is written in Spanish, but the terminal transcriptions should > be useful for everybody. > > In the process, maybe somebody finds this interesting too: > > http://www.jcea.es/artic/zfs_flash01.htm Google translate works well for this :-) Thanks for posting! -- richard -- ZFS and performance consulting http://www.RichardElling.com illumos meetup, Jan 10, 2012, Menlo Park, CA http://www.meetup.com/illumos-User-Group/events/41665962/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] making network configuration sticky in nexenta core/napp-it
On Tue, Jan 10, 2012 at 2:21 PM, Garrett D'Amore wrote: > put the configuration in /etc/hostname.if0 (where if0 is replaced by the > name of your interface, such as /etc/hostname.e1000g0) > > Without an IP address in such a static file, the system will default to DHCP > and hence override other settings. > > - Garrett > > On Jan 10, 2012, at 8:54 AM, Eugen Leitl wrote: > >> >> Sorry for an off-topic question, but anyone knows how to make >> network configuration (done with ifconfig/route add) sticky in >> nexenta core/napp-it? >> >> After reboot system reverts to 0.0.0.0 and doesn't listen >> to /etc/defaultrouter >> >> Thanks. You may also have to disable network auto-magic: svcadm disable network/physical:nwam ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] making network configuration sticky in nexenta core/napp-it
put the configuration in /etc/hostname.if0 (where if0 is replaced by the name of your interface, such as /etc/hostname.e1000g0) Without an IP address in such a static file, the system will default to DHCP and hence override other settings. - Garrett On Jan 10, 2012, at 8:54 AM, Eugen Leitl wrote: > > Sorry for an off-topic question, but anyone knows how to make > network configuration (done with ifconfig/route add) sticky in > nexenta core/napp-it? > > After reboot system reverts to 0.0.0.0 and doesn't listen > to /etc/defaultrouter > > Thanks. > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] making network configuration sticky in nexenta core/napp-it
Sorry for an off-topic question, but anyone knows how to make network configuration (done with ifconfig/route add) sticky in nexenta core/napp-it? After reboot system reverts to 0.0.0.0 and doesn't listen to /etc/defaultrouter Thanks. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Doublefree/doubledelete
Hello all, While it is deemed uncool to reply to one's own posts, there's often no other choice ;) Here is some more detail on that failure: this problem was located to be on the rpool, and any attempts to import it (including rollback or readonly modes) lead to immediate freeze of the system with those warnings on the console. My current guess is that ZFS wrongly tries to use a "very" old TXG number (beyond the actually last 128) which references some overwritten metadata, leading to seeming inconsistencies such as boube allocations and double frees. I am not sure how to "roll-forward" TXGs in the pool labels properly, in order to point to newer COW-secured block hierarchy. Details follow: Some ZDB research has shown that according to labels, the latest TXG is 500179 (zdb -l). However, the pool history has newer TXGs mentioned (zdb -h): 2011-12-19.20:00:00 zpool clear rpool 2011-12-19.20:00:00 [internal pool scrub txg:500179] func=1 mintxg=0 maxtxg=500179 2011-12-19.20:00:10 zpool scrub rpool 2011-12-19.20:19:44 [internal pool scrub done txg:500355] complete=1 When I tried ZFS forensics script from these sources linked below (original source link is down at this time) I had yet newer TXG numbers, ranging from 500422 to 500549, and not including either of those discovered above. Info page: * [1] http://www.solarisinternals.com/wiki/index.php/ZFS_forensics_scrollback_script Script code: * [2] http://markmail.org/download.xqy?id=gde5k3zynpfhftgd&number=1 I tried to roll back to TXG 500535 which was about a minute before the most recent one and the presumed crash. Here's a screenshot AFTER the rollback: # ./zfs_revert.py -bs=512 -tb=75409110 /dev/dsk/c4t1d0s0 512 Total of 75409110 blocks Reading from the beginning to 131072 blocks Reading from 75278038 blocks to the end 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 44.791 s, 1.5 MB/s 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 26.8802 s, 2.5 MB/s TXG TIMETIMESTAMP BLOCK ADDRESSES 500422 19 Dec 2011 20:25:17 1324326317 [396, 908, 75408268, 75408780] 500423 19 Dec 2011 20:25:22 1324326322 [398, 910, 75408270, 75408782] ... 500530 19 Dec 2011 20:32:27 1324326747 [356, 868, 75408228, 75408740] 500531 19 Dec 2011 20:32:31 1324326751 [358, 870, 75408230, 75408742] 500532 19 Dec 2011 20:32:37 1324326757 [360, 872, 75408232, 75408744] 500533 19 Dec 2011 20:32:40 1324326760 [362, 874, 75408234, 75408746] 500534 19 Dec 2011 20:32:44 1324326764 [364, 876, 75408236, 75408748] 500535 19 Dec 2011 20:32:48 1324326768 [366, 878, 75408238, 75408750] What is the last TXG you wish to keep? Apparently, the script did roll back txg's on disk (I did not look deep into it, probably it zeroed and invalidated newer uberblocks); however the pool label still references an out-of-range TXG number 500179. I've had some strange problems with ZDB's "-t" option: when I referenced the pool with either "-e rpool" or "-e GUIDNUMBER", it complains about not finding "rpool" (which is not imported and can't be), while without this flag it apparently uses the wrong old TXGnum: root@openindiana:~# zdb -b -t 500355 -e 12076177533503245216 zdb: can't open 'rpool': No such device or address root@openindiana:~# zdb -b -F -t 500355 -e 12076177533503245216 zdb: can't open 'rpool': No such device or address root@openindiana:~# zdb -b -F -e 12076177533503245216 Traversing all blocks to verify nothing leaked ... error: zfs: freeing free segment (offset=3146341888 size=1024) Abort (core dumped) root@openindiana:~# zdb -b -F -e rpool Traversing all blocks to verify nothing leaked ... error: zfs: freeing free segment (offset=3146341888 size=1024) Abort (core dumped) So... "Kowalsky, options?" (C) Madagascar Thanks, //Jim Klimov 2011-12-24 1:43, Jim Klimov wrote: Hello all, My computer has recently crashed with the following messages last displayed; they also pop up on boot attempts and it freezes: Dec 20 00:33:12 bofh-sol genunix: [ID 415322 kern.warning] WARNING: zfs: allocating allocated segment(offset=9662417920 size=512) Dec 20 00:33:14 bofh-sol genunix: [ID 361072 kern.warning] WARNING: zfs: freeing free segment (offset=9608101376 size=1536) I believe it is not good ;) But the message has even no info on which pool the error was, and 9Gb offset could be on either the rpool and on the data pool. What can be done for debug and repair? ;) Thanks, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss