Re: [zfs-discuss] System started crashing hard after zpool reconfigure and OI upgrade

2013-03-20 Thread Michael Schuster
How about crash dumps?

michael

On Wed, Mar 20, 2013 at 4:50 PM, Peter Wood  wrote:

> I'm sorry. I should have mentioned it that I can't find any errors in the
> logs. The last entry in /var/adm/messages is that I removed the keyboard
> after the last reboot and then it shows the new boot up messages when I
> boot up the system after the crash. The BIOS log is empty. I'm not sure how
> to check the IPMI but IPMI is not configured and I'm not using it.
>
> Just another observation - the crashes are more intense the more data the
> system serves (NFS).
>
> I'm looking into FRMW upgrades for the LSI now.
>
>
> On Wed, Mar 20, 2013 at 8:40 AM, Will Murnane wrote:
>
>> Does the Supermicro IPMI show anything when it crashes?  Does anything
>> show up in event logs in the BIOS, or in system logs under OI?
>>
>>
>> On Wed, Mar 20, 2013 at 11:34 AM, Peter Wood wrote:
>>
>>> I have two identical Supermicro boxes with 32GB ram. Hardware details at
>>> the end of the message.
>>>
>>> They were running OI 151.a.5 for months. The zpool configuration was one
>>> storage zpool with 3 vdevs of 8 disks in RAIDZ2.
>>>
>>> The OI installation is absolutely clean. Just next-next-next until done.
>>> All I do is configure the network after install. I don't install or enable
>>> any other services.
>>>
>>> Then I added more disks and rebuild the systems with OI 151.a.7 and this
>>> time configured the zpool with 6 vdevs of 5 disks in RAIDZ.
>>>
>>> The systems started crashing really bad. They just disappear from the
>>> network, black and unresponsive console, no error lights but no activity
>>> indication either. The only way out is to power cycle the system.
>>>
>>> There is no pattern in the crashes. It may crash in 2 days in may crash
>>> in 2 hours.
>>>
>>> I upgraded the memory on both systems to 128GB at no avail. This is the
>>> max memory they can take.
>>>
>>> In summary all I did is upgrade to OI 151.a.7 and reconfigured zpool.
>>>
>>> Any idea what could be the problem.
>>>
>>> Thank you
>>>
>>> -- Peter
>>>
>>> Supermicro X9DRH-iF
>>> Xeon E5-2620 @ 2.0 GHz 6-Core
>>> LSI SAS9211-8i HBA
>>> 32x 3TB Hitachi HUS723030ALS640, SAS, 7.2K
>>>
>>> ___
>>> zfs-discuss mailing list
>>> zfs-discuss@opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>>
>>>
>>
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>


-- 
Michael Schuster
http://recursiveramblings.wordpress.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] System started crashing hard after zpool reconfigure and OI upgrade

2013-03-20 Thread Michael Schuster
Peter,

sorry if this is so obvious that you didn't mention it: Have you checked
/var/adm/messages and other diagnostic tool output?

regards
Michael

On Wed, Mar 20, 2013 at 4:34 PM, Peter Wood  wrote:

> I have two identical Supermicro boxes with 32GB ram. Hardware details at
> the end of the message.
>
> They were running OI 151.a.5 for months. The zpool configuration was one
> storage zpool with 3 vdevs of 8 disks in RAIDZ2.
>
> The OI installation is absolutely clean. Just next-next-next until done.
> All I do is configure the network after install. I don't install or enable
> any other services.
>
> Then I added more disks and rebuild the systems with OI 151.a.7 and this
> time configured the zpool with 6 vdevs of 5 disks in RAIDZ.
>
> The systems started crashing really bad. They just disappear from the
> network, black and unresponsive console, no error lights but no activity
> indication either. The only way out is to power cycle the system.
>
> There is no pattern in the crashes. It may crash in 2 days in may crash in
> 2 hours.
>
> I upgraded the memory on both systems to 128GB at no avail. This is the
> max memory they can take.
>
> In summary all I did is upgrade to OI 151.a.7 and reconfigured zpool.
>
> Any idea what could be the problem.
>
> Thank you
>
> -- Peter
>
> Supermicro X9DRH-iF
> Xeon E5-2620 @ 2.0 GHz 6-Core
> LSI SAS9211-8i HBA
> 32x 3TB Hitachi HUS723030ALS640, SAS, 7.2K
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>


-- 
Michael Schuster
http://recursiveramblings.wordpress.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] help zfs pool with duplicated and missing entry of hdd

2013-01-10 Thread Michael Hase

On Thu, 10 Jan 2013, Jim Klimov wrote:


On 2013-01-10 08:51, Jason wrote:

Hi,

One of my server's zfs faulted and it shows following:
NAMESTATE READ WRITE CKSUM
 backup  UNAVAIL  0 0 0  insufficient replicas
   raidz2-0  UNAVAIL  0 0 0  insufficient replicas
 c4t0d0  ONLINE   0 0 0
 c4t0d1  ONLINE   0 0 0
 c4t0d0  FAULTED  0 0 0  corrupted data
 c4t0d3  FAULTED  0 0 0  too many errors
 c4t0d4  FAULTED  0 0 0  too many errors
...(omit the rest).

My question is why c4t0d0 appeared twice, and c4t0d2 is missing.

Have check the controller card and hard disk, they are all working fine.


This renaming does seem like an error in detecting (and further naming)
of the disks - i.e. if a connector got loose, and one of the disks is
not seen by the system, the numbering can shift in such manner. It is
indeed strange however that only "d2" got shifted or missing and not
all those numbers after it.

So, you did verify that the controller sees all the disks in "format"
command (and perhaps after a cold reboot - in BIOS)? Just in case, try
to unplug and replug all cables (power, data) in case their pins got
oxydized over time.


Usually the disk numbering in any solaris based os stays the same if one 
disk is offline/missing, it's fixed to the controller port, or scsi 
target, or wwn. Imho a huge advantage of the c0t0d0 pattern, instead of 
the linux or freebsd numbering. I once had an old sun 5200 hooked up to a 
linux box and one of the 22 disks failed, every disk after the bad one had 
shifted, what a mess.


To me the c4t0d0, c4t0d1, ... numbering looks either like a hardware raid 
controller not in jbod mode, or even an external san. jbods normally show 
up as lun 0 (d0) with different target numbers (t1, t2, ...). Maybe 
something wrong with lun numbering on your box?


-- Michael
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Using L2ARC on an AdHoc basis.

2012-10-13 Thread Michael Armstrong
Ok, so it is possible to remove. Good to know, thanks . I move the pool maybe 
once a month for a few days, on an otherwise daily used fixed location. So 
thought the warm up allowance may be worth it. I guess I just wanted to know if 
adding a cache device was a one way operation or not and whether or not it 
risked integrity. 


Sent from my iPhone

On 13 Oct 2012, at 23:02, Ian Collins  wrote:

> On 10/14/12 10:02, Michael Armstrong wrote:
>> Hi Guys,
>> 
>> I have a "portable pool" i.e. one that I carry around in an enclosure. 
>> However, any SSD I add for L2ARC, will not be carried around... meaning the 
>> cache drive will become unavailable from time to time.
>> 
>> My question is Will random removal of the cache drive put the pool into 
>> a "degraded" state or affect the integrity of the pool at all? Additionally, 
>> how adversely will this effect "warm up"...
>> Or will moving the enclosure between machines with and without cache, just 
>> automatically work, and offer benefits when cache is available, and less 
>> benefits when it isn't?
> 
> Why bother with cache devices at all if you are moving the pool around?  As 
> you hinted above, the cache can take a while to warm up and become useful.
> 
> You should zpool remove the cache device before exporting the pool.
> 
> -- 
> Ian.
> 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Using L2ARC on an AdHoc basis.

2012-10-13 Thread Michael Armstrong
Hi Guys,

I have a "portable pool" i.e. one that I carry around in an enclosure. However, 
any SSD I add for L2ARC, will not be carried around... meaning the cache drive 
will become unavailable from time to time.

My question is Will random removal of the cache drive put the pool into a 
"degraded" state or affect the integrity of the pool at all? Additionally, how 
adversely will this effect "warm up"...
Or will moving the enclosure between machines with and without cache, just 
automatically work, and offer benefits when cache is available, and less 
benefits when it isn't?

I hope this question isn't too much of a rambling :) thanks.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] what have you been buying for slog and l2arc?

2012-08-07 Thread Michael Hase

On Mon, 6 Aug 2012, Christopher George wrote:


I mean this as constructive criticism, not as angry bickering. I totally
respect you guys doing your own thing.


Thanks, I'll try my best to address your comments...


*) At least updated benchmarks your site to compare against modern
   flash-based competition (not the Intel X25-E, which is seriously
   stone age by now...)


I completely agree we need to refresh the website, not even the photos are 
representative of our shipping product (we now offer VLP DIMMs).

We are engineers first and foremost, but an updated website is in the works.

In the mean time, we have benchmarked against both the Intel 320/710
in my OpenStorage Summit 2011 presentation which can be found at:

http://www.ddrdrive.com/zil_rw_revelation.pdf


Very impressive iops numbers. Although I have some thoughts on the 
benchmarking method itself. Imho the comparison shouldn't be raw iops 
numbers on the ddrdrive itself as tested with iometer (it's only 4gb), but 
real world numbers on a real world pool consisting of spinning disks with 
ddrdrive acting as zil accelerator.


I just introduced an intel 320 120gb as zil accelerator for a simple zpool 
with two sas disks in raid0 configuration, and it's not as bad as in your 
presentation. It shows about 50% of the possible nfs ops with the ssd as 
zil versus no zil (sync=disabled on oi151), and about 6x-8x the 
performance compared to the pool without any accelerator and 
sync=standard. The case with no zil is the upper limit one can achieve on 
a given pool, in my case creation of about 750 small files/sec via nfs. 
With the ssd it's 380 files/sec (nfs stack is a limiting factor, too). Or 
about 2400 8k write iops with the ssd vs. 11900 iops with zil disabled, 
and 250 iops without accelerator (gnu dd with oflag=sync). Not bad at all. 
This could be just good enough for small businesses and moderate sized 
pools.


Michael

--
Michael Hase
edition-software GmbH
http://edition-software.de
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Very poor small-block random write performance

2012-07-18 Thread Michael Traffanstead
I have an 8 drive ZFS array (RAIDZ2 - 1 Spare) using 5900rpm 2TB SATA drives 
with an hpt27xx controller under FreeBSD 10 (but I've seen the same issue with 
FreeBSD 9). 

The system has 8gigs and I'm letting FreeBSD auto-size the ARC.

Running iozone (from ports), everything is fine for file sizes up to 8GB, but 
when it runs with a 16GB file the random write performance plummets using 64K 
record sizes.

8G - 64K -> 52mB/s
8G - 128K -> 713mB/s
8G - 256K -> 442mB/s

16G - 64K -> 7mB/s

16G - 128K -> 380mB/s
16G - 256K -> 392mB/s

Also, sequential small block performance doesn't show such a dramatic slowdown 
either.

16G - 64K -> 108mB/s (sequential) 

There's nothing else using the zpool at the moment, the system is on a separate 
ssd.

I was expecting performance to drop off at 16GB b/c that's well above the 
available ARC but see that dramatic of a drop off and then the sharp 
improvement at 128K and 256K is surprising.

Are there any configuration settings I should be looking at?

Mike 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs sata mirror slower than single disk

2012-07-17 Thread Michael Hase

On Tue, 17 Jul 2012, Bob Friesenhahn wrote:


On Tue, 17 Jul 2012, Michael Hase wrote:


If you were to add a second vdev (i.e. stripe) then you should see very 
close to 200% due to the default round-robin scheduling of the writes.


My expectation would be > 200%, as 4 disks are involved. It may not be the 
perfect 4x scaling, but imho it should be (and is for a scsi system) more 
than half of the theoretical throughput. This is solaris or a solaris 
derivative, not linux ;-)


Here are some results from my own machine based on the 'virgin mount' test 
approach.  The results show less boost than is reported by a benchmark tool 
like 'iozone' which sees benefits from caching.


I get an initial sequential read speed of 657 MB/s on my new pool which has 
1200 MB/s of raw bandwidth (if mirrors could produce 100% boost).  Reading 
the file a second time reports 6.9 GB/s.


The below is with a 2.6 GB test file but with a 26 GB test file (just add 
another zero to 'count' and wait longer) I see an initial read rate of 618 
MB/s and a re-read rate of 8.2 GB/s.  The raw disk can transfer 150 MB/s.


To work around these caching effects just use a file > 2 times the size 
of ram, iostat then shows the numbers really coming from disk. I always 
test like this. a re-read rate of 8.2 GB/s is really just memory 
bandwidth, but quite impressive ;-)



% pfexec zfs create tank/zfstest/defaults
% cd /tank/zfstest/defaults
% pfexec dd if=/dev/urandom of=random.dat bs=128k count=2
2+0 records in
2+0 records out
262144 bytes (2.6 GB) copied, 36.8133 s, 71.2 MB/s
% cd ..
% pfexec zfs umount tank/zfstest/defaults
% pfexec zfs mount tank/zfstest/defaults
% cd defaults
% dd if=random.dat of=/dev/null bs=128k count=2
2+0 records in
2+0 records out
262144 bytes (2.6 GB) copied, 3.99229 s, 657 MB/s
% pfexec dd if=/dev/rdsk/c7t5393E8CA21FAd0p0 of=/dev/null bs=128k 
count=2000

2000+0 records in
2000+0 records out
262144000 bytes (262 MB) copied, 1.74532 s, 150 MB/s
% bc
scale=8
657/150
4.3800

It is very difficult to benchmark with a cache which works so well:

% dd if=random.dat of=/dev/null bs=128k count=2
2+0 records in
2+0 records out
262144 bytes (2.6 GB) copied, 0.379147 s, 6.9 GB/s


This is not my point, I'm pretty sure I did not measure any arc effects - 
maybe with the one exception of the raid0 test on the scsi array. Don't 
know why the arc had this effect, filesize was 2x of ram. The point is: 
I'm searching for an explanation for the relative slowness of a mirror 
pair of sata disks, or some tuning knobs, or something like "the disks are 
plain crap", or maybe: zfs throttles sata disks in general (don't know the 
internals).


In the range of > 600 MB/s other issues may show up (pcie bus contention, 
hba contention, cpu load). And performance at this level could be just 
good enough, not requiring any further tuning. Could you recheck with only 
4 disks (2 mirror pairs)? If you just get some 350 MB/s it could be the 
same problem as with my boxes. All sata disks?


Michael



Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs sata mirror slower than single disk

2012-07-17 Thread Michael Hase

sorry to insist, but still no real answer...

On Mon, 16 Jul 2012, Bob Friesenhahn wrote:


On Tue, 17 Jul 2012, Michael Hase wrote:


So only one thing left: mirror should read 2x


I don't think that mirror should necessarily read 2x faster even though the 
potential is there to do so.  Last I heard, zfs did not include a special 
read scheduler for sequential reads from a mirrored pair.  As a result, 50% 
of the time, a read will be scheduled for a device which already has a read 
scheduled.  If this is indeed true, the typical performance would be 150%. 
There may be some other scheduling factor (e.g. estimate of busyness) which 
might still allow zfs to select the right side and do better than that.


If you were to add a second vdev (i.e. stripe) then you should see very close 
to 200% due to the default round-robin scheduling of the writes.


My expectation would be > 200%, as 4 disks are involved. It may not be the 
perfect 4x scaling, but imho it should be (and is for a scsi system) more 
than half of the theoretical throughput. This is solaris or a solaris 
derivative, not linux ;-)




It is really difficult to measure zfs read performance due to caching 
effects.  One way to do it is to write a large file (containing random data 
such as returned from /dev/urandom) to a zfs filesystem, unmount the 
filesystem, remount the filesystem, and then time how long it takes to read 
the file once.  The reason why this works is because remounting the 
filesystem restarts the filesystem cache.


Ok, did a zpool export/import cycle between the dd read and write test.
This really empties the arc, checked this with arc_summary.pl. the test 
even uses two processes in parallel (doesn't make a difference). Result is 
still the same:


dd write:  2x 58 MB/sec  --> perfect, each disk does > 110 MB/sec
dd read:   2x 68 MB/sec  --> imho too slow, about 68 MB/sec per disk

For writes each disk gets 900 128k io requests/sec with asvc_t in the 8-9 
msec range. For reads each disk only gets 500 io requests/sec, asvc_t 
18-20 msec with the default zfs_vdev_maxpending=10. When reducing 
zfs_vdev_maxpending the asvc_t drops accordingly, the i/o rate remains at 
500/sec per disk, throughput also the same. I think iostat values should 
be reliable here. These high iops numbers make sense as we work on empty 
pools so there aren't very high seek times.


All benchmarks (dd, bonnie, will try iozone) lead to the same result: on 
the sata mirror pair read performance is in the range of a single disk. 
For the sas disks (only two available for testing) and for the scsi system 
there is quite good throughput scaling.


Here for comparison a table for 1-4 36gb 15k u320 scsi disks on an old 
sxde box (nevada b130):


seq write  factor   seq read  factor
MB/sec  MB/sec
single821 78   1
mirror791137   1.75
2x mirror1201.5  251   3.2

This is exactly what's imho to be expected from mirrors and striped 
mirrors. It just doesn't happen for my sata pool. Still have no reference 
numbers for other sata pools, just one with the 4k/512bytes sector problem 
which is even slower than mine. It seems the zfs performance people just 
use sas disks and be done.


Michael



Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
old ibm dual opteron intellistation with external hp msa30, 36gb 15k u320 scsi 
disks



  pool: scsi1
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
scsi1   ONLINE   0 0 0
  c3t4d0ONLINE   0 0 0

errors: No known data errors

Version  1.96   --Sequential Output-- --Sequential Input- --Random-
Concurrency   1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
zfssingle   16G   137  99 82739  20 39453   9   314  99 78251   7 856.9   8
Latency   160ms4799ms5292ms   43210us3274ms2069ms
Version  1.96   --Sequential Create-- Random Create
zfssingle   -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
 16  8819  34 + +++ 26318  68 20390  73 + +++ 26846  72
Latency 16413us 108us 231us   12206us  46us 124us
1.96,1.96,zfssingle,1,1342514790,16G,,137,99,82739,20,39453,9,314,99,78251,7,856.9,8,16,8819,34,+,+++,26318,68,20390,73,+,+++,26846,72,160ms,4799ms,5292ms,43210us,3274ms,2069ms,16413us,108us,231us,12206us,46us,124us

##

  pool: scsi1
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKS

Re: [zfs-discuss] zfs sata mirror slower than single disk

2012-07-16 Thread Michael Hase

On Mon, 16 Jul 2012, Edward Ned Harvey wrote:


From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Michael Hase

got some strange results, please see
attachements for exact numbers and pool config:

   seq write  factor   seq read  factor
   MB/sec  MB/sec
single1231135   1
raid0 1141249   2
mirror 570.5  129   1


I agree with you these look wrong.  Here is what you should expect:

seq W   seq R
single  1.0 1.0
stripe  2.0 2.0
mirror  1.0 2.0

You have three things wrong:
(a) stripe should write 2x
(b) mirror should write 1x
(c) mirror should read 2x

I would have simply said "for some reason your drives are unable to operate
concurrently" but you have the stripe read 2x.

I cannot think of a single reason that the stripe should be able to read 2x,
and the mirror only 1x.



Yes, I think so too. In the meantime I switched the two disks to another 
box (hp xw8400, 2 xeon 5150 cpus, 16gb ram). On this machine I did the 
previous sas tests. OS is now OpenIndiana 151a (vs OpenSolaris b130 
before), the mirror pool was upgraded from version 22 to 28, the raid0 
pool newly created. The results look quite different:


  seq write  factor   seq read  factor
  MB/sec  MB/sec
raid0 2362330   2.5
mirror1111128   1

Now the raid0 case shows excellent performance, the 330 MB/sec are a bit 
on the optimistic side, maybe some arc cache effects (file size 32gb, 16gb 
ram). iostat during sequential read shows about 115 MB/sec from each disk, 
which is great.


The (really desired) mirror case still has a problem with sequential 
reads. sequential writes to the mirror are twice as fast as before, and 
show the expected performance for a single disk.


So only one thing left: mirror should read 2x

I suspect the difference is not the hardware, both boxess should have 
enough horsepower to easily do sequential reads with way more than 200 
MB/sec. In all tests cpu time (user and system) remained quite low. I 
think it's an OS issue: OpenSolaris b130 is over 2 years old, OI 151a 
dates 11/2011.


Could someone please send me some bonnie++ results for a 2 disk mirror or 
a 2x2 disk mirror pool with sata disks?


Michael

--
Michael Hase
http://edition-software.de
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs sata mirror slower than single disk

2012-07-16 Thread Michael Hase

On Mon, 16 Jul 2012, Bob Friesenhahn wrote:


On Mon, 16 Jul 2012, Michael Hase wrote:


This is my understanding of zfs: it should load balance read requests even 
for a single sequential reader. zfs_prefetch_disable is the default 0. And 
I can see exactly this scaling behaviour with sas disks and with scsi 
disks, just not on this sata pool.


Is the BIOS configured to use AHCI mode or is it using IDE mode?


Not relevant here, disks are connected to an onboard sas hba (lsi 1068, 
see first post), hardware is a primergy rx330 with 2 qc opterons.




Are the disks 512 byte/sector or 4K?


512 byte/sector, HDS721010CLA330



Maybe it's a corner case which doesn't matter in real world applications? 
The random seek values in my bonnie output show the expected performance 
boost when going from one disk to a mirrored configuration. It's just the 
sequential read/write case, that's different for sata and sas disks.


I don't have a whole lot of experience with SATA disks but it is my 
impression that you might see this sort of performance if the BIOS was 
configured so that the drives were used as IDE disks.  If not that, then 
there must be a bottleneck in your hardware somewhere.


With early nevada releases I had indeed the IDE/AHCI problem, albeit on 
different hardware. Solaris only ran in IDE mode, disks were 4 times 
slower than on linux, see 
http://www.oracle.com/webfolder/technetwork/hcl/data/components/details/intel/sol_10_05_08/2999.html


Wouldn't a hardware bottleneck show up on raw dd tests as well? I can 
stream > 130 MB/sec from each of the two disks in parallel. dd reading 
from more than these two disks at the same time results in a slight 
slowdown, but here we talk about nearly 400 MB/sec aggregated bandwidth 
through the onboard hba, the box has 6 disk slots:


extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
   94.50.0   94.50.0  0.0  1.00.0   10.5   0 100 c13t6d0
   94.50.0   94.50.0  0.0  1.00.0   10.6   0 100 c13t1d0
   93.00.0   93.00.0  0.0  1.00.0   10.7   0 100 c13t2d0
   94.50.0   94.50.0  0.0  1.00.0   10.5   0 100 c13t5d0

Don't know why this is a bit slower, maybe some pci-e bottleneck. Or 
something with the mpt driver, intrstat shows only one cpu handles all mpt 
interrupts. Or even the slow cpus? These are 1.8ghz opterons.


During sequential reads from the zfs mirror I see > 1000 interrupts/sec on 
one cpu. So it could really be a bottleneck somewhere triggerd by the 
"smallish" 128k i/o requests from the zfs side. I think I'll benchmark 
again on a xeon box with faster cpus, my tests with sas disks were done on 
this other box.


Michael



Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs sata mirror slower than single disk

2012-07-16 Thread Michael Hase

On Mon, 16 Jul 2012, Bob Friesenhahn wrote:


On Mon, 16 Jul 2012, Stefan Ring wrote:


It is normal for reads from mirrors to be faster than for a single disk
because reads can be scheduled from either disk, with different I/Os being
handled in parallel.


That assumes that there *are* outstanding requests to be scheduled in
parallel, which would only happen with multiple readers or a large
read-ahead buffer.


That is true.  Zfs tries to detect the case of sequential reads and requests 
to read more data than the application has already requested. In this case 
the data may be prefetched from the other disk before the application has 
requested it.


This is my understanding of zfs: it should load balance read requests even 
for a single sequential reader. zfs_prefetch_disable is the default 0. And 
I can see exactly this scaling behaviour with sas disks and with scsi 
disks, just not on this sata pool.


zfs_vdev_max_pending is already tuned down to 3 as recommended for sata 
disks, iostat -Mxnz 2 looks something like


r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  507.10.0   63.40.0  0.0  2.90.05.8   1  99 c13t5d0
  477.60.0   59.70.0  0.0  2.80.05.8   1  94 c13t4d0

when reading from the zfs mirror. The default zfs_vdev_max_pending=10 
leads to much higher service times in the 20-30msec range, throughput 
remains roughly the same.


I can read from the dsk or rdsk devices in parallel with real platter 
speeds:


dd if=/dev/dsk/c13t4d0s0 of=/dev/null bs=1024k count=8192 &
dd if=/dev/dsk/c13t5d0s0 of=/dev/null bs=1024k count=8192 &

extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
 2467.50.0  134.90.0  0.0  0.90.00.4   1  87 c13t5d0
 2546.50.0  139.30.0  0.0  0.80.00.3   1  84 c13t4d0

So I think there is no problem with the disks.

Maybe it's a corner case which doesn't matter in real world applications? 
The random seek values in my bonnie output show the expected performance 
boost when going from one disk to a mirrored configuration. It's just the 
sequential read/write case, that's different for sata and sas disks.


Michael



Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs sata mirror slower than single disk

2012-07-16 Thread Michael Hase

Hello list,

did some bonnie++ benchmarks for different zpool configurations
consisting of one or two 1tb sata disks (hitachi hds721010cla332, 512
bytes/sector, 7.2k), and got some strange results, please see
attachements for exact numbers and pool config:

  seq write  factor   seq read  factor
  MB/sec  MB/sec
single1231135   1
raid0 1141249   2
mirror 570.5  129   1

Each of the disks is capable of about 135 MB/sec sequential reads and
about 120 MB/sec sequential writes, iostat -En shows no defects. Disks
are 100% busy in all tests, and show normal service times. This is on
opensolaris 130b, rebooting with openindiana 151a live cd gives the
same results, dd tests give the same results, too. Storage controller
is an lsi 1068 using mpt driver. The pools are newly created and
empty. atime on/off doesn't make a difference.

Is there an explanation why

1) in the raid0 case the write speed is more or less the same as a
single disk.

2) in the mirror case the write speed is cut by half, and the read
speed is the same as a single disk. I'd expect about twice the
performance for both reading and writing, maybe a bit less, but
definitely more than measured.

For comparison I did the same tests with 2 old 2.5" 36gb sas 10k disks
maxing out at about 50-60 MB/sec on the outer tracks.

  seq write  factor   seq read  factor
  MB/sec  MB/sec
single 381 50   1
raid0  892111   2
mirror 361 92   2

Here we get the expected behaviour: raid0 with about double the
performance for reading and writing, mirror about the same performance
for writing, and double the speed for reading, compared to a single
disk. An old scsi system with 4x2 mirror pairs also shows these
scaling characteristics, about 450-500 MB/sec seq read and 250 MB/sec
write, each disk capable of 80 MB/sec. I don't care about absolute
numbers, just don't get why the sata system is so much slower than
expected, especially for a simple mirror. Any ideas?

Thanks,
Michael

--
Michael Hase
http://edition-software.de  pool: ptest
 state: ONLINE
  scan: none requested
config:

NAMESTATE READ WRITE CKSUM
ptest   ONLINE   0 0 0
  c13t4d0   ONLINE   0 0 0

Version  1.96   --Sequential Output-- --Sequential Input- --Random-
Concurrency   1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
zfssingle   32G79  98 123866  51 63626  35   255  99 135359  25 530.6  
13
Latency   333ms 111ms5283ms   73791us 465ms2535ms
Version  1.96   --Sequential Create-- Random Create
zfssingle   -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
 16  4536  40 + +++ 14140  50 10382  69 + +++  6260  73
Latency 21655us 154us 206us   24539us  46us 405us
1.96,1.96,zfssingle,1,1342165334,32G,,79,98,123866,51,63626,35,255,99,135359,25,530.6,13,16,4536,40,+,+++,14140,50,10382,69,+,+++,6260,73,333ms,111ms,5283ms,73791us,465ms,2535ms,21655us,154us,206us,24539us,46us,405us

###

  pool: ptest
 state: ONLINE
  scan: none requested
config:

NAMESTATE READ WRITE CKSUM
ptest   ONLINE   0 0 0
  c13t4d0   ONLINE   0 0 0
  c13t5d0   ONLINE   0 0 0

Version  1.96   --Sequential Output-- --Sequential Input- --Random-
Concurrency   1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
zfsstripe   32G78  98 114243  46 72938  37   192  77 249022  44 815.1  
20
Latency   483ms 106ms5179ms3613ms 259ms1567ms
Version  1.96   --Sequential Create-- Random Create
zfsstripe   -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
 16  6474  53 + +++ 15505  47  8562  81 + +++ 10839  65
Latency 21894us 131us 208us   22203us  52us 230us
1.96,1.96,zfsstripe,1,1342172768,32G,,78,98,114243,46,72938,37,192,77,249022,44,815.1,20,16,6474,53,+,+++,15505,47,8562,81,+,+++,10839,65,483ms,106ms,5179ms,3613ms,259ms,1567ms,21894us,131us,208us,22203us,52us,230us



  pool: ptest
 state: ONLINE
 scrub: none requested
config:

NAME STATE READ WRITE CKSUM
ptestONLINE   0 0 0
  mirror-0   ONLINE   0 0 0
c13t4d0  O

Re: [zfs-discuss] Drive upgrades

2012-04-13 Thread Michael Armstrong
Yes this Is another thing im weary of... I should have slightly under 
provisioned at the start or mixed manufacturers... Now i may have to replace 
2tb fails with 2.5 for the sake of a block

Sent from my iPhone

On 13 Apr 2012, at 17:30, Tim Cook  wrote:

> 
> 
> On Fri, Apr 13, 2012 at 9:35 AM, Edward Ned Harvey 
>  wrote:
> > From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> > boun...@opensolaris.org] On Behalf Of Michael Armstrong
> >
> > Is there a way to quickly ascertain if my seagate/hitachi drives are as
> large as
> > the 2.0tb samsungs? I'd like to avoid the situation of replacing all
> drives and
> > then not being able to grow the pool...
> 
> It doesn't matter.  If you have a bunch of drives that are all approx the
> same size but vary slightly, and you make (for example) a raidz out of them,
> then the raidz will only be limited by the size of the smallest one.  So you
> will only be wasting 1% of the drives that are slightly larger.
> 
> Also, given that you have a pool currently made up of 13x2T and 5x1T ... I
> presume these are separate vdev's.  You don't have one huge 18-disk raidz3,
> do you?  that would be bad.  And it would also mean that you're currently
> wasting 13x1T.  I assume the 5x1T are a single raidzN.  You can increase the
> size of these disks, without any cares about the size of the other 13.
> 
> Just make sure you have the autoexpand property set.
> 
> But most of all, make sure you do a scrub first, and make sure you complete
> the resilver in between each disk swap.  Do not pull out more than one disk
> (or whatever your redundancy level is) while it's still resilvering from the
> previously replaced disk.  If you're very thorough, you would also do a
> scrub in between each disk swap, but if it's just a bunch of home movies
> that are replaceable, you will probably skip that step.
> 
> 
> You will however have an issue replacing them if one should fail.  You need 
> to have the same block count to replace a device, which is why I asked for a 
> "right-sizing" years ago.  Deaf ears :/
> 
> --Tim
>  
>  
> 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Drive upgrades

2012-04-13 Thread Michael Armstrong
Hi Guys,I currently have a 18 drive system built from 13x 2.0tb Samsung's and 5x WD 1tb's... I'm about to swap out all of my 1tb drives with 2tb ones to grow the pool a bit... My question is...The replacement 2tb drives are from various manufacturers (seagate/hitachi/samsung) and I know from previous experience that the geometry/boundaries of each manufacturer's 2tb offerings are different.Is there a way to quickly ascertain if my seagate/hitachi drives are as large as the 2.0tb samsungs? I'd like to avoid the situation of replacing all drives and then not being able to grow the pool...Thanks,Michael
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] S11 vs illumos zfs compatiblity

2012-01-04 Thread Michael Sullivan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 3 Jan 12, at 04:22 , Darren J Moffat wrote:

> On 12/28/11 06:27, Richard Elling wrote:
>> On Dec 27, 2011, at 7:46 PM, Tim Cook wrote:
>>> On Tue, Dec 27, 2011 at 9:34 PM, Nico Williams  
>>> wrote:
>>> On Tue, Dec 27, 2011 at 8:44 PM, Frank Cusack  wrote:
>>>> So with a de facto fork (illumos) now in place, is it possible that two
>>>> zpools will report the same version yet be incompatible across
>>>> implementations?
>> 
>> This was already broken by Sun/Oracle when the deduplication feature was not
>> backported to Solaris 10. If you are running Solaris 10, then zpool version 
>> 29 features
>> are not implemented.
> 
> Solaris 10 does have some deduplication support, it can import and read 
> datasets in a deduped pool just fine.  You can't enable dedup on a dataset 
> and any writes won't dedup they will "rehydrate".
> 
> So it is more like partial dedup support rather than it not being there at 
> all.

"rehydrate"???


Is it instant or freeze dried?


Mike

- ---
Michael Sullivan   
m...@axsh.us
http://www.axsh.us/
Phone: +1-662-259-
Mobile: +1-662-202-7716

-BEGIN PGP SIGNATURE-
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org

iQIcBAEBAgAGBQJPAuOzAAoJEPFdIteZcPZgn7QQAI0nq500qymcpuTreoPpDHIL
vvMtRS4/VoOxmHbu2wJT9GO21f4JC3CCzFRHl8t6NkAK5vi9cuNUx1IGjDjlZAqG
Vp3H2DmtuHVHsPiAGB4J7b3zI4U8IL5tPhgbEcg5kkiTqBjMOCTdg1ibRz7ovf9Y
aDmplOD1d2UN5il6FEs3ZEojHslb4yoRajd5HgyjibF6sdC1leKcAFaUOg9q0t/s
40Ckzw6G4RC5mCb6WHK+a4WXPUMG4uPryIRl4F4jxqrMCSw/rIUHa1slVcagu1gO
wft+P7Y922SPnClMHhDufIGGKrqvJaOriYU+1ZXVoil18BaauboVn1/PEtlDOF57
vy0jOiC/DVICvk/AzzKfQxlO9YFhu4RInc27B2Ut4pCmXLeDDJpy5QXge+AZBM6K
Q2dPJJ3ZNii4JYsTfIufMzWjBwBMhUgkbbK5kbdNyuIptg/ueHOKOf+v9gSkqCGC
CjWrqtchtBSHa5Vw1JjMbKR5Y2qNzH+YuYICFgnYvJbZ31WO8TdzRL+M8PnuJRE3
WJDKs0TmSStYiuGZ1jf1oA3SJ1gcok47rYueSGKcmMSfhHfw3zeB0JpHLVQaCG2j
k2CwfwGskSs1FvgHR+YbCCne5KXwk5PzqCvd5IGH7GZyEOJLtW29MjW5d2TazSzr
3u01cKzStpyXPaxj6+cD
=SLu1
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] about btrfs and zfs

2011-11-14 Thread Michael Schuster
On Mon, Nov 14, 2011 at 14:40, Paul Kraus  wrote:
> On Fri, Nov 11, 2011 at 9:25 PM, Edward Ned Harvey
>  wrote:
>
>> LOL.  Well, for what it's worth, there are three common pronunciations for
>> btrfs.  Butterfs, Betterfs, and B-Tree FS (because it's based on b-trees.)
>> Check wikipedia.  (This isn't really true, but I like to joke, after saying
>> something like that, I wrote the wikipedia page just now.)   ;-)
>
> Is it really B-Tree based? Apple's HFS+ is B-Tree based and falls
> apart (in terms of performance) when you get too many objects in one
> FS, which is specifically what drove us to ZFS. We had 4.5 TB of data
> in about 60 million files/directories on an Apple X-Serve and X-RAID
> and the overall response was terrible. We moved the data to ZFS and
> the performance was limited by the Windows client at that point.
>
>> Speaking of which. zettabyte filesystem.   ;-)  Is it just a dumb filesystem
>> with a lot of address bits?  Or is it something that offers functionality
>> that other filesystems don't have?      ;-)
>
> The stories I have heard indicate that the name came after the TLA.
> "zfs" came first and "zettabyte" later.

as Jeff told it (IIRC), the "expanded" version of zfs underwent
several changes during the development phase, until it was decided one
day to attach none of them to "zfs" and just have it be "the last word
in filesystems". (perhaps he even replied to a similar message on this
list ... check the archives :-)

regards
-- 
Michael Schuster
http://recursiveramblings.wordpress.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Remove corrupt files from snapshot

2011-11-03 Thread Michael Schuster
Hi,

snapshots are read-only by design; you can clone them and manipulate
the clone, but the snapshot itself remains r/o.

HTH
Michael

On Thu, Nov 3, 2011 at 13:35,   wrote:
>
> Hello,
>
> I have got a bunch of corrupted files in various snapshots on my ZFS file 
> backing store. I was not able to recover them so decided to remove all, 
> otherwise the continuously make trouble for my incremental backup (rsync, 
> diff etc. fails).
>
> However, snapshots seem to be read-only:
>
> # zpool status -v
>  pool: backups
>  state: ONLINE
> status: One or more devices has experienced an error resulting in data
>        corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>        entire pool from backup.
>   see: http://www.sun.com/msg/ZFS-8000-8A
>  scrub: none requested
> config:
>        NAME        STATE     READ WRITE CKSUM
>        backups     ONLINE       0     0    13
>          md0       ONLINE       0     0    13
> errors: Permanent errors have been detected in the following files:
>        /backups/memory_card/.zfs/snapshot/20110218230726/Backup/Backup.arc
> ...
>
> # rm /backups/memory_card/.zfs/snapshot/20110218230726/Backup/Backup.arc
> rm: /backups/memory_card/.zfs/snapshot/20110218230726/Backup/Backup.arc: 
> Read-only file system
>
>
> Is there any way to force the file removal?
>
>
> Cheers,
> B.
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>



-- 
Michael Schuster
http://recursiveramblings.wordpress.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] about btrfs and zfs

2011-10-17 Thread Michael DeMan
Or, if you absolutely must run linux for the operating system, see: 
http://zfsonlinux.org/

On Oct 17, 2011, at 8:55 AM, Freddie Cash wrote:

> If you absolutely must run Linux on your storage server, for whatever reason, 
> then you probably won't be running ZFS.  For the next year or two, it would 
> probably be safer to run software RAID (md), with LVM on top, with XFS or 
> Ext4 on top.  It's not the easiest setup to manage, but it would be safer 
> than btrfs.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] commercial zfs-based storage replication software?

2011-10-01 Thread Michael Sullivan
On 1 Oct 11, at 08:01 , Edward Ned Harvey wrote:

>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of Fajar A. Nugraha
>> 
>> On Sat, Oct 1, 2011 at 5:06 AM, Edward Ned Harvey
>>  wrote:
>>> Have you looked at Sun Unified Storage, AKA the 7000 series?
>> 
>> Thanks, that would be my fallback plan (along with nexentastor and
> netapp).
> 
> So you're basically looking for installable 3rd party software that
> replicates that functionality?  I don't know of any, but that's not saying
> much, because when it comes to ZFS, I'm not very platform explorative.

A I said before, hack an open source job scheduler or find one which allows 
creating jobs with parameters or panels for custom fields to put together the 
crontab command and wrap it in something which preserves the output of cron 
rather than email, but stores it in a database or something as well as keeps 
track of success or failure and notifies someone in the event of failure and/or 
restarts.  Which also probably means it needs to be do distributed process 
management to kickoff everything it needs to.  It should probably be ZFS aware 
so it can present filesystems and select based on filesystem rather than job.

Oracle Enterprise Manager does this.  It's commercial, and I'm sure they would 
negotiate on price for you and give you a good deal if you are good at 
bargaining with your Oracle Sales Rep.

I think his requirements are being driven by a PHB who wants to see a "GUI".

crontab, ssh - functionality already there, simple and not many "moving parts" 
but obviously too obfuscated for the PHB to understand.

Good luck.

Mike

---
Michael Sullivan   
m...@axsh.us
http://www.axsh.us/
Phone: +1-662-259-
Mobile: +1-662-202-7716

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] commercial zfs-based storage replication software?

2011-09-30 Thread Michael Sullivan
Maybe I'm missing something here, but Amanda has a whole bunch of bells and 
whistles, and scans the filesystem to determine what should be backed up.  Way 
overkill for this task I think.

Seems to me like zfs send blah | ssh replicatehost zfs receive …  more than 
meets the requirement when combined with just plain old crontab.

If it's a graphical interface you're looking for, I'm sure someone has hacked 
together somethings in TCL/Tk pr Perl/TK as an interface to cron which you 
could probably hack to have construct your particular crontab entry.

Just a thought,

Mike

---
Michael Sullivan   
m...@axsh.us
http://www.axsh.us/
Phone: +1-662-259-
Mobile: +1-662-202-7716

On 30 Sep 11, at 07:33 , Fajar A. Nugraha wrote:

> On Fri, Sep 30, 2011 at 7:22 PM, Edward Ned Harvey
>  wrote:
>>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>>> boun...@opensolaris.org] On Behalf Of Fajar A. Nugraha
>>> 
>>> Does anyone know a good commercial zfs-based storage replication
>>> software that runs on Solaris (i.e. not an appliance, not another OS
>>> based on solaris)?
>>> Kinda like Amanda, but for replication (not backup).
>> 
>> Please define replication, not backup?  To me, your question is unclear what
>> you want to accomplish.  What don't you like about zfs send | zfs receive?
> 
> Basically I need something that does zfs send | zfs receive, plus
> GUI/web interface to configure stuff (e.g. which fs to backup,
> schedule, etc.), support, and a price tag.
> 
> Believe it or not the last two requirement are actually important
> (don't ask :P ), and are the main reasons why I can't use automated
> send - receive scripts already available from the internet.
> 
> CMIIW, Amanda can use "zfs send" but it only store the resulting
> stream somewhere, while the requirement for this one is that the send
> stream must be received on a different server (e.g. DR site) and be
> accessible there.
> 
> -- 
> Fajar
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Mike

---
Michael Sullivan   
m...@axsh.us
http://www.axsh.us/
Phone: +1-662-259-
Mobile: +1-662-202-7716

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I'm back!

2011-09-02 Thread Michael DeMan
Warm welcomes back.

So whats neXt?

- Mike DeMan


On Sep 2, 2011, at 6:30 PM, Erik Trimble wrote:

> Hi folks.
> 
> I'm now no longer at Oracle, and the past couple of weeks have been a bit of 
> a mess for me as I disentangle myself from it.
> 
> I apologize to those who may have tried to contact me during August, as my 
> @oracle.com email is no longer being read by myself, and I didn't have a lot 
> of extra time to devote to things like making sure my email subscription 
> lists pointed to my personal email. I've done that now.
> 
> I now have a free(er) hand to do some work in IllumOS (hopefully, in ZFS in 
> particular), so I'm looking forward to getting back into the swing of things. 
> And, hopefully, not be too much of a PITA.
> 
> :-)
> 
> -Erik Trimble
> tr...@netdemons.com
> 
> 
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Advice with SSD, ZIL and L2ARC

2011-08-29 Thread Michael DeMan
Are you truly new to ZFS?   Or do you work for NetApp or EMC or somebody else 
that is curious?

- Mike

On Aug 29, 2011, at 9:15 PM, Jesus Cea wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Hi all. Sorry if I am asking a FAQ, but I haven't found a really
> authorizative answer to this. Most references are old, incomplete or
> of "I have heard of" kind.
> 
> I am running Solaris 10 Update 9, and my pool is v22.
> 
> I recently got two 40GB SSD I plan to add to my pool. My idea is this:
> 
> 1. Format each SSD as 39GB+1GB.
> 2. Use the TWO 39GB's as L2ARC, with no redundancy.
> 3. Use the TWO 1GB's as mirrored ZIL.
> 
> 1GB of ZIL seems more than enough for my needs. I have synchronous
> writes, but they are, 99.9% of the time, <1MB/s, with occasional bursts.
> 
> My main concern here is about pool stability if there have any kind of
> problem with the SSD's. Especifically:
> 
> 1. Is the L2ARC data stored in the SSD checksummed?. If so, can I
> expect that ZFS goes directly to the disk if the checksum is wrong?.
> 
> 2. Can I import a POOL if one/both L2ARC's are not available?.
> 
> 3. What happend if a L2ARC device, suddenly, "dissappears"?.
> 
> 4. Any idea if L2ARC content will be persistent across system
> rebooting "eventually"?
> 
> 5. Can I import a POOL if one/both ZIL devices are not available?. My
> pool is v22. I know that I can remove ZIL devices since v19, but I
> don't know if I can remove them AFTER they are physically unavailable,
> of before importing the pool (after a reboot).
> 
> 6. Can I remove a ZIL device after ZFS consider it "faulty"?.
> 
> 7. What if a ZIL device "dissapears", suddenly?. I know that I could
> lose "committed" transactions in-fight, but will the machine crash?.
> Will it fallback to ZIL on harddisk?.
> 
> 8. Since my ZIL will be mirrored, I assume that the OS will actually
> will look for transactions to be replayed in both devices (AFAIK, the
> ZIL chain is considered done when the checksum of the last block is
> not valid, and I wonder how this interacts with ZIL device mirroring).
> 
> 9. If a ZIL device mirrored goes offline/online, will it resilver from
> the other side, or it will simply get new transactions, since old
> transactions are irrelevant after ¿30? seconds?.
> 
> 10. What happens if my 1GB of ZIL is too optimistic?. Will ZFS use the
> disks or it will stop writers until flushing ZIL to the HDs?.
> 
> Anything else I should consider?.
> 
> As you can see, my concerns concentrate in what happens if the SSDs go
> bad or "somebody" unplugs them "live".
> 
> I have backup of (most) of my data, but rebuilding a 12TB pool from
> backups, in a production machine, in a remote hosting, would be
> something I rather avoid :-p.
> 
> I know that hybrid HD+SSD pools were a bit flacky in the past (you
> lost the ZIL device, you kiss goodbye to your ZPOOL, in the pre-v19
> days), and I want to know what terrain I am getting into.
> 
> PS: I plan to upgrade to S10 U10 when available, and I will upgrade
> the ZPOOL version after a while.
> 
> - -- 
> Jesus Cea Avion _/_/  _/_/_/_/_/_/
> j...@jcea.es - http://www.jcea.es/ _/_/_/_/  _/_/_/_/  _/_/
> jabber / xmpp:j...@jabber.org _/_/_/_/  _/_/_/_/_/
> .  _/_/  _/_/_/_/  _/_/  _/_/
> "Things are not so easy"  _/_/  _/_/_/_/  _/_/_/_/  _/_/
> "My name is Dump, Core Dump"   _/_/_/_/_/_/  _/_/  _/_/
> "El amor es poner tu felicidad en la felicidad de otro" - Leibniz
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.10 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> 
> iQCVAwUBTlxjxplgi5GaxT1NAQLi9AP/VW2LQqij6y25KQ3c5EDBWvnnL1Z7R65j
> BJ0N1EbWW6ZdkQ9uFoLNJBVb8xPgwpTOKuy5g8FTwrjs1Sc5a3E3DbRDUg75faE5
> 4IOgCi0gtIVyrxGEQ2AAhnKHGcto/2gB9Y5KRiibBeysbqNvr0HXQsko7WRauP96
> N1L1TqFsN8E=
> =sDRY
> -END PGP SIGNATURE-
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Kernel panic on zpool import. 200G of data inaccessible!

2011-08-22 Thread Michael DeMan
I can not help but agree with Tim's comment below.

If you want a free version of ZFS, in which case you are still responsible for 
things yourself - like having backups, then maybe:

www.freenas.org
www.linuxonzfs.org
www.openindiana.org

Meanwhile, it is grossly inappropriate to be complaining about lack of support 
when using an operating system / file system that you know has no support.  
Doubly so if your data is important and doubly so again if did not already back 
it up.

- mike

On Aug 19, 2011, at 6:54 AM, Tim Cook wrote:

> 
> 
> You digitally signed a license agreement stating the following:
> No Technical Support
> Our technical support organization will not provide technical support, phone 
> support, or updates to you for the Programs licensed under this agreement.
> 
> To turn around and keep repeating that they're "holding your data hostage" is 
> disingenuous at best.  Nobody is holding your data hostage.  You voluntarily 
> put it on an operating system that explicitly states doesn't offer support 
> from the parent company.  Nobody from Oracle is going to show up with a patch 
> for you on this mailing list because none of the Oracle employees want to 
> lose their job and subsequently be subjected to a lawsuit.  If that's what 
> you're planning on waiting for, I'd suggest you take a new approach.
> 
> Sorry to be a downer, but that's reality.
> 
> --Tim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Disable ZIL - persistent

2011-08-05 Thread Michael Sullivan
On 5 Aug 11, at 08:14 , Darren J Moffat wrote:

> On 08/05/11 13:11, Edward Ned Harvey wrote:
>> 
>> My question: Is there any way to make Disabled ZIL a normal mode of
>> operations in solaris 10? Particularly:
>> 
>> If I do this "echo zil_disable/W0t1 | mdb -kw" then I have to remount
>> the filesystem. It's kind of difficult to do this automatically at boot
>> time, and impossible (as far as I know) for rpool. The only solution I
>> see is to write some startup script which applies it to filesystems
>> other than rpool. Which feels kludgy. Is there a better way?
> 
> echo "set zfs:zil_disable = 1" > /etc/system

echo "set zfs:zil_disable = 1" >> /etc/system

Mike

---
Michael Sullivan   
m...@axsh.us
http://www.axsh.us/
Phone: +1-662-259-
Mobile: +1-662-202-7716


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zil on multiple usb keys

2011-07-22 Thread Michael DeMan
+1 on the below, and in addition...

...compact flash, like off of USB sticks is not designed to deal with very many 
writes to it.  Commonly it is used to store a bootable image that maybe once a 
year will have an upgrade on it.

Basically, trying to use those devices for a ZIL, even they are mirrored - you 
should be prepared to having one die and be replaced very, very regularly.

Generally performance is going to pretty bad as well - USB sticks are not made 
to be written too rapidly.  They are entirely different animals than SSDs.  I 
would not be surprised (but would be curious to know if you still move forward 
on this) that you will find performance even worse trying to do this.

On Jul 18, 2011, at 1:54 AM, Fajar A. Nugraha wrote:

> First of all, using USB disks for permanent storage is a bad idea. Go
> for e-sata instead (http://en.wikipedia.org/wiki/Serial_ata#eSATA). It

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resizing ZFS partition, shrinking NTFS?

2011-06-17 Thread Michael Sullivan
On 17 Jun 11, at 21:14 , Bob Friesenhahn wrote:

> On Fri, 17 Jun 2011, Jim Klimov wrote:
>> I gather that he is trying to expand his root pool, and you can
>> not add a vdev to one. Though, true, it might be possible to
>> create a second, data pool, in the partition. I am not sure if
>> zfs can make two pools in different partitions of the same
>> device though - underneath it still uses Solaris slices, and
>> I think those can be used on one partition. That was my
>> assumption for a long time, though never really tested.
> 
> This would be a bad assumption.  Zfs should not care and you are able to do 
> apparently silly things with it.  Sometimes allowing potentially silly things 
> is quite useful.
> 

This is true.  If one has mirrored disks, you could do something like I explain 
here WRT partitioning and resizing pools.

http://www.kamiogi.net/Kamiogi/Frame_Dragging/Entries/2009/5/19_Everything_in_Its_Place_-_Moving_and_Reorganizing_ZFS_Storage.html

I did some shuffling using Solaris partitions here on a home server, but it was 
using mirrors of the same geometry disks.

You might be able to o a similar shuffle using an external USB drive which was 
appropriately sized and turn on autoexpand.

Mike

---
Michael Sullivan   
m...@axsh.us
http://www.axsh.us/
Phone: +1-662-259-
Mobile: +1-662-202-7716

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] question about COW and snapshots

2011-06-17 Thread Michael Sullivan
On 17 Jun 11, at 21:02 , Ross Walker wrote:

> On Jun 16, 2011, at 7:23 PM, Erik Trimble  wrote:
> 
>> On 6/16/2011 1:32 PM, Paul Kraus wrote:
>>> On Thu, Jun 16, 2011 at 4:20 PM, Richard Elling
>>>   wrote:
>>> 
>>>> You can run OpenVMS :-)
>>> Since *you* brought it up (I was not going to :-), how does VMS'
>>> versioning FS handle those issues ?
>>> 
>> It doesn't, per se.  VMS's filesystem has a "versioning" concept (i.e. every 
>> time you do a close() on a file, it creates a new file with the version 
>> number appended, e.g.  foo;1  and foo;2  are the same file, different 
>> versions).  However, it is completely missing the rest of the features we're 
>> talking about, like data *consistency* in that file. It's still up to the 
>> app using the file to figure out what data consistency means, and such.  
>> Really, all VMS adds is versioning, nothing else (no API, no additional 
>> features, etc.).
> 
> I believe NTFS was built on the same concept of file streams the VMS FS used 
> for versioning.
> 
> It's a very simple versioning system.
> 
> Personnally I use Sharepoint, but there are other content management systems 
> out there that provide what your looking for, so no need to bring out the 
> crypt keeper.
> 

I think from following this whole discussion people are wanting "Versions" 
which will be offered by OS X Lion soon. However, it is dependent upon 
applications playing nice,behaving and using the "standard" API's.

It would likely take a major overhaul in the way ZFS handles snapshots to 
create them at the object level rather than the filesystems level.  Might be a 
nice exploratory exercise for those in the know with the ZFS roadmap, but then 
there are two "roadmaps" right?

Also consistency and integrity cannot be guaranteed on the object level since 
an application may have more than a single filesystem object in use at a time 
and operations would need to be transaction based with commits and rollbacks.

Way off-topic, but Smalltalk and its variants do this by maintaining the state 
of everything in an operating environment image.

But then again, I could be wrong.

Mike

---
Michael Sullivan   
m...@axsh.us
http://www.axsh.us/
Phone: +1-662-259-
Mobile: +1-662-202-7716

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resizing ZFS partition, shrinking NTFS?

2011-06-16 Thread Michael Schuster

On 17.06.2011 01:44, John D Groenveld wrote:

In message<444915109.61308252125289.JavaMail.Twebapp@sf-app1>, Clive Meredith
writes:

I currently run a duel boot machine with a 45Gb partition for Win7 Ultimate an
d a 25Gb partition for OpenSolaris 10 (134).  I need to shrink NTFS to 20Gb an
d increase the ZFS partion to 45Gb.  Is this possible please?  I have looked a
t using the partition tool in OpenSolaris but both partition are locked, even
under admin.  Win7 won't allow me to shrink the dynamic volume, as the Finsh b
utton is always greyed out, so no luck in that direction.


Shrink the NTFS filesystem first.
I've used the Knoppix LiveCD against a defragmented NTFS.

Then use beadm(1M) to duplicate your OpenSolaris BE to
a USB drive and also send snapshots of any other rpool ZFS
there.


I'd suggest a somewhat different approach:
1) boot a live cd and use something like parted to shrink the NTFS partition
2) create a new partition without FS in the space now freed from NTFS
3) boot OpenSolaris, add the partition from 2) as vdev to your zpool.

HTH
Michael
--
Michael Schuster
http://recursiveramblings.wordpress.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] question about COW and snapshots

2011-06-15 Thread Michael Schuster

On 15.06.2011 14:30, Simon Walter wrote:


Another one is that snapshots are per-filesystem, while the intention
here is to capture a document in one user session. Taking a snapshot
will of course say nothing about the state of other user sessions. Any
document in the process of being saved by another user, for example,
will be corrupt.


Would it be? I think that's pretty lame for ZFS to corrupt data.


I think "corrupt" is not the right word to use here - "inconsistent" is 
probably better. ZFS has no idea when a document is "OK", so if your 
snapshot happens between two writes (even from a single user), it will 
be consistent from the POV of the FS, but may not be from the POV of the 
application.


HTH
Michael
--
Michael Schuster
http://recursiveramblings.wordpress.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Have my RMA... Now what??

2011-05-28 Thread Michael DeMan
Yes, particularly if you have older drives with 512 sectors and then buy a 
newer drive that seems the same, but is not, because it has 4k sectors.  Looks 
like it works, and will work, but performance drops.


On May 28, 2011, at 4:59 PM, Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D. wrote:

> yes good idea, another things to keep in mind
> technology change so fast, by the time you want a replacement, may be HDD 
> does exist any more
> or the supplier changed, so the drives are not exactly like your original 
> drive
> 
> 
> 
> 
> On 5/28/2011 6:05 PM, Michael DeMan wrote:
>> Always pre-purchase one extra drive to have on hand.  When you get it, 
>> confirm it was not dead-on-arrival by hooking up on an external USB to a 
>> workstation and running whatever your favorite tools are to validate it is 
>> okay.  Then put it back in its original packaging, and put a label on it 
>> about what it is, and that it is a spare for box(s) XYZ disk system.
>> 
>> When a drive fails, use that one off the shelf to do your replacement 
>> immediately then deal with the RMA, paperwork, and snailmail to get the bad 
>> drive replaced.
>> 
>> Also, depending how many disks you have in your array - keeping multiple 
>> spares can be a good idea as well to cover another disk dying while waiting 
>> on that replacement one.
>> 
>> In my opinion, the above goes whether you have your disk system configured 
>> with hot spare or not.  And the technique is applicable to both 
>> personal/home-use and commercial uses if your data is important.
>> 
>> 
>> - Mike
>> 
>> On May 28, 2011, at 9:30 AM, Brian wrote:
>> 
>>> I have a raidz2 pool with one disk that seems to be going bad, several 
>>> errors are noted in iostat.  I have an RMA for the drive, however - no I am 
>>> wondering how I proceed.  I need to send the drive in and then they will 
>>> send me one back.  If I had the drive on hand, I could do a zpool replace.
>>> 
>>> Do I do a zpool offline? zpool detach?
>>> Once I get the drive back and put it in the same drive bay..  Is it just a 
>>> zpool replace?
>>> -- 
>>> This message posted from opensolaris.org
>>> ___
>>> zfs-discuss mailing list
>>> zfs-discuss@opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>> ___
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Have my RMA... Now what??

2011-05-28 Thread Michael DeMan
Always pre-purchase one extra drive to have on hand.  When you get it, confirm 
it was not dead-on-arrival by hooking up on an external USB to a workstation 
and running whatever your favorite tools are to validate it is okay.  Then put 
it back in its original packaging, and put a label on it about what it is, and 
that it is a spare for box(s) XYZ disk system.

When a drive fails, use that one off the shelf to do your replacement 
immediately then deal with the RMA, paperwork, and snailmail to get the bad 
drive replaced.

Also, depending how many disks you have in your array - keeping multiple spares 
can be a good idea as well to cover another disk dying while waiting on that 
replacement one.

In my opinion, the above goes whether you have your disk system configured with 
hot spare or not.  And the technique is applicable to both personal/home-use 
and commercial uses if your data is important.


- Mike

On May 28, 2011, at 9:30 AM, Brian wrote:

> I have a raidz2 pool with one disk that seems to be going bad, several errors 
> are noted in iostat.  I have an RMA for the drive, however - no I am 
> wondering how I proceed.  I need to send the drive in and then they will send 
> me one back.  If I had the drive on hand, I could do a zpool replace.  
> 
> Do I do a zpool offline? zpool detach? 
> Once I get the drive back and put it in the same drive bay..  Is it just a 
> zpool replace ?
> -- 
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] best migration path from Solaris 10

2011-03-23 Thread Michael DeMan
I think on this, the big question is going to be whether Oracle continues to 
release ZFS updates under CDDL after their commercial releases.

Overall, in the past it has obviously and necessarily been the case that 
FreeBSD has been a '2nd class citizen'.

Moving forward, that 2nd class idea becomes very mutable - and ironically it 
becomes more so in regards to dealing with organizations that have longevity.

Moving forward...

If Oracle continues to release critical ZFS feature sets under CDDL to the 
community, then:

A) They are no longer pre-releasing those features to OpenSolaris
B) FreeBSD gets them at the same time.

If Oracle does not continue to release ZFS features sets under CDDL, then then 
game changes.  Pick your choice of operating systems - one that has a history 
of surviving for nearly two decades on its own with community support, or the 
'green leaf off the dead tree' that just decided to jump into the willy-nilly 
world without direct/giant corporate support.

2nd class citizen issue for FreeBSD disappears either way.  

The only remaining question would be the remaining crufts of legal disposition. 
 I could for instance see NetApp or somebody try and sue ixSystems, but I have 
a really, really rough time seeing Oracle/LarryEllison suing the FreeBSD 
foundation overall or something?

Oh yeah - plus BTRFS on the horizon?

Honestly - I am not here to start a flame war - I am asking these questions 
because businesses both big and small need to know what to do.

My hunch is, we all have to wait and see if Oracle releases ZFS updates after 
Solaris 11, and if so, whether that is a subset of functionality or full 
functionality. 

- mike


On Mar 19, 2011, at 11:54 PM, Fajar A. Nugraha wrote:

> On Sun, Mar 20, 2011 at 4:05 AM, Pawel Jakub Dawidek  wrote:
>> On Fri, Mar 18, 2011 at 06:22:01PM -0700, Garrett D'Amore wrote:
>>> Newer versions of FreeBSD have newer ZFS code.
>> 
>> Yes, we are at v28 at this point (the lastest open-source version).
>> 
>>> That said, ZFS on FreeBSD is kind of a 2nd class citizen still. [...]
>> 
>> That's actually not true. There are more FreeBSD committers working on
>> ZFS than on UFS.
> 
> How is the performance of ZFS under FreeBSD? Is it comparable to that
> in Solaris, or still slower due to some needed compatibility layer?
> 
> -- 
> Fajar
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] best migration path from Solaris 10

2011-03-18 Thread Michael DeMan
ZFSv28 is in HEAD now and will be out in 8.3.

ZFS + HAST in 9.x means being able to cluster off different hardware.

In regards to OpenSolaris and Indiana - can somebody clarify the relationship 
there?  It was clear with OpenSolaris that the latest/greatest ZFS would always 
be available since it was a guinea-pig product for cost conscious folks and 
served as an excellent area for Sun to get marketplace feedback and bug fixes 
done before rolling updates into full Solaris.

To me it seems that Open Indiana is basically a green branch off of a dead tree 
- if I am wrong, please enlighten me.

On Mar 18, 2011, at 6:16 PM, Roy Sigurd Karlsbakk wrote:

>> I think we all feel the same pain with Oracle's purchase of Sun.
>> 
>> FreeBSD that has commercial support for ZFS maybe?
> 
> Fbsd currently has a very old zpool version, not suitable for running with 
> SLOGs, since if you lose it, you may lose the pool, which isn't very 
> amusing...
> 
> Vennlige hilsener / Best regards
> 
> roy
> --
> Roy Sigurd Karlsbakk
> (+47) 97542685
> r...@karlsbakk.net
> http://blogg.karlsbakk.net/
> --
> I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det 
> er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
> idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
> relevante synonymer på norsk.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [OpenIndiana-discuss] best migration path from Solaris 10

2011-03-18 Thread Michael DeMan
Hi David,

Caught your note about bonnie, actually do some testing myself over the weekend.

All on older hardware for fun - dual opteron 285 with 16GB RAM.  Disk systems 
is off a pair of SuperMicro SATA cards, with a combination of WD enterprise and 
Seagate ES 1TB drives.  No ZIL, no L2ARC, no tuning at all from base FreeNAS 
install.

10 drives total, I'm going to be running tests as below, mostly curious about 
IOPS and to sort out a little debate with a co-worker.

- all 10 in one raidz2 (running now)
- 5 by 2-way mirrors
- 2 by 5-disk raidz1

Script is as below - if folks would find the data I collect be useful 
information at all, let me know and I will post it publicly somewhere.




freenas# cat test.sh
#!/bin/sh

# Basic test for file I/O.  We run lots and lots of the tradditional
# 'bonnie' tool at 50GB file size, starting one every minute.  Resulting
# data should give us a good work mixture in the middle given all the different
# tests that bonnnie runs, 100 instances running at the same time, and at 
different
# stages of their processing.


MAX=100
COUNT=0

FILESYSTEM=testrz2
LOG=${FILESYSTEM}.log


date > ${LOG}
echo "Test with file system named ${FILESYSTEM} and Configuration of..." >> 
${LOG}
zpool status >> ${LOG}

# DEMAN grab zfs and regular dev iostats every 10 minutes during test
zpool iostat -v 600 >>  ${LOG} &
iostat -w 600 ada0 ada1 ada2 ada3 ada4 ada5 ada6 ada7 ada8 ada9 > ${LOG}.iostat 
& 


while [ $COUNT -le $MAX ]; do
echo kicking off bonnie
bonnie -d /mnt/${FILESYSTEM} -s 5 &
sleep 60;
COUNT=$((count+1));
done;



On Mar 18, 2011, at 3:26 PM, David Brodbeck wrote:

> I'm in a similar position, so I'll be curious what kinds of responses you 
> get.  I can give you a thumbnail sketch of what I've looked at so far:
> 
> I evaluated FreeBSD, and ruled it out because I need NFSv4, and FreeBSD's 
> NFSv4 support is still in an early stage.  The NFS stability and performance 
> just isn't there yet, in my opinion.
> 
> Nexenta Core looked promising, but locked up in bonnie++ NFS testing with our 
> RedHat nodes, so its stability is a bit of a question mark for me.
> 
> I haven't gotten the opportunity to thoroughly evaluate OpenIndiana, yet.  
> It's only available as a DVD ISO, and my test machine currently has only a 
> CD-ROM drive.  Changing that is on my to-do list, but other things keep 
> slipping in ahead of it.
> 
> For now I'm running OpenSolaris, with a locally-compiled version of Samba.  
> (The OpenSolaris Samba package is very old and has several unpatched security 
> holes, at this point.)
> 
> -- 
> David Brodbeck
> System Administrator, Linguistics
> University of Washington
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] best migration path from Solaris 10

2011-03-18 Thread Michael DeMan
I think we all feel the same pain with Oracle's purchase of Sun.

FreeBSD that has commercial support for ZFS maybe?

Not here quite yet, but it is something being looked at by an F500 that I am 
currently on contract with.

www.freenas.org, www.ixsystems.com.

Not saying this would be the right solution by any means, but for that 
'corporate barrier', sometimes the option to get both the hardware and ZFS from 
the same place, with support, helps out.

- mike


On Mar 18, 2011, at 2:56 PM, Paul B. Henson wrote:

> We've been running Solaris 10 for the past couple of years, primarily to 
> leverage zfs to provide storage for about 40,000 faculty, staff, and students 
> as well as about 1000 groups. Access is provided via NFSv4, CIFS (by samba), 
> and http/https (including a local module allowing filesystem acl's to be 
> respected via web access). This has worked reasonably well barring some 
> ongoing issues with scalability (approximately a 2 hour reboot window on an 
> x4500 with ~8000 zfs filesystems, complete breakage of live upgrade) and 
> acl/chmod interaction madness.
> 
> We were just about to start working on a cutover to OpenSolaris (for the 
> in-kernel CIFS server, and quicker access to new features/developments) when 
> Oracle finished assimilating Sun and killed off the OpenSolaris distribution. 
> We've been sitting pat for a while to see how things ended up shaking out, 
> and at this point want to start reevaluating our best migration option to 
> move forward from Solaris 10.
> 
> There's really nothing else available that is comparable to zfs (perhaps 
> btrfs someday in the indefinite future, but who knows when that day might 
> come), so our options would appear to be Solaris 11 Express, Nexenta (either 
> NexentaStor or NexentaCore), and OpenIndiana (FreeBSD is occasionally 
> mentioned as a possibility, but I don't really see that as suitable for our 
> enterprise needs).
> 
> Solaris 11 is the official successor to OpenSolaris, has commercial support, 
> and the backing of a huge corporation which historically has contributed the 
> majority of Solaris forward development. However, that corporation is Oracle, 
> and frankly, I don't like doing business with Oracle. With no offense 
> intended to the no doubt numerous talented and goodhearted people that might 
> work there, Oracle is simply evil. We've dealt with Oracle for a long time 
> (in addition to their database itself, we're a PeopleSoft shop) and a 
> positive interaction with them is quite rare. Since they took over Sun, costs 
> on licensing, support contracts, and hardware have increased dramatically, at 
> least in the cases where we've actually been able to get a quote. Arguably, 
> we are not their target market, and they make that quite clear ;). There's 
> also been significant brain drain of prior Sun employees since the takeover, 
> so while they might still continue to contribute the most money into Solaris 
> dev
 elopment, they might not be the future source of the most innovation. Given 
our needs, and our budget, I really don't consider this a viable option.
> 
> Nexenta, on the other hand, seems to be the kind of company I'd like to deal 
> with. Relatively small, nimble, with a ton of former Sun zfs talent working 
> for them, and what appears to be actual consideration for the needs of their 
> customers. I think I'd more likely get my needs addressed through Nexenta, 
> they've already started work on adding aclmode back and I've had some initial 
> discussion with one of their engineers on the possibility of adding 
> additional options such as denying or ignoring attempted chmod updates on 
> objects with acls. It looks like they only offer commercial support for 
> NexentaStor, not NexentaCore. Commercial support isn't a strict requirement, 
> a sizable chunk of our infrastructure runs on a non-commercial linux 
> distribution and open source software, but it can make management happier. 
> NexentaStor seems positioned as a storage appliance, which isn't really what 
> we need. I'm not particularly interested in a web gui or cli interface that 
> hides the underly
 ing complexity of the operating system and zfs, on the contrary, I want full 
access to the guts :). We have our zfs deployment integrated into our identity 
management system, which automatically provisions, destroys, and maintains 
filespace for our user/groups, as well as providing an API for end-users and 
administrators to manage quotas and other attributes. We also run apache with 
some custom modules. I still need to investigate further, but I'm not even sure 
if NexentaStor provides access into the underlying OS or encapsulates 
everything and only allows control through its own administrative functionality.
> 
> NexentaCore is more of the raw operating system we're probably looking for, 
> but with only community-based support. Given that NexentaCore and OpenIndiana 
> are now both going to be based off of the illumos core, I'm no

Re: [zfs-discuss] zfs-discuss Digest, Vol 64, Issue 21

2011-02-07 Thread Michael Armstrong
I obtained smartmontools (which includes smartctl) from the standard apt 
repository (i'm using nexenta however), in addition its neccessary to use the 
device type of sat,12 with smartctl to get it to read attributes correctly in 
OS afaik. Also regarding dev id's on the system, from what i've seen they are 
assigned to ports therefor do not change, however upon changing a controller 
will most likely change unless its the same chipset with exactly the same port 
configuration. Hope this helps.

On 7 Feb 2011, at 18:04, zfs-discuss-requ...@opensolaris.org wrote:

> Having managed to muddle through this weekend without loss (though with a
> certain amount of angst and duplication of efforts), I'm in the mood to
> label things a bit more clearly on my system :-).
> 
> smartctl doesn't seem to be on my system, though.  I'm running
> snv_134.  I'm still pretty badly lost in the whole repository /
> package thing with Solaris, most of my brain cells were already
> occupied with Red Hat, Debian, and Perl package information :-( .
> Where do I look?
> 
> Are the controller port IDs, the "C9T3D0" things that ZFS likes,
> reasonably stable?  They won't change just because I add or remove
> drives, right; only maybe if I change controller cards?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] deduplication requirements

2011-02-07 Thread Michael
Hi guys,

I'm currently running 2 zpools each in a raidz1 configuration, totally
around 16TB usable data. I'm running it all on an OpenSolaris based box with
2gb memory and an old Athlon 64 3700 CPU, I understand this is very poor and
underpowered for deduplication, so I'm looking at building a new system, but
wanted some advice first, here is what i've planned so far:

Core i7 2600 CPU
16gb DDR3 Memory
64GB SSD for ZIL (optional)

Would this produce decent results for deduplication of 16TB worth of pools
or would I need more RAM still?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs-discuss Digest, Vol 64, Issue 13

2011-02-06 Thread Michael Armstrong
Additionally, the way I do it is to draw a diagram of the drives in the system, 
labelled with the drive serial numbers. Then when a drive fails, I can find out 
from smartctl which drive it is and remove/replace without trial and error.

On 5 Feb 2011, at 21:54, zfs-discuss-requ...@opensolaris.org wrote:

> 
> Message: 7
> Date: Sat, 5 Feb 2011 15:42:45 -0500
> From: rwali...@washdcmail.com
> To: David Dyer-Bennet 
> Cc: zfs-discuss@opensolaris.org
> Subject: Re: [zfs-discuss] Identifying drives (SATA)
> Message-ID: <58b53790-323b-4ae4-98cd-575f93b66...@washdcmail.com>
> Content-Type: text/plain; charset=us-ascii
> 
> 
> On Feb 5, 2011, at 2:43 PM, David Dyer-Bennet wrote:
> 
>> Is there a clever way to figure out which drive is which?  And if I have to 
>> fall back on removing a drive I think is right, and seeing if that's true, 
>> what admin actions will I have to perform to get the pool back to safety?  
>> (I've got backups, but it's a pain to restore of course.) (Hmmm; in 
>> single-user mode, use dd to read huge chunks of one disk, and see which 
>> lights come on?  Do I even need to be in single-user mode to do that?)
> 
> Obviously this depends on your lights working to some extent (the right light 
> doing something when the right disk is accessed), but I've used:
> 
> dd if=/dev/rdsk/c8t3d0s0 of=/dev/null bs=4k count=10
> 
> which someone mentioned on this list.  Assuming you can actually read from 
> the disk (it isn't completely dead), it should allow you to direct traffic to 
> each drive individually.
> 
> Good luck,
> Ware

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] NFS slow for small files: idle disks

2011-01-20 Thread Michael Hase
sks each, and 1 
channel with 2 disks).

Richard Ellings zilstat gives

   N-Bytes  N-Bytes/s N-Max-RateB-Bytes  B-Bytes/s B-Max-Rateops  <=4kB 
4-32kB >=32kB
  9552   9552   9552 671744 671744 671744164164 
 0  0
 10192  10192  10192 724992 724992 724992177177 
 0  0
  9568   9568   9568 679936 679936 679936166166 
 0  0
 11712  11712  11712 823296 823296 823296201201 
 0  0
 10784  10784  10784 765952 765952 765952187187 
 0  0
 10024  10024  10024 708608 708608 708608173173 
 0  0

About 200 zil ops all < 4k as maximum. As said the disks aren't busy during 
this test.

The test zfs ist configured with atime off. logbias nearly doesn't matter, with 
logbias=latency the iops rate is a little bit lower.

Attached are some bonnie++ results to show, that all disks and the whole pool 
are quite healthy. I get > 1000 random reads/sec local and still nearly 900 
reads/sec via nfs. For large files I easily get gbit wirespeed (105 MB/sec 
read) with nfs. And for random reads in a bonnie or iozone test the disks are 
really 80%-100% busy. Just for small files the array sits almost idle, the 
array can do way more. I discovered this on different solaris versions, not 
only this test system. Is there any explanation for this behaviour?

Thanks,
Michael
-- 
This message posted from opensolaris.orglocal

Version 1.03c   --Sequential Output-- --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
ibmr10  16G   108972  25 89923  21   263540  26  1074   
3
--Sequential Create-- Random Create
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
 16 30359  99 + +++ + +++ 24836  99 + +++ + +++
ibmr10,16G,,,108972,25,89923,21,,,263540,26,1073.5,3,16,30359,99,+,+++,+,+++,24836,99,+,+++,+,+++
NFS

Version 1.03d   --Sequential Output-- --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
nfsibmr10   16G   50022  11 42524  14   105335  18 884.8  20
--Sequential Create-- Random Create
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
 16   152   3 + +++   182   1   151   3 + +++   183   1
nfsibmr10,16G,,,50022,11,42524,14,,,105335,18,884.8,20,16,152,3,+,+++,182,1,151,3,+,+++,183,1
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Troubleshooting help on ZFS

2011-01-20 Thread Michael Schuster
On Thu, Jan 20, 2011 at 01:47, Steve Kellam
 wrote:
> I have a home media server set up using OpenSolaris.   All my experience with 
> OpenSolaris has been through setting up and maintaining this server so it is 
> rather limited.   I have run in to some problems recently and I am not sure 
> how the best way to troubleshoot this.  I was hoping to get some feedback on 
> possible fixes for this.
>
> I am running SunOS 5.11 snv_134.  It is running on a tower with 6 HDD 
> configured in as raidz2 array.  Motherboard: ECS 945GCD-M(1.0) Intel Atom 330 
> Intel 945GC Micro ATX Motherboard/CPU Combo.  Memory: 4GB.
>
> I set this up about a year ago and have had very few problems.  I was 
> streaming a movie off the server a few days ago and it all of a sudden lost 
> connectivity with the server.  When I checked the server, there was no output 
> on the display from the server but the power supply seemed to be running and 
> the fans were going.
> The next day it started working again and I was able to log in.  The SMB and 
> NFS file server was connecting without problems.
>
> Now I am able to connect remotely via SSH.  I am able to bring up a zpool 
> status screen that shows no problems.  It reports no known data errors.  I am 
> able to go to the top level data directories but when I cd into the 
> sub-directories the SSH connection freezes.
>
> I have tried to do a ZFS scrub on the pool and it only gets to 0.02% and 
> never gets beyond that but does not report any errors.  Now, also, I am 
> unable to stop the scrub.  I use the zpool scrub -s command but this freezes 
> the SSH connection.
> When I reboot, it is still trying to scrub but not making progress.
>
> I have the system set up to a battery back up with surge protection and I'm 
> not aware of any spikes in electricity recently.  I have not made any 
> modifications to the system.  All the drives have been run through SpinRite 
> less than a couple months ago without any data errors.
>
> I can't figure out how this happened all of the sudden and how best to 
> troubleshoot it.
>
> If you have any help or technical wisdom to offer, I'd appreciate it as this 
> has been frustrating.

look in /var/adm/messages (.*) to see whether there's anything
interesting around the time you saw the loss of connectivity, and also
since, then take it from there.

HTH
Michael
-- 
regards/mit freundlichen Grüssen
Michael Schuster
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is my bottleneck RAM?

2011-01-18 Thread Michael Armstrong
Ah ok, I wont be using dedup anyway just wanted to try. Ill be adding more ram 
though, I guess you can't have too much. Thanks

Erik Trimble  wrote:

>You can't really do that.
>
>Adding an SSD for L2ARC will help a bit, but L2ARC storage also consumes
>RAM to maintain a cache table of what's in the L2ARC.  Using 2GB of RAM
>with an SSD-based L2ARC (even without Dedup) likely won't help you too
>much vs not having the SSD. 
>
>If you're going to turn on Dedup, you need at least 8GB of RAM to go
>with the SSD.
>
>-Erik
>
>
>On Tue, 2011-01-18 at 18:35 +, Michael Armstrong wrote:
>> Thanks everyone, I think overtime I'm gonna update the system to include an 
>> ssd for sure. Memory may come later though. Thanks for everyone's responses
>> 
>> Erik Trimble  wrote:
>> 
>> >On Tue, 2011-01-18 at 15:11 +, Michael Armstrong wrote:
>> >> I've since turned off dedup, added another 3 drives and results have 
>> >> improved to around 148388K/sec on average, would turning on compression 
>> >> make things more CPU bound and improve performance further?
>> >> 
>> >> On 18 Jan 2011, at 15:07, Richard Elling wrote:
>> >> 
>> >> > On Jan 15, 2011, at 4:21 PM, Michael Armstrong wrote:
>> >> > 
>> >> >> Hi guys, sorry in advance if this is somewhat a lowly question, I've 
>> >> >> recently built a zfs test box based on nexentastor with 4x samsung 2tb 
>> >> >> drives connected via SATA-II in a raidz1 configuration with dedup 
>> >> >> enabled compression off and pool version 23. From running bonnie++ I 
>> >> >> get the following results:
>> >> >> 
>> >> >> Version 1.03b   --Sequential Output-- --Sequential Input- 
>> >> >> --Random-
>> >> >>   -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- 
>> >> >> --Seeks--
>> >> >> MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  
>> >> >> /sec %CP
>> >> >> nexentastor  4G 60582  54 20502   4 12385   3 53901  57 105290  10 
>> >> >> 429.8   1
>> >> >>   --Sequential Create-- Random 
>> >> >> Create
>> >> >>   -Create-- --Read--- -Delete-- -Create-- --Read--- 
>> >> >> -Delete--
>> >> >> files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  
>> >> >> /sec %CP
>> >> >>16  7181  29 + +++ + +++ 21477  97 + +++ 
>> >> >> + +++
>> >> >> nexentastor,4G,60582,54,20502,4,12385,3,53901,57,105290,10,429.8,1,16,7181,29,+,+++,+,+++,21477,97,+,+++,+,+++
>> >> >> 
>> >> >> 
>> >> >> I'd expect more than 105290K/s on a sequential read as a peak for a 
>> >> >> single drive, let alone a striped set. The system has a relatively 
>> >> >> decent CPU, however only 2GB memory, do you think increasing this to 
>> >> >> 4GB would noticeably affect performance of my zpool? The memory is 
>> >> >> only DDR1.
>> >> > 
>> >> > 2GB or 4GB of RAM + dedup is a recipe for pain. Do yourself a favor, 
>> >> > turn off dedup
>> >> > and enable compression.
>> >> > -- richard
>> >> > 
>> >> 
>> >> ___
>> >> zfs-discuss mailing list
>> >> zfs-discuss@opensolaris.org
>> >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>> >
>> >
>> >Compression will help speed things up (I/O, that is), presuming that
>> >you're not already CPU-bound, which it doesn't seem you are.
>> >
>> >If you want Dedup, you pretty much are required to buy an SSD for L2ARC,
>> >*and* get more RAM.
>> >
>> >
>> >These days, I really don't recommend running ZFS as a fileserver without
>> >a bare minimum of 4GB of RAM (8GB for anything other than light use),
>> >even with Dedup turned off. 
>> >
>> >
>> >-- 
>> >Erik Trimble
>> >Java System Support
>> >Mailstop:  usca22-317
>> >Phone:  x67195
>> >Santa Clara, CA
>> >Timezone: US/Pacific (GMT-0800)
>> >
>-- 
>Erik Trimble
>Java System Support
>Mailstop:  usca22-317
>Phone:  x67195
>Santa Clara, CA
>Timezone: US/Pacific (GMT-0800)
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is my bottleneck RAM?

2011-01-18 Thread Michael Armstrong
Thanks everyone, I think overtime I'm gonna update the system to include an ssd 
for sure. Memory may come later though. Thanks for everyone's responses

Erik Trimble  wrote:

>On Tue, 2011-01-18 at 15:11 +, Michael Armstrong wrote:
>> I've since turned off dedup, added another 3 drives and results have 
>> improved to around 148388K/sec on average, would turning on compression make 
>> things more CPU bound and improve performance further?
>> 
>> On 18 Jan 2011, at 15:07, Richard Elling wrote:
>> 
>> > On Jan 15, 2011, at 4:21 PM, Michael Armstrong wrote:
>> > 
>> >> Hi guys, sorry in advance if this is somewhat a lowly question, I've 
>> >> recently built a zfs test box based on nexentastor with 4x samsung 2tb 
>> >> drives connected via SATA-II in a raidz1 configuration with dedup enabled 
>> >> compression off and pool version 23. From running bonnie++ I get the 
>> >> following results:
>> >> 
>> >> Version 1.03b   --Sequential Output-- --Sequential Input- 
>> >> --Random-
>> >>   -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- 
>> >> --Seeks--
>> >> MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  
>> >> /sec %CP
>> >> nexentastor  4G 60582  54 20502   4 12385   3 53901  57 105290  10 
>> >> 429.8   1
>> >>   --Sequential Create-- Random 
>> >> Create
>> >>   -Create-- --Read--- -Delete-- -Create-- --Read--- 
>> >> -Delete--
>> >> files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec 
>> >> %CP
>> >>16  7181  29 + +++ + +++ 21477  97 + +++ + 
>> >> +++
>> >> nexentastor,4G,60582,54,20502,4,12385,3,53901,57,105290,10,429.8,1,16,7181,29,+,+++,+,+++,21477,97,+,+++,+,+++
>> >> 
>> >> 
>> >> I'd expect more than 105290K/s on a sequential read as a peak for a 
>> >> single drive, let alone a striped set. The system has a relatively decent 
>> >> CPU, however only 2GB memory, do you think increasing this to 4GB would 
>> >> noticeably affect performance of my zpool? The memory is only DDR1.
>> > 
>> > 2GB or 4GB of RAM + dedup is a recipe for pain. Do yourself a favor, turn 
>> > off dedup
>> > and enable compression.
>> > -- richard
>> > 
>> 
>> ___
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>
>Compression will help speed things up (I/O, that is), presuming that
>you're not already CPU-bound, which it doesn't seem you are.
>
>If you want Dedup, you pretty much are required to buy an SSD for L2ARC,
>*and* get more RAM.
>
>
>These days, I really don't recommend running ZFS as a fileserver without
>a bare minimum of 4GB of RAM (8GB for anything other than light use),
>even with Dedup turned off. 
>
>
>-- 
>Erik Trimble
>Java System Support
>Mailstop:  usca22-317
>Phone:  x67195
>Santa Clara, CA
>Timezone: US/Pacific (GMT-0800)
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is my bottleneck RAM?

2011-01-18 Thread Michael Armstrong
I've since turned off dedup, added another 3 drives and results have improved 
to around 148388K/sec on average, would turning on compression make things more 
CPU bound and improve performance further?

On 18 Jan 2011, at 15:07, Richard Elling wrote:

> On Jan 15, 2011, at 4:21 PM, Michael Armstrong wrote:
> 
>> Hi guys, sorry in advance if this is somewhat a lowly question, I've 
>> recently built a zfs test box based on nexentastor with 4x samsung 2tb 
>> drives connected via SATA-II in a raidz1 configuration with dedup enabled 
>> compression off and pool version 23. From running bonnie++ I get the 
>> following results:
>> 
>> Version 1.03b   --Sequential Output-- --Sequential Input- 
>> --Random-
>>   -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
>> MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec 
>> %CP
>> nexentastor  4G 60582  54 20502   4 12385   3 53901  57 105290  10 429.8 
>>   1
>>   --Sequential Create-- Random Create
>>   -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
>> files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
>>16  7181  29 + +++ + +++ 21477  97 + +++ + +++
>> nexentastor,4G,60582,54,20502,4,12385,3,53901,57,105290,10,429.8,1,16,7181,29,+,+++,+,+++,21477,97,+,+++,+,+++
>> 
>> 
>> I'd expect more than 105290K/s on a sequential read as a peak for a single 
>> drive, let alone a striped set. The system has a relatively decent CPU, 
>> however only 2GB memory, do you think increasing this to 4GB would 
>> noticeably affect performance of my zpool? The memory is only DDR1.
> 
> 2GB or 4GB of RAM + dedup is a recipe for pain. Do yourself a favor, turn off 
> dedup
> and enable compression.
> -- richard
> 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Is my bottleneck RAM?

2011-01-18 Thread Michael Armstrong
Hi guys, sorry in advance if this is somewhat a lowly question, I've recently 
built a zfs test box based on nexentastor with 4x samsung 2tb drives connected 
via SATA-II in a raidz1 configuration with dedup enabled compression off and 
pool version 23. From running bonnie++ I get the following results:

Version 1.03b   --Sequential Output-- --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
nexentastor  4G 60582  54 20502   4 12385   3 53901  57 105290  10 429.8   1
--Sequential Create-- Random Create
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
 16  7181  29 + +++ + +++ 21477  97 + +++ + +++
nexentastor,4G,60582,54,20502,4,12385,3,53901,57,105290,10,429.8,1,16,7181,29,+,+++,+,+++,21477,97,+,+++,+,+++


I'd expect more than 105290K/s on a sequential read as a peak for a single 
drive, let alone a striped set. The system has a relatively decent CPU, however 
only 2GB memory, do you think increasing this to 4GB would noticeably affect 
performance of my zpool? The memory is only DDR1.

Thanks in advance.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-09 Thread Michael Sullivan
Just to add a bit to this, I just love sweeping generalizations...

On 9 Jan 2011, at 19:33 , Richard Elling wrote:

> On Jan 9, 2011, at 4:19 PM, Edward Ned Harvey 
>  wrote:
> 
>>> From: Pasi Kärkkäinen [mailto:pa...@iki.fi]
>>> 
>>> Other OS's have had problems with the Broadcom NICs aswell..
>> 
>> Yes.  The difference is, when I go to support.dell.com and punch in my
>> service tag, I can download updated firmware and drivers for RHEL that (at
>> least supposedly) solve the problem.  I haven't tested it, but the dell
>> support guy told me it has worked for RHEL users.  There is nothing
>> available to download for solaris.
> 
> The drivers are written by Broadcom and are, AFAIK, closed source.
> By going through Dell, you are going through a middle-man. For example,
> 
> http://www.broadcom.com/support/ethernet_nic/netxtremeii10.php
> 
> where you see the release of the Solaris drivers was at the same time
> as Windows.
> 

What Richard says is true.

Broadcom have been a source of contention in the Linux world as well as the 
*BSD world due to the proprietary nature of their firmware.  
OpenSolaris/Solaris users are not the only ones who have complained about this. 
 There's been much uproar in the FOSS community about Broadcom and their 
drivers.  As a result, I've seen some pretty nasty hacks like people using the 
Windows drivers linked into their kernel - *gack*  I forget all the gory 
details, but it was rather disgusting as I recall, bubblegum, bailing wire, 
duct tape and all.

Dell and Red Hat aren't exactly a marriage made in heaven either.  I've had 
problems getting support from both Dell and Red Hat, them pointing fingers at 
each other rather than solving the problem.  Like most people, I've had to come 
up with my own work-arounds, like others with the Broadcom issue, using a 
"known quantity" NIC.

When dealing with Dell as a corporate buyer, they have always made it quite 
clear that they are primarily a Windows platform.  Linux, oh yes, we have that 
too...

>> Also, the bcom is not the only problem on that server.  After I added-on an
>> intel network card and disabled the bcom, the weekly crashes stopped, but
>> now it's ...  I don't know ... once every 3 weeks with a slightly different
>> mode of failure.  This is yet again, rare enough that the system could very
>> well pass a certification test, but not rare enough for me to feel
>> comfortable putting into production as a primary mission critical server.

I've never been particularly warm and fuzzy with Dell servers.  They seem to 
like to change their chipsets slightly while a model is in production.  This 
can cause all sorts of problems which are difficult to diagnose since an 
"identical" Dell system will have no problems, and it's mate will crash weekly.

>> 
>> I really think there are only two ways in the world to engineer a good solid
>> server:
>> (a) Smoke your own crack.  Systems engineering teams use the same systems
>> that are sold to customers.
> 
> This is rarely practical, not to mention that product development
> is often not in the systems engineering organization.
> 
>> or
>> (b) Sell millions of 'em.  So despite whether or not the engineering team
>> uses them, you're still going to have sufficient mass to dedicate engineers
>> to the purpose of post-sales bug solving.
> 
> yes, indeed :-)
> -- richard

As for certified systems, It's my understanding that Nexenta themselves don't 
"certify" anything.  They have systems which are recommended and supported by 
their network of VAR's.  It just so happens that SuperMicro is one of the 
brands of choice, but even then one must adhere to a fairly tight HCL.  The 
same holds true for Solaris/OpenSolaris with third-party hardware.

SATA Controllers and multiplexers are also another example of the drivers being 
written by the manufacturer and Solaris/OpenSolaris are not a priority over 
Windows and Linux, in that order.

Deviation from items which are not somewhat "plain vanilla" and are not listed 
on the HCL is just asking for trouble.

Mike

---
Michael Sullivan   
michael.p.sulli...@me.com
http://www.kamiogi.net/
Mobile: +1-662-202-7716
US Phone: +1-561-283-2034
JP Phone: +81-50-5806-6242



smime.p7s
Description: S/MIME cryptographic signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)

2011-01-07 Thread Michael DeMan

On Jan 7, 2011, at 6:13 AM, David Magda wrote:

> On Fri, January 7, 2011 01:42, Michael DeMan wrote:
>> Then - there is the other side of things.  The 'black swan' event.  At
>> some point, given percentages on a scenario like the example case above,
>> one simply has to make the business justification case internally at their
>> own company about whether to go SHA-256 only or Fletcher+Verification?
>> Add Murphy's Law to the 'black swan event' and of course the only data
>> that is lost is that .01% of your data that is the most critical?
> 
> The other thing to note is that by default (with de-dupe disabled), ZFS
> uses Fletcher checksums to prevent data corruption. Add also the fact all
> other file systems don't have any checksums, and simply rely on the fact
> that disks have a bit error rate of (at best) 10^-16.
> 
Agreed - but I think it is still missing the point of what the original poster 
was asking about.

In all honesty I think the debate is a business decision - the highly 
improbable vs. certainty.

Somebody somewhere must have written this stuff up, along with simple use cases?
Perhaps even a new acronym?  MTBC - mean time before collision?

And even with the 'certainty' factor being the choice - other things like human 
error come in to play and are far riskier?




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)

2011-01-06 Thread Michael DeMan
At the end of the day this issue essentially is about mathematical 
improbability versus certainty?

To be quite honest, I too am skeptical about about using de-dupe just based on 
SHA256.  In prior posts it was asked that the potential adopter of the 
technology provide the mathematical reason to NOT use SHA-256 only.  However, 
if Oracle believes that it is adequate to do that, would it be possible for 
somebody to provide:

(A) The theoretical documents and associated mathematics specific to say one 
simple use case?
(A1) Total data size is 1PB (lets say the zpool is 2PB to not worry about that 
part of it).
(A2) Daily, 10TB of data is updated, 1TB of data is deleted, and 1TB of data is 
'new'.
(A3) Out of the dataset, 25% of the data is capable of being de-duplicated
(A4) Between A2 and A3 above, the 25% rule from A3 also applies to everything 
in A2.


I think the above would be a pretty 'soft' case for justifying the case that 
SHA-256 works?  I would presume some kind of simple kind of scenario 
mathematically has been run already by somebody inside Oracle/Sun long ago when 
first proposing that ZFS be funded internally at all?


Then - there is the other side of things.  The 'black swan' event.  At some 
point, given percentages on a scenario like the example case above, one simply 
has to make the business justification case internally at their own company 
about whether to go SHA-256 only or Fletcher+Verification?  Add Murphy's Law to 
the 'black swan event' and of course the only data that is lost is that .01% of 
your data that is the most critical?



Not trying to be aggressive or combative here at all against peoples opinions 
and understandings of it all - I would just like to see some hard information 
about it all - it must exist somewhere already?

Thanks,
 
- Mike




On Jan 6, 2011, at 10:05 PM, Edward Ned Harvey wrote:

>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of Peter Taps
>> 
>> Perhaps (Sha256+NoVerification) would work 99.99% of the time. But
> 
> Append 50 more 9's on there. 
> 99.%
> 
> See below.
> 
> 
>> I have been told that the checksum value returned by Sha256 is almost
>> guaranteed to be unique. In fact, if Sha256 fails in some case, we have a
>> bigger problem such as memory corruption, etc. Essentially, adding
>> verification to sha256 is an overkill.
> 
> Someone please correct me if I'm wrong.  I assume ZFS dedup matches both the
> blocksize and the checksum right?  A simple checksum collision (which is
> astronomically unlikely) is still not sufficient to produce corrupted data.
> It's even more unlikely than that.
> 
> Using the above assumption, here's how you calculate the probability of
> corruption if you're not using verification:
> 
> Suppose every single block in your whole pool is precisely the same size
> (which is unrealistic in the real world, but I'm trying to calculate worst
> case.)  Suppose the block is 4K, which is again, unrealistically worst case.
> Suppose your dataset is purely random or sequential ... with no duplicated
> data ... which is unrealisic because if your data is like that, then why in
> the world are you enabling dedupe?  But again, assuming worst case
> scenario...  At this point we'll throw in some evil clowns, spit on a voodoo
> priestess, and curse the heavens for some extra bad luck.
> 
> If you have astronomically infinite quantities of data, then your
> probability of corruption approaches 100%.  With infinite data, eventually
> you're guaranteed to have a collision.  So the probability of corruption is
> directly related to the total amount of data you have, and the new question
> is:  For anything Earthly, how near are you to 0% probability of collision
> in reality?
> 
> Suppose you have 128TB of data.  That is ...  you have 2^35 unique 4k blocks
> of uniformly sized data.  Then the probability you have any collision in
> your whole dataset is (sum(1 thru 2^35))*2^-256 
> Note: sum of integers from 1 to N is  (N*(N+1))/2
> Note: 2^35 * (2^35+1) = 2^35 * 2^35 + 2^35 = 2^70 + 2^35
> Note: (N*(N+1))/2 in this case = 2^69 + 2^34
> So the probability of data corruption in this case, is 2^-187 + 2^-222 ~=
> 5.1E-57 + 1.5E-67
> 
> ~= 5.1E-57
> 
> In other words, even in the absolute worst case, cursing the gods, running
> without verification, using data that's specifically formulated to try and
> cause errors, on a dataset that I bet is larger than what you're doing, ...
> 
> Before we go any further ... The total number of bits stored on all the
> storage in the whole planet is a lot smaller than the total number of
> molecules in the planet.
> 
> There are estimated 8.87 * 10^49 molecules in planet Earth.
> 
> The probability of a collision in your worst-case unrealistic dataset as
> described, is even 100 million times less likely than randomly finding a
> single specific molecule in the whole planet Earth

Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)

2011-01-06 Thread Michael Sullivan
Ed, with all due respect to your math,

I've seen rsync bomb due to an SHA256 collision, so I know it can and does 
happen.

I respect my data, so even with checksumming and comparing the block size, I'll 
still do a comparison check if those two match.  You will end up with silent 
data corruption which could affect you in so many ways.

Do you want to stake your career and reputation on that?  With a client or 
employer's data? I sure don't.

"Those who walk on the razor's edge are destined to be cut to ribbons…" Someone 
I used to work with said that, not me.

For my home media server, maybe, but even then I'd hate to lose any of my 
family photos or video due to a hash collision.

I'll play it safe if I dedup.

Mike

---
Michael Sullivan   
michael.p.sulli...@me.com
http://www.kamiogi.net/
Mobile: +1-662-202-7716
US Phone: +1-561-283-2034
JP Phone: +81-50-5806-6242

On 7 Jan 2011, at 00:05 , Edward Ned Harvey wrote:

>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of Peter Taps
>> 
>> Perhaps (Sha256+NoVerification) would work 99.99% of the time. But
> 
> Append 50 more 9's on there. 
> 99.%
> 
> See below.
> 
> 
>> I have been told that the checksum value returned by Sha256 is almost
>> guaranteed to be unique. In fact, if Sha256 fails in some case, we have a
>> bigger problem such as memory corruption, etc. Essentially, adding
>> verification to sha256 is an overkill.
> 
> Someone please correct me if I'm wrong.  I assume ZFS dedup matches both the
> blocksize and the checksum right?  A simple checksum collision (which is
> astronomically unlikely) is still not sufficient to produce corrupted data.
> It's even more unlikely than that.
> 
> Using the above assumption, here's how you calculate the probability of
> corruption if you're not using verification:
> 
> Suppose every single block in your whole pool is precisely the same size
> (which is unrealistic in the real world, but I'm trying to calculate worst
> case.)  Suppose the block is 4K, which is again, unrealistically worst case.
> Suppose your dataset is purely random or sequential ... with no duplicated
> data ... which is unrealisic because if your data is like that, then why in
> the world are you enabling dedupe?  But again, assuming worst case
> scenario...  At this point we'll throw in some evil clowns, spit on a voodoo
> priestess, and curse the heavens for some extra bad luck.
> 
> If you have astronomically infinite quantities of data, then your
> probability of corruption approaches 100%.  With infinite data, eventually
> you're guaranteed to have a collision.  So the probability of corruption is
> directly related to the total amount of data you have, and the new question
> is:  For anything Earthly, how near are you to 0% probability of collision
> in reality?
> 
> Suppose you have 128TB of data.  That is ...  you have 2^35 unique 4k blocks
> of uniformly sized data.  Then the probability you have any collision in
> your whole dataset is (sum(1 thru 2^35))*2^-256 
> Note: sum of integers from 1 to N is  (N*(N+1))/2
> Note: 2^35 * (2^35+1) = 2^35 * 2^35 + 2^35 = 2^70 + 2^35
> Note: (N*(N+1))/2 in this case = 2^69 + 2^34
> So the probability of data corruption in this case, is 2^-187 + 2^-222 ~=
> 5.1E-57 + 1.5E-67
> 
> ~= 5.1E-57
> 
> In other words, even in the absolute worst case, cursing the gods, running
> without verification, using data that's specifically formulated to try and
> cause errors, on a dataset that I bet is larger than what you're doing, ...
> 
> Before we go any further ... The total number of bits stored on all the
> storage in the whole planet is a lot smaller than the total number of
> molecules in the planet.
> 
> There are estimated 8.87 * 10^49 molecules in planet Earth.
> 
> The probability of a collision in your worst-case unrealistic dataset as
> described, is even 100 million times less likely than randomly finding a
> single specific molecule in the whole planet Earth by pure luck.
> 
> 
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



smime.p7s
Description: S/MIME cryptographic signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-05 Thread Michael Schuster
On Wed, Jan 5, 2011 at 15:34, Edward Ned Harvey
 wrote:
>> From: Deano [mailto:de...@rattie.demon.co.uk]
>> Sent: Wednesday, January 05, 2011 9:16 AM
>>
>> So honestly do we want to innovate ZFS (I do) or do we just want to follow
>> Oracle?
>
> Well, you can't follow Oracle.  Unless you wait till they release something,
> reverse engineer it, and attempt to reimplement it.

that's not my understanding - while we will have to wait, oracle is
supposed to release *some* source code afterwards to satisfy some
claim or other. I agree, some would argue that that should have
already happened with S11 express... I don't know it has, but that's
not *the* release of S11, is it? And once the code is released, even
if after the fact, it's not reverse-engineering anymore, is it?

Michael
PS: just in case: even while at Oracle, I had no insight into any of
these plans, much less do I have now.
-- 
regards/mit freundlichen Grüssen
Michael Schuster
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A couple of quick questions

2010-12-22 Thread Michael Schuster
I can't answer any of these authoritatively(?), but have a comment:

On Wed, Dec 22, 2010 at 10:55, Per Hojmark  wrote:
> 1) What's the maximum number of disk devices that can be used to construct 
> filesystems?

lots.

> 2) Is there a practical limit on #1? I've seen messages where folks suggested 
> 40 physical devices is the practical maximum. That would seem to imply a 
> maximum single volume size of 80TB...

how does that follow, or, in other words, why do you believe zfs can
only handle 2 TB per physical disc? (hint: look up GTP or EFI label
;-)

HTH
-- 
regards/mit freundlichen Grüssen
Michael Schuster
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Ideas for ghetto file server data reliability?

2010-11-16 Thread michael . p . sullivan
Ummm… there's a difference between data integrity and data corruption.

Integrity is enforced programmatically by something like a DBMS.  This sets up 
basic rules that ensure the programmer, program or algorithm adhere to a level 
of sanity and bounds.

Corruption is where cosmic rays, bit rot, malware or some other item writes to 
the block level.  ZFS protects systems from a lot of this by the way it's 
constructed to keep metadata, checksums, and duplicates of critical data.

If the filesystem is given bad data it will faithfully lay it down on disk.  If 
that faulty data gets corrupt, ZFS will come in and save the day.

Regards,

Mike

On Nov 16, 2010, at 11:28, Edward Ned Harvey  wrote:

>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of Toby Thain
>> 
>> The corruption will at least be detected by a scrub, even in cases where
> it
>> cannot be repaired.
> 
> Not necessarily.  Let's suppose you have some bad memory, and no ECC.  Your
> application does 1 + 1 = 3.  Then your application writes the answer to a
> file.  Without ECC, the corruption happened in memory and went undetected.
> Then the corruption was written to file, with a correct checksum.  So in
> fact it's not filesystem corruption, and ZFS will correctly mark the
> filesystem as clean and free of checksum errors.
> 
> In conclusion:
> 
> Use ECC if you care about your data.
> Do backups if you care about your data.
> 
> Don't be a cheapskate, or else, don't complain when you get bitten by lack
> of adequate data protection.
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Running on Dell hardware?

2010-11-01 Thread Michael Sullivan
Congratulations Ed, and welcome to "open systems…"

Ah, but Nexenta is open and has "no vendor lock-in."  That's what you probably 
should have done is bank everything on Illumos and Nexenta.  A winning 
combination by all accounts.

But then again, you could have used Linux on any hardware as well.  Then your 
hardware and software issues would probably be multiplied even more.

Cheers,

Mike

---
Michael Sullivan   
michael.p.sulli...@me.com
http://www.kamiogi.net/
Japan Mobile: +81-80-3202-2599
US Phone: +1-561-283-2034

On 23 Oct 2010, at 12:53 , Edward Ned Harvey wrote:

>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of Kyle McDonald
>> 
>> I'm currently considering purchasing 1 or 2 Dell R515's.
>> 
>> With up to 14 drives, and up to 64GB of RAM, it seems like it's well
>> suited
>> for a low-end ZFS server.
>> 
>> I know this box is new, but I wonder if anyone out there has any
>> experience with it?
>> 
>> How about the H700 SAS controller?
>> 
>> Anyone know where to find the Dell 3.5" sleds that take 2.5" drives? I
>> want to put some SSD's in a box like this, but there's no way I'm
>> going to pay Dell's SSD prices. $1300 for a 50GB 'mainstream' SSD? Are
>> they kidding?
> 
> You are asking for a world of hurt.  You may luck out, and it may work
> great, thus saving you money.  Take my example for example ... I took the
> "safe" approach (as far as any non-sun hardware is concerned.)  I bought an
> officially supported dell server, with all dell blessed and solaris
> supported components, with support contracts on both the hardware and
> software, fully patched and updated on all fronts, and I am getting system
> failures approx once per week.  I have support tickets open with both dell
> and oracle right now ... Have no idea how it's all going to turn out.  But
> if you have a problem like mine, using unsupported hardware, you have no
> alternative.  You're up a tree full of bees, naked, with a hunter on the
> ground trying to shoot you.  And IMHO, I think the probability of having a
> problem like mine is higher when you use the unsupported hardware.  But of
> course there's no definable way to quantize that belief.
> 
> My advice to you is:  buy the supported hardware, and the support contracts
> for both the hardware and software.  But of course, that's all just a
> calculated risk, and I doubt you're going to take my advice.  ;-)
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [RFC] Backup solution

2010-10-08 Thread Michael DeMan

On Oct 8, 2010, at 4:33 AM, Edward Ned Harvey wrote:

>> From: Peter Jeremy [mailto:peter.jer...@alcatel-lucent.com]
>> Sent: Thursday, October 07, 2010 10:02 PM
>> 
>> On 2010-Oct-08 09:07:34 +0800, Edward Ned Harvey 
>> wrote:
>>> If you're going raidz3, with 7 disks, then you might as well just make
>>> mirrors instead, and eliminate the slow resilver.
>> 
>> There is a difference in reliability:  raidzN means _any_ N disks can
>> fail, whereas mirror means one disk in each mirror pair can fail.
>> With a mirror, Murphy's Law says that the second disk to fail will be
>> the pair of the first disk :-).
> 
> Maybe.  But in reality, you're just guessing the probability of a single
> failure, the probability of multiple failures, and the probability of
> multiple failures within the critical time window and critical redundancy
> set.
> 
> The probability of a 2nd failure within the critical time window is smaller
> whenever the critical time window is decreased, and the probability of that
> failure being within the critical redundancy set is smaller whenever your
> critical redundancy set is smaller.  So if raidz2 takes twice as long to
> resilver than a mirror, and has a larger critical redundancy set, then you
> haven't gained any probable resiliency over a mirror.
> 
> Although it's true with mirrors, it's possible for 2 disks to fail and
> result in loss of pool, I think the probability of that happening is smaller
> than the probability of a 3-disk failure in the raidz2.
> 
> How much longer does a 7-disk raidz2 take to resilver as compared to a
> mirror?  According to my calculations, it's in the vicinity of 10x longer.  
> 

This article has been posted elsewhere, is about 10 months old, but is a good 
read:

http://queue.acm.org/detail.cfm?id=1670144



Really, there should be a ballpark / back of the napkin formula to be able to 
calculate this?  I've been curious about this too, so here goes a 1st cut...



DR = disk reliability, in terms of chance of the disk dying in any given time 
period, say any given hour?

DFW = disk full write - time to write every sector on the disk.  This will vary 
depending on system load, but is still an input item that can be determined by 
some testing.


RSM = resilver time for a mirror of two of the given disks
RSZ1 = resilver time for raidz1 vdev of two of the given disks?
RSZ2 = resilver time for raidz2 vdev of two of the given disks?


chances of losing all data in a mirror: DLM = RSM * DR.
chances of losing all data in a raiz1: DLRZ1 = RSZ1 * DR.
chances of losing all data in a raidz2: DLRZ2 = RSZ2 * DR * DR



Now, for the above, I'll make some other assumptions...


Lets just guess at a 1-year MTBF for our disks, and for purposes here, just 
flat line that at a failure rate  of chance per hour throughout the year.

Lets presume rebuilding a mirror takes one hour.
Lets presume that a 7-disk raidz1 takes 24 times longer to rebuild one disk 
than a mirror, I think this would be a 'safe' ratio to the benefit of the 
mirror.
Lets presume that a 7-disk raidz2 takes 72 times longer to rebuild one disk 
than a mirror, this should be 'safe' and again benefit to the mirror.




DR for a one hour period = 1 / 24 hours / 365 day = .000114 - chance a disk 
might die in any given hour.


DLM = one hour * DR = .000114

DLRZ1 = 24 hours * DR = .0001114 * 6 ( x6 because there are six more drives in 
the pool, and any one of them could fail)

DLRZ2 = 72 hours * DR * DR = (72 * (.0001114 * 6-disks) * (.0001114 * 5 disks)  
= a much tinier chance of losing all that data.





A better way to think about it maybe

Based on our 1-year flat-line MTBF for disks, to figure out how much faster the 
mirror must rebuild for reliability to be the same as a raidz2...

DLM = DLRZ2

.0001114 * 1 hour = X hours * (.0001114 * 6-disks) * (.0001114 * 5 disks)

X = (.0001114 * 6-disks) * 5 

X = .003342

So, the mirror would have to resilver three hundred times faster than the raiz2 
 (1 / .003342) in order for it to offer the same levels of reliability in 
regards to the chances of losing the entire vdev due to additional disk 
failures during a resilver?





The governing thing here is that O(2) level of reliability based on expected 
chances of failure of  additional disks during any given moment in time, vs. 
O(1) for mirrors and raidz1?

Note that the above is O(2) for raidz2 and O(1) for mirror/raidz1, because we 
are working on the assumption we have already lost one disk.

With raidz3, we would have ( 1  /  (.0001114 * 4-disks remaining in pool ), or 
about 2,000 times more reliability?




Now, the above does not include things like proper statistics that the chances 
of that 2nd and 3rd disk failing (even correlations) may be higher than our 
'flat-line' %/hr. based on 1-year MTBF, or stuff like if all the disks were 
purchased in the same lots and at the same time, so their chances of failing 
around the same time is higher, etc.
















___

Re: [zfs-discuss] TLER and ZFS

2010-10-06 Thread Michael DeMan
Can you give us release numbers that confirm that this is 'automatic'.  It is 
my understanding that the last available public release of OpenSolaris does not 
do this.



On Oct 5, 2010, at 8:52 PM, Richard Elling wrote:

> ZFS already aligns the beginning of data areas to 4KB offsets from the label.
> For modern OpenSolaris and Solaris implementations, the default starting 
> block for partitions is also aligned to 4KB.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] TLER and ZFS

2010-10-05 Thread Michael DeMan
Hi upfront, and thanks for the valuable information.


On Oct 5, 2010, at 4:12 PM, Peter Jeremy wrote:

>> Another annoying thing with the whole 4K sector size, is what happens
>> when you need to replace drives next year, or the year after?
> 
> About the only mitigation needed is to ensure that any partitioning is
> based on multiples of 4KB.

I agree, but to be quite honest, I have no clue how to do this with ZFS.  It 
seems that it should be something under the regular tuning documenation.  

http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide


Is it going to be the case that basic information like about how to deal with 
common scenarios like this is no longer going to be publicly available, and 
Oracle will simply keep it 'close to the vest', with the relevant information 
simply available for those who choose to research it themselves, or only 
available to those with certain levels of support contracts from Oracle?

To put it another way - does the community that uses ZFS need to fork 'ZFS Best 
Practices' and 'ZFZ Evil Tuning' to ensure that it is reasonably up to date?

Sorry for the somewhat hostile in the above, but the changes w/ the merger have 
demoralized a lot of folks I think.

- Mike




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] TLER and ZFS

2010-10-05 Thread Michael DeMan

On Oct 5, 2010, at 2:47 PM, casper@sun.com wrote:
> 
> 
> I've seen several important features when selecting a drive for
> a mirror:
> 
>   TLER (the ability of the drive to timeout a command)
>   sector size (native vs virtual)
>   power use (specifically at home)
>   performance (mostly for work)
>   price
> 
> I've heard scary stories about a mismatch of the native sector size and
> unaligned Solaris partitions (4K sectors, unaligned cylinder).
> 

Yes, avoiding the 4K sector sizes is a huge issue right now too - another item 
I forgot on the reasons to absolutely avoid those WD 'green' drives.

Three good reasons to avoid WD 'green' drives for ZFS...

- TLER issues
- IntelliPower head park issues
- 4K sector size issues

...they are an absolutely nightmare.  

The WD 1TB 'enterprise' drives are still 512 sector size and safe to use, who 
knows though, maybe they just started shipping with 4K sector size as I write 
this e-mail?

Another annoying thing with the whole 4K sector size, is what happens when you 
need to replace drives next year, or the year after?  That part has me worried 
on this whole 4K sector migration thing more than what to buy today.  Given the 
choice, I would prefer to buy 4K sector size now, but operating system support 
is still limited.  Does anybody know if there any vendors that are shipping 4K 
sector drives that have a jumper option to make them 512 size?  WD has a 
jumper, but is there explicitly to work with WindowsXP, and is not a real way 
to dumb down the drive to 512.  I would presume that any vendor that is 
shipping 4K sector size drives now, with a jumper to make it 'real' 512, would 
be supporting that over the long run?

I would be interested, and probably others would too, on what the original 
poster finally decides on this?

- Mike


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] TLER and ZFS

2010-10-05 Thread Michael DeMan

On Oct 5, 2010, at 1:47 PM, Roy Sigurd Karlsbakk wrote:

>> Western Digital RE3 WD1002FBYS 1TB 7200 RPM SATA 3.0Gb/s 3.5" Internal
>> Hard Drive -Bare Drive
>> 
>> are only $129.
>> 
>> vs. $89 for the 'regular' black drives.
>> 
>> 45% higher price, but it is my understanding that the 'RAID Edition'
>> ones also are physically constructed for longer life, lower vibration
>> levels, etc.
> 
> Well, here it's about 60% up and for 150 drives, that makes a wee 
> difference...
> 
> Vennlige hilsener / Best regards
> 
> roy

Understood on 1.6  times cost, especially for quantity 150 drives.

I think (and if I am wrong, somebody else correct me) - that if you are using 
commodity controllers, which seems to generally fine for ZFS, then if a drive 
times out trying to constantly re-read a bad sector, it could stall out the 
read on the entire pool overall.  On the other hand, if the drives are exported 
as JBOD from a RAID controller, I would think the RAID controller itself would 
just mark the drive as bad and offline it quickly based on its own internal 
algorithms. 

The above would also be relevant to the anticipated usage.  For instance, if it 
is some sort of backup machine and delays due to some reads stalling on out 
TLER then perhaps it is not a big deal.  If it is for more of an up-front 
production use, that could be intolerable.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] TLER and ZFS

2010-10-05 Thread Michael DeMan
I'm not sure on the TLER issues by themselves, but after the nightmares I have 
gone through dealing with the 'green drives', which have both the TLER issue 
and the IntelliPower head parking issues, I would just stay away from it all 
entirely and pay extra for the 'RAID Editiion' drives.

Just out of curiosity, I took a peek a newegg.

Western Digital RE3 WD1002FBYS 1TB 7200 RPM SATA 3.0Gb/s 3.5" Internal Hard 
Drive -Bare Drive  

are only $129.

vs. $89 for the 'regular' black drives.

45% higher price, but it is my understanding that the 'RAID Edition' ones also 
are physically constructed for longer life, lower vibration levels, etc.


On Oct 5, 2010, at 1:30 PM, Roy Sigurd Karlsbakk wrote:

> Hi all
> 
> I just discovered WD Black drives are rumored not to be set to allow TLER. 
> Does anyone know how much performance impact the lack of TLER might have on a 
> large pool? Choosing Enterprise drives will cost about 60% more, and on a 
> large install, that means a lot of money...
> 
> Vennlige hilsener / Best regards
> 
> roy
> --
> Roy Sigurd Karlsbakk
> (+47) 97542685
> r...@karlsbakk.net
> http://blogg.karlsbakk.net/
> --
> I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det 
> er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
> idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
> relevante synonymer på norsk.
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] "zfs unmount" versus "umount"?

2010-09-30 Thread Michael Schuster

On 30.09.10 15:42, Mark J Musante wrote:

On Thu, 30 Sep 2010, Linder, Doug wrote:


Is there any technical difference between using "zfs unmount" to unmount
a ZFS filesystem versus the standard unix "umount" command? I always use
"zfs unmount" but some of my colleagues still just use umount. Is there
any reason to use one over the other?


No, they're identical. If you use 'zfs umount' the code automatically maps
it to 'unmount'. It also maps 'recv' to 'receive' and '-?' to call into the
usage function. Here's the relevant code from main():


Mark, I think that wasn't the question, rather, "what's the difference 
between 'zfs u[n]mount' and '/usr/bin/umount'?"


HTH
Michael
--
michael.schus...@oracle.com http://blogs.sun.com/recursion
Recursion, n.: see 'Recursion'
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] file recovery on lost RAIDZ array

2010-09-28 Thread Michael Eskowitz

I'm sorry to say that I am quite the newbie to ZFS.  When you say zfs 
send/receive what exactly are you referring to?

I had the zfs array mounted to a specific location in my file system 
(/mnt/Share) and I was sharing that location over the network with a samba 
server.  The directory had read-write-execute persion set to allow anyone to 
write to it and I was copying data from windows into it.

At what point do file changes get committed to the file system?  I sort of 
assumed that any additional files copied over would be committed once the next 
file began copying.

Thanks for your insight.

-Mike



  
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] file recovery on lost RAIDZ array

2010-09-13 Thread Michael Eskowitz
Oh and yes, raidz1.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] file recovery on lost RAIDZ array

2010-09-13 Thread Michael Eskowitz
I don't know what happened.  I was in the process of copying files onto my new 
file server when the copy process from the other machine failed.  I turned on 
the monitor for the fileserver and found that it had rebooted by itself at some 
point (machine fault maybe?) and when I remounted the drives every last thing 
was gone.

I am new to zfs.  How do you take snapshots?  Does the sytem do it 
automagically for you?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] file recovery on lost RAIDZ array

2010-09-12 Thread Michael Eskowitz
I recently lost all of the data on my single parity raid z array.  Each of the 
drives was encrypted with the zfs array built within the encrypted volumes.

I am not exactly sure what happened.  The files were there and accessible and 
then they were all gone.  The server apparently crashed and rebooted and 
everything was lost.  After the crash I remounted the encrypted drives and the 
zpool was still reporting that roughly 3TB of the 7TB array were used, but I 
could not see any of the files through the array's mount point.  I unmounted 
the zpool and then remounted it and suddenly zpool was reporting 0TB were used. 
 I did not remap the virtual device.  The only thing of note that I saw was 
that the name of storage pool had changed.  Originally it was "Movies" and then 
it became "Movita".  I am guessing that the file system became corrupted some 
how.  (zpool status did not report any errors)

So, my questions are these... 

Is there anyway to undelete data from a lost raidz array?  If I build a new 
virtual device on top of the old one and the drive topology remains the same, 
can we scan the drives for files from old arrays?

Also, is there any way to repair a corrupted storage pool?  Is it possible to 
backup the file table or whatever partition index zfs maintains?


I imagine that you all are going to suggest that I scrub the array, but that is 
not an option at this point.  I had a backup of all of the data lost as I am 
moving between file servers so at a certain point I gave up and decided to 
start fresh.  This doesn't give me a warm fuzzy feeling about zfs, though.

Thanks,
-Mike
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS with SAN's and HA

2010-08-26 Thread Michael Dodwell
Lao,

I had a look at the HAStoragePlus etc and from what i understand that's to 
mirror local storage across 2 nodes for services to be able to access 'DRBD 
style'. 

Having a read thru the documentation on the oracle site the cluster software 
from what i gather is how to cluster services together (oracle/apache etc) and 
again any documentation i've found on storage is how to duplicate local storage 
to multiple hosts for HA failover. Can't really see anything on clustering 
services to use shared storage/zfs pools.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS with SAN's and HA

2010-08-26 Thread Michael Dodwell
Hey all,

I currently work for a company that has purchased a number of different SAN 
solutions (whatever was cheap at the time!) and i want to setup a HA ZFS file 
store over fiber channel.

Basically I've taken slices from each of the sans and added them to a ZFS pool 
on this box (which I'm calling a 'ZFS proxy'). I've then carved out LUN's from 
this pool and assigned them to other servers. I then have snapshots taken on 
each of the LUN's and replication off site for DR. This all works perfectly 
(backups for ESXi!)

However, I'd like to be able to a) expand and b) make it HA. All the 
documentation i can find on setting up a HA cluster for file stores replicates 
data from 2 servers and then serves from these computers (i trust the SAN's to 
take care of the data and don't want to replicate anything -- cost!). Basically 
all i want is for the node that serves the ZFS pool to be HA (if this was to be 
put into production we have around 128tb and are looking to expand to a pb). We 
have a couple of IBM SVC's that seem to handle the HA node setup in some 
obscure property IBM way so logically it seems possible.

Clients would only be making changes via a single 'zfs proxy' at a time 
(multi-pathing setup for fail over only) so i don't believe I'd need to OCFS 
the setup? If i do need to setup OCFS can i put ZFS on top of that? (want 
snap-shotting/rollback and replication to a off site location, as well as all 
the goodness of thin provisioning and de-duplication)

However when i import the ZFS pool onto the 2nd box i got large warnings about 
it being mounted elsewhere and i needed to force the import, then when 
importing the LUN's i saw that the GUUID was different so multi-pathing doesn't 
pick that the LUN's are the same? can i change a GUUID via smtfadm? Is any of 
this even possible over fiber channel? Is anyone able to point me at some 
documentation? Am i simply crazy?

Any input would be most welcome.

Thanks in advance,
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs/iSCSI: 0000 = SNS Error Type: Current Error (0x70)

2010-08-26 Thread Michael W Lucas
Hi,

I'm trying to track down an error with a 64bit x86 OpenSolaris 2009.06 ZFS 
shared via iSCSI and an Ubuntu 10.04 client.  The client can successfully log 
in, but no device node appears.  I captured a session with wireshark.  When the 
client attempts a "SCSI: Inquiry LUN: 0x00", OpenSolaris sends a "SCSI Response 
(Check Condition) LUN:0x00" that contains the following:

.111  = SNS Error Type: Current Error (0x70)
Filemark: 0, EOM: 0, ILI: 0
 0100 = Sense Key: Hardware Error (0x04)

The ZFS being exported is a 400GB chunk of a 1TB ZFS mirror.  The underlying OS 
reports no hardware errors, and "zpool status" looks OK.  Why would OpenSolaris 
give this error?  Is there anything I can do for it?  Any suggestions would be 
appreciated.

(I discussed this with the open-iscsi people at 
http://groups.google.com/group/open-iscsi/browse_thread/thread/06b83227ffc6a31a/2e58a163e21ec74e#2e58a163e21ec74e.)

Thanks,
==ml
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 64-bit vs 32-bit applications

2010-08-16 Thread Michael Schuster

On 17.08.10 04:17, Will Murnane wrote:

On Mon, Aug 16, 2010 at 21:58, Kishore Kumar Pusukuri
  wrote:

Hi,
I am surprised with the performances of some 64-bit multi-threaded
applications on my AMD Opteron machine. For most of the applications, the
performance of 32-bit version is almost same as the performance of 64-bit
version. However, for a couple of applications, 32-bit versions provide
better performance (running-time is around 76 secs) than 64-bit (running
time is around 96 secs). Could anyone help me to find the reason behind
this, please?

[...]

This list discusses the ZFS filesystem.  Perhaps you'd be better off
posting to perf-discuss or tools-gcc?

That said, you need to provide more information.  What compiler and
flags did you use?  What does your program (broadly speaking) do?
What did you measure to conclude that it's slower in 64-bit mode?


add to that: what OS are you using?

Michael
--
michael.schus...@oracle.com http://blogs.sun.com/recursion
Recursion, n.: see 'Recursion'
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Degraded Pool, Spontaneous Reboots

2010-08-12 Thread Michael Anderson
Hello,

I've been getting warnings that my zfs pool is degraded. At first it was 
complaining about a few corrupt files, which were listed as hex numbers instead 
of filenames, i.e.

VOL1:<0x0>

After a scrub, a couple of the filenames appeared - turns out they were in 
snapshots I don't really need, so I destroyed those snapshots and started a new 
scrub. Subsequently, I typed " zpool status -v VOL1" ... and the machine 
rebooted. When I could log on again, I looked at /var/log/messages, but found 
nothing interesting prior to the reboot. I typed " zpool status -v VOL1" again, 
whereupon the machine rebooted. When the machine was back up, I stopped the 
scrub, waited a while, then typed "zpool status -v VOL1" again, and this time 
got:


r...@nexenta1:~# zpool status -v VOL1
pool: VOL1
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scan: scrub canceled on Wed Aug 11 11:03:15 2010
config:

NAME STATE READ WRITE CKSUM
VOL1 DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
c2d0 DEGRADED 0 0 0 too many errors
c3d0 DEGRADED 0 0 0 too many errors
c4d0 DEGRADED 0 0 0 too many errors
c5d0 DEGRADED 0 0 0 too many errors

So, I have the following questions:

1) How do I find out which file is corrupt, when I only get something like 
"VOL1:<0x0>"
2) What could be causing these reboots?
3) How can I fix my pool?

Thanks!
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS p[erformance drop with new Xeon 55xx and 56xx cpus

2010-08-11 Thread michael schuster

On 08/12/10 04:16, Steve Gonczi wrote:

Greetings,

I am seeing some unexplained performance drop using the above cpus,
using a fairly up-to-date build ( late 145).
Basically, the system seems to be 98% idle, spending most if its time in this 
stack:

   unix`i86_mwait+0xd
   unix`cpu_idle_mwait+0xf1
   unix`idle+0x114
   unix`thread_start+0x8
455645

Most cpus seem to be idling most of the time, sitting on the mwait instruction.
No lock contention, not waiting on io, I am finding myself at a loss explaining 
what this system is doing.
(I am monitoring the system w. lockstat, mpstat, prstat).  Despite the 
predominantly idle system,
I see some latency reported by prstat microstate accounting on the zfs threads.

This is a fairly beefy box, 24G memory,  16 cpus.
Doing a local zfs send | receive, should be getting at least 100MB+,
and I am only getting  5-10MB.
I see some Intel errata on the 55xx series xeons, a problem with the
monitor/mwait instructions, that could conceivably cause missed wake-up or 
mis-reported  mwait status.


I'd suggest you supply a bit more information (to the list, not to me, I 
don't know very much about zfs internals):


- zpool/zfs configuration
- history of this issue: has it been like this since you installed the 
machine?

  - if no: what changes were introduced around the time you saw this first?
- does this happen on a busy machine too?
- describe your test in more detail
- provide measurements (lockstat, iostat, maybe some DTrace) before and 
during test, add some timestamps so people can correlate data to events.

- anything else you can think of that might be relevant.

HTH
Michael
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] core dumps eating space in snapshots

2010-07-27 Thread Michael Schuster

On 27.07.10 14:21, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of devsk

I have many core files stuck in snapshots eating up gigs of my disk
space. Most of these are BE's which I don't really want to delete right
now.


Ok, you don't want to delete them ...



Is there a way to get rid of them? I know snapshots are RO but can I do
some magic with clones and reclaim my space?


You don't want to delete them, but you don't want them to take up space
either?  Um ... Sorry, can't be done.  Move them to a different disk ...

Or clarify what it is that you want.

If you're saying you have core files in your present filesystem that you
don't want to delete ... And you also have core files in snapshots that you
*do* want to delete ...  As long as the file hasn't been changing, it's not
consuming space beyond what's in the current filesystem.  (See the output of
zfs list, looking at sizes and you'll see that.)  If it has been changing
... the cores in snapshot are in fact different from the cores in present
filesytem ... then the only way to delete them is to destroy snapshots.

Or have I still misunderstood the question?


yes, I think so.

Here's how I read it: the snapshots contain lots more than the core files, 
and OP wants to remove only the core files (I'm assuming they weren't 
discovered before the snapshot was taken) but retain the rest.


does that explain it better?

HTH
Michael
--
michael.schus...@oracle.com http://blogs.sun.com/recursion
Recursion, n.: see 'Recursion'
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help identify failed drive

2010-07-19 Thread Michael Shadle
On Mon, Jul 19, 2010 at 4:35 PM, Richard Elling  wrote:

> I depends on if the problem was fixed or not.  What says
>        zpool status -xv
>
>  -- richard

[r...@nas01 ~]# zpool status -xv
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 14h2m with 0 errors on Sun Jul 18 18:32:38 2010
config:

NAMESTATE READ WRITE CKSUM
tankDEGRADED 0 0 0
  raidz2ONLINE   0 0 0
c0t3d0  ONLINE   0 0 0
c0t2d0  ONLINE   0 0 0
c0t4d0  ONLINE   0 0 0
c0t1d0  ONLINE   0 0 0
c0t6d0  ONLINE   0 0 0
c0t7d0  ONLINE   0 0 0
c0t0d0  ONLINE   0 0 0
c0t5d0  ONLINE   0 0 0
  raidz2DEGRADED 0 0 0
c2t0d0  ONLINE   0 0 0
c2t1d0  ONLINE   0 0 0
c2t2d0  ONLINE   0 0 0
c2t3d0  ONLINE   0 0 0
c2t4d0  ONLINE   0 0 0
c2t5d0  DEGRADED 0 0 0  too many errors
c2t6d0  ONLINE   0 0 0
c2t7d0  ONLINE   0 0 0

was never fixed. I thought I needed to replace the drive. Should I
mark it as "resolved" or whatever the syntax is and re-run a scrub?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help identify failed drive

2010-07-19 Thread Michael Shadle
On Mon, Jul 19, 2010 at 4:26 PM, Richard Elling  wrote:

> Aren't you assuming the I/O error comes from the drive?
> fmdump -eV

okay - I guess I am. Is this just telling me "hey stupid, a checksum
failed" ? In which case why did this never resolve itself and the
specific device get marked as degraded?

Apr 04 2010 21:52:38.920978339 ereport.fs.zfs.checksum
nvlist version: 0
class = ereport.fs.zfs.checksum
ena = 0x64350d4040300c01
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0xfd80ebd352cc9271
vdev = 0x29282dc6fa073a2
(end detector)

pool = tank
pool_guid = 0xfd80ebd352cc9271
pool_context = 0
pool_failmode = wait
vdev_guid = 0x29282dc6fa073a2
vdev_type = disk
vdev_path = /dev/dsk/c2t5d0s0
vdev_devid = id1,s...@sata_st31500341as9vs077gt/a
parent_guid = 0xc2d5959dd2c07bf7
parent_type = raidz
zio_err = 0
zio_offset = 0x40abbf2600
zio_size = 0x200
zio_objset = 0x10
zio_object = 0x1c06000
zio_level = 2
zio_blkid = 0x0
__ttl = 0x1
__tod = 0x4bb96c96 0x36e503a3
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help identify failed drive

2010-07-19 Thread Michael Shadle
On Mon, Jul 19, 2010 at 4:16 PM, Marty Scholes  wrote:

> Start a scrub or do an obscure find, e.g. "find /tank_mointpoint -name core" 
> and watch the drive activity lights.  The drive in the pool which isn't 
> blinking like crazy is a faulted/offlined drive.

Actually I guess my real question is why iostat hasn't logged any
errors in its counters even though the device has been bad in there
for months?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help identify failed drive

2010-07-19 Thread Michael Shadle
On Mon, Jul 19, 2010 at 4:16 PM, Marty Scholes  wrote:

> Start a scrub or do an obscure find, e.g. "find /tank_mointpoint -name core" 
> and watch the drive activity lights.  The drive in the pool which isn't 
> blinking like crazy is a faulted/offlined drive.
>
> Ugly and oh-so-hackerish, but it works.

that was my idea except figuring out something to make just specific
drives write one at a time. although if it has been offlined or
whatever then it shouldn't receive any requests, that sounds even
easier. :)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help identify failed drive

2010-07-19 Thread Michael Shadle
On Mon, Jul 19, 2010 at 3:11 PM, Haudy Kazemi  wrote:

> ' iostat -Eni ' indeed outputs Device ID on some of the drives,but I still
> can't understand how it helps me to identify model of specific drive.

Curious:

[r...@nas01 ~]# zpool status -x
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 14h2m with 0 errors on Sun Jul 18 18:32:38 2010
config:

NAMESTATE READ WRITE CKSUM
tankDEGRADED 0 0 0
  raidz2ONLINE   0 0 0
...
  raidz2DEGRADED 0 0 0
...
c2t5d0  DEGRADED 0 0 0  too many errors
...


c2t5d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: ST31500341AS Revision: SD1B Device Id:
id1,s...@sata_st31500341as9vs077gt
Size: 1500.30GB <1500301910016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0


Why has it been reported as bad (for probably 2 months now, I haven't
got around to figuring out which disk in the case it is etc.) but the
iostat isn't showing me any errors.

Note: I do a weekly scrub too. Not sure if that matters or helps reset
the device.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recommended RAM for ZFS on various platforms

2010-07-16 Thread Michael Johnson
Garrett D'Amore wrote:
>On Fri, 2010-07-16 at 10:24 -0700, Michael Johnson wrote:
>> I'm currently planning on running FreeBSD with ZFS, but I wanted to 
>>double-check 
>> how much memory I'd need for it to be stable.  The ZFS wiki currently says 
>you 
>> can go as low as 1 GB, but recommends 2 GB; however, elsewhere I've seen 
>>someone 
>> claim that you need at least 4 GB.  Does anyone here know how much RAM 
>FreeBSD 
>> would need in this case?
>> 
>> Likewise, how much RAM does OpenSolaris need for stability when running ZFS? 
>>  How about other OpenSolaris-based OSs, like NexentaStor?  (My searching 
>found 
>> that OpenSolaris recommended at least 1 GB, while NexentaStor said 2 GB was 
>> okay, 4 GB was better.  I'd be interested in hearing your input, though.)

>
>1GB isn't enough for a real system.  2GB is a bare minimum.  If you're
>going to use dedup, plan on a *lot* more.  I think 4 or 8 GB are good
>for a typical desktop or home NAS setup.  With FreeBSD you may be able
>to get away with less.  (Probably, in fact.)

Fortunately, I don't need deduplication; it's kind of a nice feature, but the 
extra RAM it would take isn't worth it.

Just curious, why do you say I'd be able to get away with less RAM in FreeBSD 
(as compared to NexentaStor, I'm assuming)?  I don't know tons about the OSs in 
question; is FreeBSD just leaner in general?

>> If it matters, I'm currently planning on RAID-Z2 with 4x500GB consumer-grade 
>> SATA drives.  (I know that's not a very efficient configuration, but I'd 
>>really 
>> like the redundancy of RAID-Z2 and I just don't need more than 1 TB of 
>>available 
>> storage right now, or for the next several years.)  This is on an AMD64 
>>system, 
>> and the OS in question will be running inside of VirtualBox, with raw access 
>>to 
>> the drives.

>
>Btw, instead of RAIDZ2, I'd recommend simply using stripe of mirrors.
>You'll have better performance, and good resilience against errors.  And
>you can grow later as you need to by just adding additional drive pairs.


A pair of mirrors would be nice, but would only protect against 100% of one 
drive failing, and 50% of two-drive failures.  Performance is less important to 
me than redundancy; this setup won't be seeing tons of disk activity, but I 
want 
it to be as reliable as possible.

Michael


  
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Recommended RAM for ZFS on various platforms

2010-07-16 Thread Michael Johnson
I'm currently planning on running FreeBSD with ZFS, but I wanted to 
double-check 
how much memory I'd need for it to be stable.  The ZFS wiki currently says you 
can go as low as 1 GB, but recommends 2 GB; however, elsewhere I've seen 
someone 
claim that you need at least 4 GB.  Does anyone here know how much RAM FreeBSD 
would need in this case?

Likewise, how much RAM does OpenSolaris need for stability when running ZFS? 
 How about other OpenSolaris-based OSs, like NexentaStor?  (My searching found 
that OpenSolaris recommended at least 1 GB, while NexentaStor said 2 GB was 
okay, 4 GB was better.  I'd be interested in hearing your input, though.)

If it matters, I'm currently planning on RAID-Z2 with 4x500GB consumer-grade 
SATA drives.  (I know that's not a very efficient configuration, but I'd really 
like the redundancy of RAID-Z2 and I just don't need more than 1 TB of 
available 
storage right now, or for the next several years.)  This is on an AMD64 system, 
and the OS in question will be running inside of VirtualBox, with raw access to 
the drives.

Thanks,
Michael


  
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Encryption?

2010-07-12 Thread Michael Johnson
Garrett wrote:
>I don't know about ramifications (though I suspect that a broadening
>error scope would decrease ZFS' ability to isolate and work around
>problematic regions on the media), but one thing I do know.  If you use
>FreeBSD disk encryption below ZFS, then you won't be able able to import
>your pools to another implementation -- you will be stuck with FreeBSD.


This is an excellent point.  Geli isn't a good option for me, then, though 
using 
encryption outside of the VM would still work.

>Btw, if you want a commercially supported and maintained product, have
>you looked at NexentaStor?  Regardless of what happens with OpenSolaris,
>we aren't going anywhere. (Full disclosure: I'm a Nexenta Systems
>employee. :-)


I probably ought to consider other OpenSolaris alternatives, like NexentaStor. 
 (Though I'd be looking at the free version, not the commercial one: this is 
just for personal use, despite how careful I'm being with it. :) )  However 
(and 
please correct me if I'm wrong), isn't your future still tied to the future of 
OpenSolaris?  The code is open, of course, but my understanding is that there 
isn't the same kind of developer community supporting OpenSolaris itself that 
you see with Linux (or even the BSDs).

In other words, if Oracle stops development of OpenSolaris, there wouldn't be 
enough developers still working on it to keep it from stagnating.  Or are you 
saying that you employ enough kernel hackers to keep up even without Oracle?  
(I 
am admittedly ignorant about the OpenSolaris developer community; this is all 
based on others' statements and opinions that I've read.)

Michael


  
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Encryption?

2010-07-12 Thread Michael Johnson
Nikola M wrote:
>Freddie Cash wrote:
>> You definitely want to do the ZFS bits from within FreeBSD.
>Why not using ZFS in OpenSolaris? At least it has most stable/tested
>implementation and also the newest one if needed?


I'd love to use OpenSolaris for exactly those reasons, but I'm wary of using an 
operating system that may not continue to be updated/maintained.  If 
OpenSolaris 
had continued to be regularly released after Oracle bought Sun I'd be choosing 
it.  As it is, I don't want to be pessimistic, but the doubt about 
OpenSolaris's 
future is enough to make me choose FreeBSD instead.  (I'm sure that such 
sentiments won't make me popular here, but so far Oracle has been frustratingly 
silent on their plans for OpenSolaris.)  At the very least, if FreeBSD doesn't 
do what I want I can switch the system disk to OpenSolaris and keep using the 
same pool.  (Right?)

Going back to my original question: does anyone know of any problems that could 
be caused by using raidz on top of encrypted drives?  If there were a physical 
read error, which would get amplified by the encryption layer (if I'm 
understanding full-disk encryption correctly, which I may not be), would ZFS 
still be able to recover?


  
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Encryption?

2010-07-11 Thread Michael Johnson
on 11/07/2010 15:54 Andriy Gapon said the following:

>on 11/07/2010 14:21 Roy Sigurd Karlsbakk said the following:
>> 
>> I'm planning on running FreeBSD in VirtualBox (with a Linux host)
>> and giving it raw disk access to four drives, which I plan to
>> configure as a raidz2 volume.
>> 
>> Wouldn't it be better or just as good to use fuse-zfs for such a
>> configuration? I/O from VirtualBox isn't really very good, but then, I
>> haven't tested the linux/fbsd configuration...


Like Freddie already mentioned, I'd heard that fuse-zfs wasn't really all that 
good of an option, and I wanted something that was more stable/reliable.

>Hmm, an unexpected question IMHO - wouldn't it better to just install FreeBSD 
on
>the hardware? :-)
>If an original poster is using Linux as a host OS, then probably he has some
>very good reason to do that.  But performance and etc -wise, directly using
>FreeBSD, of course, should win over fuse-zfs.  Right?
>
>[Installing and maintaining one OS instead of two is the first thing that comes
>to mind]


I'm going with a virtual machine because the box I ended up building for this 
was way more powerful than I needed for just my file server; thus, I figured 
I'd 
use it as a personal machine too.  (I wanted ECC RAM, and there just aren't 
that 
many motherboards that support ECC RAM that are also really cheap and 
low-powered.)  And since I'm much more comfortable with Linux, I wanted to use 
it for the "personal" side of things.


  
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Encryption?

2010-07-10 Thread Michael Johnson
I'm planning on running FreeBSD in VirtualBox (with a Linux host) and giving it 
raw disk access to four drives, which I plan to configure as a raidz2 volume.

On top of that, I'm considering using encryption.  I understand that ZFS 
doesn't 
yet natively support encryption, so my idea was to set each drive up with 
full-disk encryption in the Linux host (e.g., using TrueCrypt or dmcrypt), 
mount 
the encrypted drives, and then give the virtual machine access to the virtual 
unencrypted drives.  So the encryption would be transparent to FreeBSD.

However, I don't know enough about ZFS to know if this is a good idea.  I know 
that I need to specifically configure VirtualBox to respect cache flushes, so 
that data really is on disk when ZFS expects it to be.  Would putting ZFS on 
top 
of full-disk encryption like this cause any problems?  E.g., if the (encrypted) 
physical disk has a problem and as a result a larger chunk of the unencrypted 
data is corrupted, would ZFS handle that well?  Are there any other possible 
consequences of this idea that I should know about?  (I'm not too worried about 
any hits in performance; I won't be reading or writing heavily, nor in 
time-sensitive applications.)

I should add that since this is a desktop I'm not nearly as worried about 
encryption as if it were a laptop (theft or loss are less likely), but 
encryption would still be nice.  However, data integrity is the most important 
thing (I'm storing backups of my personal files on this), so if there's a 
chance 
that ZFS wouldn't handle errors well when on top of encryption, I'll just go 
without it.

Thanks,
Michael


  ___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Consequences of resilvering failure

2010-07-06 Thread Michael Johnson
I'm just about to start using ZFS in a RAIDZ configuration for a home file 
server (mostly holding backups), and I wasn't clear on what happens if data 
corruption is detected while resilvering.  For example: let's say I'm using 
RAIDZ1 and a drive fails.  I pull it and put in a new one.  While resilvering, 
ZFS detects corrupt data on one of the remaining disks.  Will the resilvering 
continue, with some files marked as containing errors, or will it simply fail?

(I found this process[1] to repair damaged data, but I wasn't sure what would 
happen if it was detected in the middle of resilvering.)

I will of course have a backup of the pool, but I may opt for additional backup 
if the entire pool could be lost due to data corruption (as opposed to just a 
few files potentially being lost).

Thanks,
Michael

[1] http://dlc.sun.com/osol/docs/content/ZFSADMIN/gbbwl.html


  ___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] b134 pool borked!

2010-06-30 Thread Michael Mattsson
Just in case any stray searches finds it way here, this is what happened to my 
pool: http://phrenetic.to/zfs
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Native ZFS for Linux

2010-06-11 Thread Michael Shadle
On Fri, Jun 11, 2010 at 2:50 AM, Alex Blewitt  wrote:

> You are sadly mistaken.
>
> From GNU.org on license compatibilities:
>
> http://www.gnu.org/licenses/license-list.html
>
>        Common Development and Distribution License (CDDL), version 1.0
>        This is a free software license. It has a copyleft with a scope
> that's similar to the one in the Mozilla Public License, which makes it
> incompatible with the GNU GPL. This means a module covered by the GPL and a
> module covered by the CDDL cannot legally be linked together. We urge you
> not to use the CDDL for this reason.
>
>        Also unfortunate in the CDDL is its use of the term “intellectual
> property”.
>
> Whether a license is classified as "Open Source" or not does not imply that
> all open source licenses are compatible with each other.

Can we stop the license talk *yet again*

Nobody here is a lawyer (IANAL!) and everyone has their own
interpretations and are splitting hairs.

In my opinion, the source code itself shouldn't be ported, the
CONCEPTS should be. Then there's no licensing issues at all. No
questions. etc.

To me, ZFS is important for bitrot protection, pooled storage and
snapshots come in handy in a couple places. Getting a COW filesystem
w/ snapshots and storage pooling would cover a lot of the demand for
ZFS as far as I'm concerned. (However, that's when a comparison with
Btrfs makes sense as it is COW too)

The minute I saw "ZFS on Linux" I knew this would degrade into a
virtual pissing contest on "my understanding is better than yours" and
a licensing fight.

To me, this is what needs to happen:

a) Get a Sun/Oracle attorney involved who understands this and flat
out explains what needs to be done to allow ZFS to be used with the
Linux kernel, or
b) Port the concepts and not the code (or the portions of code under
the restrictive license), or
c) Look at Btrfs or other filesystems which may be extended to give
the same capabilities as ZFS without the licensing issue and focus all
this development time on extending those.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool replace lockup / replace process now stalled, how to fix?

2010-05-21 Thread Michael Donaghy
For the record, in case anyone else experiences this behaviour: I tried 
various things which failed, and finally as a last ditch effort, upgraded my 
freebsd, giving me zpool v14 rather than v13 - and now it's resilvering as it 
should.

Michael

On Monday 17 May 2010 09:26:23 Michael Donaghy wrote:
> Hi,
> 
> I recently moved to a freebsd/zfs system for the sake of data integrity,
>  after losing my data on linux. I've now had my first hard disk failure;
>  the bios refused to even boot with the failed drive (ad18) connected, so I
>  removed it. I have another drive, ad16, which had enough space to replace
>  the failed one, so I partitioned it and attempted to use "zpool replace"
>  to replace the failed partitions for new ones, i.e. "zpool replace tank
>  ad18s1d ad16s4d". This seemed to simply hang, with no processor or disk
>  use; any "zpool status" commands also hung. Eventually I attempted to
>  reboot the system, which also eventually hung; after waiting a while,
>  having no other option, rightly or wrongly, I hard-rebooted. Exactly the
>  same behaviour happened with the other zpool replace.
> 
> Now, my zpool status looks like:
> arcueid ~ $ zpool status
>   pool: tank
>  state: DEGRADED
>  scrub: none requested
> config:
> 
> NAME   STATE READ WRITE CKSUM
> tank   DEGRADED 0 0 0
>   raidz2   DEGRADED 0 0 0
> ad4s1d ONLINE   0 0 0
> ad6s1d ONLINE   0 0 0
> ad9s1d ONLINE   0 0 0
> ad17s1dONLINE   0 0 0
> replacing  DEGRADED 0 0 0
>   ad18s1d  UNAVAIL  0 9.62K 0  cannot open
>   ad16s4d  ONLINE   0 0 0
> ad20s1dONLINE   0 0 0
>   raidz2   DEGRADED 0 0 0
> ad4s1e ONLINE   0 0 0
> ad6s1e ONLINE   0 0 0
> ad17s1eONLINE   0 0 0
> replacing  DEGRADED 0 0 0
>   ad18s1e  UNAVAIL  0 11.2K 0  cannot open
>   ad16s4e  ONLINE   0 0 0
> ad20s1eONLINE   0 0 0
> 
> errors: No known data errors
> 
> It looks like the replace has taken in some sense, but ZFS doesn't seem to
>  be resilvering as it should. Attempting to zpool offline doesn't work:
>  arcueid ~ # zpool offline tank ad18s1d
> cannot offline ad18s1d: no valid replicas
> Attempting to scrub causes a similar hang to before. Data is still readable
> (from the zvol which is the only thing actually on this filesystem),
>  although slowly.
> 
> What should I do to recover this / trigger a proper replace of the failed
> partitions?
> 
> Many thanks,
> Michael
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs mount -a kernel panic

2010-05-19 Thread Michael Schuster

On 19.05.10 17:53, John Andrunas wrote:

Not to my knowledge, how would I go about getting one?  (CC'ing discuss)


man savecore and dumpadm.

Michael



On Wed, May 19, 2010 at 8:46 AM, Mark J Musante  wrote:


Do you have a coredump?  Or a stack trace of the panic?

On Wed, 19 May 2010, John Andrunas wrote:


Running ZFS on a Nexenta box, I had a mirror get broken and apparently
the metadata is corrupt now.  If I try and mount vol2 it works but if
I try and mount -a or mount vol2/vm2 is instantly kernel panics and
reboots.  Is it possible to recover from this?  I don't care if I lose
the file listed below, but the other data in the volume would be
really nice to get back.  I have scrubbed the volume to no avail.  Any
other thoughts.


zpool status -xv vol2
  pool: vol2
state: ONLINE
status: One or more devices has experienced an error resulting in data
   corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
   entire pool from backup.
  see: http://www.sun.com/msg/ZFS-8000-8A
scrub: none requested
config:

   NAMESTATE READ WRITE CKSUM
   vol2ONLINE   0 0 0
 mirror-0  ONLINE   0 0 0
   c3t3d0  ONLINE   0 0 0
   c3t2d0  ONLINE   0 0 0

errors: Permanent errors have been detected in the following files:

   vol2/v...@snap-daily-1-2010-05-06-:/as5/as5-flat.vmdk

--
John
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




Regards,
markm








--
michael.schus...@oracle.com http://blogs.sun.com/recursion
Recursion, n.: see 'Recursion'
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zpool replace lockup / replace process now stalled, how to fix?

2010-05-17 Thread Michael Donaghy
Hi,

I recently moved to a freebsd/zfs system for the sake of data integrity, after 
losing my data on linux. I've now had my first hard disk failure; the bios 
refused to even boot with the failed drive (ad18) connected, so I removed it.
I have another drive, ad16, which had enough space to replace the failed one, 
so I partitioned it and attempted to use "zpool replace" to replace the failed 
partitions for new ones, i.e. "zpool replace tank ad18s1d ad16s4d". This 
seemed to simply hang, with no processor or disk use; any "zpool status" 
commands also hung. Eventually I attempted to reboot the system, which also 
eventually hung; after waiting a while, having no other option, rightly or 
wrongly, I hard-rebooted. Exactly the same behaviour happened with the other 
zpool replace.

Now, my zpool status looks like:
arcueid ~ $ zpool status
  pool: tank
 state: DEGRADED
 scrub: none requested
config:

NAME   STATE READ WRITE CKSUM
tank   DEGRADED 0 0 0
  raidz2   DEGRADED 0 0 0
ad4s1d ONLINE   0 0 0
ad6s1d ONLINE   0 0 0
ad9s1d ONLINE   0 0 0
ad17s1dONLINE   0 0 0
replacing  DEGRADED 0 0 0
  ad18s1d  UNAVAIL  0 9.62K 0  cannot open
  ad16s4d  ONLINE   0 0 0
ad20s1dONLINE   0 0 0
  raidz2   DEGRADED 0 0 0
ad4s1e ONLINE   0 0 0
ad6s1e ONLINE   0 0 0
ad17s1eONLINE   0 0 0
replacing  DEGRADED 0 0 0
  ad18s1e  UNAVAIL  0 11.2K 0  cannot open
  ad16s4e  ONLINE   0 0 0
ad20s1eONLINE   0 0 0

errors: No known data errors

It looks like the replace has taken in some sense, but ZFS doesn't seem to be 
resilvering as it should. Attempting to zpool offline doesn't work:
arcueid ~ # zpool offline tank ad18s1d
cannot offline ad18s1d: no valid replicas
Attempting to scrub causes a similar hang to before. Data is still readable 
(from the zvol which is the only thing actually on this filesystem), although 
slowly.

What should I do to recover this / trigger a proper replace of the failed 
partitions?

Many thanks,
Michael
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Opteron 6100? Does it work with opensolaris?

2010-05-11 Thread Michael DeMan
I agree on the motherboard and peripheral chipset issue.

This, and the last generation AMD quad/six core motherboards all seem to use 
the AMD SP56x0/SP5100 chipset, which I can't find much information about 
support on for either OpenSolaris or FreeBSD.

Another issue is the LSI SAS2008 chipset for SAS controller which is frequently 
offered as an onboard option for many motherboards as well and still seems to 
be somewhat of a work in progress in regards to being 'production ready'.



On May 11, 2010, at 3:29 PM, Brandon High wrote:

> On Tue, May 11, 2010 at 5:29 AM, Thomas Burgess  wrote:
>> I'm specificially looking at this motherboard:
>> http://www.newegg.com/Product/Product.aspx?Item=N82E16813182230
> 
> I'd be more concerned that the motherboard and it's attached
> peripherals are unsupported than the processor. Solaris can handle 12
> cores with no problems.
> 
> -B
> 
> -- 
> Brandon High : bh...@freaks.com
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] osol monitoring question

2010-05-10 Thread Michael Schuster

On 10.05.10 08:57, Roy Sigurd Karlsbakk wrote:

Hi all

It seems that if using zfs, the usual tools like vmstat, sar, top etc are quite 
worthless, since zfs i/o load is not reported as iowait etc. Are there any 
plans to rewrite the old performance monitoring tools or the zfs parts to allow 
for standard monitoring tools? If not, what other tools exist that can do the 
same?


"zpool iostat" for one.

Michael
--
michael.schus...@oracle.com http://blogs.sun.com/recursion
Recursion, n.: see 'Recursion'
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] why both dedup and compression?

2010-05-06 Thread Michael Sullivan
This is interesting, but what about iSCSI volumes for virtual machines?

Compress or de-dupe?  Assuming the virtual machine was made from a clone of the 
original iSCSI or a master iSCSI volume.

Does anyone have any real world data this?  I would think the iSCSI volumes 
would diverge quite a bit over time even with compression and/or de-duplication.

Just curious…

On 6 May 2010, at 16:39 , Peter Tribble wrote:

> On Thu, May 6, 2010 at 2:06 AM, Richard Jahnel  wrote:
>> I've googled this for a bit, but can't seem to find the answer.
>> 
>> What does compression bring to the party that dedupe doesn't cover already?
> 
> Compression will reduce the storage requirements for non-duplicate data.
> 
> As an example, I have a system that I rsync the web application data
> from a whole
> bunch of servers (zones) to. There's a fair amount of duplication in
> the application
> files (java, tomcat, apache, and the like) so dedup is a big win. On
> the other hand,
> there's essentially no duplication whatsoever in the log files, which
> are pretty big,
> but compress really well. So having both enabled works really well.
> 
> -- 
> -Peter Tribble
> http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Mike

---
Michael Sullivan   
michael.p.sulli...@me.com
http://www.kamiogi.net/
Japan Mobile: +81-80-3202-2599
US Phone: +1-561-283-2034
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Loss of L2ARC SSD Behaviour

2010-05-06 Thread Michael Sullivan
Hi Marc,

Well, if you are striping over multiple devices the you I/O should be spread 
over the devices and you should be reading them all simultaneously rather than 
just accessing a single device.  Traditional striping would give 1/n 
performance improvement rather than 1/1 where n is the number of disks the 
stripe is spread across.

The round-robin access I am referring to, is the way the L2ARC vdevs appear to 
be accessed.  So, any given object will be taken from a single device rather 
than from several devices simultaneously, thereby increasing the I/O 
throughput.  So, theoretically, a stripe spread over 4 disks would give 4 times 
the performance as opposed to reading from a single disk.  This also assumes 
the controller can handle multiple I/O as well or that you are striped over 
different disk controllers for each disk in the stripe.

SSD's are fast, but if I can read a block from more devices simultaneously, it 
will cut the latency of the overall read.

On 7 May 2010, at 02:57 , Marc Nicholas wrote:

> Hi Michael,
> 
> What makes you think striping the SSDs would be faster than round-robin?
> 
> -marc
> 
> On Thu, May 6, 2010 at 1:09 PM, Michael Sullivan  
> wrote:
> Everyone,
> 
> Thanks for the help.  I really appreciate it.
> 
> Well, I actually walked through the source code with an associate today and 
> we found out how things work by looking at the code.
> 
> It appears that L2ARC is just assigned in round-robin fashion.  If a device 
> goes offline, then it goes to the next and marks that one as offline.  The 
> failure to retrieve the requested object is treated like a cache miss and 
> everything goes along its merry way, as far as we can tell.
> 
> I would have hoped it to be different in some way.  Like if the L2ARC was 
> striped for performance reasons, that would be really cool and using that 
> device as an extension of the VM model it is modeled after.  Which would mean 
> using the L2ARC as an extension of the virtual address space and striping it 
> to make it more efficient.  Way cool.  If it took out the bad device and 
> reconfigured the stripe device, that would be even way cooler.  Replacing it 
> with a hot spare more cool too.  However, it appears from the source code 
> that the L2ARC is just a (sort of) jumbled collection of ZFS objects.  Yes, 
> it gives you better performance if you have it, but it doesn't really use it 
> in a way you might expect something as cool as ZFS does.
> 
> I understand why it is read only, and it invalidates it's cache when a write 
> occurs, to be expected for any object written.
> 
> If an object is not there because of a failure or because it has been removed 
> from the cache, it is treated as a cache miss, all well and good - go fetch 
> from the pool.
> 
> I also understand why the ZIL is important and that it should be mirrored if 
> it is to be on a separate device.  Though I'm wondering how it is handled 
> internally when there is a failure of one of it's default devices, but then 
> again, it's on a regular pool and should be redundant enough, only just some 
> degradation in speed.
> 
> Breaking these devices out from their default locations is great for 
> performance, and I understand.  I just wish the knowledge of how they work 
> and their internal mechanisms were not so much of a black box.  Maybe that is 
> due to the speed at which ZFS is progressing and the features it adds with 
> each subsequent release.
> 
> Overall, I am very impressed with ZFS, its flexibility and even more so, it's 
> breaking all the rules about how storage should be managed and I really like 
> it.  I have yet to see anything to come close in its approach to disk data 
> management.  Let's just hope it keeps moving forward, it is truly a unique 
> way to view disk storage.
> 
> Anyway, sorry for the ramble, but to everyone, thanks again for the answers.
> 
> Mike
> 
> ---
> Michael Sullivan
> michael.p.sulli...@me.com
> http://www.kamiogi.net/
> Japan Mobile: +81-80-3202-2599
> US Phone: +1-561-283-2034
> 
> On 7 May 2010, at 00:00 , Robert Milkowski wrote:
> 
> > On 06/05/2010 15:31, Tomas Ögren wrote:
> >> On 06 May, 2010 - Bob Friesenhahn sent me these 0,6K bytes:
> >>
> >>
> >>> On Wed, 5 May 2010, Edward Ned Harvey wrote:
> >>>
> >>>> In the L2ARC (cache) there is no ability to mirror, because cache device
> >>>> removal has always been supported.  You can't mirror a cache device, 
> >>>> because
> >>>> you don't need it.
> >>>>
> >>> How do you know that I don't need it?  The ability seems useful to me.
> >>

Re: [zfs-discuss] Loss of L2ARC SSD Behaviour

2010-05-06 Thread Michael Sullivan
Everyone,

Thanks for the help.  I really appreciate it.

Well, I actually walked through the source code with an associate today and we 
found out how things work by looking at the code.

It appears that L2ARC is just assigned in round-robin fashion.  If a device 
goes offline, then it goes to the next and marks that one as offline.  The 
failure to retrieve the requested object is treated like a cache miss and 
everything goes along its merry way, as far as we can tell.

I would have hoped it to be different in some way.  Like if the L2ARC was 
striped for performance reasons, that would be really cool and using that 
device as an extension of the VM model it is modeled after.  Which would mean 
using the L2ARC as an extension of the virtual address space and striping it to 
make it more efficient.  Way cool.  If it took out the bad device and 
reconfigured the stripe device, that would be even way cooler.  Replacing it 
with a hot spare more cool too.  However, it appears from the source code that 
the L2ARC is just a (sort of) jumbled collection of ZFS objects.  Yes, it gives 
you better performance if you have it, but it doesn't really use it in a way 
you might expect something as cool as ZFS does.

I understand why it is read only, and it invalidates it's cache when a write 
occurs, to be expected for any object written.

If an object is not there because of a failure or because it has been removed 
from the cache, it is treated as a cache miss, all well and good - go fetch 
from the pool.

I also understand why the ZIL is important and that it should be mirrored if it 
is to be on a separate device.  Though I'm wondering how it is handled 
internally when there is a failure of one of it's default devices, but then 
again, it's on a regular pool and should be redundant enough, only just some 
degradation in speed.

Breaking these devices out from their default locations is great for 
performance, and I understand.  I just wish the knowledge of how they work and 
their internal mechanisms were not so much of a black box.  Maybe that is due 
to the speed at which ZFS is progressing and the features it adds with each 
subsequent release.

Overall, I am very impressed with ZFS, its flexibility and even more so, it's 
breaking all the rules about how storage should be managed and I really like 
it.  I have yet to see anything to come close in its approach to disk data 
management.  Let's just hope it keeps moving forward, it is truly a unique way 
to view disk storage.

Anyway, sorry for the ramble, but to everyone, thanks again for the answers.

Mike

---
Michael Sullivan   
michael.p.sulli...@me.com
http://www.kamiogi.net/
Japan Mobile: +81-80-3202-2599
US Phone: +1-561-283-2034

On 7 May 2010, at 00:00 , Robert Milkowski wrote:

> On 06/05/2010 15:31, Tomas Ögren wrote:
>> On 06 May, 2010 - Bob Friesenhahn sent me these 0,6K bytes:
>> 
>>   
>>> On Wed, 5 May 2010, Edward Ned Harvey wrote:
>>> 
>>>> In the L2ARC (cache) there is no ability to mirror, because cache device
>>>> removal has always been supported.  You can't mirror a cache device, 
>>>> because
>>>> you don't need it.
>>>>   
>>> How do you know that I don't need it?  The ability seems useful to me.
>>> 
>> The gain is quite minimal.. If the first device fails (which doesn't
>> happen too often I hope), then it will be read from the normal pool once
>> and then stored in ARC/L2ARC again. It just behaves like a cache miss
>> for that specific block... If this happens often enough to become a
>> performance problem, then you should throw away that L2ARC device
>> because it's broken beyond usability.
>> 
>>   
> 
> Well if a L2ARC device fails there might be an unacceptable drop in delivered 
> performance.
> If it were mirrored than a drop usually would be much smaller or there could 
> be no drop if a mirror had an option to read only from one side.
> 
> Being able to mirror L2ARC might especially be useful once a persistent L2ARC 
> is implemented as after a node restart or a resource failover in a cluster 
> L2ARC will be kept warm. Then the only thing which might affect L2 
> performance considerably would be a L2ARC device failure...
> 
> 
> -- 
> Robert Milkowski
> http://milek.blogspot.com
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Loss of L2ARC SSD Behaviour

2010-05-05 Thread Michael Sullivan
On 6 May 2010, at 13:18 , Edward Ned Harvey wrote:

>> From: Michael Sullivan [mailto:michael.p.sulli...@mac.com]
>> 
>> While it explains how to implement these, there is no information
>> regarding failure of a device in a striped L2ARC set of SSD's.  I have
> 
> http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Sepa
> rate_Cache_Devices
> 
> It is not possible to mirror or use raidz on cache devices, nor is it
> necessary. If a cache device fails, the data will simply be read from the
> main pool storage devices instead.
> 

I understand this.

> I guess I didn't write this part, but:  If you have multiple cache devices,
> they are all independent from each other.  Failure of one does not negate
> the functionality of the others.
> 

Ok, this is what I wanted to know.  The that the L2ARC devices assigned to the 
pool are not striped but are independent.  Loss of one drive will just cause a 
cache miss and force ZFS to go out to the pool for its objects.

But then I'm not talking about using RAIDZ on a cache device.  I'm talking 
about a striped device which would be RAID-0.  If the SSD's are all assigned to 
L2ARC, then they are not striped in any fashion (RAID-0), but are completely 
independent and the L2ARC will continue to operate, just missing a single SSD.

> 
>> I'm running 2009.11 which is the latest OpenSolaris.  
> 
> Quoi??  2009.06 is the latest available from opensolaris.com and
> opensolaris.org.
> 
> If you want something newer, AFAIK, you have to go to developer build, such
> as osol-dev-134
> 
> Sure you didn't accidentally get 2008.11?
> 

My mistake… snv_111b which is 2009.06.  I know it went up to 11 somewhere.

> 
>> I am also well aware of the effect of losing a ZIL device will cause
>> loss of the entire pool.  Which is why I would never have a ZIL device
>> unless it was mirrored and on different controllers.
> 
> Um ... the log device is not special.  If you lose *any* unmirrored device,
> you lose the pool.  Except for cache devices, or log devices on zpool >=19
> 

Well, if I've got a separate ZIL which is mirrored for performance, and 
mirrored because I think my data is valuable and important, I will have 
something more than RAID-0 on my main storage pool too.  More than likely 
RAIDZ2 since I plan on using L2ARC to help improve performance along with 
separate SSD mirrored ZIL devices.

> 
>> From the information I've been reading about the loss of a ZIL device,
>> it will be relocated to the storage pool it is assigned to.  I'm not
>> sure which version this is in, but it would be nice if someone could
>> provide the release number it is included in (and actually works), it
>> would be nice.  
> 
> What the heck?  Didn't I just answer that question?
> I know I said this is answered in ZFS Best Practices Guide.
> http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Sepa
> rate_Log_Devices
> 
> Prior to pool version 19, if you have an unmirrored log device that fails,
> your whole pool is permanently lost.
> Prior to pool version 19, mirroring the log device is highly recommended.
> In pool version 19 or greater, if an unmirrored log device fails during
> operation, the system reverts to the default behavior, using blocks from the
> main storage pool for the ZIL, just as if the log device had been gracefully
> removed via the "zpool remove" command.
> 

No need to get defensive here, all I'm looking for is the spool version number 
which supports it and the version of OpenSolaris which supports that ZPOOL 
version.

I think that if you are building for performance, it would be almost intuitive 
to have a mirrored ZIL in the event of failure, and perhaps even a hot spare 
available as well.  I don't like the idea of my ZIL being transferred back to 
the pool, but having it transferred back is better than the alternative which 
would be data loss or corruption.

> 
>> Also, will this functionality be included in the
>> mythical 2010.03 release?
> 
> 
> Zpool 19 was released in build 125.  Oct 16, 2009.  You can rest assured it
> will be included in 2010.03, or 04, or whenever that thing comes out.
> 

Thanks, build 125.

> 
>> So what you are saying is that if a single device fails in a striped
>> L2ARC VDEV, then the entire VDEV is taken offline and the fallback is
>> to simply use the regular ARC and fetch from the pool whenever there is
>> a cache miss.
> 
> It sounds like you're only going to believe it if you test it.  Go for it.
> That's what I did before I wrote that section of the ZFS Best Practices
> Guide.
> 
> In ZFS, there is no such thing as striping, although 

Re: [zfs-discuss] Loss of L2ARC SSD Behaviour

2010-05-05 Thread Michael Sullivan
Hi Ed,

Thanks for your answers.  Seem to make sense, sort of…

On 6 May 2010, at 12:21 , Edward Ned Harvey wrote:

>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of Michael Sullivan
>> 
>> I have a question I cannot seem to find an answer to.
> 
> Google for ZFS Best Practices Guide  (on solarisinternals).  I know this
> answer is there.
> 

My Google is very strong and I have the Best Practices Guide committed to 
bookmark as well as most of it to memory.

While it explains how to implement these, there is no information regarding 
failure of a device in a striped L2ARC set of SSD's.  I have been hard pressed 
to find this information anywhere, short of testing it myself, but I don't have 
the necessary hardware in a lab to test correctly.  If someone has pointers to 
references, could you please provide them to chapter and verse, rather than the 
advice to "Go read the manual."

> 
>> I know if I set up ZIL on SSD and the SSD goes bad, the the ZIL will be
>> relocated back to the spool.  I'd probably have it mirrored anyway,
>> just in case.  However you cannot mirror the L2ARC, so...
> 
> Careful.  The "log device removal" feature exists, and is present in the
> developer builds of opensolaris today.  However, it's not included in
> opensolars 2009.06, and it's not included in the latest and greatest solaris
> 10 yet.  Which means, right now, if you lose an unmirrored ZIL (log) device,
> your whole pool is lost, unless you're running a developer build of
> opensolaris.
> 

I'm running 2009.11 which is the latest OpenSolaris.  I should have made that 
clear, and that I don't intend this to be on Solaris 10 system, and am waiting 
for the next production build anyway.  As you say, it does not exist in 
2009.06, this is not the latest production Opensolaris which is 2009.11, and 
I'd be more interested in its behavior than an older release.

I am also well aware of the effect of losing a ZIL device will cause loss of 
the entire pool.  Which is why I would never have a ZIL device unless it was 
mirrored and on different controllers.

>From the information I've been reading about the loss of a ZIL device, it will 
>be relocated to the storage pool it is assigned to.  I'm not sure which 
>version this is in, but it would be nice if someone could provide the release 
>number it is included in (and actually works), it would be nice.  Also, will 
>this functionality be included in the mythical 2010.03 release?

Also, I'd be interested to know what features along these lines will be 
available in 2010.03 if it ever sees the light of day.

> 
>> What I want to know, is what happens if one of those SSD's goes bad?
>> What happens to the L2ARC?  Is it just taken offline, or will it
>> continue to perform even with one drive missing?
> 
> In the L2ARC (cache) there is no ability to mirror, because cache device
> removal has always been supported.  You can't mirror a cache device, because
> you don't need it.
> 
> If one of the cache devices fails, no harm is done.  That device goes
> offline.  The rest stay online.
> 

So what you are saying is that if a single device fails in a striped L2ARC 
VDEV, then the entire VDEV is taken offline and the fallback is to simply use 
the regular ARC and fetch from the pool whenever there is a cache miss.

Or, does what you are saying here mean that if I have a 4 SSD's in a stripe for 
my L2ARC, and one device fails, the L2ARC will be reconfigured dynamically 
using the remaining SSD's for L2ARC.

It would be good to get an answer to this from someone who has actually tested 
this or is more intimately familiar with the ZFS code rather than all the 
speculation I've been getting so far.

> 
>> Sorry, if these questions have been asked before, but I cannot seem to
>> find an answer.
> 
> Since you said this twice, I'll answer it twice.  ;-)
> I think the best advice regarding cache/log device mirroring is in the ZFS
> Best Practices Guide.
> 

Been there read that, many, many times.  It's an invaluable reference, I agree.

Thanks

Mike

---
Michael Sullivan   
michael.p.sulli...@me.com
http://www.kamiogi.net/
Japan Mobile: +81-80-3202-2599
US Phone: +1-561-283-2034

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] b134 pool borked!

2010-05-05 Thread Michael Mattsson
I got a suggestion to check what fmdump -eV gave to look for PCI errors if the 
controller might be broken.

Attached you'll find the last panic's fmdump -eV. It indicates that ZFS can't 
open the drives. That might suggest a broken controller, but my slog is on the 
motherboard's internal controller. 

One might think that the motherboard itself might be toast or do we have a case 
of unstable power?
-- 
This message posted from opensolaris.orgMay 04 2010 19:44:31.716566239 ereport.fs.zfs.vdev.open_failed
nvlist version: 0
class = ereport.fs.zfs.vdev.open_failed
ena = 0xeeed67dca00c01
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0x97541c1ea1ad833e
vdev = 0x645834a4c69584e5
(end detector)

pool = tank
pool_guid = 0x97541c1ea1ad833e
pool_context = 1
pool_failmode = wait
vdev_guid = 0x645834a4c69584e5
vdev_type = disk
vdev_path = /dev/dsk/c13t1d0s0
vdev_devid = id1,s...@sata_wdc_wd5001aals-0_wd-wmasy3260051/a
parent_guid = 0x6041a7903a345374
parent_type = raidz
prev_state = 0x1
__ttl = 0x1
__tod = 0x4be05cff 0x2ab5eedf

May 04 2010 19:44:31.716565705 ereport.fs.zfs.vdev.open_failed
nvlist version: 0
class = ereport.fs.zfs.vdev.open_failed
ena = 0xeeed67dca00c01
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0x97541c1ea1ad833e
vdev = 0x928ecd01b281b313
(end detector)

pool = tank
pool_guid = 0x97541c1ea1ad833e
pool_context = 1
pool_failmode = wait
vdev_guid = 0x928ecd01b281b313
vdev_type = disk
vdev_path = /dev/dsk/c13t2d0s0
vdev_devid = id1,s...@sata_samsung_hd103si___s1vsj90sc22634/a
parent_guid = 0x6041a7903a345374
parent_type = raidz
prev_state = 0x1
__ttl = 0x1
__tod = 0x4be05cff 0x2ab5ecc9

May 04 2010 19:44:31.716565713 ereport.fs.zfs.vdev.open_failed
nvlist version: 0
class = ereport.fs.zfs.vdev.open_failed
ena = 0xeeed67dca00c01
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0x97541c1ea1ad833e
vdev = 0xc6c893601f1263cb
(end detector)

pool = tank
pool_guid = 0x97541c1ea1ad833e
pool_context = 1
pool_failmode = wait
vdev_guid = 0xc6c893601f1263cb
vdev_type = disk
vdev_path = /dev/dsk/c8t0d0s0
vdev_devid = id1,s...@sata_intel_ssdsa2m080__cvpo003401vt080bgn/a
parent_guid = 0x97541c1ea1ad833e
parent_type = root
prev_state = 0x1
__ttl = 0x1
__tod = 0x4be05cff 0x2ab5ecd1

May 04 2010 19:44:31.716566468 ereport.fs.zfs.vdev.open_failed
nvlist version: 0
class = ereport.fs.zfs.vdev.open_failed
ena = 0xeeed67dca00c01
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0x97541c1ea1ad833e
vdev = 0x381e0480469b4ed7
(end detector)

pool = tank
pool_guid = 0x97541c1ea1ad833e
pool_context = 1
pool_failmode = wait
vdev_guid = 0x381e0480469b4ed7
vdev_type = disk
vdev_path = /dev/dsk/c13t3d0s0
vdev_devid = id1,s...@sata_samsung_hd103si___s1vsj90sc22045/a
parent_guid = 0x6041a7903a345374
parent_type = raidz
prev_state = 0x1
__ttl = 0x1
__tod = 0x4be05cff 0x2ab5efc4

May 04 2010 19:44:31.716566182 ereport.fs.zfs.vdev.open_failed
nvlist version: 0
class = ereport.fs.zfs.vdev.open_failed
ena = 0xeeed67dca00c01
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0x97541c1ea1ad833e
vdev = 0x6e5ce9b416a3f8a4
(end detector)

pool = tank
pool_guid = 0x97541c1ea1ad833e
pool_context = 1
pool_failmode = wait
vdev_guid = 0x6e5ce9b416a3f8a4
vdev_type = disk
vdev_path = /dev/dsk/c13t6d0s0
vdev_devid = id1,s...@sata_wdc_wd6400aacs-0_wd-wcauf0934679/a
parent_guid = 0x4491e617ebc26c75
parent_type = raidz
prev_state = 0x1
__ttl = 0x1
__tod = 0x4be05cff 0x2ab5eea6

May 04 2010 19:44:31.716565740 ereport.fs.zfs.vdev.open_failed
nvlist version: 0
class = ereport.fs.zfs.vdev.open_failed
ena = 0xeeed67dca00c01
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0x97541c1ea1ad833e
vdev = 0x69f0986c92adda53
 

Re: [zfs-discuss] b134 pool borked!

2010-05-05 Thread Michael Mattsson
Thanks for your reply! I ran memtest86 and it did not report any errors. The 
disk controller I've not replaced, yet. The server is up in multi-user mode 
with the broken pool in an un-imported state. Format now works and properly 
lists all my devices without panic'ing. zpool import  panic's the box 
with the same stack trace as above.

Could it still be the disk controller? I'd jump through the roof of happiness 
if that's the case. It's one of those Supermicro thumper controllers. Anyone 
know any good non-destructive diagnostics to run?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] b134 pool borked!

2010-05-05 Thread Michael Mattsson
This is how my zpool import command looks like:

Attached you'll find the output of zdb -l of each device.

  pool: tank
id: 10904371515657913150
 state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:

tank ONLINE
  raidz1-0   ONLINE
c13t4d0  ONLINE
c13t5d0  ONLINE
c13t6d0  ONLINE
c13t7d0  ONLINE
  raidz1-1   ONLINE
c13t3d0  ONLINE
c13t1d0  ONLINE
c13t2d0  ONLINE
c13t0d0  ONLINE
cache
  c8t2d0
logs
  c8t0d0 ONLINE
-- 
This message posted from opensolaris.org

zdbl.gz
Description: Binary data
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] b134 pool borked!

2010-05-04 Thread Michael Mattsson
90 reads and not a single comment? Not the slightest hint of what's going on?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Loss of L2ARC SSD Behaviour

2010-05-04 Thread Michael Sullivan
Ok, thanks.

So, if I understand correctly, it will just remove the device from the VDEV and 
continue to use the good ones in the stripe.

Mike

---
Michael Sullivan   
michael.p.sulli...@me.com
http://www.kamiogi.net/
Japan Mobile: +81-80-3202-2599
US Phone: +1-561-283-2034

On 5 May 2010, at 04:34 , Marc Nicholas wrote:

> The L2ARC will continue to function.
> 
> -marc
> 
> On 5/4/10, Michael Sullivan  wrote:
>> HI,
>> 
>> I have a question I cannot seem to find an answer to.
>> 
>> I know I can set up a stripe of L2ARC SSD's with say, 4 SSD's.
>> 
>> I know if I set up ZIL on SSD and the SSD goes bad, the the ZIL will be
>> relocated back to the spool.  I'd probably have it mirrored anyway, just in
>> case.  However you cannot mirror the L2ARC, so...
>> 
>> What I want to know, is what happens if one of those SSD's goes bad?  What
>> happens to the L2ARC?  Is it just taken offline, or will it continue to
>> perform even with one drive missing?
>> 
>> Sorry, if these questions have been asked before, but I cannot seem to find
>> an answer.
>> Mike
>> 
>> ---
>> Michael Sullivan
>> michael.p.sulli...@me.com
>> http://www.kamiogi.net/
>> Japan Mobile: +81-80-3202-2599
>> US Phone: +1-561-283-2034
>> 
>> ___
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>> 
> 
> -- 
> Sent from my mobile device

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


  1   2   3   4   >