Re: [zfs-discuss] Thin device support in ZFS?

2010-01-01 Thread Eric D. Mudama

On Thu, Dec 31 at 16:53, David Magda wrote:

Just as the first 4096-byte block disks are silently emulating 4096 -
to-512 blocks, SSDs are currently re-mapping LBAs behind the scenes. 
Perhaps in the future there will be a setting to say "no really, I'm 
talking about the /actual/ LBA 123456".


What, exactly, is the "/actual/ LBA 123456" on a modern SSD?

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-01 Thread Eric D. Mudama

On Thu, Dec 31 at 10:18, Bob Friesenhahn wrote:
There are of course SSDs with hardly any (or no) reserve space, but 
while we might be willing to sacrifice an image or two to SSD block 
failure in our digital camera, that is just not acceptable for 
serious computer use.


Some people are doing serious computing on devices with 6-7% reserve.
Devices with less enforced reserve will be significantly cheaper per
exposed gigabyte, independent of all other factors, and always give
the user the flexibility to increase their effective reserve by
destroking the working area a little or a lot.

If someone just needs blazing fast read access and isn't expecting to
put more than a few cycles/day on their devices, small reserve MLC
drives may be very cost effective and just as fast as their 20-30%
reserve SLC counterparts.

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-01 Thread Ragnar Sundblad

On 31 dec 2009, at 22.53, David Magda wrote:

> On Dec 31, 2009, at 13:44, Joerg Schilling wrote:
> 
>> ZFS is COW, but does the SSD know which block is "in use" and which is not?
>> 
>> If the SSD did know whether a block is in use, it could erase unused blocks
>> in advance. But what is an "unused block" on a filesystem that supports
>> snapshots?

Snapshots make no difference - when you delete the last
dataset/snapshot that references a file you also delete the
data. Snapshots is a way to keep more files around, it is not
a really way to keep the disk entirely full or anything like
that. There is obviously no problem to distinguish between
used and unused blocks, and zfs (or btrfs or similar) make no
difference.

> Personally, I think that at some point in the future there will need to be a 
> command telling SSDs that the file system will take care of handling blocks, 
> as new FS designs will be COW. ZFS is the first "mainstream" one to do it, 
> but Btrfs is there as well, and it looks like Apple will be making its own FS.

That could be an idea, but there still will be holes after
deleted files that need to be reclaimed. Do you mean it would
be a major win to have the file system take care of the
space reclaiming instead of the drive?

> Just as the first 4096-byte block disks are silently emulating 4096 -to-512 
> blocks, SSDs are currently re-mapping LBAs behind the scenes. Perhaps in the 
> future there will be a setting to say "no really, I'm talking about the 
> /actual/ LBA 123456".

A typical flash page size is 512 KB. You probably don't want to
use all the physical pages, since those could be worn out or bad,
so those need to be remapped (or otherwise avoided) at some level
anyway. These days, typically disks do the remapping without the
host computer knowing (both SSDs and rotating rust).

I see the possible win that you could always use all the working
blocks on the disk, and when blocks goes bad your disk will shrink.
I am not sure that is really what people expect, though. Apart from
that, I am not sure what the gain would be.
Could you elaborate on why this would be called for?

/ragge 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] (snv_129, snv_130) can't import zfs pool

2010-01-01 Thread LevT
Hi 

(snv_130) created zfs pool storage (a mirror of two whole disks)

zfs created storage/iscsivol,  made some tests, wrote some GBs

zfs created storage/mynas filesystem
(sharesmb
dedup=on
compression=on)

FILLED the storage/mynas


tried to ZFS DESTROY my storage/iscsivol, but the system has HUNG...
this system now tends to boot to maintenance mode due to the boot-archive 
corruption

The pool can't be imported -f by the recent EON storage (snv_129), it hangs 
also and don't return to the CLI


Any help is appreciated
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-01 Thread David Magda


On Jan 1, 2010, at 03:30, Eric D. Mudama wrote:


On Thu, Dec 31 at 16:53, David Magda wrote:

Just as the first 4096-byte block disks are silently emulating 4096 -
to-512 blocks, SSDs are currently re-mapping LBAs behind the  
scenes. Perhaps in the future there will be a setting to say "no  
really, I'm talking about the /actual/ LBA 123456".


What, exactly, is the "/actual/ LBA 123456" on a modern SSD?


It doesn't exist currently because of the behind-the-scenes re-mapping  
that's being done by the SSD's firmware.


While arbitrary to some extent, and "actual" LBA would presumably the  
number of a particular cell in the SSD.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-01 Thread David Magda

On Jan 1, 2010, at 04:33, Ragnar Sundblad wrote:


I see the possible win that you could always use all the working
blocks on the disk, and when blocks goes bad your disk will shrink.
I am not sure that is really what people expect, though. Apart from
that, I am not sure what the gain would be.

Could you elaborate on why this would be called for?


Currently you have SSDs that look like disks, but under certain  
circumstances the OS / FS know that it isn't rotating rust--in which  
case the TRIM command is then used by the OS to help the SSD's  
allocation algorithm(s).


If the file system is COW, and knows about SSDs via TRIM, why not just  
skip the middle-man and tell the SSD "I'll take care of managing  
blocks".


In the ZFS case, I think it's a logical extension of how RAID is  
handling: ZFS' system is much more helpful in most case that  
hardware- / firmware-based RAID, so it's generally best just to expose  
the underlying hardware to ZFS. In the same way ZFS already does COW,  
so why bother with the SSD's firmware doing it when giving extra  
knowledge to ZFS could be more useful?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] preview of new SSD based on SandForce controller

2010-01-01 Thread Al Hopper
Interesting article - rumor has it that this is the same controller
that Seagate will use in its upcoming enterprise level SSDs:

http://anandtech.com/storage/showdoc.aspx?i=3702

It reads like  SandForce has implemented a bunch of ZFS like
functionality in firmware.  Hmm, I wonder if they used any ZFS source
code??

Happy new year.

-- 
Al Hopper  Logical Approach Inc,Plano,TX a...@logical-approach.com
   Voice: 972.379.2133 Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-01 Thread Ragnar Sundblad

On 1 jan 2010, at 14.14, David Magda wrote:

> On Jan 1, 2010, at 04:33, Ragnar Sundblad wrote:
> 
>> I see the possible win that you could always use all the working
>> blocks on the disk, and when blocks goes bad your disk will shrink.
>> I am not sure that is really what people expect, though. Apart from
>> that, I am not sure what the gain would be.
>> 
>> Could you elaborate on why this would be called for?
> 
> Currently you have SSDs that look like disks, but under certain circumstances 
> the OS / FS know that it isn't rotating rust--in which case the TRIM command 
> is then used by the OS to help the SSD's allocation algorithm(s).

(Note that TRIM and equivalents are not only useful on SSDs,
but on other storage too, such as when using sparse/thin
storage.)

> If the file system is COW, and knows about SSDs via TRIM, why not just skip 
> the middle-man and tell the SSD "I'll take care of managing blocks".
> 
> In the ZFS case, I think it's a logical extension of how RAID is handling: 
> ZFS' system is much more helpful in most case that hardware- / firmware-based 
> RAID, so it's generally best just to expose the underlying hardware to ZFS. 
> In the same way ZFS already does COW, so why bother with the SSD's firmware 
> doing it when giving extra knowledge to ZFS could be more useful?

But that would only move the hardware specific and dependent flash
chip handling code into the file system code, wouldn't it? What
is won with that? As long as the flash chips have larger pages than
the file system blocks, someone will have to shuffle around blocks
to reclaim space, why not let the one thing that knows the hardware
and also is very close to the hardware do it?

And if this is good for SSDs, why isn't it as good for rotating rust?

/ragge s

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hard drive choice, TLER/ERC/CCTL

2010-01-01 Thread R.G. Keen
> On Dec 31, 2009, at 6:14 PM, Richard Elling wrote:
> Some nits:
> disks aren't marked as semi-bad, but if ZFS has trouble with a
> block, it will try to not use the block again.  So there is two levels
> of recovery at work: whole device and block.
Ah. I hadn't found that yet.

> The "one more and you're dead" is really N errors in T time.
I'm interpreting this as "OS/S/zfs/drivers will not mark a disk 
as failed until it returns N errors in T time," which means - 
check me on this - that to get a second failed disk, the time
to get a second real-or-fake failed disk is T, where T is the 
time a second soft-failing disk may happen while the system
is balled up in worrying about the first disk not responding in
T time.

This based on a paper I read on line about the increasing
need for raidz3 or similar over raidz2 or similar because 
throughput from disks has not increased concomitantly with 
their size; this leading to increasing times to recover from 
first failures using the stored checking data in the array to
to rebuild. The notice-an-error time plus the rebuild-the-array
time is the window in which losing another disk, soft or hard, 
will lead to the inability to resilver the array.

> For disks which don't return when there is an error, you can
> reasonably expect that T will be a long time (multiples of 60
> seconds) and therefore the N in T threshold will not be triggered.
The scenario I had in mind was two disks ready to fail, either
soft (long time to return data) or hard (bang! That sector/block
or disk is not coming back, period). The first fails and starts 
trying to recover in desktop-disk fashion, maybe taking hours.

This leaves the system with no error report (i.e. the N-count is
zero) and the T-timer ticking. Meanwhile the array is spinning.
The second fragile disk is going to hit its own personal pothole
at some point soon in this scenario. 

What happens next is not clear to me. Is OS/S/zfs going to 
suspend disk operations until it finally does hear from first
failing disk 1, based on N still being at 0 because the disk 
hasn't reported back yet? Or will the array continue with other
operations, noting that the operation involving failing disk1
has not completed, and either stack another request on 
failing disk 1, or access failing disk 2 and get its error too
at some point? Or both?

If the timeout is truly N errors in T time, and N is never
reported back because the disk spends some hours retrying, 
then it looks like this is a zfs hang if not a system hang.

If there is a timeout of some kind which takes place even 
if N never gets over 0, that would at least unhang the 
file system/system, but it opens you to the second failing
disk fault having occurred, and you're in for another of
either hung-forever or failed-array in the case of raidz. 

> The term "degraded" does not have a consistent
> definition across the industry. 
Of course not! 8-)  Maybe we should use "depraved" 8-)

> See the zpool man page for the definition
> used for ZFS.  In particular, DEGRADED != FAULTED

> Issues are logged, for sure.  If you want to monitor
> them proactively,
> you need to configure SNMP traps for FMA.
Ok, can deal with that.

> It already does this, as long as there are N errors
> in T time.  
OK, I can work that one out. I'm still puzzled on what
happens with the "N=0 forever" case. The net result
on that one seems to be that  you need raid specific
disks to get some kind of timeout to happen at the 
disk level ever (depending on the disk firmware, 
which as you note later, is likely to have been written
by a junior EE as his first assignment 8-) )


>There is room for improvement here, but I'm not sure how
> one can set a rule that would explicitly take care of the I/O never
> returning from a disk while a different I/O to the same disk
> returns.  More research required here...
Yep. I'm thinking that it might be possible to do a policy-based
setup section for an array where you could select one of a number
of rule-sets for what to do, based on your experience and/or
paranoia about the disks in your array. I had good luck with that
in a primitive whole-machine hardware diagnosis system I worked
with at one point in the dim past. Kind of "if you can't do the 
right/perfect thing, then ensure that *something* happens."

One of the rules scenarios might be "if one seek to a disk never 
returns and other actions to that disk to work, then halt the 
pending action(s) to disk and/or array, increment N, restart that
disk or the entire array as needed, and retry that action in a 
diagnostic loop, which decides whether it's a soft fail, hard 
block fail, or hard disk fail" and then take the proper action 
based on the diagnostic. Or it could be "map that disk out and
run diagnostics on it while the hot spare is swapped in" based 
on whether there's a hot spare or not. 

But yes, some thought is needed. I always tend to pick the side
of "let the user/admin pick the way they want to fail" which 
m

Re: [zfs-discuss] Thin device support in ZFS?

2010-01-01 Thread David Magda

On Jan 1, 2010, at 11:04, Ragnar Sundblad wrote:


But that would only move the hardware specific and dependent flash
chip handling code into the file system code, wouldn't it? What
is won with that? As long as the flash chips have larger pages than
the file system blocks, someone will have to shuffle around blocks
to reclaim space, why not let the one thing that knows the hardware
and also is very close to the hardware do it?

And if this is good for SSDs, why isn't it as good for rotating rust?


Don't really see how things are either hardware specific or dependent.  
COW is COW. Am I missing something? It's done by code somewhere in the  
stack, if the FS knows about it, it can lay things out in sequential  
writes. If we're talking about 512 KB blocks, ZFS in particular would  
create four 128 KB txgs--and 128 KB is simply the currently #define'd  
size, which can be changed in the future.


One thing you gain is perhaps not requiring to have as much of a  
reserve. At most you have some hidden bad block re-mapping, similar to  
rotating rust nowadays. If you're shuffling blocks around, you're  
doing a read-modify-write, which if done in the file system, you could  
use as a mechanism to defrag on-the-fly or to group many small files  
together.



Not quite sure what you mean by your last question.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-01 Thread Richard Elling

On Dec 31, 2009, at 12:59 PM, Ragnar Sundblad wrote:

Flash SSDs actually always remap new writes into a
only-append-to-new-pages style, pretty much as ZFS does itself.
So for a SSD there is no big difference between ZFS and
filesystems as UFS, NTFS, HFS+ et al, on the flash level they
all work the same.



The reason is that there is no way for it to rewrite single
disk blocks, it can only fill up already erased pages of
512K (for example). When the old blocks get mixed with unused
blocks (because of block rewrites, TRIM or Write Many/UNMAP),
it needs to compact the data by copying all active blocks from
those pages into previously erased pages, and there write the
active data compacted/continuos. (When this happens, things tend
to get really slow.)


However, the quantity of small, overwritten pages is vastly different.
I am not convinced that a workload that generates few overwrites
will be penalized as much as a workload that generates a large
number of overwrites.

I think most folks here will welcome good, empirical studies,
but thus far the only one I've found is from STEC and their
disks behave very well after they've been filled and subjected
to a rewrite workload. You get what you pay for.  Additional
pointers are always appreciated :-)
http://www.stec-inc.com/ssd/videos/ssdvideo1.php

 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-01 Thread Bob Friesenhahn

On Fri, 1 Jan 2010, David Magda wrote:


It doesn't exist currently because of the behind-the-scenes re-mapping that's 
being done by the SSD's firmware.


While arbitrary to some extent, and "actual" LBA would presumably the number 
of a particular cell in the SSD.


There seems to be some severe misunderstanding of that a SSD is. 
This severe misunderstanding leads one to assume that a SSD has a 
"native" blocksize.  SSDs (as used in computer drives) are comprised 
of many tens of FLASH memory chips which can be layed out and mapped 
in whatever fashion the designers choose to do.  They could be mapped 
sequentially, in parallel, a combination of the two, or perhaps even 
change behavior depending on use.  Individual FLASH devices usually 
have a much smaller page size than 4K.  A 4K write would likely be 
striped across several/many FLASH devices.


The construction of any given SSD is typically a closely-held trade 
secret and the vendor will not reveal how it is designed.  You would 
have to chip away the epoxy yourself and reverse-engineer in order to 
gain some understanding of how a given SSD operates and even then it 
would be mostly guesswork.


It would be wrong for anyone here, including someone who has 
participated in the design of an SSD, to claim that they know how a 
"SSD" will behave unless they have access to the design of that 
particular SSD.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS not working in non-global zone after upgrading to snv_130

2010-01-01 Thread Bernd Schemmer

Hi

After upgrading OpenSolaris from snv111 to snv130

r...@t61p:/export/home/xtrnaw7# cat /etc/release
   OpenSolaris Development snv_130 X86
   Copyright 2009 Sun Microsystems, Inc.  All Rights Reserved.
Use is subject to license terms.
   Assembled 18 December 2009

my zone does not boot anymore:

Inside the zone:

-bash-3.2$ svcs -x
svc:/system/filesystem/local:default (local file system mounts)
 State: maintenance since Fri Jan 01 16:18:54 2010
Reason: Start method exited with $SMF_EXIT_ERR_FATAL.
   See: http://sun.com/msg/SMF-8000-KS
   See: /var/svc/log/system-filesystem-local:default.log
Impact: 16 dependent services are not running.  (Use -v for list.)

bash-3.2$ tail /var/svc/log/system-filesystem-local:default.log
[ Sep  3 19:43:08 Executing start method ("/lib/svc/method/fs-local"). ]
[ Sep  3 19:43:08 Method "start" exited with status 0. ]
[ Oct 17 18:00:54 Enabled. ]
[ Oct 17 18:01:25 Executing start method ("/lib/svc/method/fs-local"). ]
[ Oct 17 18:01:27 Method "start" exited with status 0. ]
[ Jan  1 16:18:45 Enabled. ]
[ Jan  1 16:18:53 Executing start method ("/lib/svc/method/fs-local"). ]
/lib/svc/method/fs-local: line 91: 12888: Abort(coredump)
WARNING: /usr/sbin/zfs mount -a failed: exit status 262

But there is no ZFS filesystem configured for the zone:

r...@t61p:/export/home/xtrnaw7# zonecfg -z develop001 info
zonename: develop001
zonepath: /zones/develop001
brand: ipkg
autoboot: false
bootargs:
pool:
limitpriv:
scheduling-class:
ip-type: exclusive
hostid:
fs:
dir: /tools
special: /tools
raw not specified
type: lofs
options: [ro]
fs:
dir: /data/develop
special: /data/develop
raw not specified
type: lofs
options: [rw]
fs:
dir: /data/img
special: /data/img
raw not specified
type: lofs
options: [ro]
fs:
dir: /opt/SunStudioExpress
special: /opt/SunStudioExpress
raw not specified
type: lofs
options: [ro]
net:
address not specified
physical: vnic0
defrouter not specified


Looks like a general problem with ZFS in the zone:

bash-3.2$ /usr/sbin/zfs list
internal error: Unknown error
Abort


ZFS in the global zones works without problems.


regards

Bernd





--
Bernd Schemmer, Frankfurt am Main, Germany
http://bnsmb.de/

M s temprano que tarde el mundo cambiar .
Fidel Castro

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] preview of new SSD based on SandForce controller

2010-01-01 Thread Bob Friesenhahn

On Fri, 1 Jan 2010, Al Hopper wrote:


Interesting article - rumor has it that this is the same controller
that Seagate will use in its upcoming enterprise level SSDs:

http://anandtech.com/storage/showdoc.aspx?i=3702

It reads like  SandForce has implemented a bunch of ZFS like
functionality in firmware.  Hmm, I wonder if they used any ZFS source
code??


The article (and product) seem interesting, but (in usual form) the 
article is written as a sort of unsubstantiated guess-work propped up 
by vendor charts and graphs and with links so the gentle reader can 
purchase the product on-line.


It is good to see that Intel is seeing some competition.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-01 Thread Al Hopper
On Fri, Jan 1, 2010 at 11:17 AM, Bob Friesenhahn
 wrote:
> On Fri, 1 Jan 2010, David Magda wrote:
>>
>> It doesn't exist currently because of the behind-the-scenes re-mapping
>> that's being done by the SSD's firmware.
>>
>> While arbitrary to some extent, and "actual" LBA would presumably the
>> number of a particular cell in the SSD.
>
> There seems to be some severe misunderstanding of that a SSD is. This severe
> misunderstanding leads one to assume that a SSD has a "native" blocksize.
>  SSDs (as used in computer drives) are comprised of many tens of FLASH
> memory chips which can be layed out and mapped in whatever fashion the
> designers choose to do.  They could be mapped sequentially, in parallel, a
> combination of the two, or perhaps even change behavior depending on use.
>  Individual FLASH devices usually have a much smaller page size than 4K.  A
> 4K write would likely be striped across several/many FLASH devices.
>
> The construction of any given SSD is typically a closely-held trade secret
> and the vendor will not reveal how it is designed.  You would have to chip
> away the epoxy yourself and reverse-engineer in order to gain some
> understanding of how a given SSD operates and even then it would be mostly
> guesswork.
>
> It would be wrong for anyone here, including someone who has participated in
> the design of an SSD, to claim that they know how a "SSD" will behave unless
> they have access to the design of that particular SSD.
>

The main issue is that most flash devices support 128k byte pages, and
the smallest "chunk" (for want of a better word) of flash memory that
can be written is a page - or 128kb.  So if you have a write to an SSD
that only changes 1 byte in one 512 byte "disk" sector, the SSD
controller has to either read/re-write the affected page or figure out
how to update the flash memory with the minimum affect on flash wear.

If one did'nt have to worry about flash wear levelling, one could
read/update/write the affected page all day long.

And, to date, flash writes are much slower than flash reads - which is
another basic property of the current generation of flash devices.

For anyone who is interested in getting more details of the challenges
with flash memory, when used to build solid state drives, reading the
tech data sheets on the flash memory devices will give you a feel for
the basic issues that must be solved.

Bobs point is well made.  The specifics of a given SSD implementation
will make the performance characteristics of the resulting SSD very
difficult to predict or even describe - especially as the device
hardware and firmware continue to evolve.   And some SSDs change the
algorithms they implement on-the-fly - depending on the
characteristics of the current workload and of the (inbound) data
being written.

There are some links to well written articles in the URL I posted
earlier this morning:
http://www.anandtech.com/storage/showdoc.aspx?i=3702

Regards,

-- 
Al Hopper  Logical Approach Inc,Plano,TX a...@logical-approach.com
   Voice: 972.379.2133 Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] trying to buy an Intel MLC SSD

2010-01-01 Thread Al Hopper
The 80Gb Intel MLC SSDs have been hard to find in-stock and prices
keep varying   The original list price on the x25m 80Gb MLC drive
was $230 - and it was *supposed* to be available for less than that.
Demand has been high and a lot of on-line sellers have taken advantage
of the demand to keep prices high.  In particular, newegg.com, who
usually have very keen pricing, has been selling the Intel SSDs at way
above list and are not competitive with other online retailers.

A possible work around is to shop for the 1.8" version of the drive -
which is more widely available with better pricing.  You'll need a
cable with a micro-SATA connector for this drive.  You can find a
micro-SATA cable to standard SATA cable assembly here:
http://www.satacables.com/micro-sata-cables.html

Bear in mind that the 1.8" (x18m) drive is 5mm high - whereas the 2.5"
(x25m) drive is 7mm tall - and it comes with a plastic frame to
increase the height to 9.5mm for compatibility with some laptop
mountings.

Here's a reference for physical form factor info on the Intel drives:

wget http://download.intel.com/design/flash/nand/mainstream/322296.pdf

and here is the spec for the micro SATA connector:

wget ftp://ftp.seagate.com/pub/sff/SFF-8144.PDF

NB: Make sure you get the G2 version of the Intel drive - regardless
of the form factor.

No affiliation with Intel etc.

-- 
Al Hopper  Logical Approach Inc,Plano,TX a...@logical-approach.com
   Voice: 972.379.2133 Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Supermicro AOC-SAT2-MV8 -- cfgadm won't create attach point (dsk/xxxx)

2010-01-01 Thread Jeb Campbell
I have a Supermicro AOC-SAT2-MV8 running on snv_130.  I have a 6 disk raidz2 
pool that has been running great.

Today I added a Western Digital Green 1.5TB WD15EADS so I could create some 
scratch space.

But, cfgadm will not assign the drive a dsk/xxx ...

I have tried unconfigure/configure and disconnect/connect/configure with cfgadm 
without luck.

[r...@solaris:~]$ uname -a
SunOS solaris 5.11 snv_130 i86pc i386 i86pc
[r...@solaris:~]$ cfgadm -alv | grep sata
sata0/0::dsk/c0t0d0connectedconfigured   ok Mod: 
ST3750330AS FRev: SD1A SN: 3QK0382L
sata0/1connectedconfigured   ok Mod: WDC 
WD15EADS-00P8B0 FRev: 01.00A01 SN: WD-WCAVU0382812
sata0/2::dsk/c0t2d0connectedconfigured   ok Mod: 
ST3750330AS FRev: SD1A SN: 3QK0382M
sata0/3::dsk/c0t3d0connectedconfigured   ok Mod: 
ST3750330AS FRev: SD1A SN: 3QK03DEP
sata0/4::dsk/c0t4d0connectedconfigured   ok Mod: 
ST3750330AS FRev: SD1A SN: 3QK038A6
sata0/5::dsk/c0t5d0connectedconfigured   ok Mod: 
ST3750330AS FRev: SD1A SN: 3QK0313K
sata0/6::dsk/c0t6d0connectedconfigured   ok Mod: 
ST3750330AS FRev: SD1A SN: 3QK037X5
sata0/7emptyunconfigured ok
unavailable  sata-portn
/devices/p...@0,0/pci8086,2...@1/pci8086,3...@0/pci11ab,1...@1:7

Any tips?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] preview of new SSD based on SandForce controller

2010-01-01 Thread Richard Elling

On Jan 1, 2010, at 11:28 AM, Bob Friesenhahn wrote:


On Fri, 1 Jan 2010, Al Hopper wrote:


Interesting article - rumor has it that this is the same controller
that Seagate will use in its upcoming enterprise level SSDs:

http://anandtech.com/storage/showdoc.aspx?i=3702

It reads like  SandForce has implemented a bunch of ZFS like
functionality in firmware.  Hmm, I wonder if they used any ZFS source
code??


The article (and product) seem interesting, but (in usual form) the  
article is written as a sort of unsubstantiated guess-work propped  
up by vendor charts and graphs and with links so the gentle reader  
can purchase the product on-line.


It is good to see that Intel is seeing some competition.


Yep, it is good to see that people who are being creative are finding
design wins.  IMHO, the rate of change in the SSD world right now is
about 1000x the rate of change in the HDD world.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (snv_129, snv_130) can't import zfs pool

2010-01-01 Thread Richard Elling

On Jan 1, 2010, at 4:57 AM, LevT wrote:


Hi

(snv_130) created zfs pool storage (a mirror of two whole disks)

zfs created storage/iscsivol,  made some tests, wrote some GBs

zfs created storage/mynas filesystem
(sharesmb
dedup=on
compression=on)

FILLED the storage/mynas


tried to ZFS DESTROY my storage/iscsivol, but the system has HUNG...


dedup is still new and several people have reported that destroying  
deduped
datasets can take a long time. Plenty of memory or cache devices seems  
to help,
as does having high IOPS drives in the main pool.  Otherwise, you'll  
have to

wait for it to finish.

this system now tends to boot to maintenance mode due to the boot- 
archive corruption


This is unrelated to the above problem.  More likely this occurred  
when you
gave up and forced a restart. Follow the standard instructions for  
rebuilding

the boot archive.
 -- richard

The pool can't be imported -f by the recent EON storage (snv_129),  
it hangs also and don't return to the CLI



Any help is appreciated
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hard drive choice, TLER/ERC/CCTL

2010-01-01 Thread Richard Elling


On Jan 1, 2010, at 8:11 AM, R.G. Keen wrote:


On Dec 31, 2009, at 6:14 PM, Richard Elling wrote:
Some nits:
disks aren't marked as semi-bad, but if ZFS has trouble with a
block, it will try to not use the block again.  So there is two  
levels

of recovery at work: whole device and block.

Ah. I hadn't found that yet.


The "one more and you're dead" is really N errors in T time.

I'm interpreting this as "OS/S/zfs/drivers will not mark a disk
as failed until it returns N errors in T time," which means -
check me on this - that to get a second failed disk, the time
to get a second real-or-fake failed disk is T, where T is the
time a second soft-failing disk may happen while the system
is balled up in worrying about the first disk not responding in
T time.


Perhaps I am not being clear.  If a disk is really dead, then
there are several different failure modes that can be responsible.
For example, if a disk does not respond to selection, then it
is diagnosed as failed very quickly. But that is not the TLER
case.  The TLER case is when the disk cannot read from
media without error, so it will continue to retry... perhaps
forever or until reset. If a disk does not complete an I/O operation
in (default) 60 seconds (for sd driver), then it will be reset and
the I/O operation retried.

If a disk returns bogus data (failed ZFS checksum), then the
N in T algorithm may kick in. I have seen this failure mode many
times.


This based on a paper I read on line about the increasing
need for raidz3 or similar over raidz2 or similar because
throughput from disks has not increased concomitantly with
their size; this leading to increasing times to recover from
first failures using the stored checking data in the array to
to rebuild. The notice-an-error time plus the rebuild-the-array
time is the window in which losing another disk, soft or hard,
will lead to the inability to resilver the array.


A similar observation is that the error rate (errors/bit) has not
changed, but the number of bits continues to increase.


For disks which don't return when there is an error, you can
reasonably expect that T will be a long time (multiples of 60
seconds) and therefore the N in T threshold will not be triggered.

The scenario I had in mind was two disks ready to fail, either
soft (long time to return data) or hard (bang! That sector/block
or disk is not coming back, period). The first fails and starts
trying to recover in desktop-disk fashion, maybe taking hours.


Yes, this is the case for TLER. The only way around this is to
use disks that return failures when they occur.


This leaves the system with no error report (i.e. the N-count is
zero) and the T-timer ticking. Meanwhile the array is spinning.
The second fragile disk is going to hit its own personal pothole
at some point soon in this scenario.

What happens next is not clear to me. Is OS/S/zfs going to
suspend disk operations until it finally does hear from first
failing disk 1, based on N still being at 0 because the disk
hasn't reported back yet? Or will the array continue with other
operations, noting that the operation involving failing disk1
has not completed, and either stack another request on
failing disk 1, or access failing disk 2 and get its error too
at some point? Or both?


ZFS issues I/O in parallel. However, that does not prevent an
application or ZFS metadata transactions from waiting on a
sequence of I/O.


If the timeout is truly N errors in T time, and N is never
reported back because the disk spends some hours retrying,
then it looks like this is a zfs hang if not a system hang.


The drivers will retry and fail the I/O. By default, for SATA
disks using the sd driver, there are 5 retries of 60 seconds.
After 5 minutes, the I/O will be declared failed and that info
is passed back up the stack to ZFS, which will start its
recovery.  This is why the T part of N in T doesn't work so
well for the TLER case.


If there is a timeout of some kind which takes place even
if N never gets over 0, that would at least unhang the
file system/system, but it opens you to the second failing
disk fault having occurred, and you're in for another of
either hung-forever or failed-array in the case of raidz.


I don't think the second disk scenario adds value to this
analysis.


The term "degraded" does not have a consistent
definition across the industry.

Of course not! 8-)  Maybe we should use "depraved" 8-)


See the zpool man page for the definition
used for ZFS.  In particular, DEGRADED != FAULTED



Issues are logged, for sure.  If you want to monitor
them proactively,
you need to configure SNMP traps for FMA.

Ok, can deal with that.


It already does this, as long as there are N errors
in T time.

OK, I can work that one out. I'm still puzzled on what
happens with the "N=0 forever" case. The net result
on that one seems to be that  you need raid specific
disks to get some kind of timeout to happen at the
disk level ever (depending on the disk firmware,
which as y

Re: [zfs-discuss] ZFS pool unusable after attempting to destroy a dataset with dedup enabled

2010-01-01 Thread tom wagner
Yeah, still no joy.  I moved the disks to another machine altogether with 8gb 
and a quad core intel versus the dual core amd I was using and it still just 
hangs the box on import. this time I did a nohup zpool import -fFX vault after 
booting off the b130 live dvd on this machine into single user text mode so I'd 
have minimal processes and the machine still hangs tighter than a drum.  Can't 
even hit the enter and get a newline this way, probably because the bash 
process is locked. I've left it for 24 hours like this and will leave it for 
another day or two to see if it is actually doing anything behind the scenes.  
I guess my plan B will be to leave these disks in a closet and try again some 
time in the future and hopefully in some later build the kinks get all worked 
out enough with dedup  to deal with my pool as I'd really not like to lose the 
data in this pool.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS pool unusable after attempting to destroy a dataset with dedup enabled

2010-01-01 Thread Richard Elling

On Jan 1, 2010, at 2:23 PM, tom wagner wrote:

Yeah, still no joy.  I moved the disks to another machine altogether  
with 8gb and a quad core intel versus the dual core amd I was using  
and it still just hangs the box on import. this time I did a nohup  
zpool import -fFX vault after booting off the b130 live dvd on this  
machine into single user text mode so I'd have minimal processes and  
the machine still hangs tighter than a drum.  Can't even hit the  
enter and get a newline this way, probably because the bash process  
is locked. I've left it for 24 hours like this and will leave it for  
another day or two to see if it is actually doing anything behind  
the scenes.  I guess my plan B will be to leave these disks in a  
closet and try again some time in the future and hopefully in some  
later build the kinks get all worked out enough with dedup  to deal  
with my pool as I'd really not like to lose the data in this pool.


Are the drive lights blinking?  If so, then let it do its work.   
Rebooting won't

help because when the pool is imported, the destroy will continue.  See
other recent threads in this forum on the subject for more insight.
http://opensolaris.org/jive/forum.jspa?forumID=80&start=0
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] best way to configure raidz groups

2010-01-01 Thread Orvar Korvar
raidz2 is recommended. As discs get large, it can take long time to repair 
raidz. Maybe several days. With raidz1, if another discs blows during repair, 
you are screwed.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS pool unusable after attempting to destroy a dataset with dedup enabled

2010-01-01 Thread tom wagner
That's the thing, the drive lights aren't blinking, but I was thinking maybe 
the writes are going so slow that it's possible they aren't registering. And 
since I can't keep a running iostat, Ican't tell if anything is going on.  I 
can however get into the KMDB.  is there something in there that can monitor 
storage activity or anything?
probably not, but it's worth asking.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (snv_129, snv_130) can't import zfs pool

2010-01-01 Thread tom wagner
You might want to checkout another thread that me and some of the others 
started on this topic. some of the guys in that thread got their pool back but 
I haven't been able to.  I have SSDs for my log and cache and it hasn't helped 
me because my system hangs hard on import the way you are describing. Thus far 
I still haven't been able to regain my pool after switching to a totally 
different system with more memory, but some of the other guys have.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] preview of new SSD based on SandForce controller

2010-01-01 Thread Erik Trimble

Bob Friesenhahn wrote:

On Fri, 1 Jan 2010, Al Hopper wrote:


Interesting article - rumor has it that this is the same controller
that Seagate will use in its upcoming enterprise level SSDs:

http://anandtech.com/storage/showdoc.aspx?i=3702

It reads like  SandForce has implemented a bunch of ZFS like
functionality in firmware.  Hmm, I wonder if they used any ZFS source
code??


The article (and product) seem interesting, but (in usual form) the 
article is written as a sort of unsubstantiated guess-work propped up 
by vendor charts and graphs and with links so the gentle reader can 
purchase the product on-line.


It is good to see that Intel is seeing some competition.

Bob
--


Yeah, there were a bunch more "maybe" and "looks like" and "might be" 
than I'm really comfortable with in that article.


The one thing it does bring up is the old problem of Where Intelligence 
Belongs.   You most typically see this in the CPU/coprocessor cycle, 
where the battle between enough performance gain in using a separate 
chip vs the main CPU to perform some task is a never ending cycle.


One of ZFS's founding ideas is that Intelligence belongs up in the main 
system (i.e. running in the OS, on the primary CPU(s)), and that all 
devices are stupid and unreliable.   I'm looking at all the (purported) 
features in this SandForce controller, and wondering how they'll 
interact with a "smart" filesystem like ZFS, rather than a traditional 
"stupid" filesystem a la UFS.   I see a lot of overlap, which I'm not 
sure is a good thing.


Maybe it's approaching time for vendors to just produce really stupid 
SSDs: that is, ones that just do wear-leveling, and expose their true 
page-size info (e.g. for MLC, how many blocks of X size have to be 
written at once) and that's about it.  Let filesystem makers worry about 
scheduling writes appropriately, doing redundancy, etc.


Oooh!   Oooh!  a whole cluster of USB thumb drives!  Yeah!




--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] preview of new SSD based on SandForce controller

2010-01-01 Thread Bob Friesenhahn

On Fri, 1 Jan 2010, Erik Trimble wrote:


Maybe it's approaching time for vendors to just produce really stupid SSDs: 
that is, ones that just do wear-leveling, and expose their true page-size 
info (e.g. for MLC, how many blocks of X size have to be written at once) and 
that's about it.  Let filesystem makers worry about scheduling writes 
appropriately, doing redundancy, etc.


From the benchmarks, it is clear that the drive interface is already 
often the bottleneck for these new SSDs.  That implies that the 
current development path is in the wrong direction unless we are 
willing to accept legacy-sized devices implementing a complex legacy 
protocol.  If the devices remain the same physical size with more 
storage then we are faced with the same current situation we have with 
rotating media, with huge media density and relatively slow I/O 
performance.  We do need stupider SSDs which fit in a small form 
factor, offer considerable bandwidth (e.g. 300MB/second) per device, 
and use a specialized communication protocol which is not defined by 
legacy disk drives.  This allows more I/O to occur in parallel, for 
much better I/O rates.



Oooh!   Oooh!  a whole cluster of USB thumb drives!  Yeah!


That is not far from what we should have (small chassis-oriented 
modules), but without the crummy USB.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] preview of new SSD based on SandForce controller

2010-01-01 Thread Richard Elling

On Jan 1, 2010, at 6:33 PM, Bob Friesenhahn wrote:


On Fri, 1 Jan 2010, Erik Trimble wrote:


Maybe it's approaching time for vendors to just produce really  
stupid SSDs: that is, ones that just do wear-leveling, and expose  
their true page-size info (e.g. for MLC, how many blocks of X size  
have to be written at once) and that's about it.  Let filesystem  
makers worry about scheduling writes appropriately, doing  
redundancy, etc.


From the benchmarks, it is clear that the drive interface is already  
often the bottleneck for these new SSDs.  That implies that the  
current development path is in the wrong direction unless we are  
willing to accept legacy-sized devices implementing a complex legacy  
protocol.  If the devices remain the same physical size with more  
storage then we are faced with the same current situation we have  
with rotating media, with huge media density and relatively slow I/O  
performance.  We do need stupider SSDs which fit in a small form  
factor, offer considerable bandwidth (e.g. 300MB/second) per device,  
and use a specialized communication protocol which is not defined by  
legacy disk drives.  This allows more I/O to occur in parallel, for  
much better I/O rates.


You can already see this affecting the design of high-throughput
storage.  The Sun Storage F1500 Flash Array has 80 SSDs and
uses 64 SAS channels for host connection. Some folks think that
6 Gbps SATA/SAS connections are the Next Great Thing^TM but
that only means you need 32 host connections.  It is quite amazing
to have 1M IOPS and 12.8 GB/s in 1 RU.  Perhaps this is the DAS
of the future?
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] preview of new SSD based on SandForce controller

2010-01-01 Thread Erik Trimble

Bob Friesenhahn wrote:

On Fri, 1 Jan 2010, Erik Trimble wrote:


Maybe it's approaching time for vendors to just produce really stupid 
SSDs: that is, ones that just do wear-leveling, and expose their true 
page-size info (e.g. for MLC, how many blocks of X size have to be 
written at once) and that's about it.  Let filesystem makers worry 
about scheduling writes appropriately, doing redundancy, etc.


From the benchmarks, it is clear that the drive interface is already 
often the bottleneck for these new SSDs.  That implies that the 
current development path is in the wrong direction unless we are 
willing to accept legacy-sized devices implementing a complex legacy 
protocol.  If the devices remain the same physical size with more 
storage then we are faced with the same current situation we have with 
rotating media, with huge media density and relatively slow I/O 
performance.  We do need stupider SSDs which fit in a small form 
factor, offer considerable bandwidth (e.g. 300MB/second) per device, 
and use a specialized communication protocol which is not defined by 
legacy disk drives.  This allows more I/O to occur in parallel, for 
much better I/O rates.



Oooh!   Oooh!  a whole cluster of USB thumb drives!  Yeah!


That is not far from what we should have (small chassis-oriented 
modules), but without the crummy USB.


Bob
I tend to like the 2.5" form factor, for a lot of reasons (economies of 
scale, and all).   And, the new SATA III (i.e. 6Gbit/s) interface is 
really sufficient for reasonable I/O, at least until the 12Gbit SAS 
comes along in a year or so.  The 1.8" drive form factor might be useful 
as Flash densities go up (in order to keep down the GB to drive 
interface ratio), but physically, that size is a bit of a pain (it's 
actually too small for reliability reasons, and makes chassis design 
harder).  I'm actually all for adding a second SATA/SAS I/O connector on 
a 2.5" drive (it's just possible, physically).


That all said, it certainly would be really nice to get a SSD controller 
which can really push the bandwidth, and the only way I see this 
happening now is to go the "stupid" route, and dumb down the controller 
as much as possible.  I really think we just want the controller to Do 
What I Say, and not try any optimizations or such.  There's simply much 
more benefit to doing the optimization up at the filesystem level than 
down at the device level. For a trivial case, consider the dreaded 
"write-read-write" problem of MLCs:   to write a single bit, a whole 
page has to be read, then the page recomposed with the changed bits, 
before writing again.  If the filesystem was aware that the drive had 
this kind of issue, then in-RAM caching would almost always allow for 
the avoidance of the first "read" cycle, and performance goes back to a 
typical Copy-on-Write style stripe write. 

I can see why having "dumb" controllers might not appear to the 
consumer/desktop market, but certainly, for the Enterprise market, I 
think it's actually /more/ likely that they start showing up soon.  
Which would be a neat reversal of sorts:   Consumer drives using a 
complex controller with cheap flash (and a large "spare" capacity 
area),  while Enterprise drives use a simple controller, higher-quality 
flash chips, and likely a much smaller spare capacity area.  Which 
means, I expect price parity between the two. 


Whee!

--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss