date:20080930

Thanks for all the answers .. Please find more questions below :)

- Good to know EMC filers do not have end2end checksums! What about netapp ?

- Any other limitations of the big two NAS vendors as compared to zfs ?

- I still don't have my original question answered, I want to somehow assess
the reliability of that zfs storage stack. If there's no hard data on that,
then if any storage expert who works with lots of systems can give his
impression of the reliability compared to the big two, that would be great
!

- Regarding building my own hardware, I don't really want to do that (I am
scared enough to put our small but very important data on zfs). If you know
of any Dell box (we usually deal with dell) that can host say 10 drives
minimum (for expandability) and that is *known* to work very well with
nexentaStor. Then please please let me know about it. I am unconfident about
the hardware quality of the pogoLinux solution, but forced to go with it for
nexenta. The Sun thumper solution is too expensive for me, I am looking for
a solution around 10k$. I don't need all those disks or RAM in thumper!

- Assuming I plan to host a maximum of 8TB uesable data on the pogo box as
seen in: http://www.pogolinux.com/quotes/editsys?sys_id=8498
  * Would I need one or two of those Quad core xeon CPUs ?
  * How much RAM is needed ?
  * I'm planning on using Segate 1TB sata 7200 disks. Is that crazy ? The
EMC guy insisted we use 10k Fibre/SAS drives at least. We're currently on 3
1TB sata disks on my current linux box, and it's fine for me! At least when
it's not rsnapshotting. The workload is 20 user NFS for homes and some
software shares
  * Assuming the pogo sata controller dies, do you suppose I could plug the
disks into any other machine and work with them ? I wonder why the pogo box
does not come with two controllers, doesn't solaris support that !


Thanks a lot for your replies


On Tue, Sep 30, 2008 at 10:31 AM, MC [EMAIL PROTECTED] wrote:

 The good news is that even though the answer to your question is no, it
 doesn't matter because it sounds like what you are doing is a piece of cake
 :)

 Given how cheap hardware is, and how modest your requirements sound, I
 expect you could build multiple custom systems for the cost of an EMC
 system.  Even that pogolinux stuff is overshooting the mark compared to what
 a custom system might be.  Price is typical too, considering they're trying
 to sell 1TB drives for $260 when similar drives are less than $150 for
 regular folks.

 The manageability of nexentastor software might be worth it to you over a
 solaris terminal, but for a small shop with one machine and one guy who
 knows it well, you might just do the hardware from scratch :)  Especially
 given what there is to know about ZFS and your use case, such as being able
 to use slower disks with more RAM and a SSD ZIL cache to produce deceptively
 fast results.

 If cost continues to be a concern over performance, also consider that
 these pre-made systems are not designed for power conservation at all.
  They're still shipping old inefficient processors and other such parts in
 these things, hoping to take advantage of IT people who don't care or know
 any better.  A custom system could potentially cut the total power cost in
 half...

  div id=jive-html-wrapper-div
  div dir=ltrHi everyone,brbrWe#39;re a small
  Linux shop (20 users). I am currently using a Linux
  server to host our 2TBs of data. I am considering
  better options for our data storage needs. I mostly
  need instant snapshots and better data protection. I
  have been considering EMC NS20 filers and Zfs based
  solutions. For the Zfs solutions, I am considering
  NexentaStor product installed on a pogoLinux
  StorageDirector box. The box will be mostly sharing
  2TB over NFS, nothing fancy.br
  brNow, my question is I need to assess the zfs
  reliability today Q4-2008 in comparison to an EMC
  solution. Something like EMC is pretty mature and
  used at the most demanding sites. Zfs is fairly new,
  and from time to time I have heard it had some pretty
  bad bugs. However, the EMC solution is like 4X more
  expensive. I need to somehow quot;quantifyquot; the
  relative quality level, in order to judge whether or
  not I should be paying all that much to EMC. The only
  really important reliability measure to me, is not
  having data loss!br
  Is there any real measure like quot;percentage of
  total corruption of a poolquot; that can assess such
  a quality, so you#39;d tell me zfs has pool failure
  rate of 1 in a 10^6, while EMC has a rate of 1 in a
  10^7. If not, would you guys rate such a zfs solution
  as ??% the reliability of an EMC solution ?br
  brI know it#39;s a pretty difficult question to
  answer, but it#39;s the one I need to answer and
  weigh against the cost. brThanks a million, I
  really appreciate your helpbr/div
 
  /div___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org

Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread David Magda


On Sep 30, 2008, at 06:58, Ahmed Kamal wrote:

 - I still don't have my original question answered, I want to  
 somehow assess
 the reliability of that zfs storage stack. If there's no hard data  
 on that,
 then if any storage expert who works with lots of systems can give his
 impression of the reliability compared to the big two, that would  
 be great!

What would you consider hard data? Can you give examples of hard  
data for EMC and NetApp (or anyone else)? Then perhaps similar things  
can be found for ZFS.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZSF Solaris

2008-09-30 Thread Ram Sharma

Hi,

can anyone please tell me what is the maximum number of files that can be there 
in 1 folder in Solaris with ZSF file system.

I am working on an application in which I have to support 1mn users. In my 
application I am using MySql MyISAM and in MyISAM there is 3 files created for 
1 table. I am having application architechture in which each user will be 
having separate table, so the expected number of files in database folder is 
3mn. I have read somewhere that there is a limit of each OS to create files in 
a folder.
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZSF Solaris

2008-09-30 Thread Mark J Musante

On Tue, 30 Sep 2008, Ram Sharma wrote:

 Hi,

 can anyone please tell me what is the maximum number of files that can 
 be there in 1 folder in Solaris with ZSF file system.

By folder, I assume you mean directory and not, say, pool.  In any case, 
the 'limit' is 2^48, but that's effectively no limit at all.


Regards,
markm
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS performance degradation when backups are running

2008-09-30 Thread Jean Dion





Simple. You cannot go faster than the slowest link.

Any VLAN share the bandwidth workload and do not provide a dedicated
bandwidth for each of them. That means if you have multiple VLAN
coming out of the same wire of your server you do not have "n" time the
bandwidth but only a fraction of it. Simple network maths.

Also iSCSI works better by using segregated IP network switches.
Beware that some switches do not guaranty full 1Gbits speed on all
ports when all active at the same time. Plan multiple uplinks if you
have more than one switch. Once again you cannot go faster than the
slowest link.

Jean


gm_sjo wrote:

  2008/9/30 Jean Dion [EMAIL PROTECTED]:
  
  
iSCSI requires dedicated network and not a shared network or even VLAN.  Backup cause large I/O that fill your network quickly.  Like ans SAN today.

  
  
Could you clarify why it is not suitable to use VLANs for iSCSI?
  

-- 




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS performance degradation when backups are running

2008-09-30 Thread Gary Mills

On Mon, Sep 29, 2008 at 06:01:18PM -0700, Jean Dion wrote:
Do you have dedicated iSCSI ports from your server to your NetApp?

Yes, it's a dedicated redundant gigabit network.

iSCSI requires dedicated network and not a shared network or even VLAN.
Backup cause large I/O that fill your network quickly. Like ans SAN today.

Backup are extremely demanding on hardware (CPU, Mem, I/O ports, disk etc).
Not rare to see performance issues during backup with several thousands small
files. Each small file cause seeks to your disk and file system.

As the number of files and size you will be impact. That means, thousand of
small files cause thousand of small I/O but not a lot of throughput.

What statistics can I generate to observe this contention? ZFS pool I/O
statistics are not that different when the backup is running.

Bigger your file are more likely the block will be consecutive on the file
system. Small file can be spread in the entire file system causing seeks,
latency and bottleneck.

Legato client and server contains tuning parameters to avoid such small file
problems. Check your Legato buffer parameters. These buffer will use your
server memory as disk cache.

I'll ask our backup person to investigate those settings. I assume that
Networker should not be buffering files since those files won't be read
again. How can I see memory usage by ZFS and by applications?

Here is a good source of network tuning parameters for your T2000
http://www.solarisinternals.com/wiki/index.php/Networks#Tunable_for_general_workloads_on_T1000.2FT2000

The soft_ring is one of the best one.

Here is another interesting place to look
http://www.solarisinternals.com/wiki/index.php/Solaris_Internals_and_Performance_FAQ

Thanks. I'll review those documents.

--
-Gary Mills--Unix Support--U of M Academic Computing and Networking-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZSF Solaris

2008-09-30 Thread Marcelo Leal

ZFS has not limit for snapshots and filesystems too, but try to create a lot 
snapshots and filesytems and you will have to wait a lot for your pool to 
import too... ;-)
 I think you should not think about the limits, but performance. Any 
filesytem with *too many entries by directory will suffer. So, my advice is 
configure your app to create a better hierarchy.

 Leal.
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS performance degradation when backups are running

2008-09-30 Thread Jean Dion

For Solaris internal debugging tools look here
http://opensolaris.org/os/community/advocacy/events/techdays/seattle/OS_SEA_POD_JMAURO.pdf;jsessionid=9B3E275EEB6F1A0E0BC191D8DEC0F965

ZFS specifics is available here
http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide

Jean

Gary Mills wrote:

On Mon, Sep 29, 2008 at 06:01:18PM -0700, Jean Dion wrote:

Do you have dedicated iSCSI ports from your server to your NetApp?

Yes, it's a dedicated redundant gigabit network.

iSCSI requires dedicated network and not a shared network or even VLAN. Backup cause large I/O that fill your network quickly. Like ans SAN today.

Backup are extremely demanding on hardware (CPU, Mem, I/O ports, disk etc). Not rare to see performance issues during backup with several thousands small files. Each small file cause seeks to your disk and file system.

As the number of files and size you will be impact. That means, thousand of small files cause thousand of small I/O but not a lot of throughput.

What statistics can I generate to observe this contention? ZFS pool I/O
statistics are not that different when the backup is running.

Bigger your file are more likely the block will be consecutive on the file system. Small file can be spread in the entire file system causing seeks, latency and bottleneck.

Legato client and server contains tuning parameters to avoid such small file problems. Check your Legato buffer parameters. These buffer will use your server memory as disk cache.

Here is a good source of network tuning parameters for your T2000
http://www.solarisinternals.com/wiki/index.php/Networks#Tunable_for_general_workloads_on_T1000.2FT2000

The soft_ring is one of the best one.

Here is another interesting place to look
http://www.solarisinternals.com/wiki/index.php/Solaris_Internals_and_Performance_FAQ

Thanks. I'll review those documents.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] c1t0d0 to c3t1d0

2008-09-30 Thread dick hoogendijk

I have a ZFS disk (c1t0d0) in a eSATA/USB2 enclosure.

If I would build this drive in the machine (internal SATA) it would
become c3t1do. When I did it (for testing) zpool status did not see it.

What do I have to do to be able to switch this drive?

-- 
Dick Hoogendijk -- PGP/GnuPG key: 01D2433D
++ http://nagual.nl/ + SunOS sxce snv95 ++
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] c1t0d0 to c3t1d0

2008-09-30 Thread Will Murnane

On Tue, Sep 30, 2008 at 14:00, dick hoogendijk [EMAIL PROTECTED] wrote:
 What do I have to do to be able to switch this drive?
I'd suggest running zpool import.  If that doesn't show the pool,
put it back in the external enclosure, run zpool export mypool and
then see if it shows up in zpool import when it's internal.

Will
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

Ahmed Kamal wrote:
 Thanks for all the answers .. Please find more questions below :)

 - Good to know EMC filers do not have end2end checksums! What about 
 netapp ?

If they are not at the end, they can't do end-to-end data validation.
Ideally, application writers would do this, but it is a lot of work.  ZFS
does this on behalf of applications which use ZFS.  Hence my comment
about ZFS being complementary to your storage device decision.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

On Tue, 30 Sep 2008, Ahmed Kamal wrote:

 - I still don't have my original question answered, I want to somehow assess
 the reliability of that zfs storage stack. If there's no hard data on that,
 then if any storage expert who works with lots of systems can give his
 impression of the reliability compared to the big two, that would be great

The reliability of that zfs storage stack primarily depends on the 
reliability of the hardware it runs on.  Note that there is a huge 
difference between 'reliability' and 'mean time to data loss' (MTDL). 
There is also the concern about 'availability' which is a function of 
how often the system fails, and the time to correct a failure.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

I guess I am mostly interested in MTDL for a zfs system on whitebox hardware
(like pogo), vs dataonTap on netapp hardware. Any numbers ?

On Tue, Sep 30, 2008 at 4:36 PM, Bob Friesenhahn 
[EMAIL PROTECTED] wrote:

 On Tue, 30 Sep 2008, Ahmed Kamal wrote:


 - I still don't have my original question answered, I want to somehow
 assess
 the reliability of that zfs storage stack. If there's no hard data on
 that,
 then if any storage expert who works with lots of systems can give his
 impression of the reliability compared to the big two, that would be
 great


 The reliability of that zfs storage stack primarily depends on the
 reliability of the hardware it runs on.  Note that there is a huge
 difference between 'reliability' and 'mean time to data loss' (MTDL). There
 is also the concern about 'availability' which is a function of how often
 the system fails, and the time to correct a failure.

 Bob
 ==
 Bob Friesenhahn
 [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

On Tue, 30 Sep 2008, Ahmed Kamal wrote:

 I guess I am mostly interested in MTDL for a zfs system on whitebox hardware
 (like pogo), vs dataonTap on netapp hardware. Any numbers ?

Barring kernel bugs or memory errors, Richard Elling's blog entry 
seems to be the best place use as a guide:

http://blogs.sun.com/relling/entry/raid_recommendations_space_vs_mttdl

It is pretty easy to build a ZFS pool with data loss probabilities (on 
paper) which are about as low as you winning the state jumbo lottery 
jackpot based on a ticket you found on the ground.  However, if you 
want to compete with an EMC system, then you will want to purchase 
hardware of similar grade.  If you purchase a cheapo system from Dell 
without ECC memory then the actual data reliability will suffer.

ZFS protects you against corruption in the data storage path.  It does 
not protect you against main memory errors or random memory overwrites 
due to a horrific kernel bug.  ZFS also does not protect against data 
loss due to user error, which remains the primary factor in data loss.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

Ahmed Kamal wrote:
 I guess I am mostly interested in MTDL for a zfs system on whitebox 
 hardware (like pogo), vs dataonTap on netapp hardware. Any numbers ?

It depends to a large degree on the disks chosen.  NetApp uses enterprise
class disks and you can expect better reliability from such disks.  I've 
blogged
about a few different MTTDL models and posted some model results.
http://blogs.sun.com/relling/tags/mttdl

 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool import of bootable root pool renders it

2008-09-30 Thread Juergen Nickelsen

Stephen Quintero [EMAIL PROTECTED] writes:

 I am running OpenSolaris 2008.05 as a PV guest under Xen. If you
 import the bootable root pool of a VM into another Solaris VM, the
 root pool is no longer bootable.

I had a similar problem: After installing and booting Opensolaris
2008.05, I succeded to lock myself out through some passwd/shadow
inconsistency (totally my own fault). Not a problem, I thought -- I
booted from the install disk, imported the root pool, fixed the
inconsistency, and rebooted. Lo, instant panic.

No idea why, though, I am not that familiar with the underlying
code. I just did a reinstall.

Regards, Juergen.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool error: must be a block device or regular file

2008-09-30 Thread William D. Hathaway

The zfs kernel modules handle the caching/flushing of data across all the 
devices in the zpools.  It uses a different method for this than the standard 
virtual memory system used by traditional file systems like UFS.  Try defining 
your NVRAM card with ZFS as a log device using the /dev/dsk/xyz path and let us 
know how it goes.
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS performance degradation when backups are running

2008-09-30 Thread William D. Hathaway

Gary -
   Besides the network questions...

   What does your zpool status look like?


   Are you using compression on the file systems?
   (Was single-threaded and fixed in s10u4 or equiv patches)
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS performance degradation when backups are running

2008-09-30 Thread gm_sjo

2008/9/30 Jean Dion [EMAIL PROTECTED]:
 Simple. You cannot go faster than the slowest link.

That is indeed correct, but what is the slowest link when using a
Layer 2 VLAN? You made a broad statement that iSCSI 'requires' a
dedicated, standalone network. I do not believe this is the case.

 Any VLAN share the bandwidth workload and do not provide a dedicated
 bandwidth for each of them.   That means if you have multiple VLAN coming
 out of the same wire of your server you do not have n time the bandwidth
 but only a fraction of it.  Simple network maths.

I can only assume that you are only referring to VLAN trunks, eg using
a NIC on a server for both 'normal' traffic and having another virtual
interface on it bound to a 'storage' VLAN. If this is the case then
what you say is true, of course you are sharing the same physical link
so ultimately that will be the limit.

However, and this should be clarified before anyone gets the wrong
idea, there is nothing wrong with segmenting a switch by using VLANs
to have some ports for storage traffic and some ports for 'normal'
traffic. You can have one/multiple NIC(s) for storage, and
another/multiple NIC(s) for everything else (or however you please to
use your interfaces!). These can be hooked up to switch ports that are
on different physical VLANs with no performance degredation.

It's best not to assume that every use of a VLAN is a trunk.

 Also iSCSI works better by using segregated IP network switches.  Beware
 that some switches do not guaranty full 1Gbits speed on all ports when all
 active at the same time.   Plan multiple uplinks if you have more than one
 switch. Once again you cannot go faster than the slowest link.

I think it's fairly safe to assume that you're going to get per-port
line-speed across anything other than the cheapest budget switches.
Most SMB (and above) switches will be rated at say 48gbit/sec
backplane on a 24 port item, for example.

However, I am keen to see any benchmarks you may have that shows the
performance difference between running a single switch with layer 2
vlans Vs. two seperate switches.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS performance degradation when backups are running

2008-09-30 Thread Jean Dion





Normal iSCSI setup split network traffic at physical layer and not
logical layer. That mean physical ports and often physical PCI bridge
chip if you can. That will be fine for small traffic but we are
talking backup performance issues. IP network and number of small
files are very often the bottlenecks.

If you want performance you do not put all your I/O across the same
physical wire. Once again you cannot go faster than the physical wire
can support (CAT5E, CAT6, fibre). No matter if it is layer 2 or not.
Using VLAN on single port you "share" the bandwidth and not creating
more Gbits speed with Layer 2.

iSCSI best practice require separated physical network. Many books,
white papers are written about this. 

This is like any FC SAN implementation. We always split the workload
between disk and tape using more than one HBA. Never forget , backup
are intensive I/O and will fill the entire I/O path.

Jean


gm_sjo wrote:

  2008/9/30 Jean Dion [EMAIL PROTECTED]:
  
  
Simple. You cannot go faster than the slowest link.

  
  
That is indeed correct, but what is the slowest link when using a
Layer 2 VLAN? You made a broad statement that iSCSI 'requires' a
dedicated, standalone network. I do not believe this is the case.

  
  
Any VLAN share the bandwidth workload and do not provide a dedicated
bandwidth for each of them.   That means if you have multiple VLAN coming
out of the same wire of your server you do not have "n" time the bandwidth
but only a fraction of it.  Simple network maths.

  
  
I can only assume that you are only referring to VLAN trunks, eg using
a NIC on a server for both 'normal' traffic and having another virtual
interface on it bound to a 'storage' VLAN. If this is the case then
what you say is true, of course you are sharing the same physical link
so ultimately that will be the limit.

However, and this should be clarified before anyone gets the wrong
idea, there is nothing wrong with segmenting a switch by using VLANs
to have some ports for storage traffic and some ports for 'normal'
traffic. You can have one/multiple NIC(s) for storage, and
another/multiple NIC(s) for everything else (or however you please to
use your interfaces!). These can be hooked up to switch ports that are
on different physical VLANs with no performance degredation.

It's best not to assume that every use of a VLAN is a trunk.

  
  
Also iSCSI works better by using segregated IP network switches.  Beware
that some switches do not guaranty full 1Gbits speed on all ports when all
active at the same time.   Plan multiple uplinks if you have more than one
switch. Once again you cannot go faster than the slowest link.

  
  
I think it's fairly safe to assume that you're going to get per-port
line-speed across anything other than the cheapest budget switches.
Most SMB (and above) switches will be rated at say 48gbit/sec
backplane on a 24 port item, for example.

However, I am keen to see any benchmarks you may have that shows the
performance difference between running a single switch with layer 2
vlans Vs. two seperate switches.
  



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS performance degradation when backups are running

2008-09-30 Thread Gary Mills

On Tue, Sep 30, 2008 at 10:32:50AM -0700, William D. Hathaway wrote:
 Gary -
Besides the network questions...

Yes, I suppose I should see if traffic on the Iscsi network is
hitting a limit of some sort.

What does your zpool status look like?

Pretty simple:

  $ zpool status
pool: space
   state: ONLINE
   scrub: none requested
  config:
  
  NAME STATE READ WRITE CKSUM
  spaceONLINE   0 0 0
c4t60A98000433469764E4A2D456A644A74d0  ONLINE   0 0 0
c4t60A98000433469764E4A2D456A696579d0  ONLINE   0 0 0
c4t60A98000433469764E4A476D2F6B385Ad0  ONLINE   0 0 0
c4t60A98000433469764E4A476D2F664E4Fd0  ONLINE   0 0 0
  
  errors: No known data errors

The four LUNs use the built-in I/O multipathing, with separate Iscsi
networks, switches, and ethernet interfaces.

Are you using compression on the file systems?
(Was single-threaded and fixed in s10u4 or equiv patches)

No, I've never enabled compression there.

-- 
-Gary Mills--Unix Support--U of M Academic Computing and Networking-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS performance degradation when backups are running

2008-09-30 Thread Gary Mills

On Mon, Sep 29, 2008 at 06:01:18PM -0700, Jean Dion wrote:
 
 Legato client and server contains tuning parameters to avoid such small file 
 problems.  Check your Legato buffer parameters.  These buffer will use your 
 server memory as disk cache.  

Our backup person tells me that there are no settings in Networker
that affect buffering on the client side.

 Here is a good source of network tuning parameters for your T2000 
 http://www.solarisinternals.com/wiki/index.php/Networks#Tunable_for_general_workloads_on_T1000.2FT2000
 
 The soft_ring is one of the best one.

Those references are for network tuning.  I don't want to change
things blindly.  How do I tell if they are necessary, that is if
the network is the bottleneck in the I/O system?

-- 
-Gary Mills--Unix Support--U of M Academic Computing and Networking-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Miles Nordin

 ak == Ahmed Kamal [EMAIL PROTECTED] writes:

ak I need to answer and weigh against the cost.

I suggest translating the reliability problems into a cost for
mitigating them: price the ZFS alternative as two systems, and keep
the second system offline except for nightly backup.  Since you care
mostly about data loss, not availability, this should work okay.  You
can lose 1 day of data, right?

I think you need two zpools, or zpool + LVM2/XFS, some kind of
two-filesystem setup, because of the ZFS corruption and
panic/freeze-on-import problems.  Having two zpools helps with other
things, too, like if you need to destroy and recreate the pool to
remove a slog or a vdev, or change from mirroring to raidz2, or
something like that.

I don't think it's realistic to give a quantitative MTDL for loss
caused by software bugs, from netapp or from ZFS.

ak The EMC guy insisted we use 10k Fibre/SAS drives at least.

I'm still not experienced at dealing with these guys without wasting
huge amounts of time.  I guess one strategy is to call a bunch of
them, so they are all wasting your time in parallel.  Last time I
tried, the EMC guy wanted to meet _in person_ in the financial
district, and then he just stopped calling so I had to guesstimate his
quote from some low-end iSCSI/FC box that Dell was reselling.  Have
you called netapp, hitachi, storagetek?  The IBM NAS is netapp so you
could call IBM if netapp ignores you, but you probably want the
storevault which is sold differently.  The HP NAS looks weird because
it runs your choice of Linux or Windows instead of
WeirdNASplatform---maybe read some more about that one.

Of course you don't get source, but it surprised me these guys are
MUCH worse than ordinary proprietary software.  At least netapp stuff,
you may as well consider it leased.  They leverage the ``appliance''
aspect, and then have sneaky licenses, that attempt to obliterate any
potential market for used filers.  When you're cut off from support
you can't even download manuals.  If you're accustomed to the ``first
sale doctrine'' then ZFS with source has a huge advantage over netapp,
beyond even ZFS's advantage over proprietary software.  The idea of
dumping all my data into some opaque DRM canister lorded over by
asshole CEO's who threaten to sick their corporate lawyers on users on
the mailing list offends me just a bit, but I guess we have to follow
the ``market forces.''


pgp0i0uaWcrRi.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs allow interaction with file system privileges

2008-09-30 Thread Paul B. Henson

On Tue, 23 Sep 2008, Darren J Moffat wrote:

 Run the service with the file_chown privilege.  See privileges(5),
 rbac(5) and if it runs as an SMF service smf_method(5).

Thanks for the pointer. After reviewing this documentation, it seems that
file_chown_self is the best privilege to delegate, as the service account
only needs to give away the filesystems it has created to the appropriate
owner, it should never need to arbitrarily chown other things.

I'm actually running a separate instance of Apache/mod_perl which exposes
my ZFS management API as a web service to our central identity management
server. So it does run under SMF, but I'm having trouble getting the
privilege delegation to the way I need it to be.

The method_credential option in the manifest only seems to apply to the
initial start of the service. Apache needs to start as root, and then gives
up the privileges when it spawns children. I can't have SMF control the
privileges of the initial parent Apache process or it won't start.

Started with full privileges, the parent process looks like:

E: all
I: basic
P: all
L: all

And the children:

flags = none
E: basic
I: basic
P: basic
L: all

I manually ran 'ppriv -s I+file_chown_self' on the parent Apache process,
which resulted in:

flags = none
E: all
I: basic,file_chown_self
P: all
L: all

And the children:

flags = none
E: basic,file_chown_self
I: basic,file_chown_self
P: basic,file_chown_self
L: all


Which worked perfectly. Is there any syntax available for the SMF manifest
that would allow starting the original process with all privileges, but
configure the inheritable privileges to include the additional
file_chown_self?

If not, the only other option I can think of offhand is to put together a
small Apache module that runs during server initialization and changes the
inheritable permissions before the children are spawned.

Thanks...


-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  [EMAIL PROTECTED]
California State Polytechnic University  |  Pomona CA 91768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

Thanks guys, it seems the problem is even more difficult than I thought, and
it seems there is no real measure for the software quality of the zfs stack
vs others, neutralizing the hardware used under both. I will be using ECC
RAM, since you mentioned it, and I will shift to using enterprise disks (I
had initially thought zfs will always recovers from cheapo sata disks,
making other disks only faster but not also safer), but now I am shifting to
10krpm SAS disks

So, I am changing my question into Do you see any obvious problems with the
following setup I am considering

- CPU: 1 Xeon Quad Core E5410 2.33GHz 12MB Cache 1333MHz
- 16GB ECC FB-DIMM 667MHz (8 x 2GB)
- 10  Seagate 400GB 10K 16MB SAS HDD

The 10 disks will be: 2 spare + 2 parity for raidz2 + 6 data = 2.4TB
useable space

* Do I need more CPU power ? How do I measure that ? What about RAM ?!
* Now that I'm using ECC RAM, and enterprisey disks, Does this put this
solution in par with low end netapp 2020 for example ?

I will be replicating the important data daily to a Linux box, just in case
I hit a wonderful zpool bug. Any final advice before I take the blue bill ;)

Thanks a lot


On Tue, Sep 30, 2008 at 8:40 PM, Miles Nordin [EMAIL PROTECTED] wrote:

  ak == Ahmed Kamal [EMAIL PROTECTED] writes:

ak I need to answer and weigh against the cost.

 I suggest translating the reliability problems into a cost for
 mitigating them: price the ZFS alternative as two systems, and keep
 the second system offline except for nightly backup.  Since you care
 mostly about data loss, not availability, this should work okay.  You
 can lose 1 day of data, right?

 I think you need two zpools, or zpool + LVM2/XFS, some kind of
 two-filesystem setup, because of the ZFS corruption and
 panic/freeze-on-import problems.  Having two zpools helps with other
 things, too, like if you need to destroy and recreate the pool to
 remove a slog or a vdev, or change from mirroring to raidz2, or
 something like that.

 I don't think it's realistic to give a quantitative MTDL for loss
 caused by software bugs, from netapp or from ZFS.

ak The EMC guy insisted we use 10k Fibre/SAS drives at least.

 I'm still not experienced at dealing with these guys without wasting
 huge amounts of time.  I guess one strategy is to call a bunch of
 them, so they are all wasting your time in parallel.  Last time I
 tried, the EMC guy wanted to meet _in person_ in the financial
 district, and then he just stopped calling so I had to guesstimate his
 quote from some low-end iSCSI/FC box that Dell was reselling.  Have
 you called netapp, hitachi, storagetek?  The IBM NAS is netapp so you
 could call IBM if netapp ignores you, but you probably want the
 storevault which is sold differently.  The HP NAS looks weird because
 it runs your choice of Linux or Windows instead of
 WeirdNASplatform---maybe read some more about that one.

 Of course you don't get source, but it surprised me these guys are
 MUCH worse than ordinary proprietary software.  At least netapp stuff,
 you may as well consider it leased.  They leverage the ``appliance''
 aspect, and then have sneaky licenses, that attempt to obliterate any
 potential market for used filers.  When you're cut off from support
 you can't even download manuals.  If you're accustomed to the ``first
 sale doctrine'' then ZFS with source has a huge advantage over netapp,
 beyond even ZFS's advantage over proprietary software.  The idea of
 dumping all my data into some opaque DRM canister lorded over by
 asshole CEO's who threaten to sick their corporate lawyers on users on
 the mailing list offends me just a bit, but I guess we have to follow
 the ``market forces.''

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS performance degradation when backups are running

2008-09-30 Thread gm_sjo

2008/9/30 Jean Dion [EMAIL PROTECTED]:
 If you want performance you do not put all your I/O across the same physical
 wire.  Once again you cannot go faster than the physical wire can support
 (CAT5E, CAT6, fibre).  No matter if it is layer 2 or not. Using VLAN on
 single port you share the bandwidth and not creating more Gbits speed with
 Layer 2.

 iSCSI best practice require separated physical network. Many books, white
 papers are written about this.

Yes, that's true, but I don't believe you mentioned single NIC
implementations in your original statement. Just seeking clarification
to help others :-)

I think it's worth clarifying that iSCSI and VLANs is okay as long as
people appreciate you will require seperate interfaces to get best
performance.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Oracle DB sequential dump questions

2008-09-30 Thread Louwtjie Burger

Server: T5120 on 10 U5
Storage: Internal 8 drives on SAS HW RAID (R5)
Oracle: ZFS fs, recordsize=8K and atime=off
Tape: LTO-4 (half height) on SAS interface.

Dumping a large file from memory using tar to LTO yields 44 MB/s ... I suspect 
the CPU cannot push more since it's a single thread doing all the work.

Dumping oracle db files from filesystem yields ~ 25 MB/s. The interesting bit 
(apart from it being a rather slow speed) is the fact that the speed fluctuates 
from the disk area.. but stays constant to the tape. I see up to 50-60 MB/s 
spikes over 5 seconds, while the tape continues to push it's steady 25 MB/s.

There has been NO tuning .. above is absolutely standard.

Where should I investigate to increase throughput ...
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] c1t0d0 to c3t1d0

2008-09-30 Thread andrew

Inserting the drive does not automatically mount the ZFS filesystem on it. You 
need to use the zpool import command which lists any pools available to 
import, then zpool import -f {name of pool} to force the import (to force the 
import if you haven't exported the pool first).

Cheers

Andrew.
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

On Tue, Sep 30, 2008 at 2:10 PM, Ahmed Kamal 
[EMAIL PROTECTED] wrote:


 * Now that I'm using ECC RAM, and enterprisey disks, Does this put this
 solution in par with low end netapp 2020 for example ?


*sort of*.  What are you going to be using it for?  Half the beauty of
NetApp are all the add-on applications you run server side.  The snapmanager
products.

If you're just using it for basic single head file serving, I'd say you're
pretty much on par.  IMO, NetApp's clustering is still far superior (yes
folks, from a fileserver perspecctive, not an application clustering
perspective) to anything Solaris has to offer right now, and also much,
much, MUCH easier to configure/manage.  Let me know when I can plug an
infiniband cable between two Solaris boxes and type cf enable and we'll
talk :)

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS performance degradation when backups are running

gm_sjo wrote:
 2008/9/30 Jean Dion [EMAIL PROTECTED]:
   
 If you want performance you do not put all your I/O across the same physical
 wire.  Once again you cannot go faster than the physical wire can support
 (CAT5E, CAT6, fibre).  No matter if it is layer 2 or not. Using VLAN on
 single port you share the bandwidth and not creating more Gbits speed with
 Layer 2.

 iSCSI best practice require separated physical network. Many books, white
 papers are written about this.
 

 Yes, that's true, but I don't believe you mentioned single NIC
 implementations in your original statement. Just seeking clarification
 to help others :-)

 I think it's worth clarifying that iSCSI and VLANs is okay as long as
 people appreciate you will require seperate interfaces to get best
 performance.
   

Separate interfaces or networks may not be required, but properly sized
networks are highly desirable.  For example, a back-of-the-envelope analysis
shows that a single 10GbE pipe is sufficient to drive 8 T10KB drives.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Oracle DB sequential dump questions

2008-09-30 Thread Carson Gaspar

Louwtjie Burger wrote:
 Dumping a large file from memory using tar to LTO yields 44 MB/s ... I 
 suspect the CPU cannot push more since it's a single thread doing all the 
 work.

 Dumping oracle db files from filesystem yields ~ 25 MB/s. The interesting bit 
 (apart from it being a rather slow speed) is the fact that the speed 
 fluctuates from the disk area.. but stays constant to the tape. I see up to 
 50-60 MB/s spikes over 5 seconds, while the tape continues to push it's 
 steady 25 MB/s.

 There has been NO tuning .. above is absolutely standard.

 Where should I investigate to increase throughput ...

Does your tape drive compress (most do)? If so, you may be seeing 
compressible vs. uncompressible data effects.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS poor performance on Areca 1231ML


 No apology necessary and I'm glad you figured it out - I was just
 reading this thread and thinking I'm missing something here - this
 can't be right.

 If you have the budget to run a few more experiments, try this
 SuperMicro card:
 http://www.springsource.com/repository/app/faq
 that others have had success with.

 Regards,

 --
 Al Hopper  Logical Approach Inc,Plano,TX [EMAIL PROTECTED]
   Voice: 972.379.2133 Timezone: US CDT
 OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/


Wrong link?

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Segmentation fault / core dump with recursive send/recv

2008-09-30 Thread BJ Quinn

Is there more information that I need to post in order to help diagnose this 
problem?
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS poor performance on Areca 1231ML

2008-09-30 Thread Al Hopper

On Mon, Sep 29, 2008 at 12:57 PM, Ross Becker [EMAIL PROTECTED] wrote:
 I have to come back and face the shame;  this was a total newbie mistake by 
 myself.

 I followed the ZFS shortcuts for noobs guide off bigadmin; 
 http://wikis.sun.com/display/BigAdmin/ZFS+Shortcuts+for+Noobs

 What that had me doing was creating a UFS filesystem on top of a ZFS volume, 
 so I was using only 2 layers of ZFS.

 I just re-did this against end-to-end ZFS, and the results are pretty 
 freaking impressive;  ZFS is handily outrunning the hardware RAID.  Bonnie++ 
 is achieving 257 mb/sec write, and 312 mb/sec read.

 My apologies for wasting folks time; this is my first experience with a 
 solaris of recent vintage.

No apology necessary and I'm glad you figured it out - I was just
reading this thread and thinking I'm missing something here - this
can't be right.

If you have the budget to run a few more experiments, try this
SuperMicro card:
http://www.springsource.com/repository/app/faq
that others have had success with.

Regards,

-- 
Al Hopper  Logical Approach Inc,Plano,TX [EMAIL PROTECTED]
   Voice: 972.379.2133 Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread MC

Just to confuse you more, I mean, give you another point of view:

 - CPU: 1 Xeon Quad Core E5410 2.33GHz 12MB Cache 1333MHz

The reason the Xeon line is good is because it allows you to squeeze maximum 
performance out of a given processor technology from Intel, possibly getting 
the highest performance density.  The reason it is bad is because it isn't that 
much better for a lot more money.  A mainstream processor is 80% of the 
performance for 20% of the price, so unless you need the highest possible 
performance density, you can save money going mainstream.  Not that you should. 
 Intel mainstream (and indeed many tech companies') stuff is purposely 
stratified from the enterprise stuff by cutting out features like ECC and 
higher memory capacity and using different interface form factors.

 - 10  Seagate 400GB 10K 16MB SAS HDD

There is nothing magical about SAS drives. Hard drives are for the most part 
all built with the same technology.  The MTBF on that is 1.4M hours vs 1.2M 
hours for the enterprise 1TB SATA disk, which isn't a big difference.  And for 
comparison, the WD3000BLFS is a consumer drive with 1.4M hours MTBF.

And we know that enterprise SATA drives are the same as the consumer drives, 
just with different firmware optimized for server workloads and longer testing 
designed to detect infant mortality, which affects MTBF just as much as old-age 
failure.  The MTBF difference from this extra testing at the start is huge.  So 
you can tell right there that the perceived extra reliability scam they're 
running is bunk.  The SAS interface is a psychological tool to help disguise 
the fact that we're all using roughly the same stuff :)  Do your own 24 hour or 
7-day stress-testing before deployment to weed out bad drives.  

Apparently old humans don't live that much longer than they did in years gone 
by, instead much fewer of our babies die, which makes the average lifespan of 
everyone go up :)  

You know that 1TB SATA works for you now.  Don't let some big greedy company 
convince you otherwise.  That extra money should be spent on your payroll, not 
on filling EMC's coffers.

ZFS provides a new landscape for storage.  It is entirely possible that a 
server built with mainstream hardware can be cheaper, faster, and at least as 
reliable as an EMC system.  Manageability and interoperability and all those 
things are another issue however.
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Segmentation fault / core dump with recursive send/recv

BJ Quinn wrote:
 Is there more information that I need to post in order to help diagnose this 
 problem?
   

Segmentation faults should be correctly handled by the software.
Please file a bug and attach the core.
http://bugs.opensolaris.org

 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability


On 30-Sep-08, at 6:58 AM, Ahmed Kamal wrote:

 Thanks for all the answers .. Please find more questions below :)

 - Good to know EMC filers do not have end2end checksums! What about  
 netapp ?

Blunty - no remote storage can have it by definition. The checksum  
needs to be computed as close as possible to the application. What's  
why ZFS can do this and hardware solutions can't (being several  
unreliable subsystems away from the data).

--Toby

 ...
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool import of bootable root pool renders it

2008-09-30 Thread Robert Milkowski

Hello Juergen,

Tuesday, September 30, 2008, 5:43:56 PM, you wrote:

JN Stephen Quintero [EMAIL PROTECTED] writes:

 I am running OpenSolaris 2008.05 as a PV guest under Xen. If you
 import the bootable root pool of a VM into another Solaris VM, the
 root pool is no longer bootable.

JN I had a similar problem: After installing and booting Opensolaris
JN 2008.05, I succeded to lock myself out through some passwd/shadow
JN inconsistency (totally my own fault). Not a problem, I thought -- I
JN booted from the install disk, imported the root pool, fixed the
JN inconsistency, and rebooted. Lo, instant panic.

JN No idea why, though, I am not that familiar with the underlying
JN code. I just did a reinstall.

I hit the same issue - once I tried to boot OS from within virtualbox
with disk partition exposed to VB - kernel couldn't mount root fs
either from VB or directly from notebook - I had to import/export pool
while booting from CD. I haven't investigated it further but I'm
surprised it's not working OOB.


-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZSF Solaris


On 30-Sep-08, at 7:50 AM, Ram Sharma wrote:

 Hi,

 can anyone please tell me what is the maximum number of files that  
 can be there in 1 folder in Solaris with ZSF file system.

 I am working on an application in which I have to support 1mn  
 users. In my application I am using MySql MyISAM and in MyISAM  
 there is 3 files created for 1 table. I am having application  
 architechture in which each user will be having separate table, so  
 the expected number of files in database folder is 3mn.

That sounds like a disastrous schema design. Apart from that, you're  
going to run into problems on several levels, including O/S resources  
(file descriptors) and filesystem scalability.

--Toby

 I have read somewhere that there is a limit of each OS to create  
 files in a folder.
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS poor performance on Areca 1231ML

2008-09-30 Thread Al Hopper

On Tue, Sep 30, 2008 at 3:51 PM, Tim [EMAIL PROTECTED] wrote:




 No apology necessary and I'm glad you figured it out - I was just
 reading this thread and thinking I'm missing something here - this
 can't be right.

 If you have the budget to run a few more experiments, try this
 SuperMicro card:
 http://www.springsource.com/repository/app/faq
 that others have had success with.

 Regards,

 --
 Al Hopper  Logical Approach Inc,Plano,TX [EMAIL PROTECTED]
   Voice: 972.379.2133 Timezone: US CDT
 OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/

 Wrong link?

Sorry!  :(

http://www.supermicro.com/products/accessories/addon/AOC-USASLP-L8i.cfm

 --Tim





-- 
Al Hopper  Logical Approach Inc,Plano,TX [EMAIL PROTECTED]
   Voice: 972.379.2133 Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Erik Trimble

Toby Thain wrote:
 On 30-Sep-08, at 6:58 AM, Ahmed Kamal wrote:

   
 Thanks for all the answers .. Please find more questions below :)

 - Good to know EMC filers do not have end2end checksums! What about  
 netapp ?
 

 Blunty - no remote storage can have it by definition. The checksum  
 needs to be computed as close as possible to the application. What's  
 why ZFS can do this and hardware solutions can't (being several  
 unreliable subsystems away from the data).

 --Toby

   

Well

That's not _strictly_ true. ZFS can still munge things up as a result of 
faulty memory.  And, it's entirely possible to build a hardware end2end 
system which is at least as reliable as ZFS (e.g. is only faultable due 
to host memory failures).  It's just neither easy, nor currently 
available from anyone I know. Doing such checking is far easier at the 
filesystem level than any other place, which is a big strength of ZFS 
over other hardware solutions.  Several of the storage vendors (EMC and 
NetApp included) I do believe support hardware checksumming over on the 
SAN/NAS device, but that still leaves them vulnerable to HBA and 
transport medium (e.g. FibreChannel/SCSI/Ethernet) errors, which they 
don't currently have a solution for.

I'd be interested in seeing if anyone has statistics about where errors 
occur in the data stream. My gut tells me that (from most common to least):

(1) hard drives
(2) transport medium (particularly if it's Ethernet)
(3) SAN/NAS controller cache
(4) Host HBA
(5) SAN/NAS controller
(6) Host RAM
(7) Host bus issues

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

On Tue, Sep 30, 2008 at 4:26 PM, Toby Thain [EMAIL PROTECTED]wrote:


 On 30-Sep-08, at 6:58 AM, Ahmed Kamal wrote:

  Thanks for all the answers .. Please find more questions below :)
 
  - Good to know EMC filers do not have end2end checksums! What about
  netapp ?

 Blunty - no remote storage can have it by definition. The checksum
 needs to be computed as close as possible to the application. What's
 why ZFS can do this and hardware solutions can't (being several
 unreliable subsystems away from the data).

 --Toby

  ...


So how is a Server running Solaris with a QLogic HBA connected to an FC JBOD
any different than a NetApp filer, running ONTAP with a QLogic HBA directly
connected to an FC JBOD?  How is it several unreliable subsystems away from
the data?

That's a great talking point but it's far from accurate.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool import of bootable root pool renders it

2008-09-30 Thread David Finberg

On Tue, 30 Sep 2008, Robert Milkowski wrote:

 Hello Juergen,

 Tuesday, September 30, 2008, 5:43:56 PM, you wrote:

 JN Stephen Quintero [EMAIL PROTECTED] writes:

 I am running OpenSolaris 2008.05 as a PV guest under Xen. If you
 import the bootable root pool of a VM into another Solaris VM, the
 root pool is no longer bootable.

 JN I had a similar problem: After installing and booting Opensolaris
 JN 2008.05, I succeded to lock myself out through some passwd/shadow
 JN inconsistency (totally my own fault). Not a problem, I thought -- I
 JN booted from the install disk, imported the root pool, fixed the
 JN inconsistency, and rebooted. Lo, instant panic.

 JN No idea why, though, I am not that familiar with the underlying
 JN code. I just did a reinstall.

 I hit the same issue - once I tried to boot OS from within virtualbox
 with disk partition exposed to VB - kernel couldn't mount root fs
 either from VB or directly from notebook - I had to import/export pool
 while booting from CD. I haven't investigated it further but I'm
 surprised it's not working OOB.

I think this is 6737463

-- Dave
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Segmentation fault / core dump with recursive send/recv

2008-09-30 Thread BJ Quinn

Please forgive my ignorance.  I'm fairly new to Solaris (Linux convert), and 
although I recognize that Linux has the same concept of Segmentation faults / 
core dumps, I believe my typical response to a Segmentation Fault was to 
upgrade the kernel and that always fixed the problem (i.e. somebody else filed 
the bug and fixed the problem before I got around to doing it myself).

So - I'm running stock OpenSolaris 2008.05.  Even if the bug was fixed, I 
imagine it would require a Solaris kernel upgrade anyway, right?  Perhaps I 
could simply try that first?  Are the kernel upgrades stable?  I know for a 
while there, before the 2008.05 release, Solaris just released a new 
development kernel every two weeks.  I don't think I want to just haphazardly 
upgrade to some random bi-weekly development kernel.  Are there actually 
stable kernel upgrades for OS, and how would I go about upgrading it if there 
are?
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS poor performance on Areca 1231ML

2008-09-30 Thread Ross Becker

At this point, ZFS is performing admirably with the Areca card.  Also, that 
card is only 8-port, and the Areca controllers I have are 12-port.  My chassis 
has 24 SATA bays, so being able to cover all the drives with 2 controllers is 
preferable.

Also, the driver for the Areca controllers is being integrated into OpenSolaris 
as we discuss, so the next spin of Opensolaris won't even require me to add the 
driver for it.


--Ross
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Segmentation fault / core dump with recursive send/recv

BJ Quinn wrote:
 Please forgive my ignorance.  I'm fairly new to Solaris (Linux convert), and 
 although I recognize that Linux has the same concept of Segmentation faults / 
 core dumps, I believe my typical response to a Segmentation Fault was to 
 upgrade the kernel and that always fixed the problem (i.e. somebody else 
 filed the bug and fixed the problem before I got around to doing it myself).

 So - I'm running stock OpenSolaris 2008.05.  Even if the bug was fixed, I 
 imagine it would require a Solaris kernel upgrade anyway, right?  Perhaps I 
 could simply try that first?  Are the kernel upgrades stable?  I know for a 
 while there, before the 2008.05 release, Solaris just released a new 
 development kernel every two weeks.  I don't think I want to just 
 haphazardly upgrade to some random bi-weekly development kernel.  Are there 
 actually stable kernel upgrades for OS, and how would I go about upgrading 
 it if there are?
   

If there was a bug already filed and fixed, then it should be in the
bugs database, which is searchable at:
http://bugs.opensolaris.org

 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Erik Trimble

Will Murnane wrote:
 On Tue, Sep 30, 2008 at 21:48, Tim [EMAIL PROTECTED] wrote:
   
 why ZFS can do this and hardware solutions can't (being several
 unreliable subsystems away from the data).
   
 So how is a Server running Solaris with a QLogic HBA connected to an FC JBOD
 any different than a NetApp filer, running ONTAP with a QLogic HBA directly
 connected to an FC JBOD?  How is it several unreliable subsystems away from
 the data?

 That's a great talking point but it's far from accurate.
 
 Do your applications run on the NetApp filer?  The idea of ZFS as I
 see it is to checksum the data from when the application puts the data
 into memory until it reads it out of memory again.  Separate filers
 can checksum from when data is written into their buffers until they
 receive the request for that data, but to get from the filer to the
 machine running the application the data must be sent across an
 unreliable medium.  If data is corrupted between the filer and the
 host, the corruption cannot be detected.  Perhaps the filer could use
 a special protocol and include the checksum for each block, but then
 the host must verify the checksum for it to be useful.

 Contrast this with ZFS.  It takes the application data, checksums it,
 and writes the data and the checksum out across the (unreliable) wire
 to the (unreliable) disk.  Then when a read request comes, it reads
 the data and checksum across the (unreliable) wire, and verifies the
 checksum on the *host* side of the wire.  If the data is corrupted any
 time between the checksum being calculated on the host and checked on
 the host, it can be detected.  This adds a couple more layers of
 verifiability than filer-based checksums.

 Will
   

To make Will's argument more succinct (wink), with a NetApp, 
undetectable (by the NetApp) errors can be introduced at the HBA and 
transport layer (FC Switch, slightly damage cable) levels.   ZFS will 
detect such errors, and fix them (if properly configured). NetApp has no 
such ability.

Also, I'm not sure that a NetApp (or EMC) has the ability to find 
bit-rot.  That is, they can determine if a block is written correctly, 
but I don't know if they keep the block checksum around permanently, 
and, how redundant that stored block checksum is.  If they don't 
permanently write the block checksum somewhere, then the NetApp has no 
way to determine if a READ block is OK, and hasn't suffered from bit-rot 
(aka disk block failure).  And, if it's not either multiply stored, then 
they have the potential to lose the ability to do READ verification.  
Neither are problems of ZFS.


In many of my production environments, I've got at least 2 different FC 
switches between my hosts and disks.  And, with longer cables, comes 
more of the chance that something gets bent a bit too much. Finally, 
HBAs are not the most reliable things I've seen (sadly).

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread A Darren Dunham

On Tue, Sep 30, 2008 at 03:19:40PM -0700, Erik Trimble wrote:
 To make Will's argument more succinct (wink), with a NetApp, 
 undetectable (by the NetApp) errors can be introduced at the HBA and 
 transport layer (FC Switch, slightly damage cable) levels.   ZFS will 
 detect such errors, and fix them (if properly configured). NetApp has no 
 such ability.

It sounds like you mean the Netapp can't detect silent errors in it's
own storage.  It can (in a manner similar, but not identical to ZFS).

The difference is that the Netapp is always remote from the application,
and cannot detect corruption introduced before it arrives at the filer.

 Also, I'm not sure that a NetApp (or EMC) has the ability to find 
 bit-rot.  That is, they can determine if a block is written correctly, 
 but I don't know if they keep the block checksum around permanently, 
 and, how redundant that stored block checksum is.  If they don't 
 permanently write the block checksum somewhere, then the NetApp has no 
 way to determine if a READ block is OK, and hasn't suffered from bit-rot 
 (aka disk block failure).  And, if it's not either multiply stored, then 
 they have the potential to lose the ability to do READ verification.  
 Neither are problems of ZFS.

A netapp filer does have a permanent block checksum that can verify
reads.  To my knowledge, it is not redundant.  But then if it fails, you
can just declare that block bad and fall back on the RAID/mirror
redundancy to supply the data.

-- 
Darren
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

On Tue, Sep 30, 2008 at 5:19 PM, Erik Trimble [EMAIL PROTECTED] wrote:


 To make Will's argument more succinct (wink), with a NetApp, undetectable
 (by the NetApp) errors can be introduced at the HBA and transport layer (FC
 Switch, slightly damage cable) levels.   ZFS will detect such errors, and
 fix them (if properly configured). NetApp has no such ability.

 Also, I'm not sure that a NetApp (or EMC) has the ability to find bit-rot.
  That is, they can determine if a block is written correctly, but I don't
 know if they keep the block checksum around permanently, and, how redundant
 that stored block checksum is.  If they don't permanently write the block
 checksum somewhere, then the NetApp has no way to determine if a READ block
 is OK, and hasn't suffered from bit-rot (aka disk block failure).  And, if
 it's not either multiply stored, then they have the potential to lose the
 ability to do READ verification.  Neither are problems of ZFS.


 In many of my production environments, I've got at least 2 different FC
 switches between my hosts and disks.  And, with longer cables, comes more of
 the chance that something gets bent a bit too much. Finally, HBAs are not
 the most reliable things I've seen (sadly).



* NetApp's block-appended checksum approach appears similar but is in fact
much stronger. Like many arrays, NetApp formats its drives with 520-byte
sectors. It then groups them into 8-sector blocks: 4K of data (the WAFL
filesystem blocksize) and 64 bytes of checksum. When WAFL reads a block it
compares the checksum to the data just like an array would, but there's a
key difference: it does this comparison after the data has made it through
the I/O path, so it validates that the block made the journey from platter
to memory without damage in transit. *
**
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS poor performance on Areca 1231ML

On Tue, Sep 30, 2008 at 5:04 PM, Ross Becker [EMAIL PROTECTED]wrote:

 At this point, ZFS is performing admirably with the Areca card.  Also, that
 card is only 8-port, and the Areca controllers I have are 12-port.  My
 chassis has 24 SATA bays, so being able to cover all the drives with 2
 controllers is preferable.

 Also, the driver for the Areca controllers is being integrated into
 OpenSolaris as we discuss, so the next spin of Opensolaris won't even
 require me to add the driver for it.


 --Ross
 --


All very valid points... if you don't mind spending 8x as much for the cards
:)

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool import of bootable root pool renders it unbootable

2008-09-30 Thread William Schumann

I have not tried importing bootable root pools onto other VMs, but there have 
been recent ZFS bug fixes in the area of importing and exporting bootable root 
pools - the panic might not occur on Solaris Nevada releases after 
approximately 97.

There are still issues with renaming of bootable root pools - particularly if 
they are renamed during import - newpool from zpool(1m).  If you import with 
a different name, at the moment, you will have to export, then import by the 
original name before it can be booted without GRUB menu changes.

Check your Solaris version, check to see if your zpool import is using an 
alternate pool name. If so, try re-importing using the original name before 
trying to reboot.  Let us know what happens.
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability


  Intel mainstream (and indeed many tech companies') stuff is purposely
 stratified from the enterprise stuff by cutting out features like ECC and
 higher memory capacity and using different interface form factors.


Well I guess I am getting a Xeon anyway


 There is nothing magical about SAS drives. Hard drives are for the most
 part all built with the same technology.  The MTBF on that is 1.4M hours vs
 1.2M hours for the enterprise 1TB SATA disk, which isn't a big difference.
  And for comparison, the WD3000BLFS is a consumer drive with 1.4M hours
 MTBF.


Hmm ... well, there is a considerable price difference, so unless someone
says I'm horribly mistaken, I now want to go back to Barracuda ES 1TB 7200
drives. By the way, how many of those would saturate a single (non trunked)
Gig ethernet link ? Workload NFS sharing of software and homes. I think 4
disks should be about enough to saturate it ?

BTW, for everyone saying zfs is more reliable because it's closer to the
application than a netapp, well at least in my case it isn't. The solaris
box will be NFS sharing and the apps will be running on remote Linux boxes.
So, I guess this makes them equal. How about a new reliable NFS protocol,
that computes the hashes on the client side, sends it over the wire to be
written remotely on the zfs storage node ?!
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Segmentation fault / core dump with recursive send/recv

2008-09-30 Thread BJ Quinn

True, but a search for zfs segmentation fault returns 500 bugs.  It's 
possible one of those is related to my issue, but it would take all day to find 
out.  If it's not flaky or unstable, I'd like to try upgrading to the 
newest kernel first, unless my Linux mindset is truly out of place here, or if 
it's not relatively easy to do.  Are these kernels truly considered stable?  
How would I upgrade?
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

On Tue, Sep 30, 2008 at 6:03 PM, Ahmed Kamal 
[EMAIL PROTECTED] wrote:




 Hmm ... well, there is a considerable price difference, so unless someone
 says I'm horribly mistaken, I now want to go back to Barracuda ES 1TB 7200
 drives. By the way, how many of those would saturate a single (non trunked)
 Gig ethernet link ? Workload NFS sharing of software and homes. I think 4
 disks should be about enough to saturate it ?


SAS has far greater performance, and if your workload is extremely random,
will have a longer MTBF.  SATA drives suffer badly on random workloads.




 BTW, for everyone saying zfs is more reliable because it's closer to the
 application than a netapp, well at least in my case it isn't. The solaris
 box will be NFS sharing and the apps will be running on remote Linux boxes.
 So, I guess this makes them equal. How about a new reliable NFS protocol,
 that computes the hashes on the client side, sends it over the wire to be
 written remotely on the zfs storage node ?!


Won't be happening anytime soon.


--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Robert Thurlow

Ahmed Kamal wrote:

 BTW, for everyone saying zfs is more reliable because it's closer to the 
 application than a netapp, well at least in my case it isn't. The 
 solaris box will be NFS sharing and the apps will be running on remote 
 Linux boxes. So, I guess this makes them equal. How about a new 
 reliable NFS protocol, that computes the hashes on the client side, 
 sends it over the wire to be written remotely on the zfs storage node ?!

We've actually prototyped an NFS protocol extension that does
this, but the challenges are integrating it with ZFS to form
a single protection domain, and getting the protocol to be a
standard.

For now, an option you have is Kerberos with data integrity;
the sender computes a CRC of the data and the receiver can
verify it to rule out OTW corruption.  This is, of course,
not end-to-end from platter to memory, but introduces a
separate protection domain for the NFS link.

Rob T
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

On Tue, 30 Sep 2008, Miles Nordin wrote:

 I think you need two zpools, or zpool + LVM2/XFS, some kind of
 two-filesystem setup, because of the ZFS corruption and
 panic/freeze-on-import problems.  Having two zpools helps with other

If ZFS provides such a terrible experience for you can I be brave 
enough to suggest that perhaps you are on the wrong mailing list and 
perhaps you should be watching the pinwheels with HFS+?  ;-)

While we surely do hear all the horror stories on this list, I don't 
think that ZFS is as wildly unstable as you make it out to be.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

On Wed, 1 Oct 2008, Ahmed Kamal wrote:
 So, I guess this makes them equal. How about a new reliable NFS protocol,
 that computes the hashes on the client side, sends it over the wire to be
 written remotely on the zfs storage node ?!

Modern NFS runs over a TCP connection, which includes its own data 
validation.  This surely helps.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZSF Solaris

2008-09-30 Thread Nathan Kroenert

Actually, the one that'll hurt most is ironically the most closely 
related to bad database schema design... With a zillion files in the one 
directory, if someone does an 'ls' in that directory, it'll not only 
take ages, but steal a whole heap of memory and compute power...

Provided the only things that'll be doing *anything* in that directory 
are using indexed methods, there is no real problem from a ZFS 
perspective, but if something decides to list (or worse, list and sort) 
that directory, it won't be that pleasant.

Oh - That's of course assuming you have sufficient memory in the system 
to cache all that metadata somewhere... If you don't then that's another 
zillion I/O's you need to deal with each time you list the entire directory.

an ls -1rt on a directory with about 1.2 million files with names like 
afile1202899 takes minutes to complete on my box, and we see 'ls' get to 
in excess of 700MB rss... (and that's not including the memory zfs is 
using to cache whatever it can.)

My box has the ARC limited to about 1GB, so it's obviously undersized 
for such a workload, but still gives you an indication...

I generally look to keep directories to a size that allows the utilities 
that work on and in it to perform at a reasonable rate... which for the 
most part is around the 100K files or less...

Perhaps you are using larger hardware than I am for some of this stuff? :)

Nathan.

On  1/10/08 07:29 AM, Toby Thain wrote:
 On 30-Sep-08, at 7:50 AM, Ram Sharma wrote:
 
 Hi,

 can anyone please tell me what is the maximum number of files that  
 can be there in 1 folder in Solaris with ZSF file system.

 I am working on an application in which I have to support 1mn  
 users. In my application I am using MySql MyISAM and in MyISAM  
 there is 3 files created for 1 table. I am having application  
 architechture in which each user will be having separate table, so  
 the expected number of files in database folder is 3mn.
 
 That sounds like a disastrous schema design. Apart from that, you're  
 going to run into problems on several levels, including O/S resources  
 (file descriptors) and filesystem scalability.
 
 --Toby
 
 I have read somewhere that there is a limit of each OS to create  
 files in a folder.
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 


//
// Nathan Kroenert  [EMAIL PROTECTED]   //
// Senior Systems Engineer  Phone:  +61 3 9869 6255 //
// Global Systems Engineering   Fax:+61 3 9869 6288 //
// Level 7, 476 St. Kilda Road  //
// Melbourne 3004   VictoriaAustralia   //
//
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Nicolas Williams

On Tue, Sep 30, 2008 at 06:09:30PM -0500, Tim wrote:
 On Tue, Sep 30, 2008 at 6:03 PM, Ahmed Kamal 
 [EMAIL PROTECTED] wrote:
  BTW, for everyone saying zfs is more reliable because it's closer to the
  application than a netapp, well at least in my case it isn't. The solaris
  box will be NFS sharing and the apps will be running on remote Linux boxes.
  So, I guess this makes them equal. How about a new reliable NFS protocol,
  that computes the hashes on the client side, sends it over the wire to be
  written remotely on the zfs storage node ?!
 
 Won't be happening anytime soon.

If you use RPCSEC_GSS with integrity protection then you've got it
already.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Miles Nordin

 rt == Robert Thurlow [EMAIL PROTECTED] writes:

rt introduces a separate protection domain for the NFS link.

There are checksums in the ethernet FCS, checksums in IP headers,
checksums in UDP headers (which are sometimes ignored), and checksums
in TCP (which are not ignored).  There might be an RPC layer checksum,
too, not sure.

Different arguments can be made against each, I suppose, but did you
have a particular argument in mind?

Have you experienced corruption with NFS that you can blame on the
network, not the CPU/memory/busses of the server and client?


I've experienced enough to make me buy stories of corruption in disks,
disk interfaces, and memory.  but not yet with TCP so I'd like to hear
the story as well as the hypothetical argument, if there is one.


pgpE7CaaWOORQ.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Segmentation fault / core dump with recursive send/recv

BJ Quinn wrote:
 True, but a search for zfs segmentation fault returns 500 bugs.  It's 
 possible one of those is related to my issue, but it would take all day to 
 find out.  If it's not flaky or unstable, I'd like to try upgrading to 
 the newest kernel first, unless my Linux mindset is truly out of place here, 
 or if it's not relatively easy to do.  Are these kernels truly considered 
 stable?  How would I upgrade?
   

Searching bug databases can be an art...

Project Indiana is where notifications of package repository changes are
made.  b98 is available, with instructions posted recently
http://www.opensolaris.org/jive/thread.jspa?threadID=75115tstart=15

Be sure to read the release notes
http://opensolaris.org/os/project/indiana/resources/rn3/image-update/

 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Segmentation fault / core dump with recursive send/recv

On Tue, 30 Sep 2008, BJ Quinn wrote:

 True, but a search for zfs segmentation fault returns 500 bugs. 
 It's possible one of those is related to my issue, but it would take 
 all day to find out.  If it's not flaky or unstable, I'd like to 
 try upgrading to the newest kernel first, unless my Linux mindset is 
 truly out of place here, or if it's not relatively easy to do.  Are 
 these kernels truly considered stable?  How would I upgrade? -- This

Linux and Solaris are quite different when it comes to kernel 
strategies.  Linux documents and stabilizes its kernel interfaces 
while Solaris does not document its kernel interfaces, but focuses on 
stable shared library interfaces.  Most Linux system APIs have a 
direct kernel API equivalent but Solaris often uses a completely 
different kernel interface.  Segmentation faults in user applications 
are generally due to user-space bugs rather than due to the kernel.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZSF Solaris

On Wed, 1 Oct 2008, Nathan Kroenert wrote:
 zillion I/O's you need to deal with each time you list the entire directory.

 an ls -1rt on a directory with about 1.2 million files with names like
 afile1202899 takes minutes to complete on my box, and we see 'ls' get to
 in excess of 700MB rss... (and that's not including the memory zfs is
 using to cache whatever it can.)

A million files in ZFS is no big deal:

% ptime ls -1rt  /dev/null

real   17.277
user8.992
sys 8.231

% ptime ls -1rt | wc -l

real   17.045
user8.607
sys 8.413
100

Maybe the problem is that you need to increase your screen's scroll 
rate. :-)

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread David Magda

On Sep 30, 2008, at 19:44, Miles Nordin wrote:

 There are checksums in the ethernet FCS, checksums in IP headers,
 checksums in UDP headers (which are sometimes ignored), and checksums
 in TCP (which are not ignored).  There might be an RPC layer checksum,
 too, not sure.

Not of which helped Amazon when their S3 service went down due to a  
flipped bit:

 More specifically, we found that there were a handful of messages on  
 Sunday morning that had a single bit corrupted such that the message  
 was still intelligible, but the system state information was  
 incorrect. We use MD5 checksums throughout the system, for example,  
 to prevent, detect, and recover from corruption that can occur  
 during receipt, storage, and retrieval of customers' objects.  
 However, we didn't have the same protection in place to detect  
 whether this particular internal state information had been corrupted.


http://status.aws.amazon.com/s3-20080720.html

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread David Magda

On Sep 30, 2008, at 19:09, Tim wrote:

 SAS has far greater performance, and if your workload is extremely  
 random,
 will have a longer MTBF.  SATA drives suffer badly on random  
 workloads.

Well, if you can probably afford more SATA drives for the purchase  
price, you can put them in a striped-mirror set up, and that may help  
things. If your disks are cheap you can afford to buy more of them  
(space, heat, and power not withstanding).

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS Pool Question

2008-09-30 Thread Josh Hardman

Hello, I'm looking for info on adding a disk to my current zfs pool.  I am 
running OpenSoarlis snv_98.  I have upgraded my pool since my image-update.  
When I installed OpenSolaris it was a machine with 2 hard disks (regular IDE).  
Is it possible to add the second hard disk to the pool to increase my storage 
capacity without a raid controller?

From what I've found, the command should be zpool add rpool device.  Is 
that right?  If so, how do I track down the device name?  zpool status tells 
me my current device (hdd0) is named c3d0s0.  Where do I find the other 
device name?

Thanks!
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

On Tue, Sep 30, 2008 at 7:15 PM, David Magda [EMAIL PROTECTED] wrote:

 On Sep 30, 2008, at 19:09, Tim wrote:

  SAS has far greater performance, and if your workload is extremely random,
 will have a longer MTBF.  SATA drives suffer badly on random workloads.


 Well, if you can probably afford more SATA drives for the purchase price,
 you can put them in a striped-mirror set up, and that may help things. If
 your disks are cheap you can afford to buy more of them (space, heat, and
 power not withstanding).


More disks will not solve SATA's problem.  I run into this on a daily basis
working on enterprise storage.  If it's for just archive/storage, or even
sequential streaming, it shouldn't be a big deal.  If it's random workload,
there's pretty much nothing you can do to get around it short of more
front-end cache and intelligence which is simply a band-aid, not a fix.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability


 Well, if you can probably afford more SATA drives for the purchase
 price, you can put them in a striped-mirror set up, and that may help
 things. If your disks are cheap you can afford to buy more of them
 (space, heat, and power not withstanding).


Hmm, that's actually cool !
If I configure the system with

10 x 400G 10k rpm disk == cost == 13k$
10 x 1TB SATA 7200 == cost == 9k$

Always assuming 2 spare disks, and Using the sata disks, I would configure
them in raid1 mirror (raid6 for the 400G), Besides being cheaper, I would
get more useable space (4TB vs 2.4TB), Better performance of raid1 (right?),
and better data reliability ?? (don't really know about that one) ?

Is this a recommended setup ? It looks too good to be true ?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Robert Thurlow

Bob Friesenhahn wrote:
 On Wed, 1 Oct 2008, Ahmed Kamal wrote:
 So, I guess this makes them equal. How about a new reliable NFS protocol,
 that computes the hashes on the client side, sends it over the wire to be
 written remotely on the zfs storage node ?!
 
 Modern NFS runs over a TCP connection, which includes its own data 
 validation.  This surely helps.

Less than we'd sometimes like :-)  The TCP checksum isn't
very strong, and we've seen corruption tied to a broken
router, where the Ethernet checksum was recomputed on
bad data, and the TCP checksum didn't help.  It sucked.

Rob T
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Fwd: Quantifying ZFS reliability

On Tue, Sep 30, 2008 at 7:30 PM, Ahmed Kamal 
[EMAIL PROTECTED] wrote:



 Well, if you can probably afford more SATA drives for the purchase
 price, you can put them in a striped-mirror set up, and that may help
 things. If your disks are cheap you can afford to buy more of them
 (space, heat, and power not withstanding).


 Hmm, that's actually cool !
 If I configure the system with

 10 x 400G 10k rpm disk == cost == 13k$
 10 x 1TB SATA 7200 == cost == 9k$

 Always assuming 2 spare disks, and Using the sata disks, I would configure
 them in raid1 mirror (raid6 for the 400G), Besides being cheaper, I would
 get more useable space (4TB vs 2.4TB), Better performance of raid1 (right?),
 and better data reliability ?? (don't really know about that one) ?

 Is this a recommended setup ? It looks too good to be true ?



I *HIGHLY* doubt you'll see better performance out of the SATA, but it is
possible.  You don't need 2 spares with SAS, 1 is more than enough with that
few disks.  I'd suggest doing RAID-Z (raid-5) as well if you've only got 9
data disks.  8+1 is more than acceptable with SAS drives.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Robert Thurlow

Miles Nordin wrote:

 There are checksums in the ethernet FCS, checksums in IP headers,
 checksums in UDP headers (which are sometimes ignored), and checksums
 in TCP (which are not ignored).  There might be an RPC layer checksum,
 too, not sure.
 
 Different arguments can be made against each, I suppose, but did you
 have a particular argument in mind?
 
 Have you experienced corruption with NFS that you can blame on the
 network, not the CPU/memory/busses of the server and client?

Absolutely.  See my recent post in this thread.  The TCP
checksum is not that strong, and a router broken the
right way can regenerate a correct-looking Ethernet
checksum on bad data.  krb5i fixed it nicely.

Rob T
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZSF Solaris

On Wed, 1 Oct 2008, Nathan Kroenert wrote:

 That being said, there is a large delta in your results and mine... If I get 
 a chance, I'll look into it...

 I suspect it's a cached versus I/O issue...

The first time I posted was the first time the directory has been read 
in well over a month so it was not currently cached.

You might find this to be interesting since it shows that the 'rt' 
options are taking most of the time:

% ptime ls -1 | wc -l

real5.497
user4.825
sys 0.654
100

I will certainly agree that huge directories can cause problems for 
many applications, particularly ones that access the files over a 
network.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

Tim wrote:


 On Tue, Sep 30, 2008 at 7:15 PM, David Magda [EMAIL PROTECTED] 
 mailto:[EMAIL PROTECTED] wrote:

 On Sep 30, 2008, at 19:09, Tim wrote:

 SAS has far greater performance, and if your workload is
 extremely random,
 will have a longer MTBF.  SATA drives suffer badly on random
 workloads.


 Well, if you can probably afford more SATA drives for the purchase
 price, you can put them in a striped-mirror set up, and that may
 help things. If your disks are cheap you can afford to buy more of
 them (space, heat, and power not withstanding).


 More disks will not solve SATA's problem.  I run into this on a daily 
 basis working on enterprise storage.  If it's for just 
 archive/storage, or even sequential streaming, it shouldn't be a big 
 deal.  If it's random workload, there's pretty much nothing you can do 
 to get around it short of more front-end cache and intelligence which 
 is simply a band-aid, not a fix.

I observe that there are no disk vendors supplying SATA disks
with speed  7,200 rpm.  It is no wonder that a 10k rpm disk
outperforms a 7,200 rpm disk for random workloads.  I'll attribute
this to intentional market segmentation by the industry rather than
a deficiency in the transfer protocol (SATA).
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability


 I observe that there are no disk vendors supplying SATA disks
 with speed  7,200 rpm.  It is no wonder that a 10k rpm disk
 outperforms a 7,200 rpm disk for random workloads.  I'll attribute
 this to intentional market segmentation by the industry rather than
 a deficiency in the transfer protocol (SATA).


I don't really need more performance that what's needed to saturate a gig
link (4 sata disks?)
So, performance aside, does SAS have other benefits ? Data integrity ? How
would a 8 raid1 sata compare vs another 8 smaller SAS disks in raidz(2) ?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

Ahmed Kamal wrote:

 I observe that there are no disk vendors supplying SATA disks
 with speed  7,200 rpm.  It is no wonder that a 10k rpm disk
 outperforms a 7,200 rpm disk for random workloads.  I'll attribute
 this to intentional market segmentation by the industry rather than
 a deficiency in the transfer protocol (SATA).


 I don't really need more performance that what's needed to saturate a 
 gig link (4 sata disks?)

It depends on the disk.  A Seagate Barracuda 500 GByte SATA disk is
rated at a media speed of 105 MBytes/s which is near the limit of a
GbE link.  In theory, one disk would be close, two should do it.

 So, performance aside, does SAS have other benefits ? Data integrity ? 
 How would a 8 raid1 sata compare vs another 8 smaller SAS disks in 
 raidz(2) ?

Like apples and pomegranates.  Both should be able to saturate a GbE link.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability



 So, performance aside, does SAS have other benefits ? Data integrity ? How
 would a 8 raid1 sata compare vs another 8 smaller SAS disks in raidz(2) ?
 Like apples and pomegranates.  Both should be able to saturate a GbE link.


You're the expert, but isn't the 100M/s for streaming not random read/write.
For that, I suppose the disk drops to around 25M/s which is why I was
mentioning 4 sata disks.

When I was asking for comparing the 2 raids, It's was aside from
performance, basically sata is obviously cheaper, it will saturate the gig
link, so performance yes too, so the question becomes which has better data
protection ( 8 sata raid1 or 8 sas raidz2)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

On Wed, 1 Oct 2008, Ahmed Kamal wrote:

 Always assuming 2 spare disks, and Using the sata disks, I would configure
 them in raid1 mirror (raid6 for the 400G), Besides being cheaper, I would
 get more useable space (4TB vs 2.4TB), Better performance of raid1 (right?),
 and better data reliability ?? (don't really know about that one) ?

 Is this a recommended setup ? It looks too good to be true ?

Using mirrors will surely make up quite a lot for disks with slow seek 
times.  Reliability is acceptable for most purposes.  Resilver should 
be pretty fast.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

On Tue, Sep 30, 2008 at 8:13 PM, Ahmed Kamal 
[EMAIL PROTECTED] wrote:


 So, performance aside, does SAS have other benefits ? Data integrity ? How
 would a 8 raid1 sata compare vs another 8 smaller SAS disks in raidz(2) ?
 Like apples and pomegranates.  Both should be able to saturate a GbE link.


 You're the expert, but isn't the 100M/s for streaming not random
 read/write. For that, I suppose the disk drops to around 25M/s which is why
 I was mentioning 4 sata disks.

 When I was asking for comparing the 2 raids, It's was aside from
 performance, basically sata is obviously cheaper, it will saturate the gig
 link, so performance yes too, so the question becomes which has better data
 protection ( 8 sata raid1 or 8 sas raidz2)


SAS's main benefits are seek time and max IOPS.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

On Tue, 30 Sep 2008, Robert Thurlow wrote:

 Modern NFS runs over a TCP connection, which includes its own data 
 validation.  This surely helps.

 Less than we'd sometimes like :-)  The TCP checksum isn't
 very strong, and we've seen corruption tied to a broken
 router, where the Ethernet checksum was recomputed on
 bad data, and the TCP checksum didn't help.  It sucked.

TCP does not see the router.  The TCP and ethernet checksums are at 
completely different levels.  Routers do not pass ethernet packets. 
They pass IP packets. Your statement does not make technical sense.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

Hm, richard's excellent Graphs here http://blogs.sun.com/relling/tags/mttdl
as well as his words say he prefers mirroring over raidz/raidz2 almost
always. It's better for performance and MTTDL.

Since 8 sata raid1 is cheaper and probably more reliable than 8 raidz2 sas
(and I dont need extra sas performance), and offers better performance and
MTTDL than 8 sata raidz2, I guess I will go with 8-sata-raid1 then!
Hope I'm not horribly mistaken :)

On Wed, Oct 1, 2008 at 3:18 AM, Tim [EMAIL PROTECTED] wrote:



 On Tue, Sep 30, 2008 at 8:13 PM, Ahmed Kamal 
 [EMAIL PROTECTED] wrote:


 So, performance aside, does SAS have other benefits ? Data integrity ?
 How would a 8 raid1 sata compare vs another 8 smaller SAS disks in raidz(2)
 ?
 Like apples and pomegranates.  Both should be able to saturate a GbE
 link.


 You're the expert, but isn't the 100M/s for streaming not random
 read/write. For that, I suppose the disk drops to around 25M/s which is why
 I was mentioning 4 sata disks.

 When I was asking for comparing the 2 raids, It's was aside from
 performance, basically sata is obviously cheaper, it will saturate the gig
 link, so performance yes too, so the question becomes which has better data
 protection ( 8 sata raid1 or 8 sas raidz2)


 SAS's main benefits are seek time and max IOPS.

 --Tim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability



On 30-Sep-08, at 6:31 PM, Tim wrote:




On Tue, Sep 30, 2008 at 5:19 PM, Erik Trimble  
[EMAIL PROTECTED] wrote:


To make Will's argument more succinct (wink), with a NetApp,  
undetectable (by the NetApp) errors can be introduced at the HBA  
and transport layer (FC Switch, slightly damage cable) levels.
ZFS will detect such errors, and fix them (if properly configured).  
NetApp has no such ability.


Also, I'm not sure that a NetApp (or EMC) has the ability to find  
bit-rot.  ...




NetApp's block-appended checksum approach appears similar but is in  
fact much stronger. Like many arrays, NetApp formats its drives  
with 520-byte sectors. It then groups them into 8-sector blocks: 4K  
of data (the WAFL filesystem blocksize) and 64 bytes of checksum.  
When WAFL reads a block it compares the checksum to the data just  
like an array would, but there's a key difference: it does this  
comparison after the data has made it through the I/O path, so it  
validates that the block made the journey from platter to memory  
without damage in transit.




This is not end to end protection; they are merely saying the data  
arrived in the storage subsystem's memory verifiably intact. The data  
still has a long way to go before it reaches the application.


--Toby



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Miles Nordin

 rt == Robert Thurlow [EMAIL PROTECTED] writes:
 dm == David Magda [EMAIL PROTECTED] writes:

dm Not of which helped Amazon when their S3 service went down due
dm to a flipped bit:

ok, I get that S3 went down due to corruption, and that the network
checksums I mentioned failed to prevent the corruption.  The missing
piece is: belief that the corruption occurred on the network rather
than somewhere else.

Their post-mortem sounds to me as though a bit flipped inside the
memory of one server could be spread via this ``gossip'' protocol to
infect the entire cluster.  The replication and spreadability of the
data makes their cluster into a many-terabyte gamma ray detector.

I wonder if they even use a meaningful VPN.

   Modern NFS runs over a TCP connection, which includes its own
   data validation.  This surely helps.

Yeah fine, but IP and UDP and Ethernet also have checksums.  The one
in TCP isn't much fancier.

rt The TCP checksum isn't very strong, and we've seen corruption
rt tied to a broken router, where the Ethernet checksum was
rt recomputed on bad data, and the TCP checksum didn't help.  It
rt sucked.

That's more like what I was looking for.

The other concept from your first post of ``protection domains'' is
interesting, too (of one domain including ZFS and NFS).  Of course,
what do you do when you get an error on an NFS client, throw ``stale
NFS file handle?''  Even speaking hypothetically, it depends on good
exception handling for its value, which has been a big trouble spot
for ZFS so far.

This ``protection domain'' concept is already enshrined in IEEE
802.1d---bridges are not supposed to recalculate the FCS, and if they
need to mangle the packet they're supposed to update the FCS
algorithmically based on fancy math and only the bits they changed,
not just recalculate it over the whole packet.  They state this is to
protect against bad RAM inside the bridge.  I don't know if anyone
DOES that, but it's written into the spec.

But if the network is L3, then FCS and IP checksums (ttl decrement)
will have to be recalculated, so the ``protection domain'' is partly
split leaving only the UDP/TCP checksum contiguous.


pgpDuHpk4l3x2.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

On Tue, Sep 30, 2008 at 8:50 PM, Toby Thain [EMAIL PROTECTED]wrote:



 * NetApp's block-appended checksum approach appears similar but is in fact
 much stronger. Like many arrays, NetApp formats its drives with 520-byte
 sectors. It then groups them into 8-sector blocks: 4K of data (the WAFL
 filesystem blocksize) and 64 bytes of checksum. When WAFL reads a block it
 compares the checksum to the data just like an array would, but there's a
 key difference: it does this comparison after the data has made it through
 the I/O path, so it validates that the block made the journey from platter
 to memory without damage in transit.*


 This is not end to end protection; they are merely saying the data arrived
 in the storage subsystem's memory verifiably intact. The data still has a
 long way to go before it reaches the application.

 --Toby


As it does in ANY fileserver scenario, INCLUDING zfs.  He is building a
FILESERVER.  This is not an APPLICATION server.  You seem to be stuck on
this idea that everyone is using ZFS on the server they're running the
application.  That does a GREAT job of creating disparate storage islands,
something EVERY enterprise is trying to get rid of.  Not create more of.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

Ahmed Kamal wrote:


 So, performance aside, does SAS have other benefits ? Data
 integrity ? How would a 8 raid1 sata compare vs another 8 smaller
 SAS disks in raidz(2) ?
 Like apples and pomegranates.  Both should be able to saturate a
 GbE link.

  
 You're the expert, but isn't the 100M/s for streaming not random 
 read/write. For that, I suppose the disk drops to around 25M/s which 
 is why I was mentioning 4 sata disks.

 When I was asking for comparing the 2 raids, It's was aside from 
 performance, basically sata is obviously cheaper, it will saturate the 
 gig link, so performance yes too, so the question becomes which has 
 better data protection ( 8 sata raid1 or 8 sas raidz2)

Good question.  Since you are talking about different disks, the
vendor specs are different.  The  500 GByte Seagate Barracuda
7200.11 I described above is rated with an MTBF of 750,000 hours,
even  though it comes in either a SATA or SAS interface -- but that isn't
so  interesting.  A 450 GByte Seagate Cheetah 15k.6 (SAS) has a rated
MTBF of 1.6M hours.  Putting that into RAIDoptimizer we see:

Disk   RAID  MTTDL[1](yrs) MTTDL[2](yrs)

Barracuda  1+0 284,966 5,351
   z2  180,663,117 6,784,904
Cheetah1+0   1,316,385   126,839
   z21,807,134,968   348,249,968

For ZFS, 50% space used, logistical MTTR=24 hours, mirror
resync time = 60 GBytes/hr

In general, (2-way) mirrors are single parity, raidz2 is double parity.
If you use a triple mirror, then the numbers will be closer to the raidz2
numbers.

For explanations of these models, see my blog,
http://blogs.sun.com/relling/entry/a_story_of_two_mttdl

 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS, NFS and Auto Mounting

2008-09-30 Thread Douglas R. Jones

I am in the process of beefing up our development environment. In essence I am 
really going simply replicate what we have spread across here and there (that 
what happens when you keep running out of disk space). Unfortunately, I 
inherited all of this and the guy who dreamed up the conflagration is lonnnggg 
gone.

So here the way it works today. There are two top level directories called 
GroupWS and ReleaseWS. The auto mount map (auto.ws) looks like this:
Upgrades   chekov:/mnt/dsk1/GroupWS/
cstoolschekov:/mnt/dsk1/GroupWS/
comchekov:/mnt/dsk1/GroupWS
Integrationchekov:/mnt/dsk1/GroupWS/

Everything is fine. Do a cd to /ws/Integration and you are taken to 
chekov:/mnt/dsk1/GroupWS/Integration. The directory Integration is a real 
directory that lives in GroupWS. If you cd to /ws/com, you are taken to 
chekov:/mnt/dsk1/GroupWS and then you can move about as one sees fit. 

To replicate this in ZFS, I did the following;
1) Parked all of the drives (except c1t0d0 and c1t1d0) into several RAIDZ 
configurations in a zpool called dpool.
2) Created a file system called dpool/GroupWS and set the mountpoint to 
/mnt/zfs1/GroupWS. The sharenfs properties were set to 
sharenfs=rw,log,root=msc-servers.
3) Next I created another file system called dpool/GroupWS/Integration. Its 
mount point was inherited from GroupWS and is /mnt/zfs1/GroupWS/Integration. 
Essentially I only allowed the new file system to inherit from its parent.
4) I change the auto.ws map thusly:
Integration chekov:/mnt/zfs1/GroupWS/
Upgradeschekov:/mnt/zfs1/GroupWS/
cstools chekov:/mnt/zfs1/GroupWS/
com chekov:/mnt/zfs1/GroupWS

Now the odd behavior. You will notice that the directories Upgrades and cstools 
are just that. Directories in GroupWS. You can cd /ws/cstools from [b][i]any 
server[/b][/i] without a problem. Perform and ls and you see what you expect to 
see. Now the rub. If on chekov, one does a cd /ws/Integration you end up in 
chekov:/mnt/zsf1/GroupWS/Integration and everything is great. Do a cd to 
/ws/com and everything is fine. You can do a cd to Integration and everything 
is fine. But. If you go to another server and do a cd /ws/Integration all is 
well. However, if you do a cd to /ws/com and then a cd Integration, Integration 
is EMPTY!! 

I know this was long winded but it is a strange problem. The workaround is to 
destroy the dpool/GroupWS/Integration file system and recreate as a regular 
directory un GroupWS. But i was hoping to be able to use fle systems in this 
way for snapshot ease. 

Any ideas?

Thanks,
Doug
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZSF Solaris

2008-09-30 Thread Al Hopper

On Tue, Sep 30, 2008 at 6:30 PM, Nathan Kroenert
[EMAIL PROTECTED] wrote:
 Actually, the one that'll hurt most is ironically the most closely
 related to bad database schema design... With a zillion files in the one
 directory, if someone does an 'ls' in that directory, it'll not only
 take ages, but steal a whole heap of memory and compute power...

 Provided the only things that'll be doing *anything* in that directory
 are using indexed methods, there is no real problem from a ZFS
 perspective, but if something decides to list (or worse, list and sort)
 that directory, it won't be that pleasant.

 Oh - That's of course assuming you have sufficient memory in the system
 to cache all that metadata somewhere... If you don't then that's another
 zillion I/O's you need to deal with each time you list the entire directory.

 an ls -1rt on a directory with about 1.2 million files with names like
 afile1202899 takes minutes to complete on my box, and we see 'ls' get to
  ^^^

Here's your problem!

 in excess of 700MB rss... (and that's not including the memory zfs is
 using to cache whatever it can.)

 My box has the ARC limited to about 1GB, so it's obviously undersized
 for such a workload, but still gives you an indication...

 I generally look to keep directories to a size that allows the utilities
 that work on and in it to perform at a reasonable rate... which for the
 most part is around the 100K files or less...

 Perhaps you are using larger hardware than I am for some of this stuff? :)


I've seen this problem where *Solaris has issues with many files
created with this type of file naming pattern.  For example, the file
naming pattern produced by tmpfile(3C).  I saw it originally on a
tmpfs and it can be easily reproduced by:

[note: I'm writing this from memory - so don't beat me up over specific details]

1) pick a number for the number of files you want to test with (try
different numbers - start with 1,500 and then increase it).  Call this
test#
2) cd /tmp
3)  IMPORTANT:  Make a test directory for this experiment - let's call it temp
4) cd /tmp/temp  (your playground)
5) using your favorite language generate your test# of files using a
pattern similar to the one above by calling (ultimate) tmpfile()
6) ptime ls -al;  -  it will be quick the first time
7) ptime rm  * ;   - it will be quick the first time
8) repeat steps 5, 6 and 7.  Your ptimes will be a little slower
9) repeat steps 5, 6 and 7.  Your ptimes will be much slower
10) repeat steps 5, 6 and 7.  Your ptimes will be *really* slow.  Now
you'll understand that you have a problem.
11) repeat 5, 6 and 7 a couple more times.  Notice how bad your ptimes are now!
12) look at the size of /tmp/temp using ls -ald /tmp/temp  and you'll
notice that it has grown substancially.  The larger this directory
grows, the slower the filesystem operations will get.

This behavior is common to tmpfs, UFS and I tested it on early ZFS
releases.  I have no idea why - I have not made the time to figure it
out.  What I have observed is that all operations on your (victim)
test directory will max out (100% utilization) one CPU or one CPU core
- and all directory operations become single-threaded and limited by
the performance of one CPU (or core).

Now for the weird part: the *only* way to return everything to normal
performance levels (that I've found) is to rmdir the (victim)
directory.  This is why I recommend you perform this experiment in a
subdirectory.  If you do it in /tmp - you'll have to reboot the box to
get reasonably performance back - and you don't want to do it in your
home directory either!!

I'll try to set aside some time tomorrow to re-run this experiment.
But I'm nearly sure this is why your directory related file ops are so
slow and *dramatically* slower than they should be.   This problem/bug
is insideous - because using tmpfile() in /tmp is a very common
practice and the application(s) using /tmp will slow down dramatically
while maxing out (100% utilization) one CPU (or core).  And if your
system only has a single CPU...   :(

Let me know what you find out.  I know that the file name pattern is
what causes this bug to bite bigtime - and not so much the number of
files you use to test it.

I *suspect* that there might be something like a hash table that is
degenerating into a singly linked list as the root cause of this
issue.  But this is only my WAG.

Regards,

-- 
Al Hopper  Logical Approach Inc,Plano,TX [EMAIL PROTECTED]
   Voice: 972.379.2133 Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability



On 30-Sep-08, at 9:54 PM, Tim wrote:




On Tue, Sep 30, 2008 at 8:50 PM, Toby Thain  
[EMAIL PROTECTED] wrote:




NetApp's block-appended checksum approach appears similar but is  
in fact much stronger. Like many arrays, NetApp formats its drives  
with 520-byte sectors. It then groups them into 8-sector blocks:  
4K of data (the WAFL filesystem blocksize) and 64 bytes of  
checksum. When WAFL reads a block it compares the checksum to the  
data just like an array would, but there's a key difference: it  
does this comparison after the data has made it through the I/O  
path, so it validates that the block made the journey from platter  
to memory without damage in transit.




This is not end to end protection; they are merely saying the data  
arrived in the storage subsystem's memory verifiably intact. The  
data still has a long way to go before it reaches the application.


--Toby


As it does in ANY fileserver scenario, INCLUDING zfs.  He is  
building a FILESERVER.  This is not an APPLICATION server.  You  
seem to be stuck on this idea that everyone is using ZFS on the  
server they're running the application.


ZFS allows the architectural option of separate storage without  
losing end to end protection, so the distinction is still important.  
Of course this means ZFS itself runs on the application server, but  
so what?


--Toby

That does a GREAT job of creating disparate storage islands,  
something EVERY enterprise is trying to get rid of.  Not create  
more of.


--Tim



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZSF Solaris

2008-09-30 Thread Ian Collins

Bob Friesenhahn wrote:
 On Wed, 1 Oct 2008, Nathan Kroenert wrote:
   
 zillion I/O's you need to deal with each time you list the entire directory.

 an ls -1rt on a directory with about 1.2 million files with names like
 afile1202899 takes minutes to complete on my box, and we see 'ls' get to
 in excess of 700MB rss... (and that's not including the memory zfs is
 using to cache whatever it can.)
 

 A million files in ZFS is no big deal:

   
But how similar were your file names?

Ian

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Pool Question

Josh Hardman wrote:
 Hello, I'm looking for info on adding a disk to my current zfs pool.  I am 
 running OpenSoarlis snv_98.  I have upgraded my pool since my image-update.  
 When I installed OpenSolaris it was a machine with 2 hard disks (regular 
 IDE).  Is it possible to add the second hard disk to the pool to increase my 
 storage capacity without a raid controller?

 From what I've found, the command should be zpool add rpool device.  Is 
 that right?  If so, how do I track down the device name?  zpool status 
 tells me my current device (hdd0) is named c3d0s0.  Where do I find the 
 other device name?
   

Do not try zpool add on your rpool!  IIRC, it will no be allowed, but
if it were, your system would be unbootable and recovery would be
difficult... very uncool.

A better idea is to create a new storage pool.
Alas, it seems that OpenSolaris 2008.05 does not include the ZFS BUI,
so you might need to descend to the command line.

format is the command to setup your disk slices (and as a gateway
to managing partitions).  Once you setup a slice, then a simple
zpool create will do the trick.  Many more details are available in
the ZFS Administration Guide
http://www.opensolaris.org/os/community/zfs/docs/zfsadmin.pdf

 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Nicolas Williams

On Tue, Sep 30, 2008 at 08:54:50PM -0500, Tim wrote:
 As it does in ANY fileserver scenario, INCLUDING zfs.  He is building a
 FILESERVER.  This is not an APPLICATION server.  You seem to be stuck on
 this idea that everyone is using ZFS on the server they're running the
 application.  That does a GREAT job of creating disparate storage islands,
 something EVERY enterprise is trying to get rid of.  Not create more of.

First off there's an issue of design.  Wherever possible end-to-end
protection is better (and easier to implement and deploy) than
hop-by-hop protection.

Hop-by-hop protection implies a lot of trust.  Yes, in a NAS you're
going to have at least one hop: from the client to the server.  But how
does the necessity of one hop mean that N hops is fine?  One hop is
manageable.  N hops is a disaster waiting to happen.

Second, NAS is not the only way to access remote storage.  There's also
SAN (e.g., iSCSI).  So you might host a DB on a ZFS pool backed by iSCSI
targets.  If you do that with a random iSCSI target implementation then
you get end-to-end integrity protection regardless of what else the
vendor does for you in terms of hop-by-hop integrity protection.  And
you can even host the target on a ZFS pool, in which case there's two
layers of integrity protection, and so some waste of disk space, but you
get the benefit of very flexible volume management on both, the
initiator and the target.

Third, who's to say that end-to-end integrity protection can't possibly
be had in a NAS environment?  Sure, with today's protocols you can't
have it -- you can get hop-by-hop protection with at least one hop (see
above) -- but having end-to-end integrity protection built-in to the
filesystem may enable new NAS protocols that do provide end-to-end
protection.  (This is a variant of the first point above: good design
decisions pay off.)

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZSF Solaris

2008-09-30 Thread Jens Elkner

On Tue, Sep 30, 2008 at 09:44:21PM -0500, Al Hopper wrote:
 
 This behavior is common to tmpfs, UFS and I tested it on early ZFS
 releases.  I have no idea why - I have not made the time to figure it
 out.  What I have observed is that all operations on your (victim)
 test directory will max out (100% utilization) one CPU or one CPU core
 - and all directory operations become single-threaded and limited by
 the performance of one CPU (or core).

And sometimes its just a little bug: E.g. with a recent version of Solaris
(i.e. = snv_95 || = S10U5) on UFS:

SunOS graf 5.10 Generic_137112-07 i86pc i386 i86pc (X4600, S10U5)
=
admin.graf /var/tmp   time sh -c 'mkfile 2g xx ; sync'
0.05u 9.78s 0:29.42 33.4%
admin.graf /var/tmp  time sh -c 'mkfile 2g xx ; sync'
0.05u 293.37s 5:13.67 93.5%
admin.graf /var/tmp  rm xx
admin.graf /var/tmp  time sh -c 'mkfile 2g xx ; sync'
0.05u 9.92s 0:31.75 31.4%
admin.graf /var/tmp  time sh -c 'mkfile 2g xx ; sync'
0.05u 305.15s 5:28.67 92.8%
admin.graf /var/tmp  time dd if=/dev/zero of=xx bs=1k count=2048
2048+0 records in
2048+0 records out
0.00u 298.40s 4:58.46 99.9%
admin.graf /var/tmp  time sh -c 'mkfile 2g xx ; sync'
0.05u 394.06s 6:52.79 95.4%

SunOS kaiser 5.10 Generic_137111-07 sun4u sparc SUNW,Sun-Fire-V440 (S10, U5)
=
admin.kaiser /var/tmp  time mkfile 1g xx
0.14u 5.24s 0:26.72 20.1%
admin.kaiser /var/tmp  time mkfile 1g xx
0.13u 64.23s 1:25.67 75.1%
admin.kaiser /var/tmp  time mkfile 1g xx
0.13u 68.36s 1:30.12 75.9%
admin.kaiser /var/tmp  rm xx
admin.kaiser /var/tmp  time mkfile 1g xx
0.14u 5.79s 0:29.93 19.8%
admin.kaiser /var/tmp  time mkfile 1g xx
0.13u 66.37s 1:28.06 75.5%

SunOS q 5.11 snv_98 i86pc i386 i86pc (U40, S11b98)
=
elkner.q /var/tmp  time mkfile 2g xx
0.05u 3.63s 0:42.91 8.5%
elkner.q /var/tmp  time mkfile 2g xx
0.04u 315.15s 5:54.12 89.0%

SunOS dax 5.11 snv_79a i86pc i386 i86pc (U40, S11b79)
=
elkner.dax /var/tmp  time mkfile 2g xx
0.05u 3.09s 0:43.09 7.2%
elkner.dax /var/tmp  time mkfile 2g xx
0.05u 4.95s 0:43.62 11.4%

Regards,
jel.
-- 
Otto-von-Guericke University http://www.cs.uni-magdeburg.de/
Department of Computer Science   Geb. 29 R 027, Universitaetsplatz 2
39106 Magdeburg, Germany Tel: +49 391 67 12768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Ian Collins

Tim wrote:

 As it does in ANY fileserver scenario, INCLUDING zfs.  He is building
 a FILESERVER.  This is not an APPLICATION server.  You seem to be
 stuck on this idea that everyone is using ZFS on the server they're
 running the application.  That does a GREAT job of creating disparate
 storage islands, something EVERY enterprise is trying to get rid of. 
 Not create more of.

I think you'd be surprised how large an organisation can migrate most,
if not all of their application servers to zones one or two Thumpers. 

Isn't that the reason for buying in server appliances?

Ian

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

On Tue, Sep 30, 2008 at 10:44 PM, Toby Thain [EMAIL PROTECTED]wrote:



 ZFS allows the architectural option of separate storage without losing end
 to end protection, so the distinction is still important. Of course this
 means ZFS itself runs on the application server, but so what?

 --Toby


So what would be that the application has to run on Solaris.  And requires a
LUN to function.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

On Wed, Oct 1, 2008 at 12:24 AM, Ian Collins [EMAIL PROTECTED] wrote:

 Tim wrote:
 
  As it does in ANY fileserver scenario, INCLUDING zfs.  He is building
  a FILESERVER.  This is not an APPLICATION server.  You seem to be
  stuck on this idea that everyone is using ZFS on the server they're
  running the application.  That does a GREAT job of creating disparate
  storage islands, something EVERY enterprise is trying to get rid of.
  Not create more of.

 I think you'd be surprised how large an organisation can migrate most,
 if not all of their application servers to zones one or two Thumpers.

 Isn't that the reason for buying in server appliances?

 Ian


I think you'd be surprised how quickly they'd be fired for putting that much
risk into their enterprise.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability