Re: [zfs-discuss] ZFS + ISCSI + LINUX QUESTIONS

2007-06-05 Thread Bill Sommerfeld
On Thu, 2007-05-31 at 13:27 +0100, Darren J Moffat wrote:

  What errors and error rates have you seen?
 
 I have seen switches flip bits in NFS traffic such that the TCP checksum 
 still match yet the data was corrupted.  One of the ways we saw this was 
 when files were being checked out of SCCS, the SCCS checksum failed. 
 Other ways we saw it was the compiler failing to compile untouched code.

To be specific, we found that an ethernet switch in one of our
development labs had a tendency to toggle a particular bit in packets
going through it.   The problem was originally suspected to be a data
corruption problem within solaris itself and got a lot of attention as a
result.

In the cases I examined (corrupted source file after SCCS checkout)
there were complementary changes (0-1 and 1-0) in the same bit in
bytes which were 256, 512, or 1024 bytes apart in the source file.

Because of the mathematics of the 16-bit ones-complement checksum used
by TCP, the packet checksummed to the same value after the switch made
these two offsetting changes.  (I believe that the switch was either
inserting or removing a vlan tag so the ethernet CRC had to be
recomputed by the switch). 

Once we realized that this was going on we went back, looked at the
output of netstat -s, and noticed that the systems in this lab had been
dropping an abnormally high number of packets due to bad TCP checksums;
only a few of the broken packets were making it through, but there were
enough of them to disrupt things in the lab.

The problem went away when the suspect switch was taken out of service.

- Bill






___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + ISCSI + LINUX QUESTIONS

2007-05-31 Thread David Anderson

Nathan,

Keep in mind iSCSI target is only in OpenSolaris at this time.

On 05/30/2007 10:15 PM, Nathan Huisman wrote:

snip



= QUESTION #1

What is the best way to mirror two zfs pools in order to achieve a sort
of HA storage system? I don't want to have to physically swap my disks
into another system if any of the hardware on the ZFS server dies. If I
have the following configuration what is the best way to mirror these in
near real time?

BOX 1 (JBOD-ZFS) BOX 2 (JBOD-ZFS)

I've seen the zfs send and recieve commands but I'm not sure how well
that would work with a close to real time mirror.


If you want this to be redundant (and very scalable) you will want at 
least 2xBOX 1's and 2x BOX2's. IPMP with redundant GbE switches + NICs 
as well.


Do not use zfs send/recv. Use Sun Cluster 3.2 for HA-ZFS.

http://docs.sun.com/app/docs/doc/820-0335/6nc35dge2?a=view

There is potential for data loss if the active ZFS node crashes before 
outstanding transaction groups commit for non-synchronous writes, but 
the ZVOL (and underlying ext3fs) should not become corrupt (hasn't 
happened to me yet). Can someone from the ZFS team comment on this?





= QUESTION #2

Can ZFS be exported via iscsi and then imported as a disk to a linux
system and then be formated with another file system. I wish to use ZFS
as a block level file systems for my virtual machines. Specifically
using xen. If this is possible, how stable is this? 


This is possible and is stable in my experience. Scales well if you 
design your infrastructure correctly.



How is error
checking handled if the zfs is exported via iscsi and then the block
device formated to ext3? Will zfs still be able to check for errors?


Yes, ZFS will detect/correct block level errors in ZVOLs as long as you 
have a redundant zpool configuration (see note below about LVM)



If this is possible and this all works, then are there ways to expand a
zfs iscsi exported volume and then expand the ext3 file system on the
remote host?



Haven't tested it myself (yet), but should be possible. You might have 
to export and re-import the iSCSI target on the Xen dom0 and then resize 
the ext3 partition (e.g. using 'parted'). If that doesn't work there are 
other ways to accomplish this.



= QUESTION #3

How does zfs handle a bad drive? What process must I go through in
order to take out a bad drive and replace it with a good one?


If you have a redundant zpool configuration you will replace the failed 
disk and then issue a 'zpool replace'.




= QUESTION #4

What is a good way to back up this HA storage unit? Snapshots will
provide an easy way to do it live, but should it be dumped into a tape
library, or an third offsite zfs pool using zfs send/recieve or ?


Send snapshots to another server that has a RAIDZ (or RAIDZ2) zpool 
(want space vs performace/redundancy for backup. Opposite of the 
*MIRRORS* you will want to use for the HA-ZFS cluster - Storage 
nodes). From this node you can dump to tape, etc.




= QUESTION #5

Does the following setup work?

BOX 1 (JBOD) - iscsi export - BOX 2 ZFS.

In other words, can I setup a bunch of thin storage boxes with low cpu
and ram instead of using sas or fc to supply the jbod to the zfs server?


Yes. And ZFS+iSCSI makes this relatively cheap. I very strongly 
recommend against using LVM to handle the mirroring. *You will lose the 
ability to correct data corruption* at the ZFS level. It also does not 
scale well, increases complexity, increases cost, and reduces throughput 
over iSCSI to your ZFS nodes. Leave volume management and redundancy to ZFS.


Set up your Xen dom0 boxes to have a redundant path to your ZVOLs over 
iSCSI. Send your data _one time_ to your ZFS nodes. Let ZFS handle the 
mirroring and then send that to your iSCSI LUNs on the storage nodes. 
Make sure you set up half of each mirror in the zpool with a disk from a 
separate storage node.


Be wary of layering ZFS/ZVOLs like this. There are multiple ways to set 
up your storage nodes (plain iscsitadm or using ZVOls), and if you use 
ZVOLs you may want to disable checksum and leave that to your ZFS nodes.


Other:
 -Others have reported that Sil3124 based SATA expansion cards work 
well with Solaris.
 -Test your failover times between ZFS nodes (BOX 2s). Having lots of 
iscsi shares/filesystems can cause this to be slow. Hopefully this will 
be improved with parallel zpool device mounting in the future.
 -ZVOLs are not sparse by default. I prefer this, but if you really 
want to use sparse ZVOLs there is a switch for it in 'zfs create'

 -This will work, but TEST, TEST, TEST for your particular scenario.
 -Yes, this can be built for less than $30k US for your storage size 
requirement.
 -I get ~150MB/s throughput on this setup with 2 storage nodes of 6 
disks each. Appears as ~3TB mirror on ZFS nodes.
 -Use Build 64 or later, as there is a ZVOL bug in b63 if I'm not 
mistaken. Probably a good idea to read through the open ZFS bugs, too.

 

Re: [zfs-discuss] ZFS + ISCSI + LINUX QUESTIONS

2007-05-31 Thread Al Hopper

On Thu, 31 May 2007, Darren J Moffat wrote:

Since you are doing iSCSI and may not be running ZFS on the initiator 
(client) then I highly recommend that you run with IPsec using at least AH 
(or ESP with Authentication) to protect the transport.  Don't assume that 
your network is reliable.  ZFS won't help you here if it isn't running on the


[Hi Darren]

Thats a curious recommendation!  You don't think that TCP/IP is 
reliable enough to provide iSCSI data integrity?

What errors and error rates have you seen?

iSCSI initiator, and even if it is it would need two targets to be able to 
repair.


Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
   Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + ISCSI + LINUX QUESTIONS

2007-05-31 Thread Darren J Moffat

Al Hopper wrote:

On Thu, 31 May 2007, Darren J Moffat wrote:

Since you are doing iSCSI and may not be running ZFS on the initiator 
(client) then I highly recommend that you run with IPsec using at 
least AH (or ESP with Authentication) to protect the transport.  Don't 
assume that your network is reliable.  ZFS won't help you here if it 
isn't running on the


[Hi Darren]

Thats a curious recommendation!  You don't think that TCP/IP is reliable 
enough to provide iSCSI data integrity?


No I don't.  Also I don't personally thing that the access control model 
of iSCSI is sufficient and trust IPsec more in that respect.


Personally I would actually like to see at IPsec AH be the default for 
all traffic that isn't otherwise doing a cryptographically strong 
integrity check of its own.



What errors and error rates have you seen?


I have seen switches flip bits in NFS traffic such that the TCP checksum 
still match yet the data was corrupted.  One of the ways we saw this was 
when files were being checked out of SCCS, the SCCS checksum failed. 
Other ways we saw it was the compiler failing to compile untouched code.


Just like we with ZFS we don't trust the HBA and the disks to give us 
correct data. With iSCSI the network is your HBA and cableing and in 
part your disk controller as well.   Defence in depth is a common mantra 
in the security geek world, I take that forward to protecting the data 
in transit too even when it isn't purely for security reasons.


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + ISCSI + LINUX QUESTIONS

2007-05-31 Thread Matty

On 5/31/07, Darren J Moffat [EMAIL PROTECTED] wrote:

Since you are doing iSCSI and may not be running ZFS on the initiator
(client) then I highly recommend that you run with IPsec using at least
AH (or ESP with Authentication) to protect the transport.  Don't assume
that your network is reliable.  ZFS won't help you here if it isn't
running on the iSCSI initiator, and even if it is it would need two
targets to be able to repair.


If you don't intend to encrypt the iSCSI headers / payloads, why not
just use the header and data digests that are part of the iSCSI
protocol?

Thanks,
- Ryan
--
UNIX Administrator
http://prefetch.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + ISCSI + LINUX QUESTIONS

2007-05-31 Thread Michael Li

Al Hopper 提到:


On Thu, 31 May 2007, David Anderson wrote:

 snip .


Other:
-Others have reported that Sil3124 based SATA expansion cards work 
well with Solaris.



[Sorry - don't mean to hijack this interesting thread]

I believe that there is a serious bug with the si3124 driver that has 
not been addressed. Ben Rockwood and I have seen it firsthand, and a 
quick look at the Hg logs shows that si3124.c has not been changed in 
6 months.


Basic description of the bug: under heavy load (lots of I/O ops/Sec) 
all data from the drive(s) will completely stop for an extended period 
of time - 60 to 90+ Seconds.


There was a recent discussion of the same issue on the Solaris on x86 
list ([EMAIL PROTECTED]) - several experienced x86ers have 
seen this bug and found the current driver unusable. Interestingly, 
one individual said (paraphrased) ... don't see any issues and then 
later ... now I see it and it was there the entire time.


Recommendation: If you plan to use the 3124 driver, test it yourself 
under heavy load. A simple test with one disk drive will suffice.


In my case, it was plainly obvious with one (ex Sun M20) drive and a 
UFS filesystem - all I was doing was tarring up /export/home to 
another drive. Periodically the tar process would simply stop (iostat 
went flatline) - it looked like the system was going to crash - then 
(after 60+ Secs) the tar process continued as if nothing had happened. 
This was repeated 4 or 5 times before the 'tar cvf' (of around 40Mb of 
data) completed successfully.


Regards,

Al Hopper Logical Approach Inc, Plano, TX. [EMAIL PROTECTED]
Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Does the si3124 bug Hopper mentioned has something to do with below 
ERROR? I met them in workspace warlock build step, but I did nothing to 
si3124 codes...


warlock -c ../../common/io/warlock/si3124.wlcmd si3124.ll \
../sd/sd.ll ../sd/sd_xbuf.ll \
-l ../scsi/scsi_capabilities.ll -l ../scsi/scsi_control.ll -l 
../scsi/scsi_watch.ll -l ../scsi/scsi_data.ll -l 
../scsi/scsi_resource.ll -l ../scsi/scsi_subr.ll -l ../scsi/scsi_hba.ll 
-l ../scsi/scsi_transport.ll -l ../scsi/scsi_confsubr.ll -l 
../scsi/scsi_reset_notify.ll \

-l ../cmlb/cmlb.ll \
-l ../sata/sata.ll \
-l ../warlock/ddi_dki_impl.ll

The following variables don't seem to be protected consistently:

dev_info::devi_state

*** Error code 10
make: Fatal error: Command failed for target `si3124.ok'
Current working directory 
/net/greatwall/workspaces/wifi_rtw/usr/src/uts/intel/si3124

*** Error code 1
The following command caused the error:
cd ../si3124; make clean; make warlock
make: Fatal error: Command failed for target `warlock.sata'
Current working directory 
/net/greatwall/workspaces/wifi_rtw/usr/src/uts/intel/warlock


-
Michael

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS + ISCSI + LINUX QUESTIONS

2007-05-30 Thread Nathan Huisman

= PROBLEM

To create a disk storage system that will act as an archive point for
user data (Non-recoverable data), and also act as a back end storage
unit for virtual machines at a block level.

= BUDGET

Currently I have about 25-30k to start the project, more could be
allocated in the next fiscal year for perhaps a backup solution.

= TIMEFRAME

I have 8 days to cut a P.O. before our fiscal year ends.

= STORAGE REQUIREMENTS

5-10tb of redundant fairly high speed storage


= QUESTION #1

What is the best way to mirror two zfs pools in order to achieve a sort
of HA storage system? I don't want to have to physically swap my disks
into another system if any of the hardware on the ZFS server dies. If I
have the following configuration what is the best way to mirror these in
near real time?

BOX 1 (JBOD-ZFS) BOX 2 (JBOD-ZFS)

I've seen the zfs send and recieve commands but I'm not sure how well
that would work with a close to real time mirror.


= QUESTION #2

Can ZFS be exported via iscsi and then imported as a disk to a linux
system and then be formated with another file system. I wish to use ZFS
as a block level file systems for my virtual machines. Specifically
using xen. If this is possible, how stable is this? How is error
checking handled if the zfs is exported via iscsi and then the block
device formated to ext3? Will zfs still be able to check for errors?
If this is possible and this all works, then are there ways to expand a
zfs iscsi exported volume and then expand the ext3 file system on the
remote host?

= QUESTION #3

How does zfs handle a bad drive? What process must I go through in
order to take out a bad drive and replace it with a good one?

= QUESTION #4

What is a good way to back up this HA storage unit? Snapshots will
provide an easy way to do it live, but should it be dumped into a tape
library, or an third offsite zfs pool using zfs send/recieve or ?

= QUESTION #5

Does the following setup work?

BOX 1 (JBOD) - iscsi export - BOX 2 ZFS.

In other words, can I setup a bunch of thin storage boxes with low cpu
and ram instead of using sas or fc to supply the jbod to the zfs server?



I appreciate any advice or answers you might have.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + ISCSI + LINUX QUESTIONS

2007-05-30 Thread Dale Ghent

On May 31, 2007, at 12:15 AM, Nathan Huisman wrote:


= PROBLEM

To create a disk storage system that will act as an archive point for
user data (Non-recoverable data), and also act as a back end storage
unit for virtual machines at a block level.


snip

Here are some tips from me. I notice you mention iSCSI a lot so I'll  
stick to that...


Q1: The best way to mirror in real time is to do it from the  
consumers of the storage, ie, your iSCSI clients. Implement two  
storage servers (say, two x4100s with attached disk) and put their  
disk into zpools. The two servers do not have to know about each  
other. Configure ZFS file systems identically on both and export them  
to the client that'll use it. Use the software mirroring feature on  
the client to mirror these iSCSI shares (eg: dynamic disks on  
Windows, LVM on Linux, SVM on Solaris).


What this gives you are two storage servers (ZFS-backed, serving out  
iSCSI shares) and the client(s) take a share from each and mirror  
them... if one of the ZFS servers were to go kaput, the other is  
still there actively taking in and serving data. From the client's  
perspective, it'll just look like one side of the mirror went down  
and after you get the downed ZFS server back up, you would initiate  
normal mirror reattachment procedure on the client(s).


This will also allow you to patch your ZFS servers without downtime  
incurred on your clients.


The disk storage on your two ZFS+iSCSI servers could be anything.  
Given your budget and space needs, I would suggest looking at the  
Apple Xserve RAID with 750GB drives. You're a .edu, so the price of  
these things will likely please you (I just snapped up two of them at  
my .edu for a really insane price).


Q2: The client will just see the iSCSI share as a raw block device.  
Put your ext3/xfs/jfs on it as you please... to ZFS on the it is just  
data. That's the only way you can use iSCSI, really it's block  
level, remember. On ZFS, the iSCSI backing store is one large sparse  
file.


Q3: See the zpool man page, specifically the 'zpool replace ...'  
command.


Q4: Since (or if) you're doing iSCSI, ZFS snapshots will be of no  
value to you since ZFS can't see into those iSCSI backing store  
files. I'll assume that you have a backup system in place for your  
existing infrastructure (Networker, NetBackup or what have you) so  
back up the stuff from the *clients* and not the ZFS servers. Just  
space the backup schedule out if you have multiple clients so that  
the ZFS+iSCSI servers aren't overloaded with all its clients reading  
data suddenly with backup time rolls around.


Q5: Sure, nothing would stop you from doing that sort of config, but  
it's something that would make Rube Goldberg smile. Keep out any  
unneeded complexity and condense the solution.


Excuse my ASCII art skills, but consider this:

[JBOD/ARRAY]---(fc)---[ZFS/iSCSI server 1]---(iscsi share)- 
[Client]

  [mirroring the]
[JBOD/ARRAY]---(fc)---[ZFS/iSCSI server 2]---(iscsi share)- 
[ two   shares ]


Kill one of the JBODs or arrays, OR the ZFS+iSCSI servers, and your  
clients are still in good shape as long as their software mirroring  
facility behaves.


/dale
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + ISCSI + LINUX QUESTIONS

2007-05-30 Thread Will Murnane

Questions I don't know answers to are omitted.  I am but a nestling.

On 5/31/07, Nathan Huisman [EMAIL PROTECTED] wrote:

= STORAGE REQUIREMENTS

5-10tb of redundant fairly high speed storage

What does high speed mean?  How many users are there for this
system?  Are they accessing it via Ethernet? FC? Something else?  Why
the emphasis on iscsi?


= QUESTION #2

Can ZFS be exported via iscsi and then imported as a disk to a linux
system and then be formated with another file system[?]

Yes. It's in OpenSolaris but not (as I understand it) in Solaris
direct from Sun.  If running OpenSolaris isn't an issue (but it
probably is) it works out of the box.


= QUESTION #3

How does zfs handle a bad drive? What process must I go through in
order to take out a bad drive and replace it with a good one?

ZFS only notices drives are dead when they're really dead - they can't
be opened.  If a drive is causing intermittent problems (returning bad
data and so forth) it won't get noticed, but ZFS will recover the
blocks from mirrors or parity.  zpool replace should take care of
the replacement procedure, or you could keep hot spares online.  I
can't comment on hotswapping drives while the machine is on; does this
work in general, or require special hardware?


= QUESTION #4

What is a good way to back up this HA storage unit? Snapshots will
provide an easy way to do it live, but should it be dumped into a tape
library, or an third offsite zfs pool using zfs send/recieve or ?

ZFS will be no help if all you've got is iscsi targets.  You need
something that knows what those targets hold; whatever client-OS-based
stuff you use other places will do.  Otherwise you end up
storing/backing up a lot more than you need to - filesystem metadata,
et cetera.


= QUESTION #5

Does the following setup work?

BOX 1 (JBOD) - iscsi export - BOX 2 ZFS.

In other words, can I setup a bunch of thin storage boxes with low cpu
and ram instead of using sas or fc to supply the jbod to the zfs server?

As Dale mentions, this seems overly complicated.   Consuming iscsi and
producing different iscsi doesn't sound like a good idea to me.

Will
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + ISCSI + LINUX QUESTIONS

2007-05-30 Thread Sanjeev Bagewadi

Nathan,

Some answers inline...

Nathan Huisman wrote:


= PROBLEM

To create a disk storage system that will act as an archive point for
user data (Non-recoverable data), and also act as a back end storage
unit for virtual machines at a block level.

= BUDGET

Currently I have about 25-30k to start the project, more could be
allocated in the next fiscal year for perhaps a backup solution.

= TIMEFRAME

I have 8 days to cut a P.O. before our fiscal year ends.

= STORAGE REQUIREMENTS

5-10tb of redundant fairly high speed storage


= QUESTION #1

What is the best way to mirror two zfs pools in order to achieve a sort
of HA storage system? I don't want to have to physically swap my disks
into another system if any of the hardware on the ZFS server dies. If I
have the following configuration what is the best way to mirror these in
near real time?

BOX 1 (JBOD-ZFS) BOX 2 (JBOD-ZFS)

I've seen the zfs send and recieve commands but I'm not sure how well
that would work with a close to real time mirror.


If you want close to realtime mirroring (across pools in this case) AVS 
would

be a better option in my opinion.
Refer to : http://www.opensolaris.org/os/project/avs/Demos/AVS-ZFS-Demo-V1/




= QUESTION #2

Can ZFS be exported via iscsi and then imported as a disk to a linux
system and then be formated with another file system. I wish to use ZFS
as a block level file systems for my virtual machines. Specifically
using xen. If this is possible, how stable is this? How is error
checking handled if the zfs is exported via iscsi and then the block
device formated to ext3? Will zfs still be able to check for errors?
If this is possible and this all works, then are there ways to expand a
zfs iscsi exported volume and then expand the ext3 file system on the
remote host?


Yes, you can create volumes (ZVOL) in a Zpool and export them over iscsi.
The ZVOL would guarantee the data consistency at the block level.

Expanding the ZVOL should be possible. However, I am not sure if/how 
iSCSI behaves here.

You might need to try it out.



= QUESTION #3

How does zfs handle a bad drive? What process must I go through in
order to take out a bad drive and replace it with a good one?


# zpool replace poolname bad-drive new-good-drive

The other option would be configure hot-spares and they will kickin 
automatically

when a bad-drive is detected.



= QUESTION #4

What is a good way to back up this HA storage unit? Snapshots will
provide an easy way to do it live, but should it be dumped into a tape
library, or an third offsite zfs pool using zfs send/recieve or ?

= QUESTION #5

Does the following setup work?

BOX 1 (JBOD) - iscsi export - BOX 2 ZFS.

In other words, can I setup a bunch of thin storage boxes with low cpu
and ram instead of using sas or fc to supply the jbod to the zfs server?


Should be feasible. Just that you would then need a robust LAN and that 
would be flooded.


Thanks and regards,
Sanjeev.

--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel:x27521 +91 80 669 27521 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss