Re: [zfs-discuss] Is there _any_ suitable motherboard?

2007-08-22 Thread Mark
Hey,

I will submit it. However does Opensolaris have a seperate HCL? or do i just 
use the solaris one?

Cheers

Mark
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Samba and ZFS ACL Question

2007-08-22 Thread Peter Baumgartner
On 8/22/07, Peter Baumgartner <[EMAIL PROTECTED]> wrote:
>
> I would also like to use this module. This bug
> http://bugs.opensolaris.org/view_bug.do?bug_id=6561700 leads me to believe
> it can be used with the current version of Samba.
>
> Do I need to rebuild Samba? If so, does anybody have pointers on doing
> that? I'm not having any luck trying to build 3.2.0 with ads/krb5 support.


Some more info:
I've installed krb5_lib and krb5_lib_dev from blastwave and when running
configure on samba 3.2.0, I get:

checking whether krb5_mk_error takes 3 arguments MIT or 9 Heimdal... yes
configure: WARNING: krb5_mk_req_extended not found in -lkrb5
configure: WARNING: no CREATE_KEY_FUNCTIONS detected
configure: WARNING: no GET_ENCTYPES_FUNCTIONS detected
configure: WARNING: no KT_FREE_FUNCTION detected
configure: WARNING: no KRB5_VERIFY_CHECKSUM_FUNCTION detected
configure: error: krb5 libs don't have all features required for Active
Directory support


Here is my configure command:
LD_LIBRARY_PATH="/opt/csw/lib:/usr/lib" LDFLAGS="-L/opt/csw/lib
-R/opt/csw/lib" CPPFLAGS="-I/opt/csw/include" CFLAGS="-I/opt/csw/include
-DHAS_LDAP" LIBS="-lldap" ./configure --with-ads --with-ldap --with-krb5
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS quota

2007-08-22 Thread Matthew Ahrens
Brad Plecs wrote:
> I hate to start rsyncing again, but may be forced to; policing the snapshot 
> space consumption is 
> getting painful, but the online snapshot feature is too valuable to discard 
> altogether.  
> 
> or if there are other creative solutions, I'm all ears...

OK, you asked for "creative" workarounds... here's one (though it requires 
that the filesystem be briefly unmounted, which may be deal-killing):

zfs create pool/realfs
zfs set quota=1g pool/realfs

again:
zfs umount pool/realfs
zfs rename pool/realfs pool/oldfs
zfs snapshot pool/[EMAIL PROTECTED]
zfs clone pool/[EMAIL PROTECTED] pool/realfs
zfs set quota=1g pool/realfs  (6364688 would be useful here)
zfs set quota=none pool/oldfs
zfs promote pool/oldfs
zfs destroy pool/backupfs
zfs rename pool/oldfs pool/backupfs
backup pool/[EMAIL PROTECTED]
sleep $backupinterval
goto again

FYI, we are working on "fs-only" quotas.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Samba and ZFS ACL Question

2007-08-22 Thread Peter Baumgartner
I would also like to use this module. This bug 
http://bugs.opensolaris.org/view_bug.do?bug_id=6561700 leads me to believe it 
can be used with the current version of Samba. 

Do I need to rebuild Samba? If so, does anybody have pointers on doing that? 
I'm not having any luck trying to build 3.2.0 with ads/krb5 support.

Thanks in advance!
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS quota

2007-08-22 Thread Brad Plecs
Just wanted to voice another request for this feature.

I was forced on a previous Solaris10/ZFS system to rsync whole filesystems, and 
snapshot the backup copy to prevent the snapshots from negatively impacting 
users.  This obviously has the effect of reducing available space on the system 
by over half.  It also robs you of lots of I/O bandwidth while all that data is 
rsyncing, and  means that users can't see their snapshots, only a sysadmin with 
access to the backup copy can.  

We've got a new system that isn't doing the rsync, and users very quickly 
discovered 
over-quota problems when their directories appeared empty, and deleting files 
didn't help. 
They required sysadmin intervention to increase their filesystem quotas to 
accomodate the snapshots and their real data.  Trying to anticipate the space 
required for the snapshots and
giving them that as a quota is more or less hopeless, plus it gives them that 
much more 
rope with which to hang themselves with massive snapshots.  

I hate to start rsyncing again, but may be forced to; policing the snapshot 
space consumption is 
getting painful, but the online snapshot feature is too valuable to discard 
altogether.  

or if there are other creative solutions, I'm all ears...
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS problems in dCache

2007-08-22 Thread Xavier Canehan
We have the same issue (using dCache on Thumpers, data on ZFS).
A workaround has been to move the directory on a local UFS filesystem using a 
low nbpi parameter.

However, this is not a solution.

Doesn't look like a threading problem,  thanks anyway Jens !
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mirrored zpool across network

2007-08-22 Thread Matthew Ahrens
Ralf Ramge wrote:
> I consider this a big design flaw of ZFS.

Are you saying that it's a design flaw of ZFS that we haven't yet implemented 
remote replication?  I would consider that a missing feature, not a design 
flaw.  There's nothing in the design of ZFS to prevent such a feature (and in 
fact, several aspects of the design that would work very well with such a 
feature, eg as used with "zfs send").

 > I'm not very familiar with the
> code, but I still have hope that there'll be a parameter which allows to 
> get rid of the cache flushes.

You mean zfs_nocacheflush?  Admittedly, this is a hack.  We're working on 
making this simply do the right thing, based on the capabilities of the 
underlying storage device.

 > ZFS, and the X4500, are typical examples
> of different departments not really working together, e.g. they have a 
> wonderful file system, but there are no storages who supports it.

I'm not sure what you mean.  ZFS supports any storage, and works great on the 
X4500.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS problems in dCache

2007-08-22 Thread Jens Elkner
On Wed, Aug 01, 2007 at 09:49:26AM -0700, Sergey Chechelnitskiy wrote:
Hi Sergey,
> 
> I have a flat directory with a lot of small files inside. And I have a java 
> application that reads all these files when it starts. If this directory is 
> located on ZFS the application starts fast (15 mins) when the number of files 
> is around 300,000 and starts very slow (more than 24 hours) when the number 
> of files is around 400,000. 
> 
> The question is why ? 
> Let's set aside the question why this application is designed this way.
> 
> I still needed to run this application. So, I installed a linux box with XFS, 
> mounted this XFS directory to the Solaris box and moved my flat directory 
> there. Then my application started fast ( < 30 mins) even if the number of 
> files (in the linux operated XFS directory mounted thru NSF to the Solaris 
> box) was 400,000 or more. 
> 
> Basicly, what I want to do is to run this application on a Solaris box. Now I 
> cannot do it.

Just a rough guess - this might be a Solaris threading problem. See
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6518490

So perhaps starting the app with -XX:-UseThreadPriorities may help ...

Regards,
jel.
-- 
Otto-von-Guericke University http://www.cs.uni-magdeburg.de/
Department of Computer Science   Geb. 29 R 027, Universitaetsplatz 2
39106 Magdeburg, Germany Tel: +49 391 67 12768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mirrored zpool across network

2007-08-22 Thread Jim Dunham
Ralf,

> Well, and what I want to say: if you place the bitmap volume on the  
> same
> disk, this situation even gets worse. The problem is the  
> involvement of
> SVM. Building the soft partition again makes the handling even more
> complex and the case harder to handle for operators. It's the best way
> to make sure that the disk will be replaced, but not added to the  
> zpool
> during the night - and replacing it during regular working hours isn't
> an option too, because syncing 500 GB over a 1 GBit/s interface during
> daytime just isn't possible without putting the guaranteed service  
> times
> to a risk. Having to take care about soft partitions just isn't
> idiot-proof enough. And *poof* there's a good chance the TCO of a  
> X4500
> is considered being too high.

You are quite correct in that increasing the number of data path  
technologies ZFS + AVS + SVM, increases the TCO, as the skills  
required by everyone involved must increase proportionately. For the  
record, using ZFS zvols for bitmap volumes does not scale, as the  
overhead of bit flipping is way too many I/Os for raidz or raidz2  
storage pools, and even a mirrored storage pool is high, as the COW  
semantics of ZFS make the I/O cost too high.

>
>>> a) A disk in the primary fails. What happens? A HSP jumps in and  
>>> 500 GB
>>> will be rebuilt. These 500 GB are synced over a single 1 GBit/s
>>> crossover cable. This takes a bit of time and is 100% unnecessary
>>
>>
>> But it is necessary! As soon as the HSP disk kicks in, not only is  
>> the
>> disk being rebuilt by ZFS, but newly allocated ZFS data will also
>> being written to this HSP disk. So although it may appear that there
>> is wasted replication cost (of which there is), the instant that ZFS
>> writes new data to this HSP disk, the old replicated disk is  
>> instantly
>> inconsistent, and there is no means to fix.
> It's necessary from your point of view, Jim. But not in the minds  
> of the
> customers. Even worse, it could be considered a design flaw - not in
> AVS, but in ZFS.

I wouldn't go that far as to say it is a design flaw. The fact that  
AVS works with ZFS, and vice-versa, without either having knowledge  
of each other's presence, says a lot for the I/O architecture of  
Solaris. If there is a compelling advantage to interoperate, the  
OpenSolaris community as a whole is free to propose a project, gather  
community interest, and go from there. The potential of OpenSolaris  
is huge, especially when it is ridding a technology wave, like the  
one created by the x4500 and ZFS.


> Just have a look how the usual Linux dude works. He doesn't use  
> AVS, he
> uses a kernel module called DRBD. It does basically the same, it
> replicates one raw device to another over a network interface, like  
> AVS
> does. But the linux dude has one advantage: he doesn't have ZFS.  
> Yes, as
> impossible as it may sound, it is an advantage. Why? Because he never
> has to mirror 40 or 46 devices, because his lame file systems  
> depend on
> a hardware RAID controller! Same goes with UFS, of course. There's  
> only
> ONE replicated device, no matter how many discs are involved.
> And so, it's definitely NOT necessary to sync a disc when a HSP kicks
> in, because this disc failure will never be reported to the host, it's
> handled by the RAID controller. As a result, no replication will take
> place, because AVS simply isn't involved. We even tried to deploy ZFS
> upon SVM RAID5 stripes to get rid of this problem, just to learn how
> much the RAID 5 performance of SVM sucks ... a cluster of six USB  
> sticks
> was faster than the Thumpers.

Instead of using SVM for RAID 5, too keep the volume count low,  
consider concatenating 8 devices (RAID 0) into 5 separate SVM  
volumes, then configuring both a ZFS raidz storage pool, plus AVS on  
these 5 volumes. This prevents SVM from performing software RAID 5,  
RAID 0 is a low-overhead pass thru for SVM, plus prior to giving the  
entire SVM volume to ZFS, one can also get the AVS bitmaps form this  
pool too.

> I consider this a big design flaw of ZFS. I'm not very familiar  
> with the
> code, but I still have hope that there'll be a parameter which  
> allows to
> get rid of the cache flushes. ZFS, and the X4500, are typical examples
> of different departments not really working together, e.g. they have a
> wonderful file system, but there are no storages who supports it. Or a
> great X4500, a 11-24 TB file server for $40,000, but no options to  
> make
> it highly available like the $1,000 boxes. AVS is, in my opinion,
> clearly one of the components which suffers from it. The Sun marketing
> and Jonathan still have a long way to go. But, on the other hand,
> difficult customers like me and my company are always happy to  
> point out
> some difficulties and to help resolving them :-)

Sun does recognize the potential of both the X4500 and ZFS, and also  
of the difficulties (and problems) when combining them together.

Re: [zfs-discuss] Is there _any_ suitable motherboard?

2007-08-22 Thread Casper . Dik

>Hi,
>
>For what your looking for the gigabyte M61p-S3 is the perfect mobo.
>Six sata ports DDRII and a am2 dual core AMD is really cheap. Only
>downside is that the realtek NIC oesnt work as far as i know.
>However an intel gigabit card is relativly cheap and works. And even
>with all that i was able to buy it all for around AUS$350. cpu, mobo,
>ram and everything. ive tried it with a few solaris distro's
>and its worked fine and been rather fast.

Did you submit it to the HCL :-)?

If power consumption and heat is a consideration, the newer Intel CPUs
have an advantage in that Solaris supports native power management on
those CPUs.

We will not do native power management on AMD's until we get some with
P-state invariant TSC (my powernow driver will supports single core, single
socket systems still).

This also, I think, will require a "socket AM3" motherboard (the 0x10
Opterons will run in AM2 motherboards but those do not provide the
different power and CPU planes needed for the 2/4 cores.

But to measure is to know and I would really like to know the idle
power consumptions (no disks/disks spun down) of the various motherboards
when IDLE under Solaris (with CPU/Memory accounted for, of course)

Casper




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is there _any_ suitable motherboard?

2007-08-22 Thread Mark
Hi,

For what your looking for the gigabyte M61p-S3 is the perfect mobo. Six sata 
ports DDRII and a am2 dual core AMD is really cheap. Only downside is that the 
realtek NIC oesnt work as far as i know. However an intel gigabit card is 
relativly cheap and works. And even with all that i was able to buy it all for 
around AUS$350. cpu, mobo, ram and everything. ive tried it with a few solaris 
distro's and its worked fine and been rather fast.

Cheers

Mark
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mirrored zpool across network

2007-08-22 Thread Mark
Wow,

I just opened a whole can of worms there that went flying over my head. Thanks 
for all the information! i'll see if i can plough through it all :)

I'm guessing that i might be able to do asynchronous, but the problem is that 
the video is going to be streaming from a camera in real time. And its only 
going to the file server. Also this isnt like security footage or something. 
Its for a feature documentary, so we can't afford to loose any of this footage, 
and we really only get one try of it. Hence mirroring so at least we have two 
copies of it.

I'm guessing that AVS is some kind of low level data replication software. 
correct me if im wrong.

I suppose the other thing is that we need sustained transfer speeds of between 
50MB/s and about 300MB/s depending on what video format we choose. So im 
guessing that with those speeds only fibre is going to cut it really. is this 
correct?

thanks again for all your help

Cheers

Mark
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Odp: Is ZFS efficient for large collections of small files?

2007-08-22 Thread Roch - PAE
£ukasz K writes:
 > > Is ZFS efficient at handling huge populations of tiny-to-small files -
 > > for example, 20 million TIFF images in a collection, each between 5
 > > and 500k in size?
 > > 
 > > I am asking because I could have sworn that I read somewhere that it
 > > isn't, but I can't find the reference.
 > 
 > It depends, what type of I/O you will do. If only reads, there is no 
 > problem. Writting small files ( and removing ) will fragmentate pool
 > and it will be a huge problem.
 > You can set recordsize to 32k ( or 16k ) and it will help for some time.
 > 

Comparing recordsize of 16K with 128K.

Files in the range of [0,16K] : no difference.
Files in the range of [16K,128K]  : more efficient to use 128K
Files in the range of [128K,500K] : more efficient to use 16K

In the [16K,128K] range the actual filesize is rounded up to 
16K with 16K recordsize and to the nearest 512B boundary
with 128K recordsize. This will be fairly catastrophic for
files slightly above 16K (rounded up to 32K vs 16K+512B).

In the [128K, 500K] range we're hurt by this

5003563 use smaller "tail block" for last block of object

until   it is  fixed, then  yes , files stored using  16K
records are  rounded up more tightly. metadata probably
east parts of the gains.

-r


 > Lukas
 > 
 > 
 > CLUBNETIC SUMMER PARTY 2007
 > House, club, electro. Najlepsza kompilacja na letnie imprezy!
 > http://klik.wp.pl/?adr=http%3A%2F%2Fadv.reklama.wp.pl%2Fas%2Fclubnetic.html&sid=1266
 > 
 > 
 > ___
 > zfs-discuss mailing list
 > zfs-discuss@opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Odp: Is ZFS efficient for large collections of small files?

2007-08-22 Thread Roch - PAE
£ukasz K writes:
 > > Is ZFS efficient at handling huge populations of tiny-to-small files -
 > > for example, 20 million TIFF images in a collection, each between 5
 > > and 500k in size?
 > > 
 > > I am asking because I could have sworn that I read somewhere that it
 > > isn't, but I can't find the reference.
 > 
 > It depends, what type of I/O you will do. If only reads, there is no 
 > problem. Writting small files ( and removing ) will fragmentate pool
 > and it will be a huge problem.
 > You can set recordsize to 32k ( or 16k ) and it will help for some time.
 > 

Comparing recordsize of 16K with 128K.

Files in the range of [0,16K] : no difference.
Files in the range of [16K,128K]  : more efficient to use 128K
Files in the range of [128K,500K] : more efficient to use 16K

In the [16K,128K] range the actual filesize is rounded up to 
16K with 16K recordsize and to the nearest 512B boundary
with 128K recordsize. This will be fairly catastrophic for
files slightly above 16K (rounded up to 32K vs 16K+512B).

In the [128K, 500K] range we're hurt by this

5003563 use smaller "tail block" for last block of object

until   it is  fixed, then  yes , files stored using  16K
records are  rounded up more tightly. metadata probably
east parts of the gains.

-r


 > Lukas
 > 
 > 
 > CLUBNETIC SUMMER PARTY 2007
 > House, club, electro. Najlepsza kompilacja na letnie imprezy!
 > http://klik.wp.pl/?adr=http%3A%2F%2Fadv.reklama.wp.pl%2Fas%2Fclubnetic.html&sid=1266
 > 
 > 
 > ___
 > zfs-discuss mailing list
 > zfs-discuss@opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is ZFS efficient for large collections of small files?

2007-08-22 Thread Roch - PAE

  Brandorr wrote:
  > Is ZFS efficient at handling huge populations of tiny-to-small files -
  > for example, 20 million TIFF images in a collection, each between 5
  > and 500k in size?
  >
  > I am asking because I could have sworn that I read somewhere that it
  > isn't, but I can't find the reference.
  >   
  If you're worried about the I/O throughput, you should avoid RAIDZ1/2 
  configurations. random read performance will be desastrous if you do; 

A raid-z group  can do one random  read per I/O latency.  So
for 8 disks (each capable of 200 IOPS) in a zpool split into
2  raid-z groups  should  be able  to  server 400  files per
second. If you need to serve more  files, then you need more
disks or  need to use  mirroring. With mirroring, I'd expect
to   serve 1600 files (8*200).  This  model  only applies to
random reading, not sequential access,  not to any types  of
write loads.

For  small file creation ZFS can   be extremely efficient in
that it can create more than 1  file per I/O. It should also
approach disk streaming performance for write loads.

  I've seen random reads ratios with less than 1 MB/s on a X4500 with 40 
  dedicated disks for data storage. 

It would  be nice to  see  if the  above model matches  your
data. So if you have  all 40 disks  in a single raid-z group
(an anti  best  practice) I'd  expect <200  files served per
second and if the files were of 5K avg  size then I'd expect
that 1MB/sec.

http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide,

  If you don't have to worry about disk 
  space, use mirrors;  

right on !

  I got my best results during my extensive X4500 
  benchmarking sessions, when I mirrored single slices instead of complete 
  disks (resulting in 40 2-way-mirrors on 40 physical discs, mirroring 
  c0t0d0s0->c0t1d0s1 and c0t1d0s0->c0t0d0s1, and so on). If you're worried 
  about disk space,  you should consider striping several instances of 
  RAIDZ1 arrays, each one consisting of three discs or slices. sequential 
  access will  go down the cliff,  but random reads will be boosted.

Writes should be good if not great, no matter what the
workload is. I'm interested in data that shows otherwise.

  You should also adjust the recordsize. 

For small files I certainly would not. 
Small files are stored as single record when they are
smaller than the recordsize. Single record is good in my
book. Not sure when one would want otherwise for small files.


  Try to measure the average I/O 
  transaction size. There's a good chance that your I/O performance will 
  be best if you set your recordsize to a smaller value. For instance, if 
  your average file size is 12 KB, try using 8K or even 4K recordsize, 
  stay away from 16K or higher.

Tuning the record size is currently only recommended for
databases (large file) with fixed record access. Again it's
interesting input if tuning the recordsize helped another
type of workload.

-r

  -- 

  Ralf Ramge
  Senior Solaris Administrator, SCNA, SCSA

  Tel. +49-721-91374-3963 
  [EMAIL PROTECTED] - http://web.de/

  1&1 Internet AG
  Brauerstraße 48
  76135 Karlsruhe

  Amtsgericht Montabaur HRB 6484

  Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, 
Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss 
  Aufsichtsratsvorsitzender: Michael Scheeren

  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mirrored zpool across network

2007-08-22 Thread Ralf Ramge
Jim Dunham wrote:
> This is just one scenario for deploying the 48 disks of x4500. The 
> blog listed below offers another option, by mirroring the bitmaps 
> across all available disks, bring the total disk count back up to 46, 
> (or 44, if 2x HSP) leaving the other two for a mirrored root disk.  
> http://blogs.sun.com/AVS/entry/avs_and_zfs_seamless
>
I know your blog entry, Jim. And I still admire your skills in 
calculations within shell scripts (I just gave each soft partition 100 
Megabytes of space, finished ;-) ). But after some thinking, I didn't 
consider using a slice on the same disk for bitmaps. Not just because of 
performance issues, that's not a valid reason. Again, the desaster 
scenarios make me think. In this case, the complexity of administration.

You know, the x64 Solaris boxes are basically competing against Linux 
boxes all day. The X4500 is very attractive replacement for the typical 
Linux file server, consisting of a server, a hardware RAID controller 
and several cheap and stupid fibre-channeled SATA JBODs for less than 
$5,000 each. Double this to have a cluster. In our case, the X4500 is 
competing against more than 60 of those clusters with a total of 360 
JBODs. The X4500's main advantage isn't the price per gigabyte (the 
price is exactly the same!), like most members of the sales department 
may expect, the real advantage is the gigabyte per rack unit. But there 
are several disadvantages, for instance: not being able to access the 
hard drives from the front and needing a ladder and a screwdriver 
instead, or, most important for the typical data center, the *operator* 
is not able to replace a disk like he's used to: pulling the old disc 
out, putting the new disc in, resync starting, finished. You'll always 
have to wait until the next morning, until a Solaris administrator is 
available again (which may impact your high availability concepts) or 
get an Solaris administrator into the company 24/7 a day (which raises 
the TCO of the Solaris boxes).
Well, and what I want to say: if you place the bitmap volume on the same 
disk, this situation even gets worse. The problem is the involvement of 
SVM. Building the soft partition again makes the handling even more 
complex and the case harder to handle for operators. It's the best way 
to make sure that the disk will be replaced, but not added to the zpool 
during the night - and replacing it during regular working hours isn't 
an option too, because syncing 500 GB over a 1 GBit/s interface during 
daytime just isn't possible without putting the guaranteed service times 
to a risk. Having to take care about soft partitions just isn't 
idiot-proof enough. And *poof* there's a good chance the TCO of a X4500 
is considered being too high.

>> a) A disk in the primary fails. What happens? A HSP jumps in and 500 GB
>> will be rebuilt. These 500 GB are synced over a single 1 GBit/s
>> crossover cable. This takes a bit of time and is 100% unnecessary
>
>
> But it is necessary! As soon as the HSP disk kicks in, not only is the 
> disk being rebuilt by ZFS, but newly allocated ZFS data will also 
> being written to this HSP disk. So although it may appear that there 
> is wasted replication cost (of which there is), the instant that ZFS 
> writes new data to this HSP disk, the old replicated disk is instantly 
> inconsistent, and there is no means to fix.
It's necessary from your point of view, Jim. But not in the minds of the 
customers. Even worse, it could be considered a design flaw - not in 
AVS, but in ZFS.

Just have a look how the usual Linux dude works. He doesn't use AVS, he 
uses a kernel module called DRBD. It does basically the same, it 
replicates one raw device to another over a network interface, like AVS 
does. But the linux dude has one advantage: he doesn't have ZFS. Yes, as 
impossible as it may sound, it is an advantage. Why? Because he never 
has to mirror 40 or 46 devices, because his lame file systems depend on 
a hardware RAID controller! Same goes with UFS, of course. There's only 
ONE replicated device, no matter how many discs are involved.
And so, it's definitely NOT necessary to sync a disc when a HSP kicks 
in, because this disc failure will never be reported to the host, it's 
handled by the RAID controller. As a result, no replication will take 
place, because AVS simply isn't involved. We even tried to deploy ZFS 
upon SVM RAID5 stripes to get rid of this problem, just to learn how 
much the RAID 5 performance of SVM sucks ... a cluster of six USB sticks 
was faster than the Thumpers.


I consider this a big design flaw of ZFS. I'm not very familiar with the 
code, but I still have hope that there'll be a parameter which allows to 
get rid of the cache flushes. ZFS, and the X4500, are typical examples 
of different departments not really working together, e.g. they have a 
wonderful file system, but there are no storages who supports it. Or a 
great X4500, a 11-24 TB file server for $40,000, but no o