date:20090925


On Fri, 25 Sep 2009, Ryan Hirsch wrote:

I have a zpool named rtank.  I accidently attached a single drive to 
the pool.  I am an idiot I know :D Now I want to replace this single 
drive with a raidz group.  Below is the pool setup and what I tried:


I think that the best you will be able to do is the turn this single 
drive into a mirror.  It seems that this sort of human error occurs 
pretty often and there is not yet a way to properly fix it.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Which directories must be part of rpool?

2009-09-25 Thread David Magda


On Sep 25, 2009, at 19:39, Frank Middleton wrote:


/var/tmp is a strange beast. It can get quite large, and be a
serious bottleneck if mapped to a physical disk and used by any
program that synchronously creates and deletes large numbers of
files. I have had no problems mapping /var/tmp to /tmp. Hopefully
a guru will step in here and explain why this is a bad idea, but
so far no problems...


The contents of /var/tmp can be expected to survive between boots  
(e.g., /var/tmp/vi.recover); /tmp is nuked on power cycles (because  
it's just memory/swap):


/tmp: A directory made available for applications that need a place  
to create temporary files. Applications shall be allowed to create  
files in this directory, but shall not assume that such files are  
preserved between invocations of the application.


http://www.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap10.html

If a program is creating and deleting large numbers of files, and  
those files aren't needed between reboots, then it really should be  
using /tmp.


Similar definition for Linux FWIW:

http://www.pathname.com/fhs/pub/fhs-2.3.html

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Neil Perrin




On 09/25/09 16:19, Bob Friesenhahn wrote:

On Fri, 25 Sep 2009, Ross Walker wrote:


Problem is most SSD manufactures list sustained throughput with large
IO sizes, say 4MB, and not 128K, so it is tricky buying a good SSD
that can handle the throughput.


Who said that the slog SSD is written to in 128K chunks?  That seems 
wrong to me.  Previously we were advised that the slog is basically a 
log of uncommitted system calls so the size of the data chunks written 
to the slog should be similar to the data sizes in the system calls.


Log blocks are variable in size dependent on what needs to be committed.
The minimum size is 4KB and the max 128KB. Log records are aggregated
and written together as much as possible.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS pool replace single disk with raidz

2009-09-25 Thread Ryan Hirsch

I have a zpool named rtank.  I accidently attached a single drive to the pool.  
I am an idiot I know :D Now I want to replace this single drive with a raidz 
group.  Below is the pool setup and what I tried:
 

NAMESTATE READ WRITE CKSUM
rtank   ONLINE   0 0 0
 - raidz1ONLINE   0 0 0
   -- c4t0d0  ONLINE   0 0 0
   -- c4t1d0  ONLINE   0 0 0
   -- c4t2d0  ONLINE   0 0 0
   -- c4t3d0  ONLINE   0 0 0
   -- c4t4d0  ONLINE   0 0 0
   -- c4t5d0  ONLINE   0 0 0
   -- c4t6d0  ONLINE   0 0 0
   -- c4t7d0  ONLINE   0 0 0
 - raidz1ONLINE   0 0 0
   -- c3t0d0  ONLINE   0 0 0
   -- c3t1d0  ONLINE   0 0 0
   -- c3t2d0  ONLINE   0 0 0
   -- c3t3d0  ONLINE   0 0 0
   -- c3t4d0  ONLINE   0 0 0
   -- c3t5d0  ONLINE   0 0 0
  - c5d0  ONLINE   0 0 0  <--- single drive in the pool 
not in any raidz


$ pfexec zpool replace rtank c5d0 raidz c3t6d0 c3t7d0 c3t8d0 c3t9d0 c3t10d0 
c3t11d0
too many arguments

$ zpool upgrade -v
This system is currently running ZFS pool version 18.


Is what I am trying to do possible?  If so what am I doing wrong?  Thanks.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] periodic slow responsiveness



On Sep 25, 2009, at 6:19 PM, Bob Friesenhahn > wrote:



On Fri, 25 Sep 2009, Ross Walker wrote:


Problem is most SSD manufactures list sustained throughput with large
IO sizes, say 4MB, and not 128K, so it is tricky buying a good SSD
that can handle the throughput.


Who said that the slog SSD is written to in 128K chunks?  That seems  
wrong to me.  Previously we were advised that the slog is basically  
a log of uncommitted system calls so the size of the data chunks  
written to the slog should be similar to the data sizes in the  
system calls.


Are these not broken into recordsize chunks?

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Which directories must be part of rpool?

2009-09-25 Thread Bill Sommerfeld


On Fri, 2009-09-25 at 14:39 -0600, Lori Alt wrote:
> The list of datasets in a root pool should look something like this:
...
> rpool/swap  

I've had success with putting swap into other pools.  I believe others
have, as well.

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Which directories must be part of rpool?

2009-09-25 Thread David Abrahams


on Fri Sep 25 2009, Glenn Lagasse  wrote:

> The question you're asking can't easily be answered.  Sun doesn't test
> configs like that.  If you really want to do this, you'll pretty much
> have to 'try it and see what breaks'.  And you get to keep both pieces
> if anything breaks.

Heh, that doesn't sound like much fun.  I have a VM I can experiment
with, but I don't want to do this badly enough to take that risk.

> There's very little you can safely move in my experience.  /export
> certainly.  Anything else, not really (though ymmv).  I tried to create
> a seperate zfs dataset for /usr/local.  That worked some of the time,
> but it also screwed up my system a time or two during
> image-updates/package installs.

That's hard to imagine.  My OpenSolaris installation didn't come with a
/usr/local directory.  How can mounting a filesystem from a non-root
pool under /usr possibly mess anything up?

> On my 2010.02/123 system I see:
>
> bin Symlink to /usr/bin
> boot/
> dev/
> devices/
> etc/
> export/ Safe to move, not tied to the 'root' system

Good to know.

> kernel/
> lib/
> media/
> mnt/
> net/
> opt/
> platform/
> proc/
> rmdisk/
> root/   Could probably move root's homedir

I don't think I'd risk it.

> rpool/
> sbin/
> system/
> tmp/
> usr/
> var/
>
> Other than /export, everything else is considered 'part of the root
> system'.  Thus part of the root pool.
>
> Really, if you can't add a mirror for your root pool, then make backups
> of your root pool (left as an exercise to the reader) and store the
> non-system specific bits (/export) on you're raidz2 pool.

Yeah, that's my fallback.  Actually, that along with copies=2 on my root
pool, which I might well do anyhow.  But you people are making a pretty
strong case for making the effort to figure out how to do the mirror
thing.

Thanks, all, for the feedback.

-- 
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] selecting zfs BE from OBP

2009-09-25 Thread Donour Sizemore


Ah yes.

Thanks Cindy!

donour


On Sep 25, 2009, at 10:37 AM, Cindy Swearingen wrote:


Hi Donour,

You would use the boot -L syntax to select the ZFS BE to boot from,
like this:

ok boot -L

Rebooting with command: boot -L
Boot device: /p...@8,60/SUNW,q...@4/f...@0,0/ 
d...@w2104cf7fa6c7,0:a  File and args: -L

1 zfs1009BE
2 zfs10092BE
Select environment to boot: [ 1 - 2 ]: 2

Then copy and paste the boot string that is provided:

To boot the selected entry, invoke:
boot [] -Z rpool/ROOT/zfs10092BE

Program terminated
{0} ok boot -Z rpool/ROOT/zfs10092BE

See this pointer as well:

http://docs.sun.com/app/docs/doc/819-5461/ggpco?a=view

Cindy


On 09/25/09 11:09, Donour Sizemore wrote:
Can you select the LU boot environment from sparc obp, if the  
filesystem is zfs? With ufs, you simply invoke 'boot [slice]'.

thanks
donour
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] White box server for OpenSolaris

2009-09-25 Thread Erik Trimble

From a product standpoint, expanding the variety available in the 
Storage 7000 (Amber Road) line is somewhere I think we'd (Sun) make bank on.


Things like:

[ for the home/very small business market ]
Mini-Tower sized case, 4-6 3.5" HS SATA-only bays (to take the 
X2200-style spud bracket drives), 2 CF slots (for boot), single-socket, 
with 4 DIMMs, and a built-in ILOM.  /maybe/ a x4 PCI-E slot, but maybe not.


[ for the small business/branch office with no racks]
Mid-tower case, 4-bay 2.5" HS area, 6-8 bay 3.5" HS area, single socket, 
4/6 DIMMs, ILOM.  (2) x4 or x8 PCI-E slots too.



(I'd probably go with Socket AM3, with ECC, of course)


I'd sell them in both fully loaded with the Amber Road software (and 
mandatory Service Contract), and no-OS Loaded, no-Service Contract 
appliance versions.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Which directories must be part of rpool?

2009-09-25 Thread Frank Middleton


On 09/25/09 04:44 PM, Lori Alt wrote:


rpool
rpool/ROOT
rpool/ROOT/snv_124 (or whatever version you're running)
rpool/ROOT/snv_124/var (you might not have this)
rpool/ROOT/snv_121 (or whatever other BEs you still have)
rpool/dump
rpool/export
rpool/export/home
rpool/swap


Unless you machine is so starved for physical memory that
you couldn't possibly install anything, AFAIK you can always
boot without dump and swap, so even if your data pool can't
be mounted, you should be OK. I've done many a reboot and
pkg image-update with dump and swap inaccessible. Of course
with no dump, you won't get, well, a dump, after a panic...

Having /usr/local (IIRC this doesn't even exist in a straight
OpenSolaris install) in a shared space on your data pool is
quite useful if you have more than one machine unless you have
multiple architectures. Then it turns into the /opt problem.

Hiving off /opt does not seem to prevent booting, and having
it on a data pool doesn't seem to prevent upgrade installs.
The big problem with putting /opt on a shared pool is when
multiple hosts have different /opts. Using legacy mounts seems
to be the only way around this. Do the gurus have a technical
explanation why putting /opt in a different pool shouldn't work?

/var/tmp is a strange beast. It can get quite large, and be a
serious bottleneck if mapped to a physical disk and used by any
program that synchronously creates and deletes large numbers of
files. I have had no problems mapping /var/tmp to /tmp. Hopefully
a guru will step in here and explain why this is a bad idea, but
so far no problems...

A 32GB SSD is marginal for a root pool, so shrinking it as much
as possible makes a lot of sense until bigger SSDS become cost
effective (not long from now I imagine). But if you already have
a 16GB or 32GB SSD, or a dedicated boot disk <= 32GB than you
can be SOL unless you are very careful to empty /var/pkg/download,
which doesn't seem to get emptied even if you set the magic flag.

HTH -- Frank


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Marion Hakanson

rswwal...@gmail.com said:
> Yes, but if it's on NFS you can just figure out the workload in MB/s and use
> that as a rough guideline. 

I wonder if that's the case.  We have an NFS server without NVRAM cache
(X4500), and it gets huge MB/sec throughput on large-file writes over NFS.
But it's painfully slow on the "tar extract lots of small files" test,
where many, tiny, synchronous metadata operations are performed.


> I did a smiliar test with a 512MB BBU controller and saw no difference with
> or without the SSD slog, so I didn't end up using it.
> 
> Does your BBU controller ignore the ZFS flushes? 

I believe it does (it would be slow otherwise).  It's the Sun StorageTek
internal SAS RAID HBA.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] periodic slow responsiveness


On Fri, 25 Sep 2009, Ross Walker wrote:


Problem is most SSD manufactures list sustained throughput with large
IO sizes, say 4MB, and not 128K, so it is tricky buying a good SSD
that can handle the throughput.


Who said that the slog SSD is written to in 128K chunks?  That seems 
wrong to me.  Previously we were advised that the slog is basically a 
log of uncommitted system calls so the size of the data chunks written 
to the slog should be similar to the data sizes in the system calls.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] periodic slow responsiveness


On Fri, 25 Sep 2009, Richard Elling wrote:

By default, the txg commit will occur when 1/8 of memory is used
for writes. For 30 GBytes, that would mean a main memory of only
240 Gbytes... feasible for modern servers.


Ahem.  We were advised that 7/8s of memory is currently what is 
allowed for writes.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Which directories must be part of rpool?

2009-09-25 Thread Glenn Lagasse

* David Magda (dma...@ee.ryerson.ca) wrote:
> On Sep 25, 2009, at 16:39, Glenn Lagasse wrote:
> 
> >There's very little you can safely move in my experience.  /export
> >certainly.  Anything else, not really (though ymmv).  I tried to
> >create
> >a seperate zfs dataset for /usr/local.  That worked some of the time,
> >but it also screwed up my system a time or two during
> >image-updates/package installs.
> 
> I'd be very surprised (disappointed?) if /usr/local couldn't be
> detached from the rpool. Given that in many cases it's an NFS mount,
> I'm curious to know why it would need to be part of the rpool. If it
> is a 'dependency' I would consider that a bug.

It can be detached, however one issue I ran in to was packages which
installed into /usr/local caused problems when those packages were
upgraded.  Essentially what occurred was that /usr/local was created on
the root pool and upon reboot caused the filesystem service to go into
maintenance because it couldn't mount the zfs /usr/local dataset on top
of the filled /usr/local root pool location.  I didn't have time to
investigate into it fully.  At that point, spinning /usr/local off into
it's own zfs dataset just didn't seem worth the hassle.

Others mileage may vary.

-- 
Glenn
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] periodic slow responsiveness

On Fri, Sep 25, 2009 at 5:47 PM, Marion Hakanson  wrote:
> j...@jamver.id.au said:
>> For a predominantly NFS server purpose, it really looks like a case of the
>> slog has to outperform your main pool for continuous write speed as well as
>> an instant response time as the primary criterion. Which might as well be a
>> fast (or group of fast) SSDs or 15kRPM drives with some NVRAM in front of
>> them.
>
> I wonder if you ran Richard Elling's "zilstat" while running your
> workload.  That should tell you how much ZIL bandwidth is needed,
> and it would be interesting to see if its stats match with your
> other measurements of slog-device traffic.

Yes, but if it's on NFS you can just figure out the workload in MB/s
and use that as a rough guideline.

Problem is most SSD manufactures list sustained throughput with large
IO sizes, say 4MB, and not 128K, so it is tricky buying a good SSD
that can handle the throughput.

> I did some filebench and "tar extract over NFS" tests of J4400 (500GB,
> 7200RPM SATA drives), with and without slog, where slog was using the
> internal 2.5" 10kRPM SAS drives in an X4150.  These drives were behind
> the standard Sun/Adaptec internal RAID controller, 256MB battery-backed
> cache memory, all on Solaris-10U7.
>
> We saw slight differences on filebench oltp profile, and a huge speedup
> for the "tar extract over NFS" tests with the slog present.  Granted, the
> latter was with only one NFS client, so likely did not fill NVRAM.  Pretty
> good results for a poor-person's slog, though:
>        http://acc.ohsu.edu/~hakansom/j4400_bench.html

I did a smiliar test with a 512MB BBU controller and saw no difference
with or without the SSD slog, so I didn't end up using it.

Does your BBU controller ignore the ZFS flushes?

> Just as an aside, and based on my experience as a user/admin of various
> NFS-server vendors, the old Prestoserve cards, and NetApp filers, seem
> to get very good improvements with relatively small amounts of NVRAM
> (128K, 1MB, 256MB, etc.).  None of the filers I've seen have ever had
> tens of GB of NVRAM.

They don't hold on to the cache for a long time, just as long as it
takes to write it all to disk.

-Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] cannot hold 'xxx': pool must be upgraded

2009-09-25 Thread Robert Milkowski


Chris Kirby wrote:

On Sep 25, 2009, at 2:43 PM, Robert Milkowski wrote:


Chris Kirby wrote:

On Sep 25, 2009, at 11:54 AM, Robert Milkowski wrote:

 That's useful information indeed.  I've filed this CR:

6885860 zfs send shouldn't require support for snapshot holds

Sorry for the trouble, please look for this to be fixed soon.

Thank you.
btw: how do you want to fix it? Do you want to acquire  a snapshot 
hold but continue anyway if it is not possible (only in case whene 
error is ENOTSUP I think)? Or do you want to get rid of it entirely?



In this particular case, we should make sure the pool version supports 
snapshot

holds before trying to request (or release) any.

We still want to acquire the temporary holds if we can, since that
prevents a race with zfs destroy.  That case is becoming more common
with automated snapshots and their associated retention policies.



Yeah, this makes sense.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Marion Hakanson

j...@jamver.id.au said:
> For a predominantly NFS server purpose, it really looks like a case of the
> slog has to outperform your main pool for continuous write speed as well as
> an instant response time as the primary criterion. Which might as well be a
> fast (or group of fast) SSDs or 15kRPM drives with some NVRAM in front of
> them. 

I wonder if you ran Richard Elling's "zilstat" while running your
workload.  That should tell you how much ZIL bandwidth is needed,
and it would be interesting to see if its stats match with your
other measurements of slog-device traffic.

I did some filebench and "tar extract over NFS" tests of J4400 (500GB,
7200RPM SATA drives), with and without slog, where slog was using the
internal 2.5" 10kRPM SAS drives in an X4150.  These drives were behind
the standard Sun/Adaptec internal RAID controller, 256MB battery-backed
cache memory, all on Solaris-10U7.

We saw slight differences on filebench oltp profile, and a huge speedup
for the "tar extract over NFS" tests with the slog present.  Granted, the
latter was with only one NFS client, so likely did not fill NVRAM.  Pretty
good results for a poor-person's slog, though:
http://acc.ohsu.edu/~hakansom/j4400_bench.html

Just as an aside, and based on my experience as a user/admin of various
NFS-server vendors, the old Prestoserve cards, and NetApp filers, seem
to get very good improvements with relatively small amounts of NVRAM
(128K, 1MB, 256MB, etc.).  None of the filers I've seen have ever had
tens of GB of NVRAM.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] periodic slow responsiveness

On Fri, Sep 25, 2009 at 1:39 PM, Richard Elling
 wrote:
> On Sep 25, 2009, at 9:14 AM, Ross Walker wrote:
>
>> On Fri, Sep 25, 2009 at 11:34 AM, Bob Friesenhahn
>>  wrote:
>>>
>>> On Fri, 25 Sep 2009, Ross Walker wrote:

 As a side an slog device will not be too beneficial for large
 sequential writes, because it will be throughput bound not latency
 bound. slog devices really help when you have lots of small sync
 writes. A RAIDZ2 with the ZIL spread across it will provide much
>>>
>>> Surely this depends on the origin of the large sequential writes.  If the
>>> origin is NFS and the SSD has considerably more sustained write bandwidth
>>> than the ethernet transfer bandwidth, then using the SSD is a win.  If
>>> the SSD accepts data slower than the ethernet can deliver it (which seems to
>>> be this particular case) then the SSD is not helping.
>>>
>>> If the ethernet can pass 100MB/second, then the sustained write
>>> specification for the SSD needs to be at least 100MB/second.  Since data
>>> is buffered in the Ethernet,TCP/IP,NFS stack prior to sending it to ZFS, the
>>> SSD should support write bursts of at least double that or else it will
>>> not be helping bulk-write performance.
>>
>> Specifically I was talking NFS as that was what the OP was talking
>> about, but yes it does depend on the origin, but you also assume that
>> NFS IO goes over only a single 1Gbe interface when it could be over
>> multiple 1Gbe interfaces or a 10Gbe interface or even multple 10Gbe
>> interfaces. You also assume the IO recorded in the ZIL is just the raw
>> IO when there is also meta-data or multiple transaction copies as
>> well.
>>
>> Personnally I still prefer to spread the ZIL across the pool and have
>> a large NVRAM backed HBA as opposed to an slog which really puts all
>> my IO in one basket. If I had a pure NVRAM device I might consider
>> using that as an slog device, but SSDs are too variable for my taste.
>
> Back of the envelope math says:
>        10 Gbe = ~1 GByte/sec of I/O capacity
>
> If the SSD can only sink 70 MByte/s, then you will need:
>        int(1000/70) + 1 = 15 SSDs for the slog
>
> For capacity, you need:
>        1 GByte/sec * 30 sec = 30 GBytes

Where did the 30 seconds come in here?

The amount of time to hold cache depends on how fast you can fill it.

> Ross' idea has merit, if the size of the NVRAM in the array is 30 GBytes
> or so.

I'm thinking you can do less if you don't need to hold it for 30 seconds.

> Both of the above assume there is lots of memory in the server.
> This is increasingly becoming easier to do as the memory costs
> come down and you can physically fit 512 GBytes in a 4u server.
> By default, the txg commit will occur when 1/8 of memory is used
> for writes. For 30 GBytes, that would mean a main memory of only
> 240 Gbytes... feasible for modern servers.
>
> However, most folks won't stomach 15 SSDs for slog or 30 GBytes of
> NVRAM in their arrays. So Bob's recommendation of reducing the
> txg commit interval below 30 seconds also has merit.  Or, to put it
> another way, the dynamic sizing of the txg commit interval isn't
> quite perfect yet. [Cue for Neil to chime in... :-)]

I'm sorry did I miss something Bob said about the txg commit interval?

I looked back and didn't see it, maybe it was off-list?

-Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] White box server for OpenSolaris

2009-09-25 Thread Enrico Maria Crisostomo

On Fri, Sep 25, 2009 at 10:56 PM, Toby Thain  wrote:
>
> On 25-Sep-09, at 2:58 PM, Frank Middleton wrote:
>
>> On 09/25/09 11:08 AM, Travis Tabbal wrote:
>>>
>>> ... haven't heard if it's a known
>>> bug or if it will be fixed in the next version...
>>
>> Out of courtesy to our host, Sun makes some quite competitive
>> X86 hardware. I have absolutely no idea how difficult it is
>> to buy Sun machines retail,
>
> Not very difficult. And there is try and buy.
Indeed, at least in Spain and in Italy I had no problem buying
workstations. Recently I owned both Sun Ultra 20 M2 and Ultra 24. I
had a great feeling with them and price seemed very competitive to me,
compared to offers of other mainstream hardware providers.

>
> People overestimate the cost of Sun, and underestimate the real value of
> "fully integrated".
+1. People like "fully integration" when it comes, for example, to
Apple, iPods and iPhones. When it comes, just to make another
example...,  to Solaris, ZFS, ECC memory and so forth (do you remember
those posts some time ago?), they quickly forget.

>
> --Toby
>
>> but it seems they might be missing
>> out on an interesting market - robust and scalable SOHO servers
>> for the DYI gang ...
>>
>> Cheers -- Frank
>>
>>
>> ___
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>



-- 
Ελευθερία ή θάνατος
"Programming today is a race between software engineers striving to
build bigger and better idiot-proof programs, and the Universe trying
to produce bigger and better idiots. So far, the Universe is winning."
GPG key: 1024D/FD2229AF
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] periodic slow responsiveness

On Fri, Sep 25, 2009 at 5:24 PM, James Lever  wrote:
>
> On 26/09/2009, at 1:14 AM, Ross Walker wrote:
>
>> By any chance do you have copies=2 set?
>
> No, only 1.  So the double data going to the slog (as reported by iostat) is
> still confusing me and clearly potentially causing significant harm to my
> performance.

Weird then, I thought that would be an easy explaination.

>> Also, try setting zfs_write_limit_override equal to the size of the
>> NVRAM cache (or half depending on how long it takes to flush):
>>
>> echo zfs_write_limit_override/W0t268435456 | mdb -kw
>
> That’s an interesting concept.  All data still appears to go via the slog
> device, however, under heavy load my responsive to a new write is typically
> below 2s (a few outliers at about 3.5s) and a read (directory listing of a
> non-cached entry) is about 2s.
>
> What will this do once it hits the limit?  Will streaming writes now be sent
> directly to a txg and streamed to the primary storage devices?  (that is
> what I would like to see happen).

It's sets the max size of a txg to the given size. When it hits that
number it flushes to disk.

>> As a side an slog device will not be too beneficial for large
>> sequential writes, because it will be throughput bound not latency
>> bound. slog devices really help when you have lots of small sync
>> writes. A RAIDZ2 with the ZIL spread across it will provide much
>> higher throughput then an SSD. An example of a workload that benefits
>> from an slog device is ESX over NFS, which does a COMMIT for each
>> block written, so it benefits from an slog, but a standard media
>> server will not (but an L2ARC would be beneficial).
>>
>> Better workload analysis is really what it is about.
>
>
> It seems that it doesn’t matter what the workload is if the NFS pipe can
> sustain more continuous throughput the slog chain can support.

Only on large sequentials, small sync IO should benefit from the slog.

> I suppose some creative use of the logbias setting might assist this
> situation and force all potentially heavy writers directly to the primary
> storage.  This would, however, negate any benefit for having a fast, low
> latency device for those filesystems for the times when it is desirable (any
> large batch of small writes, for example).
>
> Is there a way to have a dynamic, auto logbias type setting depending on the
> transaction currently presented to the server such that if it is clearly a
> large streaming write it gets treated as logbias=throughput and if it is a
> small transaction it gets treated as logbias=latency?  (i.e. such that NFS
> transactions can be effectively treated as if it was local storage but
> minorly breaking the benefits of the txg scheduling).

I'll leave that to the Sun guys to answer.

-Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread James Lever



On 26/09/2009, at 1:14 AM, Ross Walker wrote:


By any chance do you have copies=2 set?


No, only 1.  So the double data going to the slog (as reported by  
iostat) is still confusing me and clearly potentially causing  
significant harm to my performance.



Also, try setting zfs_write_limit_override equal to the size of the
NVRAM cache (or half depending on how long it takes to flush):

echo zfs_write_limit_override/W0t268435456 | mdb -kw


That’s an interesting concept.  All data still appears to go via the  
slog device, however, under heavy load my responsive to a new write is  
typically below 2s (a few outliers at about 3.5s) and a read  
(directory listing of a non-cached entry) is about 2s.


What will this do once it hits the limit?  Will streaming writes now  
be sent directly to a txg and streamed to the primary storage  
devices?  (that is what I would like to see happen).



As a side an slog device will not be too beneficial for large
sequential writes, because it will be throughput bound not latency
bound. slog devices really help when you have lots of small sync
writes. A RAIDZ2 with the ZIL spread across it will provide much
higher throughput then an SSD. An example of a workload that benefits
from an slog device is ESX over NFS, which does a COMMIT for each
block written, so it benefits from an slog, but a standard media
server will not (but an L2ARC would be beneficial).

Better workload analysis is really what it is about.



It seems that it doesn’t matter what the workload is if the NFS pipe  
can sustain more continuous throughput the slog chain can support.


I suppose some creative use of the logbias setting might assist this  
situation and force all potentially heavy writers directly to the  
primary storage.  This would, however, negate any benefit for having a  
fast, low latency device for those filesystems for the times when it  
is desirable (any large batch of small writes, for example).


Is there a way to have a dynamic, auto logbias type setting depending  
on the transaction currently presented to the server such that if it  
is clearly a large streaming write it gets treated as  
logbias=throughput and if it is a small transaction it gets treated as  
logbias=latency?  (i.e. such that NFS transactions can be effectively  
treated as if it was local storage but minorly breaking the benefits  
of the txg scheduling).


On 26/09/2009, at 3:39 AM, Richard Elling wrote:


Back of the envelope math says:
10 Gbe = ~1 GByte/sec of I/O capacity

If the SSD can only sink 70 MByte/s, then you will need:
int(1000/70) + 1 = 15 SSDs for the slog

For capacity, you need:
1 GByte/sec * 30 sec = 30 GBytes

Ross' idea has merit, if the size of the NVRAM in the array is 30  
GBytes

or so.


At this point, enter the fusionIO cards or similar devices.   
Unfortunately there does not seem to be anything on the market with  
infinitely fast write capacity (memory speeds) that is also supported  
under OpenSolaris as a slog device.


I think this is precisely what I (and anybody running a general  
purpose NFS server) need for a general purpose slog device.



Both of the above assume there is lots of memory in the server.
This is increasingly becoming easier to do as the memory costs
come down and you can physically fit 512 GBytes in a 4u server.
By default, the txg commit will occur when 1/8 of memory is used
for writes. For 30 GBytes, that would mean a main memory of only
240 Gbytes... feasible for modern servers.

However, most folks won't stomach 15 SSDs for slog or 30 GBytes of
NVRAM in their arrays. So Bob's recommendation of reducing the
txg commit interval below 30 seconds also has merit.  Or, to put it
another way, the dynamic sizing of the txg commit interval isn't
quite perfect yet. [Cue for Neil to chime in... :-)]


How does reducing the txg commit interval really help?  WIll data no  
longer go via the slog once it is streaming to disk?  or will data  
still all be pushed through the slog regardless?


For a predominantly NFS server purpose, it really looks like a case of  
the slog has to outperform your main pool for continuous write speed  
as well as an instant response time as the primary criterion. Which  
might as well be a fast (or group of fast) SSDs or 15kRPM drives with  
some NVRAM in front of them.


Is there also a way to throttle synchronous writes to the slog  
device?  Much like the ZFS write throttling that is already  
implemented, so that there is a gap for new writers to enter when  
writing to the slog device? (or is this the norm and includes slog  
writes?)


cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Which directories must be part of rpool?

2009-09-25 Thread David Magda


On Sep 25, 2009, at 16:39, Glenn Lagasse wrote:


There's very little you can safely move in my experience.  /export
certainly.  Anything else, not really (though ymmv).  I tried to  
create

a seperate zfs dataset for /usr/local.  That worked some of the time,
but it also screwed up my system a time or two during
image-updates/package installs.


I'd be very surprised (disappointed?) if /usr/local couldn't be  
detached from the rpool. Given that in many cases it's an NFS mount,  
I'm curious to know why it would need to be part of the rpool. If it  
is a 'dependency' I would consider that a bug.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Which directories must be part of rpool?

2009-09-25 Thread Peter Pickford

Hi David,

I believe /opt is an essential file system as it contains software
that is maintained by the packaging system.
In fact anywhere you install software via pkgadd probably should be in
the BE under /rpool/ROOT/bename

AFIK it should not even be split from root in the BE under zfs boot
(only /var is supported) other wise LU breaks.

I have sub directories of /opt like /aop/app which does not contain
software installed via pkgadd.

I also split off /var/core and /var/crash.

Unfortunately when you need to boot -F and import the pool for
maintenance it doesn't mount /var causing directory /var/core and
/var/crash to be created in the root file system.

The system then reboots but when you do a lucreate, or lumount it
fails due to /var/core and /var/crash existing on the / file system
causing the mount of /var to fail in the ABE.

I have found it a bit problematic to split of file systems from /
under zfs boot and still have LU work properly.

I haven't tried putting split off file systems as apposed to
application file systems on a different pool but I believe there may
be mount ordering issues with mounting dependent file systems from
different pools where the parent file system are not part of the BE or
legacy mounts.

It is not possible to mount a vxfs file system under a non legacy zone
root file system due to ordering issues with mounting on boot (legacy
is done before automatic zfs mounts).

Perhaps u7 addressed some of there issues as I believe it is now
allowable to have zone root file system on a non root pool.

These are just my experiences and I'm sure others can give more
definitive answers.
Perhaps its easier to get some bigger disks.

Thanks

Peter

2009/9/25 David Abrahams :
>
> on Fri Sep 25 2009, Cindy Swearingen  wrote:
>
>> Hi David,
>>
>> All system-related components should remain in the root pool, such as
>> the components needed for booting and running the OS.
>
> Yes, of course.  But which *are* those?
>
>> If you have datasets like /export/home or other non-system-related
>> datasets in the root pool, then feel free to move them out.
>
> Well, for example, surely /opt can be moved?
>
>> Moving OS components out of the root pool is not tested by us and I've
>> heard of one example recently of breakage when usr and var were moved
>> to a non-root RAIDZ pool.
>>
>> It would be cheaper and easier to buy another disk to mirror your root
>> pool then it would be to take the time to figure out what could move out
>> and then possibly deal with an unbootable system.
>>
>> Buy another disk and we'll all sleep better.
>
> Easy for you to say.  There's no room left in the machine for another disk.
>
> --
> Dave Abrahams
> BoostPro Computing
> http://www.boostpro.com
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS ARC vs Oracle cache

2009-09-25 Thread Christo Kutrovsky

Hi,

Definitely large SGA, small arc. In fact, it's best to disable the ARC 
altogether for the Oracle filesystems.

Blocks in the db_cache (oracle cache) can be used "as is" while cached data 
from ARC needs significant CPU processing before it's inserted back into the 
db_cache.

Not to mention that block in db_cache can remain dirty for longer periods, 
saving disk writes.

But definetelly:
- separate redo disk (preferably dedicated disk/pool)
- your ZFS filesystem needs to match the oracle block size (8Kb default)

With your configuration, and assuming nothing else (but oracle database server) 
on the system, a db_cache size in the 70 GiB range would be perfectly 
acceptable.

Don't forget to set pga_aggregate_target to something reasonable too, like 20 
GiB.

Christo Kutrovsky
Senior DBA
The Pythian Group
I Blog at: www.pythian.com/news
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] White box server for OpenSolaris

2009-09-25 Thread Toby Thain



On 25-Sep-09, at 2:58 PM, Frank Middleton wrote:


On 09/25/09 11:08 AM, Travis Tabbal wrote:

... haven't heard if it's a known
bug or if it will be fixed in the next version...


Out of courtesy to our host, Sun makes some quite competitive
X86 hardware. I have absolutely no idea how difficult it is
to buy Sun machines retail,


Not very difficult. And there is try and buy.

People overestimate the cost of Sun, and underestimate the real value  
of "fully integrated".


--Toby


but it seems they might be missing
out on an interesting market - robust and scalable SOHO servers
for the DYI gang ...

Cheers -- Frank


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Which directories must be part of rpool?

2009-09-25 Thread Glenn Lagasse

* David Abrahams (d...@boostpro.com) wrote:
> 
> on Fri Sep 25 2009, Cindy Swearingen  wrote:
> 
> > Hi David,
> >
> > All system-related components should remain in the root pool, such as
> > the components needed for booting and running the OS.
> 
> Yes, of course.  But which *are* those?
> 
> > If you have datasets like /export/home or other non-system-related
> > datasets in the root pool, then feel free to move them out.
> 
> Well, for example, surely /opt can be moved?

Don't be so sure.

> > Moving OS components out of the root pool is not tested by us and I've
> > heard of one example recently of breakage when usr and var were moved
> > to a non-root RAIDZ pool.
> >
> > It would be cheaper and easier to buy another disk to mirror your root
> > pool then it would be to take the time to figure out what could move out
> > and then possibly deal with an unbootable system.
> >
> > Buy another disk and we'll all sleep better.
> 
> Easy for you to say.  There's no room left in the machine for another disk.

The question you're asking can't easily be answered.  Sun doesn't test
configs like that.  If you really want to do this, you'll pretty much
have to 'try it and see what breaks'.  And you get to keep both pieces
if anything breaks.

There's very little you can safely move in my experience.  /export
certainly.  Anything else, not really (though ymmv).  I tried to create
a seperate zfs dataset for /usr/local.  That worked some of the time,
but it also screwed up my system a time or two during
image-updates/package installs.

On my 2010.02/123 system I see:

bin Symlink to /usr/bin
boot/
dev/
devices/
etc/
export/ Safe to move, not tied to the 'root' system
kernel/
lib/
media/
mnt/
net/
opt/
platform/
proc/
rmdisk/
root/   Could probably move root's homedir
rpool/
sbin/
system/
tmp/
usr/
var/

Other than /export, everything else is considered 'part of the root
system'.  Thus part of the root pool.

Really, if you can't add a mirror for your root pool, then make backups
of your root pool (left as an exercise to the reader) and store the
non-system specific bits (/export) on you're raidz2 pool.

Cheers,

-- 
Glenn
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Which directories must be part of rpool?


I have no idea why that last mail lost its line feeds.   Trying again:




On 09/25/09 13:35, David Abrahams wrote:

Hi,

Since I don't even have a mirror for my root pool "rpool," I'd like to
move as much of my system as possible over to my raidz2 pool, "tank."
Can someone tell me which parts need to stay in rpool in order for the
system to work normally?

Thanks.

  

The list of datasets in a root pool should look something like this:


rpool   
rpool/ROOT
rpool/ROOT/snv_124  (or whatever version you're running)

rpool/ROOT/snv_124/var   (you might not have this)
rpool/ROOT/snv_121  (or whatever other BEs you still have)
rpool/dump  
rpool/export
rpool/export/home   
rpool/swap



plus any other datasets you might have added.  Datasets you've added in 
addition to the above (unless they are zone roots under 
rpool/ROOT/ ) can be moved to another pool.  Anything you have 
in /export or /export/ home can be moved to another pool.  Everything 
else needs to stay in the root pool.  Yes, there are contents of the 
above datasets that could be moved and  your system would still run 
(you'd have to play with mount points or symlinks to get them included 
in the Solaris name space), but such a configuration would be 
non-standard, unsupported, and probably not upgradeable.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] extremely slow writes (with good reads)

2009-09-25 Thread Paul Archer

Oh, for the record, the drives are 1.5TB SATA, in a 4+1 raidz-1 config. 
All the drives are on the same LSI 150-6 PCI controller card, and the M/B 
is a generic something or other with a triple-core, and 2GB RAM.


Paul


3:34pm, Paul Archer wrote:

Since I got my zfs pool working under solaris (I talked on this list last 
week about moving it from linux & bsd to solaris, and the pain that was), I'm 
seeing very good reads, but nada for writes.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Which directories must be part of rpool?


On 09/25/09 13:35, David Abrahams wrote:

Hi,

Since I don't even have a mirror for my root pool "rpool," I'd like to
move as much of my system as possible over to my raidz2 pool, "tank."
Can someone tell me which parts need to stay in rpool in order for the
system to work normally?

Thanks.

  

The list of datasets in a root pool should look something like this:

rpool
rpool/ROOT   
rpool/ROOT/snv_124  (or whatever version you're running)
rpool/ROOT/snv_124/var   (you might not have this) 
rpool/ROOT/snv_121  (or whatever other BEs you still have)   
rpool/dump   
rpool/export 
rpool/export/home
rpool/swap 

plus any other datasets you might have added.  Datasets you've added in 
addition to the above (unless they are zone roots under 
rpool/ROOT/ ) can be moved to another pool.  Anything you have 
in /export or /export/ home can be moved to another pool.  Everything 
else needs to stay in the root pool.  Yes, there are contents of the 
above datasets that could be moved and  your system would still run 
(you'd have to play with mount points or symlinks to get them included 
in the Solaris name space), but such a configuration would be 
non-standard, unsupported, and probably not upgradeable.


lori


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] extremely slow writes (with good reads)

2009-09-25 Thread Paul Archer

Since I got my zfs pool working under solaris (I talked on this list 
last week about moving it from linux & bsd to solaris, and the pain that 
was), I'm seeing very good reads, but nada for writes.


Reads:

r...@shebop:/data/dvds# rsync -aP young_frankenstein.iso /tmp
sending incremental file list
young_frankenstein.iso
^C1032421376  20%   86.23MB/s0:00:44

Writes:

r...@shebop:/data/dvds# rsync -aP /tmp/young_frankenstein.iso yf.iso
sending incremental file list
young_frankenstein.iso
^C  68976640   6%2.50MB/s0:06:42


This is pretty typical of what I'm seeing.


r...@shebop:/data/dvds# zpool status -v
  pool: datapool
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
pool will no longer be accessible on older software versions.
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
datapoolONLINE   0 0 0
  raidz1ONLINE   0 0 0
c2d0s0  ONLINE   0 0 0
c3d0s0  ONLINE   0 0 0
c4d0s0  ONLINE   0 0 0
c6d0s0  ONLINE   0 0 0
c5d0s0  ONLINE   0 0 0

errors: No known data errors

  pool: syspool
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
syspool ONLINE   0 0 0
  c0d1s0ONLINE   0 0 0

errors: No known data errors

(This is while running an rsync from a remote machine to a ZFS filesystem)
r...@shebop:/data/dvds# iostat -xn 5
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   11.14.8  395.8  275.9  5.8  0.1  364.74.3   2   5 c0d1
9.8   10.9  514.3  346.4  6.8  1.4  329.7   66.7  68  70 c5d0
9.8   10.9  516.6  346.4  6.7  1.4  323.1   66.2  67  70 c6d0
9.7   10.9  491.3  346.3  6.7  1.4  324.7   67.2  67  70 c3d0
9.8   10.9  519.9  346.3  6.8  1.4  326.7   67.2  68  71 c4d0
9.8   11.0  493.5  346.6  3.6  0.8  175.3   37.9  38  41 c2d0
0.00.00.00.0  0.0  0.00.00.0   0   0 c0t0d0
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.00.00.00.0  0.0  0.00.00.0   0   0 c0d1
   64.6   12.6 8207.4  382.1 32.8  2.0  424.7   25.9 100 100 c5d0
   62.2   12.2 7203.2  370.1 27.9  2.0  375.1   26.7  99 100 c6d0
   53.2   11.8 5973.9  390.2 25.9  2.0  398.8   30.5  98  99 c3d0
   49.4   10.6 5398.2  389.8 30.2  2.0  503.7   33.3  99 100 c4d0
   45.2   12.8 5431.4  337.0 14.3  1.0  247.3   17.9  52  52 c2d0
0.00.00.00.0  0.0  0.00.00.0   0   0 c0t0d0


Any ideas?

Paul
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Which directories must be part of rpool?

2009-09-25 Thread David Abrahams


on Fri Sep 25 2009, Cindy Swearingen  wrote:

> Hi David,
>
> All system-related components should remain in the root pool, such as
> the components needed for booting and running the OS.

Yes, of course.  But which *are* those?

> If you have datasets like /export/home or other non-system-related
> datasets in the root pool, then feel free to move them out.

Well, for example, surely /opt can be moved?

> Moving OS components out of the root pool is not tested by us and I've
> heard of one example recently of breakage when usr and var were moved
> to a non-root RAIDZ pool.
>
> It would be cheaper and easier to buy another disk to mirror your root
> pool then it would be to take the time to figure out what could move out
> and then possibly deal with an unbootable system.
>
> Buy another disk and we'll all sleep better.

Easy for you to say.  There's no room left in the machine for another disk.

-- 
Dave Abrahams
BoostPro Computing
http://www.boostpro.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Which directories must be part of rpool?

2009-09-25 Thread Cindy Swearingen


Hi David,

All system-related components should remain in the root pool, such as
the components needed for booting and running the OS.

If you have datasets like /export/home or other non-system-related
datasets in the root pool, then feel free to move them out.

Moving OS components out of the root pool is not tested by us and I've
heard of one example recently of breakage when usr and var were moved
to a non-root RAIDZ pool.

It would be cheaper and easier to buy another disk to mirror your root
pool then it would be to take the time to figure out what could move out
and then possibly deal with an unbootable system.

Buy another disk and we'll all sleep better.

Cindy

On 09/25/09 13:35, David Abrahams wrote:

Hi,

Since I don't even have a mirror for my root pool "rpool," I'd like to
move as much of my system as possible over to my raidz2 pool, "tank."
Can someone tell me which parts need to stay in rpool in order for the
system to work normally?

Thanks.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] cannot hold 'xxx': pool must be upgraded

2009-09-25 Thread Chris Kirby


On Sep 25, 2009, at 2:43 PM, Robert Milkowski wrote:


Chris Kirby wrote:

On Sep 25, 2009, at 11:54 AM, Robert Milkowski wrote:

 That's useful information indeed.  I've filed this CR:

6885860 zfs send shouldn't require support for snapshot holds

Sorry for the trouble, please look for this to be fixed soon.

Thank you.
btw: how do you want to fix it? Do you want to acquire  a snapshot  
hold but continue anyway if it is not possible (only in case whene  
error is ENOTSUP I think)? Or do you want to get rid of it entirely?



In this particular case, we should make sure the pool version supports  
snapshot

holds before trying to request (or release) any.

We still want to acquire the temporary holds if we can, since that
prevents a race with zfs destroy.  That case is becoming more common
with automated snapshots and their associated retention policies.

-Chris

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] cannot hold 'xxx': pool must be upgraded

2009-09-25 Thread Robert Milkowski


Chris Kirby wrote:

On Sep 25, 2009, at 11:54 AM, Robert Milkowski wrote:


Hi,

I have a zfs send command failing for some reason...


# uname -a
SunOS  5.11 snv_123 i86pc i386 i86pc Solaris

# zfs send -R -I 
archive-1/archive/x...@rsync-2009-06-01_07:45--2009-06-01_08:50 
archive-1/archive/x...@rsync-2009-09-01_07:45--2009-09-01_07:59 
>/dev/null
cannot hold 
'archive-1/archive/x...@rsync-2009-06-01_07:45--2009-06-01_08:50': 
pool must be upgraded
cannot hold 
'archive-1/archive/x...@rsync-2009-07-01_07:45--2009-07-01_07:59': 
pool must be upgraded
cannot hold 
'archive-1/archive/x...@rsync-2009-08-01_07:45--2009-08-01_10:14': 
pool must be upgraded
cannot hold 
'archive-1/archive/x...@rsync-2009-09-01_07:45--2009-09-01_07:59': 
pool must be upgraded



# zfs list -r -t all archive-1/archive/
NAME
USED  AVAIL  REFER  MOUNTPOINT
archive-1/archive/   
65.6G  7.69T  8.69G  /archive-1/archive/
archive-1/archive/x...@rsync-2009-04-21_14:52--2009-04-21_15:13  
11.9G  -  12.0G  -
archive-1/archive/x...@rsync-2009-05-01_07:45--2009-05-01_08:06  
12.0G  -  12.1G  -
archive-1/archive/x...@rsync-2009-06-01_07:45--2009-06-01_08:50  
12.2G  -  12.3G  -
archive-1/archive/x...@rsync-2009-07-01_07:45--2009-07-01_07:59  
8.26G  -  8.37G  -
archive-1/archive/x...@rsync-2009-08-01_07:45--2009-08-01_10:14  
12.6G  -  12.7G  -
archive-1/archive/x...@rsync-2009-09-01_07:45--2009-09-01_07:59  
0  -  8.69G  -



The pool is at version 14 and all file systems are at version 3.


Ahhh... if -R is provided zfs send now calls zfs_hold_range() which 
later fails in dsl_dataset_user_hold_check() as it checks if dataset 
is not below SPA_VERSION_USERREFS which is defined as SPA_VERSION_18 
and in my case it is 14 so it fails.


But I don't really want to upgrade to version 18 as then I won't be 
able to reboot back to snv_111b (which supports up-to version 14 
only). I guess if I would use libzfs from older build it would work 
as keeping a user hold is not really required...


I can understand why it was introduced I'm just unhappy that I can't 
do zfs send -R -I now without upgrading a pool


Probably no point sending the email, as I was looking at the code and 
dtracing while writing it, but since I've written it I will post it. 
Maybe someone will find it useful.


Robert,
  That's useful information indeed.  I've filed this CR:

6885860 zfs send shouldn't require support for snapshot holds

Sorry for the trouble, please look for this to be fixed soon.

Thank you.
btw: how do you want to fix it? Do you want to acquire  a snapshot hold 
but continue anyway if it is not possible (only in case whene error is 
ENOTSUP I think)? Or do you want to get rid of it entirely?



--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

2009-09-25 Thread Ray Clark

I didn't want my question to lead to an answer, but perhaps I should have put 
more information.  My idea is to copy the file system with one of the following:
   cp -rp
   zfs send | zfs receive
   tar
   cpio
But I don't know what would be the best.

Then I would do a "diff -r" on them before deleting the old.

I don't know the "obscure" (for me) secondary things like attributes, links, 
extended modes, etc.

Thanks again.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Which directories must be part of rpool?

2009-09-25 Thread David Abrahams


Hi,

Since I don't even have a mirror for my root pool "rpool," I'd like to
move as much of my system as possible over to my raidz2 pool, "tank."
Can someone tell me which parts need to stay in rpool in order for the
system to work normally?

Thanks.

-- 
Dave Abrahams
BoostPro Computing
http://www.boostpro.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New to ZFS: One LUN, multiple zones

2009-09-25 Thread Mārcis Lielturks

2009/9/24 Robert Milkowski 

> Mike Gerdts wrote:
>
>> On Wed, Sep 23, 2009 at 7:32 AM, bertram fukuda 
>> wrote:
>>
>>
>>> Thanks for the info Mike.
>>>
>>> Just so I'm clear.  You suggest 1)create a single zpool from my LUN 2)
>>> create a single ZFS filesystem 3) create 2 zone in the ZFS filesystem. Sound
>>> right?
>>>
>>>
>>
>> Correct
>>
>>
>>
>>
> Well I would actually recommend to create a dedicate zfs file system for
> each zone (which zoneadm should do for you anyway). The reason is that it is
> much easier then to get information on how much storage each zone is using,
> you can set a quote or reservation for storage for each zone independently,
> you can easily clone each zone, snapshot it, etc.
>

Another thing. If you will use live upgrade (and as I understand then "pkg
image-update" does that seamlessly) then besides putting each zone on its
own filesystem you also should add another two datasets to be delegated to
zones where they can store their data. This would ensure that during LU you
don't boot up with a bit old data in zones. For example, this could be very
important on mail servers so you don't "forget" some new mails in spool
directories which arrived after creation of new environment, but before
reboot.

>
>
> --
> Robert Milkowski
> http://milek.blogspot.com
>
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] White box server for OpenSolaris

2009-09-25 Thread Frank Middleton


On 09/25/09 11:08 AM, Travis Tabbal wrote:

... haven't heard if it's a known
bug or if it will be fixed in the next version...


Out of courtesy to our host, Sun makes some quite competitive
X86 hardware. I have absolutely no idea how difficult it is
to buy Sun machines retail, but it seems they might be missing
out on an interesting market - robust and scalable SOHO servers
for the DYI gang - certainly OEMS like us recommend them,
although there doesn't seem to be a single-box file+application
server in the lineup which might be a disadvantage to some.

Also, assuming Oracle keeps the product line going, we plan to
give them a serious look when we finally have to replace those
sturdy old SPARCS. Unfortunately there aren't entry level SPARCs
in the lineup, but sadly there probably isn't a big enough market
to justify them and small developers don't need the big iron.

It would be interesting to hear from Sun if they have any specific
recommendations for the use of Suns for the DYI SOHO market; AFAIK
it is the profits from hardware that are going a long way to support
Sun's support of FOSS that we are all benefiting from, and there's
a good bet that OpenSolaris will run well on Sun hardware :-)

Cheers -- Frank
 



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Cloning Systems using zpool

The whole pool. Although you can choose to exclude individual datasets
from the flar when creating it.

lori

On 09/25/09 12:03, Peter Pickford wrote:

Hi Lori,

Is the u8 flash support for the whole root pool or an individual BE
using live upgrade?

Thanks

Peter

2009/9/24 Lori Alt :

On 09/24/09 15:54, Peter Pickford wrote:

Hi Cindy,

Wouldn't

touch /reconfigure
mv /etc/path_to_inst* /var/tmp/

regenerate all device information?

It might, but it's hard to say whether that would accomplish everything
needed to move a root file system from one system to another.

I just got done modifying flash archive support to work with zfs root on
Solaris 10 Update 8. For those not familiar with it, "flash archives" are a
way to clone full boot environments across multiple machines. The S10
Solaris installer knows how to install one of these flash archives on a
system and then do all the customizations to adapt it to the local hardware
and local network environment. I'm pretty sure there's more to the
customization than just a device reconfiguration.

So feel free to hack together your own solution. It might work for you, but
don't assume that you've come up with a completely general way to clone root
pools.

lori

AFIK zfs doesn't care about the device names it scans for them
it would only affect things like vfstab.

I did a restore from a E2900 to V890 and is seemed to work

Created the pool and zfs recieve.

I would like to be able to have a zfs send of a minimal build and
install it in an abe and activate it.
I tried that is test and it seems to work.

It seems to work but IM just wondering what I may have missed.

I saw someone else has done this on the list and was going to write a blog.

It seems like a good way to get a minimal install on a server with
reduced downtime.

Now if I just knew how to run the installer in and abe without there
being an OS there already that would be cool too.

Thanks

Peter

2009/9/24 Cindy Swearingen :

Hi Peter,

I can't provide it because I don't know what it is.

Even if we could provide a list of items, tweaking
the device informaton if the systems are not identical
would be too difficult.

On 09/24/09 12:04, Peter Pickford wrote:

Hi Cindy,

Could you provide a list of system specific info stored in the root pool?

Thanks

Peter

2009/9/24 Cindy Swearingen :

Hi Karl,

Manually cloning the root pool is difficult. We have a root pool recovery
procedure that you might be able to apply as long as the
systems are identical. I would not attempt this with LiveUpgrade
and manually tweaking.

http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide#Complete_Solaris_ZFS_Root_Pool_Recovery

The problem is that the amount system-specific info stored in the root
pool and any kind of device differences might be insurmountable.

Solaris 10 ZFS/flash archive support is available with patches but not
for the Nevada release.

The ZFS team is working on a split-mirrored-pool feature and that might
be an option for future root pool cloning.

If you're still interested in a manual process, see the steps below
attempted by another community member who moved his root pool to a
larger disk on the same system.

This is probably more than you wanted to know...

Cindy

# zpool create -f altrpool c1t1d0s0
# zpool set listsnapshots=on rpool
# SNAPNAME=`date +%Y%m%d`
# zfs snapshot -r rpool/r...@$snapname
# zfs list -t snapshot
# zfs send -R rp...@$snapname | zfs recv -vFd altrpool
# installboot -F zfs /usr/platform/`uname -i`/lib/fs/zfs/bootblk
/dev/rdsk/c1t1d0s0
for x86 do
# installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c1t1d0s0
Set the bootfs property on the root pool BE.
# zpool set bootfs=altrpool/ROOT/zfsBE altrpool
# zpool export altrpool
# init 5
remove source disk (c1t0d0s0) and move target disk (c1t1d0s0) to slot0
-insert solaris10 dvd
ok boot cdrom -s
# zpool import altrpool rpool
# init 0
ok boot disk1

On 09/24/09 10:06, Karl Rossing wrote:

I would like to clone the configuration on a v210 with snv_115.

The current pool looks like this:

-bash-3.2$ /usr/sbin/zpool statuspool: rpool
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror ONLINE 0 0 0
c1t0d0s0 ONLINE 0 0 0
c1t1d0s0 ONLINE 0 0 0

errors: No known data errors

After I run zpool detach rpool c1t1d0s0, how can I remount c1t1d0s0 to
/tmp/a so that I can make the changes I need prior to removing the drive
and
putting it into the new v210.

I supose I could lucreate -n new_v210, lumount new_v210, edit what I
need
to, luumount new_v210, luactivate new_v210, zpool detach rpool c1t1d0s0
and
then luactivate the original boot environment.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___

[zfs-discuss] Best way to convert checksums

2009-09-25 Thread Ray Clark

What is the "Best" way to convert the checksums of an existing ZFS file system 
from one checksum to another?  To me "Best" means safest and most complete.

My zpool is 39% used, so there is plenty of space available.

Thanks.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Cloning Systems using zpool

2009-09-25 Thread Peter Pickford

Hi Lori,

Is the u8 flash support for the whole root pool or an individual BE
using live upgrade?

Thanks

Peter

2009/9/24 Lori Alt :
> On 09/24/09 15:54, Peter Pickford wrote:
>
> Hi Cindy,
>
> Wouldn't
>
> touch /reconfigure
> mv /etc/path_to_inst* /var/tmp/
>
> regenerate all device information?
>
>
> It might, but it's hard to say whether that would accomplish everything
> needed to move a root file system from one system to another.
>
> I just got done modifying flash archive support to work with zfs root on
> Solaris 10 Update 8.  For those not familiar with it, "flash archives" are a
> way to clone full boot environments across multiple machines.  The S10
> Solaris installer knows how to install one of these flash archives on a
> system and then do all the customizations to adapt it to the  local hardware
> and local network environment.  I'm pretty sure there's more to the
> customization than just a device reconfiguration.
>
> So feel free to hack together your own solution.  It might work for you, but
> don't assume that you've come up with a completely general way to clone root
> pools.
>
> lori
>
> AFIK zfs doesn't care about the device names it scans for them
> it would only affect things like vfstab.
>
> I did a restore from a E2900 to V890 and is seemed to work
>
> Created the pool and zfs recieve.
>
> I would like to be able to have a zfs send of a minimal build and
> install it in an abe and activate it.
> I tried that is test and it seems to work.
>
> It seems to work but IM just wondering what I may have missed.
>
> I saw someone else has done this on the list and was going to write a blog.
>
> It seems like a good way to get a minimal install on a server with
> reduced downtime.
>
> Now if I just knew how to run the installer in and abe without there
> being an OS there already that would be cool too.
>
> Thanks
>
> Peter
>
> 2009/9/24 Cindy Swearingen :
>
>
> Hi Peter,
>
> I can't provide it because I don't know what it is.
>
> Even if we could provide a list of items, tweaking
> the device informaton if the systems are not identical
> would be too difficult.
>
> cs
>
> On 09/24/09 12:04, Peter Pickford wrote:
>
>
> Hi Cindy,
>
> Could you provide a list of system specific info stored in the root pool?
>
> Thanks
>
> Peter
>
> 2009/9/24 Cindy Swearingen :
>
>
> Hi Karl,
>
> Manually cloning the root pool is difficult. We have a root pool recovery
> procedure that you might be able to apply as long as the
> systems are identical. I would not attempt this with LiveUpgrade
> and manually tweaking.
>
>
> http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide#Complete_Solaris_ZFS_Root_Pool_Recovery
>
> The problem is that the amount system-specific info stored in the root
> pool and any kind of device differences might be insurmountable.
>
> Solaris 10 ZFS/flash archive support is available with patches but not
> for the Nevada release.
>
> The ZFS team is working on a split-mirrored-pool feature and that might
> be an option for future root pool cloning.
>
> If you're still interested in a manual process, see the steps below
> attempted by another community member who moved his root pool to a
> larger disk on the same system.
>
> This is probably more than you wanted to know...
>
> Cindy
>
>
>
> # zpool create -f altrpool c1t1d0s0
> # zpool set listsnapshots=on rpool
> # SNAPNAME=`date +%Y%m%d`
> # zfs snapshot -r rpool/r...@$snapname
> # zfs list -t snapshot
> # zfs send -R rp...@$snapname | zfs recv -vFd altrpool
> # installboot -F zfs /usr/platform/`uname -i`/lib/fs/zfs/bootblk
> /dev/rdsk/c1t1d0s0
> for x86 do
> # installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c1t1d0s0
> Set the bootfs property on the root pool BE.
> # zpool set bootfs=altrpool/ROOT/zfsBE altrpool
> # zpool export altrpool
> # init 5
> remove source disk (c1t0d0s0) and move target disk (c1t1d0s0) to slot0
> -insert solaris10 dvd
> ok boot cdrom -s
> # zpool import altrpool rpool
> # init 0
> ok boot disk1
>
> On 09/24/09 10:06, Karl Rossing wrote:
>
>
> I would like to clone the configuration on a v210 with snv_115.
>
> The current pool looks like this:
>
> -bash-3.2$ /usr/sbin/zpool status    pool: rpool
>  state: ONLINE
>  scrub: none requested
> config:
>
>       NAME          STATE     READ WRITE CKSUM
>       rpool         ONLINE       0     0     0
>         mirror      ONLINE       0     0     0
>           c1t0d0s0  ONLINE       0     0     0
>           c1t1d0s0  ONLINE       0     0     0
>
> errors: No known data errors
>
> After I run zpool detach rpool c1t1d0s0, how can I remount c1t1d0s0 to
> /tmp/a so that I can make the changes I need prior to removing the drive
> and
> putting it into the new v210.
>
> I supose I could lucreate -n new_v210, lumount new_v210, edit what I
> need
> to, luumount new_v210, luactivate new_v210, zpool detach rpool c1t1d0s0
> and
> then luactivate the original boot environment.
>
>
> ___
> zfs-discuss ma

Re: [zfs-discuss] ZFS flar image.

2009-09-25 Thread Peter Pickford

Hi Peter,

Do you have any notes on what you did to restore a sendfile to an existing BE?

I'm interested in creating a 'golden image' and restring this into a
new BE on a  running system as part of a hardening project.

Thanks

Peter

2009/9/14 Peter Karlsson :
> Hi Greg,
>
> We did a hack on those lines when we installed 100 Ultra 27s that was used
> during J1, but we automated the process by using AI to install a bootstrap
> image that had a SMF service that pulled over the zfs sendfile, create a new
> BE and received the sendfile to the new BE. Work fairly OK, there where a
> few things that we to run a few scripts to fix, but at large it was
> smooth I really need to get that blog entry done :)
>
> /peter
>
> Greg Mason wrote:
>>
>> As an alternative, I've been taking a snapshot of rpool on the golden
>> system, sending it to a file, and creating a boot environment from the
>> archived snapshot on target systems. After fiddling with the snapshots a
>> little, I then either appropriately anonymize the system or provide it with
>> its identity. When it boots up, it's ready to go.
>>
>> The only downfall to my method is that I still have to run the full
>> OpenSolaris installer, and I can't exclude anything in the archive.
>>
>> Essentially, it's a poor man's flash archive.
>>
>> -Greg
>>
>> cindy.swearin...@sun.com wrote:
>>>
>>> Hi RB,
>>>
>>> We have a draft of the ZFS/flar image support here:
>>>
>>> http://opensolaris.org/os/community/zfs/boot/flash/
>>>
>>> Make sure you review the Solaris OS requirements.
>>>
>>> Thanks,
>>>
>>> Cindy
>>>
>>> On 09/14/09 11:45, RB wrote:

 Is it possible to create flar image of ZFS root filesystem to install it
 to other macines?
>>>
>>> ___
>>> zfs-discuss mailing list
>>> zfs-discuss@opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>>
>>>
>> ___
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Richard Elling


On Sep 25, 2009, at 9:14 AM, Ross Walker wrote:


On Fri, Sep 25, 2009 at 11:34 AM, Bob Friesenhahn
 wrote:

On Fri, 25 Sep 2009, Ross Walker wrote:


As a side an slog device will not be too beneficial for large
sequential writes, because it will be throughput bound not latency
bound. slog devices really help when you have lots of small sync
writes. A RAIDZ2 with the ZIL spread across it will provide much


Surely this depends on the origin of the large sequential writes.   
If the
origin is NFS and the SSD has considerably more sustained write  
bandwidth
than the ethernet transfer bandwidth, then using the SSD is a win.   
If the
SSD accepts data slower than the ethernet can deliver it (which  
seems to be

this particular case) then the SSD is not helping.

If the ethernet can pass 100MB/second, then the sustained write
specification for the SSD needs to be at least 100MB/second.  Since  
data is
buffered in the Ethernet,TCP/IP,NFS stack prior to sending it to  
ZFS, the
SSD should support write bursts of at least double that or else it  
will not

be helping bulk-write performance.


Specifically I was talking NFS as that was what the OP was talking
about, but yes it does depend on the origin, but you also assume that
NFS IO goes over only a single 1Gbe interface when it could be over
multiple 1Gbe interfaces or a 10Gbe interface or even multple 10Gbe
interfaces. You also assume the IO recorded in the ZIL is just the raw
IO when there is also meta-data or multiple transaction copies as
well.

Personnally I still prefer to spread the ZIL across the pool and have
a large NVRAM backed HBA as opposed to an slog which really puts all
my IO in one basket. If I had a pure NVRAM device I might consider
using that as an slog device, but SSDs are too variable for my taste.


Back of the envelope math says:
10 Gbe = ~1 GByte/sec of I/O capacity

If the SSD can only sink 70 MByte/s, then you will need:
int(1000/70) + 1 = 15 SSDs for the slog

For capacity, you need:
1 GByte/sec * 30 sec = 30 GBytes

Ross' idea has merit, if the size of the NVRAM in the array is 30 GBytes
or so.

Both of the above assume there is lots of memory in the server.
This is increasingly becoming easier to do as the memory costs
come down and you can physically fit 512 GBytes in a 4u server.
By default, the txg commit will occur when 1/8 of memory is used
for writes. For 30 GBytes, that would mean a main memory of only
240 Gbytes... feasible for modern servers.

However, most folks won't stomach 15 SSDs for slog or 30 GBytes of
NVRAM in their arrays. So Bob's recommendation of reducing the
txg commit interval below 30 seconds also has merit.  Or, to put it
another way, the dynamic sizing of the txg commit interval isn't
quite perfect yet. [Cue for Neil to chime in... :-)]
 -- richard


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] selecting zfs BE from OBP

2009-09-25 Thread Cindy Swearingen


Hi Donour,

You would use the boot -L syntax to select the ZFS BE to boot from,
like this:

ok boot -L

Rebooting with command: boot -L
Boot device: /p...@8,60/SUNW,q...@4/f...@0,0/d...@w2104cf7fa6c7,0:a 
 File and args: -L

1 zfs1009BE
2 zfs10092BE
Select environment to boot: [ 1 - 2 ]: 2

Then copy and paste the boot string that is provided:

To boot the selected entry, invoke:
boot [] -Z rpool/ROOT/zfs10092BE

Program terminated
{0} ok boot -Z rpool/ROOT/zfs10092BE

See this pointer as well:

http://docs.sun.com/app/docs/doc/819-5461/ggpco?a=view

Cindy


On 09/25/09 11:09, Donour Sizemore wrote:
Can you select the LU boot environment from sparc obp, if the filesystem 
is zfs? With ufs, you simply invoke 'boot [slice]'.


thanks

donour
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] selecting zfs BE from OBP

2009-09-25 Thread Donour Sizemore

Can you select the LU boot environment from sparc obp, if the  
filesystem is zfs? With ufs, you simply invoke 'boot [slice]'.


thanks

donour
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] cannot hold 'xxx': pool must be upgraded

2009-09-25 Thread Chris Kirby


On Sep 25, 2009, at 11:54 AM, Robert Milkowski wrote:


Hi,

I have a zfs send command failing for some reason...


# uname -a
SunOS  5.11 snv_123 i86pc i386 i86pc Solaris

# zfs send -R -I archive-1/archive/ 
x...@rsync-2009-06-01_07:45--2009-06-01_08:50 archive-1/archive/ 
x...@rsync-2009-09-01_07:45--2009-09-01_07:59 >/dev/null
cannot hold 'archive-1/archive/ 
x...@rsync-2009-06-01_07:45--2009-06-01_08:50': pool must be upgraded
cannot hold 'archive-1/archive/ 
x...@rsync-2009-07-01_07:45--2009-07-01_07:59': pool must be upgraded
cannot hold 'archive-1/archive/ 
x...@rsync-2009-08-01_07:45--2009-08-01_10:14': pool must be upgraded
cannot hold 'archive-1/archive/ 
x...@rsync-2009-09-01_07:45--2009-09-01_07:59': pool must be upgraded



# zfs list -r -t all archive-1/archive/
NAME 
USED 
  AVAIL  REFER  MOUNTPOINT
archive-1/archive/
65.6G  7.69T  8.69G  /archive-1/archive/
archive-1/archive/x...@rsync-2009-04-21_14:52--2009-04-21_15:13   
11.9G  -  12.0G  -
archive-1/archive/x...@rsync-2009-05-01_07:45--2009-05-01_08:06   
12.0G  -  12.1G  -
archive-1/archive/x...@rsync-2009-06-01_07:45--2009-06-01_08:50   
12.2G  -  12.3G  -
archive-1/archive/x...@rsync-2009-07-01_07:45--2009-07-01_07:59   
8.26G  -  8.37G  -
archive-1/archive/x...@rsync-2009-08-01_07:45--2009-08-01_10:14   
12.6G  -  12.7G  -
archive-1/archive/x...@rsync-2009-09-01_07:45--2009-09-01_07:59   
0  -  8.69G  -



The pool is at version 14 and all file systems are at version 3.


Ahhh... if -R is provided zfs send now calls zfs_hold_range() which  
later fails in dsl_dataset_user_hold_check() as it checks if dataset  
is not below SPA_VERSION_USERREFS which is defined as SPA_VERSION_18  
and in my case it is 14 so it fails.


But I don't really want to upgrade to version 18 as then I won't be  
able to reboot back to snv_111b (which supports up-to version 14  
only). I guess if I would use libzfs from older build it would work  
as keeping a user hold is not really required...


I can understand why it was introduced I'm just unhappy that I can't  
do zfs send -R -I now without upgrading a pool


Probably no point sending the email, as I was looking at the code  
and dtracing while writing it, but since I've written it I will post  
it. Maybe someone will find it useful.


Robert,
  That's useful information indeed.  I've filed this CR:

6885860 zfs send shouldn't require support for snapshot holds

Sorry for the trouble, please look for this to be fixed soon.

-Chris

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] NLM_DENIED_NOLOCKS Solaris 10u5 X4500

2009-09-25 Thread Richard Elling


Try nfs-disc...@opensolaris.org
 -- richard

On Sep 25, 2009, at 7:28 AM, Chris Banal wrote:

This was previously posed to the sun-managers mailing list but the  
only reply I received recommended I post here at well.


We have a production Solaris 10u5 / ZFS X4500 file server which is  
reporting  NLM_DENIED_NOLOCKS immediately for any nfs locking  
request. The lockd does not appear to be busy so is it possible we  
have hit some sort of limit on the number of files that can be  
locked? Are there any items to check before restarting lockd /  
statd. This appears to have at least temporarily cleared up the issue.


Thanks,
Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] cannot hold 'xxx': pool must be upgraded

2009-09-25 Thread Robert Milkowski

Hi,

I have a zfs send command failing for some reason...


# uname -a
SunOS  5.11 snv_123 i86pc i386 i86pc Solaris

# zfs send -R -I 
archive-1/archive/x...@rsync-2009-06-01_07:45--2009-06-01_08:50 
archive-1/archive/x...@rsync-2009-09-01_07:45--2009-09-01_07:59 >/dev/null 
cannot hold 'archive-1/archive/x...@rsync-2009-06-01_07:45--2009-06-01_08:50': 
pool must be upgraded
cannot hold 'archive-1/archive/x...@rsync-2009-07-01_07:45--2009-07-01_07:59': 
pool must be upgraded
cannot hold 'archive-1/archive/x...@rsync-2009-08-01_07:45--2009-08-01_10:14': 
pool must be upgraded
cannot hold 'archive-1/archive/x...@rsync-2009-09-01_07:45--2009-09-01_07:59': 
pool must be upgraded


# zfs list -r -t all archive-1/archive/ 

NAME
USED  AVAIL  REFER  MOUNTPOINT
archive-1/archive/   65.6G  7.69T  
8.69G  /archive-1/archive/
archive-1/archive/x...@rsync-2009-04-21_14:52--2009-04-21_15:13  11.9G  -  
12.0G  -
archive-1/archive/x...@rsync-2009-05-01_07:45--2009-05-01_08:06  12.0G  -  
12.1G  -
archive-1/archive/x...@rsync-2009-06-01_07:45--2009-06-01_08:50  12.2G  -  
12.3G  -
archive-1/archive/x...@rsync-2009-07-01_07:45--2009-07-01_07:59  8.26G  -  
8.37G  -
archive-1/archive/x...@rsync-2009-08-01_07:45--2009-08-01_10:14  12.6G  -  
12.7G  -
archive-1/archive/x...@rsync-2009-09-01_07:45--2009-09-01_07:59  0  -  
8.69G  -


The pool is at version 14 and all file systems are at version 3.


Ahhh... if -R is provided zfs send now calls zfs_hold_range() which later fails 
in dsl_dataset_user_hold_check() as it checks if dataset is not below 
SPA_VERSION_USERREFS which is defined as SPA_VERSION_18 and in my case it is 14 
so it fails.

But I don't really want to upgrade to version 18 as then I won't be able to 
reboot back to snv_111b (which supports up-to version 14 only). I guess if I 
would use libzfs from older build it would work as keeping a user hold is not 
really required...

I can understand why it was introduced I'm just unhappy that I can't do zfs 
send -R -I now without upgrading a pool

Probably no point sending the email, as I was looking at the code and dtracing 
while writing it, but since I've written it I will post it. Maybe someone will 
find it useful.
 

-- 
Robert Milkowski
http://milek.blogspot.com
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Help! System panic when pool imported

2009-09-25 Thread Richard Elling

Assertion failures indicate bugs. You might try another version of the  
OS.

In general, they are easy to search for in the bugs database.  A quick
search reveals
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6822816
but that doesn't look like it will help you.  I suggest filing a new  
bug at

the very least.

http://en.wikipedia.org/wiki/Assertion_(computing)
 -- richard


On Sep 24, 2009, at 10:21 PM, Albert Chin wrote:


Running snv_114 on an X4100M2 connected to a 6140. Made a clone of a
snapshot a few days ago:
 # zfs snapshot a...@b
 # zfs clone a...@b tank/a
 # zfs clone a...@b tank/b

The system started panicing after I tried:
 # zfs snapshot tank/b...@backup

So, I destroyed tank/b:
 # zfs destroy tank/b
then tried to destroy tank/a
 # zfs destroy tank/a

Now, the system is in an endless panic loop, unable to import the pool
at system startup or with "zpool import". The panic dump is:
 panic[cpu1]/thread=ff0010246c60: assertion failed: 0 ==  
zap_remove_int(mos, ds_prev->ds_phys->ds_next_clones_obj, obj, tx)  
(0x0 == 0x2), file: ../../common/fs/zfs/dsl_dataset.c, line: 1512


 ff00102468d0 genunix:assfail3+c1 ()
 ff0010246a50 zfs:dsl_dataset_destroy_sync+85a ()
 ff0010246aa0 zfs:dsl_sync_task_group_sync+eb ()
 ff0010246b10 zfs:dsl_pool_sync+196 ()
 ff0010246ba0 zfs:spa_sync+32a ()
 ff0010246c40 zfs:txg_sync_thread+265 ()
 ff0010246c50 unix:thread_start+8 ()

We really need to import this pool. Is there a way around this? We do
have snv_114 source on the system if we need to make changes to
usr/src/uts/common/fs/zfs/dsl_dataset.c. It seems like the "zfs
destroy" transaction never completed and it is being replayed, causing
the panic. This cycle continues endlessly.

--
albert chin (ch...@thewrittenword.com)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS flar image.

On 09/25/09 09:59, RB wrote:
I tired to install the flar image using the method explained in this link

http://opensolaris.org/os/community/zfs/boot/flash/

I installed 119534-15 patch on the box whose flar image was required. Then created a flar image using

flarcreate -n zfs_flar /flar_dir/zfs_flar.flar

I then installed 124630-26 patch on the miniroot of Solaris 05/09 (update 7) net install image. This was done by unpacking the miniroot using root_archive, patching the miniroot using patchadd and then repacking it back.

Profile for the jumpstart :

install_type flash_install
archive_location nfs ://zfs_flar.flar
partitioning explicit

But the jumpstart fails with the following error:

Executing SolStart preinstall phase...
Executing begin script "install_begin"...
Begin script install_begin execution completed.

Processing profile
- Opening Flash archive

ERROR: Could not mount ://zfs_flar.flar
ERROR: Flash installation failed
Solaris installation program exited.

Any clues what could be wrong?

I don't know. There's all kinds of reasons a nfs mount might fail. One
think you could is to boot the system from the install image and then
escape out of the install after going through all the configuration
steps (i.e. the questions about name server, routers, etc.). Then try
to do an explicit NFS mount of the flar location (onto /mnt or a
temporary mount point created in /tmp). If it fails, that may be the
source of your problem.

lori

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] periodic slow responsiveness

On Fri, Sep 25, 2009 at 11:34 AM, Bob Friesenhahn
 wrote:
> On Fri, 25 Sep 2009, Ross Walker wrote:
>>
>> As a side an slog device will not be too beneficial for large
>> sequential writes, because it will be throughput bound not latency
>> bound. slog devices really help when you have lots of small sync
>> writes. A RAIDZ2 with the ZIL spread across it will provide much
>
> Surely this depends on the origin of the large sequential writes.  If the
> origin is NFS and the SSD has considerably more sustained write bandwidth
> than the ethernet transfer bandwidth, then using the SSD is a win.  If the
> SSD accepts data slower than the ethernet can deliver it (which seems to be
> this particular case) then the SSD is not helping.
>
> If the ethernet can pass 100MB/second, then the sustained write
> specification for the SSD needs to be at least 100MB/second.  Since data is
> buffered in the Ethernet,TCP/IP,NFS stack prior to sending it to ZFS, the
> SSD should support write bursts of at least double that or else it will not
> be helping bulk-write performance.

Specifically I was talking NFS as that was what the OP was talking
about, but yes it does depend on the origin, but you also assume that
NFS IO goes over only a single 1Gbe interface when it could be over
multiple 1Gbe interfaces or a 10Gbe interface or even multple 10Gbe
interfaces. You also assume the IO recorded in the ZIL is just the raw
IO when there is also meta-data or multiple transaction copies as
well.

Personnally I still prefer to spread the ZIL across the pool and have
a large NVRAM backed HBA as opposed to an slog which really puts all
my IO in one basket. If I had a pure NVRAM device I might consider
using that as an slog device, but SSDs are too variable for my taste.

-Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS flar image.

2009-09-25 Thread RB

I tired to install the flar image using the method explained in this link 

http://opensolaris.org/os/community/zfs/boot/flash/


I installed 119534-15 patch on the box whose flar image was required. Then 
created a flar image using 

flarcreate -n zfs_flar /flar_dir/zfs_flar.flar

I then  installed 124630-26 patch on the miniroot of Solaris 05/09 (update 7) 
net install image. This was done by unpacking the miniroot using root_archive, 
patching the miniroot using patchadd and then repacking it back. 

Profile for the jumpstart :

install_type flash_install
archive_location nfs ://zfs_flar.flar
partitioning explicit

But the jumpstart fails with the following error:

Executing SolStart preinstall phase...
Executing begin script "install_begin"...
Begin script install_begin execution completed.

Processing profile
- Opening Flash archive

ERROR: Could not mount ://zfs_flar.flar
ERROR: Flash installation failed
Solaris installation program exited.


Any clues what could be wrong? 

Thanks.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] periodic slow responsiveness


On Fri, 25 Sep 2009, Ross Walker wrote:


As a side an slog device will not be too beneficial for large
sequential writes, because it will be throughput bound not latency
bound. slog devices really help when you have lots of small sync
writes. A RAIDZ2 with the ZIL spread across it will provide much


Surely this depends on the origin of the large sequential writes.  If 
the origin is NFS and the SSD has considerably more sustained write 
bandwidth than the ethernet transfer bandwidth, then using the SSD is 
a win.  If the SSD accepts data slower than the ethernet can deliver 
it (which seems to be this particular case) then the SSD is not 
helping.


If the ethernet can pass 100MB/second, then the sustained write 
specification for the SSD needs to be at least 100MB/second.  Since 
data is buffered in the Ethernet,TCP/IP,NFS stack prior to sending it 
to ZFS, the SSD should support write bursts of at least double that or 
else it will not be helping bulk-write performance.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] periodic slow responsiveness