from:"Andrew Gabriel"

Re: [zfs-discuss] partioned cache devices

2013-03-19 Thread Andrew Gabriel


On 03/19/13 20:27, Jim Klimov wrote:

I disagree; at least, I've always thought differently:
the "d" device is the whole disk denomination, with a
unique number for a particular controller link ("c+t").

The disk has some partitioning table, MBR or GPT/EFI.
In these tables, partition "p0" stands for the table
itself (i.e. to manage partitioning),


p0 is the whole disk regardless of any partitioning.
(Hence you can use p0 to access any type of partition table.)


and the rest kind
of "depends". In case of MBR tables, one partition may
be named as having a Solaris (or Solaris2) type, and
there it holds a SMI table of Solaris slices, and these
slices can hold legacy filesystems or components of ZFS
pools. In case of GPT, the GPT-partitions can be used
directly by ZFS. However, they are also denominated as
"slices" in ZFS and format utility.


The GPT partitioning spec requires the disk to be FDISK
partitioned with just one single FDISK partition of type EFI,
so that tools which predate GPT partitioning will still see
such a GPT disk as fully assigned to FDISK partitions, and
therefore less likely to be accidentally blown away.


I believe, Solaris-based OSes accessing a "p"-named
partition and an "s"-named slice of the same number
on a GPT disk should lead to the same range of bytes
on disk, but I am not really certain about this.


No, you'll see just p0 (whole disk), and p1 (whole disk
less space for the backwards compatible FDISK partitioning).


Also, if a "whole disk" is given to ZFS (and for OSes
other that the latest Solaris 11 this means non-rpool
disks), then ZFS labels the disk as GPT and defines a
partition for itself plus a small trailing partition
(likely to level out discrepancies with replacement
disks that might happen to be a few sectors too small).
In this case ZFS reports that it uses "cXtYdZ" as a
pool component,


For an EFI disk, the device name without a final p* or s*
component is the whole EFI partition. (It's actually the
s7 slice minor device node, but the s7 is dropped from
the device name to avoid the confusion we had with s2
on SMI labeled disks being the whole SMI partition.)


since it considers itself in charge
of the partitioning table and its inner contents, and
doesn't intend to share the disk with other usages
(dual-booting and other OSes' partitions, or SLOG and
L2ARC parts, etc). This also "allows" ZFS to influence
hardware-related choices, like caching and throttling,
and likely auto-expansion with the changed LUN sizes
by fixing up the partition table along the way, since
it assumes being 100% in charge of the disk.

I don't think there is a "crime" in trying to use the
partitions (of either kind) as ZFS leaf vdevs, even the
zpool(1M) manpage states that:

... The  following  virtual  devices  are supported:
  disk
A block device, typically located under  /dev/dsk.
ZFS  can  use  individual  slices  or  partitions,
though the recommended mode of operation is to use
whole  disks.  ...


Right.


This is orthogonal to the fact that there can only be
one Solaris slice table, inside one partition, on MBR.
AFAIK this is irrelevant on GPT/EFI - no SMI slices there.


There's a simpler way to think of it on x86.
You always have FDISK partitioning (p1, p2, p3, p4).
You can then have SMI or GPT/EFI slices (both called s0, s1, ...)
in an FDISK partition of the appropriate type.
With SMI labeling, s2 is by convention the whole Solaris FDISK
partition (although this is not enforced).
With EFI labeling, s7 is enforced as the whole EFI FDISK partition,
and so the trailing s7 is dropped off the device name for
clarity.

This simplicity is brought about because the GPT spec requires
that backwards compatible FDISK partitioning is included, but
with just 1 partition assigned.

--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] partioned cache devices

2013-03-19 Thread Andrew Gabriel


Andrew Werchowiecki wrote:


 Total disk size is 9345 cylinders
 Cylinder size is 12544 (512 byte) blocks
 
   Cylinders

  Partition   StatusType  Start   End   Length%
  =   ==  =   ===   ==   ===
  1 EFI   0  93459346100


You only have a p1 (and for a GPT/EFI labeled disk, you can only
have p1 - no other FDISK partitions are allowed).


partition> print
Current partition table (original):
Total disk sectors available: 117214957 + 16384 (reserved sectors)
 
Part  TagFlag First Sector Size Last Sector

  0usrwm642.00GB  4194367
  1usrwm   4194368   53.89GB  117214990
  2 unassignedwm 0   0   0
  3 unassignedwm 0   0   0
  4 unassignedwm 0   0   0
  5 unassignedwm 0   0   0
  6 unassignedwm 0   0   0
  8   reservedwm 1172149918.00MB  117231374


You have an s0 and s1.

This isn’t the output from when I did it but it is exactly the same 
steps that I followed.
 
Thanks for the info about slices, I may give that a go later on. I’m not 
keen on that because I have clear evidence (as in zpools set up this 
way, right now, working, without issue) that GPT partitions of the style 
shown above work and I want to see why it doesn’t work in my set up 
rather than simply ignoring and moving on.


You would have to blow away the partitioning you have, and create an FDISK
partitioned disk (not EFI), and then create a p1 and p2 partition. (Don't
use the 'partition' subcommand, which confusingly creates solaris slices.)
Give the FDISK partitions a partition type which nothing will recognise,
such as 'other', so that nothing will try and interpret them as OS partitions.
Then you can use them as raw devices, and they should be portable between
OS's which can handle FDISK partitioned devices.

--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Andrew Gabriel


Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote:

From: Darren J Moffat [mailto:darr...@opensolaris.org]

Support for SCSI UNMAP - both issuing it and honoring it when it is the
backing store of an iSCSI target.



When I search for scsi unmap, I come up with all sorts of documentation that 
... is ... like reading a medical journal when all you want to know is the 
conversion from 98.6F to C.

Would you mind momentarily, describing what SCSI UNMAP is used for?  If I were describing to a customer (CEO, CFO) I'm not going to tell them about SCSI UNMAP, I'm going to say the new system has a new feature that enables ... or solves the ___ problem...  


Customer doesn't *necessarily* have to be as clueless as CEO/CFO.  Perhaps just 
another IT person, or whatever.
  


SCSI UNMAP (or SATA TRIM) is a means of telling a storage device that 
some blocks are no longer needed. (This might be because a file has been 
deleted in the filesystem on the device.)


In the case of a Flash device, it can optimise usage by knowing this, 
e.g. it can perhaps perform a background erase on the real blocks so 
they're ready for reuse sooner, and/or better optimise wear leveling by 
having more spare space to play with. There are some devices in which 
this enables the device to improve its lifetime by performing better 
wear leveling when having more spare space. It can also help by avoiding 
some read-modify-write operations, if the device knows the data that is 
in the rest of the 4k block is no loner needed.


In the case of an iSCSI LUN target, these blocks no longer need to be 
archived, and if sparse space allocation is in use, the space they 
occupied can be freed off. In the particular case of ZFS provisioning 
the iSCSI LUN (COMSTAR), you might get performance improvements by 
having more free space to play with during other write operations to 
allow better storage layout optimisation.


So, bottom line is longer life of SSDs (maybe higher performance too if 
there's less waiting for erases during writes), and better space 
utilisation and performance for a ZFS COMSTAR target.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] any more efficient way to transfer snapshot between two hosts than ssh tunnel?

2012-12-14 Thread Andrew Gabriel

In my own experiments with my own equivalent of mbuffer, it's well worth 
giving the receiving side a buffer which is sized to hold the amount of 
data in a transaction commit, which allows ZFS to be banging out one tx 
group to disk, whilst the network is bringing the next one across for 
it. This will be roughly the link speed in bytes/second x 5, plus a bit 
more for good measure, say 250-300Mbytes for a gigabit link. It seems to 
be most important when the disks and the network link have similar max 
theoretical bandwidths (100Mbytes/sec is what you might expect from both 
gigabit ethernet and reasonable disks), and it becomes less important as 
the difference in max performance between them increases. Without the 
buffer, you tend to see the network run flat out for 5 seconds, and then 
the receiving disks run flat out for 5 seconds, alternating back and 
forth, whereas with the buffer, both continue streaming at full gigabit 
speed without a break.


I have not seen any benefit of buffering on the sending side, although 
I'd still be inclined to include a small one.


YMMV...


Palmer, Trey wrote:

We have found mbuffer to be the fastest solution.   Our rates for large 
transfers on 10GbE are:

280MB/smbuffer
220MB/srsh
180MB/sHPN-ssh unencrypted
 60MB/s standard ssh

The tradeoff mbuffer is a little more complicated to script;   rsh is, well, you know;  and hpn-ssh requires rebuilding ssh and (probably) maintaining a second copy of it. 
  


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS array on marvell88sx in Solaris 11.1

2012-12-13 Thread Andrew Gabriel

3112 and 3114 were very early SATA controllers before there were any 
SATA drivers, which pretend to be ATA controllers to the OS.

No one should be using these today.

sol wrote:
Oh I can run the disks off a SiliconImage 3114 but it's the marvell 
controller that I'm trying to get working. I'm sure it's the 
controller which is used in the Thumpers so it should surely work in 
solaris 11.1



*From:* Bob Friesenhahn 

If the SATA card you are using is a JBOD-style card (i.e. disks
are portable to a different controller), are you able/willing to
swap it for one that Solaris is known to support well?

--------


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Remove disk

2012-12-02 Thread Andrew Gabriel


Bob Friesenhahn wrote:

On Sat, 1 Dec 2012, Jan Owoc wrote:



When I would like to change the disk, I also would like change the disk
enclosure, I don't want to use the old one.


You didn't give much detail about the enclosure (how it's connected,
how many disk bays it has, how it's used etc.), but are you able to
power off the system and transfer the all the disks at once?


And what happen if I have 24, 36 disks to change ? It's take mounth 
to do

that.


Those are the current limitations of zfs. Yes, with 12x2TB of data to
copy it could take about a month.



You can create a brand new pool with the new chassis and use 'zfs 
send' to send a full snapshot of each filesystem to the new pool. 
After the bulk of the data has been transferred, take new snapshots 
and send the remainder. This expects that both pools can be available 
at once. 


or if you don't care about existing snapshots, use Shadow Migration to 
move the data across.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [zfs] portable zfs send streams (preview webrev)

2012-10-18 Thread Andrew Gabriel


Arne Jansen wrote:

We have finished a beta version of the feature.


What does FITS stand for?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Making ZIL faster

2012-10-04 Thread Andrew Gabriel


Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Schweiss, Chip

How can I determine for sure that my ZIL is my bottleneck?  If it is the
bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL to
make it faster?  Or should I be looking for a DDR drive, ZeusRAM, etc.


Temporarily set sync=disabled
Or, depending on your application, leave it that way permanently.  I know, for 
the work I do, most systems I support at most locations have sync=disabled.  It 
all depends on the workload.


Noting of course that this means that in the case of an unexpected system 
outage or loss of connectivity to the disks, synchronous writes since the last 
txg commit will be lost, even though the applications will believe they are 
secured to disk. (ZFS filesystem won't be corrupted, but it will look like it's 
been wound back by up to 30 seconds when you reboot.)

This is fine for some workloads, such as those where you would start again with 
fresh data and those which can look closely at the data to see how far they got 
before being rudely interrupted, but not for those which rely on the Posix 
semantics of synchronous writes/syncs meaning data is secured on non-volatile 
storage when the function returns.

--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] all in one server

2012-09-18 Thread Andrew Gabriel


Richard Elling wrote:
On Sep 18, 2012, at 7:31 AM, Eugen Leitl <mailto:eu...@leitl.org>> wrote:


Can I actually have a year's worth of snapshots in
zfs without too much performance degradation?


I've got 6 years of snapshots with no degradation :-)


$ zfs list -t snapshot -r export/home | wc -l
   1951
$ echo 1951 / 365 | bc -l
5.34520547945205479452
$

So you're slightly ahead of my 5.3 years of daily snapshots:-)

--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Remedies for suboptimal mmap performance on zfs

2012-05-28 Thread Andrew Gabriel


On 05/28/12 20:06, Iwan Aucamp wrote:
I'm getting sub-optimal performance with an mmap based database 
(mongodb) which is running on zfs of Solaris 10u9.


System is Sun-Fire X4270-M2 with 2xX5680 and 72GB (6 * 8GB + 6 * 4GB) 
ram (installed so it runs at 1333MHz) and 2 * 300GB 15K RPM disks


 - a few mongodb instances are running with with moderate IO and total 
rss of 50 GB
 - a service which logs quite excessively (5GB every 20 mins) is also 
running (max 2GB ram use) - log files are compressed after some time 
to bzip2.


Database performance is quite horrid though - it seems that zfs does 
not know how to manage allocation between page cache and arc cache - 
and it seems arc cache wins most of the time.


I'm thinking of doing the following:
 - relocating mmaped (mongo) data to a zfs filesystem with only 
metadata cache

 - reducing zfs arc cache to 16 GB

Is there any other recommendations - and is above likely to improve 
performance.


1. Upgrade to S10 Update 10 - this has various performance improvements, 
in particular related to database type loads (but I don't know anything 
about mongodb).


2. Reduce the ARC size so RSS + ARC + other memory users < RAM size.
I assume the RSS include's whatever caching the database does. In 
theory, a database should be able to work out what's worth caching 
better than any filesystem can guess from underneath it, so you want to 
configure more memory in the DB's cache than in the ARC. (The default 
ARC tuning is unsuitable for a database server.)


3. If the database has some concept of blocksize or recordsize that it 
uses to perform i/o, make sure the filesystems it is using configured to 
be the same recordsize. The ZFS default recordsize (128kB) is usually 
much bigger than database blocksizes. This is probably going to have 
less impact with an mmaped database than a read(2)/write(2) database, 
where it may prove better to match the filesystem's record size to the 
system's page size (4kB, unless it's using some type of large pages). I 
haven't tried playing with recordsize for memory mapped i/o, so I'm 
speculating here.


Blocksize or recordsize may apply to the log file writer too, and it may 
be that this needs a different recordsize and therefore has to be in a 
different filesystem. If it uses write(2) or some variant rather than 
mmap(2) and doesn't document this in detail, Dtrace is your friend.


4. Keep plenty of free space in the zpool if you want good database 
performance. If you're more than 60% full (S10U9) or 80% full (S10U10), 
that could be a factor.


Anyway, there are a few things to think about.

--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs_arc_max values

2012-05-17 Thread Andrew Gabriel


On 05/17/12 15:03, Bob Friesenhahn wrote:

On Thu, 17 May 2012, Paul Kraus wrote:


   Why are you trying to tune the ARC as _low_ as possible? In my
experience the ARC gives up memory readily for other uses. The only
place I _had_ to tune the ARC in production was a  couple systems
running an app that checks for free memory _before_ trying to allocate
it. If the ARC has all but 1 GB in use, the app (which is looking for


On my system I adjusted the ARC down due to running user-space 
applications with very bursty short-term large memory usage. Reducing 
the ARC assured that there would be no contention between zfs ARC and 
the applications.


If the system is running one app which expects to do lots of application 
level caching (and in theory, the app should be able to work out what's 
worth caching and what isn't better than any filesystem underneath it 
can guess), then you should be planning your memory usage accordingly.


For example, a database server, you probably want to allocate much of 
the system's memory to
the database cache (in the case of Oracle, the SGA), leaving enough for 
a smaller ZFS arc and the memory required by the OS and app. Depends on 
the system and database size, but something like 50% SGA, 25% ZFS ARC, 
25% for everything else might be an example, with the SGA 
disproportionally bigger on larger systems with larger databases.


On my desktop system (supposed to be 8GB RAM, but currently 6GB due to a 
dead DIMM), I have knocked the ARC down to 1GB. I used to find the ARC 
wouldn't shrink in size until system had got to the point of crawling 
along showing anon page-ins, and some app (usually firefox or 
thunderbird) had already become too difficult to use. I must admit I did 
this a long time ago, and ZFS's shrinking of the ARC may be more 
proactive now than it was back then, but I don't notice any ZFS 
performance issues with the ARC restricted to 1GB on a desktop system. 
It may have increase scrub times, but that happens when I'm in bed, so I 
don't care.


--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] test for holes in a file?

2012-03-26 Thread Andrew Gabriel

I just played and knocked this up (note the stunning lack of comments, 
missing optarg processing, etc)...

Give it a list of files to check...

#define _FILE_OFFSET_BITS 64

#include

#include

#include

#include

#include

int

main(int argc, char **argv)

{

int i;

for (i = 1; i<  argc; i++) {

int fd;

fd = open(argv[i], O_RDONLY);

if (fd<  0) {

perror(argv[i]);

} else {

off_t eof;

off_t hole;

if (((eof = lseek(fd, 0, SEEK_END))<  0) ||

lseek(fd, 0, SEEK_SET)<  0) {

perror(argv[i]);

} else if (eof == 0) {

printf("%s: empty\n", argv[i]);

} else {

hole = lseek(fd, 0, SEEK_HOLE);

if (hole<  0) {

perror(argv[i]);

} else if (hole<  eof) {

printf("%s: sparse\n", argv[i]);

} else {

printf("%s: not sparse\n", argv[i]);

}

}

close(fd);

}

}
return 0;

}


On 03/26/12 10:06 PM, ольга крыжановская wrote:

Mike, I was hoping that some one has a complete example for a bool
has_file_one_or_more_holes(const char *path) function.

Olga

2012/3/26 Mike Gerdts:

2012/3/26 ольга крыжановская:

How can I test if a file on ZFS has holes, i.e. is a sparse file,
using the C api?

See SEEK_HOLE in lseek(2).

--
Mike Gerdts
http://mgerdts.blogspot.com/





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Any recommendations on Perc H700 controller on Dell Rx10 ?

2012-03-10 Thread Andrew Gabriel


On 03/10/12 09:29, Sriram Narayanan wrote:

Hi folks:

At work, I have an R510, and R610 and an R710 - all with the H700 PERC
controller.

Based on experiments, it seems like there is no way to bypass the PERC
controller - it seems like one can only access the individual disks if
they are set up in RAID0 each.

This brings me to ask some questions:
a. Is it fine (in terms of an intelligent controller coming in the way
of ZFS) to have the PERC controllers present each drive as RAID0
drives ?
b. Would there be any errors in terms of PERC doing things that ZFS is
not aware of and this causing any issues later  ?


I had to produce a ZFS hybrid storage pool performance demo, and was 
initially given a system with a RAID-only controller (different from 
yours, but same idea).


I created the demo with it, but disabled the RAID's cache as that wasn't 
what I wanted in the picture. Meanwhile, I ordered the non-RAID version 
of the card.


When it came and I swapped it in. A couple of issues...

ZFS doesn't recognise any of the disks obviously, because they have 
propitiatory RAID headers on them, so they have to be created again from 
scratch. (That was no big deal in this case, and if it had been, I could 
have done a zfs send and receive to somewhere else temporarily.)


The performance went up, a tiny bit for the spinning disks, and by 50% 
for the SSDs, so the RAID controller was seriously limiting the IOPs of 
the SSDs in particular. This was when SSDs were relatively new, and the 
controllers may not have been designed with SSDs in mind. That's likely 
to be somewhat different nowadays, but I don't have any data to show 
that either way.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] send only difference between snapshots

2012-02-14 Thread Andrew Gabriel


skeletor wrote:
There is a task: make backup by sending snapshots to another server. 
But I don't want to send each time a complete snapshot of the system - 
I want to send only the difference between a snapshots.
For example: there are 2 servers, and I want to do the snapshot on the 
master, send only the difference between the current and recent 
snapshots on the backup and then deploy it on backup.

Any ideas how this can be done?


It's called an incremental - it's part of the zfs send command line options.

--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?

2012-01-15 Thread Andrew Gabriel


Gary Mills wrote:

On Sun, Jan 15, 2012 at 04:06:33PM +, Peter Tribble wrote:
  

On Sun, Jan 15, 2012 at 3:04 PM, Jim Klimov  wrote:


"Does raidzN actually protect against bitrot?"
That's a kind of radical, possibly offensive, question formula
that I have lately.
  

Yup, it does. That's why many of us use it.



There's actually no such thing as bitrot on a disk.  Each sector on
the disk is accompanied by a CRC that's verified by the disk
controller on each read.  It will either return correct data or report
an unreadable sector.  There's nothing inbetween.
  


Actually, there are a number of disk firmware and cache faults 
inbetween, which zfs has picked up over the years.



--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Stress test zfs

2012-01-05 Thread Andrew Gabriel





grant lowe wrote:
Ok. I blew it. I didn't add enough information. Here's
some more detail:
  
Disk array is a RAMSAN array, with RAID6 and 8K stripes. I'm measuring
performance with the results of the bonnie++ output and comparing with
with the the zpool iostat output. It's with the zpool iostat I'm not
seeing a lot of writes.


Since ZFS never writes data back where it was, it can coalesce multiple
outstanding writes into fewer device writes. This may be what you're
seeing. I have a ZFS IOPs demo where the (multi-threaded) application
is performing over 10,000 synchronous write IOPs, but the underlying
devices are only performing about 1/10th of that, due to ZFS coalescing
multiple outstanding writes.

Sorry, I'm not familiar with what type of load bonnie generates.

-- 

Andrew Gabriel |
Solaris Systems Architect
Email: andrew.gabr...@oracle.com
Mobile: +44 7720 598213
Oracle EMEA Server Pre-Sales
ORACLE Corporation UK Ltd is a
company incorporated in England & Wales | Company Reg. No. 1782505
| Reg. office: Oracle Parkway, Thames Valley Park, Reading RG6 1RA

Hardware and Software, Engineered
to Work Together



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Can I create a mirror for a root rpool?

2011-12-16 Thread Andrew Gabriel


On 12/16/11 07:27 AM, Gregg Wonderly wrote:
Cindy, will it ever be possible to just have attach mirror the 
surfaces, including the partition tables?  I spent an hour today 
trying to get a new mirror on my root pool.  There was a 250GB disk 
that failed.  I only had a 1.5TB handy as a replacement.  prtvtoc ... 
| fmthard does not work in this case


Can you be more specific why it fails?
I have seen a couple of cases, and I'm wondering if you're hitting the 
same thing.

Can you post the prtvtoc output of your original disk please?

and so you have to do the partitioning by hand, which is just silly to 
fight with anyway.


Gregg


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] slow zfs send/recv speed

2011-11-15 Thread Andrew Gabriel

On 11/15/11 23:40, Tim Cook wrote:
On Tue, Nov 15, 2011 at 5:17 PM, Andrew Gabriel
mailto:andrew.gabr...@oracle.com>> wrote:

On 11/15/11 23:05, Anatoly wrote:

Good day,

The speed of send/recv is around 30-60 MBytes/s for initial
send and 17-25 MBytes/s for incremental. I have seen lots of
setups with 1 disk to 100+ disks in pool. But the speed
doesn't vary in any degree. As I understand 'zfs send' is a
limiting factor. I did tests by sending to /dev/null. It
worked out too slow and absolutely not scalable.
None of cpu/memory/disk activity were in peak load, so there
is of room for improvement.

Is there any bug report or article that addresses this
problem? Any workaround or solution?

I found these guys have the same result - around 7 Mbytes/s
for 'send' and 70 Mbytes for 'recv'.
http://wikitech-static.wikimedia.org/articles/z/f/s/Zfs_replication.html

Well, if I do a zfs send/recv over 1Gbit ethernet from a 2 disk
mirror, the send runs at almost 100Mbytes/sec, so it's pretty much
limited by the ethernet.

Since you have provided none of the diagnostic data you collected,
it's difficult to guess what the limiting factor is for you.

--
Andrew Gabriel

So all the bugs have been fixed?

Probably not, but the OP's implication that zfs send has a specific rate
limit in the range suggested is demonstrably untrue. So I don't know
what's limiting the OP's send rate. (I could guess a few possibilities,
but that's pointless without the data.)

I seem to recall people on this mailing list using mbuff to speed it
up because it was so bursty and slow at one point. IE:

http://blogs.everycity.co.uk/alasdair/2010/07/using-mbuffer-to-speed-up-slow-zfs-send-zfs-receive/

Yes, this idea originally came from me, having analyzed the send/receive
traffic behavior in combination with network connection behavior.
However, it's the receive side that's bursty around the TXG commits, not
the send side, so that doesn't match the issue the OP is seeing. (The
buffer sizes in that blog are not optimal, although any buffer at the
receive side will make a significant improvement if the network
bandwidth is same order of magnitude as the send/recv are capable of.)

--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] slow zfs send/recv speed

2011-11-15 Thread Andrew Gabriel


 On 11/15/11 23:05, Anatoly wrote:

Good day,

The speed of send/recv is around 30-60 MBytes/s for initial send and 
17-25 MBytes/s for incremental. I have seen lots of setups with 1 disk 
to 100+ disks in pool. But the speed doesn't vary in any degree. As I 
understand 'zfs send' is a limiting factor. I did tests by sending to 
/dev/null. It worked out too slow and absolutely not scalable.
None of cpu/memory/disk activity were in peak load, so there is of 
room for improvement.


Is there any bug report or article that addresses this problem? Any 
workaround or solution?


I found these guys have the same result - around 7 Mbytes/s for 'send' 
and 70 Mbytes for 'recv'.

http://wikitech-static.wikimedia.org/articles/z/f/s/Zfs_replication.html


Well, if I do a zfs send/recv over 1Gbit ethernet from a 2 disk mirror, 
the send runs at almost 100Mbytes/sec, so it's pretty much limited by 
the ethernet.


Since you have provided none of the diagnostic data you collected, it's 
difficult to guess what the limiting factor is for you.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool scrub bad block list

2011-11-08 Thread Andrew Gabriel

ZFS detects far more errors that traditional filesystems will simply miss. 
This means that many of the possible causes for those errors will be 
something other than a real bad block on the disk. As Edward said, the disk 
firmware should automatically remap real bad blocks, so if ZFS did that 
too, we'd not use the remapped block, which is probably fine. For other 
errors, there's nothing wrong with the real block on the disk - it's going 
to be firmware, driver, cache corruption, or something else, so 
blacklisting the block will not solve the issue. Also, with some types of 
disk (SSD), block numbers are moved around to achieve wear leveling, so 
blacklistinng a block number won't stop you reusing that real block.


--
Andrew Gabriel (from mobile)

--- Original message ---
From: Edward Ned Harvey 


To: didier.reb...@u-bourgogne.fr, zfs-discuss@opensolaris.org
Sent: 8.11.'11,  12:50


From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Didier Rebeix

 from ZFS documentation it appears unclear to me if a "zpool
scrub" will black list any found bad blocks so they won't be used
anymore.


If there are any physically bad blocks, such that the hardware (hard 
disk)
will return an error every time that block is used, then the disk should 
be

replaced.  All disks have a certain amount of error detection/correction
built in, and remap bad blocks internally and secretly behind the scenes,
transparent to the OS.  So if there are any blocks regularly reporting 
bad

to the OS, then it means there is a growing problem inside the disk.
Offline the disk and replace it.

It is ok to get an occasional cksum error.  Say, once a year.  Because 
the

occasional cksum error will be re-read and as long as the data is correct
the second time, no problem.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Thumper (X4500), and CF SSD for L2ARC = ?

2011-10-14 Thread Andrew Gabriel


Jim Klimov wrote:

Thanks, but I believe currently that's out of budget, but a 90MB/s
CF module may be acceptable for the small business customer.
I wondered if that is known to work or not...


I've had a compact flash IDE drive not work in a white-box system. In 
that case it was a ufs root disk, but any attempt to put a serious load 
on it, and it corrupted data all over the place. So if you're going to 
try one, make sure you hammer it very hard in a test environment before 
you commit anything important to it.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] solaris 10u8 hangs with message Disconnected command timeout for Target 0

2011-08-15 Thread Andrew Gabriel


Ding Honghui wrote:

Hi,

My solaris storage hangs. I login to the console and there is 
messages[1] display on the console.

I can't login into the console and seems the IO is totally blocked.

The system is solaris 10u8 on Dell R710 with disk array Dell MD3000. 2 
HBA cable connect the server and MD3000.

The symptom is random.

It is very appreciated if any one can help me out.


The SCSI target you are talking to is being reset. "Unit Attention" 
means it's forgotten what operating parameters have been negotiated with 
the system and is a warning the device might have been changed without 
the system knowing, and it's telling you this happened because of 
"device internal reset". That sort of thing can happen if the firmware 
in the SCSI target crashes and restarts, or the power supply blips, or 
if the device was swapped. I don't know anything about a Dell MD3000, 
but given it's happened on lots of disks at the same moment following a 
timeout, it looks like the array power cycled or array firmware (if any) 
rebooted. (Not sure if a SCSI bus reset can do this or not.)



[1]
Aug 16 13:14:16 nas-hz-02 scsi: WARNING: 
/pci@0,0/pci8086,3410@9/pci8086,32c@0/pci1028,1f04@8 (mpt1):

Aug 16 13:14:16 nas-hz-02   Disconnected command timeout for Target 0
Aug 16 13:14:16 nas-hz-02 scsi: WARNING: 
/scsi_vhci/disk@g60026b900053aa1802a44b8f0ded (sd47):
Aug 16 13:14:16 nas-hz-02   Error for Command: 
write(10)   Error Level: Retryable
Aug 16 13:14:16 nas-hz-02 scsi: Requested Block: 
1380679073Error Block: 1380679073
Aug 16 13:14:16 nas-hz-02 scsi: Vendor: 
DELL   Serial Number:
Aug 16 13:14:16 nas-hz-02 scsi: Sense Key: Unit Attention
Aug 16 13:14:16 nas-hz-02 scsi: ASC: 0x29 (device internal 
reset), ASCQ: 0x4, FRU: 0x0
Aug 16 13:14:16 nas-hz-02 scsi: WARNING: 
/scsi_vhci/disk@g60026b900053aa18029e4b8f0d61 (sd41):
Aug 16 13:14:16 nas-hz-02   Error for Command: 
write(10)   Error Level: Retryable
Aug 16 13:14:16 nas-hz-02 scsi: Requested Block: 
1380679072Error Block: 1380679072
Aug 16 13:14:16 nas-hz-02 scsi: Vendor: 
DELL   Serial Number:
Aug 16 13:14:16 nas-hz-02 scsi: Sense Key: Unit Attention
Aug 16 13:14:16 nas-hz-02 scsi: ASC: 0x29 (device internal 
reset), ASCQ: 0x4, FRU: 0x0
Aug 16 13:14:16 nas-hz-02 scsi: WARNING: 
/scsi_vhci/disk@g60026b900053aa1802a24b8f0dc5 (sd45):
Aug 16 13:14:16 nas-hz-02   Error for Command: 
write(10)   Error Level: Retryable
Aug 16 13:14:16 nas-hz-02 scsi: Requested Block: 
1380679073Error Block: 1380679073
Aug 16 13:14:16 nas-hz-02 scsi: Vendor: 
DELL   Serial Number:
Aug 16 13:14:16 nas-hz-02 scsi: Sense Key: Unit Attention
Aug 16 13:14:16 nas-hz-02 scsi: ASC: 0x29 (device internal 
reset), ASCQ: 0x4, FRU: 0x0
Aug 16 13:14:16 nas-hz-02 scsi: WARNING: 
/scsi_vhci/disk@g60026b900053aa18029c4b8f0d35 (sd39):
Aug 16 13:14:16 nas-hz-02   Error for Command: 
write(10)   Error Level: Retryable
Aug 16 13:14:16 nas-hz-02 scsi: Requested Block: 
1380679072Error Block: 1380679072
Aug 16 13:14:16 nas-hz-02 scsi: Vendor: 
DELL   Serial Number:
Aug 16 13:14:16 nas-hz-02 scsi: Sense Key: Unit Attention
Aug 16 13:14:16 nas-hz-02 scsi: ASC: 0x29 (device internal 
reset), ASCQ: 0x4, FRU: 0x0
Aug 16 13:14:16 nas-hz-02 scsi: WARNING: 
/scsi_vhci/disk@g60026b900053aa1802984b8f0cd2 (sd35):
Aug 16 13:14:16 nas-hz-02   Error for Command: 
write(10)   Error Level: Retryable
Aug 16 13:14:16 nas-hz-02 scsi: Requested Block: 
1380679072Error Block: 1380679072
Aug 16 13:14:16 nas-hz-02 scsi: Vendor: 
DELL   Serial Number:
Aug 16 13:14:16 nas-hz-02 scsi: Sense Key: Unit Attention
Aug 16 13:14:16 nas-hz-02 scsi: ASC: 0x29 (device internal 
reset), ASCQ: 0x4, FRU: 0x0


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sudden drop in disk performance - WD20EURS & 4k sectors to blame?

2011-08-15 Thread Andrew Gabriel


David Wragg wrote:

I've not done anything different this time from when I created the original 
(512b)  pool. How would I check ashift?
  


For a zpool called "export"...

# zdb export | grep ashift
ashift: 12
^C
#

As far as I know (although I don't have any WD's), all the current 4k 
sectorsize hard drives claim to be 512b sectorsize, so if you didn't do 
anything special, you'll probably have ashift=9.


I would look at a zpool iostat -v to see what the IOPS rate is (you may 
have bottomed out on that), and I would also work out average transfer 
size (although that alone doesn't necessarily tell you much - a dtrace 
quantize aggregation would be better). Also check service times on the 
disks (iostat) to see if there's one which is significantly worse and 
might be going bad.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Disk IDs and DD

2011-08-10 Thread Andrew Gabriel


Lanky Doodle wrote:

Oh no I am not bothered at all about the target ID numbering. I just wondered 
if there was a problem in the way it was enumerating the disks.

Can you elaborate on the dd command LaoTsao? Is the 's' you refer to a 
parameter of the command or the slice of a disk - none of my 'data' disks have 
been 'configured' yet. I wanted to ID them before adding them to pools.
  


Use p0 on x86 (whole disk, without regard to any partitioning).
Any other s or p device node may or may not be there, depending on what 
partitions/slices are on the disk.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] matching zpool versions to development builds

2011-08-08 Thread Andrew Gabriel


John Martin wrote:

Is there a list of zpool versions for development builds?

I found:

http://blogs.oracle.com/stw/entry/zfs_zpool_and_file_system

where it says Solaris 11 Express is zpool version 31, but my
system has BEs back to build 139 and I have not done a zpool upgrade
since installing this system but it reports on the current
development build:

# zpool upgrade -v
This system is currently running ZFS pool version 33.


It's painfully laid out (each on a separate page), but have a look at 
http://hub.opensolaris.org/bin/view/Community+Group+zfs/31 (change the 
version on the end of the URL). It conks out at version 31 though.


I have systems back to build 125, so I tend to always force zpool 
version 19 for that (and that automatically limits zfs version to 4).


There's also some info about some builds on the zfs wikipedia page 
http://en.wikipedia.org/wiki/Zfs


--
Andrew Gabriel*

***
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Large scale performance query

2011-08-08 Thread Andrew Gabriel


Alexander Lesle wrote:

And what is your suggestion for scrubbing a mirror pool?
Once per month, every 2 weeks, every week.
  


There isn't just one answer.

For a pool with redundancy, you need to do a scrub just before the 
redundancy is lost, so you can be reasonably sure the remaining data is 
correct and can rebuild the redundancy.


The problem comes with knowing when this might happen. Of course, if you 
are doing some planned maintenance which will reduce the pool 
redundancy, then always do a scrub before that. However, in most cases, 
the redundancy is lost without prior warning, and you need to do 
periodic scrubs to cater for this case. I do a scrub via cron once a 
week on my home system. Having almost completely filled the pool, this 
was taking about 24 hours. However, now that I've replaced the disks and 
done a send/recv of the data across to a new larger pool which is only 
1/3rd full, that's dropped down to 2 hours.


For a pool with no redundancy, where you rely only on backups for 
recovery, the scrub needs to be integrated into the backup cycle, such 
that you will discover corrupt data before it has crept too far through 
your backup cycle to be able to find a non corrupt version of the data.


When you have a new hardware setup, I would perform scrubs more 
frequently as a further check that the hardware doesn't have any 
systemic problems, until you have gained confidence in it.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Adding mirrors to an existing zfs-pool

2011-07-26 Thread Andrew Gabriel


Bernd W. Hennig wrote:

G'Day,

- zfs pool with 4 disks (from Clariion A)
- must migrate to Clariion B (so I created 4 disks with the same size,
  avaiable for the zfs)

The zfs pool has no mirrors, my idea was to add the new 4 disks from
the Clariion B to the 4 disks which are still in the pool - and later
remove the original 4 disks.

I only found in all example how to create a new pool with mirrors
but no example how to add to a pool without mirrors a mirror disk
for each "disk" in the pool.

- is it possible to add disks to each disk in the pool (they have different
  sizes, so I have exact add the correct disks form Clariion B to the 
  original disk from Clariion B)

- can I later "remove" the disks from the Clariion A, pool is intact, user
  can work with the pool
  


Depends on a few things...

What OS are you running, and what release/update or build?

What's the RAID layout of your pool "zpool status"?

--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] zfs send/receive and ashift

2011-07-26 Thread Andrew Gabriel

Does anyone know if it's OK to do zfs send/receive between zpools with 
different ashift values?


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Mount Options

2011-07-25 Thread Andrew Gabriel


Tony MacDoodle wrote:

I have a zfs pool called logs (about 200G).
I would like to create 2 volumes using this chunk of storage.
However, they would have different mount points.
ie. 50G would be mounted as /oarcle/logs
100G would be mounted as /session/logs
 
is this possible?


Yes...

zfs create -o mountpoint=/oracle/logs logs/oracle
zfs create -o mountpoint=/session/logs logs/session

If you don't otherwise specify, the two filesystem will share the pool 
without any constraints.


If you wish to limit their max space...

zfs set quota=50g logs/oracle
zfs set quota=100g logs/session

and/or if you wish to reserve a minimum space...

zfs set reservation=50g logs/oracle
zfs set reservation=100g logs/session


Do I have to use the legacy mount options?


You don't have to.

--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] How create a FAT filesystem on a zvol?

2011-07-12 Thread Andrew Gabriel


Gary Mills wrote:

On Sun, Jul 10, 2011 at 11:16:02PM +0700, Fajar A. Nugraha wrote:
  

On Sun, Jul 10, 2011 at 10:10 PM, Gary Mills  wrote:


The `lofiadm' man page describes how to export a file as a block
device and then use `mkfs -F pcfs' to create a FAT filesystem on it.

Can't I do the same thing by first creating a zvol and then creating
a FAT filesystem on it?
  

seems not.


[...]
  

Some solaris tools (like fdisk, or "mkfs -F pcfs") needs disk geometry
to function properly. zvols doesn't provide that. If you want to use
zvols to work with such tools, the easiest way would be using lofi, or
exporting zvols as iscsi share and import it again.

For example, if you have a 10MB zvol and use lofi, fdisk would show
these geometry

 Total disk size is 34 cylinders
 Cylinder size is 602 (512 byte) blocks

... which will then be used if you run "mkfs -F pcfs -o
nofdisk,size=20480". Without lofi, the same command would fail with

Drive geometry lookup (need tracks/cylinder and/or sectors/track:
Operation not supported



So, why can I do it with UFS?

# zfs create -V 10m rpool/vol1
# newfs /dev/zvol/rdsk/rpool/vol1
newfs: construct a new file system /dev/zvol/rdsk/rpool/vol1: (y/n)? y
Warning: 4130 sector(s) in last cylinder unallocated
/dev/zvol/rdsk/rpool/vol1:  20446 sectors in 4 cylinders of 48 tracks, 128 
sectors
10.0MB in 1 cyl groups (14 c/g, 42.00MB/g, 20160 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
 32,

Why is this different from PCFS?
  


UFS has known for years that drive geometries are bogus, and just fakes 
something up to keep itself happy. What UFS thinks of as a cylinder 
bares no relation to actual disk cylinders.


If you give mkfs_pcfs all the geom data it needs, then it won't try 
asking the device...


andrew@opensolaris:~# zfs create -V 10m rpool/vol1
andrew@opensolaris:~# mkfs -F pcfs -o 
fat=16,nofdisk,nsect=255,ntrack=63,size=2 /dev/zvol/rdsk/rpool/vol1

Construct a new FAT file system on /dev/zvol/rdsk/rpool/vol1: (y/n)? y
andrew@opensolaris:~# fstyp /dev/zvol/rdsk/rpool/vol1
pcfs
andrew@opensolaris:~# fsck -F pcfs /dev/zvol/rdsk/rpool/vol1
** /dev/zvol/rdsk/rpool/vol1
** Scanning file system meta-data
** Correcting any meta-data discrepancies
10143232 bytes.
0 bytes in bad sectors.
0 bytes in 0 directories.
0 bytes in 0 files.
10143232 bytes free.
512 bytes per allocation unit.
19811 total allocation units.
19811 available allocation units.
andrew@opensolaris:~# mount -F pcfs /dev/zvol/dsk/rpool/vol1 /mnt
andrew@opensolaris:~#

--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 512b vs 4K sectors

2011-07-04 Thread Andrew Gabriel


Richard Elling wrote:

On Jul 4, 2011, at 6:42 AM, Lanky Doodle wrote:

  

Hiya,

I''ve been doing a lot of research surrounding this and ZFS, including some 
posts on here, though I am still left scratching my head.

I am planning on using slow RPM drives for a home media server, and it's these 
that seem to 'suffer' from a few problems;

Seagate Barracuda LP - Looks to be the only true 512b sector hard disk. Serious 
firmware issues
Western Digital Cavier Green - 4K sectors = crap write performance
Hitachi 5K3000 - Variable sector sizing (according to tech. specs)
Samsung SpinPoint F4 - Just plain old problems with them

What is the best drive of the above 4, and are 4K drives really a no-no with 
ZFS. Are there any alternatives in the same price bracket?



4K drives are fine, especially if the workload is read-mostly.

Depending on the OS, you can tell ZFS to ignore the incorrect physical sector 
size reported by some drives. Today, this is easiest in FreeBSD, a little bit more
tricky in OpenIndiana (patches and source are available for a few different 
implementations). Or you can just trick them out by starting the pool with a 4K

sector device that doesn't lie (eg, iscsi target).

  

Who would have thought choosing a hard disk could be so 'hard'!



I recommend enterprise-grade disks, none of which made your short list ;-(.
 -- richard


I'm going through this at the moment. I've bought a pair of Seagate 
Barracuda XT 2Tb disks (which are a bit more Enterprise than the list 
above), just plugged them in, and so far they're OK. Not had them long 
enough to report on longevity.


--

Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 700GB gone?

2011-06-30 Thread Andrew Gabriel


 On 06/30/11 08:50 PM, Orvar Korvar wrote:

I have a 1.5TB disk that has several partitions. One of them is 900GB. Now I 
can only see 300GB. Where is the rest? Is there a command I can do to reach the 
rest of the data? Will scrub help?


Not much to go on - no one can answer this.

How did you go about partitioning the disk?
What does the fdisk partitioning look like (if its x86)?
What does the VToC slice layout look like?
What are you using each partition and slice for?
What tells you that you can only see 300GB?

--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Encryption accelerator card recommendations.

2011-06-28 Thread Andrew Gabriel


 On 06/27/11 11:32 PM, Bill Sommerfeld wrote:

On 06/27/11 15:24, David Magda wrote:

Given the amount of transistors that are available nowadays I think
it'd be simpler to just create a series of SIMD instructions right
in/on general CPUs, and skip the whole co-processor angle.

see: http://en.wikipedia.org/wiki/AES_instruction_set

Present in many current Intel CPUs; also expected to be present in AMD's
"Bulldozer" based CPUs.


I recall seeing a blog comparing the existing Solaris hand-tuned AES 
assembler performance with the (then) new AES instruction version, where 
the Intel AES instructions only got you about a 30% performance 
increase. I've seen reports of better performance improvements, but 
usually by comparing with the performance on older processors which are 
going to be slower for additional reasons then just missing the AES 
instructions. Also, you could claim better performance improvement if 
you compared against a less efficient original implementation of AES. 
What this means is that a faster CPU may buy you more crypto performance 
than the AES instructions alone will do.


My understanding from reading the Intel AES instruction set (which I 
warn might not be completely correct) is that the AES 
encryption/decryption instruction is executed between 10 and 14 times 
(depending on key length) for each 128 bits (16 bytes) of data being 
encrypted/decrypted, so it's very much part of the regular instruction 
pipeline. The code will have to loop though this process multiple times 
to process a data block bigger than 16 bytes, i.e. a double nested loop, 
although I expect it's normally loop-unrolled a fair degree for 
optimisation purposes.


Conversely, the crypto units in the T-series processors are separate 
from the CPU, and do the encryption/decryption whilst the CPU is getting 
on with something else, and they do it much faster than it could be done 
on the CPU. Small blocks are normally a problem for crypto offload 
engines because the overhead of farming off the work to the engine and 
getting the result back often means that you can do the crypto on the 
CPU faster than the time it takes to get the crypto engine started and 
stopped. However, T-series crypto is particularly good at handling small 
blocks efficiently, such as around 1kbyte which you are likely to find 
in a network packet, as it is much closer coupled to the CPU than a PCI 
crypto card can be, and performance with small packets was key for the 
crypto networking support T-series was designed for. Of course, it 
handles crypto of large blocks just fine too.


--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

2011-06-20 Thread Andrew Gabriel


Richard Elling wrote:

On Jun 19, 2011, at 6:04 AM, Andrew Gabriel wrote:
  

Richard Elling wrote:


Actually, all of the data I've gathered recently shows that the number of IOPS 
does not significantly increase for HDDs running random workloads. However the 
response time does :-( My data is leading me to want to restrict the queue 
depth to 1 or 2 for HDDs.
 
  

Thinking out loud here, but if you can queue up enough random I/Os, the 
embedded disk controller can probably do a good job reordering them into less 
random elevator sweep pattern, and increase IOPs through reducing the total 
seek time, which may be why IOPs does not drop as much as one might imagine if 
you think of the heads doing random seeks (they aren't random anymore). 
However, this requires that there's a reasonable queue of I/Os for the 
controller to optimise, and processing that queue will necessarily increase the 
average response time. If you run with a queue depth of 1 or 2, the controller 
can't do this.



I agree. And disksort is in the mix, too.
  


Oh, I'd never looked at that.


This is something I played with ~30 years ago, when the OS disk driver was 
responsible for the queuing and reordering disc transfers to reduce total seek 
time, and disk controllers were dumb.



...and disksort still survives... maybe we should kill it?
  


It looks like it's possibly slightly worse than the pathologically worst 
response time case I described below...



There are lots of options and compromises, generally weighing reduction in 
total seek time against longest response time. Best reduction in total seek 
time comes from planning out your elevator sweep, and inserting newly queued 
requests into the right position in the sweep ahead. That also gives the 
potentially worse response time, as you may have one transfer queued for the 
far end of the disk, whilst you keep getting new transfers queued for the track 
just in front of you, and you might end up reading or writing the whole disk 
before you get to do that transfer which is queued for the far end. If you can 
get a big enough queue, you can modify the insertion algorithm to never insert 
into the current sweep, so you are effectively planning two sweeps ahead. Then 
the worse response time becomes the time to process one queue full, rather than 
the time to read or write the whole disk. Lots of other tricks too (e.g. 
insertion into sweeps taking into account priority, such as if

 the I/O is a synchronous or asynchronous, and age of existing queue entries). 
I had much fun playing with this at the time.



The other wrinkle for ZFS is that the priority scheduler can't re-order I/Os 
sent to the disk.
  


Does that also go through disksort? Disksort doesn't seem to have any 
concept of priorities (but I haven't looked in detail where it plugs in 
to the whole framework).



So it might make better sense for ZFS to keep the disk queue depth small for 
HDDs.
 -- richard
  


--
Andrew Gabriel

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

2011-06-19 Thread Andrew Gabriel


Richard Elling wrote:
Actually, all of the data I've gathered recently shows that the number of 
IOPS does not significantly increase for HDDs running random workloads. 
However the response time does :-( My data is leading me to want to restrict 
the queue depth to 1 or 2 for HDDs.
  


Thinking out loud here, but if you can queue up enough random I/Os, the 
embedded disk controller can probably do a good job reordering them into 
less random elevator sweep pattern, and increase IOPs through reducing 
the total seek time, which may be why IOPs does not drop as much as one 
might imagine if you think of the heads doing random seeks (they aren't 
random anymore). However, this requires that there's a reasonable queue 
of I/Os for the controller to optimise, and processing that queue will 
necessarily increase the average response time. If you run with a queue 
depth of 1 or 2, the controller can't do this.


This is something I played with ~30 years ago, when the OS disk driver 
was responsible for the queuing and reordering disc transfers to reduce 
total seek time, and disk controllers were dumb. There are lots of 
options and compromises, generally weighing reduction in total seek time 
against longest response time. Best reduction in total seek time comes 
from planning out your elevator sweep, and inserting newly queued 
requests into the right position in the sweep ahead. That also gives the 
potentially worse response time, as you may have one transfer queued for 
the far end of the disk, whilst you keep getting new transfers queued 
for the track just in front of you, and you might end up reading or 
writing the whole disk before you get to do that transfer which is 
queued for the far end. If you can get a big enough queue, you can 
modify the insertion algorithm to never insert into the current sweep, 
so you are effectively planning two sweeps ahead. Then the worse 
response time becomes the time to process one queue full, rather than 
the time to read or write the whole disk. Lots of other tricks too (e.g. 
insertion into sweeps taking into account priority, such as if the I/O 
is a synchronous or asynchronous, and age of existing queue entries). I 
had much fun playing with this at the time.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] compare snapshot to current zfs fs

2011-06-04 Thread Andrew Gabriel


Harry Putnam wrote:

I have a sneaking feeling I'm missing something really obvious.

If you have zfs fs that see little use and have lost track of whether
changes may have occurred since last snapshot, is there some handy way
to determine if a snapshot matches its filesystem.  Or put another
way, some way to determine if the snapshot is different than its
current filesystem.

I knot about the diff tools and of course I guess one could compare
overall sizes in bytes for a good idea, but is there a way provided by
zfs?
  


If you have a recent enough OS release...

zfs diff  [ | ]

--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Extremely slow zpool scrub performance

2011-05-14 Thread Andrew Gabriel


 On 05/14/11 01:08 PM, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Donald Stahl

Running a zpool scrub on our production pool is showing a scrub rate
of about 400K/s. (When this pool was first set up we saw rates in the
MB/s range during a scrub).

Wait longer, and keep watching it.  Or just wait till it's done and look at
the total time required.  It is normal to have periods of high and low
during scrub.  I don't know why.


Check the IOPS per drive - you may be maxing out on one of them if it's 
in an area where there are lots of small blocks.


--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-08 Thread Andrew Gabriel


Toby Thain wrote:

On 08/05/11 10:31 AM, Edward Ned Harvey wrote:
  

...
Incidentally, does fsync() and sync return instantly or wait?  Cuz "time
sync" might product 0 sec every time even if there were something waiting to
be flushed to disk.



The semantics need to be synchronous. Anything else would be a horrible bug.
  


sync(2) is not required to be synchronous.
I believe that for ZFS it is synchronous, but for most other 
filesystems, it isn't (although a second sync will block until the 
actions resulting from a previous sync have completed).


fsync(3C) is synchronous.

--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Faster copy from UFS to ZFS

2011-05-03 Thread Andrew Gabriel


Dan Shelton wrote:
Is anyone aware of any freeware program that can speed up copying tons 
of data (2 TB) from UFS to ZFS on same server?


I use 'ufsdump | ufsrestore'*. I would also suggest try setting 
'sync=disabled' during the operation, and reverting it afterwards. 
Certainly, fastfs (a similar although more dangerous option for ufs) 
makes ufs to ufs copying significantly faster.


*ufsrestore works fine on ZFS filesystems (although I haven't tried it 
with any POSIX ACLs on the original ufs filesystem, which would probably 
simply get lost).


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] No write coalescing after upgrade to Solaris 11 Express

2011-04-27 Thread Andrew Gabriel


Matthew Anderson wrote:

Hi All,

I've run into a massive performance problem after upgrading to Solaris 11 
Express from oSol 134.

Previously the server was performing a batch write every 10-15 seconds and the 
client servers (connected via NFS and iSCSI) had very low wait times. Now I'm 
seeing constant writes to the array with a very low throughput and high wait 
times on the client servers. Zil is currently disabled.


How/Why?


 There is currently one failed disk that is being replaced shortly.

Is there any ZFS tunable to revert Solaris 11 back to the behaviour of oSol 134?
  


What does "zfs get sync" report?

--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SIL3114 and sparc solaris 10

2011-02-23 Thread Andrew Gabriel


Krunal Desai wrote:

On Wed, Feb 23, 2011 at 8:38 AM, Mauricio Tavares  wrote:
  

   I see what you mean; in
http://mail.opensolaris.org/pipermail/opensolaris-discuss/2008-September/043024.html
they claim it is supported by the uata driver. What would you suggest
instead? Also, since I have the card already, how about if I try it out?



My experience with SPARC is limited, but perhaps the Option ROM/BIOS
for that card is intended for x86, and not SPARC? I might thinking of
another controller, but this could be the case. You could always try
to boot with the card; the worst that'll probably happen is boot hangs
before the OS even comes into play.
  


SPARC won't try to run the BIOS on the card anyway (it will only run 
OpenFirmware BIOS), but you will have to make sure the card has the 
non-RAID BIOS so that the PCI class doesn't claim it to be a RAID 
controller, which will prevent Solaris going anywhere near the card at 
all. These cards could be bought with either RAID or non-RAID BIOS, but 
RAID was more common. You can (or could some time back) download the 
RAID and non-RAID BIOS from Silicon Image and re-flash which also 
updates the PCI class, and I think you'll need a Windows system to 
actually flash the BIOS.


You might want to do a google search on "3114 data corruption" too, 
although it never hit me back when I used the cards.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SIL3114 and sparc solaris 10

2011-02-23 Thread Andrew Gabriel


Mauricio Tavares wrote:
Perhaps a bit off-topic (I asked on the rescue list -- 
http://web.archiveorange.com/archive/v/OaDWVGdLhxWVWIEabz4F -- and was 
told to try here), but I am kinda shooting in the dark: I have been 
finding online scattered and vague info stating that this card can be 
made to work with a sparc solaris 10 box 
(http://old.nabble.com/eSATA-or-firewire-in-Solaris-Sparc-system-td27150246.html 
is the only link I can offer right now). Can anyone confirm or deny that? 


3112/3114 was a very early (possibly the first?) SATA chipset, I think 
aimed for use before SATA drivers had been developed. I would suggest 
looking for something more modern.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and spindle speed (7.2k / 10k / 15k)

2011-02-05 Thread Andrew Gabriel


Roy Sigurd Karlsbakk wrote:

Nope. Most HDDs today have a single read channel, and they select
which head uses that channel at any point in time. They cannot use
multiple heads at the same time, because the heads to not travel the
same path on their respective surfaces at the same time. There's no
real vertical alignment of the tracks between surfaces, and every
surface has its own embedded position information that is used when
that surface's head is active. There were attempts at multi-actuator
designs with separate servo arms and multiple channels, but
mechanically they're too difficult to manufacture at high yields as I
understood it.



Perhaps a stupid question, but why don't they read from all platters in 
parallel?
  


The answer is in the text you quoted above.

There are drives now with two level actuators.
The primary actuator is the standard actuator you are familiar with 
which moves all the arms.
The secondary actuator is a piezo crystal towards the head end of the 
arm which can move the head a few tracks very quickly without having to 
move the arm, and these are one per head. In theory, this might allow 
multiple heads to lock on to their respective tracks at the same time 
for parallel reads, but I haven't heard that they are used in this way.


If you go back to the late 1970's before tracks had embedded servo data, 
on multi-platter disks you had one surface which contained the head 
positioning servo data, and the drive relied on accurate vertical 
alignment between heads/surfaces to keep on track (and drives could 
head-switch instantly). Around 1980, tracks got too close together for 
this to work anymore, and the servo positioning data was embedded into 
each track itself. The very first drives of this type scanned all the 
surfaces on startup to build up an internal table of the relative 
misalignment of tracks across the surfaces, but this rapidly became 
unviable as drive capacity increased and this scan would take an 
unreasonable length of time. It may be that modern drives learn this as 
they go - I don't know.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and TRIM

2011-01-29 Thread Andrew Gabriel


Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Edward Ned Harvey

My google-fu is coming up short on this one...  I didn't see that it had


been
  

discussed in a while ...



BTW, there were a bunch of places where people said "ZFS doesn't need trim."
Which I hope, by now, has been commonly acknowledged as bunk.

The only situation where you don't need TRIM on a SSD is when (a) you're
going to fill it once and never write to it again, which is highly unlikely
considering the fact that you're buying a device for its fast write
performance... (b) you don't care about performance, which is highly
unlikely considering the fact that you bought a performance device ...  (c)
you are using whole disk encryption.  This is a valid point.  You would
probably never TRIM anything from a fully encrypted disk ... 


In places where people said TRIM was thought to be unnecessary, the
justification they stated was that TRIM will only benefit people whose usage
patterns are sporadic, rather than sustained.  The downfall of that argument
is the assumption that the device can't perform TRIM operations
simultaneously while performing other operations.  That may be true in some
cases, or even globally, but without backing, it's just an assumption.  One
which I find highly quesitonable.


TRIM could also be useful where ZFS uses a storage LUN which is sparsely 
provisioned, in order to deallocate blocks in the LUN which have 
previously been allocated, but whose contents have since been invalidated.


In this case, both ZFS and whatever is providing the storage LUN would 
need to support TRIM.


Out of interest, what other filesystems out there today can generate 
TRIM commands?


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] show hdd life time ?

2011-01-23 Thread Andrew Gabriel


Richard Elling wrote:

On Jan 21, 2011, at 7:36 PM, Tobias Lauridsen wrote:

  

it is possible to se my hdd total time it have been in use  so I can switch to 
a new one before it gets too many hours old



In theory, yes. In practice, I've never seen a disk properly report this data 
on a
consistent basis :-(  Perhaps some of the more modern disks do a better job?

Look for the power on hours (POH) attribute of SMART.
http://en.wikipedia.org/wiki/S.M.A.R.T.
  


If you're looking for stats to give an indication of likely wear, and
thus increasing probably of failure, POH is probably not very useful by
itself (or even at all). Things like Head Flying Hours and Load Cycle
Count are probably more indicative, although not necessarily maintained
by all drives.

Of course, data which gives indication of actual (rather than likely)
wear is even more important as an indicator of impending failure, such
as the various error and retry counts.

--
Andrew Gabriel

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] incorrect vdev added to pool

2011-01-18 Thread Andrew Gabriel


 On 01/15/11 11:32 PM, Gal Buki wrote:

Hi

I have a pool with a raidz2 vdev.
Today I accidentally added a single drive to the pool.

I now have a pool that partially has no redundancy as this vdev is a single 
drive.

Is there a way to remove the vdev


Not at the moment, as far as I know.


and replace it with a new raidz2 vdev?
If not what can I do to do damage control and add some redundancy to the single 
drive vdev?


I think you should be able to attach another disk to it to make them 
into a mirror. (Make sure you attach, and not add.)


--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] how to quiesce and unquiesc zfs and zpool for array/hardware snapshots ?

2010-11-15 Thread Andrew Gabriel


 Sridhar,

You have switched to a new disruptive filesystem technology, and it has 
to be disruptive in order to break out of all the issues older 
filesystems have, and give you all the new and wonderful features. 
However, you are still trying to use old filesystem techniques with it, 
which is why things don't fit for you, and you are missing out on the 
more powerful way ZFS presents these features to you.


On 11/16/10 06:59 AM, Ian Collins wrote:

On 11/16/10 07:19 PM, sridhar surampudi wrote:

Hi,

How it would help for instant recovery  or point in time recovery ?? 
i.e restore data at device/LUN level ?


Why would you want to?  If you are sending snapshots to another pool, 
you can do instant recovery at the pool level.


Point in time recovery is a feature of ZFS snapshots. What's more, with 
ZFS you can see all your snapshots online all the time, read and/or 
recover just individual files or whole datasets, and the storage 
overhead is very efficient.


If you want to recover a whole LUN, that's presumably because you lost 
the original, and in this case the system won't have the original 
filesystem mounted.




Currently it is easy as I can unwind the primary device stack and 
restore data at device/ LUN level and recreate stack.


It's probably easier with ZFS to restore data at the pool or 
filesystem level from snapshots.


Trying to work at the device level is just adding an extra level of 
complexity to a problem already solved.




I won't claim ZFS couldn't better support use of back-end Enterprise 
storage, but in this case, you haven't given any use cases where that's 
relevant.


--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Changing GUID

2010-11-15 Thread Andrew Gabriel


sridhar surampudi wrote:

Hi I am looking in similar lines,

my requirement is 


1. create a zpool on one or many devices ( LUNs ) from an array ( array can be 
IBM or HPEVA or EMC etc.. not SS7000).
2. Create file systems on zpool
3. Once file systems are in use (I/0 is happening) I need to take snapshot at 
array level
 a. Freeze the zfs flle system ( not required due to zfs consistency : source : 
mailing groups)
 b. take array snapshot ( say .. IBM flash copy )
 c. Got new snapshot device (having same data and metadata including same GUID 
of source pool)

  Now I need a way to change the GUID and pool of snapshot device so that the 
snapshot device can be accessible on same host or an alternate host (if the LUN 
is shared).

Could you please post commands for the same.


There is no way I know of currently. (There was an unofficial program 
floating around to do this on much earlier opensolaris versions, but it 
no longer works).


If you have a support contract, raise a call and asked to be added to 
RFE 6744320.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] how to quiesce and unquiesc zfs and zpool for array/hardware snapshots ?

2010-11-15 Thread Andrew Gabriel

sridhar surampudi wrote:

Hi Darren,

In shot I am looking a way to freeze and thaw for zfs file system so that for
harware snapshot, i can do
1. run zfs freeze
2. run hardware snapshot on devices belongs to the zpool where the given file system is residing.

3. run zfs thaw
Unlike other filesystems, ZFS is always consistent on disk, so there's
no need to freeze a zpool to take a hardware snapshot. The hardware
snapshot will effectively contain all transactions up to the last
transaction group commit, plus all synchronous transactions up to the
hardware snapshot. If you want to be sure that all transactions up to a
certain point in time are included (for the sake of an application's
data), take a ZFS snapshot (which will force a TXG commit), and then
take the hardware snapshot. You will not be able to access the hardware
snapshot from the system which has the original zpool mounted, because
the two zpools will have the same pool GUID (there's an RFE outstanding
on fixing this).

The one thing you do need to be careful of is, that with a multi-disk
zpool, the hardware snapshot is taken at an identical point in time
across all the disks in the zpool. This functionality is usually an
extra-charge option in Enterprise storage systems. If the hardware
snapshots are staggered across multiple disks, all bets are off,
although if you take a zfs snapshot immediately beforehand and you test
import/scrub the hardware snapshot (on a different system) immediately
(so you can repeat the hardware snapshot again if it fails), maybe you
will be lucky.

The right way to do this with zfs is to send/recv the datasets to a
fresh zpool, or (S10 Update 9) to create an extra zpool mirror and then
split it off with zpool split.

--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] How do you use >1 partition on x86?

2010-10-25 Thread Andrew Gabriel

Change the new partition type to something than none of the OS's on the 
system will know anything about, so they don't make any invalid 
assumptions about what might be in it. Then use the appropriate 
partition device node, /dev/dsk/c7t0d0p4 (assuming it's the 4th primary 
FDISK partition).


Multiple zpools on one disk is not going to be good for performance if 
you use both together. There may be some way to grow the existing 
Solaris partition into the spare space without destroying the contents 
and then growing the zpool into the new space, but I haven't tried this 
with FDISK partitions, so I don't know if it works without damaging the 
existing contents. (I have done it with slices, and it does work in that 
case.)


Bill Werner wrote:

So when I built my new workstation last year, I partitioned the one and only 
disk in half, 50% for Windows, 50% for 2009.06.   Now, I'm not using Windows, 
so I'd like to use the other half for another ZFS pool, but I can't figure out 
how to access it.

I have used fdisk to create a second Solaris2 partition, did a re-con reboot, 
but format still only shows the 1 available partition.  How do I used the 
second partition?

selecting c7t0d0
 Total disk size is 30401 cylinders
 Cylinder size is 16065 (512 byte) blocks

   Cylinders
  Partition   StatusType  Start   End   Length%
  =   ==  =   ===   ==   ===
  1 Other OS  0 4   5  0
  2 IFS: NTFS 5  19171913  6
  3   ActiveSolaris2   1917  1497113055 43
  4 Solaris2   14971  3017015200 50

 format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
   0. c7t0d0 
  /p...@0,0/pci1028,2...@1f,2/d...@0,0


Thanks for any idea.
  



--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] TLER and ZFS

2010-10-06 Thread Andrew Gabriel


casper@sun.com wrote:

On Tue, Oct 5, 2010 at 11:49 PM,   wrote:


I'm not sure that that is correct; the drive works on naive clients but I
believe it can reveal its true colors.
  

The drive reports 512 byte sectors to all hosts. AFAIK there's no way
to make it report 4k sectors.




Too bad because it makes it less useful (specifically because the label 
mentions sectors and if you can use bigger sectors, you can address a 
larger drive).
  


Having now read a number of forums about these, there's a strong feeling 
WD screwed up by not providing a switch to disable pseudo 512b access so 
you can use the 4k native. The industry as a whole will transition to 4k 
sectorsize over next few years, but these first 4k sectorsize HDs are 
rather less useful with 4k sectorsize-aware OS's. Let's hope other 
manufacturers get this right in their first 4k products.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] TLER and ZFS

2010-10-05 Thread Andrew Gabriel


Michael DeMan wrote:
The WD 1TB 'enterprise' drives are still 512 sector size and safe to 
use, who knows though, maybe they just started shipping with 4K sector 
size as I write this e-mail?


Another annoying thing with the whole 4K sector size, is what happens 
when you need to replace drives next year, or the year after?  That 
part has me worried on this whole 4K sector migration thing more than 
what to buy today.  Given the choice, I would prefer to buy 4K sector 
size now, but operating system support is still limited.  Does anybody 
know if there any vendors that are shipping 4K sector drives that have 
a jumper option to make them 512 size?  WD has a jumper, but is there 
explicitly to work with WindowsXP, and is not a real way to dumb down 
the drive to 512.  I would presume that any vendor that is shipping 4K 
sector size drives now, with a jumper to make it 'real' 512, would be 
supporting that over the long run?


Changing the sector size (if it's possible at all) would require a
reformat of the drive.

On SCSI disks which support it, you do it by changing the sector size on
the relevant mode select page, and then sending a format-unit command to
make the drive relayout all the sectors.

I've no idea if these 4K sata drives have any such mechanism, but I
would expect they would.

BTW, I've been using a pair of 1TB Hitachi Ultrastar for something like
18 months without any problems at all. Of course, a 1 year old disk
model is no longer available now. I'm going to have to swap out for
bigger disks in the not too distant future.

--
Andrew Gabriel

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] fs root inode number?

2010-09-26 Thread Andrew Gabriel





Richard L. Hamilton wrote:

  Typically on most filesystems, the inode number of the root
directory of the filesystem is 2, 0 being unused and 1 historically
once invisible and used for bad blocks (no longer done, but kept
reserved so as not to invalidate assumptions implicit in ufsdump tapes).

However, my observation seems to be (at least back at snv_97), the
inode number of ZFS filesystem root directories (including at the
top level of a spool) is 3, not 2.

If there's any POSIX/SUS requirement for the traditional number 2,
I haven't found it.  So maybe there's no reason founded in official
standards for keeping it the same.  But there are bound to be programs
that make what was with other filesystems a safe assumption.

Perhaps a warning is in order, if there isn't already one.

Is there some _reason_ why the inode number of filesystem root directories
in ZFS is 3 rather than 2?
  


If you look at zfs_create_fs(), you will see the first 3 items created
are:

Create zap object used for SA attribute registration
Create a delete queue.
Create root znode.


Hence, inode 3.

-- 

Andrew Gabriel




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Adding a higher level partition to ZFS pool

2010-09-24 Thread Andrew Gabriel


Axelle Apvrille wrote:

Hi all,
I would like to add a new partition to my ZFS pool but it looks like it's more 
stricky than expected.

The layout of my disk is the following:
- first partition for Windows. I want to keep it. (no formatting !)
- second partition for OpenSolaris.This is where I have all the Solaris slices 
(c0d0s0 etc). I have a single ZFS pool. OpenSolaris boots on ZFS.
- third partition: a FAT partition I want to keep (no formatting !)
- fourth partition: I want to add this partition to my ZFS pool (or another 
pool ?). I don't care if information on that partition is lost, I can format it 
if necessary.
zpool add c0d0p0:2 ? Hmm...
  


You cannot add it to the root pool, as the root pool cannot be a RAID0.

You can make another pool from it...

Ideally, set the FDISK partition type to something that none of the OS's 
on the system will know anything about. (It doesn't matter what it is 
from the zfs point of view, but you don't want any of the OS's thinking 
it's something they believe they know how to use.)


zpool create tank c0d0p4 (for the 4th FDISK primary partition).

Note that two zpools on the same disk may give you poor performance if 
you are accessing both at the same time, as you are forcing head seeking 
between them.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Failed zfs send "invalid backup stream".............

2010-09-12 Thread Andrew Gabriel


Humberto Ramirez wrote:

I'm trying to replicate a 300 GB pool with this command

zfs send al...@3 | zfs receive -F omega

about 2 hours in to the process it fails with this error

"cannot receive new filesystem stream: invalid backup stream"

I have tried setting the target read only  (zfs set readonly=on omega)
also disable Timeslider thinking it might have something to do with it.

What could be causing the error ?


Could be zfs filesystem version too old on the sending side (I have one 
such case).

What are their versions, and what release/build of the OS are you using?

if the target is a new hard drive can I use 
this   zfs send al...@3 > /dev/c10t0d0 ?
  


That command doesn't make much sense for the purpose of doing anything 
useful.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SCSI write retry errors on ZIL SSD drives...

2010-08-24 Thread Andrew Gabriel

7;m not intimately familiar with the firmware versions, but if you're 
having problems, making sure you have latest firmware is probably a good 
thing to do.



--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Solaris startup script location

2010-08-18 Thread andrew . gabriel

What you say is true only on the system itself. On an NFS client system, 30 
seconds of lost data in the middle of a file (as per my earlier example) is a 
corrupt file.

-original message-
Subject: Re: [zfs-discuss] Solaris startup script location
From: Edward Ned Harvey 
Date: 18/08/2010 17:17

> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Alxen4
> 
> Disabling ZIL converts all synchronous calls to asynchronous which
> makes ZSF to report data acknowledgment before it actually was written
> to stable storage which in turn improves performance but might cause
> data corruption in case of server crash.
> 
> Is it correct ?

It is partially correct.

With the ZIL disabled, you could lose up to 30 sec of writes, but it won't
cause an inconsistent filesystem, or "corrupt" data.  If you make a
distinction between "corrupt" and "lost" data, then this is valuable for you
to know:

Disabling the ZIL can result in up to 30sec of lost data, but not corrupt
data.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Solaris startup script location

2010-08-18 Thread Andrew Gabriel


Alxen4 wrote:

Thanks...Now I think I understand...

Let me summarize it andd let me know if I'm wrong.

Disabling ZIL converts all synchronous calls to asynchronous which makes ZSF to 
report data acknowledgment before it actually was written to stable storage 
which in turn improves performance but might cause data corruption in case of 
server crash.

Is it correct ?

In my case I'm having serious performance issues with NFS over ZFS.
  


You need a non-volatile slog, such as an SSD.


My NFS Client is ESXi so the major question is there risk of corruption for 
VMware images if I disable ZIL ?
  


Yes.

If your NFS server takes an unexpected outage and comes back up again, 
some writes will have been lost which ESXi thinks succeeded (typically 5 
to 30 seconds worth of writes/updates immediately before the outage). So 
as an example, if you had an application writing a file sequentially, 
you will likely find an area of the file is corrupt because the data was 
lost.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Solaris startup script location

2010-08-18 Thread Andrew Gabriel


Andrew Gabriel wrote:

Alxen4 wrote:

Is there any way run start-up script before non-root pool is mounted ?

For example I'm trying to use ramdisk as ZIL device (ramdiskadm )
So I need to create ramdisk before actual pool is mounted otherwise 
it complains that log device is missing :)


For sure I can manually remove/and add it by script and put the 
script in regular rc2.d location...I'm just looking for more elegant 
way to it.


Can you start by explaining what you're trying to do, because this may 
be completely misguided?


A ramdisk is volatile, so you'll lose it when system goes down, 
causing failure to mount on reboot. Recreating a ramdisk on reboot 
won't recreate the slog device you lost when the system went down. I 
expect the zpool would fail to mount.


Furthermore, using a ramdisk as a ZIL is effectively just a very 
inefficient way to disable the ZIL.
A better way to do this is to "zfs set sync=disabled ..." on relevant 
filesystems.
I can't recall which build introduced this, but prior to that, you can 
set zfs://zil_disable=1 in /etc/system but that applies to all 
pools/filesystems.




The double-slash was brought to you by a bug in thunderbird. The 
original read: set zfs:zil_disable=1


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Solaris startup script location

2010-08-18 Thread Andrew Gabriel


Alxen4 wrote:

Is there any way run start-up script before non-root pool is mounted ?

For example I'm trying to use ramdisk as ZIL device (ramdiskadm )
So I need to create ramdisk before actual pool is mounted otherwise it 
complains that log device is missing :)

For sure I can manually remove/and add it  by script and put the script in 
regular rc2.d location...I'm just looking for more elegant way to it.
  


Can you start by explaining what you're trying to do, because this may 
be completely misguided?


A ramdisk is volatile, so you'll lose it when system goes down, causing 
failure to mount on reboot. Recreating a ramdisk on reboot won't 
recreate the slog device you lost when the system went down. I expect 
the zpool would fail to mount.


Furthermore, using a ramdisk as a ZIL is effectively just a very 
inefficient way to disable the ZIL.
A better way to do this is to "zfs set sync=disabled ..." on relevant 
filesystems.
I can't recall which build introduced this, but prior to that, you can 
set zfs://zil_disable=1 in /etc/system but that applies to all 
pools/filesystems.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS automatic rollback and data rescue.

2010-08-14 Thread Andrew Gabriel


Constantine wrote:

ZFS doesn't do this.


I thought so too. ;)

Situation brief: I've got OpenSolaris 2009.06 installed on the RAID-5 array on 
the controller with 512 Mb cache (as i can remember) without a cache-saving 
battery.


I hope the controller disabled the cache then.
Probably a good idea to run "zpool scrub rpool" to find out if it's 
broken. It will probably take some time. zpool status will show the 
progress.



At the Friday lightning bolt hit the power supply station of colocating 
company,and turned out that their UPSs not much more then decoration. After 
reboot filesystem and logs are on their last snapshot version.

  

Would also be useful to see output of: zfs list -t all -r zpool/filesystem



wi...@zeus:~/.zfs/snapshot# zfs list -t all -r rpool
NAME   USED  AVAIL  REFER  MOUNTPOINT
rpool  427G  1.37T  82.5K  /rpool
rpool/ROOT 366G  1.37T19K  legacy
rpool/ROOT/opensolaris20.6M  1.37T  3.21G  /
rpool/ROOT/xvm8.10M  1.37T  8.24G  /
rpool/ROOT/xvm-1   690K  1.37T  8.24G  /
rpool/ROOT/xvm-2  35.1G  1.37T   232G  /
rpool/ROOT/xvm-3   851K  1.37T   221G  /
rpool/ROOT/xvm-4   331G  1.37T   221G  /
rpool/ROOT/xv...@install   144M  -  2.82G  -
rpool/ROOT/xv...@xvm  38.3M  -  3.21G  -
rpool/ROOT/xv...@2009-07-27-01:09:1456K  -  8.24G  -
rpool/ROOT/xv...@2009-07-27-01:09:5756K  -  8.24G  -
rpool/ROOT/xv...@2009-09-13-23:34:54  2.30M  -   206G  -
rpool/ROOT/xv...@2009-09-13-23:35:17  1.14M  -   206G  -
rpool/ROOT/xv...@2009-09-13-23:42:12  5.72M  -   206G  -
rpool/ROOT/xv...@2009-09-13-23:42:45  5.69M  -   206G  -
rpool/ROOT/xv...@2009-09-13-23:46:25   573K  -   206G  -
rpool/ROOT/xv...@2009-09-13-23:46:34   525K  -   206G  -
rpool/ROOT/xv...@2009-09-13-23:48:11  6.51M  -   206G  -
rpool/ROOT/xv...@2010-04-22-03:50:25  24.6M  -   221G  -
rpool/ROOT/xv...@2010-04-22-03:51:28  24.6M  -   221G  -
  


Actually, there's 24.6Mbytes worth of changes to the filesystem since 
the last snapshot, which is coincidentally about the same as there was 
over the preceding minute between the last two snapshots. I can't tell 
if (or how much of) that happened before, verses after, the reboot though.



rpool/dump16.0G  1.37T  16.0G  -
rpool/export  28.6G  1.37T21K  /export
rpool/export/home 28.6G  1.37T21K  /export/home
rpool/export/home/wiron   28.6G  1.37T  28.6G  /export/home/wiron
rpool/swap16.0G  1.38T   101M  -
=
  


Normally in a power-out scenario, you will only lose asynchronous writes 
since the last transaction group commit, which will be up to 30 seconds 
worth (although normally much less), and you lose no synchronous writes.


However, I've no idea what your potentially flaky RAID array will have 
done. If it was using its cache and thinking it was non-volatile, then 
it could easily have corrupted the zfs filesystem due to having got 
writes out of sequence with transaction commits, and this can render the 
filesystem no longer mountable because the back-end storage has lied to 
zfs about committing writes. Even though you were lucky and it still 
mounts, it might still be corrupted, hence the suggestion to run zpool 
scrub (and even more important, get the RAID array fixed). Since I 
presume ZFS doesn't have redundant storage for this zpool, any corrupted 
data can't be repaired by ZFS, although it will tell you about it. 
Running ZFS without redundancy on flaky storage is not a good place to be.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS automatic rollback and data rescue.

2010-08-14 Thread Andrew Gabriel


Constantine wrote:

Hi.

I've got the ZFS filesystem (opensolaris 2009.06), witch, as i can see, was 
automatically rollbacked by OS to the lastest snapshot after the power failure.


ZFS doesn't do this.
Can you give some more details of what you're seeing?
Would also be useful to see output of: zfs list -t all -r zpool/filesystem


 There is a trouble - snapshot is too old, and ,consequently,  there is a 
questions -- Can I browse pre-rollbacked corrupted branch of FS ? And, if I 
can,  how ?
  

--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Global Spare for 2 pools

2010-08-10 Thread Andrew Gabriel


Tony MacDoodle wrote:
I have 2 ZFS pools all using the same drive type and size. The 
question is can I have 1 global hot spare for both of those pools?


Yes. A hot spare disk can be added to more than one pool at the same time.

--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] RAID Z stripes

2010-08-10 Thread Andrew Gabriel


Phil Harman wrote:

On 10 Aug 2010, at 08:49, Ian Collins  wrote:


On 08/10/10 06:21 PM, Terry Hull wrote:
I am wanting to build a server with 16 - 1TB drives with 2 – 8 drive 
RAID Z2 arrays striped together. However, I would like the 
capability of adding additional stripes of 2TB drives in the future. 
Will this be a problem? I thought I read it is best to keep the 
stripes the same width and was planning to do that, but I was 
wondering about using drives of different sizes. These drives would 
all be in a single pool.


It would work, but you run the risk of the smaller drives becoming 
full and all new writes doing to the bigger vdev. So while usable, 
performance would suffer.


Almost by definition, the 1TB drives are likely to be getting full 
when the new drives are added (presumably because of running out of 
space).


Performance can only be said to suffer relative to a new pool built 
entirely with drives of the same size. Even if he added 8x 2TB drives 
in a RAIDZ3 config it is hard to predict what the performance gap will 
be (on the one hand: RAIDZ3 vs RAIDZ2, on the other: an empty group vs 
an almost full, presumably fragmented, group).


One option would be to add 2TB drives as 5 drive raidz3 vdevs. That 
way your vdevs would be approximately the same size and you would 
have the optimum redundancy for the 2TB drives.


I think you meant 6, but I don't see a good reason for matching the 
group sizes. I'm for RAIDZ3, but I don't see much logic in mixing 
groups of 6+2 x 1TB and 3+3 x 2TB in the same pool (in one group I 
appear to care most about maximising space, in the other I'm 
maximising availability)


Another option - use the new 2TB drives to swap out the existing 1TB drives.
If you can find another use for the swapped out drives, this works well, 
and avoids ending up with sprawling lower capacity drives as your pool 
grows in size. This is what I do at home. The freed-up drives get used 
in other systems and for off-site backups. Over the last 4 years, I've 
upgraded from 1/4TB, to 1/2TB, and now on 1TB drives.



--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS SCRUB

2010-08-09 Thread Andrew Gabriel


Mohammed Sadiq wrote:

Hi
 
Is it recommended to do scrub while the filesystem is mounted . How 
frequently do we have to do scrub and at what circumstances.


You can scrub while the filesystems are mounted - most people do, 
there's no reason to unmount for for a scrub. (Scrub is pool level, not 
filesystem level.)


Scrub does noticeably slow the filesystem, so pick a time of low 
application load or a time when performance isn't critical. If it 
overruns into a busy period, you can cancel the scrub. Unfortunately, 
you can't pause and resume - there's an RFE for this, so if you cancel 
one you can't restart it from where it got to - it has to restart from 
the beginning.


You should scrub occasionally anyway. That's your check that data you 
haven't accessed in your application isn't rotting on the disks.


You should also do a scrub before you do a planned reduction of the pool 
redundancy (e.g. if you're going to detach a mirror side in order to 
attach a larger disk), most particularly if you are reducing the 
redundancy to nothing.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Maximum zfs send/receive throughput

2010-08-06 Thread Andrew Gabriel


Jim Barker wrote:

Just an update, I had a ticket open with Sun regarding this and it looks like 
they have a CR for what I was seeing (6975124).
  


That would seem to describe a zfs receive which has stopped for 12 hours.
You described yours as slow, which is not the term I personally would 
use for one which is stopped.
However, you haven't given anything like enough detail here of your 
situation and what's happening for me to make any worthwhile guesses.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] zvol recordsize for backing a zpool over iSCSI

2010-07-30 Thread Andrew Gabriel

Just wondering if anyone has experimented with working out the best zvol 
recordsize for a zvol which is backing a zpool over iSCSI?


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] NFS performance?

2010-07-23 Thread Andrew Gabriel


Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Phil Harman

Milkowski and Neil Perrin's zil synchronicity [PSARC/2010/108] changes
with sync=disabled, when the changes work their way into an available

The fact that people run unsafe systems seemingly without complaint for
years assumes that they know silent data corruption when they
see^H^H^Hhear it ... which, of course, they didn't ... because it is
silent ... or having encountered corrupted data, that they have the
faintest idea where it came from. In my day to day work I still find
many people that have been (apparently) very lucky.



Running with sync disabled, or ZIL disabled, you could call "unsafe" if you
want to use a generalization and a stereotype.  


Just like people say "writeback" is unsafe.  If you apply a little more
intelligence, you'll know, it's safe in some conditions, and not in other
conditions.  Like ... If you have a BBU, you can use your writeback safely.
And if you're not sharing stuff across the network, you're guaranteed the
disabled ZIL is safe.  But even when you are sharing stuff across the
network, the disabled ZIL can still be safe under the following conditions:

If you are only doing file sharing (NFS, CIFS) and you are willing to
reboot/remount from all your clients after an ungraceful shutdown of your
server, then it's safe to run with ZIL disabled.
  


No, that's not safe. The client can still lose up to 30 seconds of data, 
which could be, for example, an email message which is received and 
foldered on the server, and is then lost. It's probably /*safe enough*/ 
for most home users, but you should be fully aware of the potential 
implications before embarking on this route.


(As I said before, the zpool itself is not at any additional risk of 
corruption, it's just that you might find the zfs filesystems with 
sync=disabled appear to have been rewound by up to 30 seconds.)



If you're unsure, then adding SSD nonvolatile log device, as people have
said, is the way to go.
  


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] NFS performance?

2010-07-23 Thread Andrew Gabriel


Thomas Burgess wrote:



On Fri, Jul 23, 2010 at 3:11 AM, Sigbjorn Lie <mailto:sigbj...@nixtra.com>> wrote:


Hi,

I've been searching around on the Internet to fine some help with
this, but have been
unsuccessfull so far.

I have some performance issues with my file server. I have an
OpenSolaris server with a Pentium D
3GHz CPU, 4GB of memory, and a RAIDZ1 over 4 x Seagate
(ST31500341AS) 1,5TB SATA drives.

If I compile or even just unpack a tar.gz archive with source code
(or any archive with lots of
small files), on my Linux client onto a NFS mounted disk to the
OpenSolaris server, it's extremely
slow compared to unpacking this archive on the locally on the
server. A 22MB .tar.gz file
containng 7360 files takes 9 minutes and 12seconds to unpack over NFS.

Unpacking the same file locally on the server is just under 2
seconds. Between the server and
client I have a gigabit network, which at the time of testing had
no other significant load. My
NFS mount options are: "rw,hard,intr,nfsvers=3,tcp,sec=sys".

Any suggestions to why this is?


Regards,
Sigbjorn


as someone else said, adding an ssd log device can help hugely.  I saw 
about a 500% nfs write increase by doing this.

I've heard of people getting even more.


Another option if you don't care quite so much about data security in 
the event of an unexpected system outage would be to use Robert 
Milkowski and Neil Perrin's zil synchronicity [PSARC/2010/108] changes 
with sync=disabled, when the changes work their way into an available 
build. The risk is that if the file server goes down unexpectedly, it 
might come back up having lost some seconds worth of changes which it 
told the client (lied) that it had committed to disk, when it hadn't, 
and this violates the NFS protocol. That might be OK if you are using it 
to hold source that's being built, where you can kick off a build again 
if the server did go down in the middle of it. Wouldn't be a good idea 
for some other applications though (although Linux ran this way for many 
years, seemingly without many complaints). Note that there's no 
increased risk of the zpool going bad - it's just that after the reboot, 
filesystems with sync=disabled will look like they were rewound by some 
seconds (possibly up to 30 seconds).


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs send to remote any ideas for a faster way than ssh?

2010-07-19 Thread Andrew Gabriel


Richard Jahnel wrote:

Any idea why? Does the zfs send or zfs receive bomb out part way through?



I have no idea why mbuffer fails. Changing the -s from 128 to 1536 made it take 
longer to occur and slowed it down bu about 20% but didn't resolve the issue. 
It just ment I might get as far as 2.5gb before mbuffer bombed with broken 
pipe. Trying -r and -R with various values had no effect.
  


I found that where the network bandwidth and the disks' throughput are 
similar (which requires a pool with many top level vdevs in the case of 
a 10Gb link), you ideally want a buffer on the receive side which will 
hold about 5 seconds worth of data. A large buffer on the transmit side 
didn't help. The aim is to be able to continue steaming data across the 
network whilst a transaction commit happens at the receive end and zfs 
receive isn't reading, but to have the data ready locally for zfs 
receive when it starts reading again. Then the network will stream, in 
spite of the bursty read nature of zfs receive.


I recorded this in bugid 
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6729347
However, I haven't verified the extent to which this still happens on 
more recent builds.


Might be worth trying it over rsh if security isn't an issue, and then 
you lose the encryption overhead. Trouble is that then you've got almost 
no buffering, which can do bad things to the performance, which is why 
mbuffer would be ideal if it worked for you.



I seem to remember reading that rsh was remapped to ssh in Solaris.
  


No.
On the system you're rsh'ing to, you will have to "svcadm enable 
svc:/network/shell:default", and set up appropriate authorisation in 
~/.rhosts



I heard of some folks using netcat.
I haven't figured out where to get netcat nor the syntax for using it yet.
  


I used a buffering program of my own, but I presume mbuffer would work too.

--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs send to remote any ideas for a faster way than ssh?

2010-07-19 Thread Andrew Gabriel


Richard Jahnel wrote:

I've tried ssh blowfish and scp arcfour. both are CPU limited long before the 
10g link is.

I'vw also tried mbuffer, but I get broken pipe errors part way through the 
transfer.
  


Any idea why? Does the zfs send or zfs receive bomb out part way through?

Might be worth trying it over rsh if security isn't an issue, and then 
you lose the encryption overhead. Trouble is that then you've got almost 
no buffering, which can do bad things to the performance, which is why 
mbuffer would be ideal if it worked for you.



I'm open to ideas for faster ways to to either zfs send directly or through a 
compressed file of the zfs send output.

For the moment I;

zfs send > pigz
scp arcfour the file gz file to the remote host
gunzip < to zfs receive

This takes a very long time for 3 TB of data, and barely makes use the 10g 
connection between the machines due to the CPU limiting on the scp and gunzip 
processes.
  


Also, if you have multiple datasets to send, might be worth seeing if 
sending them in parallel helps.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 1tb SATA drives

2010-07-16 Thread Andrew Gabriel


Arne Jansen wrote:

Jordan McQuown wrote:
I’m curious to know what other people are running for HD’s in white 
box systems? I’m currently looking at Seagate Barracuda’s and Hitachi 
Deskstars. I’m looking at the 1tb models. These will be attached to 
an LSI expander in a sc847e2 chassis driven by an LSI 9211-8i HBA. 
This system will be used as a large storage array for backups and 
archiving.


I wouldn't recommend using desktop drives in a server RAID. They can't
handle the vibrations well that are present in a server. I'd recommend
at least the Seagate Constellation or the Hitachi Ultrastar, though I
haven't tested the Deskstar myself.


I've been using a couple of 1TB Hitachi Ultrastars for about a year with 
no problem. I don't think mine are still available, but I expect they 
have something equivalent.


The pool is scrubbed 3 times a week which takes nearly 19 hours now, and 
hammers the heads quite hard. I keep meaning to reduce the scrub 
frequency now it's getting to take so long, but haven't got around to 
it. What I really want is pause/resume scrub, and the ability to trigger 
the pause/resume from the screensaver (or something similar).


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Recommended RAM for ZFS on various platforms

2010-07-16 Thread Andrew Gabriel


Garrett D'Amore wrote:

Btw, instead of RAIDZ2, I'd recommend simply using stripe of mirrors.
You'll have better performance, and good resilience against errors.  And
you can grow later as you need to by just adding additional drive pairs.

-- Garrett
  


Or in my case, I find my home data growth is slightly less than the rate 
of disk capacity increase, so every 18 months or so, I simply swap out 
the disks for higher capacity ones.



--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Legality and the future of zfs...

2010-07-12 Thread Andrew Gabriel


Linder, Doug wrote:

Out of sheer curiosity - and I'm not disagreeing with you, just wondering - how 
does ZFS make money for Oracle when they don't charge for it?  Do you think 
it's such an important feature that it's a big factor in customers picking 
Solaris over other platforms?
  


Yes, it is one of many significant factors in customers choosing Solaris 
over other OS's.
Having chosen Solaris, customers then tend to buy Sun/Oracle systems to 
run it on.


Of course, there are the 7000 series products too, which are heavily 
based on the capabilities of ZFS, amongst other Solaris features.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ls says: /tank/ws/fubar: Operation not applicable

2010-06-22 Thread Andrew Gabriel


Gordon Ross wrote:

Anyone know why my ZFS filesystem might suddenly start
giving me an error when I try to "ls -d" the top of it?
i.e.: ls -d /tank/ws/fubar
/tank/ws/fubar: Operation not applicable

zpool status says all is well.  I've tried snv_139 and snv_137
(my latest and previous installs).  It's an amd64 box.
Both OS versions show the same problem.

Do I need to run a scrub?  (will take days...)

Other ideas?
  


It might be interesting to run it under truss, to see which syscall is 
returning that error.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Is it possible to disable MPxIO during OpenSolaris installation?

2010-06-02 Thread Andrew Gabriel





James C. McPherson wrote:
On 
2/06/10 03:11 PM, Fred Liu wrote:
  
  Fix some typos.


#


In fact, there is no problem for MPxIO name in technology.

It only matters for storage admins to remember the name.

  
  
You are correct.
  
  
  I think there is no way to give short aliases
to these long tedious MPxIO name.

  
  
You are correct that we don't have aliases. However, I do not
  
agree that the naming is tedious. It gives you certainty about
  
the actual device that you are dealing with, without having
  
to worry about whether you've cabled it right.
  


Might want to add a call record to

    CR 6901193 Need a command to list current usage of disks,
partitions, and slices

which includes a request for vanity naming for disks.

(Actually, vanity naming for disks should probably be brought out into
a separate RFE.)

-- 

Andrew Gabriel |
Solaris Systems Architect
Email: andrew.gabr...@oracle.com
Mobile: +44 7720 598213
Oracle Pre-Sales
Guillemont Park | Minley Road | Camberley | GU17 9QG | United Kingdom

ORACLE Corporation UK Ltd is a
company incorporated in England & Wales | Company Reg. No. 1782505
| Reg. office: Oracle Parkway, Thames Valley Park, Reading RG6 1RA


Oracle is committed to developing practices and products that
help protect the environment




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] unsetting the bootfs property possible? imported a FreeBSD pool

2010-05-25 Thread Andrew Gabriel





Reshekel Shedwitz wrote:

  r...@nexenta:~# zpool set bootfs= tank
cannot set property for 'tank': property 'bootfs' not supported on EFI labeled devices

r...@nexenta:~# zpool get bootfs tank
NAME  PROPERTY  VALUE   SOURCE
tank  bootfstanklocal

Could this be related to the way FreeBSD's zfs partitioned my disk? I thought ZFS used EFI by default though (except for boot pools). 
  


Looks like this bit of code to me:
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libzfs/common/libzfs_pool.c#473

473 			/*
474 			 * bootfs property cannot be set on a disk which has
475 			 * been EFI labeled.
476 			 */
477 			if (pool_uses_efi(nvroot)) {
478 zfs_error_aux(hdl, dgettext(TEXT_DOMAIN,
479 "property '%s' not supported on "
480 "EFI labeled devices"), propname);
481 (void) zfs_error(hdl, EZFS_POOL_NOTSUP, errbuf);
482 zpool_close(zhp);
483 goto error;
484 			}
485 			zpool_close(zhp);
486 			break;


It's not checking if you're clearing the property before bailing out
with the error about setting it.
A few lines above, another test (for a valid bootfs name) does get
bypassed in the case of clearing the property.

Don't know if that alone would fix it.

-- 

Andrew Gabriel |
Solaris Systems Architect
Email: andrew.gabr...@oracle.com
Mobile: +44 7720 598213
Oracle Pre-Sales
Guillemont Park | Minley Road | Camberley | GU17 9QG | United Kingdom

ORACLE Corporation UK Ltd is a
company incorporated in England & Wales | Company Reg. No. 1782505
| Reg. office: Oracle Parkway, Thames Valley Park, Reading RG6 1RA


Oracle is committed to developing practices and products that
help protect the environment




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [ZIL device brainstorm] intel x25-M G2 has ram cache?

2010-05-24 Thread Andrew Gabriel





Erik Trimble wrote:

  
Frankly, I'm really surprised that there's no solution, given that the
*amount* of NVRAM needed for ZIL (or similar usage) is really quite
small. a dozen GB is more than sufficient, and really, most systems do
fine with just a couple of GB (3-4 or so).  Producing a small,
DRAM-based device in a 3.5" HD form-factor with built-in battery
shouldn't be hard, and I'm kinda flabberghasted nobody is doing it. 
Well, at least in the sub-$1000 category.  I mean, it's 2 SODIMMs, a
AAA-NiCad battery, a PCI-E->DDR2 memory controller, a PCI-E to
SATA6Gbps controller, and that's it.  


It's a bit of a wonky design. The DRAM could do something of the order
1,000,000 IOPS, and is then throttled back to a tiny fraction of that
by the SATA bottleneck. Disk interfaces like SATA/SAS really weren't
designed for this type of use.

What you probably want is a motherboard which has a small area of main
memory protected by battery, and a ramdisk driver which knows how to
use it.
Then you'd get the 1,000,000 IOPS. No idea if anyone makes such a thing.

You are correct that ZFS gets an enormous benefit from even tiny
amounts if NV ZIL. Trouble is that no other operating systems or
filesystems work this well with such relatively tiny amounts of NV
storage, so such a hardware solution is very ZFS-specific.

-- 

Andrew Gabriel |
Solaris Systems Architect
Email: andrew.gabr...@oracle.com
Mobile: +44 7720 598213
Oracle Pre-Sales
Guillemont Park | Minley Road | Camberley | GU17 9QG | United Kingdom

ORACLE Corporation UK Ltd is a
company incorporated in England & Wales | Company Reg. No. 1782505
| Reg. office: Oracle Parkway, Thames Valley Park, Reading RG6 1RA


Oracle is committed to developing practices and products that
help protect the environment




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs question

2010-05-20 Thread Andrew Gabriel


Mihai wrote:

hello all,

I have the following scenario of using zfs.
- I have a HDD images that has a NTFS partition stored in a zfs 
dataset in a file called images.img
- I have X physical machines that boot from my server via iSCSI from 
such an image
- Every time a machine ask for a boot request from my server a clone 
of the zfs dataset is created and the machine is given the clone to 
boot from


I want to make an optimization to my framework that involves using a 
ramdisk pool to store the initial hdd images and the clones of the 
image being stored on a disk based pool. 
I tried to do this using zfs, but it wouldn't let me do cross pool clones.


If someone has any idea on how to proceed in doing this, please let me 
know. It is not necessary to do this exactly as I proposed, but it has 
to be something in this direction, a ramdisk backed initial image and 
more disk backed clones.


You haven't said what your requirement is - i.e. what are you hoping to 
improve by making this change? I can only guess.


If you are reading blocks from your initial hdd images (golden images) 
frequently enough, and you have enough memory on your system, these 
blocks will end up on the ARC (memory) anyway. If you don't have enough 
RAM for this to help, then you could add more memory, and/or an SSD as a 
L2ARC device ("cache" device in zpool command line terms).


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Proposition of a new zpool property.

2010-03-22 Thread Andrew Gabriel


Robert Milkowski wrote:


To add my 0.2 cents...

I think starting/stopping scrub belongs to cron, smf, etc. and not to 
zfs itself.


However what would be nice to have is an ability to freeze/resume a 
scrub and also limit its rate of scrubbing.
One of the reason is that when working in SAN environments one have to 
take into account more that just a server where a scrub will be running 
as while it might not impact the server  it might cause an issue for 
others, etc.


There's an RFE for this (pause/resume a scrub), or rather there was - 
unfortunately, it's got subsumed into another RFE/BUG and the 
pause/resume requirement got lost. I'll see about reinstating it.


--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] When to Scrub..... ZFS That Is

2010-03-13 Thread Andrew Gabriel


Thomas Burgess wrote:

I scrub once a week.

I think the general rule is:

once a week for consumer grade drives
once a month for enterprise grade drives.


and before any planned operation which will reduce your 
redundancy/resilience, such as swapping out a disk for a new larger one 
when growing a pool. The resulting resilver will read all the data in 
the datasets in order to reconstruct the new disk, some of which might 
not have been read for ages (or since the last scrub), and that's not 
the ideal time to discover your existing copy of some blocks went bad 
some time back. Better to discover this before you reduce the pool 
redundancy/resilience, whilst it's still fixable.


--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Intel SASUC8I - worth every penny

2010-03-12 Thread Andrew Gabriel


Dedhi Sujatmiko wrote:
As a user of el-cheapo US$18 SIL3114, I managed to make the system 
freeze continuously when one of SATA cable got disconnected. I am 
using 8 disks RAIDZ2 driven by 2 x SIL3114
System is still able to answer the ping, but SSH and console are no 
longer responsive, obviously also the NFS and CIFS share. The console 
keep sending "waiting for disk" loop.


The only way to recover is to reset the system, and as expected, one 
of the disk went offline, but the service is back online in degraded 
ZFS pool.


The SIL3112/3114 were very early SATA controllers, indeed barely SATA 
controllers at all by todays standards as I think they always pretend to 
be PATA to the host system.


--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Snapshot recycle freezes system activity

2010-03-04 Thread Andrew Gabriel


Gary Mills wrote:

We have an IMAP e-mail server running on a Solaris 10 10/09 system.
It uses six ZFS filesystems built on a single zpool with 14 daily
snapshots.  Every day at 11:56, a cron command destroys the oldest
snapshots and creates new ones, both recursively.  For about four
minutes thereafter, the load average drops and I/O to the disk devices
drops to almost zero.  Then, the load average shoots up to about ten
times normal and then declines to normal over about four minutes, as
disk activity resumes.  The statistics return to their normal state
about ten minutes after the cron command runs.

Is it destroying old snapshots or creating new ones that causes this
dead time?  What does each of these procedures do that could affect
the system?  What can I do to make this less visible to users?
  


Creating a snapshot shouldn't do anything much more than a regular 
transaction group commit, which should be happening at least every 30 
seconds anyway.


Deleting a snapshot potentially results in freeing up the space occupied 
by files/blocks which aren't in any other snapshots. One way to think of 
this is that when you're using regular snapshots, the freeing up of 
space which happens when you delete files is in effect all deferred 
until you destroy the snapshot(s) which also refer to that space, which 
has the effect of bunching all your space freeing.


If this is the cause (a big _if_, as I'm just speculating), then it 
might be a good idea to:

a) spread out the deleting of the snapshots, and
b) create more snapshots more often (and conversely delete more 
snapshots, more often), so each one contains fewer accumulated space to 
be freed off.


--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pool vdev imbalance

2010-02-28 Thread Andrew Gabriel


Ian Collins wrote:
I was running zpool iostat on a pool comprising a stripe of raidz2 
vdevs that appears to be writing slowly and I notice a considerable 
imbalance of both free space and write operations.  The pool is 
currently feeding a tape backup while receiving a large filesystem.


Is this imbalance normal?  I would expect a more even distribution as 
the poll configuration hasn't been changed since creation.


The second and third ones are pretty much full, with the others having 
well over 10 times more free space, so I wouldn't expect many writes to 
the full ones.


Have the others ever been in a degraded state? That might explain why 
the fill level has become unbalanced.




The system is running Solaris 10 update 7.

capacity operationsbandwidth
pool   used  avail   read  write   read  write
  -  -  -  -  -  -
tank  15.9T  2.19T 87119  2.34M  1.88M
 raidz2  2.90T   740G 24 27   762K  95.5K
   c0t1d0-  - 14 13   273K  18.6K
   c1t1d0-  - 15 13   263K  18.3K
   c4t2d0-  - 17 14   288K  18.2K
   spare -  - 17 20   104K  17.2K
 c5t2d0  -  - 16 13   277K  17.6K
 c7t5d0  -  -  0 14  0  17.6K
   c6t3d0-  - 15 12   242K  18.7K
   c7t3d0-  - 15 12   242K  17.6K
   c6t4d0-  - 16 12   272K  18.1K
   c1t0d0-  - 15 13   275K  16.8K
 raidz2  3.59T  37.8G 20  0   546K  0
   c0t2d0-  - 11  0   184K361
   c1t3d0-  - 10  0   182K361
   c4t5d0-  - 14  0   237K361
   c5t5d0-  - 13  0   220K361
   c6t6d0-  - 12  0   155K361
   c7t6d0-  - 11  0   149K361
   c7t4d0-  - 14  0   219K361
   c4t0d0-  - 14  0   213K361
 raidz2  3.58T  44.1G 27  0  1.01M  0
   c0t5d0-  - 16  0   290K361
   c1t6d0-  - 15  0   301K361
   c4t7d0-  - 20  0   375K361
   c5t1d0-  - 19  0   374K361
   c6t7d0-  - 17  0   285K361
   c7t7d0-  - 15  0   253K361
   c0t0d0-  - 18  0   328K361
   c6t0d0-  - 18  0   348K361
 raidz2  3.05T   587G  7 47  24.9K  1.07M
   c0t4d0-  -  3 21   254K   187K
   c1t2d0-  -  3 22   254K   187K
   c4t3d0-  -  5 22   350K   187K
   c5t3d0-  -  5 21   350K   186K
   c6t2d0-  -  4 22   265K   187K
   c7t1d0-  -  4 21   271K   187K
   c6t1d0-  -  5 22   345K   186K
   c4t1d0-  -  5 24   333K   184K
 raidz2  2.81T   835G  8 45  30.9K   733K
   c0t3d0-  -  5 16   339K   126K
   c1t5d0-  -  5 16   333K   126K
   c4t6d0-  -  6 16   441K   127K
   c5t6d0-  -  6 17   435K   126K
   c6t5d0-  -  4 18   294K   126K
   c7t2d0-  -  4 18   282K   124K
   c0t6d0-  -  7 19   446K   124K
   c5t7d0-  -  7 21   452K   122K
  -  -  -  -  -  -

--
Andrew

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Help, my zone's dataset has disappeared!

2010-02-26 Thread Andrew Gabriel


Jesse Reynolds wrote:

Does ZFS store a log file of all operations applied to it? It feels like someone has gained access and run 'zfs destroy mailtmp' to me, but then again it could just be my own ineptitude. 


Yes...

zpool history rpool

--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pool import with failed ZIL device now possible ?

2010-02-16 Thread Andrew Gabriel


Darren J Moffat wrote:
You have done a risk analysis and if you are happy that your NTFS 
filesystems could be corrupt on those ZFS ZVOLs if you lose data then 
you could consider turning off the ZIL.  Note though that it isn't

just those ZVOLs you are serving to Windows that lose access to a ZIL
but *ALL* datasets on *ALL* pools and that includes your root pool.

For what it's worth I personally run with the ZIL disabled on my home 
NAS system which is serving over NFS and CIFS to various clients, but 
I wouldn't recommend it to anyone.  The reason I say never to turn off 
the ZIL is because in most environments outside of home usage it just 
isn't worth the risk to do so (not even for a small business).


People used fastfs for years in specific environments (hopefully 
understanding the risks), and disabling the ZIL is safer than fastfs. 
Seems like it would be a useful ZFS dataset parameter.


--
Andrew

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] cannot receive new filesystem stream: invalid backup stream

2010-02-12 Thread Andrew Gabriel


Darren J Moffat wrote:

On 12/02/2010 09:55, Andrew Gabriel wrote:

Can anyone suggest how I can get around the above error when
sending/receiving a ZFS filesystem? It seems to fail when about 2/3rds
of the data have been passed from send to recv. Is it possible to get
more diagnostics out?


You could try using /usr/bin/zstreamdump on recent builds.


Ah, thanks Darren, didn't know about that.

As far as I can see, it runs without a problem right to the end of the 
stream. Not sure what I should be looking for in the way of errors, but 
there are no occurrences in the output of any of the error strings 
revealed by running strings(1) on zstreamdump. Without -v ...


zfs send export/h...@20100211  |  zstreamdump  
BEGIN record

   version = 1
   magic = 2f5bacbac
   creation_time = 4b7499c5
   type = 2
   flags = 0x0
   toguid = 435481b4a8c20fbd
   fromguid = 0
   toname = export/h...@20100211
END checksum = 
6797d43d709150c1/e1ec581cbaf0cfa4/7f6d80fa0f23c741/2c2cb821b4a2e639

SUMMARY:
   Total DRR_BEGIN records = 1
   Total DRR_END records = 1
   Total DRR_OBJECT records = 90963
   Total DRR_FREEOBJECTS records = 23389
   Total DRR_WRITE records = 212381
   Total DRR_FREE records = 520856
   Total records = 847591
   Total write size = 17395359232 (0x40cd81e00)
   Total stream length = 17683854968 (0x41e0a3678)


This filesystem has failed in this way for a long time, and I've ignored
it thinking something might get fixed in the future, but this hasn't
happened yet. It's a home directory which has been in existence and used
for about 3 years. One thing is that the pool version (3) and zfs
version (1) are old - could that be the problem? The sending system is
currently running build 125 and receiving system something approximating
to 133, but I've had the same problem with this filesystem for all
builds I've used over the last 2 years.


--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] cannot receive new filesystem stream: invalid backup stream

2010-02-12 Thread Andrew Gabriel

Can anyone suggest how I can get around the above error when 
sending/receiving a ZFS filesystem? It seems to fail when about 2/3rds 
of the data have been passed from send to recv. Is it possible to get 
more diagnostics out?


This filesystem has failed in this way for a long time, and I've ignored 
it thinking something might get fixed in the future, but this hasn't 
happened yet. It's a home directory which has been in existence and used 
for about 3 years. One thing is that the pool version (3) and zfs 
version (1) are old - could that be the problem? The sending system is 
currently running build 125 and receiving system something approximating 
to 133, but I've had the same problem with this filesystem for all 
builds I've used over the last 2 years.


--
Cheers
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Impact of an enterprise class SSD on ZIL performance

2010-02-04 Thread Andrew Gabriel


Peter Radig wrote:

I was interested in the impact the type of an SSD has on the performance of the 
ZIL. So I did some benchmarking and just want to share the results.

My test case is simply untarring the latest ON source (528 MB, 53k files) on an 
Linux system that has a ZFS file system mounted via NFS over gigabit ethernet.

I got the following results:

- remotely with no dedicated ZIL device: 36 min 37 sec (factor 73 compared to 
local)

- remotely with an Intel X25-E 32 GB as ZIL device: 3 min 11 sec (factor 6.4 
compared to local)
  


That's about the same ratio I get when I demonstrate this on the 
SSD/Flash/Turbocharge Discovery Days I run the UK from time to time (the 
name changes over time;-).


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives

2010-02-03 Thread Andrew Gabriel


Brandon High wrote:

On Wed, Feb 3, 2010 at 3:13 PM, David Dyer-Bennet  wrote:
  

Which is to say that 45 drives is really quite a lot for a HOME NAS.
Particularly when you then think about backing up that data.



The origin of this thread was how to buy a J4500 (48 drive chassis).

One thing that I enjoy about this list (and I'm sure the Sun guys get
annoyed about) is the discussion of how to build various "small"
systems for home use.


We don't get annoyed at all.
What do you think we build to run at home? ;-)


After sitting on the sidelines for a while, I
assembled an 8TB server for home.

Yeah, 8TB is more than I can backup over my home DSL connection. But
it's only got about 2.5TB in use, and most of that is our DVD
collection that I've ripped to play on the Popcorn Hour in our living
room, or CDs that I've ripped. I'd hate to have to re-rip it all, but
I can get it back. The rest is photos and important documents which
are copied to a VM instance and backed up offsite via Mozy.

I'm considering doing a send/receive of a few volumes to a friend's
system (as he will do to mine) to have offsite backups of the pools.
It's mostly dependent on him buying more disk. ;-)

And for what it's worth, my toying with ZFS and discussing it with
coworkers has raised interest in Sun's storage line to replace NetApp
at the office.


Absolutely.

My homebrew system has come up in many conversations about ZFS, which 
has ended up with a customer buying a Thumpers or Amber Road systems 
from Sun. (but that's my job, I guess;-)


--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance of partition based SWAP vs. ZFS zvol SWAP

2010-01-27 Thread Andrew Gabriel


LICON, RAY (ATTPB) wrote:

Thanks for the reply.

In many situations, the hardware design isn't up to me and budgets tend
to dictate everything these days. True, nobody wants to swap, but the
question is "if" you had to -- what design serves you best. Independent
swap slices or putting it all under control of zfs.


It depends why you need to swap, i.e. why are you using more memory than
you have, and is your working set size bigger than memory (thrashing),
or is swapping likely to be just a once-off event or infrequently repeated?

You probably need to forget most of what you learned about swapping 25
years ago, when systems routinely swapped, and technology was very
different. Disks have got faster over that period, probably of the order
100 times faster. However, CPUs have got 100,000 times faster, so in
reality a disk looks to be 1000 times slower from the CPU's standpoint
than it did 25 years ago. This means that CPU cycles lost due to
swapping will appear to have a proportionally much more dire effect on
performance than they did many years back.

There are lots more options available today than there were when systems
routinely swapped. A couple of examples that spring to mind...

ZFS has been explicitly designed to swap it's own cache data, only we
don't call it swapping - we call it an L2ARC or ReadZilla. So if you
have a system where the application is going to struggle with main
memory, you might configure ZFS to significantly reduce it's memory
buffer (ARC), and instead give it an L2ARC on a fast solid state disk.
This might result in less performance degradation in some systems where
memory is short, depending heavily on the behaviour of the application.

If you do have to go with brute force old style swapping, then you might
want to invest in solid state disk swap devices, which will go some way
towards reducing the factor of 1000 I mentioned above. (Take note of
aligning swap to the 4k flash i/o boundaries.)

Probably lots of other possibilities too, given more than a couple of
minutes thought.

--
Andrew

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance of partition based SWAP vs. ZFS zvol SWAP

2010-01-27 Thread Andrew Gabriel


RayLicon wrote:

Has anyone done research into the performance of SWAP on the traditional 
partitioned based SWAP device as compared to a SWAP area set up on ZFS with a 
zvol?

 I can find no best practices for this issue. In the old days it was considered 
important to separate the swap devices onto individual disks (controllers)  and 
select the outer cylinder groups for the partition (to gain some read speed).  
How does this compare to creating a single SWAP zvol within a rootpool and then 
mirroring the rootpool across two separate disks?


Best practice nowadays is to design a system so it doesn't need to swap.
Then it doesn't matter what the performance of the swap device is.

--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 2gig file limit on ZFS?

2010-01-21 Thread Andrew Gabriel


Michelle Knight wrote:

Fair enough.

So where do you think my problem lies?

Do you think it could be a limitation of the driver I loaded to read the ext3 
partition?


Without knowing exactly what commands you typed and exactly what error 
messages they produced, and which directories/files are on which types 
of file systems, we're limited to guessing.


--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs send/receive as backup - reliability?

2010-01-21 Thread Andrew Gabriel


Robert Milkowski wrote:
I think one should actually compare whole solutions - including servers, 
fc infrastructure, tape drives, robots, software costs, rack space, ...


Servers like x4540 are ideal for zfs+rsync backup solution - very 
compact, good $/GB ratio, enough CPU power for its capacity, allow to 
easily scale it horizontally, and it is not too small and not too big. 
Then thanks to its compactness they are very easy to administer.


Depending on an anvironment one could deploy them always in paris - one 
in one datacenter and 2nd one in other datacanter with ZFS send based 
replication of all backups (snapshots). Or one may replicate 
(cross-replicate) only selected clients if needed.


Something else that often sells the 4500/4540 relates to internal 
company politics. Often, inside a company storage has to be provisioned 
from the company's storage group, using very expensive SAN based 
storage, indeed so expensive by the time the company's storage group 
have added their overhead onto the already expensive SAN, that whole 
projects become unviable. Instead, teams find they can order 4500/4540's 
which slip under the radar as servers (or even PCs), and they now have 
affordable storage for their projects, which makes them viable once more.


--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] "NearLine SAS"?

2010-01-19 Thread Andrew Gabriel


Edward Ned Harvey wrote:

A poster in another forum mentioned that Seagate (and Hitachi, amongst
others) is now selling something labeled as "NearLine SAS" storage
(e.g. Seagate's NL35 series).


Industry has moved again.  Better get used to it.

Nearline SAS is a replacement for SATA.  It's a lower cost drive than SAS,
with higher reliability than SATA.  I have begun seeing vendors that sell
only SAS and NearLine SAS.  (Some dell servers.)

SATA is not dead.  But this will certainly change things up a bit.


Actually, this sounds like really good news for ZFS.
ZFS (or rather, Solaris) can make good use of the multi-pathing 
capability, only previously available on high speed drives.
Of course, ZFS can make use of any additional IOPs to be had, again only 
previously available on high speed drives.
However, ZFS generally doesn't need high speed drives and does much 
better with hybrid storage pools.


So these drives sound to me to have been designed specifically for ZFS! 
It's hard to imagine any other filesystem which can exploit them so 
completely.


--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] hard drive choice, TLER/ERC/CCTL

2009-12-10 Thread Andrew Gabriel


Mark Grant wrote:

Yeah, this is my main concern with moving from my cheap Linux server with no 
redundancy to ZFS RAID on OpenSolaris; I don't really want to have to pay twice 
as much to buy the 'enterprise' disks which appear to be exactly the same 
drives with a flag set in the firmware to limit read retries, but I also don't 
want to lose all my data because a sector fails and the drive hangs for a 
minute trying to relocate it, causing the file system to fall over.

I haven't found a definitive answer as to whether this will kill a ZFS RAID 
like it kills traditional hardware RAID or whether ZFS will recover after the 
drive stops attempting to relocate the sector. At least with a single drive 
setup the OS will eventually get an error response and the other files on the 
disk will be readable when I copy them over to a new drive.
  


I don't think ZFS does any timing out.
It's up to the drivers underneath to timeout and send an error back to 
ZFS - only they know what's reasonable for a given disk type and bus 
type. So I guess this may depend which drivers you are using.
I don't know what the timeouts are, but I have observed them to be long 
in some cases when things do go wrong and timeouts and retries are 
triggered.


--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs on ssd

2009-12-05 Thread Andrew Gabriel


Bob Friesenhahn wrote:
The interesting thing for the future will be non-volatile main memory, 
with the primary concern being how to firewall damage due to a bug. 
You would be able to turn your computer off and back on and be working 
again almost instantaneously.
Some of us are old enough (just) to have used computers back in the days 
when they all did this anyway...

Funny how things go full circle...

--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Heads up: SUNWzfs-auto-snapshot obsoletion in snv 128

2009-11-23 Thread Andrew Gabriel


Kjetil Torgrim Homme wrote:

Daniel Carosone  writes:


Would there be a way to avoid taking snapshots if they're going to be
zero-sized?


I don't think it is easy to do, the txg counter is on a pool level,
AFAIK:

  # zdb -u spool
  Uberblock

magic = 00bab10c
version = 13
txg = 1773324
guid_sum = 16611641539891595281
timestamp = 1258992244 UTC = Mon Nov 23 17:04:04 2009

it would help when the entire pool is idle, though.


(posted here, rather than in response to the mailing list reference
given, because I'm not subscribed [...]


ditto.



I think I can see from this which filesystems have and have not changed 
since the last snapshot (@20091122 in this case)...


a20$ zfs list -t snapshot | grep 20091122
export/d...@20091122499K  -   227G  -
export/h...@20091122144K  -  15.6G  -
export/mu...@20091122  0  -  66.5G  -
export/virtual...@20091122 0  -   484K  -
export/virtualbox/os0...@20091122  0  -  3.52G  -
export/virtualbox/x...@20091122  0  -  12.1G  -
export/zo...@20091122  0  -  22.5K  -
export/zones/s...@20091122  0  -  5.21G  -
a20$

All the ones with USED = 0 haven't changed. Don't know if this info is 
available without spinning up disks though.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Resilver/scrub times?

2009-11-22 Thread Andrew Gabriel


Bill Sommerfeld wrote:
Yesterday's integration of 


6678033 resilver code should prefetch

as part of changeset 74e8c05021f1 (which should be in build 129 when it
comes out) may improve scrub times, particularly if you have a large
number of small files and a large number of snapshots.  I recently
tested an early version of the fix, and saw one pool go from an elapsed
time of 85 hours to 20 hours; another (with many fewer snapshots) went
from 35 to 17.  
  


I've been wondering what difference that might make. I'm currently 
running snv_125.


$ zfs list -t snapshot | wc -l
   4407
$

Yep, quite a few snapshots.

However, more important to me would be reducing the impact a scrub has 
on the rest of the system, even if it takes longer. Conversely, I can 
imagine you might want a resilver to run as fast as possible in some 
cases, even at the expense of other system activities. It would be 
really nice to have a speed knob on these operations, which you can vary 
while the activity progresses, depending on other uses of the system.


--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Resilver/scrub times?

2009-11-22 Thread Andrew Gabriel


Colin Raven wrote:

Hi all!
I've decided to take the "big jump" and build a ZFS home filer 
(although it might also do "other work" like caching DNS, mail, 
usenet, bittorent and so forth). YAY! I wonder if anyone can shed some 
light on how long a pool scrub would take on a fairly decent rig. 
These are the specs as-ordered:


Asus P5Q-EM mainboard
Core2 Quad 2.83 GHZ
8GB DDR2/80


On this system (Dual core Athlon64 4600+ 2.4GHz, 8GB ram), 7200RPM 
enterprise duty disks, it takes about 14 hours on the following zpool:


# zpool status
 pool: export
state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
   still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
   pool will no longer be accessible on older software versions.
scrub: scrub completed after 13h45m with 0 errors on Fri Nov 20 
15:05:20 2009

config:

   NAMESTATE READ WRITE CKSUM
   export  ONLINE   0 0 0
 mirror-0  ONLINE   0 0 0
   c3d0ONLINE   0 0 0
   c2d0ONLINE   0 0 0

errors: No known data errors
# zpool list
NAME SIZE   USED  AVAILCAP  HEALTH  ALTROOT
export   930G   503G   427G54%  ONLINE  -
#


OS:
2 x SSD's in RAID 0 (brand/size not decided on yet, but they will 
definitely be some flavor of SSD)


Can only be RAID1 (mirrored) for the OS.


Data:
4 x 1TB Samsung Spin Point 7200 RPM 32MB cache SATA HD's (RAIDZ)


Prefer mirrors myself, but depends on how much data you have verses how 
much disk you can afford.


Data payload initially will be around 550GB or so, (before loading any 
stuff from another NAS and so on)


Does scrub like memory, or CPU, or both? There's enough horsepower 
available, I would think. Same question applies to resilvering if I 
need to swap out drives at some point. [cough] I can't wait to get 
this thing built! :)


I also use the system as a desktop now (after doing some system 
consolidation). Scrub doesn't use much CPU, but it does interfere with 
the interactive response of the desktop. I suspect this is due to the 
i/o's it queues up on the disks.


--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

1 2 >

1 - 100 of 177 matches

Mail list logo