Re: [zfs-discuss] ZFS very slow under xVM

2007-11-03 Thread Erblichs
Martin,

This is a shot in the dark, but, this seems to be a IO scheduling
issue.

Since, i am late on this thread, what is the characteristic of
the IO: read mostly, appending writes, read, modify write,
sequentiality, random, single large file, multiple files.

And have you tracked whether any IO is aged much beyond 30
seconds if we are talking about writes.

If we were talking about Xen by itself, I am sure their is
some type of schedular involvement, that COULD slow down your
IO due to fairness or some specified weight against other
processes/ threads / tasks.

Can you boost the scheduling of the IO task, by making it
realtime or giving it a niceness or .. in a experimental
environment and comparing stats.

Whether this is the bottleneck of your problem would take
a closer examination of the various metrics of the system.

Mitchell Erblich
-





Martin wrote:
> 
> > The behaviour of ZFS might vary between invocations, but I don't think that
> > is related to xVM. Can you get the results to vary when just booting under
> > "bare metal"?
> 
> It's pretty consistently displays the behaviors of good IO (approx 60Mb/s - 
> 80Mb/s) for about 10-20 seconds, then always drops to approx 2.5 Mb/s for 
> virtually all of the rest of the output. It always displays this when running 
> under xVM/Xen with Dom0, and never on bare metal when xVM/Xen isn't booted.
> 
> 
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [docs-discuss] Introduction to Operating Systems

2007-08-02 Thread Erblichs
http://www.sun.com/software/solaris/ds/zfs.jsp

Solaris ZFS—The Most Advanced File System on the Planet

Anyone who has ever lost important files, run out of space on a
partition, spent weekends adding new storage to servers, tried to grow
or shrink a file system, or experienced data corruption knows that there
is room for improvement in file systems and volume managers. The Solaris
Zettabyte File System (ZFS), is designed from the ground up to meet the
emerging needs of a general-purpose file system that spans the desktop
to the data center. 

Mitchell Erblich
Ex-Sun Eng
--

Alan Coopersmith wrote:
> 
> Lisa Shepherd wrote:
> > "Zettabyte File System" is the formal, expanded name of the file system and 
> > "ZFS" is its abbreviation. In most Sun manuals, the name is expanded at 
> > first use and the abbreviation used the rest of the time. Though I was 
> > surprised to find that the Solaris ZFS System Administration Guide, which I 
> > would consider the main source of ZFS information, doesn't seem to have 
> > "Zettabyte" anywhere in it. Anyway, both names are official and correct, 
> > but since "Zettabyte" is such a mouthful, "ZFS" is what gets used most of 
> > the time.
> 
> How current is that?   I thought that while "Zettabyte File System"
> was the original name, use of it was dropped a couple years ago and
> ZFS became the only name.   I don't see "Zettabyte" appearing anywhere
> in the ZFS community pages.
> 
> --
> -Alan Coopersmith-   [EMAIL PROTECTED]
>  Sun Microsystems, Inc. - X Window System Engineering
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: Mac OS X "Leopard" to use ZFS

2007-06-13 Thread Erblichs
Toby Thain, et al,

I am guessing here, but to just be able to access
the FS data locally without the headaches of
verifying FS consistency, write caches, etc.

Mitchell Erblich


Toby Thain wrote:
> 
> On 13-Jun-07, at 1:14 PM, Rick Mann wrote:
> 
> >> From (http://www.informationweek.com/news/showArticle.jhtml;?
> >> articleID=199903525)
> >
> > ... Croll explained, "ZFS is not the default file system for
> > Leopard. We are exploring it as a file system option for high-end
> > storage systems with really large storage. As a result, we have
> > included ZFS -- a read-only copy of ZFS -- in Leopard."
> 
> I don't get it. What possible use is "read only" ZFS?
> 
> So that people can see if their FC array can be mounted on Leopard
> beta? I must be missing the point here.
> 
> --Toby
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS Apple WWDC Keynote Absence

2007-06-12 Thread Erblichs
Group,

Isn't Apple strength really in the non-compute intensive
personal computer / small business environment? 
IE, Plug and play.

Thus, even though ZFS is able to work as the default
FS, should it be the default FS for the small system
environment where your average user, wants it more to
work, and cares less about administration issues.

It/ZFS almost, IMO, needs to check mistakes with configuration
in the first case, where the larger business environment
can have one or more dedicated admins that are more adapt
to config and tuning issues.

Mitchell Erblich


Rich Teer wrote:
> 
> On Tue, 12 Jun 2007, Robert Smicinski wrote:
> 
> > Apple's strength is the desktop, Sun's is the datacenter.
> 
> Agreed, to a large extent.
> 
> > There's no need to have ZFS on the desktop, just as there's no need
> > to have HFS+ in the datacenter.
> 
> I strongly disagree with the first clause of that sentence.  There's
> no reason why one wouldn't want to have mirrored file systems on a
> workstation, or make use of snapshots and clones.  All three of those
> features are supplied rather handily by ZFS.
> 
> ZFS isn't just about easily creating massive pools of data, although
> admitedly that is the first feature most people mention.
> 
> > There is a need to improve ZFS in the datacenter, however, and I wish
> > Sun had invested their time in getting dynamic LUN expansion going
> > instead of working on a port to OS/X.
> 
> I have no insider knowledge, but I don't think Sun invested much time
> in this.  I believe Apple's engineers did most of the work.
> 
> --
> Rich Teer, SCSA, SCNA, SCSECA, OGB member
> 
> CEO,
> My Online Home Inventory
> 
> Voice: +1 (250) 979-1638
> URLs: http://www.rite-group.com/rich
>   http://www.myonlinehomeinventory.com
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Optimal strategy (add or replace disks) to build acheap and raidz?

2007-05-08 Thread Erblichs
Group,

MOST people want a system to work without doing
 ANYTHING when they turn on the system.

So yes, the thought of people buying another
 drive and installing it in a brand new system
 would be insane for this group of buyers.

Mitchell Erblich
--

Richard Elling wrote:
> 
> Harold Ancell wrote:
> > I checked Dell.com, and their "we want you to buy this higher end home
> > machine" offer has a 320GB stock drive, but highlights a 500GB just
> > below it with a bolded "Dell recommended for photos, music, and
> > games!", for an extra 120 US$, about 10% of the machine's price.
> >
> > I'll bet a lot of people take them up on that.
> 
> Interestingly, I was online recently comparing Dell, Frys.com, and Apple
> Store prices (for a research project).  For a sampling of products exactly
> the same, Dell generally had the worst prices, Frys the best, and Apple
> more often matched Frys than Dell.  Specifically, for a 500 GByte disk,
> Dell was asking $189 versus $129 at Frys.  I couldn't directly compare
> Apple's price because they don't sell raw disks, they sell "modules" which
> cost 2x the price of an external drive listed right next to the modules --
> go figure.
> 
> As the old saying goes, it pays to shop around.
>   -- richard
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] gzip compression throttles system?

2007-05-04 Thread Erblichs
Darren Moffat,

Yes and no. A earlier statement within this discussion
was whether gzip is appropriate for .wav files. This just
gets a relative time to compress. And relative
sizes of the files after the compression.

My assumption is that gzip will run as a user app
in one environment. The normal r/w sys calls then take
a user buffer. So, it would be hard to believe that the 
.wav file won't be read one user buff at at time. Yes,
it could be mmap'ed, but then it would have to be
unmapped. Too many sys calls, I think for the app.
Sorry, haven't looked at it for awhile..

Overall, I am just trying to guess at the read-ahead
delay versus the user buffer versus the internal fs.
The internal FS should take it basicly one FS block
at a time (or do multiple blocks in parallel)
and the user app takes it anywhere from
one buffer to one page size, 8k at a time. So, due
to reading one buffer at a time in a loop, with
a context switch from kernel to user each time. Thus,
I would expect that the gzip app would be slwer.

So, my first step is to keep it simple (KISS)and tell
the group "what happens if" we do this simple
comparison? And how many bytes/sec is compressed?
And are they approx the same speed? Do you end up
with the same size file

Mitchell Erblich
--


Darren J Moffat wrote:
> 
> Erblichs wrote:
> >   So, my first order would be to take 1GB or 10GB .wav files
> >   AND time both the kernel implementation of Gzip and the
> >   user application. Approx the same times MAY indicate
> >   that the kernel implementation gzip funcs should
> >   be treatedly maybe more as  interactive scheduling
> >   threads and that it is too high and blocks other
> >   threads or proces from executing.
> 
> If you just run gzip(1) against the files you are operating on the whole
> file so you only incur startup costs once and are thus doing quite a
> different compression to operating on a block level.  A fairer
> comparison would be to build a userland program that compresses and then
> writes to disk in ZFS blocksize chunks, that way you are compressing the
> same sizes of data and doing the startup every time just like zio has to do.
> 
> --
> Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] gzip compression throttles system?

2007-05-03 Thread Erblichs
Ian Collins,

My two free cents..

If the gzip was in application space, most gzip's implementations
support (maybe a new compile) a less extensive/expensive "deflation" 
that would consume fewer CPU cycles.

Secondly, if the file objects are being written locally, the
writes to disk are being done asynchronously and shouldn't
really delay other processes and slow down the system.

So, my first order would be to take 1GB or 10GB .wav files
AND time both the kernel implementation of Gzip and the
user application. Approx the same times MAY indicate
that the kernel implementation gzip funcs should
be treatedly maybe more as  interactive scheduling
threads and that it is too high and blocks other
threads or proces from executing.


Mitchell Erblich
Sr Software Engineer



Ian Collins wrote:
> 
> I just had a quick play with gzip compression on a filesystem and the
> result was the machine grinding to a halt while copying some large
> (.wav) files to it from another filesystem in the same pool.
> 
> The system became very unresponsive, taking several seconds to echo
> keystrokes.  The box is a maxed out AMD QuadFX, so it should have plenty
> of grunt for this.
> 
> Comments?
> 
> Ian
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very Large Filesystems

2007-04-28 Thread Erblichs
Jorg,

Do you really think that ANY FS actually needs to support
more FS objects? If that would be an issue, why not create
more FSs?

A multi-TB FS SHOULD support 100MB+/GB size FS objects, which
IMO is the more common use. I have seen this alot in video
environments. The largest that I have personally seen is in
excess of 64TBs.

I would assume that just normal FSops that search or display
a extremely large number of FS objects is going to be
difficult to use. Just try placing 10k+ FS objects/files within
a directly and then list that directory.

As for backups / restore type ops, I would assume that a
smaller granularity of specified paths / directories would be 
more common due to user error and not disturbing other
directories.

Mitchell Erblich
-

 
Joerg Schilling wrote:
> 
> Yaniv Aknin <[EMAIL PROTECTED]> wrote:
> 
> > Following my previous post across several mailing lists regarding 
> > multi-tera volumes with small files on them, I'd be glad if people could 
> > share real life numbers on large filesystems and their experience with 
> > them. I'm slowly coming to a realization that regardless of theoretical 
> > filesystem capabilities (1TB, 32TB, 256TB or more), more or less across the 
> > enterprise filesystem arena people are recommending to keep practical 
> > filesystems up to 1TB in size, for manageability and recoverability.
> 
> UFS is limited to 2**31 inodes and this also limits the filesystem size.
> On Berlios we have a mixture of small and large files and the average file
> size is 100 kB. This would still give you a limit os 200 TB which is more
> than UFS allows you.
> 
> I would guess that the recommendations are rather oriented on the backup.
> On backup speed and on the size of the backup media.
> 
> Jörg
> 
> --
>  EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin
>[EMAIL PROTECTED](uni)
>[EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/
>  URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cow performance penatly

2007-04-26 Thread Erblichs
Ming,

Lets take a pro example with a minimal performance
tradeoff.

All FSs that modify a disk block, IMO, do a full
disk block read before anything.

If doing a extended write and moving to a
larger block size with COW you give yourself
the ability to write to a single block
vs 
having to fill the original block and also needing 
to write the next block. The "performance loss" is the
additional latency to transfer more bytes within the
larger block on the next access.

This pro doesn't just benefit at the end of the file
but also at both ends of a hole within the file. In
addition, the next non recent IO op that accesses the
disk block will be able to perform a single seek. Also,
if we allow ourselves to dynamicly increase the size
of the block and we are within direct access to the
blocks, we can delay moving to the additional latencies
going to a indirect block or...

So, this has a performance benefit in addition to
removing the case where a OS panic occures in the
middle of the disk block and losing the original and
the full next iteration of the file. After the
write completes we should be able to update the
FS's node data struct.

Mitchell Erblich
Ex-Sun Kernel Engineer who proposed and implemented this
 in a limited release of UFS many years ago.
--


Ming Zhang wrote:
> 
> Hi All
> 
> I wonder if any one have idea about the performance loss caused by COW
> in ZFS? If you have to read old data out before write it to some other
> place, it involve disk seek.
> 
> Thanks
> 
> Ming
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS+NFS on storedge 6120 (sun t4)

2007-04-21 Thread Erblichs
Spencer,

Summary: I am not sure that v4 would have a significant
advantage over v3 or v2 in all envirs. I just believe it can
have a significant advantage (no/minimal drawbacks)
and one should use it if at all possbile to verify
that it is not the bottleneck.

So, no, I can not say that NFSv3 has the same performance
as v4. I know that at its worst, I don't belive that v4
performs under v3 and at best, performs up to 2x or more
than v3.

So,,

The assumptions are:

-  V4 is being actively worked on,
-  v3 is stable but no major changes are being done on it..
-  leases,
-  better data caching (delagations and client callbacks),
-  state behaviour,
-  compound NFS requests (procs) to remove the sequential rtt
   of individual NFS requests
- Significantly improved lookups for pathing (multi-lookup)
  and later attr requests.. I am sure that the attr calls
  are/were a significant percentage of NFS ops.
- etc...
** I am not telling Spencer this he should already
   know this because skip...

So, with the compound procs in v4, the increased latency's
with some of the ops might have a different congestion type
behaviour (it scales better under more environments and
allows the IO bandwidth to be more of an issue).

So, yes, my assumption is that NFSv4 has a good possibility
of significantly outperforming v3.. Either way, I know
of no degradation in any op moving to v4.

So, again, if we are tuning a setup, I would rather see what
 ZFS does with v4, knowing that a few performance holes were
 closed or almost closed versus v3.. I don't think this is
 specific to Sun.. It would apply to all NFSv4 environments.

**Yes, however even when the public (Paw,Spencer, etc) NFSv4 paper
was done, the SFS was stated as not yet done..

-- LASTLY, I would also be interested in the actual times
   of the different TCP segments. To see, if acks are
   constantly in the pipeline between the dst and src, or
   whether "slow-start restart behaviour" is occuring. It
   is also theorectical that delayed acks of the dst,
   the number of acks is reduced, which reduces the
   bandwidth (IO ops) on subsequent data bursts. Also,
   is Allman's ABC being used in the TCP implementation.

Mitchell Erblich




Spencer Shepler wrote:
> 
> On Apr 21, 2007, at 9:46 AM, Andy Lubel wrote:
> 
> > so what you are saying is that if we were using NFS v4 things
> > should be dramatically better?
> 
> I certainly don't support this assertion (if it was being made).
> 
> NFSv4 does have some advantages from the perspective of enabling
> more aggressive file data caching; that will enable NFSv4 to
> outperform NFSv3 in some specific workloads.  In general, however,
> NFSv4 performs similarly to NFSv3.
> 
> Spencer
> 
> >
> > do you think this applies to any NFS v4 client or only Suns?
> >
> >
> >
> > -Original Message-
> > From: [EMAIL PROTECTED] on behalf of Erblichs
> > Sent: Sun 4/22/2007 4:50 AM
> > To: Leon Koll
> > Cc: zfs-discuss@opensolaris.org
> > Subject: Re: [zfs-discuss] Re: ZFS+NFS on storedge 6120 (sun t4)
> >
> > Leon Koll,
> >
> >   As a knowldegeable outsider I can say something.
> >
> >   The benchbark (SFS) page specifies NFSv3,v2 support, so I question
> >whether you ra n NFSv4. I would expect a major change in
> >performance just to version 4 NFS version and ZFS.
> >
> >   The benchmark seems to stress your configuration enough that
> >   the latency to service NFS ops increases to the point of non
> >   serviced NFS requests. However, you don't know what is the
> >   byte count per IO op. Reads are bottlenecked against rtt of
> >   the connection and writes are normally sub 1K with a later
> >   commit. However, many ops are probably just file handle
> >   verifications which again are limited to your connection
> >   rtt (round trip time). So, my initial guess is that the number
> >   of NFS threads are somewhat related to the number of non
> >   state (v4 now has state) per file handle op. Thus, if a 64k
> >   ZFS block is being modified by 1 byte, COW would require a
> >   64k byte read, 1 byte modify, and then allocation of another
> >   64k block. So, for every write op, you COULD be writing a
> >   full ZFS block.
> >
> &

Re: [zfs-discuss] Re: ZFS+NFS on storedge 6120 (sun t4)

2007-04-21 Thread Erblichs
Leon Koll,

As a knowldegeable outsider I can say something.

The benchbark (SFS) page specifies NFSv3,v2 support, so I question
 whether you ra n NFSv4. I would expect a major change in
 performance just to version 4 NFS version and ZFS.

The benchmark seems to stress your configuration enough that
the latency to service NFS ops increases to the point of non
serviced NFS requests. However, you don't know what is the
byte count per IO op. Reads are bottlenecked against rtt of
the connection and writes are normally sub 1K with a later
commit. However, many ops are probably just file handle
verifications which again are limited to your connection
rtt (round trip time). So, my initial guess is that the number
of NFS threads are somewhat related to the number of non
state (v4 now has state) per file handle op. Thus, if a 64k
ZFS block is being modified by 1 byte, COW would require a
64k byte read, 1 byte modify, and then allocation of another
64k block. So, for every write op, you COULD be writing a
full ZFS block.

This COW philosphy works best with extending delayed writes, etc
where later reads would make the trade-off of increased
latency of the larger block on a read op versus being able
to minimize the number of seeks on the write and read. Basicly
increasing the block size from say 8k to 64K. Thus, your
read latency goes up just to get the data off the disk
and minimizing the number of seeks, and dropping the read
ahead logic for the needed 8k to 64k file offset.

I do NOT know that "THAT" 4000 IO OPS load would match your maximal
load and that your actual load would never increase past 2000 IO ops.
Secondly, jumping from 2000 to 4000 seems to be too big of a jump
for your environment. Going to 2500 or 3000 might be more
appropriate. Lastly wrt the benchmark, some remnants (NFS and/or ZFS
and/or benchmark) seem to remain that have a negative impact.

Lastly, my guess is that this NFS and the benchark are stressing small
partial block writes and that is probably one of the worst case
scenarios for ZFS. So, my guess is the proper analogy is trying to
kill a nat with a sledgehammer. Each write IO OP really needs to be
equal
to a full size ZFS block to get the full benefit of ZFS on a per byte
basis.

Mitchell Erblich
Sr Software Engineer
-





Leon Koll wrote:
> 
> Welcome to the club, Andy...
> 
> I tried several times to attract the attention of the community to the 
> dramatic performance degradation (about 3 times) of NFZ/ZFS vs. ZFS/UFS 
> combination - without any result :  href="http://www.opensolaris.org/jive/thread.jspa?messageID=98592";>[1] , 
> http://www.opensolaris.org/jive/thread.jspa?threadID=24015";>[2].
> 
> Just look at two graphs in my  href="http://napobo3.blogspot.com/2006/08/spec-sfs-bencmark-of-zfsufsvxfs.html";>posting
>  dated August, 2006 to see how bad the situation was and, unfortunately, 
> this situation wasn't changed much recently: 
> http://photos1.blogger.com/blogger/7591/428/1600/sfs.1.png
> 
> I don't think the storage array is a source of the problems you reported. 
> It's somewhere else...
> 
> [i]-- leon[/i]
> 
> 
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and Linux

2007-04-18 Thread Erblichs
Joerg Schilling,

Stepping back into the tech discussion.

If we want a port of ZFS to Linux to begin, SHOULD the kitchen
sink approach be abandoned for the 1.0 release?? For later
releases, dropped functionality could be added in.

Suggested 1.0 Requirements
--
1) No NFS export support
2) Basic local FS support (all Vnodeops and VFS op review)
3) Identify any FSs (source availability) that are common 
  between Linux and SunOS and use those as porting guides
4)Identify all Sun DDI/DKI calls that have no Linux equivs
5)Identify what ZFS apps need supporting
6)Identify any/all library's that are needed for the ZFS apps
7)Identify and acquire as many ZFS validation tests as possible.
8) Can we/should we assume that the Sun ZFS docs will suffice
   as main reference and identify any and all diffs using a
   suplimentary doc.
9) Create a one pager on the mmap() diffs.
10) Identify whether lookuppathname should be ported over to
   Linux, and whether "ships-in-the-night" approach would
   cause more problems.

Mitchell Erblich
Sr Software Engineer
-


Joerg Schilling wrote:
> 
> Erblichs <[EMAIL PROTECTED]> wrote:
> 
> > Joerg Shilling,
> >
> >   Putting the license issues aside for a moment.
> 
> I was trying to point people to the fact that the biggest problems are
> technical problems and that the license discussion was done the wrong way.
> 
> >   If their is "INTEREST" in ZFS within Linux, should
> >a small Linux group be formed to break down ZFS in
> >easily portable sections and non-portable sections.
> >   And get a real-time/effort assessment as to what is
> >   needed to get it done.
> 
> Going back to the tecnical stuff:
> 
> -   The NFS export interface from Linux is weird and needs
> adoptation
> 
> -   Linux still has the outdated "namei" inteface instead of
> the more than 20 year old lookuppathname() interface
> from SunOS.
> 
> -   The mmap interface is extremely different
> 
> In general, the problem on Linux is that the Linux "vfs"
> interface is a low level inteface, so it is most likely easier
> to adopt a Linux FS to the Solaris vfs interface than vice versa.
> 
> There is nothing like the clean global vfsops and vnodeops on Solaris
> but a lot of small interfaces.
> 
> >   Assuming their is interest and usage, if ported, I
> >   would assume that someone/some group would make sure
> >   that the code is resynced on a periodic basis.
> 
> I also asume that the same people who are interested in a port
> will do the maintenance...
> 
> >   I know a FS from Veritas and SGI were reviewed in
> >   these manners. The Veritas's FS originally was
> >   developed using the Sun's VFS layer.
> >
> >   So, if the license issues are removed, I am sure
> >   that ZFS could be ported over to Linux. It is just
> >   time and effort...
> 
> I am sure it could be done but Linux peole cannot asume that
> Sun will do it ;-)
> 
> Jörg
> 
> --
>  EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin
>[EMAIL PROTECTED](uni)
>[EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/
>  URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS on the desktop

2007-04-17 Thread Erblichs
Rich Teer,

I have a perfect app for the masses.

A Hi-Def Video/ audio server for the hi-def TV
and audio setup.

I would think the average person would want
 to have access to 1000s of DVDs / CDs within
 a small box versus taking up the full wall.

Yes, assuming the quality was their...

Extrapolating the cost of drives, this is now
 reality for the few but given 1.5 years this
 is for the masses.

Wouldn't this sell enough boxes to make this
 the newer killer app??

Mitchell Erblich
Sr Software Engineer


Rich Teer wrote:
> 
> On Tue, 17 Apr 2007, Toby Thain wrote:
> 
> > The killer feature for me is checksumming and self-healing.
> 
> Same here.  I think anyone who dismisses ZFS as being inappropriate for
> desktop use ("who needs access to Petabytes of space in their desktop
> machine?!") doesn't get it.  (A close 2nd for me personally is the
> ease of creating mirrors, but granted that's on my servers rather than
> my desktop.)
> 
> --
> Rich Teer, SCSA, SCNA, SCSECA, OGB member
> 
> CEO,
> My Online Home Inventory
> 
> Voice: +1 (250) 979-1638
> URLs: http://www.rite-group.com/rich
>   http://www.myonlinehomeinventory.com
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and Linux

2007-04-17 Thread Erblichs
Joerg Shilling,

Putting the license issues aside for a moment.

If their is "INTEREST" in ZFS within Linux, should
 a small Linux group be formed to break down ZFS in
 easily portable sections and non-portable sections.
And get a real-time/effort assessment as to what is
needed to get it done.

Assuming their is interest and usage, if ported, I
would assume that someone/some group would make sure
that the code is resynced on a periodic basis.

I know a FS from Veritas and SGI were reviewed in
these manners. The Veritas's FS originally was
developed using the Sun's VFS layer.

So, if the license issues are removed, I am sure
that ZFS could be ported over to Linux. It is just
time and effort...

Mitchell Erblich
Ex: Sun Kernel Engineer



Joerg Schilling wrote:
> 
> Nicolas Williams <[EMAIL PROTECTED]> wrote:
> 
> > Sigh.  We have devolved.  Every thread on OpenSolaris discuss lists
> > seems to devolve into a license discussion.
> 
> It is funny to see that in our case, the tecnical problems (those caused
> by the fact that linux implements a different VFS interface layer) are
> creating much bigger problem than the license issue does.
> 
> > I have seen mailing list posts (I'd have to search again) that indicate
> > [that some believe] that even dynamic linking via dlopen() qualifies as
> > making a derivative.
> 
> There is no single place in the GPL that mentions the term "linking".
> For this reason, the GPL FAQ from the FSF is wring as it is based on the
> term "linking".
> 
> There is no difference whether you link statically or dynamically.
> 
> Whether using GPLd code from a non-GPLd program creates a "derived work"
> thus cannot depend on whether you link agaist it or not. If a GPLd program
> however "uses" a non-GPLd library, this is definitely not a problem or
> every GPLd program linked against the libc from HP-UX would be a problem.
> 
> > If true that would mean that one could not distribute an OpenSolaris
> > distribution containing a GPLed PAM module.  Or perhaps, because in that
> > case the header files needed to make the linking possible are not GPLed
> > the linking-makes-derivatives argument would not apply.
> 
> If the GPLd PAM module just implements a well known plug in interface,
> a program that uses this odule cannot be a derivate of the GPLd code.
> 
> Jörg
> 
> --
>  EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin
>[EMAIL PROTECTED](uni)
>[EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/
>  URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS for Linux (NO LISCENCE talk, please)

2007-04-17 Thread Erblichs
Toby Thain,

I am sure someone will divise a method of subdividing
the FS and run a background fsck and/or checksums on the
different file objects or ...  before this becomes a issue. :)

Mitchell Erblich
-



Toby Thain wrote:
> 
> >
> > It seems that there are other reasons for the Linux kernel folks
> > for not
> > liking ZFS.
> 
> I certainly don't understand why they ignore it.
> 
> How can one have a "Storage and File Systems Workshop" in 2007
> without ZFS dominating the agenda??
> http://lwn.net/Articles/226351/
> 
> That "long fscks" should be a hot topic, given the state of the art,
> is just bizarre.
> 
> --Toby
> 
> >
> > Jörg
> >
> > --
> >  EMail:[EMAIL PROTECTED] (home) Jörg Schilling
> > D-13353 Berlin
> >[EMAIL PROTECTED](uni)
> >[EMAIL PROTECTED] (work) Blog: http://
> > schily.blogspot.com/
> >  URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/
> > pub/schily
> > ___
> > zfs-discuss mailing list
> > zfs-discuss@opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS for Linux (NO LISCENCE talk, please)

2007-04-17 Thread Erblichs
Group,

Did Joerg Schilling bring up a bigger issue within this
discussion thread?

> And it seems that you missunderstand the way the Linux kernel is developed.
> If _you_ started a ZFS project for Linux, _you_ would need to maintain it too
> or otherwise it would not be kept up to date. Note that it is a well known
> fact that a lot of the non-mainstream parts of the linux kernel sources
> do not work although they _are_ part of the linux kernel source tree.

Whose job is it to "clean" or declare for removal kernel
sources that "do not work"?

Mitchell Erblich
---

Joerg Schilling wrote:
> 
> "David R. Litwin" <[EMAIL PROTECTED]> wrote:
> 
> > If you refer to the licensing, yes. Coding-wise, I have no idea exept
> > to say that I would be VERY surprised if ZFS can not be ported to
> > Linux, especially since there already
> > exists the FUSE project.
> 
> So if you are interested in this project, I would encourage you to just start
> with the code...
> 
> > > ZFS is not part of the Linux Kernel. Only if you declare ZFS a "part of
> > > Linux", you will observe the license conflict.
> >
> >
> > And, as brought up elsewhere, ZFS would have to be a part of the
> > Kernel -- or else some persons would have to employ Herculean
> > attention to make sure ZFS was upgraded with the kernel. if some one
> > were
> > willing to do this, a swift resolution MAY ba possible.
> 
> The fact that someone may put the ZFS sources in the Linux source tree
> does not make it a part of that software
> 
> And it seems that you missunderstand the way the Linux kernel is developed.
> If _you_ started a ZFS project for Linux, _you_ would need to maintain it too
> or otherwise it would not be kept up to date. Note that it is a well known
> fact that a lot of the non-mainstream parts of the linux kernel sources
> do not work although they _are_ part of the linux kernel source tree.
> 
> Creating a port does not mean that you may forget about it once you believe 
> that
> you are ready.
> 
> > The GPL is talking about "works" and there is no problem to use GPL code
> > > together with code under other licenses as long as this is mere
> > > aggregation
> > > (like creating a driver for Linux) instead of creating a "derived work".
> > >
> > > It seems that there are other reasons for the Linux kernel folks for not
> > > liking ZFS.
> >
> >
> > Indeed? What are these reasons? I want to have every thing in the open.
> 
> This is something you would need to ask the Linux kernel folks
> 
> Jörg
> 
> --
>  EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin
>[EMAIL PROTECTED](uni)
>[EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/
>  URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Gzip compression for ZFS

2007-04-05 Thread Erblichs
My two cents,

Assuming that you may pick a specific compression algorithm,
most algorithms can have different levels/percentages of
deflations/inflations which is effects the time to compress
and/or inflate wrt the CPU capacity.

Secondly, if I can add an additional item, would anyone
want to be able to encrypt the data vs compress or to
be able to combine encryption with compression?

Third, if data were to be compressed within a file
object, should a reader be made aware that the data
being read is compressed or should he just read
garbage? Would/should a field in the znode be read
transparently that de-compresses already compressed
data?

Fourth, if you take 8k and expect to alloc 8k of disk
block storage for it and compress it to 7k, are you
really saving 1k? Or are you just creating an additional
1K of internal fragmentation? It is possible that moving
'   7K of data accross your "SCSI" type interface may
give you a faster read/write performance. But that is
after the additional latency of the compress on the
async write and adds a real latency on the current
block read. So, what are you really gaining?

Fifth and hopefully last, should the znode have a
new length field that keeps the non-compressed length
for Posix compatibility. I am assuming large file
support where a process that is not large file aware
should not be able to even open the file. With the
additional field (unccompressed size) the file may
lie on the boundry for the large file open reqs.

Really last..., why not just compress the data stream
before writing it out to disk? Then you can at least do
a file on it and identify the type of compression...

Mitchell Erblich
-

Darren Reed wrote:
> 
> From: "Darren J Moffat" <[EMAIL PROTECTED]>
> ...
> > The other problem is that you basically need a global unique registry
> > anyway so that compress algorithm 1 is always lzjb, 2 is gzip, 3 is 
> > etc etc.  Similarly for crypto and any other transform.
> 
> I've two thoughts on that:
> 1) if there is to be a registry, it should be hosted by OpenSolaris
>and be open to all and
> 
> 2) there should be provision for a "private number space" so that
>people can implement their own whatever so long as they understand
>that the filesystem will not work if plugged into something else.
> 
> Case in point for (2), if I wanted to make a bzip2 version of ZFS at
> home then I should be able to and in doing so chose a number for it
> that I know will be safe for my playing at home.  I shouldn't have
> to come to zfs-discuss@opensolaris.org to "pick a number."
> 
> Darren
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Layout for multiple large streaming writes.

2007-03-13 Thread Erblichs
Tp the original poster,

FYI,

Accessing RAID drives at a constant "~70-75%" does not
probably leave enough excess for degraded mode.

A normal rule of thumb is 50 to 60% constant to
allow excess capacity to be absorbed in degraded
mode.

An "old" rule of thumb for determining for estimating
MTBF is if you have 100 drives and the single drive
is estimated at 30,000 hours (> 3years).. Then the 
expected failure will occur  in about 1 day/30 hours.

Thus, excess capacity needs to be always present to
allow the time to reconstruct the raid, ability
to reconstuct it within a limited timeframe and to
minimize any significantly increased latencies for
normal processing.

Mitchell Erblich
-


Richard Elling wrote:
> 
> > I have a setup with a T2000 SAN attached to 90 500GB SATA drives
> > presented as individual luns to the host.  We will be sending mostly
> > large streaming writes to the filesystems over the network (~2GB/file)
> > in 5/6 streams per filesystem.  Data protection is pretty important, but
> > we need to have at most 25% overhead for redundancy.
> >
> > Some options I'm considering are:
> > 10 x 7+2 RAIDZ2 w/ no hotspares
> > 7 x 10+2 RAIDZ2 w/ 6 spares
> >
> > Does any one have advice relating to the performance or reliability to
> > either of these?  We typically would swap out a bad drive in 4-6 hrs and
> > we expect the drives to be fairly full most of the  time ~70-75% fs
> > utilization.
> 
> What drive manufacturer & model?
> What is the SAN configuration?  More nodes on a loop can significantly
> reduce performance as loop arbitration begins to dominate.  This problem
> can be reduced by using multiple loops or switched fabric, assuming the
> drives support fabrics.
> 
> The data availability should be pretty good with raidz2.  Having hot spares
> will be better than not, but with a 4-6 hour (assuming 24x7 operations)
> replacement time there isn't an overwhelming need for hot spares -- double
> parity and fast repair time is a good combination.  We do worry more
> about spares when the operations are not managed 24x7 or if you wish
> to save money by deferring repairs to a regularly scheduled service
> window.  In my blog about this, I used a 24 hour logistical response
> time and see about an order of magnitude  difference in the MTTDL.
> http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance
> 
> In general, you will have better performance with more sets, so the
> 10-set config will outperform the 7-set config.
>  -- richard
> 
> 
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] writes lost with zfs !

2007-03-11 Thread Erblichs
Ayaz Anjum and others,

I think once you move into NFS over TCP in a client
server env, the chance for lost data is significantly
higher than just a disconnecting a cable,

Scenario, before a client generates a delayed write
from his violatile DRAM client cache, client reboots,

and/or a asynchronous or a delayed write is done, 
no error on the write and the error is missed on the 
close because the programmer didn't perform a fsync 
on the fd before the close and/or expect that a
close might fail,

and/or the tcp connection is lost and the data is
not transfered,

Thus, I know of very few FSs that can guarantee against
data loss. What most modern FSs try to prevent is data
corruption and FS corruption,...

However, I am surprised that you seem to indicate that
no hardware indication is/was present to indicate some
form of hardware degredation/failure had occured.

Mitchell Erblich






is generated because of
the delayed  


On 11-Mar-07, at 11:12 PM, Ayaz Anjum wrote:

>
> HI !
>
> Well as per my actual post, i created a zfs file as part of Sun  
> cluster HAStoragePlus, and then disconned the FC cable, since there  
> was no active IO hence the failure of disk was not detected, then i  
> touched a file in the zfs filesystem, and it went fine, only after  
> that when i did sync then the node panicked and zfs filesystem is  
> failed over to other node. On the othernode the file i touched is  
> not there in the same zfs file system hence i am saying that data  
> is lost. I am planning to deploy zfs in a production NFS  
> environment with above 2TB of Data where users are constantly  
> updating file. Hence my concerns about data integrity.

I believe Robert and Darren have offered sufficient explanations: You  
cannot be assured of committed data unless you've sync'd it. You are  
only risking data loss if your users and/or applications assume data  
is committed without seeing a completed sync, which would be a design  
error. This applies to any filesystem.

--Toby

> Please explain.
>
> thaks
>
> Ayaz Anjum
>
>
>
> Darren Dunham <[EMAIL PROTECTED]>
> Sent by: [EMAIL PROTECTED]
> 03/12/2007 05:45 AM
>
> To
> zfs-discuss@opensolaris.org
> cc
> Subject
> Re: Re[2]: [zfs-discuss] writes lost with zfs !
>
>
>
>
>
> > I have some concerns here,  from my experience in the past,  
> touching a
> > file ( doing some IO ) will cause the ufs filesystem to failover,  
> unlike
> > zfs where it did not ! Why the behaviour of zfs different than ufs ?
>
> UFS always does synchronous metadata updates.  So a 'touch' that  
> creates
> a file is going to require a metadata write.
>
> ZFS writes may not necessarily hit the disk until a transaction group
> flush.
>
> > is not this compromising data integrity ?
>
> It should not.  Is there a scenario that you are worried about?
>
> -- 
> Darren Dunham
> [EMAIL PROTECTED]
> Senior Technical Consultant TAOShttp:// 
> www.taos.com/
> Got some Dr Pepper?   San Francisco, CA bay  
> area
> < This line left intentionally blank to confuse you. >
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>
>
>
>
>
>
>
>
> -- 
> 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: Efficiency when reading the same file blocks

2007-02-28 Thread Erblichs
Toby Thain,

No, physical location was for the exact location and
logical was for the rest of my info.

But, what I might not have made clear was the use of
fragments. Their are two types of fragments. One which
is the partial use of a logical disk block and the other
which I was also trying to refer to is the moving of modified
sections of the file. The first use was well used with
the Joy FFS implementation where a FS and drive tended
to have a high cost per byte overhead and was fairly
small.

Now, lets make this perfectly clear. If a FS object is
large and written "somewhat" in sequence as a stream
of bytes and then random FS logical blocks or physical
blocks are then modified, the new FS object will be less
sequentially written and CAN decrease read performance.
Sorry, I tend to care less about write  performance, due
to the fact that writes tend to be async without threads
blocking waiting for their operation to complete.

This will happen MOST as the FS fills and less optimal
locations of the FS are found for the COW blocks.

The same problem happens with memory with OSs that support
multiple page sizes where a well used system may not be
able to allocate large page sizes due to fragmentation.
Yes, this is a overloaded term... :)

Thus, FS performance may suffer even if their are just
alot of 1 byte changes to frequently accessed FS objects.
If this occurs, either keep a larger FS, clean out the
FS more frequently, or backup, cleanup, and then restore
to get newly sequental FS objects.

Mitchell Erblich
-


Toby Thain wrote:
> 
> On 28-Feb-07, at 6:43 PM, Erblichs wrote:
> 
> > ZFS Group,
> >
> >   My two cents..
> >
> >   Currently, in  my experience, it is a waste of time to try to
> >   guarantee "exact" location of disk blocks with any FS.
> 
> ? Sounds like you're confusing logical location with physical
> location, throughout this post.
> 
> I'm sure Roch meant logical location.
> 
> --T
> 
> >
> >   A simple reason exception is bad blocks, a neighboring block
> >   will suffice.
> >
> >   Second, current disk controllers have logic that translates
> >   and you can't be sure outside of the firmware where the
> >   disk block actually is. Yes, I wrote code in this area before.
> >
> >   Third, some FSs, do a Read-Modify-Write, where the write is
> >   NOT, NOT, NOT overwriting the original location of the read.
> ...
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: Efficiency when reading the same file blocks

2007-02-28 Thread Erblichs
ZFS Group,

My two cents..

Currently, in  my experience, it is a waste of time to try to
guarantee "exact" location of disk blocks with any FS.

A simple reason exception is bad blocks, a neighboring block
will suffice.

Second, current disk controllers have logic that translates
and you can't be sure outside of the firmware where the
disk block actually is. Yes, I wrote code in this area before.

Third, some FSs, do a Read-Modify-Write, where the write is
NOT, NOT, NOT overwriting the original location of the read.

Why? for a couple of reasons.. One is that the original read
may have existed in a fragment. Some do it for FS consistency
to allow the write to become a partial write in some
circumstances (Ex:crash), and the second file block location
then allows for FS consistency and the ability to recover the
original contents. No overwite.

Another reason is that sometimes we are filling a hole
within a FS object window from a base addr to new offset.
The ability to concatenate allows us to reduce the number of
future seeks and small reads / writes versus having a slightly
longer transfer time for the larger theorectical disk block.

Thus, the tradeoff is that we accept that we waste some FS
space, we may not fully optimize the location of the disk
block,  we have larger read and write single large block
latencies, but... we seek less, the per byte overhead is
less, we can order our writes so that we again seek less, our
writes can be delayed (assuming that we might write multiple
times and then commmit on close) to minimize the amount of
actual write operations, we can prioritize our reads over
our writes to decrease read latency, etc..

Bottom line is that performance may suffer if we do alot
of random small read-modify-writes within FS objects that
use a very large disk block. Since the actual CHANGE is
small to the file, each small write outside of a delayed
write window, will consume at least 1 disk block. However,
some writes are to FS objects that are writethru and thus
each small write will consume a new disk block.

Mitchell Erblich
-



Roch - PAE wrote:
> 
> Jeff Davis writes:
>  > > On February 26, 2007 9:05:21 AM -0800 Jeff Davis
>  > > But you have to be aware that logically sequential
>  > > reads do not
>  > > necessarily translate into physically sequential
>  > > reads with zfs.  zfs
>  >
>  > I understand that the COW design can fragment files. I'm still trying to 
> understand how that would affect a database. It seems like that may be bad 
> for performance on single disks due to the seeking, but I would expect that 
> to be less significant when you have many spindles. I've read the following 
> blogs regarding the topic, but didn't find a lot of details:
>  >
>  > http://blogs.sun.com/bonwick/entry/zfs_block_allocation
>  > http://blogs.sun.com/realneel/entry/zfs_and_databases
>  >
>  >
> 
> Here is my take on this:
> 
> DB updates (writes) are mostly  governed by the  synchronous
> write  code  path which for ZFS   means the ZIL performance.
> It's already quite good  in   that it aggregatesmultiple
> updates into few I/Os.  Some further improvements are in the
> works.  COW, in general, simplify greatly write code path.
> 
> DB reads in a transaction  workloads  are mostly random.  If
> the DB  is not cacheable the performance  will  be that of a
> head seek no matter what FS is used (since we can't guess in
> advance where to seek, COW nature does  not help nor hinders
> performance).
> 
> DB reads in a decision workloads can benefit from good
> prefetching (since here we actually know where the next
> seeks will be).
> 
> -r
> 
>  > This message posted from opensolaris.org
>  > ___
>  > zfs-discuss mailing list
>  > zfs-discuss@opensolaris.org
>  > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Implementing fbarrier() on ZFS

2007-02-12 Thread Erblichs
Jeff Bonwick,

Do you agree that their is a major tradeoff of
"builds up a wad of transactions in memory"?

We loose the changes if we have an unstable
environment.

Thus, I don't quite understand why a 2-phase
approach to commits isn't done. First, take the
transactions as they come and do a minimal amount
of a delayed write. If the number of transactions
build up, then convert to the delayed write scheme.

This assumption is that not all ZFS envs are write
heavy versus write once and read-many type accesses.
My assumption is that attribute/meta reading
outweighs all other accesses.

Wouldn't this approach allow minimal outstanding
transactions and favor read access. Yes, the assumption
is that once the "wad" is started, the amount of writing
could be substantial and thus the amount of available
bandwidth for reading is reduced. This would then allow
for a more N states to be available. Right?

Second, their are a multiple uses  of "then: (then pushes,
then flushes all disk..., then writes the new uberblock,
then flushes the caches again), in which seems to have
some level of possible parallelism which should reduce the
latency from the start to the final write. Or did you just
say that for simplicity sake?

Mitchell Erblich
---


Jeff Bonwick wrote:
> 
> Toby Thain wrote:
> > I'm no guru, but would not ZFS already require strict ordering for its
> > transactions ... which property Peter was exploiting to get "fbarrier()"
> > for free?
> 
> Exactly.  Even if you disable the intent log, the transactional nature
> of ZFS ensures preservation of event ordering.  Note that disk caches
> don't come into it: ZFS builds up a wad of transactions in memory,
> then pushes them out as a transaction group.  That entire group will
> either commit or not.  ZFS writes all the new data to new locations,
> then flushes all disk write caches, then writes the new uberblock,
> then flushes the caches again.  Thus you can lose power at any point
> in the middle of committing transaction group N, and you're guaranteed
> that upon reboot, everything will either be at state N or state N-1.
> 
> I agree about the usefulness of fbarrier() vs. fsync(), BTW.  The cool
> thing is that on ZFS, fbarrier() is a no-op.  It's implicit after
> every system call.
> 
> Jeff
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Heavy writes freezing system

2007-01-16 Thread Erblichs
Rainer Heilke,

You have 1/4 of the amount of memory that the 2900 system
is capable of (192GBs : I think).

Secondly, output from fsstat(1M) could be helpful.

Run this command over time and check to see if the
values change over time..

Mitchell Erblich
---



Rainer Heilke wrote:
> 
> > What hardware is used?  Sparc? x86 32-bit? x86
> > 64-bit?
> > How much RAM is installed?
> > Which version of the OS?
> 
> Sorry, this is happening on two systems (test and production). They're both 
> Solaris 10, Update 2. Test is a V880 with 8 CPU's and 32GB, production is an 
> E2900 with 12 dual-core CPU's and 48GB.
> 
> > Did you already try to monitor kernel memory usage,
> > while writing to zfs?  Maybe the kernel is running
> > out of
> > free memory?  (I've bugs like 6483887 in mind,
> > "without direct management, arc ghost lists can run
> > amok")
> 
> We haven't seen serious kernel memory usage that I know of (I'll be honest--I 
> came into this problem late).
> 
> > For a live system:
> >
> > echo ::kmastat | mdb -k
> > echo ::memstat | mdb -k
> 
> I can try this if the DBA group is willing to do another test, thanks.
> 
> > In case you've got a crash dump for the hung system,
> > you
> > can try the same ::kmastat and ::memstat commands
> > using the
> > kernel crash dumps saved in directory
> > /var/crash/`hostname`
> >
> > # cd /var/crash/`hostname`
> > # mdb -k unix.1 vmcore.1
> > ::memstat
> > ::kmastat
> 
> The system doesn't actually crash. It also doesn't freeze _completely_. While 
> I call it a freeze (best name for it), it actually just slows down 
> incredibly. It's like the whole system bogs down like molasses in January. 
> Things happen, but very slowly.
> 
> Rainer
> 
> 
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Limit ZFS Memory Utilization

2007-01-10 Thread Erblichs
Hey guys,

Do to lng URL lookups, the DNLC was pushed to variable
sized entries. The hit rate was dropping because of
"name to long" misses. This was done long ago while I
was at Sun under a bug reported by me..

I don't know your usage, but you should attempt to
estimate the amount of mem used with the default size.

Yes, this is after you start tracking your DNLC hit rate
and make sure it doesn't significantly drop if the ncsize
is decreased. You also may wish to increase the size and
again check the hit rate.. Yes, it is posible that your
access is random enough that no changes will effect the
hit rte.

2nd item.. Bonwick's mem allcators I think still have the
ability to limit the size of each slab. The issue is that
some parts of the code expect non mem failures with
SLEEPs. This can result in extended SLEEPs, but can be
done.

If your company generates changes to your local source
and then you rebuild, it is possible to pre-allocate a
fixed number of objects per cache and then use NOLSLEEPs
with returning values that indicate to retry or failure.

3rd.. And could be the most important, the mem cache
allocators are lazy in freeing memory when it is not
needed by anyone else. Thus, unfreed memory is effectively
used as a cache to remove latencies of on-demand
memory allocations. This artificially keeps memory
usage high, but should have minimal latencies to realloc
when necessary.

Also, it is possible to make mods to increase the level
of mem garbage collection after some watermark code
is added to minimize repeated allocs and frees.


Mitchell Erblich


"Jason J. W. Williams" wrote:
> 
> Hi Robert,
> 
> We've got the default ncsize. I didn't see any advantage to increasing
> it outside of NFS serving...which this server is not. For speed the
> X4500 is showing to be a killer MySQL platform. Between the blazing
> fast procs and the sheer number of spindles, its perfromance is
> tremendous. If MySQL cluster had full disk-based support, scale-out
> with X4500s a-la Greenplum would be terrific solution.
> 
> At this point, the ZFS memory gobbling is the main roadblock to being
> a good database platform.
> 
> Regarding the paging activity, we too saw tremendous paging of up to
> 24% of the X4500s CPU being used for that with the default arc_max.
> After changing it to 4GB, we haven't seen anything much over 5-10%.
> 
> Best Regards,
> Jason
> 
> On 1/10/07, Robert Milkowski <[EMAIL PROTECTED]> wrote:
> > Hello Jason,
> >
> > Thursday, January 11, 2007, 12:36:46 AM, you wrote:
> >
> > JJWW> Hi Robert,
> >
> > JJWW> Thank you! Holy mackerel! That's a lot of memory. With that type of a
> > JJWW> calculation my 4GB arc_max setting is still in the danger zone on a
> > JJWW> Thumper. I wonder if any of the ZFS developers could shed some light
> > JJWW> on the calculation?
> >
> > JJWW> That kind of memory loss makes ZFS almost unusable for a database 
> > system.
> >
> >
> > If you leave ncsize with default value then I belive it won't consume
> > that much memory.
> >
> >
> > JJWW> I agree that a page cache similar to UFS would be much better.  Linux
> > JJWW> works similarly to free pages, and it has been effective enough in the
> > JJWW> past. Though I'm equally unhappy about Linux's tendency to grab every
> > JJWW> bit of free RAM available for filesystem caching, and then cause
> > JJWW> massive memory thrashing as it frees it for applications.
> >
> > Page cache won't be better - just better memory control for ZFS caches
> > is strongly desired. Unfortunately from time to time ZFS makes servers
> > to page enormously :(
> >
> >
> > --
> > Best regards,
> >  Robertmailto:[EMAIL PROTECTED]
> >http://milek.blogspot.com
> >
> >
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O patterns during a "zpool replace": whywritetothe disk being replaced?

2006-11-09 Thread Erblichs

Bill, Sommerfeld, Sorry,

However, I am trying to explain what I think is
happening on your system and why I consider this
normal.

Most of the reads/FS "replace" are normally 
at the block level.

To copy a FS, some level of reading MUST be done
at the orig_dev.
At what level and whether it is recorded as a
normal vnode read  / mmap op for the direct and
indirect blocks is another story.

But it is being done. It is just not being
recorded in FS stats. Read stats are normally used
for normal FS object access requests.

Secondly, maybe starting with the ?uberblock?, the
rest of the meta data is probably being read. And
because of the normal asyn access of FSs, it would
not surprise me that then each znode's access time
field is updated. Remember, that unless you are just
touching a FS low-level(file) object, all writes are
proceeded by at least 1 read.

Mitchell Erblich





Bill Sommerfeld wrote:
> 
> On Thu, 2006-11-09 at 19:18 -0800, Erblichs wrote:
> > Bill Sommerfield,
> 
> Again, that's not how my name is spelled.
> 
> >   With some normal sporadic read failure, accessing
> >   the whole spool may force repeated reads for
> >   the replace.
> 
> please look again at the iostat I posted:
> 
>   capacity operationsbandwidth
> poolused  avail   read  write   read  write
> -  -  -  -  -  -  -
> z   306G   714G  1.43K658  23.5M  1.11M
>   raidz1109G   231G  1.08K392  22.3M   497K
> replacing  -  -  0   1012  0  5.72M
>   c1t4d0   -  -  0753  0  5.73M
>   c1t5d0   -  -  0790  0  5.72M
> c2t12d0-  -339177  9.46M   149K
> c2t13d0-  -317177  9.08M   149K
> c3t12d0-  -330181  9.27M   147K
> c3t13d0-  -352180  9.45M   146K
>   raidz1100G   240G117101   373K   225K
> c1t3d0 -  - 65 33  3.99M  64.1K
> c2t10d0-  - 60 44  3.77M  63.2K
> c2t11d0-  - 62 42  3.87M  63.4K
> c3t10d0-  - 63 42  3.88M  62.3K
> c3t11d0-  - 65 35  4.06M  61.8K
>   raidz1   96.2G   244G234164   768K   415K
> c1t2d0 -  -129 49  7.85M   112K
> c2t8d0 -  -133 54  8.05M   112K
> c2t9d0 -  -132 56  8.08M   113K
> c3t8d0 -  -132 52  8.01M   113K
> c3t9d0 -  -132 49  8.16M   112K
> 
> there were no (zero, none, nada, zilch) reads directed to the failing
> device.  there were a lot of WRITES to the failing device; in fact, the
> the same volume of data was being written to BOTH the failing device and
> the new device.
> 
> >   So, I was thinking that a read access
> >   that could ALSO be updating the znode. This newer
> >   time/date stamp is causing alot of writes.
> 
> that's not going to be significant as a source of traffic; again, look
> at the above iostat, which was representative of the load throughout the
> resilver.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O patterns during a "zpool replace": why writetothe disk being replaced?

2006-11-09 Thread Erblichs
Bill Sommerfield,

Because, first, I have seen alot of I/O
occur while a snapshot is being aged out
of a system.

I don't think that during the resilvering process 
accesses (read, writes) are completely
stopped to the orig_dev.

I expect at least some meta reads are
going on.

With some normal sporadic read failure, accessing
the whole spool may force repeated reads for
the replace.

So, I was thinking that a read access
that could ALSO be updating the znode. This newer
time/date stamp is causing alot of writes.

Depending on how the fs meta and blocks are
being accessed, the orig_dev may also 
have some normal writes until it is offlined.

Mitchell Erblich
-

Bill Sommerfeld wrote:
> 
> On Wed, 2006-11-08 at 01:54 -0800, Erblichs wrote:
> >
> > Bill Sommerfield,
> 
> that's not how my name is spelled
> >
> >   Are their any existing snaps?
> no.  why do you think this would matter?
> >
> >   Can you have any scripts that may be
> >   removing aged files?
> no; there was essentially no other activity on the pool other than the
> "replace".
> 
> why do you think this would matter?
> 
> - Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O patterns during a "zpool replace": why write tothe disk being replaced?

2006-11-08 Thread Erblichs


Bill Sommerfield,

Are their any existing snaps?

Can you have any scripts that may be 
removing aged files?

Mitchell Erblich
--

Bill Sommerfeld wrote:
> 
> On a v40z running snv_51, I'm doing a "zpool replace z c1t4d0 c1t5d0".
> 
> (so, why am I doing the replace?  The outgoing disk has been reporting
> read errors sporadically but with increasing frequency over time..)
> 
> zpool iostat -v shows writes going to the old (outgoing) disk as well as
> to the replacement disk.  Is this intentional?
> 
> Seems counterintuitive as I'd think you'd want to touch a suspect disk
> as little as possible and as nondestructively as possible...
> 
> A representative snapshot from "zpool iostat -v" :
> 
>   capacity operationsbandwidth
> poolused  avail   read  write   read  write
> -  -  -  -  -  -  -
> z   306G   714G  1.43K658  23.5M  1.11M
>   raidz1109G   231G  1.08K392  22.3M   497K
> replacing  -  -  0   1012  0  5.72M
>   c1t4d0   -  -  0753  0  5.73M
>   c1t5d0   -  -  0790  0  5.72M
> c2t12d0-  -339177  9.46M   149K
> c2t13d0-  -317177  9.08M   149K
> c3t12d0-  -330181  9.27M   147K
> c3t13d0-  -352180  9.45M   146K
>   raidz1100G   240G117101   373K   225K
> c1t3d0 -  - 65 33  3.99M  64.1K
> c2t10d0-  - 60 44  3.77M  63.2K
> c2t11d0-  - 62 42  3.87M  63.4K
> c3t10d0-  - 63 42  3.88M  62.3K
> c3t11d0-  - 65 35  4.06M  61.8K
>   raidz1   96.2G   244G234164   768K   415K
> c1t2d0 -  -129 49  7.85M   112K
> c2t8d0 -  -133 54  8.05M   112K
> c2t9d0 -  -132 56  8.08M   113K
> c3t8d0 -  -132 52  8.01M   113K
> c3t9d0 -  -132 49  8.16M   112K
> 
> - Bill
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] thousands of ZFS file systems

2006-10-30 Thread Erblichs
Hi,

My suggestion is direct any command output to a file 
that may print thous of lines.

I have not tried that number of FSs. So, my first
suggestion is to have alot of phys mem installed.

The second item that I could be concerned with is
path translation going thru alot of mount points.
I think I remember in some old code that their was
a limit of 256 mount points thru a path. I don't
know if it still exists.

Mitchell Erblich
-



> Rafael Friedlander wrote:
> 
> Hi,
> 
> An IT organization needs to implement highly available file server,
> using Solaris 10, SunCluster, NFS and Samba. We are talking about
> thousands, even 10s of thousands of ZFS file systems.
> 
> Is this doable? Should I expect any impact on performance or stability
> due to the fact I'll have that many mounted filesystems, with
> everything implied from that fact ('df | wc -l' with thousands of
> lines of result, for instance)?
> 
> Thanks,
> 
> Rafael.
> --
> 
> ---
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] copying a large file..

2006-10-29 Thread Erblichs
Hi,

How much time is a "long time"?

Second, had a snapshot been taken after the file
was created?

Are the src and dst directories in the
 same slice?

What other work was being done at the time of
 the move?

Were their numerous files in the src or dst
 directories?

How much phys mem is in your system?

Does a equivelent move take a drasticly shorter
 amount of time if done right after a reboot?

Mitchell Erblich
--



Pavan Reddy wrote:
> 
> 'mv' command took very long time to copy a large file from one ZFS directory 
> to another. The directories share the same pool and file system. I had a 385 
> MB file in one directory and wanted to move that to a different directory.  
> It took long time to move. Any particular reasons? There is no raid involved.
> 
> -Pavan
> 
> 
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ENOSPC : No space on file deletion

2006-10-20 Thread Erblichs
Matthew, et al,

You haven't identified a solution / workaround?

Their is one "large" file within the FS and
snapshot that has been backed up.

They wish to remove this large file and the
system is preventing this because of a additional
reference from the snapshot.

For some good reason to them, they do not wish to
remove the entire snapshot.

Then, how do you FORCE ably remove a file that
fails with a no space error?

Are you telling me that their is no way to access a single
file within the snapshot and remove it?

Mitchell Erblich


Matthew Ahrens wrote:
> 
> Erblichs wrote:
> >   Now the stupid question..
> >   If the snapshot is identical to the FS, I can't
> >   remove files from the FS because of the snapshot
> >   and removing files from the snapshot only removes
> >   a reference to the file and leaves the memory.
> >
> >   So, how do I do a atomic file removes on both the
> >   original and the snapshot(s). Yes, I am assuming that
> >   I have backed up the file offline.
> >
> >   Can I request a possible RFE to be able to force a
> >   file remove from the original FS and if found elsewhere
> >   remove that location too IFF a ENOSPC would fail the
> >   original rm?
> 
> No, you can not remove files from snapshots.  Snapshots can not be
> changed.  If you are out of space because of snapshots, you can always
> 'zfs destroy' the snapshot :-)
> 
> --matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ENOSPC : No space on file deletion

2006-10-19 Thread Erblichs
Hey guys,

I think i know what is going on.

A set of files was attempted to be deleted on a FS
that had almost consumed its reservation.

It failed because one or more snapshots hold
references to these files and the snaps needed
to allocate FS space. Thus, the no space error.

Now the stupid question..
If the snapshot is identical to the FS, I can't
remove files from the FS because of the snapshot
and removing files from the snapshot only removes
a reference to the file and leaves the memory.

So, how do I do a atomic file removes on both the
original and the snapshot(s). Yes, I am assuming that
I have backed up the file offline.

Can I request a possible RFE to be able to force a
file remove from the original FS and if found elsewhere
remove that location too IFF a ENOSPC would fail the
original rm?

Thanks,
Mitchell Erblich
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Self-tuning recordsize

2006-10-17 Thread Erblichs
Group, et al, 

I don't understand that if the problem is systemic based on
the number of continual dirty pages and stress to clean
those pages, then why .

If the problem is FS independent, because any number of
different installed FSs can equally consume pages.
Thus, to solve the problem on a per FS basis seems to me a
bandaid approach..

Then why doesn't the OS determine that a dangerous level of high 
watermark number of pages are continually being paged out 
(we have swapped and have a large percentage of available pages 
always dirty: based on recent past history) and thus,

 * force the writes to a set of predetermined pages (limit the
   number of pages for I/O),
 * these pages get I/O scheduled immediately, not waiting for
   a need for these pages and finding them dirty, 
   (hopefully a percentage of these pages will be cleaned and
be immediately available if needed in the near future),

 Yes, the OS could redirect the I/O as being direct without
 using the page cache, but the assumption is that these
 procs are behaving as multiple-readers and need the cached
 page data in the near future. Thus, changing the behaviour
 to remove whether the pages are cached bcause they CAN
 totally consume the cache removes the multiple-reader
 reader to cache the data in the first place, thus...


*  guarantee that heartbeats are always regular by preserving
   5 to 20% of pages for exec / text,
*  limit the number of interrupts being generated by network
   so low level SCSI interrupts can page and not be starved,
   (something the white paper did not mention),
   (yes, this will cause the loss of UDP based data but we
need to generate some form of backpressure / explicit
congestion event),
* if the files coming in from network were TCP based, hopefully
  a segment would be dropped and act as a backpressure to
  the originator of the data,
* if the files are being read from the FS, then a max I/O rate 
  should be determined based on the number of pages that are 
  clean and ready to accept FS data,
*  etc

Thus, tuning to determine whether the page cache should be used
for write or read, should allow one set of processes not to
adversely effect the operation of other processes.

And any OS, should only slow down the dirty I/O pages for
those specific processes and other processes work being
unaware of the I/O issues..

Mitchell Erblich
-

Richard Elling - PAE wrote:
> 
> Roch wrote:
> > Oracle will typically create it's files with 128K writes
> > not recordsize ones.
> 
> Blast from the past...
> http://www.sun.com/blueprints/0400/ram-vxfs.pdf
> 
>   -- richard
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Self-tuning recordsize

2006-10-14 Thread Erblichs
Nico,

Yes, I agree.

But also single random large single read and writes would also
benefit from a large record size. So, I didn't try make that
distinction. However, I "guess"  that the best random large
reads & writes would fall within single filesystem record
sizes.

No, I haven't reviewed whether the holes (disk block space)
tend to be multiples of record size, page size, or ..
Would a write of recordsize that didn't fall on a record size 
boundry write into 2 filesystem blocks / records?

However, would extremely large record sizes, say 1MB (or more)
 (what is the limit?), open up write atomicity issues
or file corruption issues? Would record sizes like these
be equal to mulitple track writes?

Also, because of the "disk block" allocation stategy, I
wasn't too sure that any form of multiple disk block 
contigousness still applied with ZFS with smaller record 
sizes.. Yes, to minimize seek and rotational latencies
and help with read ahead and "write behind"...

Oh, but once writes have begun to the file, in the past,
this has frozen the recordsize. So "self-tuning" or
adjustments NEED to be decided probably at the create
of the FS object. OR some type of copy mechanism needs to
be done to a new file with a different record size at a
later time when the default or past record size was
determined to be significantly incorrect. Yes, I assume
that many  reads /writes will occur in the future that 
will amortize the copy cost.

So, yes group... I am still formulating the "best"
algorithm for this. ZFS uses alot of past gained
knowledege applied to UFS (page lists stuff, chksum stuff,
large file awwareness/support), but adds a new twist to 
things..

Mitchell Erblich
--


    

Nicolas Williams wrote:
> 
> On Fri, Oct 13, 2006 at 09:22:53PM -0700, Erblichs wrote:
> >   For extremely large files (25 to 100GBs), that are accessed
> >   sequentially for both read & write, I would expect 64k or 128k.
> 
> Lager files accessed sequentially don't need any special heuristic for
> record size determination: just use the filesystem's record size and be
> done.  The bigger the record size, the better -- a form of read ahead.
> 
> Nico
> --
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Self-tuning recordsize

2006-10-13 Thread Erblichs
Group,

I am not sure I agree with the 8k size.

Since "recordsize" is based on the size of filesystem blocks
for large files, my first consideration is what will be
the max size of the file object.

For extremely large files (25 to 100GBs), that are accessed 
sequentially for both read & write, I would expect 64k or 128k. 

Putpage functions attempt to grab a number of pages off the
vnode and place their modified contents within disk blocks.
Thus if disk blocks are larger, then a fewer of them are needed,
and can result in a more efficient operations.

However, any small change to the filesystem block would result
in the entire filesystem block being accessed, so small accesses
to the block are very inefficent.

Lastly, the access to a larger block will occupy the media
for longer periods of continuous time, possibly creating a
larger latency than necessary for another non-related op.

Hope this helps...

Mitchell Erblich
---


Nicolas Williams wrote:
> 
> On Fri, Oct 13, 2006 at 08:30:27AM -0700, Matthew Ahrens wrote:
> > Jeremy Teo wrote:
> > >Would it be worthwhile to implement heuristics to auto-tune
> > >'recordsize', or would that not be worth the effort?
> >
> > It would be really great to automatically select the proper recordsize
> > for each file!  How do you suggest doing so?
> 
> I would suggest the following:
> 
>  - on file creation start with record size = 8KB (or some such smallish
>size), but don't record this on-disk yet
> 
>  - keep the record size at 8KB until the file exceeds some size, say,
>.5MB, at which point the most common read size, if there were enough
>reads, or the most common write size otherwise, should be used to
>derive the actual file record size (rounding up if need be)
> 
> - if the selected record size != 8KB then re-write the file with the
>   new record size
> 
> - record the file's selected record size in an extended attribute
> 
>  - on truncation keep the existing file record size
> 
>  - on open of non-empty files without associated file record size stick
>to the original approach (growing the file block size up to the FS
>record size, defaulting to 128KB)
> 
> I think we should create a namespace for Solaris-specific extended
> attributes.
> 
> The file record size attribute should be writable, but changes in record
> size should only be allowed when the file is empty or when the file data
> is in one block.  E.g., writing "8KB" to a file's RS EA when the file's
> larger than 8KB or consists of more than one block should appear to
> succeed, but a subsequent read of the RS EA should show the previous
> record size.
> 
> This approach might lead to the creation of new tunables for controlling
> the heuristic (e.g., which heuristic, initial RS, file size at which RS
> will be determined, default RS when none can be determined).
> 
> Nico
> --
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs_vfsops.c : zfs_vfsinit() : line 1179: Src inspection

2006-10-13 Thread Erblichs
Group,

If their is a bad vfs ops template, why
wouldn't you just return(error) versus
trying to create the vnode ops template?

My suggestion is after the cmn_err()
then   return(error);

Mitchell Erblich
---
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] single memory allocation in the ZFS intent log

2006-10-06 Thread Erblichs
Group,

This example is done with a single threaded app.
It is NOT NOT NOT intended to show any level of
Thread-safe type coding.

It is ONLY used to show that it is signifcantly lower cost
to grab pre-allocated objects than to allocate the 
objects on demand.

Thus, grabbing 64 byte chunks off a free list and
placing them back on can be done with this simple
base code even when dealing with 1Gb/sec intfs.

Under extreme circumstances, the normal on demand
allocator can sleep if it needs to colesce memory
or steal it from another's cache.

Mitchell Erblich
-

Frank Hofmann wrote:
> 
> On Thu, 5 Oct 2006, Erblichs wrote:
> 
> > Casper Dik,
> >
> >   After my posting, I assumed that a code question should be
> >   directed to the ZFS code alias, so I apologize to the people
> >   show don't read code. However, since the discussion is here,
> >   I will post a code proof here. Just use "time program" to get
> >   a generic time frame. It is under 0.1 secs for 500k loops
> >   (each loop does removes a obj and puts it back).
> >
> >   It is just to be used as a proof of concept that a simple
> >   pre-alloc'ed set of objects can be accessed so much faster
> >   than allocating and assigning them.
> 
> Ok, could you please explain how is this piece (and all else, for that
> matter):
> 
> /*
>   * Get a node structure from the freelist
>   */
> struct node *
> node_getnode()
> {
> struct node *node;
> 
> if ((node = nodefree) == NULL)  /* "shouldn't happen" */
> printf("out of nodes");
> 
> nodefree = node->node_next;
> node->node_next = NULL;
> 
> return (node);
> }
> 
> is multithread-safe ?
> 
> Best wishes,
> FrankH.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] single memory allocation in the ZFS intent log

2006-10-05 Thread Erblichs
Casper Dik,

After my posting, I assumed that a code question should be
directed to the ZFS code alias, so I apologize to the people
show don't read code. However, since the discussion is here,
I will post a code proof here. Just use "time program" to get
a generic time frame. It is under 0.1 secs for 500k loops
(each loop does removes a obj and puts it back).

It is just to be used as a proof of concept that a simple
pre-alloc'ed set of objects can be accessed so much faster
than allocating and assigning them.

To support the change to the intent log objects, I would suggest
first identiying the number of objects normally allocated and
use that as a working set of objects. A time element is also
needed to identify when objects should be released from the
free list to the memory cache. Yes, the initial thoughts of
having a per CPU based allocs are coded in which would allow
multiple simultaneious access to a freelist per CPU. This
should remove most of the mutex code necessary for scalability.

Objective
---
   What this app code is proving is that items that are pre-alloc'ed
can be removed off a simple free list and stored on a free list.
This is just a inception proof that shows "fast access" to a
working set of objects.

   The time to make one chunk alloc, place all of them on a
free list, and then perform 500k ops of removal and insertions
is probably somewhere 50 to 1000x faster than even the best memory
allocators allocating/retrieving 500k items. If a dynamic list of
nodes are required the chunk alloc should be changed.

  This quick piece of app prog runs in less than 0.1 sec with 500k
retrieves and store ops. This is fast enough to grab 64byte chunks
dealing with even a 1Gb Eth link. Even though this code is simplified,
it
indicates that kmem_allocs will have the same benefits even without
sleeping.

-

The code does a single pre alloc and then breaks up the alloc
into N node pieces. It takes each piece and places them
on a free list in the init section. The assumption here is
that we have a fixed reasonable number of items. If the
number of items is dynamic the init could easily alloc
a number of nodes, then use watermarks to alloc and free
into / from the free list as the number of nodes are used.

If the logic is used to deal with kmem ops, then any free
nodes could be returned to memory as excess nodes are in
the free list.

The base level of this type of logic is normally used when
a special program project requires non-standard interfaces
to guarantee a HIGH level of performance.

The main has a hard code of 500k loops, which allocs one
node and then frees it. Thus, 500k equiv allocs would need
to be done. This ran in the 0.02 to 0.35 secs on a 1.3GHz
laptop Linux box.

-

It is my understanding that the Bonwicks's new allocator was created
to remove fragmentation. And yes it also allows the OS to reduce the
overhead of of dealing with mem objects of process's that 
are freed and alloc'ed frequently. When the system gets low on
memory, it steals freed objects that are being cached. However,
even with no SLEEPING, I have yet to see it perform as fast as 
simple retrieves and stores.

Years ago, the amount of memory on a system was limited due to
its expense. This is no longer the case. Some/most 
processes/threads could have a decent increase in performance
if the amount of workload done
on a working set of objects is minimized. Up to this workset,
I propose that a almost guranteed level of performance could
be achieved.

With the comment that any type of functionality that has merits
get a API so multiple processes / threads can use that
functionality. Years ago I was in the "process" of doing that
when I left a company with a ARC group. It was to add a layer
of working set mem objects that would have "fast access" properties.

I will ONLY GUARANTEE that X working set objects once freed 
to the FREE LIST can be re-allocated without waiting for the objects. 
Any count beyond that working set, has the same underlying properties. 
Except if I KNOW
that the number on my freelist goes down to a small value, I could
pre-request more objects. The latency of retrieving these objects
could thus be minimized.

This logic then removes on demand memory allocations, so any WAIT
time MAY not effect the other parts of the process that need more
 

Re: [zfs-discuss] single memory allocation in the ZFS intent log

2006-10-04 Thread Erblichs
Casper Dik,

Yes, I am familiar with Bonwick's slab allocators and tried
it for wirespeed test of 64byte pieces for a 1Gb and then
100Mb Eths and lastly 10Mb Eth. My results were not 
encouraging. I assume it has improved over time.

First, let me ask what happens to the FS if the allocs
in the intent log code are sleeping waiting for memory

IMO, The general problem with memory allocators is:

- getting memory from a "cache" of ones own size/type
  is orders of magnitude higher than just getting some
  off one's own freelist,

- their is a built in latency to recouperate/steal memory
  from other processes,

- this stealing forces a sleep and context switches,

- the amount of time to sleep is undeterminate with a single
  call per struct. How long can you sleep for? 100ms or
  250ms or more..

- no process can guarantee a working set,

In the time when memory was expensive, maybe a global
sharing mechanisms would make sense, but when  the amount
of memory is somewhat plentiful and cheap,

*** It then makes sense for a 2 stage implementation of
preallocation of a working set and then normal allocation
with the added latency. 

So, it makes sense to pre-allocate a working set of allocs
by a single alloc call, break up the alloc into needed sizes,
and then alloc from your own free list,

-> if that freelist then empties, maybe then take the extra
overhead with the kmem call. Consider this a expected cost to exceed
a certain watermark.

But otherwise, I bet if I give you some code for the pre-alloc, I bet
10 
allocs from the freelist can be done versus the kmem_alloc call, and
at least 100 to 10k allocs if sleep occurs on your side.

Actually, I think it is so bad, that why don't you time 1 kmem_free
versus grabbing elements off the freelist,

However, don't trust me, I will drop a snapshot of the code to you
tomarrow if you want and you make a single CPU benchmark comparison.

Your multiple CPU issue, forces me to ask, is it a common
occurance that 2 are more CPUs are simultaneouly requesting
memory for the intent log? If it is, then their should be a
freelist of a low watermark set of elements per CPU. However,
one thing at a time..

So, do you want that code? It will be a single alloc of X units
and then place them on a freelist. You then time it takes to
remove Y elements from the freelist versus 1 kmem_alloc with
a NO_SLEEP arg and report the numbers. Then I would suggest the
call with the smallest sleep possible. How many allocs can then
be done? 25k, 35k, more...

Oh, the reason why we aren't timing the initial kmem_alloc call
for the freelist, is because I expect that to occur during init
and not proceed until memory is alloc'ed.


Mitchell Erblich








[EMAIL PROTECTED] wrote:
> 
> >   at least one location:
> >
> >   When adding a new dva node into the tree, a kmem_alloc is done with
> >   a KM_SLEEP argument.
> >
> >   thus, this process thread could block waiting for memory.
> >
> >   I would suggest adding a  pre-allocated pool of dva nodes.
> 
> This is how the Solaris memory allocator works.  It keeps pools of
> "pre-allocated" nodes about until memory conditions are low.
> 
> >   When a new dva node is needed, first check this pre-allocated
> >   pool and allocate from their.
> 
> There are two reasons why this is a really bad idea:
> 
> - the system will run out of memory even sooner if people
>   start building their own free-lists
> 
> - a single freelist does not scale; at two CPUs it becomes
>   the allocation bottleneck (I've measured and removed two
>   such bottlenecks from Solaris 9)
> 
> You might want to learn about how the Solaris memory allocator works;
> it pretty much works like you want, except that it is all part of the
> framework.  And, just as in your case, it does run out some times but
> a private freelist does not help against that.
> 
> >   Why? This would eliminate a possible sleep condition if memory
> >is not immediately available. The pool would add a working
> >set of dva nodes that could be monitored. Per alloc latencies
> >could be amortized over a chunk allocation.
> 
> That's how the Solaris memory allocator already works.
> 
> Casper
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] single memory allocation in the ZFS intent log

2006-10-03 Thread Erblichs
group,

at least one location:

When adding a new dva node into the tree, a kmem_alloc is done with
a KM_SLEEP argument.

thus, this process thread could block waiting for memory.

I would suggest adding a  pre-allocated pool of dva nodes.

When a new dva node is needed, first check this pre-allocated
pool and allocate from their.

Why? This would eliminate a possible sleep condition if memory
 is not immediately available. The pool would add a working
 set of dva nodes that could be monitored. Per alloc latencies
 could be amortized over a chunk allocation.

 Lastly, if memory is scarce along time may pass before
 this node could be allocated to the tree. If the number
 is monitored, it is possible that restricted operations
 could be done while until the intent log is decreased in
 size.

I can supply untested code within 24 hours if wanted.

Mitchell Erblich
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss