Re: [zfs-discuss] Metadata (DDT) Cache Bias

2011-06-03 Thread Daniel Carosone
Edward Ned Harvey writes:
>  > If you consider the extreme bias...  If the system would never give up
>  > metadata in cache until all the cached data were gone...  Then it would be
>  > similar to the current primarycache=metadata, except that the system would
>  > be willing to cache data too, whenever there was available cache otherwise
>  > going to waste.

I like this, and it could be another value for the same property:
metabias, metadata-bias, perfer-metadata, whatever. 

On Fri, Jun 03, 2011 at 06:25:45AM -0700, Roch wrote:
> Interesting. Now consider this :
> 
> We have an indirect block in memory (those are 16K
> referencing 128 individual data blocks). We also have an
> unrelated data block say 16K. Neither are currently being
> reference nor have they been for a long time (otherwise they
> move up to the head of the cache lists).  They reach the
> tail of the primary cache together. I have room for one of
> them in the secondary cache. 
> 
> Absent other information, do we think that the indirect
> block is more valuable than the data block ? At first I also
> wanted to say that metadata should be favored. Now I can't come
> up with an argument to favor either one. 

The effectiveness of a cache depends on the likelihood of a hit
against a cached value, vs the cost of keeping it.

Including data that may allow us to predict this future likelihood
based on past access patterns can improve this immensely. This is what
the arc algorithm does quite well.  

Absent this information, we assume the probability of future access to
all data blocks not currently in ARC is approximately equal.  The
indirect metadata block is therefore 127x as likely to be needed as
the one data block, since if any of the data blocks is needed, so will
the indirect block to find it.

> Therefore I think we need to include more information than just data
> vs metadata in the decision process.

If we have the information to hand, it may help - but we don't. 

The only thing I can think of we may have is whether either block was
ever on the "frequent" list, or only on the "recent" list, to catch
the single-pass sequential access pattern and make it the lower
priority for cache residence.

I don't know how feasible it is to check whether any of the blocks
referenced by the indirect block are themselves in arc, nor what that
might imply about the future likelihood of further accesses to other
blocks indirectly referenced by this one.

> Instant Poll : Yes/No ?

Yes for this as an RFE, or at least as a q&d implementation to measure
potential benefit.

--
Dan.


pgp3K2k87cSZH.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Experiences with SAS, SATA SSDs, and Expanders (Re: Should Intel X25-E not be used with a SAS Expander?)

2011-06-03 Thread Jeff Bacon
Let me throw two cents into the mix here. 

Background: I have probably 8 different ZFS boxes, BYO using SMC
chassis. The standard config now looks like such: 

- CSE847-E26-1400LPB main chassis, X8DTH-iF board, dual X5670 CPUs, 96G
RAM (some have 144G)
- Intel X520 dual-10G card
- 2 LSI 9211-8i controllers driving the internal backplanes, dual-attach
- 2 CSE847-E26-RJBDO1 chassis, connected via 4 LSI 9200-8e controllers
(so everything is dual-connected, no daisy-chaining)
- primary disks are 2TB Constellation SAS drives - lots and lots of them
- some Cheetah 450G 15k drives 
- some L2ARC and ZIL, mostly in the main chassis 
- sol10U9
- the 2TBs are all in 7-disk RAIDZ2s, the Cheetahs are mirrors. 

(Before you wonder: all of the ZFS filesystems are set compression=gzip.
The data compresses anywhere from 2:1 to 5:1. So the actual record being
written to disk is invariably of a variable size and less than 128k - so
matching # of drives to 128k boundary is a pointless exercise from all
that I've seen.)

some of the machines are NFS fileservers, some are running a proprietary
column-oriented time-series database. 

They don't all follow the pattern, as it's evolved over time. I have one
machine that has 100-some Barracuda 1TB drives hung off a bunch of LSI
3081E-R controllers in three zpools. Some use the onboard 82576 intel
NICs, no 10G. Some have a mix of 1068- and 2008-based controllers. CPUs
vary as does total memory. The main prod boxes are of course the most
standard. 

Note that I don't use the X25-Es like everyone else does; I use a mix of
drives:
- Crucial C300 128G drives for L2ARC
- OCZ Vertex 2 Pro 100G drives for ZIL 

I will be testing the new Intel 520s soon.

Personally, I've been noodling with Solaris for going on 20 years - I
cut my teeth on SunOS 3.2. 

-

I spent a fair bit of time mucking with SAS/SATA interposers, and I
agree with Richard that they're toxic.
 
a) they add latency - a not-inconsiderable amount
b) they're a bad idea. 

The thing is, the drive behind the interposer is still a single port. So
you have multipath ... some small gain... but is it really worth it? I
figure that if we lose a disk controller in some serious way, we're
hosed anyway and looking at a reboot and we can figure on doing whatever
minor reconfig is necessary so long as all the right pieces are in
place. 

Then too, it's a SAS controller on a tiny board. What firmware is it
running? What bugs are in that firmware? Who's maintaining it? Do you
really want another piece of weirdness in the chain? 

Plus, I never found anyone who made a mount for the SMC CSE-846/847
cases that would allow mounting the interposer and the drive in a
consistent way, and didn't feel like fabricating it myself. So that was
that. I have a drawer full of interposers, if anyone cares. 


--

One criteria to look for in choosing a SATA SSD to use: it needs to
report a GUID, so the expander chip and the driver can keep track of the
drive as a unique entity that is connected via a series of
ports/connections, not "some drive that's connected via that port on
that backplane via that port off that controller". I can't cite
scientific evidence of a tie between that and "having issues", but the
pattern holds. 

-

I freely mix the 2TB Constellations and the SATA drives on the same
backplanes and controllers. The main-line prod machines have been in
continuous use for at least 6 months, if not more, and quite literally
have the living crap beaten out of them. The primary db server has a
pair of X5690s and 144G and runs flat out 24x7 and thrashes its disks
like a mofo. We've learned a lot in terms of ARC tuning and other such
fun, in an environment where we have probably 10,000 threads competing -
6 ZFS pools and a pile of Java applications. The L2ARC gets thrashed.
The Cheetahs get pounded. 

So far, I have yet to have a single error pop up with respect to drive
timeouts or other such fun that you might expect from SAS/SATA
conflicts. They are amazingly solid, given they're built out of
commodity parts and running stock Sol10U9. 

--

Note however that the above is using the LSI 2008-based controllers with
the Phase7 firmware. Phase4 and Phase5 were abysmal and put me through
much pain and grief. If you haven't flashed your controllers, do
yourself a favor and do so. 

Note that I use the retail box LSI cards, not the SMC cards. I'm sure
SMC is less than best pleased by that, but I'd rather be able to talk to
LSI directly in case of issues, and so far LSI support's been pretty
decent, to the extent that I've even needed it. (SMC consoles themselves
by selling me piles of chassis, motherboards, and disks.)

-

Note that, if you're using SATA drives on the SAS backplanes, the SATA
drives will always only attach to the primary controller - so buying a
dual-chip backplane does you zip for good; your SATA drives WILL NOT
FAIL OVER. That's just how it is. Distribute your SATA drives across the
backplanes and controllers, assume

Re: [zfs-discuss] SATA disk perf question

2011-06-03 Thread Eric D. Mudama

On Thu, Jun  2 at 20:49, Erik Trimble wrote:
Nope. In terms of actual, obtainable IOPS, a 7200RPM drive isn't 
going to be able to do more than 200 under ideal conditions, and 
should be able to manage 50 under anything other than the 
pedantically worst-case situation. That's only about a 50% deviation, 
not like an order of magnitude or so.


Most cache-enabled 7200RPM drives can do 20K+ sequential IOPS at small
block sizes, up close to their peak transfer rate.  


For random IO, I typically see 80 IOPS for unqueued reads, 120 for
queued reads/writes with cache disabled, and maybe 150-200 for cache
enabled writes.  The above are all full-stroke, so the average seek is
1/3 stroke (unqueued).  On a smaller data set where the drive dwarfs
the data set, average seek distance is much shorter and the resulting
IOPS can be quite a bit higher.

--eric

--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SATA disk perf question

2011-06-03 Thread Eric Sproul
On Fri, Jun 3, 2011 at 11:22 AM, Paul Kraus  wrote:
> So is there a way to read these real I/Ops numbers ?
>
> iostat is reporting 600-800 I/Ops peak (1 second sample) for these
> 7200 RPM SATA drives. If the drives are doing aggregation, then how to
> tell what is really going on ?

I've always assumed that crazy high IOPS numbers on 7.2k drives means
I'm seeing the individual drive caches absorbing those writes.  That's
the first place those writes will "land" when coming in from the disk
controller.  As other posters have said, after that the drive may
internally reorder and/or aggregate those writes before sending them
to the platter.

Eric
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SATA disk perf question

2011-06-03 Thread Paul Kraus
On Thu, Jun 2, 2011 at 11:49 PM, Erik Trimble  wrote:
> On 6/2/2011 5:12 PM, Jens Elkner wrote:
>>
>> On Wed, Jun 01, 2011 at 06:17:08PM -0700, Erik Trimble wrote:
>>>
>>> On Wed, 2011-06-01 at 12:54 -0400, Paul Kraus wrote:
>>
>>> Here's how you calculate (average) how long a random IOPs takes:
>>> seek time + ((60 / RPMs) / 2))]
>>>
>>> A truly sequential IOPs is:
>>> (60 / RPMs) / 2)
>>>
>>> For that series of drives, seek time averages 8.5ms (per Seagate).
>>> So, you get
>>>
>>> 1 Random IOPs takes [8.5ms + 4.13ms] = 12.6ms, which translates to 78
>>> IOPS
>>> 1 Sequential IOPs takes 4.13ms, which gives 120 IOPS.
>>>
>>> Note that due to averaging, the above numbers may be slightly higher or
>>> lower for any actual workload.
>>
>> Nahh, shouldn't it read "numbers may be _significant_ higher or lower"
>> ...? ;-)
>>
>> Regards,
>> jel.
>
> Nope. In terms of actual, obtainable IOPS, a 7200RPM drive isn't going to be
> able to do more than 200 under ideal conditions, and should be able to
> manage 50 under anything other than the pedantically worst-case situation.
> That's only about a 50% deviation, not like an order of magnitude or so.

So is there a way to read these real I/Ops numbers ?

iostat is reporting 600-800 I/Ops peak (1 second sample) for these
7200 RPM SATA drives. If the drives are doing aggregation, then how to
tell what is really going on ?

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Metadata (DDT) Cache Bias

2011-06-03 Thread Roch

Edward Ned Harvey writes:
 > Based on observed behavior measuring performance of dedup, I would say, some
 > chunk of data and its associated metadata seem have approximately the same
 > "warmness" in the cache.  So when the data gets evicted, the associated
 > metadata tends to be evicted too.  So whenever you have a cache miss,
 > instead of needing to fetch 1 thing from disk (the data) you need to fetch N
 > things from disk (data + the metadata.)
 > 
 >  
 > 
 > I would say, simply giving bias to the metadata would be useful.  So the
 > metadata would tend to stay in cache, even when the data itself is evicted.
 > Useful because the metadata is so *darn* small by comparison with the actual
 > data...  It carries a relatively small footprint in ram, but upon cache
 > miss, it more than doubles the disk fetch penalty.
 > 
 >  
 > 
 > If you consider the extreme bias...  If the system would never give up
 > metadata in cache until all the cached data were gone...  Then it would be
 > similar to the current primarycache=metadata, except that the system would
 > be willing to cache data too, whenever there was available cache otherwise
 > going to waste.
 > 
 >  

Interesting. Now consider this :

We have an indirect block in memory (those are 16K
referencing 128 individual data blocks). We also have an
unrelated data block say 16K. Neither are currently being
reference nor have they been for a long time (otherwise they
move up to the head of the cache lists).  They reach the
tail of the primary cache together. I have room for one of
them in the secondary cache. 

Absent other information, do we think that the indirect
block is more valuable than the data block ? At first I also
wanted to say that metadata should be favored. Now I can't come
up with an argument to favor either one. Therefore I think
we need to include more information than just data vs
metadata in the decision process.

Instant Poll : Yes/No ?

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss