Re: [zfs-discuss] Re: ZFS direct IO

2007-01-23 Thread johansen-osdev
> Basically speaking - there needs to be some sort of strategy for
> bypassing the ARC or even parts of the ARC for applications that
> may need to advise the filesystem of either:
> 1) the delicate nature of imposing additional buffering for their
> data flow
> 2) already well optimized applications that need more adaptive
> cache in the application instead of the underlying filesystem or
> volume manager

This advice can't be sensibly delivered to ZFS via a Direct I/O
mechanism.  Anton's characterization of Direct I/O as, "an optimization
which allows data to be transferred directly between user data buffers
and disk, without a memory-to-memory copy," is concise and accurate.
Trying to intuit advice from this is unlikely to be useful.  It would be
better to develop a separate mechanism for delivering advice about the
application to the filesystem.  (fadvise, perhaps?)

A DIO implementation for ZFS is more complicated than UFS and adversely
impacts well optimized applications.

I looked into this late last year when we had a customer who was
suffering from too much bcopy overhead.  Billm found another workaround
instead of bypassing the ARC.

The challenge for implementing DIO for ZFS is in dealing with access to
the pages mapped by the user application.  Since ZFS has to checksum all
of its data, the user's pages that are involved in the direct I/O cannot
be written to by another thread during the I/O.  If this policy isn't
enforced, it is possible for the data written to or read from disk to be
different from their checksums.

In order to protect the user pages while a DIO is in progress, we want
support from the VM that isn't presently implemented.  To prevent a page
from being accessed by another thread, we have to unmap the TLB/PTE
entries and lock the page.  There's a cost associated with this, as it
may be necessary to cross-call other CPUs.  Any thread that accesses the
locked pages will block.  While it's possible lock pages in the VM
today, there isn't a neat set of interfaces the filesystem can use to
maintain the integrity of the user's buffers.  Without an experimental
prototype to verify the design, it's impossible to say whether overhead
of manipulating the page permissions is more than the cost of bypassing
the cache.

What do you see as potential use cases for ZFS Direct I/O?  I'm having a
hard time imagining a situation in which this would be useful to a
customer.  The application would probably have to be single-threaded,
and if not, it would have to be pretty careful about how its threads
access buffers involved in I/O.

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS direct IO

2007-01-23 Thread johansen-osdev
> Note also that for most applications, the size of their IO operations
> would often not match the current page size of the buffer, causing
> additional performance and scalability issues.

Thanks for mentioning this, I forgot about it.

Since ZFS's default block size is configured to be larger than a page,
the application would have to issue page-aligned block-sized I/Os.
Anyone adjusting the block size would presumably be responsible for
ensuring that the new size is a multiple of the page size.  (If they
would want Direct I/O to work...)

I believe UFS also has a similar requirement, but I've been wrong
before.

-j

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS direct IO

2007-01-24 Thread johansen-osdev
> And this feature is independant on whether   or not the data  is
> DMA'ed straight into the user buffer.

I suppose so, however, it seems like it would make more sense to
configure a dataset property that specifically describes the caching
policy that is desired.  When directio implies different semantics for
different filesystems, customers are going to get confused.

> The  other  feature,  is to  avoid a   bcopy by  DMAing full
> filesystem block reads straight into user buffer (and verify
> checksum after). The I/O is high latency, bcopy adds a small
> amount. The kernel memory can  be freed/reuse straight after
> the user read  completes. This is  where I ask, how much CPU
> is lost to the bcopy in workloads that benefit from DIO ?

Right, except that if we try to DMA into user buffers with ZFS there's a
bunch of other things we need the VM to do on our behalf to protect the
integrity of the kernel data that's living in user pages.  Assume you
have a high-latency I/O and you've locked some user pages for this I/O.
In a pathological case, when another thread tries to access the locked
pages and then also blocks,  it does so for the duration of the first
thread's I/O.  At that point, it seems like it might be easier to accept
the cost of the bcopy instead of blocking another thread.

I'm not even sure how to assess the impact of VM operations required to
change the permissions on the pages before we start the I/O.

> The quickest return on  investement  I see for  the  directio
> hint would be to tell ZFS to not grow the ARC when servicing
> such requests.

Perhaps if we had an option that specifies not to cache data from a
particular dataset, that would suffice.  I think you've filed a CR along
those lines already (6429855)?

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS multi-threading

2007-02-08 Thread johansen-osdev
> Would the logic behind ZFS take full advantage of a heavily multicored
> system, such as on the Sun Niagara platform? Would it utilize of the
> 32 concurrent threads for generating its checksums? Has anyone
> compared ZFS on a Sun Tx000, to that of a 2-4 thread x64 machine?

Pete and I are working on resolving ZFS scalability issues with Niagara and
StarCat right now.  I'm not sure if any official numbers about ZFS
performance on Niagara have been published.

As far as concurrent threads generating checksums goes, the system
doesn't work quite the way you have postulated.  The checksum is
generated in the ZIO_STAGE_CHECKSUM_GENERATE pipeline state for writes,
and verified in the ZIO_STAGE_CHECKSUM_VERIFY pipeline stage for reads.
Whichever thread happens to advance the pipline to the checksum generate
stage is the thread that will actually perform the work.  ZFS does not
break the work of the checksum into chunks and have multiple CPUs
perform the computation.  However, it is possible to have concurrent
writes simultaneously in the checksum_generate stage.

More details about this can be found in zfs/zio.c and zfs/sys/zio_impl.h

-j

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] understanding zfs/thunoer "bottlenecks"?

2007-02-27 Thread johansen-osdev
> it seems there isn't an algorithm in ZFS that detects sequential write
> in traditional fs such as ufs, one would trigger directio.

There is no directio for ZFS.  Are you encountering a situation in which
you believe directio support would improve performance?  If so, please
explain.

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] C'mon ARC, stay small...

2007-03-15 Thread johansen-osdev
This seems a bit strange.  What's the workload, and also, what's the
output for:

> ARC_mru::print size lsize
> ARC_mfu::print size lsize
and
> ARC_anon::print size

For obvious reasons, the ARC can't evict buffers that are in use.
Buffers that are available to be evicted should be on the mru or mfu
list, so this output should be instructive.

-j

On Thu, Mar 15, 2007 at 02:08:37PM -0400, Jim Mauro wrote:
> 
> FYI - After a few more runs, ARC size hit 10GB, which is now 10X c_max:
> 
> 
> > arc::print -tad
> {
> . . .
>c02e29e8 uint64_t size = 0t10527883264
>c02e29f0 uint64_t p = 0t16381819904
>c02e29f8 uint64_t c = 0t1070318720
>c02e2a00 uint64_t c_min = 0t1070318720
>c02e2a08 uint64_t c_max = 0t1070318720
> . . .
> 
> Perhaps c_max does not do what I think it does?
> 
> Thanks,
> /jim
> 
> 
> Jim Mauro wrote:
> >Running an mmap-intensive workload on ZFS on a X4500, Solaris 10 11/06
> >(update 3). All file IO is mmap(file), read memory segment, unmap, close.
> >
> >Tweaked the arc size down via mdb to 1GB. I used that value because
> >c_min was also 1GB, and I was not sure if c_max could be larger than
> >c_minAnyway, I set c_max to 1GB.
> >
> >After a workload run:
> >> arc::print -tad
> >{
> >. . .
> >  c02e29e8 uint64_t size = 0t3099832832
> >  c02e29f0 uint64_t p = 0t16540761088
> >  c02e29f8 uint64_t c = 0t1070318720
> >  c02e2a00 uint64_t c_min = 0t1070318720
> >  c02e2a08 uint64_t c_max = 0t1070318720
> >. . .
> >
> >"size" is at 3GB, with c_max at 1GB.
> >
> >What gives? I'm looking at the code now, but was under the impression
> >c_max would limit ARC growth. Granted, it's not a factor of 10, and
> >it's certainly much better than the out-of-the-box growth to 24GB
> >(this is a 32GB x4500), so clearly ARC growth is being limited, but it
> >still grew to 3X c_max.
> >
> >Thanks,
> >/jim
> >___
> >zfs-discuss mailing list
> >zfs-discuss@opensolaris.org
> >http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] C'mon ARC, stay small...

2007-03-15 Thread johansen-osdev
Gar.  This isn't what I was hoping to see.  Buffers that aren't
available for eviction aren't listed in the lsize count.  It looks like
the MRU has grown to 10Gb and most of this could be successfully
evicted.

The calculation for determining if we evict from the MRU is in
arc_adjust() and looks something like:

top_sz = ARC_anon.size + ARC_mru.size

Then if top_sz > arc.p and ARC_mru.lsize > 0 we evict the smaller of
ARC_mru.lsize and top_size - arc.p

In your previous message it looks like arc.p is > (ARC_mru.size +
ARC_anon.size).  It might make sense to double-check these numbers
together, so when you check the size and lsize again, also check arc.p.

How/when did you configure arc_c_max?  arc.p is supposed to be
initialized to half of arc.c.  Also, I assume that there's a reliable
test case for reproducing this problem?

Thanks,

-j

On Thu, Mar 15, 2007 at 06:57:12PM -0400, Jim Mauro wrote:
> 
> 
> > ARC_mru::print -d size lsize
> size = 0t10224433152
> lsize = 0t10218960896
> > ARC_mfu::print -d size lsize
> size = 0t303450112
> lsize = 0t289998848
> > ARC_anon::print -d size
> size = 0
> >
> 
> So it looks like the MRU is running at 10GB...
> 
> What does this tell us?
> 
> Thanks,
> /jim
> 
> 
> 
> [EMAIL PROTECTED] wrote:
> >This seems a bit strange.  What's the workload, and also, what's the
> >output for:
> >
> >  
> >>ARC_mru::print size lsize
> >>ARC_mfu::print size lsize
> >>
> >and
> >  
> >>ARC_anon::print size
> >>
> >
> >For obvious reasons, the ARC can't evict buffers that are in use.
> >Buffers that are available to be evicted should be on the mru or mfu
> >list, so this output should be instructive.
> >
> >-j
> >
> >On Thu, Mar 15, 2007 at 02:08:37PM -0400, Jim Mauro wrote:
> >  
> >>FYI - After a few more runs, ARC size hit 10GB, which is now 10X c_max:
> >>
> >>
> >>
> >>>arc::print -tad
> >>>  
> >>{
> >>. . .
> >>   c02e29e8 uint64_t size = 0t10527883264
> >>   c02e29f0 uint64_t p = 0t16381819904
> >>   c02e29f8 uint64_t c = 0t1070318720
> >>   c02e2a00 uint64_t c_min = 0t1070318720
> >>   c02e2a08 uint64_t c_max = 0t1070318720
> >>. . .
> >>
> >>Perhaps c_max does not do what I think it does?
> >>
> >>Thanks,
> >>/jim
> >>
> >>
> >>Jim Mauro wrote:
> >>
> >>>Running an mmap-intensive workload on ZFS on a X4500, Solaris 10 11/06
> >>>(update 3). All file IO is mmap(file), read memory segment, unmap, close.
> >>>
> >>>Tweaked the arc size down via mdb to 1GB. I used that value because
> >>>c_min was also 1GB, and I was not sure if c_max could be larger than
> >>>c_minAnyway, I set c_max to 1GB.
> >>>
> >>>After a workload run:
> >>>  
> arc::print -tad
> 
> >>>{
> >>>. . .
> >>> c02e29e8 uint64_t size = 0t3099832832
> >>> c02e29f0 uint64_t p = 0t16540761088
> >>> c02e29f8 uint64_t c = 0t1070318720
> >>> c02e2a00 uint64_t c_min = 0t1070318720
> >>> c02e2a08 uint64_t c_max = 0t1070318720
> >>>. . .
> >>>
> >>>"size" is at 3GB, with c_max at 1GB.
> >>>
> >>>What gives? I'm looking at the code now, but was under the impression
> >>>c_max would limit ARC growth. Granted, it's not a factor of 10, and
> >>>it's certainly much better than the out-of-the-box growth to 24GB
> >>>(this is a 32GB x4500), so clearly ARC growth is being limited, but it
> >>>still grew to 3X c_max.
> >>>
> >>>Thanks,
> >>>/jim
> >>>___
> >>>zfs-discuss mailing list
> >>>zfs-discuss@opensolaris.org
> >>>http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> >>>  
> >>___
> >>zfs-discuss mailing list
> >>zfs-discuss@opensolaris.org
> >>http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> >>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] C'mon ARC, stay small...

2007-03-15 Thread johansen-osdev
Something else to consider, depending upon how you set arc_c_max, you
may just want to set arc_c and arc_p at the same time.  If you try
setting arc_c_max, and then setting arc_c to arc_c_max, and then set
arc_p to arc_c / 2, do you still get this problem?

-j

On Thu, Mar 15, 2007 at 05:18:12PM -0700, [EMAIL PROTECTED] wrote:
> Gar.  This isn't what I was hoping to see.  Buffers that aren't
> available for eviction aren't listed in the lsize count.  It looks like
> the MRU has grown to 10Gb and most of this could be successfully
> evicted.
> 
> The calculation for determining if we evict from the MRU is in
> arc_adjust() and looks something like:
> 
> top_sz = ARC_anon.size + ARC_mru.size
> 
> Then if top_sz > arc.p and ARC_mru.lsize > 0 we evict the smaller of
> ARC_mru.lsize and top_size - arc.p
> 
> In your previous message it looks like arc.p is > (ARC_mru.size +
> ARC_anon.size).  It might make sense to double-check these numbers
> together, so when you check the size and lsize again, also check arc.p.
> 
> How/when did you configure arc_c_max?  arc.p is supposed to be
> initialized to half of arc.c.  Also, I assume that there's a reliable
> test case for reproducing this problem?
> 
> Thanks,
> 
> -j
> 
> On Thu, Mar 15, 2007 at 06:57:12PM -0400, Jim Mauro wrote:
> > 
> > 
> > > ARC_mru::print -d size lsize
> > size = 0t10224433152
> > lsize = 0t10218960896
> > > ARC_mfu::print -d size lsize
> > size = 0t303450112
> > lsize = 0t289998848
> > > ARC_anon::print -d size
> > size = 0
> > >
> > 
> > So it looks like the MRU is running at 10GB...
> > 
> > What does this tell us?
> > 
> > Thanks,
> > /jim
> > 
> > 
> > 
> > [EMAIL PROTECTED] wrote:
> > >This seems a bit strange.  What's the workload, and also, what's the
> > >output for:
> > >
> > >  
> > >>ARC_mru::print size lsize
> > >>ARC_mfu::print size lsize
> > >>
> > >and
> > >  
> > >>ARC_anon::print size
> > >>
> > >
> > >For obvious reasons, the ARC can't evict buffers that are in use.
> > >Buffers that are available to be evicted should be on the mru or mfu
> > >list, so this output should be instructive.
> > >
> > >-j
> > >
> > >On Thu, Mar 15, 2007 at 02:08:37PM -0400, Jim Mauro wrote:
> > >  
> > >>FYI - After a few more runs, ARC size hit 10GB, which is now 10X c_max:
> > >>
> > >>
> > >>
> > >>>arc::print -tad
> > >>>  
> > >>{
> > >>. . .
> > >>   c02e29e8 uint64_t size = 0t10527883264
> > >>   c02e29f0 uint64_t p = 0t16381819904
> > >>   c02e29f8 uint64_t c = 0t1070318720
> > >>   c02e2a00 uint64_t c_min = 0t1070318720
> > >>   c02e2a08 uint64_t c_max = 0t1070318720
> > >>. . .
> > >>
> > >>Perhaps c_max does not do what I think it does?
> > >>
> > >>Thanks,
> > >>/jim
> > >>
> > >>
> > >>Jim Mauro wrote:
> > >>
> > >>>Running an mmap-intensive workload on ZFS on a X4500, Solaris 10 11/06
> > >>>(update 3). All file IO is mmap(file), read memory segment, unmap, close.
> > >>>
> > >>>Tweaked the arc size down via mdb to 1GB. I used that value because
> > >>>c_min was also 1GB, and I was not sure if c_max could be larger than
> > >>>c_minAnyway, I set c_max to 1GB.
> > >>>
> > >>>After a workload run:
> > >>>  
> > arc::print -tad
> > 
> > >>>{
> > >>>. . .
> > >>> c02e29e8 uint64_t size = 0t3099832832
> > >>> c02e29f0 uint64_t p = 0t16540761088
> > >>> c02e29f8 uint64_t c = 0t1070318720
> > >>> c02e2a00 uint64_t c_min = 0t1070318720
> > >>> c02e2a08 uint64_t c_max = 0t1070318720
> > >>>. . .
> > >>>
> > >>>"size" is at 3GB, with c_max at 1GB.
> > >>>
> > >>>What gives? I'm looking at the code now, but was under the impression
> > >>>c_max would limit ARC growth. Granted, it's not a factor of 10, and
> > >>>it's certainly much better than the out-of-the-box growth to 24GB
> > >>>(this is a 32GB x4500), so clearly ARC growth is being limited, but it
> > >>>still grew to 3X c_max.
> > >>>
> > >>>Thanks,
> > >>>/jim
> > >>>___
> > >>>zfs-discuss mailing list
> > >>>zfs-discuss@opensolaris.org
> > >>>http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> > >>>  
> > >>___
> > >>zfs-discuss mailing list
> > >>zfs-discuss@opensolaris.org
> > >>http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> > >>
> > ___
> > zfs-discuss mailing list
> > zfs-discuss@opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] C'mon ARC, stay small...

2007-03-15 Thread johansen-osdev
I suppose I should have been more forward about making my last point.
If the arc_c_max isn't set in /etc/system, I don't believe that the ARC
will initialize arc.p to the correct value.   I could be wrong about
this; however, next time you set c_max, set c to the same value as c_max
and set p to half of c.  Let me know if this addresses the problem or
not.

-j

> >How/when did you configure arc_c_max?  
> Immediately following a reboot, I set arc.c_max using mdb,
> then verified reading the arc structure again.
> >arc.p is supposed to be
> >initialized to half of arc.c.  Also, I assume that there's a reliable
> >test case for reproducing this problem?
> >  
> Yep. I'm using a x4500 in-house to sort out performance of a customer test
> case that uses mmap. We acquired the new DIMMs to bring the
> x4500 to 32GB, since the workload has a 64GB working set size,
> and we were clobbering a 16GB thumper. We wanted to see how doubling
> memory may help.
> 
> I'm trying clamp the ARC size because for mmap-intensive workloads,
> it seems to hurt more than help (although, based on experiments up to this
> point, it's not hurting a lot).
> 
> I'll do another reboot, and run it all down for you serially...
> 
> /jim
> 
> >Thanks,
> >
> >-j
> >
> >On Thu, Mar 15, 2007 at 06:57:12PM -0400, Jim Mauro wrote:
> >  
> >>
> >>>ARC_mru::print -d size lsize
> >>>  
> >>size = 0t10224433152
> >>lsize = 0t10218960896
> >>
> >>>ARC_mfu::print -d size lsize
> >>>  
> >>size = 0t303450112
> >>lsize = 0t289998848
> >>
> >>>ARC_anon::print -d size
> >>>  
> >>size = 0
> >>
> >>So it looks like the MRU is running at 10GB...
> >>
> >>What does this tell us?
> >>
> >>Thanks,
> >>/jim
> >>
> >>
> >>
> >>[EMAIL PROTECTED] wrote:
> >>
> >>>This seems a bit strange.  What's the workload, and also, what's the
> >>>output for:
> >>>
> >>> 
> >>>  
> ARC_mru::print size lsize
> ARC_mfu::print size lsize
>    
> 
> >>>and
> >>> 
> >>>  
> ARC_anon::print size
>    
> 
> >>>For obvious reasons, the ARC can't evict buffers that are in use.
> >>>Buffers that are available to be evicted should be on the mru or mfu
> >>>list, so this output should be instructive.
> >>>
> >>>-j
> >>>
> >>>On Thu, Mar 15, 2007 at 02:08:37PM -0400, Jim Mauro wrote:
> >>> 
> >>>  
> FYI - After a few more runs, ARC size hit 10GB, which is now 10X c_max:
> 
> 
>    
> 
> >arc::print -tad
> > 
> >  
> {
> . . .
>   c02e29e8 uint64_t size = 0t10527883264
>   c02e29f0 uint64_t p = 0t16381819904
>   c02e29f8 uint64_t c = 0t1070318720
>   c02e2a00 uint64_t c_min = 0t1070318720
>   c02e2a08 uint64_t c_max = 0t1070318720
> . . .
> 
> Perhaps c_max does not do what I think it does?
> 
> Thanks,
> /jim
> 
> 
> Jim Mauro wrote:
>    
> 
> >Running an mmap-intensive workload on ZFS on a X4500, Solaris 10 11/06
> >(update 3). All file IO is mmap(file), read memory segment, unmap, 
> >close.
> >
> >Tweaked the arc size down via mdb to 1GB. I used that value because
> >c_min was also 1GB, and I was not sure if c_max could be larger than
> >c_minAnyway, I set c_max to 1GB.
> >
> >After a workload run:
> > 
> >  
> >>arc::print -tad
> >>   
> >>
> >{
> >. . .
> >c02e29e8 uint64_t size = 0t3099832832
> >c02e29f0 uint64_t p = 0t16540761088
> >c02e29f8 uint64_t c = 0t1070318720
> >c02e2a00 uint64_t c_min = 0t1070318720
> >c02e2a08 uint64_t c_max = 0t1070318720
> >. . .
> >
> >"size" is at 3GB, with c_max at 1GB.
> >
> >What gives? I'm looking at the code now, but was under the impression
> >c_max would limit ARC growth. Granted, it's not a factor of 10, and
> >it's certainly much better than the out-of-the-box growth to 24GB
> >(this is a 32GB x4500), so clearly ARC growth is being limited, but it
> >still grew to 3X c_max.
> >
> >Thanks,
> >/jim
> >___
> >zfs-discuss mailing list
> >zfs-discuss@opensolaris.org
> >http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> > 
> >  
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>    
> 
> >>___
> >>zfs-discuss mailing list
> >>zfs-discuss@opensolaris.org
> >>http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> >>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___

Re: [zfs-discuss] Re: C'mon ARC, stay small...

2007-03-16 Thread johansen-osdev
> I've been seeing this failure to cap on a number of (Solaris 10 update
> 2 and 3) machines since the script came out (arc hogging is a huge
> problem for me, esp on Oracle). This is probably a red herring, but my
> v490 testbed seemed to actually cap on 3 separate tests, but my t2000
> testbed doesn't even pretend to cap - kernel memory (as identified in
> Orca) sails right to the top, leaves me maybe 2GB free on a 32GB
> machine and shoves Oracle data into swap. 

What method are you using to cap this memory?  Jim and I just disucssed
the required steps for doing this by hand using MDB.

> This isn't as amusing as one Stage and one Production Oracle machine
> which have 128GB and 96GB respectively. Sending in 92GB core dumps to
> support is an impressive gesture taking 2-3 days to complete.

This is solved by CR 4894692, which is in snv_56 and s10u4.

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Bottlenecks in building a system

2007-04-18 Thread johansen-osdev
Adam:

> Does anyone have a clue as to where the bottlenecks are going to be with 
> this:
> 
> 16x hot swap SATAII hard drives (plus an internal boot drive)
> Tyan S2895 (K8WE) motherboard
> Dual GigE (integral nVidia ports)
> 2x Areca 8-port PCIe (8-lane) RAID drivers
> 2x AMD Opteron 275 CPUs (2.2GHz, dual core)
> 8 GiB RAM

> The supplier is used to shipping Linux servers in this 3U chassis, but 
> hasn't dealt with Solaris. He originally suggested 2GiB RAM, but I hear 
> things about ZFS getting RAM hungry after a while.

ZFS is opportunistic when it comes to using free memory for caching.
I'm not sure what exactly you've heard.

> I guess my questions are:
> - Does anyone out there have a clue where the potential bottlenecks 
> might be?

What's your workload?  Bart is subscribed to this list, but he has a
famous saying, "One experiment is worth a thousand expert opinions."

Without knowing what you're trying to do with this box, it's going to be
hard to offer any useful advice.  However, you'll learn the most by
getting one of these boxes and running your workload.  If you have
problems, Solaris has a lot of tools that we can use to diagnose the
problem.  Then we can improve the performance and everybody wins.

> - If I focused on simple streaming IO, would giving the server less RAM 
> have an impact on performance?

The more RAM you can give your box, the more of it ZFS will use for
caching.  If your workload doesn't benefit from caching, then the impact
on performance won't be large.  Could you be more specific about what
the filesystem's consumers are doing when they're performing "simple
streaming IO?"

> - I had assumed four cores would be better than the two faster (3.0GHz) 
> single-core processors the vendor originally suggested. Agree?

I suspect that this is correct.  ZFS does many steps in its I/O path
asynchronously and they execute in the context of different threads.
Four cores are probably better than two.  Of course experimentation
could prove me wrong here, too. :)

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Bottlenecks in building a system

2007-04-20 Thread johansen-osdev
Adam:

> Hi, hope you don't mind if I make some portions of your email public in 
> a reply--I hadn't seen it come through on the list at all, so it's no 
> duplicate to me.

I don't mind at all.  I had hoped to avoid sending the list a duplicate
e-mail, although it looks like my first post never made it here.

> > I suspect that if you have a bottleneck in your system, it would be due
> > to the available bandwidth on the PCI bus.
> 
> Mm. yeah, it's what I was worried about, too (mostly through ignorance 
> of the issues), which is why I was hoping HyperTransport and PCIe were 
> going to give that data enough room on the bus.
> But after others expressed the opinion that the Areca PCIe cards were 
> overkill, I'm now looking to putting some PCI-X cards on a different 
> (probably slower) motherboard.

I dug up a copy of the S2895 block diagram and asked Bill Moore about
it.  He said that you should be able to get about 700mb/s off of each of
the PCI-X channels and that you only need 100mb/s to saturate a GigE
link.  He also observed that the RAID card you were using was
unnecessary and would probably hamper performance.  He reccomended
non-RAID SATA cards based upon the Marvell chipset.

Here's the e-mail trail on this list where he discusses Marvell SATA
cards in a bit more detail:

http://mail.opensolaris.org/pipermail/zfs-discuss/2006-March/016874.html

It sounds like if getting disk -> network is the concern, you'll have
plenty of bandwidth, assuming you have a reasonable controller card.

> > Caching isn't going to be a huge help for writes, unless there's another
> > thread reading simultaneoulsy from the same file.
> >
> > Prefetch will definitely use the additional RAM to try to boost the
> > performance of sequential reads.  However, in the interest of full
> > disclosure, there is a pathology that we've seen where the number of
> > sequential readers exceeds the available space in the cache.  In this
> > situation, sometimes the competeing prefetches for the different streams
> > will cause more temporally favorable data to be evicted from the cache
> > and performance will drop.  The workaround right now is just to disable
> > prefetch.  We're looking into more comprehensive solutions.
> 
> Interesting. So noted. I will expect to have to test thoroughly.

If you run across this problem and are willing to let me debug on your
system, shoot me an e-mail.  We've only seen this in a couple of
situations and it was combined with another problem where we were seeing
excessive overhead for kcopyout.  It's unlikely, but possible that you'll
hit this.

-K
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Help me understand ZFS caching

2007-04-20 Thread johansen-osdev
Tony:

> Now to another question related to Anton's post. You mention that
> directIO does not exist in ZFS at this point. Are their plan's to
> support DirectIO; any functionality that will simulate directIO or
> some other non-caching ability suitable for critical systems such as
> databases if the client still wanted to deploy on filesystems.

I would describe DirectIO as the ability to map the application's
buffers directly for disk DMAs.  You need to disable the filesystem's
cache to do this correctly.  Having the cache disabled is an
implementation requirement for this feature.

Based upon this definition, are you seeking the ability to disable the
filesystem's cache or the ability to directly map application buffers
for DMA?

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: gzip compression throttles system?

2007-05-03 Thread johansen-osdev
A couple more questions here.

[mpstat]

> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
>   00   0 3109  3616  316  1965   17   48   45   2450  85   0  15
>   10   0 3127  3797  592  2174   17   63   46   1760  84   0  15
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
>   00   0 3051  3529  277  2012   14   25   48   2160  83   0  17
>   10   0 3065  3739  606  1952   14   37   47   1530  82   0  17
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
>   00   0 3011  3538  316  2423   26   16   52   2020  81   0  19
>   10   0 3019  3698  578  2694   25   23   56   3090  83   0  17
> 
> # lockstat -kIW -D 20 sleep 30
> 
> Profiling interrupt: 6080 events in 31.341 seconds (194 events/sec)
> 
> Count indv cuml rcnt nsec Hottest CPU+PILCaller  
> ---
>  2068  34%  34% 0.00 1767 cpu[0] deflate_slow
>  1506  25%  59% 0.00 1721 cpu[1] longest_match   
>  1017  17%  76% 0.00 1833 cpu[1] mach_cpu_idle   
>   454   7%  83% 0.00 1539 cpu[0] fill_window 
>   215   4%  87% 0.00 1788 cpu[1] pqdownheap  


What do you have zfs compresison set to?  The gzip level is tunable,
according to zfs set, anyway:

PROPERTY   EDIT  INHERIT   VALUES
compression YES  YES   on | off | lzjb | gzip | gzip-[1-9]

You still have idle time in this lockstat (and mpstat).

What do you get for a lockstat -A -D 20 sleep 30?

Do you see anyone with long lock hold times, long sleeps, or excessive
spinning?

The largest numbers from mpstat are for interrupts and cross calls.
What does intrstat(1M) show?

Have you run dtrace to determine the most frequent cross-callers?

#!/usr/sbin/dtrace -s

sysinfo:::xcalls
{
@a[stack(30)] = count();
}

END
{
trunc(@a, 30);
}

is an easy way to do this.

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

2007-05-14 Thread johansen-osdev
This certainly isn't the case on my machine.

$ /usr/bin/time dd if=/test/filebench/largefile2 of=/dev/null bs=128k 
count=1
1+0 records in
1+0 records out

real1.3
user0.0
sys 1.2

# /usr/bin/time dd if=/dev/dsk/c0t0d0 of=/dev/null bs=128k count=1
1+0 records in
1+0 records out

real   22.3
user0.0
sys 2.2

This looks like 56 MB/s on the /dev/dsk and 961 MB/s on the pool.

My pool is configured into a 46 disk RAID-0 stripe.  I'm going to omit
the zpool status output for the sake of brevity.

> What I am seeing is that ZFS performance for sequential access is
> about 45% of raw disk access, while UFS (as well as ext3 on Linux) is
> around 70%. For workload consisting mostly of reading large files
> sequentially, it would seem then that ZFS is the wrong tool
> performance-wise. But, it could be just my setup, so I would
> appreciate more data points.

This isn't what we've observed in much of our performance testing.
It may be a problem with your config, although I'm not an expert on
storage configurations.  Would you mind providing more details about
your controller, disks, and machine setup?

-j

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

2007-05-14 Thread johansen-osdev
Marko,

I tried this experiment again using 1 disk and got nearly identical
times:

# /usr/bin/time dd if=/dev/dsk/c0t0d0 of=/dev/null bs=128k count=1
1+0 records in
1+0 records out

real   21.4
user0.0
sys 2.4

$ /usr/bin/time dd if=/test/filebench/testfile of=/dev/null bs=128k count=1
1+0 records in
1+0 records out

real   21.0
user0.0
sys 0.7


> [I]t is not possible for dd to meaningfully access multiple-disk
> configurations without going through the file system. I find it
> curious that there is such a large slowdown by going through file
> system (with single drive configuration), especially compared to UFS
> or ext3.

Comparing a filesystem to raw dd access isn't a completely fair
comparison either.  Few filesystems actually layout all of their data
and metadata so that every read is a completely sequential read.

> I simply have a small SOHO server and I am trying to evaluate which OS to
> use to keep a redundant disk array. With unreliable consumer-level hardware,
> ZFS and the checksum feature are very interesting and the primary selling
> point compared to a Linux setup, for as long as ZFS can generate enough
> bandwidth from the drive array to saturate single gigabit ethernet.

I would take Bart's reccomendation and go with Solaris on something like a
dual-core box with 4 disks.

> My hardware at the moment is the "wrong" choice for Solaris/ZFS - PCI 3114
> SATA controller on a 32-bit AthlonXP, according to many posts I found.

Bill Moore lists some controller reccomendations here:

http://mail.opensolaris.org/pipermail/zfs-discuss/2006-March/016874.html

> However, since dd over raw disk is capable of extracting 75+MB/s from this
> setup, I keep feeling that surely I must be able to get at least that much
> from reading a pair of striped or mirrored ZFS drives. But I can't - single
> drive or 2-drive stripes or mirrors, I only get around 34MB/s going through
> ZFS. (I made sure mirror was rebuilt and I resilvered the stripes.)

Maybe this is a problem with your controller?  What happens when you
have two simultaneous dd's to different disks running?  This would
simulate the case where you're reading from the two disks at the same
time.

-j

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

2007-05-15 Thread johansen-osdev
> Each drive is freshly formatted with one 2G file copied to it. 

How are you creating each of these files?

Also, would you please include a the output from the isalist(1) command?

> These are snapshots of iostat -xnczpm 3 captured somewhere in the
> middle of the operation.

Have you double-checked that this isn't a measurement problem by
measuring zfs with zpool iostat (see zpool(1M)) and verifying that
outputs from both iostats match?

> single drive, zfs file
>r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>  258.30.0 33066.60.0 33.0  2.0  127.77.7 100 100 c0d1
> 
> Now that is odd. Why so much waiting? Also, unlike with raw or UFS, kr/s /
> r/s gives 256K, as I would imagine it should.

Not sure.  If we can figure out why ZFS is slower than raw disk access
in your case, it may explain why you're seeing these results.

> What if we read a UFS file from the PATA disk and ZFS from SATA:
>r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>  792.80.0 44092.90.0  0.0  1.80.02.2   1  98 c1d0
>  224.00.0 28675.20.0 33.0  2.0  147.38.9 100 100 c0d0
> 
> Now that is confusing! Why did SATA/ZFS slow down too? I've retried this a
> number of times, not a fluke.

This could be cache interference.  ZFS and UFS use different caches.

How much memory is in this box?

> I have no idea what to make of all this, except that it ZFS has a problem
> with this hardware/drivers that UFS and other traditional file systems,
> don't. Is it a bug in the driver that ZFS is inadvertently exposing? A
> specific feature that ZFS assumes the hardware to have, but it doesn't? Who
> knows!

This may be a more complicated interaction than just ZFS and your
hardware.  There are a number of layers of drivers underneath ZFS that
may also be interacting with your hardware in an unfavorable way.

If you'd like to do a little poking with MDB, we can see the features
that your SATA disks claim they support.

As root, type mdb -k, and then at the ">" prompt that appears, enter the
following command (this is one very long line):

*sata_hba_list::list sata_hba_inst_t satahba_next | ::print sata_hba_inst_t 
satahba_dev_port | ::array void* 32 | ::print void* | ::grep ".!=0" | ::print 
sata_cport_info_t cport_devp.cport_sata_drive | ::print -a sata_drive_info_t 
satadrv_features_support satadrv_settings satadrv_features_enabled

This should show satadrv_features_support, satadrv_settings, and
satadrv_features_enabled for each SATA disk on the system.

The values for these variables are defined in:

http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/sys/sata/impl/sata.h

this is the relevant snippet for interpreting these values:

/*
 * Device feature_support (satadrv_features_support)
 */
#define SATA_DEV_F_DMA  0x01
#define SATA_DEV_F_LBA280x02
#define SATA_DEV_F_LBA480x04
#define SATA_DEV_F_NCQ  0x08
#define SATA_DEV_F_SATA10x10
#define SATA_DEV_F_SATA20x20
#define SATA_DEV_F_TCQ  0x40/* Non NCQ tagged queuing */

/*
 * Device features enabled (satadrv_features_enabled)
 */
#define SATA_DEV_F_E_TAGGED_QING0x01/* Tagged queuing enabled */
#define SATA_DEV_F_E_UNTAGGED_QING  0x02/* Untagged queuing enabled */

/*
 * Drive settings flags (satdrv_settings)
 */
#define SATA_DEV_READ_AHEAD 0x0001  /* Read Ahead enabled */
#define SATA_DEV_WRITE_CACHE0x0002  /* Write cache ON */
#define SATA_DEV_SERIAL_FEATURES0x8000  /* Serial ATA feat.  enabled */
#define SATA_DEV_ASYNCH_NOTIFY  0x2000  /* Asynch-event enabled */

This may give us more information if this is indeed a problem with
hardware/drivers supporting the right features.

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

2007-05-16 Thread johansen-osdev
> >*sata_hba_list::list sata_hba_inst_t satahba_next | ::print 
> >sata_hba_inst_t satahba_dev_port | ::array void* 32 | ::print void* | 
> >::grep ".!=0" | ::print sata_cport_info_t cport_devp.cport_sata_drive | 
> >::print -a sata_drive_info_t satadrv_features_support satadrv_settings 
> >satadrv_features_enabled

> This gives me "mdb: failed to dereference symbol: unknown symbol
> name". 

You may not have the SATA module installed.  If you type:

::modinfo !  grep sata

and don't get any output, your sata driver is attached some other way.

My apologies for the confusion.

-K
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

2007-05-16 Thread johansen-osdev
At Matt's request, I did some further experiments and have found that
this appears to be particular to your hardware.  This is not a general
32-bit problem.  I re-ran this experiment on a 1-disk pool using a 32
and 64-bit kernel.  I got identical results:

64-bit
==

$ /usr/bin/time dd if=/testpool1/filebench/testfile of=/dev/null bs=128k
count=1
1+0 records in
1+0 records out

real   20.1
user0.0
sys 1.2

62 Mb/s

# /usr/bin/time dd if=/dev/dsk/c1t3d0 of=/dev/null bs=128k count=1
1+0 records in
1+0 records out

real   19.0
user0.0
sys 2.6

65 Mb/s

32-bit
==

/usr/bin/time dd if=/testpool1/filebench/testfile of=/dev/null bs=128k
count=1
1+0 records in
1+0 records out

real   20.1
user0.0
sys 1.7

62 Mb/s

# /usr/bin/time dd if=/dev/dsk/c1t3d0 of=/dev/null bs=128k count=1
1+0 records in
1+0 records out

real   19.1
user0.0
sys 4.3

65 Mb/s

-j

On Wed, May 16, 2007 at 09:32:35AM -0700, Matthew Ahrens wrote:
> Marko Milisavljevic wrote:
> >now lets try:
> >set zfs:zfs_prefetch_disable=1
> >
> >bingo!
> >
> >   r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
> > 609.00.0 77910.00.0  0.0  0.80.01.4   0  83 c0d0
> >
> >only 1-2 % slower then dd from /dev/dsk. Do you think this is general
> >32-bit problem, or specific to this combination of hardware?
> 
> I suspect that it's fairly generic, but more analysis will be necessary.
> 
> >Finally, should I file a bug somewhere regarding prefetch, or is this
> >a known issue?
> 
> It may be related to 6469558, but yes please do file another bug report. 
>  I'll have someone on the ZFS team take a look at it.
> 
> --matt
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

2007-05-16 Thread johansen-osdev
Marko,
Matt and I discussed this offline some more and he had a couple of ideas
about double-checking your hardware.

It looks like your controller (or disks, maybe?) is having trouble with
multiple simultaneous I/Os to the same disk.  It looks like prefetch
aggravates this problem.

When I asked Matt what we could do to verify that it's the number of
concurrent I/Os that is causing performance to be poor, he had the
following suggestions:

set zfs_vdev_{min,max}_pending=1 and run with prefetch on, then
iostat should show 1 outstanding io and perf should be good.

or turn prefetch off, and have multiple threads reading
concurrently, then iostat should show multiple outstanding ios
and perf should be bad.

Let me know if you have any additional questions.

-j

On Wed, May 16, 2007 at 11:38:24AM -0700, [EMAIL PROTECTED] wrote:
> At Matt's request, I did some further experiments and have found that
> this appears to be particular to your hardware.  This is not a general
> 32-bit problem.  I re-ran this experiment on a 1-disk pool using a 32
> and 64-bit kernel.  I got identical results:
> 
> 64-bit
> ==
> 
> $ /usr/bin/time dd if=/testpool1/filebench/testfile of=/dev/null bs=128k
> count=1
> 1+0 records in
> 1+0 records out
> 
> real   20.1
> user0.0
> sys 1.2
> 
> 62 Mb/s
> 
> # /usr/bin/time dd if=/dev/dsk/c1t3d0 of=/dev/null bs=128k count=1
> 1+0 records in
> 1+0 records out
> 
> real   19.0
> user0.0
> sys 2.6
> 
> 65 Mb/s
> 
> 32-bit
> ==
> 
> /usr/bin/time dd if=/testpool1/filebench/testfile of=/dev/null bs=128k
> count=1
> 1+0 records in
> 1+0 records out
> 
> real   20.1
> user0.0
> sys 1.7
> 
> 62 Mb/s
> 
> # /usr/bin/time dd if=/dev/dsk/c1t3d0 of=/dev/null bs=128k count=1
> 1+0 records in
> 1+0 records out
> 
> real   19.1
> user0.0
> sys 4.3
> 
> 65 Mb/s
> 
> -j
> 
> On Wed, May 16, 2007 at 09:32:35AM -0700, Matthew Ahrens wrote:
> > Marko Milisavljevic wrote:
> > >now lets try:
> > >set zfs:zfs_prefetch_disable=1
> > >
> > >bingo!
> > >
> > >   r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
> > > 609.00.0 77910.00.0  0.0  0.80.01.4   0  83 c0d0
> > >
> > >only 1-2 % slower then dd from /dev/dsk. Do you think this is general
> > >32-bit problem, or specific to this combination of hardware?
> > 
> > I suspect that it's fairly generic, but more analysis will be necessary.
> > 
> > >Finally, should I file a bug somewhere regarding prefetch, or is this
> > >a known issue?
> > 
> > It may be related to 6469558, but yes please do file another bug report. 
> >  I'll have someone on the ZFS team take a look at it.
> > 
> > --matt
> > ___
> > zfs-discuss mailing list
> > zfs-discuss@opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: [storage-discuss] NCQ performance

2007-05-29 Thread johansen-osdev
> When sequential I/O is done to the disk directly there is no performance
> degradation at all.  

All filesystems impose some overhead compared to the rate of raw disk
I/O.  It's going to be hard to store data on a disk unless some kind of
filesystem is used.  All the tests that Eric and I have performed show
regressions for multiple sequential I/O streams.  If you have data that
shows otherwise, please feel free to share.

> [I]t does not take any additional time in ldi_strategy(),
> bdev_strategy(), mv_rw_dma_start().  In some instance it actually
> takes less time.   The only thing that sometimes takes additional time
> is waiting for the disk I/O.

Let's be precise about what was actually observed.  Eric and I saw
increased service times for the I/O on devices with NCQ enabled when
running multiple sequential I/O streams.  Everything that we observed
indicated that it actually took the disk longer to service requests when
many sequential I/Os were queued.

-j


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] si3124 controller problem and fix (fwd)

2007-06-07 Thread johansen-osdev
> it's been assigned CR 6566207 by Linda Bernal.  Basically, if you look 
> at si_intr and read the comments in the code, the bug is pretty 
> obvious.
>
> si3124 driver's interrupt routine is incorrectly coded.  The ddi_put32 
> that clears the interrupts should be enclosed in an "else" block, 
> thereby making it consistent with the comment just below.  Otherwise, 
> you would be double clearing the interrupts, thus losing pending 
> interrupts.
> 
> Since this is a simple fix, there's really no point dealing it as a 
> contributor.

The bug report for 6566207 states that the submitter is an OpenSolaris
contributor who wishes to work on the fix.  If this is not the case, we
should clarify this CR so it doesn't languish.  It's still sitting in
the dispatched state (hasn't been accepted by anyone).

-j

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS performance and memory consumption

2007-07-06 Thread johansen-osdev
> But now I have another question.
> How 8k blocks will impact on performance ?

When tuning recordsize for things like databases, we try to recommend
that the customer's recordsize match the I/O size of the database
record.

I don't think that's the case in your situation.  ZFS is clever enough
that changes to recordsize only affect new blocks written to the
filesystem.  If you're seeing metaslab fragmentation problems now,
changing your recordsize to 8k is likely to increase your performance.
This is because you're out of 128k metaslabs, so using a smaller size
lets you make better use of the remaining space.  This also means you
won't have to iterate through all of the used 128k metaslabs looking for
a free one.

If you're asking, "How does setting the recordsize to 8k affect
performance when I'm not encountering fragmentation," I would guess
that there would be some reduction.  However, you can adjust the
recordsize once you encounter this problem with the default size.

-j

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] si3124 controller problem and fix (fwd)

2007-07-17 Thread johansen-osdev
In an attempt to speed up progress on some of the si3124 bugs that Roger
reported, I've created a workspace with the fixes for:

   6565894 sata drives are not identified by si3124 driver
   6566207 si3124 driver loses interrupts.

I'm attaching a driver which contains these fixes as well as a diff of
the changes I used to produce them.

I don't have access to a si3124 chipset, unfortunately.

Would somebody be able to review these changes and try the new driver on
a si3124 card?

Thanks,

-j

On Tue, Jul 17, 2007 at 02:39:00AM -0700, Nigel Smith wrote:
> You can see the  status of bug here:
> 
> http://bugs.opensolaris.org/view_bug.do?bug_id=6566207
> 
> Unfortunately, it's showing no progress since 20th June.
> 
> This fix really could do to be in place for S10u4 and snv_70.
> Thanks
> Nigel Smith
>  
>  
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


si3124.tar.gz
Description: application/tar-gz

--- usr/src/uts/common/io/sata/adapters/si3124/si3124.c ---

Index: usr/src/uts/common/io/sata/adapters/si3124/si3124.c
--- /ws/onnv-clone/usr/src/uts/common/io/sata/adapters/si3124/si3124.c  Mon Nov 
13 23:20:01 2006
+++ 
/export/johansen/si-fixes/usr/src/uts/common/io/sata/adapters/si3124/si3124.c   
Tue Jul 17 14:37:17 2007
@@ -22,11 +22,11 @@
 /*
  * Copyright 2006 Sun Microsystems, Inc.  All rights reserved.
  * Use is subject to license terms.
  */
 
-#pragma ident  "@(#)si3124.c   1.4 06/11/14 SMI"
+#pragma ident  "@(#)si3124.c   1.5 07/07/17 SMI"
 
 
 
 /*
  * SiliconImage 3124/3132 sata controller driver
@@ -381,11 +381,11 @@
 
 extern struct mod_ops mod_driverops;
 
 static  struct modldrv modldrv = {
&mod_driverops, /* driverops */
-   "si3124 driver v1.4",
+   "si3124 driver v1.5",
&sictl_dev_ops, /* driver ops */
 };
 
 static  struct modlinkage modlinkage = {
MODREV_1,
@@ -2808,10 +2808,13 @@
si_portp = si_ctlp->sictl_ports[port];
mutex_enter(&si_portp->siport_mutex);
 
/* Clear Port Reset. */
ddi_put32(si_ctlp->sictl_port_acc_handle,
+   (uint32_t *)PORT_CONTROL_SET(si_ctlp, port),
+   PORT_CONTROL_SET_BITS_PORT_RESET);
+   ddi_put32(si_ctlp->sictl_port_acc_handle,
(uint32_t *)PORT_CONTROL_CLEAR(si_ctlp, port),
PORT_CONTROL_CLEAR_BITS_PORT_RESET);
 
/*
 * Arm the interrupts for: Cmd completion, Cmd error,
@@ -3509,16 +3512,16 @@
port);
 
if (port_intr_status & INTR_COMMAND_COMPLETE) {
(void) si_intr_command_complete(si_ctlp, si_portp,
port);
-   }
-
+   } else {
/* Clear the interrupts */
ddi_put32(si_ctlp->sictl_port_acc_handle,
(uint32_t *)(PORT_INTERRUPT_STATUS(si_ctlp, port)),
port_intr_status & INTR_MASK);
+   }
 
/*
 * Note that we did not clear the interrupt for command
 * completion interrupt. Reading of slot_status takes care
 * of clearing the interrupt for command completion case.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] is send/receive incremental

2007-08-08 Thread johansen-osdev
You can do it either way.  Eric Kustarz has a good explanation of how to
set up incremental send/receive on your laptop.  The description is on
his blog:

http://blogs.sun.com/erickustarz/date/20070612

The technique he uses is applicable to any ZFS filesystem.

-j

On Wed, Aug 08, 2007 at 04:44:16PM -0600, Peter Baumgartner wrote:
> 
>I'd like to send a backup of my filesystem offsite nightly using zfs
>send/receive. Are those done incrementally so only changes move or
>would a full copy get shuttled across everytime?
>--
>Pete

> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Extremely long creat64 latencies on higly utilized zpools

2007-08-15 Thread johansen-osdev
You might also consider taking a look at this thread:

http://mail.opensolaris.org/pipermail/zfs-discuss/2007-July/041760.html

Although I'm not certain, this sounds a lot like the other pool
fragmentation issues.

-j

On Wed, Aug 15, 2007 at 01:11:40AM -0700, Yaniv Aknin wrote:
> Hello friends,
> 
> I've recently seen a strange phenomenon with ZFS on Solaris 10u3, and was 
> wondering if someone may have more information.
> 
> The system uses several zpools, each a bit under 10T, each containing one zfs 
> with lots and lots of small files (way too many, about 100m files and 75m 
> directories).
> 
> I have absolutely no control over the directory structure and believe me I 
> tried to change it.
> 
> Filesystem usage patterns are create and read, never delete and never rewrite.
> 
> When volumes approach 90% usage, and under medium/light load (zpool iostat 
> reports 50mb/s and 750iops reads), some creat64 system calls take over 50 
> seconds to complete (observed with 'truss -D touch'). When doing manual 
> tests, I've seen similar times on unlink() calls (truss -D rm). 
> 
> I'd like to stress this happens on /some/ of the calls, maybe every 100th 
> manual call (I scripted the test), which (along with normal system 
> operations) would probably be every 10,000th or 100,000th call.
> 
> Other system parameters (memory usage, loadavg, process number, etc) appear 
> nominal. The machine is an NFS server, though the crazy latencies were 
> observed both local and remote.
> 
> What would you suggest to further diagnose this? Has anyone seen trouble with 
> high utilization and medium load? (with or without insanely high filecount?)
> 
> Many thanks in advance,
>  - Yaniv
>  
>  
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS/WAFL lawsuit

2007-09-06 Thread johansen-osdev
It's Columbia Pictures vs. Bunnell:

http://www.eff.org/legal/cases/torrentspy/columbia_v_bunnell_magistrate_order.pdf

The Register syndicated a Security Focus article that summarizes the
potential impact of the court decision:

http://www.theregister.co.uk/2007/08/08/litigation_data_retention/


-j

On Thu, Sep 06, 2007 at 08:14:56PM +0200, [EMAIL PROTECTED] wrote:
> 
> 
> >It really is a shot in the dark at this point, you really never know what
> >will happen in court (take the example of the recent court decision that
> >all data in RAM be held for discovery ?!WHAT, HEAD HURTS!?).  But at the
> >end of the day,  if you waited for a sure bet on any technology or
> >potential patent disputes you would not implement anything, ever.
> 
> 
> Do you have a reference for "all data in RAM most be held".  I guess we
> need to build COW RAM as well.
> 
> Casper
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: slow reads question...

2006-09-22 Thread johansen-osdev
Harley:

>I had tried other sizes with much the same results, but
> hadnt gone as large as 128K.  With bs=128K, it gets worse:
> 
> | # time dd if=zeros-10g of=/dev/null bs=128k count=102400
> | 81920+0 records in
> | 81920+0 records out
> | 
> | real2m19.023s
> | user0m0.105s
> | sys 0m8.514s

I may have done my math wrong, but if we assume that the real
time is the actual amount of time we spent performing the I/O (which may
be incorrect) haven't you done better here?

In this case you pushed 81920 128k records in ~139 seconds -- approx
75437 k/sec.

Using ZFS with 8k bs, you pushed 102400 8k records in ~68 seconds --
approx 12047 k/sec.

Using the raw device you pushed 102400 8k records in ~23 seconds --
approx 35617 k/sec.

I may have missed something here, but isn't this newest number the
highest performance so far?

What does iostat(1M) say about your disk read performance?

>Is there any other info I can provide which would help?

Are you just trying to measure ZFS's read performance here?

It might be interesting to change your outfile (of) argument and see if
we're actually running into some other performance problem.  If you
change of=/tmp/zeros does performance improve or degrade?  Likewise, if
you write the file out to another disk (UFS, ZFS, whatever), does this
improve performance?

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: slow reads question...

2006-09-22 Thread johansen-osdev
Harley:

> Old 36GB drives:
> 
> | # time mkfile -v 1g zeros-1g
> | zeros-1g 1073741824 bytes
> | 
> | real2m31.991s
> | user0m0.007s
> | sys 0m0.923s
> 
> Newer 300GB drives:
> 
> | # time mkfile -v 1g zeros-1g
> | zeros-1g 1073741824 bytes
> | 
> | real0m8.425s
> | user0m0.010s
> | sys 0m1.809s

This is a pretty dramatic difference.  What type of drives were your old
36g drives?

>I am wondering if there is something other than capacity
> and seek time which has changed between the drives.  Would a
> different scsi command set or features have this dramatic a
> difference?

I'm hardly the authority on hardware, but there are a couple of
possibilties.  Your newer drives may have a write cache.  It's also
quite likely that the newer drives have a faster speed of rotation and
seek time.

If you subtract the usr + sys time from the real time in these
measurements, I suspect the result is the amount of time you were
actually waiting for the I/O to finish.  In the first case, you spent
99% of your total time waiting for stuff to happen, whereas in the
second case it was only ~86% of your overall time.

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and savecore

2006-11-10 Thread johansen-osdev
This is CR: 4894692 caching data in heap inflates crash dump

I have a fix which I am testing now.  It still needs review from
Matt/Mark before it's eligible for putback, though.

-j

On Fri, Nov 10, 2006 at 02:40:40PM -0800, Thomas Maier-Komor wrote:
> Hi, 
> 
> I'm not sure if this is the right forum, but I guess this topic will
> be bounced into the right direction from here.
> 
> With ZFS using as much physical memory as it can get, dumps and
> livedumps via 'savecore -L' are huge in size. I just tested it on my
> workstation and got a 1.8G vmcore file, when dumping only kernel
> pages. 
> 
> Might it be possible to add an extension that would make it possible,
> to support dumping without the whole ZFS cache? I guess this would
> make kernel live dumps smaller again, as they used to be...
> 
> Any comments?
> 
> Cheers,
> Tom
>  
>  
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?

2007-01-08 Thread johansen-osdev
> > Note that you'd actually have to verify that the blocks were the same;
> > you cannot count on the hash function.  If you didn't do this, anyone
> > discovering a collision could destroy the colliding blocks/files.
> 
> Given that nobody knows how to find sha256 collisions, you'd of course
> need to test this code with a weaker hash algorithm.
> 
> (It would almost be worth it to have the code panic in the event that a
> real sha256 collision was found)

The novel discovery of a sha256 collision will be lost on any
administrator whose system panics.  Imagine how much this will annoy the
first customer who accidentally discovers a reproducible test-case.
Perhaps generating an FMA error report would be more appropriate?

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Limit ZFS Memory Utilization

2007-01-10 Thread johansen-osdev
Robert:

> Better yet would be if memory consumed by ZFS for caching (dnodes,
> vnodes, data, ...) would behave similar to page cache like with UFS so
> applications will be able to get back almost all memory used for ZFS
> caches if needed.

I believe that a better response to memory pressure is a long-term goal
for ZFS.  There's also an effort in progress to improve the caching
algorithms used in the ARC.

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss