Frank.Hofmann at Sun.COM wrote:
>
> Hi Moinak,
>
>
> final webrev ready ?
Not yet. Been testing in all sorts of ways.
I have figured a simple way to avoid any performance
degradation for pure random reads by changing when
to call mutex_exit for the non-coalescing case to avail
greater concurrency.
However after doing that change I was seeing an odd
hang only when booting a BeleniX 0.6.1 LiveCD image
from a DVD (not a CD) on a Ferrari 3400 (not on any
other laptop or desktop) or on MacOSX Parallels !
It was caused because of a starvation problem due to the
following loop. I resolved this just now:
while (sema_tryp(&fio_done[count]) == 0) {
hsched_invoke_strategy(fsp->hqueue);
}
It was caused by thread starvation:
Suppose there are 2 threads who have enqueued their
bufs. Each will call hsched_invoke_strategy repeatedly
till their bufs are processed.
Now sometimes it may happen that the threads are
processing each others bufs. So there can be situation
where there are 2 bufs in the queue and thread A picks
up thread B's buf, thread B picks up threads A's buf.
thread B issues thread A's buf first and get's it back.
Then thread A issues thread B's buf and is waiting for it
to come back. So the list is now empty and thread B keeps
on calling hsched_invoke_strategy which returns
immediately due to an empty list.
This spin can starve other threads and the I/O may never
complete. I fixed this by adding a return value to
hsched_invoke_strategy so that it returns 1 if the queue
is empty, in which case the caller will simply block on a
sema_p:
while (sema_tryp(&fio_done[count]) == 0) {
if (hsched_invoke_strategy(fsp->hqueue)) {
sema_p(&fio_done[count]);
break;
}
}
This had me puzzled for a while. Thanks to Pramod and
Sanjeev for their help.
I will now add in your suggestions and send out the
updated webrevs. I will also run the fs fuzzer test.
See inline ...
> [...]
> SDT probe ?
> (no need to change the code just to get that in, just a thought to
> keep in mind if making such changes)
Okay, good idea will do this, but see below ...
>
>>
>> Readahead stats will not be 100% accurate in the face of multiple
>> readers on
>> a single file.
>> Readahead is actually called from hsfs_getapage only when a page is
>> satisfied from the cache. This is to prevent unnecessary readahead
>> calls if
>> it is not resulting in actual cache hits. In addition readahead is
>> essentially
>> additional getpage calls being issued ahead of time and so should
>> have the
>> same behavior as when multiple threads are requesting for the same
>> page[s]
>> and they are not in cache. In addition multiple readers on the same
>> file will
>> actually cause the readahead counters to fall back to 0 quite fast.
>
> Ok, was assuming that. If we want reliable statistics, we should hook
> that stuff into a kstat; or just use SDT probes in the relevant points.
To get a little better reliability would mean adding a lock
and that is bad for performance. The current stuff is giving
a good enough compromise. BTW these stats are internal-use
only, not exposed to user. It might make sense to expose the
number of pages readahead, number pages got from cache
and number of pages got by physical I/O as kstats. Would give
the ability to check readahead effectiveness.
However I think adding kstats and SDT probes require an
ARC case ?
If yes then I'd rather do that via another RFE since that will
cause a big additional delay, and, overhead for me to keep
things in sync as more additional features go into hsfs down
the line.
>
> [ ... ]
>>> - same code, should also zero out the trailing parts of the
>>> non-coalesced
>>> buffers. Or else, kmem_zalloc() 'iodata' to prevent incidental
>>> copyout
>>> of garbage when I/O was only partially successful. Again, in that
>>> sense
>>> _only_ the 'error' case block is correct, since you shouldn't make
>>> assumptions on data within a buf past b_resid on read (or rather,
>>> never
>>> copy such data out to the user). I.e. 'nbuf' contents past b_resid
>>> should be considered invalid.
>>>
>>
>> Actually data is not copied for the errored blocks. It just signals
>> a bioerror. Should
>> the target bufs be zeroed out as well ?
>
> To be sure there's no "leaking" of data into userspace, it might be a
> reasonable thing to do. I think setting b_resid = b_count would
> achieve the same (nothing transferred). Wanted to stress that in error
> case, we shouldn't "leak" stuff that's none of the business of the
> reading app.
Okay.
>> [...]
>> Readahead should complete fast enough to be useful, ideally before
>> or a little
>> after the caller comes back to read those pages. So using a slower
>> general
>> purpose queue may not give sufficient throughput, esp. with faster
>> DVD drives.
>> I have seen that in my tests and changed to a dynamic taskq. Since
>> this is dynamic
>> taskq, 8 threads will be created only if required.
>> That said, I see your point, it should be enough to just use one
>> dynamic taskq
>> for all hsfs mounts on the machine with a max of say 16 threads.
>> Since this is
>> dynamic using a larger max should be okay and desired to support a
>> large no.
>> of sequential threads.
>
> The advantage of the system taskq would be that it's generically
> "fair" and doesn't promote HSFS readahead over other system
> activities. Yes, it's important to have it done.
>
> Done a bit of browsing other code ...
> You're right, we can go by precedence. The LOFI driver does a 4-thread
> taskq per lofi device, and noone has ever complained about that. So go
> with the eight threads per fs instance; if we ever receive a report
> about that causing too high a system load, changing it to an
> N-thread-for-all would be easy enough.
Okay.
>
>>
>> BTW are there Sunray servers with 100 CDROM drives ?
>
> Every SunRay appliance could (theoretically) have one attached, and
> these show up on the SunRay server as USB devices. I'm not sure what
> customers use in actual deployment, though.
>
>>
>>> - Also, hsched_fini() doesn't make attempts to suspend/wait on the
>>> taskq;
>>> I wonder if the situation could ever arise that we manage to
>>> unmount the
>>> fs while there are still pending/unserviced readahead tasks. If
>>> not, I'd
>>> also like to know what makes sure that situation couldn'r arise.
>>>
>>
>> Won't taskq_destroy wait for the threads to complete ? But then if
>> we use one
>> per-machine taskq then an explicit check needs to be added.
>
> Checked the semantics, you're right taskq_destroy will wait. Thanks
> for the clarification.
I will add a comment to explain this.
>
>>
>>>
>>> Generically - I like the code. As said above, I'd like to see some
>>> explanation why the code does [ not ] race in various places. And
>>> the 'partial buf copyout' to be modified not to use b_count but
>>> always b_resid when copying. All else are nits.
>>>
>>> Ah, one final thing 'nits': There's quite a lot of non-cstyle-clean
>>> things in the new code. Pls. cleanup.
>>>
>>
>> Yes, cstyle and lint cleanup needs to be done.
>
>
> :)
Thanks for the comments. I will update the webrev in a couple
of days; and I will also update the bug description with results
from various tests.
Regards,
Moinak.
>
> Best regards,
> FrankH.
>
>>
>> Regards,
>> Moinak.
>>
>>> Thanks,
>>> FrankH.
>>>
>>>
>>>> Regards,
>>>> Moinak.
>>>>
>>> _______________________________________________
>>> caiman-discuss mailing list
>>> caiman-discuss at opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/caiman-discuss
>>>
>>
>>
>
> ------------------------------------------------------------------------------
>
>
> No good can come from selling your freedom, not for all the gold in
> the world,
> for the value of this heavenly gift far exceeds that of any fortune on
> earth.
> ------------------------------------------------------------------------------
>
>