Frank.Hofmann at Sun.COM wrote:
>
> Hi Moinak,
>
>
> final webrev ready ?

   Not yet. Been testing in all sorts of ways.
   I have figured a simple way to avoid any performance
   degradation for pure random reads by changing when
   to call mutex_exit for the non-coalescing case to avail
   greater concurrency.

   However after doing that change I was seeing an odd
   hang only when booting a BeleniX 0.6.1 LiveCD image
   from a DVD (not a CD) on a Ferrari 3400 (not on any
   other laptop or desktop) or on MacOSX Parallels !

   It was caused because of a starvation problem due to the
   following loop. I resolved this just now:

   while (sema_tryp(&fio_done[count]) == 0) {
           hsched_invoke_strategy(fsp->hqueue);
   }

  

   It was caused by thread starvation:
   Suppose there are 2 threads who have enqueued their
   bufs. Each will call hsched_invoke_strategy repeatedly
   till their bufs are processed.
   Now sometimes it may happen that the threads are
   processing each others bufs. So there can be situation
   where there are 2 bufs in the queue and thread A picks
   up thread B's buf, thread B picks up threads A's buf.
   thread B issues thread A's buf  first and get's it back.
   Then thread A issues thread B's buf and is waiting for it
   to come back. So the list is now empty and thread B keeps
   on calling hsched_invoke_strategy which returns
   immediately due to an empty list.

   This spin can starve other threads and the I/O may never
   complete. I fixed this by adding a return value to
   hsched_invoke_strategy so that it returns 1 if the queue
   is empty, in which case the caller will simply block on a
   sema_p:

   while (sema_tryp(&fio_done[count]) == 0) {
           if (hsched_invoke_strategy(fsp->hqueue)) {
                   sema_p(&fio_done[count]);
                   break;
           }
   }

   This had me puzzled for a while. Thanks to Pramod and
   Sanjeev for their help.

   I will now add in your suggestions and send out the
   updated webrevs. I will also run the fs fuzzer test.
   See inline ...

> [...]
> SDT probe ?
> (no need to change the code just to get that in, just a thought to 
> keep in mind if making such changes)

   Okay, good idea will do this, but see below ...

>
>>
>>  Readahead stats will not be 100% accurate in the face of multiple 
>> readers on
>>  a single file.
>>  Readahead is actually called from hsfs_getapage only when a page is
>>  satisfied from the cache. This is to prevent unnecessary readahead 
>> calls if
>>  it is not resulting in actual cache hits. In addition readahead is 
>> essentially
>>  additional getpage calls being issued ahead of time and so should 
>> have the
>>  same behavior as when multiple threads are requesting for the same 
>> page[s]
>>  and they are not in cache. In addition multiple readers on the same 
>> file will
>>  actually cause the readahead counters to fall back to 0 quite fast.
>
> Ok, was assuming that. If we want reliable statistics, we should hook 
> that stuff into a kstat; or just use SDT probes in the relevant points.

   To get a little better reliability would mean adding a lock
   and that is bad for performance. The current stuff is giving
   a good enough compromise. BTW these stats are internal-use
   only, not exposed to user. It might make sense to expose the
   number of pages readahead, number pages got from cache
   and number of pages got by physical I/O as kstats. Would give
   the ability to check readahead effectiveness.

   However I think adding kstats and SDT probes require an
   ARC case ?

   If yes then I'd rather do that via another RFE since that will
   cause a big additional delay, and, overhead for me to keep
   things in sync as more additional features go into hsfs down
   the line.

>
> [ ... ]
>>> - same code, should also zero out the trailing parts of the 
>>> non-coalesced
>>>    buffers. Or else, kmem_zalloc() 'iodata' to prevent incidental 
>>> copyout
>>>    of garbage when I/O was only partially successful. Again, in that 
>>> sense
>>>    _only_ the 'error' case block is correct, since you shouldn't make
>>>    assumptions on data within a buf past b_resid on read (or rather, 
>>> never
>>>    copy such data out to the user). I.e. 'nbuf' contents past b_resid
>>>    should be considered invalid.
>>>
>>
>>  Actually data is not copied for the errored blocks. It just signals 
>> a bioerror. Should
>>  the target bufs be zeroed out as well ?
>
> To be sure there's no "leaking" of data into userspace, it might be a 
> reasonable thing to do. I think setting b_resid = b_count would 
> achieve the same (nothing transferred). Wanted to stress that in error 
> case, we shouldn't "leak" stuff that's none of the business of the 
> reading app.

   Okay.

>> [...]
>>  Readahead should complete fast enough to be useful, ideally before 
>> or a little
>>  after the caller comes back to read those pages. So using a slower 
>> general
>>  purpose queue may not give sufficient throughput, esp. with faster 
>> DVD drives.
>>  I have seen that in my tests and changed to a dynamic taskq. Since 
>> this is dynamic
>>  taskq, 8 threads will be created only if required.
>>  That said, I see your point, it should be enough to just use one 
>> dynamic taskq
>>  for all hsfs mounts on the machine with a max of say 16 threads. 
>> Since this is
>>  dynamic using a larger max should be okay and desired to support a 
>> large no.
>>  of sequential threads.
>
> The advantage of the system taskq would be that it's generically 
> "fair" and doesn't promote HSFS readahead over other system 
> activities. Yes, it's important to have it done.
>
> Done a bit of browsing other code ...
> You're right, we can go by precedence. The LOFI driver does a 4-thread 
> taskq per lofi device, and noone has ever complained about that. So go 
> with the eight threads per fs instance; if we ever receive a report 
> about that causing too high a system load, changing it to an 
> N-thread-for-all would be easy enough.

   Okay.

>
>>
>>  BTW are there Sunray servers with 100 CDROM drives ?
>
> Every SunRay appliance could (theoretically) have one attached, and 
> these show up on the SunRay server as USB devices. I'm not sure what 
> customers use in actual deployment, though.
>
>>
>>> - Also, hsched_fini() doesn't make attempts to suspend/wait on the 
>>> taskq;
>>>    I wonder if the situation could ever arise that we manage to 
>>> unmount the
>>>    fs while there are still pending/unserviced readahead tasks. If 
>>> not, I'd
>>>    also like to know what makes sure that situation couldn'r arise.
>>>
>>
>>  Won't taskq_destroy wait for the threads to complete ? But then if 
>> we use one
>>  per-machine taskq then an explicit check needs to be added.
>
> Checked the semantics, you're right taskq_destroy will wait. Thanks 
> for the clarification.

   I will add a comment to explain this.

>
>>
>>>
>>> Generically - I like the code. As said above, I'd like to see some 
>>> explanation why the code does [ not ] race in various places. And 
>>> the 'partial buf copyout' to be modified not to use b_count but 
>>> always b_resid when copying. All else are nits.
>>>
>>> Ah, one final thing 'nits': There's quite a lot of non-cstyle-clean 
>>> things in the new code. Pls. cleanup.
>>>
>>
>>  Yes, cstyle and lint cleanup needs to be done.
>
>
> :)

  Thanks for the comments. I will update the webrev in a couple
  of days; and I will also update the bug description with results
  from various tests.

Regards,
Moinak.

>
> Best regards,
> FrankH.
>
>>
>> Regards,
>> Moinak.
>>
>>> Thanks,
>>> FrankH.
>>>
>>>
>>>> Regards,
>>>> Moinak.
>>>>
>>> _______________________________________________
>>> caiman-discuss mailing list
>>> caiman-discuss at opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/caiman-discuss
>>>
>>
>>
>
> ------------------------------------------------------------------------------
>  
>
> No good can come from selling your freedom, not for all the gold in 
> the world,
> for the value of this heavenly gift far exceeds that of any fortune on 
> earth.
> ------------------------------------------------------------------------------
>  
>


Reply via email to