Streams itself uses locks (syncq's) when crossing module boundaries.  So 
although removing perimeters *should* help performance, it won't (by 
itself) achieve full performance.  (Locks have to be used with streams 
to protect against plumbing changes -- one of the most powerful features 
of streams is, in this case, its worst problem.)

If you can find a way to do this *other* than thru streams you'll 
probably have somewhat better performance.

Many many messages/sec are very hard to achieve with true streams.  
Which is one of the reasons why in Solaris 10 the networking stack moved 
away from using STREAMS -- as much as possible direct function calls are 
used.

    -- Garrett

William Reich wrote:
>  
> I am dealing with a performance issue where I have a simple
> loopback test on a T2000 Sparc platform.
>
> A user space application is sending data to the streams
> driver, which sends the data to a our board via DMA.
> Then the data is returned from the board to user space application #2.
>
> ++++++++        ++++++++
> + app1 +        + app2 +
> ++++++++        ++++++++
>   |                 ^
>   v                 |
> ++++++++++++++++++++++++
> +                      +
> +         driver    B  +
> ++++++++++++++++++++++++
>    |             ^
>    V             |
> ++++++++++++++++++++++++
> +   E   board          +
> ++++++++++++++++++++++++
>
> In this simple configuration,
> we have run experiments that show
> from app1 to app2, we can send/receive 28,000 messages per second.
> ( Each message is about 800 bytes. )
>
> This is not good enough for our purposes. We need to achieve 36,000.
>
> Further testing and experimentation reveals that
> if we send the messages from app1 to point 'E' in the above
> diagram, we can achieve the 36,000 rate.
>
> Additional testing shows that going from app1 , thru the board
> to point 'B' in the diagram ( inside the driver upon
> DMA complete from the board ), we can make the 36,000 mark.
>
> So, the bottleneck is from point 'B' to app2.
>
> But, this is streams - part of Solaris.
>
> Using lockstat, we can see many locks being used
> by the streams. 
> The "D_MTOUTEPERIM | D_MTOCEXCL" indicators have been
> removed from the driver.
> We want maximum concurrentcy. This implies
> as far as I know that we want no outer perimeter and no
> inner perimeter in the driver.
>
> Interestingly, on a T2000, the same rates are obtained
> whether 2 CPUs are used or 16 CPUs.
>
> Does anybody have any tips/tricks/techniques
> that can be used to remove the bottleneck?
>
> +++++++
> more details...
> from the board to the host driver, an interrupt is 
> created when a DMA of a batch of messages is completed.
> A batch is usually 250 messages in this test.
> The interrupt routine schedules the read service routine ( via qenable()
> ) of the
> streams driver. The service routine pulls the messages from the dma
> chain
> into the proper queue  for sending the message up the stream ( via
> putq() ).
> In this simple case, it is one-to-one. In the general
> case, the data could go on one of 124 possible queues.
>
> Those messages go up the stream via the
> getq(),putnext() sequence observing flow control when appropriate.
>
> When we hacked the driver to ignore flow control,
> we still could not do better than 28000.
>
> wr
>
> -----Original Message-----
> From: Peter Memishian [mailto:[EMAIL PROTECTED]
> Sent: Friday, April 25, 2008 5:26 PM
> To: William Reich
> Cc: [email protected]
> Subject: re: [osol-code] Streams flags - D_MTPUTSHARED
>
>
>  > Our driver using the following flags in its cb_ops structure:
>  > ( D_NEW | D_MP | D_MTOUTEPERIM | D_MTOCEXCL ).
>  > I understand that this set of flags will make the open & close
> routines  > synchronous.
>
> The open and close routines are *always* synchronous.  D_MTOCEXCL makes
> sure there's only one thread in open or close at a time across all the
> instances of the driver in the system (basically, it forces the outer
> perimeter to be entered exclusively for those entrypoints).  BTW, the
> D_NEW flag above does nothing (it expands to 0x0) and should be removed.
>
>  > My question is - do I need to add the D_MTPUTHSARED, D_MTPERQ, and  >
> _D_MTSVCSHARED flags to make sure read and write queue put & service  >
> routines can run concurrently ?
>
> No -- D_MP does that already.  For an inner perimeter, you have four
> basic choices:
>
>       * D_MP: effectively no inner perimeter.
>       * D_MTPERQ: inner perimeter around each queue.
>       * D_MTQPAIR: inner perimeter around each queuepair.
>       * D_MTPERMOD: inner perimeter around all queuepairs.
>
> Since the inner perimeter is always exclusively by default, D_MP is the
> highest level of concurrency, and flags like D_MTPUTSHARED make no sense
> with it.  You can approximate D_MP to some degree by combining coarser
> perimeters with e.g. D_MTPUTSHARED and the like, but there's no reason
> to do that unless you have a specific reason to not be D_MP.  As an
> aside:
> _D_MTSVCSHARED is not a public interface (hence the leading underscore).
> Do not use it.
>
> Hope this helps,
> --
> meem
> _______________________________________________
> opensolaris-code mailing list
> [email protected]
> http://mail.opensolaris.org/mailman/listinfo/opensolaris-code
>   

_______________________________________________
opensolaris-code mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/opensolaris-code

Reply via email to