Streams itself uses locks (syncq's) when crossing module boundaries. So
although removing perimeters *should* help performance, it won't (by
itself) achieve full performance. (Locks have to be used with streams
to protect against plumbing changes -- one of the most powerful features
of streams is, in this case, its worst problem.)
If you can find a way to do this *other* than thru streams you'll
probably have somewhat better performance.
Many many messages/sec are very hard to achieve with true streams.
Which is one of the reasons why in Solaris 10 the networking stack moved
away from using STREAMS -- as much as possible direct function calls are
used.
-- Garrett
William Reich wrote:
>
> I am dealing with a performance issue where I have a simple
> loopback test on a T2000 Sparc platform.
>
> A user space application is sending data to the streams
> driver, which sends the data to a our board via DMA.
> Then the data is returned from the board to user space application #2.
>
> ++++++++ ++++++++
> + app1 + + app2 +
> ++++++++ ++++++++
> | ^
> v |
> ++++++++++++++++++++++++
> + +
> + driver B +
> ++++++++++++++++++++++++
> | ^
> V |
> ++++++++++++++++++++++++
> + E board +
> ++++++++++++++++++++++++
>
> In this simple configuration,
> we have run experiments that show
> from app1 to app2, we can send/receive 28,000 messages per second.
> ( Each message is about 800 bytes. )
>
> This is not good enough for our purposes. We need to achieve 36,000.
>
> Further testing and experimentation reveals that
> if we send the messages from app1 to point 'E' in the above
> diagram, we can achieve the 36,000 rate.
>
> Additional testing shows that going from app1 , thru the board
> to point 'B' in the diagram ( inside the driver upon
> DMA complete from the board ), we can make the 36,000 mark.
>
> So, the bottleneck is from point 'B' to app2.
>
> But, this is streams - part of Solaris.
>
> Using lockstat, we can see many locks being used
> by the streams.
> The "D_MTOUTEPERIM | D_MTOCEXCL" indicators have been
> removed from the driver.
> We want maximum concurrentcy. This implies
> as far as I know that we want no outer perimeter and no
> inner perimeter in the driver.
>
> Interestingly, on a T2000, the same rates are obtained
> whether 2 CPUs are used or 16 CPUs.
>
> Does anybody have any tips/tricks/techniques
> that can be used to remove the bottleneck?
>
> +++++++
> more details...
> from the board to the host driver, an interrupt is
> created when a DMA of a batch of messages is completed.
> A batch is usually 250 messages in this test.
> The interrupt routine schedules the read service routine ( via qenable()
> ) of the
> streams driver. The service routine pulls the messages from the dma
> chain
> into the proper queue for sending the message up the stream ( via
> putq() ).
> In this simple case, it is one-to-one. In the general
> case, the data could go on one of 124 possible queues.
>
> Those messages go up the stream via the
> getq(),putnext() sequence observing flow control when appropriate.
>
> When we hacked the driver to ignore flow control,
> we still could not do better than 28000.
>
> wr
>
> -----Original Message-----
> From: Peter Memishian [mailto:[EMAIL PROTECTED]
> Sent: Friday, April 25, 2008 5:26 PM
> To: William Reich
> Cc: [email protected]
> Subject: re: [osol-code] Streams flags - D_MTPUTSHARED
>
>
> > Our driver using the following flags in its cb_ops structure:
> > ( D_NEW | D_MP | D_MTOUTEPERIM | D_MTOCEXCL ).
> > I understand that this set of flags will make the open & close
> routines > synchronous.
>
> The open and close routines are *always* synchronous. D_MTOCEXCL makes
> sure there's only one thread in open or close at a time across all the
> instances of the driver in the system (basically, it forces the outer
> perimeter to be entered exclusively for those entrypoints). BTW, the
> D_NEW flag above does nothing (it expands to 0x0) and should be removed.
>
> > My question is - do I need to add the D_MTPUTHSARED, D_MTPERQ, and >
> _D_MTSVCSHARED flags to make sure read and write queue put & service >
> routines can run concurrently ?
>
> No -- D_MP does that already. For an inner perimeter, you have four
> basic choices:
>
> * D_MP: effectively no inner perimeter.
> * D_MTPERQ: inner perimeter around each queue.
> * D_MTQPAIR: inner perimeter around each queuepair.
> * D_MTPERMOD: inner perimeter around all queuepairs.
>
> Since the inner perimeter is always exclusively by default, D_MP is the
> highest level of concurrency, and flags like D_MTPUTSHARED make no sense
> with it. You can approximate D_MP to some degree by combining coarser
> perimeters with e.g. D_MTPUTSHARED and the like, but there's no reason
> to do that unless you have a specific reason to not be D_MP. As an
> aside:
> _D_MTSVCSHARED is not a public interface (hence the leading underscore).
> Do not use it.
>
> Hope this helps,
> --
> meem
> _______________________________________________
> opensolaris-code mailing list
> [email protected]
> http://mail.opensolaris.org/mailman/listinfo/opensolaris-code
>
_______________________________________________
opensolaris-code mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/opensolaris-code