Re: [PATCH 21/21]: powerpc/cell spidernet DMA coalescing

2006-10-13 Thread Joel Schopp

Linas Vepstas wrote:
The current driver code performs 512 DMA mappns of a bunch of 
32-byte structures. This is silly, as they are all in contiguous 
memory. Ths patch changes the code to DMA map the entie area

with just one call.

Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]>
Cc: James K Lewis <[EMAIL PROTECTED]>
Cc: Arnd Bergmann <[EMAIL PROTECTED]>


While this patch wasn't useful in the current cell implementation of pci_map_single 
it sounds like people are going to be making changes to that sometime. In light of 
that new information (new to me anyway) this should probably go in after all.  Sorry 
for causing trouble.


Acked-by: Joel Schopp <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 21/21]: powerpc/cell spidernet DMA coalescing

2006-10-11 Thread Benjamin Herrenschmidt

> I started writingthe patch thinking it will have some huge effect on
> performance, based on a false assumption on how i/o was done on this
> machine
> 
> *If* this were another pSeries system, then each call to 
> pci_map_single() chews up an actual hardware "translation 
> control entry" (TCE) that maps pci bus addresses into 
> system RAM addresses. These are somewhat limited resources,
> and so one shouldn't squander them.  Furthermore, I thouhght
> TCE's have TLB's associated with them (similar to how virtual
> memory page tables are backed by hardware page TLB's), of which 
> there are even less of. I was thinking that TLB thrashing would 
> have a big hit on performance. 
> 
> Turns out that there was no difference to performance at all, 
> and a quick look at "cell_map_single()" in arch/powerpc/platforms/cell
> made it clear why: there's no fancy i/o address mapping.
> 
> Thus, the patch has only mrginal benefit; I submit it only in the 
> name of "its the right thing to do anyway".

Well, there is no fancy iommu mapping ... yet.

It's been implemented and is coming after we put together some
workarounds for various other spider hardware issues that trigger when
using it (bogus prefetches and bogus pci ordering).

I think the hypervisor based platforms will be happy with that patch
too.

Ben.


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 21/21]: powerpc/cell spidernet DMA coalescing

2006-10-11 Thread Geoff Levand
Linas Vepstas wrote:
> On Tue, Oct 10, 2006 at 06:46:08PM -0700, Geoff Levand wrote:
>> > Linas Vepstas wrote:
>> >> The current driver code performs 512 DMA mappns of a bunch of 
>> >> 32-byte structures. This is silly, as they are all in contiguous 
>> >> memory. Ths patch changes the code to DMA map the entie area
>> >> with just one call.
>> 
>> Linas, 
>> 
>> Is the motivation for this change to improve performance by reducing the 
>> overhead
>> of the mapping calls?  
> 
> Yes.
> 
>> If so, there may be some benefit for some systems.  Could
>> you please elaborate?
> 
> I started writingthe patch thinking it will have some huge effect on
> performance, based on a false assumption on how i/o was done on this
> machine
> 
> *If* this were another pSeries system, then each call to 
> pci_map_single() chews up an actual hardware "translation 
> control entry" (TCE) that maps pci bus addresses into 
> system RAM addresses. These are somewhat limited resources,
> and so one shouldn't squander them.  Furthermore, I thouhght
> TCE's have TLB's associated with them (similar to how virtual
> memory page tables are backed by hardware page TLB's), of which 
> there are even less of. I was thinking that TLB thrashing would 
> have a big hit on performance. 
> 
> Turns out that there was no difference to performance at all, 
> and a quick look at "cell_map_single()" in arch/powerpc/platforms/cell
> made it clear why: there's no fancy i/o address mapping.

OK, thanks for the explanation.  Actually, the current cell DMA mapping
implementation uses a simple 'linear' mapping, in that, all of RAM is
mapped into the bus DMA address space at once, and in fact, it is all
just done at system startup.

There is ongoing work to implement 'dynamic' mapping, where DMA pages are
mapped into the bus DMA address space on demand.  I think a key point to
understand the benefit to this is that the cell processor's I/O controller
maps pages per device, so you can map one DMA page to one device.  I
currently have this working for my platform, but have not released that
work.  There is some overhead to managing the mapped buffers and to request
pages be mapped by the hypervisor, etc., so I was thinking that is this work
of yours to consolidate the memory buffers prior to requesting the mapping
could be of benefit if it was in an often executed code path.

-Geoff

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 21/21]: powerpc/cell spidernet DMA coalescing

2006-10-11 Thread Linas Vepstas
On Tue, Oct 10, 2006 at 06:46:08PM -0700, Geoff Levand wrote:
> > Linas Vepstas wrote:
> >> The current driver code performs 512 DMA mappns of a bunch of 
> >> 32-byte structures. This is silly, as they are all in contiguous 
> >> memory. Ths patch changes the code to DMA map the entie area
> >> with just one call.
> 
> Linas, 
> 
> Is the motivation for this change to improve performance by reducing the 
> overhead
> of the mapping calls?  

Yes.

> If so, there may be some benefit for some systems.  Could
> you please elaborate?

I started writingthe patch thinking it will have some huge effect on
performance, based on a false assumption on how i/o was done on this
machine

*If* this were another pSeries system, then each call to 
pci_map_single() chews up an actual hardware "translation 
control entry" (TCE) that maps pci bus addresses into 
system RAM addresses. These are somewhat limited resources,
and so one shouldn't squander them.  Furthermore, I thouhght
TCE's have TLB's associated with them (similar to how virtual
memory page tables are backed by hardware page TLB's), of which 
there are even less of. I was thinking that TLB thrashing would 
have a big hit on performance. 

Turns out that there was no difference to performance at all, 
and a quick look at "cell_map_single()" in arch/powerpc/platforms/cell
made it clear why: there's no fancy i/o address mapping.

Thus, the patch has only mrginal benefit; I submit it only in the 
name of "its the right thing to do anyway".

--linas
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 21/21]: powerpc/cell spidernet DMA coalescing

2006-10-11 Thread Arnd Bergmann
On Wednesday 11 October 2006 03:46, Geoff Levand wrote:
> > 
> > The others look good, but this one complicates the code and doesn't have 
> > any benefit.  20 
> > for 21 isn't bad.
> 
> Is the motivation for this change to improve performance by reducing the 
> overhead
> of the mapping calls?  If so, there may be some benefit for some systems.  
> Could
> you please elaborate?
> 

>From what I understand, this patch drastically reduces the number of
I/O PTEs that are needed in the iommu. With the current static IOMMU
mapping, it should only make a difference during initialization, but
any platform that uses a dynamic mapping of iommu entries will benefit
a lot from it, because:

- the card can do better prefetching of consecutive memory
- there are more I/O ptes available for other drivers.

Arnd <><
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 21/21]: powerpc/cell spidernet DMA coalescing

2006-10-11 Thread Benjamin Herrenschmidt
On Tue, 2006-10-10 at 18:20 -0500, jschopp wrote:
> Linas Vepstas wrote:
> > The current driver code performs 512 DMA mappns of a bunch of 
> > 32-byte structures. This is silly, as they are all in contiguous 
> > memory. Ths patch changes the code to DMA map the entie area
> > with just one call.
> > 
> > Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]>
> > Cc: James K Lewis <[EMAIL PROTECTED]>
> > Cc: Arnd Bergmann <[EMAIL PROTECTED]>
> 
> The others look good, but this one complicates the code and doesn't have any 
> benefit.  20 
> for 21 isn't bad.

Hi Joel !

I'm not sure what you mean here (especially your 20 to 21 comment).

The patch looks perfectly fine to me, and in fact removes more lines of
code than it adds :)

Cheers,
Ben.


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 21/21]: powerpc/cell spidernet DMA coalescing

2006-10-10 Thread Geoff Levand
jschopp wrote:
> Linas Vepstas wrote:
>> The current driver code performs 512 DMA mappns of a bunch of 
>> 32-byte structures. This is silly, as they are all in contiguous 
>> memory. Ths patch changes the code to DMA map the entie area
>> with just one call.
>> 
>> Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]>
>> Cc: James K Lewis <[EMAIL PROTECTED]>
>> Cc: Arnd Bergmann <[EMAIL PROTECTED]>
> 
> The others look good, but this one complicates the code and doesn't have any 
> benefit.  20 
> for 21 isn't bad.

Linas, 

Is the motivation for this change to improve performance by reducing the 
overhead
of the mapping calls?  If so, there may be some benefit for some systems.  Could
you please elaborate?

-Geoff

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 21/21]: powerpc/cell spidernet DMA coalescing

2006-10-10 Thread jschopp

Linas Vepstas wrote:
The current driver code performs 512 DMA mappns of a bunch of 
32-byte structures. This is silly, as they are all in contiguous 
memory. Ths patch changes the code to DMA map the entie area

with just one call.

Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]>
Cc: James K Lewis <[EMAIL PROTECTED]>
Cc: Arnd Bergmann <[EMAIL PROTECTED]>


The others look good, but this one complicates the code and doesn't have any benefit.  20 
for 21 isn't bad.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 21/21]: powerpc/cell spidernet DMA coalescing

2006-10-10 Thread Linas Vepstas

The current driver code performs 512 DMA mappns of a bunch of 
32-byte structures. This is silly, as they are all in contiguous 
memory. Ths patch changes the code to DMA map the entie area
with just one call.

Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]>
Cc: James K Lewis <[EMAIL PROTECTED]>
Cc: Arnd Bergmann <[EMAIL PROTECTED]>


 drivers/net/spider_net.c |  107 +++
 drivers/net/spider_net.h |   16 ++-
 2 files changed, 59 insertions(+), 64 deletions(-)

Index: linux-2.6.18-mm2/drivers/net/spider_net.c
===
--- linux-2.6.18-mm2.orig/drivers/net/spider_net.c  2006-10-10 
13:40:56.0 -0500
+++ linux-2.6.18-mm2/drivers/net/spider_net.c   2006-10-10 13:42:12.0 
-0500
@@ -267,25 +267,6 @@ spider_net_get_descr_status(struct spide
 }
 
 /**
- * spider_net_free_chain - free descriptor chain
- * @card: card structure
- * @chain: address of chain
- *
- */
-static void
-spider_net_free_chain(struct spider_net_card *card,
- struct spider_net_descr_chain *chain)
-{
-   struct spider_net_descr *descr;
-
-   for (descr = chain->tail; !descr->bus_addr; descr = descr->next) {
-   pci_unmap_single(card->pdev, descr->bus_addr,
-SPIDER_NET_DESCR_SIZE, PCI_DMA_BIDIRECTIONAL);
-   descr->bus_addr = 0;
-   }
-}
-
-/**
  * spider_net_init_chain - links descriptor chain
  * @card: card structure
  * @chain: address of chain
@@ -297,15 +278,15 @@ spider_net_free_chain(struct spider_net_
  *
  * returns 0 on success, <0 on failure
  */
-static int
+static void
 spider_net_init_chain(struct spider_net_card *card,
   struct spider_net_descr_chain *chain,
   struct spider_net_descr *start_descr,
+  dma_addr_t buf,
   int no)
 {
int i;
struct spider_net_descr *descr;
-   dma_addr_t buf;
 
descr = start_descr;
memset(descr, 0, sizeof(*descr) * no);
@@ -314,17 +295,12 @@ spider_net_init_chain(struct spider_net_
for (i=0; idmac_cmd_status = SPIDER_NET_DESCR_NOT_IN_USE;
 
-   buf = pci_map_single(card->pdev, descr,
-SPIDER_NET_DESCR_SIZE,
-PCI_DMA_BIDIRECTIONAL);
-
-   if (pci_dma_mapping_error(buf))
-   goto iommu_error;
-
descr->bus_addr = buf;
+   descr->next_descr_addr = 0;
descr->next = descr + 1;
descr->prev = descr - 1;
 
+   buf += sizeof(struct spider_net_descr);
}
/* do actual circular list */
(descr-1)->next = start_descr;
@@ -333,17 +309,6 @@ spider_net_init_chain(struct spider_net_
spin_lock_init(&chain->lock);
chain->head = start_descr;
chain->tail = start_descr;
-
-   return 0;
-
-iommu_error:
-   descr = start_descr;
-   for (i=0; i < no; i++, descr++)
-   if (descr->bus_addr)
-   pci_unmap_single(card->pdev, descr->bus_addr,
-SPIDER_NET_DESCR_SIZE,
-PCI_DMA_BIDIRECTIONAL);
-   return -ENOMEM;
 }
 
 /**
@@ -1658,24 +1623,32 @@ spider_net_open(struct net_device *netde
 {
struct spider_net_card *card = netdev_priv(netdev);
struct spider_net_descr *descr;
-   int i, result;
+   int result = -ENOMEM;
 
-   result = -ENOMEM;
-   if (spider_net_init_chain(card, &card->tx_chain, card->descr,
- card->num_tx_desc))
-   goto alloc_tx_failed;
+   card->descr_dma_addr = pci_map_single(card->pdev, card->descr,
+   (card->num_tx_desc+card->num_rx_desc)*sizeof(struct 
spider_net_descr),
+PCI_DMA_BIDIRECTIONAL);
+   if (pci_dma_mapping_error(card->descr_dma_addr))
+   return -ENOMEM;
+
+   spider_net_init_chain(card, &card->tx_chain, card->descr,
+ card->descr_dma_addr,
+ card->num_tx_desc);
 
card->low_watermark = NULL;
 
/* rx_chain is after tx_chain, so offset is descr + tx_count */
-   if (spider_net_init_chain(card, &card->rx_chain,
+   spider_net_init_chain(card, &card->rx_chain,
  card->descr + card->num_tx_desc,
- card->num_rx_desc))
-   goto alloc_rx_failed;
+ card->descr_dma_addr
+   + card->num_tx_desc * sizeof(struct 
spider_net_descr),
+ card->num_rx_desc);
 
descr = card->rx_chain.head;
-   for (i=0; i < card->num_rx_desc; i++, descr++)
+   do {
descr->next_descr_addr = descr->next->bus_a