[PATCH 0/2] Address issues in dma-debug API

2013-03-18 Thread Alexander Duyck
Christoph Paasch recently reported a "device driver failed to check map 
error" on igb.  However after reviewing the code there was no possibility of
that.  On closer inspection there was a bug in the DMA debug API that was
causing the issue.  These two patches address the issues I found.

The first issue I found while trying to implement a workaround.  Specifically
the problem is a locking bug which is triggered if a multiple mapped buffer
exists and there is not an exact match for the unmap.  This results in the CPU
becoming deadlocked.

The second issue, which was the original problem, is resolved by guaranteeing
that if we call dma_mapping_error we set a matching entry to MAP_ERR_CHECKED
that was not previously set. 

I'm not sure if these are critical enough to go into one of the upcoming RC
kernels or if these can wait until the merge since this is in a debugging API.
I'm leaving that for the sub-maintainers to decide.

---

Alexander Duyck (2):
  dma-debug: Fix locking bug in check_unmap
  dma-debug: Update DMA debug API to better handle multiple mappings of a 
buffer


 lib/dma-debug.c |   42 --
 1 files changed, 28 insertions(+), 14 deletions(-)

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] dma-debug: Update DMA debug API to better handle multiple mappings of a buffer

2013-03-18 Thread Alexander Duyck
There were reports of the igb driver unmapping buffers without calling
dma_mapping_error.  On closer inspection issues were found in the DMA debug
API and how it handled multiple mappings of the same buffer.

The issue I found is the fact that the debug_dma_mapping_error would only set
the map_err_type to MAP_ERR_CHECKED in the case that the was only one match
for device and device address.  However in the case of non-IOMMU, multiple
addresses existed and as a result it was not setting this field once a
second mapping was instantiated.  I have resolved this by changing the search
so that it instead will now set MAP_ERR_CHECKED on the first buffer that
matches the device and DMA address that is currently in the state
MAP_ERR_NOT_CHECKED.

A secondary side effect of this patch is that in the case of multiple buffers
using the same address only the last mapping will have a valid map_err_type.
The previous mappings will all end up with map_err_type set to
MAP_ERR_CHECKED because of the dma_mapping_error call in debug_dma_map_page.
However this behavior may be preferable as it means you will likely only see
one real error per multi-mapped buffer, versus the current behavior of
multiple false errors mer multi-mapped buffer.

Signed-off-by: Alexander Duyck 
---
 lib/dma-debug.c |   24 +++-
 1 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/lib/dma-debug.c b/lib/dma-debug.c
index 724bd4d..aa465d9 100644
--- a/lib/dma-debug.c
+++ b/lib/dma-debug.c
@@ -1082,13 +1082,27 @@ void debug_dma_mapping_error(struct device *dev, 
dma_addr_t dma_addr)
ref.dev = dev;
ref.dev_addr = dma_addr;
bucket = get_hash_bucket(&ref, &flags);
-   entry = bucket_find_exact(bucket, &ref);
 
-   if (!entry)
-   goto out;
+   list_for_each_entry(entry, &bucket->list, list) {
+   if (!exact_match(&ref, entry))
+   continue;
+
+   /*
+* The same physical address can be mapped multiple
+* times. Without a hardware IOMMU this results in the
+* same device addresses being put into the dma-debug
+* hash multiple times too. This can result in false
+* positives being reported. Therefore we implement a
+* best-fit algorithm here which updates the first entry
+* from the hash which fits the reference value and is
+* not currently listed as being checked.
+*/
+   if (entry->map_err_type == MAP_ERR_NOT_CHECKED) {
+   entry->map_err_type = MAP_ERR_CHECKED;
+   break;
+   }
+   }
 
-   entry->map_err_type = MAP_ERR_CHECKED;
-out:
put_hash_bucket(bucket, &flags);
 }
 EXPORT_SYMBOL(debug_dma_mapping_error);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] dma-debug: Fix locking bug in check_unmap

2013-03-18 Thread Alexander Duyck
In check_unmap it is possible to get into a dead-locked state if
dma_mapping_error is called.  The problem is that the bucket is locked in
check_unmap, and locked again by debug_dma_mapping_error which is called by
dma_mapping_error.  To resolve that we must release the lock on the bucket
before making the call to dma_mapping_error.

Signed-off-by: Alexander Duyck 
---
 lib/dma-debug.c |   18 +-
 1 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/lib/dma-debug.c b/lib/dma-debug.c
index 5e396ac..724bd4d 100644
--- a/lib/dma-debug.c
+++ b/lib/dma-debug.c
@@ -862,17 +862,18 @@ static void check_unmap(struct dma_debug_entry *ref)
entry = bucket_find_exact(bucket, ref);
 
if (!entry) {
+   /* must drop lock before calling dma_mapping_error */
+   put_hash_bucket(bucket, &flags);
+
if (dma_mapping_error(ref->dev, ref->dev_addr)) {
err_printk(ref->dev, NULL,
-  "DMA-API: device driver tries "
-  "to free an invalid DMA memory address\n");
-   return;
+  "DMA-API: device driver tries to free an 
invalid DMA memory address\n");
+   } else {
+   err_printk(ref->dev, NULL,
+  "DMA-API: device driver tries to free DMA 
memory it has not allocated [device address=0x%016llx] [size=%llu bytes]\n",
+  ref->dev_addr, ref->size);
}
-   err_printk(ref->dev, NULL, "DMA-API: device driver tries "
-  "to free DMA memory it has not allocated "
-  "[device address=0x%016llx] [size=%llu bytes]\n",
-  ref->dev_addr, ref->size);
-   goto out;
+   return;
}
 
if (ref->size != entry->size) {
@@ -936,7 +937,6 @@ static void check_unmap(struct dma_debug_entry *ref)
hash_bucket_del(entry);
dma_entry_free(entry);
 
-out:
put_hash_bucket(bucket, &flags);
 }
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/5] ixgbe: add driver set_max_vfs support

2012-10-03 Thread Alexander Duyck
On 10/03/2012 10:51 AM, Yinghai Lu wrote:
> Need ixgbe guys to close the loop to use set_max_vfs instead
> kernel parameters.
>
> Signed-off-by: Yinghai Lu 
> Cc: Jeff Kirsher 
> Cc: Jesse Brandeburg 
> Cc: Greg Rose 
> Cc: "David S. Miller" 
> Cc: John Fastabend 
> Cc: e1000-de...@lists.sourceforge.net
> Cc: net...@vger.kernel.org
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe.h  |2 +
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   44 +++-
>  2 files changed, 37 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
> b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> index b9623e9..d39d975 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> @@ -558,6 +558,8 @@ struct ixgbe_adapter {
>   u32 interrupt_event;
>   u32 led_reg;
>  
> + struct ixgbe_info *ixgbe_info;
> +
>  #ifdef CONFIG_IXGBE_PTP
>   struct ptp_clock *ptp_clock;
>   struct ptp_clock_info ptp_caps;
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
> b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index ee61819..1c097c7 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -129,13 +129,6 @@ static struct notifier_block dca_notifier = {
>  };
>  #endif
>  
> -#ifdef CONFIG_PCI_IOV
> -static unsigned int max_vfs;
> -module_param(max_vfs, uint, 0);
> -MODULE_PARM_DESC(max_vfs,
> -  "Maximum number of virtual functions to allocate per physical 
> function - default is zero and maximum value is 63");
> -#endif /* CONFIG_PCI_IOV */
> -
>  static unsigned int allow_unsupported_sfp;
>  module_param(allow_unsupported_sfp, uint, 0);
>  MODULE_PARM_DESC(allow_unsupported_sfp,
> @@ -4496,7 +4489,7 @@ static int __devinit ixgbe_sw_init(struct ixgbe_adapter 
> *adapter)
>  #ifdef CONFIG_PCI_IOV
>   /* assign number of SR-IOV VFs */
>   if (hw->mac.type != ixgbe_mac_82598EB)
> - adapter->num_vfs = (max_vfs > 63) ? 0 : max_vfs;
> + adapter->num_vfs = min_t(int, pdev->max_vfs, 63);
>  
>  #endif
>   /* enable itr by default in dynamic mode */
> @@ -7220,8 +7213,9 @@ static int __devinit ixgbe_probe(struct pci_dev *pdev,
>  
>  #ifdef CONFIG_PCI_IOV
>   ixgbe_enable_sriov(adapter, ii);
> -
>  #endif
> + adapter->ixgbe_info = ii;
> +
>   netdev->features = NETIF_F_SG |
>  NETIF_F_IP_CSUM |
>  NETIF_F_IPV6_CSUM |
> @@ -7683,11 +7677,43 @@ static const struct pci_error_handlers 
> ixgbe_err_handler = {
>   .resume = ixgbe_io_resume,
>  };
>  
> +static void ixgbe_set_max_vfs(struct pci_dev *pdev)
> +{
> +#ifdef CONFIG_PCI_IOV
> + struct ixgbe_adapter *adapter = pci_get_drvdata(pdev);
> + struct ixgbe_hw *hw = &adapter->hw;
> + int num_vfs = 0;
> +
> + /* assign number of SR-IOV VFs */
> + if (hw->mac.type != ixgbe_mac_82598EB)
> + num_vfs = min_t(int, pdev->max_vfs, 63);
> +
> + /* no change */
> + if (adapter->num_vfs == num_vfs)
> + return;
> +
> + if (!num_vfs) {
> + /* disable sriov */
> + ixgbe_disable_sriov(adapter);
> + adapter->num_vfs = 0;
> + } else if (!adapter->num_vfs && num_vfs) {
> + /* enable sriov */
> + adapter->num_vfs = num_vfs;
> + ixgbe_enable_sriov(adapter, adapter->ixgbe_info);
> + } else {
> + /* increase or decrease */
> + }
> +
> + pdev->max_vfs = adapter->num_vfs;
> +#endif
> +}
> +
>  static struct pci_driver ixgbe_driver = {
>   .name = ixgbe_driver_name,
>   .id_table = ixgbe_pci_tbl,
>   .probe= ixgbe_probe,
>   .remove   = __devexit_p(ixgbe_remove),
> + .set_max_vfs = ixgbe_set_max_vfs,
>  #ifdef CONFIG_PM
>   .suspend  = ixgbe_suspend,
>   .resume   = ixgbe_resume,

The ixgbe_set_max_vfs function has several issues.  The two big ones are
that this function assumes it can just enable/disable SR-IOV without any
other changes being necessary which is not the case.  I would recommend
looking at ixgbe_setup_tc for how to do this properly.  Secondly is the
fact that this code will change the PF network device and as such
sections of the code should be called with the RTNL lock held.  In
addition I believe you have to disable SR-IOV before enabling it again
with a different number of VFs.

Below is a link to one of the early patches for igb when we were first
introducing SR-IOV, and the in-driver sysfs value had been rejected.  I
figure it might be useful as it was also using sysfs to enable/disable
VFs.  It however doesn't have the correct locking on changing the queues
and as such will likely throw an error if you were to implement it the
same way now:
http://lists.openwall.net/netdev/2009/04/08/34

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@

[RFC PATCH 0/7] Improve swiotlb performance by using physical addresses

2012-10-03 Thread Alexander Duyck
While working on 10Gb/s routing performance I found a significant amount of
time was being spent in the swiotlb DMA handler.  Further digging found that a
significant amount of this was due to the fact that virtual to physical
address translation and calling the function that did it.  It accounted for
nearly 60% of the total overhead.

This patch set works to resolve that by changing the io_tlb_start address and
io_tlb_overflow_buffer address from virtual addresses to physical addresses.
By doing this, devices that are not making use of bounce buffers can
significantly reduce their overhead.  In addition I followed through with the
cleanup to the point that the only functions that really require the virtual
address for the dma buffer are the init, free, and bounce functions.

When running a routing throughput test using small packets I saw roughly a 5%
increase in packets rates after applying these patches.  This appears to match
up with the CPU overhead reduction I was tracking via perf.

Before:
Results 10.29Mps
# Overhead Symbol
#  
...
#
 1.97%  [k] __phys_addr 
  
|  
|--24.97%-- swiotlb_sync_single
|  
|--16.55%-- is_swiotlb_buffer
|  
|--11.25%-- unmap_single
|  
 --2.71%-- swiotlb_dma_mapping_error
 1.66%  [k] swiotlb_sync_single 

 1.45%  [k] is_swiotlb_buffer 
 0.53%  [k] unmap_single
 
 0.52%  [k] swiotlb_map_page  
 0.47%  [k] swiotlb_sync_single_for_device  
   
 0.43%  [k] swiotlb_sync_single_for_cpu   
 0.42%  [k] swiotlb_dma_mapping_error   

 0.34%  [k] swiotlb_unmap_page

After:
Results 10.99Mps
# Overhead Symbol
#  
...
#
 0.50%  [k] swiotlb_map_page

 0.50%  [k] swiotlb_sync_single 

 0.36%  [k] swiotlb_sync_single_for_cpu 

 0.35%  [k] swiotlb_sync_single_for_device  

 0.25%  [k] swiotlb_unmap_page  

 0.17%  [k] swiotlb_dma_mapping_error  

---

Alexander Duyck (7):
  swiotlb:  Do not export swiotlb_bounce since there are no external 
consumers
  swiotlb: Use physical addresses instead of virtual in 
swiotlb_tbl_sync_single
  swiotlb: Use physical addresses for swiotlb_tbl_unmap_single
  swiotlb: Return physical addresses when calling swiotlb_tbl_map_single
  swiotlb: Make io_tlb_overflow_buffer a physical address
  swiotlb: Make io_tlb_start a physical address instead of a virtual address
  swiotlb: Instead of tracking the end of the swiotlb region just calculate 
it


 drivers/xen/swiotlb-xen.c |   25 ++---
 include/linux/swiotlb.h   |   20 ++--
 lib/swiotlb.c |  247 +++--
 3 files changed, 150 insertions(+), 142 deletions(-)

-- 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 1/7] swiotlb: Instead of tracking the end of the swiotlb region just calculate it

2012-10-03 Thread Alexander Duyck
In the case of swiotlb we already have the start of the region and the number
of slabs that give us the region size.  Instead of having to call
virt_to_phys on two pointers we can just take advantage of the fact that the
region is linear and just compute the end based on the start plus the size.

Signed-off-by: Alexander Duyck 
---

 lib/swiotlb.c |   25 -
 1 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index f114bf6..5cc4d4e 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -57,11 +57,11 @@ int swiotlb_force;
  * swiotlb_tbl_sync_single_*, to see if the memory was in fact allocated by 
this
  * API.
  */
-static char *io_tlb_start, *io_tlb_end;
+static char *io_tlb_start;
 
 /*
- * The number of IO TLB blocks (in groups of 64) between io_tlb_start and
- * io_tlb_end.  This is command line adjustable via setup_io_tlb_npages.
+ * The number of IO TLB blocks (in groups of 64).
+ * This is command line adjustable via setup_io_tlb_npages.
  */
 static unsigned long io_tlb_nslabs;
 
@@ -128,11 +128,11 @@ void swiotlb_print_info(void)
phys_addr_t pstart, pend;
 
pstart = virt_to_phys(io_tlb_start);
-   pend = virt_to_phys(io_tlb_end);
+   pend = pstart + bytes;
 
printk(KERN_INFO "software IO TLB [mem %#010llx-%#010llx] (%luMB) 
mapped at [%p-%p]\n",
   (unsigned long long)pstart, (unsigned long long)pend - 1,
-  bytes >> 20, io_tlb_start, io_tlb_end - 1);
+  bytes >> 20, io_tlb_start, io_tlb_start + bytes - 1);
 }
 
 void __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
@@ -143,12 +143,10 @@ void __init swiotlb_init_with_tbl(char *tlb, unsigned 
long nslabs, int verbose)
 
io_tlb_nslabs = nslabs;
io_tlb_start = tlb;
-   io_tlb_end = io_tlb_start + bytes;
 
/*
 * Allocate and initialize the free list array.  This array is used
 * to find contiguous free memory regions of size up to IO_TLB_SEGSIZE
-* between io_tlb_start and io_tlb_end.
 */
io_tlb_list = alloc_bootmem_pages(PAGE_ALIGN(io_tlb_nslabs * 
sizeof(int)));
for (i = 0; i < io_tlb_nslabs; i++)
@@ -254,14 +252,12 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long 
nslabs)
 
io_tlb_nslabs = nslabs;
io_tlb_start = tlb;
-   io_tlb_end = io_tlb_start + bytes;
 
memset(io_tlb_start, 0, bytes);
 
/*
 * Allocate and initialize the free list array.  This array is used
 * to find contiguous free memory regions of size up to IO_TLB_SEGSIZE
-* between io_tlb_start and io_tlb_end.
 */
io_tlb_list = (unsigned int *)__get_free_pages(GFP_KERNEL,
  get_order(io_tlb_nslabs * sizeof(int)));
@@ -304,7 +300,6 @@ cleanup3:
 sizeof(int)));
io_tlb_list = NULL;
 cleanup2:
-   io_tlb_end = NULL;
io_tlb_start = NULL;
io_tlb_nslabs = 0;
return -ENOMEM;
@@ -339,8 +334,10 @@ void __init swiotlb_free(void)
 
 static int is_swiotlb_buffer(phys_addr_t paddr)
 {
-   return paddr >= virt_to_phys(io_tlb_start) &&
-   paddr < virt_to_phys(io_tlb_end);
+   phys_addr_t swiotlb_start = virt_to_phys(io_tlb_start);
+
+   return paddr >= swiotlb_start &&
+   paddr < (swiotlb_start + (io_tlb_nslabs << IO_TLB_SHIFT));
 }
 
 /*
@@ -938,6 +935,8 @@ EXPORT_SYMBOL(swiotlb_dma_mapping_error);
 int
 swiotlb_dma_supported(struct device *hwdev, u64 mask)
 {
-   return swiotlb_virt_to_bus(hwdev, io_tlb_end - 1) <= mask;
+   unsigned long bytes = io_tlb_nslabs << IO_TLB_SHIFT;
+
+   return swiotlb_virt_to_bus(hwdev, io_tlb_start + bytes - 1) <= mask;
 }
 EXPORT_SYMBOL(swiotlb_dma_supported);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 2/7] swiotlb: Make io_tlb_start a physical address instead of a virtual address

2012-10-03 Thread Alexander Duyck
This change makes it so that io_tlb_start contains a physical address instead
of a virtual address.  The advantage to this is that we can avoid costly
translations between virtual and physical addresses when comparing the
io_tlb_start against DMA addresses.

Signed-off-by: Alexander Duyck 
---

 lib/swiotlb.c |   61 +
 1 files changed, 31 insertions(+), 30 deletions(-)

diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 5cc4d4e..02abb72 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -57,7 +57,7 @@ int swiotlb_force;
  * swiotlb_tbl_sync_single_*, to see if the memory was in fact allocated by 
this
  * API.
  */
-static char *io_tlb_start;
+phys_addr_t io_tlb_start;
 
 /*
  * The number of IO TLB blocks (in groups of 64).
@@ -125,14 +125,15 @@ static dma_addr_t swiotlb_virt_to_bus(struct device 
*hwdev,
 void swiotlb_print_info(void)
 {
unsigned long bytes = io_tlb_nslabs << IO_TLB_SHIFT;
-   phys_addr_t pstart, pend;
+   unsigned char *vstart, *vend;
 
-   pstart = virt_to_phys(io_tlb_start);
-   pend = pstart + bytes;
+   vstart = phys_to_virt(io_tlb_start);
+   vend = vstart + bytes;
 
printk(KERN_INFO "software IO TLB [mem %#010llx-%#010llx] (%luMB) 
mapped at [%p-%p]\n",
-  (unsigned long long)pstart, (unsigned long long)pend - 1,
-  bytes >> 20, io_tlb_start, io_tlb_start + bytes - 1);
+  (unsigned long long)io_tlb_start,
+  (unsigned long long)io_tlb_start + bytes - 1,
+  bytes >> 20, vstart, vend - 1);
 }
 
 void __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
@@ -142,7 +143,7 @@ void __init swiotlb_init_with_tbl(char *tlb, unsigned long 
nslabs, int verbose)
bytes = nslabs << IO_TLB_SHIFT;
 
io_tlb_nslabs = nslabs;
-   io_tlb_start = tlb;
+   io_tlb_start = __pa(tlb);
 
/*
 * Allocate and initialize the free list array.  This array is used
@@ -171,6 +172,7 @@ void __init swiotlb_init_with_tbl(char *tlb, unsigned long 
nslabs, int verbose)
 static void __init
 swiotlb_init_with_default_size(size_t default_size, int verbose)
 {
+   unsigned char *vstart;
unsigned long bytes;
 
if (!io_tlb_nslabs) {
@@ -183,11 +185,11 @@ swiotlb_init_with_default_size(size_t default_size, int 
verbose)
/*
 * Get IO TLB memory from the low pages
 */
-   io_tlb_start = alloc_bootmem_low_pages(PAGE_ALIGN(bytes));
-   if (!io_tlb_start)
+   vstart = alloc_bootmem_low_pages(PAGE_ALIGN(bytes));
+   if (!vstart)
panic("Cannot allocate SWIOTLB buffer");
 
-   swiotlb_init_with_tbl(io_tlb_start, io_tlb_nslabs, verbose);
+   swiotlb_init_with_tbl(vstart, io_tlb_nslabs, verbose);
 }
 
 void __init
@@ -205,6 +207,7 @@ int
 swiotlb_late_init_with_default_size(size_t default_size)
 {
unsigned long bytes, req_nslabs = io_tlb_nslabs;
+   unsigned char *vstart = NULL;
unsigned int order;
int rc = 0;
 
@@ -221,14 +224,14 @@ swiotlb_late_init_with_default_size(size_t default_size)
bytes = io_tlb_nslabs << IO_TLB_SHIFT;
 
while ((SLABS_PER_PAGE << order) > IO_TLB_MIN_SLABS) {
-   io_tlb_start = (void *)__get_free_pages(GFP_DMA | __GFP_NOWARN,
-   order);
-   if (io_tlb_start)
+   vstart = (void *)__get_free_pages(GFP_DMA | __GFP_NOWARN,
+ order);
+   if (vstart)
break;
order--;
}
 
-   if (!io_tlb_start) {
+   if (!vstart) {
io_tlb_nslabs = req_nslabs;
return -ENOMEM;
}
@@ -237,9 +240,9 @@ swiotlb_late_init_with_default_size(size_t default_size)
   "for software IO TLB\n", (PAGE_SIZE << order) >> 20);
io_tlb_nslabs = SLABS_PER_PAGE << order;
}
-   rc = swiotlb_late_init_with_tbl(io_tlb_start, io_tlb_nslabs);
+   rc = swiotlb_late_init_with_tbl(vstart, io_tlb_nslabs);
if (rc)
-   free_pages((unsigned long)io_tlb_start, order);
+   free_pages((unsigned long)vstart, order);
return rc;
 }
 
@@ -251,9 +254,9 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
bytes = nslabs << IO_TLB_SHIFT;
 
io_tlb_nslabs = nslabs;
-   io_tlb_start = tlb;
+   io_tlb_start = virt_to_phys(tlb);
 
-   memset(io_tlb_start, 0, bytes);
+   memset(tlb, 0, bytes);
 
/*
 * Allocate and initialize the free list array.  This array is used
@@ -300,7 +303,7 @@ cleanup3:
 sizeof(int)));
io_tlb_list = NULL;
 cleanup2:
-   io_tlb_start = NULL;
+   io_tlb_start = 0;
io

[RFC PATCH 4/7] swiotlb: Return physical addresses when calling swiotlb_tbl_map_single

2012-10-03 Thread Alexander Duyck
This change makes it so that swiotlb_tbl_map_single will return a physical
address instead of a virtual address when called.  The advantage to this once
again is that we are avoiding a number of virt_to_phys and phys_to_virt
translations by working with everything as a physical address.

One change I had to make in order to support using physical addresses is that
I could no longer trust 0 to be a invalid physical address on all platforms.
So instead I made it so that ~0 is returned on error.  This should never be a
valid return value as it implies that only one byte would be available for
use.

Signed-off-by: Alexander Duyck 
---

 drivers/xen/swiotlb-xen.c |   22 +++---
 include/linux/swiotlb.h   |   11 +--
 lib/swiotlb.c |   73 +++--
 3 files changed, 56 insertions(+), 50 deletions(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 58db6df..8a6035a 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -338,9 +338,8 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, struct 
page *page,
enum dma_data_direction dir,
struct dma_attrs *attrs)
 {
-   phys_addr_t phys = page_to_phys(page) + offset;
+   phys_addr_t map, phys = page_to_phys(page) + offset;
dma_addr_t dev_addr = xen_phys_to_bus(phys);
-   void *map;
 
BUG_ON(dir == DMA_NONE);
/*
@@ -356,16 +355,16 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, 
struct page *page,
 * Oh well, have to allocate and map a bounce buffer.
 */
map = swiotlb_tbl_map_single(dev, start_dma_addr, phys, size, dir);
-   if (!map)
+   if (map == SWIOTLB_MAP_ERROR)
return DMA_ERROR_CODE;
 
-   dev_addr = xen_virt_to_bus(map);
+   dev_addr = xen_phys_to_bus(map);
 
/*
 * Ensure that the address returned is DMA'ble
 */
if (!dma_capable(dev, dev_addr, size)) {
-   swiotlb_tbl_unmap_single(dev, map, size, dir);
+   swiotlb_tbl_unmap_single(dev, phys_to_virt(map), size, dir);
dev_addr = 0;
}
return dev_addr;
@@ -494,11 +493,12 @@ xen_swiotlb_map_sg_attrs(struct device *hwdev, struct 
scatterlist *sgl,
if (swiotlb_force ||
!dma_capable(hwdev, dev_addr, sg->length) ||
range_straddles_page_boundary(paddr, sg->length)) {
-   void *map = swiotlb_tbl_map_single(hwdev,
-  start_dma_addr,
-  sg_phys(sg),
-  sg->length, dir);
-   if (!map) {
+   phys_addr_t map = swiotlb_tbl_map_single(hwdev,
+start_dma_addr,
+sg_phys(sg),
+sg->length,
+dir);
+   if (map == SWIOTLB_MAP_ERROR) {
/* Don't panic here, we expect map_sg users
   to do proper error handling. */
xen_swiotlb_unmap_sg_attrs(hwdev, sgl, i, dir,
@@ -506,7 +506,7 @@ xen_swiotlb_map_sg_attrs(struct device *hwdev, struct 
scatterlist *sgl,
sgl[0].dma_length = 0;
return DMA_ERROR_CODE;
}
-   sg->dma_address = xen_virt_to_bus(map);
+   sg->dma_address = xen_phys_to_bus(map);
} else
sg->dma_address = dev_addr;
sg->dma_length = sg->length;
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 8d08b3e..1995f3e 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -34,9 +34,14 @@ enum dma_sync_target {
SYNC_FOR_CPU = 0,
SYNC_FOR_DEVICE = 1,
 };
-extern void *swiotlb_tbl_map_single(struct device *hwdev, dma_addr_t 
tbl_dma_addr,
-   phys_addr_t phys, size_t size,
-   enum dma_data_direction dir);
+
+/* define the last possible byte of physical address space as a mapping error 
*/
+#define SWIOTLB_MAP_ERROR (~(phys_addr_t)0x0)
+
+extern phys_addr_t swiotlb_tbl_map_single(struct device *hwdev,
+ dma_addr_t tbl_dma_addr,
+ phys_addr_t phys, size_t size,
+ enum dma_data_direction dir);
 
 extern void swiotlb_tbl_unmap_single(struct device *hwdev, char *dma_addr,
 siz

[RFC PATCH 5/7] swiotlb: Use physical addresses for swiotlb_tbl_unmap_single

2012-10-03 Thread Alexander Duyck
This change makes it so that the unmap functionality also uses physical
addresses.  This helps to further reduce the use of virt_to_phys and
phys_to_virt functions.

Signed-off-by: Alexander Duyck 
---

 drivers/xen/swiotlb-xen.c |4 ++--
 include/linux/swiotlb.h   |3 ++-
 lib/swiotlb.c |   35 ++-
 3 files changed, 22 insertions(+), 20 deletions(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 8a6035a..4cedc28 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -364,7 +364,7 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, struct 
page *page,
 * Ensure that the address returned is DMA'ble
 */
if (!dma_capable(dev, dev_addr, size)) {
-   swiotlb_tbl_unmap_single(dev, phys_to_virt(map), size, dir);
+   swiotlb_tbl_unmap_single(dev, map, size, dir);
dev_addr = 0;
}
return dev_addr;
@@ -388,7 +388,7 @@ static void xen_unmap_single(struct device *hwdev, 
dma_addr_t dev_addr,
 
/* NOTE: We use dev_addr here, not paddr! */
if (is_xen_swiotlb_buffer(dev_addr)) {
-   swiotlb_tbl_unmap_single(hwdev, phys_to_virt(paddr), size, dir);
+   swiotlb_tbl_unmap_single(hwdev, paddr, size, dir);
return;
}
 
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 1995f3e..5a5a654 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -43,7 +43,8 @@ extern phys_addr_t swiotlb_tbl_map_single(struct device 
*hwdev,
  phys_addr_t phys, size_t size,
  enum dma_data_direction dir);
 
-extern void swiotlb_tbl_unmap_single(struct device *hwdev, char *dma_addr,
+extern void swiotlb_tbl_unmap_single(struct device *hwdev,
+phys_addr_t dma_addr,
 size_t size, enum dma_data_direction dir);
 
 extern void swiotlb_tbl_sync_single(struct device *hwdev, char *dma_addr,
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 55e052e..41e1d9a 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -510,20 +510,20 @@ phys_addr_t map_single(struct device *hwdev, phys_addr_t 
phys, size_t size,
 /*
  * dma_addr is the kernel virtual address of the bounce buffer to unmap.
  */
-void
-swiotlb_tbl_unmap_single(struct device *hwdev, char *dma_addr, size_t size,
-   enum dma_data_direction dir)
+void swiotlb_tbl_unmap_single(struct device *hwdev, phys_addr_t dma_addr,
+ size_t size, enum dma_data_direction dir)
 {
unsigned long flags;
int i, count, nslots = ALIGN(size, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT;
-   int index = (dma_addr - (char *)phys_to_virt(io_tlb_start)) >> 
IO_TLB_SHIFT;
+   int index = (dma_addr - io_tlb_start) >> IO_TLB_SHIFT;
phys_addr_t phys = io_tlb_orig_addr[index];
 
/*
 * First, sync the memory before unmapping the entry
 */
if (phys && ((dir == DMA_FROM_DEVICE) || (dir == DMA_BIDIRECTIONAL)))
-   swiotlb_bounce(phys, dma_addr, size, DMA_FROM_DEVICE);
+   swiotlb_bounce(phys, phys_to_virt(dma_addr),
+  size, DMA_FROM_DEVICE);
 
/*
 * Return the buffer to the free list by setting the corresponding
@@ -616,17 +616,18 @@ swiotlb_alloc_coherent(struct device *hwdev, size_t size,
 
ret = phys_to_virt(paddr);
dev_addr = phys_to_dma(hwdev, paddr);
-   }
 
-   /* Confirm address can be DMA'd by device */
-   if (dev_addr + size - 1 > dma_mask) {
-   printk("hwdev DMA mask = 0x%016Lx, dev_addr = 0x%016Lx\n",
-  (unsigned long long)dma_mask,
-  (unsigned long long)dev_addr);
+   /* Confirm address can be DMA'd by device */
+   if (dev_addr + size - 1 > dma_mask) {
+   printk("hwdev DMA mask = 0x%016Lx, dev_addr = 
0x%016Lx\n",
+  (unsigned long long)dma_mask,
+  (unsigned long long)dev_addr);
 
-   /* DMA_TO_DEVICE to avoid memcpy in unmap_single */
-   swiotlb_tbl_unmap_single(hwdev, ret, size, DMA_TO_DEVICE);
-   return NULL;
+   /* DMA_TO_DEVICE to avoid memcpy in unmap_single */
+   swiotlb_tbl_unmap_single(hwdev, paddr,
+size, DMA_TO_DEVICE);
+   return NULL;
+   }
}
 
*dma_handle = dev_addr;
@@ -647,7 +648,7 @@ swiotlb_free_coherent(struct device *hwdev, size_t size, 
void *vaddr,
free_pages((unsigned long)vaddr, get_order(size));
else
/* DMA_TO_DEVICE to avoid memcpy in s

[RFC PATCH 6/7] swiotlb: Use physical addresses instead of virtual in swiotlb_tbl_sync_single

2012-10-03 Thread Alexander Duyck
This change makes it so that the sync functionality also uses physical
addresses.  This helps to further reduce the use of virt_to_phys and
phys_to_virt functions.

Signed-off-by: Alexander Duyck 
---

 drivers/xen/swiotlb-xen.c |3 +--
 include/linux/swiotlb.h   |3 ++-
 lib/swiotlb.c |   18 +-
 3 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 4cedc28..af47e75 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -433,8 +433,7 @@ xen_swiotlb_sync_single(struct device *hwdev, dma_addr_t 
dev_addr,
 
/* NOTE: We use dev_addr here, not paddr! */
if (is_xen_swiotlb_buffer(dev_addr)) {
-   swiotlb_tbl_sync_single(hwdev, phys_to_virt(paddr), size, dir,
-  target);
+   swiotlb_tbl_sync_single(hwdev, paddr, size, dir, target);
return;
}
 
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 5a5a654..ba1bd38 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -47,7 +47,8 @@ extern void swiotlb_tbl_unmap_single(struct device *hwdev,
 phys_addr_t dma_addr,
 size_t size, enum dma_data_direction dir);
 
-extern void swiotlb_tbl_sync_single(struct device *hwdev, char *dma_addr,
+extern void swiotlb_tbl_sync_single(struct device *hwdev,
+   phys_addr_t dma_addr,
size_t size, enum dma_data_direction dir,
enum dma_sync_target target);
 
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 41e1d9a..7cfe850 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -552,12 +552,11 @@ void swiotlb_tbl_unmap_single(struct device *hwdev, 
phys_addr_t dma_addr,
 }
 EXPORT_SYMBOL_GPL(swiotlb_tbl_unmap_single);
 
-void
-swiotlb_tbl_sync_single(struct device *hwdev, char *dma_addr, size_t size,
-   enum dma_data_direction dir,
-   enum dma_sync_target target)
+void swiotlb_tbl_sync_single(struct device *hwdev, phys_addr_t dma_addr,
+size_t size, enum dma_data_direction dir,
+enum dma_sync_target target)
 {
-   int index = (dma_addr - (char *)phys_to_virt(io_tlb_start)) >> 
IO_TLB_SHIFT;
+   int index = (dma_addr - io_tlb_start) >> IO_TLB_SHIFT;
phys_addr_t phys = io_tlb_orig_addr[index];
 
phys += ((unsigned long)dma_addr & ((1 << IO_TLB_SHIFT) - 1));
@@ -565,13 +564,15 @@ swiotlb_tbl_sync_single(struct device *hwdev, char 
*dma_addr, size_t size,
switch (target) {
case SYNC_FOR_CPU:
if (likely(dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
-   swiotlb_bounce(phys, dma_addr, size, DMA_FROM_DEVICE);
+   swiotlb_bounce(phys, phys_to_virt(dma_addr),
+  size, DMA_FROM_DEVICE);
else
BUG_ON(dir != DMA_TO_DEVICE);
break;
case SYNC_FOR_DEVICE:
if (likely(dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL))
-   swiotlb_bounce(phys, dma_addr, size, DMA_TO_DEVICE);
+   swiotlb_bounce(phys, phys_to_virt(dma_addr),
+  size, DMA_TO_DEVICE);
else
BUG_ON(dir != DMA_FROM_DEVICE);
break;
@@ -780,8 +781,7 @@ swiotlb_sync_single(struct device *hwdev, dma_addr_t 
dev_addr,
BUG_ON(dir == DMA_NONE);
 
if (is_swiotlb_buffer(paddr)) {
-   swiotlb_tbl_sync_single(hwdev, phys_to_virt(paddr), size, dir,
-  target);
+   swiotlb_tbl_sync_single(hwdev, paddr, size, dir, target);
return;
}
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 7/7] swiotlb: Do not export swiotlb_bounce since there are no external consumers

2012-10-03 Thread Alexander Duyck
Currently swiotlb is the only consumer for swiotlb_bounce.  Since that is the
case it doesn't make much sense to be exporting it so make it a static
function only.

In addition we can save a few more lines of code by making it so that it
accepts the DMA address as a physical address instead of a virtual one.  This
is the last piece in essentially pushing all of the DMA address values to use
physical addresses in swiotlb.

Signed-off-by: Alexander Duyck 
---

 include/linux/swiotlb.h |3 ---
 lib/swiotlb.c   |   30 +-
 2 files changed, 13 insertions(+), 20 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index ba1bd38..8e635d1 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -53,9 +53,6 @@ extern void swiotlb_tbl_sync_single(struct device *hwdev,
enum dma_sync_target target);
 
 /* Accessory functions. */
-extern void swiotlb_bounce(phys_addr_t phys, char *dma_addr, size_t size,
-  enum dma_data_direction dir);
-
 extern void
 *swiotlb_alloc_coherent(struct device *hwdev, size_t size,
dma_addr_t *dma_handle, gfp_t flags);
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 7cfe850..a2ad781 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -351,10 +351,11 @@ static int is_swiotlb_buffer(phys_addr_t paddr)
 /*
  * Bounce: copy the swiotlb buffer back to the original dma location
  */
-void swiotlb_bounce(phys_addr_t phys, char *dma_addr, size_t size,
-   enum dma_data_direction dir)
+static void swiotlb_bounce(phys_addr_t phys, phys_addr_t dma_addr,
+  size_t size, enum dma_data_direction dir)
 {
unsigned long pfn = PFN_DOWN(phys);
+   unsigned char *vaddr = phys_to_virt(dma_addr);
 
if (PageHighMem(pfn_to_page(pfn))) {
/* The buffer does not have a mapping.  Map it in and copy */
@@ -369,25 +370,23 @@ void swiotlb_bounce(phys_addr_t phys, char *dma_addr, 
size_t size,
local_irq_save(flags);
buffer = kmap_atomic(pfn_to_page(pfn));
if (dir == DMA_TO_DEVICE)
-   memcpy(dma_addr, buffer + offset, sz);
+   memcpy(vaddr, buffer + offset, sz);
else
-   memcpy(buffer + offset, dma_addr, sz);
+   memcpy(buffer + offset, vaddr, sz);
kunmap_atomic(buffer);
local_irq_restore(flags);
 
size -= sz;
pfn++;
-   dma_addr += sz;
+   vaddr += sz;
offset = 0;
}
+   } else if (dir == DMA_TO_DEVICE) {
+   memcpy(vaddr, phys_to_virt(phys), size);
} else {
-   if (dir == DMA_TO_DEVICE)
-   memcpy(dma_addr, phys_to_virt(phys), size);
-   else
-   memcpy(phys_to_virt(phys), dma_addr, size);
+   memcpy(phys_to_virt(phys), vaddr, size);
}
 }
-EXPORT_SYMBOL_GPL(swiotlb_bounce);
 
 phys_addr_t swiotlb_tbl_map_single(struct device *hwdev,
   dma_addr_t tbl_dma_addr,
@@ -489,7 +488,7 @@ found:
for (i = 0; i < nslots; i++)
io_tlb_orig_addr[index+i] = phys + (i << IO_TLB_SHIFT);
if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
-   swiotlb_bounce(phys, phys_to_virt(dma_addr), size, 
DMA_TO_DEVICE);
+   swiotlb_bounce(phys, dma_addr, size, DMA_TO_DEVICE);
 
return dma_addr;
 }
@@ -522,8 +521,7 @@ void swiotlb_tbl_unmap_single(struct device *hwdev, 
phys_addr_t dma_addr,
 * First, sync the memory before unmapping the entry
 */
if (phys && ((dir == DMA_FROM_DEVICE) || (dir == DMA_BIDIRECTIONAL)))
-   swiotlb_bounce(phys, phys_to_virt(dma_addr),
-  size, DMA_FROM_DEVICE);
+   swiotlb_bounce(phys, dma_addr, size, DMA_FROM_DEVICE);
 
/*
 * Return the buffer to the free list by setting the corresponding
@@ -564,15 +562,13 @@ void swiotlb_tbl_sync_single(struct device *hwdev, 
phys_addr_t dma_addr,
switch (target) {
case SYNC_FOR_CPU:
if (likely(dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
-   swiotlb_bounce(phys, phys_to_virt(dma_addr),
-  size, DMA_FROM_DEVICE);
+   swiotlb_bounce(phys, dma_addr, size, DMA_FROM_DEVICE);
else
BUG_ON(dir != DMA_TO_DEVICE);
break;
case SYNC_FOR_DEVICE:
if (likely(dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL))
-   swiotlb_bounce(phys,

[RFC PATCH 3/7] swiotlb: Make io_tlb_overflow_buffer a physical address

2012-10-03 Thread Alexander Duyck
This change makes it so that we can avoid virt_to_phys overhead when using the
io_tlb_overflow_buffer.  My original plan was to completely remove the value
and replace it with a constant but I had seen that there were recent patches
that stated this couldn't be done until all device drivers that depended on
that functionality be updated.

Signed-off-by: Alexander Duyck 
---

 lib/swiotlb.c |   61 -
 1 files changed, 34 insertions(+), 27 deletions(-)

diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 02abb72..62848fb 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -70,7 +70,7 @@ static unsigned long io_tlb_nslabs;
  */
 static unsigned long io_tlb_overflow = 32*1024;
 
-static void *io_tlb_overflow_buffer;
+phys_addr_t io_tlb_overflow_buffer;
 
 /*
  * This is a free list describing the number of free entries available from
@@ -138,6 +138,7 @@ void swiotlb_print_info(void)
 
 void __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
 {
+   void *v_overflow_buffer;
unsigned long i, bytes;
 
bytes = nslabs << IO_TLB_SHIFT;
@@ -146,6 +147,15 @@ void __init swiotlb_init_with_tbl(char *tlb, unsigned long 
nslabs, int verbose)
io_tlb_start = __pa(tlb);
 
/*
+* Get the overflow emergency buffer
+*/
+   v_overflow_buffer = 
alloc_bootmem_low_pages(PAGE_ALIGN(io_tlb_overflow));
+   if (!v_overflow_buffer)
+   panic("Cannot allocate SWIOTLB overflow buffer!\n");
+
+   io_tlb_overflow_buffer = __pa(v_overflow_buffer);
+
+   /*
 * Allocate and initialize the free list array.  This array is used
 * to find contiguous free memory regions of size up to IO_TLB_SEGSIZE
 */
@@ -155,12 +165,6 @@ void __init swiotlb_init_with_tbl(char *tlb, unsigned long 
nslabs, int verbose)
io_tlb_index = 0;
io_tlb_orig_addr = alloc_bootmem_pages(PAGE_ALIGN(io_tlb_nslabs * 
sizeof(phys_addr_t)));
 
-   /*
-* Get the overflow emergency buffer
-*/
-   io_tlb_overflow_buffer = 
alloc_bootmem_low_pages(PAGE_ALIGN(io_tlb_overflow));
-   if (!io_tlb_overflow_buffer)
-   panic("Cannot allocate SWIOTLB overflow buffer!\n");
if (verbose)
swiotlb_print_info();
 }
@@ -250,6 +254,7 @@ int
 swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
 {
unsigned long i, bytes;
+   unsigned char *v_overflow_buffer;
 
bytes = nslabs << IO_TLB_SHIFT;
 
@@ -259,13 +264,23 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long 
nslabs)
memset(tlb, 0, bytes);
 
/*
+* Get the overflow emergency buffer
+*/
+   v_overflow_buffer = (void *)__get_free_pages(GFP_DMA,
+
get_order(io_tlb_overflow));
+   if (!v_overflow_buffer)
+   goto cleanup2;
+
+   io_tlb_overflow_buffer = virt_to_phys(v_overflow_buffer);
+
+   /*
 * Allocate and initialize the free list array.  This array is used
 * to find contiguous free memory regions of size up to IO_TLB_SEGSIZE
 */
io_tlb_list = (unsigned int *)__get_free_pages(GFP_KERNEL,
  get_order(io_tlb_nslabs * sizeof(int)));
if (!io_tlb_list)
-   goto cleanup2;
+   goto cleanup3;
 
for (i = 0; i < io_tlb_nslabs; i++)
io_tlb_list[i] = IO_TLB_SEGSIZE - OFFSET(i, IO_TLB_SEGSIZE);
@@ -276,18 +291,10 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long 
nslabs)
 get_order(io_tlb_nslabs *
   sizeof(phys_addr_t)));
if (!io_tlb_orig_addr)
-   goto cleanup3;
+   goto cleanup4;
 
memset(io_tlb_orig_addr, 0, io_tlb_nslabs * sizeof(phys_addr_t));
 
-   /*
-* Get the overflow emergency buffer
-*/
-   io_tlb_overflow_buffer = (void *)__get_free_pages(GFP_DMA,
- get_order(io_tlb_overflow));
-   if (!io_tlb_overflow_buffer)
-   goto cleanup4;
-
swiotlb_print_info();
 
late_alloc = 1;
@@ -295,13 +302,13 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long 
nslabs)
return 0;
 
 cleanup4:
-   free_pages((unsigned long)io_tlb_orig_addr,
-  get_order(io_tlb_nslabs * sizeof(phys_addr_t)));
-   io_tlb_orig_addr = NULL;
-cleanup3:
free_pages((unsigned long)io_tlb_list, get_order(io_tlb_nslabs *
 sizeof(int)));
io_tlb_list = NULL;
+cleanup3:
+   free_pages((unsigned long)v_overflow_buffer,
+  get_order(io_tlb_overflow));
+   io_tlb_overflow_buffer = 0;
 cleanup2:
io_tlb_start = 0;
io_tlb_nslabs = 0;
@@ -310,11 +317,11 @@ cleanup2:
 
 void

Re: [RFC PATCH 0/7] Improve swiotlb performance by using physical addresses

2012-10-04 Thread Alexander Duyck
On 10/04/2012 05:55 AM, Konrad Rzeszutek Wilk wrote:
> On Wed, Oct 03, 2012 at 05:38:41PM -0700, Alexander Duyck wrote:
>> While working on 10Gb/s routing performance I found a significant amount of
>> time was being spent in the swiotlb DMA handler.  Further digging found that 
>> a
>> significant amount of this was due to the fact that virtual to physical
>> address translation and calling the function that did it.  It accounted for
>> nearly 60% of the total overhead.
>>
>> This patch set works to resolve that by changing the io_tlb_start address and
>> io_tlb_overflow_buffer address from virtual addresses to physical addresses.
>> By doing this, devices that are not making use of bounce buffers can
>> significantly reduce their overhead.  In addition I followed through with the
> .. but are still using SWIOTLB for their DMA operations, right?
>

That is correct.  I tested with the bounce buffers in use as well, but
didn't really see any difference since almost all of the overhead was
due to the locking required in obtaining and releasing the bounce
buffers in map/unmap calls.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/7] swiotlb: Instead of tracking the end of the swiotlb region just calculate it

2012-10-04 Thread Alexander Duyck
On 10/04/2012 06:01 AM, Konrad Rzeszutek Wilk wrote:
> On Wed, Oct 03, 2012 at 05:38:47PM -0700, Alexander Duyck wrote:
>> In the case of swiotlb we already have the start of the region and the number
>> of slabs that give us the region size.  Instead of having to call
>> virt_to_phys on two pointers we can just take advantage of the fact that the
>> region is linear and just compute the end based on the start plus the size.
> Why not take advantage of 'the fact that the region is linear' and just
> pre-compute the end in swiotlb_init_with_tbl?
>
> That way the logic in is_swiotlb_buffer is even simpler?
>

Using a pre-computed end point based on a virtual address is more
expensive in the x86_64 case.  The calls to __phys_addr require a
separate function call.  By just using the physical address of the start
and adding the offset I can avoid the second call and the compiler will
take advantage of the smaller function size.  The result is that
is_swiotlb_buffer will be inlined.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 2/7] swiotlb: Make io_tlb_start a physical address instead of a virtual address

2012-10-04 Thread Alexander Duyck
On 10/04/2012 06:18 AM, Konrad Rzeszutek Wilk wrote:
> On Wed, Oct 03, 2012 at 05:38:53PM -0700, Alexander Duyck wrote:
>> This change makes it so that io_tlb_start contains a physical address instead
>> of a virtual address.  The advantage to this is that we can avoid costly
>> translations between virtual and physical addresses when comparing the
>> io_tlb_start against DMA addresses.
>>
>> Signed-off-by: Alexander Duyck 
>> ---
>>
>>  lib/swiotlb.c |   61 
>> +
>>  1 files changed, 31 insertions(+), 30 deletions(-)
>>
>> diff --git a/lib/swiotlb.c b/lib/swiotlb.c
>> index 5cc4d4e..02abb72 100644
>> --- a/lib/swiotlb.c
>> +++ b/lib/swiotlb.c
>> @@ -57,7 +57,7 @@ int swiotlb_force;
>>   * swiotlb_tbl_sync_single_*, to see if the memory was in fact allocated by 
>> this
>>   * API.
>>   */
>> -static char *io_tlb_start;
>> +phys_addr_t io_tlb_start;
>>  
>>  /*
>>   * The number of IO TLB blocks (in groups of 64).
>> @@ -125,14 +125,15 @@ static dma_addr_t swiotlb_virt_to_bus(struct device 
>> *hwdev,
>>  void swiotlb_print_info(void)
>>  {
>>  unsigned long bytes = io_tlb_nslabs << IO_TLB_SHIFT;
>> -phys_addr_t pstart, pend;
>> +unsigned char *vstart, *vend;
>>  
>> -pstart = virt_to_phys(io_tlb_start);
>> -pend = pstart + bytes;
>> +vstart = phys_to_virt(io_tlb_start);
>> +vend = vstart + bytes;
>>  
>>  printk(KERN_INFO "software IO TLB [mem %#010llx-%#010llx] (%luMB) 
>> mapped at [%p-%p]\n",
>> -   (unsigned long long)pstart, (unsigned long long)pend - 1,
>> -   bytes >> 20, io_tlb_start, io_tlb_start + bytes - 1);
>> +   (unsigned long long)io_tlb_start,
>> +   (unsigned long long)io_tlb_start + bytes - 1,
>> +   bytes >> 20, vstart, vend - 1);
>>  }
>>  
>>  void __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int 
>> verbose)
>> @@ -142,7 +143,7 @@ void __init swiotlb_init_with_tbl(char *tlb, unsigned 
>> long nslabs, int verbose)
>>  bytes = nslabs << IO_TLB_SHIFT;
>>  
>>  io_tlb_nslabs = nslabs;
>> -io_tlb_start = tlb;
>> +io_tlb_start = __pa(tlb);
> Why not 'virt_to_phys' ?

I had originally done it as a virt_to_phys, however I then noticed in
swiotlb_free that the bootmem was being converted to a physical address
via __pa.  I did a bit of digging and everything seemed to indicate that
the preferred approach in early boot to get a physical address was __pa
so I decided to switch it from virt_to_phys to __pa for the early init
versions of the calls.  If virt_to_phys is preferred though I can switch
it back.

>>  
>>  /*
>>   * Allocate and initialize the free list array.  This array is used
>> @@ -171,6 +172,7 @@ void __init swiotlb_init_with_tbl(char *tlb, unsigned 
>> long nslabs, int verbose)
>>  static void __init
>>  swiotlb_init_with_default_size(size_t default_size, int verbose)
>>  {
>> +unsigned char *vstart;
>>  unsigned long bytes;
>>  
>>  if (!io_tlb_nslabs) {
>> @@ -183,11 +185,11 @@ swiotlb_init_with_default_size(size_t default_size, 
>> int verbose)
>>  /*
>>   * Get IO TLB memory from the low pages
>>   */
>> -io_tlb_start = alloc_bootmem_low_pages(PAGE_ALIGN(bytes));
>> -if (!io_tlb_start)
>> +vstart = alloc_bootmem_low_pages(PAGE_ALIGN(bytes));
>> +if (!vstart)
>>  panic("Cannot allocate SWIOTLB buffer");
>>  
>> -swiotlb_init_with_tbl(io_tlb_start, io_tlb_nslabs, verbose);
>> +swiotlb_init_with_tbl(vstart, io_tlb_nslabs, verbose);
>>  }
>>  
>>  void __init
>> @@ -205,6 +207,7 @@ int
>>  swiotlb_late_init_with_default_size(size_t default_size)
>>  {
>>  unsigned long bytes, req_nslabs = io_tlb_nslabs;
>> +unsigned char *vstart = NULL;
>>  unsigned int order;
>>  int rc = 0;
>>  
>> @@ -221,14 +224,14 @@ swiotlb_late_init_with_default_size(size_t 
>> default_size)
>>  bytes = io_tlb_nslabs << IO_TLB_SHIFT;
>>  
>>  while ((SLABS_PER_PAGE << order) > IO_TLB_MIN_SLABS) {
>> -io_tlb_start = (void *)__get_free_pages(GFP_DMA | __GFP_NOWARN,
>> -order);
>> -if (io_tlb_start)
>> +vstart = (void *)__get_free_pages(GFP_DMA | __GFP_NOWARN,
>> +  

Re: [RFC PATCH 0/7] Improve swiotlb performance by using physical addresses

2012-10-04 Thread Alexander Duyck
On 10/04/2012 06:33 AM, Konrad Rzeszutek Wilk wrote:
> On Wed, Oct 03, 2012 at 05:38:41PM -0700, Alexander Duyck wrote:
>> While working on 10Gb/s routing performance I found a significant amount of
>> time was being spent in the swiotlb DMA handler.  Further digging found that 
>> a
>> significant amount of this was due to the fact that virtual to physical
>> address translation and calling the function that did it.  It accounted for
>> nearly 60% of the total overhead.
>>
>> This patch set works to resolve that by changing the io_tlb_start address and
>> io_tlb_overflow_buffer address from virtual addresses to physical addresses.
> The assertion in your patches is that the DMA addresses (bus address)
> are linear is not applicable (unfortunatly). Meaning virt_to_phys() !=
> virt_to_dma().

That isn't my assertion.  My assertion is that virt_to_phys(x + y) ==
(virt_to_phys(x) + y).

> Now, on x86 and ia64 it is true - but one of the users of swiotlb
> library is the Xen swiotlb - which cannot guarantee that the physical
> address are 1-1 with the bus addresses. Hence the bulk of dealing with
> figuring out the right physical to bus address is done in Xen-SWIOTLB
> and the basics of finding an entry in the bounce buffer (if needed),
> mapping it, unmapping ,etc is carried out by the generic library.
>
> This is sad - b/c your patches are a good move forward.

I think if you take a second look you will find you might be taking
things one logical step too far.  From what I can tell my assertion is
correct.  I believe the bits you are talking about don't apply until you
use the xen_virt _to_bus/xen_phys_to_bus call, and the only difference
between those two calls is a virt_to_phys which is what I am eliminating.

> Perhaps another way to do this is by having your patches split the
> lookups in "chunks". Wherein we validate in swiotlb_init_*tbl that the
> 'tbl' (so the bounce buffer) is linear - if not, we split it up in
> chunks. Perhaps the various backends can be responsible for this since
> they would know which of their memory regions are linear or not. But
> that sounds complicated and we don't want to complicate this library.
>
> Or another way would be to provide 'phys_to_virt' and 'virt_to_phys'
> functions for the swiotlb_tbl_{map|unmap}_single and the main library
> (lib/swiotlb.c) can decide to use them. If they are NULL, then it
> would do what your patches suggested. If they are defined, then
> carry out those lookups on the 'empty-to-be-used' bounce buffer
> address. Hm, that sounds like a better way of doing it.
>

I don't see any special phys_to_virt or virt_to_phys calls available for
Xen.  The only calls I do see are phys_to_machine and machine_to_phys
which seem to be translating between physical addresses and those used
for DMA.  If that is the case I should be fine because I am not going as
far as translating the io_tlb_start into a DMA address I am only taking
it to a physical one.

I am not asserting that the DMA memory is contiguous.  I am asserting
that from the CPU perspective the physical memory is contiguous which I 
believe you already agreed with.  From what I can tell this should be
fine since almost all of the virt_to_phys calls out there are just doing
offset manipulation and not breaking the memory up into discreet
chunks.  The sectioning up of the segments in Xen should happen after we
have taken care of the virt_to_phys work so the bounce buffer case
should work out correctly.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 2/7] swiotlb: Make io_tlb_start a physical address instead of a virtual address

2012-10-04 Thread Alexander Duyck
On 10/04/2012 10:19 AM, Konrad Rzeszutek Wilk wrote:
 @@ -450,7 +451,7 @@ void *swiotlb_tbl_map_single(struct device *hwdev, 
 dma_addr_t tbl_dma_addr,
io_tlb_list[i] = 0;
for (i = index - 1; (OFFSET(i, IO_TLB_SEGSIZE) != 
 IO_TLB_SEGSIZE - 1) && io_tlb_list[i]; i--)
io_tlb_list[i] = ++count;
 -  dma_addr = io_tlb_start + (index << IO_TLB_SHIFT);
 +  dma_addr = (char *)phys_to_virt(io_tlb_start) + (index 
 << IO_TLB_SHIFT);
>>> I think this is going to fall flat with the other user of
>>> swiotlb_tbl_map_single - Xen SWIOTLB. When it allocates the io_tlb_start
>>> and does it magic to make sure its under 4GB - the io_tlb_start swath
>>> of memory, ends up consisting of 2MB chunks of contingous spaces. But each
>>> chunk is not linearly in the DMA space (thought it is in the CPU space).
>>>
>>> Meaning the io_tlb_start region 0-2MB can fall within the DMA address space
>>> of 2048MB->2032MB, and io_tlb_start offset 2MB->4MB, can fall within 
>>> 1024MB->1026MB,
>>> and so on (depending on the availability of memory under 4GB).
>>>
>>> There is a clear virt_to_phys(x) != virt_to_dma(x).
>> Just to be sure I understand you are talking about DMA address space,
>> not physical address space correct?  I am fully aware that DMA address
>> space can be all over the place.  When I was writing the patch set the
>> big reason why I decided to stop at physical address space was because
>> DMA address spaces are device specific.
>>
>> I understand that virt_to_phys(x) != virt_to_dma(x) for many platforms,
>> however that is not my assertion.  My assertion is (virt_to_phys(x) + y)
>> == virt_to_phys(x + y).  This should be true for any large block of
>> contiguous memory that is DMA accessible since the CPU and the device
>> should be able to view the memory in the same layout.  If that wasn't
> That is true mostly for x86 but not all platforms do this.
>
>> true I don't think is_swiotlb_buffer would be working correctly since it
>> is essentially operating on the same assumption prior to my patches.
> There are two pieces here - the is_swiotlb_buffer and the 
> swiotlb_tbl_[map|unmap]
> functions.
>
> The is_swiotlb_buffer is operating on that principle (and your change
> to reflect that is OK). The swiotlb_tbl_[*] is not.
>> If you take a look at patches 4 and 5 I do address changes that end up
>> needing to be made to Xen SWIOTLB since it makes use of
>> swiotlb_tbl_map_single.  All that I effectively end up changing is that
>> instead of messing with a void pointer we instead are dealing with a
>> physical address, and instead of calling xen_virt_to_bus we end up
>> calling xen_phys_to_bus and thereby drop one extra virt_to_phys call in
>> the process.
> Sure that is OK. All of those changes when we bypass the bounce
> buffer look OK (thought I should double-check again the patch to make
> sure and also just take it for a little test spin).

I'm interesting in finding out what the results of your test spin are. 

> The issue is when we do _use_ the bounce buffer. At that point we
> run into the allocation from the bounce buffer where the patches
> assume that the 64MB swath of bounce buffer memory is bus (or DMA)
> memory contingous. And that is not the case sadly.

I think I understand what you are saying now.  However, I don't think
the issue applies to my patches.

If I am not mistaken what you are talking about is the pseudo-physical
memory versus machine memory.  I understand the 64MB block is not
machine-memory contiguous, but it should be pseudo-physical contiguous
memory.  As such using the pseudo-physical addresses instead of virtual
addresses should function the same way as using true physical addresses
to replace virtual addresses.  All of the physical memory translation to
machine memory translation is happening in xen_phys_to_bus and all of
the changes I have made take place before that so the bounce buffers
should still be working correctly.  In addition none of the changes I
have made change the bounce buffer boundary assumptions so we should
have no bounce buffers mapped across the 2MB boundaries.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/7] Improve swiotlb performance by using physical addresses

2012-10-05 Thread Alexander Duyck
On 10/05/2012 09:55 AM, Andi Kleen wrote:
> Alexander Duyck  writes:
>
>> While working on 10Gb/s routing performance I found a significant amount of
>> time was being spent in the swiotlb DMA handler.  Further digging found that 
>> a
>> significant amount of this was due to the fact that virtual to physical
>> address translation and calling the function that did it.  It accounted for
>> nearly 60% of the total overhead.
> Can you find out why that is? Traditionally virt_to_phys was just a
> subtraction. Then later on it was a if and a subtraction.
>
> It cannot really be that expensive. Do you have some debugging enabled?
>
> Really virt_to_phys should be fixed. Such fundamental operations
> shouldn't slow. I don't think hacking up all the users to work
> around this is the r ight way.
>
> Looking at the code a bit someone (crazy) made it out of line.
> But that cannot explain that much overhead.
>
>
> -Andi
>

I was thinking the issue was all of the calls to relatively small
functions occurring in quick succession.  The way most of this code is
setup it seems like it is one small function call in turn calling
another, and then another, and I would imagine the code fragmentation
can have a significant negative impact.

For example just the first patch in the series is enough to see a
significant performance gain and that is simply due to the fact that
is_swiotlb_buffer becomes inlined when I built it on my system.  The
basic idea I had with these patches was to avoid making multiple calls
in quick succession and instead just to have all the data right there so
that all of the swiotlb functions don't need to make many external
calls, at least not until they are actually dealing with bounce buffers
which are slower due to locking anyway.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: Improve 64 bit __phys_addr call performance

2012-10-24 Thread Alexander Duyck
On 10/24/2012 03:25 AM, Ingo Molnar wrote:
> * Alexander Duyck  wrote:
>
>> This patch is meant to improve overall system performance when 
>> making use of the __phys_addr call on 64 bit x86 systems.  To 
>> do this I have implemented several changes.
>>
>> First if CONFIG_DEBUG_VIRTUAL is not defined __phys_addr is 
>> made an inline, similar to how this is currently handled in 32 
>> bit.  However in order to do this it is required to export 
>> phys_base so that it is available if __phys_addr is used in 
>> kernel modules.
>>
>> The second change was to streamline the code by making use of 
>> the carry flag on an add operation instead of performing a 
>> compare on a 64 bit value.  The advantage to this is that it 
>> allows us to reduce the overall size of the call. On my Xeon 
>> E5 system the entire __phys_addr inline call consumes 30 bytes 
>> and 5 instructions.  I also applied similar logic to the debug 
>> version of the function.  My testing shows that the debug 
>> version of the function with this patch applied is slightly 
>> faster than the non-debug version without the patch.
>>
>> Finally, when building the kernel with the first two changes 
>> applied I saw build warnings about __START_KERNEL_map and 
>> PAGE_OFFSET constants not fitting in their type.  In order to 
>> resolve the build warning I changed their type from UL to ULL.
>>
>> Signed-off-by: Alexander Duyck 
>> ---
>>
>>  arch/x86/include/asm/page_64_types.h |   16 ++--
>>  arch/x86/kernel/x8664_ksyms_64.c |3 +++
>>  arch/x86/mm/physaddr.c   |   20 ++--
>>  3 files changed, 31 insertions(+), 8 deletions(-)
>> +#ifdef CONFIG_DEBUG_VIRTUAL
>>  extern unsigned long __phys_addr(unsigned long);
>> +#else
>> +static inline unsigned long __phys_addr(unsigned long x)
>> +{
>> +unsigned long y = x - __START_KERNEL_map;
>> +
>> +/* use the carry flag to determine if x was < __START_KERNEL_map */
>> +x = y + ((x > y) ? phys_base : (__START_KERNEL_map - PAGE_OFFSET));
>> +
>> +return x;
>> +}
> This is a rather frequently used primitive. By how much does 
> this patch increase a 'make defconfig' kernel's vmlinux, as 
> measured via 'size vmlinux'?
>
> Thanks,
>
>   Ingo

Here is the before and after:

Before
textdata bss  dechex filename
10368528 1047480 1122304 12538312 bf51c8 vmlinux

After
textdata bss  dechex filename
10372216 1047480 1122304 12542000 bf6030 vmlinux

I also have some patches are going into the swiotlb.  With them in it
reduces the size a bit but still doesn't get us back to the original size:

After SWIOTLB
textdata bss  dechex filename
10371860 1047480 1122304 12541644 bf5ecc vmlinux

The total increase in size amounts to about 3.6K without the SWIOTLB
changes, and about 3.3K with.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/8] Improve performance of VM translation on x86_64

2012-11-01 Thread Alexander Duyck
On 10/11/2012 03:58 PM, H. Peter Anvin wrote:
> On 10/12/2012 06:40 AM, Andi Kleen wrote:
>> Patch series looks good to me. Thanks for doing this properly.
>> Reviewed-by: Andi Kleen 
>>
> Agreed.
>
> Acked-by: H. Peter Anvin 
>
> I will pick this up after the merge window closes unless Ingo beats me
> to it.  (I'm currently traveling.)
>
>   -hpa
>
>

I was wondering if this ever got picked up?  If so is there a public
tree I could find them in?

Just wondering since I have some minor changes I would like to make and
I just wanted to figure out if I should rework the patches or submit the
changes as a follow-on.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/7] Improve swiotlb performance by using physical addresses

2012-11-02 Thread Alexander Duyck
On 11/02/2012 09:21 AM, Konrad Rzeszutek Wilk wrote:
> On Mon, Oct 29, 2012 at 03:05:56PM -0400, Konrad Rzeszutek Wilk wrote:
>> On Mon, Oct 29, 2012 at 11:18:09AM -0700, Alexander Duyck wrote:
>>> On Mon, Oct 15, 2012 at 10:19 AM, Alexander Duyck
>>>  wrote:
>>>> While working on 10Gb/s routing performance I found a significant amount of
>>>> time was being spent in the swiotlb DMA handler. Further digging found 
>>>> that a
>>>> significant amount of this was due to virtual to physical address 
>>>> translation
>>>> and calling the function that did it. It accounted for nearly 60% of the
>>>> total swiotlb overhead.
>>>>
>>>> This patch set works to resolve that by replacing the io_tlb_start and
>>>> io_tlb_end virtual addresses with a physical addresses. In addition it 
>>>> changes
>>>> the io_tlb_overflow_buffer from a virtual to a physical address. I followed
>>>> through with the cleanup to the point that the only functions that really
>>>> require the virtual address for the DMA buffer are the init, free, and
>>>> bounce functions.
>>>>
>>>> In the case of devices that are using the bounce buffers these patches 
>>>> should
>>>> result in only a slight performance gain if any. This is due to the locking
>>>> overhead required to map and unmap the buffers.
>>>>
>>>> In the case of devices that are not making use of bounce buffers these 
>>>> patches
>>>> can significantly reduce their overhead. In the case of an ixgbe routing 
>>>> test
>>>> for example, these changes result in 7 fewer calls to __phys_addr and
>>>> allow is_swiotlb_buffer to become inlined due to a reduction in the number 
>>>> of
>>>> instructions. When running a routing throughput test using small packets I
>>>> saw roughly a 6% increase in packets rates after applying these patches. 
>>>> This
>>>> appears to match up with the CPU overhead reduction I was tracking via 
>>>> perf.
>>>>
>>>> Before:
>>>> Results 10.0Mpps
>>>>
>>>> After:
>>>> Results 10.6Mpps
>>>>
>>>> Finally, I updated the parameter names for several of the core function 
>>>> calls
>>>> as there was some ambiguity in naming. Specifically virtual address 
>>>> pointers
>>>> were named dma_addr. When I changed these pointers to physical I instead 
>>>> used
>>>> the name tlb_addr as this value represented a physical address in the
>>>> io_tlb_start region and is less likely to be confused with a bus address.
>>>>
>>>> v2:
>>>> I reviewed the changes and realized that the first patch that was dropping
>>>> io_tlb_end and calculating the value didn't actually gain me much once I 
>>>> had
>>>> gone through and translated the rest of the addresses to physical 
>>>> addresses.
>>>> As such I have updated the patch so that it instead is converting 
>>>> io_tlb_end
>>>> from a virtual address to a physical address.  This actually helps to 
>>>> reduce
>>>> the overhead for is_swiotlb_buffer and swiotlb_dma_supported by several
>>>> instructions.
>>>>
>>>> v3:
>>>> After reviewing the patches I realized I was causing some namespace 
>>>> pollution
>>>> since a "static char *" was being replaced with "phys_addr_t" when it 
>>>> should
>>>> have been "static phys_addr_t".  As such I have updated the first 3 
>>>> patches to
>>>> correctly replace static pointers with static physical addresses.
>>>>
>>>> ---
>>>>
>>>> Alexander Duyck (7):
>>>>   swiotlb:  Do not export swiotlb_bounce since there are no external 
>>>> consumers
>>>>   swiotlb: Use physical addresses instead of virtual in 
>>>> swiotlb_tbl_sync_single
>>>>   swiotlb: Use physical addresses for swiotlb_tbl_unmap_single
>>>>   swiotlb: Return physical addresses when calling 
>>>> swiotlb_tbl_map_single
>>>>   swiotlb: Make io_tlb_overflow_buffer a physical address
>>>>   swiotlb: Make io_tlb_start a physical address instead of a virtual 
>>>> one
>>>>   swiotlb: Make io_tlb_end a physical address instead of a virtual one
>>>>
>>>>
>>>>  drivers/xen/swiotlb-xen.c |   25 ++--
>>>>  include/linux/swiotlb.h   |   20 ++-
>>>>  lib/swiotlb.c |  269 
>>>> +++--
>>>>  3 files changed, 163 insertions(+), 151 deletions(-)
>>>>
>>> Is there any ETA on when this patch series might be pulled into a
>>> tree?  I'm just wondering if I need to rebase this patch series and
>>> resubmit it, and if so what tree I need to rebase it off of?
>> No need to rebase it. I did a test on V2 version with Xen, but I still
>> need to do a IA64/Calgary/AMD Vi/Intel VT-d/GART test before
>> pushing it out.
> So you should your patches in linux-next.

I see they are in there.  Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 0/8] Improve performance of VM translation on x86_64

2012-11-05 Thread Alexander Duyck
This patch series is meant to address several issues I encountered with VM
translations on x86_64.  In my testing I found that swiotlb was incurring up
to a 5% processing overhead due to calls to __phys_addr.  To address that I
have updated swiotlb to use physical addresses instead of virtual addresses
to reduce the need to call __phys_addr.  However those patches didn't address
the other callers.  With these patches applied I am able to achieve an
additional 1% to 2% performance gain on top of the changes to swiotlb.

The first 2 patches are the performance optimizations that result in the 1% to
2% increase in overall performance.  The remaining patches are various
cleanups for a number of spots where __pa or virt_to_phys was being called
and was not needed or __pa_symbol could have been used.

It doesn't seem like the v2 patch set was accepted so I am submitting an
updated v3 set that is rebased off of linux-next with a few additional
improvements to the existing patches.  Specifically the first patch now also
updates __virt_addr_valid so that it is almost identical in layout to
__phys_addr.  Also I found one additional spot in init_64.c that could use
__pa_symbol instead of virt_to_page calls so I updated the first __pa_symbol
patch for the x86 init calls.

With this patch set applied I am noticing a 1-2% improvement in performance in
my routing tests.  Without my earlier swiotlb changes applied it was getting as
high as 6-7% because that code originally relied heavily on virt_to_phys.

The overall effect on size varies depending on what kernel options are
enabled.  I have notices that almost all of the network device drivers have
dropped in size by around 100 bytes.  I suspect this is due to the fact that
the virt_to_page call in dma_map_single is now less expensive.  However the
default build for x86_64 increases the vmlinux size by 3.5K with this change
applied.

---

Alexander Duyck (8):
  x86/lguest: Use __pa_symbol instead of __pa on C visible symbols
  x86/acpi: Use __pa_symbol instead of __pa on C visible symbols
  x86/xen: Use __pa_symbol instead of __pa on C visible symbols
  x86/ftrace: Use __pa_symbol instead of __pa on C visible symbols
  x86: Use __pa_symbol instead of __pa on C visible symbols
  x86: Drop 4 unnecessary calls to __pa_symbol
  x86: Make it so that __pa_symbol can only process kernel symbols on x86_64
  x86: Improve __phys_addr performance by making use of carry flags and 
inlining


 arch/x86/include/asm/page.h  |3 +-
 arch/x86/include/asm/page_32.h   |1 +
 arch/x86/include/asm/page_64_types.h |   20 +++-
 arch/x86/kernel/acpi/sleep.c |2 +
 arch/x86/kernel/cpu/intel.c  |2 +
 arch/x86/kernel/ftrace.c |4 +-
 arch/x86/kernel/head32.c |4 +-
 arch/x86/kernel/head64.c |4 +-
 arch/x86/kernel/setup.c  |   16 +-
 arch/x86/kernel/x8664_ksyms_64.c |3 ++
 arch/x86/lguest/boot.c   |3 +-
 arch/x86/mm/init_64.c|   18 +--
 arch/x86/mm/pageattr.c   |8 ++---
 arch/x86/mm/physaddr.c   |   55 +-
 arch/x86/platform/efi/efi.c  |4 +-
 arch/x86/realmode/init.c |8 ++---
 arch/x86/xen/mmu.c   |   21 +++--
 17 files changed, 111 insertions(+), 65 deletions(-)

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 2/8] x86: Make it so that __pa_symbol can only process kernel symbols on x86_64

2012-11-05 Thread Alexander Duyck
I submitted an earlier patch that make __phys_addr an inline.  This obviously
results in an increase in the code size.  One step I can take to reduce that
is to make it so that the __pa_symbol call does a direct translation for
kernel addresses instead of covering all of virtual memory.

On my system this reduced the size for __pa_symbol from 5 instructions
totalling 30 bytes to 3 instructions totalling 16 bytes.

Signed-off-by: Alexander Duyck 
---

 arch/x86/include/asm/page.h  |3 ++-
 arch/x86/include/asm/page_32.h   |1 +
 arch/x86/include/asm/page_64_types.h |3 +++
 arch/x86/mm/physaddr.c   |   15 +++
 4 files changed, 21 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 8ca8283..3698a6a 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -44,7 +44,8 @@ static inline void copy_user_page(void *to, void *from, 
unsigned long vaddr,
  * case properly. Once all supported versions of gcc understand it, we can
  * remove this Voodoo magic stuff. (i.e. once gcc3.x is deprecated)
  */
-#define __pa_symbol(x) __pa(__phys_reloc_hide((unsigned long)(x)))
+#define __pa_symbol(x) \
+   __phys_addr_symbol(__phys_reloc_hide((unsigned long)(x)))
 
 #define __va(x)((void *)((unsigned 
long)(x)+PAGE_OFFSET))
 
diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
index da4e762..4d550d0 100644
--- a/arch/x86/include/asm/page_32.h
+++ b/arch/x86/include/asm/page_32.h
@@ -15,6 +15,7 @@ extern unsigned long __phys_addr(unsigned long);
 #else
 #define __phys_addr(x) __phys_addr_nodebug(x)
 #endif
+#define __phys_addr_symbol(x)  __phys_addr(x)
 #define __phys_reloc_hide(x)   RELOC_HIDE((x), 0)
 
 #ifdef CONFIG_FLATMEM
diff --git a/arch/x86/include/asm/page_64_types.h 
b/arch/x86/include/asm/page_64_types.h
index 1ca93d3..a130589 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -69,8 +69,11 @@ static inline unsigned long __phys_addr_nodebug(unsigned 
long x)
 }
 #ifdef CONFIG_DEBUG_VIRTUAL
 extern unsigned long __phys_addr(unsigned long);
+extern unsigned long __phys_addr_symbol(unsigned long);
 #else
 #define __phys_addr(x) __phys_addr_nodebug(x)
+#define __phys_addr_symbol(x) \
+   ((unsigned long)(x) - __START_KERNEL_map + phys_base)
 #endif
 #define __phys_reloc_hide(x)   (x)
 
diff --git a/arch/x86/mm/physaddr.c b/arch/x86/mm/physaddr.c
index fd40d75..8420708 100644
--- a/arch/x86/mm/physaddr.c
+++ b/arch/x86/mm/physaddr.c
@@ -28,6 +28,21 @@ unsigned long __phys_addr(unsigned long x)
return x;
 }
 EXPORT_SYMBOL(__phys_addr);
+
+unsigned long __phys_addr_symbol(unsigned long x)
+{
+   unsigned long y = x - __START_KERNEL_map;
+
+   /* use the carry flag to determine if x was < __START_KERNEL_map */
+   VIRTUAL_BUG_ON(x < y);
+
+   x = y + phys_base;
+
+   VIRTUAL_BUG_ON(y >= KERNEL_IMAGE_SIZE);
+
+   return x;
+}
+EXPORT_SYMBOL(__phys_addr_symbol);
 #endif
 
 bool __virt_addr_valid(unsigned long x)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 1/8] x86: Improve __phys_addr performance by making use of carry flags and inlining

2012-11-05 Thread Alexander Duyck
This patch is meant to improve overall system performance when making use of
the __phys_addr call.  To do this I have implemented several changes.

First if CONFIG_DEBUG_VIRTUAL is not defined __phys_addr is made an inline,
similar to how this is currently handled in 32 bit.  However in order to do
this it is required to export phys_base so that it is available if __phys_addr
is used in kernel modules.

The second change was to streamline the code by making use of the carry flag
on an add operation instead of performing a compare on a 64 bit value.  The
advantage to this is that it allows us to significantly reduce the overall
size of the call.  On my Xeon E5 system the entire __phys_addr inline call
consumes a little less than 32 bytes and 5 instructions.  I also applied
similar logic to the debug version of the function.  My testing shows that the
debug version of the function with this patch applied is slightly faster than
the non-debug version without the patch.

When building the kernel with the first two changes applied I saw build
warnings about __START_KERNEL_map and PAGE_OFFSET constants not fitting in
their type.  In order to resolve the build warning I changed their type from
UL to ULL.

Finally I also applied the same logic changes to __virt_addr_valid since it
used the same general code flow as __phys_addr and could achieve similar gains
though these changes.

Signed-off-by: Alexander Duyck 
---

v3:  Added changes to __virt_addr_valid to keep it in sync with __phys_addr

 arch/x86/include/asm/page_64_types.h |   17 +-
 arch/x86/kernel/x8664_ksyms_64.c |3 +++
 arch/x86/mm/physaddr.c   |   40 +-
 3 files changed, 43 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/page_64_types.h 
b/arch/x86/include/asm/page_64_types.h
index 320f7bb..1ca93d3 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -30,14 +30,14 @@
  * hypervisor to fit.  Choosing 16 slots here is arbitrary, but it's
  * what Xen requires.
  */
-#define __PAGE_OFFSET   _AC(0x8800, UL)
+#define __PAGE_OFFSET   _AC(0x8800, ULL)
 
 #define __PHYSICAL_START   ((CONFIG_PHYSICAL_START +   \
  (CONFIG_PHYSICAL_ALIGN - 1)) &\
 ~(CONFIG_PHYSICAL_ALIGN - 1))
 
 #define __START_KERNEL (__START_KERNEL_map + __PHYSICAL_START)
-#define __START_KERNEL_map _AC(0x8000, UL)
+#define __START_KERNEL_map _AC(0x8000, ULL)
 
 /* See Documentation/x86/x86_64/mm.txt for a description of the memory map. */
 #define __PHYSICAL_MASK_SHIFT  46
@@ -58,7 +58,20 @@ void copy_page(void *to, void *from);
 extern unsigned long max_pfn;
 extern unsigned long phys_base;
 
+static inline unsigned long __phys_addr_nodebug(unsigned long x)
+{
+   unsigned long y = x - __START_KERNEL_map;
+
+   /* use the carry flag to determine if x was < __START_KERNEL_map */
+   x = y + ((x > y) ? phys_base : (__START_KERNEL_map - PAGE_OFFSET));
+
+   return x;
+}
+#ifdef CONFIG_DEBUG_VIRTUAL
 extern unsigned long __phys_addr(unsigned long);
+#else
+#define __phys_addr(x) __phys_addr_nodebug(x)
+#endif
 #define __phys_reloc_hide(x)   (x)
 
 #define vmemmap ((struct page *)VMEMMAP_START)
diff --git a/arch/x86/kernel/x8664_ksyms_64.c b/arch/x86/kernel/x8664_ksyms_64.c
index 1330dd1..b014d94 100644
--- a/arch/x86/kernel/x8664_ksyms_64.c
+++ b/arch/x86/kernel/x8664_ksyms_64.c
@@ -59,6 +59,9 @@ EXPORT_SYMBOL(memcpy);
 EXPORT_SYMBOL(__memcpy);
 EXPORT_SYMBOL(memmove);
 
+#ifndef CONFIG_DEBUG_VIRTUAL
+EXPORT_SYMBOL(phys_base);
+#endif
 EXPORT_SYMBOL(empty_zero_page);
 #ifndef CONFIG_PARAVIRT
 EXPORT_SYMBOL(native_load_gs_index);
diff --git a/arch/x86/mm/physaddr.c b/arch/x86/mm/physaddr.c
index d2e2735..fd40d75 100644
--- a/arch/x86/mm/physaddr.c
+++ b/arch/x86/mm/physaddr.c
@@ -8,33 +8,43 @@
 
 #ifdef CONFIG_X86_64
 
+#ifdef CONFIG_DEBUG_VIRTUAL
 unsigned long __phys_addr(unsigned long x)
 {
-   if (x >= __START_KERNEL_map) {
-   x -= __START_KERNEL_map;
-   VIRTUAL_BUG_ON(x >= KERNEL_IMAGE_SIZE);
-   x += phys_base;
+   unsigned long y = x - __START_KERNEL_map;
+
+   /* use the carry flag to determine if x was < __START_KERNEL_map */
+   if (unlikely(x > y)) {
+   x = y + phys_base;
+
+   VIRTUAL_BUG_ON(y >= KERNEL_IMAGE_SIZE);
} else {
-   VIRTUAL_BUG_ON(x < PAGE_OFFSET);
-   x -= PAGE_OFFSET;
-   VIRTUAL_BUG_ON(!phys_addr_valid(x));
+   x = y + (__START_KERNEL_map - PAGE_OFFSET);
+
+   /* carry flag will be set if starting x was >= PAGE_OFFSET */
+   VIRTUAL_BUG_ON((x > y) || !phys_addr_valid(x));
}
+
return x;
 }
 EXPORT_SYMBOL(__phys_addr);
+#endif
 
 bool __virt_addr_

[PATCH v3 3/8] x86: Drop 4 unnecessary calls to __pa_symbol

2012-11-05 Thread Alexander Duyck
While debugging the __pa_symbol inline patch I found that there were a couple
spots where __pa_symbol was used as follows:
__pa_symbol(x) - __pa_symbol(y)

The compiler had reduced them to:
x - y

Since we also support a debug case where __pa_symbol is a function call it
would probably be useful to just change the two cases I found so that they are
always just treated as "x - y".  As such I am casting the values to
phys_addr_t and then doing simple subtraction so that the correct type and
value is returned.

Signed-off-by: Alexander Duyck 
---

 arch/x86/kernel/head32.c |4 ++--
 arch/x86/kernel/head64.c |4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/head32.c b/arch/x86/kernel/head32.c
index c18f59d..f15db0c 100644
--- a/arch/x86/kernel/head32.c
+++ b/arch/x86/kernel/head32.c
@@ -30,8 +30,8 @@ static void __init i386_default_early_setup(void)
 
 void __init i386_start_kernel(void)
 {
-   memblock_reserve(__pa_symbol(&_text),
-__pa_symbol(&__bss_stop) - __pa_symbol(&_text));
+   memblock_reserve(__pa_symbol(_text),
+(phys_addr_t)__bss_stop - (phys_addr_t)_text);
 
 #ifdef CONFIG_BLK_DEV_INITRD
/* Reserve INITRD */
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 037df57..42f5df1 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -97,8 +97,8 @@ void __init x86_64_start_reservations(char *real_mode_data)
 {
copy_bootdata(__va(real_mode_data));
 
-   memblock_reserve(__pa_symbol(&_text),
-__pa_symbol(&__bss_stop) - __pa_symbol(&_text));
+   memblock_reserve(__pa_symbol(_text),
+(phys_addr_t)__bss_stop - (phys_addr_t)_text);
 
 #ifdef CONFIG_BLK_DEV_INITRD
/* Reserve INITRD */

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 4/8] x86: Use __pa_symbol instead of __pa on C visible symbols

2012-11-05 Thread Alexander Duyck
When I made an attempt at separating __pa_symbol and __pa I found that there
were a number of cases where __pa was used on an obvious symbol.

I also caught one non-obvious case as _brk_start and _brk_end are based on the
address of __brk_base which is a C visible symbol.

In mark_rodata_ro I was able to reduce the overhead of kernel symbol to
virtual memory translation by using a combination of __va(__pa_symbol())
instead of page_address(virt_to_page()).

Signed-off-by: Alexander Duyck 
---

v3:  Added changes to init_64.c function mark_rodata_ro to avoid unnecessary
 conversion to and from a page when all that is wanted is a virtual
 address.

 arch/x86/kernel/cpu/intel.c |2 +-
 arch/x86/kernel/setup.c |   16 
 arch/x86/mm/init_64.c   |   18 --
 arch/x86/mm/pageattr.c  |8 
 arch/x86/platform/efi/efi.c |4 ++--
 arch/x86/realmode/init.c|8 
 6 files changed, 27 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 198e019..2249e7e 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -168,7 +168,7 @@ int __cpuinit ppro_with_ram_bug(void)
 #ifdef CONFIG_X86_F00F_BUG
 static void __cpuinit trap_init_f00f_bug(void)
 {
-   __set_fixmap(FIX_F00F_IDT, __pa(&idt_table), PAGE_KERNEL_RO);
+   __set_fixmap(FIX_F00F_IDT, __pa_symbol(idt_table), PAGE_KERNEL_RO);
 
/*
 * Update the IDT descriptor and reload the IDT so that
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index e800bc6..4343570 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -300,8 +300,8 @@ static void __init cleanup_highmap(void)
 static void __init reserve_brk(void)
 {
if (_brk_end > _brk_start)
-   memblock_reserve(__pa(_brk_start),
-__pa(_brk_end) - __pa(_brk_start));
+   memblock_reserve(__pa_symbol(_brk_start),
+_brk_end - _brk_start);
 
/* Mark brk area as locked down and no longer taking any
   new allocations */
@@ -761,12 +761,12 @@ void __init setup_arch(char **cmdline_p)
init_mm.end_data = (unsigned long) _edata;
init_mm.brk = _brk_end;
 
-   code_resource.start = virt_to_phys(_text);
-   code_resource.end = virt_to_phys(_etext)-1;
-   data_resource.start = virt_to_phys(_etext);
-   data_resource.end = virt_to_phys(_edata)-1;
-   bss_resource.start = virt_to_phys(&__bss_start);
-   bss_resource.end = virt_to_phys(&__bss_stop)-1;
+   code_resource.start = __pa_symbol(_text);
+   code_resource.end = __pa_symbol(_etext)-1;
+   data_resource.start = __pa_symbol(_etext);
+   data_resource.end = __pa_symbol(_edata)-1;
+   bss_resource.start = __pa_symbol(__bss_start);
+   bss_resource.end = __pa_symbol(__bss_stop)-1;
 
 #ifdef CONFIG_CMDLINE_BOOL
 #ifdef CONFIG_CMDLINE_OVERRIDE
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 3baff25..0374a10 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -770,12 +770,10 @@ void set_kernel_text_ro(void)
 void mark_rodata_ro(void)
 {
unsigned long start = PFN_ALIGN(_text);
-   unsigned long rodata_start =
-   ((unsigned long)__start_rodata + PAGE_SIZE - 1) & PAGE_MASK;
+   unsigned long rodata_start = PFN_ALIGN(__start_rodata);
unsigned long end = (unsigned long) &__end_rodata_hpage_align;
-   unsigned long text_end = PAGE_ALIGN((unsigned long) &__stop___ex_table);
-   unsigned long rodata_end = PAGE_ALIGN((unsigned long) &__end_rodata);
-   unsigned long data_start = (unsigned long) &_sdata;
+   unsigned long text_end = PFN_ALIGN(&__stop___ex_table);
+   unsigned long rodata_end = PFN_ALIGN(&__end_rodata);
 
printk(KERN_INFO "Write protecting the kernel read-only data: %luk\n",
   (end - start) >> 10);
@@ -800,12 +798,12 @@ void mark_rodata_ro(void)
 #endif
 
free_init_pages("unused kernel memory",
-   (unsigned long) page_address(virt_to_page(text_end)),
-   (unsigned long)
-page_address(virt_to_page(rodata_start)));
+   (unsigned long) __va(__pa_symbol(text_end)),
+   (unsigned long) __va(__pa_symbol(rodata_start)));
+
free_init_pages("unused kernel memory",
-   (unsigned long) page_address(virt_to_page(rodata_end)),
-   (unsigned long) page_address(virt_to_page(data_start)));
+   (unsigned long) __va(__pa_symbol(rodata_end)),
+   (unsigned long) __va(__pa_symbol(_sdata)));
 }
 
 #endif
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index a718e0d..40f92f3 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/

[PATCH v3 5/8] x86/ftrace: Use __pa_symbol instead of __pa on C visible symbols

2012-11-05 Thread Alexander Duyck
Instead of using __pa which is meant to be a general function for converting
virtual addresses to physical addresses we can use __pa_symbol which is the
preferred way of decoding kernel text virtual addresses to physical addresses.

In this case we are not directly converting C visible symbols however if we
know that the instruction pointer is somewhere between _text and _etext we
know that we are going to be translating an address form the kernel text
space.

Cc: Steven Rostedt 
Cc: Frederic Weisbecker 
Signed-off-by: Alexander Duyck 
---

 arch/x86/kernel/ftrace.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 1d41402..42a392a 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -89,7 +89,7 @@ do_ftrace_mod_code(unsigned long ip, const void *new_code)
 * kernel identity mapping to modify code.
 */
if (within(ip, (unsigned long)_text, (unsigned long)_etext))
-   ip = (unsigned long)__va(__pa(ip));
+   ip = (unsigned long)__va(__pa_symbol(ip));
 
return probe_kernel_write((void *)ip, new_code, MCOUNT_INSN_SIZE);
 }
@@ -279,7 +279,7 @@ static int ftrace_write(unsigned long ip, const char *val, 
int size)
 * kernel identity mapping to modify code.
 */
if (within(ip, (unsigned long)_text, (unsigned long)_etext))
-   ip = (unsigned long)__va(__pa(ip));
+   ip = (unsigned long)__va(__pa_symbol(ip));
 
return probe_kernel_write((void *)ip, val, size);
 }

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 6/8] x86/xen: Use __pa_symbol instead of __pa on C visible symbols

2012-11-05 Thread Alexander Duyck
This change updates a few of the functions to use __pa_symbol when
translating C visible symbols instead of __pa.  By using __pa_symbol we are
able to drop a few extra lines of code as don't have to test to see if the
virtual pointer is a part of the kernel text or just standard virtual memory.

Cc: Konrad Rzeszutek Wilk 
Signed-off-by: Alexander Duyck 
---

 arch/x86/xen/mmu.c |   21 +++--
 1 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 4a05b39..a63e5f9 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1486,7 +1486,8 @@ static int xen_pgd_alloc(struct mm_struct *mm)
 
if (user_pgd != NULL) {
user_pgd[pgd_index(VSYSCALL_START)] =
-   __pgd(__pa(level3_user_vsyscall) | _PAGE_TABLE);
+   __pgd(__pa_symbol(level3_user_vsyscall) |
+ _PAGE_TABLE);
ret = 0;
}
 
@@ -1958,10 +1959,10 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, 
unsigned long max_pfn)
 * pgd.
 */
if (xen_feature(XENFEAT_writable_page_tables)) {
-   native_write_cr3(__pa(init_level4_pgt));
+   native_write_cr3(__pa_symbol(init_level4_pgt));
} else {
xen_mc_batch();
-   __xen_write_cr3(true, __pa(init_level4_pgt));
+   __xen_write_cr3(true, __pa_symbol(init_level4_pgt));
xen_mc_issue(PARAVIRT_LAZY_CPU);
}
/* We can't that easily rip out L3 and L2, as the Xen pagetables are
@@ -1984,10 +1985,10 @@ static RESERVE_BRK_ARRAY(pmd_t, swapper_kernel_pmd, 
PTRS_PER_PMD);
 
 static void __init xen_write_cr3_init(unsigned long cr3)
 {
-   unsigned long pfn = PFN_DOWN(__pa(swapper_pg_dir));
+   unsigned long pfn = PFN_DOWN(__pa_symbol(swapper_pg_dir));
 
-   BUG_ON(read_cr3() != __pa(initial_page_table));
-   BUG_ON(cr3 != __pa(swapper_pg_dir));
+   BUG_ON(read_cr3() != __pa_symbol(initial_page_table));
+   BUG_ON(cr3 != __pa_symbol(swapper_pg_dir));
 
/*
 * We are switching to swapper_pg_dir for the first time (from
@@ -2011,7 +2012,7 @@ static void __init xen_write_cr3_init(unsigned long cr3)
pin_pagetable_pfn(MMUEXT_PIN_L3_TABLE, pfn);
 
pin_pagetable_pfn(MMUEXT_UNPIN_TABLE,
- PFN_DOWN(__pa(initial_page_table)));
+ PFN_DOWN(__pa_symbol(initial_page_table)));
set_page_prot(initial_page_table, PAGE_KERNEL);
set_page_prot(initial_kernel_pmd, PAGE_KERNEL);
 
@@ -2036,7 +2037,7 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, 
unsigned long max_pfn)
 
copy_page(initial_page_table, pgd);
initial_page_table[KERNEL_PGD_BOUNDARY] =
-   __pgd(__pa(initial_kernel_pmd) | _PAGE_PRESENT);
+   __pgd(__pa_symbol(initial_kernel_pmd) | _PAGE_PRESENT);
 
set_page_prot(initial_kernel_pmd, PAGE_KERNEL_RO);
set_page_prot(initial_page_table, PAGE_KERNEL_RO);
@@ -2045,8 +2046,8 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, 
unsigned long max_pfn)
pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, PFN_DOWN(__pa(pgd)));
 
pin_pagetable_pfn(MMUEXT_PIN_L3_TABLE,
- PFN_DOWN(__pa(initial_page_table)));
-   xen_write_cr3(__pa(initial_page_table));
+ PFN_DOWN(__pa_symbol(initial_page_table)));
+   xen_write_cr3(__pa_symbol(initial_page_table));
 
memblock_reserve(__pa(xen_start_info->pt_base),
 xen_start_info->nr_pt_frames * PAGE_SIZE);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 7/8] x86/acpi: Use __pa_symbol instead of __pa on C visible symbols

2012-11-05 Thread Alexander Duyck
This change just updates one spot where __pa was being used when __pa_symbol
should have been used.  By using __pa_symbol we are able to drop a few extra
lines of code as we don't have to test to see if the virtual pointer is a
part of the kernel text or just standard virtual memory.

Cc: Len Brown 
Cc: Pavel Machek 
Cc: "Rafael J. Wysocki" 
Signed-off-by: Alexander Duyck 
---

 arch/x86/kernel/acpi/sleep.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/acpi/sleep.c b/arch/x86/kernel/acpi/sleep.c
index d5e0d71..0532f5d 100644
--- a/arch/x86/kernel/acpi/sleep.c
+++ b/arch/x86/kernel/acpi/sleep.c
@@ -69,7 +69,7 @@ int acpi_suspend_lowlevel(void)
 
 #ifndef CONFIG_64BIT
header->pmode_entry = (u32)&wakeup_pmode_return;
-   header->pmode_cr3 = (u32)__pa(&initial_page_table);
+   header->pmode_cr3 = (u32)__pa_symbol(initial_page_table);
saved_magic = 0x12345678;
 #else /* CONFIG_64BIT */
 #ifdef CONFIG_SMP

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 8/8] x86/lguest: Use __pa_symbol instead of __pa on C visible symbols

2012-11-05 Thread Alexander Duyck
The function lguest_write_cr3 is using __pa to convert swapper_pg_dir and
initial_page_table from virtual addresses to physical.  The correct function
to use for these values is __pa_symbol since they are C visible symbols.

Cc: Rusty Russell 
Signed-off-by: Alexander Duyck 
---

 arch/x86/lguest/boot.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/arch/x86/lguest/boot.c b/arch/x86/lguest/boot.c
index df4176c..1cbd89c 100644
--- a/arch/x86/lguest/boot.c
+++ b/arch/x86/lguest/boot.c
@@ -552,7 +552,8 @@ static void lguest_write_cr3(unsigned long cr3)
current_cr3 = cr3;
 
/* These two page tables are simple, linear, and used during boot */
-   if (cr3 != __pa(swapper_pg_dir) && cr3 != __pa(initial_page_table))
+   if (cr3 != __pa_symbol(swapper_pg_dir) &&
+   cr3 != __pa_symbol(initial_page_table))
cr3_changed = true;
 }
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 1/8] x86: Improve __phys_addr performance by making use of carry flags and inlining

2012-11-05 Thread Alexander Duyck
On 11/05/2012 12:24 PM, Kirill A. Shutemov wrote:
> On Mon, Nov 05, 2012 at 11:04:06AM -0800, Alexander Duyck wrote:
>> This patch is meant to improve overall system performance when making use of
>> the __phys_addr call.  To do this I have implemented several changes.
>>
>> First if CONFIG_DEBUG_VIRTUAL is not defined __phys_addr is made an inline,
>> similar to how this is currently handled in 32 bit.  However in order to do
>> this it is required to export phys_base so that it is available if 
>> __phys_addr
>> is used in kernel modules.
>>
>> The second change was to streamline the code by making use of the carry flag
>> on an add operation instead of performing a compare on a 64 bit value.  The
>> advantage to this is that it allows us to significantly reduce the overall
>> size of the call.  On my Xeon E5 system the entire __phys_addr inline call
>> consumes a little less than 32 bytes and 5 instructions.  I also applied
>> similar logic to the debug version of the function.  My testing shows that 
>> the
>> debug version of the function with this patch applied is slightly faster than
>> the non-debug version without the patch.
>>
>> When building the kernel with the first two changes applied I saw build
>> warnings about __START_KERNEL_map and PAGE_OFFSET constants not fitting in
>> their type.  In order to resolve the build warning I changed their type from
>> UL to ULL.
> What kind of warning messages did you see?
> It's strange: sizeof(unsinged long) == sizeof(unsinged long long) on
> x86_64

One of the warnings is included below:

In file included from 
/usr/src/kernels/linux-next/arch/x86/include/asm/page_types.h:37,
 from 
/usr/src/kernels/linux-next/arch/x86/include/asm/pgtable_types.h:5,
 from 
/usr/src/kernels/linux-next/arch/x86/include/asm/boot.h:11,
 from arch/x86/realmode/rm/../../boot/boot.h:26,
 from arch/x86/realmode/rm/../../boot/regs.c:19,
 from arch/x86/realmode/rm/regs.c:1:
/usr/src/kernels/linux-next/arch/x86/include/asm/page_64_types.h: In function 
'__phys_addr_nodebug':
/usr/src/kernels/linux-next/arch/x86/include/asm/page_64_types.h:63: warning: 
integer constant is too large for 'unsigned long' type
/usr/src/kernels/linux-next/arch/x86/include/asm/page_64_types.h:66: warning: 
integer constant is too large for 'unsigned long' type
/usr/src/kernels/linux-next/arch/x86/include/asm/page_64_types.h:66: warning: 
integer constant is too large for 'unsigned long' type

The warnings all seemed to originate from several different spots
throughout the x86 tree.  All of the warning messages include
arch/x86/boot/boot.h:26 and then from there up the included from list is
always the same.

Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/7] Improve swiotlb performance by using physical addresses

2012-10-29 Thread Alexander Duyck
On Mon, Oct 15, 2012 at 10:19 AM, Alexander Duyck
 wrote:
> While working on 10Gb/s routing performance I found a significant amount of
> time was being spent in the swiotlb DMA handler. Further digging found that a
> significant amount of this was due to virtual to physical address translation
> and calling the function that did it. It accounted for nearly 60% of the
> total swiotlb overhead.
>
> This patch set works to resolve that by replacing the io_tlb_start and
> io_tlb_end virtual addresses with a physical addresses. In addition it changes
> the io_tlb_overflow_buffer from a virtual to a physical address. I followed
> through with the cleanup to the point that the only functions that really
> require the virtual address for the DMA buffer are the init, free, and
> bounce functions.
>
> In the case of devices that are using the bounce buffers these patches should
> result in only a slight performance gain if any. This is due to the locking
> overhead required to map and unmap the buffers.
>
> In the case of devices that are not making use of bounce buffers these patches
> can significantly reduce their overhead. In the case of an ixgbe routing test
> for example, these changes result in 7 fewer calls to __phys_addr and
> allow is_swiotlb_buffer to become inlined due to a reduction in the number of
> instructions. When running a routing throughput test using small packets I
> saw roughly a 6% increase in packets rates after applying these patches. This
> appears to match up with the CPU overhead reduction I was tracking via perf.
>
> Before:
> Results 10.0Mpps
>
> After:
> Results 10.6Mpps
>
> Finally, I updated the parameter names for several of the core function calls
> as there was some ambiguity in naming. Specifically virtual address pointers
> were named dma_addr. When I changed these pointers to physical I instead used
> the name tlb_addr as this value represented a physical address in the
> io_tlb_start region and is less likely to be confused with a bus address.
>
> v2:
> I reviewed the changes and realized that the first patch that was dropping
> io_tlb_end and calculating the value didn't actually gain me much once I had
> gone through and translated the rest of the addresses to physical addresses.
> As such I have updated the patch so that it instead is converting io_tlb_end
> from a virtual address to a physical address.  This actually helps to reduce
> the overhead for is_swiotlb_buffer and swiotlb_dma_supported by several
> instructions.
>
> v3:
> After reviewing the patches I realized I was causing some namespace pollution
> since a "static char *" was being replaced with "phys_addr_t" when it should
> have been "static phys_addr_t".  As such I have updated the first 3 patches to
> correctly replace static pointers with static physical addresses.
>
> ---
>
> Alexander Duyck (7):
>   swiotlb:  Do not export swiotlb_bounce since there are no external 
> consumers
>   swiotlb: Use physical addresses instead of virtual in 
> swiotlb_tbl_sync_single
>   swiotlb: Use physical addresses for swiotlb_tbl_unmap_single
>   swiotlb: Return physical addresses when calling swiotlb_tbl_map_single
>   swiotlb: Make io_tlb_overflow_buffer a physical address
>   swiotlb: Make io_tlb_start a physical address instead of a virtual one
>   swiotlb: Make io_tlb_end a physical address instead of a virtual one
>
>
>  drivers/xen/swiotlb-xen.c |   25 ++--
>  include/linux/swiotlb.h   |   20 ++-
>  lib/swiotlb.c |  269 
> +++--
>  3 files changed, 163 insertions(+), 151 deletions(-)
>

Is there any ETA on when this patch series might be pulled into a
tree?  I'm just wondering if I need to rebase this patch series and
resubmit it, and if so what tree I need to rebase it off of?

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/7] Improve swiotlb performance by using physical addresses

2012-10-05 Thread Alexander Duyck
On 10/05/2012 01:02 PM, Andi Kleen wrote:
>> I was thinking the issue was all of the calls to relatively small
>> functions occurring in quick succession.  The way most of this code is
>> setup it seems like it is one small function call in turn calling
>> another, and then another, and I would imagine the code fragmentation
>> can have a significant negative impact.
> Maybe. Can you just inline everything and see if it it's faster then?
>
> This was out of line when the "text cost at all costs" drive was still
> envogue, but luckily we're not doing  that anymore.
>
> -Andiu
>

Inlining everything did speed things up a bit, but I still didn't reach
the same speed I achieved using the patch set.  However I did notice the
resulting swiotlb code was considerably larger.

I did a bit more digging and the issue may actually be simple repetition
of the calls.  By my math it would seem we would end up calling
is_swiotlb_buffer 3 times per packet in the routing test case, once in
sync_for_cpu and once for sync_for_device in the Rx cleanup path, and
once in unmap_page in the Tx cleanup path.  Each call to
is_swiotlb_buffer will result in 2 calls to __phys_addr.  In freeing the
skb we end up doing a call to virt_to_head_page which will call
__phys_addr.  In addition we end up mapping the skb using map_single so
we end up using __phys_addr to do a virt_to_page translation in the
xmit_frame_ring path, and then call __phys_addr when we check
dma_mapping_error.  So in total that ends up being 3 calls to
is_swiotlb_buffer, and 9 calls to __phys_addr per packet routed.

With the patches the is_swiotlb_buffer function, which was 25 lines of
assembly, is replaced with 8 lines of assembly and becomes inline.  In
addition we drop the number of calls to __phys_addr from 9 to 2 by
dropping them all from swiotlb.  By my math I am probably saving about
120 instructions per packet.  I suspect all of that would probably be
cutting the number of instructions per packet enough to probably account
for a 5% difference when you consider I am running at about 1.5Mpps per
core on a 2.7Ghz processor.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/7] Improve swiotlb performance by using physical addresses

2012-10-05 Thread Alexander Duyck
While working on 10Gb/s routing performance I found a significant amount of
time was being spent in the swiotlb DMA handler.  Further digging found that a
significant amount of this was due to virtual to physical address translation
and calling the function that did it.  It accounted for nearly 60% of the
total swiotlb overhead.

This patch set works to resolve that by replacing the io_tlb_start virtual
address with io_tlb_addr which is a physical address.  In addition it changes
the io_tlb_overflow_buffer from a virtual to a physical address.  I followed
through with the cleanup to the point that the only functions that really
require the virtual address for the DMA buffer are the init, free, and
bounce functions.

In the case of devices that are using the bounce buffers these patches should
result in only a slight performance gain if any.  This is due to the locking
overhead required to map and unmap the buffers.

In the case of devices that are not making use of bounce buffers these patches
can significantly reduce their overhead.  In the case of an ixgbe routing test
for example, these changes result in 7 fewer calls to __phys_addr and
allow is_swiotlb_buffer to become inlined due to a reduction in the number of
instructions.  When running a routing throughput test using small packets I
saw roughly a 5% increase in packets rates after applying these patches.  This
appears to match up with the CPU overhead reduction I was tracking via perf.

Before:
Results 10.29Mps
# Overhead Symbol
#  
...
#
 1.97%  [k] __phys_addr 
  
|  
|--24.97%-- swiotlb_sync_single
|  
|--16.55%-- is_swiotlb_buffer
|  
|--11.25%-- unmap_single
|  
 --2.71%-- swiotlb_dma_mapping_error
 1.66%  [k] swiotlb_sync_single 

 1.45%  [k] is_swiotlb_buffer 
 0.53%  [k] unmap_single
 
 0.52%  [k] swiotlb_map_page  
 0.47%  [k] swiotlb_sync_single_for_device  
   
 0.43%  [k] swiotlb_sync_single_for_cpu   
 0.42%  [k] swiotlb_dma_mapping_error   

 0.34%  [k] swiotlb_unmap_page

After:
Results 10.99Mps
# Overhead Symbol
#  
...
#
 0.50%  [k] swiotlb_map_page

 0.50%  [k] swiotlb_sync_single 

 0.36%  [k] swiotlb_sync_single_for_cpu 

 0.35%  [k] swiotlb_sync_single_for_device  

 0.25%  [k] swiotlb_unmap_page  

 0.17%  [k] swiotlb_dma_mapping_error  

Finally, I updated the parameter names for several of the core function calls
as there was some ambiguity in naming.  Specifically virtual address pointers
were named dma_addr.  When I changed these pointers to physical I instead used
the name tlb_addr as this value represented a physical address in the
io_tlb_addr region and is less likely to be confused with a bus address.

---

Alexander Duyck (7):
  swiotlb:  Do not export swiotlb_bounce since there are no external 
consumers
  swiotlb: Use physical addresses instead of virtual in 
swiotlb_tbl_sync_single
  swiotlb: Use physical addresses for swiotlb_tbl_unmap_single
  swiotlb: Return physical addresses when calling swiotlb_tbl_map_single
  swiotlb: Make io_tlb_overflow_buffer a physical address
  swiotlb: Replace virtual io_tlb_start with physical io_tlb_addr
  swiotlb: Instead of tracking the end of the swiotlb region just calculate 
it


 drivers/xen/swiotlb-xen.c |   25 ++--
 include/linux/swiotlb.h   |   20 ++-
 lib/swiotlb.c |  285 +++--
 3 files changed, 170 insertions(+), 160 deletions(-)

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/7] swiotlb: Instead of tracking the end of the swiotlb region just calculate it

2012-10-05 Thread Alexander Duyck
In the case of swiotlb we already have the start of the region and the number
of slabs that give us the region size.  Instead of having to call
virt_to_phys on two pointers we can just take advantage of the fact that the
region is linear and just compute the end based on the start plus the size.

Signed-off-by: Alexander Duyck 
---

 lib/swiotlb.c |   25 -
 1 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index f114bf6..5cc4d4e 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -57,11 +57,11 @@ int swiotlb_force;
  * swiotlb_tbl_sync_single_*, to see if the memory was in fact allocated by 
this
  * API.
  */
-static char *io_tlb_start, *io_tlb_end;
+static char *io_tlb_start;
 
 /*
- * The number of IO TLB blocks (in groups of 64) between io_tlb_start and
- * io_tlb_end.  This is command line adjustable via setup_io_tlb_npages.
+ * The number of IO TLB blocks (in groups of 64).
+ * This is command line adjustable via setup_io_tlb_npages.
  */
 static unsigned long io_tlb_nslabs;
 
@@ -128,11 +128,11 @@ void swiotlb_print_info(void)
phys_addr_t pstart, pend;
 
pstart = virt_to_phys(io_tlb_start);
-   pend = virt_to_phys(io_tlb_end);
+   pend = pstart + bytes;
 
printk(KERN_INFO "software IO TLB [mem %#010llx-%#010llx] (%luMB) 
mapped at [%p-%p]\n",
   (unsigned long long)pstart, (unsigned long long)pend - 1,
-  bytes >> 20, io_tlb_start, io_tlb_end - 1);
+  bytes >> 20, io_tlb_start, io_tlb_start + bytes - 1);
 }
 
 void __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
@@ -143,12 +143,10 @@ void __init swiotlb_init_with_tbl(char *tlb, unsigned 
long nslabs, int verbose)
 
io_tlb_nslabs = nslabs;
io_tlb_start = tlb;
-   io_tlb_end = io_tlb_start + bytes;
 
/*
 * Allocate and initialize the free list array.  This array is used
 * to find contiguous free memory regions of size up to IO_TLB_SEGSIZE
-* between io_tlb_start and io_tlb_end.
 */
io_tlb_list = alloc_bootmem_pages(PAGE_ALIGN(io_tlb_nslabs * 
sizeof(int)));
for (i = 0; i < io_tlb_nslabs; i++)
@@ -254,14 +252,12 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long 
nslabs)
 
io_tlb_nslabs = nslabs;
io_tlb_start = tlb;
-   io_tlb_end = io_tlb_start + bytes;
 
memset(io_tlb_start, 0, bytes);
 
/*
 * Allocate and initialize the free list array.  This array is used
 * to find contiguous free memory regions of size up to IO_TLB_SEGSIZE
-* between io_tlb_start and io_tlb_end.
 */
io_tlb_list = (unsigned int *)__get_free_pages(GFP_KERNEL,
  get_order(io_tlb_nslabs * sizeof(int)));
@@ -304,7 +300,6 @@ cleanup3:
 sizeof(int)));
io_tlb_list = NULL;
 cleanup2:
-   io_tlb_end = NULL;
io_tlb_start = NULL;
io_tlb_nslabs = 0;
return -ENOMEM;
@@ -339,8 +334,10 @@ void __init swiotlb_free(void)
 
 static int is_swiotlb_buffer(phys_addr_t paddr)
 {
-   return paddr >= virt_to_phys(io_tlb_start) &&
-   paddr < virt_to_phys(io_tlb_end);
+   phys_addr_t swiotlb_start = virt_to_phys(io_tlb_start);
+
+   return paddr >= swiotlb_start &&
+   paddr < (swiotlb_start + (io_tlb_nslabs << IO_TLB_SHIFT));
 }
 
 /*
@@ -938,6 +935,8 @@ EXPORT_SYMBOL(swiotlb_dma_mapping_error);
 int
 swiotlb_dma_supported(struct device *hwdev, u64 mask)
 {
-   return swiotlb_virt_to_bus(hwdev, io_tlb_end - 1) <= mask;
+   unsigned long bytes = io_tlb_nslabs << IO_TLB_SHIFT;
+
+   return swiotlb_virt_to_bus(hwdev, io_tlb_start + bytes - 1) <= mask;
 }
 EXPORT_SYMBOL(swiotlb_dma_supported);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/7] swiotlb: Replace virtual io_tlb_start with physical io_tlb_addr

2012-10-05 Thread Alexander Duyck
This change replaces all references to the virtual address for io_tlb_start
with references to the physical address io_tlb_addr.  The main advantage of
replacing the virtual address with a physical address is that we can avoid
having to do multiple translations from the virtual address to the physical
one needed for testing an existing DMA address.

Signed-off-by: Alexander Duyck 
---

 lib/swiotlb.c |   67 +
 1 files changed, 34 insertions(+), 33 deletions(-)

diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 5cc4d4e..3c45f10 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -53,11 +53,11 @@
 int swiotlb_force;
 
 /*
- * Used to do a quick range check in swiotlb_tbl_unmap_single and
- * swiotlb_tbl_sync_single_*, to see if the memory was in fact allocated by 
this
- * API.
+ * Physical address for swiotlb bounce buffer.  Used to do a quick range
+ * check in swiotlb_tbl_unmap_single and swiotlb_tbl_sync_single_*, to see
+ * if the memory was in fact allocated by this API.
  */
-static char *io_tlb_start;
+phys_addr_t io_tlb_addr;
 
 /*
  * The number of IO TLB blocks (in groups of 64).
@@ -125,14 +125,15 @@ static dma_addr_t swiotlb_virt_to_bus(struct device 
*hwdev,
 void swiotlb_print_info(void)
 {
unsigned long bytes = io_tlb_nslabs << IO_TLB_SHIFT;
-   phys_addr_t pstart, pend;
+   unsigned char *vstart, *vend;
 
-   pstart = virt_to_phys(io_tlb_start);
-   pend = pstart + bytes;
+   vstart = phys_to_virt(io_tlb_addr);
+   vend = vstart + bytes;
 
printk(KERN_INFO "software IO TLB [mem %#010llx-%#010llx] (%luMB) 
mapped at [%p-%p]\n",
-  (unsigned long long)pstart, (unsigned long long)pend - 1,
-  bytes >> 20, io_tlb_start, io_tlb_start + bytes - 1);
+  (unsigned long long)io_tlb_addr,
+  (unsigned long long)io_tlb_addr + bytes - 1,
+  bytes >> 20, vstart, vend - 1);
 }
 
 void __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
@@ -142,7 +143,7 @@ void __init swiotlb_init_with_tbl(char *tlb, unsigned long 
nslabs, int verbose)
bytes = nslabs << IO_TLB_SHIFT;
 
io_tlb_nslabs = nslabs;
-   io_tlb_start = tlb;
+   io_tlb_addr = __pa(tlb);
 
/*
 * Allocate and initialize the free list array.  This array is used
@@ -171,6 +172,7 @@ void __init swiotlb_init_with_tbl(char *tlb, unsigned long 
nslabs, int verbose)
 static void __init
 swiotlb_init_with_default_size(size_t default_size, int verbose)
 {
+   unsigned char *vstart;
unsigned long bytes;
 
if (!io_tlb_nslabs) {
@@ -183,11 +185,11 @@ swiotlb_init_with_default_size(size_t default_size, int 
verbose)
/*
 * Get IO TLB memory from the low pages
 */
-   io_tlb_start = alloc_bootmem_low_pages(PAGE_ALIGN(bytes));
-   if (!io_tlb_start)
+   vstart = alloc_bootmem_low_pages(PAGE_ALIGN(bytes));
+   if (!vstart)
panic("Cannot allocate SWIOTLB buffer");
 
-   swiotlb_init_with_tbl(io_tlb_start, io_tlb_nslabs, verbose);
+   swiotlb_init_with_tbl(vstart, io_tlb_nslabs, verbose);
 }
 
 void __init
@@ -205,6 +207,7 @@ int
 swiotlb_late_init_with_default_size(size_t default_size)
 {
unsigned long bytes, req_nslabs = io_tlb_nslabs;
+   unsigned char *vstart = NULL;
unsigned int order;
int rc = 0;
 
@@ -221,14 +224,14 @@ swiotlb_late_init_with_default_size(size_t default_size)
bytes = io_tlb_nslabs << IO_TLB_SHIFT;
 
while ((SLABS_PER_PAGE << order) > IO_TLB_MIN_SLABS) {
-   io_tlb_start = (void *)__get_free_pages(GFP_DMA | __GFP_NOWARN,
-   order);
-   if (io_tlb_start)
+   vstart = (void *)__get_free_pages(GFP_DMA | __GFP_NOWARN,
+ order);
+   if (vstart)
break;
order--;
}
 
-   if (!io_tlb_start) {
+   if (!vstart) {
io_tlb_nslabs = req_nslabs;
return -ENOMEM;
}
@@ -237,9 +240,9 @@ swiotlb_late_init_with_default_size(size_t default_size)
   "for software IO TLB\n", (PAGE_SIZE << order) >> 20);
io_tlb_nslabs = SLABS_PER_PAGE << order;
}
-   rc = swiotlb_late_init_with_tbl(io_tlb_start, io_tlb_nslabs);
+   rc = swiotlb_late_init_with_tbl(vstart, io_tlb_nslabs);
if (rc)
-   free_pages((unsigned long)io_tlb_start, order);
+   free_pages((unsigned long)vstart, order);
return rc;
 }
 
@@ -251,9 +254,9 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
bytes = nslabs << IO_TLB_SHIFT;
 
io_tlb_nslabs = nslabs;
-   io_tlb_start = tlb;
+   io_tlb_add

[PATCH 4/7] swiotlb: Return physical addresses when calling swiotlb_tbl_map_single

2012-10-05 Thread Alexander Duyck
This change makes it so that swiotlb_tbl_map_single will return a physical
address instead of a virtual address when called.  The advantage to this once
again is that we are avoiding a number of virt_to_phys and phys_to_virt
translations by working with everything as a physical address.

One change I had to make in order to support using physical addresses is that
I could no longer trust 0 to be a invalid physical address on all platforms.
So instead I made it so that ~0 is returned on error.  This should never be a
valid return value as it implies that only one byte would be available for
use.

In order to clarify things since we now have 2 physical addresses in use
inside of swiotlb_tbl_map_single I am renaming phys to orig_addr, and
dma_addr to tlb_addr.  This way is should be clear that orig_addr is
contained within io_orig_addr and tlb_addr is an address within the
io_tlb_addr buffer.

Signed-off-by: Alexander Duyck 
---

 drivers/xen/swiotlb-xen.c |   22 ++---
 include/linux/swiotlb.h   |   11 +-
 lib/swiotlb.c |   78 +++--
 3 files changed, 59 insertions(+), 52 deletions(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 58db6df..8a6035a 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -338,9 +338,8 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, struct 
page *page,
enum dma_data_direction dir,
struct dma_attrs *attrs)
 {
-   phys_addr_t phys = page_to_phys(page) + offset;
+   phys_addr_t map, phys = page_to_phys(page) + offset;
dma_addr_t dev_addr = xen_phys_to_bus(phys);
-   void *map;
 
BUG_ON(dir == DMA_NONE);
/*
@@ -356,16 +355,16 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, 
struct page *page,
 * Oh well, have to allocate and map a bounce buffer.
 */
map = swiotlb_tbl_map_single(dev, start_dma_addr, phys, size, dir);
-   if (!map)
+   if (map == SWIOTLB_MAP_ERROR)
return DMA_ERROR_CODE;
 
-   dev_addr = xen_virt_to_bus(map);
+   dev_addr = xen_phys_to_bus(map);
 
/*
 * Ensure that the address returned is DMA'ble
 */
if (!dma_capable(dev, dev_addr, size)) {
-   swiotlb_tbl_unmap_single(dev, map, size, dir);
+   swiotlb_tbl_unmap_single(dev, phys_to_virt(map), size, dir);
dev_addr = 0;
}
return dev_addr;
@@ -494,11 +493,12 @@ xen_swiotlb_map_sg_attrs(struct device *hwdev, struct 
scatterlist *sgl,
if (swiotlb_force ||
!dma_capable(hwdev, dev_addr, sg->length) ||
range_straddles_page_boundary(paddr, sg->length)) {
-   void *map = swiotlb_tbl_map_single(hwdev,
-  start_dma_addr,
-  sg_phys(sg),
-  sg->length, dir);
-   if (!map) {
+   phys_addr_t map = swiotlb_tbl_map_single(hwdev,
+start_dma_addr,
+sg_phys(sg),
+sg->length,
+dir);
+   if (map == SWIOTLB_MAP_ERROR) {
/* Don't panic here, we expect map_sg users
   to do proper error handling. */
xen_swiotlb_unmap_sg_attrs(hwdev, sgl, i, dir,
@@ -506,7 +506,7 @@ xen_swiotlb_map_sg_attrs(struct device *hwdev, struct 
scatterlist *sgl,
sgl[0].dma_length = 0;
return DMA_ERROR_CODE;
}
-   sg->dma_address = xen_virt_to_bus(map);
+   sg->dma_address = xen_phys_to_bus(map);
} else
sg->dma_address = dev_addr;
sg->dma_length = sg->length;
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 8d08b3e..1995f3e 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -34,9 +34,14 @@ enum dma_sync_target {
SYNC_FOR_CPU = 0,
SYNC_FOR_DEVICE = 1,
 };
-extern void *swiotlb_tbl_map_single(struct device *hwdev, dma_addr_t 
tbl_dma_addr,
-   phys_addr_t phys, size_t size,
-   enum dma_data_direction dir);
+
+/* define the last possible byte of physical address space as a mapping error 
*/
+#define SWIOTLB_MAP_ERROR (~(phys_addr_t)0x0)
+
+extern phys_addr_t swiotlb_tbl_map_single(struct device *hwdev,
+   

[PATCH 3/7] swiotlb: Make io_tlb_overflow_buffer a physical address

2012-10-05 Thread Alexander Duyck
This change makes it so that we can avoid virt_to_phys overhead when using the
io_tlb_overflow_buffer.  My original plan was to completely remove the value
and replace it with a constant but I had seen that there were recent patches
that stated this couldn't be done until all device drivers that depended on
that functionality be updated.

Signed-off-by: Alexander Duyck 
---

 lib/swiotlb.c |   61 -
 1 files changed, 34 insertions(+), 27 deletions(-)

diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 3c45f10..bbf36d1 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -70,7 +70,7 @@ static unsigned long io_tlb_nslabs;
  */
 static unsigned long io_tlb_overflow = 32*1024;
 
-static void *io_tlb_overflow_buffer;
+phys_addr_t io_tlb_overflow_buffer;
 
 /*
  * This is a free list describing the number of free entries available from
@@ -138,6 +138,7 @@ void swiotlb_print_info(void)
 
 void __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
 {
+   void *v_overflow_buffer;
unsigned long i, bytes;
 
bytes = nslabs << IO_TLB_SHIFT;
@@ -146,6 +147,15 @@ void __init swiotlb_init_with_tbl(char *tlb, unsigned long 
nslabs, int verbose)
io_tlb_addr = __pa(tlb);
 
/*
+* Get the overflow emergency buffer
+*/
+   v_overflow_buffer = 
alloc_bootmem_low_pages(PAGE_ALIGN(io_tlb_overflow));
+   if (!v_overflow_buffer)
+   panic("Cannot allocate SWIOTLB overflow buffer!\n");
+
+   io_tlb_overflow_buffer = __pa(v_overflow_buffer);
+
+   /*
 * Allocate and initialize the free list array.  This array is used
 * to find contiguous free memory regions of size up to IO_TLB_SEGSIZE
 */
@@ -155,12 +165,6 @@ void __init swiotlb_init_with_tbl(char *tlb, unsigned long 
nslabs, int verbose)
io_tlb_index = 0;
io_tlb_orig_addr = alloc_bootmem_pages(PAGE_ALIGN(io_tlb_nslabs * 
sizeof(phys_addr_t)));
 
-   /*
-* Get the overflow emergency buffer
-*/
-   io_tlb_overflow_buffer = 
alloc_bootmem_low_pages(PAGE_ALIGN(io_tlb_overflow));
-   if (!io_tlb_overflow_buffer)
-   panic("Cannot allocate SWIOTLB overflow buffer!\n");
if (verbose)
swiotlb_print_info();
 }
@@ -250,6 +254,7 @@ int
 swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
 {
unsigned long i, bytes;
+   unsigned char *v_overflow_buffer;
 
bytes = nslabs << IO_TLB_SHIFT;
 
@@ -259,13 +264,23 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long 
nslabs)
memset(tlb, 0, bytes);
 
/*
+* Get the overflow emergency buffer
+*/
+   v_overflow_buffer = (void *)__get_free_pages(GFP_DMA,
+
get_order(io_tlb_overflow));
+   if (!v_overflow_buffer)
+   goto cleanup2;
+
+   io_tlb_overflow_buffer = virt_to_phys(v_overflow_buffer);
+
+   /*
 * Allocate and initialize the free list array.  This array is used
 * to find contiguous free memory regions of size up to IO_TLB_SEGSIZE
 */
io_tlb_list = (unsigned int *)__get_free_pages(GFP_KERNEL,
  get_order(io_tlb_nslabs * sizeof(int)));
if (!io_tlb_list)
-   goto cleanup2;
+   goto cleanup3;
 
for (i = 0; i < io_tlb_nslabs; i++)
io_tlb_list[i] = IO_TLB_SEGSIZE - OFFSET(i, IO_TLB_SEGSIZE);
@@ -276,18 +291,10 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long 
nslabs)
 get_order(io_tlb_nslabs *
   sizeof(phys_addr_t)));
if (!io_tlb_orig_addr)
-   goto cleanup3;
+   goto cleanup4;
 
memset(io_tlb_orig_addr, 0, io_tlb_nslabs * sizeof(phys_addr_t));
 
-   /*
-* Get the overflow emergency buffer
-*/
-   io_tlb_overflow_buffer = (void *)__get_free_pages(GFP_DMA,
- get_order(io_tlb_overflow));
-   if (!io_tlb_overflow_buffer)
-   goto cleanup4;
-
swiotlb_print_info();
 
late_alloc = 1;
@@ -295,13 +302,13 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long 
nslabs)
return 0;
 
 cleanup4:
-   free_pages((unsigned long)io_tlb_orig_addr,
-  get_order(io_tlb_nslabs * sizeof(phys_addr_t)));
-   io_tlb_orig_addr = NULL;
-cleanup3:
free_pages((unsigned long)io_tlb_list, get_order(io_tlb_nslabs *
 sizeof(int)));
io_tlb_list = NULL;
+cleanup3:
+   free_pages((unsigned long)v_overflow_buffer,
+  get_order(io_tlb_overflow));
+   io_tlb_overflow_buffer = 0;
 cleanup2:
io_tlb_addr = 0;
io_tlb_nslabs = 0;
@@ -310,11 +317,11 @@ cleanup2:
 
 void

[PATCH 6/7] swiotlb: Use physical addresses instead of virtual in swiotlb_tbl_sync_single

2012-10-05 Thread Alexander Duyck
This change makes it so that the sync functionality also uses physical
addresses.  This helps to further reduce the use of virt_to_phys and
phys_to_virt functions.

In order to clarify things since we now have 2 physical addresses in use
inside of swiotlb_tbl_sync_single I am renaming phys to orig_addr, and
dma_addr to tlb_addr.  This way is should be clear that orig_addr is
contained within io_orig_addr and tlb_addr is an address within the
io_tlb_addr buffer.

Signed-off-by: Alexander Duyck 
---

 drivers/xen/swiotlb-xen.c |3 +--
 include/linux/swiotlb.h   |3 ++-
 lib/swiotlb.c |   22 +++---
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 4cedc28..af47e75 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -433,8 +433,7 @@ xen_swiotlb_sync_single(struct device *hwdev, dma_addr_t 
dev_addr,
 
/* NOTE: We use dev_addr here, not paddr! */
if (is_xen_swiotlb_buffer(dev_addr)) {
-   swiotlb_tbl_sync_single(hwdev, phys_to_virt(paddr), size, dir,
-  target);
+   swiotlb_tbl_sync_single(hwdev, paddr, size, dir, target);
return;
}
 
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 291643c..e0ac98f 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -47,7 +47,8 @@ extern void swiotlb_tbl_unmap_single(struct device *hwdev,
 phys_addr_t tlb_addr,
 size_t size, enum dma_data_direction dir);
 
-extern void swiotlb_tbl_sync_single(struct device *hwdev, char *dma_addr,
+extern void swiotlb_tbl_sync_single(struct device *hwdev,
+   phys_addr_t tlb_addr,
size_t size, enum dma_data_direction dir,
enum dma_sync_target target);
 
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index f37050f..d6d1ddc 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -553,26 +553,27 @@ void swiotlb_tbl_unmap_single(struct device *hwdev, 
phys_addr_t tlb_addr,
 }
 EXPORT_SYMBOL_GPL(swiotlb_tbl_unmap_single);
 
-void
-swiotlb_tbl_sync_single(struct device *hwdev, char *dma_addr, size_t size,
-   enum dma_data_direction dir,
-   enum dma_sync_target target)
+void swiotlb_tbl_sync_single(struct device *hwdev, phys_addr_t tlb_addr,
+size_t size, enum dma_data_direction dir,
+enum dma_sync_target target)
 {
-   int index = (dma_addr - (char *)phys_to_virt(io_tlb_addr)) >> 
IO_TLB_SHIFT;
-   phys_addr_t phys = io_tlb_orig_addr[index];
+   int index = (tlb_addr - io_tlb_addr) >> IO_TLB_SHIFT;
+   phys_addr_t orig_addr = io_tlb_orig_addr[index];
 
-   phys += ((unsigned long)dma_addr & ((1 << IO_TLB_SHIFT) - 1));
+   orig_addr += (unsigned long)tlb_addr & ((1 << IO_TLB_SHIFT) - 1);
 
switch (target) {
case SYNC_FOR_CPU:
if (likely(dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
-   swiotlb_bounce(phys, dma_addr, size, DMA_FROM_DEVICE);
+   swiotlb_bounce(orig_addr, phys_to_virt(tlb_addr),
+  size, DMA_FROM_DEVICE);
else
BUG_ON(dir != DMA_TO_DEVICE);
break;
case SYNC_FOR_DEVICE:
if (likely(dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL))
-   swiotlb_bounce(phys, dma_addr, size, DMA_TO_DEVICE);
+   swiotlb_bounce(orig_addr, phys_to_virt(tlb_addr),
+  size, DMA_TO_DEVICE);
else
BUG_ON(dir != DMA_FROM_DEVICE);
break;
@@ -781,8 +782,7 @@ swiotlb_sync_single(struct device *hwdev, dma_addr_t 
dev_addr,
BUG_ON(dir == DMA_NONE);
 
if (is_swiotlb_buffer(paddr)) {
-   swiotlb_tbl_sync_single(hwdev, phys_to_virt(paddr), size, dir,
-  target);
+   swiotlb_tbl_sync_single(hwdev, paddr, size, dir, target);
return;
}
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 7/7] swiotlb: Do not export swiotlb_bounce since there are no external consumers

2012-10-05 Thread Alexander Duyck
Currently swiotlb is the only consumer for swiotlb_bounce.  Since that is the
case it doesn't make much sense to be exporting it so make it a static
function only.

In addition we can save a few more lines of code by making it so that it
accepts the DMA address as a physical address instead of a virtual one.  This
is the last piece in essentially pushing all of the DMA address values to use
physical addresses in swiotlb.

In order to clarify things since we now have 2 physical addresses in use
inside of swiotlb_bounce I am renaming phys to orig_addr, and dma_addr to
tlb_addr.  This way is should be clear that orig_addr is contained within
io_orig_addr and tlb_addr is an address within the io_tlb_addr buffer.

Signed-off-by: Alexander Duyck 
---

 include/linux/swiotlb.h |3 ---
 lib/swiotlb.c   |   35 ---
 2 files changed, 16 insertions(+), 22 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index e0ac98f..071d62c 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -53,9 +53,6 @@ extern void swiotlb_tbl_sync_single(struct device *hwdev,
enum dma_sync_target target);
 
 /* Accessory functions. */
-extern void swiotlb_bounce(phys_addr_t phys, char *dma_addr, size_t size,
-  enum dma_data_direction dir);
-
 extern void
 *swiotlb_alloc_coherent(struct device *hwdev, size_t size,
dma_addr_t *dma_handle, gfp_t flags);
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index d6d1ddc..c1c5f19 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -351,14 +351,15 @@ static int is_swiotlb_buffer(phys_addr_t paddr)
 /*
  * Bounce: copy the swiotlb buffer back to the original dma location
  */
-void swiotlb_bounce(phys_addr_t phys, char *dma_addr, size_t size,
-   enum dma_data_direction dir)
+static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
+  size_t size, enum dma_data_direction dir)
 {
-   unsigned long pfn = PFN_DOWN(phys);
+   unsigned long pfn = PFN_DOWN(orig_addr);
+   unsigned char *vaddr = phys_to_virt(tlb_addr);
 
if (PageHighMem(pfn_to_page(pfn))) {
/* The buffer does not have a mapping.  Map it in and copy */
-   unsigned int offset = phys & ~PAGE_MASK;
+   unsigned int offset = orig_addr & ~PAGE_MASK;
char *buffer;
unsigned int sz = 0;
unsigned long flags;
@@ -369,25 +370,23 @@ void swiotlb_bounce(phys_addr_t phys, char *dma_addr, 
size_t size,
local_irq_save(flags);
buffer = kmap_atomic(pfn_to_page(pfn));
if (dir == DMA_TO_DEVICE)
-   memcpy(dma_addr, buffer + offset, sz);
+   memcpy(vaddr, buffer + offset, sz);
else
-   memcpy(buffer + offset, dma_addr, sz);
+   memcpy(buffer + offset, vaddr, sz);
kunmap_atomic(buffer);
local_irq_restore(flags);
 
size -= sz;
pfn++;
-   dma_addr += sz;
+   vaddr += sz;
offset = 0;
}
+   } else if (dir == DMA_TO_DEVICE) {
+   memcpy(vaddr, phys_to_virt(orig_addr), size);
} else {
-   if (dir == DMA_TO_DEVICE)
-   memcpy(dma_addr, phys_to_virt(phys), size);
-   else
-   memcpy(phys_to_virt(phys), dma_addr, size);
+   memcpy(phys_to_virt(orig_addr), vaddr, size);
}
 }
-EXPORT_SYMBOL_GPL(swiotlb_bounce);
 
 phys_addr_t swiotlb_tbl_map_single(struct device *hwdev,
   dma_addr_t tbl_dma_addr,
@@ -489,8 +488,7 @@ found:
for (i = 0; i < nslots; i++)
io_tlb_orig_addr[index+i] = orig_addr + (i << IO_TLB_SHIFT);
if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
-   swiotlb_bounce(orig_addr, phys_to_virt(tlb_addr), size,
-  DMA_TO_DEVICE);
+   swiotlb_bounce(orig_addr, tlb_addr, size, DMA_TO_DEVICE);
 
return tlb_addr;
 }
@@ -522,9 +520,8 @@ void swiotlb_tbl_unmap_single(struct device *hwdev, 
phys_addr_t tlb_addr,
/*
 * First, sync the memory before unmapping the entry
 */
-   if (phys && ((dir == DMA_FROM_DEVICE) || (dir == DMA_BIDIRECTIONAL)))
-   swiotlb_bounce(orig_addr, phys_to_virt(tlb_addr),
-  size, DMA_FROM_DEVICE);
+   if (orig_addr && ((dir == DMA_FROM_DEVICE) || (dir == 
DMA_BIDIRECTIONAL)))
+   swiotlb_bounce(orig_addr, tlb_addr, size, DMA_FROM_DEVICE);
 
/*
 * Return the buffer to the free list b

[PATCH 5/7] swiotlb: Use physical addresses for swiotlb_tbl_unmap_single

2012-10-05 Thread Alexander Duyck
This change makes it so that the unmap functionality also uses physical
addresses.  This helps to further reduce the use of virt_to_phys and
phys_to_virt functions.

In order to clarify things since we now have 2 physical addresses in use
inside of swiotlb_tbl_unmap_single I am renaming phys to orig_addr, and
dma_addr to tlb_addr.  This way is should be clear that orig_addr is
contained within io_orig_addr and tlb_addr is an address within the
io_tlb_addr buffer.

Signed-off-by: Alexander Duyck 
---

 drivers/xen/swiotlb-xen.c |4 ++--
 include/linux/swiotlb.h   |3 ++-
 lib/swiotlb.c |   37 +++--
 3 files changed, 23 insertions(+), 21 deletions(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 8a6035a..4cedc28 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -364,7 +364,7 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, struct 
page *page,
 * Ensure that the address returned is DMA'ble
 */
if (!dma_capable(dev, dev_addr, size)) {
-   swiotlb_tbl_unmap_single(dev, phys_to_virt(map), size, dir);
+   swiotlb_tbl_unmap_single(dev, map, size, dir);
dev_addr = 0;
}
return dev_addr;
@@ -388,7 +388,7 @@ static void xen_unmap_single(struct device *hwdev, 
dma_addr_t dev_addr,
 
/* NOTE: We use dev_addr here, not paddr! */
if (is_xen_swiotlb_buffer(dev_addr)) {
-   swiotlb_tbl_unmap_single(hwdev, phys_to_virt(paddr), size, dir);
+   swiotlb_tbl_unmap_single(hwdev, paddr, size, dir);
return;
}
 
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 1995f3e..291643c 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -43,7 +43,8 @@ extern phys_addr_t swiotlb_tbl_map_single(struct device 
*hwdev,
  phys_addr_t phys, size_t size,
  enum dma_data_direction dir);
 
-extern void swiotlb_tbl_unmap_single(struct device *hwdev, char *dma_addr,
+extern void swiotlb_tbl_unmap_single(struct device *hwdev,
+phys_addr_t tlb_addr,
 size_t size, enum dma_data_direction dir);
 
 extern void swiotlb_tbl_sync_single(struct device *hwdev, char *dma_addr,
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 98d5733..f37050f 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -511,20 +511,20 @@ phys_addr_t map_single(struct device *hwdev, phys_addr_t 
phys, size_t size,
 /*
  * dma_addr is the kernel virtual address of the bounce buffer to unmap.
  */
-void
-swiotlb_tbl_unmap_single(struct device *hwdev, char *dma_addr, size_t size,
-   enum dma_data_direction dir)
+void swiotlb_tbl_unmap_single(struct device *hwdev, phys_addr_t tlb_addr,
+ size_t size, enum dma_data_direction dir)
 {
unsigned long flags;
int i, count, nslots = ALIGN(size, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT;
-   int index = (dma_addr - (char *)phys_to_virt(io_tlb_addr)) >> 
IO_TLB_SHIFT;
-   phys_addr_t phys = io_tlb_orig_addr[index];
+   int index = (tlb_addr - io_tlb_addr) >> IO_TLB_SHIFT;
+   phys_addr_t orig_addr = io_tlb_orig_addr[index];
 
/*
 * First, sync the memory before unmapping the entry
 */
if (phys && ((dir == DMA_FROM_DEVICE) || (dir == DMA_BIDIRECTIONAL)))
-   swiotlb_bounce(phys, dma_addr, size, DMA_FROM_DEVICE);
+   swiotlb_bounce(orig_addr, phys_to_virt(tlb_addr),
+  size, DMA_FROM_DEVICE);
 
/*
 * Return the buffer to the free list by setting the corresponding
@@ -617,17 +617,18 @@ swiotlb_alloc_coherent(struct device *hwdev, size_t size,
 
ret = phys_to_virt(paddr);
dev_addr = phys_to_dma(hwdev, paddr);
-   }
 
-   /* Confirm address can be DMA'd by device */
-   if (dev_addr + size - 1 > dma_mask) {
-   printk("hwdev DMA mask = 0x%016Lx, dev_addr = 0x%016Lx\n",
-  (unsigned long long)dma_mask,
-  (unsigned long long)dev_addr);
+   /* Confirm address can be DMA'd by device */
+   if (dev_addr + size - 1 > dma_mask) {
+   printk("hwdev DMA mask = 0x%016Lx, dev_addr = 
0x%016Lx\n",
+  (unsigned long long)dma_mask,
+  (unsigned long long)dev_addr);
 
-   /* DMA_TO_DEVICE to avoid memcpy in unmap_single */
-   swiotlb_tbl_unmap_single(hwdev, ret, size, DMA_TO_DEVICE);
-   return NULL;
+   /* DMA_TO_DEVICE to avoid memcpy in unmap_single */
+   swiotlb_tbl_unmap_single(hwdev, paddr,
+

Re: [RFC PATCH 0/7] Improve swiotlb performance by using physical addresses

2012-10-08 Thread Alexander Duyck
On 10/06/2012 10:57 AM, Andi Kleen wrote:
>> Inlining everything did speed things up a bit, but I still didn't reach
>> the same speed I achieved using the patch set.  However I did notice the
>> resulting swiotlb code was considerably larger.
> Thanks. So your patch makes sense, but imho should pursue the inlining
> in parallel for other call sites.

I'll try to take a look at getting that done this morning.

>> assembly, is replaced with 8 lines of assembly and becomes inline.  In
>> addition we drop the number of calls to __phys_addr from 9 to 2 by
>> dropping them all from swiotlb.  By my math I am probably saving about
>> 120 instructions per packet.  I suspect all of that would probably be
>> cutting the number of instructions per packet enough to probably account
>> for a 5% difference when you consider I am running at about 1.5Mpps per
>> core on a 2.7Ghz processor.
> Maybe it's just me, but that's somehow sad for one if() and a subtraction

Well there is also all of the setup of the call on the function stack. 
By my count just the portion that is used in the standard case is about
9 lines of assembly.  By inlining it and dropping the if case we can
probably drop it to 1.

> BTW __pa used to be a simple subtraction, the if () was just added to
> handle the few call sites for x86-64 that do __pa(&text_symbol).
> Maybe we should just go back to the old __pa_symbol() for those cases,
> then __pa could be the simple subtraction it used to was again
> and it could be inlined and everyone would be happy.
>
> -Andi

What I am probably looking at doing is splitting the function in two as
you suggest where we have a separate function for the text symbol case. 
I will probably also take the 32 bit approach and add a debug version
that is still a separate function for uses such as determining if we
have any callers who should be using __pa_symbol instead of __pa.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86: Improve 64 bit __phys_addr call performance

2012-10-09 Thread Alexander Duyck
This patch is meant to improve overall system performance when making use of
the __phys_addr call on 64 bit x86 systems.  To do this I have implemented
several changes.

First if CONFIG_DEBUG_VIRTUAL is not defined __phys_addr is made an inline,
similar to how this is currently handled in 32 bit.  However in order to do
this it is required to export phys_base so that it is available if __phys_addr
is used in kernel modules.

The second change was to streamline the code by making use of the carry flag
on an add operation instead of performing a compare on a 64 bit value.  The
advantage to this is that it allows us to reduce the overall size of the call.
On my Xeon E5 system the entire __phys_addr inline call consumes 30 bytes and
5 instructions.  I also applied similar logic to the debug version of the
function.  My testing shows that the debug version of the function with this
patch applied is slightly faster than the non-debug version without the patch.

Finally, when building the kernel with the first two changes applied I saw
build warnings about __START_KERNEL_map and PAGE_OFFSET constants not fitting
in their type.  In order to resolve the build warning I changed their type
from UL to ULL.

Signed-off-by: Alexander Duyck 
---

 arch/x86/include/asm/page_64_types.h |   16 ++--
 arch/x86/kernel/x8664_ksyms_64.c |3 +++
 arch/x86/mm/physaddr.c   |   20 ++--
 3 files changed, 31 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/page_64_types.h 
b/arch/x86/include/asm/page_64_types.h
index 320f7bb..a951e4d 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -30,14 +30,14 @@
  * hypervisor to fit.  Choosing 16 slots here is arbitrary, but it's
  * what Xen requires.
  */
-#define __PAGE_OFFSET   _AC(0x8800, UL)
+#define __PAGE_OFFSET   _AC(0x8800, ULL)
 
 #define __PHYSICAL_START   ((CONFIG_PHYSICAL_START +   \
  (CONFIG_PHYSICAL_ALIGN - 1)) &\
 ~(CONFIG_PHYSICAL_ALIGN - 1))
 
 #define __START_KERNEL (__START_KERNEL_map + __PHYSICAL_START)
-#define __START_KERNEL_map _AC(0x8000, UL)
+#define __START_KERNEL_map _AC(0x8000, ULL)
 
 /* See Documentation/x86/x86_64/mm.txt for a description of the memory map. */
 #define __PHYSICAL_MASK_SHIFT  46
@@ -58,7 +58,19 @@ void copy_page(void *to, void *from);
 extern unsigned long max_pfn;
 extern unsigned long phys_base;
 
+#ifdef CONFIG_DEBUG_VIRTUAL
 extern unsigned long __phys_addr(unsigned long);
+#else
+static inline unsigned long __phys_addr(unsigned long x)
+{
+   unsigned long y = x - __START_KERNEL_map;
+
+   /* use the carry flag to determine if x was < __START_KERNEL_map */
+   x = y + ((x > y) ? phys_base : (__START_KERNEL_map - PAGE_OFFSET));
+
+   return x;
+}
+#endif
 #define __phys_reloc_hide(x)   (x)
 
 #define vmemmap ((struct page *)VMEMMAP_START)
diff --git a/arch/x86/kernel/x8664_ksyms_64.c b/arch/x86/kernel/x8664_ksyms_64.c
index 1330dd1..b014d94 100644
--- a/arch/x86/kernel/x8664_ksyms_64.c
+++ b/arch/x86/kernel/x8664_ksyms_64.c
@@ -59,6 +59,9 @@ EXPORT_SYMBOL(memcpy);
 EXPORT_SYMBOL(__memcpy);
 EXPORT_SYMBOL(memmove);
 
+#ifndef CONFIG_DEBUG_VIRTUAL
+EXPORT_SYMBOL(phys_base);
+#endif
 EXPORT_SYMBOL(empty_zero_page);
 #ifndef CONFIG_PARAVIRT
 EXPORT_SYMBOL(native_load_gs_index);
diff --git a/arch/x86/mm/physaddr.c b/arch/x86/mm/physaddr.c
index d2e2735..f63bec5 100644
--- a/arch/x86/mm/physaddr.c
+++ b/arch/x86/mm/physaddr.c
@@ -8,20 +8,28 @@
 
 #ifdef CONFIG_X86_64
 
+#ifdef CONFIG_DEBUG_VIRTUAL
 unsigned long __phys_addr(unsigned long x)
 {
-   if (x >= __START_KERNEL_map) {
-   x -= __START_KERNEL_map;
-   VIRTUAL_BUG_ON(x >= KERNEL_IMAGE_SIZE);
-   x += phys_base;
+   unsigned long y = x - __START_KERNEL_map;
+
+   /* use the carry flag to determine if x was < __START_KERNEL_map */
+   if (unlikely(x > y)) {
+   x = y + phys_base;
+
+   VIRTUAL_BUG_ON(y >= KERNEL_IMAGE_SIZE);
} else {
-   VIRTUAL_BUG_ON(x < PAGE_OFFSET);
-   x -= PAGE_OFFSET;
+   x = y + (__START_KERNEL_map - PAGE_OFFSET);
+
+   /* carry flag will be set if starting x was >= PAGE_OFFSET */
+   VIRTUAL_BUG_ON(x > y);
VIRTUAL_BUG_ON(!phys_addr_valid(x));
}
+
return x;
 }
 EXPORT_SYMBOL(__phys_addr);
+#endif
 
 bool __virt_addr_valid(unsigned long x)
 {

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/7] Improve swiotlb performance by using physical addresses

2012-10-09 Thread Alexander Duyck
On 10/08/2012 08:43 AM, Alexander Duyck wrote:
> On 10/06/2012 10:57 AM, Andi Kleen wrote:
>> BTW __pa used to be a simple subtraction, the if () was just added to
>> handle the few call sites for x86-64 that do __pa(&text_symbol).
>> Maybe we should just go back to the old __pa_symbol() for those cases,
>> then __pa could be the simple subtraction it used to was again
>> and it could be inlined and everyone would be happy.
>>
>> -Andi
> What I am probably looking at doing is splitting the function in two as
> you suggest where we have a separate function for the text symbol case. 
> I will probably also take the 32 bit approach and add a debug version
> that is still a separate function for uses such as determining if we
> have any callers who should be using __pa_symbol instead of __pa.
>
> Thanks,
>
> Alex

I gave up on trying to split __pa and __pa_symbol.   Yesterday I
realized there is way too much code that depends on the two resolving to
the same function, and many cases are pretty well hidden.  Instead I
just mailed out a patch that inlines an optimized version of
__phys_addr.  I figure it is probably as good as it is going to get
without having to rip the entire x86 portion of the kernel apart to
separate uses of __pa and __pa_symbol.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 2/7] swiotlb: Make io_tlb_start a physical address instead of a virtual address

2012-10-09 Thread Alexander Duyck
On 10/09/2012 09:43 AM, Konrad Rzeszutek Wilk wrote:
> On Thu, Oct 04, 2012 at 01:22:58PM -0700, Alexander Duyck wrote:
>> On 10/04/2012 10:19 AM, Konrad Rzeszutek Wilk wrote:
>>>>>> @@ -450,7 +451,7 @@ void *swiotlb_tbl_map_single(struct device *hwdev, 
>>>>>> dma_addr_t tbl_dma_addr,
>>>>>>  io_tlb_list[i] = 0;
>>>>>>  for (i = index - 1; (OFFSET(i, IO_TLB_SEGSIZE) 
>>>>>> != IO_TLB_SEGSIZE - 1) && io_tlb_list[i]; i--)
>>>>>>  io_tlb_list[i] = ++count;
>>>>>> -dma_addr = io_tlb_start + (index << 
>>>>>> IO_TLB_SHIFT);
>>>>>> +dma_addr = (char *)phys_to_virt(io_tlb_start) + 
>>>>>> (index << IO_TLB_SHIFT);
>>>>> I think this is going to fall flat with the other user of
>>>>> swiotlb_tbl_map_single - Xen SWIOTLB. When it allocates the io_tlb_start
>>>>> and does it magic to make sure its under 4GB - the io_tlb_start swath
>>>>> of memory, ends up consisting of 2MB chunks of contingous spaces. But each
>>>>> chunk is not linearly in the DMA space (thought it is in the CPU space).
>>>>>
>>>>> Meaning the io_tlb_start region 0-2MB can fall within the DMA address 
>>>>> space
>>>>> of 2048MB->2032MB, and io_tlb_start offset 2MB->4MB, can fall within 
>>>>> 1024MB->1026MB,
>>>>> and so on (depending on the availability of memory under 4GB).
>>>>>
>>>>> There is a clear virt_to_phys(x) != virt_to_dma(x).
>>>> Just to be sure I understand you are talking about DMA address space,
>>>> not physical address space correct?  I am fully aware that DMA address
>>>> space can be all over the place.  When I was writing the patch set the
>>>> big reason why I decided to stop at physical address space was because
>>>> DMA address spaces are device specific.
>>>>
>>>> I understand that virt_to_phys(x) != virt_to_dma(x) for many platforms,
>>>> however that is not my assertion.  My assertion is (virt_to_phys(x) + y)
>>>> == virt_to_phys(x + y).  This should be true for any large block of
>>>> contiguous memory that is DMA accessible since the CPU and the device
>>>> should be able to view the memory in the same layout.  If that wasn't
>>> That is true mostly for x86 but not all platforms do this.
>>>
>>>> true I don't think is_swiotlb_buffer would be working correctly since it
>>>> is essentially operating on the same assumption prior to my patches.
>>> There are two pieces here - the is_swiotlb_buffer and the 
>>> swiotlb_tbl_[map|unmap]
>>> functions.
>>>
>>> The is_swiotlb_buffer is operating on that principle (and your change
>>> to reflect that is OK). The swiotlb_tbl_[*] is not.
>>>> If you take a look at patches 4 and 5 I do address changes that end up
>>>> needing to be made to Xen SWIOTLB since it makes use of
>>>> swiotlb_tbl_map_single.  All that I effectively end up changing is that
>>>> instead of messing with a void pointer we instead are dealing with a
>>>> physical address, and instead of calling xen_virt_to_bus we end up
>>>> calling xen_phys_to_bus and thereby drop one extra virt_to_phys call in
>>>> the process.
>>> Sure that is OK. All of those changes when we bypass the bounce
>>> buffer look OK (thought I should double-check again the patch to make
>>> sure and also just take it for a little test spin).
>> I'm interesting in finding out what the results of your test spin are. 
> Haven't gotten to that yet :-(
>>> The issue is when we do _use_ the bounce buffer. At that point we
>>> run into the allocation from the bounce buffer where the patches
>>> assume that the 64MB swath of bounce buffer memory is bus (or DMA)
>>> memory contingous. And that is not the case sadly.
>> I think I understand what you are saying now.  However, I don't think
>> the issue applies to my patches.
> Great.
>> If I am not mistaken what you are talking about is the pseudo-physical
>> memory versus machine memory.  I understand the 64MB block is not
>> machine-memory contiguous, but it should be pseudo-physical contiguous
>> memory.  As such using the pseudo-physical addresses instead of virtual
>> addresses should function the same way as u

[PATCH] x86: Make it so that __pa_symbol can only process kernel symbols on x86_64

2012-10-10 Thread Alexander Duyck
I submitted an earlier patch that make __phys_addr an inline.  This obviously
results in an increase in the code size.  One step I can take to reduce that
is to make it so that the __pa_symbol call does a direct translation for
kernel addresses instead of covering all of virtual memory.

On my system this reduced the size for __pa_symbol from 5 instructions
totalling 30 bytes to 3 instructions totalling 16 bytes.

Signed-off-by: Alexander Duyck 
---

 arch/x86/include/asm/page.h  |3 ++-
 arch/x86/include/asm/page_32.h   |1 +
 arch/x86/include/asm/page_64_types.h |3 +++
 arch/x86/mm/physaddr.c   |   15 +++
 4 files changed, 21 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 8ca8283..3698a6a 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -44,7 +44,8 @@ static inline void copy_user_page(void *to, void *from, 
unsigned long vaddr,
  * case properly. Once all supported versions of gcc understand it, we can
  * remove this Voodoo magic stuff. (i.e. once gcc3.x is deprecated)
  */
-#define __pa_symbol(x) __pa(__phys_reloc_hide((unsigned long)(x)))
+#define __pa_symbol(x) \
+   __phys_addr_symbol(__phys_reloc_hide((unsigned long)(x)))
 
 #define __va(x)((void *)((unsigned 
long)(x)+PAGE_OFFSET))
 
diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
index da4e762..4d550d0 100644
--- a/arch/x86/include/asm/page_32.h
+++ b/arch/x86/include/asm/page_32.h
@@ -15,6 +15,7 @@ extern unsigned long __phys_addr(unsigned long);
 #else
 #define __phys_addr(x) __phys_addr_nodebug(x)
 #endif
+#define __phys_addr_symbol(x)  __phys_addr(x)
 #define __phys_reloc_hide(x)   RELOC_HIDE((x), 0)
 
 #ifdef CONFIG_FLATMEM
diff --git a/arch/x86/include/asm/page_64_types.h 
b/arch/x86/include/asm/page_64_types.h
index a951e4d..217c7d5 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -60,6 +60,7 @@ extern unsigned long phys_base;
 
 #ifdef CONFIG_DEBUG_VIRTUAL
 extern unsigned long __phys_addr(unsigned long);
+extern unsigned long __phys_addr_symbol(unsigned long);
 #else
 static inline unsigned long __phys_addr(unsigned long x)
 {
@@ -70,6 +71,8 @@ static inline unsigned long __phys_addr(unsigned long x)
 
return x;
 }
+#define __phys_addr_symbol(x) \
+   ((unsigned long)(x) - __START_KERNEL_map + phys_base)
 #endif
 #define __phys_reloc_hide(x)   (x)
 
diff --git a/arch/x86/mm/physaddr.c b/arch/x86/mm/physaddr.c
index f63bec5..666edbd 100644
--- a/arch/x86/mm/physaddr.c
+++ b/arch/x86/mm/physaddr.c
@@ -29,6 +29,21 @@ unsigned long __phys_addr(unsigned long x)
return x;
 }
 EXPORT_SYMBOL(__phys_addr);
+
+unsigned long __phys_addr_symbol(unsigned long x)
+{
+   unsigned long y = x - __START_KERNEL_map;
+
+   /* use the carry flag to determine if x was < __START_KERNEL_map */
+   VIRTUAL_BUG_ON(x < y);
+
+   x = y + phys_base;
+
+   VIRTUAL_BUG_ON(y >= KERNEL_IMAGE_SIZE);
+
+   return x;
+}
+EXPORT_SYMBOL(__phys_addr_symbol);
 #endif
 
 bool __virt_addr_valid(unsigned long x)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: Improve 64 bit __phys_addr call performance

2012-10-10 Thread Alexander Duyck
On 10/10/2012 06:58 AM, Andi Kleen wrote:
>> The second change was to streamline the code by making use of the carry flag
>> on an add operation instead of performing a compare on a 64 bit value.  The
>> advantage to this is that it allows us to reduce the overall size of the 
>> call.
>> On my Xeon E5 system the entire __phys_addr inline call consumes 30 bytes and
>> 5 instructions.  I also applied similar logic to the debug version of the
>> function.  My testing shows that the debug version of the function with this
>> patch applied is slightly faster than the non-debug version without the 
>> patch.
> Looks good. Thanks. 
>
> Probably should still split the callers though (or have a pa_symbol_fast
> that does not do the check)
>
> -Andi

I hadn't thought of that.  I couldn't drop support for symbols from
__pa, but I can get away with dropping support for regular addresses
from __pa_symbol.

I just submitted a patch to drop support for standard virtual addresses
from __pa_symbol.  I will also submit some patches tomorrow morning for
cleaning up a number of places I had found where we were calling
__pa/virt_to_phys when we should have been calling __pa_symbol.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/7] swiotlb: Make io_tlb_end a physical address instead of a virtual one

2012-10-18 Thread Alexander Duyck
On 10/18/2012 05:41 AM, Konrad Rzeszutek Wilk wrote:
> On Mon, Oct 15, 2012 at 08:43:28AM -0700, Alexander Duyck wrote:
>> On 10/13/2012 05:52 AM, Hillf Danton wrote:
>>> Hi Alexander,
>>>
>>> On Fri, Oct 12, 2012 at 4:34 AM, Alexander Duyck
>>>  wrote:
>>>> This change replaces all references to the virtual address for io_tlb_end
>>>> with references to the physical address io_tlb_end.  The main advantage of
>>>> replacing the virtual address with a physical address is that we can avoid
>>>> having to do multiple translations from the virtual address to the physical
>>>> one needed for testing an existing DMA address.
>>>>
>>>> Signed-off-by: Alexander Duyck 
>>>> ---
>>>>
>>>>  lib/swiotlb.c |   24 +---
>>>>  1 files changed, 13 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/lib/swiotlb.c b/lib/swiotlb.c
>>>> index f114bf6..19aac9f 100644
>>>> --- a/lib/swiotlb.c
>>>> +++ b/lib/swiotlb.c
>>>> @@ -57,7 +57,8 @@ int swiotlb_force;
>>>>   * swiotlb_tbl_sync_single_*, to see if the memory was in fact allocated 
>>>> by this
>>>>   * API.
>>>>   */
>>>> -static char *io_tlb_start, *io_tlb_end;
>>>> +static char *io_tlb_start;
>>>> +phys_addr_t io_tlb_end;
>>> If add io_tlb_start_phy and io_tlb_end_phy, could we get same results
>>> with less hunks?
>>>
>>> Hillf
>> What do you mean by less hunks?  Are you referring to the memory space? 
> As in less patch movements.
>> If so, then the patches I am submitting do not impact how much space is
>> used for the bounce buffer.  The only real result of these patches is
>> that the total code path is significantly reduced since we don't have to
>> perform any virtual to physical translations in the hot-path.
> No. He is referring that you can keep io_tlb_end still here. Just
> do the computation of the physical address in the init path (of the end).
> Then you don't need to do the shifting in the 'is-this-swiotlb-buffer'
> and can just do a simple:
>   if (dma_addr >= io_tlb_start && dma_addr <= io_tlb_end)
>

That is how the code ends up.  The v2 and v3 version of these patches
leave the end value there.  As this patch says I am just changing the
end to be physical instead of virtual.  I reviewed the code and realized
that I wasn't saving anything by removing it since the overall code was
larger as a result so I just converted it to a physical address.  There
are no users of io_tlb_end that are accessing it as a virtual value so
all I did is just change it to a physical one and drop the virt_to_phys
calls that were made on it.  If I am not mistaken by the second patch
the is_swiotlb_buffer call is literally what you have described above. 
Here is the snippet from the 2nd patch:

static int is_swiotlb_buffer(phys_addr_t paddr)
 {
-   return paddr >= virt_to_phys(io_tlb_start) && paddr < io_tlb_end;
+   return paddr >= io_tlb_start && paddr < io_tlb_end;
 }


As far as the number of patches I decided to do this incrementally
instead of trying to do it all at once.  That way it is clearer to the
reviewer what I am doing in each step and it can be more easily bisected
in case I messed up somewhere.  If you want fewer patches I can do that
but I don't see the point in combining patches since they are all just
going to result in the same total change anyway.

Thanks,

Alex


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/7] swiotlb: Make io_tlb_end a physical address instead of a virtual one

2012-10-19 Thread Alexander Duyck
On 10/19/2012 07:18 AM, Konrad Rzeszutek Wilk wrote:
> On Thu, Oct 18, 2012 at 08:53:33AM -0700, Alexander Duyck wrote:
>> end to be physical instead of virtual.  I reviewed the code and realized
>> that I wasn't saving anything by removing it since the overall code was
>> larger as a result so I just converted it to a physical address.  There
>> are no users of io_tlb_end that are accessing it as a virtual value so
>> all I did is just change it to a physical one and drop the virt_to_phys
>> calls that were made on it.  If I am not mistaken by the second patch
>> the is_swiotlb_buffer call is literally what you have described above. 
>> Here is the snippet from the 2nd patch:
>>
>> static int is_swiotlb_buffer(phys_addr_t paddr)
>>  {
>> -return paddr >= virt_to_phys(io_tlb_start) && paddr < io_tlb_end;
>> +return paddr >= io_tlb_start && paddr < io_tlb_end;
>>  }
>>
>>
>> As far as the number of patches I decided to do this incrementally
>> instead of trying to do it all at once.  That way it is clearer to the
>> reviewer what I am doing in each step and it can be more easily bisected
>> in case I messed up somewhere.  If you want fewer patches I can do that
>> but I don't see the point in combining patches since they are all just
>> going to result in the same total change anyway.
> No that is OK. BTW, I did a testing of your V2 patches with Xen-SWIOTLB
> and they worked. But it was on non-debug env. The debug one does such
> evil things as make the initial domain memory start at the end physical
> memory and put the underlaying MFNs (machine frame numbers, the real
> physical frames) in reverse order. So for example pfn 100, ends up being
> mfn 0xf00d, and pfn 101 ends up being oxf00c.

Glad to hear they are working under Xen.  As I said I am pretty sure if
it was working before it should still be working after these changes
since the only difference is that instead of working with virtual
addresses and having to call virt_to_phys to get the pseudo-physical
addresses we are just working with the pseudo-physical addresses directly.

Thanks,

Alex




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 1/4] net: Add support for hardware-offloaded encapsulation

2012-12-07 Thread Alexander Duyck
On 12/07/2012 02:07 AM, Ben Hutchings wrote:
> On Thu, 2012-12-06 at 17:56 -0800, Joseph Gasparakis wrote:
>> This patch adds support in the kernel for offloading in the NIC Tx and Rx
>> checksumming for encapsulated packets (such as VXLAN and IP GRE).
> [...]
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -1063,6 +1063,8 @@ struct net_device {
>>  netdev_features_t   wanted_features;
>>  /* mask of features inheritable by VLAN devices */
>>  netdev_features_t   vlan_features;
>> +/* mask of features inherited by encapsulating devices */
>> +netdev_features_t   hw_enc_features;
> [...]
> 
> How will the networking core know *which* encapsulations this applies
> to?  I notice that your implementation in ixgbe does not set
> NETIF_F_HW_CSUM here, so presumably the hardware will parse headers to
> find which ranges should be checksummed and it won't cover the next
> encapsulation protocol that comes along.
> 
> Ben.
> 

Actually the offload is generic to any encapsulation that does not
compute a checksum on the inner headers.  So as long as you can treat
the outer headers as one giant L2 header you can pretty much ignore what
is in there as long as the inner network and transport header values are
set.  There are a number of tunnels that fall into that category since
most just use IP as the L2 and the L3 usually doesn't contain any checksum.

Thanks,

Alex




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 1/5] net: Add support for hardware-offloaded encapsulation

2012-12-10 Thread Alexander Duyck
On 12/10/2012 02:04 AM, saeed bishara wrote:
>> +static inline struct iphdr *inner_ip_hdr(const struct sk_buff *skb)
>> +{
>> +   return (struct iphdr *)skb_inner_network_header(skb);
>> +}
> Hi,
> I'm a little bit bothered because of those inner_ functions, what
> about the following approach:
> 1. the skb will have a new state, that state can be outer (normal
> mode) and inner.
> 2. when you change the state to inner, all the helper functions such
> as ip_hdr will return the innter header.
>
> that's ofcourse the API side. the implementation may still use the
> fields you added to the skb.
>
> what you think?
> saeed

What you describe isn't too far off from what we are doing.  However we
need to store both the inner and the outer headers.  All these inner_
functions are meant to do is assist drivers to access the inner headers
in the case that skb->encapsulation is set.  We wanted to avoid
abstracting it too much since it is possible in the future that both
inner and outer network headers may be needed if for instance you were
to place a tunnelled frame inside of a VLAN with hardware tag insertion.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 1/5] net: Add support for hardware-offloaded encapsulation

2012-12-11 Thread Alexander Duyck
On 12/11/2012 12:11 AM, saeed bishara wrote:
> On Mon, Dec 10, 2012 at 9:58 PM, Dmitry Kravkov  wrote:
>>> -Original Message-
>>> From: saeed bishara [mailto:saeed.bish...@gmail.com]
>>> Sent: Monday, December 10, 2012 12:04 PM
>>> To: Joseph Gasparakis
>>> Cc: da...@davemloft.net; shemmin...@vyatta.com; chr...@sous-sol.org;
>>> go...@redhat.com; net...@vger.kernel.org; linux-kernel@vger.kernel.org;
>>> Dmitry Kravkov; bhutchi...@solarflare.com; Peter P Waskiewicz Jr; Alexander
>>> Duyck
>>> Subject: Re: [PATCH v4 1/5] net: Add support for hardware-offloaded
>>> encapsulation
>>>
>>>> +static inline struct iphdr *inner_ip_hdr(const struct sk_buff *skb)
>>>> +{
>>>> +   return (struct iphdr *)skb_inner_network_header(skb);
>>>> +}
>>>
>>> Hi,
>>> I'm a little bit bothered because of those inner_ functions, what
>>> about the following approach:
>>> 1. the skb will have a new state, that state can be outer (normal
>>> mode) and inner.
>>> 2. when you change the state to inner, all the helper functions such
>>> as ip_hdr will return the innter header.
>>>
>>> that's ofcourse the API side. the implementation may still use the
>>> fields you added to the skb.
>>>
>>> what you think?
>>> saeed
>>
>> Some drivers will probably need both inner_ and other_ in same flow, 
>> switching between two states will consume cpu cycles.
> from performance perspective, I'm not sure the switching is worse, it
> may be better as it reduces code size. please have a look at patch
> 2/5, with switching you can avoid doing the following change -> less
> code, less if-else.
> -   skb_set_transport_header(skb,
> -   skb_checksum_start_offset(skb));
> +   if (skb->encapsulation)
> +   skb_set_inner_transport_header(skb,
> +   
> skb_checksum_start_offset(skb));
> +   else
> +   skb_set_transport_header(skb,
> +   
> skb_checksum_start_offset(skb));
> if (!(features & NETIF_F_ALL_CSUM) &&
> 
> I think also that from (stack) maintenance perspective, less code is better.

I don't think your argument is making much sense.  With the approach we
took the switching only needs to take place in the offloaded path.  If
we were to put the switching in place generically we would end up with
the code scattered all throughout the stack.  In addition we will need
both the inner and outer headers to be captured in the case of an
encapsulated offload because the stack will need access to the outer
headers for routing.

My advice is if you have an idea then please just code it up, test it,
and submit a patch so that we can see what you are talking about.  My
concern is that you are suggesting we come up with a generic network and
transport offset that I don't believe has been completely thought through.

Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 1/8] x86: Improve __phys_addr performance by making use of carry flags and inlining

2012-11-16 Thread Alexander Duyck
On 11/05/2012 02:08 PM, Kirill A. Shutemov wrote:
> On Mon, Nov 05, 2012 at 01:56:28PM -0800, Alexander Duyck wrote:
>> On 11/05/2012 12:24 PM, Kirill A. Shutemov wrote:
>>> On Mon, Nov 05, 2012 at 11:04:06AM -0800, Alexander Duyck wrote:
>>>> This patch is meant to improve overall system performance when making use 
>>>> of
>>>> the __phys_addr call.  To do this I have implemented several changes.
>>>>
>>>> First if CONFIG_DEBUG_VIRTUAL is not defined __phys_addr is made an inline,
>>>> similar to how this is currently handled in 32 bit.  However in order to do
>>>> this it is required to export phys_base so that it is available if 
>>>> __phys_addr
>>>> is used in kernel modules.
>>>>
>>>> The second change was to streamline the code by making use of the carry 
>>>> flag
>>>> on an add operation instead of performing a compare on a 64 bit value.  The
>>>> advantage to this is that it allows us to significantly reduce the overall
>>>> size of the call.  On my Xeon E5 system the entire __phys_addr inline call
>>>> consumes a little less than 32 bytes and 5 instructions.  I also applied
>>>> similar logic to the debug version of the function.  My testing shows that 
>>>> the
>>>> debug version of the function with this patch applied is slightly faster 
>>>> than
>>>> the non-debug version without the patch.
>>>>
>>>> When building the kernel with the first two changes applied I saw build
>>>> warnings about __START_KERNEL_map and PAGE_OFFSET constants not fitting in
>>>> their type.  In order to resolve the build warning I changed their type 
>>>> from
>>>> UL to ULL.
>>> What kind of warning messages did you see?
>>> It's strange: sizeof(unsinged long) == sizeof(unsinged long long) on
>>> x86_64
>> One of the warnings is included below:
>>
>> In file included from 
>> /usr/src/kernels/linux-next/arch/x86/include/asm/page_types.h:37,
>>  from 
>> /usr/src/kernels/linux-next/arch/x86/include/asm/pgtable_types.h:5,
>>  from 
>> /usr/src/kernels/linux-next/arch/x86/include/asm/boot.h:11,
>>  from arch/x86/realmode/rm/../../boot/boot.h:26,
>>  from arch/x86/realmode/rm/../../boot/regs.c:19,
>>  from arch/x86/realmode/rm/regs.c:1:
>> /usr/src/kernels/linux-next/arch/x86/include/asm/page_64_types.h: In 
>> function '__phys_addr_nodebug':
>> /usr/src/kernels/linux-next/arch/x86/include/asm/page_64_types.h:63: 
>> warning: integer constant is too large for 'unsigned long' type
>> /usr/src/kernels/linux-next/arch/x86/include/asm/page_64_types.h:66: 
>> warning: integer constant is too large for 'unsigned long' type
>> /usr/src/kernels/linux-next/arch/x86/include/asm/page_64_types.h:66: 
>> warning: integer constant is too large for 'unsigned long' type
>>
>> The warnings all seemed to originate from several different spots
>> throughout the x86 tree.  All of the warning messages include
>> arch/x86/boot/boot.h:26 and then from there up the included from list is
>> always the same.
> Realmode code compiles with -m32. I guess it's just wrong that it tries to
> include .
>

I have been reviewing things and I think the problem is due to the fact
that we have content in page_64_types.h that doesn't really belong
there.  I will add a patch that moves the virtual to physical address
translation header contents over to page_64.h.  That will keep it
consistent with where it is in the 32 bit build (page_32.h) and avoids
the build conflicts.

Expect a v4 of the patches with this fixed in the next few hours.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v4] x86/xen: Use __pa_symbol instead of __pa on C visible symbols

2012-11-16 Thread Alexander Duyck
This change updates a few of the functions to use __pa_symbol when
translating C visible symbols instead of __pa.  By using __pa_symbol we are
able to drop a few extra lines of code as don't have to test to see if the
virtual pointer is a part of the kernel text or just standard virtual memory.

Cc: Konrad Rzeszutek Wilk 
Signed-off-by: Alexander Duyck 
---

v4:  I have spun this patch off as a separate patch for v4 due to the fact that
 this patch doesn't apply cleanly to Linus's tree.  As such I am
 submitting it based off of the linux-next tree to be accepted in the Xen
 tree since this patch can actually exist on its own without the need
 for the other patches in the original __phys_addr performance series.

 arch/x86/xen/mmu.c |   21 +++--
 1 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 4a05b39..a63e5f9 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1486,7 +1486,8 @@ static int xen_pgd_alloc(struct mm_struct *mm)
 
if (user_pgd != NULL) {
user_pgd[pgd_index(VSYSCALL_START)] =
-   __pgd(__pa(level3_user_vsyscall) | _PAGE_TABLE);
+   __pgd(__pa_symbol(level3_user_vsyscall) |
+ _PAGE_TABLE);
ret = 0;
}
 
@@ -1958,10 +1959,10 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, 
unsigned long max_pfn)
 * pgd.
 */
if (xen_feature(XENFEAT_writable_page_tables)) {
-   native_write_cr3(__pa(init_level4_pgt));
+   native_write_cr3(__pa_symbol(init_level4_pgt));
} else {
xen_mc_batch();
-   __xen_write_cr3(true, __pa(init_level4_pgt));
+   __xen_write_cr3(true, __pa_symbol(init_level4_pgt));
xen_mc_issue(PARAVIRT_LAZY_CPU);
}
/* We can't that easily rip out L3 and L2, as the Xen pagetables are
@@ -1984,10 +1985,10 @@ static RESERVE_BRK_ARRAY(pmd_t, swapper_kernel_pmd, 
PTRS_PER_PMD);
 
 static void __init xen_write_cr3_init(unsigned long cr3)
 {
-   unsigned long pfn = PFN_DOWN(__pa(swapper_pg_dir));
+   unsigned long pfn = PFN_DOWN(__pa_symbol(swapper_pg_dir));
 
-   BUG_ON(read_cr3() != __pa(initial_page_table));
-   BUG_ON(cr3 != __pa(swapper_pg_dir));
+   BUG_ON(read_cr3() != __pa_symbol(initial_page_table));
+   BUG_ON(cr3 != __pa_symbol(swapper_pg_dir));
 
/*
 * We are switching to swapper_pg_dir for the first time (from
@@ -2011,7 +2012,7 @@ static void __init xen_write_cr3_init(unsigned long cr3)
pin_pagetable_pfn(MMUEXT_PIN_L3_TABLE, pfn);
 
pin_pagetable_pfn(MMUEXT_UNPIN_TABLE,
- PFN_DOWN(__pa(initial_page_table)));
+ PFN_DOWN(__pa_symbol(initial_page_table)));
set_page_prot(initial_page_table, PAGE_KERNEL);
set_page_prot(initial_kernel_pmd, PAGE_KERNEL);
 
@@ -2036,7 +2037,7 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, 
unsigned long max_pfn)
 
copy_page(initial_page_table, pgd);
initial_page_table[KERNEL_PGD_BOUNDARY] =
-   __pgd(__pa(initial_kernel_pmd) | _PAGE_PRESENT);
+   __pgd(__pa_symbol(initial_kernel_pmd) | _PAGE_PRESENT);
 
set_page_prot(initial_kernel_pmd, PAGE_KERNEL_RO);
set_page_prot(initial_page_table, PAGE_KERNEL_RO);
@@ -2045,8 +2046,8 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, 
unsigned long max_pfn)
pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, PFN_DOWN(__pa(pgd)));
 
pin_pagetable_pfn(MMUEXT_PIN_L3_TABLE,
- PFN_DOWN(__pa(initial_page_table)));
-   xen_write_cr3(__pa(initial_page_table));
+ PFN_DOWN(__pa_symbol(initial_page_table)));
+   xen_write_cr3(__pa_symbol(initial_page_table));
 
memblock_reserve(__pa(xen_start_info->pt_base),
 xen_start_info->nr_pt_frames * PAGE_SIZE);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v4 0/8] Improve performance of VM translation on x86_64

2012-11-16 Thread Alexander Duyck
This patch series is meant to address several issues I encountered with VM
translations on x86_64.  In my testing I found that swiotlb was incurring up
to a 5% processing overhead due to calls to __phys_addr.  To address that I
have updated swiotlb to use physical addresses instead of virtual addresses
to reduce the need to call __phys_addr.  However those patches didn't address
the other callers.  With these patches applied I am able to achieve an
additional 1% to 2% performance gain on top of the changes to swiotlb.

The first 2 patches are the performance optimizations that result in the 1% to
2% increase in overall performance.  The remaining patches are various
cleanups for a number of spots where __pa or virt_to_phys was being called
and was not needed or __pa_symbol could have been used.

It doesn't seem like the v2 patch set was accepted so I am submitting an
updated v3 set that is rebased off of linux-next with a few additional
improvements to the existing patches.  Specifically the first patch now also
updates __virt_addr_valid so that it is almost identical in layout to
__phys_addr.  Also I found one additional spot in init_64.c that could use
__pa_symbol instead of virt_to_page calls so I updated the first __pa_symbol
patch for the x86 init calls.

With this patch set applied I am noticing a 1-2% improvement in performance in
my routing tests.  Without my earlier swiotlb changes applied it was getting
as high as 6-7% because that code originally relied heavily on virt_to_phys.

The overall effect on size varies depending on what kernel options are
enabled.  I have notices that almost all of the network device drivers have
dropped in size by around 100 bytes.  I suspect this is due to the fact that
the virt_to_page call in dma_map_single is now less expensive.  However the
default build for x86_64 increases the vmlinux size by 3.5K with this change
applied.

v2:  Rebased changes onto linux-next due to changes in x86/xen tree.
v3:  Changes to __virt_addr_valid so it was in sync with __phys_addr.
 Changes to init_64.c function mark_rodata_ro to avoid virt_to_page calls.
v4:  Spun x86/xen changes off as a separate patch.
 Added new patch to push address translation into page_64.h.
 Minor change to __phys_addr_symbol to avoid unnecessary second > check.
---

Alexander Duyck (8):
  x86: Move some contents of page_64_types.h into pgtable_64.h and page_64.h
  x86: Improve __phys_addr performance by making use of carry flags and 
inlining
  x86: Make it so that __pa_symbol can only process kernel symbols on x86_64
  x86: Drop 4 unnecessary calls to __pa_symbol
  x86: Use __pa_symbol instead of __pa on C visible symbols
  x86/ftrace: Use __pa_symbol instead of __pa on C visible symbols
  x86/acpi: Use __pa_symbol instead of __pa on C visible symbols
  x86/lguest: Use __pa_symbol instead of __pa on C visible symbols


 arch/x86/include/asm/page.h  |3 +-
 arch/x86/include/asm/page_32.h   |1 +
 arch/x86/include/asm/page_64.h   |   36 
 arch/x86/include/asm/page_64_types.h |   22 ---
 arch/x86/include/asm/pgtable_64.h|5 +++
 arch/x86/kernel/acpi/sleep.c |2 +
 arch/x86/kernel/cpu/intel.c  |2 +
 arch/x86/kernel/ftrace.c |4 +--
 arch/x86/kernel/head32.c |4 +--
 arch/x86/kernel/head64.c |4 +--
 arch/x86/kernel/setup.c  |   16 +--
 arch/x86/kernel/x8664_ksyms_64.c |3 ++
 arch/x86/lguest/boot.c   |3 +-
 arch/x86/mm/init_64.c|   18 +---
 arch/x86/mm/pageattr.c   |8 +++--
 arch/x86/mm/physaddr.c   |   51 --
 arch/x86/platform/efi/efi.c  |4 +--
 arch/x86/realmode/init.c |8 +++--
 18 files changed, 119 insertions(+), 75 deletions(-)

-- 
Signature
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v4 1/8] x86: Move some contents of page_64_types.h into pgtable_64.h and page_64.h

2012-11-16 Thread Alexander Duyck
This patch is meant to clean-up the fact that we have several functions in
page_64_types.h which really don't belong there.  I found this issue when I
had tried to replace __phys_addr with an inline function.  It resulted in the
realmode bits generating compile warnings about types.  In order to resolve
that I am relocating the address translation to page_64.h since this is in
keeping with where these functions are located in 32 bit.

In addtion I have relocated several functions defined in init_64.c to
pgtable_64.h as this seems to be where most of the functions related to
memory initialization were already located.

Signed-off-by: Alexander Duyck 
---
 arch/x86/include/asm/page_64.h   |   19 +++
 arch/x86/include/asm/page_64_types.h |   22 --
 arch/x86/include/asm/pgtable_64.h|5 +
 3 files changed, 24 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 072694e..4150999 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -3,4 +3,23 @@
 
 #include 
 
+#ifndef __ASSEMBLY__
+
+/* duplicated to the one in bootmem.h */
+extern unsigned long max_pfn;
+extern unsigned long phys_base;
+
+extern unsigned long __phys_addr(unsigned long);
+
+#define __phys_reloc_hide(x)   (x)
+
+#ifdef CONFIG_FLATMEM
+#define pfn_valid(pfn)  ((pfn) < max_pfn)
+#endif
+
+void clear_page(void *page);
+void copy_page(void *to, void *from);
+
+#endif /* !__ASSEMBLY__ */
+
 #endif /* _ASM_X86_PAGE_64_H */
diff --git a/arch/x86/include/asm/page_64_types.h 
b/arch/x86/include/asm/page_64_types.h
index 320f7bb..8b491e6 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -50,26 +50,4 @@
 #define KERNEL_IMAGE_SIZE  (512 * 1024 * 1024)
 #define KERNEL_IMAGE_START _AC(0x8000, UL)
 
-#ifndef __ASSEMBLY__
-void clear_page(void *page);
-void copy_page(void *to, void *from);
-
-/* duplicated to the one in bootmem.h */
-extern unsigned long max_pfn;
-extern unsigned long phys_base;
-
-extern unsigned long __phys_addr(unsigned long);
-#define __phys_reloc_hide(x)   (x)
-
-#define vmemmap ((struct page *)VMEMMAP_START)
-
-extern void init_extra_mapping_uc(unsigned long phys, unsigned long size);
-extern void init_extra_mapping_wb(unsigned long phys, unsigned long size);
-
-#endif /* !__ASSEMBLY__ */
-
-#ifdef CONFIG_FLATMEM
-#define pfn_valid(pfn)  ((pfn) < max_pfn)
-#endif
-
 #endif /* _ASM_X86_PAGE_64_DEFS_H */
diff --git a/arch/x86/include/asm/pgtable_64.h 
b/arch/x86/include/asm/pgtable_64.h
index 47356f9..b5d30ad 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -183,6 +183,11 @@ extern void cleanup_highmap(void);
 
 #define __HAVE_ARCH_PTE_SAME
 
+#define vmemmap ((struct page *)VMEMMAP_START)
+
+extern void init_extra_mapping_uc(unsigned long phys, unsigned long size);
+extern void init_extra_mapping_wb(unsigned long phys, unsigned long size);
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_X86_PGTABLE_64_H */

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v4 2/8] x86: Improve __phys_addr performance by making use of carry flags and inlining

2012-11-16 Thread Alexander Duyck
This patch is meant to improve overall system performance when making use of
the __phys_addr call.  To do this I have implemented several changes.

First if CONFIG_DEBUG_VIRTUAL is not defined __phys_addr is made an inline,
similar to how this is currently handled in 32 bit.  However in order to do
this it is required to export phys_base so that it is available if __phys_addr
is used in kernel modules.

The second change was to streamline the code by making use of the carry flag
on an add operation instead of performing a compare on a 64 bit value.  The
advantage to this is that it allows us to significantly reduce the overall
size of the call.  On my Xeon E5 system the entire __phys_addr inline call
consumes a little less than 32 bytes and 5 instructions.  I also applied
similar logic to the debug version of the function.  My testing shows that the
debug version of the function with this patch applied is slightly faster than
the non-debug version without the patch.

Finally I also applied the same logic changes to __virt_addr_valid since it
used the same general code flow as __phys_addr and could achieve similar gains
though these changes.

Signed-off-by: Alexander Duyck 
---
v3:  Added changes to __virt_addr_valid to keep it in sync with __phys_addr

 arch/x86/include/asm/page_64.h   |   14 +
 arch/x86/kernel/x8664_ksyms_64.c |3 +++
 arch/x86/mm/physaddr.c   |   40 --
 3 files changed, 42 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 4150999..5138174 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -9,7 +9,21 @@
 extern unsigned long max_pfn;
 extern unsigned long phys_base;
 
+static inline unsigned long __phys_addr_nodebug(unsigned long x)
+{
+   unsigned long y = x - __START_KERNEL_map;
+
+   /* use the carry flag to determine if x was < __START_KERNEL_map */
+   x = y + ((x > y) ? phys_base : (__START_KERNEL_map - PAGE_OFFSET));
+
+   return x;
+}
+
+#ifdef CONFIG_DEBUG_VIRTUAL
 extern unsigned long __phys_addr(unsigned long);
+#else
+#define __phys_addr(x) __phys_addr_nodebug(x)
+#endif
 
 #define __phys_reloc_hide(x)   (x)
 
diff --git a/arch/x86/kernel/x8664_ksyms_64.c b/arch/x86/kernel/x8664_ksyms_64.c
index 1330dd1..b014d94 100644
--- a/arch/x86/kernel/x8664_ksyms_64.c
+++ b/arch/x86/kernel/x8664_ksyms_64.c
@@ -59,6 +59,9 @@ EXPORT_SYMBOL(memcpy);
 EXPORT_SYMBOL(__memcpy);
 EXPORT_SYMBOL(memmove);
 
+#ifndef CONFIG_DEBUG_VIRTUAL
+EXPORT_SYMBOL(phys_base);
+#endif
 EXPORT_SYMBOL(empty_zero_page);
 #ifndef CONFIG_PARAVIRT
 EXPORT_SYMBOL(native_load_gs_index);
diff --git a/arch/x86/mm/physaddr.c b/arch/x86/mm/physaddr.c
index d2e2735..fd40d75 100644
--- a/arch/x86/mm/physaddr.c
+++ b/arch/x86/mm/physaddr.c
@@ -8,33 +8,43 @@
 
 #ifdef CONFIG_X86_64
 
+#ifdef CONFIG_DEBUG_VIRTUAL
 unsigned long __phys_addr(unsigned long x)
 {
-   if (x >= __START_KERNEL_map) {
-   x -= __START_KERNEL_map;
-   VIRTUAL_BUG_ON(x >= KERNEL_IMAGE_SIZE);
-   x += phys_base;
+   unsigned long y = x - __START_KERNEL_map;
+
+   /* use the carry flag to determine if x was < __START_KERNEL_map */
+   if (unlikely(x > y)) {
+   x = y + phys_base;
+
+   VIRTUAL_BUG_ON(y >= KERNEL_IMAGE_SIZE);
} else {
-   VIRTUAL_BUG_ON(x < PAGE_OFFSET);
-   x -= PAGE_OFFSET;
-   VIRTUAL_BUG_ON(!phys_addr_valid(x));
+   x = y + (__START_KERNEL_map - PAGE_OFFSET);
+
+   /* carry flag will be set if starting x was >= PAGE_OFFSET */
+   VIRTUAL_BUG_ON((x > y) || !phys_addr_valid(x));
}
+
return x;
 }
 EXPORT_SYMBOL(__phys_addr);
+#endif
 
 bool __virt_addr_valid(unsigned long x)
 {
-   if (x >= __START_KERNEL_map) {
-   x -= __START_KERNEL_map;
-   if (x >= KERNEL_IMAGE_SIZE)
+   unsigned long y = x - __START_KERNEL_map;
+
+   /* use the carry flag to determine if x was < __START_KERNEL_map */
+   if (unlikely(x > y)) {
+   x = y + phys_base;
+
+   if (y >= KERNEL_IMAGE_SIZE)
return false;
-   x += phys_base;
} else {
-   if (x < PAGE_OFFSET)
-   return false;
-   x -= PAGE_OFFSET;
-   if (!phys_addr_valid(x))
+   x = y + (__START_KERNEL_map - PAGE_OFFSET);
+
+   /* carry flag will be set if starting x was >= PAGE_OFFSET */
+   if ((x > y) || !phys_addr_valid(x))
return false;
}
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v4 3/8] x86: Make it so that __pa_symbol can only process kernel symbols on x86_64

2012-11-16 Thread Alexander Duyck
I submitted an earlier patch that make __phys_addr an inline.  This obviously
results in an increase in the code size.  One step I can take to reduce that
is to make it so that the __pa_symbol call does a direct translation for
kernel addresses instead of covering all of virtual memory.

On my system this reduced the size for __pa_symbol from 5 instructions
totalling 30 bytes to 3 instructions totalling 16 bytes.

Signed-off-by: Alexander Duyck 
---
v4:  Dropped y>x check in debug version of __phys_addr_symbol since we already
 checked for y >= KERNEL_IMAGE_SIZE.

 arch/x86/include/asm/page.h|3 ++-
 arch/x86/include/asm/page_32.h |1 +
 arch/x86/include/asm/page_64.h |3 +++
 arch/x86/mm/physaddr.c |   11 +++
 4 files changed, 17 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 8ca8283..3698a6a 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -44,7 +44,8 @@ static inline void copy_user_page(void *to, void *from, 
unsigned long vaddr,
  * case properly. Once all supported versions of gcc understand it, we can
  * remove this Voodoo magic stuff. (i.e. once gcc3.x is deprecated)
  */
-#define __pa_symbol(x) __pa(__phys_reloc_hide((unsigned long)(x)))
+#define __pa_symbol(x) \
+   __phys_addr_symbol(__phys_reloc_hide((unsigned long)(x)))
 
 #define __va(x)((void *)((unsigned 
long)(x)+PAGE_OFFSET))
 
diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
index da4e762..4d550d0 100644
--- a/arch/x86/include/asm/page_32.h
+++ b/arch/x86/include/asm/page_32.h
@@ -15,6 +15,7 @@ extern unsigned long __phys_addr(unsigned long);
 #else
 #define __phys_addr(x) __phys_addr_nodebug(x)
 #endif
+#define __phys_addr_symbol(x)  __phys_addr(x)
 #define __phys_reloc_hide(x)   RELOC_HIDE((x), 0)
 
 #ifdef CONFIG_FLATMEM
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 5138174..0f1ddee 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -21,8 +21,11 @@ static inline unsigned long __phys_addr_nodebug(unsigned 
long x)
 
 #ifdef CONFIG_DEBUG_VIRTUAL
 extern unsigned long __phys_addr(unsigned long);
+extern unsigned long __phys_addr_symbol(unsigned long);
 #else
 #define __phys_addr(x) __phys_addr_nodebug(x)
+#define __phys_addr_symbol(x) \
+   ((unsigned long)(x) - __START_KERNEL_map + phys_base)
 #endif
 
 #define __phys_reloc_hide(x)   (x)
diff --git a/arch/x86/mm/physaddr.c b/arch/x86/mm/physaddr.c
index fd40d75..c73fedd 100644
--- a/arch/x86/mm/physaddr.c
+++ b/arch/x86/mm/physaddr.c
@@ -28,6 +28,17 @@ unsigned long __phys_addr(unsigned long x)
return x;
 }
 EXPORT_SYMBOL(__phys_addr);
+
+unsigned long __phys_addr_symbol(unsigned long x)
+{
+   unsigned long y = x - __START_KERNEL_map;
+
+   /* only check upper bounds since lower bounds will trigger carry */
+   VIRTUAL_BUG_ON(y >= KERNEL_IMAGE_SIZE);
+
+   return y + phys_base;
+}
+EXPORT_SYMBOL(__phys_addr_symbol);
 #endif
 
 bool __virt_addr_valid(unsigned long x)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v4 4/8] x86: Drop 4 unnecessary calls to __pa_symbol

2012-11-16 Thread Alexander Duyck
While debugging the __pa_symbol inline patch I found that there were a couple
spots where __pa_symbol was used as follows:
__pa_symbol(x) - __pa_symbol(y)

The compiler had reduced them to:
x - y

Since we also support a debug case where __pa_symbol is a function call it
would probably be useful to just change the two cases I found so that they are
always just treated as "x - y".  As such I am casting the values to
phys_addr_t and then doing simple subtraction so that the correct type and
value is returned.

Signed-off-by: Alexander Duyck 
---
 arch/x86/kernel/head32.c |4 ++--
 arch/x86/kernel/head64.c |4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/head32.c b/arch/x86/kernel/head32.c
index c18f59d..f15db0c 100644
--- a/arch/x86/kernel/head32.c
+++ b/arch/x86/kernel/head32.c
@@ -30,8 +30,8 @@ static void __init i386_default_early_setup(void)
 
 void __init i386_start_kernel(void)
 {
-   memblock_reserve(__pa_symbol(&_text),
-__pa_symbol(&__bss_stop) - __pa_symbol(&_text));
+   memblock_reserve(__pa_symbol(_text),
+(phys_addr_t)__bss_stop - (phys_addr_t)_text);
 
 #ifdef CONFIG_BLK_DEV_INITRD
/* Reserve INITRD */
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 037df57..42f5df1 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -97,8 +97,8 @@ void __init x86_64_start_reservations(char *real_mode_data)
 {
copy_bootdata(__va(real_mode_data));
 
-   memblock_reserve(__pa_symbol(&_text),
-__pa_symbol(&__bss_stop) - __pa_symbol(&_text));
+   memblock_reserve(__pa_symbol(_text),
+(phys_addr_t)__bss_stop - (phys_addr_t)_text);
 
 #ifdef CONFIG_BLK_DEV_INITRD
/* Reserve INITRD */

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v4 5/8] x86: Use __pa_symbol instead of __pa on C visible symbols

2012-11-16 Thread Alexander Duyck
When I made an attempt at separating __pa_symbol and __pa I found that there
were a number of cases where __pa was used on an obvious symbol.

I also caught one non-obvious case as _brk_start and _brk_end are based on the
address of __brk_base which is a C visible symbol.

In mark_rodata_ro I was able to reduce the overhead of kernel symbol to
virtual memory translation by using a combination of __va(__pa_symbol())
instead of page_address(virt_to_page()).

Signed-off-by: Alexander Duyck 
---
v3:  Added changes to init_64.c function mark_rodata_ro to avoid unnecessary
 conversion to and from a page when all that is wanted is a virtual
 address.

 arch/x86/kernel/cpu/intel.c |2 +-
 arch/x86/kernel/setup.c |   16 
 arch/x86/mm/init_64.c   |   18 --
 arch/x86/mm/pageattr.c  |8 
 arch/x86/platform/efi/efi.c |4 ++--
 arch/x86/realmode/init.c|8 
 6 files changed, 27 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 198e019..2249e7e 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -168,7 +168,7 @@ int __cpuinit ppro_with_ram_bug(void)
 #ifdef CONFIG_X86_F00F_BUG
 static void __cpuinit trap_init_f00f_bug(void)
 {
-   __set_fixmap(FIX_F00F_IDT, __pa(&idt_table), PAGE_KERNEL_RO);
+   __set_fixmap(FIX_F00F_IDT, __pa_symbol(idt_table), PAGE_KERNEL_RO);
 
/*
 * Update the IDT descriptor and reload the IDT so that
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index ca45696..2702c5d 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -300,8 +300,8 @@ static void __init cleanup_highmap(void)
 static void __init reserve_brk(void)
 {
if (_brk_end > _brk_start)
-   memblock_reserve(__pa(_brk_start),
-__pa(_brk_end) - __pa(_brk_start));
+   memblock_reserve(__pa_symbol(_brk_start),
+_brk_end - _brk_start);
 
/* Mark brk area as locked down and no longer taking any
   new allocations */
@@ -761,12 +761,12 @@ void __init setup_arch(char **cmdline_p)
init_mm.end_data = (unsigned long) _edata;
init_mm.brk = _brk_end;
 
-   code_resource.start = virt_to_phys(_text);
-   code_resource.end = virt_to_phys(_etext)-1;
-   data_resource.start = virt_to_phys(_etext);
-   data_resource.end = virt_to_phys(_edata)-1;
-   bss_resource.start = virt_to_phys(&__bss_start);
-   bss_resource.end = virt_to_phys(&__bss_stop)-1;
+   code_resource.start = __pa_symbol(_text);
+   code_resource.end = __pa_symbol(_etext)-1;
+   data_resource.start = __pa_symbol(_etext);
+   data_resource.end = __pa_symbol(_edata)-1;
+   bss_resource.start = __pa_symbol(__bss_start);
+   bss_resource.end = __pa_symbol(__bss_stop)-1;
 
 #ifdef CONFIG_CMDLINE_BOOL
 #ifdef CONFIG_CMDLINE_OVERRIDE
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 3baff25..0374a10 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -770,12 +770,10 @@ void set_kernel_text_ro(void)
 void mark_rodata_ro(void)
 {
unsigned long start = PFN_ALIGN(_text);
-   unsigned long rodata_start =
-   ((unsigned long)__start_rodata + PAGE_SIZE - 1) & PAGE_MASK;
+   unsigned long rodata_start = PFN_ALIGN(__start_rodata);
unsigned long end = (unsigned long) &__end_rodata_hpage_align;
-   unsigned long text_end = PAGE_ALIGN((unsigned long) &__stop___ex_table);
-   unsigned long rodata_end = PAGE_ALIGN((unsigned long) &__end_rodata);
-   unsigned long data_start = (unsigned long) &_sdata;
+   unsigned long text_end = PFN_ALIGN(&__stop___ex_table);
+   unsigned long rodata_end = PFN_ALIGN(&__end_rodata);
 
printk(KERN_INFO "Write protecting the kernel read-only data: %luk\n",
   (end - start) >> 10);
@@ -800,12 +798,12 @@ void mark_rodata_ro(void)
 #endif
 
free_init_pages("unused kernel memory",
-   (unsigned long) page_address(virt_to_page(text_end)),
-   (unsigned long)
-page_address(virt_to_page(rodata_start)));
+   (unsigned long) __va(__pa_symbol(text_end)),
+   (unsigned long) __va(__pa_symbol(rodata_start)));
+
free_init_pages("unused kernel memory",
-   (unsigned long) page_address(virt_to_page(rodata_end)),
-   (unsigned long) page_address(virt_to_page(data_start)));
+   (unsigned long) __va(__pa_symbol(rodata_end)),
+   (unsigned long) __va(__pa_symbol(_sdata)));
 }
 
 #endif
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index a718e0d..40f92f3 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/

[PATCH v4 6/8] x86/ftrace: Use __pa_symbol instead of __pa on C visible symbols

2012-11-16 Thread Alexander Duyck
Instead of using __pa which is meant to be a general function for converting
virtual addresses to physical addresses we can use __pa_symbol which is the
preferred way of decoding kernel text virtual addresses to physical addresses.

In this case we are not directly converting C visible symbols however if we
know that the instruction pointer is somewhere between _text and _etext we
know that we are going to be translating an address form the kernel text
space.

Cc: Steven Rostedt 
Cc: Frederic Weisbecker 
Signed-off-by: Alexander Duyck 
---
 arch/x86/kernel/ftrace.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 1d41402..42a392a 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -89,7 +89,7 @@ do_ftrace_mod_code(unsigned long ip, const void *new_code)
 * kernel identity mapping to modify code.
 */
if (within(ip, (unsigned long)_text, (unsigned long)_etext))
-   ip = (unsigned long)__va(__pa(ip));
+   ip = (unsigned long)__va(__pa_symbol(ip));
 
return probe_kernel_write((void *)ip, new_code, MCOUNT_INSN_SIZE);
 }
@@ -279,7 +279,7 @@ static int ftrace_write(unsigned long ip, const char *val, 
int size)
 * kernel identity mapping to modify code.
 */
if (within(ip, (unsigned long)_text, (unsigned long)_etext))
-   ip = (unsigned long)__va(__pa(ip));
+   ip = (unsigned long)__va(__pa_symbol(ip));
 
return probe_kernel_write((void *)ip, val, size);
 }

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v4 7/8] x86/acpi: Use __pa_symbol instead of __pa on C visible symbols

2012-11-16 Thread Alexander Duyck
This change just updates one spot where __pa was being used when __pa_symbol
should have been used.  By using __pa_symbol we are able to drop a few extra
lines of code as we don't have to test to see if the virtual pointer is a
part of the kernel text or just standard virtual memory.

Cc: Len Brown 
Cc: Pavel Machek 
Cc: "Rafael J. Wysocki" 
Signed-off-by: Alexander Duyck 
---
 arch/x86/kernel/acpi/sleep.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/acpi/sleep.c b/arch/x86/kernel/acpi/sleep.c
index 11676cf..f146a3c 100644
--- a/arch/x86/kernel/acpi/sleep.c
+++ b/arch/x86/kernel/acpi/sleep.c
@@ -69,7 +69,7 @@ int acpi_suspend_lowlevel(void)
 
 #ifndef CONFIG_64BIT
header->pmode_entry = (u32)&wakeup_pmode_return;
-   header->pmode_cr3 = (u32)__pa(&initial_page_table);
+   header->pmode_cr3 = (u32)__pa_symbol(initial_page_table);
saved_magic = 0x12345678;
 #else /* CONFIG_64BIT */
 #ifdef CONFIG_SMP

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v4 8/8] x86/lguest: Use __pa_symbol instead of __pa on C visible symbols

2012-11-16 Thread Alexander Duyck
The function lguest_write_cr3 is using __pa to convert swapper_pg_dir and
initial_page_table from virtual addresses to physical.  The correct function
to use for these values is __pa_symbol since they are C visible symbols.

Cc: Rusty Russell 
Signed-off-by: Alexander Duyck 
---
 arch/x86/lguest/boot.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/arch/x86/lguest/boot.c b/arch/x86/lguest/boot.c
index 642d880..139dd35 100644
--- a/arch/x86/lguest/boot.c
+++ b/arch/x86/lguest/boot.c
@@ -552,7 +552,8 @@ static void lguest_write_cr3(unsigned long cr3)
current_cr3 = cr3;
 
/* These two page tables are simple, linear, and used during boot */
-   if (cr3 != __pa(swapper_pg_dir) && cr3 != __pa(initial_page_table))
+   if (cr3 != __pa_symbol(swapper_pg_dir) &&
+   cr3 != __pa_symbol(initial_page_table))
cr3_changed = true;
 }
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 6/8] x86/ftrace: Use __pa_symbol instead of __pa on C visible symbols

2012-11-16 Thread Alexander Duyck
On 11/16/2012 03:06 PM, H. Peter Anvin wrote:
> On 11/16/2012 02:45 PM, Steven Rostedt wrote:
>>
>> #define __pa(x)__phys_addr((unsigned long)(x))
>> #define __pa_symbol(x)__pa(__phys_reloc_hide((unsigned long)(x)))
>>
>> I'm confused. __pa_symbol() just calls __pa() with some macro magic to
>> its parameter. How is this a performance improvement?
>>
>
> One of the earlier patches in this series changes __pa_symbol() to
> avoid the conditional hidden inside __phys_addr(), since by definition
> a symbol can only be on one side of that branch.
>
> -hpa
>

In addition to being a bit faster the code is also a bit smaller since
it can combine the the constants from __va() and __pa_symbol() as the
new __pa_symbol is an inline in the non-debug case.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86: Fix warning about cast from pointer to integer of different size

2012-11-19 Thread Alexander Duyck
This patch fixes a warning reported by the kbuild test robot where we were
casting a pointer to a physical address which represents an integer of a
different size.  Per the suggestion of Peter Anvin I am replacing it and one
other spot where I made a similar cast with an unsigned long.

Cc: H. Peter Anvin 
Signed-off-by: Alexander Duyck 
---
 arch/x86/kernel/head32.c |2 +-
 arch/x86/kernel/head64.c |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/head32.c b/arch/x86/kernel/head32.c
index f15db0c..e175548 100644
--- a/arch/x86/kernel/head32.c
+++ b/arch/x86/kernel/head32.c
@@ -31,7 +31,7 @@ static void __init i386_default_early_setup(void)
 void __init i386_start_kernel(void)
 {
memblock_reserve(__pa_symbol(_text),
-(phys_addr_t)__bss_stop - (phys_addr_t)_text);
+(unsigned long)__bss_stop - (unsigned long)_text);
 
 #ifdef CONFIG_BLK_DEV_INITRD
/* Reserve INITRD */
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 42f5df1..7b215a5 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -98,7 +98,7 @@ void __init x86_64_start_reservations(char *real_mode_data)
copy_bootdata(__va(real_mode_data));
 
memblock_reserve(__pa_symbol(_text),
-(phys_addr_t)__bss_stop - (phys_addr_t)_text);
+(unsigned long)__bss_stop - (unsigned long)_text);
 
 #ifdef CONFIG_BLK_DEV_INITRD
/* Reserve INITRD */

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RESEND][PATCH] x86: Fix warning about cast from pointer to integer of different size

2012-11-19 Thread Alexander Duyck
This patch fixes a warning reported by the kbuild test robot where we were
casting a pointer to a physical address which represents an integer of a
different size.  Per the suggestion of Peter Anvin I am replacing it and one
other spot where I made a similar cast with an unsigned long.

Cc: H. Peter Anvin 
Signed-off-by: Alexander Duyck 
---

Resending patch as I realized I forgot to add --auto to stgit command line and
as such the Cc was ignored.  Sorry for the extra noise on the list.

 arch/x86/kernel/head32.c |2 +-
 arch/x86/kernel/head64.c |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/head32.c b/arch/x86/kernel/head32.c
index f15db0c..e175548 100644
--- a/arch/x86/kernel/head32.c
+++ b/arch/x86/kernel/head32.c
@@ -31,7 +31,7 @@ static void __init i386_default_early_setup(void)
 void __init i386_start_kernel(void)
 {
memblock_reserve(__pa_symbol(_text),
-(phys_addr_t)__bss_stop - (phys_addr_t)_text);
+(unsigned long)__bss_stop - (unsigned long)_text);
 
 #ifdef CONFIG_BLK_DEV_INITRD
/* Reserve INITRD */
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 42f5df1..7b215a5 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -98,7 +98,7 @@ void __init x86_64_start_reservations(char *real_mode_data)
copy_bootdata(__va(real_mode_data));
 
memblock_reserve(__pa_symbol(_text),
-(phys_addr_t)__bss_stop - (phys_addr_t)_text);
+(unsigned long)__bss_stop - (unsigned long)_text);
 
 #ifdef CONFIG_BLK_DEV_INITRD
/* Reserve INITRD */

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] pci: Avoid reentrant calls to work_on_cpu

2013-06-12 Thread Alexander Duyck
On 05/14/2013 07:50 PM, Yinghai Lu wrote:
> On Tue, May 14, 2013 at 3:26 PM, Alexander Duyck
>  wrote:
>> This change is meant to fix a deadlock seen when pci_enable_sriov was
>> called from within a driver's probe routine.  The issue was that
>> work_on_cpu calls flush_work which attempts to flush a work queue for a
>> cpu that we are currently working in.  In order to avoid the reentrant
>> path we just skip the call to work_on_cpu in the case that the device
>> node matches our current node.
>>
>> Reported-by: Yinghai Lu 
>> Signed-off-by: Alexander Duyck 
>> ---
>>
>> This patch is meant to address the issue pointed out in an earlier patch
>> sent by Yinghai Lu titled:
>>   [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's
> Yes, that help. my v2 patch will not need to device schecdule and
> device_initicall to wait
> first work_on_cpu is done.
>
> Tested-by: Yinghai Lu 

So what ever happened with this patch?  It doesn't look like it was
applied anywhere.  Was there some objection to it?  If so I can update
and resubmit if necessary.

Thanks,

Alex


>
>>  drivers/pci/pci-driver.c |   14 +-
>>  1 files changed, 9 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
>> index 79277fb..caeb1c0 100644
>> --- a/drivers/pci/pci-driver.c
>> +++ b/drivers/pci/pci-driver.c
>> @@ -277,12 +277,16 @@ static int pci_call_probe(struct pci_driver *drv, 
>> struct pci_dev *dev,
>> int error, node;
>> struct drv_dev_and_id ddi = { drv, dev, id };
>>
>> -   /* Execute driver initialization on node where the device's
>> -  bus is attached to.  This way the driver likely allocates
>> -  its local memory on the right node without any need to
>> -  change it. */
>> +   /*
>> +* Execute driver initialization on the node where the device's
>> +* bus is attached.  This way the driver likely allocates
>> +* its local memory on the right node without any need to
>> +* change it.  If the node is the current node just call
>> +* local_pci_probe and avoid the possibility of reentrant
>> +* calls to work_on_cpu.
>> +*/
>> node = dev_to_node(&dev->dev);
>> -   if (node >= 0) {
>> +   if ((node >= 0) && (node != numa_node_id())) {
>> int cpu;
>>
>> get_online_cpus();
>>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] pci: Avoid unnecessary calls to work_on_cpu

2013-06-24 Thread Alexander Duyck
This patch is meant to address the fact that we are making unnecessary calls
to work_on_cpu.  To resolve this I have added a check to see if the current
node is the correct node for the device before we decide to assign the probe
task to another CPU.

The advantages to this approach is that we can avoid reentrant calls to
work_on_cpu.  In addition we should not make any calls to setup the work
remotely in the case of a single node system that has NUMA enabled.

Signed-off-by: Alexander Duyck 
---

This patch is based off of work I submitted in an earlier patch that I never
heard back on.  The change was originally submitted in:
  pci: Avoid reentrant calls to work_on_cpu

I'm not sure what ever happened with that patch, however after reviewing it
some myself I decided I could do without the change to the comments since they
were unneeded.  As such I am resubmitting this as a much simpler patch that
only adds the line of code needed to avoid calling work_on_cpu for every call
to probe on an NUMA node specific device.

 drivers/pci/pci-driver.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 79277fb..7d81713 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -282,7 +282,7 @@ static int pci_call_probe(struct pci_driver *drv, struct 
pci_dev *dev,
   its local memory on the right node without any need to
   change it. */
node = dev_to_node(&dev->dev);
-   if (node >= 0) {
+   if ((node >= 0) && (node != numa_node_id())) {
int cpu;
 
get_online_cpus();

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 8/9] pci: Tune secondary bus reset timing

2013-08-06 Thread Alexander Duyck
On 08/05/2013 12:37 PM, Alex Williamson wrote:
> The PCI spec indicates that with stable power, reset needs to be
> asserted for a minimum of 1ms (Trst).  Seems like we should be able
> to assume power is stable for a runtime secondary bus reset.  The
> current code has always used 100ms with no explanation where that
> came from.  The aer_do_secondary_bus_reset() function uses 2ms, but
> that seems to be a misinterpretation of the PCIe spec, where hot
> reset is implemented by TS1 ordered sets containing the hot reset
> command.  After a 2ms delay the state machine enters the detect state,
> but to generate a link down, only two consecutive TS1 hot reset
> ordered sets are requred.  1ms should be plenty for that.

The reason for doing a 2ms sleep is because the are supposed to be
sending the Hot Reset TS1 Ordered-Sets continuously for 2ms per all of
the documents I have read.  The 1ms number you quote is the minimum time
for a conventional PCI bus.  I'm not completely sure of that applies as
well to PCIe, nor does it represent the maximum recommended value.

If we stop early we risk not resetting the full device tree on the
secondary bus which is the bug I was resolving by adding the 2ms delay. 
Previously we saw that some devices were only getting their PCIe link
retrained without performing a hot reset when the bit was not held for
long enough.  I would prefer to keep this at 2 ms in order to account
for the fact that PCIe has to go though link recovery states before it
can perform the hot reset.

Thanks,

Alex


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 8/9] pci: Tune secondary bus reset timing

2013-08-07 Thread Alexander Duyck
On 08/06/2013 07:56 PM, Alex Williamson wrote:
> On Tue, 2013-08-06 at 16:27 -0700, Alexander Duyck wrote:
>> On 08/05/2013 12:37 PM, Alex Williamson wrote:
>>> The PCI spec indicates that with stable power, reset needs to be
>>> asserted for a minimum of 1ms (Trst).  Seems like we should be able
>>> to assume power is stable for a runtime secondary bus reset.  The
>>> current code has always used 100ms with no explanation where that
>>> came from.  The aer_do_secondary_bus_reset() function uses 2ms, but
>>> that seems to be a misinterpretation of the PCIe spec, where hot
>>> reset is implemented by TS1 ordered sets containing the hot reset
>>> command.  After a 2ms delay the state machine enters the detect state,
>>> but to generate a link down, only two consecutive TS1 hot reset
>>> ordered sets are requred.  1ms should be plenty for that.
>> The reason for doing a 2ms sleep is because the are supposed to be
>> sending the Hot Reset TS1 Ordered-Sets continuously for 2ms per all of
>> the documents I have read.
> Could you point to one of those references?  In the PCIe v3 spec I'm
> seeing things like 4.2.6.11 Hot Reset:
>
>   * If two consecutive TS1 Ordered Sets are received on any Lane
> with the Hot Reset bit asserted and configured Link and Lane
> numbers, then:
>   * LinkUp = 0b (False)
>   * If no higher Layer is directing the Physical Layer to
> remain in Hot Reset, the next state is Detect
>   * Otherwise, all Lanes in the configured Link continue to
> transmit TS1 Ordered Sets with the Hot Reset bit
> asserted and the configured Link and Lane numbers.
>   * Otherwise, after a 2 ms timeout next state is Detect.
>
> The next section has something similar for propagation of hot resets.
>
> Nowhere there does it say TS1 Ordered Sets need to be sent continuously
> for 2ms.  A hot reset is initiated only by two consecutive TS1 Ordered
> Sets with the Hot Reset bit asserted.  The 2ms timeout seems to be the
> delay before the link moves to the Detect state after we stop asserting
> hot reset.  1ms seems like more than enough time for two TS1 Ordered
> Sets to propagate down a multi-level hierarchy at 2.5GT/s. 
>

My original implementation is actually based on page 536 of the "PCI
Express System Architecture".  However based on the PCIe spec itself I
think the point is that the port is supposed to stay in Hot Reset for
2ms after receiving the in-band message.  For a bridge port it means
that is supposed to be sending the Hot Reset message for those 2ms on
all downstream facing ports.  After the timer expires then it stops
sending the Hot Reset TS1 Ordered Sets and then will transition to the
Detect state.

My main concern here is that the previous code was not triggering a Hot
Reset on all ports previously.  What was happening was that some of the
ports would only get as far as Recovery as the upstream port was only
sending a couple of TS1 frames and not allowing the downstream ports
time to switch to Recovery themselves and discover the Hot Reset.

>> The 1ms number you quote is the minimum time
>> for a conventional PCI bus.  I'm not completely sure of that applies as
>> well to PCIe, nor does it represent the maximum recommended value.
> Correct, 1ms comes from conventional PCI.  PCIe is designed to be
> software compatible with conventional PCI so it makes sense that PCIe
> would do something within the timing boundaries of conventional PCI.  I
> didn't see any reference to a maximum recommended value for this
> parameter.

I don't want to implement things to minimum specification as there are
too many marginal parts where the minimum doesn't work.  I would rather
not have to add a ton of quirks for all of the parts out there that
didn't quite meet up to the specification.  By using a value of 2ms we
are matching what the PCIe bridge behavior is supposed to be by sending
the Hot Reset TS1 ordered sets for 2ms.

>> If we stop early we risk not resetting the full device tree on the
>> secondary bus which is the bug I was resolving by adding the 2ms delay. 
>> Previously we saw that some devices were only getting their PCIe link
>> retrained without performing a hot reset when the bit was not held for
>> long enough.  I would prefer to keep this at 2 ms in order to account
>> for the fact that PCIe has to go though link recovery states before it
>> can perform the hot reset.
> I'm not going to sweat over 1ms or 2ms but I do want to be able to
> document why we're setting it to one or the other.  If it's warm
> fuzzies, so be it, but I'd prefer if we could find actual sp

Re: [PATCH v4 8/9] pci: Tune secondary bus reset timing

2013-08-08 Thread Alexander Duyck
On 08/07/2013 10:23 PM, Alex Williamson wrote:
> On Wed, 2013-08-07 at 11:30 -0700, Alexander Duyck wrote:
>> On 08/06/2013 07:56 PM, Alex Williamson wrote:
>>> On Tue, 2013-08-06 at 16:27 -0700, Alexander Duyck wrote:
>>>> On 08/05/2013 12:37 PM, Alex Williamson wrote:
>>>>> The PCI spec indicates that with stable power, reset needs to be
>>>>> asserted for a minimum of 1ms (Trst).  Seems like we should be able
>>>>> to assume power is stable for a runtime secondary bus reset.  The
>>>>> current code has always used 100ms with no explanation where that
>>>>> came from.  The aer_do_secondary_bus_reset() function uses 2ms, but
>>>>> that seems to be a misinterpretation of the PCIe spec, where hot
>>>>> reset is implemented by TS1 ordered sets containing the hot reset
>>>>> command.  After a 2ms delay the state machine enters the detect state,
>>>>> but to generate a link down, only two consecutive TS1 hot reset
>>>>> ordered sets are requred.  1ms should be plenty for that.
>>>> The reason for doing a 2ms sleep is because the are supposed to be
>>>> sending the Hot Reset TS1 Ordered-Sets continuously for 2ms per all of
>>>> the documents I have read.
>>> Could you point to one of those references?  In the PCIe v3 spec I'm
>>> seeing things like 4.2.6.11 Hot Reset:
>>>
>>>   * If two consecutive TS1 Ordered Sets are received on any Lane
>>> with the Hot Reset bit asserted and configured Link and Lane
>>> numbers, then:
>>>   * LinkUp = 0b (False)
>>>   * If no higher Layer is directing the Physical Layer to
>>> remain in Hot Reset, the next state is Detect
>>>   * Otherwise, all Lanes in the configured Link continue to
>>> transmit TS1 Ordered Sets with the Hot Reset bit
>>> asserted and the configured Link and Lane numbers.
>>>   * Otherwise, after a 2 ms timeout next state is Detect.
>>>
>>> The next section has something similar for propagation of hot resets.
>>>
>>> Nowhere there does it say TS1 Ordered Sets need to be sent continuously
>>> for 2ms.  A hot reset is initiated only by two consecutive TS1 Ordered
>>> Sets with the Hot Reset bit asserted.  The 2ms timeout seems to be the
>>> delay before the link moves to the Detect state after we stop asserting
>>> hot reset.  1ms seems like more than enough time for two TS1 Ordered
>>> Sets to propagate down a multi-level hierarchy at 2.5GT/s. 
>>>
>> My original implementation is actually based on page 536 of the "PCI
>> Express System Architecture".  However based on the PCIe spec itself I
>> think the point is that the port is supposed to stay in Hot Reset for
>> 2ms after receiving the in-band message.  For a bridge port it means
>> that is supposed to be sending the Hot Reset message for those 2ms on
>> all downstream facing ports.  After the timer expires then it stops
>> sending the Hot Reset TS1 Ordered Sets and then will transition to the
>> Detect state.
> Conveniently page 536 is available for preview on google :)  What that
> suggests to me is that the minimum "nobody home", unconnected link
> timeout is 2ms.  Downstream ports may exit to the Detect state after
> either a 2ms timeout expires or after two hot-reset-TS1s are received
> from the downstream device.  The other 2ms case is that an upstream port
> in the Hot Reset state will always wait for the 2ms timeout to expire
> after the last pair of hot-reset-TS1s is received before entering the
> Detect state.
>
>> My main concern here is that the previous code was not triggering a Hot
>> Reset on all ports previously.  What was happening was that some of the
>> ports would only get as far as Recovery as the upstream port was only
>> sending a couple of TS1 frames and not allowing the downstream ports
>> time to switch to Recovery themselves and discover the Hot Reset.
> Was that the original code that had no delay between set and clear of
> the bridge control register?  1ms is pretty long time vs no delay.
>
>>>> The 1ms number you quote is the minimum time
>>>> for a conventional PCI bus.  I'm not completely sure of that applies as
>>>> well to PCIe, nor does it represent the maximum recommended value.
>>> Correct, 1ms comes from conventional PCI.  PCIe is designed to be
>>> software compatible with conventional PCI so it makes sense that PCIe
>>> would do somethin

Re: workqueue, pci: INFO: possible recursive locking detected

2013-07-22 Thread Alexander Duyck
On 07/22/2013 02:38 PM, Bjorn Helgaas wrote:
> [+cc Alex, Yinghai, linux-pci]
>
> On Mon, Jul 22, 2013 at 9:37 AM, Srivatsa S. Bhat
>  wrote:
>> On 07/22/2013 05:22 PM, Lai Jiangshan wrote:
>>> On 07/19/2013 04:57 PM, Srivatsa S. Bhat wrote:
 On 07/19/2013 07:17 AM, Lai Jiangshan wrote:
> On 07/19/2013 04:23 AM, Srivatsa S. Bhat wrote:
>> ---
>>
>>  kernel/workqueue.c |6 ++
>>  1 file changed, 6 insertions(+)
>>
>>
>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>> index f02c4a4..07d9a67 100644
>> --- a/kernel/workqueue.c
>> +++ b/kernel/workqueue.c
>> @@ -4754,7 +4754,13 @@ long work_on_cpu(int cpu, long (*fn)(void *), 
>> void *arg)
>>  {
>>struct work_for_cpu wfc = { .fn = fn, .arg = arg };
>>
>> +#ifdef CONFIG_LOCKDEP
>> +  static struct lock_class_key __key;
> Sorry, this "static" should be removed.
>
 That didn't help either :-( Because it makes lockdep unhappy,
 since the key isn't persistent.

 This is the patch I used:

 ---

 diff --git a/kernel/workqueue.c b/kernel/workqueue.c
 index f02c4a4..7967e3b 100644
 --- a/kernel/workqueue.c
 +++ b/kernel/workqueue.c
 @@ -4754,7 +4754,13 @@ long work_on_cpu(int cpu, long (*fn)(void *), void 
 *arg)
  {
  struct work_for_cpu wfc = { .fn = fn, .arg = arg };

 +#ifdef CONFIG_LOCKDEP
 +struct lock_class_key __key;
 +INIT_WORK_ONSTACK(&wfc.work, work_for_cpu_fn);
 +lockdep_init_map(&wfc.work.lockdep_map, "&wfc.work", &__key, 0);
 +#else
  INIT_WORK_ONSTACK(&wfc.work, work_for_cpu_fn);
 +#endif
  schedule_work_on(cpu, &wfc.work);
  flush_work(&wfc.work);
  return wfc.ret;


 And here are the new warnings:


 Block layer SCSI generic (bsg) driver version 0.4 loaded (major 252)
 io scheduler noop registered
 io scheduler deadline registered
 io scheduler cfq registered (default)
 BUG: key 881039557b98 not in .data!
 [ cut here ]
 WARNING: CPU: 8 PID: 1 at kernel/lockdep.c:2987 
 lockdep_init_map+0x168/0x170()
>>> Sorry again.
>>>
>>> From 0096b9dac2282ec03d59a3f665b92977381a18ad Mon Sep 17 00:00:00 2001
>>> From: Lai Jiangshan 
>>> Date: Mon, 22 Jul 2013 19:08:51 +0800
>>> Subject: [PATCH] [PATCH] workqueue: allow the function of work_on_cpu() can
>>>  call work_on_cpu()
>>>
>>> If the @fn call work_on_cpu() again, the lockdep will complain:
>>>
 [ INFO: possible recursive locking detected ]
 3.11.0-rc1-lockdep-fix-a #6 Not tainted
 -
 kworker/0:1/142 is trying to acquire lock:
  ((&wfc.work)){+.+.+.}, at: [] flush_work+0x0/0xb0

 but task is already holding lock:
  ((&wfc.work)){+.+.+.}, at: [] 
 process_one_work+0x169/0x610

 other info that might help us debug this:
  Possible unsafe locking scenario:

CPU0

   lock((&wfc.work));
   lock((&wfc.work));

  *** DEADLOCK ***
>>> It is false-positive lockdep report. In this sutiation,
>>> the two "wfc"s of the two work_on_cpu() are different,
>>> they are both on stack. flush_work() can't be deadlock.
>>>
>>> To fix this, we need to avoid the lockdep checking in this case,
>>> But we don't want to change the flush_work(), so we use
>>> completion instead of flush_work() in the work_on_cpu().
>>>
>>> Reported-by: Srivatsa S. Bhat 
>>> Signed-off-by: Lai Jiangshan 
>>> ---
>> That worked, thanks a lot!
>>
>> Tested-by: Srivatsa S. Bhat 
>>
>> Regards,
>> Srivatsa S. Bhat
>>
>>>  kernel/workqueue.c |5 -
>>>  1 files changed, 4 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>>> index f02c4a4..b021a45 100644
>>> --- a/kernel/workqueue.c
>>> +++ b/kernel/workqueue.c
>>> @@ -4731,6 +4731,7 @@ struct work_for_cpu {
>>>   long (*fn)(void *);
>>>   void *arg;
>>>   long ret;
>>> + struct completion done;
>>>  };
>>>
>>>  static void work_for_cpu_fn(struct work_struct *work)
>>> @@ -4738,6 +4739,7 @@ static void work_for_cpu_fn(struct work_struct *work)
>>>   struct work_for_cpu *wfc = container_of(work, struct work_for_cpu, 
>>> work);
>>>
>>>   wfc->ret = wfc->fn(wfc->arg);
>>> + complete(&wfc->done);
>>>  }
>>>
>>>  /**
>>> @@ -4755,8 +4757,9 @@ long work_on_cpu(int cpu, long (*fn)(void *), void 
>>> *arg)
>>>   struct work_for_cpu wfc = { .fn = fn, .arg = arg };
>>>
>>>   INIT_WORK_ONSTACK(&wfc.work, work_for_cpu_fn);
>>> + init_completion(&wfc.done);
>>>   schedule_work_on(cpu, &wfc.work);
>>> - flush_work(&wfc.work);
>>> + wait_for_completion(&wfc.done);
>>>   return wfc.ret;
>>>  }
>>>  EXPORT_SYMBOL_GPL(work_on_cpu);
>>>
> Isn't this for the same issue Alex and others have been working on?
>
> It doesn't feel like we have consensus on how this should be fixed.

Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

2013-05-14 Thread Alexander Duyck
On 05/13/2013 07:28 PM, Yinghai Lu wrote:
> Found kernel try to load mlx4 drivers for VFs before
> PF's is really loaded when the drivers are built-in, and kernel
> command line include probe_vfs=63, num_vfs=63.
>
> It turns that it also happen for hotadd path even drivers are
> compiled as modules and if they loaded. Esp some VF share the
> same driver with PF.
>
> calling path:
>   device driver probe
>   ==> pci_enable_sriov
>   ==> virtfn_add
>   ==> pci_dev_add
>   ==> pci_bus_device_add
> when pci_bus_device_add is called, the VF's driver will be attached.
> and at that time PF's driver does not finish yet.
>
> Need to move out pci_bus_device_add from virtfn_add and call it
> later. Fix the problem for two path,
> 1. hotadd path: use device_schedule_callback.
> 2. for booting path, use initcall to call that for all VF's.
>
> Signed-off-by: Yinghai Lu 
> Cc: net...@vger.kernel.org
>

I'm sorry, but what is the point of this patch?  With device assignment
it is always possible to have VFs loaded and the PF driver unloaded
since you cannot remove the VFs if they are assigned to a VM.

If there is a driver that has to have the PF driver fully loaded before
it instantiates the VFs then it sounds like a buggy driver to me.  The
VF driver should be able to be loaded when the PF driver is not
present.  We handle it in igb and ixgbe last I checked, and I don't see
any reason why it cannot be handled in all other VF drivers.  I'm not
saying the VF has to be able to fully functional, but it should be able
to detect the PF becoming enabled and then bring itself to a fully
functional state.  To not handle that case is a bug.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

2013-05-14 Thread Alexander Duyck
On 05/14/2013 11:44 AM, Yinghai Lu wrote:
> On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
>  wrote:
>> On 05/13/2013 07:28 PM, Yinghai Lu wrote:
>>> Found kernel try to load mlx4 drivers for VFs before
>>> PF's is really loaded when the drivers are built-in, and kernel
>>> command line include probe_vfs=63, num_vfs=63.
>>>
>>> It turns that it also happen for hotadd path even drivers are
>>> compiled as modules and if they loaded. Esp some VF share the
>>> same driver with PF.
>>>
>>> calling path:
>>>   device driver probe
>>>   ==> pci_enable_sriov
>>>   ==> virtfn_add
>>>   ==> pci_dev_add
>>>   ==> pci_bus_device_add
>>> when pci_bus_device_add is called, the VF's driver will be attached.
>>> and at that time PF's driver does not finish yet.
>>>
>>> Need to move out pci_bus_device_add from virtfn_add and call it
>>> later. Fix the problem for two path,
>>> 1. hotadd path: use device_schedule_callback.
>>> 2. for booting path, use initcall to call that for all VF's.
>>>
>>> Signed-off-by: Yinghai Lu 
>>> Cc: net...@vger.kernel.org
>>>
>> I'm sorry, but what is the point of this patch?  With device assignment
>> it is always possible to have VFs loaded and the PF driver unloaded
>> since you cannot remove the VFs if they are assigned to a VM.
> unload PF driver will not call pci_disable_sriov?

You cannot call pci_disable_sriov because you will panic all of the
guests that have devices assigned.

>> If there is a driver that has to have the PF driver fully loaded before
>> it instantiates the VFs then it sounds like a buggy driver to me.  The
>> VF driver should be able to be loaded when the PF driver is not
>> present.  We handle it in igb and ixgbe last I checked, and I don't see
>> any reason why it cannot be handled in all other VF drivers.  I'm not
>> saying the VF has to be able to fully functional, but it should be able
>> to detect the PF becoming enabled and then bring itself to a fully
>> functional state.  To not handle that case is a bug.
> more than that.
>
> there is work_on_cpu nested lock problem. from calling pci_bus_add_device
> in driver pci probe function.
>
> [  181.938110] mlx4_core :02:00.0: Started init_resource_tracker: 80 
> slaves
> [  181.938759]   alloc irq_desc for 1170 on node 0
> [  181.949104] mlx4_core :02:00.0: irq 1170 for MSI-X
> [  181.949404]   alloc irq_desc for 1171 on node 0
> [  181.949741] mlx4_core :02:00.0: irq 1171 for MSI-X
> [  181.969253]   alloc irq_desc for 1172 on node 0
> [  181.969564] mlx4_core :02:00.0: irq 1172 for MSI-X
> [  181.989137]   alloc irq_desc for 1173 on node 0
> [  181.989485] mlx4_core :02:00.0: irq 1173 for MSI-X
> [  182.033789] mlx4_core :02:00.0: NOP command IRQ test passed
> [  182.035380]
> [  182.035473] =
> [  182.049065] [ INFO: possible recursive locking detected ]
> [  182.049349] 3.10.0-rc1-yh-00114-gf59c98e-dirty #1588 Not tainted
> [  182.069079] -
> [  182.069354] kworker/0:1/2285 is trying to acquire lock:
> [  182.089080]  ((&wfc.work)){+.+.+.}, at: []
> flush_work+0x5/0x280
> [  182.089500]
> [  182.089500] but task is already holding lock:
> [  182.109671]  ((&wfc.work)){+.+.+.}, at: []
> process_one_work+0x202/0x490
> [  182.129097]
> [  182.129097] other info that might help us debug this:
> [  182.129415]  Possible unsafe locking scenario:
> [  182.129415]
> [  182.149275]CPU0
> [  182.149386]
> [  182.149513]   lock((&wfc.work));
> [  182.149705]   lock((&wfc.work));
> [  182.169391]
> [  182.169391]  *** DEADLOCK ***
> [  182.169391]
> [  182.169722]  May be due to missing lock nesting notation
> [  182.169722]
> [  182.189461] 3 locks held by kworker/0:1/2285:
> [  182.189664]  #0:  (events){.+.+.+}, at: []
> process_one_work+0x202/0x490
> [  182.209468]  #1:  ((&wfc.work)){+.+.+.}, at: []
> process_one_work+0x202/0x490
> [  182.229176]  #2:  (&__lockdep_no_validate__){..}, at:
> [] device_attach+0x2a/0xc0
> [  182.249108]
> [  182.249108] stack backtrace:
> [  182.249362] CPU: 0 PID: 2285 Comm: kworker/0:1 Not tainted
> 3.10.0-rc1-yh-00114-gf59c98e-dirty #1588
> [  182.269258] Hardware name: Oracle Corporation  unknown   /
> , BIOS 1101660005/17/2011
> [  182.289141] Workqueue: events work_for_cpu_fn
> [  182.289410]  833

Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

2013-05-14 Thread Alexander Duyck
On 05/14/2013 12:59 PM, Yinghai Lu wrote:
> On Tue, May 14, 2013 at 12:45 PM, Alexander Duyck
>  wrote:
>> On 05/14/2013 11:44 AM, Yinghai Lu wrote:
>>> On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
>>>  wrote:
>>>> I'm sorry, but what is the point of this patch?  With device assignment
>>>> it is always possible to have VFs loaded and the PF driver unloaded
>>>> since you cannot remove the VFs if they are assigned to a VM.
>>> unload PF driver will not call pci_disable_sriov?
>> You cannot call pci_disable_sriov because you will panic all of the
>> guests that have devices assigned.
> ixgbe_remove did call pci_disable_sriov...
>
> for guest panic, that is another problem.
> just like you pci passthrough with real pci device and hotremove the
> card in host.
>
> ...

I suggest you take another look.  In ixgbe_disable_sriov, which is the
function that is called we do a check for assigned VFs.  If they are
assigned then we do not call pci_disable_sriov.

>
>> So how does your patch actually fix this problem?  It seems like it is
>> just avoiding it.
> yes, until the first one is done.

Avoiding the issue doesn't fix the underlying problem and instead you
are likely just introducing more bugs as a result.

>> From what I can tell your problem is originating in pci_call_probe.  I
>> believe it is calling work_on_cpu and that doesn't seem correct since
>> the work should be taking place on a CPU already local to the PF. You
>> might want to look there to see why you are trying to schedule work on a
>> CPU which should be perfectly fine for you to already be doing your work on.
> it always try to go with local cpu with same pxm.

The problem is we really shouldn't be calling work_for_cpu in this case
since we are already on the correct CPU.  What probably should be
happening is that pci_call_probe should be doing a check to see if the
current CPU is already contained within the device node map and if so
just call local_pci_probe directly.  That way you can avoid deadlocking
the system by trying to flush the CPU queue of the CPU you are currently on.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] pci: Avoid reentrant calls to work_on_cpu

2013-05-14 Thread Alexander Duyck
This change is meant to fix a deadlock seen when pci_enable_sriov was
called from within a driver's probe routine.  The issue was that
work_on_cpu calls flush_work which attempts to flush a work queue for a
cpu that we are currently working in.  In order to avoid the reentrant
path we just skip the call to work_on_cpu in the case that the device
node matches our current node.

Reported-by: Yinghai Lu 
Signed-off-by: Alexander Duyck 
---

This patch is meant to address the issue pointed out in an earlier patch
sent by Yinghai Lu titled:
  [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

 drivers/pci/pci-driver.c |   14 +-
 1 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 79277fb..caeb1c0 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -277,12 +277,16 @@ static int pci_call_probe(struct pci_driver *drv, struct 
pci_dev *dev,
int error, node;
struct drv_dev_and_id ddi = { drv, dev, id };
 
-   /* Execute driver initialization on node where the device's
-  bus is attached to.  This way the driver likely allocates
-  its local memory on the right node without any need to
-  change it. */
+   /*
+* Execute driver initialization on the node where the device's
+* bus is attached.  This way the driver likely allocates
+* its local memory on the right node without any need to
+* change it.  If the node is the current node just call
+* local_pci_probe and avoid the possibility of reentrant
+* calls to work_on_cpu.
+*/
node = dev_to_node(&dev->dev);
-   if (node >= 0) {
+   if ((node >= 0) && (node != numa_node_id())) {
int cpu;
 
get_online_cpus();

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] pci: Avoid reentrant calls to work_on_cpu

2013-05-14 Thread Alexander Duyck
On 05/14/2013 05:32 PM, Or Gerlitz wrote:
> On Tue, May 14, 2013 at 6:26 PM, Alexander Duyck
>  wrote:
>>
>> This change is meant to fix a deadlock seen when pci_enable_sriov was
>> called from within a driver's probe routine.  The issue was that
>> work_on_cpu calls flush_work which attempts to flush a work queue for a
>> cpu that we are currently working in.  In order to avoid the reentrant
>> path we just skip the call to work_on_cpu in the case that the device
>> node matches our current node.
>>
>> Reported-by: Yinghai Lu 
>> Signed-off-by: Alexander Duyck 
>> ---
>>
>> This patch is meant to address the issue pointed out in an earlier patch
>> sent by Yinghai Lu titled:
>>   [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's
>>
>>  drivers/pci/pci-driver.c |   14 +-
>>  1 files changed, 9 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
>> index 79277fb..caeb1c0 100644
>> --- a/drivers/pci/pci-driver.c
>> +++ b/drivers/pci/pci-driver.c
>> @@ -277,12 +277,16 @@ static int pci_call_probe(struct pci_driver *drv,
>> struct pci_dev *dev,
>> int error, node;
>> struct drv_dev_and_id ddi = { drv, dev, id };
>>
>> -   /* Execute driver initialization on node where the device's
>> -  bus is attached to.  This way the driver likely allocates
>> -  its local memory on the right node without any need to
>> -  change it. */
>> +   /*
>> +* Execute driver initialization on the node where the device's
>> +* bus is attached.  This way the driver likely allocates
>> +* its local memory on the right node without any need to
>> +* change it.  If the node is the current node just call
>> +* local_pci_probe and avoid the possibility of reentrant
>> +* calls to work_on_cpu.
>> +*/
>> node = dev_to_node(&dev->dev);
>> -   if (node >= 0) {
>> +   if ((node >= 0) && (node != numa_node_id())) {
>> int cpu;
>>
>> get_online_cpus();
> 
> 
> Alex, FWIW a similar patch was posted by Michael during the last rc
> cycles of 3.9 see
> http://marc.info/?l=linux-netdev&m=136569426119644&w=2

Did his patch ever get applied anywhere?  I don't see it in any of the
trees.

The advantage this approach has over the one in the similar patch is
that this covers a broader set of CPUs since anything on the same node
is local versus just the first CPU in a given NUMA node.

Thanks,

Alex



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 6/7] PCI: Make sure VF's driver get attached after PF's

2013-05-14 Thread Alexander Duyck
On 05/14/2013 07:48 PM, Yinghai Lu wrote:
> Found kernel try to load mlx4 drivers for VFs before
> PF's is loaded when the drivers are built-in, and kernel
> command line include probe_vfs=63, num_vfs=63.
> 
> [  169.581682] calling  mlx4_init+0x0/0x119 @ 1
> [  169.595681] mlx4_core: Mellanox ConnectX core driver v1.1 (Dec, 2011)
> [  169.600194] mlx4_core: Initializing :02:00.0
> [  169.616322] mlx4_core :02:00.0: Enabling SR-IOV with 63 VFs
> [  169.724084] pci :02:00.1: [15b3:1002] type 00 class 0x0c0600
> [  169.732442] mlx4_core: Initializing :02:00.1
> [  169.734345] mlx4_core :02:00.1: enabling device ( -> 0002)
> [  169.747060] mlx4_core :02:00.1: enabling bus mastering
> [  169.764283] mlx4_core :02:00.1: Detected virtual function - running in 
> slave mode
> [  169.767409] mlx4_core :02:00.1: with iommu 3 : domain 11
> [  169.785589] mlx4_core :02:00.1: Sending reset
> [  179.790131] mlx4_core :02:00.1: Got slave FLRed from Communication 
> channel (ret:0x1)
> [  181.798661] mlx4_core :02:00.1: slave is currently in themiddle of 
> FLR. retrying...(try num:1)
> [  181.803336] mlx4_core :02:00.1: Communication channel is not idle.my 
> toggle is 1 (cmd:0x0)
> ...
> [  182.078710] mlx4_core :02:00.1: slave is currently in themiddle of 
> FLR. retrying...(try num:10)
> [  182.096657] mlx4_core :02:00.1: Communication channel is not idle.my 
> toggle is 1 (cmd:0x0)
> [  182.104935] mlx4_core :02:00.1: slave driver version is not supported 
> by the master
> [  182.118570] mlx4_core :02:00.1: Communication channel is not idle.my 
> toggle is 1 (cmd:0x0)
> [  182.138190] mlx4_core :02:00.1: Failed to initialize slave
> [  182.141728] mlx4_core: probe of :02:00.1 failed with error -5
> 
> It turns that this also happen for hotadd path even drivers are
> compiled as modules and if they are loaded. Esp some VF share the
> same driver with PF.
> 
> calling path:
>   device driver probe
>   ==> pci_enable_sriov
>   ==> virtfn_add
>   ==> pci_dev_add
>   ==> pci_bus_device_add
> when pci_bus_device_add is called, the VF's driver will be attached.
> and at that time PF's driver does not finish yet.
> 
> Need to move out pci_bus_device_add from virtfn_add and call it
> later.
> 
> bnx2x and qlcnic are ok, because it does not modules command line
> to enable sriov. They must use sysfs to enable it.
> 
> be2net is ok, according to Sathya Perla,
> he fixed this issue in be2net with the following patch (commit b4c1df93)
>   http://marc.info/?l=linux-netdev&m=136801459808765&w=2
> 
> For igb and ixgbe is ok, as Alex Duyck said:
> | The VF driver should be able to be loaded when the PF driver is not
> | present.  We handle it in igb and ixgbe last I checked, and I don't see
> | any reason why it cannot be handled in all other VF drivers.  I'm not
> | saying the VF has to be able to fully functional, but it should be able
> | to detect the PF becoming enabled and then bring itself to a fully
> | functional state.  To not handle that case is a bug.
> 
> Looks like the patch will help enic, mlx4, efx, vxge and lpfc now.
> 
> -v2: don't use schedule_callback, and initcall after Alex's patch.
>   pci: Avoid reentrant calls to work_on_cpu
> 
> Signed-off-by: Yinghai Lu 
> Cc: Alexander Duyck 
> Cc: Yan Burman 
> Cc: Sathya Perla 
> Cc: net...@vger.kernel.org
> 

This is a driver bug in mlx4 and possibly a few others, not a bug in the
SR-IOV code.  My concern is your patch may introduce issues in all of
the drivers, especially the ones that don't need this workaround.
Fixing the kernel to make this work is just encouraging a poor design model.

The problem is the mlx4 driver is enabling SR-IOV before it is ready to
support VFs.  The mlx4 driver should probably be fixed by either,
changing over to sysfs, provisioning the resources before enabling
SR-IOV like be2net, or via the igb/ixgbe approach where the VF
gracefully handles the PF not being present.

Thanks,

Alex




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 6/7] PCI: Make sure VF's driver get attached after PF's

2013-05-20 Thread Alexander Duyck
On 05/20/2013 05:28 AM, Or Gerlitz wrote:
> On Wed, May 15, 2013 at 7:12 PM, Greg Rose  wrote:
>
>
>> I'm really not a fan of this.  Seems to me the tail is wagging the dog
>> here.  Fix the driver to work without a PF driver being present.
> Greg, Alex,
>
> As I wrote over the V1 thread, currently we can't go and patch mlx4 to
> use the sysfs API nor defer the call from within our probe function to
> enable sriov since  this requires some firmware change to allow
> enabling SRIOV after some  resources are initialized/provisioned.
> Hence the patch suggested here or any other patch we can agree on
> which will make sure that VF probing is done only once the PF is ready
> is preferred, I think.

I guess I am not understanding.  Are you saying you have to enable
SR-IOV, then allocate some resources, and then wait for firmware to
complete, and then load VFs?  Is it not possible to do whatever it is
you need to do in firmware first, and then enable SR-IOV?

Would it be possible for the VFs to detect this state?  If so you could
probably work around it by either delaying probe as Ben suggested with
EPROBE_DEFER, or by using something such as the igbvf/ixgbevf approach
which will treat the lack of a PF and resources as a link down condition
until the PF and resources become available.

> I wasn't sure to totally follow on the argument that things need to
> work when the PF is absent in the sense there's no driver instance
> around over which the PF is probed, if you can explain little better,
> that would help.
>
> Or.

The problem I was referring to was the case where the PF is loaded, the
VFs are then assigned to guests, and then someone attempts to unload the
PF driver.  The problem in that case is that disabling SR-IOV will cause
all of the guests with assigned VFs to crash so the solution is to leave
the VFs loaded when the PF is unloaded or we would have to block PF
driver unload.  As such the Intel VFs have to deal with a PF that can be
unloaded while they are present.

If you take a look at the code for the igb/igbvf drivers it is a bit
easier to tell what is going on in terms of how we handle the unloaded
PF state.  Basically what happens is that the mailbox we use goes dead
so we just report link down until we can get the PF to come back on the
other end of the mailbox.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

2013-05-21 Thread Alexander Duyck
On 05/21/2013 02:31 PM, Don Dutile wrote:
> On 05/21/2013 05:30 PM, Don Dutile wrote:
>> On 05/14/2013 05:39 PM, Alexander Duyck wrote:
>>> On 05/14/2013 12:59 PM, Yinghai Lu wrote:
>>>> On Tue, May 14, 2013 at 12:45 PM, Alexander Duyck
>>>>  wrote:
>>>>> On 05/14/2013 11:44 AM, Yinghai Lu wrote:
>>>>>> On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
>>>>>>  wrote:
>>>>>>> I'm sorry, but what is the point of this patch? With device
>>>>>>> assignment
>>>>>>> it is always possible to have VFs loaded and the PF driver unloaded
>>>>>>> since you cannot remove the VFs if they are assigned to a VM.
>>>>>> unload PF driver will not call pci_disable_sriov?
>>>>> You cannot call pci_disable_sriov because you will panic all of the
>>>>> guests that have devices assigned.
>>>> ixgbe_remove did call pci_disable_sriov...
>>>>
>>>> for guest panic, that is another problem.
>>>> just like you pci passthrough with real pci device and hotremove the
>>>> card in host.
>>>>
>>>> ...
>>>
>>> I suggest you take another look. In ixgbe_disable_sriov, which is the
>>> function that is called we do a check for assigned VFs. If they are
>>> assigned then we do not call pci_disable_sriov.
>>>
>>>>
>>>>> So how does your patch actually fix this problem? It seems like it is
>>>>> just avoiding it.
>>>> yes, until the first one is done.
>>>
>>> Avoiding the issue doesn't fix the underlying problem and instead you
>>> are likely just introducing more bugs as a result.
>>>
>>>>> From what I can tell your problem is originating in pci_call_probe. I
>>>>> believe it is calling work_on_cpu and that doesn't seem correct since
>>>>> the work should be taking place on a CPU already local to the PF. You
>>>>> might want to look there to see why you are trying to schedule work
>>>>> on a
>>>>> CPU which should be perfectly fine for you to already be doing your
>>>>> work on.
>>>> it always try to go with local cpu with same pxm.
>>>
>>> The problem is we really shouldn't be calling work_for_cpu in this case
>>> since we are already on the correct CPU. What probably should be
>>> happening is that pci_call_probe should be doing a check to see if the
>>> current CPU is already contained within the device node map and if so
>>> just call local_pci_probe directly. That way you can avoid deadlocking
>>> the system by trying to flush the CPU queue of the CPU you are
>>> currently on.
>>>
>> That's the patch that Michael Tsirkin posted for a fix,
>> but it was noted that if you have the case where the _same_ driver is
>> used for vf & pf,
>> other deadlocks may occur.
>> It would work in the case of ixgbe/ixgbevf, but not for something like
>> the Mellanox pf/vf driver (which is the same).
>>
> apologies; here's the thread the discussed the issue:
> https://patchwork.kernel.org/patch/2458681/
> 

I found out about that patch after I submitted one that was similar.
The only real complaint I had about his patch was that it was only
looking at the CPU and he could save himself some trouble by just doing
the work locally if we were on the correct NUMA node.  For example if
the system only has one node in it what is the point in scheduling all
of the work on CPU 0?  My alternative patch can be found at:
https://patchwork.kernel.org/patch/2568881/

As far as the inter-driver locking issues for same driver I don't think
that is really any kind if issue.  Most drivers shouldn't be holding any
big locks when they call pci_enable_sriov.  If I am not mistaken the
follow on patch I submitted which was similar to Michaels was reported
to have resolved the issue.

As far as the Mellanox PF/VF the bigger issue is that when they call
pci_enable_sriov they are not ready to handle VFs.  There have been
several suggestions on how to resolve it including -EPROBE_DEFER or the
igbvf/ixgbevf approach of just brining up the device in a "link down" state.

Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

2013-05-21 Thread Alexander Duyck
On 05/21/2013 02:49 PM, Michael S. Tsirkin wrote:
> On Tue, May 21, 2013 at 05:30:32PM -0400, Don Dutile wrote:
>> On 05/14/2013 05:39 PM, Alexander Duyck wrote:
>>> On 05/14/2013 12:59 PM, Yinghai Lu wrote:
>>>> On Tue, May 14, 2013 at 12:45 PM, Alexander Duyck
>>>>   wrote:
>>>>> On 05/14/2013 11:44 AM, Yinghai Lu wrote:
>>>>>> On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
>>>>>>   wrote:
>>>>>>> I'm sorry, but what is the point of this patch?  With device assignment
>>>>>>> it is always possible to have VFs loaded and the PF driver unloaded
>>>>>>> since you cannot remove the VFs if they are assigned to a VM.
>>>>>> unload PF driver will not call pci_disable_sriov?
>>>>> You cannot call pci_disable_sriov because you will panic all of the
>>>>> guests that have devices assigned.
>>>> ixgbe_remove did call pci_disable_sriov...
>>>>
>>>> for guest panic, that is another problem.
>>>> just like you pci passthrough with real pci device and hotremove the
>>>> card in host.
>>>>
>>>> ...
>>>
>>> I suggest you take another look.  In ixgbe_disable_sriov, which is the
>>> function that is called we do a check for assigned VFs.  If they are
>>> assigned then we do not call pci_disable_sriov.
>>>
>>>>
>>>>> So how does your patch actually fix this problem?  It seems like it is
>>>>> just avoiding it.
>>>> yes, until the first one is done.
>>>
>>> Avoiding the issue doesn't fix the underlying problem and instead you
>>> are likely just introducing more bugs as a result.
>>>
>>>>> From what I can tell your problem is originating in pci_call_probe.  I
>>>>> believe it is calling work_on_cpu and that doesn't seem correct since
>>>>> the work should be taking place on a CPU already local to the PF. You
>>>>> might want to look there to see why you are trying to schedule work on a
>>>>> CPU which should be perfectly fine for you to already be doing your work 
>>>>> on.
>>>> it always try to go with local cpu with same pxm.
>>>
>>> The problem is we really shouldn't be calling work_for_cpu in this case
>>> since we are already on the correct CPU.  What probably should be
>>> happening is that pci_call_probe should be doing a check to see if the
>>> current CPU is already contained within the device node map and if so
>>> just call local_pci_probe directly.  That way you can avoid deadlocking
>>> the system by trying to flush the CPU queue of the CPU you are currently on.
>>>
>> That's the patch that Michael Tsirkin posted for a fix,
>> but it was noted that if you have the case where the _same_ driver is used 
>> for vf & pf,
>> other deadlocks may occur.
>> It would work in the case of ixgbe/ixgbevf, but not for something like
>> the Mellanox pf/vf driver (which is the same).
>>
> 
> I think our conclusion was this is a false positive for Mellanox.
> If not, we need to understand what the deadlock is better.
> 

As I understand the issue, the problem is not a deadlock for Mellanox
(At least with either your patch or mine applied), the issue is that the
PF is not ready to handle VFs when pci_enable_sriov is called due to
some firmware issues.

Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

2013-05-21 Thread Alexander Duyck
On 05/21/2013 03:09 PM, Don Dutile wrote:
> On 05/21/2013 05:58 PM, Alexander Duyck wrote:
>> On 05/21/2013 02:31 PM, Don Dutile wrote:
>>> On 05/21/2013 05:30 PM, Don Dutile wrote:
>>>> On 05/14/2013 05:39 PM, Alexander Duyck wrote:
>>>>> On 05/14/2013 12:59 PM, Yinghai Lu wrote:
>>>>>> On Tue, May 14, 2013 at 12:45 PM, Alexander Duyck
>>>>>>   wrote:
>>>>>>> On 05/14/2013 11:44 AM, Yinghai Lu wrote:
>>>>>>>> On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
>>>>>>>>   wrote:
>>>>>>>>> I'm sorry, but what is the point of this patch? With device
>>>>>>>>> assignment
>>>>>>>>> it is always possible to have VFs loaded and the PF driver
>>>>>>>>> unloaded
>>>>>>>>> since you cannot remove the VFs if they are assigned to a VM.
>>>>>>>> unload PF driver will not call pci_disable_sriov?
>>>>>>> You cannot call pci_disable_sriov because you will panic all of the
>>>>>>> guests that have devices assigned.
>>>>>> ixgbe_remove did call pci_disable_sriov...
>>>>>>
>>>>>> for guest panic, that is another problem.
>>>>>> just like you pci passthrough with real pci device and hotremove the
>>>>>> card in host.
>>>>>>
>>>>>> ...
>>>>>
>>>>> I suggest you take another look. In ixgbe_disable_sriov, which is the
>>>>> function that is called we do a check for assigned VFs. If they are
>>>>> assigned then we do not call pci_disable_sriov.
>>>>>
>>>>>>
>>>>>>> So how does your patch actually fix this problem? It seems like
>>>>>>> it is
>>>>>>> just avoiding it.
>>>>>> yes, until the first one is done.
>>>>>
>>>>> Avoiding the issue doesn't fix the underlying problem and instead you
>>>>> are likely just introducing more bugs as a result.
>>>>>
>>>>>>>  From what I can tell your problem is originating in
>>>>>>> pci_call_probe. I
>>>>>>> believe it is calling work_on_cpu and that doesn't seem correct
>>>>>>> since
>>>>>>> the work should be taking place on a CPU already local to the PF.
>>>>>>> You
>>>>>>> might want to look there to see why you are trying to schedule work
>>>>>>> on a
>>>>>>> CPU which should be perfectly fine for you to already be doing your
>>>>>>> work on.
>>>>>> it always try to go with local cpu with same pxm.
>>>>>
>>>>> The problem is we really shouldn't be calling work_for_cpu in this
>>>>> case
>>>>> since we are already on the correct CPU. What probably should be
>>>>> happening is that pci_call_probe should be doing a check to see if the
>>>>> current CPU is already contained within the device node map and if so
>>>>> just call local_pci_probe directly. That way you can avoid deadlocking
>>>>> the system by trying to flush the CPU queue of the CPU you are
>>>>> currently on.
>>>>>
>>>> That's the patch that Michael Tsirkin posted for a fix,
>>>> but it was noted that if you have the case where the _same_ driver is
>>>> used for vf&  pf,
>>>> other deadlocks may occur.
>>>> It would work in the case of ixgbe/ixgbevf, but not for something like
>>>> the Mellanox pf/vf driver (which is the same).
>>>>
>>> apologies; here's the thread the discussed the issue:
>>> https://patchwork.kernel.org/patch/2458681/
>>>
>>
>> I found out about that patch after I submitted one that was similar.
>> The only real complaint I had about his patch was that it was only
>> looking at the CPU and he could save himself some trouble by just doing
>> the work locally if we were on the correct NUMA node.  For example if
>> the system only has one node in it what is the point in scheduling all
>> of the work on CPU 0?  My alternative patch can be found at:
>> https://patchwork.kernel.org/patch/2568881/
>>
>> As far as the inter-driver locking issues for same driver I don't think
>> that is really any kind if issue.  Most drivers shouldn't be holding any
>> big locks when they call pci_enable_sriov.  If I am not mistaken the
>> follow on patch I submitted which was similar to Michaels was reported
>> to have resolved the issue.
>>
> You mean the above patchwork patch, or another one?

Well I know the above patchwork patch resolves it, but I am assuming
Michaels would probably work as well since it resolves the underlying issue.

>> As far as the Mellanox PF/VF the bigger issue is that when they call
>> pci_enable_sriov they are not ready to handle VFs.  There have been
>> several suggestions on how to resolve it including -EPROBE_DEFER or the
>> igbvf/ixgbevf approach of just brining up the device in a "link down"
>> state.
>>
> thanks for summary.  i was backlogged on email, and responding as i read
> them;
> I should have read through the whole thread before chiming in.
> 

No problem.  My main concern at this point is that we should probably
get either Michaels patch or mine pulled in since the potential for
deadlock is still there.

Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

2013-05-21 Thread Alexander Duyck
On 05/21/2013 03:11 PM, Michael S. Tsirkin wrote:
> On Tue, May 21, 2013 at 03:01:08PM -0700, Alexander Duyck wrote:
>> On 05/21/2013 02:49 PM, Michael S. Tsirkin wrote:
>>> On Tue, May 21, 2013 at 05:30:32PM -0400, Don Dutile wrote:
>>>> On 05/14/2013 05:39 PM, Alexander Duyck wrote:
>>>>> On 05/14/2013 12:59 PM, Yinghai Lu wrote:
>>>>>> On Tue, May 14, 2013 at 12:45 PM, Alexander Duyck
>>>>>>   wrote:
>>>>>>> On 05/14/2013 11:44 AM, Yinghai Lu wrote:
>>>>>>>> On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
>>>>>>>>   wrote:
>>>>>>>>> I'm sorry, but what is the point of this patch?  With device 
>>>>>>>>> assignment
>>>>>>>>> it is always possible to have VFs loaded and the PF driver unloaded
>>>>>>>>> since you cannot remove the VFs if they are assigned to a VM.
>>>>>>>> unload PF driver will not call pci_disable_sriov?
>>>>>>> You cannot call pci_disable_sriov because you will panic all of the
>>>>>>> guests that have devices assigned.
>>>>>> ixgbe_remove did call pci_disable_sriov...
>>>>>>
>>>>>> for guest panic, that is another problem.
>>>>>> just like you pci passthrough with real pci device and hotremove the
>>>>>> card in host.
>>>>>>
>>>>>> ...
>>>>>
>>>>> I suggest you take another look.  In ixgbe_disable_sriov, which is the
>>>>> function that is called we do a check for assigned VFs.  If they are
>>>>> assigned then we do not call pci_disable_sriov.
>>>>>
>>>>>>
>>>>>>> So how does your patch actually fix this problem?  It seems like it is
>>>>>>> just avoiding it.
>>>>>> yes, until the first one is done.
>>>>>
>>>>> Avoiding the issue doesn't fix the underlying problem and instead you
>>>>> are likely just introducing more bugs as a result.
>>>>>
>>>>>>> From what I can tell your problem is originating in pci_call_probe.  I
>>>>>>> believe it is calling work_on_cpu and that doesn't seem correct since
>>>>>>> the work should be taking place on a CPU already local to the PF. You
>>>>>>> might want to look there to see why you are trying to schedule work on a
>>>>>>> CPU which should be perfectly fine for you to already be doing your 
>>>>>>> work on.
>>>>>> it always try to go with local cpu with same pxm.
>>>>>
>>>>> The problem is we really shouldn't be calling work_for_cpu in this case
>>>>> since we are already on the correct CPU.  What probably should be
>>>>> happening is that pci_call_probe should be doing a check to see if the
>>>>> current CPU is already contained within the device node map and if so
>>>>> just call local_pci_probe directly.  That way you can avoid deadlocking
>>>>> the system by trying to flush the CPU queue of the CPU you are currently 
>>>>> on.
>>>>>
>>>> That's the patch that Michael Tsirkin posted for a fix,
>>>> but it was noted that if you have the case where the _same_ driver is used 
>>>> for vf & pf,
>>>> other deadlocks may occur.
>>>> It would work in the case of ixgbe/ixgbevf, but not for something like
>>>> the Mellanox pf/vf driver (which is the same).
>>>>
>>>
>>> I think our conclusion was this is a false positive for Mellanox.
>>> If not, we need to understand what the deadlock is better.
>>>
>>
>> As I understand the issue, the problem is not a deadlock for Mellanox
>> (At least with either your patch or mine applied), the issue is that the
>> PF is not ready to handle VFs when pci_enable_sriov is called due to
>> some firmware issues.
>>
>> Thanks,
>>
>> Alex
> 
> I haven't seen Mellanox guys say anything like this on the list.
> Pointers?
> All I saw is some lockdep warnings and Tejun says they are bogus ...

Actually the patch I submitted is at:
https://patchwork.kernel.org/patch/2568881/

It was in response to:
https://patchwork.kernel.org/patch/2562471/

Basically the patch I was responding to was supposed to address both the
lockdep issue and a problem with mlx4 not being able to support the VFs
when pci_enable_sriov is called.  Yinghai had specifically called out
the work_on_cpu lockdep issue that you also submitted a patch for.

As per the feedback from Yinghai it seems like my patch does resolve the
lockdep issue that was seen.  The other half of the issue was what we
have been discussing with Or in regards to delaying VF driver init via
something like -EPROBE_DEFER instead of trying to split up
pci_enable_sriov and VF probe.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/2] dma-debug: allow size to become smaller in dma_unmap

2013-05-27 Thread Alexander Duyck
On 05/27/2013 09:13 AM, Ming Lei wrote:
> This patch looses the check on DMA buffer size for streaming
> DMA unmap, based on the below fact:
> 
> - it is common to see only part of DMA transfer is completed,
> especially in case of DMA_FROM_DEVICE
> 
> So it isn't necessary to unmap the whole DMA buffer inside DMA
> unmapping, and unmapping the actual completed buffer should be more
> efficient. Considered that unmapping is often called in hard irq
> context, time of irq handling can be saved.
> 
> Cc: Shuah Khan 
> Cc: Joerg Roedel 
> Cc: Andrew Morton 
> Cc: Alexander Duyck 
> Cc: Konrad Rzeszutek Wilk 
> Signed-off-by: Ming Lei 

What you are proposing doesn't make much sense.  If you are only wanting
to use part of a buffer then just use the dma_sync primitives.

The idea behind unmapping a buffer is to free any resources associated
with it.  Calling map once, and unmap multiple times per buffer is just
asking for trouble in the form of use after free or memory leaks.

Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 1/7] swiotlb: Make io_tlb_end a physical address instead of a virtual one

2012-10-11 Thread Alexander Duyck
This change replaces all references to the virtual address for io_tlb_end
with references to the physical address io_tlb_end.  The main advantage of
replacing the virtual address with a physical address is that we can avoid
having to do multiple translations from the virtual address to the physical
one needed for testing an existing DMA address.

Signed-off-by: Alexander Duyck 
---

 lib/swiotlb.c |   24 +---
 1 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index f114bf6..19aac9f 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -57,7 +57,8 @@ int swiotlb_force;
  * swiotlb_tbl_sync_single_*, to see if the memory was in fact allocated by 
this
  * API.
  */
-static char *io_tlb_start, *io_tlb_end;
+static char *io_tlb_start;
+phys_addr_t io_tlb_end;
 
 /*
  * The number of IO TLB blocks (in groups of 64) between io_tlb_start and
@@ -125,14 +126,16 @@ static dma_addr_t swiotlb_virt_to_bus(struct device 
*hwdev,
 void swiotlb_print_info(void)
 {
unsigned long bytes = io_tlb_nslabs << IO_TLB_SHIFT;
-   phys_addr_t pstart, pend;
+   phys_addr_t pstart;
+   unsigned char *vend;
 
pstart = virt_to_phys(io_tlb_start);
-   pend = virt_to_phys(io_tlb_end);
+   vend = phys_to_virt(io_tlb_end);
 
printk(KERN_INFO "software IO TLB [mem %#010llx-%#010llx] (%luMB) 
mapped at [%p-%p]\n",
-  (unsigned long long)pstart, (unsigned long long)pend - 1,
-  bytes >> 20, io_tlb_start, io_tlb_end - 1);
+  (unsigned long long)pstart,
+  (unsigned long long)io_tlb_end,
+  bytes >> 20, io_tlb_start, vend - 1);
 }
 
 void __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
@@ -143,7 +146,7 @@ void __init swiotlb_init_with_tbl(char *tlb, unsigned long 
nslabs, int verbose)
 
io_tlb_nslabs = nslabs;
io_tlb_start = tlb;
-   io_tlb_end = io_tlb_start + bytes;
+   io_tlb_end = __pa(io_tlb_start) + bytes;
 
/*
 * Allocate and initialize the free list array.  This array is used
@@ -254,7 +257,7 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
 
io_tlb_nslabs = nslabs;
io_tlb_start = tlb;
-   io_tlb_end = io_tlb_start + bytes;
+   io_tlb_end = virt_to_phys(io_tlb_start) + bytes;
 
memset(io_tlb_start, 0, bytes);
 
@@ -304,7 +307,7 @@ cleanup3:
 sizeof(int)));
io_tlb_list = NULL;
 cleanup2:
-   io_tlb_end = NULL;
+   io_tlb_end = 0;
io_tlb_start = NULL;
io_tlb_nslabs = 0;
return -ENOMEM;
@@ -339,8 +342,7 @@ void __init swiotlb_free(void)
 
 static int is_swiotlb_buffer(phys_addr_t paddr)
 {
-   return paddr >= virt_to_phys(io_tlb_start) &&
-   paddr < virt_to_phys(io_tlb_end);
+   return paddr >= virt_to_phys(io_tlb_start) && paddr < io_tlb_end;
 }
 
 /*
@@ -938,6 +940,6 @@ EXPORT_SYMBOL(swiotlb_dma_mapping_error);
 int
 swiotlb_dma_supported(struct device *hwdev, u64 mask)
 {
-   return swiotlb_virt_to_bus(hwdev, io_tlb_end - 1) <= mask;
+   return phys_to_dma(hwdev, io_tlb_end - 1) <= mask;
 }
 EXPORT_SYMBOL(swiotlb_dma_supported);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 2/7] swiotlb: Make io_tlb_start a physical address instead of a virtual one

2012-10-11 Thread Alexander Duyck
This change replaces all references to the virtual address for io_tlb_start
with references to the physical address io_tlb_end.  The main advantage of
replacing the virtual address with a physical address is that we can avoid
having to do multiple translations from the virtual address to the physical
one needed for testing an existing DMA address.

Signed-off-by: Alexander Duyck 
---

 lib/swiotlb.c |   58 +
 1 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 19aac9f..c492b84 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -57,8 +57,7 @@ int swiotlb_force;
  * swiotlb_tbl_sync_single_*, to see if the memory was in fact allocated by 
this
  * API.
  */
-static char *io_tlb_start;
-phys_addr_t io_tlb_end;
+phys_addr_t io_tlb_start, io_tlb_end;
 
 /*
  * The number of IO TLB blocks (in groups of 64) between io_tlb_start and
@@ -126,16 +125,15 @@ static dma_addr_t swiotlb_virt_to_bus(struct device 
*hwdev,
 void swiotlb_print_info(void)
 {
unsigned long bytes = io_tlb_nslabs << IO_TLB_SHIFT;
-   phys_addr_t pstart;
-   unsigned char *vend;
+   unsigned char *vstart, *vend;
 
-   pstart = virt_to_phys(io_tlb_start);
+   vstart = phys_to_virt(io_tlb_start);
vend = phys_to_virt(io_tlb_end);
 
printk(KERN_INFO "software IO TLB [mem %#010llx-%#010llx] (%luMB) 
mapped at [%p-%p]\n",
-  (unsigned long long)pstart,
+  (unsigned long long)io_tlb_start,
   (unsigned long long)io_tlb_end,
-  bytes >> 20, io_tlb_start, vend - 1);
+  bytes >> 20, vstart, vend - 1);
 }
 
 void __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
@@ -145,8 +143,8 @@ void __init swiotlb_init_with_tbl(char *tlb, unsigned long 
nslabs, int verbose)
bytes = nslabs << IO_TLB_SHIFT;
 
io_tlb_nslabs = nslabs;
-   io_tlb_start = tlb;
-   io_tlb_end = __pa(io_tlb_start) + bytes;
+   io_tlb_start = __pa(tlb);
+   io_tlb_end = io_tlb_start + bytes;
 
/*
 * Allocate and initialize the free list array.  This array is used
@@ -176,6 +174,7 @@ void __init swiotlb_init_with_tbl(char *tlb, unsigned long 
nslabs, int verbose)
 static void __init
 swiotlb_init_with_default_size(size_t default_size, int verbose)
 {
+   unsigned char *vstart;
unsigned long bytes;
 
if (!io_tlb_nslabs) {
@@ -188,11 +187,11 @@ swiotlb_init_with_default_size(size_t default_size, int 
verbose)
/*
 * Get IO TLB memory from the low pages
 */
-   io_tlb_start = alloc_bootmem_low_pages(PAGE_ALIGN(bytes));
-   if (!io_tlb_start)
+   vstart = alloc_bootmem_low_pages(PAGE_ALIGN(bytes));
+   if (!vstart)
panic("Cannot allocate SWIOTLB buffer");
 
-   swiotlb_init_with_tbl(io_tlb_start, io_tlb_nslabs, verbose);
+   swiotlb_init_with_tbl(vstart, io_tlb_nslabs, verbose);
 }
 
 void __init
@@ -210,6 +209,7 @@ int
 swiotlb_late_init_with_default_size(size_t default_size)
 {
unsigned long bytes, req_nslabs = io_tlb_nslabs;
+   unsigned char *vstart = NULL;
unsigned int order;
int rc = 0;
 
@@ -226,14 +226,14 @@ swiotlb_late_init_with_default_size(size_t default_size)
bytes = io_tlb_nslabs << IO_TLB_SHIFT;
 
while ((SLABS_PER_PAGE << order) > IO_TLB_MIN_SLABS) {
-   io_tlb_start = (void *)__get_free_pages(GFP_DMA | __GFP_NOWARN,
-   order);
-   if (io_tlb_start)
+   vstart = (void *)__get_free_pages(GFP_DMA | __GFP_NOWARN,
+ order);
+   if (vstart)
break;
order--;
}
 
-   if (!io_tlb_start) {
+   if (!vstart) {
io_tlb_nslabs = req_nslabs;
return -ENOMEM;
}
@@ -242,9 +242,9 @@ swiotlb_late_init_with_default_size(size_t default_size)
   "for software IO TLB\n", (PAGE_SIZE << order) >> 20);
io_tlb_nslabs = SLABS_PER_PAGE << order;
}
-   rc = swiotlb_late_init_with_tbl(io_tlb_start, io_tlb_nslabs);
+   rc = swiotlb_late_init_with_tbl(vstart, io_tlb_nslabs);
if (rc)
-   free_pages((unsigned long)io_tlb_start, order);
+   free_pages((unsigned long)vstart, order);
return rc;
 }
 
@@ -256,10 +256,10 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long 
nslabs)
bytes = nslabs << IO_TLB_SHIFT;
 
io_tlb_nslabs = nslabs;
-   io_tlb_start = tlb;
-   io_tlb_end = virt_to_phys(io_tlb_start) + bytes;
+   io_tlb_start = virt_to_phys(tlb);
+   io_tlb_end = io_tlb_start + bytes;
 
-   memset(io_tlb_start, 0, bytes);
+   memset(tlb, 0, bytes);
 
   

[PATCH v2 5/7] swiotlb: Use physical addresses for swiotlb_tbl_unmap_single

2012-10-11 Thread Alexander Duyck
This change makes it so that the unmap functionality also uses physical
addresses.  This helps to further reduce the use of virt_to_phys and
phys_to_virt functions.

In order to clarify things since we now have 2 physical addresses in use
inside of swiotlb_tbl_unmap_single I am renaming phys to orig_addr, and
dma_addr to tlb_addr.  This way is should be clear that orig_addr is
contained within io_orig_addr and tlb_addr is an address within the
io_tlb_addr buffer.

Signed-off-by: Alexander Duyck 
---

 drivers/xen/swiotlb-xen.c |4 ++--
 include/linux/swiotlb.h   |3 ++-
 lib/swiotlb.c |   37 +++--
 3 files changed, 23 insertions(+), 21 deletions(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 8a6035a..4cedc28 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -364,7 +364,7 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, struct 
page *page,
 * Ensure that the address returned is DMA'ble
 */
if (!dma_capable(dev, dev_addr, size)) {
-   swiotlb_tbl_unmap_single(dev, phys_to_virt(map), size, dir);
+   swiotlb_tbl_unmap_single(dev, map, size, dir);
dev_addr = 0;
}
return dev_addr;
@@ -388,7 +388,7 @@ static void xen_unmap_single(struct device *hwdev, 
dma_addr_t dev_addr,
 
/* NOTE: We use dev_addr here, not paddr! */
if (is_xen_swiotlb_buffer(dev_addr)) {
-   swiotlb_tbl_unmap_single(hwdev, phys_to_virt(paddr), size, dir);
+   swiotlb_tbl_unmap_single(hwdev, paddr, size, dir);
return;
}
 
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 1995f3e..291643c 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -43,7 +43,8 @@ extern phys_addr_t swiotlb_tbl_map_single(struct device 
*hwdev,
  phys_addr_t phys, size_t size,
  enum dma_data_direction dir);
 
-extern void swiotlb_tbl_unmap_single(struct device *hwdev, char *dma_addr,
+extern void swiotlb_tbl_unmap_single(struct device *hwdev,
+phys_addr_t tlb_addr,
 size_t size, enum dma_data_direction dir);
 
 extern void swiotlb_tbl_sync_single(struct device *hwdev, char *dma_addr,
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 58d0bbd..e0e66db 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -515,20 +515,20 @@ phys_addr_t map_single(struct device *hwdev, phys_addr_t 
phys, size_t size,
 /*
  * dma_addr is the kernel virtual address of the bounce buffer to unmap.
  */
-void
-swiotlb_tbl_unmap_single(struct device *hwdev, char *dma_addr, size_t size,
-   enum dma_data_direction dir)
+void swiotlb_tbl_unmap_single(struct device *hwdev, phys_addr_t tlb_addr,
+ size_t size, enum dma_data_direction dir)
 {
unsigned long flags;
int i, count, nslots = ALIGN(size, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT;
-   int index = (dma_addr - (char *)phys_to_virt(io_tlb_start)) >> 
IO_TLB_SHIFT;
-   phys_addr_t phys = io_tlb_orig_addr[index];
+   int index = (tlb_addr - io_tlb_start) >> IO_TLB_SHIFT;
+   phys_addr_t orig_addr = io_tlb_orig_addr[index];
 
/*
 * First, sync the memory before unmapping the entry
 */
if (phys && ((dir == DMA_FROM_DEVICE) || (dir == DMA_BIDIRECTIONAL)))
-   swiotlb_bounce(phys, dma_addr, size, DMA_FROM_DEVICE);
+   swiotlb_bounce(orig_addr, phys_to_virt(tlb_addr),
+  size, DMA_FROM_DEVICE);
 
/*
 * Return the buffer to the free list by setting the corresponding
@@ -621,17 +621,18 @@ swiotlb_alloc_coherent(struct device *hwdev, size_t size,
 
ret = phys_to_virt(paddr);
dev_addr = phys_to_dma(hwdev, paddr);
-   }
 
-   /* Confirm address can be DMA'd by device */
-   if (dev_addr + size - 1 > dma_mask) {
-   printk("hwdev DMA mask = 0x%016Lx, dev_addr = 0x%016Lx\n",
-  (unsigned long long)dma_mask,
-  (unsigned long long)dev_addr);
+   /* Confirm address can be DMA'd by device */
+   if (dev_addr + size - 1 > dma_mask) {
+   printk("hwdev DMA mask = 0x%016Lx, dev_addr = 
0x%016Lx\n",
+  (unsigned long long)dma_mask,
+  (unsigned long long)dev_addr);
 
-   /* DMA_TO_DEVICE to avoid memcpy in unmap_single */
-   swiotlb_tbl_unmap_single(hwdev, ret, size, DMA_TO_DEVICE);
-   return NULL;
+   /* DMA_TO_DEVICE to avoid memcpy in unmap_single */
+   swiotlb_tbl_unmap_single(hwdev, paddr,
+   

[PATCH v2 3/7] swiotlb: Make io_tlb_overflow_buffer a physical address

2012-10-11 Thread Alexander Duyck
This change makes it so that we can avoid virt_to_phys overhead when using the
io_tlb_overflow_buffer.  My original plan was to completely remove the value
and replace it with a constant but I had seen that there were recent patches
that stated this couldn't be done until all device drivers that depended on
that functionality be updated.

Signed-off-by: Alexander Duyck 
---

 lib/swiotlb.c |   61 -
 1 files changed, 34 insertions(+), 27 deletions(-)

diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index c492b84..383f780 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -70,7 +70,7 @@ static unsigned long io_tlb_nslabs;
  */
 static unsigned long io_tlb_overflow = 32*1024;
 
-static void *io_tlb_overflow_buffer;
+phys_addr_t io_tlb_overflow_buffer;
 
 /*
  * This is a free list describing the number of free entries available from
@@ -138,6 +138,7 @@ void swiotlb_print_info(void)
 
 void __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
 {
+   void *v_overflow_buffer;
unsigned long i, bytes;
 
bytes = nslabs << IO_TLB_SHIFT;
@@ -147,6 +148,15 @@ void __init swiotlb_init_with_tbl(char *tlb, unsigned long 
nslabs, int verbose)
io_tlb_end = io_tlb_start + bytes;
 
/*
+* Get the overflow emergency buffer
+*/
+   v_overflow_buffer = 
alloc_bootmem_low_pages(PAGE_ALIGN(io_tlb_overflow));
+   if (!v_overflow_buffer)
+   panic("Cannot allocate SWIOTLB overflow buffer!\n");
+
+   io_tlb_overflow_buffer = __pa(v_overflow_buffer);
+
+   /*
 * Allocate and initialize the free list array.  This array is used
 * to find contiguous free memory regions of size up to IO_TLB_SEGSIZE
 * between io_tlb_start and io_tlb_end.
@@ -157,12 +167,6 @@ void __init swiotlb_init_with_tbl(char *tlb, unsigned long 
nslabs, int verbose)
io_tlb_index = 0;
io_tlb_orig_addr = alloc_bootmem_pages(PAGE_ALIGN(io_tlb_nslabs * 
sizeof(phys_addr_t)));
 
-   /*
-* Get the overflow emergency buffer
-*/
-   io_tlb_overflow_buffer = 
alloc_bootmem_low_pages(PAGE_ALIGN(io_tlb_overflow));
-   if (!io_tlb_overflow_buffer)
-   panic("Cannot allocate SWIOTLB overflow buffer!\n");
if (verbose)
swiotlb_print_info();
 }
@@ -252,6 +256,7 @@ int
 swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
 {
unsigned long i, bytes;
+   unsigned char *v_overflow_buffer;
 
bytes = nslabs << IO_TLB_SHIFT;
 
@@ -262,6 +267,16 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
memset(tlb, 0, bytes);
 
/*
+* Get the overflow emergency buffer
+*/
+   v_overflow_buffer = (void *)__get_free_pages(GFP_DMA,
+
get_order(io_tlb_overflow));
+   if (!v_overflow_buffer)
+   goto cleanup2;
+
+   io_tlb_overflow_buffer = virt_to_phys(v_overflow_buffer);
+
+   /*
 * Allocate and initialize the free list array.  This array is used
 * to find contiguous free memory regions of size up to IO_TLB_SEGSIZE
 * between io_tlb_start and io_tlb_end.
@@ -269,7 +284,7 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
io_tlb_list = (unsigned int *)__get_free_pages(GFP_KERNEL,
  get_order(io_tlb_nslabs * sizeof(int)));
if (!io_tlb_list)
-   goto cleanup2;
+   goto cleanup3;
 
for (i = 0; i < io_tlb_nslabs; i++)
io_tlb_list[i] = IO_TLB_SEGSIZE - OFFSET(i, IO_TLB_SEGSIZE);
@@ -280,18 +295,10 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long 
nslabs)
 get_order(io_tlb_nslabs *
   sizeof(phys_addr_t)));
if (!io_tlb_orig_addr)
-   goto cleanup3;
+   goto cleanup4;
 
memset(io_tlb_orig_addr, 0, io_tlb_nslabs * sizeof(phys_addr_t));
 
-   /*
-* Get the overflow emergency buffer
-*/
-   io_tlb_overflow_buffer = (void *)__get_free_pages(GFP_DMA,
- get_order(io_tlb_overflow));
-   if (!io_tlb_overflow_buffer)
-   goto cleanup4;
-
swiotlb_print_info();
 
late_alloc = 1;
@@ -299,13 +306,13 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long 
nslabs)
return 0;
 
 cleanup4:
-   free_pages((unsigned long)io_tlb_orig_addr,
-  get_order(io_tlb_nslabs * sizeof(phys_addr_t)));
-   io_tlb_orig_addr = NULL;
-cleanup3:
free_pages((unsigned long)io_tlb_list, get_order(io_tlb_nslabs *
 sizeof(int)));
io_tlb_list = NULL;
+cleanup3:
+   free_pages((unsigned long)v_overflow_buffer,
+  get_order(io_tlb_overfl

[PATCH v2 4/7] swiotlb: Return physical addresses when calling swiotlb_tbl_map_single

2012-10-11 Thread Alexander Duyck
This change makes it so that swiotlb_tbl_map_single will return a physical
address instead of a virtual address when called.  The advantage to this once
again is that we are avoiding a number of virt_to_phys and phys_to_virt
translations by working with everything as a physical address.

One change I had to make in order to support using physical addresses is that
I could no longer trust 0 to be a invalid physical address on all platforms.
So instead I made it so that ~0 is returned on error.  This should never be a
valid return value as it implies that only one byte would be available for
use.

In order to clarify things since we now have 2 physical addresses in use
inside of swiotlb_tbl_map_single I am renaming phys to orig_addr, and
dma_addr to tlb_addr.  This way is should be clear that orig_addr is
contained within io_orig_addr and tlb_addr is an address within the
io_tlb_addr buffer.

Signed-off-by: Alexander Duyck 
---

 drivers/xen/swiotlb-xen.c |   22 ++---
 include/linux/swiotlb.h   |   11 +-
 lib/swiotlb.c |   78 +++--
 3 files changed, 59 insertions(+), 52 deletions(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 58db6df..8a6035a 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -338,9 +338,8 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, struct 
page *page,
enum dma_data_direction dir,
struct dma_attrs *attrs)
 {
-   phys_addr_t phys = page_to_phys(page) + offset;
+   phys_addr_t map, phys = page_to_phys(page) + offset;
dma_addr_t dev_addr = xen_phys_to_bus(phys);
-   void *map;
 
BUG_ON(dir == DMA_NONE);
/*
@@ -356,16 +355,16 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, 
struct page *page,
 * Oh well, have to allocate and map a bounce buffer.
 */
map = swiotlb_tbl_map_single(dev, start_dma_addr, phys, size, dir);
-   if (!map)
+   if (map == SWIOTLB_MAP_ERROR)
return DMA_ERROR_CODE;
 
-   dev_addr = xen_virt_to_bus(map);
+   dev_addr = xen_phys_to_bus(map);
 
/*
 * Ensure that the address returned is DMA'ble
 */
if (!dma_capable(dev, dev_addr, size)) {
-   swiotlb_tbl_unmap_single(dev, map, size, dir);
+   swiotlb_tbl_unmap_single(dev, phys_to_virt(map), size, dir);
dev_addr = 0;
}
return dev_addr;
@@ -494,11 +493,12 @@ xen_swiotlb_map_sg_attrs(struct device *hwdev, struct 
scatterlist *sgl,
if (swiotlb_force ||
!dma_capable(hwdev, dev_addr, sg->length) ||
range_straddles_page_boundary(paddr, sg->length)) {
-   void *map = swiotlb_tbl_map_single(hwdev,
-  start_dma_addr,
-  sg_phys(sg),
-  sg->length, dir);
-   if (!map) {
+   phys_addr_t map = swiotlb_tbl_map_single(hwdev,
+start_dma_addr,
+sg_phys(sg),
+sg->length,
+dir);
+   if (map == SWIOTLB_MAP_ERROR) {
/* Don't panic here, we expect map_sg users
   to do proper error handling. */
xen_swiotlb_unmap_sg_attrs(hwdev, sgl, i, dir,
@@ -506,7 +506,7 @@ xen_swiotlb_map_sg_attrs(struct device *hwdev, struct 
scatterlist *sgl,
sgl[0].dma_length = 0;
return DMA_ERROR_CODE;
}
-   sg->dma_address = xen_virt_to_bus(map);
+   sg->dma_address = xen_phys_to_bus(map);
} else
sg->dma_address = dev_addr;
sg->dma_length = sg->length;
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 8d08b3e..1995f3e 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -34,9 +34,14 @@ enum dma_sync_target {
SYNC_FOR_CPU = 0,
SYNC_FOR_DEVICE = 1,
 };
-extern void *swiotlb_tbl_map_single(struct device *hwdev, dma_addr_t 
tbl_dma_addr,
-   phys_addr_t phys, size_t size,
-   enum dma_data_direction dir);
+
+/* define the last possible byte of physical address space as a mapping error 
*/
+#define SWIOTLB_MAP_ERROR (~(phys_addr_t)0x0)
+
+extern phys_addr_t swiotlb_tbl_map_single(struct device *hwdev,
+   

[PATCH v2 6/7] swiotlb: Use physical addresses instead of virtual in swiotlb_tbl_sync_single

2012-10-11 Thread Alexander Duyck
This change makes it so that the sync functionality also uses physical
addresses.  This helps to further reduce the use of virt_to_phys and
phys_to_virt functions.

In order to clarify things since we now have 2 physical addresses in use
inside of swiotlb_tbl_sync_single I am renaming phys to orig_addr, and
dma_addr to tlb_addr.  This way is should be clear that orig_addr is
contained within io_orig_addr and tlb_addr is an address within the
io_tlb_addr buffer.

Signed-off-by: Alexander Duyck 
---

 drivers/xen/swiotlb-xen.c |3 +--
 include/linux/swiotlb.h   |3 ++-
 lib/swiotlb.c |   22 +++---
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 4cedc28..af47e75 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -433,8 +433,7 @@ xen_swiotlb_sync_single(struct device *hwdev, dma_addr_t 
dev_addr,
 
/* NOTE: We use dev_addr here, not paddr! */
if (is_xen_swiotlb_buffer(dev_addr)) {
-   swiotlb_tbl_sync_single(hwdev, phys_to_virt(paddr), size, dir,
-  target);
+   swiotlb_tbl_sync_single(hwdev, paddr, size, dir, target);
return;
}
 
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 291643c..e0ac98f 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -47,7 +47,8 @@ extern void swiotlb_tbl_unmap_single(struct device *hwdev,
 phys_addr_t tlb_addr,
 size_t size, enum dma_data_direction dir);
 
-extern void swiotlb_tbl_sync_single(struct device *hwdev, char *dma_addr,
+extern void swiotlb_tbl_sync_single(struct device *hwdev,
+   phys_addr_t tlb_addr,
size_t size, enum dma_data_direction dir,
enum dma_sync_target target);
 
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index e0e66db..a81138f 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -557,26 +557,27 @@ void swiotlb_tbl_unmap_single(struct device *hwdev, 
phys_addr_t tlb_addr,
 }
 EXPORT_SYMBOL_GPL(swiotlb_tbl_unmap_single);
 
-void
-swiotlb_tbl_sync_single(struct device *hwdev, char *dma_addr, size_t size,
-   enum dma_data_direction dir,
-   enum dma_sync_target target)
+void swiotlb_tbl_sync_single(struct device *hwdev, phys_addr_t tlb_addr,
+size_t size, enum dma_data_direction dir,
+enum dma_sync_target target)
 {
-   int index = (dma_addr - (char *)phys_to_virt(io_tlb_start)) >> 
IO_TLB_SHIFT;
-   phys_addr_t phys = io_tlb_orig_addr[index];
+   int index = (tlb_addr - io_tlb_start) >> IO_TLB_SHIFT;
+   phys_addr_t orig_addr = io_tlb_orig_addr[index];
 
-   phys += ((unsigned long)dma_addr & ((1 << IO_TLB_SHIFT) - 1));
+   orig_addr += (unsigned long)tlb_addr & ((1 << IO_TLB_SHIFT) - 1);
 
switch (target) {
case SYNC_FOR_CPU:
if (likely(dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
-   swiotlb_bounce(phys, dma_addr, size, DMA_FROM_DEVICE);
+   swiotlb_bounce(orig_addr, phys_to_virt(tlb_addr),
+  size, DMA_FROM_DEVICE);
else
BUG_ON(dir != DMA_TO_DEVICE);
break;
case SYNC_FOR_DEVICE:
if (likely(dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL))
-   swiotlb_bounce(phys, dma_addr, size, DMA_TO_DEVICE);
+   swiotlb_bounce(orig_addr, phys_to_virt(tlb_addr),
+  size, DMA_TO_DEVICE);
else
BUG_ON(dir != DMA_FROM_DEVICE);
break;
@@ -785,8 +786,7 @@ swiotlb_sync_single(struct device *hwdev, dma_addr_t 
dev_addr,
BUG_ON(dir == DMA_NONE);
 
if (is_swiotlb_buffer(paddr)) {
-   swiotlb_tbl_sync_single(hwdev, phys_to_virt(paddr), size, dir,
-  target);
+   swiotlb_tbl_sync_single(hwdev, paddr, size, dir, target);
return;
}
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 7/7] swiotlb: Do not export swiotlb_bounce since there are no external consumers

2012-10-11 Thread Alexander Duyck
Currently swiotlb is the only consumer for swiotlb_bounce.  Since that is the
case it doesn't make much sense to be exporting it so make it a static
function only.

In addition we can save a few more lines of code by making it so that it
accepts the DMA address as a physical address instead of a virtual one.  This
is the last piece in essentially pushing all of the DMA address values to use
physical addresses in swiotlb.

In order to clarify things since we now have 2 physical addresses in use
inside of swiotlb_bounce I am renaming phys to orig_addr, and dma_addr to
tlb_addr.  This way is should be clear that orig_addr is contained within
io_orig_addr and tlb_addr is an address within the io_tlb_addr buffer.

Signed-off-by: Alexander Duyck 
---

 include/linux/swiotlb.h |3 ---
 lib/swiotlb.c   |   35 ---
 2 files changed, 16 insertions(+), 22 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index e0ac98f..071d62c 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -53,9 +53,6 @@ extern void swiotlb_tbl_sync_single(struct device *hwdev,
enum dma_sync_target target);
 
 /* Accessory functions. */
-extern void swiotlb_bounce(phys_addr_t phys, char *dma_addr, size_t size,
-  enum dma_data_direction dir);
-
 extern void
 *swiotlb_alloc_coherent(struct device *hwdev, size_t size,
dma_addr_t *dma_handle, gfp_t flags);
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index a81138f..fc31bdf 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -355,14 +355,15 @@ static int is_swiotlb_buffer(phys_addr_t paddr)
 /*
  * Bounce: copy the swiotlb buffer back to the original dma location
  */
-void swiotlb_bounce(phys_addr_t phys, char *dma_addr, size_t size,
-   enum dma_data_direction dir)
+static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
+  size_t size, enum dma_data_direction dir)
 {
-   unsigned long pfn = PFN_DOWN(phys);
+   unsigned long pfn = PFN_DOWN(orig_addr);
+   unsigned char *vaddr = phys_to_virt(tlb_addr);
 
if (PageHighMem(pfn_to_page(pfn))) {
/* The buffer does not have a mapping.  Map it in and copy */
-   unsigned int offset = phys & ~PAGE_MASK;
+   unsigned int offset = orig_addr & ~PAGE_MASK;
char *buffer;
unsigned int sz = 0;
unsigned long flags;
@@ -373,25 +374,23 @@ void swiotlb_bounce(phys_addr_t phys, char *dma_addr, 
size_t size,
local_irq_save(flags);
buffer = kmap_atomic(pfn_to_page(pfn));
if (dir == DMA_TO_DEVICE)
-   memcpy(dma_addr, buffer + offset, sz);
+   memcpy(vaddr, buffer + offset, sz);
else
-   memcpy(buffer + offset, dma_addr, sz);
+   memcpy(buffer + offset, vaddr, sz);
kunmap_atomic(buffer);
local_irq_restore(flags);
 
size -= sz;
pfn++;
-   dma_addr += sz;
+   vaddr += sz;
offset = 0;
}
+   } else if (dir == DMA_TO_DEVICE) {
+   memcpy(vaddr, phys_to_virt(orig_addr), size);
} else {
-   if (dir == DMA_TO_DEVICE)
-   memcpy(dma_addr, phys_to_virt(phys), size);
-   else
-   memcpy(phys_to_virt(phys), dma_addr, size);
+   memcpy(phys_to_virt(orig_addr), vaddr, size);
}
 }
-EXPORT_SYMBOL_GPL(swiotlb_bounce);
 
 phys_addr_t swiotlb_tbl_map_single(struct device *hwdev,
   dma_addr_t tbl_dma_addr,
@@ -493,8 +492,7 @@ found:
for (i = 0; i < nslots; i++)
io_tlb_orig_addr[index+i] = orig_addr + (i << IO_TLB_SHIFT);
if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
-   swiotlb_bounce(orig_addr, phys_to_virt(tlb_addr), size,
-  DMA_TO_DEVICE);
+   swiotlb_bounce(orig_addr, tlb_addr, size, DMA_TO_DEVICE);
 
return tlb_addr;
 }
@@ -526,9 +524,8 @@ void swiotlb_tbl_unmap_single(struct device *hwdev, 
phys_addr_t tlb_addr,
/*
 * First, sync the memory before unmapping the entry
 */
-   if (phys && ((dir == DMA_FROM_DEVICE) || (dir == DMA_BIDIRECTIONAL)))
-   swiotlb_bounce(orig_addr, phys_to_virt(tlb_addr),
-  size, DMA_FROM_DEVICE);
+   if (orig_addr && ((dir == DMA_FROM_DEVICE) || (dir == 
DMA_BIDIRECTIONAL)))
+   swiotlb_bounce(orig_addr, tlb_addr, size, DMA_FROM_DEVICE);
 
/*
 * Return the buffer to the free list b

[PATCH v2 0/7] Improve swiotlb performance by using physical addresses

2012-10-11 Thread Alexander Duyck
While working on 10Gb/s routing performance I found a significant amount of
time was being spent in the swiotlb DMA handler. Further digging found that a
significant amount of this was due to virtual to physical address translation
and calling the function that did it. It accounted for nearly 60% of the
total swiotlb overhead.

This patch set works to resolve that by replacing the io_tlb_start and
io_tlb_end virtual addresses with a physical addresses. In addition it changes
the io_tlb_overflow_buffer from a virtual to a physical address. I followed
through with the cleanup to the point that the only functions that really
require the virtual address for the DMA buffer are the init, free, and
bounce functions.

In the case of devices that are using the bounce buffers these patches should
result in only a slight performance gain if any. This is due to the locking
overhead required to map and unmap the buffers.

In the case of devices that are not making use of bounce buffers these patches
can significantly reduce their overhead. In the case of an ixgbe routing test
for example, these changes result in 7 fewer calls to __phys_addr and
allow is_swiotlb_buffer to become inlined due to a reduction in the number of
instructions. When running a routing throughput test using small packets I
saw roughly a 6% increase in packets rates after applying these patches. This
appears to match up with the CPU overhead reduction I was tracking via perf.

Before:
Results 10.0Mpps

After:
Results 10.6Mpps

Finally, I updated the parameter names for several of the core function calls
as there was some ambiguity in naming. Specifically virtual address pointers
were named dma_addr. When I changed these pointers to physical I instead used
the name tlb_addr as this value represented a physical address in the
io_tlb_start region and is less likely to be confused with a bus address.

v2:
I reviewed the changes and realized that the first patch that was dropping
io_tlb_end and calculating the value didn't actually gain me much once I had
gone through and translated the rest of the addresses to physical addresses.
As such I have updated the patch so that it instead is converting io_tlb_end
from a virtual address to a physical address.  This actually helps to reduce
the overhead for is_swiotlb_buffer and swiotlb_dma_supported by several
instructions.

---

Alexander Duyck (7):
  swiotlb:  Do not export swiotlb_bounce since there are no external 
consumers
  swiotlb: Use physical addresses instead of virtual in 
swiotlb_tbl_sync_single
  swiotlb: Use physical addresses for swiotlb_tbl_unmap_single
  swiotlb: Return physical addresses when calling swiotlb_tbl_map_single
  swiotlb: Make io_tlb_overflow_buffer a physical address
  swiotlb: Make io_tlb_start a physical address instead of a virtual one
  swiotlb: Make io_tlb_end a physical address instead of a virtual one


 drivers/xen/swiotlb-xen.c |   25 ++--
 include/linux/swiotlb.h   |   20 ++-
 lib/swiotlb.c |  269 +++--
 3 files changed, 163 insertions(+), 151 deletions(-)

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 0/8] Improve performance of VM translation on x86_64

2012-10-11 Thread Alexander Duyck
This patch series is meant to address several issues I encountered with VM
translations on x86_64.  In my testing I found that swiotlb was incurring up
to a 5% processing overhead due to calls to __phys_addr.  To address that I
have updated swiotlb to use physical addresses instead of virtual addresses
to reduce the need to call __phys_addr.  However those patches didn't address
the other callers.  With these patches applied I am able to achieve an
additional 1% to 2% performance gain on top of the changes to swiotlb.

The first 2 patches are the performance optimizations that result in the 1% to
2% increase in overall performance.  The remaining patches are various
cleanups for a number of spots where __pa or virt_to_phys was being called
and was not needed or __pa_symbol could have been used.

---

Alexander Duyck (8):
  x86/lguest: Use __pa_symbol instead of __pa on C visible symbols
  x86/acpi: Use __pa_symbol instead of __pa on C visible symbols
  x86/xen: Use __pa_symbol instead of __pa on C visible symbols
  x86/ftrace: Use __pa_symbol instead of __pa on C visible symbols
  x86: Use __pa_symbol instead of __pa on C visible symbols
  x86: Drop 4 unnecessary calls to __pa_symbol
  x86: Make it so that __pa_symbol can only process kernel symbols on x86_64
  x86: Improve __phys_addr performance by making use of carry flags and 
inlining


 arch/x86/include/asm/page.h  |3 ++-
 arch/x86/include/asm/page_32.h   |1 +
 arch/x86/include/asm/page_64_types.h |   20 +--
 arch/x86/kernel/acpi/sleep.c |2 +-
 arch/x86/kernel/cpu/intel.c  |2 +-
 arch/x86/kernel/ftrace.c |4 ++--
 arch/x86/kernel/head32.c |4 ++--
 arch/x86/kernel/head64.c |4 ++--
 arch/x86/kernel/setup.c  |   16 
 arch/x86/kernel/x8664_ksyms_64.c |3 +++
 arch/x86/lguest/boot.c   |3 ++-
 arch/x86/mm/pageattr.c   |8 
 arch/x86/mm/physaddr.c   |   35 --
 arch/x86/platform/efi/efi.c  |4 ++--
 arch/x86/realmode/init.c |8 
 arch/x86/xen/mmu.c   |   19 ++
 16 files changed, 91 insertions(+), 45 deletions(-)

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 2/8] x86: Make it so that __pa_symbol can only process kernel symbols on x86_64

2012-10-11 Thread Alexander Duyck
I submitted an earlier patch that make __phys_addr an inline.  This obviously
results in an increase in the code size.  One step I can take to reduce that
is to make it so that the __pa_symbol call does a direct translation for
kernel addresses instead of covering all of virtual memory.

On my system this reduced the size for __pa_symbol from 5 instructions
totalling 30 bytes to 3 instructions totalling 16 bytes.

Signed-off-by: Alexander Duyck 
---

 arch/x86/include/asm/page.h  |3 ++-
 arch/x86/include/asm/page_32.h   |1 +
 arch/x86/include/asm/page_64_types.h |3 +++
 arch/x86/mm/physaddr.c   |   15 +++
 4 files changed, 21 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 8ca8283..3698a6a 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -44,7 +44,8 @@ static inline void copy_user_page(void *to, void *from, 
unsigned long vaddr,
  * case properly. Once all supported versions of gcc understand it, we can
  * remove this Voodoo magic stuff. (i.e. once gcc3.x is deprecated)
  */
-#define __pa_symbol(x) __pa(__phys_reloc_hide((unsigned long)(x)))
+#define __pa_symbol(x) \
+   __phys_addr_symbol(__phys_reloc_hide((unsigned long)(x)))
 
 #define __va(x)((void *)((unsigned 
long)(x)+PAGE_OFFSET))
 
diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
index da4e762..4d550d0 100644
--- a/arch/x86/include/asm/page_32.h
+++ b/arch/x86/include/asm/page_32.h
@@ -15,6 +15,7 @@ extern unsigned long __phys_addr(unsigned long);
 #else
 #define __phys_addr(x) __phys_addr_nodebug(x)
 #endif
+#define __phys_addr_symbol(x)  __phys_addr(x)
 #define __phys_reloc_hide(x)   RELOC_HIDE((x), 0)
 
 #ifdef CONFIG_FLATMEM
diff --git a/arch/x86/include/asm/page_64_types.h 
b/arch/x86/include/asm/page_64_types.h
index 1ca93d3..a130589 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -69,8 +69,11 @@ static inline unsigned long __phys_addr_nodebug(unsigned 
long x)
 }
 #ifdef CONFIG_DEBUG_VIRTUAL
 extern unsigned long __phys_addr(unsigned long);
+extern unsigned long __phys_addr_symbol(unsigned long);
 #else
 #define __phys_addr(x) __phys_addr_nodebug(x)
+#define __phys_addr_symbol(x) \
+   ((unsigned long)(x) - __START_KERNEL_map + phys_base)
 #endif
 #define __phys_reloc_hide(x)   (x)
 
diff --git a/arch/x86/mm/physaddr.c b/arch/x86/mm/physaddr.c
index f63bec5..666edbd 100644
--- a/arch/x86/mm/physaddr.c
+++ b/arch/x86/mm/physaddr.c
@@ -29,6 +29,21 @@ unsigned long __phys_addr(unsigned long x)
return x;
 }
 EXPORT_SYMBOL(__phys_addr);
+
+unsigned long __phys_addr_symbol(unsigned long x)
+{
+   unsigned long y = x - __START_KERNEL_map;
+
+   /* use the carry flag to determine if x was < __START_KERNEL_map */
+   VIRTUAL_BUG_ON(x < y);
+
+   x = y + phys_base;
+
+   VIRTUAL_BUG_ON(y >= KERNEL_IMAGE_SIZE);
+
+   return x;
+}
+EXPORT_SYMBOL(__phys_addr_symbol);
 #endif
 
 bool __virt_addr_valid(unsigned long x)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 4/8] x86: Use __pa_symbol instead of __pa on C visible symbols

2012-10-11 Thread Alexander Duyck
When I made an attempt at separating __pa_symbol and __pa I found that there
were a number of cases where __pa was used on an obvious symbol.

I also caught one non-obvious case as _brk_start and _brk_end are based on the
address of __brk_base which is a C visible symbol.

Signed-off-by: Alexander Duyck 
---

 arch/x86/kernel/cpu/intel.c |2 +-
 arch/x86/kernel/setup.c |   16 
 arch/x86/mm/pageattr.c  |8 
 arch/x86/platform/efi/efi.c |4 ++--
 arch/x86/realmode/init.c|8 
 5 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 198e019..2249e7e 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -168,7 +168,7 @@ int __cpuinit ppro_with_ram_bug(void)
 #ifdef CONFIG_X86_F00F_BUG
 static void __cpuinit trap_init_f00f_bug(void)
 {
-   __set_fixmap(FIX_F00F_IDT, __pa(&idt_table), PAGE_KERNEL_RO);
+   __set_fixmap(FIX_F00F_IDT, __pa_symbol(idt_table), PAGE_KERNEL_RO);
 
/*
 * Update the IDT descriptor and reload the IDT so that
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d609be0..391f5f4 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -299,8 +299,8 @@ static void __init cleanup_highmap(void)
 static void __init reserve_brk(void)
 {
if (_brk_end > _brk_start)
-   memblock_reserve(__pa(_brk_start),
-__pa(_brk_end) - __pa(_brk_start));
+   memblock_reserve(__pa_symbol(_brk_start),
+_brk_end - _brk_start);
 
/* Mark brk area as locked down and no longer taking any
   new allocations */
@@ -760,12 +760,12 @@ void __init setup_arch(char **cmdline_p)
init_mm.end_data = (unsigned long) _edata;
init_mm.brk = _brk_end;
 
-   code_resource.start = virt_to_phys(_text);
-   code_resource.end = virt_to_phys(_etext)-1;
-   data_resource.start = virt_to_phys(_etext);
-   data_resource.end = virt_to_phys(_edata)-1;
-   bss_resource.start = virt_to_phys(&__bss_start);
-   bss_resource.end = virt_to_phys(&__bss_stop)-1;
+   code_resource.start = __pa_symbol(_text);
+   code_resource.end = __pa_symbol(_etext)-1;
+   data_resource.start = __pa_symbol(_etext);
+   data_resource.end = __pa_symbol(_edata)-1;
+   bss_resource.start = __pa_symbol(__bss_start);
+   bss_resource.end = __pa_symbol(__bss_stop)-1;
 
 #ifdef CONFIG_CMDLINE_BOOL
 #ifdef CONFIG_CMDLINE_OVERRIDE
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index a718e0d..40f92f3 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -94,12 +94,12 @@ static inline void split_page_count(int level) { }
 
 static inline unsigned long highmap_start_pfn(void)
 {
-   return __pa(_text) >> PAGE_SHIFT;
+   return __pa_symbol(_text) >> PAGE_SHIFT;
 }
 
 static inline unsigned long highmap_end_pfn(void)
 {
-   return __pa(roundup(_brk_end, PMD_SIZE)) >> PAGE_SHIFT;
+   return __pa_symbol(roundup(_brk_end, PMD_SIZE)) >> PAGE_SHIFT;
 }
 
 #endif
@@ -276,8 +276,8 @@ static inline pgprot_t static_protections(pgprot_t prot, 
unsigned long address,
 * The .rodata section needs to be read-only. Using the pfn
 * catches all aliases.
 */
-   if (within(pfn, __pa((unsigned long)__start_rodata) >> PAGE_SHIFT,
-  __pa((unsigned long)__end_rodata) >> PAGE_SHIFT))
+   if (within(pfn, __pa_symbol(__start_rodata) >> PAGE_SHIFT,
+  __pa_symbol(__end_rodata) >> PAGE_SHIFT))
pgprot_val(forbidden) |= _PAGE_RW;
 
 #if defined(CONFIG_X86_64) && defined(CONFIG_DEBUG_RODATA)
diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index aded2a9..e8d0320 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -406,8 +406,8 @@ void __init efi_reserve_boot_services(void)
 * - Not within any part of the kernel
 * - Not the bios reserved area
*/
-   if ((start+size >= virt_to_phys(_text)
-   && start <= virt_to_phys(_end)) ||
+   if ((start+size >= __pa_symbol(_text)
+   && start <= __pa_symbol(_end)) ||
!e820_all_mapped(start, start+size, E820_RAM) ||
memblock_is_region_reserved(start, size)) {
/* Could not reserve, skip it */
diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index cbca565..8045026 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -62,9 +62,9 @@ void __init setup_real_mode(void)
__va(real_mode_header->trampoline_header);
 
 #ifdef CONFIG_X86_32
-   trampoline_header->start = __pa(startup_32_s

[PATCH v2 5/8] x86/ftrace: Use __pa_symbol instead of __pa on C visible symbols

2012-10-11 Thread Alexander Duyck
Instead of using __pa which is meant to be a general function for converting
virtual addresses to physical addresses we can use __pa_symbol which is the
preferred way of decoding kernel text virtual addresses to physical addresses.

In this case we are not directly converting C visible symbols however if we
know that the instruction pointer is somewhere between _text and _etext we
know that we are going to be translating an address form the kernel text
space.

Cc: Steven Rostedt 
Cc: Frederic Weisbecker 
Signed-off-by: Alexander Duyck 
---

 arch/x86/kernel/ftrace.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 1d41402..42a392a 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -89,7 +89,7 @@ do_ftrace_mod_code(unsigned long ip, const void *new_code)
 * kernel identity mapping to modify code.
 */
if (within(ip, (unsigned long)_text, (unsigned long)_etext))
-   ip = (unsigned long)__va(__pa(ip));
+   ip = (unsigned long)__va(__pa_symbol(ip));
 
return probe_kernel_write((void *)ip, new_code, MCOUNT_INSN_SIZE);
 }
@@ -279,7 +279,7 @@ static int ftrace_write(unsigned long ip, const char *val, 
int size)
 * kernel identity mapping to modify code.
 */
if (within(ip, (unsigned long)_text, (unsigned long)_etext))
-   ip = (unsigned long)__va(__pa(ip));
+   ip = (unsigned long)__va(__pa_symbol(ip));
 
return probe_kernel_write((void *)ip, val, size);
 }

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 1/8] x86: Improve __phys_addr performance by making use of carry flags and inlining

2012-10-11 Thread Alexander Duyck
This patch is meant to improve overall system performance when making use of
the __phys_addr call.  To do this I have implemented several changes.

First if CONFIG_DEBUG_VIRTUAL is not defined __phys_addr is made an inline,
similar to how this is currently handled in 32 bit.  However in order to do
this it is required to export phys_base so that it is available if __phys_addr
is used in kernel modules.

The second change was to streamline the code by making use of the carry flag
on an add operation instead of performing a compare on a 64 bit value.  The
advantage to this is that it allows us to significantly reduce the overall
size of the call.  On my Xeon E5 system the entire __phys_addr inline call
consumes a little less than 32 bytes and 5 instructions.  I also applied
similar logic to the debug version of the function.  My testing shows that the
debug version of the function with this patch applied is slightly faster than
the non-debug version without the patch.

Finally, when building the kernel with the first two changes applied I saw
build warnings about __START_KERNEL_map and PAGE_OFFSET constants not fitting
in their type.  In order to resolve the build warning I changed their type
from UL to ULL.

Signed-off-by: Alexander Duyck 
---

 arch/x86/include/asm/page_64_types.h |   17 +++--
 arch/x86/kernel/x8664_ksyms_64.c |3 +++
 arch/x86/mm/physaddr.c   |   20 ++--
 3 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/page_64_types.h 
b/arch/x86/include/asm/page_64_types.h
index 320f7bb..1ca93d3 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -30,14 +30,14 @@
  * hypervisor to fit.  Choosing 16 slots here is arbitrary, but it's
  * what Xen requires.
  */
-#define __PAGE_OFFSET   _AC(0x8800, UL)
+#define __PAGE_OFFSET   _AC(0x8800, ULL)
 
 #define __PHYSICAL_START   ((CONFIG_PHYSICAL_START +   \
  (CONFIG_PHYSICAL_ALIGN - 1)) &\
 ~(CONFIG_PHYSICAL_ALIGN - 1))
 
 #define __START_KERNEL (__START_KERNEL_map + __PHYSICAL_START)
-#define __START_KERNEL_map _AC(0x8000, UL)
+#define __START_KERNEL_map _AC(0x8000, ULL)
 
 /* See Documentation/x86/x86_64/mm.txt for a description of the memory map. */
 #define __PHYSICAL_MASK_SHIFT  46
@@ -58,7 +58,20 @@ void copy_page(void *to, void *from);
 extern unsigned long max_pfn;
 extern unsigned long phys_base;
 
+static inline unsigned long __phys_addr_nodebug(unsigned long x)
+{
+   unsigned long y = x - __START_KERNEL_map;
+
+   /* use the carry flag to determine if x was < __START_KERNEL_map */
+   x = y + ((x > y) ? phys_base : (__START_KERNEL_map - PAGE_OFFSET));
+
+   return x;
+}
+#ifdef CONFIG_DEBUG_VIRTUAL
 extern unsigned long __phys_addr(unsigned long);
+#else
+#define __phys_addr(x) __phys_addr_nodebug(x)
+#endif
 #define __phys_reloc_hide(x)   (x)
 
 #define vmemmap ((struct page *)VMEMMAP_START)
diff --git a/arch/x86/kernel/x8664_ksyms_64.c b/arch/x86/kernel/x8664_ksyms_64.c
index 1330dd1..b014d94 100644
--- a/arch/x86/kernel/x8664_ksyms_64.c
+++ b/arch/x86/kernel/x8664_ksyms_64.c
@@ -59,6 +59,9 @@ EXPORT_SYMBOL(memcpy);
 EXPORT_SYMBOL(__memcpy);
 EXPORT_SYMBOL(memmove);
 
+#ifndef CONFIG_DEBUG_VIRTUAL
+EXPORT_SYMBOL(phys_base);
+#endif
 EXPORT_SYMBOL(empty_zero_page);
 #ifndef CONFIG_PARAVIRT
 EXPORT_SYMBOL(native_load_gs_index);
diff --git a/arch/x86/mm/physaddr.c b/arch/x86/mm/physaddr.c
index d2e2735..f63bec5 100644
--- a/arch/x86/mm/physaddr.c
+++ b/arch/x86/mm/physaddr.c
@@ -8,20 +8,28 @@
 
 #ifdef CONFIG_X86_64
 
+#ifdef CONFIG_DEBUG_VIRTUAL
 unsigned long __phys_addr(unsigned long x)
 {
-   if (x >= __START_KERNEL_map) {
-   x -= __START_KERNEL_map;
-   VIRTUAL_BUG_ON(x >= KERNEL_IMAGE_SIZE);
-   x += phys_base;
+   unsigned long y = x - __START_KERNEL_map;
+
+   /* use the carry flag to determine if x was < __START_KERNEL_map */
+   if (unlikely(x > y)) {
+   x = y + phys_base;
+
+   VIRTUAL_BUG_ON(y >= KERNEL_IMAGE_SIZE);
} else {
-   VIRTUAL_BUG_ON(x < PAGE_OFFSET);
-   x -= PAGE_OFFSET;
+   x = y + (__START_KERNEL_map - PAGE_OFFSET);
+
+   /* carry flag will be set if starting x was >= PAGE_OFFSET */
+   VIRTUAL_BUG_ON(x > y);
VIRTUAL_BUG_ON(!phys_addr_valid(x));
}
+
return x;
 }
 EXPORT_SYMBOL(__phys_addr);
+#endif
 
 bool __virt_addr_valid(unsigned long x)
 {

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 7/8] x86/acpi: Use __pa_symbol instead of __pa on C visible symbols

2012-10-11 Thread Alexander Duyck
This change just updates one spot where __pa was being used when __pa_symbol
should have been used.  By using __pa_symbol we are able to drop a few extra
lines of code as we don't have to test to see if the virtual pointer is a
part of the kernel text or just standard virtual memory.

Cc: Len Brown 
Cc: Pavel Machek 
Cc: "Rafael J. Wysocki" 
Signed-off-by: Alexander Duyck 
---

 arch/x86/kernel/acpi/sleep.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/acpi/sleep.c b/arch/x86/kernel/acpi/sleep.c
index 11676cf..f146a3c 100644
--- a/arch/x86/kernel/acpi/sleep.c
+++ b/arch/x86/kernel/acpi/sleep.c
@@ -69,7 +69,7 @@ int acpi_suspend_lowlevel(void)
 
 #ifndef CONFIG_64BIT
header->pmode_entry = (u32)&wakeup_pmode_return;
-   header->pmode_cr3 = (u32)__pa(&initial_page_table);
+   header->pmode_cr3 = (u32)__pa_symbol(initial_page_table);
saved_magic = 0x12345678;
 #else /* CONFIG_64BIT */
 #ifdef CONFIG_SMP

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 6/8] x86/xen: Use __pa_symbol instead of __pa on C visible symbols

2012-10-11 Thread Alexander Duyck
This change updates a few of the functions to use __pa_symbol when
translating C visible symbols instead of __pa.  By using __pa_symbol we are
able to drop a few extra lines of code as don't have to test to see if the
virtual pointer is a part of the kernel text or just standard virtual memory.

Cc: Konrad Rzeszutek Wilk 
Signed-off-by: Alexander Duyck 
---

 arch/x86/xen/mmu.c |   19 ++-
 1 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index fd28d86..c50a87e 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1449,7 +1449,8 @@ static int xen_pgd_alloc(struct mm_struct *mm)
 
if (user_pgd != NULL) {
user_pgd[pgd_index(VSYSCALL_START)] =
-   __pgd(__pa(level3_user_vsyscall) | _PAGE_TABLE);
+   __pgd(__pa_symbol(level3_user_vsyscall) |
+ _PAGE_TABLE);
ret = 0;
}
 
@@ -1908,7 +1909,7 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, 
unsigned long max_pfn)
 * pgd.
 */
xen_mc_batch();
-   __xen_write_cr3(true, __pa(init_level4_pgt));
+   __xen_write_cr3(true, __pa_symbol(init_level4_pgt));
xen_mc_issue(PARAVIRT_LAZY_CPU);
 
/* We can't that easily rip out L3 and L2, as the Xen pagetables are
@@ -1931,10 +1932,10 @@ static RESERVE_BRK_ARRAY(pmd_t, swapper_kernel_pmd, 
PTRS_PER_PMD);
 
 static void __init xen_write_cr3_init(unsigned long cr3)
 {
-   unsigned long pfn = PFN_DOWN(__pa(swapper_pg_dir));
+   unsigned long pfn = PFN_DOWN(__pa_symbol(swapper_pg_dir));
 
-   BUG_ON(read_cr3() != __pa(initial_page_table));
-   BUG_ON(cr3 != __pa(swapper_pg_dir));
+   BUG_ON(read_cr3() != __pa_symbol(initial_page_table));
+   BUG_ON(cr3 != __pa_symbol(swapper_pg_dir));
 
/*
 * We are switching to swapper_pg_dir for the first time (from
@@ -1958,7 +1959,7 @@ static void __init xen_write_cr3_init(unsigned long cr3)
pin_pagetable_pfn(MMUEXT_PIN_L3_TABLE, pfn);
 
pin_pagetable_pfn(MMUEXT_UNPIN_TABLE,
- PFN_DOWN(__pa(initial_page_table)));
+ PFN_DOWN(__pa_symbol(initial_page_table)));
set_page_prot(initial_page_table, PAGE_KERNEL);
set_page_prot(initial_kernel_pmd, PAGE_KERNEL);
 
@@ -1983,7 +1984,7 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, 
unsigned long max_pfn)
 
copy_page(initial_page_table, pgd);
initial_page_table[KERNEL_PGD_BOUNDARY] =
-   __pgd(__pa(initial_kernel_pmd) | _PAGE_PRESENT);
+   __pgd(__pa_symbol(initial_kernel_pmd) | _PAGE_PRESENT);
 
set_page_prot(initial_kernel_pmd, PAGE_KERNEL_RO);
set_page_prot(initial_page_table, PAGE_KERNEL_RO);
@@ -1992,8 +1993,8 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, 
unsigned long max_pfn)
pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, PFN_DOWN(__pa(pgd)));
 
pin_pagetable_pfn(MMUEXT_PIN_L3_TABLE,
- PFN_DOWN(__pa(initial_page_table)));
-   xen_write_cr3(__pa(initial_page_table));
+ PFN_DOWN(__pa_symbol(initial_page_table)));
+   xen_write_cr3(__pa_symbol(initial_page_table));
 
memblock_reserve(__pa(xen_start_info->pt_base),
 xen_start_info->nr_pt_frames * PAGE_SIZE);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 8/8] x86/lguest: Use __pa_symbol instead of __pa on C visible symbols

2012-10-11 Thread Alexander Duyck
The function lguest_write_cr3 is using __pa to convert swapper_pg_dir and
initial_page_table from virtual addresses to physical.  The correct function
to use for these values is __pa_symbol since they are C visible symbols.

Cc: Rusty Russell 
Signed-off-by: Alexander Duyck 
---

 arch/x86/lguest/boot.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/arch/x86/lguest/boot.c b/arch/x86/lguest/boot.c
index 642d880..139dd35 100644
--- a/arch/x86/lguest/boot.c
+++ b/arch/x86/lguest/boot.c
@@ -552,7 +552,8 @@ static void lguest_write_cr3(unsigned long cr3)
current_cr3 = cr3;
 
/* These two page tables are simple, linear, and used during boot */
-   if (cr3 != __pa(swapper_pg_dir) && cr3 != __pa(initial_page_table))
+   if (cr3 != __pa_symbol(swapper_pg_dir) &&
+   cr3 != __pa_symbol(initial_page_table))
cr3_changed = true;
 }
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >