from:"\"Luca Barbieri\""

Re: [Mesa3d-dev] DRI SDK and modularized drivers.

2010-03-19 Thread Luca Barbieri

> It may seem e.g. like the DRM interface is the worst because of rather large 
> threads caused by certain kernel developer's problems, but that doesn't mean 
> problems wouldn't be created by splitting other areas.

This would probably be best solved by merging libdrm into the Linux kernel tree.
Ingo Molnar's rationale for having tools/perf in the kernel tree
applies even more in this case.

--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: [git pull] drm request 3

2010-03-05 Thread Luca Barbieri

> So overall, I'd say that we spent about a month of developer time
> at least between jbarnes, ickle, and myself, on extending the execbuf
> interface to add a flag saying "dear kernel, please don't do this bit of
> work on this buffer, because I don't need it and it makes things slow."

Perhaps then, we should break ABI compatibility _more_ often to speed
up development, but also have awesome mechanisms to make it painless
for the user.

Such as:
1. Automatic side by side userspace installation, as Linus proposed
2. Kernel "make install" refusing to proceed if it finds that
userspace is not updated, and giving instructions on how to update
userspace
3. Distributions packaging the new ABI X/Mesa drivers and libdrm even
for stable distributions
4. Kernel "make install" offering to automatically install said
distribution packages if it detects a supported distribution
5. Ability to drop new versions of drivers/gpu/drm in an older kernel
tree and have it compile (within reasonable limits)

In particular, for people with (slightly) old kernels, it should be
much easier to make updated DRM trees still work with older kernels,
than attempting to make updated userspace work with older kernel
modules.
This also actually gives them the benefits of the new code.

And for people with really old kernels, it's not different from any
other hardware device, which requires a kernel upgrade to have better
support.

Then, for instance, Linus would just have seen the following upon
running make install:
This kernel requires the Nouveau userspace version 0.0.16, which you
don't have installed.
Fedora 12 has been detected.
Invoke yum to install the  RPMs required for it? [y/n]
_or_
Ubuntu 9.10 has been detected
Invoke apt-get to install the  packages required for it? [y/n]

If the user says no, or the distribution is unknown, instructions on
how to download and compile the source would be presented.

Once you setup this system, you can freely break the ABI with no
significant user discomfort by just raising the version number.
This also potentially applies to stuff other than DRM (e.g. perf, kvm,
iptables, udev, filesystem-specific tools/APIs, various device
configuration systems, and so on).

The really stable APIs/ABIs should not be the (undocumented) kernel
ones, but Xlib and OpenGL, which actually have formal specifications.
Perhaps eventually Gallium could join them as a stable API closer to
the hardware, but that's a long way off.

--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: [git pull] drm request 3

2010-03-05 Thread Luca Barbieri

> I think you need to be clearer about that. Your distribution packaging
> may not support that out of the box. There are a variety of ways to do
> almsot all of this including having entire parallel X installs for
> development work.

Sure, but each user must manually find out how to setup that, and
create and test the setup himself (or happen to use a distribution
that somehow eases that, if any exist).

I think that Linus has a good point in saying that this should be
provided by default.
And ideally not just by the distribution, but upstream, so that people
wanting to test bleeding edge DRM drivers (not necessarily just
nouveau) can do so more easily, and beable to go back to their working
setup by rebooting should something go wrong.

> The fundamental problem you can't solve by versioning though is the exact
> one here. A new kernel that requires version X of a driver won't help if
> the newest version you have is X - 1.

Yes, but the fact that you can't have both X - 1 and X at the same
time makes it worse, since for instance bisecting won't just work.

> Yeah perhaps Fedora should have pushed an update that was smart enough to
> handle the Nouveau old/new ABI before the patch went upstream. Hindsight
> is an exact science.

Well, yes, but it can still be implemented in time for the next
distribution releases and the lesson can be learned for similar future
situations.

--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: [git pull] drm request 3

2010-03-05 Thread Luca Barbieri

It seems to me that Linus' technical argument is indeed being mostly ignored.

While breaking the ABI is unfortunate, the real problem that Linus
complained about is that you can't install several userspace versions
side-by-side.

This means that if you install your new kernel and userspace, reboot,
and find the new kernel doesn't work for some reason, you can't just
go back to the old kernel and have working X, because you just
replaced userspace with a version that no longer works with the kernel
that worked correctly.

This is even worse for distributions that want to upgrade the kernel,
because each kernel package would need to have a Depends on the exact
userspace package version.
Thus, the package manager would remove the older kernel when the new
one is installed (since they depend on different versions of the same
userspace package).
If the new kernel somehow doesn't work, the user is totally screwed
and must reboot from a live CD.

As pointed out, in this case, it is relatively easy to solve by adding
a version number to libdrm-nouveau, the X driver and the DRI drivers.
X will then have to load the correct driver and give Mesa the DRI
driver name with the correct version appended.

It may be a good idea to do this before the new nouveau ABI gets
shipped in released distributions, and with a generic mechanisms that
can be used by all X/drm drivers.

Workarounds are possible, such as replacing /usr/bin/X with a script
that loads the correct version, or using X over /dev/fb0 (which should
work fine with any DRM KMS driver, and any non-DRI framebuffer), but
they shouldn't be needed.

BTW, the nVidia proprietary driver also ties the kernel and userspace
version in this way, and is actually worse because it replaces the X
libglx.so, breaking all other drivers.

--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: [RFC] drm/ttm: add pool wc/uc page allocator

2010-03-03 Thread Luca Barbieri

> ^^^ Luca, I've never seen this show up high on a profile (yet). Do you see
> that with Nouveau? I used to have an rb-tree implementation of drm_mm_xxx
> lying around, but I didn't use it because I didn't have a case where it
> showed up?

Yes, before I did userspace allocation, in doom3 profiles,
get_unmapped_area and function called by it (all coming from sys_mmap)
consumed around 10% of the time.

Note that this is not DRM-specific code, but rather the generic Linux
code for finding free virtual address space, which walks over all the
vmas in the process and, in the default x86 variant, does an rb-tree
lookup for each vma!

It could be fixed by either in the kernel with better data structures
and algorithms or in userspace by using MAP_FIXED.
However, this will still leave significant kernel overhead (e.g. just
drm_mm_search_free_in_range takes 1%).

--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: [RFC] drm/ttm: add pool wc/uc page allocator

2010-03-03 Thread Luca Barbieri

While this is almost surely a good idea, note that userspace caching
and suballocation substantially improves Mesa performance even on PCIe
systems.
This is mostly due to the unavoidable overhead of kernel calls and
pagetable modifications, as well as the avoidable linear search the
kernel currently does to find an empty space in virtual address space,
as well as the additional pagefaults.

Userspace caching and suballocation mean that you just have to compute
a pointer, which you cannot beat with any kernel-space solution. This
is also the way glibc allocates normal memory with malloc(), for the
same reason.

--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: [PATCH 1/2] drm: introduce drm_gem_object_[handle_]unreference_unlocked

2010-02-10 Thread Luca Barbieri

> With the exception of adding a new unused API in the form of
> gem_free_object_unlocked driver hook, I like this.

Nouveau and Radeon should be able to use it (by setting it to the same
function used for gem_free_object) with little or no modification
(they rely on TTM locking).

I didn't do it for Radeon since I'm not familiar with the code.
I'll hopefully send a Nouveau patch as part of a patchset removing all
non-init/KMS BKL/struct_mutex usage in it once it is done.

--
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

[PATCH 1/2] drm: introduce drm_gem_object_[handle_]unreference_unlocked

2010-02-08 Thread Luca Barbieri

This patch introduces the drm_gem_object_unreference_unlocked
and drm_gem_object_handle_unreference_unlocked functions that
do not require holding struct_mutex.

drm_gem_object_unreference_unlocked calls the new
->gem_free_object_unlocked entry point if available, and
otherwise just takes struct_mutex and just calls ->gem_free_object

Signed-off-by: Luca Barbieri 
---
 drivers/gpu/drm/drm_gem.c |   49 
 include/drm/drmP.h|   28 +++--
 2 files changed, 69 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index 8bf3770..4018b3b 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -411,8 +411,19 @@ drm_gem_release(struct drm_device *dev, struct drm_file 
*file_private)
mutex_unlock(&dev->struct_mutex);
 }
 
+static void
+drm_gem_object_free_common(struct drm_gem_object *obj)
+{
+   struct drm_device *dev = obj->dev;
+   fput(obj->filp);
+   atomic_dec(&dev->object_count);
+   atomic_sub(obj->size, &dev->object_memory);
+   kfree(obj);
+}
+
 /**
  * Called after the last reference to the object has been lost.
+ * Must be called holding struct_ mutex
  *
  * Frees the object
  */
@@ -427,14 +438,40 @@ drm_gem_object_free(struct kref *kref)
if (dev->driver->gem_free_object != NULL)
dev->driver->gem_free_object(obj);
 
-   fput(obj->filp);
-   atomic_dec(&dev->object_count);
-   atomic_sub(obj->size, &dev->object_memory);
-   kfree(obj);
+   drm_gem_object_free_common(obj);
 }
 EXPORT_SYMBOL(drm_gem_object_free);
 
 /**
+ * Called after the last reference to the object has been lost.
+ * Must be called without holding struct_mutex
+ *
+ * Frees the object
+ */
+void
+drm_gem_object_free_unlocked(struct kref *kref)
+{
+   struct drm_gem_object *obj = (struct drm_gem_object *) kref;
+   struct drm_device *dev = obj->dev;
+
+   if (dev->driver->gem_free_object_unlocked != NULL)
+   dev->driver->gem_free_object_unlocked(obj);
+   else if (dev->driver->gem_free_object != NULL) {
+   mutex_lock(&dev->struct_mutex);
+   dev->driver->gem_free_object(obj);
+   mutex_unlock(&dev->struct_mutex);
+   }
+
+   drm_gem_object_free_common(obj);
+}
+EXPORT_SYMBOL(drm_gem_object_free_unlocked);
+
+static void drm_gem_object_ref_bug(struct kref *list_kref)
+{
+   BUG();
+}
+
+/**
  * Called after the last handle to the object has been closed
  *
  * Removes any name for the object. Note that this must be
@@ -458,8 +495,10 @@ drm_gem_object_handle_free(struct kref *kref)
/*
 * The object name held a reference to this object, drop
 * that now.
+   *
+   * This cannot be the last reference, since the handle holds one 
too.
 */
-   drm_gem_object_unreference(obj);
+   kref_put(&obj->refcount, drm_gem_object_ref_bug);
} else
spin_unlock(&dev->object_name_lock);
 
diff --git a/include/drm/drmP.h b/include/drm/drmP.h
index ffac157..4a3c4e4 100644
--- a/include/drm/drmP.h
+++ b/include/drm/drmP.h
@@ -801,6 +801,7 @@ struct drm_driver {
 */
int (*gem_init_object) (struct drm_gem_object *obj);
void (*gem_free_object) (struct drm_gem_object *obj);
+   void (*gem_free_object_unlocked) (struct drm_gem_object *obj);
 
/* vga arb irq handler */
void (*vgaarb_irq)(struct drm_device *dev, bool state);
@@ -1427,6 +1428,7 @@ extern void drm_sysfs_connector_remove(struct 
drm_connector *connector);
 int drm_gem_init(struct drm_device *dev);
 void drm_gem_destroy(struct drm_device *dev);
 void drm_gem_object_free(struct kref *kref);
+void drm_gem_object_free_unlocked(struct kref *kref);
 struct drm_gem_object *drm_gem_object_alloc(struct drm_device *dev,
size_t size);
 void drm_gem_object_handle_free(struct kref *kref);
@@ -1443,10 +1445,15 @@ drm_gem_object_reference(struct drm_gem_object *obj)
 static inline void
 drm_gem_object_unreference(struct drm_gem_object *obj)
 {
-   if (obj == NULL)
-   return;
+   if (obj != NULL)
+   kref_put(&obj->refcount, drm_gem_object_free);
+}
 
-   kref_put(&obj->refcount, drm_gem_object_free);
+static inline void
+drm_gem_object_unreference_unlocked(struct drm_gem_object *obj)
+{
+   if (obj != NULL)
+   kref_put(&obj->refcount, drm_gem_object_free_unlocked);
 }
 
 int drm_gem_handle_create(struct drm_file *file_priv,
@@ -1475,6 +1482,21 @@ drm_gem_object_handle_unreference(struct drm_gem_object 
*obj)
drm_gem_object_unreference(obj);
 }
 
+static inline void
+drm_gem_object_handle_unreference_unlocked(struct

[PATCH 2/2] Use drm_gem_object_[handle_]unreference_unlocked where possible

2010-02-08 Thread Luca Barbieri

Mostly obvious simplifications.

The i915 pread/pwrite ioctls, intel_overlay_put_image and
nouveau_gem_new were incorrectly using the locked versions
without locking: this is also fixed in this patch.

Signed-off-by: Luca Barbieri 
---
 drivers/gpu/drm/drm_gem.c  |   21 
 drivers/gpu/drm/i915/i915_gem.c|   16 ---
 drivers/gpu/drm/i915/i915_gem_tiling.c |4 +--
 drivers/gpu/drm/i915/intel_display.c   |   13 +++---
 drivers/gpu/drm/i915/intel_overlay.c   |2 +-
 drivers/gpu/drm/nouveau/nouveau_display.c  |7 +
 drivers/gpu/drm/nouveau/nouveau_fbcon.c|4 +--
 drivers/gpu/drm/nouveau/nouveau_gem.c  |   26 +--
 drivers/gpu/drm/nouveau/nouveau_notifier.c |9 ++
 drivers/gpu/drm/nouveau/nv04_crtc.c|4 +--
 drivers/gpu/drm/nouveau/nv50_crtc.c|4 +--
 drivers/gpu/drm/radeon/radeon_cs.c |7 +
 drivers/gpu/drm/radeon/radeon_cursor.c |8 +
 drivers/gpu/drm/radeon/radeon_display.c|7 +
 drivers/gpu/drm/radeon/radeon_gem.c|   36 +++-
 15 files changed, 47 insertions(+), 121 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index 4018b3b..aa89d4b 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -192,9 +192,7 @@ drm_gem_handle_delete(struct drm_file *filp, u32 handle)
idr_remove(&filp->object_idr, handle);
spin_unlock(&filp->table_lock);
 
-   mutex_lock(&dev->struct_mutex);
-   drm_gem_object_handle_unreference(obj);
-   mutex_unlock(&dev->struct_mutex);
+   drm_gem_object_handle_unreference_unlocked(obj);
 
return 0;
 }
@@ -325,9 +323,7 @@ again:
}
 
 err:
-   mutex_lock(&dev->struct_mutex);
-   drm_gem_object_unreference(obj);
-   mutex_unlock(&dev->struct_mutex);
+   drm_gem_object_unreference_unlocked(obj);
return ret;
 }
 
@@ -358,9 +354,7 @@ drm_gem_open_ioctl(struct drm_device *dev, void *data,
return -ENOENT;
 
ret = drm_gem_handle_create(file_priv, obj, &handle);
-   mutex_lock(&dev->struct_mutex);
-   drm_gem_object_unreference(obj);
-   mutex_unlock(&dev->struct_mutex);
+   drm_gem_object_unreference_unlocked(obj);
if (ret)
return ret;
 
@@ -390,7 +384,7 @@ drm_gem_object_release_handle(int id, void *ptr, void *data)
 {
struct drm_gem_object *obj = ptr;
 
-   drm_gem_object_handle_unreference(obj);
+   drm_gem_object_handle_unreference_unlocked(obj);
 
return 0;
 }
@@ -403,12 +397,10 @@ drm_gem_object_release_handle(int id, void *ptr, void 
*data)
 void
 drm_gem_release(struct drm_device *dev, struct drm_file *file_private)
 {
-   mutex_lock(&dev->struct_mutex);
idr_for_each(&file_private->object_idr,
 &drm_gem_object_release_handle, NULL);
 
idr_destroy(&file_private->object_idr);
-   mutex_unlock(&dev->struct_mutex);
 }
 
 static void
@@ -516,11 +508,8 @@ EXPORT_SYMBOL(drm_gem_vm_open);
 void drm_gem_vm_close(struct vm_area_struct *vma)
 {
struct drm_gem_object *obj = vma->vm_private_data;
-   struct drm_device *dev = obj->dev;
 
-   mutex_lock(&dev->struct_mutex);
-   drm_gem_object_unreference(obj);
-   mutex_unlock(&dev->struct_mutex);
+   drm_gem_object_unreference_unlocked(obj);
 }
 EXPORT_SYMBOL(drm_gem_vm_close);
 
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index b4c8c02..6844ca4 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -128,9 +128,7 @@ i915_gem_create_ioctl(struct drm_device *dev, void *data,
return -ENOMEM;
 
ret = drm_gem_handle_create(file_priv, obj, &handle);
-   mutex_lock(&dev->struct_mutex);
-   drm_gem_object_handle_unreference(obj);
-   mutex_unlock(&dev->struct_mutex);
+   drm_gem_object_handle_unreference_unlocked(obj);
 
if (ret)
return ret;
@@ -488,7 +486,7 @@ i915_gem_pread_ioctl(struct drm_device *dev, void *data,
 */
if (args->offset > obj->size || args->size > obj->size ||
args->offset + args->size > obj->size) {
-   drm_gem_object_unreference(obj);
+   drm_gem_object_unreference_unlocked(obj);
return -EINVAL;
}
 
@@ -501,7 +499,7 @@ i915_gem_pread_ioctl(struct drm_device *dev, void *data,
file_priv);
}
 
-   drm_gem_object_unreference(obj);
+   drm_gem_object_unreference_unlocked(obj);
 
return ret;
 }
@@ -961,7 +959,7 @@ i915_gem_pwrite_ioctl(struct drm_device *dev, void *data,
 */

Re: [PATCH] drm/ttm: Only try to preserve caching in moves in the same space

2010-02-01 Thread Luca Barbieri

> Your patch remove the consistency of caching attribute and make move btw
> non fixed area different than others move, while driver can already achieve
> so.

It is already different, because TTM does it by changing the page
attributes, as opposed to copying data.
Thus, it is useful to preserve the existing caching since that means
we aren't going to do any work for the move.

If instead the move is actually performed by copying (either memcpy or
unaccelerated) we don't care about the current caching.

The code should actually be:
cur_flags = ttm_bo_select_caching(man,
ttm_is_going_to_move_by_setting_page_caching_flags ? bo->mem.placement
: 0, cur_flags);

I wrote ttm_is_going_to_move_by_setting_page_caching_flags as
!(man->flags & TTM_MEMTYPE_FLAG_FIXED) &&
!(bdev->man[bo->mem.mem_type].flags & TTM_MEMTYPE_FLAG_FIXED);

If there is a better way to express that in TTM, we should use that instead.

This would fix all cases without requiring odd hacks in the drivers.

--
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: [PATCH] drm/ttm: Only try to preserve caching in moves in the same space

2010-02-01 Thread Luca Barbieri

> Idea is to mask all move which involve TT (AGP) with the agp_caching_mask,
> it's easy to do in radeon ttm layer code and i believe it's easy to do
> in nouveau
> too. On non AGP you set agp_caching_mask to UC|WC|CACHED.

Sure, but isn't that uglier and much more ad-hoc that the patch I proposed?

--
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: [PATCH] drm/ttm: Only try to preserve caching in moves in the same space

2010-02-01 Thread Luca Barbieri

I see a problem with this.

If you have uncached TT (e.g. AGP) you want to get uncached for
TT->SYSTEM, but you want cached for VRAM->SYSTEM.
If you set SYSTEM to uncached, then VRAM->SYSTEM will broken, and if
you set SYSTEM to cached, TT->SYSTEM will be broken.
If you set to uncached | cached, VRAM->SYSTEM will still be broken
because it will choose uncached.

You would need to make the proposed placement conditional on the
current BO memory type, which is possible but somewhat defeats the
purpose of having all the ttm_bo_mem_space logic in ttm.

Perhaps this doesn't usually happen because you would do VRAM->TT
instead of SYSTEM (is that really always the case? what if we don't
have enough space in GART?), but it seems a design deficiency in TTM
anyway.
The current placement caching should not really be a factor *unless*
we can actually do the move just by changing the caching attributes.

What do you think?

Of course we could also just drop ttm_bo_mem_space and let the driver
do that itself (perhaps using hardcorded helpers for the AGP and PCIE
case). It seems quite a radical solution though.

--
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: [PATCH] drm/ttm: Only try to preserve caching in moves in the same space

2010-02-01 Thread Luca Barbieri

> If you set to uncached | cached, VRAM->SYSTEM will still be broken
> because it will choose uncached.

Assuming VRAM is uncached and not WC of course.
The reasoning still holds if you replace all instances of "uncached" with WC.

--
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

[PATCH] drm/ttm: Only try to preserve caching in moves in the same space

2010-01-28 Thread Luca Barbieri

Currently TTM tries to preserve the previous caching even for moves
between unrelated memory spaces.

This results for instance in moves from VRAM to PCIE GART changing
system memory to WC state needlessly.

This patch changes TTM to only try to preserve caching if moving from
system/TT to system/TT.

In theory, we should also do that if moving between two device memory
spaces that are backend by the same physical memory area.
However, I'm not sure how to do that in the current TTM framework and
I suspect no card/driver uses memory spaces in that way.

Signed-off-by: Luca Barbieri 
---
 drivers/gpu/drm/ttm/ttm_bo.c |   18 ++
 1 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 2920f9a..27ee1be 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -876,6 +876,7 @@ int ttm_bo_mem_space(struct ttm_buffer_object *bo,
 
mem->mm_node = NULL;
for (i = 0; i < placement->num_placement; ++i) {
+   int is_tt_move;
ret = ttm_mem_type_from_flags(placement->placement[i],
&mem_type);
if (ret)
@@ -891,8 +892,12 @@ int ttm_bo_mem_space(struct ttm_buffer_object *bo,
if (!type_ok)
continue;
 
-   cur_flags = ttm_bo_select_caching(man, bo->mem.placement,
- cur_flags);
+   is_tt_move = !(man->flags & TTM_MEMTYPE_FLAG_FIXED) &&
+   !(bdev->man[bo->mem.mem_type].flags & 
TTM_MEMTYPE_FLAG_FIXED);
+
+   /* Only try to keep the current flags if we are in the same 
memory space */
+   cur_flags = ttm_bo_select_caching(man, is_tt_move ? 
bo->mem.placement : 0, cur_flags);
+
/*
 * Use the access and other non-mapping-related flag bits from
 * the memory placement flags to the current flags
@@ -927,6 +932,7 @@ int ttm_bo_mem_space(struct ttm_buffer_object *bo,
return -EINVAL;
 
for (i = 0; i < placement->num_busy_placement; ++i) {
+   int is_tt_move;
ret = ttm_mem_type_from_flags(placement->busy_placement[i],
&mem_type);
if (ret)
@@ -941,8 +947,12 @@ int ttm_bo_mem_space(struct ttm_buffer_object *bo,
&cur_flags))
continue;
 
-   cur_flags = ttm_bo_select_caching(man, bo->mem.placement,
- cur_flags);
+   is_tt_move = !(man->flags & TTM_MEMTYPE_FLAG_FIXED) &&
+   !(bdev->man[bo->mem.mem_type].flags & 
TTM_MEMTYPE_FLAG_FIXED);
+
+   /* Only try to keep the current flags if we are in the same 
memory space */
+   cur_flags = ttm_bo_select_caching(man, is_tt_move ? 
bo->mem.placement : 0, cur_flags);
+
/*
 * Use the access and other non-mapping-related flag bits from
 * the memory placement flags to the current flags
-- 
1.6.6.1.476.g01ddb


--
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: [Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-21 Thread Luca Barbieri

> Doing it without software methods means you need to have a semaphore
> that "exists" in the cpu domain and therefore cannot be used again
> until the cpu is sure it's done. So that will probably require a
> rotating queue of several semaphores and a fallback to cpu waiting.

Why would it need a semaphore in CPU domain?
Couldn't it work this way:
1. Allocate a BO in the kernel
2. Emit on one channel a "notify" request on that BO
3. Emit on the other channel a "wait" request on that BO
4. Emit a fence on the wait channel
5. Use the existing delayed-destroy mechanism to get rid of the BO
once the wait channel fence expires

Then the GPU would presumably switch away from the "waiting" context
and not reexecute it until the notify happens, or something similar.

The problem is, of course, whether the GPU supports the "wait" request.

--
Throughout its 18-year history, RSA Conference consistently attracts the
world's best and brightest in the field, creating opportunities for Conference
attendees to learn about information security's most important issues through
interactions with peers, luminaries and emerging and established companies.
http://p.sf.net/sfu/rsaconf-dev2dev
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: [Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-21 Thread Luca Barbieri

> The card also keeps some internal FIFO caches, so it seems to me that
> getting that safe of races and efficient at the same time could be a bit
> non-trivial.

Are they flushable?
It seems this *might* do the job:
if (!pfifo->cache_flush(dev))
return;
pfifo->reassign(dev, false);
pfifo->cache_flush(dev);
pfifo->cache_pull(dev, false);


pfifo->cache_pull(dev, true);
pfifo->reassign(dev, true);

--
Throughout its 18-year history, RSA Conference consistently attracts the
world's best and brightest in the field, creating opportunities for Conference
attendees to learn about information security's most important issues through
interactions with peers, luminaries and emerging and established companies.
http://p.sf.net/sfu/rsaconf-dev2dev
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: [Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-21 Thread Luca Barbieri

>> If not, it could possibly be hacked around by reading from a DMA
>> object at the address of the fence sequence number and then resizing
>> the DMA object so that addresses from a certain point on would trigger
>> a protection fault interrupt.
>
> I don't think you can safely modify a DMA object without stalling the
> card completely, but i think you could use e.g. PGRAPH NOTIFY interrupts
> and disable them by flipping a bit in PGRAPH when you stop caring about
> them.

The problem is that one needs to disable them *before* the one he cares about.

Suppose the card is at fence 0 and we are interested in fence 1000 expiring.

If we just enable interrupts, then we are going to be interrupted
uselessly 1000 times.
Instead, we would like to tell the card "send me interrupts for
fences, but only for fence number 1000 (or higher)".

This way one could for instance render a whole scene, and then
desiring to read it into the CPU, just ask for an interrupt once
rendering is done (i.e. wait for the framebuffer fence) and get a
single interrupt, while we cleanly sleep undisturbed in the meantime.

Basically, it would just need some way of *conditionally* causing interrupts.
If there is none, then maybe we could insert a call to a "fence
mini-pushbuffer" filled with NOPs that could be overwritten with an
interrupt request on demand?
Or alternatively, construct such a pushbuffer with the 2D or 3D
engines, starting from our "1000" input and the fence value generated
by the 3D engine? (this is likely to be slow though).
Or some hack like the DMA object resizing? (would it crash the GPU? or
just not work due to it caching the previous size?)

--
Throughout its 18-year history, RSA Conference consistently attracts the
world's best and brightest in the field, creating opportunities for Conference
attendees to learn about information security's most important issues through
interactions with peers, luminaries and emerging and established companies.
http://p.sf.net/sfu/rsaconf-dev2dev
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-21 Thread Luca Barbieri

I'm not sure I understand your proposal correctly.
It seems your proposoal is similar to mine, replacing the term "fence
nodes" with "ttm transactions", but I'm not sure if I understand it
correctly

Here is some pseudocode for a improved, simplified version of my proposal.
It is modified so that there are no longer distinct alive/destroy
lists, but buffers are destroyed if their ref count is zero.

list_head ram_lru_list; /* list of bos */
list_head vram_unfenced_lru_list; /* list of bos */
list_head gart_unfenced_lru_list; /* list of bos */

atomic uint64_t next_seq_num;

// if read_list and write_list are empty, the buffer is unfenced and
MUST be in an unfenced lru list
// otherwise, it is fenced and MUST be, if not zombie, on some
read_list/write_list, or if zombie, on some unfenced_list
struct ttm_buffer_object
{
   kref_t ref;
   list_head read_list; /* list of bo_fence_nodes */
   list_head write_list; /* list of bo_fence_nodes */
   list_head unfenced_list; /* list entry in
[ram|[gart|vram]_unfenced]_lru_list */
   [...]
};

// TODO: we could embed just the first two members in the
ttm_buffer_object, and steal the lowest bit on the fence pointer to
signify that
// this would optimize for the very common single-fence case
struct ttm_bo_fence_node
{
list_head bo_list; /* list entry in bo.[read|write]_list */
struct ttm_fence_node* fence;

struct ttm_buffer_object* bo;
list_head fence_list; /* list entry in fence.[vram|gart|destroy]_list */
};

struct ttm_fence
{
void* sync_object; /* may also turned into an object containing a
ttm_fence at the start */
uint64_t seq_num; /* unique value in order of kmalloc of this ttm_fence */
list_head bo_list; /* list of bo_fence_nodes */
};

struct ttm_channel
{
list_head fence_list; /* list of ttm_fence_node */
};

ttm_flush_fences()
{
list_head vram_list[MAX_CHANNELS];
list_head gart_list[MAX_CHANNELS];
foreach channel
{
foreach fence in channel
{
 if(not driver->fence_expired(fence->sync_object))
 break;
 foreach bo_fence_node in fence.bo_list
 {
 remove bo_fence_node
 if bo.read_list and bo.write_list are both empty
 {
 if bo.refcount is zero
 destroy
 else
 {
 append to [vram|gart]_list[channel]
 }
 }
 }
}
}

// this is the n-way merge of vram_list[]s into the lru list
while(vram_list[]s are not all empty)
{
// this can use either a simple scan, or an heap
find channel such that
first_entry(vram_list[channel]).fence.seq_num is smallest

remove first_entry(vram_list[channel]) and put the bo at the
recent end of vram_unfenced_lru_list
}

same thing for gart;
}

// assume buffer is reserved, use current mechanism or mutex for that
// channel may be null for CPU waits
ttm_bo_wait(bo, channel, wait_for_write)
{
 foreach fence_node in bo.write_list
 {
 if(fence_node.channel != channel)
 driver->wait_fence(fence_node.fence.sync_object);
 }

 if(!wait_for_write)
  return;

 foreach fence_node in bo.read_list
 {
 if(fence_node.channel != channel)
  driver->wait_fence(fence_node.fence.sync_object);
 }
}

ttm_out_of_memory() takes memory_alloc_mutex
{
retry:
ttm_flush_fences();
if(we have enough space now)
return;
foreach in [vram|gart]_unfenced_lru_list
{
evict that buffer if it's big enough, no need to check fences
this uses the current "ghost" mechanism for accelerated moves
emit evicted_buffer_fence for after emission
}
if we didn't find a big enough buffer, evict the biggest buffer
(also consider empty space around it in size)
if(we have enough space now)
return;
if(burn_cpu_time_but_have_lowest_latencies)
{
while(!driver->fence_expired(evicted_bo->sync_object) and we
don't have enough space)
{
driver->wait_for_any_fence_to_expire();
ttm_flush_fences();
}
}
else
ttm_bo_wait(evicted_bo 0)
goto retry;
}

// assume the buffer has already been put in the desired memory space
// also assume we already waited for the buffer fences
ttm_add_buffer_to_fence(fence, bo)
{
remove bo from unfenced lru list if it's on it
for the none or single bo_fence_node in bo.read_list or
bo.write_list with bo_fence_node.fence.channel == fence.channel
{
remove bo_fence_node from bo_fence_node.fence.[gart|vram]_list
if(bo_fence_node.fence has all lists empty)
remove from channel and free the fence bo_fence_node.fence
remove bo_fence_node from bo.[read|list]_list
}

create a new bo_fence node and use it to add the bo to the fence
}

ttm_bo_refcount_drops_

Re: [Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-21 Thread Luca Barbieri

> Nvidia cards have a synchronization primitive that could be used to
> synchronize several FIFOs in hardware (AKA semaphores, see [1] for an
> example).

Does this operate wholly on the GPU on all nVidia cards?

It seems that at least on some GPUs this will trigger "software
methods" that are basically a way for the GPU to trigger an interrupt
and stop the FIFO until the CPU handles the interrupt and restarts it.

Also, is there a way on nVidia cards to get interrupts on fences, but
only where the fence sequence number is higher than a dynamically set
value? (so that one could sleep for fence X without getting an
interrupt for every single fence before that)

If not, it could possibly be hacked around by reading from a DMA
object at the address of the fence sequence number and then resizing
the DMA object so that addresses from a certain point on would trigger
a protection fault interrupt.

--
Throughout its 18-year history, RSA Conference consistently attracts the
world's best and brightest in the field, creating opportunities for Conference
attendees to learn about information security's most important issues through
interactions with peers, luminaries and emerging and established companies.
http://p.sf.net/sfu/rsaconf-dev2dev
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-21 Thread Luca Barbieri

> At a first glance:
>
> 1) We probably *will* need a delayed destroyed workqueue to avoid wasting
> memory that otherwise should be freed to the system. At the very least, the
> delayed delete process should optionally be run by a system shrinker.
You are right. For VRAM we don't care since we are the only user,
while for system backed memory some delayed destruction will be
needed.
The logical extension of the scheme would be for the Linux page
allocator/swapper to check for TTM buffers to destroy when it would
otherwise shrink caches, try to swap and/or wait on swap to happen.
Not sure whether there are existing hooks for this or where exactly to
hook this code.

> 2) Fences in TTM are currently not necessarily strictly ordered, and
> sequence numbers are hidden from the bo code. This means, for a given FIFO,
> fence sequence 3 may expire before fence sequence 2, depending on the usage
> of the buffer.

My definition of "channel" (I sometimes used FIFO incorrectly as a
synonym of that) is exactly a set of fences that are strictly ordered.
If the card has multiple HW engines, each is considered a different
channel (so that a channel becomes a (fifo, engine) pair).

We may need however to add the concept of a "sync domain" that would
be a set of channels that support on-GPU synchronization against each
other.
This would model hardware where channels with the same FIFO can be
synchronized together but those with different FIFOs don't, and also
multi-core GPUs where synchronization might be available only inside
each core and not across cores.

To sum it up, a GPU consists of a set of sync domains, each consisting
of a set of channels, each consisting of a sequence of fences, with
the following rules:
1. Fences within the same channel expire in order
2. If channels A and B belong to the same sync domain, it's possible
to emit a fence on A that is guaranteed to expire after an arbitrary
fence of B

Whether channels have the same FIFO or not is essentially a driver
implementation detail, and what TTM cares about is if they are in the
same sync domain.

[I just made up "sync domain" here: is there a standard term?]

This assumes that the "synchronizability" graph is a disjoint union of
complete graphs. Is there any example where it is not so?
Also, does this actually model correctly Poulsbo, or am I wrong?

Note that we could use CPU mediation more than we currently do.
For instance now Nouveau, to do inter-channel synchronization, simply
waits on the fence with the CPU immediately synchronously, while it
could instead queue the commands in software, and with an
interrupt/delayed mechanism submit them to hardware once the fence to
be waited for is expired.

--
Throughout its 18-year history, RSA Conference consistently attracts the
world's best and brightest in the field, creating opportunities for Conference
attendees to learn about information security's most important issues through
interactions with peers, luminaries and emerging and established companies.
http://p.sf.net/sfu/rsaconf-dev2dev
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-20 Thread Luca Barbieri

> Also note that the delayed delete list is not in fence order but in
> deletion-time order, which perhaps gives room for more optimizations.
You are right.
I think then that ttm_bo_delayed_delete may still need to be changed,
because it stops when ttm_bo_cleanup_refs returns -EBUSY, which
happens when a fence has not been reached.
This means that a buffer will need to wait for all previously deleted
buffers to become unused, even if it is unused itself.
Is this acceptable?

What if we get rid of the delayed destroy list, and instead append
buffers to be deleted to their fence object, and delete them when the
fence is signaled?

This also allows to do it more naturally, since the fence object can
just keep a normal reference to the buffers it fences, and unreference
them on expiration.

Then there needs to be no special "delayed destruction" logic, and it
would work as if the GPU were keeping a reference to the buffer
itself, using fences as a proxy to have the CPU do that work for the
GPU.

Then the delayed work is no longer "periodically destroy buffers" but
rather "periodically check if fences are expired", naturally stopping
at the first unexpired one.
Drivers that support IRQs on fences could also do the work in the
interrupt handler/tasklet instead, avoid the delay jiffies magic
number. This may need a NAPI-like interrupt mitigation middle layer
for optimal results though.

> But isn't an atomic cmpxchg about as costly as a spinlock?
I think it's cheaper on all architectures, as otherwise it would be
mostly pointless to have it, since you can emulate it with a spinlock.

--
Throughout its 18-year history, RSA Conference consistently attracts the
world's best and brightest in the field, creating opportunities for Conference
attendees to learn about information security's most important issues through
interactions with peers, luminaries and emerging and established companies.
http://p.sf.net/sfu/rsaconf-dev2dev
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-20 Thread Luca Barbieri

When designing this, we should also keep in mind that some drivers
(e.g. nouveau) have multiple FIFO channels, and thus we would like a
buffer to be referenced for reading by multiple channels at once (and
be destroyed only when all fences are expired, obviously).
Also, hardware may support on-GPU inter-channel synchronization, and
then multiple references may be for writing too.

If we use an external dynamically allocated channel/buffer list node,
we can support this (if the kernel allocators aren't fast enough,
which they should be, we can just keep released ones linked to the bo
to speed allocations).

Note that in nouveau there is a small hardware limit to channels (up
to 128 on nv50), but future hardware may possibly support unlimited
channels.

--
Throughout its 18-year history, RSA Conference consistently attracts the
world's best and brightest in the field, creating opportunities for Conference
attendees to learn about information security's most important issues through
interactions with peers, luminaries and emerging and established companies.
http://p.sf.net/sfu/rsaconf-dev2dev
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-20 Thread Luca Barbieri

> We had to do a similar thing in the
> Poulsbo driver and it turned out that we could save a significant amount of
> CPU by using a delayed workqueue, collecting objects and destroying them
> periodically.
Yes, indeed, we don't really care about a fence expiring unless we
want to use that buffer or we are out of memory.
Processing each fence, especially if there is an interrupt per fence,
seems unnecessary, since you can just do that work in the cases
mentioned above.

So, here is a new proposal that should have the best features of all
previously considered methods.
Assume first that there is a single FIFO channel.

We introduce a list of fence nodes, each containing:
- A list of buffers NOT to be destroyed having the fence as their sync
object (or possibly one for each memory type)
- A delayed-destroy list of buffers having the fence as their sync object

Fence nodes are added at the end upon emission.
They are removed and freed when the buffer count drops to 0 (meaning
all buffers have been moved to later fences).
Thus, the number of fence nodes does not grow without bounds, but is
bounded by the number of buffers.
.
The LRU lists are changed to only contain unfenced buffers and buffers
are initially put on it.
When a buffer is fenced, it is removed from the LRU list or its
current fence list, and added to the new fence (possibly destroying
the old fence node in the process).
The reference count of buffer objects is only increased when they are
put in a delayed destruction list.

Upon deletion, they are destroyed immediately if they are on the LRU
list and moved to the corresponding delayed-delete list if they are in
the fenced list.

ttm_bo_delayed_delete is no longer called by any workqueue, but only
on when we run out of memory, before we try eviction.
It processes the list until the first unsignalled fence, destroying
all buffers it finds and freeing all the fence nodes.
All the buffers in the alive lists are put in the LRU list, in the
correct order.
We may either keep an alive list per memory type, or sort the buffers
by memory type (so they are put in the memory type LRU list) at this
point

Thus, after ttm_bo_delayed_delete, there is the following scenario:
1. Inactive buffers with no references are destroyed
2. Inactive buffers with references are in the LRU list
3. Fenced buffers with no references are in the delayed-destroy list
attached to their fence
4. Fenced buffers with references are in the alive list attached to their fence

This should have about the same CPU and memory usage of the current
code, but the following advantages:
1. Delayed deletion does not run periodically, but exactly when needed
(at allocation time, before eviction)
2. Delayed deletion always deletes all buffers that can be deleted,
since it processes them in fence order
3. The LRU list at eviction time contains exactly all evictable buffers
4. Each buffer is always on an LRU _or_ on a delayed destroy list, so
only one node is necessary
5. Once buffer deletion determines a fence is signalled, it can
destroyed all its to-be-destroyed buffers without any more checking
6. Some fence logic (such as signalling of all fences on channel
forced closing) can be moved  from drivers to TTM
7. Some channel logic can be moved from drivers to TTM
8. The code should be simpler and more elegant

Note that using a buffer with the CPU doesn't change its position in
the LRU list.
This is good, because evicting a buffer used by the CPU is actually a
good thing, since it will move to a region of memory slower for the
GPU but possibly faster for the CPU.
However, for buffers in system memory, if the LRU list is used for
swapping, CPU access should move the buffer to the front of list
(using the LRU list for this is probably a bad idea though, since
system memory swapping should be at page granularity).

For multiple channels, things are slightly more complex.
First, we need to built the fence data structure for each channel.
Also, we need the fence/buffer nodes to be separate
Delayed destruction continues to work, but the reference count needs
to be set to the number of channels fencing the buffer on destruction.
Putting buffers in the LRU list can be done by doing n-way merging
between the channel fence lists, assigning each fence a global
sequence number, and using that to merge and put back buffers in the
LRU.
n-way merging can be done with a small heap on the stack on current
hardware where channels are limited.
Furthermore, we need to keep a reference count, so that buffers are
put in the LRU (or destroyed) only after they are off all the
channels.

The LRU list semantics are relaxed. If a buffer has both its fence
emitted before another buffer, and also signaled before it, then it
will be considered as "less recently" used, and the opposite thing
happens if both are after. Otherwise, the order is undetermined.

This should still guarantee good eviction behavior, and allows to have
the LRU list only contain evictable buffers, while n

Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-20 Thread Luca Barbieri

Yes it's fine. I sent your patch to Dave with an expanded commit
comment for merging.

Here is a possible redesign of the mechanism inspired by this issue.
It seems that what we are racing against is buffer eviction, due to
delayed deletion buffers being still kept on the LRU list.

I'm wondering if the delayed deletion design could be changed as follows:
1. Remove to-be-deleted buffers from the LRU list before putting them
on delayed delete
2. Change buffer eviction to first do a delayed deletion pass. This
should be cheap (and cheaper than the current code) because delayed
deletion stops at the first unsignaled fence.
3. Add a new "delayed deletion" lock/semaphore. Then, have
ttm_bo_delayed_delete take it for reading and keep it across the
function.
4. Inside the delayed deletion lock, grab the LRU lock, copy the
delayed delete list head to a local variable, set it to empty and
release the lock.
5. Iterate on the privately held list with list_for_each_safe
6. At the end of ttm_bo_delayed_delete, retake the LRU lock and readd
the remaining part of our private list at the head of the global list

This would prevent uselessly trying to evict delayed-delete buffers
(which should be processed in fence order and not LRU order), and also
prevent concurrent delayed deletions, which should be more harmful
than useful.

Furthermore, it should be possible to get rid of list locking in the
following way:
1. BOs to be delayed-deleted are added to the head of the initial
delayed deletion single linked list, using atomic cmpxchg
2. ttm_bo_delayed_delete takes the delayed deletion lock and grabs the
list with an atomic xchg of the head with zero
3. It reverses the list in place, processes the entries and puts them
at the end of a second single linked list, the recurring delayed
deletion list
4. It processes the recurring delayed deletion list, cleaning up the BOs
5. Finally, the delayed deletion lock is released

This makes adding to the delayed deletion list lockless.

The LRU list instead inherently needs to be doubly linked, so only RCU
could make it lockless, and it seems that may require using an
external list node structure (so readers don't suddenly jump to the
most recent node), and that would not be a win (except with *lots* of
CPUs).
Besides, most DRM drivers (except vmware) are taking the BKL around
all ioctls and (except nouveau) use a single pushbuffer, so this is a
bit moot anyway.

What do you think?

Anyway, this, if done, would be for the next merge window, or later,
while the current fix ought to be merged now.

--
Throughout its 18-year history, RSA Conference consistently attracts the
world's best and brightest in the field, creating opportunities for Conference
attendees to learn about information security's most important issues through
interactions with peers, luminaries and emerging and established companies.
http://p.sf.net/sfu/rsaconf-dev2dev
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

[PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete (v3, final)

2010-01-20 Thread Luca Barbieri

Resending this with Thomas Hellstrom's signoff for merging into 2.6.33

ttm_bo_delayed_delete has a race condition, because after we do:
kref_put(&nentry->list_kref, ttm_bo_release_list);

we are not holding the list lock and not holding any reference to
objects, and thus every bo in the list can be removed and freed at
this point.

However, we then use the next pointer we stored, which is not guaranteed
to be valid.

This was apparently the cause of some Nouveau oopses I experienced.

This patch rewrites the function so that it keeps the reference to nentry
until nentry itself is freed and we already got a reference to nentry->next.

v2 updated by me according to Thomas Hellstrom's feedback.
v3 proposed by Thomas Hellstrom. Commit comment updated by me.

Both updates fixed minor efficiency/style issues only and all three versions
should be correct.

Signed-off-by: Luca Barbieri 
Signed-off-by: Thomas Hellstrom 
---
 drivers/gpu/drm/ttm/ttm_bo.c |   58 ++
 1 files changed, 25 insertions(+), 33 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index c7733c3..1a3e909 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -524,52 +524,44 @@ static int ttm_bo_cleanup_refs(struct ttm_buffer_object 
*bo, bool remove_all)
 static int ttm_bo_delayed_delete(struct ttm_bo_device *bdev, bool remove_all)
 {
struct ttm_bo_global *glob = bdev->glob;
-   struct ttm_buffer_object *entry, *nentry;
-   struct list_head *list, *next;
-   int ret;
+   struct ttm_buffer_object *entry = NULL;
+   int ret = 0;
 
spin_lock(&glob->lru_lock);
-   list_for_each_safe(list, next, &bdev->ddestroy) {
-   entry = list_entry(list, struct ttm_buffer_object, ddestroy);
-   nentry = NULL;
+   if (list_empty(&bdev->ddestroy))
+   goto out_unlock;
 
-   /*
-* Protect the next list entry from destruction while we
-* unlock the lru_lock.
-*/
+   entry = list_first_entry(&bdev->ddestroy,
+   struct ttm_buffer_object, ddestroy);
+   kref_get(&entry->list_kref);
+
+   for (;;) {
+   struct ttm_buffer_object *nentry = NULL;
 
-   if (next != &bdev->ddestroy) {
-   nentry = list_entry(next, struct ttm_buffer_object,
-   ddestroy);
+   if (entry->ddestroy.next != &bdev->ddestroy) {
+   nentry = list_first_entry(&entry->ddestroy,
+   struct ttm_buffer_object, ddestroy);
kref_get(&nentry->list_kref);
}
-   kref_get(&entry->list_kref);
 
spin_unlock(&glob->lru_lock);
ret = ttm_bo_cleanup_refs(entry, remove_all);
kref_put(&entry->list_kref, ttm_bo_release_list);
+   entry = nentry;
+
+   if (ret || !entry)
+   goto out;
 
spin_lock(&glob->lru_lock);
-   if (nentry) {
-   bool next_onlist = !list_empty(next);
-   spin_unlock(&glob->lru_lock);
-   kref_put(&nentry->list_kref, ttm_bo_release_list);
-   spin_lock(&glob->lru_lock);
-   /*
-* Someone might have raced us and removed the
-* next entry from the list. We don't bother restarting
-* list traversal.
-*/
-
-   if (!next_onlist)
-   break;
-   }
-   if (ret)
+   if (list_empty(&entry->ddestroy))
break;
}
-   ret = !list_empty(&bdev->ddestroy);
-   spin_unlock(&glob->lru_lock);
 
+out_unlock:
+   spin_unlock(&glob->lru_lock);
+out:
+   if (entry)
+   kref_put(&entry->list_kref, ttm_bo_release_list);
return ret;
 }
 
-- 
1.6.2.5


--
Throughout its 18-year history, RSA Conference consistently attracts the
world's best and brightest in the field, creating opportunities for Conference
attendees to learn about information security's most important issues through
interactions with peers, luminaries and emerging and established companies.
http://p.sf.net/sfu/rsaconf-dev2dev
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-19 Thread Luca Barbieri

> Would  nentry=list_first_entry(&entry->ddestroy, ) work?
Yes, it seems a bit less intuitive, but if that's the accepted
practice, let's do that instead.

> Here nentry may have been removed from the list by another process, which
> would trigger the unnecessary call, mentioned above.
You are right.

I attached a revised patch.
It's only compile tested, but the changes are small and it should
hopefully work.

Note that in principle we could remove the special-case code for the
list head but that would require pretending the list head is actually
inside a ttm_buffer_object and adding a flag to not do the
unlock/cleanup/put/lock on it, which seems bad.

The whole function seems more complicated than needed, but I can't
find a way to do it with less code. If we could keep glob->lru_lock
while calling ttm_bo_cleanup_refs things would be much simpler, but
that would require intrusive changes and may not be possible.
From 68972c220abe3ce19eb046d72fa218646168adc7 Mon Sep 17 00:00:00 2001
From: Luca Barbieri 
Date: Mon, 18 Jan 2010 19:34:53 +0100
Subject: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete (v2)

ttm_bo_delayed_delete has a race condition, because after we do:
kref_put(&nentry->list_kref, ttm_bo_release_list);

we are not holding the list lock and not holding any reference to
objects, and thus every bo in the list can be removed and freed at
this point.

However, we then use the next pointer we stored, which is not guaranteed
to be valid.

This was apparently the cause of some Nouveau oopses I experienced.

This patch rewrites the function so that it keeps the reference to nentry
until nentry itself is freed and we already got a reference to nentry->next.

It should now be correct and free of races, but please double check this.

Updated according to Thomas Hellstrom's feedback.

Signed-off-by: Luca Barbieri 
---
 drivers/gpu/drm/ttm/ttm_bo.c |   58 ++---
 1 files changed, 26 insertions(+), 32 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 2920f9a..5dfa41f 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -523,52 +523,46 @@ static int ttm_bo_cleanup_refs(struct ttm_buffer_object *bo, bool remove_all)
 static int ttm_bo_delayed_delete(struct ttm_bo_device *bdev, bool remove_all)
 {
 	struct ttm_bo_global *glob = bdev->glob;
-	struct ttm_buffer_object *entry, *nentry;
-	struct list_head *list, *next;
-	int ret;
+	struct ttm_buffer_object *entry;
+	int ret = 0;
 
 	spin_lock(&glob->lru_lock);
-	list_for_each_safe(list, next, &bdev->ddestroy) {
-		entry = list_entry(list, struct ttm_buffer_object, ddestroy);
-		nentry = NULL;
+	if (list_empty(&bdev->ddestroy)) {
+		spin_unlock(&glob->lru_lock);
+		return 0;
+	}
 
-		/*
-		 * Protect the next list entry from destruction while we
-		 * unlock the lru_lock.
-		 */
+	entry = list_first_entry(&bdev->ddestroy,
+		struct ttm_buffer_object, ddestroy);
+	kref_get(&entry->list_kref);
+
+	for (;;) {
+		struct ttm_buffer_object *nentry = NULL;
 
-		if (next != &bdev->ddestroy) {
-			nentry = list_entry(next, struct ttm_buffer_object,
-	ddestroy);
+		if (entry->ddestroy.next != &bdev->ddestroy) {
+			nentry = list_first_entry(&entry->ddestroy,
+struct ttm_buffer_object, ddestroy);
 			kref_get(&nentry->list_kref);
 		}
-		kref_get(&entry->list_kref);
 
 		spin_unlock(&glob->lru_lock);
 		ret = ttm_bo_cleanup_refs(entry, remove_all);
 		kref_put(&entry->list_kref, ttm_bo_release_list);
+		entry = nentry;
+
+		if (ret || !entry)
+			break;
 
 		spin_lock(&glob->lru_lock);
-		if (nentry) {
-			bool next_onlist = !list_empty(next);
+		
+		if (list_empty(&entry->ddestroy)) {
 			spin_unlock(&glob->lru_lock);
-			kref_put(&nentry->list_kref, ttm_bo_release_list);
-			spin_lock(&glob->lru_lock);
-			/*
-			 * Someone might have raced us and removed the
-			 * next entry from the list. We don't bother restarting
-			 * list traversal.
-			 */
-
-			if (!next_onlist)
-break;
-		}
-		if (ret)
 			break;
+		}
 	}
-	ret = !list_empty(&bdev->ddestroy);
-	spin_unlock(&glob->lru_lock);
 
+	if (entry)
+		kref_put(&entry->list_kref, ttm_bo_release_list);
 	return ret;
 }
 
-- 
1.6.3.3

--
Throughout its 18-year history, RSA Conference consistently attracts the
world's best and brightest in the field, creating opportunities for Conference
attendees to learn about information security's most important issues through
interactions with peers, luminaries and emerging and established companies.
http://p.sf.net/sfu/rsaconf-dev2dev--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

[PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-18 Thread Luca Barbieri

ttm_bo_delayed_delete has a race condition, because after we do:
kref_put(&nentry->list_kref, ttm_bo_release_list);

we are not holding the list lock and not holding any reference to
objects, and thus every bo in the list can be removed and freed at
this point.

However, we then use the next pointer we stored, which is not guaranteed
to be valid.

This was apparently the cause of some Nouveau oopses I experienced.

This patch rewrites the function so that it keeps the reference to nentry
until nentry itself is freed and we already got a reference to nentry->next.

It should now be correct and free of races, but please double check this.

Signed-off-by: Luca Barbieri 
---
 drivers/gpu/drm/ttm/ttm_bo.c |   58 +
 1 files changed, 24 insertions(+), 34 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 2920f9a..1daa2f1 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -523,52 +523,42 @@ static int ttm_bo_cleanup_refs(struct ttm_buffer_object 
*bo, bool remove_all)
 static int ttm_bo_delayed_delete(struct ttm_bo_device *bdev, bool remove_all)
 {
struct ttm_bo_global *glob = bdev->glob;
-   struct ttm_buffer_object *entry, *nentry;
-   struct list_head *list, *next;
-   int ret;
+   struct ttm_buffer_object *entry;
+   int ret = 0;
 
spin_lock(&glob->lru_lock);
-   list_for_each_safe(list, next, &bdev->ddestroy) {
-   entry = list_entry(list, struct ttm_buffer_object, ddestroy);
-   nentry = NULL;
+   if (list_empty(&bdev->ddestroy)) {
+   spin_unlock(&glob->lru_lock);
+   return 0;
+   }
 
-   /*
-* Protect the next list entry from destruction while we
-* unlock the lru_lock.
-*/
+   entry = list_first_entry(&bdev->ddestroy,
+   struct ttm_buffer_object, ddestroy);
+   kref_get(&entry->list_kref);
 
-   if (next != &bdev->ddestroy) {
-   nentry = list_entry(next, struct ttm_buffer_object,
-   ddestroy);
+   for (;;) {
+   struct ttm_buffer_object *nentry = NULL;
+
+   if (!list_empty(&entry->ddestroy)
+   && entry->ddestroy.next != &bdev->ddestroy) {
+   nentry = list_entry(entry->ddestroy.next,
+   struct ttm_buffer_object, ddestroy);
kref_get(&nentry->list_kref);
}
-   kref_get(&entry->list_kref);
 
spin_unlock(&glob->lru_lock);
ret = ttm_bo_cleanup_refs(entry, remove_all);
kref_put(&entry->list_kref, ttm_bo_release_list);
+   entry = nentry;
 
-   spin_lock(&glob->lru_lock);
-   if (nentry) {
-   bool next_onlist = !list_empty(next);
-   spin_unlock(&glob->lru_lock);
-   kref_put(&nentry->list_kref, ttm_bo_release_list);
-   spin_lock(&glob->lru_lock);
-   /*
-* Someone might have raced us and removed the
-* next entry from the list. We don't bother restarting
-* list traversal.
-*/
-
-   if (!next_onlist)
-   break;
-   }
-   if (ret)
+   if (ret || !entry)
break;
+
+   spin_lock(&glob->lru_lock);
}
-   ret = !list_empty(&bdev->ddestroy);
-   spin_unlock(&glob->lru_lock);
 
+   if (entry)
+   kref_put(&entry->list_kref, ttm_bo_release_list);
return ret;
 }
 
-- 
1.6.3.3


--
Throughout its 18-year history, RSA Conference consistently attracts the
world's best and brightest in the field, creating opportunities for Conference
attendees to learn about information security's most important issues through
interactions with peers, luminaries and emerging and established companies.
http://p.sf.net/sfu/rsaconf-dev2dev
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: [Mesa3d-dev] DRI SDK and modularized drivers.

Re: [git pull] drm request 3

Re: [git pull] drm request 3

Re: [git pull] drm request 3

Re: [RFC] drm/ttm: add pool wc/uc page allocator

Re: [RFC] drm/ttm: add pool wc/uc page allocator

Re: [PATCH 1/2] drm: introduce drm_gem_object_[handle_]unreference_unlocked

[PATCH 1/2] drm: introduce drm_gem_object_[handle_]unreference_unlocked

[PATCH 2/2] Use drm_gem_object_[handle_]unreference_unlocked where possible

Re: [PATCH] drm/ttm: Only try to preserve caching in moves in the same space

Re: [PATCH] drm/ttm: Only try to preserve caching in moves in the same space

Re: [PATCH] drm/ttm: Only try to preserve caching in moves in the same space

Re: [PATCH] drm/ttm: Only try to preserve caching in moves in the same space

[PATCH] drm/ttm: Only try to preserve caching in moves in the same space

Re: [Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

Re: [Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

Re: [Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

Re: [Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

[PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete (v3, final)

Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

[PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

27 matches

Site Navigation

Mail list logo

Footer information