[PATCH 00/83] AMD HSA kernel driver

2014-07-24 Thread Daniel Vetter
On Wed, Jul 23, 2014 at 04:57:48PM -0700, Jesse Barnes wrote:
> On Sun, 13 Jul 2014 12:40:32 -0400
> j.glisse at gmail.com (Jerome Glisse) wrote:
> 
> > On Sun, Jul 13, 2014 at 11:42:58AM +0200, Daniel Vetter wrote:
> > > On Sat, Jul 12, 2014 at 6:49 PM, Jerome Glisse  
> > > wrote:
> > > >> Hm, so the hsa part is a completely new driver/subsystem, not just an
> > > >> additional ioctl tacked onto radoen? The history of drm is littered 
> > > >> with
> > > >> "generic" ioctls that turned out to be useful for exactly one driver.
> > > >> Which is why _all_ the command submission is now done with 
> > > >> driver-private
> > > >> ioctls.
> > > >>
> > > >> I'd be quite a bit surprised if that suddenly works differently, so 
> > > >> before
> > > >> we bless a generic hsa interface I really want to see some 
> > > >> implementation
> > > >> from a different vendor (i.e. nvdidia or intel) using the same ioctls.
> > > >> Otherwise we just repeat history and I'm not terribly inclined to keep 
> > > >> on
> > > >> cleanup up cruft forever - one drm legacy is enough ;-)
> > > >>
> > > >> Jesse is the guy from our side to talk to about this.
> > > >> -Daniel
> > > >
> > > > I am not worried about that side, the hsa foundation has pretty strict
> > > > guidelines on what is hsa compliant hardware ie the hw needs to 
> > > > understand
> > > > the pm4 packet format of radeon (well small subset of it). But of course
> > > > this require hsa compliant hardware and from member i am guessing ARM 
> > > > Mali,
> > > > ImgTech, Qualcomm, ... so unless Intel and NVidia joins hsa you will not
> > > > see it for those hardware.
> > > >
> > > > So yes for once same ioctl would apply to different hardware. The only 
> > > > things
> > > > that is different is the shader isa. The hsafoundation site has some pdf
> > > > explaining all that but someone thought that slideshare would be a good 
> > > > idea
> > > > personnaly i would not register to any of the website just to get the 
> > > > pdf.
> > > >
> > > > So to sumup i am ok with having a new device file that present uniform 
> > > > set
> > > > of ioctl. It would actualy be lot easier for userspace, just open this 
> > > > fix
> > > > device file and ask for list of compliant hardware.
> > > >
> > > > Then radeon kernel driver would register itself as a provider. So all 
> > > > ioctl
> > > > decoding marshalling would be share which makes sense.
> > > 
> > > There's also the other side namely that preparing the cp ring in
> > > userspace and submitting the entire pile through a doorbell to the hw
> > > scheduler isn't really hsa exclusive. And for a solid platform with
> > > seamless gpu/cpu integration that means we need standard ways to set
> > > gpu context priorities and get at useful stats like gpu time used by a
> > > given context.
> > > 
> > > To get there I guess intel/nvidia need to reuse the hsa subsystem with
> > > the command submission adjusted a bit. Kinda like drm where kms and
> > > buffer sharing is common and cs driver specific.
> > 
> > HSA module would be for HSA compliant hardware and thus hardware would
> > need to follow HSA specification which again is pretty clear on what
> > the hardware need to provide. So if Intel and NVidia wants to join HSA
> > i am sure they would be welcome, the more the merrier :)
> > 
> > So i would not block HSA kernel ioctl design in order to please non HSA
> > hardware especialy if at this point in time nor Intel or NVidia can
> > share anything concret on the design and how this things should be setup
> > for there hardware.
> > 
> > When Intel or NVidia present their own API they should provide their
> > own set of ioctl through their own platform.
> 
> Yeah things are different enough that a uniform ioctl doesn't make
> sense.  If/when all the vendors decide on a single standard, we can use
> that, but until then I don't see a nice way to share our doorbell &
> submission scheme with HSA, and I assume nvidia is the same.
> 
> Using HSA as a basis for non-HSA systems seems like it would add a lot
> of complexity, since non-HSA hardware would have to intercept the queue
> writes and manage the submission requests etc as bytecodes in the
> kernel driver, or maybe as a shim layer library that wraps that stuff.
> 
> Probably not worth the effort given that the command sets themselves
> are all custom as well, driven by specific user level drivers like GL,
> CL, and libva.

Well I know that - drm also has the split between shared management stuff
like prime and driver private cmd submission. I still think that some
common interfaces would benefit us. I want things like a gputop (and also
perf counters and all that) to work the same way with the same tooling on
all svm/gpgpu stuff. So a shared namespace/create for svm contexts or
something like that.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


[PATCH 00/83] AMD HSA kernel driver

2014-07-23 Thread Jesse Barnes
On Sun, 13 Jul 2014 12:40:32 -0400
j.glisse at gmail.com (Jerome Glisse) wrote:

> On Sun, Jul 13, 2014 at 11:42:58AM +0200, Daniel Vetter wrote:
> > On Sat, Jul 12, 2014 at 6:49 PM, Jerome Glisse  
> > wrote:
> > >> Hm, so the hsa part is a completely new driver/subsystem, not just an
> > >> additional ioctl tacked onto radoen? The history of drm is littered with
> > >> "generic" ioctls that turned out to be useful for exactly one driver.
> > >> Which is why _all_ the command submission is now done with driver-private
> > >> ioctls.
> > >>
> > >> I'd be quite a bit surprised if that suddenly works differently, so 
> > >> before
> > >> we bless a generic hsa interface I really want to see some implementation
> > >> from a different vendor (i.e. nvdidia or intel) using the same ioctls.
> > >> Otherwise we just repeat history and I'm not terribly inclined to keep on
> > >> cleanup up cruft forever - one drm legacy is enough ;-)
> > >>
> > >> Jesse is the guy from our side to talk to about this.
> > >> -Daniel
> > >
> > > I am not worried about that side, the hsa foundation has pretty strict
> > > guidelines on what is hsa compliant hardware ie the hw needs to understand
> > > the pm4 packet format of radeon (well small subset of it). But of course
> > > this require hsa compliant hardware and from member i am guessing ARM 
> > > Mali,
> > > ImgTech, Qualcomm, ... so unless Intel and NVidia joins hsa you will not
> > > see it for those hardware.
> > >
> > > So yes for once same ioctl would apply to different hardware. The only 
> > > things
> > > that is different is the shader isa. The hsafoundation site has some pdf
> > > explaining all that but someone thought that slideshare would be a good 
> > > idea
> > > personnaly i would not register to any of the website just to get the pdf.
> > >
> > > So to sumup i am ok with having a new device file that present uniform set
> > > of ioctl. It would actualy be lot easier for userspace, just open this fix
> > > device file and ask for list of compliant hardware.
> > >
> > > Then radeon kernel driver would register itself as a provider. So all 
> > > ioctl
> > > decoding marshalling would be share which makes sense.
> > 
> > There's also the other side namely that preparing the cp ring in
> > userspace and submitting the entire pile through a doorbell to the hw
> > scheduler isn't really hsa exclusive. And for a solid platform with
> > seamless gpu/cpu integration that means we need standard ways to set
> > gpu context priorities and get at useful stats like gpu time used by a
> > given context.
> > 
> > To get there I guess intel/nvidia need to reuse the hsa subsystem with
> > the command submission adjusted a bit. Kinda like drm where kms and
> > buffer sharing is common and cs driver specific.
> 
> HSA module would be for HSA compliant hardware and thus hardware would
> need to follow HSA specification which again is pretty clear on what
> the hardware need to provide. So if Intel and NVidia wants to join HSA
> i am sure they would be welcome, the more the merrier :)
> 
> So i would not block HSA kernel ioctl design in order to please non HSA
> hardware especialy if at this point in time nor Intel or NVidia can
> share anything concret on the design and how this things should be setup
> for there hardware.
> 
> When Intel or NVidia present their own API they should provide their
> own set of ioctl through their own platform.

Yeah things are different enough that a uniform ioctl doesn't make
sense.  If/when all the vendors decide on a single standard, we can use
that, but until then I don't see a nice way to share our doorbell &
submission scheme with HSA, and I assume nvidia is the same.

Using HSA as a basis for non-HSA systems seems like it would add a lot
of complexity, since non-HSA hardware would have to intercept the queue
writes and manage the submission requests etc as bytecodes in the
kernel driver, or maybe as a shim layer library that wraps that stuff.

Probably not worth the effort given that the command sets themselves
are all custom as well, driven by specific user level drivers like GL,
CL, and libva.

-- 
Jesse Barnes, Intel Open Source Technology Center


[PATCH 00/83] AMD HSA kernel driver

2014-07-16 Thread Daniel Vetter
On Wed, Jul 16, 2014 at 10:52:56AM -0400, Jerome Glisse wrote:
> On Wed, Jul 16, 2014 at 10:27:42AM +0200, Daniel Vetter wrote:
> > On Tue, Jul 15, 2014 at 8:04 PM, Jerome Glisse  
> > wrote:
> > >> Yes although it can be skipped on most systems. We figured that topology
> > >> needed to cover everything that would be handled by a single OS image, so
> > >> in a NUMA system it would need to cover all the CPUs. I think that is 
> > >> still
> > >> the right scope, do you agree ?
> > >
> > > I think it is a idea to duplicate cpu. I would rather have each device
> > > give its afinity against each cpu and for cpu just keep the existing
> > > kernel api that expose this through sysfs iirc.
> > 
> > It's all there already if we fix up the hsa dev-node model to expose
> > one dev node per underlying device instead of one for everything:
> > - cpus already expose the full numa topology in sysfs
> > - pci devices have a numa_node file in sysfs to display the link
> > - we can easily add similar stuff for platform devices on arm socs
> > without pci devices.
> > 
> > Then the only thing userspace needs to do is follow the device link in
> > the hsa instance node in sysfs and we have all the information
> > exposed. Iff we expose one hsa driver instance to userspace per
> > physical device (which is the normal linux device driver model
> > anyway).
> > 
> > I don't see a need to add anything hsa specific here at all (well
> > maybe some description of the cache architecture on the hsa device
> > itself, the spec seems to have provisions for that).
> > -Daniel
> 
> What is HSA specific is userspace command queue in form of common ring
> buffer execution queue all sharing common packet format. So yes i see
> a reason for an HSA class that provide common ioctl through one dev file
> per device. Note that i am not a fan of userspace command queue given
> that linux ioctl overhead is small and having kernel do stuff would
> allow for really "infinite" number of userspace context while right
> now limit is DOORBELL_APERTURE_SIZE/PAGE_SIZE.
> 
> No, CPU should not be included, neither should the numa topology of
> device. And yes all numa topology should use existing kernel interface.
> I however understand that a second GPU specific topology might make
> sense ie if you have specialize link btw some discrete GPU.
> 
> So if Intel wants to join the HSA foundation fine, but unless you are
> ready to implement what is needed i do not see the value of forcing
> your wish on another group that is trying to standardize something.

You're mixing up my replies ;-) This was really just a comment on the
proposed hsa interfaces for exposing the topology - we already have all
the stuff exposed in sysfs for cpus and pci devices, so exposing this
again through a hsa specific interface doesn't make much sense imo.

What intel does or not does is completely irrelevant for my comment. I.e.
I've written the above with my drm hacker hat on, not with my intel hat
on.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


[PATCH 00/83] AMD HSA kernel driver

2014-07-16 Thread Greg KH
On Wed, Jul 16, 2014 at 10:21:14AM +0200, Daniel Vetter wrote:
> On Tue, Jul 15, 2014 at 7:53 PM, Bridgman, John  
> wrote:
> [snip away the discussion about hsa device discover, I'm hijacking
> this thread just for the event/fence stuff here.]
> 
> > ... There's an event mechanism still to come - mostly for communicating 
> > fences and shader interrupts back to userspace, but also used for "device 
> > change" notifications, so no polling of sysfs.
> 
> That could would be interesting. On i915 my plan is to internally use
> the recently added struct fence from Maarten. For the external
> interface for userspace that wants explicit control over fences I'm
> leaning towards polishing the android syncpt stuff (currently in
> staging). But in any case I _really_ want to avoid that we end up with
> multiple different and incompatible explicit fencing interfaces on
> linux.

I agree, and I'll say it stronger than that, we WILL NOT have different
and incompatible fencing interfaces in the kernel.  That way lies
madness.

John, take a look at what is now in linux-next, it should provide what
you need here, right?

thanks,

greg k-h


[PATCH 00/83] AMD HSA kernel driver

2014-07-16 Thread Jerome Glisse
On Wed, Jul 16, 2014 at 10:27:42AM +0200, Daniel Vetter wrote:
> On Tue, Jul 15, 2014 at 8:04 PM, Jerome Glisse  wrote:
> >> Yes although it can be skipped on most systems. We figured that topology
> >> needed to cover everything that would be handled by a single OS image, so
> >> in a NUMA system it would need to cover all the CPUs. I think that is still
> >> the right scope, do you agree ?
> >
> > I think it is a idea to duplicate cpu. I would rather have each device
> > give its afinity against each cpu and for cpu just keep the existing
> > kernel api that expose this through sysfs iirc.
> 
> It's all there already if we fix up the hsa dev-node model to expose
> one dev node per underlying device instead of one for everything:
> - cpus already expose the full numa topology in sysfs
> - pci devices have a numa_node file in sysfs to display the link
> - we can easily add similar stuff for platform devices on arm socs
> without pci devices.
> 
> Then the only thing userspace needs to do is follow the device link in
> the hsa instance node in sysfs and we have all the information
> exposed. Iff we expose one hsa driver instance to userspace per
> physical device (which is the normal linux device driver model
> anyway).
> 
> I don't see a need to add anything hsa specific here at all (well
> maybe some description of the cache architecture on the hsa device
> itself, the spec seems to have provisions for that).
> -Daniel

What is HSA specific is userspace command queue in form of common ring
buffer execution queue all sharing common packet format. So yes i see
a reason for an HSA class that provide common ioctl through one dev file
per device. Note that i am not a fan of userspace command queue given
that linux ioctl overhead is small and having kernel do stuff would
allow for really "infinite" number of userspace context while right
now limit is DOORBELL_APERTURE_SIZE/PAGE_SIZE.

No, CPU should not be included, neither should the numa topology of
device. And yes all numa topology should use existing kernel interface.
I however understand that a second GPU specific topology might make
sense ie if you have specialize link btw some discrete GPU.

So if Intel wants to join the HSA foundation fine, but unless you are
ready to implement what is needed i do not see the value of forcing
your wish on another group that is trying to standardize something.

Cheers,
J?r?me


[PATCH 00/83] AMD HSA kernel driver

2014-07-16 Thread Daniel Vetter
On Tue, Jul 15, 2014 at 8:04 PM, Jerome Glisse  wrote:
>> Yes although it can be skipped on most systems. We figured that topology
>> needed to cover everything that would be handled by a single OS image, so
>> in a NUMA system it would need to cover all the CPUs. I think that is still
>> the right scope, do you agree ?
>
> I think it is a idea to duplicate cpu. I would rather have each device
> give its afinity against each cpu and for cpu just keep the existing
> kernel api that expose this through sysfs iirc.

It's all there already if we fix up the hsa dev-node model to expose
one dev node per underlying device instead of one for everything:
- cpus already expose the full numa topology in sysfs
- pci devices have a numa_node file in sysfs to display the link
- we can easily add similar stuff for platform devices on arm socs
without pci devices.

Then the only thing userspace needs to do is follow the device link in
the hsa instance node in sysfs and we have all the information
exposed. Iff we expose one hsa driver instance to userspace per
physical device (which is the normal linux device driver model
anyway).

I don't see a need to add anything hsa specific here at all (well
maybe some description of the cache architecture on the hsa device
itself, the spec seems to have provisions for that).
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


[PATCH 00/83] AMD HSA kernel driver

2014-07-16 Thread Daniel Vetter
On Tue, Jul 15, 2014 at 7:53 PM, Bridgman, John  
wrote:
[snip away the discussion about hsa device discover, I'm hijacking
this thread just for the event/fence stuff here.]

> ... There's an event mechanism still to come - mostly for communicating 
> fences and shader interrupts back to userspace, but also used for "device 
> change" notifications, so no polling of sysfs.

That could would be interesting. On i915 my plan is to internally use
the recently added struct fence from Maarten. For the external
interface for userspace that wants explicit control over fences I'm
leaning towards polishing the android syncpt stuff (currently in
staging). But in any case I _really_ want to avoid that we end up with
multiple different and incompatible explicit fencing interfaces on
linux.

Adding relevant people.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


[PATCH 00/83] AMD HSA kernel driver

2014-07-15 Thread Bridgman, John


>-Original Message-
>From: Jerome Glisse [mailto:j.glisse at gmail.com]
>Sent: Tuesday, July 15, 2014 1:37 PM
>To: Bridgman, John
>Cc: Dave Airlie; Christian K?nig; Lewycky, Andrew; linux-
>kernel at vger.kernel.org; dri-devel at lists.freedesktop.org; Deucher,
>Alexander; akpm at linux-foundation.org
>Subject: Re: [PATCH 00/83] AMD HSA kernel driver
>
>On Tue, Jul 15, 2014 at 05:06:56PM +, Bridgman, John wrote:
>> >From: Dave Airlie [mailto:airlied at gmail.com]
>> >Sent: Tuesday, July 15, 2014 12:35 AM
>> >To: Christian K?nig
>> >Cc: Jerome Glisse; Bridgman, John; Lewycky, Andrew; linux-
>> >kernel at vger.kernel.org; dri-devel at lists.freedesktop.org; Deucher,
>> >Alexander; akpm at linux-foundation.org
>> >Subject: Re: [PATCH 00/83] AMD HSA kernel driver
>> >
>> >On 14 July 2014 18:37, Christian K?nig  wrote:
>> >>> I vote for HSA module that expose ioctl and is an intermediary
>> >>> with the kernel driver that handle the hardware. This gives a
>> >>> single point for HSA hardware and yes this enforce things for any
>> >>> hardware
>> >manufacturer.
>> >>> I am more than happy to tell them that this is it and nothing else
>> >>> if they want to get upstream.
>> >>
>> >> I think we should still discuss this single point of entry a bit more.
>> >>
>> >> Just to make it clear the plan is to expose all physical HSA
>> >> capable devices through a single /dev/hsa device node to userspace.
>> >
>> >This is why we don't design kernel interfaces in secret foundations,
>> >and expect anyone to like them.
>>
>> Understood and agree. In this case though this isn't a cross-vendor
>> interface designed by a secret committee, it's supposed to be more of
>> an inoffensive little single-vendor interface designed *for* a secret
>> committee. I'm hoping that's better ;)
>>
>> >
>> >So before we go any further, how is this stuff planned to work for
>> >multiple GPUs/accelerators?
>>
>> Three classes of "multiple" :
>>
>> 1. Single CPU with IOMMUv2 and multiple GPUs:
>>
>> - all devices accessible via /dev/kfd
>> - topology information identifies CPU + GPUs, each has "node ID" at
>> top of userspace API, "global ID" at user/kernel interface  (don't
>> think we've implemented CPU part yet though)
>> - userspace builds snapshot from sysfs info & exposes to HSAIL
>> runtime, which in turn exposes the "standard" API
>
>This is why i do not like the sysfs approach, it would be lot nicer to have
>device file per provider and thus hsail can listen on device file event and
>discover if hardware is vanishing or appearing. Periodicaly going over sysfs
>files is not the right way to do that.

Agree that wouldn't be good. There's an event mechanism still to come - mostly 
for communicating fences and shader interrupts back to userspace, but also used 
for "device change" notifications, so no polling of sysfs.

>
>> - kfd sets up ATC aperture so GPUs can access system RAM via IOMMUv2
>> (fast for APU, relatively less so for dGPU over PCIE)
>> - to-be-added memory operations allow allocation & residency control
>> (within existing gfx driver limits) of buffers in VRAM & carved-out
>> system RAM
>> - queue operations specify a node ID to userspace library, which
>> translates to "global ID" before calling kfd
>>
>> 2. Multiple CPUs connected via fabric (eg HyperTransport) each with 0 or
>more GPUs:
>>
>> - topology information exposes CPUs & GPUs, along with affinity info
>> showing what is connected to what
>> - everything else works as in (1) above
>>
>
>This is suppose to be part of HSA ? This is lot broader than i thought.

Yes although it can be skipped on most systems. We figured that topology needed 
to cover everything that would be handled by a single OS image, so in a NUMA 
system it would need to cover all the CPUs. I think that is still the right 
scope, do you agree ?

>
>> 3. Multiple CPUs not connected via fabric (eg a blade server) each
>> with 0 or more GPUs
>>
>> - no attempt to cover this with HSA topology, each CPU and associated
>> GPUs is accessed independently via separate /dev/kfd instances
>>
>> >
>> >Do we have a userspace to exercise this interface so we can see how
>> >such a thing would look?
>>
>> Yes -- initial IP review done, legal stuff done, sanitizing WIP,
>> hoping for final approval this week
>>
>> There's a separate test harness to exercise the userspace lib calls,
>> haven't started IP review or sanitizing for that but legal stuff is
>> done
>>
>> >
>> >Dave.


[PATCH 00/83] AMD HSA kernel driver

2014-07-15 Thread Bridgman, John


>-Original Message-
>From: dri-devel [mailto:dri-devel-bounces at lists.freedesktop.org] On Behalf
>Of Bridgman, John
>Sent: Tuesday, July 15, 2014 1:07 PM
>To: Dave Airlie; Christian K?nig
>Cc: Lewycky, Andrew; linux-kernel at vger.kernel.org; dri-
>devel at lists.freedesktop.org; Deucher, Alexander; akpm at linux-
>foundation.org
>Subject: RE: [PATCH 00/83] AMD HSA kernel driver
>
>
>
>>-Original Message-
>>From: Dave Airlie [mailto:airlied at gmail.com]
>>Sent: Tuesday, July 15, 2014 12:35 AM
>>To: Christian K?nig
>>Cc: Jerome Glisse; Bridgman, John; Lewycky, Andrew; linux-
>>kernel at vger.kernel.org; dri-devel at lists.freedesktop.org; Deucher,
>>Alexander; akpm at linux-foundation.org
>>Subject: Re: [PATCH 00/83] AMD HSA kernel driver
>>
>>On 14 July 2014 18:37, Christian K?nig  wrote:
>>>> I vote for HSA module that expose ioctl and is an intermediary with
>>>> the kernel driver that handle the hardware. This gives a single
>>>> point for HSA hardware and yes this enforce things for any hardware
>>manufacturer.
>>>> I am more than happy to tell them that this is it and nothing else
>>>> if they want to get upstream.
>>>
>>> I think we should still discuss this single point of entry a bit more.
>>>
>>> Just to make it clear the plan is to expose all physical HSA capable
>>> devices through a single /dev/hsa device node to userspace.
>>
>>This is why we don't design kernel interfaces in secret foundations,
>>and expect anyone to like them.
>
>Understood and agree. In this case though this isn't a cross-vendor interface
>designed by a secret committee, it's supposed to be more of an inoffensive
>little single-vendor interface designed *for* a secret committee. I'm hoping
>that's better ;)
>
>>
>>So before we go any further, how is this stuff planned to work for
>>multiple GPUs/accelerators?
>
>Three classes of "multiple" :
>
>1. Single CPU with IOMMUv2 and multiple GPUs:
>
>- all devices accessible via /dev/kfd
>- topology information identifies CPU + GPUs, each has "node ID" at top of
>userspace API, "global ID" at user/kernel interface  (don't think we've
>implemented CPU part yet though)
>- userspace builds snapshot from sysfs info & exposes to HSAIL runtime,
>which in turn exposes the "standard" API
>- kfd sets up ATC aperture so GPUs can access system RAM via IOMMUv2 (fast
>for APU, relatively less so for dGPU over PCIE)
>- to-be-added memory operations allow allocation & residency control
>(within existing gfx driver limits) of buffers in VRAM & carved-out system
>RAM
>- queue operations specify a node ID to userspace library, which translates to
>"global ID" before calling kfd
>
>2. Multiple CPUs connected via fabric (eg HyperTransport) each with 0 or
>more GPUs:
>
>- topology information exposes CPUs & GPUs, along with affinity info
>showing what is connected to what
>- everything else works as in (1) above

This is probably a good point to stress that HSA topology is only intended as 
an OS-independent way of communicating system info up to higher levels of the 
HSA stack, not as a new and competing way to *manage* system properties inside 
Linux or any other OS.

>
>3. Multiple CPUs not connected via fabric (eg a blade server) each with 0 or
>more GPUs
>
>- no attempt to cover this with HSA topology, each CPU and associated GPUs
>is accessed independently via separate /dev/kfd instances
>
>>
>>Do we have a userspace to exercise this interface so we can see how
>>such a thing would look?
>
>Yes -- initial IP review done, legal stuff done, sanitizing WIP, hoping for 
>final
>approval this week
>
>There's a separate test harness to exercise the userspace lib calls, haven't
>started IP review or sanitizing for that but legal stuff is done
>
>>
>>Dave.
>___
>dri-devel mailing list
>dri-devel at lists.freedesktop.org
>http://lists.freedesktop.org/mailman/listinfo/dri-devel


[PATCH 00/83] AMD HSA kernel driver

2014-07-15 Thread Bridgman, John


>-Original Message-
>From: Dave Airlie [mailto:airlied at gmail.com]
>Sent: Tuesday, July 15, 2014 12:35 AM
>To: Christian K?nig
>Cc: Jerome Glisse; Bridgman, John; Lewycky, Andrew; linux-
>kernel at vger.kernel.org; dri-devel at lists.freedesktop.org; Deucher,
>Alexander; akpm at linux-foundation.org
>Subject: Re: [PATCH 00/83] AMD HSA kernel driver
>
>On 14 July 2014 18:37, Christian K?nig  wrote:
>>> I vote for HSA module that expose ioctl and is an intermediary with
>>> the kernel driver that handle the hardware. This gives a single point
>>> for HSA hardware and yes this enforce things for any hardware
>manufacturer.
>>> I am more than happy to tell them that this is it and nothing else if
>>> they want to get upstream.
>>
>> I think we should still discuss this single point of entry a bit more.
>>
>> Just to make it clear the plan is to expose all physical HSA capable
>> devices through a single /dev/hsa device node to userspace.
>
>This is why we don't design kernel interfaces in secret foundations, and
>expect anyone to like them.

Understood and agree. In this case though this isn't a cross-vendor interface 
designed by a secret committee, it's supposed to be more of an inoffensive 
little single-vendor interface designed *for* a secret committee. I'm hoping 
that's better ;)

>
>So before we go any further, how is this stuff planned to work for multiple
>GPUs/accelerators?

Three classes of "multiple" :

1. Single CPU with IOMMUv2 and multiple GPUs:

- all devices accessible via /dev/kfd
- topology information identifies CPU + GPUs, each has "node ID" at top of 
userspace API, "global ID" at user/kernel interface
 (don't think we've implemented CPU part yet though)
- userspace builds snapshot from sysfs info & exposes to HSAIL runtime, which 
in turn exposes the "standard" API
- kfd sets up ATC aperture so GPUs can access system RAM via IOMMUv2 (fast for 
APU, relatively less so for dGPU over PCIE)
- to-be-added memory operations allow allocation & residency control (within 
existing gfx driver limits) of buffers in VRAM & carved-out system RAM
- queue operations specify a node ID to userspace library, which translates to 
"global ID" before calling kfd

2. Multiple CPUs connected via fabric (eg HyperTransport) each with 0 or more 
GPUs:

- topology information exposes CPUs & GPUs, along with affinity info showing 
what is connected to what
- everything else works as in (1) above

3. Multiple CPUs not connected via fabric (eg a blade server) each with 0 or 
more GPUs

- no attempt to cover this with HSA topology, each CPU and associated GPUs is 
accessed independently via separate /dev/kfd instances

>
>Do we have a userspace to exercise this interface so we can see how such a
>thing would look?

Yes -- initial IP review done, legal stuff done, sanitizing WIP, hoping for 
final approval this week

There's a separate test harness to exercise the userspace lib calls, haven't 
started IP review or sanitizing for that but legal stuff is done

>
>Dave.


[PATCH 00/83] AMD HSA kernel driver

2014-07-15 Thread Dave Airlie
On 14 July 2014 18:37, Christian K?nig  wrote:
>> I vote for HSA module that expose ioctl and is an intermediary with the
>> kernel driver that handle the hardware. This gives a single point for
>> HSA hardware and yes this enforce things for any hardware manufacturer.
>> I am more than happy to tell them that this is it and nothing else if
>> they want to get upstream.
>
> I think we should still discuss this single point of entry a bit more.
>
> Just to make it clear the plan is to expose all physical HSA capable devices
> through a single /dev/hsa device node to userspace.

This is why we don't design kernel interfaces in secret foundations,
and expect anyone to like them.

So before we go any further, how is this stuff planned to work for
multiple GPUs/accelerators?

Do we have a userspace to exercise this interface so we can see how
such a thing would look?

Dave.


[PATCH 00/83] AMD HSA kernel driver

2014-07-15 Thread Jerome Glisse
On Tue, Jul 15, 2014 at 05:53:32PM +, Bridgman, John wrote:
> >From: Jerome Glisse [mailto:j.glisse at gmail.com]
> >Sent: Tuesday, July 15, 2014 1:37 PM
> >To: Bridgman, John
> >Cc: Dave Airlie; Christian K?nig; Lewycky, Andrew; linux-
> >kernel at vger.kernel.org; dri-devel at lists.freedesktop.org; Deucher,
> >Alexander; akpm at linux-foundation.org
> >Subject: Re: [PATCH 00/83] AMD HSA kernel driver
> >
> >On Tue, Jul 15, 2014 at 05:06:56PM +, Bridgman, John wrote:
> >> >From: Dave Airlie [mailto:airlied at gmail.com]
> >> >Sent: Tuesday, July 15, 2014 12:35 AM
> >> >To: Christian K?nig
> >> >Cc: Jerome Glisse; Bridgman, John; Lewycky, Andrew; linux-
> >> >kernel at vger.kernel.org; dri-devel at lists.freedesktop.org; Deucher,
> >> >Alexander; akpm at linux-foundation.org
> >> >Subject: Re: [PATCH 00/83] AMD HSA kernel driver
> >> >
> >> >On 14 July 2014 18:37, Christian K?nig  wrote:
> >> >>> I vote for HSA module that expose ioctl and is an intermediary
> >> >>> with the kernel driver that handle the hardware. This gives a
> >> >>> single point for HSA hardware and yes this enforce things for any
> >> >>> hardware
> >> >manufacturer.
> >> >>> I am more than happy to tell them that this is it and nothing else
> >> >>> if they want to get upstream.
> >> >>
> >> >> I think we should still discuss this single point of entry a bit more.
> >> >>
> >> >> Just to make it clear the plan is to expose all physical HSA
> >> >> capable devices through a single /dev/hsa device node to userspace.
> >> >
> >> >This is why we don't design kernel interfaces in secret foundations,
> >> >and expect anyone to like them.
> >>
> >> Understood and agree. In this case though this isn't a cross-vendor
> >> interface designed by a secret committee, it's supposed to be more of
> >> an inoffensive little single-vendor interface designed *for* a secret
> >> committee. I'm hoping that's better ;)
> >>
> >> >
> >> >So before we go any further, how is this stuff planned to work for
> >> >multiple GPUs/accelerators?
> >>
> >> Three classes of "multiple" :
> >>
> >> 1. Single CPU with IOMMUv2 and multiple GPUs:
> >>
> >> - all devices accessible via /dev/kfd
> >> - topology information identifies CPU + GPUs, each has "node ID" at
> >> top of userspace API, "global ID" at user/kernel interface  (don't
> >> think we've implemented CPU part yet though)
> >> - userspace builds snapshot from sysfs info & exposes to HSAIL
> >> runtime, which in turn exposes the "standard" API
> >
> >This is why i do not like the sysfs approach, it would be lot nicer to have
> >device file per provider and thus hsail can listen on device file event and
> >discover if hardware is vanishing or appearing. Periodicaly going over sysfs
> >files is not the right way to do that.
> 
> Agree that wouldn't be good. There's an event mechanism still to come - mostly
> for communicating fences and shader interrupts back to userspace, but also 
> used
> for "device change" notifications, so no polling of sysfs.
> 

My point being, do not use sysfs, use /dev/hsa/device* and have hsail listen on
file event on /dev/hsa/ directory. The hsail would be inform of new device and
of device that are unloaded. It would do a first pass to open each device file
and get there capabilities through standardize ioctl.

Thought maybe sysfs is ok given than cpu numa is expose through sysfs

> >
> >> - kfd sets up ATC aperture so GPUs can access system RAM via IOMMUv2
> >> (fast for APU, relatively less so for dGPU over PCIE)
> >> - to-be-added memory operations allow allocation & residency control
> >> (within existing gfx driver limits) of buffers in VRAM & carved-out
> >> system RAM
> >> - queue operations specify a node ID to userspace library, which
> >> translates to "global ID" before calling kfd
> >>
> >> 2. Multiple CPUs connected via fabric (eg HyperTransport) each with 0 or
> >more GPUs:
> >>
> >> - topology information exposes CPUs & GPUs, along with affinity info
> >> showing what is connected to what
> >> - everything else works as in (1) above
> >>
> >
> >This is suppose to be part of HSA ? This is lot broader than i thoug

[PATCH 00/83] AMD HSA kernel driver

2014-07-15 Thread Jerome Glisse
On Tue, Jul 15, 2014 at 05:06:56PM +, Bridgman, John wrote:
> >From: Dave Airlie [mailto:airlied at gmail.com]
> >Sent: Tuesday, July 15, 2014 12:35 AM
> >To: Christian K?nig
> >Cc: Jerome Glisse; Bridgman, John; Lewycky, Andrew; linux-
> >kernel at vger.kernel.org; dri-devel at lists.freedesktop.org; Deucher,
> >Alexander; akpm at linux-foundation.org
> >Subject: Re: [PATCH 00/83] AMD HSA kernel driver
> >
> >On 14 July 2014 18:37, Christian K?nig  wrote:
> >>> I vote for HSA module that expose ioctl and is an intermediary with
> >>> the kernel driver that handle the hardware. This gives a single point
> >>> for HSA hardware and yes this enforce things for any hardware
> >manufacturer.
> >>> I am more than happy to tell them that this is it and nothing else if
> >>> they want to get upstream.
> >>
> >> I think we should still discuss this single point of entry a bit more.
> >>
> >> Just to make it clear the plan is to expose all physical HSA capable
> >> devices through a single /dev/hsa device node to userspace.
> >
> >This is why we don't design kernel interfaces in secret foundations, and
> >expect anyone to like them.
> 
> Understood and agree. In this case though this isn't a cross-vendor interface 
> designed by a secret committee, it's supposed to be more of an inoffensive 
> little single-vendor interface designed *for* a secret committee. I'm hoping 
> that's better ;)
> 
> >
> >So before we go any further, how is this stuff planned to work for multiple
> >GPUs/accelerators?
> 
> Three classes of "multiple" :
> 
> 1. Single CPU with IOMMUv2 and multiple GPUs:
> 
> - all devices accessible via /dev/kfd
> - topology information identifies CPU + GPUs, each has "node ID" at top of 
> userspace API, "global ID" at user/kernel interface
>  (don't think we've implemented CPU part yet though)
> - userspace builds snapshot from sysfs info & exposes to HSAIL runtime, which 
> in turn exposes the "standard" API

This is why i do not like the sysfs approach, it would be lot nicer to have
device file per provider and thus hsail can listen on device file event and
discover if hardware is vanishing or appearing. Periodicaly going over sysfs
files is not the right way to do that.

> - kfd sets up ATC aperture so GPUs can access system RAM via IOMMUv2 (fast 
> for APU, relatively less so for dGPU over PCIE)
> - to-be-added memory operations allow allocation & residency control (within 
> existing gfx driver limits) of buffers in VRAM & carved-out system RAM
> - queue operations specify a node ID to userspace library, which translates 
> to "global ID" before calling kfd
> 
> 2. Multiple CPUs connected via fabric (eg HyperTransport) each with 0 or more 
> GPUs:
> 
> - topology information exposes CPUs & GPUs, along with affinity info showing 
> what is connected to what
> - everything else works as in (1) above
> 

This is suppose to be part of HSA ? This is lot broader than i thought.

> 3. Multiple CPUs not connected via fabric (eg a blade server) each with 0 or 
> more GPUs
> 
> - no attempt to cover this with HSA topology, each CPU and associated GPUs is 
> accessed independently via separate /dev/kfd instances
> 
> >
> >Do we have a userspace to exercise this interface so we can see how such a
> >thing would look?
> 
> Yes -- initial IP review done, legal stuff done, sanitizing WIP, hoping for 
> final approval this week
> 
> There's a separate test harness to exercise the userspace lib calls, haven't 
> started IP review or sanitizing for that but legal stuff is done
> 
> >
> >Dave.


[PATCH 00/83] AMD HSA kernel driver

2014-07-15 Thread Jerome Glisse
On Tue, Jul 15, 2014 at 02:35:19PM +1000, Dave Airlie wrote:
> On 14 July 2014 18:37, Christian K?nig  wrote:
> >> I vote for HSA module that expose ioctl and is an intermediary with the
> >> kernel driver that handle the hardware. This gives a single point for
> >> HSA hardware and yes this enforce things for any hardware manufacturer.
> >> I am more than happy to tell them that this is it and nothing else if
> >> they want to get upstream.
> >
> > I think we should still discuss this single point of entry a bit more.
> >
> > Just to make it clear the plan is to expose all physical HSA capable devices
> > through a single /dev/hsa device node to userspace.
> 
> This is why we don't design kernel interfaces in secret foundations,
> and expect anyone to like them.
> 

I think at this time this is unlikely to get into 3.17. But Christian had
point on having multiple device file. So something like /dev/hsa/*

> So before we go any further, how is this stuff planned to work for
> multiple GPUs/accelerators?

My understanding is that you create queue and each queue is associated
with a device. You can create several queue for same device and have
different priority btw queue.

Btw queue here means a ring buffer that understand a common set of pm4
packet.

> Do we have a userspace to exercise this interface so we can see how
> such a thing would look?

I think we need to wait a bit before freezing and accepting the kernel
api and see enough userspace bits to be confortable. Moreover if AMD
wants common API for HSA i also think that they at very least needs
there HSA partner to make public comment on the kernel API.

Cheers,
J?r?me




[PATCH 00/83] AMD HSA kernel driver

2014-07-14 Thread Christian König
> I vote for HSA module that expose ioctl and is an intermediary with the
> kernel driver that handle the hardware. This gives a single point for
> HSA hardware and yes this enforce things for any hardware manufacturer.
> I am more than happy to tell them that this is it and nothing else if
> they want to get upstream.
I think we should still discuss this single point of entry a bit more.

Just to make it clear the plan is to expose all physical HSA capable 
devices through a single /dev/hsa device node to userspace.

While this obviously makes device enumeration much easier it's still a 
quite hard break with Unix traditions. Essentially we now expose all 
devices of one kind though a single device node instead of creating 
independent nodes for each physical or logical device.

What makes it even worse is that we want to expose different drivers 
though the same device node.

Because of this any effort of a system administrator to limit access to 
HSA is reduced to an on/off decision. It's simply not possible any more 
to apply simple file system access semantics to individual hardware devices.

Just imaging you are an administrator with a bunch of different compute 
cards in a system and you want to restrict access of one off them 
because it's faulty or has a security problem or something like this. Or 
you have several hardware device and want to assign each of them a 
distinct container.

Just some thoughts,
Christian.

Am 13.07.2014 18:49, schrieb Jerome Glisse:
> On Sun, Jul 13, 2014 at 03:34:12PM +, Bridgman, John wrote:
>>> From: Jerome Glisse [mailto:j.glisse at gmail.com]
>>> Sent: Saturday, July 12, 2014 11:56 PM
>>> To: Gabbay, Oded
>>> Cc: linux-kernel at vger.kernel.org; Bridgman, John; Deucher, Alexander;
>>> Lewycky, Andrew; joro at 8bytes.org; akpm at linux-foundation.org; dri-
>>> devel at lists.freedesktop.org; airlied at linux.ie; oded.gabbay at 
>>> gmail.com
>>> Subject: Re: [PATCH 00/83] AMD HSA kernel driver
>>>
>>> On Sat, Jul 12, 2014 at 09:55:49PM +, Gabbay, Oded wrote:
>>>> On Fri, 2014-07-11 at 17:18 -0400, Jerome Glisse wrote:
>>>>> On Thu, Jul 10, 2014 at 10:51:29PM +, Gabbay, Oded wrote:
>>>>>>   On Thu, 2014-07-10 at 18:24 -0400, Jerome Glisse wrote:
>>>>>>>   On Fri, Jul 11, 2014 at 12:45:27AM +0300, Oded Gabbay wrote:
>>>>>>>>This patch set implements a Heterogeneous System
>>>>>>>> Architecture
>>>>>>>>   (HSA) driver
>>>>>>>>for radeon-family GPUs.
>>>>>>>   This is just quick comments on few things. Given size of this,
>>>>>>> people  will need to have time to review things.
>>>>>>>>HSA allows different processor types (CPUs, DSPs, GPUs,
>>>>>>>> etc..) to
>>>>>>>>   share
>>>>>>>>system resources more effectively via HW features including
>>>>>>>> shared pageable
>>>>>>>>memory, userspace-accessible work queues, and platform-level
>>>>>>>> atomics. In
>>>>>>>>addition to the memory protection mechanisms in GPUVM and
>>>>>>>> IOMMUv2, the Sea
>>>>>>>>Islands family of GPUs also performs HW-level validation of
>>>>>>>> commands passed
>>>>>>>>in through the queues (aka rings).
>>>>>>>>The code in this patch set is intended to serve both as a
>>>>>>>> sample  driver for
>>>>>>>>other HSA-compatible hardware devices and as a production
>>>>>>>> driver  for
>>>>>>>>radeon-family processors. The code is architected to support
>>>>>>>> multiple CPUs
>>>>>>>>each with connected GPUs, although the current
>>>>>>>> implementation  focuses on a
>>>>>>>>single Kaveri/Berlin APU, and works alongside the existing
>>>>>>>> radeon  kernel
>>>>>>>>graphics driver (kgd).
>>>>>>>>AMD GPUs designed for use with HSA (Sea Islands and up)
>>>>>>>> share  some hardware
>>>>>>>>functionality between HSA compute and regular gfx/compute
>>>>>>>> (memory,
>>>>>>>>interrupts, registers), while other functionality has been
>>>>>>>> added
>>>>>>>>specifically for HSA compute  (hw sched

[PATCH 00/83] AMD HSA kernel driver

2014-07-13 Thread Bridgman, John


>-Original Message-
>From: Jerome Glisse [mailto:j.glisse at gmail.com]
>Sent: Saturday, July 12, 2014 11:56 PM
>To: Gabbay, Oded
>Cc: linux-kernel at vger.kernel.org; Bridgman, John; Deucher, Alexander;
>Lewycky, Andrew; joro at 8bytes.org; akpm at linux-foundation.org; dri-
>devel at lists.freedesktop.org; airlied at linux.ie; oded.gabbay at gmail.com
>Subject: Re: [PATCH 00/83] AMD HSA kernel driver
>
>On Sat, Jul 12, 2014 at 09:55:49PM +, Gabbay, Oded wrote:
>> On Fri, 2014-07-11 at 17:18 -0400, Jerome Glisse wrote:
>> > On Thu, Jul 10, 2014 at 10:51:29PM +, Gabbay, Oded wrote:
>> > >  On Thu, 2014-07-10 at 18:24 -0400, Jerome Glisse wrote:
>> > > >  On Fri, Jul 11, 2014 at 12:45:27AM +0300, Oded Gabbay wrote:
>> > > > >   This patch set implements a Heterogeneous System
>> > > > > Architecture
>> > > > >  (HSA) driver
>> > > > >   for radeon-family GPUs.
>> > > >  This is just quick comments on few things. Given size of this,
>> > > > people  will need to have time to review things.
>> > > > >   HSA allows different processor types (CPUs, DSPs, GPUs,
>> > > > > etc..) to
>> > > > >  share
>> > > > >   system resources more effectively via HW features including
>> > > > > shared pageable
>> > > > >   memory, userspace-accessible work queues, and platform-level
>> > > > > atomics. In
>> > > > >   addition to the memory protection mechanisms in GPUVM and
>> > > > > IOMMUv2, the Sea
>> > > > >   Islands family of GPUs also performs HW-level validation of
>> > > > > commands passed
>> > > > >   in through the queues (aka rings).
>> > > > >   The code in this patch set is intended to serve both as a
>> > > > > sample  driver for
>> > > > >   other HSA-compatible hardware devices and as a production
>> > > > > driver  for
>> > > > >   radeon-family processors. The code is architected to support
>> > > > > multiple CPUs
>> > > > >   each with connected GPUs, although the current
>> > > > > implementation  focuses on a
>> > > > >   single Kaveri/Berlin APU, and works alongside the existing
>> > > > > radeon  kernel
>> > > > >   graphics driver (kgd).
>> > > > >   AMD GPUs designed for use with HSA (Sea Islands and up)
>> > > > > share  some hardware
>> > > > >   functionality between HSA compute and regular gfx/compute
>> > > > > (memory,
>> > > > >   interrupts, registers), while other functionality has been
>> > > > > added
>> > > > >   specifically for HSA compute  (hw scheduler for virtualized
>> > > > > compute rings).
>> > > > >   All shared hardware is owned by the radeon graphics driver,
>> > > > > and  an interface
>> > > > >   between kfd and kgd allows the kfd to make use of those
>> > > > > shared  resources,
>> > > > >   while HSA-specific functionality is managed directly by kfd
>> > > > > by  submitting
>> > > > >   packets into an HSA-specific command queue (the "HIQ").
>> > > > >   During kfd module initialization a char device node
>> > > > > (/dev/kfd) is
>> > > > >  created
>> > > > >   (surviving until module exit), with ioctls for queue
>> > > > > creation &  management,
>> > > > >   and data structures are initialized for managing HSA device
>> > > > > topology.
>> > > > >   The rest of the initialization is driven by calls from the
>> > > > > radeon  kgd at
>> > > > >   the following points :
>> > > > >   - radeon_init (kfd_init)
>> > > > >   - radeon_exit (kfd_fini)
>> > > > >   - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
>> > > > >   - radeon_driver_unload_kms (kfd_device_fini)
>> > > > >   During the probe and init processing per-device data
>> > > > > structures  are
>> > > > >   established which connect to the associated graphics kernel
>> > > > > driver. This
>> > > > >   information is exposed to userspace via sysfs, along with a
>>

[PATCH 00/83] AMD HSA kernel driver

2014-07-13 Thread Jerome Glisse
On Sun, Jul 13, 2014 at 03:34:12PM +, Bridgman, John wrote:
> >From: Jerome Glisse [mailto:j.glisse at gmail.com]
> >Sent: Saturday, July 12, 2014 11:56 PM
> >To: Gabbay, Oded
> >Cc: linux-kernel at vger.kernel.org; Bridgman, John; Deucher, Alexander;
> >Lewycky, Andrew; joro at 8bytes.org; akpm at linux-foundation.org; dri-
> >devel at lists.freedesktop.org; airlied at linux.ie; oded.gabbay at gmail.com
> >Subject: Re: [PATCH 00/83] AMD HSA kernel driver
> >
> >On Sat, Jul 12, 2014 at 09:55:49PM +, Gabbay, Oded wrote:
> >> On Fri, 2014-07-11 at 17:18 -0400, Jerome Glisse wrote:
> >> > On Thu, Jul 10, 2014 at 10:51:29PM +, Gabbay, Oded wrote:
> >> > >  On Thu, 2014-07-10 at 18:24 -0400, Jerome Glisse wrote:
> >> > > >  On Fri, Jul 11, 2014 at 12:45:27AM +0300, Oded Gabbay wrote:
> >> > > > >   This patch set implements a Heterogeneous System
> >> > > > > Architecture
> >> > > > >  (HSA) driver
> >> > > > >   for radeon-family GPUs.
> >> > > >  This is just quick comments on few things. Given size of this,
> >> > > > people  will need to have time to review things.
> >> > > > >   HSA allows different processor types (CPUs, DSPs, GPUs,
> >> > > > > etc..) to
> >> > > > >  share
> >> > > > >   system resources more effectively via HW features including
> >> > > > > shared pageable
> >> > > > >   memory, userspace-accessible work queues, and platform-level
> >> > > > > atomics. In
> >> > > > >   addition to the memory protection mechanisms in GPUVM and
> >> > > > > IOMMUv2, the Sea
> >> > > > >   Islands family of GPUs also performs HW-level validation of
> >> > > > > commands passed
> >> > > > >   in through the queues (aka rings).
> >> > > > >   The code in this patch set is intended to serve both as a
> >> > > > > sample  driver for
> >> > > > >   other HSA-compatible hardware devices and as a production
> >> > > > > driver  for
> >> > > > >   radeon-family processors. The code is architected to support
> >> > > > > multiple CPUs
> >> > > > >   each with connected GPUs, although the current
> >> > > > > implementation  focuses on a
> >> > > > >   single Kaveri/Berlin APU, and works alongside the existing
> >> > > > > radeon  kernel
> >> > > > >   graphics driver (kgd).
> >> > > > >   AMD GPUs designed for use with HSA (Sea Islands and up)
> >> > > > > share  some hardware
> >> > > > >   functionality between HSA compute and regular gfx/compute
> >> > > > > (memory,
> >> > > > >   interrupts, registers), while other functionality has been
> >> > > > > added
> >> > > > >   specifically for HSA compute  (hw scheduler for virtualized
> >> > > > > compute rings).
> >> > > > >   All shared hardware is owned by the radeon graphics driver,
> >> > > > > and  an interface
> >> > > > >   between kfd and kgd allows the kfd to make use of those
> >> > > > > shared  resources,
> >> > > > >   while HSA-specific functionality is managed directly by kfd
> >> > > > > by  submitting
> >> > > > >   packets into an HSA-specific command queue (the "HIQ").
> >> > > > >   During kfd module initialization a char device node
> >> > > > > (/dev/kfd) is
> >> > > > >  created
> >> > > > >   (surviving until module exit), with ioctls for queue
> >> > > > > creation &  management,
> >> > > > >   and data structures are initialized for managing HSA device
> >> > > > > topology.
> >> > > > >   The rest of the initialization is driven by calls from the
> >> > > > > radeon  kgd at
> >> > > > >   the following points :
> >> > > > >   - radeon_init (kfd_init)
> >> > > > >   - radeon_exit (kfd_fini)
> >> > > > >   - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
> >> > > > >   - radeon_driver_unload_kms (

[PATCH 00/83] AMD HSA kernel driver

2014-07-13 Thread Jerome Glisse
On Sun, Jul 13, 2014 at 11:42:58AM +0200, Daniel Vetter wrote:
> On Sat, Jul 12, 2014 at 6:49 PM, Jerome Glisse  wrote:
> >> Hm, so the hsa part is a completely new driver/subsystem, not just an
> >> additional ioctl tacked onto radoen? The history of drm is littered with
> >> "generic" ioctls that turned out to be useful for exactly one driver.
> >> Which is why _all_ the command submission is now done with driver-private
> >> ioctls.
> >>
> >> I'd be quite a bit surprised if that suddenly works differently, so before
> >> we bless a generic hsa interface I really want to see some implementation
> >> from a different vendor (i.e. nvdidia or intel) using the same ioctls.
> >> Otherwise we just repeat history and I'm not terribly inclined to keep on
> >> cleanup up cruft forever - one drm legacy is enough ;-)
> >>
> >> Jesse is the guy from our side to talk to about this.
> >> -Daniel
> >
> > I am not worried about that side, the hsa foundation has pretty strict
> > guidelines on what is hsa compliant hardware ie the hw needs to understand
> > the pm4 packet format of radeon (well small subset of it). But of course
> > this require hsa compliant hardware and from member i am guessing ARM Mali,
> > ImgTech, Qualcomm, ... so unless Intel and NVidia joins hsa you will not
> > see it for those hardware.
> >
> > So yes for once same ioctl would apply to different hardware. The only 
> > things
> > that is different is the shader isa. The hsafoundation site has some pdf
> > explaining all that but someone thought that slideshare would be a good idea
> > personnaly i would not register to any of the website just to get the pdf.
> >
> > So to sumup i am ok with having a new device file that present uniform set
> > of ioctl. It would actualy be lot easier for userspace, just open this fix
> > device file and ask for list of compliant hardware.
> >
> > Then radeon kernel driver would register itself as a provider. So all ioctl
> > decoding marshalling would be share which makes sense.
> 
> There's also the other side namely that preparing the cp ring in
> userspace and submitting the entire pile through a doorbell to the hw
> scheduler isn't really hsa exclusive. And for a solid platform with
> seamless gpu/cpu integration that means we need standard ways to set
> gpu context priorities and get at useful stats like gpu time used by a
> given context.
> 
> To get there I guess intel/nvidia need to reuse the hsa subsystem with
> the command submission adjusted a bit. Kinda like drm where kms and
> buffer sharing is common and cs driver specific.

HSA module would be for HSA compliant hardware and thus hardware would
need to follow HSA specification which again is pretty clear on what
the hardware need to provide. So if Intel and NVidia wants to join HSA
i am sure they would be welcome, the more the merrier :)

So i would not block HSA kernel ioctl design in order to please non HSA
hardware especialy if at this point in time nor Intel or NVidia can
share anything concret on the design and how this things should be setup
for there hardware.

When Intel or NVidia present their own API they should provide their
own set of ioctl through their own platform.

Cheers,
J?r?me Glisse


[PATCH 00/83] AMD HSA kernel driver

2014-07-13 Thread Daniel Vetter
On Sat, Jul 12, 2014 at 6:49 PM, Jerome Glisse  wrote:
>> Hm, so the hsa part is a completely new driver/subsystem, not just an
>> additional ioctl tacked onto radoen? The history of drm is littered with
>> "generic" ioctls that turned out to be useful for exactly one driver.
>> Which is why _all_ the command submission is now done with driver-private
>> ioctls.
>>
>> I'd be quite a bit surprised if that suddenly works differently, so before
>> we bless a generic hsa interface I really want to see some implementation
>> from a different vendor (i.e. nvdidia or intel) using the same ioctls.
>> Otherwise we just repeat history and I'm not terribly inclined to keep on
>> cleanup up cruft forever - one drm legacy is enough ;-)
>>
>> Jesse is the guy from our side to talk to about this.
>> -Daniel
>
> I am not worried about that side, the hsa foundation has pretty strict
> guidelines on what is hsa compliant hardware ie the hw needs to understand
> the pm4 packet format of radeon (well small subset of it). But of course
> this require hsa compliant hardware and from member i am guessing ARM Mali,
> ImgTech, Qualcomm, ... so unless Intel and NVidia joins hsa you will not
> see it for those hardware.
>
> So yes for once same ioctl would apply to different hardware. The only things
> that is different is the shader isa. The hsafoundation site has some pdf
> explaining all that but someone thought that slideshare would be a good idea
> personnaly i would not register to any of the website just to get the pdf.
>
> So to sumup i am ok with having a new device file that present uniform set
> of ioctl. It would actualy be lot easier for userspace, just open this fix
> device file and ask for list of compliant hardware.
>
> Then radeon kernel driver would register itself as a provider. So all ioctl
> decoding marshalling would be share which makes sense.

There's also the other side namely that preparing the cp ring in
userspace and submitting the entire pile through a doorbell to the hw
scheduler isn't really hsa exclusive. And for a solid platform with
seamless gpu/cpu integration that means we need standard ways to set
gpu context priorities and get at useful stats like gpu time used by a
given context.

To get there I guess intel/nvidia need to reuse the hsa subsystem with
the command submission adjusted a bit. Kinda like drm where kms and
buffer sharing is common and cs driver specific.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


[PATCH 00/83] AMD HSA kernel driver

2014-07-13 Thread Jerome Glisse
On Sat, Jul 12, 2014 at 09:55:49PM +, Gabbay, Oded wrote:
> On Fri, 2014-07-11 at 17:18 -0400, Jerome Glisse wrote:
> > On Thu, Jul 10, 2014 at 10:51:29PM +, Gabbay, Oded wrote:
> > >  On Thu, 2014-07-10 at 18:24 -0400, Jerome Glisse wrote:
> > > >  On Fri, Jul 11, 2014 at 12:45:27AM +0300, Oded Gabbay wrote:
> > > > >   This patch set implements a Heterogeneous System Architecture
> > > > >  (HSA) driver
> > > > >   for radeon-family GPUs.
> > > >  This is just quick comments on few things. Given size of this, 
> > > > people
> > > >  will need to have time to review things.
> > > > >   HSA allows different processor types (CPUs, DSPs, GPUs, 
> > > > > etc..) to
> > > > >  share
> > > > >   system resources more effectively via HW features including
> > > > >  shared pageable
> > > > >   memory, userspace-accessible work queues, and platform-level
> > > > >  atomics. In
> > > > >   addition to the memory protection mechanisms in GPUVM and
> > > > >  IOMMUv2, the Sea
> > > > >   Islands family of GPUs also performs HW-level validation of
> > > > >  commands passed
> > > > >   in through the queues (aka rings).
> > > > >   The code in this patch set is intended to serve both as a 
> > > > > sample
> > > > >  driver for
> > > > >   other HSA-compatible hardware devices and as a production 
> > > > > driver
> > > > >  for
> > > > >   radeon-family processors. The code is architected to support
> > > > >  multiple CPUs
> > > > >   each with connected GPUs, although the current implementation
> > > > >  focuses on a
> > > > >   single Kaveri/Berlin APU, and works alongside the existing 
> > > > > radeon
> > > > >  kernel
> > > > >   graphics driver (kgd).
> > > > >   AMD GPUs designed for use with HSA (Sea Islands and up) share
> > > > >  some hardware
> > > > >   functionality between HSA compute and regular gfx/compute 
> > > > > (memory,
> > > > >   interrupts, registers), while other functionality has been 
> > > > > added
> > > > >   specifically for HSA compute  (hw scheduler for virtualized
> > > > >  compute rings).
> > > > >   All shared hardware is owned by the radeon graphics driver, 
> > > > > and
> > > > >  an interface
> > > > >   between kfd and kgd allows the kfd to make use of those 
> > > > > shared
> > > > >  resources,
> > > > >   while HSA-specific functionality is managed directly by kfd 
> > > > > by
> > > > >  submitting
> > > > >   packets into an HSA-specific command queue (the "HIQ").
> > > > >   During kfd module initialization a char device node 
> > > > > (/dev/kfd) is
> > > > >  created
> > > > >   (surviving until module exit), with ioctls for queue 
> > > > > creation &
> > > > >  management,
> > > > >   and data structures are initialized for managing HSA device
> > > > >  topology.
> > > > >   The rest of the initialization is driven by calls from the 
> > > > > radeon
> > > > >  kgd at
> > > > >   the following points :
> > > > >   - radeon_init (kfd_init)
> > > > >   - radeon_exit (kfd_fini)
> > > > >   - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
> > > > >   - radeon_driver_unload_kms (kfd_device_fini)
> > > > >   During the probe and init processing per-device data 
> > > > > structures
> > > > >  are
> > > > >   established which connect to the associated graphics kernel
> > > > >  driver. This
> > > > >   information is exposed to userspace via sysfs, along with a
> > > > >  version number
> > > > >   allowing userspace to determine if a topology change has 
> > > > > occurred
> > > > >  while it
> > > > >   was reading from sysfs.
> > > > >   The interface between kfd and kgd also allows the kfd to 
> > > > > request
> > > > >  buffer
> > > > >   management services from kgd, and allows kgd to route 
> > > > > interrupt
> > > > >  requests to
> > > > >   kfd code since the interrupt block is shared between regular
> > > > >   graphics/compute and HSA compute subsystems in the GPU.
> > > > >   The kfd code works with an open source usermode library
> > > > >  ("libhsakmt") which
> > > > >   is in the final stages of IP review and should be published 
> > > > > in a
> > > > >  separate
> > > > >   repo over the next few days.
> > > > >   The code operates in one of three modes, selectable via the
> > > > >  sched_policy
> > > > >   module parameter :
> > > > >   - sched_policy=0 uses a hardware scheduler running in the MEC
> > > > >  block within
> > > > >   CP, and allows oversubscription (more queues than HW slots)
> > > > >   - sched_policy=1 also uses HW scheduling but does not allow
> > > > >   oversubscription, so create_queue requests fail when we run 
> > > > > out
> > > > >  of HW slots
> > > > >   - sched_policy=2 does not use HW scheduling, so the driver
> > > > >  manually assigns
> > > > >   queues to HW slots by programming registers
> > > > >   The "no HW scheduling" option is for debug & new hardware 
> > > > > bringup
> > > > >  only, so
> > > > >   has less test coverage than the other options. Default in the
> > > > >  current 

[PATCH 00/83] AMD HSA kernel driver

2014-07-12 Thread Gabbay, Oded
On Fri, 2014-07-11 at 17:18 -0400, Jerome Glisse wrote:
> On Thu, Jul 10, 2014 at 10:51:29PM +, Gabbay, Oded wrote:
> >  On Thu, 2014-07-10 at 18:24 -0400, Jerome Glisse wrote:
> > >  On Fri, Jul 11, 2014 at 12:45:27AM +0300, Oded Gabbay wrote:
> > > >   This patch set implements a Heterogeneous System Architecture
> > > >  (HSA) driver
> > > >   for radeon-family GPUs.
> > >  This is just quick comments on few things. Given size of this, 
> > > people
> > >  will need to have time to review things.
> > > >   HSA allows different processor types (CPUs, DSPs, GPUs, 
> > > > etc..) to
> > > >  share
> > > >   system resources more effectively via HW features including
> > > >  shared pageable
> > > >   memory, userspace-accessible work queues, and platform-level
> > > >  atomics. In
> > > >   addition to the memory protection mechanisms in GPUVM and
> > > >  IOMMUv2, the Sea
> > > >   Islands family of GPUs also performs HW-level validation of
> > > >  commands passed
> > > >   in through the queues (aka rings).
> > > >   The code in this patch set is intended to serve both as a 
> > > > sample
> > > >  driver for
> > > >   other HSA-compatible hardware devices and as a production 
> > > > driver
> > > >  for
> > > >   radeon-family processors. The code is architected to support
> > > >  multiple CPUs
> > > >   each with connected GPUs, although the current implementation
> > > >  focuses on a
> > > >   single Kaveri/Berlin APU, and works alongside the existing 
> > > > radeon
> > > >  kernel
> > > >   graphics driver (kgd).
> > > >   AMD GPUs designed for use with HSA (Sea Islands and up) share
> > > >  some hardware
> > > >   functionality between HSA compute and regular gfx/compute 
> > > > (memory,
> > > >   interrupts, registers), while other functionality has been 
> > > > added
> > > >   specifically for HSA compute  (hw scheduler for virtualized
> > > >  compute rings).
> > > >   All shared hardware is owned by the radeon graphics driver, 
> > > > and
> > > >  an interface
> > > >   between kfd and kgd allows the kfd to make use of those 
> > > > shared
> > > >  resources,
> > > >   while HSA-specific functionality is managed directly by kfd 
> > > > by
> > > >  submitting
> > > >   packets into an HSA-specific command queue (the "HIQ").
> > > >   During kfd module initialization a char device node 
> > > > (/dev/kfd) is
> > > >  created
> > > >   (surviving until module exit), with ioctls for queue 
> > > > creation &
> > > >  management,
> > > >   and data structures are initialized for managing HSA device
> > > >  topology.
> > > >   The rest of the initialization is driven by calls from the 
> > > > radeon
> > > >  kgd at
> > > >   the following points :
> > > >   - radeon_init (kfd_init)
> > > >   - radeon_exit (kfd_fini)
> > > >   - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
> > > >   - radeon_driver_unload_kms (kfd_device_fini)
> > > >   During the probe and init processing per-device data 
> > > > structures
> > > >  are
> > > >   established which connect to the associated graphics kernel
> > > >  driver. This
> > > >   information is exposed to userspace via sysfs, along with a
> > > >  version number
> > > >   allowing userspace to determine if a topology change has 
> > > > occurred
> > > >  while it
> > > >   was reading from sysfs.
> > > >   The interface between kfd and kgd also allows the kfd to 
> > > > request
> > > >  buffer
> > > >   management services from kgd, and allows kgd to route 
> > > > interrupt
> > > >  requests to
> > > >   kfd code since the interrupt block is shared between regular
> > > >   graphics/compute and HSA compute subsystems in the GPU.
> > > >   The kfd code works with an open source usermode library
> > > >  ("libhsakmt") which
> > > >   is in the final stages of IP review and should be published 
> > > > in a
> > > >  separate
> > > >   repo over the next few days.
> > > >   The code operates in one of three modes, selectable via the
> > > >  sched_policy
> > > >   module parameter :
> > > >   - sched_policy=0 uses a hardware scheduler running in the MEC
> > > >  block within
> > > >   CP, and allows oversubscription (more queues than HW slots)
> > > >   - sched_policy=1 also uses HW scheduling but does not allow
> > > >   oversubscription, so create_queue requests fail when we run 
> > > > out
> > > >  of HW slots
> > > >   - sched_policy=2 does not use HW scheduling, so the driver
> > > >  manually assigns
> > > >   queues to HW slots by programming registers
> > > >   The "no HW scheduling" option is for debug & new hardware 
> > > > bringup
> > > >  only, so
> > > >   has less test coverage than the other options. Default in the
> > > >  current code
> > > >   is "HW scheduling without oversubscription" since that is 
> > > > where
> > > >  we have the
> > > >   most test coverage but we expect to change the default to "HW
> > > >  scheduling
> > > >   with oversubscription" after further testing. This 
> > > > effectively
> > > 

[PATCH 00/83] AMD HSA kernel driver

2014-07-12 Thread Daniel Vetter
On Sat, Jul 12, 2014 at 11:24:49AM +0200, Christian K?nig wrote:
> Am 11.07.2014 23:18, schrieb Jerome Glisse:
> >On Thu, Jul 10, 2014 at 10:51:29PM +, Gabbay, Oded wrote:
> >>On Thu, 2014-07-10 at 18:24 -0400, Jerome Glisse wrote:
> >>>On Fri, Jul 11, 2014 at 12:45:27AM +0300, Oded Gabbay wrote:
>   This patch set implements a Heterogeneous System Architecture
> (HSA) driver
>   for radeon-family GPUs.
> >>>This is just quick comments on few things. Given size of this, people
> >>>will need to have time to review things.
>   HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to
> share
>   system resources more effectively via HW features including
> shared pageable
>   memory, userspace-accessible work queues, and platform-level
> atomics. In
>   addition to the memory protection mechanisms in GPUVM and
> IOMMUv2, the Sea
>   Islands family of GPUs also performs HW-level validation of
> commands passed
>   in through the queues (aka rings).
>   The code in this patch set is intended to serve both as a sample
> driver for
>   other HSA-compatible hardware devices and as a production driver
> for
>   radeon-family processors. The code is architected to support
> multiple CPUs
>   each with connected GPUs, although the current implementation
> focuses on a
>   single Kaveri/Berlin APU, and works alongside the existing radeon
> kernel
>   graphics driver (kgd).
>   AMD GPUs designed for use with HSA (Sea Islands and up) share
> some hardware
>   functionality between HSA compute and regular gfx/compute (memory,
>   interrupts, registers), while other functionality has been added
>   specifically for HSA compute  (hw scheduler for virtualized
> compute rings).
>   All shared hardware is owned by the radeon graphics driver, and
> an interface
>   between kfd and kgd allows the kfd to make use of those shared
> resources,
>   while HSA-specific functionality is managed directly by kfd by
> submitting
>   packets into an HSA-specific command queue (the "HIQ").
>   During kfd module initialization a char device node (/dev/kfd) is
> created
>   (surviving until module exit), with ioctls for queue creation &
> management,
>   and data structures are initialized for managing HSA device
> topology.
>   The rest of the initialization is driven by calls from the radeon
> kgd at
>   the following points :
>   - radeon_init (kfd_init)
>   - radeon_exit (kfd_fini)
>   - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
>   - radeon_driver_unload_kms (kfd_device_fini)
>   During the probe and init processing per-device data structures
> are
>   established which connect to the associated graphics kernel
> driver. This
>   information is exposed to userspace via sysfs, along with a
> version number
>   allowing userspace to determine if a topology change has occurred
> while it
>   was reading from sysfs.
>   The interface between kfd and kgd also allows the kfd to request
> buffer
>   management services from kgd, and allows kgd to route interrupt
> requests to
>   kfd code since the interrupt block is shared between regular
>   graphics/compute and HSA compute subsystems in the GPU.
>   The kfd code works with an open source usermode library
> ("libhsakmt") which
>   is in the final stages of IP review and should be published in a
> separate
>   repo over the next few days.
>   The code operates in one of three modes, selectable via the
> sched_policy
>   module parameter :
>   - sched_policy=0 uses a hardware scheduler running in the MEC
> block within
>   CP, and allows oversubscription (more queues than HW slots)
>   - sched_policy=1 also uses HW scheduling but does not allow
>   oversubscription, so create_queue requests fail when we run out
> of HW slots
>   - sched_policy=2 does not use HW scheduling, so the driver
> manually assigns
>   queues to HW slots by programming registers
>   The "no HW scheduling" option is for debug & new hardware bringup
> only, so
>   has less test coverage than the other options. Default in the
> current code
>   is "HW scheduling without oversubscription" since that is where
> we have the
>   most test coverage but we expect to change the default to "HW
> scheduling
>   with oversubscription" after further testing. This effectively
> removes the
>   HW limit on the number of work queues available to applications.
>   Programs running on the GPU are associated with an address space
> through the
>   VMID field, which is translated to a unique PASID at access time
> via a set
>   of 16 VMID-to-PASID mapping registers. The available VMIDs
> (currently 16)
>   are 

[PATCH 00/83] AMD HSA kernel driver

2014-07-12 Thread Jerome Glisse
On Sat, Jul 12, 2014 at 01:10:32PM +0200, Daniel Vetter wrote:
> On Sat, Jul 12, 2014 at 11:24:49AM +0200, Christian K?nig wrote:
> > Am 11.07.2014 23:18, schrieb Jerome Glisse:
> > >On Thu, Jul 10, 2014 at 10:51:29PM +, Gabbay, Oded wrote:
> > >>On Thu, 2014-07-10 at 18:24 -0400, Jerome Glisse wrote:
> > >>>On Fri, Jul 11, 2014 at 12:45:27AM +0300, Oded Gabbay wrote:
> >   This patch set implements a Heterogeneous System Architecture
> > (HSA) driver
> >   for radeon-family GPUs.
> > >>>This is just quick comments on few things. Given size of this, people
> > >>>will need to have time to review things.
> >   HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to
> > share
> >   system resources more effectively via HW features including
> > shared pageable
> >   memory, userspace-accessible work queues, and platform-level
> > atomics. In
> >   addition to the memory protection mechanisms in GPUVM and
> > IOMMUv2, the Sea
> >   Islands family of GPUs also performs HW-level validation of
> > commands passed
> >   in through the queues (aka rings).
> >   The code in this patch set is intended to serve both as a sample
> > driver for
> >   other HSA-compatible hardware devices and as a production driver
> > for
> >   radeon-family processors. The code is architected to support
> > multiple CPUs
> >   each with connected GPUs, although the current implementation
> > focuses on a
> >   single Kaveri/Berlin APU, and works alongside the existing radeon
> > kernel
> >   graphics driver (kgd).
> >   AMD GPUs designed for use with HSA (Sea Islands and up) share
> > some hardware
> >   functionality between HSA compute and regular gfx/compute (memory,
> >   interrupts, registers), while other functionality has been added
> >   specifically for HSA compute  (hw scheduler for virtualized
> > compute rings).
> >   All shared hardware is owned by the radeon graphics driver, and
> > an interface
> >   between kfd and kgd allows the kfd to make use of those shared
> > resources,
> >   while HSA-specific functionality is managed directly by kfd by
> > submitting
> >   packets into an HSA-specific command queue (the "HIQ").
> >   During kfd module initialization a char device node (/dev/kfd) is
> > created
> >   (surviving until module exit), with ioctls for queue creation &
> > management,
> >   and data structures are initialized for managing HSA device
> > topology.
> >   The rest of the initialization is driven by calls from the radeon
> > kgd at
> >   the following points :
> >   - radeon_init (kfd_init)
> >   - radeon_exit (kfd_fini)
> >   - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
> >   - radeon_driver_unload_kms (kfd_device_fini)
> >   During the probe and init processing per-device data structures
> > are
> >   established which connect to the associated graphics kernel
> > driver. This
> >   information is exposed to userspace via sysfs, along with a
> > version number
> >   allowing userspace to determine if a topology change has occurred
> > while it
> >   was reading from sysfs.
> >   The interface between kfd and kgd also allows the kfd to request
> > buffer
> >   management services from kgd, and allows kgd to route interrupt
> > requests to
> >   kfd code since the interrupt block is shared between regular
> >   graphics/compute and HSA compute subsystems in the GPU.
> >   The kfd code works with an open source usermode library
> > ("libhsakmt") which
> >   is in the final stages of IP review and should be published in a
> > separate
> >   repo over the next few days.
> >   The code operates in one of three modes, selectable via the
> > sched_policy
> >   module parameter :
> >   - sched_policy=0 uses a hardware scheduler running in the MEC
> > block within
> >   CP, and allows oversubscription (more queues than HW slots)
> >   - sched_policy=1 also uses HW scheduling but does not allow
> >   oversubscription, so create_queue requests fail when we run out
> > of HW slots
> >   - sched_policy=2 does not use HW scheduling, so the driver
> > manually assigns
> >   queues to HW slots by programming registers
> >   The "no HW scheduling" option is for debug & new hardware bringup
> > only, so
> >   has less test coverage than the other options. Default in the
> > current code
> >   is "HW scheduling without oversubscription" since that is where
> > we have the
> >   most test coverage but we expect to change the default to "HW
> > scheduling
> >   with oversubscription" after further testing. This effectively
> > removes the
> >   HW limit on the number of work queues available to applications.
> >   

[PATCH 00/83] AMD HSA kernel driver

2014-07-12 Thread Christian König
Am 11.07.2014 23:18, schrieb Jerome Glisse:
> On Thu, Jul 10, 2014 at 10:51:29PM +, Gabbay, Oded wrote:
>> On Thu, 2014-07-10 at 18:24 -0400, Jerome Glisse wrote:
>>> On Fri, Jul 11, 2014 at 12:45:27AM +0300, Oded Gabbay wrote:
   This patch set implements a Heterogeneous System Architecture
 (HSA) driver
   for radeon-family GPUs.
>>>   
>>> This is just quick comments on few things. Given size of this, people
>>> will need to have time to review things.
>>>   
   HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to
 share
   system resources more effectively via HW features including
 shared pageable
   memory, userspace-accessible work queues, and platform-level
 atomics. In
   addition to the memory protection mechanisms in GPUVM and
 IOMMUv2, the Sea
   Islands family of GPUs also performs HW-level validation of
 commands passed
   in through the queues (aka rings).
   The code in this patch set is intended to serve both as a sample
 driver for
   other HSA-compatible hardware devices and as a production driver
 for
   radeon-family processors. The code is architected to support
 multiple CPUs
   each with connected GPUs, although the current implementation
 focuses on a
   single Kaveri/Berlin APU, and works alongside the existing radeon
 kernel
   graphics driver (kgd).
   AMD GPUs designed for use with HSA (Sea Islands and up) share
 some hardware
   functionality between HSA compute and regular gfx/compute (memory,
   interrupts, registers), while other functionality has been added
   specifically for HSA compute  (hw scheduler for virtualized
 compute rings).
   All shared hardware is owned by the radeon graphics driver, and
 an interface
   between kfd and kgd allows the kfd to make use of those shared
 resources,
   while HSA-specific functionality is managed directly by kfd by
 submitting
   packets into an HSA-specific command queue (the "HIQ").
   During kfd module initialization a char device node (/dev/kfd) is
 created
   (surviving until module exit), with ioctls for queue creation &
 management,
   and data structures are initialized for managing HSA device
 topology.
   The rest of the initialization is driven by calls from the radeon
 kgd at
   the following points :
   - radeon_init (kfd_init)
   - radeon_exit (kfd_fini)
   - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
   - radeon_driver_unload_kms (kfd_device_fini)
   During the probe and init processing per-device data structures
 are
   established which connect to the associated graphics kernel
 driver. This
   information is exposed to userspace via sysfs, along with a
 version number
   allowing userspace to determine if a topology change has occurred
 while it
   was reading from sysfs.
   The interface between kfd and kgd also allows the kfd to request
 buffer
   management services from kgd, and allows kgd to route interrupt
 requests to
   kfd code since the interrupt block is shared between regular
   graphics/compute and HSA compute subsystems in the GPU.
   The kfd code works with an open source usermode library
 ("libhsakmt") which
   is in the final stages of IP review and should be published in a
 separate
   repo over the next few days.
   The code operates in one of three modes, selectable via the
 sched_policy
   module parameter :
   - sched_policy=0 uses a hardware scheduler running in the MEC
 block within
   CP, and allows oversubscription (more queues than HW slots)
   - sched_policy=1 also uses HW scheduling but does not allow
   oversubscription, so create_queue requests fail when we run out
 of HW slots
   - sched_policy=2 does not use HW scheduling, so the driver
 manually assigns
   queues to HW slots by programming registers
   The "no HW scheduling" option is for debug & new hardware bringup
 only, so
   has less test coverage than the other options. Default in the
 current code
   is "HW scheduling without oversubscription" since that is where
 we have the
   most test coverage but we expect to change the default to "HW
 scheduling
   with oversubscription" after further testing. This effectively
 removes the
   HW limit on the number of work queues available to applications.
   Programs running on the GPU are associated with an address space
 through the
   VMID field, which is translated to a unique PASID at access time
 via a set
   of 16 VMID-to-PASID mapping registers. The available VMIDs
 (currently 16)
   are partitioned (under control of the radeon kgd) between current
   gfx/compute and HSA compute, with each getting 8 in the current
 code. The
   

[PATCH 00/83] AMD HSA kernel driver

2014-07-11 Thread Jerome Glisse
On Thu, Jul 10, 2014 at 10:51:29PM +, Gabbay, Oded wrote:
> On Thu, 2014-07-10 at 18:24 -0400, Jerome Glisse wrote:
> > On Fri, Jul 11, 2014 at 12:45:27AM +0300, Oded Gabbay wrote:
> > >  This patch set implements a Heterogeneous System Architecture 
> > > (HSA) driver
> > >  for radeon-family GPUs.
> >  
> > This is just quick comments on few things. Given size of this, people
> > will need to have time to review things.
> >  
> > >  HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to 
> > > share
> > >  system resources more effectively via HW features including 
> > > shared pageable
> > >  memory, userspace-accessible work queues, and platform-level 
> > > atomics. In
> > >  addition to the memory protection mechanisms in GPUVM and 
> > > IOMMUv2, the Sea
> > >  Islands family of GPUs also performs HW-level validation of 
> > > commands passed
> > >  in through the queues (aka rings).
> > >  The code in this patch set is intended to serve both as a sample 
> > > driver for
> > >  other HSA-compatible hardware devices and as a production driver 
> > > for
> > >  radeon-family processors. The code is architected to support 
> > > multiple CPUs
> > >  each with connected GPUs, although the current implementation 
> > > focuses on a
> > >  single Kaveri/Berlin APU, and works alongside the existing radeon 
> > > kernel
> > >  graphics driver (kgd).
> > >  AMD GPUs designed for use with HSA (Sea Islands and up) share 
> > > some hardware
> > >  functionality between HSA compute and regular gfx/compute (memory,
> > >  interrupts, registers), while other functionality has been added
> > >  specifically for HSA compute  (hw scheduler for virtualized 
> > > compute rings).
> > >  All shared hardware is owned by the radeon graphics driver, and 
> > > an interface
> > >  between kfd and kgd allows the kfd to make use of those shared 
> > > resources,
> > >  while HSA-specific functionality is managed directly by kfd by 
> > > submitting
> > >  packets into an HSA-specific command queue (the "HIQ").
> > >  During kfd module initialization a char device node (/dev/kfd) is 
> > > created
> > >  (surviving until module exit), with ioctls for queue creation & 
> > > management,
> > >  and data structures are initialized for managing HSA device 
> > > topology.
> > >  The rest of the initialization is driven by calls from the radeon 
> > > kgd at
> > >  the following points :
> > >  - radeon_init (kfd_init)
> > >  - radeon_exit (kfd_fini)
> > >  - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
> > >  - radeon_driver_unload_kms (kfd_device_fini)
> > >  During the probe and init processing per-device data structures 
> > > are
> > >  established which connect to the associated graphics kernel 
> > > driver. This
> > >  information is exposed to userspace via sysfs, along with a 
> > > version number
> > >  allowing userspace to determine if a topology change has occurred 
> > > while it
> > >  was reading from sysfs.
> > >  The interface between kfd and kgd also allows the kfd to request 
> > > buffer
> > >  management services from kgd, and allows kgd to route interrupt 
> > > requests to
> > >  kfd code since the interrupt block is shared between regular
> > >  graphics/compute and HSA compute subsystems in the GPU.
> > >  The kfd code works with an open source usermode library 
> > > ("libhsakmt") which
> > >  is in the final stages of IP review and should be published in a 
> > > separate
> > >  repo over the next few days.
> > >  The code operates in one of three modes, selectable via the 
> > > sched_policy
> > >  module parameter :
> > >  - sched_policy=0 uses a hardware scheduler running in the MEC 
> > > block within
> > >  CP, and allows oversubscription (more queues than HW slots)
> > >  - sched_policy=1 also uses HW scheduling but does not allow
> > >  oversubscription, so create_queue requests fail when we run out 
> > > of HW slots
> > >  - sched_policy=2 does not use HW scheduling, so the driver 
> > > manually assigns
> > >  queues to HW slots by programming registers
> > >  The "no HW scheduling" option is for debug & new hardware bringup 
> > > only, so
> > >  has less test coverage than the other options. Default in the 
> > > current code
> > >  is "HW scheduling without oversubscription" since that is where 
> > > we have the
> > >  most test coverage but we expect to change the default to "HW 
> > > scheduling
> > >  with oversubscription" after further testing. This effectively 
> > > removes the
> > >  HW limit on the number of work queues available to applications.
> > >  Programs running on the GPU are associated with an address space 
> > > through the
> > >  VMID field, which is translated to a unique PASID at access time 
> > > via a set
> > >  of 16 VMID-to-PASID mapping registers. The available VMIDs 
> > > (currently 16)
> > >  are partitioned (under control of the radeon kgd) between current
> > >  gfx/compute and HSA compute, with each getting 8 in the 

[PATCH 00/83] AMD HSA kernel driver

2014-07-11 Thread Oded Gabbay
This patch set implements a Heterogeneous System Architecture (HSA) driver 
for radeon-family GPUs. 

HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share 
system resources more effectively via HW features including shared pageable 
memory, userspace-accessible work queues, and platform-level atomics. In 
addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea 
Islands family of GPUs also performs HW-level validation of commands passed 
in through the queues (aka rings).

The code in this patch set is intended to serve both as a sample driver for 
other HSA-compatible hardware devices and as a production driver for 
radeon-family processors. The code is architected to support multiple CPUs 
each with connected GPUs, although the current implementation focuses on a 
single Kaveri/Berlin APU, and works alongside the existing radeon kernel 
graphics driver (kgd). 

AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware 
functionality between HSA compute and regular gfx/compute (memory, 
interrupts, registers), while other functionality has been added 
specifically for HSA compute  (hw scheduler for virtualized compute rings). 
All shared hardware is owned by the radeon graphics driver, and an interface 
between kfd and kgd allows the kfd to make use of those shared resources, 
while HSA-specific functionality is managed directly by kfd by submitting 
packets into an HSA-specific command queue (the "HIQ").

During kfd module initialization a char device node (/dev/kfd) is created 
(surviving until module exit), with ioctls for queue creation & management, 
and data structures are initialized for managing HSA device topology. 

The rest of the initialization is driven by calls from the radeon kgd at 
the following points :

- radeon_init (kfd_init)
- radeon_exit (kfd_fini)
- radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
- radeon_driver_unload_kms (kfd_device_fini)

During the probe and init processing per-device data structures are 
established which connect to the associated graphics kernel driver. This 
information is exposed to userspace via sysfs, along with a version number 
allowing userspace to determine if a topology change has occurred while it 
was reading from sysfs. 

The interface between kfd and kgd also allows the kfd to request buffer 
management services from kgd, and allows kgd to route interrupt requests to 
kfd code since the interrupt block is shared between regular 
graphics/compute and HSA compute subsystems in the GPU.

The kfd code works with an open source usermode library ("libhsakmt") which 
is in the final stages of IP review and should be published in a separate 
repo over the next few days. 

The code operates in one of three modes, selectable via the sched_policy 
module parameter :

- sched_policy=0 uses a hardware scheduler running in the MEC block within 
CP, and allows oversubscription (more queues than HW slots) 

- sched_policy=1 also uses HW scheduling but does not allow 
oversubscription, so create_queue requests fail when we run out of HW slots 

- sched_policy=2 does not use HW scheduling, so the driver manually assigns 
queues to HW slots by programming registers

The "no HW scheduling" option is for debug & new hardware bringup only, so 
has less test coverage than the other options. Default in the current code 
is "HW scheduling without oversubscription" since that is where we have the 
most test coverage but we expect to change the default to "HW scheduling 
with oversubscription" after further testing. This effectively removes the 
HW limit on the number of work queues available to applications.

Programs running on the GPU are associated with an address space through the 
VMID field, which is translated to a unique PASID at access time via a set 
of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16) 
are partitioned (under control of the radeon kgd) between current 
gfx/compute and HSA compute, with each getting 8 in the current code. The 
VMID-to-PASID mapping registers are updated by the HW scheduler when used, 
and by driver code if HW scheduling is not being used.  

The Sea Islands compute queues use a new "doorbell" mechanism instead of the 
earlier kernel-managed write pointer registers. Doorbells use a separate BAR 
dedicated for this purpose, and pages within the doorbell aperture are 
mapped to userspace (each page mapped to only one user address space). 
Writes to the doorbell aperture are intercepted by GPU hardware, allowing 
userspace code to safely manage work queues (rings) without requiring a 
kernel call for every ring update. 

First step for an application process is to open the kfd device. Calls to 
open create a kfd "process" structure only for the first thread of the 
process. Subsequent open calls are checked to see if they are from processes 
using the same mm_struct and, if so, don't do anything. The kfd per-process 
data lives as long as the 

[PATCH 00/83] AMD HSA kernel driver

2014-07-10 Thread Gabbay, Oded
On Thu, 2014-07-10 at 18:24 -0400, Jerome Glisse wrote:
> On Fri, Jul 11, 2014 at 12:45:27AM +0300, Oded Gabbay wrote:
> >  This patch set implements a Heterogeneous System Architecture 
> > (HSA) driver
> >  for radeon-family GPUs.
>  
> This is just quick comments on few things. Given size of this, people
> will need to have time to review things.
>  
> >  HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to 
> > share
> >  system resources more effectively via HW features including 
> > shared pageable
> >  memory, userspace-accessible work queues, and platform-level 
> > atomics. In
> >  addition to the memory protection mechanisms in GPUVM and 
> > IOMMUv2, the Sea
> >  Islands family of GPUs also performs HW-level validation of 
> > commands passed
> >  in through the queues (aka rings).
> >  The code in this patch set is intended to serve both as a sample 
> > driver for
> >  other HSA-compatible hardware devices and as a production driver 
> > for
> >  radeon-family processors. The code is architected to support 
> > multiple CPUs
> >  each with connected GPUs, although the current implementation 
> > focuses on a
> >  single Kaveri/Berlin APU, and works alongside the existing radeon 
> > kernel
> >  graphics driver (kgd).
> >  AMD GPUs designed for use with HSA (Sea Islands and up) share 
> > some hardware
> >  functionality between HSA compute and regular gfx/compute (memory,
> >  interrupts, registers), while other functionality has been added
> >  specifically for HSA compute  (hw scheduler for virtualized 
> > compute rings).
> >  All shared hardware is owned by the radeon graphics driver, and 
> > an interface
> >  between kfd and kgd allows the kfd to make use of those shared 
> > resources,
> >  while HSA-specific functionality is managed directly by kfd by 
> > submitting
> >  packets into an HSA-specific command queue (the "HIQ").
> >  During kfd module initialization a char device node (/dev/kfd) is 
> > created
> >  (surviving until module exit), with ioctls for queue creation & 
> > management,
> >  and data structures are initialized for managing HSA device 
> > topology.
> >  The rest of the initialization is driven by calls from the radeon 
> > kgd at
> >  the following points :
> >  - radeon_init (kfd_init)
> >  - radeon_exit (kfd_fini)
> >  - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
> >  - radeon_driver_unload_kms (kfd_device_fini)
> >  During the probe and init processing per-device data structures 
> > are
> >  established which connect to the associated graphics kernel 
> > driver. This
> >  information is exposed to userspace via sysfs, along with a 
> > version number
> >  allowing userspace to determine if a topology change has occurred 
> > while it
> >  was reading from sysfs.
> >  The interface between kfd and kgd also allows the kfd to request 
> > buffer
> >  management services from kgd, and allows kgd to route interrupt 
> > requests to
> >  kfd code since the interrupt block is shared between regular
> >  graphics/compute and HSA compute subsystems in the GPU.
> >  The kfd code works with an open source usermode library 
> > ("libhsakmt") which
> >  is in the final stages of IP review and should be published in a 
> > separate
> >  repo over the next few days.
> >  The code operates in one of three modes, selectable via the 
> > sched_policy
> >  module parameter :
> >  - sched_policy=0 uses a hardware scheduler running in the MEC 
> > block within
> >  CP, and allows oversubscription (more queues than HW slots)
> >  - sched_policy=1 also uses HW scheduling but does not allow
> >  oversubscription, so create_queue requests fail when we run out 
> > of HW slots
> >  - sched_policy=2 does not use HW scheduling, so the driver 
> > manually assigns
> >  queues to HW slots by programming registers
> >  The "no HW scheduling" option is for debug & new hardware bringup 
> > only, so
> >  has less test coverage than the other options. Default in the 
> > current code
> >  is "HW scheduling without oversubscription" since that is where 
> > we have the
> >  most test coverage but we expect to change the default to "HW 
> > scheduling
> >  with oversubscription" after further testing. This effectively 
> > removes the
> >  HW limit on the number of work queues available to applications.
> >  Programs running on the GPU are associated with an address space 
> > through the
> >  VMID field, which is translated to a unique PASID at access time 
> > via a set
> >  of 16 VMID-to-PASID mapping registers. The available VMIDs 
> > (currently 16)
> >  are partitioned (under control of the radeon kgd) between current
> >  gfx/compute and HSA compute, with each getting 8 in the current 
> > code. The
> >  VMID-to-PASID mapping registers are updated by the HW scheduler 
> > when used,
> >  and by driver code if HW scheduling is not being used.
> >  The Sea Islands compute queues use a new "doorbell" mechanism 
> > instead of the
> >  earlier kernel-managed 

[PATCH 00/83] AMD HSA kernel driver

2014-07-10 Thread Jerome Glisse
On Fri, Jul 11, 2014 at 12:45:27AM +0300, Oded Gabbay wrote:
> This patch set implements a Heterogeneous System Architecture (HSA) driver 
> for radeon-family GPUs. 

This is just quick comments on few things. Given size of this, people
will need to have time to review things.

> 
> HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share 
> system resources more effectively via HW features including shared pageable 
> memory, userspace-accessible work queues, and platform-level atomics. In 
> addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea 
> Islands family of GPUs also performs HW-level validation of commands passed 
> in through the queues (aka rings).
> 
> The code in this patch set is intended to serve both as a sample driver for 
> other HSA-compatible hardware devices and as a production driver for 
> radeon-family processors. The code is architected to support multiple CPUs 
> each with connected GPUs, although the current implementation focuses on a 
> single Kaveri/Berlin APU, and works alongside the existing radeon kernel 
> graphics driver (kgd). 
> 
> AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware 
> functionality between HSA compute and regular gfx/compute (memory, 
> interrupts, registers), while other functionality has been added 
> specifically for HSA compute  (hw scheduler for virtualized compute rings). 
> All shared hardware is owned by the radeon graphics driver, and an interface 
> between kfd and kgd allows the kfd to make use of those shared resources, 
> while HSA-specific functionality is managed directly by kfd by submitting 
> packets into an HSA-specific command queue (the "HIQ").
> 
> During kfd module initialization a char device node (/dev/kfd) is created 
> (surviving until module exit), with ioctls for queue creation & management, 
> and data structures are initialized for managing HSA device topology. 
> 
> The rest of the initialization is driven by calls from the radeon kgd at 
> the following points :
> 
> - radeon_init (kfd_init)
> - radeon_exit (kfd_fini)
> - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
> - radeon_driver_unload_kms (kfd_device_fini)
> 
> During the probe and init processing per-device data structures are 
> established which connect to the associated graphics kernel driver. This 
> information is exposed to userspace via sysfs, along with a version number 
> allowing userspace to determine if a topology change has occurred while it 
> was reading from sysfs. 
> 
> The interface between kfd and kgd also allows the kfd to request buffer 
> management services from kgd, and allows kgd to route interrupt requests to 
> kfd code since the interrupt block is shared between regular 
> graphics/compute and HSA compute subsystems in the GPU.
> 
> The kfd code works with an open source usermode library ("libhsakmt") which 
> is in the final stages of IP review and should be published in a separate 
> repo over the next few days. 
> 
> The code operates in one of three modes, selectable via the sched_policy 
> module parameter :
> 
> - sched_policy=0 uses a hardware scheduler running in the MEC block within 
> CP, and allows oversubscription (more queues than HW slots) 
> 
> - sched_policy=1 also uses HW scheduling but does not allow 
> oversubscription, so create_queue requests fail when we run out of HW slots 
> 
> - sched_policy=2 does not use HW scheduling, so the driver manually assigns 
> queues to HW slots by programming registers
> 
> The "no HW scheduling" option is for debug & new hardware bringup only, so 
> has less test coverage than the other options. Default in the current code 
> is "HW scheduling without oversubscription" since that is where we have the 
> most test coverage but we expect to change the default to "HW scheduling 
> with oversubscription" after further testing. This effectively removes the 
> HW limit on the number of work queues available to applications.
> 
> Programs running on the GPU are associated with an address space through the 
> VMID field, which is translated to a unique PASID at access time via a set 
> of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16) 
> are partitioned (under control of the radeon kgd) between current 
> gfx/compute and HSA compute, with each getting 8 in the current code. The 
> VMID-to-PASID mapping registers are updated by the HW scheduler when used, 
> and by driver code if HW scheduling is not being used.  
> 
> The Sea Islands compute queues use a new "doorbell" mechanism instead of the 
> earlier kernel-managed write pointer registers. Doorbells use a separate BAR 
> dedicated for this purpose, and pages within the doorbell aperture are 
> mapped to userspace (each page mapped to only one user address space). 
> Writes to the doorbell aperture are intercepted by GPU hardware, allowing 
> userspace code to safely manage work queues (rings) without requiring a 
> kernel call for every