Doing better than CS ioctl ?

Jerome Glisse Fri, 07 Aug 2009 14:55:08 -0700

Investigating where time is spent in radeon/kms world when doing
rendering leaded me to question the design of CS ioctl. As i am among
the people behind it, i think i should give some historical background
on the choice that were made.


The first motivation behind cs ioctl was to take common language
btw userspace and kernel and btw kernel and device. Of course in
an ideal world command submitted through cs ioctl could directly
be forwarded to the GPU without much overhead. Thing is, the world
we leave in isn't that good. There is 2 things the cs ioctl
do before forwarding command:

1- First it must rewrite any packet which supply an offset to GPU
with the address the memory manager validate the buffer object
associated to this packet. We can't get rid of this with the cs
ioctl (we might do somethings very clever like doing a new
microcode for the cp so that cp can rewrite packet using some
table of validated buffer offset but i am not even sure cp
would be powerful enough to do that).
2- In order to provide a more advanced security than what we
did have in the past i added a cs checker facility which is
responsible to analyze the command stream and make sure that
the GPU won't read or write outside the supplied buffer object
list. DRI1 didn't offered such advanced checking. This feature
was added with GPU sharing in mind where sensible application
might run on the GPU and for which we might like to protect
their memory.

We can obviously avoid the second item and things would work
but userspace would be able to abuse the GPU to access outside
the GPU object its own (this doesn't means it will be able to
access any system ram but rather any ram that is mapped to GPU
which should for the time being only be pixmap, texture, vbo
or things like that).

Bottom line is that with cs ioctl we do 2 times a different
work. In userspace we build a command stream under stable by the
GPU and in kernel space we unencode this command stream to check
it. Obviously this sounds wrong.

That being said, CS ioctl isn't that bad, it doesn't consume much
on benchmark i have done but i expect it might consume a more on
older cpu or when many complex 3D apps run at the same time. So
i am not proposing to trash it away but rather to discuss about
a better interface we could add at latter point to slowly replace
cs. CS is bringing today feature we needed yesterday so we should
focus our effort on getting cs ioctl as smooth and good as possible.


So as a pet project i have been thinking this last few days of
what would be a better interface btw userspace and kernel and
i come up with somethings in btw gallium state object and nvidia
gpu object (well at least as far as i know each of this my
design sounds close to that).

Idea behind design is that whenever userspace allocate a bo,
userspace knows about properties of the bo. If it's a texture
userspace knows the size, the number of mipmap level, the
border,... of the textur. If it's a vbo it's knows the layout
the size, number of elements, ... same for rendering viewport
it knows the size and associated properties

Design 2 ioctl:
        create_object :
                supply :
                        - object type id specific to asic
                        - object structure associated to type
                        id, fully describing the object
                return :
                        - object id
                processing :
                        - check that the state provided are
                        correct and check that the bo is big
                        enough for the state
                        - translate state into packet stream
                        - store the object and packet stream
                        & associated object id
        batchs :
                supply :
                        - table of batch
                process :
                        - check each batch and schedule them

Each batch is a set of object id and userspace need to provide
all object id for the batch to be valid. For instance if shader
object id needs 5 texture, batch needs to have 5 texture object
id supplied.

Checking that a batch is valid is quick as it's a set of
already checked object. You create object just after creating
the bo (if it's a pixmap you can create a texture and viewport
just after and whenever you want to use this pixmap just use
the proper object id). This means that for object which are
used multiple times you do object properties checking once and
then takes advantage of quick reuse.

Example of what object looks like is at:
http://people.freedesktop.org/~glisse/rv515obj.h

So what we win is fast checking, better knowledge in the kernel
of a use of a bo, all this allow to add many optimization :
        - simple state remission optimization (don't remit state
        of an object if the object state are already set in the
        GPU)
        - clever flushing if a bo is only associated to texture
        object than kernel knows that it's not necessary to ask
        for GPU flush and can take clever flushing decission,
        - gives more information to kernel for object placement
        - kernel can override object placement even for things
        like vbo were endian swapping might need different setting
        depending on layout of vbo and where it's in memory
        - hw optimization like rotating texture btw available slot
        to avoid flushing the texture cache
        - faster relocation, relocation can be hardcoded so no need
        to parse anythings.
        - kernel can break down a batches ioctl, each batch is full
        description of the state necessary to perform an operation,
        so only requirement is that each batch of a batches ioctl
        fit into the available memory
        - likely easier to report a memory limit for a maximum batch
        size userspace can supply
        - share state btw process (object id could be a hash of the
        object states so being unique to given set of states)
        - easier to workaround some gpu limitations
        - allow to fine tune some of the gpu fifo safely in the
        kernel

Drawbacks i see (often with a new design you don't see all the
drawbacks so please add any) :
        - kernel needs to know how to build the command stream
        (mostly byte shifting & masking)
        - might consume more memory (especially if userspace keeps
        a copy of the state)
        - you loose some of the benefit if you cohabit with cs
        ioctl (need to assume all state are loose after cs and
        perform all necessary flush like texture state flush)
        - adding feature is going through kernel (not sure it's a
        drawback it gives us a short window for merging new features
        which leads to likely bigger amount of time sitting on top
        of new features and testing them).

Well this mail is already big enough, so what i would like is feedback
on this, does it sounds like a good direction ? (So far for me it sounds
better but i can be wrong).

Cheers,
Jerome Glisse


------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
--
_______________________________________________
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Doing better than CS ioctl ?

Reply via email to