amdgpu: fix a typo

Christian König Sat, 24 Jun 2017 11:28:00 -0700

Am 24.06.2017 um 02:27 schrieb Marek Olšák:

On Fri, Jun 23, 2017 at 3:01 PM, Christian König
<deathsim...@vodafone.de> wrote:

The key point here is while optimizing this is nice the much bigger pile is
the locking done for each BO.


In other words even when we optimize all the other locks involved into
atomics or RCU, the BO reservation lock will still dominate everything.

One possible solution to this would be per process resources like I
suggested multiple times now.

Mesa can set a per-process resource flag on all resources except
displayable ones. The question is, would it help if an IB contained
1000 per-process resources and 1-2 inter-process sharable?


Yeah, absolutely.

See the 1000 per-process shareable resources only need to be revalidatedwhen one of them was evicted. This would easily reduce the per BO CSoverhead to 1/1000 of what it is now.

The crux is that you can't share those resources with other process andsince you don't specify them on a per CS basis swapping things in andout on per CS basis like we do for all the Unigine benchmarks won't workany more.


Everything would be statically allocated, that might not be ideal either.

Regards,
Christian.


Marek

Christian.


Am 23.06.2017 um 13:37 schrieb Marek Olšák:

I agree with you about the spinlock. You seem to be good at this.

It's always good to do measurements to validate that a code change
improves something, especially when the code size and code complexity
has to be increased. A CPU profiler such as sysprof can show you
improvements on the order of 1/10000th = 0.01% if you record enough
samples. Sometimes you have to un-inline a function to make it visible
there. If you see a function that takes 0.3% of CPU time and you
optimize it down to 0.1% using the profiler as the measurement tool,
you have evidence that the improvement is there and nobody can reject
the idea anymore. It also proves that the code size increase is worth
it. It's always "added code size and loss of simplicity" vs benefit.
It's a transaction. You trade one for the other. You lose something to
get something else. OK, we know the code complexity. Now, what's the
benefit? Can you do some measurements? The accuracy of 1/10000th
should be enough for anybody.

I know the feeling when you spend many days working on something,
adding 100s or 1000s of lines of code, solving many problems to get
there and increasing code complexity significantly, and then you do
the measurement and it doesn't improve anything. I know the feeling
very well. It sucks. The frustration comes from the investment of time
and getting no return on the investment. Many frustrations in life are
like that.

Marek


On Fri, Jun 23, 2017 at 4:23 AM, axie <a...@amd.com> wrote:

Hi Marek,


So do you agree that spinlock disables CPU preemption, contrary to your
original idea?


If you have new reason that this patch does not improve, please speak
out.


Many patches in GPU driver aim at improving performance and power
efficiency. Does most patches submitted in AMDGPU requires a benchmarking
first?

If all developers are required to always answer your questions when code
review, I am afraid that most open source community developers cannot
meet
that requirement and stop working on AMDGPU.


To improve performance, there are many bottlenecks to clear. When the
last
several bottlenecks are clear, the performance will show faster more
significantly.

My pass profiling experience told me that clearing a lock can improve
performance for some driver like 0.3% to much bigger percentage. It
depends
on many factors, even depends on the application itself.


This is not the first bottleneck fixed. This is surely not the last one.


Thanks,

Alex Bin



On 2017-06-22 07:54 PM, Marek Olšák wrote:

That's all nice, but does it improve performance? Have you been able
to measure some performance difference with that code? Were you
targeting a specific inefficiency you had seen e.g. with a CPU
profiler?

Marek

On Thu, Jun 22, 2017 at 8:19 PM, axie <a...@amd.com> wrote:

To clarify, local IRQ is disabled by calling raw_local_irq_save(flags);

Function __lock_acquire double checks that the local IRQ is really
disabled.



On 2017-06-22 01:34 PM, axie wrote:

Hi Marek,

Spin lock and spin unlock is fast. But it is not so fast compared with
atomic, which is a single CPU instruction in x86.


1. spinlock does NOT allow preemption at local CPU. Let us have a look
at
how spin lock was implemented.

static inline void __raw_spin_lock(raw_spinlock_t *lock)
{
       preempt_disable(); <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<--This is
memory barrier operation too.
       spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
       LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
}

2.  A function  __lock_acquire called by spinlock. The function is so
long
that I would not attach all of it here.

There is atomic operation inside and 12 meta data updates and 14 if
statements and it calls quite some other functions.

Note that it disable IRQ...

static int __lock_acquire(struct lockdep_map *lock, unsigned int
subclass,
                 int trylock, int read, int check, int hardirqs_off,
                 struct lockdep_map *nest_lock, unsigned long ip,
                 int references, int pin_count)
{
       struct task_struct *curr = current;
       struct lock_class *class = NULL;
       struct held_lock *hlock;
       unsigned int depth;
       int chain_head = 0;
       int class_idx;
       u64 chain_key;

       if (unlikely(!debug_locks))
           return 0;

       /*
        * Lockdep should run with IRQs disabled, otherwise we could
        * get an interrupt which would want to take locks, which would
        * end up in lockdep and have you got a head-ache already?
        */
       if (DEBUG_LOCKS_WARN_ON(!irqs_disabled()))
<<<<<<<<<<<<<<<Disable
IRQ
           return 0;

....

3. Another function called by spinlock in a higher level:

void lock_acquire(struct lockdep_map *lock, unsigned int subclass,

                 int trylock, int read, int check,
                 struct lockdep_map *nest_lock, unsigned long ip)
{
       unsigned long flags;

       if (unlikely(current->lockdep_recursion))
           return;

       raw_local_irq_save(flags);
       check_flags(flags);

       current->lockdep_recursion = 1;
       trace_lock_acquire(lock, subclass, trylock, read, check,
nest_lock,
ip);
       __lock_acquire(lock, subclass, trylock, read, check,
                  irqs_disabled_flags(flags), nest_lock, ip, 0, 0);
       current->lockdep_recursion = 0;
       raw_local_irq_restore(flags);
}


Thanks,

Alex Bin


On 2017-06-22 12:27 PM, Marek Olšák wrote:

On Thu, Jun 22, 2017 at 5:33 PM, Xie, AlexBin <alexbin....@amd.com>
wrote:

Hi Christian,


In fact, the change from spinlock to atomic is quite painful. When I
started, I thought it was easy but later I found there might be race
condition here and there. Now I think the change looks more robust.
In
kernel source, there are several other drivers used the same trick.


On the other hand, I think the logic itself might be optimized
considering
the locking. But I had spent quite some effort to maintain original
logic.

It seems quite complicated and I don't know if there is any
performance benefit. Spinlocks are nice because they allow
preemption.

It would be more interesting to merge the CS and BO_LIST ioctls into
one.

Marek


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 1/3] drm/amdgpu: fix a typo

Reply via email to