Re: Performance drop due to alloc_workqueue() misuse and recent change

2023-12-19 Thread Tejun Heo
Hello, again.

On Mon, Dec 04, 2023 at 04:03:47PM +, Naohiro Aota wrote:
...
> In summary, we misuse max_active, considering it is a global limit. And,
> the recent commit introduced a huge performance drop in some cases.  We
> need to review alloc_workqueue() usage to check if its max_active setting
> is proper or not.

Can you please test the following branch?

 https://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git 
unbound-system-wide-max_active

Thanks.

-- 
tejun


Re: [Intel-gfx] Performance drop due to alloc_workqueue() misuse and recent change

2023-12-04 Thread Tejun Heo
Hello,

On Mon, Dec 04, 2023 at 04:03:47PM +, Naohiro Aota wrote:
> Recently, commit 636b927eba5b ("workqueue: Make unbound workqueues to use
> per-cpu pool_workqueues") changed WQ_UNBOUND workqueue's behavior. It
> changed the meaning of alloc_workqueue()'s max_active from an upper limit
> imposed per NUMA node to a limit per CPU. As a result, massive number of
> workers can be running at the same time, especially if the workqueue user
> thinks the max_active is a global limit.
> 
> Actually, it is already written it is per-CPU limit in the documentation
> before the commit. However, several callers seem to misuse max_active,
> maybe thinking it is a global limit. It is an unexpected behavior change
> for them.

Right, and the behavior has been like that for a very long time and there
was no other way to achieve reasonable level of concurrency, so the current
situation is expected.

> For example, these callers set max_active = num_online_cpus(), which is a
> suspicious limit applying to per-CPU. This config means we can have nr_cpu
> * nr_cpu active tasks working at the same time.

Yeah, that sounds like a good indicator.

> fs/f2fs/data.c: sbi->post_read_wq = alloc_workqueue("f2fs_post_read_wq",
> fs/f2fs/data.c-  WQ_UNBOUND | 
> WQ_HIGHPRI,
> fs/f2fs/data.c-  num_online_cpus());
> 
> fs/crypto/crypto.c: fscrypt_read_workqueue = 
> alloc_workqueue("fscrypt_read_queue",
> fs/crypto/crypto.c-  WQ_UNBOUND | 
> WQ_HIGHPRI,
> fs/crypto/crypto.c-  
> num_online_cpus());
> 
> fs/verity/verify.c: fsverity_read_workqueue = 
> alloc_workqueue("fsverity_read_queue",
> fs/verity/verify.c-   WQ_HIGHPRI,
> fs/verity/verify.c-   
> num_online_cpus());
> 
> drivers/crypto/hisilicon/qm.c:  qm->wq = alloc_workqueue("%s", WQ_HIGHPRI | 
> WQ_MEM_RECLAIM |
> drivers/crypto/hisilicon/qm.c-   WQ_UNBOUND, 
> num_online_cpus(),
> drivers/crypto/hisilicon/qm.c-   pci_name(qm->pdev));
> 
> block/blk-crypto-fallback.c:blk_crypto_wq = 
> alloc_workqueue("blk_crypto_wq",
> block/blk-crypto-fallback.c-WQ_UNBOUND | 
> WQ_HIGHPRI |
> block/blk-crypto-fallback.c-
> WQ_MEM_RECLAIM, num_online_cpus());
> 
> drivers/md/dm-crypt.c:  cc->crypt_queue = 
> alloc_workqueue("kcryptd/%s",
> drivers/md/dm-crypt.c-
> WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM | WQ_UNBOUND,
> drivers/md/dm-crypt.c-
> num_online_cpus(), devname);

Most of these work items are CPU bound but not completley so. e.g.
kcrypt_crypt_write_continue() does wait_for_completion(), so setting
max_active to 1 likely isn't what they want either. They mostly want some
reasonable system-wide concurrency limit w.r.t. the CPU count while keeping
some level of flexibility in terms of task placement.

The previous max_active wasn't great for this because its meaning changed
depending on the number of nodes. Now, the meaning doesn't change but it's
not really useful for the above purpose. It's only useful for avoiding
melting the system completely.

One way to go about it is to declare that concurrency level management for
unbound workqueue is on users but that seems not ideal given many use cases
would want it anyway.

Let me think it over but I think the right way to go about it is going the
other direction - ie. making max_active apply to the whole system regardless
of the number of nodes / ccx's / whatever.

> Furthermore, the change affects performance in a certain case.
> 
> Btrfs creates several WQ_UNBOUND workqueues with a default max_active =
> min(NRCPUS + 2, 8). As my machine has 96 CPUs with NUMA disabled, this
> max_active config allows running over 700 active works. Before the commit,
> it is limited to 8 if NUMA is disabled or limited to 16 if NUMA nodes is 2.
> 
> I reverted the workqueue code back to before the commit, and I ran the
> following fio command on RAID0 btrfs on 6 SSDs.
> 
> fio --group_reporting --eta=always --eta-interval=30s --eta-newline=30s \
> --rw=write --fallocate=none \
> --direct=1 --ioengine=libaio --iodepth=32 \
> --filesize=100G \
> --blocksize=64k \
> --time_based --runtime=300s \
> --end_fsync=1 \
> --directory=${MNT} \
> --name=writer --numjobs=32
> 
> By changing workqueue's max_active, the result varies.
> 
> - wq max_active=8   (intended limit by btrfs?)
>   WRITE: bw=2495MiB/s (2616MB/s), 2495MiB/s-2495MiB/s (2616MB/s-2616MB/s), 
> io=753GiB (808GB), run=308953-308953msec
> - wq max_active=16  (actual limit on 2 NUMA nodes setup)
>   WRITE: bw=1736MiB/s (1820MB/s), 1736MiB/s-1736MiB/s (1820MB/s-1820MB/s), 
> io=670GiB (720GB), 

Re: [Intel-gfx] [RFC v6 0/8] DRM scheduling cgroup controller

2023-11-12 Thread Tejun Heo
Hello,

>From cgroup POV, it generally looks fine to me. As before, I'm really
curious whether this is something other non-intel drivers can get behind.
Just one nit.

On Tue, Oct 24, 2023 at 05:07:19PM +0100, Tvrtko Ursulin wrote:
>  * Allowing per DRM card configuration and queries is deliberatly left out but
>it is compatible in principle with the current proposal.
> 
>Where today we have, for drm.weight:
> 
>100
> 
>We can later extend with:
> 
>100
>card0 80
>card1 20
>
>Similarly for drm.stat:
> 
>usage_usec 1000
>card0.usage_usec 500
>card1.usage_usec 500

These dont't match what blkcg is doing. Please use nested keyed format
instead.

Thanks.

-- 
tejun


Re: [Intel-gfx] WQ_UNBOUND warning since recent workqueue refactoring

2023-08-30 Thread Tejun Heo
Hello,

(cc'ing i915 folks)

On Wed, Aug 30, 2023 at 04:57:42PM +0200, Heiner Kallweit wrote:
> Recently I started to see the following warning on linux-next and presumably
> this may be related to the refactoring of the workqueue core code.
> 
> [   56.900223] workqueue: output_poll_execute [drm_kms_helper] hogged CPU for 
> >1us 4 times, consider switching to WQ_UNBOUND
> [   56.923226] workqueue: i915_hpd_poll_init_work [i915] hogged CPU for 
> >1us 4 times, consider switching to WQ_UNBOUND
> [   97.860430] workqueue: output_poll_execute [drm_kms_helper] hogged CPU for 
> >1us 8 times, consider switching to WQ_UNBOUND
> [   97.884453] workqueue: i915_hpd_poll_init_work [i915] hogged CPU for 
> >1us 8 times, consider switching to WQ_UNBOUND
> 
> Adding WQ_UNBOUND to these queues didn't change the behavior.

That should have made them go away as the code path isn't active at all for
WQ_UNBOUND workqueues. Can you please double check?

> Maybe relevant: I run the affected system headless.

i915 folks, workqueue recently added debug warnings which trigger when a
per-cpu work item hogs the CPU for too long - 10ms in this case. This is
problematic because such work item can stall other per-cpu work items.

* Is it expected for the above two work functions to occupy the CPU for over
  10ms repeatedly?

* If so, can we make them use an unbound workqueue instead?

Thanks.

-- 
tejun


Re: [Intel-gfx] [PATCH 16/17] cgroup/drm: Expose memory stats

2023-07-26 Thread Tejun Heo
Hello,

On Wed, Jul 26, 2023 at 05:44:28PM +0100, Tvrtko Ursulin wrote:
...
> > So, yeah, if you want to add memory controls, we better think through how
> > the fd ownership migration should work.
> 
> It would be quite easy to make the implicit migration fail - just the matter
> of failing the first ioctl, which is what triggers the migration, after the
> file descriptor access from a new owner.

So, it'd be best if there's no migration involved at all as per the
discussion with Maarten.

> But I don't think I can really add that in the RFC given I have no hard
> controls or anything like that.
> 
> With GPU usage throttling it doesn't really apply, at least I don't think it
> does, since even when migrated to a lower budget group it would just get
> immediately de-prioritized.
> 
> I don't think hard GPU time limits are feasible in general, and while soft
> might be, again I don't see that any limiting would necessarily have to run
> immediately on implicit migration.

Yeah, I wouldn't worry about hard allocation of GPU time. CPU RT control
does that but it's barely used.

> Second part of the story are hypothetical/future memory controls.
> 
> I think first thing to say is that implicit migration is important, but it
> is not really established to use the file descriptor from two places or to
> migrate more than once. It is simply fresh fd which gets sent to clients
> from Xorg, which is one of the legacy ways of doing things.
> 
> So we probably can just ignore that given no significant amount of memory
> ownership would be getting migrated.

So, if this is the case, it'd be better to clarify this. ie. if the summary is:

fd gets assigned to the user with a certain action at which point the fd
doesn't have significant resources attached to it and the fd can't be moved
to some other cgroup afterwards.

then, everything is pretty simple. No need to worry about migration at all.
fd just gets assigned once at the beginning and everything gets accounted
towards that afterwards.

> And for drm.memory.stat I think what I have is good enough - both private
> and shared data get accounted, for any clients that have handles to
> particular buffers.
> 
> Maarten was working on memory controls so maybe he would have more thoughts
> on memory ownership and implicit migration.
> 
> But I don't think there is anything incompatible with that and
> drm.memory.stats as proposed here, given how the categories reported are the
> established ones from the DRM fdinfo spec, and it is fact of the matter that
> we can have multiple memory regions per driver.
> 
> The main thing that would change between this RFC and future memory controls
> in the area of drm.memory.stat is the implementation - it would have to get
> changed under the hood from "collect on query" to "account at
> allocation/free/etc". But that is just implementation details.

I'd much prefer to straighten out this before adding a prelimiary stat only
thing. If the previously described ownership model is sufficient, none of
this is complicated, right? We can just add counters to track the resources
and print them out.

Thanks.

-- 
tejun


Re: [Intel-gfx] [PATCH 16/17] cgroup/drm: Expose memory stats

2023-07-26 Thread Tejun Heo
Hello,

On Wed, Jul 26, 2023 at 12:14:24PM +0200, Maarten Lankhorst wrote:
> > So, yeah, if you want to add memory controls, we better think through how
> > the fd ownership migration should work.
>
> I've taken a look at the series, since I have been working on cgroup memory
> eviction.
> 
> The scheduling stuff will work for i915, since it has a purely software
> execlist scheduler, but I don't think it will work for GuC (firmware)
> scheduling or other drivers that use the generic drm scheduler.
> 
> For something like this,  you would probably want it to work inside the drm
> scheduler first. Presumably, this can be done by setting a weight on each
> runqueue, and perhaps adding a callback to update one for a running queue.
> Calculating the weights hierarchically might be fun..

I don't have any idea on this front. The basic idea of making high level
distribution decisions in core code and letting individual drivers enforce
that in a way which fits them the best makes sense to me but I don't know
enough to have an opinion here.

> I have taken a look at how the rest of cgroup controllers change ownership
> when moved to a different cgroup, and the answer was: not at all. If we

For persistent resources, that's the general rule. Whoever instantiates a
resource gets to own it until the resource gets freed. There is an exception
with the pid controller and there are discussions around whether we want
some sort of migration behavior with memcg but yes by and large instantiator
being the owner is the general model cgroup follows.

> attempt to create the scheduler controls only on the first time the fd is
> used, you could probably get rid of all the tracking.
> This can be done very easily with the drm scheduler.
>
> WRT memory, I think the consensus is to track system memory like normal
> memory. Stolen memory doesn't need to be tracked. It's kernel only memory,
> used for internal bookkeeping  only.
> 
> The only time userspace can directly manipulate stolen memory, is by mapping
> the pinned initial framebuffer to its own address space. The only allocation
> it can do is when a framebuffer is displayed, and framebuffer compression
> creates some stolen memory. Userspace is not
> aware of this though, and has no way to manipulate those contents.

So, my dumb understanding:

* Ownership of an fd can be established on the first ioctl call and doesn't
  need to be migrated afterwards. There are no persistent resources to
  migration on the first call.

* Memory then can be tracked in a similar way to memcg. Memory gets charged
  to the initial instantiator and doesn't need to be moved around
  afterwards. There may be some discrepancies around stolen memory but the
  magnitude of inaccuracy introduced that way is limited and bound and can
  be safely ignored.

Is that correct?

Thanks.

-- 
tejun


Re: [Intel-gfx] [PATCH 15/17] cgroup/drm: Expose GPU utilisation

2023-07-25 Thread Tejun Heo
Hello,

On Tue, Jul 25, 2023 at 03:08:40PM +0100, Tvrtko Ursulin wrote:
> > Also, shouldn't this be keyed by the drm device?
>
> It could have that too, or it could come later. Fun with GPUs that it not
> only could be keyed by the device, but also by the type of the GPU engine.
> (Which are a) vendor specific and b) some aree fully independent, some
> partially so, and some not at all - so it could get complicated semantics
> wise really fast.)

I see.

> If for now I'd go with drm.stat/usage_usec containing the total time spent
> how would you suggest adding per device granularity? Files as documented
> are either flag or nested, not both at the same time. So something like:
> 
> usage_usec 10
> card0 usage_usec 5
> card1 usage_usec 5
> 
> Would or would not fly? Have two files along the lines of drm.stat and 
> drm.dev_stat?

Please follow one of the pre-defined formats. If you want to have card
identifier and field key, it should be a nested keyed file like io.stat.

> While on this general topic, you will notice that for memory stats I have
> _sort of_ nested keyed per device format, for example on integrated Intel
> GPU:
> 
>   $ cat drm.memory.stat
>   card0 region=system total=12898304 shared=0 active=0 resident=12111872 
> purgeable=167936
>   card0 region=stolen-system total=0 shared=0 active=0 resident=0 purgeable=0
> 
> If one a discrete Intel GPU two more lines would appear with memory
> regions of local and local-system. But then on some server class
> multi-tile GPUs even further regions with more than one device local
> memory region. And users do want to see this granularity for container use
> cases at least.
> 
> Anyway, this may not be compatible with the nested key format as
> documented in cgroup-v2.rst, although it does not explicitly say.
> 
> Should I cheat and create key names based on device and memory region name
> and let userspace parse it? Like:
> 
>   $ cat drm.memory.stat
>   card0.system total=12898304 shared=0 active=0 resident=12111872 
> purgeable=167936
>   card0.stolen-system total=0 shared=0 active=0 resident=0 purgeable=0

Yeah, this looks better to me. If they're reporting different values for the
same fields, they're separate keys.

Thanks.

-- 
tejun


Re: [Intel-gfx] [PATCH 16/17] cgroup/drm: Expose memory stats

2023-07-21 Thread Tejun Heo
On Wed, Jul 12, 2023 at 12:46:04PM +0100, Tvrtko Ursulin wrote:
>   $ cat drm.memory.stat
>   card0 region=system total=12898304 shared=0 active=0 resident=12111872 
> purgeable=167936
>   card0 region=stolen-system total=0 shared=0 active=0 resident=0 purgeable=0
> 
> Data is generated on demand for simplicty of implementation ie. no running
> totals are kept or accounted during migrations and such. Various
> optimisations such as cheaper collection of data are possible but
> deliberately left out for now.
> 
> Overall, the feature is deemed to be useful to container orchestration
> software (and manual management).
> 
> Limits, either soft or hard, are not envisaged to be implemented on top of
> this approach due on demand nature of collecting the stats.

So, yeah, if you want to add memory controls, we better think through how
the fd ownership migration should work.

Thanks.

-- 
tejun


Re: [Intel-gfx] [PATCH 15/17] cgroup/drm: Expose GPU utilisation

2023-07-21 Thread Tejun Heo
On Fri, Jul 21, 2023 at 12:19:32PM -1000, Tejun Heo wrote:
> On Wed, Jul 12, 2023 at 12:46:03PM +0100, Tvrtko Ursulin wrote:
> > +  drm.active_us
> > +   GPU time used by the group recursively including all child groups.
> 
> Maybe instead add drm.stat and have "usage_usec" inside? That'd be more
> consistent with cpu side.

Also, shouldn't this be keyed by the drm device?

-- 
tejun


Re: [Intel-gfx] [PATCH 15/17] cgroup/drm: Expose GPU utilisation

2023-07-21 Thread Tejun Heo
On Wed, Jul 12, 2023 at 12:46:03PM +0100, Tvrtko Ursulin wrote:
> +  drm.active_us
> + GPU time used by the group recursively including all child groups.

Maybe instead add drm.stat and have "usage_usec" inside? That'd be more
consistent with cpu side.

Thanks.

-- 
tejun


Re: [Intel-gfx] [PATCH 12/17] cgroup/drm: Introduce weight based drm cgroup control

2023-07-21 Thread Tejun Heo
On Wed, Jul 12, 2023 at 12:46:00PM +0100, Tvrtko Ursulin wrote:
> +DRM scheduling soft limits
> +~~

Please don't say soft limits for this. It means something different for
memcg, so it gets really confusing. Call it "weight based CPU time control"
and maybe call the triggering points as thresholds.

Thanks.

-- 
tejun


Re: [Intel-gfx] [PATCH 08/17] drm/cgroup: Track DRM clients per cgroup

2023-07-21 Thread Tejun Heo
Hello,

On Wed, Jul 12, 2023 at 12:45:56PM +0100, Tvrtko Ursulin wrote:
> +void drmcgroup_client_migrate(struct drm_file *file_priv)
> +{
> + struct drm_cgroup_state *src, *dst;
> + struct cgroup_subsys_state *old;
> +
> + mutex_lock(_mutex);
> +
> + old = file_priv->__css;
> + src = css_to_drmcs(old);
> + dst = css_to_drmcs(task_get_css(current, drm_cgrp_id));
> +
> + if (src != dst) {
> + file_priv->__css = >css; /* Keeps the reference. */
> + list_move_tail(_priv->clink, >clients);
> + }
> +
> + mutex_unlock(_mutex);
> +
> + css_put(old);
> +}
> +EXPORT_SYMBOL_GPL(drmcgroup_client_migrate);

So, you're implicitly migrating the fd to the latest ioctl user on the first
access. While this may work for state-less control like usage time. This
likely will cause problem if you later want to add stateful control for
memory consumption. e.g. What happens if the new destination doesn't have
enough budget to accommodate the fd's usage? Let's say we allow over-commit
with follow-up reclaim or oom kill actions, if so, can we guarantee that all
memory ownership for can always be updated? If not, what happens after
failure?

If DRM needs to transfer fd ownership with resources attached to it, it
likely would be a good idea to make that an explicit operation so that the
attempt can be verified and failed if necessary.

Thanks.

-- 
tejun


Re: [Intel-gfx] [RFC PATCH 0/4] Add support for DRM cgroup memory accounting.

2023-05-10 Thread Tejun Heo
Hello,

On Wed, May 10, 2023 at 04:59:01PM +0200, Maarten Lankhorst wrote:
> The misc controller is not granular enough. A single computer may have any 
> number of
> graphics cards, some of them with multiple regions of vram inside a single 
> card.

Extending the misc controller to support dynamic keys shouldn't be that
difficult.

...
> In the next version, I will move all the code for handling the resource limit 
> to
> TTM's eviction layer, because otherwise it cannot handle the resource limit 
> correctly.
> 
> The effect of moving the code to TTM, is that it will make the code even more 
> generic
> for drivers that have vram and use TTM. When using TTM, you only have to 
> describe your
> VRAM, update some fields in the TTM manager and (un)register your device with 
> the
> cgroup handler on (un)load. It's quite trivial to add vram accounting to 
> amdgpu and
> nouveau. [2]
> 
> If you want to add a knob for scheduling weight for a process, it makes sense 
> to
> also add resource usage as a knob, otherwise the effect of that knob is very
> limited. So even for Tvrtko's original proposed usecase, it would make sense.

It does make sense but unlike Tvrtko's scheduling weights what's being
proposed doesn't seem to encapsulate GPU memory resource in a generic enough
manner at least to my untrained eyes. ie. w/ drm.weight, I don't need any
specific knoweldge of how a specific GPU operates to say "this guy should
get 2x processing power over that guy". This more or less holds for other
major resources including CPU, memory and IO. What you're proposing seems a
lot more tied to hardware details and users would have to know a lot more
about how memory is configured on that particular GPU.

Now, if this is inherent to how all, or at least most, GPUs operate, sure,
but otherwise let's start small in terms of interface and not take up space
which should be for something universal. If this turns out to be the way,
expanding to take up the generic interface space isn't difficult.

I don't know GPU space so please educate me where I'm wrong.

Thanks.

-- 
tejun


Re: [Intel-gfx] [RFC PATCH 0/4] Add support for DRM cgroup memory accounting.

2023-05-05 Thread Tejun Heo
Hello,

On Wed, May 03, 2023 at 10:34:56AM +0200, Maarten Lankhorst wrote:
> RFC as I'm looking for comments.
> 
> For long running compute, it can be beneficial to partition the GPU memory
> between cgroups, so each cgroup can use its maximum amount of memory without
> interfering with other scheduled jobs. Done properly, this can alleviate the
> need for eviction, which might result in a job being terminated if the GPU
> doesn't support mid-thread preemption or recoverable page faults.
> 
> This is done by adding a bunch of knobs to cgroup:
> drm.capacity: Shows maximum capacity of each resource region.
> drm.max: Display or limit max amount of memory.
> drm.current: Current amount of memory in use.
> 
> TTM has not been made cgroup aware yet, so instead of evicting from
> the current cgroup to stay within the cgroup limits, it simply returns
> the error -ENOSPC to userspace.
> 
> I've used Tvrtko's cgroup controller series as a base, but it implemented
> scheduling weight, not memory accounting, so I only ended up keeping the
> base patch.
> 
> Xe is not upstream yet, so the driver specific patch will only apply on
> https://gitlab.freedesktop.org/drm/xe/kernel

Some high-level feedbacks.

* There have been multiple attempts at this but the track record is kinda
  poor. People don't seem to agree what should constitute DRM memory and how
  they should be accounted / controlled.

* I like Tvrtko's scheduling patchset because it exposes a generic interface
  which makes sense regardless of hardware details and then each driver can
  implement the configured control in whatever way they can. However, even
  for that, there doesn't seem much buy-in from other drivers.

* This proposal seems narrowly scoped trying to solve a specific problem
  which may not translate to different hardware configurations. Please let
  me know if I got that wrong, but if that's the case, I think a better and
  easier approach might be just being a part of the misc controller. That
  doesn't require much extra code and should be able to provide everything
  necessary for statically limiting specific resources.

Thanks.

-- 
tejun


Re: [Intel-gfx] [RFC v4 00/10] DRM scheduling cgroup controller

2023-03-24 Thread Tejun Heo
Hello, Tvrtko.

On Tue, Mar 14, 2023 at 02:18:54PM +, Tvrtko Ursulin wrote:
> DRM scheduling soft limits
> ~~
> 
> Because of the heterogenous hardware and driver DRM capabilities, soft limits
> are implemented as a loose co-operative (bi-directional) interface between the
> controller and DRM core.
> 
> The controller configures the GPU time allowed per group and periodically 
> scans
> the belonging tasks to detect the over budget condition, at which point it
> invokes a callback notifying the DRM core of the condition.
> 
> DRM core provides an API to query per process GPU utilization and 2nd API to
> receive notification from the cgroup controller when the group enters or exits
> the over budget condition.
> 
> Individual DRM drivers which implement the interface are expected to act on 
> this
> in the best-effort manner only. There are no guarantees that the soft limits
> will be respected.
> 
> DRM scheduling soft limits interface files
> ~~

In general, I'm in favor of your approach but can you please stop using the
term "soft limit". That's a term with a specific historical meaning in
cgroup, so it gets really confusing when you use the term for hierarchical
weighted control. If you need a term to refer to how the weighted control is
implemented by throttling cgroups at target rates, please just come up with
a different term "usage threshold based control", "usage throttling based
control" or whichever you may like.

>   drm.weight
>   Standard cgroup weight based control [1, 1] used to configure the
>   relative distributing of GPU time between the sibling groups.
> 
> This builds upon the per client GPU utilisation work which landed recently 
> for a
> few drivers. My thinking is that in principle, an intersect of drivers which
> support both that and some sort of scheduling control, like  priorities, could
> also in theory support this controller.
> 
> Another really interesting angle for this controller is that it mimics the 
> same
> control menthod used by the CPU scheduler. That is the proportional/weight 
> based
> GPU time budgeting. Which makes it easy to configure and does not need a new
> mental model.

FWIW, the hierarchical weighted distribution is also implemented by IO
control.

> However, as the introduction mentions, GPUs are much more heterogenous and
> therefore the controller uses very "soft" wording as to what it promises. The
> general statement is that it can define budgets, notify clients when they are
> over them, and let individual drivers implement best effort handling of those
> conditions.

Maybe "best effort" is more suited than "soft"?

...
> Roughly simultaneously we run the following two benchmarks in each session
> respectively:
> 
> 1)
> ./GpuTest /test=pixmark_julia_fp32 /width=1920 /height=1080 /fullscreen 
> /no_scorebox /benchmark /benchmark_duration_ms=6
> 
> 2)
> vblank_mode=0 bin/testfw_app --gl_api=desktop_core --width=1920 --height=1080 
> --fullscreen 1 --gfx=glfw -t gl_manhattan
> 
> (The only reason for vsync off here is because I struggled to find an easily
> runnable and demanding enough benchmark, or to run on a screen large enough to
> make even a simpler ones demanding.)
> 
> With this test we get 252fps from GpuTest and 96fps from GfxBenchmark.
> 
> Premise here is that one of these GPU intensive benchmarks is intended to be 
> ran
> by the user with lower priority. Imagine kicking off some background compute
> processing and continuing to use the UI for other tasks. Hence the user will 
> now
> re-run the test by first lowering the weight control of the first session (DRM
> cgroup):
> 
> 1)
> echo 50 | sudo tee /sys/fs/cgroup/`cut -d':' -f3 /proc/self/cgroup`/drm.weight
> ./GpuTest /test=pixmark_julia_fp32 /width=1920 /height=1080 /fullscreen 
> /no_scorebox /benchmark /benchmark_duration_ms=6
> 
> 2)
> vblank_mode=0 bin/testfw_app --gl_api=desktop_core --width=1920 --height=1080 
> --fullscreen 1 --gfx=glfw -t gl_manhattan
> 
> In this case we will see that GpuTest has recorded 208fps (~18% down) and
> GfxBenchmark 114fps (18% up), demonstrating that even a very simple approach 
> of
> wiring up i915 to the DRM cgroup controller can enable external GPU scheduling
> control.

It's really nice to see it working pretty intuitively.

>  * For now (RFC) I haven't implemented the 2nd suggestion from Tejun of having
>a shadow tree which would only contain groups with DRM clients. (Purpose
>being less nodes to traverse in the scanning loop.)
> 
>  * Is the global state passing from can_attach to attach really okay? (I need
>source and destination css.)

Right now, it is and there are other places that depend on it. Obviously,
it's not great and we probably want to add explicit context passed around
instead in the future, but for now, it should be okay.

While not fully polished, from cgroup POV, the series looks pretty good and
I'd be happy to see it 

Re: [Intel-gfx] [RFC 10/12] cgroup/drm: Introduce weight based drm cgroup control

2023-02-02 Thread Tejun Heo
Hello,

On Thu, Feb 02, 2023 at 02:26:06PM +, Tvrtko Ursulin wrote:
> When you say active/inactive - to what you are referring in the cgroup
> world? Offline/online? For those my understanding was offline was a
> temporary state while css is getting destroyed.

Oh, it's just based on activity. So, for example, iocost puts a cgroup on
its active list which is canned periodically when an IO is issued from an
inactive cgroup. If an active cgroup doesn't have any activity between two
scans, it becomes inactive and dropped from the list. drm can prolly use the
same approach?

> Also, I am really postponing implementing those changes until I hear at
> least something from the DRM community.

Yeah, that sounds like a good idea.

Thanks.

-- 
tejun


Re: [Intel-gfx] [RFC 10/12] cgroup/drm: Introduce weight based drm cgroup control

2023-01-27 Thread Tejun Heo
On Thu, Jan 12, 2023 at 04:56:07PM +, Tvrtko Ursulin wrote:
...
> + /*
> +  * 1st pass - reset working values and update hierarchical weights and
> +  * GPU utilisation.
> +  */
> + if (!__start_scanning(root, period_us))
> + goto out_retry; /*
> +  * Always come back later if scanner races with
> +  * core cgroup management. (Repeated pattern.)
> +  */
> +
> + css_for_each_descendant_pre(node, >css) {
> + struct drm_cgroup_state *drmcs = css_to_drmcs(node);
> + struct cgroup_subsys_state *css;
> + unsigned int over_weights = 0;
> + u64 unused_us = 0;
> +
> + if (!css_tryget_online(node))
> + goto out_retry;
> +
> + /*
> +  * 2nd pass - calculate initial budgets, mark over budget
> +  * siblings and add up unused budget for the group.
> +  */
> + css_for_each_child(css, >css) {
> + struct drm_cgroup_state *sibling = css_to_drmcs(css);
> +
> + if (!css_tryget_online(css)) {
> + css_put(node);
> + goto out_retry;
> + }
> +
> + sibling->per_s_budget_us  =
> + DIV_ROUND_UP_ULL(drmcs->per_s_budget_us *
> +  sibling->weight,
> +  drmcs->sum_children_weights);
> +
> + sibling->over = sibling->active_us >
> + sibling->per_s_budget_us;
> + if (sibling->over)
> + over_weights += sibling->weight;
> + else
> + unused_us += sibling->per_s_budget_us -
> +  sibling->active_us;
> +
> + css_put(css);
> + }
> +
> + /*
> +  * 3rd pass - spread unused budget according to relative weights
> +  * of over budget siblings.
> +  */
> + css_for_each_child(css, >css) {
> + struct drm_cgroup_state *sibling = css_to_drmcs(css);
> +
> + if (!css_tryget_online(css)) {
> + css_put(node);
> + goto out_retry;
> + }
> +
> + if (sibling->over) {
> + u64 budget_us =
> + DIV_ROUND_UP_ULL(unused_us *
> +  sibling->weight,
> +  over_weights);
> + sibling->per_s_budget_us += budget_us;
> + sibling->over = sibling->active_us  >
> + sibling->per_s_budget_us;
> + }
> +
> + css_put(css);
> + }
> +
> + css_put(node);
> + }
> +
> + /*
> +  * 4th pass - send out over/under budget notifications.
> +  */
> + css_for_each_descendant_post(node, >css) {
> + struct drm_cgroup_state *drmcs = css_to_drmcs(node);
> +
> + if (!css_tryget_online(node))
> + goto out_retry;
> +
> + if (drmcs->over || drmcs->over_budget)
> + signal_drm_budget(drmcs,
> +   drmcs->active_us,
> +   drmcs->per_s_budget_us);
> + drmcs->over_budget = drmcs->over;
> +
> + css_put(node);
> + }

It keeps bothering me that the distribution logic has no memory. Maybe this
is good enough for coarse control with long cycle durations but it likely
will get in trouble if pushed to finer grained control. State keeping
doesn't require a lot of complexity. The only state that needs tracking is
each cgroup's vtime and then the core should be able to tell specific
drivers how much each cgroup is over or under fairly accurately at any given
time.

That said, this isn't a blocker. What's implemented can work well enough
with coarse enough time grain and that might be enough for the time being
and we can get back to it later. I think Michal already mentioned it but it
might be a good idea to track active and inactive cgroups and build the
weight tree with only active ones. There are machines with a lot of mostly
idle cgroups (> tens of thousands) and tree wide scanning even at low
frequency can become a pretty bad bottleneck.

Thanks.

-- 
tejun


Re: [Intel-gfx] [RFC v3 00/12] DRM scheduling cgroup controller

2023-01-26 Thread Tejun Heo
Hello,

On Thu, Jan 26, 2023 at 02:00:50PM +0100, Michal Koutný wrote:
> On Wed, Jan 25, 2023 at 06:11:35PM +, Tvrtko Ursulin 
>  wrote:
> > I don't immediately see how you envisage the half-userspace implementation
> > would look like in terms of what functionality/new APIs would be provided by
> > the kernel?
> 
> Output:
>   drm.stat (with consumed time(s))
> 
> Input:
>   drm.throttle (alternatives)
>   - a) writing 0,1 (in rough analogy to your proposed
>notifications)
>   - b) writing duration (in loose analogy to memory.reclaim)
>- for how long GPU work should be backed off
> 
> An userspace agent sitting between these two and it'd do the measurement
> and calculation depending on given policies (weighting, throttling) and
> apply respective controls.
> 
> (In resemblance of e.g. https://denji.github.io/cpulimit/)

Yeah, things like this can be done from userspace but if we're gonna build
the infrastructure to allow that in gpu drivers and so on, I don't see why
we wouldn't add a generic in-kernel control layer if we can implement a
proper weight based control. We can of course also expose .max style
interface to allow userspace to do whatever they wanna do with it.

> > Problem there is to find a suitable point to charge at. If for a moment we
> > limit the discussion to i915, out of the box we could having charging
> > happening at several thousand times per second to effectively never. This is
> > to illustrate the GPU context execution dynamics which range from many small
> > packets of work to multi-minute, or longer. For the latter to be accounted
> > for we'd still need some periodic scanning, which would then perhaps go per
> > driver. For the former we'd have thousands of needless updates per second.
> > 
> > Hence my thinking was to pay both the cost of accounting and collecting the
> > usage data once per actionable event, where the latter is controlled by some
> > reasonable scanning period/frequency.
> > 
> > In addition to that, a few DRM drivers already support GPU usage querying
> > via fdinfo, so that being externally triggered, it is next to trivial to
> > wire all those DRM drivers into such common DRM cgroup controller framework.
> > All that every driver needs to implement on top is the "over budget"
> > callback.
> 
> I'd also like show comparison with CPU accounting and controller.
> There is tick-based (~sampling) measurement of various components of CPU
> time (task_group_account_field()). But the actual schedulling (weights)
> or throttling is based on precise accounting (update_curr()).
> 
> So, if the goal is to have precise and guaranteed limits, it shouldn't
> (cannot) be based on sampling. OTOH, if it must be sampling based due to
> variability of the device landscape, it could be advisory mechanism with
> the userspace component.

As for the specific control mechanism, yeah, charge based interface would be
more conventional and my suspicion is that transposing the current
implementation that way likely isn't too difficult. It just pushes "am I
over the limit?" decisions to the specific drivers with the core layer
telling them how much under/over budget they are. I'm curious what other gpu
driver folks think about the current RFC tho. Is at least AMD on board with
the approach?

Thanks.

-- 
tejun


Re: [Intel-gfx] [RFC 11/13] cgroup/drm: Introduce weight based drm cgroup control

2022-11-28 Thread Tejun Heo
Hello,

On Thu, Nov 24, 2022 at 02:32:25PM +, Tvrtko Ursulin wrote:
> > Soft limits is a bit of misnomer and can be confused with best-effort limits
> > such as memory.high. Prolly best to not use the term.
> 
> Are you suggesting "best effort limits" or "best effort "? It
> would sounds good to me if we found the right . Best effort
> budget perhaps?

A more conventional name would be hierarchical weighted distribution.

> Also, when you mention scalability you are concerned about multiple tree
> walks I have per iteration? I wasn't so much worried about that, definitely
> not for the RFC, but even in general due relatively low frequency of
> scanning and a good amount of less trivial cost being outside the actual
> tree walks (drm client walks, GPU utilisation calculations, maybe more). But
> perhaps I don't have the right idea on how big cgroups hierarchies can be
> compared to number of drm clients etc.

It's just a better way doing this kind of weight based scheduling. It's
simpler, more scalable and easier to understand how things are working. The
basic idea is pretty simple - each schedulable entity gets assigned a
timestamp and whenever it consumes the target resource, its time is wound
forward by the consumption amount divided by its absolute share - e.g. if
cgroup A deserves 25% of the entire thing and it ran for 1s, its time is
wound forward by 1s / 0.25 == 4s. There's a rbtree keyed by these timestamps
and anything wanting to consume gets put on that tree and whatever is at the
head of the tree is the next thing to run.

Thanks.

-- 
tejun


Re: [Intel-gfx] [RFC 11/13] cgroup/drm: Introduce weight based drm cgroup control

2022-11-22 Thread Tejun Heo
On Wed, Nov 09, 2022 at 04:11:39PM +, Tvrtko Ursulin wrote:
> +DRM scheduling soft limits
> +~~
> +
> +Because of the heterogenous hardware and driver DRM capabilities, soft limits
> +are implemented as a loose co-operative (bi-directional) interface between 
> the
> +controller and DRM core.
> +
> +The controller configures the GPU time allowed per group and periodically 
> scans
> +the belonging tasks to detect the over budget condition, at which point it
> +invokes a callback notifying the DRM core of the condition.
> +
> +DRM core provides an API to query per process GPU utilization and 2nd API to
> +receive notification from the cgroup controller when the group enters or 
> exits
> +the over budget condition.
> +
> +Individual DRM drivers which implement the interface are expected to act on 
> this
> +in the best-effort manner only. There are no guarantees that the soft limits
> +will be respected.

Soft limits is a bit of misnomer and can be confused with best-effort limits
such as memory.high. Prolly best to not use the term.

> +static bool
> +__start_scanning(struct drm_cgroup_state *root, unsigned int period_us)
> +{
> + struct cgroup_subsys_state *node;
> + bool ok = false;
> +
> + rcu_read_lock();
> +
> + css_for_each_descendant_post(node, >css) {
> + struct drm_cgroup_state *drmcs = css_to_drmcs(node);
> +
> + if (!css_tryget_online(node))
> + goto out;
> +
> + drmcs->active_us = 0;
> + drmcs->sum_children_weights = 0;
> +
> + if (node == >css)
> + drmcs->per_s_budget_ns =
> + DIV_ROUND_UP_ULL(NSEC_PER_SEC * period_us,
> +  USEC_PER_SEC);
> + else
> + drmcs->per_s_budget_ns = 0;
> +
> + css_put(node);
> + }
> +
> + css_for_each_descendant_post(node, >css) {
> + struct drm_cgroup_state *drmcs = css_to_drmcs(node);
> + struct drm_cgroup_state *parent;
> + u64 active;
> +
> + if (!css_tryget_online(node))
> + goto out;
> + if (!node->parent) {
> + css_put(node);
> + continue;
> + }
> + if (!css_tryget_online(node->parent)) {
> + css_put(node);
> + goto out;
> + }
> + parent = css_to_drmcs(node->parent);
> +
> + active = drmcs_get_active_time_us(drmcs);
> + if (active > drmcs->prev_active_us)
> + drmcs->active_us += active - drmcs->prev_active_us;
> + drmcs->prev_active_us = active;
> +
> + parent->active_us += drmcs->active_us;
> + parent->sum_children_weights += drmcs->weight;
> +
> + css_put(node);
> + css_put(>css);
> + }
> +
> + ok = true;
> +
> +out:
> + rcu_read_unlock();
> +
> + return ok;
> +}

A more conventional and scalable way to go about this would be using an
rbtree keyed by virtual time. Both CFS and blk-iocost are examples of this,
but I think for drm, it can be a lot simpler.

Thanks.

-- 
tejun


Re: [Intel-gfx] [RFC 00/17] DRM scheduling cgroup controller

2022-10-31 Thread Tejun Heo
Hello,

On Thu, Oct 27, 2022 at 03:32:00PM +0100, Tvrtko Ursulin wrote:
> Looking at what's available in cgroups world now, I have spotted the
> blkio.prio.class control. For my current use case (lower GPU scheduling of
> background/unfocused windows) that would also work. Even if starting with
> just two possible values - 'no-change' and 'idle' (to follow the IO
> controller naming).

I wouldn't follow that example. That's only meaningful within the context of
bfq and it probabaly shouldn't have been merged in the first place.

> How would you view that as a proposal? It would be a bit tougher "sell" to
> the DRM community, perhaps, given that many drivers do have scheduling
> priority, but the concept of scheduling class is not really there.
> Nevertheless a concept of normal-vs-background could be plausible in my
> mind. It could be easily implemented by using the priority controls
> available in many drivers.

I don't feel great about that.

* The semantics aren't clearly defined. While not immediately obvious in the
  interface, the task nice levels have definite mappings to weight values
  and thus clear meanings. I don't think it's a good idea to leave the
  interface semantics open to interpretation.

* Maybe GPUs are better but my experience with optional hardware features in
  the storage world has been that vendors diverge wildly and unexpectedly to
  the point many features are mostly useless. There are fewer GPU vendors
  and more software effort behind each, so maybe the situation is better but
  I think it'd be helpul to keep some skepticism.

* Even when per-vendor or per-driver features are consistent enough to be
  useful, they often aren't thought through enough to be truly useful. e.g.
  nvme has priority features but they aren't really that useful because they
  can't do much without congestion control on the issuer side and once you
  have congestion control on the issuer side which is usually a lot more
  complex (e.g. dealing with work-conserving hierarchical weight
  distributions, priority inversions and so on), you can achieve most of
  what you need in terms of resource control from the issuer side anyway.

So, I'd much prefer to have a fuller solution on the kernel side which
integrates per-vendor/driver features where they make sense.

> > >drm.budget_supported
> > >   One of:
> > >1) 'yes' - when all DRM clients in the group support the functionality.
> > >2) 'no' - when at least one of the DRM clients does not support the
> > >  functionality.
> > >3) 'n/a' - when there are no DRM clients in the group.
> > 
> > Yeah, I'm not sure about this. This isn't a per-cgroup property to begin
> > with and I'm not sure 'no' meaning at least one device not supporting is
> > intuitive. The distinction between 'no' and 'n/a' is kinda weird too. Please
> > drop this.
> 
> The idea actually is for this to be per-cgroup and potentially change
> dynamically. It implements the concept of "observability", how I have,
> perhaps clumsily, named it.
> 
> This is because we can have a mix of DRM file descriptors in a cgroup, not
> all of which support the proposed functionality. So I wanted to have
> something by which the administrator can observe the status of the group.
> 
> For instance seeing some clients do not support the feature could be signal
> that things have been misconfigured, or that appeal needs to be made for
> driver X to start supporting the feature. Seeing a "no" there in other words
> is a signal that budgeting may not really work as expected and needs to be
> investigated.

I still don't see how this is per-cgroup given that it's indicating whether
the driver supports some feature. Also, the eventual goal would be
supporting the same control mechanisms across most (if not all) GPUs, right?

> > Rather than doing it hierarchically on the spot, it's usually a lot cheaper
> > and easier to calculate the flattened hierarchical weight per leaf cgroup
> > and divide the bandwidth according to the eventual portions. For an example,
> > please take a look at block/blk-iocost.c.
> 
> This seems exactly what I had in mind (but haven't implemented it yet). So
> in this RFC I have budget splitting per group where each tree level adds up
> to "100%" and the thing which I have not implemented is "borrowing" or
> yielding (how blk-iocost.c calls it, or donating) unused budgets to
> siblings.
> 
> I am very happy to hear my idea is the right one and someone already
> implemented it. Thanks for this pointer!

The budget donation thing in iocost is necessary only because it wants to
make the hot path local to the cgroup because io control has to support very
high decision rate. For time-slicing GPU, it's likely that following the
current hierarchical weight on the spot is enough.

Thanks.

-- 
tejun


Re: [Intel-gfx] [RFC 00/17] DRM scheduling cgroup controller

2022-10-19 Thread Tejun Heo
Hello,

On Wed, Oct 19, 2022 at 06:32:37PM +0100, Tvrtko Ursulin wrote:
...
> DRM static priority interface files
> ~~~
> 
>   drm.priority_levels
>   One of:
>1) And integer representing the minimum number of discrete priority
>   levels for the whole group.
>   Optionally followed by an asterisk ('*') indicating some DRM clients
>   in the group support more than the minimum number.
>2) '0'- indicating one or more DRM clients in the group has no support
>   for static priority control.
>3) 'n/a' - when there are no DRM clients in the configured group.
> 
>   drm.priority
>   A read-write integer between -1 and 1 (inclusive) representing
>   an abstract static priority level.
> 
>   drm.effective_priority
>   Read only integer showing the current effective priority level for the
>   group. Effective meaning taking into account the chain of inherited

>From interface POV, this is a lot worse than the second proposal and I'd
really like to avoid this. Even if we go with mapping user priority
configuration to per-driver priorities, I'd much prefer if the interface
presented to user is weight based and let each driver try to match the
resulting hierarchical weight (ie. the absolute proportion a given cgroup
should have at the point in time) as best as they can rather than exposing
opaque priority numbers to userspace whose meaning isn't defined at all.

> DRM scheduling soft limits interface files
> ~~
> 
>   drm.weight
>   Standard cgroup weight based control [1, 1] used to configure the
>   relative distributing of GPU time between the sibling groups.

Please take a look at io.weight. This can follow the same convention to
express both global and per-device weights.

>   drm.period_us
>   An integer representing the period with which the controller should look
>   at the GPU usage by the group and potentially send the over/under budget
>   signal.
>   Value of zero (defaul) disables the soft limit checking.

Can we not do period_us or at least make it a per-driver tuning parameter
exposed as module param? Weight, users can easily understand and configure.
period_us is a lot more an implementation detail. If we want to express the
trade-off between latency and bandwidth at the interface, we prolly should
encode the latency requirement in a more canonical way but let's leave that
for the future.

>   drm.budget_supported
>   One of:
>1) 'yes' - when all DRM clients in the group support the functionality.
>2) 'no' - when at least one of the DRM clients does not support the
>  functionality.
>3) 'n/a' - when there are no DRM clients in the group.

Yeah, I'm not sure about this. This isn't a per-cgroup property to begin
with and I'm not sure 'no' meaning at least one device not supporting is
intuitive. The distinction between 'no' and 'n/a' is kinda weird too. Please
drop this.

Another basic interface question. Is everyone happy with the drm prefix or
should it be something like gpu? Also, in the future, if there's a consensus
around how to control gpu memory, what prefix would that take?

> The second proposal is a little bit more advanced in concept and also a little
> bit less finished. Interesting thing is that it builds upon the per client GPU
> utilisation work which landed recently for a few drivers. So my thinking is 
> that
> in principle, an intersect of drivers which support both that and some sort of
> priority scheduling control, could also in theory support this.
> 
> Another really interesting angle for this controller is that it mimics the 
> same
> control menthod used by the CPU scheduler. That is the proportional/weight 
> based
> GPU time budgeting. Which makes it easy to configure and does not need a new
> mental model.
> 
> However, as the introduction mentions, GPUs are much more heterogenous and
> therefore the controller uses very "soft" wording as to what it promises. The
> general statement is that it can define budgets, notify clients when they are
> over them, and let individual drivers implement best effort handling of those
> conditions.
> 
> Delegation of duties in the implementation goes likes this:
> 
>  * DRM cgroup controller implements the control files and the scanning loop.
>  * DRM core is required to track all DRM clients belonging to processes so it
>can answer when asked how much GPU time is a process using.
>  * DRM core also provides a call back which the controller will call when a
>certain process is over budget.
>  * Individual drivers need to implement two similar hooks, but which work for
>a single DRM client. Over budget callback and GPU utilisation query.
> 
> What I have demonstrated in practice is that when wired to i915, in a really
> primitive way where the over-budget condition simply lowers the scheduling
> priority, 

Re: [Intel-gfx] [BUG] lockdep splat with kernfs lockdep annotations and slab mutex from drm patch??

2019-06-14 Thread Tejun Heo
Hello,

On Fri, Jun 14, 2019 at 04:08:33PM +0100, Chris Wilson wrote:
> #ifdef CONFIG_MEMCG
> if (slab_state >= FULL && err >= 0 && is_root_cache(s)) {
> struct kmem_cache *c;
> 
> mutex_lock(_mutex);
> 
> so it happens to hit the error + FULL case with the additional slabcaches?
> 
> Anyway, according to lockdep, it is dangerous to use the slab_mutex inside
> slab_attr_store().

Didn't really look into the code but it looks like slab_mutex is held
while trying to remove sysfs files.  sysfs file removal flushes
on-going accesses, so if a file operation then tries to grab a mutex
which is held during removal, it leads to a deadlock.

Thanks.

-- 
tejun
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] [RFC PATCH 0/5] cgroup support for GPU devices

2019-05-09 Thread Tejun Heo
Hello,

On Tue, May 07, 2019 at 12:50:50PM -0700, Welty, Brian wrote:
> There might still be merit in having a 'device mem' cgroup controller.
> The resource model at least is then no longer mixed up with host memory.
> RDMA community seemed to have some interest in a common controller at
> least for device memory aspects.
> Thoughts on this?   I believe could still reuse the 'struct mem_cgroup' data
> structure.  There should be some opportunity to reuse charging APIs and
> have some nice integration with HMM for charging to device memory, depending
> on backing store.

Library-ish sharing is fine but in terms of interface, I think it'd be
better to keep them separate at least for now.  Down the line maybe
these resources will interact with each other in a more integrated way
but I don't think it's a good idea to try to design and implement
resource models for something like that preemptively.

Thanks.

-- 
tejun
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] [RFC PATCH 0/5] cgroup support for GPU devices

2019-05-06 Thread Tejun Heo
Hello,

On Wed, May 01, 2019 at 10:04:33AM -0400, Brian Welty wrote:
> The patch series enables device drivers to use cgroups to control the
> following resources within a GPU (or other accelerator device):
> *  control allocation of device memory (reuse of memcg)
> and with future work, we could extend to:
> *  track and control share of GPU time (reuse of cpu/cpuacct)
> *  apply mask of allowed execution engines (reuse of cpusets)
> 
> Instead of introducing a new cgroup subsystem for GPU devices, a new
> framework is proposed to allow devices to register with existing cgroup
> controllers, which creates per-device cgroup_subsys_state within the
> cgroup.  This gives device drivers their own private cgroup controls
> (such as memory limits or other parameters) to be applied to device
> resources instead of host system resources.
> Device drivers (GPU or other) are then able to reuse the existing cgroup
> controls, instead of inventing similar ones.

I'm really skeptical about this approach.  When creating resource
controllers, I think what's the most important and challenging is
establishing resource model - what resources are and how they can be
distributed.  This patchset is going the other way around - building
out core infrastructure for bolierplates at a significant risk of
mixing up resource models across different types of resources.

IO controllers already implement per-device controls.  I'd suggest
following the same interface conventions and implementing a dedicated
controller for the subsystem.

Thanks.

-- 
tejun
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] [PATCH RFC 2/5] cgroup: Add mechanism to register vendor specific DRM devices

2018-11-20 Thread Tejun Heo
Hello,

On Tue, Nov 20, 2018 at 10:21:14PM +, Ho, Kenny wrote:
> By this reply, are you suggesting that vendor specific resources
> will never be acceptable to be managed under cgroup?  Let say a user

I wouldn't say never but whatever which gets included as a cgroup
controller should have clearly defined resource abstractions and the
control schemes around them including support for delegation.  AFAICS,
gpu side still seems to have a long way to go (and it's not clear
whether that's somewhere it will or needs to end up).

> want to have similar functionality as what cgroup is offering but to
> manage vendor specific resources, what would you suggest as a
> solution?  When you say keeping vendor specific resource regulation
> inside drm or specific drivers, do you mean we should replicate the
> cgroup infrastructure there or do you mean either drm or specific
> driver should query existing hierarchy (such as device or perhaps
> cpu) for the process organization information?
> 
> To put the questions in more concrete terms, let say a user wants to
> expose certain part of a gpu to a particular cgroup similar to the
> way selective cpu cores are exposed to a cgroup via cpuset, how
> should we go about enabling such functionality?

Do what the intel driver or bpf is doing?  It's not difficult to hook
into cgroup for identification purposes.

Thanks.

-- 
tejun
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH RFC 2/5] cgroup: Add mechanism to register vendor specific DRM devices

2018-11-20 Thread Tejun Heo
Hello,

On Tue, Nov 20, 2018 at 01:58:11PM -0500, Kenny Ho wrote:
> Since many parts of the DRM subsystem has vendor-specific
> implementations, we introduce mechanisms for vendor to register their
> specific resources and control files to the DRM cgroup subsystem.  A
> vendor will register itself with the DRM cgroup subsystem first before
> registering individual DRM devices to the cgroup subsystem.
> 
> In addition to the cgroup_subsys_state that is common to all DRM
> devices, a device-specific state is introduced and it is allocated
> according to the vendor of the device.

So, I'm still pretty negative about adding drm controller at this
point.  There isn't enough of common resource model defined yet and
until that gets sorted out I think it's in the best interest of
everyone involved to keep it inside drm or specific driver proper.

Thanks.

-- 
tejun
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 02/12] blk: use for_each_if

2018-07-11 Thread Tejun Heo
On Wed, Jul 11, 2018 at 01:31:51PM -0600, Jens Axboe wrote:
> I don't think there's a git easy way of sending it out outside of
> just ensuring that everybody is CC'ed on everything. I don't mind
> that at all. I don't subscribe to lkml, and the patches weren't
> sent to linux-block. Hence all I see is this stand-alone patch,
> and logic would dictate that it's stand-alone (but it isn't).

What I sometimes do is including a short blurb on each patch giving
the overview and action hints (e.g. this is part of patchset doing XYZ
and should be routed such and such).  It's a bit redundant but has
worked pretty well for patchsets with dependenat & sweeping changes.

Thanks.

-- 
tejun
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 03/12] cgroup: use for_each_if

2018-07-11 Thread Tejun Heo
On Mon, Jul 09, 2018 at 10:36:41AM +0200, Daniel Vetter wrote:
> Avoids the need to invert the condition instead of the open-coded
> version.
> 
> Signed-off-by: Daniel Vetter 
> Cc: Tejun Heo 
> Cc: Li Zefan 
> Cc: Johannes Weiner 
> Cc: cgro...@vger.kernel.org

Acked-by: Tejun Heo 

Please feel free to route as you see fit.

Thanks.

-- 
tejun
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 02/12] blk: use for_each_if

2018-07-11 Thread Tejun Heo
On Wed, Jul 11, 2018 at 09:40:58AM -0700, Tejun Heo wrote:
> On Mon, Jul 09, 2018 at 10:36:40AM +0200, Daniel Vetter wrote:
> > Makes the macros resilient against if {} else {} blocks right
> > afterwards.
> > 
> > Signed-off-by: Daniel Vetter 
> > Cc: Tejun Heo 
> > Cc: Jens Axboe 
> > Cc: Shaohua Li 
> > Cc: Kate Stewart 
> > Cc: Greg Kroah-Hartman 
> > Cc: Joseph Qi 
> > Cc: Daniel Vetter 
> > Cc: Arnd Bergmann 
> 
> Acked-by: Tejun Heo 
> 
> Jens, it'd probably be best to route this through block tree.

Oops, this requires an earlier patch to move the for_each_if def to a
common header and should be routed together.

Thanks.

-- 
tejun
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 02/12] blk: use for_each_if

2018-07-11 Thread Tejun Heo
On Mon, Jul 09, 2018 at 10:36:40AM +0200, Daniel Vetter wrote:
> Makes the macros resilient against if {} else {} blocks right
> afterwards.
> 
> Signed-off-by: Daniel Vetter 
> Cc: Tejun Heo 
> Cc: Jens Axboe 
> Cc: Shaohua Li 
> Cc: Kate Stewart 
> Cc: Greg Kroah-Hartman 
> Cc: Joseph Qi 
> Cc: Daniel Vetter 
> Cc: Arnd Bergmann 

Acked-by: Tejun Heo 

Jens, it'd probably be best to route this through block tree.

Thanks.

-- 
tejun
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH v3 1/6] cgroup: Allow registration and lookup of cgroup private data

2018-03-13 Thread Tejun Heo
Hello,

On Tue, Mar 13, 2018 at 02:47:45PM -0700, Alexei Starovoitov wrote:
> it has to be zero lookups. If idr lookup is involved, it's cleaner
> to add idr as new bpf map type and use cgroup ino as an id.

Oh, idr (or rather ida) is just to allocate the key, once the key is
there it pretty much should boil down to sth like

rcu_read_lock();
table = rcu_deref(cgrp->table);
if (key < table->len)
ret = table[key];
else
ret = NULL;
rcu_read_unlock();

Depending on the requirements, we can get rid of the table->len check
by making key alloc path more expensive (ie. give out key only after
table extension is fully done and propagated).

Thanks.

-- 
tejun
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH v3 1/6] cgroup: Allow registration and lookup of cgroup private data

2018-03-13 Thread Tejun Heo
Hello, Matt.

cc'ing Roman and Alexei.

On Tue, Mar 06, 2018 at 03:46:55PM -0800, Matt Roper wrote:
> There are cases where other parts of the kernel may wish to store data
> associated with individual cgroups without building a full cgroup
> controller.  Let's add interfaces to allow them to register and lookup
> this private data for individual cgroups.
> 
> A kernel system (e.g., a driver) that wishes to register private data
> for a cgroup will do so by subclassing the 'struct cgroup_priv'
> structure to describe the necessary data to store.  Before registering a
> private data structure to a cgroup, the caller should fill in the 'key'
> and 'free' fields of the base cgroup_priv structure.
> 
>  * 'key' should be a unique void* that will act as a key for future
>privdata lookups/removals.  Note that this allows drivers to store
>per-device private data for a cgroup by using a device pointer as a key.
> 
>  * 'free' should be a function pointer to a function that may be used
>to destroy the private data.  This function will be called
>automatically if the underlying cgroup is destroyed.

This feature turned out to have more users than I originally
anticipated and bpf also wants something like this to track network
states.  The requirements are pretty similar but not quite the same.
The extra requirements are...

* Lookup must be really cheap.  Probably using pointer hash or walking
  list isn't great, so maybe idr based lookup + RCU protected index
  table per cgroup?

* It should support both regular memory and percpu pointers.  Given
  that what cgroup does is pretty much cgroup:key -> pointer lookup,
  it's mostly about getting the interface right so that it's not too
  error-prone.

Sorry about moving the goalpost.

Thanks.

-- 
tejun
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH v3 3/6] cgroup: Introduce cgroup_permission()

2018-03-13 Thread Tejun Heo
On Tue, Mar 06, 2018 at 03:46:57PM -0800, Matt Roper wrote:
> Non-controller kernel subsystems may base access restrictions for
> cgroup-related syscalls/ioctls on a process' access to the cgroup.
> Let's make it easy for other parts of the kernel to check these cgroup
> permissions.

I'm not sure about pulling in cgroup perms this way because cgroup
access perms have more to it than the file and directory permissions.
Can't this be restricted to root or some CAP?

Thanks.

-- 
tejun
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH v3 2/6] cgroup: Introduce task_get_dfl_cgroup()

2018-03-13 Thread Tejun Heo
(cc'ing Roman)

Hello,

On Tue, Mar 06, 2018 at 03:46:56PM -0800, Matt Roper wrote:
> +static inline struct cgroup *
> +task_get_dfl_cgroup(struct task_struct *task)
> +{
> + struct cgroup *cgrp;
> +
> + mutex_lock(_mutex);
> + cgrp = task_dfl_cgroup(task);
> + cgroup_get(cgrp);
> + mutex_unlock(_mutex);

Heh, this is super heavy.  Can't we do "rcu, try get, compare &
retry"?

Thanks.

-- 
tejun
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 1/5] workqueue: Allow retrieval of current task's work struct

2018-02-12 Thread Tejun Heo
Hello,

On Sun, Feb 11, 2018 at 10:38:28AM +0100, Lukas Wunner wrote:
> Introduce a helper to retrieve the current task's work struct if it is
> a workqueue worker.
> 
> This allows us to fix a long-standing deadlock in several DRM drivers
> wherein the ->runtime_suspend callback waits for a specific worker to
> finish and that worker in turn calls a function which waits for runtime
> suspend to finish.  That function is invoked from multiple call sites
> and waiting for runtime suspend to finish is the correct thing to do
> except if it's executing in the context of the worker.
> 
> Cc: Tejun Heo <t...@kernel.org>
> Cc: Lai Jiangshan <jiangshan...@gmail.com>
> Cc: Dave Airlie <airl...@redhat.com>
> Cc: Ben Skeggs <bske...@redhat.com>
> Cc: Alex Deucher <alexander.deuc...@amd.com>
> Signed-off-by: Lukas Wunner <lu...@wunner.de>

I wonder whether it's too generic a name but there are other functions
named in a similar fashion and AFAICS current_work isn't used by
anyone in the tree, so it seems okay.

Acked-by: Tejun Heo <t...@kernel.org>

Please feel free to route as you see fit.

Thanks.

-- 
tejun
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH RFC v2 3/7] cgroup: Add interface to allow drivers to lookup process cgroup membership

2018-02-07 Thread Tejun Heo
Hello,

On Thu, Feb 01, 2018 at 11:53:11AM -0800, Matt Roper wrote:
> +/**
> + * cgroup_for_driver_process - return the cgroup for a process
> + * @pid: process to lookup cgroup for
> + *
> + * Returns the cgroup from the v2 hierarchy that a process belongs to.
> + * This function is intended to be called from drivers and will obtain
> + * the necessary cgroup locks.
> + *
> + * RETURNS:
> + * Process' cgroup in the default (v2) hierarchy
> + */
> +struct cgroup *
> +cgroup_for_driver_process(struct pid *pid)

I think you probably want to use task_dfl_cgroup().  We can add
task_get_dfl_cgroup() in a similar fashion to task_get_css() if
necessary.

Thanks.

-- 
tejun
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH RFC v2 1/7] cgroup: Allow drivers to store data associated with a cgroup

2018-02-07 Thread Tejun Heo
Hello,

On Thu, Feb 01, 2018 at 11:53:09AM -0800, Matt Roper wrote:
>  * Drivers may be built as modules (and unloaded/reloaded) which is not
>something cgroup controllers support today.

As discussed in the other subthread, this shouldn't be a concern.

>  * Drivers may wish to provide their own interface to allow userspace to
>adjust driver-specific settings (e.g., via a driver ioctl rather than
>via the kernfs filesystem).
>  * A single driver may be managing multiple devices and wish to maintain
>different driver-specific cgroup data for each.

If you look at io and rdma controllers, they already do this.

> Note that technically these interfaces aren't restricted to drivers
> (other non-driver parts of the kernel could make use of them as well).
> I expect drivers to be the primary consumers of this interface and
> couldn't think of a more appropriate generic name (the term "subsystem"
> would probably be more accurate, but that's already used by cgroup
> controllers).

Let's please not do "driver", it's really confusing.  Just coming up
with a made-up word would be fine as long as the connection can be
made and the word is easily identifiable.  e.g. cgroup cdata / pdata for
cgroup custom / priv data.

> +/*
> + * Driver-specific cgroup data.  Drivers should subclass this structure with
> + * their own fields for data that should be stored alongside individual
> + * cgroups.
> + */
> +struct cgroup_driver_data {
> + /* Driver this data structure is associated with */
> + struct cgroup_driver *drv;
> +
> + /* Node in cgroup's data hashtable */
> + struct hlist_node cgroupnode;
> +
> + /* Node in driver's data list; used to cleanup on driver unload */
> + struct list_head drivernode;
> +};
...
> +struct cgroup_driver {
> + /* Functions this driver uses to manage its data */
> + struct cgroup_driver_funcs *funcs;
> +
> + /*
> +  * List of driver-specific data structures that need to be cleaned up
> +  * if driver is unloaded.
> +  */
> + struct list_head datalist;
> +};

It generally looks great but can we do something like the following in
terms of interface?

  struct cgroup_cdata {
  const void *key;
  void (*free)(struct cgroup_cdata *cdata);
  /* whatever other necessary fields */
  char data[];
  };

  int cgroup_cdata_install(struct cgroup *cgrp, struct cgroup_cdata *cdata);
  struct cgroup_cdata *cgroup_cdata_lookup(struct cgroup *cgrp, const void 
*key);
  int cgroup_cdata_free(struct cgroup *cgrp, const void *key);
  /* free is also automatically called when the cgroup is released */

And please use a separate lock or mutex for managing them.

Thanks.

-- 
tejun
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [IGT PATCH RFC] tools: Introduce intel_cgroup tool

2018-02-07 Thread Tejun Heo
Hello,

Forgot to respond to one point.

On Thu, Feb 01, 2018 at 03:14:38PM -0800, Matt Roper wrote:
>  * The drivers that want to make use of this functionality may be built
>as modules rather than compiled directly into the kernel.  This is
>important because the cgroups subsystem removed the ability to have
>cgroup controllers in modules a few years ago.  While it might be
>possible to resurrect module-based cgroup controller support, my
>impression is that the cgroups community would prefer a separate
>non-controller mechanism for doing what we want to do.  E.g., see
>https://www.spinics.net/lists/cgroups/msg18674.html for some brief
>discussion on this topic.

So, this isn't a concern.  If we need to have modular controllers, we
can resurrect module support, hopefully in a simpler way, or could do
something similar to what rdma controller did.  We shouldn't make
userland interface decisions based on this sort of implementation
details.

Thanks.

-- 
tejun
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [IGT PATCH RFC] tools: Introduce intel_cgroup tool

2018-02-07 Thread Tejun Heo
Hello, Matt, Chris.

On Thu, Feb 01, 2018 at 03:14:38PM -0800, Matt Roper wrote:
> > Hmm. Could we not avoid drm_ioctl + well known param names and use a
> > more generic tool to set cgroup attributes? Just feels wrong that a
> > such a generic interface boils down to a driver specific ioctl.

So, everything which shows up in the cgroup hierarchy should satisfy
the following two conditions.

* The control mechanism should adhere to one of the resource
  distribution models defined in Documentation/cgroup-v2.txt.  This is
  to ensure consistency across different resources which in turn
  allows things like delegation.

* This one is a bit vague and I probably should find a better way to
  describe it but each controller should encapsulate a generic core
  resource.  Here, I don't think it makes sense to have intel gfx
  controller when there are a lot of commmonalities in the graphics
  hardware across different vendors.  It should be better abstracted.

It's true that it's difficult to figure out the right generic design
without actually trying, and I think that's why it'd be better to
start scoped in the scope of the specific driver.  The smaller scope
would allow for less strict expectations, more latitude, and easier
experimentations.

> I think the nicest interface for setting cgroup attributes would be to
> find a way to add our own kernfs entries to the cgroup filesystem, the
> same way real cgroup controllers do.  Then we could do something like
> "echo 123 > /cgroup2/apps/highprio/i915.card0.priority" and not need to
> call any ioctl's at all.  Without creating an actual cgroup controller,
> I think this might be challenging, but I'm investigating it on the side
> right now to see if it's a viable option.

So, I strongly believe that it isn't the right approach to add i915
prefixed interface files to cgroup interface.

Thanks.

-- 
tejun
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH RFC 6/9] drm: Add cgroup helper library

2018-01-22 Thread Tejun Heo
Hello, Matt.

On Fri, Jan 19, 2018 at 05:51:38PM -0800, Matt Roper wrote:
> Most DRM drivers will want to handle the CGROUP_SETPARAM ioctl by looking up a
> driver-specific per-cgroup data structure (or allocating a new one) and 
> storing
> the supplied parameter value into the data structure (possibly after doing 
> some
> checking and sanitization on the provided value).  Let's provide a helper
> library for drivers that will take care of the details of storing per-cgroup
> data structures in a hashtable and destroying those structures if/when the
> cgroup itself is removed.

Would it be possible to make the core of this a built-in part of
cgroup like cgroup-bpf does?  My gut feeling is that that isn't gonna
be much code anyway and likely to be cleaner to implement and use.

Being a full-fledged controller comes with quite a bit of complexity
in terms of sync rules and everything, and we may build a simpler
modular infrastructure if usages like this proliferate but for now
extending from the core seems the most straight forward.

Thanks.

-- 
tejun
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH V3 04/29] ata: deprecate pci_get_bus_and_slot()

2017-11-27 Thread Tejun Heo
On Mon, Nov 27, 2017 at 11:57:41AM -0500, Sinan Kaya wrote:
> pci_get_bus_and_slot() is restrictive such that it assumes domain=0 as
> where a PCI device is present. This restricts the device drivers to be
> reused for other domain numbers.
> 
> Getting ready to remove pci_get_bus_and_slot() function in favor of
> pci_get_domain_bus_and_slot().
> 
> Use pci_get_domain_bus_and_slot() and extract the actual domain number
> from the pdev passed in.
> 
> Signed-off-by: Sinan Kaya <ok...@codeaurora.org>

Acked-by: Tejun Heo <t...@kernel.org>

Please feel free to route with the rest of the series.

Thanks.

-- 
tejun
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Try harder to finish the idle-worker

2017-09-05 Thread Tejun Heo
Hello,

On Tue, Sep 05, 2017 at 02:43:14PM +0100, Chris Wilson wrote:
> > Can't you use cancel[_delayed]_work_sync()?
> 
> We then need a loop like:
> 
>   do {
>   if (cancel_delayed_work_sync(wrk))
>   do_work(wrk);
>   else
>   break;
>   } while (1);
> 
> We do want the flush semantics.

I see.  Heh, I don't know.  One thing you can try to do is putting
them on a separate workqueue and use drain_workqueue() or
destroy_workqueue() on it.  Those functions expect there to be some
requeueing but warns if there are too much (non configurable now but
we can add if necessary).

Thanks.

-- 
tejun
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Try harder to finish the idle-worker

2017-09-05 Thread Tejun Heo
On Mon, Sep 04, 2017 at 10:35:49AM +0200, Daniel Vetter wrote:
> On Fri, Sep 01, 2017 at 03:11:23PM +0100, Chris Wilson wrote:
> > If a worker requeues itself, it may switch to a different kworker pool,
> > which flush_work() considers as complete. To be strict, we then need to
> > keep flushing the work until it is no longer pending.
> > 
> > References: https://bugs.freedesktop.org/show_bug.cgi?id=102456
> > Signed-off-by: Chris Wilson 
> > Cc: Mika Kuoppala 
> 
> Shouldn't this be a thing the workqueue subsystem exposes? Adding Tejun et
> al.
> -Daniel

Can't you use cancel[_delayed]_work_sync()?

Thanks.

-- 
tejun
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] lib/ida: Document locking requirements a bit better v2

2016-10-27 Thread Tejun Heo
On Thu, Oct 27, 2016 at 09:22:16AM +0200, Daniel Vetter wrote:
> I wanted to wrap a bunch of ida_simple_get calls into their own
> locking, until I dug around and read the original commit message.
> Stuff like this should imo be added to the kernel doc, let's do that.
> 
> v2: Improve the kerneldoc per Tejun's review.
> 
> Cc: Mel Gorman <mgor...@techsingularity.net>
> Cc: Michal Hocko <mho...@suse.com>
> Cc: Vlastimil Babka <vba...@suse.cz>
> Cc: Tejun Heo <t...@kernel.org>
> Cc: Andrew Morton <a...@linux-foundation.org>
> Signed-off-by: Daniel Vetter <daniel.vet...@intel.com>

Acked-by: Tejun Heo <t...@kernel.org>

Andrew, can you please route this patch?

Thanks.

-- 
tejun
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] lib/ida: Document locking requirements a bit better

2016-10-26 Thread Tejun Heo
Hello, Daniel.

On Wed, Oct 26, 2016 at 09:25:25PM +0200, Daniel Vetter wrote:
> > > + * Note that callers must ensure that concurrent access to @ida is not 
> > > possible.
> > > + * When simplicity trumps concurrency needs look at ida_simple_get() 
> > > instead.
> > 
> > Maybe we can make it a bit less dramatic?
> 
> What about?
> 
> "Note that callers must ensure that concurrent access to @ida is not possible.
> See ida_simple_get() for a varaint which takes care of locking.

Yeah, that reads easier to me.

> > Hmm... so, this isn't necessarily about speed.  For example, id
> > allocation might have to happen inside a spinlock which protects a
> > larger scope.  To guarantee GFP_KERNEL allocation behavior in such
> > cases, the caller would have to call ida_pre_get() outside the said
> > spinlock and then call ida_get_new_above() inside the lock.
> 
> Hm, ida_simple_get does that for you already ...

Here's an example.

spin_lock();
do some stuff;
something->id = ida_simple_get(some gfp flag);
do some stuff;
spin_unlock();

In this scenario, you can't use sleeping GFPs for ida_simple_get()
because it does preloading inside it.  What one has to do is...

ida_pre_get(GFP_KERNEL);
spin_lock();
do some stuff;
something->id = ida_get_new_above(GFP_NOWAIT);
do some stuff;
spin_unlock();

So, I guess it can be sometimes about avoiding the extra locking
overhead but it's more often about separating out allocation context
into an earlier call.

> > I think it'd be better to explain what the simple version does and
> > expects and then say that unless there are specific requirements using
> > the simple version is recommended.
> 
> What about:
> 
> "Compared to ida_get_new_above() this function does its own locking, and
> should be used unless there are special requirements."

Yeah, looks good to me.

Thanks.

-- 
tejun
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] lib/ida: Document locking requirements a bit better

2016-10-26 Thread Tejun Heo
Hello, Daniel.

On Wed, Oct 26, 2016 at 04:27:39PM +0200, Daniel Vetter wrote:
> I wanted to wrap a bunch of ida_simple_get calls into their own
> locking, until I dug around and read the original commit message.
> Stuff like this should imo be added to the kernel doc, let's do that.

Generally agreed but some nits below.

> @@ -927,6 +927,9 @@ EXPORT_SYMBOL(ida_pre_get);
>   * and go back to the ida_pre_get() call.  If the ida is full, it will
>   * return %-ENOSPC.
>   *
> + * Note that callers must ensure that concurrent access to @ida is not 
> possible.
> + * When simplicity trumps concurrency needs look at ida_simple_get() instead.

Maybe we can make it a bit less dramatic?

> @@ -1073,6 +1076,10 @@ EXPORT_SYMBOL(ida_destroy);
>   * Allocates an id in the range start <= id < end, or returns -ENOSPC.
>   * On memory allocation failure, returns -ENOMEM.
>   *
> + * Compared to ida_get_new_above() this function does its own locking and 
> hence
> + * is recommended everywhere where simplicity is preferred over the last bit 
> of
> + * speed.

Hmm... so, this isn't necessarily about speed.  For example, id
allocation might have to happen inside a spinlock which protects a
larger scope.  To guarantee GFP_KERNEL allocation behavior in such
cases, the caller would have to call ida_pre_get() outside the said
spinlock and then call ida_get_new_above() inside the lock.

I think it'd be better to explain what the simple version does and
expects and then say that unless there are specific requirements using
the simple version is recommended.

Thanks.

-- 
tejun
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] kernfs: Move faulting copy_user operations outside of the mutex

2016-03-31 Thread Tejun Heo
On Thu, Mar 31, 2016 at 11:45:06AM +0100, Chris Wilson wrote:
> A fault in a user provided buffer may lead anywhere, and lockdep warns
> that we have a potential deadlock between the mm->mmap_sem and the
> kernfs file mutex:
...
> Reported-by: Ville Syrjälä <ville.syrj...@linux.intel.com>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94350
> Signed-off-by: Chris Wilson <ch...@chris-wilson.co.uk>
> Reviewed-by: Joonas Lahtinen <joonas.lahti...@linux.intel.com>
> Cc: Ville Syrjälä <ville.syrj...@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahti...@linux.intel.com>
> Cc: Tejun Heo <t...@kernel.org>
> Cc: Greg Kroah-Hartman <gre...@linuxfoundation.org>
> Cc: NeilBrown <ne...@suse.de>
> Cc: linux-ker...@vger.kernel.org

Acked-by: Tejun Heo <t...@kernel.org>

Thanks!

-- 
tejun
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 0/1] async: export current_is_async()

2015-11-19 Thread Tejun Heo
Hello, Lukas.

On Thu, Nov 19, 2015 at 04:31:11PM +0100, Lukas Wunner wrote:
> Hi Tejun,
> 
> when you introduced current_is_async() with 84b233adcca3, was it a
> deliberate decision not to export it? All other non-static functions
> in async.c are exported as well.
> 
> I'm asking because I would like to use it in i915.ko.

There just was no module user at the time.  Please feel free to export
it.

Thanks.

-- 
tejun
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH driver-core-linus] kernfs: add back missing error check in kernfs_fop_mmap()

2014-04-20 Thread Tejun Heo
While updating how mmap enabled kernfs files are handled by lockdep,
9b2db6e18945 (sysfs: bail early from kernfs_file_mmap() to avoid
spurious lockdep warning) inadvertently dropped error return check
from kernfs_file_mmap().  The intention was just dropping if
(ops-mmap) check as the control won't reach the point if the mmap
callback isn't implemented, but I mistakenly removed the error return
check together with it.

This led to Xorg crash on i810 which was reported and bisected to the
commit and then to the specific change by Tobias.

Signed-off-by: Tejun Heo t...@kernel.org
Reported-and-bisected-by: Tobias Powalowski tobias.powalow...@googlemail.com
References: http://lkml.kernel.org/g/533d01bd.1010...@googlemail.com
---
Hello,

Oops, sorry that I didn't see the mistake which seems so obvious now.
Can you please verify that this patch works?

Thanks a lot for the report and bisection!

 fs/kernfs/file.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 8034706..e01ea4a 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -484,6 +484,8 @@ static int kernfs_fop_mmap(struct file *file, struct 
vm_area_struct *vma)
 
ops = kernfs_ops(of-kn);
rc = ops-mmap(of, vma);
+   if (rc)
+   goto out_put;
 
/*
 * PowerPC's pci_mmap of legacy_mem uses shmem_zero_setup()
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] 3.14 issue with i810 graphic card bisected

2014-04-18 Thread Tejun Heo
Hello,

Sorry about the long delay.

On Thu, Apr 03, 2014 at 08:37:49AM +0200, Tobias Powalowski wrote:
 Hi,
 I bisected a X startup crash due to new 3.14 kernel:
 https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/fs/kernfs/file.c?id=9b2db6e1894577d48f4e290381bac6e573593838
 It's an old intel 810 graphics card which got broken.
 dmesg and xorg.log attached.

Hmm... I'm stumped.  The patch shouldn't cause any visible difference
to the userland.  I went over it a couple more times and still can't
see how this would make any difference.  Can you please do the
followings?

* Repeat the test on v3.14 with only the patch reverted.  If it makes
  the problem go away reliably,

* strace X startup on v3.14 and v3.14 sans the offending patch.

Thanks!

-- 
tejun
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] kernfs oops with i915+i2c_core in 3.14 merge window

2014-02-04 Thread Tejun Heo
On Thu, Jan 30, 2014 at 02:03:18PM -0500, Josh Boyer wrote:
 Hi All,
 
 I'm seeing the oops below on my MacBookPro 10,2 machine using i915
 graphics.  It's after the DRM merge for 3.14 ( v3.13-10094-g9b0cd30) ,
 but we seem to have one report[1] of this happening well before that,
 in v3.13-3260-g03d11a0 as well.  Does anyone have a clue what is going
 on here?
 
 https://bugzilla.redhat.com/show_bug.cgi?id=1055105

Should be fixed by the following patch which is already queued.

 http://lkml.kernel.org/g/20140129170403.gj30...@htj.dyndns.org

Thanks.

-- 
tejun
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx