On Tue, Mar 02, 2021 at 03:44:44PM +0300, Dmitry Osipenko wrote:
> Display controller (DC) performs isochronous memory transfers, and thus,
> has a requirement for a minimum memory bandwidth that shall be fulfilled,
> otherwise framebuffer data can't be fetched fast enough and this results
> in a DC's data-FIFO underflow that follows by a visual corruption.
> 
> The Memory Controller drivers provide facility for memory bandwidth
> management via interconnect API. Let's wire up the interconnect API
> support to the DC driver in order to fix the distorted display output
> on T30 Ouya, T124 TK1 and other Tegra devices.

I did a read on the code. I have put some thoughts and nits inbetween the
diff, but here are more general questions about the patch:

Is there a reason why the bandwidth is allocated per plane instead of just
using one peak and average for the whole configuration? Or is there a need
to expose all the planes as interconnect channels and allocate their
bandwidth individually?

>From algorithmic part I see that the plane overlaps are calculated twice
(A vs B and later B vs A). The cursor plane is ignored, but nevertheless
its overlap mask is calculated before being thrown away. The bandwidths
are also calculated twice: once for pre-commit and then again for
post-commit.  Is setting bandwidth for an interconnect expensive when
re-setting a value that was already set? The code seems to avoid this
case, but maybe unnecessarily?

[...cut the big and interesting part...]

[...]
> @@ -65,7 +75,9 @@ struct tegra_dc_soc_info {
>       unsigned int num_overlay_formats;
>       const u64 *modifiers;
>       bool has_win_a_without_filters;
> +     bool has_win_b_vfilter_mem_client;
>       bool has_win_c_without_vert_filter;
> +     unsigned int plane_tiled_memory_bandwidth_x2;

This looks more like bool in the code using it.

[...]
> --- a/drivers/gpu/drm/tegra/plane.c
> +++ b/drivers/gpu/drm/tegra/plane.c
[...]
> +static int tegra_plane_check_memory_bandwidth(struct drm_plane_state *state)

The function's body looks more like 'update' or 'recalculate' rather
than 'check' the memory bandwidth.

> +     struct tegra_plane_state *tegra_state = to_tegra_plane_state(state);
> +     unsigned int i, bpp, bpp_plane, dst_w, src_w, src_h, mul;
> +     const struct tegra_dc_soc_info *soc;
> +     const struct drm_format_info *fmt;
> +     struct drm_crtc_state *crtc_state;
> +     u32 avg_bandwidth, peak_bandwidth;
> +
> +     if (!state->visible)
> +             return 0;
> +
> +     crtc_state = drm_atomic_get_new_crtc_state(state->state, state->crtc);
> +     if (!crtc_state)
> +             return -EINVAL;
> +
> +     src_w = drm_rect_width(&state->src) >> 16;
> +     src_h = drm_rect_height(&state->src) >> 16;
> +     dst_w = drm_rect_width(&state->dst);
> +
> +     fmt = state->fb->format;
> +     soc = to_tegra_dc(state->crtc)->soc;
> +
> +     /*
> +      * Note that real memory bandwidth vary depending on format and
> +      * memory layout, we are not taking that into account because small
> +      * estimation error isn't important since bandwidth is rounded up
> +      * anyway.
> +      */
> +     for (i = 0, bpp = 0; i < fmt->num_planes; i++) {
> +             bpp_plane = fmt->cpp[i] * 8;

Nit: you could declare the bpp_plane here.

> +             /*
> +              * Sub-sampling is relevant for chroma planes only and vertical
> +              * readouts are not cached, hence only horizontal sub-sampling
> +              * matters.
> +              */
> +             if (i > 0)
> +                     bpp_plane /= fmt->hsub;
> +
> +             bpp += bpp_plane;
> +     }
> +
> +     /*
> +      * Horizontal downscale takes extra bandwidth which roughly depends
> +      * on the scaled width.
> +      */
> +     if (src_w > dst_w)
> +             mul = (src_w - dst_w) * bpp / 2048 + 1;
> +     else
> +             mul = 1;

Does it really need more bandwidth to scale down? Does it read the same
data multiple times just to throw it away?

> +     /* average bandwidth in bytes/s */
> +     avg_bandwidth  = src_w * src_h * bpp / 8 * mul;
> +     avg_bandwidth *= drm_mode_vrefresh(&crtc_state->mode);
> +
> +     /* mode.clock in kHz, peak bandwidth in kbit/s */
> +     peak_bandwidth = crtc_state->mode.clock * bpp * mul;

I would guess that (src_w * bpp / 8) needs rounding up unless the plane
is continuous. Or you could just add the max rounding error here and
get a safe overestimated value. Maybe this would be better done when
summing per-plane widths.

> +     /* ICC bandwidth in kbyte/s */
> +     peak_bandwidth = kbps_to_icc(peak_bandwidth);
> +     avg_bandwidth  = Bps_to_icc(avg_bandwidth);

This could be merged with the assignments after the following 'if' block.

> +     /*
> +      * Tegra30/114 Memory Controller can't interleave DC memory requests
> +      * and DC uses 16-bytes atom for the tiled windows, while DDR3 uses 32
> +      * bytes atom. Hence there is x2 memory overfetch for tiled framebuffer
> +      * and DDR3 on older SoCs.
> +      */
> +     if (soc->plane_tiled_memory_bandwidth_x2 &&
> +         tegra_state->tiling.mode == TEGRA_BO_TILING_MODE_TILED) {
> +             peak_bandwidth *= 2;
> +             avg_bandwidth *= 2;
> +     }
> +
> +     tegra_state->peak_memory_bandwidth = peak_bandwidth;
> +     tegra_state->avg_memory_bandwidth = avg_bandwidth;
> +
> +     return 0;
> +}

[...]
> +static const char * const tegra_plane_icc_names[] = {
> +     "wina", "winb", "winc", "", "", "", "cursor",
> +};
> +
> +int tegra_plane_interconnect_init(struct tegra_plane *plane)
> +{
> +     const char *icc_name = tegra_plane_icc_names[plane->index];

Is plane->index guaranteed to be <= 6? I would guess so, but maybe
BUILD_BUG_ON(sizeof(icc_names)==TEGRA_DC_LEGACY_PLANES_NUM) or some
other check could document this?

[...]

Best Regards
Michał Mirosław

Reply via email to