Hi Ilia,

I'm going to push this with your ST_DEBUG suggestion. I did some more micro-benchmarking with the attached test, which confirmed that the current heuristic is decent for radeonsi. One might get slightly better results with a mixed heuristic that includes the number of readpixels calls, but at some point it just becomes bike-shedding.

If you find that it's not a good match for nouveau in the transfer_map case, perhaps we can follow up with a pipe cap.

Cheers,
Nicolai

On 15.06.2016 17:20, Ilia Mirkin wrote:
On Wed, Jun 15, 2016 at 4:38 AM, Nicolai Hähnle <nhaeh...@gmail.com> wrote:
From: Nicolai Hähnle <nicolai.haeh...@amd.com>

---
  src/mesa/state_tracker/st_cb_fbo.h        |   2 +
  src/mesa/state_tracker/st_cb_readpixels.c | 122 ++++++++++++++++++++++++++----
  2 files changed, 110 insertions(+), 14 deletions(-)

diff --git a/src/mesa/state_tracker/st_cb_fbo.h 
b/src/mesa/state_tracker/st_cb_fbo.h
index f3b310b..414a661 100644
--- a/src/mesa/state_tracker/st_cb_fbo.h
+++ b/src/mesa/state_tracker/st_cb_fbo.h
@@ -58,6 +58,8 @@ struct st_renderbuffer
     boolean software;
     void *data;

+   bool use_readpix_cache;
+
     /* Inputs from Driver.RenderTexture, don't use directly. */
     boolean is_rtt; /**< whether Driver.RenderTexture was called */
     unsigned rtt_face, rtt_slice;
diff --git a/src/mesa/state_tracker/st_cb_readpixels.c 
b/src/mesa/state_tracker/st_cb_readpixels.c
index a501b7b..42d147f 100644
--- a/src/mesa/state_tracker/st_cb_readpixels.c
+++ b/src/mesa/state_tracker/st_cb_readpixels.c
@@ -46,6 +46,28 @@
  #include "state_tracker/st_pbo.h"
  #include "state_tracker/st_texture.h"

+/* The readpixels cache caches a blitted staging texture so that back-to-back
+ * calls to glReadPixels with user pointers require less CPU-GPU 
synchronization.
+ *
+ * Assumptions:
+ *
+ * (1) Blits have high synchronization overheads, and it is beneficial to
+ *     use a single blit of the entire framebuffer instead of many smaller
+ *     blits (because the smaller blits cannot be batched, and we have to wait
+ *     for the GPU after each one).
+ *
+ * (2) transfer_map implicitly involves a blit as well (for de-tiling, copy
+ *     from VRAM, etc.), so that it is beneficial to replace the
+ *     _mesa_readpixels path as well when possible.

FWIW this second point isn't the case for nouveau. We map the texture
directly under many conditions. However that's still slow since VRAM
is "far away", and, without having done the tests, this seems like it
would improve the situation for nouveau as well.

BTW, what are some good tests to play with for testing this out?

+ *
+ */
+#define USE_READPIXELS_CACHE true

Maybe add it to ST_DEBUG instead (so it can be disabled)? Your call.

   -ilia

// Compile with c++ -std=c++11 readpixels-bench.cpp -Wall $(pkg-config --cflags --libs sdl2 gl)

#define GL_GLEXT_PROTOTYPES
#include <GL/gl.h>
#include <SDL.h>
#include <SDL_video.h>
#include <cassert>
#include <chrono>
#include <cstdlib>
#include <iostream>
#include <memory>
#include <vector>

static void usage(const char *name)
{
	printf("usage: %s read_height read_count iterations\n", name);
	exit(1);
}

int main(int argc, char **argv)
{
	unsigned window_width = 1024;
	unsigned window_height = 512;
	unsigned read_height;
	unsigned read_count;
	unsigned iterations;

	if (argc < 4)
		usage(argv[0]);

	read_height = atoi(argv[1]);
	read_count = atoi(argv[2]);
	iterations = atoi(argv[3]);

	if (read_count * read_height > window_height) {
		std::cerr << "read_height * read_count > window_height\n";
		return 1;
	}

	SDL_Init(SDL_INIT_VIDEO);
	SDL_Window * const window = SDL_CreateWindow(
		"Bench",
		SDL_WINDOWPOS_CENTERED, SDL_WINDOWPOS_CENTERED,
		window_width, window_height,
		SDL_WINDOW_OPENGL);
	if (!window) {
		std::cerr << "Could not create window: "
		          << SDL_GetError() << '\n';
		return 1;
	}

	SDL_GL_SetAttribute(SDL_GL_CONTEXT_MAJOR_VERSION, 2);
	SDL_GL_SetAttribute(SDL_GL_CONTEXT_MINOR_VERSION, 1);
	SDL_GLContext gl_context = SDL_GL_CreateContext(window);
	if (!gl_context) {
		std::cerr << "Could not create OpenGL context: "
		          << SDL_GetError() << '\n';
		return 1;
	}

	std::vector<char> pixels;
	pixels.resize(read_height * window_width * 4);

	auto start = std::chrono::high_resolution_clock::now();

	for (unsigned i = 0; i < iterations; ++i) {
		glClear(GL_COLOR_BUFFER_BIT);

		unsigned y = 0;
		for (unsigned j = 0; j < read_count; ++j, y += read_height)
			glReadPixels(
				0, y, window_width, read_height,
				GL_RGBA, GL_UNSIGNED_BYTE, pixels.data());
	}

	auto end = std::chrono::high_resolution_clock::now();
	std::chrono::duration<double, std::milli> ms = end - start;

	std::cout << iterations << " iterations, " << (read_count * read_height * window_width) << " pixels each\n";
	std::cout << ms.count() << "ms\n";
	std::cout << ((uint64_t)iterations * 4 * read_count * read_height * window_width / ms.count() / 1000) << "MB/s\n";

	SDL_GL_DeleteContext(gl_context);
	SDL_DestroyWindow(window);
	SDL_Quit();
}

_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Reply via email to