From: Roland Scheidegger
This probably isn't all that useful for GL, but there are apis where
sample_mask is a valid output even without msaa.
Just discard the pixel if the sample_mask doesn't include the bit for
sample 0.
---
src/gallium/drivers/llvmpipe/lp_state_fs.c | 26
From: Roland Scheidegger
The operation performed is all the same as LODQ, but with the usual
differences between dx10 and GL texture opcodes, that is separate resource
and sampler indices (plus result swizzling, and setting z/w channels
to zero).
---
From: Roland Scheidegger
This uses all the existing code to calculate lod values for mip linear
filtering. Though we'll have to disable the simplifications (if we know some
parts of the lod calculation won't actually matter for filtering purposes due
to mip clamps etc.). For
From: Roland Scheidegger
This was implemented since forever, but not enabled.
It passes all piglit tests except one, arb_pipeline_statistics_query-frag.
The reason is that the test (for drawing a 10x10 rect) expects between
100 and 150 pixel shader invocations. But since
From: Roland Scheidegger
With GALLIVM_DEBUG=perf set, output the relevant stats for shader cache usage
whenever we have to evict shader variants.
Also add some output when shaders are deleted (but not with the perf setting
to keep this one less noisy).
While here, also don't
From: Roland Scheidegger
gather is defined in terms of bilinear filtering, just without the filtering
part. However, there's actually some subtle differences required in our
implementation, because we use some tricks to simplify coord wrapping for the
two coords per
From: Roland Scheidegger
Trivial. We already support tg4 for legacy tex opcodes, so the actual
texture sampling code already handles it.
(Just like TG4, we don't handle additional capabilities and always sample
red channel.)
---
From: Roland Scheidegger
We're not particularly concerned with memory usage, if the tradeoff is
shader recompiles. And it's common for apps to have a lot of shaders
nowadays (and, since our shaders include a LOT of context state of course
we may create quite a bit more
From: Roland Scheidegger
I think this is what the code was meant to do, albeit as far as I can tell
the redundant initialization some analyzers complain about should work as
well just fine (only the first layer will be used, if the view contains one
or more layers doesn't
From: Roland Scheidegger
Fixes build error when it's not.
---
src/util/u_queue.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/util/u_queue.c b/src/util/u_queue.c
index 49361c3..449da7d 100644
--- a/src/util/u_queue.c
+++ b/src/util/u_queue.c
@@
From: Roland Scheidegger
The driver supported this since way before the GL spec for it existed.
Just need to support both the per-stream and for all streams variants
(which are identical due to only supporting 1 stream).
Passes piglit
From: Roland Scheidegger
The driver was supposed to support this since way before the GL spec for it
existed, albeit it was apparently broken, so fix and enable it.
---
docs/features.txt| 2 +-
src/gallium/drivers/softpipe/sp_query.c | 7 ++-
From: Roland Scheidegger
We had some caller using LLVMAddInstrAttributes, which couldn't be
converted to lp_add_function_attr, because attributes were only handled
for functions in this case, so fix this.
For llvm >= 4.0, this already works correctly.
(radeonsi seems to avoid
From: Roland Scheidegger
It could only handle indices 0/1, otherwise what happened was bad (accessing
array out of bounds, no crash but kind of random). This is enough for the gl
state tracker (primary/secondary color) but not enough for some other state
trackers (d3d9 has no
From: Roland Scheidegger
If lp_setup_bind_framebuffer() is never called, then setup fb x1/y1 was not
correctly initialized. This can happen if there's never a fb set - both
cso and llvmpipe would consider setting this with no cbufs and no zsbuf a
redundant change and
From: Roland Scheidegger
We use the bounding box (triangle extents) to figure out if 32bit rasterization
could potentially overflow. However, we used the bounding box which already got
rounded up to 0 for negative coords for this, which is incorrect, leading to
overflows and
From: Roland Scheidegger
This is pretty useful for debugging rasterization issues, so turn it on
based on DEBUG (the actual existence of the fields is also conditionalized
on DEBUG, lines fill it out the same too).
---
src/gallium/drivers/llvmpipe/lp_setup_tri.c | 2 +-
1
From: Roland Scheidegger
The vertex information we compute here is really dependent on the last
stage before FS. It just happened to work most of the time because new
GS tend to come with new VS and/or FS...
(The LP_NEW_GS flag was previously set but never used.)
---
From: Roland Scheidegger
define __STDC_FORMAT_MACROS and include (same as
ir_builder_print_visitor.cpp already does).
Otherwise, some mingw build errors out (since
8e7e1ae0365ddc7edb0d4d98250ab46728e6c14a and
bbce1c538dc0cb8bf3769510283d11847dc07540 presumably) with:
From: Roland Scheidegger
The use of fast rcp instruction is disabled, and will always fall back
to use a division instead (1 / x). Hence, if we get a division opcode,
it doesn't make much sense trying to split that into rcp/mul.
---
From: Roland Scheidegger
we can't use the cpu implementation of fdiv, as this one uses different
lp_build_context, which causes assertion failure.
Just use default fdiv action (there is no fast rcp for doubles which we
could potentially use anyway).
---
From: Roland Scheidegger
softpipe (along with llvmpipe) claims to support arb_gpu_shader_fp64,
so we really need to support that opcode.
---
src/gallium/auxiliary/tgsi/tgsi_exec.c | 14 ++
1 file changed, 14 insertions(+)
diff --git
From: Roland Scheidegger
Doing these operations with blend format means that we have to convert
the destination into blend format, which is entirely pointless if we don't
do blending. For instance, we'd convert half floats to floats, or 10/10/10/2
to unorm16, just to apply
From: Roland Scheidegger
Generally we should do tranpose after conversion, if the format has less than
32 bits per channel (if it has 32 bits, conversion is going to be a no-op
anyway...). This is obviously because there's less vectors to deal with.
Though the advantage for
From: Roland Scheidegger
This special packing path can be easily extended to handle not just
float->unorm8 but also float->snorm8 and uint32->uint8 and int32->int8
(i.e. all interesting cases for llvmpipe fs backend code).
The packing parts all stay the same (only the last
From: Roland Scheidegger
llvm has _huge_ problems trying to load things like <4 x i8> vectors and
stitching such loads together to form 128bit vectors. My understanding
of the problem is that the type legalizer tries to extend that to
really a <4 x i32> vector and not a <16 x
From: Roland Scheidegger
For rgbx formats, there is no point in doing alpha conversion again (and
with different tranpose even, so llvm can't eliminate it).
Albeit it looks like there's some minimal changes needed in the blend code
(found by code inspection, no test seemed to
From: Roland Scheidegger
This code uses a vector shift which has to be emulated on x86 unless
there's AVX2. Luckily in some cases we can actually avoid the shift
altogether, so do that.
Also make sure we hit the fast lp_build_conv() path when applicable,
albeit that's quite
From: Roland Scheidegger
Using bit replication. This path now resembles something which might make
sense. (The logic was mostly copied from llvmpipe fs backend.)
I am not convinced though it is actually faster than SoA sampling (actually
I'm quite certain it's always a loss
From: Roland Scheidegger
simd instruction sets usually have comparisons for equal, not unequal.
So use a different comparison against the mask itself - which also means
we don't need a all-zero as well as a all-one (for the pxor) reg.
Also add code to avoid scalar expansion
From: Roland Scheidegger
If we only feed one source vector at a time, we cannot use pack intrinsics
(as we only have a 64bit destination dst vector). lp_bld_conv_auto is
specifically designed to alter the length and number of destination vectors,
so this works just fine (if
From: Roland Scheidegger
By using a dst_type in the the gather interface, gather has some more
knowledge about how values should be fetched.
E.g. if this is a 3x32bit fetch and dst_type is 4x32bit vector gather
will no longer do a ZExt with a 96bit scalar value to 128bit, but
From: Roland Scheidegger
We should do transpose, not extract/insert, at least with "sufficient" amount
of channels (for 4 channels, extract/insert shuffles generated otherwise look
truly terrifying). Albeit we shouldn't fallback to that so often in any case.
---
From: Roland Scheidegger
This can now handle rgtc (unorm) too - this path no longer handles plain
formats, but that's unnecessary they now all have their proper SoA unpack
(this will still be dog-slow though due to the actual fetch being per-pixel
util fallbacks).
---
From: Roland Scheidegger
This previously always fell back to AoS conversion. Even for 4-float formats
(which is the optimal case by far for that fallback case) this was suboptimal,
since it meant the conversion couldn't be done with 256bit vectors. While this
may still only
From: Roland Scheidegger
Now that there's some SoA fetch which never falls back, we should usually get
results which are better or at least not worse (something like rgba32f will
stay the same). I suppose though it might be worse in some cases where the
format doesn't require
From: Roland Scheidegger
soa fetch so far always assumed that data was aligned. However, we want to
use this for vertex fetch, and data might not be aligned there, so handle
it in this path too (basically just pass through alignment through to other
functions). (It looks like
From: Roland Scheidegger
As per GL 4.5 rules, which fixed a spec mistake in GL_ARB_stencil_texturing.
The extension spec wasn't updated, but just allow it with older GL versions
as well, hoping there aren't any crazy tests which want to see an error
there... (Compile tested
From: Roland Scheidegger
Just like other similar compressed formats.
---
src/gallium/auxiliary/util/u_format.c | 5 +
1 file changed, 5 insertions(+)
diff --git a/src/gallium/auxiliary/util/u_format.c
b/src/gallium/auxiliary/util/u_format.c
index 72dd60f..3d28190
From: Roland Scheidegger
Note that we really want to _never_ reach the bottom of the function, which
resorts to AoS fetch.
Half floats can be handled just like other formats which fit into 32bit
vectors (so, only 1x16 and 2x16 formats, albeit with more channels things
are not
From: Roland Scheidegger
By using a dst_type in the the gather interface, gather has some more
knowledge about how values should be fetched.
E.g. if this is a 3x32bit fetch and dst_type is 4x32bit vector gather
will no longer do a ZExt with a 96bit scalar value to 128bit, but
From: Roland Scheidegger
Trivial, this just resurrects the code which was there once upon a time
(the code can't lower instructions generated in the lowering pass there,
and even if it could it would probably be suboptimal).
This fixes piglit mesa_shader_integer_functions
From: Roland Scheidegger
The code for elts and linear paths was nearly 100% identical by now - with
the elts path simply having some additional gather for the elements in the
main loop (with some additional small differences before the main loop).
Hence nuke the separate
From: Roland Scheidegger
Don't keep the ofbit. This is just a minor simplification, just adjust
the buffer size so that there will always be an overflow if buffers aren't
valid to fetch from.
Also, get rid of control flow from the instanced path too. Not worried about
From: Roland Scheidegger
It turns out that noone actually cares if the address computations overflow,
be it the stride mul or the offset adds.
Wrap around seems to be explicitly permitted even by some other API (which
is a _very_ surprising result, as these overflow
From: Roland Scheidegger
This was kind of strange, since it replaced indices which were only
overflowing due to bias with MAX_UINT. This would cause an overflow later
in the shader, except if stride was 0, however the vertex id would be
essentially random then (-1 + eltBias).
Overflow handling is simplified quite a bit both in jit code and vsplit
paths (basically just let things wrap around everywhere). This seems to
be good enough for all apis.
Also, elts and linear jit code is unified since the differences are minimal
(even more so at the end of the series). The cost
From: Roland Scheidegger
This is a bit simpler. Mostly to make it easier to unify the paths later...
---
src/gallium/auxiliary/draw/draw_llvm.c | 48 ++
src/gallium/auxiliary/draw/draw_llvm.h | 8 ++--
From: Roland Scheidegger
vsplit_get_base_idx explicitly returned idx 0 and set the ofbit
in case of overflow. We'd then check the ofbit and use idx 0 instead of
looking it up. This was necessary because DRAW_GET_IDX used to return
DRAW_MAX_FETCH_IDX and not 0 in case of
From: Roland Scheidegger
Instead of doing all the math with scalars, use vectors. This means the
overflow math needs to be done manually, albeit that's only really
problematic for the stride/index mul, the rest has been pretty much
moved outside the shader loop (albeit the
From: Roland Scheidegger
lp_build_any_true_range is just what we need, though it will only produce
optimal code with sse41 (ptest + set) - but even without it on 64bit x86
the code is still better (1 unpack, 2 movq + or + set), on 32bit x86 it's
going to be roughly the same
From: Roland Scheidegger
Instead of doing all the math with scalars, use vectors. This means the
overflow math needs to be done manually, albeit that's only really
problematic for the stride/index mul, the rest has been pretty much
moved outside the shader loop (albeit the
From: Roland Scheidegger
This is used by shader umul_hi/imul_hi functions (and soon by draw).
It's actually useful separating this out on its own, however the real
reason for doing it is because we're using an optimized sse2 version,
since the code llvm generates is atrocious
From: Roland Scheidegger
Previous fixes were incomplete - some code still iterated through the number
of elements provided by velem layout instead of the number stored in the key
(which is the same as the number defined by the vs). And also actually
accessed the elements from
From: Roland Scheidegger
Instead of doing all the math with scalars, use vectors. This means the
overflow math needs to be done manually, albeit that's only really
problematic for the stride/index mul, the rest has been pretty much
moved outside the shader loop (albeit the
From: Roland Scheidegger
Instead of doing all the math with scalars, use vectors. This means the
overflow math needs to be done manually, albeit that's only really
problematic for the stride/index mul, the rest has been pretty much
moved outside the shader loop (albeit the
From: Roland Scheidegger
For the texturing packs, things looked pretty terrible. For every
lerp, we were repacking the values, and while those look sort of cheap
with 128bit, with 256bit we end up with 2 of them instead of just 1 but
worse, plus 2 extracts too (the unpack,
From: Roland Scheidegger
The per-element fetch has quite some calculations which are constant,
these can be moved outside both the per-element as well as the main
shader loop (llvm can figure out it's constant mostly on its own, however
this can have a significant compile
From: Roland Scheidegger
Compilation to actual machine code can easily take as much time as the
optimization passes on the IR if not more, so print this out too.
---
src/gallium/auxiliary/gallivm/lp_bld_init.c | 11 +++
1 file changed, 11 insertions(+)
diff --git
From: Roland Scheidegger
Previous attempts to zero initialize all inputs were not really optimal
(though no performance impact was measurable). In fact this is not really
necessary, since we know the max number of inputs used.
Instead, just generate fetch for up to max inputs
From: Roland Scheidegger
The per-element fetch has quite some calculations which are constant,
these can be moved outside both the per-element as well as the main
shader loop (llvm can figure out it's constant mostly on its own, however
this can have a significant compile
From: Roland Scheidegger
This should make the code more robust if a shader tries to use inputs which
aren't defined by the vertex element layout (which usually shouldn't happen).
No piglit change.
---
src/gallium/auxiliary/draw/draw_llvm.c | 7 +++
1 file changed, 7
From: Roland Scheidegger
We only did depth clamp when the value was written from the fs.
This is very wrong both for d3d10 and GL, and only passed the
corresponding piglit test due to pure luck (it no longer does
with the enhanced test).
Also, interpolation clamped values to
From: Roland Scheidegger
This wasn't handled before (the result was that no matter what value got
clamped, it always ended up as the near value in this case) (if clamping
actually happened).
Fix this by using the util helper for that (the math is otherwise "mostly"
the same,
From: Roland Scheidegger
Apparently, these are deprecated. There's some AutoUpgrade feature which
is supposed to promote these to cmp/select, which apparently doesn't work
with jit code. It is possible it's not actually even meant to work (see
the bug filed against llvm which
From: Roland Scheidegger
The previous assertions required for texture sizes smaller than block_size
that src_box.x + src_box.width still be block size.
(e.g. for a texture with width 3, and src_box.x = 0, src_box.width would
have to be 4 to not assert.)
This caused some
From: Roland Scheidegger
The gallium contract would be that bind flags must indicate all possible
bindings a resource might get used, but fact is the mesa state tracker does
not set bind flags correctly, and this is more or less unfixable due to GL.
This caused a bug with
From: Roland Scheidegger
The gallium contract would be that bind flags must indicate all possible
bindings a resource might get used, but fact is the mesa state tracker does
not set bind flags correctly, and this is more or less unfixable due to GL.
This caused a bug with
From: Roland Scheidegger
There were complaints from a mingw build:
u_draw.h:134:14: error: invalid conversion from ‘uint {aka unsigned int}’
to ‘pipe_prim_type’ [-fpermissive]
---
src/gallium/auxiliary/util/u_draw.h | 21 -
1 file changed, 16
From: Roland Scheidegger
Instead of doing a add and then mask out the upper bits, we can
simply do a add with a half wide type (this, of course, assumes
the hw can actually do it...), so we'll get the required zero
in the upper bits automatically.
---
From: Roland Scheidegger
Use GALLIVM_DEBUG=dumpbc for dumping of modules as bitcode.
Instead of a fixed llvmpipe.bc name, use ir_.bc so multiple
modules can be dumped (albeit it might still overwrite previous modules,
particularly the modules from draw tend to always have the
From: Roland Scheidegger
Those aren't really interesting, however outputting them is helpful when
trying to feed the IR to llvm llc (or opt) for debugging.
---
src/gallium/auxiliary/gallivm/lp_bld_intr.c | 5 +
1 file changed, 5 insertions(+)
diff --git
From: Roland Scheidegger
We don't target this yet, and some llvm versions incorrectly enable it based
on cpu string, causing crashes.
(Albeit this is a losing battle, it is pretty much guaranteed when the next
new feature comes along llvm will mistakenly enable it on some
From: Roland Scheidegger
At least with MCJIT the disassembler will crash otherwise when trying to
disassemble such functions.
---
src/gallium/auxiliary/gallivm/lp_bld_sample_soa.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git
From: Roland Scheidegger
Some cases (especially these using fract for coord wrapping) did not handle
NaNs (or Infs) correctly - the following code assumed the fract result
could not be outside [0,1], but if the input is a NaN (or +-Inf) the fract
result was NaN - which then
From: Roland Scheidegger
Screwed up since 0753b135f6e83b171d8a1b08aea967374f3542bc.
(Only an issue with different min/mag filters, and then only in some cases,
which is probably why it went unnoticed for quite a while.
The effect should have simply been nearest mip filter
From: Roland Scheidegger
This was part of EXT_gpu_shader4 - as such it should have been supported
by glsl 130.
It was however forgotten, and not added until glsl 430 - with the wrong
syntax no less (glsl 430 mentions it was overlooked).
glsl 440 (but revision 8 only) fixed
From: Roland Scheidegger
This was part of EXT_gpu_shader4 - as such it should have been supported
by glsl 130.
It was however forgotten, and not added until glsl 430 - with the wrong
syntax no less (glsl 430 mentions it was overlooked).
glsl 440 (but revision 8 only) fixed
From: Roland Scheidegger
llvm 3.7 sometimes simply miscompiles vector selects.
See https://bugs.freedesktop.org/show_bug.cgi?id=94972
---
src/gallium/auxiliary/gallivm/lp_bld_logic.c | 8 +---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git
From: Roland Scheidegger
The blend code would do a conditional assignment based on it, causing valgrind
to complain. Since that variable was actually unused in this case, this
doesn't fix anything but the warning.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94955
From: Roland Scheidegger
We used to use sse roundps intrinsic directly, but switched to use the llvm
intrinsics for rounding with e4f01da15d8c6ce3e8c77ff3ff3d2ce2574a3f7b.
However, llvm semantics follows standard math lib round function which is
specced to do
From: Roland Scheidegger
Some rasterization code relies (for sse) on the first and third planes
(but not the second for now) being 128bit aligned, and we didn't get that
on 32bit - I mistakenly thought the 64bit number in the struct would get
the thing aligned to 64bit even
From: Roland Scheidegger
The logic was comparing actual ints, not true/false values.
This meant that it was emitting always multiple line segments instead of just
one even if the stipple test had the same result, which looks inefficient, and
the segments also overlapped thus
From: Roland Scheidegger
All these img filter loops iterate through NUM_CHANNELS, not QUAD_SIZE.
In practice both are of course the same unchangeable value (4), but it
makes the code look a bit confusing. Moreover, some of the functions were
actually given an array of 4
From: Roland Scheidegger
The filt_args->offset wasn't assigned but was always used later leading
to a crash (as far as I can tell, texel offsets don't actually make much
sense with anisotropic filtering, but because there's no explicit setting
if offsets are enabled there the
From: Roland Scheidegger
Just like the rest of the msaa "implementation" it's just fake for now...
---
src/gallium/auxiliary/gallivm/lp_bld_tgsi_soa.c | 7 ++-
src/gallium/auxiliary/tgsi/tgsi_exec.c | 8 +---
2 files changed, 11 insertions(+), 4 deletions(-)
From: Roland Scheidegger
If the tri is fully inside a scissor edge (or rather, we just use the
bounding box of the tri for the comparison), then we can drop these
additional scissor "planes" early. We do not even need to allocate
space for them in the tri.
The math actually
From: Roland Scheidegger
Just slightly simpler assembly.
---
src/gallium/drivers/llvmpipe/lp_setup_tri.c | 11 +--
1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/src/gallium/drivers/llvmpipe/lp_setup_tri.c
b/src/gallium/drivers/llvmpipe/lp_setup_tri.c
From: Roland Scheidegger
When we switched to 64bit rasterization, we could no longer use straight
aligned loads for loading the plane data. However, what the code actually
does for loading 3 planes, is 12 scalar loads + 9 unpacks, and then there's
another 8 unpacks for the
From: Roland Scheidegger
Add support for these opcodes, the conversion functions were already
there albeit need some new packing stuff.
Just like the tgsi version, piglit won't like it for all the same
reasons, so it's disabled (UP2H passes piglit arb_shader_language_packing
don't enable the cap bit (so the code is unused).
(Code is from imirkin, comment from sroland)
Signed-off-by: Ilia Mirkin <imir...@alum.mit.edu>
Reviewed-by: Roland Scheidegger <srol...@vmvware.com>
---
src/gallium/auxiliary/tgsi/tgsi_exec.c | 44 --
src
From: Roland Scheidegger
I removed this mistakenly in 2dbc20e45689e09766552517a74e2270e49817b5. I
actually thought it should not be necessary and a piglit run didn't show
any differences, but this shouldn't have been in there.
draw_prepare_shader_outputs() is in fact
From: Roland Scheidegger
Doing that is clearly a bug. We can't quite assert as st/mesa may hit this,
but increase at least visibility of it a bit.
(For the non-refcounted objects it would be illegal too, but we can't detect
that unless we'd store the context ourselves. Plus,
From: Roland Scheidegger
If we have a d24x8 format, there is no stencil. Therefore, we can always
clear these bits too, which means this will be some kind of memset rather
than read-modify-write.
This is good for some 7% increase or so in gears with huge window size -
seems
From: Roland Scheidegger
If the tri is fully inside the scissor (or rather, we just use the
bounding box of the tri for the comparison), then we can drop these
additional scissor "planes" early.
(We could, of course, not even emit the scissor planes in this case
in the first
From: Roland Scheidegger
---
src/gallium/auxiliary/util/u_format_parse.py | 2 +-
src/mesa/main/format_parser.py | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/gallium/auxiliary/util/u_format_parse.py
From: Roland Scheidegger
The existing code used ssse3, and because it isn't compiled in a separate
file compiled with that, it is usually not used (that, of course, could
be fixed...), whereas sse2 is always present at least with 64bit builds.
It is actually trivial to do
From: Roland Scheidegger
Like the previous patch, but this time instead of direct format pack
functions, this handles convert_ubyte if the destination and source
were both ubyte unorm with 4 channels (so this can do things like
bgrx8->rgba8, apart from swizzling filling in
From: Roland Scheidegger
This certainly isn't as generic as it would be ideally, but got to start
somewhere...
Handles just rgba8/rgbx8 formats (so just swizzling). Even when using
cached regions, these functions are definitely quite a bit faster than
the c ones (for larger
From: Roland Scheidegger
The existing code used ssse3, and because it isn't compiled in a separate
file compiled with that, it is usually not used (that, of course, could
be fixed...), whereas sse2 is always present at least with 64bit builds.
It is actually trivial to do
101 - 200 of 624 matches
Mail list logo