Re: [FFmpeg-devel] [PATCH] avfilter/vf_bwdif_cuda: CUDA implementation of bwdif

2020-10-13 Thread Thomas Mundt
Am Mo., 12. Okt. 2020 um 21:42 Uhr schrieb Philip Langdale <
phil...@overt.org>:

> On Sun, 11 Oct 2020 18:36:42 +0200
> Thomas Mundt  wrote:
>
> > Hi Philip,
> >
> > Am Fr., 9. Okt. 2020 um 18:33 Uhr schrieb Philip Langdale
> >  > >:
> >
> > > I've been sitting on this for a couple of years now, and I figured I
> > > should just send it out. This is what I believe is a conceptually
> > > correct port of bwdif to cuda (modulo edge handling which is not
> > > done in the same way because the conditional checks for edges are
> > > expensive in cuda, but that's the same as for yadif_cuda).
> > >
> > > However, I see glitches in some samples where black or white pixels
> > > appear in white or black areas respectively. This seems like some
> > > sort of under/overflow. I've tried to use the largest cuda types
> > > everywhere, and that did appear to improve things but didn't make
> > > it go away. This is what led to me never sending this diff over the
> > > years, but maybe someone else has insights about this.
> > >
> >
> > I am not familiar with cuda. So here is just one difference, which I
> > noticed compared to the c code.
> > Maybe that is the reason for the glitches.
> >
> > > +
> > > +template
> > > +__inline__ __device__ T filter(T A, T B, T C, T D,
> > > +   T a, T b, T c, T d, T e, T f, T g,
> > > +   T h, T i, T j, T k, T l, T m, T n,
> > > +   int clip_max)
> > > +{
> > > +T final;
> > > +
> > > +int fc = C;
> > > +int fd = (c + l) >> 1;
> > > +int fe = B;
> > >
> >
> > In the following you sometimes use B and C directly and sometimes fc
> > and fe. Is there a reason for this?
>
> Unfortunately, I can't remember. This may have had something to do with
> wanting those calculations to be done with smaller data types, but why
> do that? Switch them did not have any obvious visual effect.
>
> >
> > > +
> > > +int temporal_diff0 = abs(c - l);
> > > +int temporal_diff1 = (abs(g - fc) + abs(f - fe)) >> 1;
> > > +int temporal_diff2 = (abs(i - fc) + abs(h - fe)) >> 1;
> > > +int diff = max3(temporal_diff0 >> 1, temporal_diff1,
> > > temporal_diff2); +
> > > +if (!diff) {
> > > +final = fd;
> > > +} else {
> > > +int fb = ((d + m) >> 1) - fc;
> > > +int ff = ((c + l) >> 1) - fe;
> > >
> >
> > If I don´t miss anything this should be:
> > int ff = ((b + k) >> 1) - fe;
>
> I think you're right. This also doesn't seem to change things
> significantly; the glitches are still there, but that's not surprising.
> This fix would make the non-glitched parts more correct.
>
> Thanks for taking a look. I'll keep banging my head against this one.
>

Could you please point out in the description of the bwdif_cuda filter that
the processing of the top and bottom edges and the first and last field is
different from the bwdif filter. This can lead to glitches in the upper and
lower edges and ghosting effects in the first and last field.

Regards,
Thomas
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH] avfilter/vf_bwdif_cuda: CUDA implementation of bwdif

2020-10-12 Thread Philip Langdale
On Sun, 11 Oct 2020 18:36:42 +0200
Thomas Mundt  wrote:

> Hi Philip,
> 
> Am Fr., 9. Okt. 2020 um 18:33 Uhr schrieb Philip Langdale
>  >:  
> 
> > I've been sitting on this for a couple of years now, and I figured I
> > should just send it out. This is what I believe is a conceptually
> > correct port of bwdif to cuda (modulo edge handling which is not
> > done in the same way because the conditional checks for edges are
> > expensive in cuda, but that's the same as for yadif_cuda).
> >
> > However, I see glitches in some samples where black or white pixels
> > appear in white or black areas respectively. This seems like some
> > sort of under/overflow. I've tried to use the largest cuda types
> > everywhere, and that did appear to improve things but didn't make
> > it go away. This is what led to me never sending this diff over the
> > years, but maybe someone else has insights about this.
> >  
> 
> I am not familiar with cuda. So here is just one difference, which I
> noticed compared to the c code.
> Maybe that is the reason for the glitches.
> 
> > +
> > +template
> > +__inline__ __device__ T filter(T A, T B, T C, T D,
> > +   T a, T b, T c, T d, T e, T f, T g,
> > +   T h, T i, T j, T k, T l, T m, T n,
> > +   int clip_max)
> > +{
> > +T final;
> > +
> > +int fc = C;
> > +int fd = (c + l) >> 1;
> > +int fe = B;
> >  
> 
> In the following you sometimes use B and C directly and sometimes fc
> and fe. Is there a reason for this?

Unfortunately, I can't remember. This may have had something to do with
wanting those calculations to be done with smaller data types, but why
do that? Switch them did not have any obvious visual effect.

> 
> > +
> > +int temporal_diff0 = abs(c - l);
> > +int temporal_diff1 = (abs(g - fc) + abs(f - fe)) >> 1;
> > +int temporal_diff2 = (abs(i - fc) + abs(h - fe)) >> 1;
> > +int diff = max3(temporal_diff0 >> 1, temporal_diff1,
> > temporal_diff2); +
> > +if (!diff) {
> > +final = fd;
> > +} else {
> > +int fb = ((d + m) >> 1) - fc;
> > +int ff = ((c + l) >> 1) - fe;
> >  
> 
> If I don´t miss anything this should be:
> int ff = ((b + k) >> 1) - fe;

I think you're right. This also doesn't seem to change things
significantly; the glitches are still there, but that's not surprising.
This fix would make the non-glitched parts more correct.

Thanks for taking a look. I'll keep banging my head against this one.

--phil
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH] avfilter/vf_bwdif_cuda: CUDA implementation of bwdif

2020-10-11 Thread Thomas Mundt
Hi Philip,

Am Fr., 9. Okt. 2020 um 18:33 Uhr schrieb Philip Langdale :

> I've been sitting on this for a couple of years now, and I figured I
> should just send it out. This is what I believe is a conceptually
> correct port of bwdif to cuda (modulo edge handling which is not done
> in the same way because the conditional checks for edges are expensive
> in cuda, but that's the same as for yadif_cuda).
>
> However, I see glitches in some samples where black or white pixels
> appear in white or black areas respectively. This seems like some
> sort of under/overflow. I've tried to use the largest cuda types
> everywhere, and that did appear to improve things but didn't make
> it go away. This is what led to me never sending this diff over the
> years, but maybe someone else has insights about this.
>

I am not familiar with cuda. So here is just one difference, which I
noticed compared to the c code.
Maybe that is the reason for the glitches.


> ---
>  configure|   2 +
>  libavfilter/Makefile |   2 +
>  libavfilter/allfilters.c |   1 +
>  libavfilter/vf_bwdif_cuda.c  | 394 +++
>  libavfilter/vf_bwdif_cuda.cu | 290 ++
>  5 files changed, 689 insertions(+)
>  create mode 100644 libavfilter/vf_bwdif_cuda.c
>  create mode 100644 libavfilter/vf_bwdif_cuda.cu
>
> ...

> +
> +template
> +__inline__ __device__ T filter(T A, T B, T C, T D,
> +   T a, T b, T c, T d, T e, T f, T g,
> +   T h, T i, T j, T k, T l, T m, T n,
> +   int clip_max)
> +{
> +T final;
> +
> +int fc = C;
> +int fd = (c + l) >> 1;
> +int fe = B;
>

In the following you sometimes use B and C directly and sometimes fc and
fe. Is there a reason for this?


> +
> +int temporal_diff0 = abs(c - l);
> +int temporal_diff1 = (abs(g - fc) + abs(f - fe)) >> 1;
> +int temporal_diff2 = (abs(i - fc) + abs(h - fe)) >> 1;
> +int diff = max3(temporal_diff0 >> 1, temporal_diff1, temporal_diff2);
> +
> +if (!diff) {
> +final = fd;
> +} else {
> +int fb = ((d + m) >> 1) - fc;
> +int ff = ((c + l) >> 1) - fe;
>

If I don´t miss anything this should be:
int ff = ((b + k) >> 1) - fe;


> +int dc = fd - fc;
> +int de = fd - fe;
> +int mmax = max3(de, dc, min(fb, ff));
> +int mmin = min3(de, dc, max(fb, ff));
> +diff = max3(diff, mmin, -mmax);
> +
> +int interpol;
> +if (abs(fc - fe) > temporal_diff0) {
> +interpol = (((coef_hf[0] * (c + l)
> +- coef_hf[1] * (d + m + b + k)
> ++ coef_hf[2] * (e + n + a + j)) >> 2)
> ++ coef_lf[0] * (C + B) - coef_lf[1] * (D + A)) >> 13;
> +} else {
> +interpol = (coef_sp[0] * (C + B) - coef_sp[1] * (D + A)) >>
> 13;
> +}
> +if (interpol > fd + diff) {
> +interpol = fd + diff;
> +} else if (interpol < fd - diff) {
> +interpol = fd - diff;
> +}
> +final = clip(interpol, 0, clip_max);
> +}
> +
> +return final;
> +}
> +
> +template
> +__inline__ __device__ void bwdif_single(T *dst,
> +cudaTextureObject_t prev,
> +cudaTextureObject_t cur,
> +cudaTextureObject_t next,
> +int dst_width, int dst_height,
> int dst_pitch,
> +int src_width, int src_height,
> +int parity, int tff, bool
> skip_spatial_check,
> +int clip_max)
> +{
> +// Identify location
> +int xo = blockIdx.x * blockDim.x + threadIdx.x;
> +int yo = blockIdx.y * blockDim.y + threadIdx.y;
> +
> +if (xo >= dst_width || yo >= dst_height) {
> +return;
> +}
> +
> +// Don't modify the primary field
> +if (yo % 2 == parity) {
> +  dst[yo*dst_pitch+xo] = tex2D(cur, xo, yo);
> +  return;
> +}
> +
> +T A = tex2D(cur, xo, yo + 3);
> +T B = tex2D(cur, xo, yo + 1);
> +T C = tex2D(cur, xo, yo - 1);
> +T D = tex2D(cur, xo, yo - 3);
> +
> +// Calculate temporal prediction
> +int is_second_field = !(parity ^ tff);
> +
> +cudaTextureObject_t prev2 = prev;
> +cudaTextureObject_t prev1 = is_second_field ? cur : prev;
> +cudaTextureObject_t next1 = is_second_field ? next : cur;
> +cudaTextureObject_t next2 = next;
> +
> +T a = tex2D(prev2, xo,  yo + 4);
> +T b = tex2D(prev2, xo,  yo + 2);
> +T c = tex2D(prev2, xo,  yo + 0);
> +T d = tex2D(prev2, xo,  yo - 2);
> +T e = tex2D(prev2, xo,  yo - 4);
> +T f = tex2D(prev1, xo,  yo + 1);
> +T g = tex2D(prev1, xo,  yo - 1);
> +T h = tex2D(next1, xo,  yo + 1);
> +T i = tex2D(next1, xo,  yo - 1);
> +T j = tex2D(next2, xo,  yo + 

[FFmpeg-devel] [PATCH] avfilter/vf_bwdif_cuda: CUDA implementation of bwdif

2020-10-09 Thread Philip Langdale
I've been sitting on this for a couple of years now, and I figured I
should just send it out. This is what I believe is a conceptually
correct port of bwdif to cuda (modulo edge handling which is not done
in the same way because the conditional checks for edges are expensive
in cuda, but that's the same as for yadif_cuda).

However, I see glitches in some samples where black or white pixels
appear in white or black areas respectively. This seems like some
sort of under/overflow. I've tried to use the largest cuda types
everywhere, and that did appear to improve things but didn't make
it go away. This is what led to me never sending this diff over the
years, but maybe someone else has insights about this.
---
 configure|   2 +
 libavfilter/Makefile |   2 +
 libavfilter/allfilters.c |   1 +
 libavfilter/vf_bwdif_cuda.c  | 394 +++
 libavfilter/vf_bwdif_cuda.cu | 290 ++
 5 files changed, 689 insertions(+)
 create mode 100644 libavfilter/vf_bwdif_cuda.c
 create mode 100644 libavfilter/vf_bwdif_cuda.cu

diff --git a/configure b/configure
index 75f0a0fcaa..4e7a97b17e 100755
--- a/configure
+++ b/configure
@@ -3511,6 +3511,8 @@ bm3d_filter_select="dct"
 boxblur_filter_deps="gpl"
 boxblur_opencl_filter_deps="opencl gpl"
 bs2b_filter_deps="libbs2b"
+bwdif_cuda_filter_deps="ffnvcodec"
+bwdif_cuda_filter_deps_any="cuda_nvcc cuda_llvm"
 chromaber_vulkan_filter_deps="vulkan libglslang"
 colorkey_opencl_filter_deps="opencl"
 colormatrix_filter_deps="gpl"
diff --git a/libavfilter/Makefile b/libavfilter/Makefile
index e6d3c283da..db99238fce 100644
--- a/libavfilter/Makefile
+++ b/libavfilter/Makefile
@@ -178,6 +178,8 @@ OBJS-$(CONFIG_BOXBLUR_FILTER)+= 
vf_boxblur.o boxblur.o
 OBJS-$(CONFIG_BOXBLUR_OPENCL_FILTER) += vf_avgblur_opencl.o opencl.o \
 opencl/avgblur.o boxblur.o
 OBJS-$(CONFIG_BWDIF_FILTER)  += vf_bwdif.o yadif_common.o
+OBJS-$(CONFIG_BWDIF_CUDA_FILTER) += vf_bwdif_cuda.o 
vf_bwdif_cuda.ptx.o \
+yadif_common.o
 OBJS-$(CONFIG_CAS_FILTER)+= vf_cas.o
 OBJS-$(CONFIG_CHROMABER_VULKAN_FILTER)   += vf_chromaber_vulkan.o vulkan.o
 OBJS-$(CONFIG_CHROMAHOLD_FILTER) += vf_chromakey.o
diff --git a/libavfilter/allfilters.c b/libavfilter/allfilters.c
index fa91e608e4..2da43166a5 100644
--- a/libavfilter/allfilters.c
+++ b/libavfilter/allfilters.c
@@ -169,6 +169,7 @@ extern AVFilter ff_vf_bm3d;
 extern AVFilter ff_vf_boxblur;
 extern AVFilter ff_vf_boxblur_opencl;
 extern AVFilter ff_vf_bwdif;
+extern AVFilter ff_vf_bwdif_cuda;
 extern AVFilter ff_vf_cas;
 extern AVFilter ff_vf_chromahold;
 extern AVFilter ff_vf_chromakey;
diff --git a/libavfilter/vf_bwdif_cuda.c b/libavfilter/vf_bwdif_cuda.c
new file mode 100644
index 00..7651a869d5
--- /dev/null
+++ b/libavfilter/vf_bwdif_cuda.c
@@ -0,0 +1,394 @@
+/*
+ * Copyright (C) 2018 Philip Langdale 
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/avassert.h"
+#include "libavutil/hwcontext_cuda_internal.h"
+#include "libavutil/cuda_check.h"
+#include "internal.h"
+#include "yadif.h"
+
+extern char vf_bwdif_cuda_ptx[];
+
+typedef struct DeintCUDAContext {
+YADIFContext yadif;
+
+AVCUDADeviceContext *hwctx;
+AVBufferRef *device_ref;
+AVBufferRef *input_frames_ref;
+AVHWFramesContext   *input_frames;
+
+CUcontext   cu_ctx;
+CUstreamstream;
+CUmodulecu_module;
+CUfunction  cu_func_uchar;
+CUfunction  cu_func_uchar2;
+CUfunction  cu_func_ushort;
+CUfunction  cu_func_ushort2;
+} DeintCUDAContext;
+
+#define DIV_UP(a, b) ( ((a) + (b) - 1) / (b) )
+#define ALIGN_UP(a, b) (((a) + (b) - 1) & ~((b) - 1))
+#define BLOCKX 32
+#define BLOCKY 16
+
+#define CHECK_CU(x) FF_CUDA_CHECK_DL(ctx, s->hwctx->internal->cuda_dl, x)
+
+static CUresult call_kernel(AVFilterContext *ctx, CUfunction func,
+CUdeviceptr prev, CUdeviceptr cur, CUdeviceptr 
next,
+CUarray_format format, int channels,
+int src_width,  // Width is pixels per channel
+