Re: [darktable-dev] OpenCL scheduling profiles

2017-04-30 Thread Ulrich Pegelow

Am 30.04.2017 um 19:43 schrieb Christian Kanzian:

A bit late, but today I had some time for deeper testing this. In general
detection seems to work well and the profile very fast GPU with a single GPU
works nice as well.


On startup profile was set to multiple GPU, so detection works. Unfortunately
the GT 640 is relatively slow and often the full pipline gets processed on
this device:
[pixelpipe_process] [thumbnail] using device 0
[pixelpipe_process] [full] using device 1
[pixelpipe_process] [preview] using device 0

If the full pipe is running on a slow GPU switching between images is way
slower than before on larger history stakes especially with denoising active.

So I set opencl_device_priority as written in the manual to:
opencl_device_priority=!GeForce GT 640,*/!GeForce GTX 1060 6GB,*/GeForce GTX
1060 6GB,*/GeForce GTX 1060 6GB,*

Now the full pipe should not run on device 1 anymore, but it still does run on
device 1 if I switch between images:
[pixelpipe_process] [thumbnail] using device 0
[pixelpipe_process] [full] using device 1
[pixelpipe_process] [preview] using device 0

Zooming after switching works correctly on device 0.

darktable -d opencl reports this:
[opencl_update_scheduling_profile] scheduling profile set to multiple GPUs
[opencl_priorities] these are your device priorities:
[opencl_priorities] image   preview export  thumbnail
[opencl_priorities] 0   0   0   0
[opencl_priorities] 1   1   1   1
[opencl_priorities] show if opencl use is mandatory for a given pixelpipe:
[opencl_priorities] image   preview export  thumbnail
[opencl_priorities] 0   0   0   0

What does its output mean? Are my opencl_device_priority settings refused?
Maybe this is a corner case, because leaving a second slow GPU in a system
does not make sense.


Yepp, opencl_device_priority is only used if the "default" scheduling 
profile has been selected. That mode offers maximal configuration 
options. I shortly considered to make the "multiple GPUs" profile 
auto-adapt to the speed of detected devices. But in the end this would 
really be a corner case and probably the effort is not really justified.


Ulrich



___
darktable developer mailing list
to unsubscribe send a mail to darktable-dev+unsubscr...@lists.darktable.org



Re: [darktable-dev] OpenCL scheduling profiles

2017-04-30 Thread Christian Kanzian
Hi,

> I am interested in your experience, both in terms of automatic detection
> of the best suited profile and in terms of overall performance. Please
> note that this is all about system latency and perceived system
> responsiveness in the darkroom view. Calling darktable with '-d perf'
> will only give you limited insights so you need to mostly rely on your
> own judgement.
A bit late, but today I had some time for deeper testing this. In general 
detection seems to work well and the profile very fast GPU with a single GPU 
works nice as well. 

My hardware:
Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
0   'GeForce GTX 1060 6GB'

On startup the profile was set to "very fast GPU". Switching between images is 
faster because the preview is faster compared to default. This is fine so far. 

Then I plugged in my old GT 640 as well.
1   'GeForce GT 640'

On startup profile was set to multiple GPU, so detection works. Unfortunately 
the GT 640 is relatively slow and often the full pipline gets processed on 
this device:
[pixelpipe_process] [thumbnail] using device 0
[pixelpipe_process] [full] using device 1
[pixelpipe_process] [preview] using device 0

If the full pipe is running on a slow GPU switching between images is way 
slower than before on larger history stakes especially with denoising active.

So I set opencl_device_priority as written in the manual to:
opencl_device_priority=!GeForce GT 640,*/!GeForce GTX 1060 6GB,*/GeForce GTX 
1060 6GB,*/GeForce GTX 1060 6GB,*

Now the full pipe should not run on device 1 anymore, but it still does run on 
device 1 if I switch between images:
[pixelpipe_process] [thumbnail] using device 0
[pixelpipe_process] [full] using device 1
[pixelpipe_process] [preview] using device 0

Zooming after switching works correctly on device 0.

darktable -d opencl reports this:
[opencl_update_scheduling_profile] scheduling profile set to multiple GPUs
[opencl_priorities] these are your device priorities:
[opencl_priorities] image   preview export  thumbnail
[opencl_priorities] 0   0   0   0
[opencl_priorities] 1   1   1   1
[opencl_priorities] show if opencl use is mandatory for a given pixelpipe:
[opencl_priorities] image   preview export  thumbnail
[opencl_priorities] 0   0   0   0

What does its output mean? Are my opencl_device_priority settings refused?
Maybe this is a corner case, because leaving a second slow GPU in a system 
does not make sense.

All the best,
Christian

Here all my opencl settings from darktablerc
opencl=TRUE
opencl_async_pixelpipe=TRUE
opencl_avoid_atomics=FALSE
opencl_checksum=2128616438
opencl_device_priority=!GeForce GT 640,*/!GeForce GTX 1060 6GB,*/GeForce GTX 
1060 6GB,*/GeForce GTX 1060 6GB,*
opencl_enable_markesteijn=true
opencl_library=
opencl_mandatory_timeout=200
opencl_memory_headroom=0
opencl_memory_requirement=768
opencl_micro_nap=0
opencl_number_event_handles=175
opencl_omit_whitebalance=
opencl_runtime=
opencl_scheduling_profile=very fast GPU
opencl_size_roundup=16
opencl_synch_cache=false
opencl_use_cpu_devices=false
opencl_use_events=FALSE
opencl_use_pinned_memory=FALSE

Am Samstag, 8. April 2017, 14:29:18 CEST schrieb Ulrich Pegelow:
> Hi,
> 
> I added a bit more flexibility concerning OpenCL device scheduling into
> master. There is a new selection box in preferences (core options) that
> allows to choose among a few typical presets.
> 
> The main target are modern systems with very fast GPUs. By default and
> "traditionally" darktable distributes work between CPU and GPU in the
> darkroom: the GPU processes the center (full) view and the CPU is
> responsible for the preview (navigation) panel. Now that GPUs get faster
> and faster there are systems where the GPU so strongly outperforms the
> CPU that it makes more sense to process preview and full pixelpipe on
> the GPU sequentially.
> 
> For that reason the "OpenCL scheduling profile" parameter has three options:
> 
> * "default" describes the old behavior: work is split between GPU and
> CPU and works best for systems where CPU and GPU performance are on a
> similar level.
> 
> * "very fast GPU" tackles the case described above: in darkroom view
> both pixelpipes are sequentially processed by the GPU. This is meant for
> GPUs which strongly outperform the CPU on that system.
> 
> * "multiple GPUs" is meant for systems with more than one OpenCL device
> so that the full and the preview pixelpipe get processed by separate GPUs.
> 
> At first startup darktable tries to find the best suited profile based
> on some benchmarking. You may at any time change the profile, this takes
> effect immediately.
> 
> I am interested in your experience, both in terms of automatic detection
> of the best suited profile and in terms of overall performance. Please
> note that this is all about system latency and perceived system
> responsiveness in the darkroom view. Calling darktable with '-d perf'
> 

Re: [darktable-dev] OpenCL scheduling profiles

2017-04-09 Thread Ulrich Pegelow

Am 09.04.2017 um 17:29 schrieb Matthias Andree:

What's your number of background threads (fourth entry in core options)?


It's currently set to 2, and if removed from the configuration file with
darktable stopped,
will revert to 2 when darktable gets restarted and closed next time.

Note I see this quite often, but I don't see where that time comes from:

[dev] took 4,787 secs (5,388 CPU) to load the image.
[dev] took 4,787 secs (5,388 CPU) to load the image.



You might try higher values like six or eight. Main advantage of many 
background threads is hiding I/O latency and that might be a main issue 
here.



Looking at iotop it appears that the prime concern however is that it
maxes out the external USB3 HDD reading from NTFS...
reducing to 1 thread stalled the UI at first but came back with some 30
thumbnails all at once.



Might easily be that the main issue on your system is stalling I/O (for 
whatever reason). Please make some experiments from a very fast storage 
medium (SSD, ram disk) to find out if this is the main cause.



I sometimes see modules like highlite reconstruction, CA correction, or
demosaic ("Entrastern") still being dispatched to the CPU, which is very
slow, when it's normally dispatched to the GPU. Statistics below. It
seems the only module that is supposed to be on the CPU is Gamma, and
it's so blazingly fast that we don't need to care. Sorry for the German,
but you get the idea. This is only from launching darktable in
lighttable view:



There are some modules where no OpenCL code is available (Amaze 
demosaic, raw denoise, color input/output profile with LittleCMS2) but I 
cannot say if this is the main cause here. At least several of the 
modules from the output below have OpenCL support. Please try further to 
isolate if slow CPU processing correlates with specific images and their 
history stacks.



$ grep 'on CPU' /tmp/dt-perf-opencl.log  | sort -k7 | uniq -f6 -c | sort -nr
124 [dev_pixelpipe] took 0,000 secs (0,000 CPU) processed `Gamma' on
CPU, blended on CPU [thumbnail]
  6 [dev_pixelpipe] took 0,026 secs (0,076 CPU) processed
`Entrastern' on CPU, blended on CPU [thumbnail]
  5 [dev_pixelpipe] took 0,276 secs (0,832 CPU) processed
`Chromatische Aberration' on CPU, blended on CPU [thumbnail]
  5 [dev_pixelpipe] took 0,019 secs (0,060 CPU) processed
`Spitzlicht-Rekonstruktion' on CPU, blended on CPU [thumbnail]
  2 [dev_pixelpipe] took 0,118 secs (0,348 CPU) processed
`Raw-Schwarz-/Weißpunkt' on CPU, blended on CPU [thumbnail]
  2 [dev_pixelpipe] took 0,052 secs (0,140 CPU) processed
`Weißabgleich' on CPU, blended on CPU [thumbnail]
  2 [dev_pixelpipe] took 0,023 secs (0,036 CPU) processed
`Tonemapping' on CPU, blended on CPU [thumbnail]
  2 [dev_pixelpipe] took 0,008 secs (0,016 CPU) processed
`Objektivkorrektur' on CPU, blended on CPU [thumbnail]
  2 [dev_pixelpipe] took 0,001 secs (0,004 CPU) processed
`Ausgabefarbprofil' on CPU, blended on CPU [thumbnail]
  2 [dev_pixelpipe] took 0,001 secs (0,000 CPU) processed
`Eingabefarbprofil' on CPU, blended on CPU [thumbnail]
  2 [dev_pixelpipe] took 0,000 secs (0,000 CPU) processed `Schärfen'
on CPU, blended on CPU [thumbnail]
  2 [dev_pixelpipe] took 0,000 secs (0,000 CPU) processed
`Basiskurve' on CPU, blended on CPU [thumbnail]
  1 [dev_pixelpipe] took 3,126 secs (9,444 CPU) processed
`Raw-Entrauschen' on CPU, blended on CPU [thumbnail]
  1 [dev_pixelpipe] took 0,000 secs (0,000 CPU) processed `Drehung'
on CPU, blended on CPU [thumbnail]



___
darktable developer mailing list
to unsubscribe send a mail to darktable-dev+unsubscr...@lists.darktable.org



Re: [darktable-dev] OpenCL scheduling profiles

2017-04-09 Thread Ulrich Pegelow

Am 09.04.2017 um 11:00 schrieb Matthias Andree:

Am 08.04.2017 um 14:29 schrieb Ulrich Pegelow:
2. What bothers me though are the timeouts and their defaults. In
practice, the darktable works ok-ish, but the lighttable does not. When
a truckload full of small thumbnails (say, lighttable zoomed out to show
10 columns of images) needs to be regenerated for the lighttable, it
*appears* (not yet corroborated with measurements) that bumping up
timeouts considerably helps to avoid latencies, as though things were
deadlocking and waiting for the timer to break the lock. Might be an
internal issue with the synchronization though - how fine granular is
the re-attempt? Is it sleep-and-retry, or does it use some form of
semaphores and signalling at the system level between threads?



What's your number of background threads (fourth entry in core options)?


___
darktable developer mailing list
to unsubscribe send a mail to darktable-dev+unsubscr...@lists.darktable.org



Re: [darktable-dev] OpenCL scheduling profiles

2017-04-09 Thread Matthias Andree
Am 08.04.2017 um 14:29 schrieb Ulrich Pegelow:
> Hi,
>
> I added a bit more flexibility concerning OpenCL device scheduling
> into master. There is a new selection box in preferences (core
> options) that allows to choose among a few typical presets.
>
> The main target are modern systems with very fast GPUs. By default and
> "traditionally" darktable distributes work between CPU and GPU in the
> darkroom: the GPU processes the center (full) view and the CPU is
> responsible for the preview (navigation) panel. Now that GPUs get
> faster and faster there are systems where the GPU so strongly
> outperforms the CPU that it makes more sense to process preview and
> full pixelpipe on the GPU sequentially.
>
> For that reason the "OpenCL scheduling profile" parameter has three
> options:
>
> * "default" describes the old behavior: work is split between GPU and
> CPU and works best for systems where CPU and GPU performance are on a
> similar level.
>
> * "very fast GPU" tackles the case described above: in darkroom view
> both pixelpipes are sequentially processed by the GPU. This is meant
> for GPUs which strongly outperform the CPU on that system.
>
> * "multiple GPUs" is meant for systems with more than one OpenCL
> device so that the full and the preview pixelpipe get processed by
> separate GPUs.
>
> At first startup darktable tries to find the best suited profile based
> on some benchmarking. You may at any time change the profile, this
> takes effect immediately.
>
> I am interested in your experience, both in terms of automatic
> detection of the best suited profile and in terms of overall
> performance. Please note that this is all about system latency and
> perceived system responsiveness in the darkroom view. Calling
> darktable with '-d perf' will only give you limited insights so you
> need to mostly rely on your own judgement.
>

Hi Ulrich,

1. gorgeous, thank you very much!

For me, the benchmarking seems to DTRT™ (do the right thing), it picks
the "very fast GPU" profile with a 2016 NVidia GeForce 1060 GTX 6 GB and
an old 2009 AMD Phenom II X4 2.5 GHz 65 W Quadcore, code is compiled
with -O2 -march=native, OpenMP and OpenCL enabled, and I get this:

[opencl_init] here are the internal numbers and names of OpenCL devices
available to darktable:
[opencl_init]   0   'GeForce GTX 1060 6GB'
[opencl_init] FINALLY: opencl is AVAILABLE on this system.
[opencl_init] initial status of opencl enabled flag is ON.
[opencl_create_kernel] successfully loaded kernel `zero' (0) for device 0
[...]
[opencl_init] benchmarking results: 0.029428 seconds for fastest GPU
versus 0.382860 seconds for CPU.
[opencl_init] set scheduling profile for very fast GPU.
[opencl_priorities] these are your device priorities:
[opencl_priorities] image   preview export  thumbnail
[opencl_priorities] 0   0   0   0
[opencl_priorities] show if opencl use is mandatory for a given pixelpipe:
[opencl_priorities] image   preview export  thumbnail
[opencl_priorities] 1   1   1   1
[opencl_synchronization_timeout] synchronization timout set to 0

2. What bothers me though are the timeouts and their defaults. In
practice, the darktable works ok-ish, but the lighttable does not. When
a truckload full of small thumbnails (say, lighttable zoomed out to show
10 columns of images) needs to be regenerated for the lighttable, it
*appears* (not yet corroborated with measurements) that bumping up
timeouts considerably helps to avoid latencies, as though things were
deadlocking and waiting for the timer to break the lock. Might be an
internal issue with the synchronization though - how fine granular is
the re-attempt? Is it sleep-and-retry, or does it use some form of
semaphores and signalling at the system level between threads?

I am running with these - possibly ridiculously high - timeout settings
(15 s). This is normally enough to process an entire export including a
few CPU segments (say, raw denoise - I need it on some high-ISO images,
ISO 6400+, to avoid black blotches or green stipples, but I have some
concerns about its quality altogether which don't belong in this thread).

opencl_mandatory_timeout=3000
pixelpipe_synchronization_timeout=3000

3. Would it be sensible to set one of these timeouts considerably higher
than the other?

4. Can we have -d perf log when timeouts occur that change the
scheduling decision (i. e. if a timeout causes a job to be dispatched to
a different device, with original intent, and dispatch target), and
4b. possibly a complete scheduler trace including all dispatch attempts?
Might help debug in the long run.


___
darktable developer mailing list
to unsubscribe send a mail to darktable-dev+unsubscr...@lists.darktable.org



[darktable-dev] OpenCL scheduling profiles

2017-04-08 Thread Ulrich Pegelow

Hi,

I added a bit more flexibility concerning OpenCL device scheduling into 
master. There is a new selection box in preferences (core options) that 
allows to choose among a few typical presets.


The main target are modern systems with very fast GPUs. By default and 
"traditionally" darktable distributes work between CPU and GPU in the 
darkroom: the GPU processes the center (full) view and the CPU is 
responsible for the preview (navigation) panel. Now that GPUs get faster 
and faster there are systems where the GPU so strongly outperforms the 
CPU that it makes more sense to process preview and full pixelpipe on 
the GPU sequentially.


For that reason the "OpenCL scheduling profile" parameter has three options:

* "default" describes the old behavior: work is split between GPU and 
CPU and works best for systems where CPU and GPU performance are on a 
similar level.


* "very fast GPU" tackles the case described above: in darkroom view 
both pixelpipes are sequentially processed by the GPU. This is meant for 
GPUs which strongly outperform the CPU on that system.


* "multiple GPUs" is meant for systems with more than one OpenCL device 
so that the full and the preview pixelpipe get processed by separate GPUs.


At first startup darktable tries to find the best suited profile based 
on some benchmarking. You may at any time change the profile, this takes 
effect immediately.


I am interested in your experience, both in terms of automatic detection 
of the best suited profile and in terms of overall performance. Please 
note that this is all about system latency and perceived system 
responsiveness in the darkroom view. Calling darktable with '-d perf' 
will only give you limited insights so you need to mostly rely on your 
own judgement.


Best wishes

Ulrich


___
darktable developer mailing list
to unsubscribe send a mail to darktable-dev+unsubscr...@lists.darktable.org