Re: [hugin-ptx] Re: Need some help | Hugin Speed test tab for automated hardware evaluation

Bart van Andel Tue, 19 Apr 2011 12:05:53 -0700

On Tuesday, April 19, 2011 6:19:27 PM UTC+2, rew wrote:
>
>
> On Tue, Apr 19, 2011 at 04:49:54AM -0700, Bart van Andel wrote:
>
> > always makes sense, it does not need a speed test. Tuning algorithms
> > (especially memory intensive ones, where not all data will fit in
> > RAM) for optimal performance and memory usage is a nice aim, but I
> > don't think that's what you propose to do yourself. At least your
> > proposal doesn't mention it.
>
> The "does not need a speed test" part is something I strongly disagree
> with.
>
> Way too many people optimize what they percieve as being a bottleneck
> and end up optimizing the wrong thing.
>
Sure, I'm not questioning that. You have to find out what can be optimized. 
Benchmarking may be a tool for that (I'm not saying it's useless), but a 
thorough understanding of the procedures and algorithms in use is probably 
even more important.


Now specific for hugin. What performance bottlenecks do we expect?
>
> == Finding control points. ==
>
> Finding control points is usually a two-step process. First
> distinitive points are searched. Next these are matched against
> the distinctive points in all the other images.
>
Recent updates to Hugin have already helped optimizing this by allowing the 
user to specify other search strategies (e.g. only adjacent pairs A-B, B-C 
etc), but I guess you already knew that.

- Memory bottleneck? 
>
> This is a process that requires each image (or even a scaled-down
> version of the image) in memory to find distinctive points. After that
> there are just "a few" (in computer terms) points that need attention,
> so very little memory requirements afer the images are processed.
>
It requires every image to be processed to find feature points (defined by a 
descriptor like SIFT, SURF, or what our own cpfind produces). This can be 
done for each image separately. The second step does the actual matching, 
keeping only points which are found in (at least) one image pair. The old 
generatekeys/autopano combo did exactly this. More recent approaches combine 
this into 1 program, but as far as I know, none of them require all the 
images to be present in memory at the same time.

Note that total memory used != (size of program) + (size of image) + 
(keypoint datastructure). While processing the image, multiple copies of the 
same image, at different resolutions ("image pyramid"), using a few 
different image processing operators (e.g. derivative, DOG etc) may be 
required, depending on the feature point extraction algorithm.

- CPU bottleneck?
>
> Possible.
>
> The old "finding matches" step was horribly slow for larger
> stitches. A factor of 10 performance increase can be achieved by
> searching for matches between adjacent images first, then optimizing
> the layout and then recheck the matches between images that are now
> seen to (almost) overlap.
>
As I said before we already have that. May be a candidate for even more 
optimization, but I think it does quite a nice job already, at least when 
some information is available, e.g. images are pre-aligned, or the user 
knows that the images are taken in a certain pattern, e.g. left-to right.

== optimizing the layout ==
>
> This is currently implemented as a multi-dimensional upgrade of
> newton-raphson (it has a name, which I've forgotten). This is very
> efficient at converging on a solution. 
>
> Performing benchmarks should show that this is not worth the trouble
> of optimizing: it already takes little memory and little CPU time.
>
> As the algorithm is already much better than simple strategies like
> hill-climbing, there is probably little to be gained by optimizing
> this step.
>
I can't comment on this one, I wouldn't know.
 

> == displaying intermediate results ==
>
> For a user to see what the result is going to look like and if the
> results are usable, a quick preview of the resulting panorama is
> neccesary.
>
> - memory 
>
> In theory numimages * numpixels_per_image of pixels are involved. This
> can be a large number.
>
More precisely: the sum of the pixel counts of all images. Hugin does not 
require all images to have equal dimensions.

In practice the display is only on the order of 2 Mpixels. This is a 
> very small number for a computer. A modern computer can easily handle
> 120Mpixels per second. (60 fps at 2Mpixel).
>
Just curious: on what info is this based? I all depends on what you mean by 
"handle". Simply outputting 120Mpixels/s (at 32bpp, this would be 480MB/s) 
directly to the screen is a bit different from doing computations for each 
pixel and *then* outputting it to the screen. Of course modern GPUs are way 
faster at "handling" most pixel related operations than CPU, hence the 
improved usability of the fast preview compared to the old preview.

If each of my images is going to show as a 100x100pixel image
> (10kpixels), it is wasteful to process the 10Mpixel camera image each
> time.
>
> In practise I see huge amounts of memory being used. In theory 
>
> - CPU 
>
> In practice hugin is slow in this step. Benchmarks are needed to see
> where the time goes.
>
I found that as long as you keep the fast preview window open, it's usually 
pretty fast. Then again my projects are only of moderate size (e.g. 30 
images of 5Mpixels each is a lot), not sure if this makes a difference.
 

> I suspect the JPEG decoding, but... Benchmarks needed to
> verify. Thought needs to be put into these things. If the jpeg
> decoding ends up being the slow part, we shouldn't simply try to
> optimize the jpeg routines, but we should also consider not doing the
> jpg decoding over and over again. That's going to pay out much more!
>
You mean caching decoded images on disk? 

But I'm getting ahead of where I think the benchmarks will lead us.
> I could be wrong. Facts from benchmarks overrule my guesses. 
>
> == remapping ==
>
> Nona. 
>
> - Memory 
>
> Input is 10Mpixels. Output is 5-20 mpixels. It should be possible to
> do with just around (10+20)*4 = 120Mbytes. In reality I suspect much 
> bigger memory use. Benchmark needed.
>
That's only true for 8 bits per channel images *while processing*. Of course 
you need more bits of precision while processing, if you want to allow EV 
compensation, vignetting correction and the like to produce nice output.
 

> - CPU
>
> Input and output refer to around 30 Mpixels total. A througput of
> 120Mpixels per second was achievable we said before. I think there is
> room for improvement. (I suspect we see less than 10% of that number)
> Benchmark needed.
>
Ok agreed.
 

> == blending ==
>
> Blending works on the complete output image, which can be several
> gigapixels. On the other hand it is quite a local operation.
>
> - memory
>
> Enblend works by loading the images in sequence. Thus: enblend needs
> to keep in memory the part of the image that overlaps with the newly
> loaded image and the newly loaded image.
>
> Enblend seems to take MUCH more than this amount of memory. 
> Benchmarks needed.
>
It's a little more complicated than that. Provided you're using a 
(temporary) file format on disk where you can directly access pixels without 
actually reading the whole file, you can indeed load just the next image and 
the part of the image it overlaps. Swapping out to/from disk for each image 
however is far from optimal, in terms of speed. The more RAM available, the 
more data should hence be kept in RAM. And because Enblend uses 
multi-resolution blending, more than 1 copy of the same region (at different 
resolutions) is required to be present in memory.
 

> - CPU
>
> Multiple passes are needed over the overlapping pixels. Thus it is
> difficult for me to predict the amount of reasonable time this takes.
> Benchmarks needed. 
>
>
> I think real-time performance of the interactive hugin is
> possible. (i.e. when you add a controlpoint, the preview window shifts
> within 50ms after you added the controlpoint!) It should be possible
> to achieve near disk-throughput speeds for the final high quality
> blend, but benchmarks are needed to identify the bottlenecks before
> sensible optimizations can be done.
>
I think you underestimate the complexity of the underlying algorithms a bit.
 

> Above I've "violated" my own rules by guessing where optimizations can
> be done and suggesting optimizations without doing the
> benchmarks. Everybody thinks he can do this better than many
> others. But please do run benchmarks to prove my (and your) guessing
> before you invest time to optmize hugin. I really don't mind being
> proven wrong by a benchmark. It means a useless optmization was
> prevented by running a simple benchmark.... 
>
> Discussion is always welcome, right? Why else would this be called a 
discussion group? :)

 
By the way, the topic of this discussion (currently reading "Hugin Speed 
test tab *for automated hardware evaluation*") may be incorrect, but if it 
is correct, we are actually talking off topic here. Hardware evaluation? 
Shouldn't it be software evaluation? Or is the outcome going to be a guide 
on what hardware to buy to run Hugin on best...

--
Bart

-- 
You received this message because you are subscribed to the Google Groups 
"Hugin and other free panoramic software" group.
A list of frequently asked questions is available at: 
http://wiki.panotools.org/Hugin_FAQ
To post to this group, send email to hugin-ptx@googlegroups.com
To unsubscribe from this group, send email to 
hugin-ptx+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/hugin-ptx

Re: [hugin-ptx] Re: Need some help | Hugin Speed test tab for automated hardware evaluation

Reply via email to