I still posit that it's possible to avoid many of those inefficiencies by using 
a sufficiently large buffer in libjpeg-turbo and using an in-memory 
source/destination manager. Much of the inefficiency in the code relates to the 
buffering that it does to avoid reading the entire image into memory.

I also hasten to point out that not all of the compute-intensive parts of the 
code are NEON-accelerated. The general speedup we're seeing in NEON vs non-NEON 
is about 1.5-2x rather than the 3-4x we see with x86-64. Not sure whether ARM 
is 64-bit, but using 64-bit code will improve Huffman en/decoding performance 
significantly. It may also be the case that the hand-tuned code I wrote in the 
Huffman codec is making performance assumptions based on x86 that aren't true 
for ARM. It would be interesting to see what the speedup is with the 
unoptimized Huffman code out of libjpeg. At least on x86, Huffman can account 
for 40% of the compute time, so optimizing it further has a potentially big 
pay-off. However, I've personally spent hundreds of hours getting it where it 
is, and I have a gut feeling that further optimization of it would require 
dropping down to assembly.

On Jun 29, 2011, at 3:03 PM, Måns Rullgård <m...@mansr.com> wrote:

> Vladimir Pantelic <vlado...@gmail.com> writes:
> 
>> Mandeep Kumar wrote:
>>> Hi All,
>>> 
>>> I have done some benchmarking on OMAP4  running Ubuntu for various versions 
>>> of libjpegs. Benchmarks were collected with
>>> modified version of djpeg that prints out ms time taken for decoding. 
>>> Sample used for benchmarking is a 12MP image
>>> downloaded from a photography website. Here are the results:
>> 
>> ...
>> 
>>> libjpeg-turbo trunk version that has NEON patches (5 runs). 
>>> *http://libjpeg-turbo.svn.sourceforge.net/viewvc/libjpeg-turbo/*
>>> *     Decoding Time for Run 1: 1068 ms
>>>      Decoding Time for Run 2: 1065 ms
>>>      Decoding Time for Run 3: 1093 ms
>>>      Decoding Time for Run 4: 1066 ms
>>>      Decoding Time for Run 5: 1067 ms
>>> *Median Decoding Time: 1067 ms*
>> 
>> One remark:
>> 
>> a 12MP image decoded in 1076ms equals ~12MP/s decoding speed.
>> 
>> decoding a 640x480 MJPEG file on a 1GHz OMAP4 using libavcodec
>> gives me an average decoding time per frame of ~10ms which yields:
>> 
>> 640x480/10ms = ~30MP/s
>> 
>> so roughly 2.5 times faster.
>> 
>> Either I am doing something wrong or this libjpeg-turbo is not so turbo.
> 
> Libjpeg (turbo or regular) is full of inefficiencies.  I guess they all
> add up.
> 
> -- 
> Måns Rullgård
> m...@mansr.com

_______________________________________________
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev

Reply via email to