Re: [Haskell-cafe] Toy compression algorithms [was: A very edgy language]

2007-07-08 Thread Donald Bruce Stewart
andrewcoppin:
> Andrew Coppin wrote:
> >Dave Bayer wrote:
> >>I was beginning to accept that I might die before clearing my 
> >>pipeline of research projects I want to code up. Haskell has given me 
> >>new hope.
> >
> >Indeed. ;-)
> >
> >Today I hve implemented encoders and decoders for RLE, MTF, Fibonacci 
> >codes, and LZW. Next on my list is BWT, Huffman codes and arithmetic 
> >coding. (That last is *very* hard though...)
> 
> In case anybody cares...
> 
>   darcs get http://www.orphi.me.uk/darcs/ToyCompression
>   ghc -O2 --make Encode
>   ghc -O2 --make Decode
> 
> You can now do
> 
>  Encode LZW foo
> 
> Will look for a file named "foo", and generate a file called "foo-LZW" 
> containing the LZW-compressed version of it. Also "foo-LZW.log.txt", 
> which contains some statistics.
> 
> Similarly,
> 
>  Decode LZW foo-LZW
> 
> will make a file "foo-LZW-unLZW", which should *hopefully* be identical 
> to the original one. (!)
> 
> It didn't occur to me, but I should probably go add some way to discover 
> a list of available algorithms! LOL. Anyway, currently working:
> 
>   RLE (Run Length Encoding). Counts runs of symbols. Considers 8 bits 
> to be one "symbol". The maximum run length is 255.
>   MTF (Move To Front). Takes an input file and produces an output file 
> of exactly the same size, but containing mostly *low* numbers. Works 
> well with RLE or Fib.
>   BWT (Burners-Wheeler Transform). Like MTF, does no actual 
> compression, but makes the file more "compressible".
>   Fib (Fibonacci codes). Stores numbers using Fibonacci codes instead 
> of unsigned binary. This makes low numbers smaller and high numbers 
> bigger. (Use it with MTF...)
>   LZW (Lempel-Ziv-Welch). The daddy. Looks for repeated input strings 
> and compresses them.
> 
> Caution: Encoding or decoding BWT currently requires absurd amounts of 
> time and space. (Like, > 2 GB RAM and 8 minutes wall time to process 12 
> KB of text.) I hope to fix that soon...
> 
> Currently it's unclear whether LZW or BWT+MTF+Fib gives the best 
> compression. (Mainly because BWT is so hard to run!) I hope to implement 
> a few other algorithms soonly.
> 
> (Note: LZW is typically used with a variable number of bits per output 
> symbol. I haven't done this yet. It simply uses 16 bits in all cases. 
> Once I fix this, compression will go up. Also, once the symbol 
> dictionary is full, the encoder just resets itself.)
> 

Good work. Probably worth benchmarking against the other compression
libraries on hackage, just to get a sense for how well your
implementations are optimised (and maybe how to best interact with lazy
bytestrings?). See for example:

zlib   http://hackage.haskell.org/cgi-bin/hackage-scripts/package/zlib-0.3
bzlib  http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bzlib-0.3

and also:
lzf
http://hackage.haskell.org/cgi-bin/hackage-scripts/package/Codec-Compression-LZF-0.2

-- Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Toy compression algorithms [was: A very edgy language]

2007-07-08 Thread Andrew Coppin

Bulat Ziganshin wrote:

Hello Donald,
  

Good work. Probably worth benchmarking against the other compression
libraries



are you really want to totally discredit Haskell? :)  they should be
hundreds of times slower than any practical compression algorithm
(and btw, zlib/bzlib isn't good performers anyway, my own algorithm is
several times faster with the same compression ratio)

Haskell isn't a good tool to develop compression algorithms because
it's the very well studied area where it has meaning to use all the
sorts of optimizations. so it falls in the "low-level algorithms"
category where using Haskell means at least 3x slower development and
3x worse performance - or faster development with 100x worse
performance. Andrew's code should fall into later category
  


Indeed. I'm more interested in which algorithms produce the best 
compression ratios than how to implement them fast. Currently the code 
uses very general, flexible, polymorphic types all over the place, 
everything is normal lists, I'm using monadic parsing rather than 
bit-twiddling, etc etc etc. It goes alright with 100 KB files on my 
monster PC at home, but toss a 4 MB file over there and it takes a 
minute or two to compress. Hardly a match for a "practical" compression 
solution that's been optimised to within inches of its life in C or even 
assembly. ;-)


I mentioned that my LZW algorithm isn't even as efficient as it could be 
- it uses 16 bits per symbol rather than being variable. Partly that's 
because it's easier to code. But partly that's so that I can write a 
16-bit Huffman compressor and run it over the top. (LZW + Huffman being 
a very common combination.) And that's really what this is - a toolbox 
for throwing algorithms together to see what they do.


OTOH, if the Gods behind GHC want to take a look and figure out whether 
there's any magic that can be added to the optimiser, I wouldn't 
complain... ;-)


(Realistically though. My program takes a [Word8] and turns it into a 
[Bool] before running a parser over it. The GHC optimiser doesn't really 
stand a hope in hell of optimising that into a program that reads a 
machine word into a CPU register and starts playing with bit flips on it...)


PS. Are those zlib libraries actually written in Haskell? Or are they a 
thin layer on top of a C library?


PPS. Does GHC make use of MMX, SSE, et al?

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Toy compression algorithms [was: A very edgy language]

2007-07-08 Thread Stefan O'Rear
On Sun, Jul 08, 2007 at 12:10:04PM +0100, Andrew Coppin wrote:
> (Realistically though. My program takes a [Word8] and turns it into a 
> [Bool] before running a parser over it. The GHC optimiser doesn't really 
> stand a hope in hell of optimising that into a program that reads a machine 
> word into a CPU register and starts playing with bit flips on it...)

Actually, if you're very lucky (fusion is just as hard in Haskell as it
is in real life), it *does*.  It seems to fit nicely into the
stream-fusion framework.

> PS. Are those zlib libraries actually written in Haskell? Or are they a 
> thin layer on top of a C library?

Yup, they wrap C's zlib.

> PPS. Does GHC make use of MMX, SSE, et al?

No (in spirit - the native code generator uses 1-element SSE operations
for floating point because it's easier to optimize than "FPU" code).

Stefan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Toy compression algorithms [was: A very edgy language]

2007-07-08 Thread Andrew Coppin

Bulat Ziganshin wrote:

Hello Andrew,

Sunday, July 8, 2007, 3:10:04 PM, you wrote:
  

Indeed. I'm more interested in which algorithms produce the best
compression ratios than how to implement them fast.



algorithms you are tried so far (except for bwt+mtf) is the standard
student's set :)  of those, lzw is most advanced, but even this
algorithm is 20-year old
  


Yeah, well, naturally I'm implementing the simple ones first. ;-)

Actually, LZW works surprisingly well for such a trivial little 
algorithm... When you compare it to the complexity of an arithmetic 
coder driven by a high-order adaptive PPM model (theoretically the best 
general-purpose algorithm that exists), it's amazing that anything so 
simple as LZW should work at all!



Currently the code
uses very general, flexible, polymorphic types all over the place, 
everything is normal lists, I'm using monadic parsing rather than 
bit-twiddling, etc etc etc. It goes alright with 100 KB files on my



testing today's algorithms on such small datasets don't make much
sense because they need much more input to gather enough statistics
about data compressed. i prefer hundreds-megabytes tests and don't
know anyone with a test files smaller than 1 mb
  


OTOH, I'd like it to finish this month...

(LZW is faster now I'm using Data.Map. But even so, 15 minutes to 
compress 640 KB of zeros? That's pretty slow!)



btw, in most cases, C algorithm with today's C compilers will work
faster than asm oe - because you should have very good knowledge of
COU internals to write efficient program. i dream about the days when
the same will become true for Haskell
  


We all do... *sigh*

(Or rather, most of us dream. A tiny few people seem to be trying to 
actually do stuff about it...)



i never heard about lzw+huf compressors :)  there are two lz families
- lz78 (lzw) which was popular in 80's and used variable-size words,
and lz77 (lzss) which together with huffman was the most popular in
90's. compress (.Z) and pkzip 1 was lzw, while zip is lzh (i.e.
lz77+huffman) 


there is no much meaning to use huffman with lzw because many words
will have only one or two occurrences
  


Hmm... that's interesting. I was *sure* it said that LHA used 
LZW+Huffman... oh well!


You say that there would be "no meaning" to using a Huffman code to 
encode the output from an LZW stage. But consider this: The standard 
approach is for the LZW encoder to spit out 9 bits/symbol initially, and 
than 10 bits/symbol once the table fills up a bit more, and then 11 
bits/symbol after that, and so on. (Usually until some hard-coded limit 
is reached. Typically the encoder just resets itself at that point or 
something.) So we're allocating more and more bits per symbol, but this 
is based only on how full the table is, *regardless* of whether any of 
these new symbols are even in fact *used*. ;-) Now, by using a Huffman 
code (scanning the actual produced output rather than working 
adaptively) we can avoid assigning bit combinations to unused symbols.


(The downside of course is that now we need a Huffman table in the 
output - and any rare symbols might end up with rather long codewords. 
But remember: Huffman *guarantees* 0% compression or higher on the 
payload itself. A Huffman-compressed payload can *never* be bigger, only 
smaller or same size. So long as you can encode the Huffman table 
efficiently, you should be fine...)



btw, i invite you to the http://encode.ru/forums where you will find
many compressor authors. you will be second one there who use haskell
and first one who write compression algorithms in it :)
  


.ru = Russia?


OTOH, if the Gods behind GHC want to take a look and figure out whether
there's any magic that can be added to the optimiser, I wouldn't 
complain... ;-)

(Realistically though. My program takes a [Word8] and turns it into a 
[Bool] before running a parser over it. The GHC optimiser doesn't really
stand a hope in hell of optimising that into a program that reads a 
machine word into a CPU register and starts playing with bit flips on it...)



as time goes, those terrible compilers becomes smarter and smarter :)
  


Oh hey, I think GHC is already pretty smart. But no optimiser can ever 
hope to cover *every* possible case. And transforming [Bool] -> [Bool] 
into UArray Word8 -> UArray Word8 just seems a little bit 
optimistic, methinks. ;-)





PPS. Does GHC make use of MMX, SSE, et al?



it can do it with via-c compilation and there are plans for native
codegenerator too.


The way I heard it is that it *does* native code generation, but it's 
"not as good" as the via-C version. (Can't imagine why... You would have 
thought directly implementing functional primitives would be much easier 
than trying to mangle them to fit C's multitude of limitations. Still, 
what do I know?)



but it seems that you don't know asm and believe in
advertising hype. i'm pretty sure that mmx can't help in algorithms
you implementing

Re: [Haskell-cafe] Toy compression algorithms [was: A very edgy language]

2007-07-08 Thread Andrew Coppin

Stefan O'Rear wrote:

On Sun, Jul 08, 2007 at 12:10:04PM +0100, Andrew Coppin wrote:
  
(Realistically though. My program takes a [Word8] and turns it into a 
[Bool] before running a parser over it. The GHC optimiser doesn't really 
stand a hope in hell of optimising that into a program that reads a machine 
word into a CPU register and starts playing with bit flips on it...)



Actually, if you're very lucky (fusion is just as hard in Haskell as it
is in real life), it *does*.  It seems to fit nicely into the
stream-fusion framework.
  


Ooo... really? That's pretty impressive...(!)

Is there a way I can check? ;-) More usefully, can I do stuff to my code 
to make myself more "lucky"?


(Love the comment about RL BTW!)

PS. Are those zlib libraries actually written in Haskell? Or are they a 
thin layer on top of a C library?



Yup, they wrap C's zlib.
  


Thought so. Comparing native Haskell to a heavily optimised C library 
would surely be just like comparing native Haskell to a compiled C binary...



PPS. Does GHC make use of MMX, SSE, et al?



No (in spirit - the native code generator uses 1-element SSE operations
for floating point because it's easier to optimize than "FPU" code).
  


Does GHC actually do anything that could usefully use these primitives? 
(I'm guessing "no"...)


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Toy compression algorithms [was: A very edgy language]

2007-07-08 Thread Stefan O'Rear
On Sun, Jul 08, 2007 at 04:07:59PM +0100, Andrew Coppin wrote:
> The way I heard it is that it *does* native code generation, but it's "not 
> as good" as the via-C version. (Can't imagine why... You would have thought 
> directly implementing functional primitives would be much easier than 
> trying to mangle them to fit C's multitude of limitations. Still, what do I 
> know?)

This is very true - but -fvia-C isn't really C.  First it generates GCC
C (with global register variables, among other things), then it feeds
the output through gcc - and then it runs a GHC-specific Perl script
called the Evil Mangler over the assembly output, which promptly regexes
your code to death.

If you want to see what a difference exists between true C and the NCG,
try compiling your program with -fvia-C -unreg.  -unreg tells GHC to
generate (something very close to) ANSI C; also, the mangler is not
used.  -unreg is mainly used for porting...

Stefan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Toy compression algorithms [was: A very edgy language]

2007-07-08 Thread Andrew Coppin

Bulat Ziganshin wrote:

Hello Andrew,

  


Ooo... really? That's pretty impressive...(!)



it's our collective tale for bringing new haskellers. i bet that
Stefan never seen asm code generated by ghc :)

  


LOL!

Once - just once - I did take a look at the C output from GHC. Now, I 
realise that I can't code in C to save my life, but... it didn't even 
*look* like C. Even slightly. Wow.


(OTOH, reading Core isn't too helpful either. It just looks like the 
original Haskell...)


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Toy compression algorithms [was: A very edgy language]

2007-07-08 Thread Stefan O'Rear
On Sun, Jul 08, 2007 at 10:40:10PM +0400, Bulat Ziganshin wrote:
> Hello Andrew,
> 
> Sunday, July 8, 2007, 7:12:38 PM, you wrote:
> >>> (Realistically though. My program takes a [Word8] and turns it into a
> >>> [Bool] before running a parser over it. The GHC optimiser doesn't really 
> >>> stand a hope in hell of optimising that into a program that reads a 
> >>> machine 
> >>> word into a CPU register and starts playing with bit flips on it...)
> >>> 
> >>
> >> Actually, if you're very lucky (fusion is just as hard in Haskell as it
> >> is in real life), it *does*.  It seems to fit nicely into the
> >> stream-fusion framework.
> >>   
> 
> > Ooo... really? That's pretty impressive...(!)
> 
> it's our collective tale for bringing new haskellers. i bet that
> Stefan never seen asm code generated by ghc :)

If you check the list archives, you'll see that I've been a major
contributer to quite a few threads on the GHC code generator, and I
posted to the JHC list months ago.

Also, I said it would be read into a register, I never said it wouldn't
be spilled two instructions later ;)

Stefan (If I'm taking this totally wrong, make sure I know)
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Toy compression algorithms [was: A very edgy language]

2007-07-09 Thread Andrew Coppin

Bulat Ziganshin wrote:

Hello Andrew,

Sunday, July 8, 2007, 7:07:59 PM, you wrote:

i don't think that ppm is so complex - it's just probability of
symbol in some context. it's just too slow in naive implementation

  


Oh, sure, the *idea* is simple enough. Trying to actually *implement* it 
correctly is something else... ;-)


(Same statemenst go for arithmetic coding, really.)

(The downside of course is that now we need a Huffman table in the 
output - and any rare symbols might end up with rather long codewords.
But remember: Huffman *guarantees* 0% compression or higher on the 
payload itself. A Huffman-compressed payload can *never* be bigger, only
smaller or same size. So long as you can encode the Huffman table 
efficiently, you should be fine...)



the devil in details. just imagine size of huffman table with 64k
entries :)  huffman encoding is inappropriate for lzw output simply
because most words will have only a few occurrences and economy on
their optimal encoding doesn't justify price of their entries in the table
  


...which is why you need to "encode the Huffman table efficiently", to 
quote myself. ;-)


Using canonical Huffman, you only actually need to know how many bits 
were assigned to each symbol. This information is probably very 
ameanable to RLE. (Which, incidentally, is why I started this whole 
"parser on top of a phaser" crazyness in the first place.) So, yeah, 
there may be 64k symbols - but if only 1k of them are ever *used*... ;-)



.ru = Russia?



of course
  


My Russian is very rusty. ;-)


Oh hey, I think GHC is already pretty smart. But no optimiser can ever
hope to cover *every* possible case. And transforming [Bool] -> [Bool]
into UArray Word8 ->> UArray Word8 just seems a little bit
optimistic, methinks. ;-)



15 years ago i've written very smart asm program (btw, it was ARJ
unpacker) with handmade function inlining, loop unrolling, register
allocation, cpu recognition and so on. now, most of these tricks are
standard for C compilers. times changes and now it's hard to imagine which
optimizations will be available 10 years later
  


Yes, but there are limits to what an optimiser can hope to accomplish.

For example, you wouldn't implement a bubble sort and seriously expect 
the compiler to be able to "optimise" that into a merge sort, would you? ;-)



ghc's native and via-C modes are blind vs lame. in native mode, its
codegenerator is comparable with 20 years-old C codegenerators. see
above how much modern C compilers changed in these years. in via-C
mode it generates unnatural C code which is hard to optimize for any C
compiler.


I'll take your word for it. ;-)

(I have made cursory attempts to comprehend the inner workings of GHC - 
but this is apparently drastically beyond my powers of comprehension.)



the jhc is very different story


Yes - last I heard, it's an experimental research project rather than a 
production-ready compiler...


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Toy compression algorithms [was: A very edgy language]

2007-07-09 Thread Stefan O'Rear
On Mon, Jul 09, 2007 at 09:15:07PM +0100, Andrew Coppin wrote:
> Bulat Ziganshin wrote:
>> Hello Andrew,
>>
>> Sunday, July 8, 2007, 7:07:59 PM, you wrote:
>>
>> i don't think that ppm is so complex - it's just probability of
>> symbol in some context. it's just too slow in naive implementation
>>
>>   
>
> Oh, sure, the *idea* is simple enough. Trying to actually *implement* it 
> correctly is something else... ;-)

Took me about an hour and 50 lines of code (about a year ago - this was
one of my first Haskell programs) to implement a PPM (de)compressor that
didn't crash, always generated the same output as input, and achieved
50% ratios on its own source code (not quite as good as gzip, but what
do you expect from a completely untuned compressor?).

Peak throughput: 2 bits / sec. :)

>> the jhc is very different story
>
> Yes - last I heard, it's an experimental research project rather than a 
> production-ready compiler...

Correct.  It requires 5 minutes and 600MB of RAM to compile Hello,
World, and fails with internal pattern match errors on anything
significantly larger.

Stefan


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Toy compression algorithms [was: A very edgy language]

2007-07-09 Thread Andrew Coppin

Stefan O'Rear wrote:

On Mon, Jul 09, 2007 at 09:15:07PM +0100, Andrew Coppin wrote:
  
Oh, sure, the *idea* is simple enough. Trying to actually *implement* it 
correctly is something else... ;-)



Took me about an hour and 50 lines of code (about a year ago - this was
one of my first Haskell programs) to implement a PPM (de)compressor that
didn't crash, always generated the same output as input, and achieved
50% ratios on its own source code (not quite as good as gzip, but what
do you expect from a completely untuned compressor?).
  


...aren't you one of those insanely intelligent people working on GHC? :-P

Actually, I wrote a working compressor in JavaScript. (You know, as an 
interactive web page. No, I don't still have it...) At least, I presume 
it works... I never had occasion to try to decompress the output! (I 
remember it had no underflow handling though...)


Now, writing one that works in binary rather than decimal, and isn't 
absurdly inefficient? (I.e., it *doesn't* work by converting every thing 
to decimal strings, jiggling the characters around, and then parsing 
back to machine integers.) That is something I haven't managed yet...



Peak throughput: 2 bits / sec. :)
  


How fast can you click buttons? That was mine's peak output. ;-)


the jhc is very different story
  
Yes - last I heard, it's an experimental research project rather than a 
production-ready compiler...



Correct.  It requires 5 minutes and 600MB of RAM to compile Hello,
World, and fails with internal pattern match errors on anything
significantly larger.
  


Ouch. That's pretty advanced... :-D

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Toy compression algorithms [was: A very edgy language]

2007-07-09 Thread Jonathan Cast
On Monday 09 July 2007, Andrew Coppin wrote:

> Yes, but there are limits to what an optimiser can hope to accomplish.
>
> For example, you wouldn't implement a bubble sort and seriously expect
> the compiler to be able to "optimise" that into a merge sort, would you?
> ;-)

Something like it's been done; compilers can take in a quadratic time list 
reverse and substitute a linear time version:

http://homepages.inf.ed.ac.uk/wadler/papers/vanish/vanish.ps.gz

(pg. 2-3).  One of the better all-time papers by the major Haskell 
researchers, I'd say.

Jonathan Cast
http://sourceforge.net/projects/fid-core
http://sourceforge.net/projects/fid-emacs
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Toy compression algorithms [was: A very edgy language]

2007-07-11 Thread Andrew Coppin

Bulat Ziganshin wrote:

Hello ajb,

Wednesday, July 11, 2007, 7:55:22 AM, you wrote:
  

Not really.  LZW is basically PPM with a static (and flat!) frequency
prediction model.  The contexts are build in a very similar way.



what you mean by "flat" and "static" applied to PPM? static PPM models
exist - they carry probabilities as separate table very like static
Huffman encoding. is "flat" the same as order-0?
  


I think he meant "flat" as in "pr(z) = 0 | 1"...

As for "static", I'm not so sure that's actually correct.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe