On Thu, 2005-07-21 at 00:12 +0200, Kornél Pál wrote: > > From: "Ben Maurer" > > * There are extremely long runs of the same char in many instances > > * The file seems to have tons of 0 bytes. > > * There are some runs of sequences: > > > > 0002bfb0: 3c00 3d00 3e00 3f00 4000 4100 4200 4300 <.=.>[EMAIL PROTECTED] > > 0002bfc0: 4400 4500 4600 4700 4800 4900 4a00 4b00 D.E.F.G.H.I.J.K. > > 0002bfd0: 4c00 4d00 4e00 4f00 5000 5100 5200 5300 L.M.N.O.P.Q.R.S. > > 0002bfe0: 5400 5500 5600 5700 5800 5900 5a00 5b00 T.U.V.W.X.Y.Z.[. > > > > though they are somewhat smaller than the runs of the same char. > > I see the problem as the following: If the file contains unicode Unicode > charaters it eats disk space but is fast to read thus sorting is fast. > If it is compressed but unbuffered sorting is slow and eats CPU. > If it's buffered either because it is compressed or "just for fun" it eats > RAM.
Compression does not mean `use bzip' in this context. It means "change the file format so that we don't need long runs". Compression will quite possibly make things faster: * Reading from disk is SLOOOOOOOOOW. In the time it takes to access one extra page from the disk, we could have done *tons* of sorts. Please see http://rlove.org/talks/rml_guadec_2005.ppt, slide 3. * Cache misses are slow (but not as slow). So a few extra instructions may well be worth avoiding one. -- Ben _______________________________________________ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list