Has there been any progress on a (stand-alone) Blosc package for Julia? If not I might have time to contribute since I need a fast compressor for a project. If there is any code/start for it I'd appreciate it though.
Cheers, Robert Feldt Den tisdagen den 2:e september 2014 kl. 21:47:33 UTC+2 skrev Douglas Bates: > > Would it be reasonable to create a Blosc package or it is best to > incorporate it directly into the HDF5 package? If a separate package is > reasonable I could start on it, as I was the one who suggested this in the > first place. > > On Tuesday, September 2, 2014 2:43:15 PM UTC-5, Tim Holy wrote: >> >> All these testimonials do make it sound promising. Even three-fold >> compression >> is a pretty big deal. >> >> One disadvantage to compression is that it makes mmap impossible. But, >> since >> HDF5 supports hyperslabs, that's not as big a deal as it would have been. >> >> --Tim >> >> On Tuesday, September 02, 2014 12:11:55 PM Jake Bolewski wrote: >> > I've used Blosc in the past with great success. Oftentimes it is >> faster >> > than the uncompressed version if IO is the bottleneck. The compression >> > ratios are not great but that is really not the point. >> > >> > On Tuesday, September 2, 2014 2:09:20 PM UTC-4, Stefan Karpinski wrote: >> > > That looks pretty sweet. It seems to avoid a lot of the pitfalls of >> > > naively compressing data files while still getting the benefits. It >> would >> > > be great to support that in JLD, maybe even turned on by default. >> > > >> > > >> > > On Tue, Sep 2, 2014 at 1:35 PM, Kevin Squire <kevin....@gmail.com >> > > >> > > <javascript:>> wrote: >> > >> Just to hype blosc a little more, see >> > >> >> > >> http://www.blosc.org/blosc-in-depth.html >> > >> >> > >> The main feature is that data is chunked so that the compressed >> chunk >> > >> size fits into L1 cache, and is then decompressed and used there. >> There >> > >> are a few more buzzwords (multithreading, simd) in the link above. >> Worth >> > >> exploring where this might be useful in Julia. >> > >> >> > >> Cheers, >> > >> >> > >> Kevin >> > >> >> > >> On Tuesday, September 2, 2014, Tim Holy <tim....@gmail.com >> <javascript:>> >> > >> >> > >> wrote: >> > >>> HDF5/JLD does support compression: >> > >>> >> > >>> >> https://github.com/timholy/HDF5.jl/blob/master/doc/hdf5.md#reading-and-w >> > >>> riting-data >> > >>> >> > >>> But it's not turned on by default. Matlab uses compression by >> default, >> > >>> and >> > >>> I've found it's a huge bottleneck in terms of performance >> > >>> ( >> > >>> >> http://www.mathworks.com/matlabcentral/fileexchange/39721-save-mat-files >> > >>> -more-quickly). But perhaps there's a good middle ground. It would >> take >> > >>> someone >> > >>> doing a little experimentation to see what the compromises are. >> > >>> >> > >>> --Tim >> > >>> >> > >>> On Tuesday, September 02, 2014 08:30:39 AM Douglas Bates wrote: >> > >>> > Now that the JLD format can handle DataFrame objects I would like >> to >> > >>> >> > >>> switch >> > >>> >> > >>> > from storing data sets in .RData format to .jld format. Datasets >> > >>> >> > >>> stored in >> > >>> >> > >>> > .RData format are compressed after they are written. The default >> > >>> > compression is gzip. Bzip2 and xz compression are also >> available. >> > >>> > The >> > >>> > compression can make a substantial difference in the file size >> because >> > >>> >> > >>> the >> > >>> >> > >>> > data values are often highly repetitive. >> > >>> > >> > >>> > JLD is different in scope in that .jld files can be queried using >> > >>> >> > >>> external >> > >>> >> > >>> > programs like h5ls and the files can have new data added or >> existing >> > >>> >> > >>> data >> > >>> >> > >>> > edited or removed. The .RData format is an archival format. >> Once the >> > >>> >> > >>> file >> > >>> >> > >>> > is written it cannot be modified in place. >> > >>> > >> > >>> > Given these differences I can appreciate that JLD files are not >> > >>> >> > >>> compressed. >> > >>> >> > >>> > Nevertheless I think it would be useful to adopt a convention in >> the >> > >>> >> > >>> JLD >> > >>> >> > >>> > module for accessing data from files with a .jld.xz or .jld.7z >> > >>> >> > >>> extension. >> > >>> >> > >>> > It could be as simple as uncompressing the files in a temporary >> > >>> >> > >>> directory, >> > >>> >> > >>> > reading then removing, or it could be more sophisticated. I >> notice >> > >>> >> > >>> that my >> > >>> >> > >>> > versions of libjulia.so on an Ubuntu 64-bit system are linked >> against >> > >>> >> > >>> both >> > >>> >> > >>> > libz.so and liblzma.so >> > >>> > >> > >>> > $ ldd /usr/lib/x86_64-linux-gnu/julia/libjulia.so >> > >>> > linux-vdso.so.1 => (0x00007fff5214f000) >> > >>> > libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 >> (0x00007f62932ee000) >> > >>> > libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f62930d5000) >> > >>> > libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f6292dce000) >> > >>> > librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 >> (0x00007f6292bc6000) >> > >>> > libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 >> > >>> > (0x00007f62929a8000) >> > >>> > libunwind.so.8 => /usr/lib/x86_64-linux-gnu/libunwind.so.8 >> > >>> > (0x00007f629278c000) >> > >>> > libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 >> > >>> > (0x00007f6292488000) >> > >>> > libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 >> > >>> >> > >>> (0x00007f6292272000) >> > >>> >> > >>> > libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f6291eab000) >> > >>> > /lib64/ld-linux-x86-64.so.2 (0x00007f62944b3000) >> > >>> > liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 >> > >>> > (0x00007f6291c89000) >> > >>> > >> > >>> > >> > >>> > AFAIK the user-level interface to gzip requires the GZip package. >> > >>> >> > >>> Unless I >> > >>> >> > >>> > have missed something (always a possibility) there is no >> user-level >> > >>> > interface to liblzma in Julia. If the library is going to be >> linked >> > >>> > anyway, would it make sense to provide a user-level interface in >> > >>> > Julia? >> >>