[Numpy-discussion] Re: Exporting numpy arrays to binary JSON (BJData) for better portability

2022-08-27 Thread Francesc Alted
Hi Qianqian,

Your work in bjdata's is very interesting.  Our team (Blosc) has been
working on something along these lines, and I was curious on how the
different approaches compares.  In particular, Blosc2 uses the msgpack
format to store binary data in a flexible way, but in my experience, using
binary JSON or msgpack is not that important; the real thing is to be able
to compress data in chunks that fits in CPU caches, and then trust in fast
codecs and filters for speed.

I have setup a small benchmark (
https://gist.github.com/FrancescAlted/e4d186404f4c87d9620cb6f89a03ba0d)
based on your setup and here are my numbers (using an AMD 5950X processor,
and a fast SSD here):

(python-blosc2) faltet2@ryzen16:~/blosc/python-blosc2/bench$ PYTHONPATH=..
python read-binary-data.py save
time for creating big array (and splits): 0.009s (86.5 GB/s)

** Saving data **
time for saving with npy: 0.450s (1.65 GB/s)
time for saving with np.memmap: 0.689s (1.08 GB/s)
time for saving with npz: 1.021s (0.73 GB/s)
time for saving with jdb (zlib): 4.614s (0.161 GB/s)
time for saving with jdb (lzma): 11.294s (0.066 GB/s)
time for saving with blosc2 (blosclz): 0.020s (37.8 GB/s)
time for saving with blosc2 (zstd): 0.153s (4.87 GB/s)

** Load and operate **
time for reducing with plain numpy (memory): 0.016s (47.4 GB/s)
time for reducing with npy (np.load, no mmap): 0.144s (5.18 GB/s)
time for reducing with np.memmap: 0.055s (13.6 GB/s)
time for reducing with npz: 1.808s (0.412 GB/s)
time for reducing with jdb (zlib): 1.624s (0.459 GB/s)
time for reducing with jdb (lzma): 0.255s (2.92 GB/s)
time for reducing with blosc2 (blosclz): 0.042s (17.7 GB/s)
time for reducing with blosc2 (zstd): 0.070s (10.7 GB/s)
Total sum: 1.0

So, it is evident that in this scenario compression can accelerate things a
lot, specially for compression.  Here are the sizes:

(python-blosc2) faltet2@ryzen16:~/blosc/python-blosc2/bench$ ll -h eye5*
-rw-rw-r-- 1 faltet2 faltet2 989K ago 27 09:51 eye5_blosc2_blosclz.b2frame
-rw-rw-r-- 1 faltet2 faltet2 188K ago 27 09:51 eye5_blosc2_zstd.b2frame
-rw-rw-r-- 1 faltet2 faltet2 121K ago 27 09:51 eye5chunk_bjd_lzma.jdb
-rw-rw-r-- 1 faltet2 faltet2 795K ago 27 09:51 eye5chunk_bjd_zlib.jdb
-rw-rw-r-- 1 faltet2 faltet2 763M ago 27 09:51 eye5chunk-memmap.npy
-rw-rw-r-- 1 faltet2 faltet2 763M ago 27 09:51 eye5chunk.npy
-rw-rw-r-- 1 faltet2 faltet2 785K ago 27 09:51 eye5chunk.npz

Regarding decompression, I am quite pleased on how jdb+lzma performs
(specially with the compression ratio).  But in order to provide a better
idea on the actual read performance, it is better to evict the files from
the OS cache.  Also, the benchmark performs some operation on data (in this
case a reduction) to make sure that all the data is processed.

So, let's evict the files:

(python-blosc2) faltet2@ryzen16:~/blosc/python-blosc2/bench$ vmtouch -ev
eye5*
Evicting eye5_blosc2_blosclz.b2frame
Evicting eye5_blosc2_zstd.b2frame
Evicting eye5chunk_bjd_lzma.jdb
Evicting eye5chunk_bjd_zlib.jdb
Evicting eye5chunk-memmap.npy
Evicting eye5chunk.npy
Evicting eye5chunk.npz

   Files: 7
 Directories: 0
   Evicted Pages: 391348 (1G)
 Elapsed: 0.084441 seconds

And then re-run the benchmark (without re-creating the files indeed):

(python-blosc2) faltet2@ryzen16:~/blosc/python-blosc2/bench$ PYTHONPATH=..
python read-binary-data.py
time for creating big array (and splits): 0.009s (80.4 GB/s)

** Load and operate **
time for reducing with plain numpy (memory): 0.065s (11.5 GB/s)
time for reducing with npy (np.load, no mmap): 0.413s (1.81 GB/s)
time for reducing with np.memmap: 0.547s (1.36 GB/s)
time for reducing with npz: 1.881s (0.396 GB/s)
time for reducing with jdb (zlib): 1.845s (0.404 GB/s)
time for reducing with jdb (lzma): 0.204s (3.66 GB/s)
time for reducing with blosc2 (blosclz): 0.043s (17.2 GB/s)
time for reducing with blosc2 (zstd): 0.072s (10.4 GB/s)
Total sum: 1.0

In this case we can notice that the combination of blosc2+blosclz achieves
speeds that are faster than using a plain numpy array.  Having disk I/O
going faster than memory is strange enough, but if we take into account
that these arrays compress extremely well (more than 1000x in this case),
then the I/O overhead is really low compared with the cost of computation
(all the decompression takes place in CPU cache, not memory), so in the
end, this is not that surprising.

Cheers!


On Fri, Aug 26, 2022 at 4:26 AM Qianqian Fang  wrote:

> On 8/25/22 18:33, Neal Becker wrote:
>
>
>
>> the loading time (from an nvme drive, Ubuntu 18.04, python 3.6.9, numpy
>> 1.19.5) for each file is listed below:
>>
>> 0.179s  eye1e4.npy (mmap_mode=None)
>> 0.001s  eye1e4.npy (mmap_mode=r)
>> 0.718s  eye1e4_bjd_raw_ndsyntax.jdb
>> 1.474s  eye1e4_bjd_zlib.jdb
>> 0.635s  eye1e4_bjd_lzma.jdb
>>
>>
>> clearly, mmapped loading is the fastest option without a surprise; it is
>> true that the raw bjdata file is about 5x slower than npy loading, but
>> given the main chunk of the da

[Numpy-discussion] Re: Exporting numpy arrays to binary JSON (BJData) for better portability

2022-08-27 Thread Qianqian Fang

hi Francesc,

wonderful works on blosc2! congrats! this is exactly the direction that 
I would hope more data creators/data users would pay attention to.


clearly blosc2 is a well positioned for high performance - msgpack is 
one of the most proliferated binary JSON formats out there, with many 
extensively optimized libraries; zstd is also a rapidly emerging 
compression class that has a well developed multi-threading support. 
this combination likely has the best that the current toolchain can 
offer to deliver good performance and robustness. The added SIMD and 
data chunking features further push the performance bar.


I am aware that msgpack does not currently support packed ND-array data 
type (see my PR to add this syntax at 
https://github.com/msgpack/msgpack/pull/267), I suppose blosc2 must have 
been using customized  buffers warped under an ext32 container, is that 
the case? or you implemented your own unofficial ext64 type?


I am not surprised to see blosc2 outperforms npz/jdb in compression 
benchmarks because zstd supports multi-threading, that makes a huge 
difference, as shown clearly in this 2017 benchmark that I found online


https://community.centminmod.com/threads/compression-comparison-benchmarks-zstd-vs-brotli-vs-pigz-vs-bzip2-vs-xz-etc.12764/ 



using the multi-threaded versions of zlib (pigz) and lzma (pxz, pixz, or 
plzip) would be a more apple-to-apple comparison, but I do believe zstd 
may still hold an edge in speed (but may trade for less compression 
ratio). I also noticed that lbzip2 also gives relatively good speed and 
high compression ratio. Nothing beats lzma (lzma/zip/xz) in compression 
ratio, even with the highest setting in zstd.


I absolutely agree with you that different flavors of binary JSON 
formats (Msgpack vs CBOR vs BSON vs UBJSON vs BJData) matters little 
because they are all JSON-convertible and follow the same design 
principles as JSON - namely simplicity, generality and lightweight.


I did make some deliberations when deciding whether to use Msgpack vs 
UBJSON/BJData as the main binary format for NeuroJSON, there were two 
things had steered my decision:


1. there is *no official packed ND array support* in both Msgpack and 
UBJSON. ND-array is such a fundamental data structure for scientific 
data storage and it has to be the first-class citizen in data 
serialization formats - storing an ND array in nested 1D list, as done 
in standard msgpack/ubjson, not only lose the dimensional regularity but 
also adds overheads and breaks the continuous binary buffer. That was 
the main reason that I had to extend UBJSON 
 
as BJData to natively support ND-array syntax


2. a key belief 
 
of the NeuroJSON project is that "human readability" is the single most 
important factor to decide the longevity of both codes and data. The 
human-readability of codes have been well addressed and reinforced by 
open-source/free/libre software licenses (specifically, Freedom 1 
). but not 
many people have been paying attention to the "readability" of data. 
Admittedly, it is a harder problem. storing data in text files results 
in much larger size and slow speed, so storing binary data in 
application-defined binary files, just like npy, is extremely common. 
However, these binary files in most cases are not directly readable; 
they depend on a marching parser, which carrys the format spec/schema 
separately from the data themselves, to correctly read/write. Because 
the data files are not self-contained, usually not self-documenting, 
their utility heavily depends on the parser writers - when a parser 
phase out an older format, or does not implement the format rigorously, 
the data ultimately will no longer able to be opened and become useless.


One feature that really drew my attention to UBJSON/BJData is that they 
are "quasi-human-readable 
". 
This is rather *unique* among all binary formats. This is because the 
"semantic" elements (data type markers, field names and strings) in 
UBJSON/BJData are all human-readable. Essentially one can open such 
binary file with a text editor and figure out what's inside - if the 
data file is well self-documented (which it permits), then such data can 
be quickly understood without depending on a parser.


you can try this command on the lzma.jdb file

*|$ strings -n2 eye5chunk_bjd_lzma.jdb | astyle | sed '/_ArrayZipData_/q'|*|
||[ {U||
||   _ArrayType_SU||
||   doubleU||
||   _ArraySize_[U||
||  ]U||
||   _ArrayZipType_SU||
||   lzmaU||
||   _ArrayZipSize_[U||
||  m@||
|| ]U||
||   _ArrayZipData_[$U#uE||
|

as you can see, the subfields of the data (|_Array

[Numpy-discussion] Re: Exporting numpy arrays to binary JSON (BJData) for better portability

2022-08-27 Thread Stephan Hoyer
On Sat, Aug 27, 2022 at 9:17 AM Qianqian Fang  wrote:

> 2. a key belief
> 
> of the NeuroJSON project is that "human readability" is the single most
> important factor to decide the longevity of both codes and data. The
> human-readability of codes have been well addressed and reinforced by
> open-source/free/libre software licenses (specifically, Freedom 1
> ). but not
> many people have been paying attention to the "readability" of data.
>

Hi Qianqian,

I think you might be interested in the Zarr storage format, for exactly
this same reason: https://zarr.dev/

Zarr is focused more on "big data" but one of its fundamental strengths is
that the format is extremely simple. All the metadata is in JSON, with
arrays divided up into smaller "chunks" stored as files on disk or in cloud
object stores.

Cheers,
Stephan
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com