[Numpy-discussion] Re: Exporting numpy arrays to binary JSON (BJData) for better portability

Bill Ross Thu, 25 Aug 2022 14:53:11 -0700

>> For my case, I'd be curious about the time to add one 1T-entries file to 
>> another.


> as I mentioned in the previous reply, bjdata is appendable [3], so you can 
> simply append another array (or a slice) to the end of the file. 

I'm thinking of numerical ops here, e.g. adding an array to itself would
double the values but not the size.

---
--

Phobrain.com 

On 2022-08-25 14:41, Qianqian Fang wrote:

> To avoid derailing the other thread [1] on extending .npy files, I am going 
> to start a new thread on alternative array storage file formats using binary 
> JSON - in case there is such a need and interest among numpy users
> 
> specifically, i want to first follow up with Bill's question below regarding 
> loading time 
> 
> On 8/25/22 11:02, Bill Ross wrote: 
> 
>> Can you give load times for these?
> 
> as I mentioned in the earlier reply to Robert, the most memory-efficient 
> (i.e. fast loading, disk-mmap-able) but not necessarily disk-efficient (i.e. 
> may result in the largest data file sizes) BJData construct to store an ND 
> array using BJData's ND-array container.
> 
> I have to admit that both jdata and bjdata modules have not been extensively 
> optimized for speed. with the current implementation, here are the loading 
> time for a larger diagonal matrix (eye(10000))
> 
> a BJData file storing a single eye(10000) array using the ND-array container 
> can be downloaded from here [2](file size: 1MB with zip, if decompressed, it 
> is ~800MB, as the npy file) - this file was generated from a matlab encoder, 
> but can be loaded using Python (see below Re Robert).
> 
> 800000128 eye1e4.npy
> 800000014 eye1e4_bjd_raw_ndsyntax.jdb
> 813721 eye1e4_bjd_zlib.jdb
> 113067 eye1e4_bjd_lzma.jdb
> 
> the loading time (from an nvme drive, Ubuntu 18.04, python 3.6.9, numpy 
> 1.19.5) for each file is listed below:
> 
> 0.179s  eye1e4.npy (mmap_mode=None)
> 0.001s  eye1e4.npy (mmap_mode=r)
> 0.718s  eye1e4_bjd_raw_ndsyntax.jdb
> 1.474s  eye1e4_bjd_zlib.jdb
> 0.635s  eye1e4_bjd_lzma.jdb
> 
> clearly, mmapped loading is the fastest option without a surprise; it is true 
> that the raw bjdata file is about 5x slower than npy loading, but given the 
> main chunk of the data are stored identically (as contiguous buffer), I 
> suppose with some optimization of the decoder, the gap between the two can be 
> substantially shortened. The longer loading time of zlib/lzma (and similarly 
> saving times) reflects a trade-off between smaller file sizes and time for 
> compression/decompression/disk-IO.
> 
> by no means I am saying the binary JSON format is readily available to 
> deliver better speed with its current non-optimized implementation. I just 
> want to bright the attention to this class of formats, and highlight that 
> it's flexibility gives abundant mechanisms to create fast, disk-mapped IO, 
> while allowing additional benefits such as compression, unlimited metadata 
> for future extensions etc. 
> 
>>> 8000128  eye5chunk.npy
>>> 5004297  eye5chunk_bjd_raw.jdb
>>> 10338  eye5chunk_bjd_zlib.jdb
>>> 2206  eye5chunk_bjd_lzma.jdb
>> 
>> For my case, I'd be curious about the time to add one 1T-entries file to 
>> another.
> 
> as I mentioned in the previous reply, bjdata is appendable [3], so you can 
> simply append another array (or a slice) to the end of the file. 
> 
>> Thanks, 
>> Bill
> 
> also related, Re @Robert's question below 
> 
>> Are any of them supported by a Python BJData implementation? I didn't see 
>> any option to get that done in the `bjdata` package you recommended, for 
>> example. 
>> https://github.com/NeuroJSON/pybj/blob/a46355a0b0df0bec1817b04368a5a573358645ef/bjdata/encoder.py#L200
> 
> the bjdata module currently only support nd-array in the decoder [4] (i.e. 
> map such buffer to a numpy.ndarray) - should be relatively trivial to add it 
> to the encoder though. 
> 
> on the other side, the annotated format is currently supported. one can call 
> jdata module (responsible for annotation-level encoding/decoding) as shown in 
> my sample code, then call bjdata internally for data serialization. 
> 
>> Okay. Given your wording, it looked like you were claiming that the binary 
>> JSON was supported by the whole ecosystem. Rather, it seems like you can 
>> either get binary encoding OR the ecosystem support, but not both at the 
>> same time.
> 
> all in relative terms of course - json has ~100 listed parsers on it's 
> website [5], MessagePack - another flavor of binary JSON - listed [6] ~50/60 
> parsers, and UBJSON listed [7] ~20 parsers. I am not familiar with npy 
> parsers, but googling it returns only a few. 
> 
> also, most binary JSON implementations provided tools to convert to JSON and 
> back, so, in that sense, whatever JSON has in its ecosystem can be 
> "potentially" used for binary JSON files if one wants to. There are also 
> recent publications comparing differences between various binary JSON formats 
> in case anyone is interested 
> 
> https://github.com/ubjson/universal-binary-json/issues/115 
> 
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: bross_phobr...@sonic.net
 

Links:
------
[1]
https://mail.python.org/archives/list/numpy-discussion@python.org/thread/A4CJ2DZCAKPMD2MYGVMDV5UI7FN4SBVI/
[2] http://neurojson.org/wiki/upload/eye1e4_bjd_raw_ndsyntax.jdb.zip
[3]
https://github.com/NeuroJSON/bjdata/blob/master/images/BJData_Diagram.pdf
[4]
https://github.com/NeuroJSON/pybj/blob/a46355a0b0df0bec1817b04368a5a573358645ef/bjdata/decoder.py#L360-L365
[5] https://www.json.org/json-en.html
[6] https://msgpack.org/index.html
[7] https://ubjson.org/libraries/

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Exporting numpy arrays to binary JSON (BJData) for better portability

Reply via email to