[Numpy-discussion] Re: An extension of the .npy file format

Bill Ross Thu, 25 Aug 2022 08:46:12 -0700

Can you give load times for these? 

> 8000128  eye5chunk.npy
> 5004297  eye5chunk_bjd_raw.jdb
>   10338  eye5chunk_bjd_zlib.jdb
>    2206  eye5chunk_bjd_lzma.jdb


For my case, I'd be curious about the time to add one 1T-entries file to
another. 

Thanks, 
Bill 

--

Phobrain.com 

On 2022-08-24 20:02, Qianqian Fang wrote:

> I am curious what you and other developers think about adopting JSON/binary 
> JSON as a similarly simple, reverse-engineering-able but universally parsable 
> array exchange format instead of designing another numpy-specific binary 
> format. 
> 
> I am interested in this topic (as well as thoughts among numpy developers) 
> because I am currently working on a project - NeuroJSON 
> (https://neurojson.org) - funded by the US National Institute of Health. The 
> goal of the NeuroJSON project is to create easy-to-adopt, easy-to-extend, and 
> preferably human-readable data formats to help disseminate and exchange 
> neuroimaging data (and scientific data in general). 
> 
> Needless to say, numpy is a key toolkit that is widely used among 
> neuroimaging data analysis pipelines. I've seen discussions of potentially 
> adopting npy as a standardized way to share volumetric data (as ndarrays), 
> such as in this thread 
> 
> https://github.com/bids-standard/bids-specification/issues/197 
> 
> however, several limitations were also discussed, for example 
> 
> 1. npy only support a single numpy array, does not support other metadata or 
> other more complex data records (multiple arrays are only achieved via 
> multiple files)
> 2. no internal (i.e. data-level) compression, only file-level compression
> 3. although the file is simple, it still requires a parser to read/write, and 
> such parser is not widely available in other environments, making it mostly 
> limited to exchange data among python programs
> 4. I am not entirely sure, but I suppose it does not support sparse matrices 
> or special matrices (such as diagonal/band/symmetric etc) - I can be wrong 
> though 
> 
> In the NeuroJSON project, we primarily use JSON and binary JSON 
> (specifically, UBJSON [1] derived BJData [2] format) as the underlying data 
> exchange files. Through standardized data annotations [3], we are able to 
> address most of the above limitations - the generated files are universally 
> parsable in nearly all programming environments with existing parsers, 
> support complex hierarchical data, compression, and can readily benefit from 
> the large ecosystems of JSON (JSON-schema, JSONPath, JSON-LD, jq, numerous 
> parsers, web-ready, NoSQL db ...). 
> 
> I understand that simplicity is a key design spec here. I want to highlight 
> UBJSON/BJData as a competitive alternative format. It is also designed with 
> simplicity considered in the first place [4], yet, it allows to store 
> hierarchical strongly-typed complex binary data and is easily extensible. 
> 
> A UBJSON/BJData parser may not necessarily longer than a npy parser, for 
> example, the python reader of the full spec only takes about 500 lines of 
> codes (including comments), similarly for a JS parser 
> 
> https://github.com/NeuroJSON/pybj/blob/master/bjdata/decoder.py
> https://github.com/NeuroJSON/js-bjdata/blob/master/bjdata.js 
> 
> We actually did a benchmark [5] a few months back - the test workloads are 
> two large 2D numerical arrays (node, face to store surface mesh data), we 
> compared parsing speed of various formats in Python, MATLAB, and JS. The 
> uncompressed BJData (BMSHraw) reported a loading speed that is nearly as fast 
> as reading raw binary dump; and internally compressed BJData (BMSHz) gives 
> the best balance between small file sizes and loading speed, see our results 
> here 
> 
> https://pbs.twimg.com/media/FRPEdLGWYAEJe80?format=png&name=large 
> 
> I want to add two quick points to echo the features you desired in npy: 
> 
> 1. it is not common to use mmap in reading JSON/binary JSON files, but it is 
> certainly possible. I recently wrote a JSON-mmap spec [6] and a MATLAB 
> reference implementation [7] 
> 
> 2. UBJSON/BJData natively support append-able root-level records; JSON has 
> been extensively used in data streaming with appendable nd-json or 
> concatenated JSON (https://en.wikipedia.org/wiki/JSON_streaming) 
> 
> just a quick comparison of output file sizes with a 1000x1000 unitary 
> diagonal matrix 
> 
> # python3 -m pip install jdata bjdata
> import numpy as np
> import jdata as jd
> x = np.eye(1000);       # create a large array
> y = np.vsplit(x, 5);    # split into smaller chunks
> np.save('eye5chunk.npy',y);             # save npy
> jd.save(y, 'eye5chunk_bjd_raw.jdb');    # save as uncompressed bjd
> jd.save(y, 'eye5chunk_bjd_zlib.jdb', {'compression':'zlib'});  # 
> zlib-compressed bjd
> jd.save(y, 'eye5chunk_bjd_lzma.jdb', {'compression':'lzma'});  # 
> lzma-compressed bjd
> newy=jd.load('eye5chunk_bjd_zlib.jdb'); # loading/decoding
> newx = np.concatenate(newy);            # regroup chunks
> newx.dtype 
> 
> here are the output file sizes in bytes: 
> 
> 8000128  eye5chunk.npy
> 5004297  eye5chunk_bjd_raw.jdb
> 10338  eye5chunk_bjd_zlib.jdb
> 2206  eye5chunk_bjd_lzma.jdb
> 
> Qianqian 
> 
> On 8/24/22 15:48, Michael Siebert wrote: 
> Hi Matti, hi all, 
> 
> @Matti: I don't know what exactly you are referring to (Pull request or the 
> Github project, links see below). Maybe some clarification is needed, which I 
> hereby try to do ;) 
> 
> A .npy file created by some appending process is a regular .npy file and does 
> not need to be read in chunks. Processing arrays larger than the systems 
> memory can already be done with memory mapping (numpy.load(... 
> mmap_mode=...)), so no third-party support is needed to do so. 
> 
> The idea is not necessarily to only write some known-but-fragmented content 
> to a .npy file in chunks or to only handle files larger than the RAM. 
> 
> It is more about the ability to append to a .npy file at any time and between 
> program runs. For example, in our case, we have a large database-like file 
> containing all (preprocessed) images of all videos used to train a neural 
> network. When new video data arrives, it can simply be appended to the 
> existing .npy file. When training the neural net, the data is simply memory 
> mapped, which happens basically instantly and does not use extra space 
> between multiple training processes. We have tried out various fancy, 
> advanced data formats for this task, but most of them don't provide the 
> memory mapping feature which is very handy to keep the time required to test 
> a code change comfortably low - rather, they have excessive parse/decompress 
> times. Also other libraries can also be difficult to handle, see below. 
> The .npy array format is designed to be limited. There is a NEP for it, which 
> summarizes the .npy features and concepts very well: 
> 
> https://numpy.org/neps/nep-0001-npy-format.html 
> 
> One of my favorite features (besides memory mapping perhaps) is this one: 
> 
> "... Be reverse engineered. Datasets often live longer than the programs that 
> created them. A competent developer should be able to create a solution in 
> his preferred programming language to read most NPY files that he has been 
> given without much documentation. ..." 
> 
> This is a big disadvantage with all the fancy formats out there: they require 
> dedicated libraries. Some of these libraries don't come with nice and free 
> documentation (especially lacking easy-to-use/easy-to-understand code 
> examples for the target language, e.g. C) and/or can be extremely complex, 
> like HDF5. Yes, HDF5 has its users and is totally valid if one operates the 
> world's largest particle accelerator, but we have spend weeks finding some 
> C/C++ library for it which does not expose bugs and is somehow documented. We 
> actually failed and posted a bug which was fixed a year later or so. This can 
> ruin entire projects - fortunately not ours, but it ate up a lot of time we 
> could have spend more meaningful. On the other hand, I don't see how e.g. 
> zarr provides added value over .npy if one only needs the .npy features and 
> maybe some append-data-along-one-axis feature. Yes, maybe there are some uses 
> for two or three appendable axes, but I think having one axis to append to 
> sho!
 uld cover
a lot of use cases: this axis is typically time: video, audio, GPS, signal data 
in general, binary log data, "binary CSV" (lines in file): all of those only 
need one axis to append to. 
> 
> The .npy format is so simple, it can be read e.g. in C in a few lines. Or 
> accessed easily through Numpy and ctypes by pointers for high speed custom 
> logic - not even requiring libraries besides Numpy. 
> 
> Making .npy appendable is easy to implement. Yes, appending along one axis is 
> limited as the .npy format itself. But I consider that rather to be a feature 
> than a (actual) limitation, as it allows for fast and simple appends. 
> 
> The question is if there is some support for an 
> append-to-.npy-files-along-one-axis feature in the Numpy community and if so, 
> about the details of the actual implementation. I made one suggestion in 
> 
> https://github.com/numpy/numpy/pull/20321/ 
> 
> and I offer to invest time to update/modify/finalize the PR. I've also 
> created a library that can already append to .npy: 
> 
> https://github.com/xor2k/npy-append-array 
> 
> However, due to current limitations in the .npy format, the code is more 
> complex than it could actually be (the library initializes and checks spare 
> space in the header) and it needs to rewrite the header every time. Both 
> could be made unnecessary with a very small addition to the .npy file format. 
> Data would stay continuous (no fragmentation!), there should just be a way to 
> indicate that the actual shape of the array should derived from the file 
> size. 
> 
> Best, Michael
> 
> On 24. Aug 2022, at 19:16, Matti Picus <matti.pi...@gmail.com> wrote: 
> 
> Sorry for the late reply. Adding a new "*.npy" format feature to allow 
> writing to the file in chunks is nice but seems a bit limited. As I 
> understand the proposal, reading the file back can only be done in the chunks 
> that were originally written. I think other libraries like zar or h5py have 
> solved this problem in a more flexible way. Is there a reason you cannot use 
> a third-party library to solve this? I would think if you have an array too 
> large to write in one chunk you will need third-party support to process it 
> anyway.
> 
> Matti
> 
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: michael.sieber...@gmail.com 
> 
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: fan...@gmail.com

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: bross_phobr...@sonic.net 

Links:
------
[1] https://ubjson.org/
[2] https://json.nlohmann.me/features/binary_formats/bjdata/
[3]
https://github.com/NeuroJSON/jdata/blob/master/JData_specification.md#data-annotation-keywords
[4] https://ubjson.org/#why
[5] https://github.com/neurolabusc/MeshFormatsJS
[6]
https://github.com/NeuroJSON/jsonmmap/blob/main/JSON-Mmap_Specification.md
[7] https://github.com/NeuroJSON/jsonmmap/tree/main/lib

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: An extension of the .npy file format

Reply via email to