Can you give load times for these?
> 8000128 eye5chunk.npy
> 5004297 eye5chunk_bjd_raw.jdb
> 10338 eye5chunk_bjd_zlib.jdb
> 2206 eye5chunk_bjd_lzma.jdb
For my case, I'd be curious about the time to add one 1T-entries file to
another.
Thanks,
Bill
--
Phobrain.com
On 2022-08-24 20:02, Qianqian Fang wrote:
> I am curious what you and other developers think about adopting JSON/binary
> JSON as a similarly simple, reverse-engineering-able but universally parsable
> array exchange format instead of designing another numpy-specific binary
> format.
>
> I am interested in this topic (as well as thoughts among numpy developers)
> because I am currently working on a project - NeuroJSON
> (https://neurojson.org) - funded by the US National Institute of Health. The
> goal of the NeuroJSON project is to create easy-to-adopt, easy-to-extend, and
> preferably human-readable data formats to help disseminate and exchange
> neuroimaging data (and scientific data in general).
>
> Needless to say, numpy is a key toolkit that is widely used among
> neuroimaging data analysis pipelines. I've seen discussions of potentially
> adopting npy as a standardized way to share volumetric data (as ndarrays),
> such as in this thread
>
> https://github.com/bids-standard/bids-specification/issues/197
>
> however, several limitations were also discussed, for example
>
> 1. npy only support a single numpy array, does not support other metadata or
> other more complex data records (multiple arrays are only achieved via
> multiple files)
> 2. no internal (i.e. data-level) compression, only file-level compression
> 3. although the file is simple, it still requires a parser to read/write, and
> such parser is not widely available in other environments, making it mostly
> limited to exchange data among python programs
> 4. I am not entirely sure, but I suppose it does not support sparse matrices
> or special matrices (such as diagonal/band/symmetric etc) - I can be wrong
> though
>
> In the NeuroJSON project, we primarily use JSON and binary JSON
> (specifically, UBJSON [1] derived BJData [2] format) as the underlying data
> exchange files. Through standardized data annotations [3], we are able to
> address most of the above limitations - the generated files are universally
> parsable in nearly all programming environments with existing parsers,
> support complex hierarchical data, compression, and can readily benefit from
> the large ecosystems of JSON (JSON-schema, JSONPath, JSON-LD, jq, numerous
> parsers, web-ready, NoSQL db ...).
>
> I understand that simplicity is a key design spec here. I want to highlight
> UBJSON/BJData as a competitive alternative format. It is also designed with
> simplicity considered in the first place [4], yet, it allows to store
> hierarchical strongly-typed complex binary data and is easily extensible.
>
> A UBJSON/BJData parser may not necessarily longer than a npy parser, for
> example, the python reader of the full spec only takes about 500 lines of
> codes (including comments), similarly for a JS parser
>
> https://github.com/NeuroJSON/pybj/blob/master/bjdata/decoder.py
> https://github.com/NeuroJSON/js-bjdata/blob/master/bjdata.js
>
> We actually did a benchmark [5] a few months back - the test workloads are
> two large 2D numerical arrays (node, face to store surface mesh data), we
> compared parsing speed of various formats in Python, MATLAB, and JS. The
> uncompressed BJData (BMSHraw) reported a loading speed that is nearly as fast
> as reading raw binary dump; and internally compressed BJData (BMSHz) gives
> the best balance between small file sizes and loading speed, see our results
> here
>
> https://pbs.twimg.com/media/FRPEdLGWYAEJe80?format=png&name=large
>
> I want to add two quick points to echo the features you desired in npy:
>
> 1. it is not common to use mmap in reading JSON/binary JSON files, but it is
> certainly possible. I recently wrote a JSON-mmap spec [6] and a MATLAB
> reference implementation [7]
>
> 2. UBJSON/BJData natively support append-able root-level records; JSON has
> been extensively used in data streaming with appendable nd-json or
> concatenated JSON (https://en.wikipedia.org/wiki/JSON_streaming)
>
> just a quick comparison of output file sizes with a 1000x1000 unitary
> diagonal matrix
>
> # python3 -m pip install jdata bjdata
> import numpy as np
> import jdata as jd
> x = np.eye(1000); # create a large array
> y = np.vsplit(x, 5); # split into smaller chunks
> np.save('eye5chunk.npy',y); # save npy
> jd.save(y, 'eye5chunk_bjd_raw.jdb'); # save as uncompressed bjd
> jd.save(y, 'eye5chunk_bjd_zlib.jdb', {'compression':'zlib'}); #
> zlib-compressed bjd
> jd.save(y, 'eye5chunk_bjd_lzma.jdb', {'compression':'lzma'}); #
> lzma-compressed bjd
> newy=jd.load('eye5chunk_bjd_zlib.jdb'); # loading/decoding
> newx = np.concatenate(newy); # regroup chunks
> newx.dtype
>
> here are the output file sizes in bytes:
>
> 8000128 eye5chunk.npy
> 5004297 eye5chunk_bjd_raw.jdb
> 10338 eye5chunk_bjd_zlib.jdb
> 2206 eye5chunk_bjd_lzma.jdb
>
> Qianqian
>
> On 8/24/22 15:48, Michael Siebert wrote:
> Hi Matti, hi all,
>
> @Matti: I don't know what exactly you are referring to (Pull request or the
> Github project, links see below). Maybe some clarification is needed, which I
> hereby try to do ;)
>
> A .npy file created by some appending process is a regular .npy file and does
> not need to be read in chunks. Processing arrays larger than the systems
> memory can already be done with memory mapping (numpy.load(...
> mmap_mode=...)), so no third-party support is needed to do so.
>
> The idea is not necessarily to only write some known-but-fragmented content
> to a .npy file in chunks or to only handle files larger than the RAM.
>
> It is more about the ability to append to a .npy file at any time and between
> program runs. For example, in our case, we have a large database-like file
> containing all (preprocessed) images of all videos used to train a neural
> network. When new video data arrives, it can simply be appended to the
> existing .npy file. When training the neural net, the data is simply memory
> mapped, which happens basically instantly and does not use extra space
> between multiple training processes. We have tried out various fancy,
> advanced data formats for this task, but most of them don't provide the
> memory mapping feature which is very handy to keep the time required to test
> a code change comfortably low - rather, they have excessive parse/decompress
> times. Also other libraries can also be difficult to handle, see below.
> The .npy array format is designed to be limited. There is a NEP for it, which
> summarizes the .npy features and concepts very well:
>
> https://numpy.org/neps/nep-0001-npy-format.html
>
> One of my favorite features (besides memory mapping perhaps) is this one:
>
> "... Be reverse engineered. Datasets often live longer than the programs that
> created them. A competent developer should be able to create a solution in
> his preferred programming language to read most NPY files that he has been
> given without much documentation. ..."
>
> This is a big disadvantage with all the fancy formats out there: they require
> dedicated libraries. Some of these libraries don't come with nice and free
> documentation (especially lacking easy-to-use/easy-to-understand code
> examples for the target language, e.g. C) and/or can be extremely complex,
> like HDF5. Yes, HDF5 has its users and is totally valid if one operates the
> world's largest particle accelerator, but we have spend weeks finding some
> C/C++ library for it which does not expose bugs and is somehow documented. We
> actually failed and posted a bug which was fixed a year later or so. This can
> ruin entire projects - fortunately not ours, but it ate up a lot of time we
> could have spend more meaningful. On the other hand, I don't see how e.g.
> zarr provides added value over .npy if one only needs the .npy features and
> maybe some append-data-along-one-axis feature. Yes, maybe there are some uses
> for two or three appendable axes, but I think having one axis to append to
> sho!
uld cover
a lot of use cases: this axis is typically time: video, audio, GPS, signal data
in general, binary log data, "binary CSV" (lines in file): all of those only
need one axis to append to.
>
> The .npy format is so simple, it can be read e.g. in C in a few lines. Or
> accessed easily through Numpy and ctypes by pointers for high speed custom
> logic - not even requiring libraries besides Numpy.
>
> Making .npy appendable is easy to implement. Yes, appending along one axis is
> limited as the .npy format itself. But I consider that rather to be a feature
> than a (actual) limitation, as it allows for fast and simple appends.
>
> The question is if there is some support for an
> append-to-.npy-files-along-one-axis feature in the Numpy community and if so,
> about the details of the actual implementation. I made one suggestion in
>
> https://github.com/numpy/numpy/pull/20321/
>
> and I offer to invest time to update/modify/finalize the PR. I've also
> created a library that can already append to .npy:
>
> https://github.com/xor2k/npy-append-array
>
> However, due to current limitations in the .npy format, the code is more
> complex than it could actually be (the library initializes and checks spare
> space in the header) and it needs to rewrite the header every time. Both
> could be made unnecessary with a very small addition to the .npy file format.
> Data would stay continuous (no fragmentation!), there should just be a way to
> indicate that the actual shape of the array should derived from the file
> size.
>
> Best, Michael
>
> On 24. Aug 2022, at 19:16, Matti Picus <matti.pi...@gmail.com> wrote:
>
> Sorry for the late reply. Adding a new "*.npy" format feature to allow
> writing to the file in chunks is nice but seems a bit limited. As I
> understand the proposal, reading the file back can only be done in the chunks
> that were originally written. I think other libraries like zar or h5py have
> solved this problem in a more flexible way. Is there a reason you cannot use
> a third-party library to solve this? I would think if you have an array too
> large to write in one chunk you will need third-party support to process it
> anyway.
>
> Matti
>
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: michael.sieber...@gmail.com
>
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: fan...@gmail.com
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: bross_phobr...@sonic.net
Links:
------
[1] https://ubjson.org/
[2] https://json.nlohmann.me/features/binary_formats/bjdata/
[3]
https://github.com/NeuroJSON/jdata/blob/master/JData_specification.md#data-annotation-keywords
[4] https://ubjson.org/#why
[5] https://github.com/neurolabusc/MeshFormatsJS
[6]
https://github.com/NeuroJSON/jsonmmap/blob/main/JSON-Mmap_Specification.md
[7] https://github.com/NeuroJSON/jsonmmap/tree/main/lib
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com