>> For my case, I'd be curious about the time to add one 1T-entries file to
>> another.
> as I mentioned in the previous reply, bjdata is appendable [3], so you can
> simply append another array (or a slice) to the end of the file.
I'm thinking of numerical ops here, e.g. adding an array to itself would
double the values but not the size.
---
--
Phobrain.com
On 2022-08-25 14:41, Qianqian Fang wrote:
> To avoid derailing the other thread [1] on extending .npy files, I am going
> to start a new thread on alternative array storage file formats using binary
> JSON - in case there is such a need and interest among numpy users
>
> specifically, i want to first follow up with Bill's question below regarding
> loading time
>
> On 8/25/22 11:02, Bill Ross wrote:
>
>> Can you give load times for these?
>
> as I mentioned in the earlier reply to Robert, the most memory-efficient
> (i.e. fast loading, disk-mmap-able) but not necessarily disk-efficient (i.e.
> may result in the largest data file sizes) BJData construct to store an ND
> array using BJData's ND-array container.
>
> I have to admit that both jdata and bjdata modules have not been extensively
> optimized for speed. with the current implementation, here are the loading
> time for a larger diagonal matrix (eye(10000))
>
> a BJData file storing a single eye(10000) array using the ND-array container
> can be downloaded from here [2](file size: 1MB with zip, if decompressed, it
> is ~800MB, as the npy file) - this file was generated from a matlab encoder,
> but can be loaded using Python (see below Re Robert).
>
> 800000128 eye1e4.npy
> 800000014 eye1e4_bjd_raw_ndsyntax.jdb
> 813721 eye1e4_bjd_zlib.jdb
> 113067 eye1e4_bjd_lzma.jdb
>
> the loading time (from an nvme drive, Ubuntu 18.04, python 3.6.9, numpy
> 1.19.5) for each file is listed below:
>
> 0.179s eye1e4.npy (mmap_mode=None)
> 0.001s eye1e4.npy (mmap_mode=r)
> 0.718s eye1e4_bjd_raw_ndsyntax.jdb
> 1.474s eye1e4_bjd_zlib.jdb
> 0.635s eye1e4_bjd_lzma.jdb
>
> clearly, mmapped loading is the fastest option without a surprise; it is true
> that the raw bjdata file is about 5x slower than npy loading, but given the
> main chunk of the data are stored identically (as contiguous buffer), I
> suppose with some optimization of the decoder, the gap between the two can be
> substantially shortened. The longer loading time of zlib/lzma (and similarly
> saving times) reflects a trade-off between smaller file sizes and time for
> compression/decompression/disk-IO.
>
> by no means I am saying the binary JSON format is readily available to
> deliver better speed with its current non-optimized implementation. I just
> want to bright the attention to this class of formats, and highlight that
> it's flexibility gives abundant mechanisms to create fast, disk-mapped IO,
> while allowing additional benefits such as compression, unlimited metadata
> for future extensions etc.
>
>>> 8000128 eye5chunk.npy
>>> 5004297 eye5chunk_bjd_raw.jdb
>>> 10338 eye5chunk_bjd_zlib.jdb
>>> 2206 eye5chunk_bjd_lzma.jdb
>>
>> For my case, I'd be curious about the time to add one 1T-entries file to
>> another.
>
> as I mentioned in the previous reply, bjdata is appendable [3], so you can
> simply append another array (or a slice) to the end of the file.
>
>> Thanks,
>> Bill
>
> also related, Re @Robert's question below
>
>> Are any of them supported by a Python BJData implementation? I didn't see
>> any option to get that done in the `bjdata` package you recommended, for
>> example.
>> https://github.com/NeuroJSON/pybj/blob/a46355a0b0df0bec1817b04368a5a573358645ef/bjdata/encoder.py#L200
>
> the bjdata module currently only support nd-array in the decoder [4] (i.e.
> map such buffer to a numpy.ndarray) - should be relatively trivial to add it
> to the encoder though.
>
> on the other side, the annotated format is currently supported. one can call
> jdata module (responsible for annotation-level encoding/decoding) as shown in
> my sample code, then call bjdata internally for data serialization.
>
>> Okay. Given your wording, it looked like you were claiming that the binary
>> JSON was supported by the whole ecosystem. Rather, it seems like you can
>> either get binary encoding OR the ecosystem support, but not both at the
>> same time.
>
> all in relative terms of course - json has ~100 listed parsers on it's
> website [5], MessagePack - another flavor of binary JSON - listed [6] ~50/60
> parsers, and UBJSON listed [7] ~20 parsers. I am not familiar with npy
> parsers, but googling it returns only a few.
>
> also, most binary JSON implementations provided tools to convert to JSON and
> back, so, in that sense, whatever JSON has in its ecosystem can be
> "potentially" used for binary JSON files if one wants to. There are also
> recent publications comparing differences between various binary JSON formats
> in case anyone is interested
>
> https://github.com/ubjson/universal-binary-json/issues/115
>
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: bross_phobr...@sonic.net
Links:
------
[1]
https://mail.python.org/archives/list/numpy-discussion@python.org/thread/A4CJ2DZCAKPMD2MYGVMDV5UI7FN4SBVI/
[2] http://neurojson.org/wiki/upload/eye1e4_bjd_raw_ndsyntax.jdb.zip
[3]
https://github.com/NeuroJSON/bjdata/blob/master/images/BJData_Diagram.pdf
[4]
https://github.com/NeuroJSON/pybj/blob/a46355a0b0df0bec1817b04368a5a573358645ef/bjdata/decoder.py#L360-L365
[5] https://www.json.org/json-en.html
[6] https://msgpack.org/index.html
[7] https://ubjson.org/libraries/
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com