Re: [Python][C++] Chunked Storage of N-dim arrays

Weston Pace Tue, 23 Apr 2024 12:31:56 -0700

I've worked quite a bit with tensor data recently and `arrow.Tensor` (or
just the underlying FixedSizeList) has worked well for me for an in-memory
representation.

> If you compress it, you have no means to decompress individual chunks,
from what I can tell from prototyping within Python.

Correct.  This is more of a storage concern and less of an in-memory
concern correct?  Or are you hoping to have these tensors compressed in
memory?

> You also cannot attach metadata to it.

Can you create a table with however many columns of metadata you want and
one tensor column?

> I do have an associated Table as each spectrum has metadata

Yes, this is how I would expect metadata to be represented.

> but if I split up the spectra to one per row, I end up with 10s of
millions of individual `numpy.ndarray` objects which causes a lot of
performance issues.

I'm not sure I understand what you are after here.  Normally I would
process such a table in batches.  Then, I would process that batch in
rows.  For each row I would convert to `numpy.ndarray`, do whatever
calculation I need to do, and then convert back to Tensor.  After
generating a batch of tensors I yield the batch and move on to the next
batch.  This kind of streaming execution should avoid having 10s of
millions of `numpy.ndarray` objects.

> I took a look at breaking up the array into a list of RecordBatch and
`RecordBatchStreamReader` doesn't seem to allow you to read only selected
indices, so no real chunking support.

What does your access pattern look like?  If you are planning on processing
all of the data at once you can do:

```
for batch in my_reader:
  for row in range(batch.num_rows):
    ...
```

However, if you need to jump in and modify a small subset of the total data
then that isn't going to work so well.

On Tue, Apr 23, 2024 at 10:45 AM Robert McLeod <robbmcl...@gmail.com> wrote:

> Hi everyone,
>
> For a project I'm working on I've picked Arrow as the library and either
> Feather or Parquet as our storage format for our tabular data. However, I
> also have some hyperspectral data to serialize and I'd prefer not to add
> another big dependency if I can avoid it so I've been trying to make
> something in Arrow work for my application. Typically our
> hyperspectral data is [N, 4096]-shaped, where N is in the tens of millions.
>
> Initially I looked at `arrow.Tensor` via the IPC module but it seems a bit
> limited. You can memory-map it, if it's uncompressed. If you compress it,
> you have no means to decompress individual chunks, from what I can tell
> from prototyping within Python. You also cannot attach metadata to it.
>
> I do have an associated Table as each spectrum has metadata, but if I
> split up the spectra to one per row, I end up with 10s of millions of
> individual `numpy.ndarray` objects which causes a lot of performance
> issues. The data is contiguous, but I would have to write some C-extension
> to slice and view the data (which would be a pain to manage the reference
> counting) and there's still no means to partially load the data.
>
> I could create a Table with one column per chunk and one cell per column.
> This is clunky.
>
> I took a look at breaking up the array into a list of RecordBatch and
> `RecordBatchStreamReader` doesn't seem to allow you to read only selected
> indices, so no real chunking support.
>
> Or is there some other lightweight (not HDF5), cloud-friendly solution
> that I should be looking at?
>
> Sincerely,
> Robert
>
> --
> Robert McLeod
> robbmcl...@gmail.com
> robert.mcl...@hitachi-hightech.com
>
>

Re: [Python][C++] Chunked Storage of N-dim arrays

Reply via email to