I've worked quite a bit with tensor data recently and `arrow.Tensor` (or just the underlying FixedSizeList) has worked well for me for an in-memory representation.
> If you compress it, you have no means to decompress individual chunks, from what I can tell from prototyping within Python. Correct. This is more of a storage concern and less of an in-memory concern correct? Or are you hoping to have these tensors compressed in memory? > You also cannot attach metadata to it. Can you create a table with however many columns of metadata you want and one tensor column? > I do have an associated Table as each spectrum has metadata Yes, this is how I would expect metadata to be represented. > but if I split up the spectra to one per row, I end up with 10s of millions of individual `numpy.ndarray` objects which causes a lot of performance issues. I'm not sure I understand what you are after here. Normally I would process such a table in batches. Then, I would process that batch in rows. For each row I would convert to `numpy.ndarray`, do whatever calculation I need to do, and then convert back to Tensor. After generating a batch of tensors I yield the batch and move on to the next batch. This kind of streaming execution should avoid having 10s of millions of `numpy.ndarray` objects. > I took a look at breaking up the array into a list of RecordBatch and `RecordBatchStreamReader` doesn't seem to allow you to read only selected indices, so no real chunking support. What does your access pattern look like? If you are planning on processing all of the data at once you can do: ``` for batch in my_reader: for row in range(batch.num_rows): ... ``` However, if you need to jump in and modify a small subset of the total data then that isn't going to work so well. On Tue, Apr 23, 2024 at 10:45 AM Robert McLeod <robbmcl...@gmail.com> wrote: > Hi everyone, > > For a project I'm working on I've picked Arrow as the library and either > Feather or Parquet as our storage format for our tabular data. However, I > also have some hyperspectral data to serialize and I'd prefer not to add > another big dependency if I can avoid it so I've been trying to make > something in Arrow work for my application. Typically our > hyperspectral data is [N, 4096]-shaped, where N is in the tens of millions. > > Initially I looked at `arrow.Tensor` via the IPC module but it seems a bit > limited. You can memory-map it, if it's uncompressed. If you compress it, > you have no means to decompress individual chunks, from what I can tell > from prototyping within Python. You also cannot attach metadata to it. > > I do have an associated Table as each spectrum has metadata, but if I > split up the spectra to one per row, I end up with 10s of millions of > individual `numpy.ndarray` objects which causes a lot of performance > issues. The data is contiguous, but I would have to write some C-extension > to slice and view the data (which would be a pain to manage the reference > counting) and there's still no means to partially load the data. > > I could create a Table with one column per chunk and one cell per column. > This is clunky. > > I took a look at breaking up the array into a list of RecordBatch and > `RecordBatchStreamReader` doesn't seem to allow you to read only selected > indices, so no real chunking support. > > Or is there some other lightweight (not HDF5), cloud-friendly solution > that I should be looking at? > > Sincerely, > Robert > > -- > Robert McLeod > robbmcl...@gmail.com > robert.mcl...@hitachi-hightech.com > >