[python] Using Arrow for storing compressable python dictionaries

Ramón Casero Cañas Wed, 23 Nov 2022 04:12:44 -0800

Hi,

I'm trying to figure out whether pyArrow could efficiently store and slice
large python dictionaries that contain numpy arrays of variable length, e.g.


x = {
'field1': [0.2, 0.2, 0.2, 0.1, 0.2, 0.0, 0.8, 0.7],
'field2': [0.3, 0.5, 0.1],
'field3': [0.9, NaN, NaN, 0.1, 0.5]
}

Arrow seems to be designed for Tables, but I was wondering whether there's
a way to do this (probably not with a Table or RecordBatch because those
require the same lengths).

The vector in each dictionary key would have in the order of 1e4 - 1e9
elements. There are some NaN gaps in the data (which would go well
with Arrow's null elements, I guess), but especially, many repeated values
that makes the data quite compressible.

Apart from writing that data to disk quickly and with compression, then I
need to slice it efficiently, e.g.

fp = open('file', 'r')
v = fp['field1'][1000:5000]

Is this something that can be done with pyArrow?

Kind regards,

Ramon.

[python] Using Arrow for storing compressable python dictionaries

Reply via email to