Hi,
I'm trying to figure out whether pyArrow could efficiently store and slice
large python dictionaries that contain numpy arrays of variable length, e.g.
x = {
'field1': [0.2, 0.2, 0.2, 0.1, 0.2, 0.0, 0.8, 0.7],
'field2': [0.3, 0.5, 0.1],
'field3': [0.9, NaN, NaN, 0.1, 0.5]
}
Arrow seems to be designed for Tables, but I was wondering whether there's
a way to do this (probably not with a Table or RecordBatch because those
require the same lengths).
The vector in each dictionary key would have in the order of 1e4 - 1e9
elements. There are some NaN gaps in the data (which would go well
with Arrow's null elements, I guess), but especially, many repeated values
that makes the data quite compressible.
Apart from writing that data to disk quickly and with compression, then I
need to slice it efficiently, e.g.
fp = open('file', 'r')
v = fp['field1'][1000:5000]
Is this something that can be done with pyArrow?
Kind regards,
Ramon.