Hi Jacek, Thanks for your reply, but it looks like that would be a complicated workaround. I have been looking some more, and it looks like hdf5 would be a good file format for this problem.
It naturally supports slicing like fp['field1'][1000:5000], provides chunking and compression, new arrays can be appended... Maybe Arrow is just not the right tool for this specific problem. Kind regards, Ramon. On Wed, 23 Nov 2022 at 15:54, Jacek Pliszka <[email protected]> wrote: > Hi! > > I am not sure if this would solve your problem: > > pa.concat_tables([pa.Table.from_pydict({'v': v}).append_column('f', > [len(v)*[f]]) for f, v in x.items()]) > > pyarrow.Table > v: double > f: string > ---- > v: [[0.2,0.2,0.2,0.1,0.2,0,0.8,0.7],[0.3,0.5,0.1],[0.9,nan,nan,0.1,0.5]] > f: > [["field1","field1","field1","field1","field1","field1","field1","field1"],["field2","field2","field2"],["field3","field3","field3","field3","field3"]] > > f column should compress very well or you can make it dictionary from the > start. > > To get back you can do couple things, take from pc.equal, to_batches, > groupby > > BR > > Jacek > > > > śr., 23 lis 2022 o 13:12 Ramón Casero Cañas <[email protected]> > napisał(a): > > > > Hi, > > > > I'm trying to figure out whether pyArrow could efficiently store and > slice large python dictionaries that contain numpy arrays of variable > length, e.g. > > > > x = { > > 'field1': [0.2, 0.2, 0.2, 0.1, 0.2, 0.0, 0.8, 0.7], > > 'field2': [0.3, 0.5, 0.1], > > 'field3': [0.9, NaN, NaN, 0.1, 0.5] > > } > > > > Arrow seems to be designed for Tables, but I was wondering whether > there's a way to do this (probably not with a Table or RecordBatch because > those require the same lengths). > > > > The vector in each dictionary key would have in the order of 1e4 - 1e9 > elements. There are some NaN gaps in the data (which would go well with > Arrow's null elements, I guess), but especially, many repeated values that > makes the data quite compressible. > > > > Apart from writing that data to disk quickly and with compression, then > I need to slice it efficiently, e.g. > > > > fp = open('file', 'r') > > v = fp['field1'][1000:5000] > > > > Is this something that can be done with pyArrow? > > > > Kind regards, > > > > Ramon. >
