Hi !
I recently learned about Apache Arrow, and as a preliminary study I would
like to know if it can be a good choice for my use case, or if I have to
look
for another technology (or to craft something specific on my own !).
I could not really find answers to my questions in the FAQ or reading
articles and blogs, but I may have missed some information so I apologize
in advance if my questions have already been answered.
Arrow is all about storing columnar data. What can be the content of the
elements in a column ?
In my case, I have scalar values (numbers), 1D arrays and 2D arrays.
The 2D arrays can be quite big (4000x4000 float 32 for example).
So, we could imagine long tables, hundred thousands of lines, containing
a mix of those data types.
I wonder if Arrow stays efficient for such kind of data ? In particular,
rows of 2D data arrays in a column may be difficult to handle with the
same level of optimization ? (just guessing)
Is there some compression in Arrow ? I am thinking about blosc kind of
compression (like in the dead "bcolz" project - by the way someone already
wondered about Arrow + Blosc: https://github.com/Blosc/bcolz/issues/300)
Another use case I have, is to be able for multiple processes on the same
computer to access the Arrow in-memory store ; it seems to me Plasma
does this job but I wonder about the trade-offs ?
Thanks in advance for your advices - any help would be highly appreciated !
Cheers,
Matias.