hi Tyler, I am not sure the Arrow Java libraries have yet been used for interacting with larger than memory datasets, but this would be a good opportunity to try to get this working.
In the C++ libraries, any Arrow data structures can easily reference memory-mapped data on disk; none of the data needs to be in-memory. There has been some discussion of adding binary, string, and list types with 64-bit offsets for extremely large values: https://issues.apache.org/jira/browse/ARROW-750. Adding this to the columnar format seems inevitable, so while it isn't there now does not mean it is out of scope. Thanks Wes On Thu, May 10, 2018 at 4:31 PM, Martin Durant <martin.dur...@utoronto.ca> wrote: > This is not directly relevant here, but has anyone looked into oamap ( > https://github.com/diana-hep/oamap ), which is capable of using numba to > compile python functions which traverse nested data structures down to the > basic leaf nodes, without creating intermediate python objects. Then the > person > doing the analysis may not need to go to C++ at all. oamap has POC loaders > for arrow and parquet, but it’s original focus was ROOT, from the high-energy > physics world. > > > > — > Martin Durant > martin.dur...@utoronto.ca > > >