Hi,
    I'm trying to implement a data management system by python with
arrow flight. The well designed dataset with filesystem makes the data
management even simpler.
    But I'm facing a situation: reading range in a dataset.
Considering a dataset stored in feather format with 1 million rows in
a remote file system (e.g. s3), client connects to multiple flight
servers to parallel load data (e.g. 2 servers, one do_get from head to
half, the other do_get from half to end), or simply wants to load the
last 500 records.
    At this point, the server needs to skip reading heading records
for the reasons of network bandwidth and memory limitation, rather
than transferring, loading heading records into memory and discarding
them.
    I think modern storage format may have advantages in determining
position of the specific range of records in the file than csv, since
csv has to move line by line without indexes. Also, I found fragment
related apis in the dataset, but not much documentation on that (maybe
more related to partitioning?).
    Here is the proposal to add "limit and offset" to ScannerOption
and Available Compute Functions and Acero, since it is also a very
common operation in SQL as well.
    But I realize that only implementing "limit and offset" compute
functions have little effect on my situation, since the arrow compute
functions accept arrays/scalars as input which the loading process has
been taken. "Limit and offset'' in ScannerOption of dataset may need
to have a dedicated implementation rather than directly call compute
to filter. Furthermore, Acero may also benefit from this feature for
scansink.
   Or any other ideas for this situation?
--
---------------------
Best Regards,
Wenbo Hu,

Reply via email to