Hello Juan,

I don't think there's any support for skipping rows in the scanner.

If you are trying to distribute scans across processes, have you considered
splitting by dataset fragment? FileSystemDatasets are split up into
FileFragments and parquet can be split up into row group fragments. Does
that seem like it could fit your use case?

Best,

Will

On Mon, Mar 21, 2022 at 12:19 PM Juan Galvez <[email protected]> wrote:

> Dear Arrow developers,
>
> I was wondering if it's possible to use the scanner API to read batches
> starting from a certain row offset.
>
> Currently I am doing something like this:
>
>      reader = dataset.scanner(filter=expr_filters).to_reader()
>
> to get a record batch reader, but I am reading data in parallel with
> multiple processes and already know the row counts and from what offset
> I want each process to read. Problem with the above code is that every
> processes will materialize batches into memory starting from the
> beginning (therefore reading the same data multiple times).
>
> Thanks
>
>

Reply via email to