Re: Is it possible to use a starting offset with Scanner?

Juan Galvez Mon, 21 Mar 2022 13:16:13 -0700

Hi Will,

I had not considered row group fragments. That might fit the use case, Iwill have to explore it.


Thank you!

On 3/21/22 14:41, Will Jones wrote:

Hello Juan,

I don't think there's any support for skipping rows in the scanner.

If you are trying to distribute scans across processes, have youconsidered splitting by dataset fragment? FileSystemDatasets are splitup into FileFragments and parquet can be split up into row groupfragments. Does that seem like it could fit your use case?


Best,

Will

On Mon, Mar 21, 2022 at 12:19 PM Juan Galvez <[email protected]> wrote:

    Dear Arrow developers,

    I was wondering if it's possible to use the scanner API to read
    batches
    starting from a certain row offset.

    Currently I am doing something like this:

         reader = dataset.scanner(filter=expr_filters).to_reader()

    to get a record batch reader, but I am reading data in parallel with
    multiple processes and already know the row counts and from what
    offset
    I want each process to read. Problem with the above code is that
    every
    processes will materialize batches into memory starting from the
    beginning (therefore reading the same data multiple times).

    Thanks

Re: Is it possible to use a starting offset with Scanner?

Reply via email to