We do not have the option to do this today.  However, it is something
we could do a better job of as long as we aren't reading CSV.

Aldrin's workaround is pretty solid, especially if you are reading
parquet and have a row_index column.  Parquet statistics filtering
should ensure we are only reading the needed row groups.

We will need to implement something similar for [1] and it seems we
should have a general JIRA for "paging support (start_index & count)"
for datasets but I couldn't find one with a quick search.

[1] https://issues.apache.org/jira/browse/ARROW-15589

On Tue, May 17, 2022 at 10:09 AM Aldrin <akmon...@ucsc.edu> wrote:
>
> I think batches are all or nothing as far as reading/deserializing. However, 
> you can manage a slice of that batch instead of the whole batch in the <deal 
> the batch> portion. That is, if you have 2 batches with 10 rows each, and you 
> want to skip rows [10, 15) (0-indexed, inclusive of 10, exclusive of 15), 
> then you can track the first batch in a vector (or handle directly), then in 
> the 2nd batch you can use `Slice(5)` [1] to track rows [15, 20).
>
> Some other approaches might include using the `Take` compute function on a 
> "super" table or on the particular batch [2], or putting a "row index" column 
> in your data and using that as a filter, e.g.:
>
> ```
> #include <arrow/api.h>
> #include <arrow/dataset/api.h>
> #include <arrow/compute/api.h>
>
> // for arrow expressions
> using arrow::compute::greater_equal;
> using arrow::compute::and_;
> using arrow::compute::literal;
> using arrow::compute::field_ref;
>
> // exclude rows [10, 15) (include 10, exclude 15, 0-indexed)
> Expression filter_rowstokeep = and_({
>      less         (field_ref(FieldRef("row_index")), literal(10))
>     ,greater_equal(field_ref(FieldRef("row_index")), literal(15))
> })
>
> // construct scanner builder as usual
> ...
> scanner_builder->project()
>
> // bind filter to scanner builder
> scanner_builder->Filter(filter_rowstokeep)
>
> // finish and execute as usual
> scanner = scanner_builder->finish()
> ...
> ```
>
> The above code sample is adapted and simplified from what I do in [3], which 
> you can refer to if you'd like.
>
> Finally, you can also construct a new table with the row_index column and 
> then filter that instead, which I think could be fairly efficient, but I 
> haven't played with the API enough to know the most efficient way. I also 
> suspect it might be slightly annoying with the existing interface. Either:
> (a) dataset -> table -> table with extra column -> dataset -> scanner builder 
> with filter as above -> scanner -> fiinish
> (b) table -> table with extra column -> dataset -> scanner builder with 
> filter as above -> scanner -> finish
>
> The difference between (a) and (b) above being how you initially read the 
> data from S3 into memory (either as a dataset, leveraging the dataset 
> framework, or as tables, probably managing the reads a bit more manually).
>
>
> <---- references -->
>
> [1]: 
> https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4NK5arrow11RecordBatch5SliceE7int64_t
> [2]: https://arrow.apache.org/docs/cpp/compute.html#selections
> [3]: 
> https://github.com/drin/cookbooks/blob/mainline/arrow/projection/project_from_dataset.cpp#L137
>
> Aldrin Montana
> Computer Science PhD Student
> UC Santa Cruz
>
>
> On Tue, May 17, 2022 at 9:33 AM 1057445597 <1057445...@qq.com> wrote:
>>
>> Can arrow skip a certain number of lines when reading data? I want to do 
>> distributed training, read data through arrow, my code is as follows
>>
>> dataset = getDatasetFromS3()
>> scanner_builder = dataset->NewScan()
>> scanner_builder->project()
>> scanner = scanner_builder->finish()
>> batch_reader = scanner->ToBatchReader()
>> current_batch_ = batch_reader->ReadNext()
>> deal the batch 。。。
>>
>> Can I skip a certain number of lines before calling ReadNext()? Or is there 
>> a skip() interface or an offset() interface?
>>
>>

Reply via email to