回复: Can arrow::dataset::Scanner¶ skip a certain number of rows?

2022-05-18 Thread 1057445597
but it's better to support(start_index, max_rows), so that I don't have to save row_index column --原始邮件-- 发件人: "user"

回复: Can arrow::dataset::Scanner¶ skip a certain number of rows?

2022-05-18 Thread 1057445597
It works for me. Because I use multi-threaded reading. When the filter is not set, it is ok to read batches sequentially. After setting the filter, the previous batch may read less or no data. Then the next batch I judged whether it was empty before I finished reading, which led to the

回复: Can arrow::dataset::Scanner¶ skip a certain number of rows?

2022-05-18 Thread 1057445597
Also, when I added filter, my program had an unexpected coredump, and I'm now looking at why. I did it based on tfio's code --原始邮件-- 发件人:

回复: Can arrow::dataset::Scanner¶ skip a certain number of rows?

2022-05-18 Thread 1057445597
I tried the method proposed by Aldrin, but when my offset exceeds a batch length, my ReadNext() will fetch a batch with row=0. That is, after I set the filter, my call to ReadNext will not fetch the batch directly at the beginning of the filter. I may need to call batch n times in a row before

Re: Can arrow::dataset::Scanner¶ skip a certain number of rows?

2022-05-18 Thread Weston Pace
We do not have the option to do this today. However, it is something we could do a better job of as long as we aren't reading CSV. Aldrin's workaround is pretty solid, especially if you are reading parquet and have a row_index column. Parquet statistics filtering should ensure we are only