n3world commented on pull request #10255: URL: https://github.com/apache/arrow/pull/10255#issuecomment-833185925
> That would be good. Eventually the dataset scanner will probably be getting a skip operation of some kind as well so that'll increase the pressure on [ARROW-8527](https://issues.apache.org/jira/browse/ARROW-8527). [ARROW-12598](https://issues.apache.org/jira/browse/ARROW-12598) is also (admittedly tangentially) related since you seem to be on a roll smile The only tricky part about a count(*) implementation with this is that skip_rows documented that it was skipping header rows which shouldn't be counted as part of a data row count. I fee like the row count operation would have to be a little different and maybe give an indicator for on which line the actual data rows start so that the header rows before that point could be skipped. Maybe a simpler solution would a set of two indexes: column names and first data row . While this doesn't allow arbitrary row skipping in the middle this would allow for the most common use cases, including skipping over valid rows to first desired row. With another option or operation could be used to count the number of data rows starting at first data row. The defaults would be 0, 1 for when column names are part of the csv or -1, 0 when they are not. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org