n3world commented on pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#issuecomment-833140856


   > Thanks for submitting these PRs! I'm still brushing up on my parser 
knowledge but I'll take a look at the other one too.
   > 
   > This one looks pretty straightforward to me. It's a bit unfortunate that 
we have two `skip_rows` but it does seem that there are cases for skipping 
before and after. I do prefer this over the before/after boolean. Another 
option would be to take in a list of row indices. This is what pandas does. So 
[0:1] skips two rows before the header (if any) and [1:2] would skip two rows 
after the header or [0, 2] would skip a row before and after.
   > 
   > @pitrou should probably take a look at this as he's got the most 
experience with the CSV reader. He's out for the rest of the week so it might 
be next week.
   
   I like the idea of replacing the current skip_rows with a list of ranges to 
skip. This would be a more complex but a more robust solution. I assume that 
skip_rows would have to be kept but deprecated. The one difficulty would be if 
`ParseOptions::newlines_in_values` were true then the reader would have to be 
aware of the rows that are parsed to have a correct row index. I can start 
working on this solution ignoring the newlines_in_values issue now if that is 
the consensus.
   
   I might make it so any range starting with 0 is handled by the reader, as 
skip_row does now, and any range not starting with 0 is handled by the parser 
so that potentially quoted new lines are handled correctly.
   
   There might be a way to also tackle 
https://issues.apache.org/jira/browse/ARROW-8527 with this change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to