[GitHub] [arrow] angela-li commented on issue #37908: [R] open_dataset() behavior with incorrectly quoted input data

via GitHub Sun, 01 Oct 2023 14:00:52 -0700


angela-li commented on issue #37908:
URL: https://github.com/apache/arrow/issues/37908#issuecomment-1742200461


   > Would you mind telling me a bit more about where you looked and how you 
figured it out?
   
   Sure! Here's what I did to try to debug:
   
   - First I tried to zoom in on the line that was producing the error and 
subset it with the Linux shell (`head`, `cut`, etc) to see what the smallest 
subset of the data that I could read in to keep producing the problem. After 
inspecting it and comparing it to the row above, I started to get a hunch that 
it was the extra " character.
   - Then, I read that tiny subset of data into `data.table::fread()` to see if 
it handled it better. It did, but it produced a different error in fread: 
_"Warning: Found and resolved improper quoting in first 100 rows. If the fields 
are not quoted (e.g. field separator does not appear within any field), try 
quote="" to avoid this warning."_
       - I then went to the data.table source code and searched the wording of 
that error in the Github repo, which led me to the NEWS file linked above.
   - I realized then it had something to do with incorrectly quoted data and 
looked on StackOverflow, which led me to this [PyArrow documentation of 
quote_char](https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html#pyarrow.csv.ParseOptions.quote_char)
 (**probably the most useful in terms of figuring out the argument I'd need to 
tweak - so use this as a reference for doc improvements!**)
   - From that, I thought that maybe there was an equivalent for 
pyarrow.csv.ParseOptions[¶](https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html#pyarrow-csv-parseoptions)
 in the R arrow package, which led me to the [CsvParseOptions 
page](https://arrow.apache.org/docs/r/reference/CsvReadOptions.html) on the 
pkgdown site. I didn't know what default `quote_char` took so then I went and 
looked at the source code for CsvParseOptions, and realized it was a `"`. And 
then I slowly started to realize how to fix it! 
       - ...And told myself that would be a useful docs PR which is now merged 
as #37909!!
   - It was difficult to figure out how to use CsvParseOptions$create() because 
there were no examples, but I just tried things out (with my limited knowledge 
of S4 methods in R), paralleling the examples in the Python code, until reading 
my messy data with `open_dataset()` worked!
   - Along the way, I also found [someone else on StackOverflow who ran into 
the same 
issue](https://stackoverflow.com/questions/74057299/how-to-read-csv-with-within-quoted-string-with-read-csv-arrow)
 but did not figure it out (but I felt glad I was not the only one with the 
problem)
   
   Phew!
   
   > Arrow’s CSV reader is optimized for very fast parsing of valid CSVs 
(rather than other parsers like readr and data.table that offer more flexible 
options for handling invalid data, occasionally at the expense of speed), so it 
might end up being more of a problem that is better solved by multiple 
libraries.
   
   That makes sense that Arrow is optimized for fast parsing of valid CSVs - 
that's what I started to suspect after seeing other examples that Arrow is used 
for (oftentimes machine-output data, not messy human-collected data). I'll 
think about what to do in this case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] angela-li commented on issue #37908: [R] open_dataset() behavior with incorrectly quoted input data

Reply via email to