angela-li commented on issue #37908:
URL: https://github.com/apache/arrow/issues/37908#issuecomment-1742200461
> Would you mind telling me a bit more about where you looked and how you
figured it out?
Sure! Here's what I did to try to debug:
- First I tried to zoom in on the line that was producing the error and
subset it with the Linux shell (`head`, `cut`, etc) to see what the smallest
subset of the data that I could read in to keep producing the problem. After
inspecting it and comparing it to the row above, I started to get a hunch that
it was the extra " character.
- Then, I read that tiny subset of data into `data.table::fread()` to see if
it handled it better. It did, but it produced a different error in fread:
_"Warning: Found and resolved improper quoting in first 100 rows. If the fields
are not quoted (e.g. field separator does not appear within any field), try
quote="" to avoid this warning."_
- I then went to the data.table source code and searched the wording of
that error in the Github repo, which led me to the NEWS file linked above.
- I realized then it had something to do with incorrectly quoted data and
looked on StackOverflow, which led me to this [PyArrow documentation of
quote_char](https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html#pyarrow.csv.ParseOptions.quote_char)
(**probably the most useful in terms of figuring out the argument I'd need to
tweak - so use this as a reference for doc improvements!**)
- From that, I thought that maybe there was an equivalent for
pyarrow.csv.ParseOptions[¶](https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html#pyarrow-csv-parseoptions)
in the R arrow package, which led me to the [CsvParseOptions
page](https://arrow.apache.org/docs/r/reference/CsvReadOptions.html) on the
pkgdown site. I didn't know what default `quote_char` took so then I went and
looked at the source code for CsvParseOptions, and realized it was a `"`. And
then I slowly started to realize how to fix it!
- ...And told myself that would be a useful docs PR which is now merged
as #37909!!
- It was difficult to figure out how to use CsvParseOptions$create() because
there were no examples, but I just tried things out (with my limited knowledge
of S4 methods in R), paralleling the examples in the Python code, until reading
my messy data with `open_dataset()` worked!
- Along the way, I also found [someone else on StackOverflow who ran into
the same
issue](https://stackoverflow.com/questions/74057299/how-to-read-csv-with-within-quoted-string-with-read-csv-arrow)
but did not figure it out (but I felt glad I was not the only one with the
problem)
Phew!
> Arrow’s CSV reader is optimized for very fast parsing of valid CSVs
(rather than other parsers like readr and data.table that offer more flexible
options for handling invalid data, occasionally at the expense of speed), so it
might end up being more of a problem that is better solved by multiple
libraries.
That makes sense that Arrow is optimized for fast parsing of valid CSVs -
that's what I started to suspect after seeing other examples that Arrow is used
for (oftentimes machine-output data, not messy human-collected data). I'll
think about what to do in this case.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]