lidavidm commented on code in PR #13709: URL: https://github.com/apache/arrow/pull/13709#discussion_r936052670
########## cpp/src/arrow/csv/options.h: ########## @@ -163,6 +163,9 @@ struct ARROW_EXPORT ReadOptions { /// If false, column names will be read from the first CSV row after `skip_rows`. bool autogenerate_column_names = false; + /// Character encoding used + std::string encoding = "UTF-8"; Review Comment: ```suggestion /// Character encoding used. Only "UTF-8" is supported. std::string encoding = "UTF-8"; ``` ########## python/pyarrow/io.pxi: ########## @@ -1547,6 +1547,33 @@ class Transcoder: return self._encoder.encode(self._decoder.decode(buf, final), final) +cdef shared_ptr[function[StreamWrapFunc]] make_streamwrap_func( + src_encoding, dest_encoding) except *: + """ + Create a function that will add a transcoding transformation to a stream. + Data from that stream will be decoded according to ``src_encoding`` and + then re-encoded according to ``dest_encoding``. + The created function can be used to wrap streams. + + Parameters + ---------- + src_encoding : str + The codec to use when reading data. + dest_encoding : str + The codec to use for emitted data. + """ + cdef: + shared_ptr[function[StreamWrapFunc]] empty_func + CTransformInputStreamVTable vtable + + vtable.transform = _cb_transform + src_codec = codecs.lookup(src_encoding) + dest_codec = codecs.lookup(dest_encoding) Review Comment: Don't we want to skip this when src == dest? ########## cpp/src/arrow/dataset/file_csv.cc: ########## @@ -183,9 +183,19 @@ static inline Future<std::shared_ptr<csv::StreamingReader>> OpenReaderAsync( auto tracer = arrow::internal::tracing::GetTracer(); auto span = tracer->StartSpan("arrow::dataset::CsvFileFormat::OpenReaderAsync"); #endif + ARROW_ASSIGN_OR_RAISE( + auto fragment_scan_options, + GetFragmentScanOptions<CsvFragmentScanOptions>( + kCsvTypeName, scan_options.get(), format.default_fragment_scan_options)); ARROW_ASSIGN_OR_RAISE(auto reader_options, GetReadOptions(format, scan_options)); - ARROW_ASSIGN_OR_RAISE(auto input, source.OpenCompressed()); + if (reader_options.encoding != "UTF-8") { + if (fragment_scan_options->stream_transform_func) { + ARROW_ASSIGN_OR_RAISE(input, fragment_scan_options->stream_transform_func(input)); + } else { + return Status::Invalid("File encoding is not UTF-8, but no stream_transform_func has been provided."); Review Comment: Invalid is fine (IOError doesn't quite fit IMO) ```suggestion return Status::Invalid("File encoding is not UTF-8, but no stream_transform_func has been provided"); ``` ########## cpp/src/arrow/csv/options.h: ########## @@ -163,6 +163,9 @@ struct ARROW_EXPORT ReadOptions { /// If false, column names will be read from the first CSV row after `skip_rows`. bool autogenerate_column_names = false; + /// Character encoding used + std::string encoding = "UTF-8"; Review Comment: nit: maybe a formal constant for the encoding? ########## cpp/src/arrow/dataset/file_csv.cc: ########## @@ -183,9 +183,19 @@ static inline Future<std::shared_ptr<csv::StreamingReader>> OpenReaderAsync( auto tracer = arrow::internal::tracing::GetTracer(); auto span = tracer->StartSpan("arrow::dataset::CsvFileFormat::OpenReaderAsync"); #endif + ARROW_ASSIGN_OR_RAISE( + auto fragment_scan_options, + GetFragmentScanOptions<CsvFragmentScanOptions>( + kCsvTypeName, scan_options.get(), format.default_fragment_scan_options)); ARROW_ASSIGN_OR_RAISE(auto reader_options, GetReadOptions(format, scan_options)); - ARROW_ASSIGN_OR_RAISE(auto input, source.OpenCompressed()); + if (reader_options.encoding != "UTF-8") { Review Comment: Somewhere in the CSV reader itself we should also validate the option ########## python/pyarrow/dataset.py: ########## @@ -433,6 +433,10 @@ def _filesystem_dataset(source, schema=None, filesystem=None, FileSystemDataset """ format = _ensure_format(format or 'parquet') + if isinstance(format, CsvFileFormat): Review Comment: hmm, it feels like this needs to be placed "lower" (unless there aren't any other ways of building a filesystem dataset?) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org