nevi-me commented on a change in pull request #9534:
URL: https://github.com/apache/arrow/pull/9534#discussion_r581965294



##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -99,7 +99,27 @@ fn infer_field_schema(string: &str) -> DataType {
 /// If `max_read_records` is not set, the whole file is read to infer its 
schema.
 ///
 /// Return infered schema and number of records used for inference.
-fn infer_file_schema<R: Read + Seek>(
+pub fn infer_file_schema<R: Read + Seek>(
+    reader: &mut R,
+    delimiter: u8,
+    max_read_records: Option<usize>,
+    has_header: bool,
+) -> Result<(Schema, usize)> {
+    let (schema, records_count) =
+        infer_schema_from_reader(reader, delimiter, max_read_records, 
has_header)?;
+    // return the reader seek back to the start

Review comment:
       The idea in the initial commit was to return the file to its initial 
position before schema inference. That position was expected to always be the 
start, unless we want to allow someone to hand over a buffer that's in some 
other position.
   
   If there's such a use-case, I think it's acceptable/fine. I was mainly 
avoiding the case where:
   - user runs schema inference on inputs
   - user tries to read the input with the inferred schema, getting no records, 
or missing records used in inferrence
   
   I was still learning Rust when I wrote that code, so it's perhaps the case 
that the behaviour isn't what we'd like.
   
   > Shouldn't we rewind to whatever offset it was at before we read any data 
to infer the schema?
   
   @houqp if that position isn't 0, then I'd agree that we should rewind to it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to