andygrove commented on a change in pull request #7252: URL: https://github.com/apache/arrow/pull/7252#discussion_r430050988
########## File path: rust/arrow/src/csv/reader.rs ########## @@ -175,6 +176,60 @@ pub fn infer_file_schema<R: Read + Seek>( Ok(Schema::new(fields)) } +/// Infer schema from a list of CSV files by reading through first n records +/// with `max_read_records` controlling the maximum number of records to read. +/// +/// Files will be read in the given order untill n records have been reached. +/// +/// If `max_read_records` is not set, all files will be read fully to infer the schema. +pub fn infer_schema_from_files( + files: &Vec<String>, + delimiter: u8, + max_read_records: Option<usize>, + has_header: bool, +) -> Result<Schema> { + let mut buff = Cursor::new(Vec::new()); Review comment: I'm a little nervous about the idea of potentially loading all of the csv files into memory. I'm also not sure how this will work when the schema varies between files. In the unit test, all files appear to have the same three columns. What if one of the files is missing "c2" and another file has an additional "c4" and "c5" ? What I think we'll ultimately need in DataFusion is a schema merging feature, so each csv (or parquet) file can have differences but the final output schema will contain the superset. ########## File path: rust/arrow/src/csv/reader.rs ########## @@ -175,6 +176,60 @@ pub fn infer_file_schema<R: Read + Seek>( Ok(Schema::new(fields)) } +/// Infer schema from a list of CSV files by reading through first n records +/// with `max_read_records` controlling the maximum number of records to read. +/// +/// Files will be read in the given order untill n records have been reached. +/// +/// If `max_read_records` is not set, all files will be read fully to infer the schema. +pub fn infer_schema_from_files( + files: &Vec<String>, + delimiter: u8, + max_read_records: Option<usize>, + has_header: bool, +) -> Result<Schema> { + let mut buff = Cursor::new(Vec::new()); Review comment: Another option here is that this code returns an error if the files have differing schemas. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org