andygrove commented on a change in pull request #7252:
URL: https://github.com/apache/arrow/pull/7252#discussion_r430050988
##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -175,6 +176,60 @@ pub fn infer_file_schema<R: Read + Seek>(
Ok(Schema::new(fields))
}
+/// Infer schema from a list of CSV files by reading through first n records
+/// with `max_read_records` controlling the maximum number of records to read.
+///
+/// Files will be read in the given order untill n records have been reached.
+///
+/// If `max_read_records` is not set, all files will be read fully to infer
the schema.
+pub fn infer_schema_from_files(
+ files: &Vec<String>,
+ delimiter: u8,
+ max_read_records: Option<usize>,
+ has_header: bool,
+) -> Result<Schema> {
+ let mut buff = Cursor::new(Vec::new());
Review comment:
I'm a little nervous about the idea of potentially loading all of the
csv files into memory. I'm also not sure how this will work when the schema
varies between files. In the unit test, all files appear to have the same three
columns. What if one of the files is missing "c2" and another file has an
additional "c4" and "c5" ?
What I think we'll ultimately need in DataFusion is a schema merging
feature, so each csv (or parquet) file can have differences but the final
output schema will contain the superset.
##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -175,6 +176,60 @@ pub fn infer_file_schema<R: Read + Seek>(
Ok(Schema::new(fields))
}
+/// Infer schema from a list of CSV files by reading through first n records
+/// with `max_read_records` controlling the maximum number of records to read.
+///
+/// Files will be read in the given order untill n records have been reached.
+///
+/// If `max_read_records` is not set, all files will be read fully to infer
the schema.
+pub fn infer_schema_from_files(
+ files: &Vec<String>,
+ delimiter: u8,
+ max_read_records: Option<usize>,
+ has_header: bool,
+) -> Result<Schema> {
+ let mut buff = Cursor::new(Vec::new());
Review comment:
Another option here is that this code returns an error if the files have
differing schemas.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]