[GitHub] [arrow] andygrove commented on a change in pull request #7252: ARROW-8906: [Rust] [DataFusion] support schema inference from multiple CSV files

GitBox Tue, 26 May 2020 20:40:24 -0700


andygrove commented on a change in pull request #7252:
URL: https://github.com/apache/arrow/pull/7252#discussion_r430050988




##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -175,6 +176,60 @@ pub fn infer_file_schema<R: Read + Seek>(
     Ok(Schema::new(fields))
 }
 
+/// Infer schema from a list of CSV files by reading through first n records
+/// with `max_read_records` controlling the maximum number of records to read.
+///
+/// Files will be read in the given order untill n records have been reached.
+///
+/// If `max_read_records` is not set, all files will be read fully to infer 
the schema.
+pub fn infer_schema_from_files(
+    files: &Vec<String>,
+    delimiter: u8,
+    max_read_records: Option<usize>,
+    has_header: bool,
+) -> Result<Schema> {
+    let mut buff = Cursor::new(Vec::new());

Review comment:
       I'm a little nervous about the idea of potentially loading all of the 
csv files into memory. I'm also not sure how this will work when the schema 
varies between files. In the unit test, all files appear to have the same three 
columns. What if one of the files is missing "c2" and another file has an 
additional "c4" and "c5" ?
   
   What I think we'll ultimately need in DataFusion is a schema merging 
feature, so each csv (or parquet) file can have differences but the final 
output schema will contain the superset.

##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -175,6 +176,60 @@ pub fn infer_file_schema<R: Read + Seek>(
     Ok(Schema::new(fields))
 }
 
+/// Infer schema from a list of CSV files by reading through first n records
+/// with `max_read_records` controlling the maximum number of records to read.
+///
+/// Files will be read in the given order untill n records have been reached.
+///
+/// If `max_read_records` is not set, all files will be read fully to infer 
the schema.
+pub fn infer_schema_from_files(
+    files: &Vec<String>,
+    delimiter: u8,
+    max_read_records: Option<usize>,
+    has_header: bool,
+) -> Result<Schema> {
+    let mut buff = Cursor::new(Vec::new());

Review comment:
       Another option here is that this code returns an error if the files have 
differing schemas.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] andygrove commented on a change in pull request #7252: ARROW-8906: [Rust] [DataFusion] support schema inference from multiple CSV files

Reply via email to