[GitHub] [arrow] houqp commented on a change in pull request #7252: ARROW-8906: [Rust] [DataFusion] support schema inference from multiple CSV files

GitBox Tue, 26 May 2020 20:40:26 -0700


houqp commented on a change in pull request #7252:
URL: https://github.com/apache/arrow/pull/7252#discussion_r430062159




##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -175,6 +176,60 @@ pub fn infer_file_schema<R: Read + Seek>(
     Ok(Schema::new(fields))
 }
 
+/// Infer schema from a list of CSV files by reading through first n records
+/// with `max_read_records` controlling the maximum number of records to read.
+///
+/// Files will be read in the given order untill n records have been reached.
+///
+/// If `max_read_records` is not set, all files will be read fully to infer 
the schema.
+pub fn infer_schema_from_files(
+    files: &Vec<String>,
+    delimiter: u8,
+    max_read_records: Option<usize>,
+    has_header: bool,
+) -> Result<Schema> {
+    let mut buff = Cursor::new(Vec::new());

Review comment:
       sounds good, i will change it to infer schema per file, then merge at 
the end. if we don't have to seek across multiple files, then we don't need to 
load all lines into memory.

##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -172,7 +177,42 @@ pub fn infer_file_schema<R: Read + Seek>(
     // return the reader seek back to the start
     csv_reader.into_inner().seek(SeekFrom::Start(0))?;
 
-    Ok(Schema::new(fields))
+    Ok((Schema::new(fields), records_count))
+}
+
+/// Infer schema from a list of CSV files by reading through first n records
+/// with `max_read_records` controlling the maximum number of records to read.
+///
+/// Files will be read in the given order untill n records have been reached.
+///
+/// If `max_read_records` is not set, all files will be read fully to infer 
the schema.
+pub fn infer_schema_from_files(
+    files: &Vec<String>,
+    delimiter: u8,
+    max_read_records: Option<usize>,
+    has_header: bool,
+) -> Result<Schema> {
+    let mut schemas = vec![];
+    let mut records_to_read = max_read_records.unwrap_or(std::usize::MAX);
+
+    for fname in files.iter() {
+        let (schema, records_read) = infer_file_schema(
+            &mut BufReader::new(File::open(fname)?),
+            delimiter,
+            Some(records_to_read),
+            has_header,
+        )?;
+        if records_read == 0 {
+            continue;
+        }
+        schemas.push(schema.clone());
+        records_to_read -= records_read;

Review comment:
       That's correct, should always be positive as long as infer_file_schema 
doesn't read more records than requested. I used `<=` below to exit early if 
any unexpected error occurs. We can also throw an error when it's less than 
zero.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] houqp commented on a change in pull request #7252: ARROW-8906: [Rust] [DataFusion] support schema inference from multiple CSV files

Reply via email to