alamb commented on code in PR #10842: URL: https://github.com/apache/datafusion/pull/10842#discussion_r1633401185
########## datafusion/substrait/tests/testdata/query_1.json: ########## @@ -0,0 +1,810 @@ +{ Review Comment: Can you please 1. Move this file into a directory that makes it clearer where it came from. Perhaps `datafusion/substrait/tests/testdata/tpch_substrait_plans/query_1.json` 2. add a README.md file in `datafusion/substrait/tests/testdata/tpch_substrait_plans ` that explains the files came from https://github.com/substrait-io/consumer-testing/tree/main/substrait_consumer/tests/integration/queries/tpch_substrait_plans? ########## datafusion/substrait/tests/cases/tpch.rs: ########## @@ -0,0 +1,63 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! tests contains in <https://github.com/substrait-io/consumer-testing/tree/main/substrait_consumer/tests/integration/queries/tpch_substrait_plans> Review Comment: This is very cool -- thank you . I think the context of this PR may be lost after merge so some more documentation might help Something like ```suggestion //! TPCH `substrait_consumer` tests //! //! This module tests that substrait plans as json encoded protobuf can be //! correctly read as DataFusion plans. //! //! The input data comes from <https://github.com/substrait-io/consumer-testing/tree/main/substrait_consumer/tests/integration/queries/tpch_substrait_plans> ``` ########## datafusion/substrait/tests/cases/mod.rs: ########## @@ -19,3 +19,4 @@ mod logical_plans; mod roundtrip_logical_plan; mod roundtrip_physical_plan; mod serialize; +mod tpch; Review Comment: What do you think about renaming this module to `consumer_integration` to make it clearer that this is an integration test of existing substrait plans? ########## datafusion/substrait/src/logical_plan/consumer.rs: ########## @@ -569,7 +571,80 @@ pub async fn from_substrait_rel( Ok(LogicalPlan::Values(Values { schema, values })) } - _ => not_impl_err!("Only NamedTable and VirtualTable reads are supported"), + Some(ReadType::LocalFiles(lf)) => { + fn extract_filename(name: &str) -> Option<String> { + let corrected_url = + if name.starts_with("file://") && !name.starts_with("file:///") { + name.replacen("file://", "file:///", 1) Review Comment: this makes all URLs absolute (is that intended)? ########## datafusion/substrait/src/logical_plan/consumer.rs: ########## @@ -569,7 +571,80 @@ pub async fn from_substrait_rel( Ok(LogicalPlan::Values(Values { schema, values })) } - _ => not_impl_err!("Only NamedTable and VirtualTable reads are supported"), + Some(ReadType::LocalFiles(lf)) => { + fn extract_filename(name: &str) -> Option<String> { + let corrected_url = + if name.starts_with("file://") && !name.starts_with("file:///") { + name.replacen("file://", "file:///", 1) + } else { + name.to_string() + }; + + Url::parse(&corrected_url).ok().and_then(|url| { + let path = url.path(); + std::path::Path::new(path) + .file_name() + .map(|filename| filename.to_string_lossy().to_string()) + }) + } + + // we could use the file name to check the original table provider + // TODO: currently does not support multiple local files + let filename: Option<String> = + lf.items.first().and_then(|x| match x.path_type.as_ref() { + Some(UriFile(name)) => extract_filename(name), + _ => None, + }); + + if lf.items.len() > 1 || filename.is_none() { + return not_impl_err!( + "Only NamedTable and VirtualTable reads are supported" + ); + } + let name = filename.unwrap(); + // directly use unwrap here since we could determine it is a valid one + let table_reference = TableReference::Bare { table: name.into() }; + let t = ctx.table(table_reference).await?; + let t = t.into_optimized_plan()?; + match &read.projection { + Some(MaskExpression { select, .. }) => match &select.as_ref() { + Some(projection) => { + let column_indices: Vec<usize> = projection + .struct_items + .iter() + .map(|item| item.field as usize) + .collect(); + match &t { Review Comment: I think if you matched on `t` you could avoid the `scan.clone()` later on ########## datafusion/substrait/tests/testdata/tpch/lineitem.csv: ########## @@ -0,0 +1,2 @@ +l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment +1,1,1,1,17,21168.23,0.04,0.02,'N','O','1996-03-13','1996-02-12','1996-03-22','DELIVER IN PERSON','TRUCK','egular courts above the' Review Comment: I think a single line row is fine 👍 ########## datafusion/substrait/src/logical_plan/consumer.rs: ########## @@ -569,7 +571,80 @@ pub async fn from_substrait_rel( Ok(LogicalPlan::Values(Values { schema, values })) } - _ => not_impl_err!("Only NamedTable and VirtualTable reads are supported"), + Some(ReadType::LocalFiles(lf)) => { + fn extract_filename(name: &str) -> Option<String> { + let corrected_url = + if name.starts_with("file://") && !name.starts_with("file:///") { + name.replacen("file://", "file:///", 1) + } else { + name.to_string() + }; + + Url::parse(&corrected_url).ok().and_then(|url| { + let path = url.path(); + std::path::Path::new(path) + .file_name() + .map(|filename| filename.to_string_lossy().to_string()) + }) + } + + // we could use the file name to check the original table provider + // TODO: currently does not support multiple local files Review Comment: Should we file at ticket for this feature? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org