Re: [PR] feat: Add random row generator in data generator [datafusion-comet]
advancedxy commented on code in PR #451: URL: https://github.com/apache/datafusion-comet/pull/451#discussion_r1607716091 ## spark/src/test/scala/org/apache/comet/DataGenerator.scala: ## @@ -95,4 +102,55 @@ class DataGenerator(r: Random) { Range(0, n).map(_ => r.nextLong()) } + // Generate a random row according to the schema, the string filed in the struct could be + // configured to generate strings by passing a stringGen function. Other types are delegated + // to Spark's RandomDataGenerator. + def generateRow(schema: StructType, stringGen: Option[() => String] = None): Row = { +val fields = mutable.ArrayBuffer.empty[Any] +schema.fields.foreach { f => + f.dataType match { +case ArrayType(childType, nullable) => + val data = if (f.nullable && r.nextFloat() <= PROBABILITY_OF_NULL) { +null + } else { +val arr = mutable.ArrayBuffer.empty[Any] +val n = 1 // rand.nextInt(10) +var i = 0 +val generator = RandomDataGenerator.forType(childType, nullable, r) +assert(generator.isDefined, "Unsupported type") +val gen = generator.get +while (i < n) { + arr += gen() + i += 1 +} +arr.toSeq + } + fields += data +case StructType(children) => + fields += generateRow(StructType(children)) +case StringType if stringGen.isDefined => + val gen = stringGen.get + val data = if (f.nullable && r.nextFloat() <= PROBABILITY_OF_NULL) { +null + } else { +gen() + } + fields += data +case _ => + val generator = RandomDataGenerator.forType(f.dataType, f.nullable, r) + assert(generator.isDefined, "Unsupported type") + val gen = generator.get + fields += gen() + } +} +Row.fromSeq(fields.toSeq) + } + + def generateRows( Review Comment: Ah, yeah. It's not used in comet tests yet. It will be used in https://github.com/apache/datafusion-comet/pull/433/files#diff-75bf612078cdacdd1c8544b3894bc941c0db803f5f904afc21fbe389fb68555bR1485-R1488 if that PR merges first. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Add random row generator in data generator [datafusion-comet]
advancedxy commented on code in PR #451: URL: https://github.com/apache/datafusion-comet/pull/451#discussion_r1607709720 ## spark/src/test/scala/org/apache/comet/DataGenerator.scala: ## @@ -95,4 +102,55 @@ class DataGenerator(r: Random) { Range(0, n).map(_ => r.nextLong()) } + // Generate a random row according to the schema, the string filed in the struct could be + // configured to generate strings by passing a stringGen function. Other types are delegated + // to Spark's RandomDataGenerator. + def generateRow(schema: StructType, stringGen: Option[() => String] = None): Row = { +val fields = mutable.ArrayBuffer.empty[Any] +schema.fields.foreach { f => + f.dataType match { +case ArrayType(childType, nullable) => + val data = if (f.nullable && r.nextFloat() <= PROBABILITY_OF_NULL) { +null + } else { +val arr = mutable.ArrayBuffer.empty[Any] +val n = 1 // rand.nextInt(10) +var i = 0 +val generator = RandomDataGenerator.forType(childType, nullable, r) +assert(generator.isDefined, "Unsupported type") +val gen = generator.get +while (i < n) { + arr += gen() + i += 1 +} +arr.toSeq + } + fields += data +case StructType(children) => + fields += generateRow(StructType(children)) +case StringType if stringGen.isDefined => + val gen = stringGen.get + val data = if (f.nullable && r.nextFloat() <= PROBABILITY_OF_NULL) { +null + } else { +gen() + } + fields += data +case _ => + val generator = RandomDataGenerator.forType(f.dataType, f.nullable, r) + assert(generator.isDefined, "Unsupported type") Review Comment: > Hmmm should we do the same (create nulls in forType) for the rest of types? Let me think about that. Maybe we can remove the array type case. However I think the user specified `stringGen` still have to access `PROBABILITY_OF_NULL` to generate nullable strings. Otherwise, the `stringGen` method itself should handle nullable by itself. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: add hex scalar function [datafusion-comet]
advancedxy commented on code in PR #449: URL: https://github.com/apache/datafusion-comet/pull/449#discussion_r1607703859 ## core/src/execution/datafusion/expressions/scalar_funcs/hex.rs: ## @@ -0,0 +1,191 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::sync::Arc; + +use arrow::array::as_string_array; +use arrow_array::StringArray; +use arrow_schema::DataType; +use datafusion::logical_expr::ColumnarValue; +use datafusion_common::{ +cast::{as_binary_array, as_int64_array}, +exec_err, DataFusionError, ScalarValue, +}; +use std::fmt::Write; + +fn hex_bytes(bytes: &[u8]) -> Vec { +let length = bytes.len(); +let mut value = vec![0; length * 2]; +let mut i = 0; +while i < length { +value[i * 2] = (bytes[i] & 0xF0) >> 4; +value[i * 2 + 1] = bytes[i] & 0x0F; +i += 1; +} +value +} + +fn hex_int64(num: i64) -> String { +if num >= 0 { +format!("{:X}", num) +} else { +format!("{:016X}", num as u64) +} +} + +fn hex_string(s: &str) -> Vec { +hex_bytes(s.as_bytes()) +} + +fn hex_bytes_to_string(bytes: &[u8]) -> Result { +let mut hex_string = String::with_capacity(bytes.len() * 2); +for byte in bytes { +write!(&mut hex_string, "{:01X}", byte)?; Review Comment: By the way, there's `hex_encode` in `scalar_func.rs`, maybe we should moved that into this hex.rs instead, so related methods are grouped together. ## spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala: ## @@ -1038,6 +1038,46 @@ class CometExpressionSuite extends CometTestBase with AdaptiveSparkPlanHelper { } } } + test("hex") { +val str_table = "string_hex_table" +withTable(str_table) { Review Comment: Spark hex's input types are: ```scala override def inputTypes: Seq[AbstractDataType] = Seq(TypeCollection(LongType, BinaryType, StringType)) ``` I think we need to test input with these types. By the way, it would be helpful to test with randomized input, I think you can leverage the `generateRows` in #451 and test hex in one sql such as: ``` select hex(long_col) , hex(binary_col), hex(string_col) from $test_table ``` ## core/src/execution/datafusion/expressions/scalar_funcs/hex.rs: ## @@ -0,0 +1,191 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::sync::Arc; + +use arrow::array::as_string_array; +use arrow_array::StringArray; +use arrow_schema::DataType; +use datafusion::logical_expr::ColumnarValue; +use datafusion_common::{ +cast::{as_binary_array, as_int64_array}, +exec_err, DataFusionError, ScalarValue, +}; +use std::fmt::Write; + +fn hex_bytes(bytes: &[u8]) -> Vec { +let length = bytes.len(); +let mut value = vec![0; length * 2]; +let mut i = 0; +while i < length { +value[i * 2] = (bytes[i] & 0xF0) >> 4; +value[i * 2 + 1] = bytes[i] & 0x0F; +i += 1; +} +value +} + +fn hex_int64(num: i64) -> String { +if num >= 0 { +format!("{:X}", num) +} else { +format!("{:016X}", num as u64) +} +} + +fn hex_string(s: &str) -> Vec { +hex_bytes(s.as_bytes()) +} + +fn hex_bytes_to_string(bytes: &[u8]) -> Result { +let mut hex_string = String::with_capacity(bytes.len() * 2); +for byte in bytes { +write!(&mut hex_string, "{:01X}", byte)?; Review Comment: hmm, I think it should be `:02X` instead?
Re: [PR] feat: Add random row generator in data generator [datafusion-comet]
kazuyukitanimura commented on code in PR #451: URL: https://github.com/apache/datafusion-comet/pull/451#discussion_r1607690262 ## spark/src/test/scala/org/apache/comet/DataGenerator.scala: ## @@ -95,4 +102,55 @@ class DataGenerator(r: Random) { Range(0, n).map(_ => r.nextLong()) } + // Generate a random row according to the schema, the string filed in the struct could be + // configured to generate strings by passing a stringGen function. Other types are delegated + // to Spark's RandomDataGenerator. + def generateRow(schema: StructType, stringGen: Option[() => String] = None): Row = { +val fields = mutable.ArrayBuffer.empty[Any] +schema.fields.foreach { f => + f.dataType match { +case ArrayType(childType, nullable) => + val data = if (f.nullable && r.nextFloat() <= PROBABILITY_OF_NULL) { +null + } else { +val arr = mutable.ArrayBuffer.empty[Any] +val n = 1 // rand.nextInt(10) +var i = 0 +val generator = RandomDataGenerator.forType(childType, nullable, r) +assert(generator.isDefined, "Unsupported type") +val gen = generator.get +while (i < n) { + arr += gen() + i += 1 +} +arr.toSeq + } + fields += data +case StructType(children) => + fields += generateRow(StructType(children)) +case StringType if stringGen.isDefined => + val gen = stringGen.get + val data = if (f.nullable && r.nextFloat() <= PROBABILITY_OF_NULL) { +null + } else { +gen() + } + fields += data +case _ => + val generator = RandomDataGenerator.forType(f.dataType, f.nullable, r) + assert(generator.isDefined, "Unsupported type") + val gen = generator.get + fields += gen() + } +} +Row.fromSeq(fields.toSeq) + } + + def generateRows( Review Comment: hmmm currently it is not testing any comet thing, so it is a bit weird -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Add random row generator in data generator [datafusion-comet]
kazuyukitanimura commented on code in PR #451: URL: https://github.com/apache/datafusion-comet/pull/451#discussion_r1607689388 ## spark/src/test/scala/org/apache/comet/DataGenerator.scala: ## @@ -95,4 +102,55 @@ class DataGenerator(r: Random) { Range(0, n).map(_ => r.nextLong()) } + // Generate a random row according to the schema, the string filed in the struct could be + // configured to generate strings by passing a stringGen function. Other types are delegated + // to Spark's RandomDataGenerator. + def generateRow(schema: StructType, stringGen: Option[() => String] = None): Row = { +val fields = mutable.ArrayBuffer.empty[Any] +schema.fields.foreach { f => + f.dataType match { +case ArrayType(childType, nullable) => + val data = if (f.nullable && r.nextFloat() <= PROBABILITY_OF_NULL) { +null + } else { +val arr = mutable.ArrayBuffer.empty[Any] +val n = 1 // rand.nextInt(10) +var i = 0 +val generator = RandomDataGenerator.forType(childType, nullable, r) +assert(generator.isDefined, "Unsupported type") +val gen = generator.get +while (i < n) { + arr += gen() + i += 1 +} +arr.toSeq + } + fields += data +case StructType(children) => + fields += generateRow(StructType(children)) +case StringType if stringGen.isDefined => + val gen = stringGen.get + val data = if (f.nullable && r.nextFloat() <= PROBABILITY_OF_NULL) { +null + } else { +gen() + } + fields += data +case _ => + val generator = RandomDataGenerator.forType(f.dataType, f.nullable, r) + assert(generator.isDefined, "Unsupported type") Review Comment: Hmmm should we do the same (create nulls in `forType`) for the rest of types? Then we do not need duplicated `PROBABILITY_OF_NULL` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Compute murmur3 hash with dictionary input correctly [datafusion-comet]
kazuyukitanimura commented on code in PR #433: URL: https://github.com/apache/datafusion-comet/pull/433#discussion_r1607685866 ## spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala: ## @@ -1452,17 +1452,55 @@ class CometExpressionSuite extends CometTestBase with AdaptiveSparkPlanHelper { withTable(table) { sql(s"create table $table(col string, a int, b float) using parquet") sql(s""" - |insert into $table values - |('Spark SQL ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), ('苹果手机', NULL, 3.99) - |, ('Spark SQL ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), ('苹果手机', NULL, 3.99) - |""".stripMargin) - checkSparkAnswerAndOperator(""" - |select - |md5(col), md5(cast(a as string)), md5(cast(b as string)), - |hash(col), hash(col, 1), hash(col, 0), hash(col, a, b), hash(b, a, col), - |sha2(col, 0), sha2(col, 256), sha2(col, 224), sha2(col, 384), sha2(col, 512), sha2(col, 128) - |from test + |insert into $table values + |('Spark SQL ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), ('苹果手机', NULL, 3.99) + |, ('Spark SQL ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), ('苹果手机', NULL, 3.99) |""".stripMargin) + checkSparkAnswerAndOperator(""" + |select + |md5(col), md5(cast(a as string)), md5(cast(b as string)), + |hash(col), hash(col, 1), hash(col, 0), hash(col, a, b), hash(b, a, col), + |sha2(col, 0), sha2(col, 256), sha2(col, 224), sha2(col, 384), sha2(col, 512), sha2(col, 128) + |from test + |""".stripMargin) +} + } +} + } + + test("hash functions with random input") { +val dataGen = DataGenerator.DEFAULT +// sufficient number of rows to create dictionary encoded ArrowArray. +val randomNumRows = 1000 Review Comment: Yes, `makeParquetFileAllTypes` or some existing dictionary related tests may be helpful -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Support unnest for struct data type [datafusion]
duongcongtoai commented on issue #10264: URL: https://github.com/apache/datafusion/issues/10264#issuecomment-212133 Please help me review this PR everyone https://github.com/apache/datafusion/pull/10429 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Support Substrait's VirtualTables [datafusion]
jonahgao commented on code in PR #10531: URL: https://github.com/apache/datafusion/pull/10531#discussion_r1607637374 ## datafusion/substrait/tests/cases/roundtrip_logical_plan.rs: ## @@ -607,6 +608,15 @@ async fn qualified_catalog_schema_table_reference() -> Result<()> { roundtrip("SELECT a,b,c,d,e FROM datafusion.public.data;").await } +#[tokio::test] +async fn roundtrip_local_relation() -> Result<()> { +let ctx = create_context().await?; +let row1 = vec![lit(1), lit("a")]; +let row2 = vec![lit(2), lit("b")]; +let plan = LogicalPlanBuilder::values(vec![row1, row2])?.build()?; Review Comment: I agree. For Values plan, these aliases should be able to be ignored. I think we could use `assert_expected_plan` during testing to ignore aliases, for example: ```rust #[tokio::test] async fn roundtrip_values() -> Result<()> { assert_expected_plan( "VALUES (1, 'a', [[1,2,3], []], STRUCT(true, 1))", "Values: (Int64(1), Utf8(\"a\"), List([[1, 2, 3], []]), Struct({c0:true,c1:1}))", ) .await } ``` Additionally, aliases should be ignored when handling `LogicalPlan::Values` in `to_substrait_rel`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Support Substrait's VirtualTables [datafusion]
jonahgao commented on code in PR #10531: URL: https://github.com/apache/datafusion/pull/10531#discussion_r1607637374 ## datafusion/substrait/tests/cases/roundtrip_logical_plan.rs: ## @@ -607,6 +608,15 @@ async fn qualified_catalog_schema_table_reference() -> Result<()> { roundtrip("SELECT a,b,c,d,e FROM datafusion.public.data;").await } +#[tokio::test] +async fn roundtrip_local_relation() -> Result<()> { +let ctx = create_context().await?; +let row1 = vec![lit(1), lit("a")]; +let row2 = vec![lit(2), lit("b")]; +let plan = LogicalPlanBuilder::values(vec![row1, row2])?.build()?; Review Comment: I agree. For Values plan, these aliases should be be ignored. I think we could use `assert_expected_plan` during testing to ignore aliases, for example: ```rust #[tokio::test] async fn roundtrip_values() -> Result<()> { assert_expected_plan( "VALUES (1, 'a', [[1,2,3], []], STRUCT(true, 1))", "Values: (Int64(1), Utf8(\"a\"), List([[1, 2, 3], []]), Struct({c0:true,c1:1}))", ) .await } ``` Additionally, aliases should be ignored when handling `LogicalPlan::Values` in `to_substrait_rel`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improve `UserDefinedLogicalNode::from_template` API to return `Result` [datafusion]
lewiszlw commented on PR #10575: URL: https://github.com/apache/datafusion/pull/10575#issuecomment-2121658693 If we want to apply same change to `UserDefinedLogicalNodeCore` trait, we shoud add Sized trait bound to it, do you need me to change it in this pr? @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improve `UserDefinedLogicalNode::from_template` API to return `Result` [datafusion]
lewiszlw commented on code in PR #10575: URL: https://github.com/apache/datafusion/pull/10575#discussion_r1607537742 ## datafusion/expr/src/logical_plan/extension.rs: ## @@ -76,27 +76,20 @@ pub trait UserDefinedLogicalNode: fmt::Debug + Send + Sync { /// For example: `TopK: k=10` fn fmt_for_explain(&self, f: &mut fmt::Formatter) -> fmt::Result; -/// Create a new `ExtensionPlanNode` with the specified children +/// Create a new `UserDefinedLogicalNode` with the specified children /// and expressions. This function is used during optimization /// when the plan is being rewritten and a new instance of the -/// `ExtensionPlanNode` must be created. +/// `UserDefinedLogicalNode` must be created. /// /// Note that exprs and inputs are in the same order as the result /// of self.inputs and self.exprs. /// -/// So, `self.from_template(exprs, ..).expressions() == exprs -// -// TODO(clippy): This should probably be renamed to use a `with_*` prefix. Something -// like `with_template`, or `with_exprs_and_inputs`. -// -// Also, I think `ExtensionPlanNode` has been renamed to `UserDefinedLogicalNode` -// but the doc comments have not been updated. -#[allow(clippy::wrong_self_convention)] -fn from_template( +/// So, `self.with_exprs_and_inputs(exprs, ..).expressions() == exprs +fn with_exprs_and_inputs( &self, exprs: &[Expr], inputs: &[LogicalPlan], -) -> Arc; +) -> Result>; Review Comment: Yes, looks better. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improve `UserDefinedLogicalNode::from_template` API to return `Result` [datafusion]
lewiszlw commented on code in PR #10575: URL: https://github.com/apache/datafusion/pull/10575#discussion_r1607537219 ## datafusion/expr/src/logical_plan/extension.rs: ## @@ -76,27 +76,20 @@ pub trait UserDefinedLogicalNode: fmt::Debug + Send + Sync { /// For example: `TopK: k=10` fn fmt_for_explain(&self, f: &mut fmt::Formatter) -> fmt::Result; -/// Create a new `ExtensionPlanNode` with the specified children +/// Create a new `UserDefinedLogicalNode` with the specified children /// and expressions. This function is used during optimization /// when the plan is being rewritten and a new instance of the -/// `ExtensionPlanNode` must be created. +/// `UserDefinedLogicalNode` must be created. /// /// Note that exprs and inputs are in the same order as the result /// of self.inputs and self.exprs. /// -/// So, `self.from_template(exprs, ..).expressions() == exprs -// -// TODO(clippy): This should probably be renamed to use a `with_*` prefix. Something -// like `with_template`, or `with_exprs_and_inputs`. -// -// Also, I think `ExtensionPlanNode` has been renamed to `UserDefinedLogicalNode` -// but the doc comments have not been updated. -#[allow(clippy::wrong_self_convention)] Review Comment: LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Fix compilation of datafusion-cli on 32bit targets [datafusion]
nathaniel-daniel opened a new pull request, #10594: URL: https://github.com/apache/datafusion/pull/10594 ## Which issue does this PR close? Closes #10552. ## Rationale for this change This PR fixes compilation of the datafusion-cli crate on 32bit targets. ## What changes are included in this PR? A constant assumed that a usize was at least 64 bits wide. I changed the constant type to be a u64, before translating it into a usize later on, throwing an error if it overflows. ## Are these changes tested? These changes are not tested. I assumed adding CI for 32bit targets was too invasive for this PR. ## Are there any user-facing changes? There should be no user-facing changes or breaking changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Migrate testing optimizer rules to use `rewrite` API [datafusion]
lewiszlw commented on PR #10576: URL: https://github.com/apache/datafusion/pull/10576#issuecomment-2121613747 The `CommonSubexprEliminate` rule has not been migrated yet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Fixes bug expect `Date32Array` but returns Int32Array [datafusion]
xinlifoobar opened a new pull request, #10593: URL: https://github.com/apache/datafusion/pull/10593 ## Which issue does this PR close? Closes #10587 ## Rationale for this change This is to fix a bug when reading a Date32 or Date64 column from a parquet file, DataFusion currently returns an Int32 array ## What changes are included in this PR? Adds conversions in the `get_statistic` marco from Int32 to Date32 and Date64 respectively. ## Are these changes tested? Yes ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Compute murmur3 hash with dictionary input correctly [datafusion-comet]
advancedxy commented on code in PR #433: URL: https://github.com/apache/datafusion-comet/pull/433#discussion_r1607506665 ## spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala: ## @@ -1452,17 +1452,55 @@ class CometExpressionSuite extends CometTestBase with AdaptiveSparkPlanHelper { withTable(table) { sql(s"create table $table(col string, a int, b float) using parquet") sql(s""" - |insert into $table values - |('Spark SQL ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), ('苹果手机', NULL, 3.99) - |, ('Spark SQL ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), ('苹果手机', NULL, 3.99) - |""".stripMargin) - checkSparkAnswerAndOperator(""" - |select - |md5(col), md5(cast(a as string)), md5(cast(b as string)), - |hash(col), hash(col, 1), hash(col, 0), hash(col, a, b), hash(b, a, col), - |sha2(col, 0), sha2(col, 256), sha2(col, 224), sha2(col, 384), sha2(col, 512), sha2(col, 128) - |from test + |insert into $table values + |('Spark SQL ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), ('苹果手机', NULL, 3.99) + |, ('Spark SQL ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), ('苹果手机', NULL, 3.99) |""".stripMargin) + checkSparkAnswerAndOperator(""" + |select + |md5(col), md5(cast(a as string)), md5(cast(b as string)), + |hash(col), hash(col, 1), hash(col, 0), hash(col, a, b), hash(b, a, col), + |sha2(col, 0), sha2(col, 256), sha2(col, 224), sha2(col, 384), sha2(col, 512), sha2(col, 128) + |from test + |""".stripMargin) +} + } +} + } + + test("hash functions with random input") { +val dataGen = DataGenerator.DEFAULT +// sufficient number of rows to create dictionary encoded ArrowArray. +val randomNumRows = 1000 Review Comment: > E.g. randomly generate 100 rows and repeat 10 times to make 1000 rows So dictionary encoding is only triggered with enough repetition? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] PR build for Linux Java 11 with Spark 3.4 is not running [datafusion-comet]
advancedxy commented on issue #389: URL: https://github.com/apache/datafusion-comet/issues/389#issuecomment-2121583062 Java 11 is excluded on purpose on pull request event as there are already too much combinations. Java8 and Java17 with Spark 3.4 on Linux are tested, which should be sufficient? They are all tested in a post-commit event though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Add random row generator in data generator [datafusion-comet]
advancedxy commented on code in PR #451: URL: https://github.com/apache/datafusion-comet/pull/451#discussion_r1607484041 ## spark/src/test/scala/org/apache/comet/DataGenerator.scala: ## @@ -95,4 +102,55 @@ class DataGenerator(r: Random) { Range(0, n).map(_ => r.nextLong()) } + // Generate a random row according to the schema, the string filed in the struct could be + // configured to generate strings by passing a stringGen function. Other types are delegated + // to Spark's RandomDataGenerator. + def generateRow(schema: StructType, stringGen: Option[() => String] = None): Row = { +val fields = mutable.ArrayBuffer.empty[Any] +schema.fields.foreach { f => + f.dataType match { +case ArrayType(childType, nullable) => + val data = if (f.nullable && r.nextFloat() <= PROBABILITY_OF_NULL) { +null + } else { +val arr = mutable.ArrayBuffer.empty[Any] +val n = 1 // rand.nextInt(10) +var i = 0 +val generator = RandomDataGenerator.forType(childType, nullable, r) +assert(generator.isDefined, "Unsupported type") +val gen = generator.get +while (i < n) { + arr += gen() + i += 1 +} +arr.toSeq + } + fields += data +case StructType(children) => + fields += generateRow(StructType(children)) +case StringType if stringGen.isDefined => + val gen = stringGen.get + val data = if (f.nullable && r.nextFloat() <= PROBABILITY_OF_NULL) { +null + } else { +gen() + } + fields += data +case _ => + val generator = RandomDataGenerator.forType(f.dataType, f.nullable, r) + assert(generator.isDefined, "Unsupported type") + val gen = generator.get + fields += gen() + } +} +Row.fromSeq(fields.toSeq) + } + + def generateRows( Review Comment: Yes. I will use this in the hash functions tests. Also, it should be useful for other randomized input tests since it will be much easier to generate the data frame with multiple columns. ## spark/src/test/scala/org/apache/comet/DataGenerator.scala: ## @@ -95,4 +102,55 @@ class DataGenerator(r: Random) { Range(0, n).map(_ => r.nextLong()) } + // Generate a random row according to the schema, the string filed in the struct could be + // configured to generate strings by passing a stringGen function. Other types are delegated + // to Spark's RandomDataGenerator. + def generateRow(schema: StructType, stringGen: Option[() => String] = None): Row = { +val fields = mutable.ArrayBuffer.empty[Any] +schema.fields.foreach { f => + f.dataType match { +case ArrayType(childType, nullable) => + val data = if (f.nullable && r.nextFloat() <= PROBABILITY_OF_NULL) { +null + } else { +val arr = mutable.ArrayBuffer.empty[Any] +val n = 1 // rand.nextInt(10) +var i = 0 +val generator = RandomDataGenerator.forType(childType, nullable, r) +assert(generator.isDefined, "Unsupported type") +val gen = generator.get +while (i < n) { + arr += gen() + i += 1 +} +arr.toSeq + } + fields += data +case StructType(children) => + fields += generateRow(StructType(children)) +case StringType if stringGen.isDefined => + val gen = stringGen.get + val data = if (f.nullable && r.nextFloat() <= PROBABILITY_OF_NULL) { +null + } else { +gen() + } + fields += data +case _ => + val generator = RandomDataGenerator.forType(f.dataType, f.nullable, r) + assert(generator.isDefined, "Unsupported type") Review Comment: The `RandomDataGenerator.forType(f.dataType, f.nullable, r)` handles the nullable type, so I don't think we need to handle it here. See: https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala#L380-L392 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] bug: CAST string to integer does not handle all invalid inputs [datafusion-comet]
andygrove closed issue #431: bug: CAST string to integer does not handle all invalid inputs URL: https://github.com/apache/datafusion-comet/issues/431 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Enable cast string to int tests and fix compatibility issue [datafusion-comet]
andygrove merged PR #453: URL: https://github.com/apache/datafusion-comet/pull/453 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Count distinct support multiple expressions [datafusion]
github-actions[bot] closed pull request #5939: Count distinct support multiple expressions URL: https://github.com/apache/datafusion/pull/5939 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Use stabalized aws-sdk and clap versions in `datafusion-cli` [datafusion]
github-actions[bot] commented on PR #9659: URL: https://github.com/apache/datafusion/pull/9659#issuecomment-2121548399 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Minor: Generate the supported Spark builtin expression list into MD file [datafusion-comet]
comphead opened a new pull request, #455: URL: https://github.com/apache/datafusion-comet/pull/455 ## Which issue does this PR close? Closes #. ## Rationale for this change ## What changes are included in this PR? ## How are these changes tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Coverage: Add a manual test to show what Spark built in expression the DF can support directly [datafusion-comet]
comphead commented on PR #331: URL: https://github.com/apache/datafusion-comet/pull/331#issuecomment-2121507170 Thanks everyone for the review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Coverage: Add a manual test to show what Spark built in expression the DF can support directly [datafusion-comet]
comphead merged PR #331: URL: https://github.com/apache/datafusion-comet/pull/331 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] bug fix: Fix fuzz testcase for cast string to integer [datafusion-comet]
vaibhawvipul commented on code in PR #450: URL: https://github.com/apache/datafusion-comet/pull/450#discussion_r1607440985 ## spark/src/test/scala/org/apache/comet/CometCastSuite.scala: ## @@ -533,11 +533,16 @@ class CometCastSuite extends CometTestBase with AdaptiveSparkPlanHelper { castTest(gen.generateStrings(dataSize, numericPattern, 5).toDF("a"), DataTypes.ShortType) } - ignore("cast StringType to IntegerType") { + test("cast StringType to IntegerType") { // test with hand-picked values castTest(castStringToIntegralInputs.toDF("a"), DataTypes.IntegerType) // fuzz test -castTest(gen.generateStrings(dataSize, numericPattern, 8).toDF("a"), DataTypes.IntegerType) +castTest( + gen +.generateStrings(dataSize, numericPattern, 8) +.toDF("a") +.withColumn("a", functions.trim($"a")), + DataTypes.IntegerType) Review Comment: +1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] bug fix: Fix fuzz testcase for cast string to integer [datafusion-comet]
vaibhawvipul closed pull request #450: bug fix: Fix fuzz testcase for cast string to integer URL: https://github.com/apache/datafusion-comet/pull/450 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix: Sort Merge Join LeftSemi issues when JoinFilter is set [datafusion]
alamb commented on PR #10304: URL: https://github.com/apache/datafusion/pull/10304#issuecomment-2121484917 🚀 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] build(deps): upgrade sqlparser to 0.46.0 [datafusion]
alamb commented on code in PR #10392: URL: https://github.com/apache/datafusion/pull/10392#discussion_r1607425124 ## datafusion/sqllogictest/test_files/array.slt: ## Review Comment: Thanks @tisonkun -- sounds like we should fix that upstream and then I can maybe make a new sqlparser release (or maybe I can make a minor patch release) 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Implement a way to preserve partitioning through `UnionExec` without losing ordering [datafusion]
alamb commented on issue #10314: URL: https://github.com/apache/datafusion/issues/10314#issuecomment-2121468243 > Hi @alamb, I am trying to work on this. > > I am not very familiar on the `InterleaveExec` in the optimizer. As initial thought, the interleaveExec is acting as a **Repartition with equal number of input partitions and output partitions** and thus a nature idea is to reuse `streaming_merge` with respect to the input size. Wdyt? Hi @xinlifoobar -- this sounds like it is on the right track -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Using `Expr::field` panics [datafusion]
jayzhan211 closed issue #10565: Using `Expr::field` panics URL: https://github.com/apache/datafusion/issues/10565 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove `Expr::GetIndexedField`, replace `Expr::{field,index,range}` with `FieldAccessor`, `IndexAccessor`, and `SliceAccessor` [datafusion]
jayzhan211 merged PR #10568: URL: https://github.com/apache/datafusion/pull/10568 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Remove `Expr::GetIndexedField` and `GetFieldAccess` and always use function `get_field` for indexing [datafusion]
jayzhan211 closed issue #10374: Remove `Expr::GetIndexedField` and `GetFieldAccess` and always use function `get_field` for indexing URL: https://github.com/apache/datafusion/issues/10374 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove `Expr::GetIndexedField`, replace `Expr::{field,index,range}` with `FieldAccessor`, `IndexAccessor`, and `SliceAccessor` [datafusion]
jayzhan211 commented on PR #10568: URL: https://github.com/apache/datafusion/pull/10568#issuecomment-2121463057 Thanks, @alamb ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Tsaucer/prepare tpch examples for ci [datafusion-python]
timsaucer commented on PR #710: URL: https://github.com/apache/datafusion-python/pull/710#issuecomment-2121459794 It looks like CI is running correctly and also caching the data. I’ll rebase in the morning and get the PR ready to merge. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Supports UUID column [datafusion-comet]
huaxingao commented on code in PR #395: URL: https://github.com/apache/datafusion-comet/pull/395#discussion_r1607409997 ## common/src/main/java/org/apache/comet/parquet/CometParquetToSparkSchemaConverter.scala: ## @@ -0,0 +1,403 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.comet.parquet + +import org.apache.hadoop.conf.Configuration +import org.apache.parquet.io.{ColumnIO, GroupColumnIO, PrimitiveColumnIO} +import org.apache.parquet.schema._ +import org.apache.parquet.schema.LogicalTypeAnnotation._ +import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName._ +import org.apache.parquet.schema.Type.Repetition._ +import org.apache.spark.sql.catalyst.util.ResolveDefaultColumns.normalizeFieldName +import org.apache.spark.sql.execution.datasources.parquet.{ParquetColumn, ParquetToSparkSchemaConverter} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types._ + +class CometParquetToSparkSchemaConverter( +assumeBinaryIsString: Boolean = SQLConf.PARQUET_BINARY_AS_STRING.defaultValue.get, +assumeInt96IsTimestamp: Boolean = SQLConf.PARQUET_INT96_AS_TIMESTAMP.defaultValue.get, +caseSensitive: Boolean = SQLConf.CASE_SENSITIVE.defaultValue.get, +inferTimestampNTZ: Boolean = SQLConf.PARQUET_INFER_TIMESTAMP_NTZ_ENABLED.defaultValue.get, +nanosAsLong: Boolean = SQLConf.LEGACY_PARQUET_NANOS_AS_LONG.defaultValue.get) extends ParquetToSparkSchemaConverter { + + def this(conf: Configuration) = this( +assumeBinaryIsString = conf.get(SQLConf.PARQUET_BINARY_AS_STRING.key).toBoolean, +assumeInt96IsTimestamp = conf.get(SQLConf.PARQUET_INT96_AS_TIMESTAMP.key).toBoolean, +caseSensitive = conf.get(SQLConf.CASE_SENSITIVE.key).toBoolean, +inferTimestampNTZ = conf.get(SQLConf.PARQUET_INFER_TIMESTAMP_NTZ_ENABLED.key).toBoolean, +nanosAsLong = conf.get(SQLConf.LEGACY_PARQUET_NANOS_AS_LONG.key).toBoolean) + + override def convertField( + field: ColumnIO, + sparkReadType: Option[DataType] = None): ParquetColumn = { +val targetType = sparkReadType.map { + case udt: UserDefinedType[_] => udt.sqlType + case otherType => otherType +} +field match { + case primitiveColumn: PrimitiveColumnIO => convertPrimitiveField(primitiveColumn, targetType) + case groupColumn: GroupColumnIO => convertGroupField(groupColumn, targetType) +} + } + + private def convertPrimitiveField( + primitiveColumn: PrimitiveColumnIO, + sparkReadType: Option[DataType] = None): ParquetColumn = { +val parquetType = primitiveColumn.getType.asPrimitiveType() +val typeAnnotation = primitiveColumn.getType.getLogicalTypeAnnotation +val typeName = primitiveColumn.getPrimitive + +def typeString = + if (typeAnnotation == null) s"$typeName" else s"$typeName ($typeAnnotation)" + +def typeNotImplemented() = + throw new UnsupportedOperationException("unsupported Parquet type: " + typeString) + +def illegalType() = + throw new UnsupportedOperationException("Illegal Parquet type: " + typeString) + +// When maxPrecision = -1, we skip precision range check, and always respect the precision +// specified in field.getDecimalMetadata. This is useful when interpreting decimal types stored +// as binaries with variable lengths. +def makeDecimalType(maxPrecision: Int = -1): DecimalType = { + val decimalLogicalTypeAnnotation = typeAnnotation +.asInstanceOf[DecimalLogicalTypeAnnotation] + val precision = decimalLogicalTypeAnnotation.getPrecision + val scale = decimalLogicalTypeAnnotation.getScale + + CometParquetSchemaConverter.checkConversionRequirement( +maxPrecision == -1 || 1 <= precision && precision <= maxPrecision, +s"Invalid decimal precision: $typeName cannot store $precision digits (max $maxPrecision)") + + DecimalType(precision, scale) +} + +val sparkType = sparkReadType.getOrElse(typeName match { + case BOOLEAN => BooleanType + + case FLOAT => FloatType + + case DOUBLE => DoubleType + + case INT32 => +typeAnnotation match { + case intTypeAnnotation: IntLogicalTypeAnnotation if intTypeA
Re: [PR] fix: Enable cast string to int tests and fix compatibility issue [datafusion-comet]
andygrove commented on PR #453: URL: https://github.com/apache/datafusion-comet/pull/453#issuecomment-2121447145 > So do you thing the perf improvement is because we are no longer trimming? We are still trimming, but we are no longer performing the redundant conditional logic in the main loop to skip leading and trailing whitespace. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Cast String to Date ANSI Mode - Spark 3.2 - Mismatch between Spark and Comet Errors [datafusion-comet]
parthchandra commented on issue #440: URL: https://github.com/apache/datafusion-comet/issues/440#issuecomment-2121399387 Is this an issue of just a mismatch between error messages? Or is the cast actually not doing the right thing with Spark 3.2? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] chore [datafusion-comet]
parthchandra opened a new issue, #454: URL: https://github.com/apache/datafusion-comet/issues/454 ### Describe the bug Separate out extended explain info unit test into spark dependent and spark independent parts ### Steps to reproduce _No response_ ### Expected behavior _No response_ ### Additional context The test `test("explain comet")` has a conditional check which executes only if `supportsExtendedExplainInfo` is true. This can be moved into a separate unit test that is run only for Spark 4.0 and greater which actually support extended info. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] test: Fix explain with exteded info comet test [datafusion-comet]
parthchandra commented on code in PR #436: URL: https://github.com/apache/datafusion-comet/pull/436#discussion_r1607380766 ## spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala: ## @@ -1399,7 +1399,7 @@ class CometExpressionSuite extends CometTestBase with AdaptiveSparkPlanHelper { CometConf.COMET_EXEC_ENABLED.key -> "true", CometConf.COMET_SHUFFLE_ENFORCE_MODE_ENABLED.key -> "true", CometConf.COMET_EXEC_ALL_OPERATOR_ENABLED.key -> "true", - "spark.sql.extendedExplainProvider" -> "org.apache.comet.ExtendedExplainInfo") { Review Comment: Will do that @kazuyukitanimura -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Enable cast string to int tests and fix compatibility issue [datafusion-comet]
parthchandra commented on PR #453: URL: https://github.com/apache/datafusion-comet/pull/453#issuecomment-2121348910 So do you thing the perf improvement is because we are no longer trimming? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Supports UUID column [datafusion-comet]
parthchandra commented on code in PR #395: URL: https://github.com/apache/datafusion-comet/pull/395#discussion_r1607363360 ## common/src/main/java/org/apache/comet/parquet/CometParquetToSparkSchemaConverter.scala: ## @@ -0,0 +1,403 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.comet.parquet + +import org.apache.hadoop.conf.Configuration +import org.apache.parquet.io.{ColumnIO, GroupColumnIO, PrimitiveColumnIO} +import org.apache.parquet.schema._ +import org.apache.parquet.schema.LogicalTypeAnnotation._ +import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName._ +import org.apache.parquet.schema.Type.Repetition._ +import org.apache.spark.sql.catalyst.util.ResolveDefaultColumns.normalizeFieldName +import org.apache.spark.sql.execution.datasources.parquet.{ParquetColumn, ParquetToSparkSchemaConverter} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types._ + +class CometParquetToSparkSchemaConverter( +assumeBinaryIsString: Boolean = SQLConf.PARQUET_BINARY_AS_STRING.defaultValue.get, +assumeInt96IsTimestamp: Boolean = SQLConf.PARQUET_INT96_AS_TIMESTAMP.defaultValue.get, +caseSensitive: Boolean = SQLConf.CASE_SENSITIVE.defaultValue.get, +inferTimestampNTZ: Boolean = SQLConf.PARQUET_INFER_TIMESTAMP_NTZ_ENABLED.defaultValue.get, +nanosAsLong: Boolean = SQLConf.LEGACY_PARQUET_NANOS_AS_LONG.defaultValue.get) extends ParquetToSparkSchemaConverter { + + def this(conf: Configuration) = this( +assumeBinaryIsString = conf.get(SQLConf.PARQUET_BINARY_AS_STRING.key).toBoolean, +assumeInt96IsTimestamp = conf.get(SQLConf.PARQUET_INT96_AS_TIMESTAMP.key).toBoolean, +caseSensitive = conf.get(SQLConf.CASE_SENSITIVE.key).toBoolean, +inferTimestampNTZ = conf.get(SQLConf.PARQUET_INFER_TIMESTAMP_NTZ_ENABLED.key).toBoolean, +nanosAsLong = conf.get(SQLConf.LEGACY_PARQUET_NANOS_AS_LONG.key).toBoolean) + + override def convertField( + field: ColumnIO, + sparkReadType: Option[DataType] = None): ParquetColumn = { +val targetType = sparkReadType.map { + case udt: UserDefinedType[_] => udt.sqlType + case otherType => otherType +} +field match { + case primitiveColumn: PrimitiveColumnIO => convertPrimitiveField(primitiveColumn, targetType) + case groupColumn: GroupColumnIO => convertGroupField(groupColumn, targetType) +} + } + + private def convertPrimitiveField( + primitiveColumn: PrimitiveColumnIO, + sparkReadType: Option[DataType] = None): ParquetColumn = { +val parquetType = primitiveColumn.getType.asPrimitiveType() +val typeAnnotation = primitiveColumn.getType.getLogicalTypeAnnotation +val typeName = primitiveColumn.getPrimitive + +def typeString = + if (typeAnnotation == null) s"$typeName" else s"$typeName ($typeAnnotation)" + +def typeNotImplemented() = + throw new UnsupportedOperationException("unsupported Parquet type: " + typeString) + +def illegalType() = + throw new UnsupportedOperationException("Illegal Parquet type: " + typeString) + +// When maxPrecision = -1, we skip precision range check, and always respect the precision +// specified in field.getDecimalMetadata. This is useful when interpreting decimal types stored +// as binaries with variable lengths. +def makeDecimalType(maxPrecision: Int = -1): DecimalType = { + val decimalLogicalTypeAnnotation = typeAnnotation +.asInstanceOf[DecimalLogicalTypeAnnotation] + val precision = decimalLogicalTypeAnnotation.getPrecision + val scale = decimalLogicalTypeAnnotation.getScale + + CometParquetSchemaConverter.checkConversionRequirement( +maxPrecision == -1 || 1 <= precision && precision <= maxPrecision, +s"Invalid decimal precision: $typeName cannot store $precision digits (max $maxPrecision)") + + DecimalType(precision, scale) +} + +val sparkType = sparkReadType.getOrElse(typeName match { + case BOOLEAN => BooleanType + + case FLOAT => FloatType + + case DOUBLE => DoubleType + + case INT32 => +typeAnnotation match { + case intTypeAnnotation: IntLogicalTypeAnnotation if intTy
Re: [I] Sort Merge Join. LeftSemi issues [datafusion]
comphead closed issue #10379: Sort Merge Join. LeftSemi issues URL: https://github.com/apache/datafusion/issues/10379 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix: Sort Merge Join LeftSemi issues when JoinFilter is set [datafusion]
comphead merged PR #10304: URL: https://github.com/apache/datafusion/pull/10304 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] test: add more tests for statistics reading [datafusion]
NGA-TRAN opened a new pull request, #10592: URL: https://github.com/apache/datafusion/pull/10592 ## Which issue does this PR close? More tests for https://github.com/apache/datafusion/issues/10453 ## Rationale for this change ## What changes are included in this PR? ## Are these changes tested? ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Docs: Update PR workflow documentation [datafusion]
alamb commented on PR #10532: URL: https://github.com/apache/datafusion/pull/10532#issuecomment-2121194261 I plan to incorporate the feedback on this PR, I just haven't had a chance yet. I hope to do so over the next few days -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add reference visitor `TreeNode` APIs [datafusion]
alamb commented on PR #10543: URL: https://github.com/apache/datafusion/pull/10543#issuecomment-2121191619 What do we think about merging this PR and filing a follow on ticket to unify the APIs? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] DataFusion reads Date32 and Date64 parquet statistics in as Int32Array [datafusion]
alamb commented on issue #10587: URL: https://github.com/apache/datafusion/issues/10587#issuecomment-2121184299 Thanks for pointing that out @edmondop -- yes the min/max seem to be extracted as `Int32Array`s -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Stop copying LogicalPlan and Exprs in `SingleDistinctToGroupBy` [datafusion]
appletreeisyellow commented on PR #10527: URL: https://github.com/apache/datafusion/pull/10527#issuecomment-2121182146 @alamb Thanks for the review! I have updated the code according to your feedback > I found whitespace blind diff easier to review: [#10527 (files)](https://github.com/apache/datafusion/pull/10527/files?w=1) Indeed! It greatly reduced my mental matching effort 😄 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Stop copying LogicalPlan and Exprs in `SingleDistinctToGroupBy` [datafusion]
appletreeisyellow commented on code in PR #10527: URL: https://github.com/apache/datafusion/pull/10527#discussion_r1607264284 ## datafusion/optimizer/src/single_distinct_to_groupby.rs: ## @@ -131,177 +126,190 @@ fn contains_grouping_set(expr: &[Expr]) -> bool { impl OptimizerRule for SingleDistinctToGroupBy { fn try_optimize( &self, -plan: &LogicalPlan, +_plan: &LogicalPlan, _config: &dyn OptimizerConfig, ) -> Result> { +internal_err!("Should have called SingleDistinctToGroupBy::rewrite") +} + +fn name(&self) -> &str { +"single_distinct_aggregation_to_group_by" +} + +fn apply_order(&self) -> Option { +Some(ApplyOrder::TopDown) +} + +fn supports_rewrite(&self) -> bool { +true +} + +fn rewrite( +&self, +plan: LogicalPlan, +_config: &dyn OptimizerConfig, +) -> Result, DataFusionError> { match plan { LogicalPlan::Aggregate(Aggregate { input, aggr_expr, schema, group_expr, .. -}) => { -if is_single_distinct_agg(plan)? && !contains_grouping_set(group_expr) { -// alias all original group_by exprs -let (mut inner_group_exprs, out_group_expr_with_alias): ( -Vec, -Vec<(Expr, Option)>, -) = group_expr -.iter() -.enumerate() -.map(|(i, group_expr)| { -if let Expr::Column(_) = group_expr { -// For Column expressions we can use existing expression as is. -(group_expr.clone(), (group_expr.clone(), None)) -} else { -// For complex expression write is as alias, to be able to refer -// if from parent operators successfully. -// Consider plan below. -// -// Aggregate: groupBy=[[group_alias_0]], aggr=[[COUNT(alias1)]] [group_alias_0:Int32, COUNT(alias1):Int64;N]\ -// --Aggregate: groupBy=[[test.a + Int32(1) AS group_alias_0, test.c AS alias1]], aggr=[[]] [group_alias_0:Int32, alias1:UInt32]\ -// TableScan: test [a:UInt32, b:UInt32, c:UInt32] -// -// First aggregate(from bottom) refers to `test.a` column. -// Second aggregate refers to the `group_alias_0` column, Which is a valid field in the first aggregate. -// If we were to write plan above as below without alias -// -// Aggregate: groupBy=[[test.a + Int32(1)]], aggr=[[COUNT(alias1)]] [group_alias_0:Int32, COUNT(alias1):Int64;N]\ -// --Aggregate: groupBy=[[test.a + Int32(1), test.c AS alias1]], aggr=[[]] [group_alias_0:Int32, alias1:UInt32]\ -// TableScan: test [a:UInt32, b:UInt32, c:UInt32] -// -// Second aggregate refers to the `test.a + Int32(1)` expression However, its input do not have `test.a` expression in it. -let alias_str = format!("group_alias_{i}"); -let alias_expr = group_expr.clone().alias(&alias_str); -let (qualifier, field) = schema.qualified_field(i); +}) if is_single_distinct_agg(&aggr_expr)? +&& !contains_grouping_set(&group_expr) => +{ +let group_size = group_expr.len(); +// alias all original group_by exprs +let (mut inner_group_exprs, out_group_expr_with_alias): ( +Vec, +Vec<(Expr, Option)>, +) = group_expr +.into_iter() +.enumerate() +.map(|(i, group_expr)| { +if let Expr::Column(_) = group_expr { +// For Column expressions we can use existing expression as is. +(group_expr.clone(), (group_expr, None)) +} else { +// For complex expression write is as alias, to be able to refer +// if from parent operators successfully. +// Consider plan below. +// +// Aggregate: groupBy=[[group_alias_0]], aggr=[[COUNT(alias1)]] [group_alias_0:Int32, COUNT(alias1):Int64;N]\ +
Re: [PR] docs: add guide to adding a new expression [datafusion-comet]
andygrove merged PR #422: URL: https://github.com/apache/datafusion-comet/pull/422 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] bug fix: Fix fuzz testcase for cast string to integer [datafusion-comet]
andygrove commented on code in PR #450: URL: https://github.com/apache/datafusion-comet/pull/450#discussion_r1607255755 ## spark/src/test/scala/org/apache/comet/CometCastSuite.scala: ## @@ -533,11 +533,16 @@ class CometCastSuite extends CometTestBase with AdaptiveSparkPlanHelper { castTest(gen.generateStrings(dataSize, numericPattern, 5).toDF("a"), DataTypes.ShortType) } - ignore("cast StringType to IntegerType") { + test("cast StringType to IntegerType") { // test with hand-picked values castTest(castStringToIntegralInputs.toDF("a"), DataTypes.IntegerType) // fuzz test -castTest(gen.generateStrings(dataSize, numericPattern, 8).toDF("a"), DataTypes.IntegerType) +castTest( + gen +.generateStrings(dataSize, numericPattern, 8) +.toDF("a") +.withColumn("a", functions.trim($"a")), + DataTypes.IntegerType) Review Comment: Here is my PR: https://github.com/apache/datafusion-comet/pull/453 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Enable cast string to int tests and fix compatibility issue [datafusion-comet]
andygrove commented on code in PR #453: URL: https://github.com/apache/datafusion-comet/pull/453#discussion_r1607255075 ## core/src/execution/datafusion/expressions/cast.rs: ## @@ -1070,7 +1050,7 @@ fn do_cast_string_to_int< if ch == '.' { if eval_mode == EvalMode::Legacy { // truncate decimal in legacy mode -state = State::ParseFractionalDigits; +parse_sign_and_digits = false; continue; } else { return none_or_err(eval_mode, type_name, str); Review Comment: error messages refer to the original input string, not the trimmed version -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Enable cast string to int tests and fix compatibility issue [datafusion-comet]
andygrove commented on code in PR #453: URL: https://github.com/apache/datafusion-comet/pull/453#discussion_r1607254539 ## core/src/execution/datafusion/expressions/cast.rs: ## @@ -1029,34 +1021,22 @@ fn do_cast_string_to_int< type_name: &str, min_value: T, ) -> CometResult> { -let len = str.len(); -if str.is_empty() { +let trimmed_str = str.trim(); +if trimmed_str.is_empty() { return none_or_err(eval_mode, type_name, str); } - +let len = trimmed_str.len(); let mut result: T = T::zero(); let mut negative = false; let radix = T::from(10); let stop_value = min_value / radix; -let mut state = State::SkipLeadingWhiteSpace; -let mut parsed_sign = false; - -for (i, ch) in str.char_indices() { -// skip leading whitespace -if state == State::SkipLeadingWhiteSpace { -if ch.is_whitespace() { -// consume this char -continue; -} -// change state and fall through to next section -state = State::ParseSignAndDigits; -} +let mut parse_sign_and_digits = true; -if state == State::ParseSignAndDigits { -if !parsed_sign { +for (i, ch) in trimmed_str.char_indices() { Review Comment: We process the trimmed string here and we no longer need the logic for skipping leading and trailing whitespace -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Enable cast string to int tests and fix compatibility issue [datafusion-comet]
andygrove commented on code in PR #453: URL: https://github.com/apache/datafusion-comet/pull/453#discussion_r1607253762 ## core/src/execution/datafusion/expressions/cast.rs: ## @@ -82,7 +82,7 @@ macro_rules! cast_utf8_to_int { for i in 0..len { if $array.is_null(i) { cast_array.append_null() -} else if let Some(cast_value) = $cast_method($array.value(i).trim(), $eval_mode)? { Review Comment: We originally trimmed the input strings here, so error messages did not have access to the original inputs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Stop copying LogicalPlan and Exprs in `SingleDistinctToGroupBy` [datafusion]
appletreeisyellow commented on code in PR #10527: URL: https://github.com/apache/datafusion/pull/10527#discussion_r1607253589 ## datafusion/optimizer/src/single_distinct_to_groupby.rs: ## @@ -131,177 +126,190 @@ fn contains_grouping_set(expr: &[Expr]) -> bool { impl OptimizerRule for SingleDistinctToGroupBy { fn try_optimize( &self, -plan: &LogicalPlan, +_plan: &LogicalPlan, _config: &dyn OptimizerConfig, ) -> Result> { +internal_err!("Should have called SingleDistinctToGroupBy::rewrite") +} + +fn name(&self) -> &str { +"single_distinct_aggregation_to_group_by" +} + +fn apply_order(&self) -> Option { +Some(ApplyOrder::TopDown) +} + +fn supports_rewrite(&self) -> bool { +true +} + +fn rewrite( +&self, +plan: LogicalPlan, +_config: &dyn OptimizerConfig, +) -> Result, DataFusionError> { match plan { LogicalPlan::Aggregate(Aggregate { input, aggr_expr, schema, group_expr, .. -}) => { -if is_single_distinct_agg(plan)? && !contains_grouping_set(group_expr) { -// alias all original group_by exprs -let (mut inner_group_exprs, out_group_expr_with_alias): ( -Vec, -Vec<(Expr, Option)>, -) = group_expr -.iter() -.enumerate() -.map(|(i, group_expr)| { -if let Expr::Column(_) = group_expr { -// For Column expressions we can use existing expression as is. -(group_expr.clone(), (group_expr.clone(), None)) -} else { -// For complex expression write is as alias, to be able to refer -// if from parent operators successfully. -// Consider plan below. -// -// Aggregate: groupBy=[[group_alias_0]], aggr=[[COUNT(alias1)]] [group_alias_0:Int32, COUNT(alias1):Int64;N]\ -// --Aggregate: groupBy=[[test.a + Int32(1) AS group_alias_0, test.c AS alias1]], aggr=[[]] [group_alias_0:Int32, alias1:UInt32]\ -// TableScan: test [a:UInt32, b:UInt32, c:UInt32] -// -// First aggregate(from bottom) refers to `test.a` column. -// Second aggregate refers to the `group_alias_0` column, Which is a valid field in the first aggregate. -// If we were to write plan above as below without alias -// -// Aggregate: groupBy=[[test.a + Int32(1)]], aggr=[[COUNT(alias1)]] [group_alias_0:Int32, COUNT(alias1):Int64;N]\ -// --Aggregate: groupBy=[[test.a + Int32(1), test.c AS alias1]], aggr=[[]] [group_alias_0:Int32, alias1:UInt32]\ -// TableScan: test [a:UInt32, b:UInt32, c:UInt32] -// -// Second aggregate refers to the `test.a + Int32(1)` expression However, its input do not have `test.a` expression in it. -let alias_str = format!("group_alias_{i}"); -let alias_expr = group_expr.clone().alias(&alias_str); -let (qualifier, field) = schema.qualified_field(i); +}) if is_single_distinct_agg(&aggr_expr)? +&& !contains_grouping_set(&group_expr) => +{ +let group_size = group_expr.len(); +// alias all original group_by exprs +let (mut inner_group_exprs, out_group_expr_with_alias): ( +Vec, +Vec<(Expr, Option)>, +) = group_expr +.into_iter() +.enumerate() +.map(|(i, group_expr)| { +if let Expr::Column(_) = group_expr { +// For Column expressions we can use existing expression as is. +(group_expr.clone(), (group_expr, None)) +} else { +// For complex expression write is as alias, to be able to refer +// if from parent operators successfully. +// Consider plan below. +// +// Aggregate: groupBy=[[group_alias_0]], aggr=[[COUNT(alias1)]] [group_alias_0:Int32, COUNT(alias1):Int64;N]\ +
[PR] fix: Enable cast string to int tests and fix compatibility issue [datafusion-comet]
andygrove opened a new pull request, #453: URL: https://github.com/apache/datafusion-comet/pull/453 ## Which issue does this PR close? Closes https://github.com/apache/datafusion-comet/issues/431 ## Rationale for this change Enable cast string to int as a compatible expression. Also, there is a performance improvement with these changes: ``` cast_string_to_int/cast_string_to_i16 time: [17.445 µs 17.699 µs 18.019 µs] change: [-41.405% -40.622% -39.805%] (p = 0.00 < 0.05) Performance has improved. ``` ## What changes are included in this PR? - Enable tests for cast string to int - Include original input value in error message instead of trimmed input value - Simplify code and remove whitespace handling code that was not needed since we already trim the input values ## How are these changes tested? Enabled the tests that were previously disabled -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] build: Add spark-4.0 profile and shims [datafusion-comet]
kazuyukitanimura commented on PR #407: URL: https://github.com/apache/datafusion-comet/pull/407#issuecomment-2121165967 @viirya Please take another look cc @andygrove @comphead -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] build: Add spark-4.0 profile and shims [datafusion-comet]
kazuyukitanimura commented on code in PR #407: URL: https://github.com/apache/datafusion-comet/pull/407#discussion_r1607249542 ## spark/src/main/scala/org/apache/spark/sql/comet/DecimalPrecision.scala: ## @@ -107,11 +108,4 @@ object DecimalPrecision { case e => e } } - - object DecimalExpression { -def unapply(e: Expression): Option[(Int, Int)] = e.dataType match { - case t: DecimalType => Some((t.precision, t.scale)) - case _ => None -} - } Review Comment: Restored the original -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: add hex scalar function [datafusion-comet]
kazuyukitanimura commented on code in PR #449: URL: https://github.com/apache/datafusion-comet/pull/449#discussion_r1607218060 ## spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala: ## @@ -1038,6 +1038,46 @@ class CometExpressionSuite extends CometTestBase with AdaptiveSparkPlanHelper { } } } + test("hex") { +val str_table = "string_hex_table" +withTable(str_table) { Review Comment: probably `makeParquetFileAllTypes` is a better example -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] bug fix: Fix fuzz testcase for cast string to integer [datafusion-comet]
kazuyukitanimura commented on code in PR #450: URL: https://github.com/apache/datafusion-comet/pull/450#discussion_r1607216900 ## spark/src/test/scala/org/apache/comet/CometCastSuite.scala: ## @@ -533,11 +533,16 @@ class CometCastSuite extends CometTestBase with AdaptiveSparkPlanHelper { castTest(gen.generateStrings(dataSize, numericPattern, 5).toDF("a"), DataTypes.ShortType) } - ignore("cast StringType to IntegerType") { + test("cast StringType to IntegerType") { // test with hand-picked values castTest(castStringToIntegralInputs.toDF("a"), DataTypes.IntegerType) // fuzz test -castTest(gen.generateStrings(dataSize, numericPattern, 8).toDF("a"), DataTypes.IntegerType) +castTest( + gen +.generateStrings(dataSize, numericPattern, 8) +.toDF("a") +.withColumn("a", functions.trim($"a")), + DataTypes.IntegerType) Review Comment: +1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] bug fix: Fix fuzz testcase for cast string to integer [datafusion-comet]
andygrove commented on code in PR #450: URL: https://github.com/apache/datafusion-comet/pull/450#discussion_r1607203214 ## spark/src/test/scala/org/apache/comet/CometCastSuite.scala: ## @@ -533,11 +533,16 @@ class CometCastSuite extends CometTestBase with AdaptiveSparkPlanHelper { castTest(gen.generateStrings(dataSize, numericPattern, 5).toDF("a"), DataTypes.ShortType) } - ignore("cast StringType to IntegerType") { + test("cast StringType to IntegerType") { // test with hand-picked values castTest(castStringToIntegralInputs.toDF("a"), DataTypes.IntegerType) // fuzz test -castTest(gen.generateStrings(dataSize, numericPattern, 8).toDF("a"), DataTypes.IntegerType) +castTest( + gen +.generateStrings(dataSize, numericPattern, 8) +.toDF("a") +.withColumn("a", functions.trim($"a")), + DataTypes.IntegerType) Review Comment: The main issue is that we are trimming the input before processing, so error messages use the trimmed input instead of the original input. Once I resolved that I saw that we had extra processing for leading and trailing whitespace that is never used (ported from Spark) so I ended up removing that for some performance wins. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] bug fix: Fix fuzz testcase for cast string to integer [datafusion-comet]
andygrove commented on code in PR #450: URL: https://github.com/apache/datafusion-comet/pull/450#discussion_r1607200546 ## spark/src/test/scala/org/apache/comet/CometCastSuite.scala: ## @@ -533,11 +533,16 @@ class CometCastSuite extends CometTestBase with AdaptiveSparkPlanHelper { castTest(gen.generateStrings(dataSize, numericPattern, 5).toDF("a"), DataTypes.ShortType) } - ignore("cast StringType to IntegerType") { + test("cast StringType to IntegerType") { // test with hand-picked values castTest(castStringToIntegralInputs.toDF("a"), DataTypes.IntegerType) // fuzz test -castTest(gen.generateStrings(dataSize, numericPattern, 8).toDF("a"), DataTypes.IntegerType) +castTest( + gen +.generateStrings(dataSize, numericPattern, 8) +.toDF("a") +.withColumn("a", functions.trim($"a")), + DataTypes.IntegerType) Review Comment: I'd prefer to see us fix the compatibility issue rather than skip testing strings that have leading and/or trailing whitespace. I have been looking into this and was going to make some comments here but it turned out to be a bit more involved that I thought it would, so will create a PR soon with my proposed fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] feat: eliminate group by constant optimizer rule [datafusion]
korowa opened a new pull request, #10591: URL: https://github.com/apache/datafusion/pull/10591 ## Which issue does this PR close? Closes #. ## Rationale for this change Initial intention was to improve clickbench q34 -- it contains aggregation by constant and URL which makes aggregation switch to row-based mode which is slower that single field group by. In general, elimination of constants from `GROUP BY` seems to be reasonable optimization. ## What changes are included in this PR? New logical optimizer rule which removes constant or originated from constant expressions from `Aggregate::group_expr`, and places additional projection on top of aggregation, to satisfy original schema for downstream operators. Optimizer rule is placed between last `SimplifyExpressions` and `OptimizeProjections`, since it requires constant evaluation, and projections, produced by this rule, may be merged. ## Are these changes tested? New unit tests and sqllogictests ## Are there any user-facing changes? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: add hex scalar function [datafusion-comet]
tshauck commented on code in PR #449: URL: https://github.com/apache/datafusion-comet/pull/449#discussion_r1607197818 ## spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala: ## @@ -1038,6 +1038,46 @@ class CometExpressionSuite extends CometTestBase with AdaptiveSparkPlanHelper { } } } + test("hex") { +val str_table = "string_hex_table" +withTable(str_table) { Review Comment: Thanks, will have a look at `makeRawTimeParquetFile` and follow up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] docs: add guide to adding a new expression [datafusion-comet]
tshauck commented on PR #422: URL: https://github.com/apache/datafusion-comet/pull/422#issuecomment-2121071344 @andygrove Conflict should actually be fixed now, thanks @kazuyukitanimura -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] docs: add guide to adding a new expression [datafusion-comet]
tshauck commented on code in PR #422: URL: https://github.com/apache/datafusion-comet/pull/422#discussion_r1607185036 ## docs/source/index.rst: ## @@ -58,7 +58,11 @@ as a native runtime to achieve improvement in terms of query efficiency and quer Comet Plugin Overview Development Guide Debugging Guide +<<< HEAD Review Comment: Ah, thanks! [1c37b49](https://github.com/apache/datafusion-comet/pull/422/commits/1c37b490b5e0d80d6de5b3ff6844f489a8cf) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Pass BigQuery options to the ArrowSchema [datafusion]
davisp commented on PR #10590: URL: https://github.com/apache/datafusion/pull/10590#issuecomment-2121051712 Also, for anyone more familiar with datafusion and/or sqlparser, one thing I wasn't 100% on was how to represent the metadata value. For now I've just called format on it, but I have the nagging suspicion there might be a way to serialize those so that TableProviderFactor folks can unserialize them more easily rather than having to resort to things like stripping `'` and `"` blindly when converting the DFSchema to whatever they're working with. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Pass BigQuery options to the ArrowSchema [datafusion]
davisp opened a new pull request, #10590: URL: https://github.com/apache/datafusion/pull/10590 ## Which issue does this PR close? Closes #10589 ## Rationale for this change Provide per-column key/value options in the `CREATE EXTERN TABLE` statement. ## What changes are included in this PR? This sets the BigQuery column `OPTIONS` key/values into the Arrow Schema's fields' metadata. ## Are these changes tested? Barely. I assume I'll need to add a couple more, but I figured I'd wait until I see what sort of enthusiasm this proposal receives. ## Are there any user-facing changes? Users can now access any parsed BigTable options to each Field's metadata. This allows for TableProvider/TableProviderFactory implementations the ability to accept per-field options. For a query like such: ```sql CREATE EXTERNAL TABLE test( col1 bigint NULL, col2 bigint NOT NULL OPTIONS(compression='zstd(5)') ) STORED AS parquet LOCATION 'foo.parquet'; ``` This results in the field 'col2' on the Arrow Schema having a metdata value of `{"sql_option.compression", "'zstd(5)'"}`. I think technically this might be able to break downstream users in this scenario: 1. For some reason they are using BigQuery OPTIONS clauses in their `CREATE EXTERN TABLE` column definitions. 2. In their TableProviderFactory, they are attempting to access and use Field metadata in such a way that the now present `sql_option.*` keys breaks things. The first requirement here seems fairly unlikely to me? I'm making the rough assumption that a "standard" BigQuery table creation statement would fail to parse as a Datafusion `CREATE EXTERN TABLE` statement and thus it'd limit to people who have a) translated a BigQuery statment to a Datafusion statement while keeping their column `OPTIONS` even though they don't actually get passed to the TableProviderFactory. For the second half, I could see this actually being rather common (that a `MyThing::from(arrow_schema)` blows up with unexpected metadata). Though the intersection of these two sets seems like it'd be tiny? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] docs: add guide to adding a new expression [datafusion-comet]
kazuyukitanimura commented on code in PR #422: URL: https://github.com/apache/datafusion-comet/pull/422#discussion_r1607164180 ## docs/source/index.rst: ## @@ -58,7 +58,11 @@ as a native runtime to achieve improvement in terms of query efficiency and quer Comet Plugin Overview Development Guide Debugging Guide +<<< HEAD Review Comment: Looks like this is a merge/rebase error -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: add hex scalar function [datafusion-comet]
kazuyukitanimura commented on code in PR #449: URL: https://github.com/apache/datafusion-comet/pull/449#discussion_r1607158744 ## spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala: ## @@ -1038,6 +1038,46 @@ class CometExpressionSuite extends CometTestBase with AdaptiveSparkPlanHelper { } } } + test("hex") { +val str_table = "string_hex_table" +withTable(str_table) { Review Comment: Let's add more types and values, something like `makeRawTimeParquetFile` ## core/src/execution/datafusion/expressions/scalar_funcs/hex.rs: ## @@ -0,0 +1,191 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::sync::Arc; + +use arrow::array::as_string_array; +use arrow_array::StringArray; +use arrow_schema::DataType; +use datafusion::logical_expr::ColumnarValue; +use datafusion_common::{ +cast::{as_binary_array, as_int64_array}, +exec_err, DataFusionError, ScalarValue, +}; +use std::fmt::Write; + +fn hex_bytes(bytes: &[u8]) -> Vec { +let length = bytes.len(); +let mut value = vec![0; length * 2]; +let mut i = 0; +while i < length { +value[i * 2] = (bytes[i] & 0xF0) >> 4; +value[i * 2 + 1] = bytes[i] & 0x0F; +i += 1; +} +value +} + +fn hex_int64(num: i64) -> String { +if num >= 0 { +format!("{:X}", num) +} else { +format!("{:016X}", num as u64) +} +} + +fn hex_string(s: &str) -> Vec { +hex_bytes(s.as_bytes()) +} + +fn hex_bytes_to_string(bytes: &[u8]) -> Result { +let mut hex_string = String::with_capacity(bytes.len() * 2); +for byte in bytes { +write!(&mut hex_string, "{:01X}", byte)?; +} +Ok(hex_string) +} + +pub(super) fn spark_hex(args: &[ColumnarValue]) -> Result { +if args.len() != 1 { +return Err(DataFusionError::Internal( +"hex expects exactly one argument".to_string(), +)); +} + +match &args[0] { +ColumnarValue::Array(array) => match array.data_type() { +DataType::Int64 => { Review Comment: Wondering if we need to support more types? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Pass per-field BigQuery `OPTIONS` values to the LogicalPlan's Arrow Schema [datafusion]
davisp opened a new issue, #10589: URL: https://github.com/apache/datafusion/issues/10589 ### Is your feature request related to a problem or challenge? I've been reading and learning the TableProvider APIs and have finally gotten around to taking a serious look at implementing support for `CREATE EXTERNAL TABLE`. I started down the road of using the statement level `OPTIONS` for adding field specific key/value config pairs similar to how the Parquet provider does with the `OPTION::COLUMN.NESTED.PATH` keys. However, my particular use case has quite a few field specific possible options that would be decently more ergonomic with field level options so I went looking for how I might be able to customize the SQL parser to support something of that nature. After many false starts I ended up stumbling over the [BigQuery column options support](https://docs.rs/sqlparser/latest/sqlparser/ast/enum.ColumnOption.html#variant.Options) and thought that would be perfect. While I was a bit disappointed its not directly supported, it turned out to be relatively easy to plumb through so that's what I've done. ### Describe the solution you'd like I'm not tied to anything specific beyond "allow for column specific key/value configuration options". The PR I'll open after filing this issue is one possible implementation, but it doesn't have to be *the* implementation by any means. ### Describe alternatives you've considered My original thought was to basically that I was going to have to go the route used in the [`sql_dialect.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/sql_dialect.rs) where I would basically have to reimplement the `CREATE EXTERNAL TABLE` logic and then somehow plumb my extra options around. And the reason I went that direction was my first (hacky) though was to re-parse the `CreateExternalTable::definition` field but it apparently doesn't keep the actual original SQL around (at least via the cli AFAICT). ### Additional context I've got no idea whether the reaction to this will be "Oh, neat!" or "Ewww, gross!". I'm not particularly tied to it because in the end I can always just go to the statement wide options, though given how easy it was to add it seems like it might be acceptable? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] DataFusion reads Date32 and Date64 parquet statistics in as [datafusion]
edmondop commented on issue #10587: URL: https://github.com/apache/datafusion/issues/10587#issuecomment-2121024857 @alamb the title here doesn't make much sense, are you saying that the `min` and `max` are not extracted as Date32/Date64? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: API for collecting statistics/index for metadata of a parquet file + tests [datafusion]
alamb merged PR #10537: URL: https://github.com/apache/datafusion/pull/10537 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Compute murmur3 hash with dictionary input correctly [datafusion-comet]
kazuyukitanimura commented on code in PR #433: URL: https://github.com/apache/datafusion-comet/pull/433#discussion_r1607151015 ## spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala: ## @@ -1452,17 +1452,55 @@ class CometExpressionSuite extends CometTestBase with AdaptiveSparkPlanHelper { withTable(table) { sql(s"create table $table(col string, a int, b float) using parquet") sql(s""" - |insert into $table values - |('Spark SQL ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), ('苹果手机', NULL, 3.99) - |, ('Spark SQL ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), ('苹果手机', NULL, 3.99) - |""".stripMargin) - checkSparkAnswerAndOperator(""" - |select - |md5(col), md5(cast(a as string)), md5(cast(b as string)), - |hash(col), hash(col, 1), hash(col, 0), hash(col, a, b), hash(b, a, col), - |sha2(col, 0), sha2(col, 256), sha2(col, 224), sha2(col, 384), sha2(col, 512), sha2(col, 128) - |from test + |insert into $table values + |('Spark SQL ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), ('苹果手机', NULL, 3.99) + |, ('Spark SQL ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), ('苹果手机', NULL, 3.99) |""".stripMargin) + checkSparkAnswerAndOperator(""" + |select + |md5(col), md5(cast(a as string)), md5(cast(b as string)), + |hash(col), hash(col, 1), hash(col, 0), hash(col, a, b), hash(b, a, col), + |sha2(col, 0), sha2(col, 256), sha2(col, 224), sha2(col, 384), sha2(col, 512), sha2(col, 128) + |from test + |""".stripMargin) +} + } +} + } + + test("hash functions with random input") { +val dataGen = DataGenerator.DEFAULT +// sufficient number of rows to create dictionary encoded ArrowArray. +val randomNumRows = 1000 Review Comment: Potentially we can add repeated values to force dictionary. E.g. randomly generate 100 rows and repeat 10 times to make 1000 rows -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Minor: Consolidate some integration tests into `core_integration` [datafusion]
alamb commented on code in PR #10588: URL: https://github.com/apache/datafusion/pull/10588#discussion_r1607149443 ## datafusion/core/tests/custom_sources.rs: ## @@ -1,308 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one Review Comment: This was moved to custom_sources_cases.mod -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Minor: Consolidate some integration tests into `core_integration` [datafusion]
alamb opened a new pull request, #10588: URL: https://github.com/apache/datafusion/pull/10588 ## Which issue does this PR close? ## Rationale for this change In an effort to make it faster to develop and test datafusion , it would be nice if the resources required to run the tests were smaller For example, on my machine, simply running the `memory_limit` integration test ```shell cargo test --test memory_limit ``` Requires creating / linking a 148 MB binary: ```shell andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$ du -s -h target/debug/deps/memory_limit-47b8096a6ecbd81c 148M target/debug/deps/memory_limit-47b8096a6ecbd81c ``` While creating the `core_integration` test requires 151M ``` andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$ du -s -h target/debug/deps/core_integration-2ccc5be60db1701a 151M target/debug/deps/core_integration-2ccc5be60db1701a ``` Not only does each integration test target require about 140MB of extra space and the time to link, it also slows down the tests runs as they are not run in parallel with other tests ## What changes are included in this PR? 1. Consoldidate the XXX tests into `core_integration` test After this PR ## Are these changes tested? They are all tests ## Are there any user-facing changes? No this is a developer productivity one only -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Minor: Improve documentation in sql_to_plan example [datafusion]
edmondop commented on code in PR #10582: URL: https://github.com/apache/datafusion/pull/10582#discussion_r1607146553 ## datafusion-examples/examples/plan_to_sql.rs: ## @@ -22,36 +22,45 @@ use datafusion::sql::unparser::expr_to_sql; use datafusion_sql::unparser::dialect::CustomDialect; use datafusion_sql::unparser::{plan_to_sql, Unparser}; -/// This example demonstrates the programmatic construction of -/// SQL using the DataFusion Expr [`Expr`] and LogicalPlan [`LogicalPlan`] API. +/// This example demonstrates the programmatic construction of SQL strings using +/// the DataFusion Expr [`Expr`] and LogicalPlan [`LogicalPlan`] API. /// /// /// The code in this example shows how to: -/// 1. Create SQL from a variety of Expr and LogicalPlan: [`main`]` -/// 2. Create a simple expression [`Exprs`] with fluent API -/// and convert to sql: [`simple_expr_to_sql_demo`] -/// 3. Create a simple expression [`Exprs`] with fluent API -/// and convert to sql without escaping column names: [`simple_expr_to_sql_demo_no_escape`] -/// 4. Create a simple expression [`Exprs`] with fluent API -/// and convert to sql escaping column names a MySQL style: [`simple_expr_to_sql_demo_escape_mysql_style`] +/// +/// 1. [`simple_expr_to_sql_demo`]: Create a simple expression [`Exprs`] with +/// fluent API and convert to sql suitable for passing to another database +/// +/// 2. [`simple_expr_to_sql_demo_no_escape`] Create a simple expression +/// [`Exprs`] with fluent API and convert to sql without escaping column names +/// more suitable for displaying to humans. +/// +/// 3. [`simple_expr_to_sql_demo_escape_mysql_style`]" Create a simple +/// expression [`Exprs`] with fluent API and convert to sql escaping column +/// names in MySQL style. +/// +/// 4. [`simple_plan_to_sql_demo`]: Create a simple logical plan using the +/// DataFrames API and convert to sql string. +/// +/// 5. [`round_trip_plan_to_sql_demo`]: Create a logical plan from a SQL string, modify it using the +/// DataFrames API and convert it back to a sql string. #[tokio::main] async fn main() -> Result<()> { // See how to evaluate expressions simple_expr_to_sql_demo()?; simple_expr_to_sql_demo_no_escape()?; simple_expr_to_sql_demo_escape_mysql_style()?; -simple_plan_to_sql_parquest_dataframe_demo().await?; -round_trip_plan_to_sql_parquest_dataframe_demo().await?; +simple_plan_to_sql_demo().await?; +round_trip_plan_to_sql_demo().await?; Ok(()) } /// DataFusion can convert expressions to SQL, using column name escaping /// PostgreSQL style. fn simple_expr_to_sql_demo() -> Result<()> { let expr = col("a").lt(lit(5)).or(col("a").eq(lit(8))); -let ast = expr_to_sql(&expr)?; -let sql = format!("{}", ast); +let sql = expr_to_sql(&expr)?.to_string(); Review Comment: I don't remember why I had to do it, that totally makes sense -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Row groups are read out of order or with completely different values [datafusion]
twitu closed issue #10572: Row groups are read out of order or with completely different values URL: https://github.com/apache/datafusion/issues/10572 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: API for collecting statistics/index for metadata of a parquet file + tests [datafusion]
alamb commented on PR #10537: URL: https://github.com/apache/datafusion/pull/10537#issuecomment-2120986025 I have filed the following tickets * https://github.com/apache/datafusion/issues/10585 * https://github.com/apache/datafusion/issues/10586 * #10587 I think this PR is now ready to go. I plan to merge it in when the CI passes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: API for collecting statistics/index for metadata of a parquet file + tests [datafusion]
alamb commented on code in PR #10537: URL: https://github.com/apache/datafusion/pull/10537#discussion_r1607125587 ## datafusion/core/tests/parquet/arrow_statistics.rs: ## @@ -0,0 +1,654 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! This file contains an end to end test of extracting statitics from parquet files. +//! It writes data into a parquet file, reads statistics and verifies they are correct + +use std::fs::File; +use std::sync::Arc; + +use arrow_array::{ +make_array, Array, ArrayRef, Int16Array, Int32Array, Int64Array, Int8Array, +RecordBatch, UInt64Array, +}; +use arrow_schema::{DataType, Field, Schema}; +use datafusion::datasource::physical_plan::parquet::{ +RequestedStatistics, StatisticsConverter, +}; +use parquet::arrow::arrow_reader::{ArrowReaderBuilder, ParquetRecordBatchReaderBuilder}; +use parquet::arrow::ArrowWriter; +use parquet::file::properties::{EnabledStatistics, WriterProperties}; + +use crate::parquet::Scenario; + +use super::make_test_file_rg; + +// TEST HELPERS + +/// Return a record batch with i64 with Null values +fn make_int64_batches_with_null( +null_values: usize, +no_null_values_start: i64, +no_null_values_end: i64, +) -> RecordBatch { +let schema = Arc::new(Schema::new(vec![Field::new("i64", DataType::Int64, true)])); + +let v64: Vec = (no_null_values_start as _..no_null_values_end as _).collect(); + +RecordBatch::try_new( +schema, +vec![make_array( +Int64Array::from_iter( +v64.into_iter() +.map(Some) +.chain(std::iter::repeat(None).take(null_values)), +) +.to_data(), +)], +) +.unwrap() +} + +// Create a parquet file with one column for data type i64 +// Data of the file include +// . Number of null rows is the given num_null +// . There are non-null values in the range [no_null_values_start, no_null_values_end], one value each row +// . The file is divided into row groups of size row_per_group +pub fn parquet_file_one_column( +num_null: usize, +no_null_values_start: i64, +no_null_values_end: i64, +row_per_group: usize, +) -> ParquetRecordBatchReaderBuilder { +let mut output_file = tempfile::Builder::new() +.prefix("parquert_statistics_test") +.suffix(".parquet") +.tempfile() +.expect("tempfile creation"); + +let props = WriterProperties::builder() +.set_max_row_group_size(row_per_group) +.set_statistics_enabled(EnabledStatistics::Chunk) +.build(); + +let batches = vec![make_int64_batches_with_null( +num_null, +no_null_values_start, +no_null_values_end, +)]; + +let schema = batches[0].schema(); + +let mut writer = ArrowWriter::try_new(&mut output_file, schema, Some(props)).unwrap(); + +for batch in batches { +writer.write(&batch).expect("writing batch"); +} + +// close file +let _file_meta = writer.close().unwrap(); + +// open the file & get the reader +let file = output_file.reopen().unwrap(); +ArrowReaderBuilder::try_new(file).unwrap() +} + +// Create a parquet file with many columns each has different data type +// - Data types are specified by the given scenario +// - Row group sizes are withe the same or different depending on the provided row_per_group & data created in the scenario +pub async fn parquet_file_many_columns( +scenario: super::Scenario, +row_per_group: usize, +) -> ParquetRecordBatchReaderBuilder { +let file = make_test_file_rg(scenario, row_per_group).await; + +// open the file & get the reader +let file = file.reopen().unwrap(); +ArrowReaderBuilder::try_new(file).unwrap() +} + +struct Test { +reader: ParquetRecordBatchReaderBuilder, +expected_min: ArrayRef, +expected_max: ArrayRef, +expected_null_counts: UInt64Array, +expected_row_counts: UInt64Array, +} + +impl Test { +fn run(self, col_name: &str) { +let Self { +reader, +expected_min, +expected_max, +expected_null_counts, +expected_row_counts, +} = self; + +le
[I] DataFusion reads Date32 and Date64 parquet statistics in as [datafusion]
alamb opened a new issue, #10587: URL: https://github.com/apache/datafusion/issues/10587 ### Describe the bug When reading a Date32 or Date64 column from a parquet file, DataFusion currently returns an Int32 array ### To Reproduce You can see the issue in https://github.com/apache/datafusion/pull/10537 test_dates_32_diff_rg_sizes ### Expected behavior I expect a 1. Date32 column to be read as `Date32Array` 2. Date64 column to be read as `Date64Array` ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: API for collecting statistics/index for metadata of a parquet file + tests [datafusion]
alamb commented on code in PR #10537: URL: https://github.com/apache/datafusion/pull/10537#discussion_r1607118413 ## datafusion/core/src/datasource/physical_plan/parquet/arrow_statistics.rs: ## @@ -0,0 +1,43 @@ +use arrow_array::ArrayRef; +use arrow_schema::DataType; +use datafusion_common::Result; +use parquet::file::statistics::Statistics as ParquetStatistics; + +/// statistics extracted from `Statistics` as Arrow `ArrayRef`s +/// +/// # Note: +/// If the corresponding `Statistics` is not present, or has no information for +/// a column, a NULL is present in the corresponding array entry +pub struct ArrowStatistics { +/// min values +min: ArrayRef, +/// max values +max: ArrayRef, +/// Row counts (UInt64Array) +row_count: ArrayRef, +/// Null Counts (UInt64Array) +null_count: ArrayRef, +} + +/// Extract `ArrowStatistics` from the parquet [`Statistics`] +pub fn parquet_stats_to_arrow<'a>( +arrow_datatype: &DataType, +statistics: impl IntoIterator>, +) -> Result { +todo!() // MY TODO next +} Review Comment: Filed https://github.com/apache/datafusion/issues/10586 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] DataFusion ignores "column order" parquet statistics specification [datafusion]
alamb opened a new issue, #10586: URL: https://github.com/apache/datafusion/issues/10586 ### Describe the bug As @tustvold points out, there is a [`column_order` API](https://docs.rs/parquet/latest/parquet/file/metadata/struct.FileMetaData.html#method.column_order) defined in parquet that is currently entirely ignored by DataFusion It is not entirely clear to me what the implications of ignoring this field are or what other parquet writers populate it with, but we should probably not ignore it ### To Reproduce _No response_ ### Expected behavior _No response_ ### Additional context To emphasise the point I made when this API was originally proposed, you need more than just the ParquetStatistics in order to correctly interpret the data. You need at least the FileMetadata to get the https://docs.rs/parquet/latest/parquet/file/metadata/struct.FileMetaData.html#method.column_order in order to be able to even interpret what the statistics mean for a given column. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: API for collecting statistics/index for metadata of a parquet file + tests [datafusion]
alamb commented on code in PR #10537: URL: https://github.com/apache/datafusion/pull/10537#discussion_r1607110465 ## datafusion/core/tests/parquet/arrow_statistics.rs: ## @@ -0,0 +1,652 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! This file contains an end to end test of extracting statitics from parquet files. +//! It writes data into a parquet file, reads statistics and verifies they are correct + +use std::fs::File; +use std::sync::Arc; + +use arrow_array::{ +make_array, Array, ArrayRef, Int16Array, Int32Array, Int64Array, Int8Array, +RecordBatch, UInt64Array, +}; +use arrow_schema::{DataType, Field, Schema}; +use datafusion::datasource::physical_plan::parquet::{ +RequestedStatistics, StatisticsConverter, +}; +use parquet::arrow::arrow_reader::{ArrowReaderBuilder, ParquetRecordBatchReaderBuilder}; +use parquet::arrow::ArrowWriter; +use parquet::file::properties::{EnabledStatistics, WriterProperties}; + +use crate::parquet::Scenario; + +use super::make_test_file_rg; + +// TEST HELPERS + +/// Return a record batch with i64 with Null values +fn make_int64_batches_with_null( +null_values: usize, +no_null_values_start: i64, +no_null_values_end: i64, +) -> RecordBatch { +let schema = Arc::new(Schema::new(vec![Field::new("i64", DataType::Int64, true)])); + +let v64: Vec = (no_null_values_start as _..no_null_values_end as _).collect(); + +RecordBatch::try_new( +schema, +vec![make_array( +Int64Array::from_iter( +v64.into_iter() +.map(Some) +.chain(std::iter::repeat(None).take(null_values)), +) +.to_data(), +)], +) +.unwrap() +} + +// Create a parquet file with one column for data type i64 +// Data of the file include +// . Number of null rows is the given num_null +// . There are non-null values in the range [no_null_values_start, no_null_values_end], one value each row +// . The file is divided into row groups of size row_per_group +pub fn parquet_file_one_column( +num_null: usize, +no_null_values_start: i64, +no_null_values_end: i64, +row_per_group: usize, +) -> ParquetRecordBatchReaderBuilder { +let mut output_file = tempfile::Builder::new() +.prefix("parquert_statistics_test") +.suffix(".parquet") +.tempfile() +.expect("tempfile creation"); + +let props = WriterProperties::builder() +.set_max_row_group_size(row_per_group) +.set_statistics_enabled(EnabledStatistics::Chunk) +.build(); + +let batches = vec![make_int64_batches_with_null( +num_null, +no_null_values_start, +no_null_values_end, +)]; + +let schema = batches[0].schema(); + +let mut writer = ArrowWriter::try_new(&mut output_file, schema, Some(props)).unwrap(); + +for batch in batches { +writer.write(&batch).expect("writing batch"); +} + +// close file +let _file_meta = writer.close().unwrap(); + +// open the file & get the reader +let file = output_file.reopen().unwrap(); +ArrowReaderBuilder::try_new(file).unwrap() +} + +// Create a parquet file with many columns each has different data type +// - Data types are specified by the given scenario +// - Row group sizes are withe the same or different depending on the provided row_per_group & data created in the scenario +pub async fn parquet_file_many_columns( +scenario: super::Scenario, +row_per_group: usize, +) -> ParquetRecordBatchReaderBuilder { +let file = make_test_file_rg(scenario, row_per_group).await; + +// open the file & get the reader +let file = file.reopen().unwrap(); +ArrowReaderBuilder::try_new(file).unwrap() +} + +struct Test { +reader: ParquetRecordBatchReaderBuilder, +expected_min: ArrayRef, +expected_max: ArrayRef, +expected_null_counts: UInt64Array, +expected_row_counts: UInt64Array, +} + +impl Test { +fn run(self, col_name: &str) { +let Self { +reader, +expected_min, +expected_max, +expected_null_counts, +expected_row_counts, +} = self; + +le
Re: [PR] feat: API for collecting statistics/index for metadata of a parquet file + tests [datafusion]
alamb commented on code in PR #10537: URL: https://github.com/apache/datafusion/pull/10537#discussion_r1607107595 ## datafusion/core/tests/parquet/arrow_statistics.rs: ## @@ -0,0 +1,652 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! This file contains an end to end test of extracting statitics from parquet files. +//! It writes data into a parquet file, reads statistics and verifies they are correct + +use std::fs::File; +use std::sync::Arc; + +use arrow_array::{ +make_array, Array, ArrayRef, Int16Array, Int32Array, Int64Array, Int8Array, +RecordBatch, UInt64Array, +}; +use arrow_schema::{DataType, Field, Schema}; +use datafusion::datasource::physical_plan::parquet::{ +RequestedStatistics, StatisticsConverter, +}; +use parquet::arrow::arrow_reader::{ArrowReaderBuilder, ParquetRecordBatchReaderBuilder}; +use parquet::arrow::ArrowWriter; +use parquet::file::properties::{EnabledStatistics, WriterProperties}; + +use crate::parquet::Scenario; + +use super::make_test_file_rg; + +// TEST HELPERS + +/// Return a record batch with i64 with Null values +fn make_int64_batches_with_null( +null_values: usize, +no_null_values_start: i64, +no_null_values_end: i64, +) -> RecordBatch { +let schema = Arc::new(Schema::new(vec![Field::new("i64", DataType::Int64, true)])); + +let v64: Vec = (no_null_values_start as _..no_null_values_end as _).collect(); + +RecordBatch::try_new( +schema, +vec![make_array( +Int64Array::from_iter( +v64.into_iter() +.map(Some) +.chain(std::iter::repeat(None).take(null_values)), +) +.to_data(), +)], +) +.unwrap() +} + +// Create a parquet file with one column for data type i64 +// Data of the file include +// . Number of null rows is the given num_null +// . There are non-null values in the range [no_null_values_start, no_null_values_end], one value each row +// . The file is divided into row groups of size row_per_group +pub fn parquet_file_one_column( +num_null: usize, +no_null_values_start: i64, +no_null_values_end: i64, +row_per_group: usize, +) -> ParquetRecordBatchReaderBuilder { +let mut output_file = tempfile::Builder::new() +.prefix("parquert_statistics_test") +.suffix(".parquet") +.tempfile() +.expect("tempfile creation"); + +let props = WriterProperties::builder() +.set_max_row_group_size(row_per_group) +.set_statistics_enabled(EnabledStatistics::Chunk) +.build(); + +let batches = vec![make_int64_batches_with_null( +num_null, +no_null_values_start, +no_null_values_end, +)]; + +let schema = batches[0].schema(); + +let mut writer = ArrowWriter::try_new(&mut output_file, schema, Some(props)).unwrap(); + +for batch in batches { +writer.write(&batch).expect("writing batch"); +} + +// close file +let _file_meta = writer.close().unwrap(); + +// open the file & get the reader +let file = output_file.reopen().unwrap(); +ArrowReaderBuilder::try_new(file).unwrap() +} + +// Create a parquet file with many columns each has different data type +// - Data types are specified by the given scenario +// - Row group sizes are withe the same or different depending on the provided row_per_group & data created in the scenario +pub async fn parquet_file_many_columns( +scenario: super::Scenario, +row_per_group: usize, +) -> ParquetRecordBatchReaderBuilder { +let file = make_test_file_rg(scenario, row_per_group).await; + +// open the file & get the reader +let file = file.reopen().unwrap(); +ArrowReaderBuilder::try_new(file).unwrap() +} + +struct Test { +reader: ParquetRecordBatchReaderBuilder, +expected_min: ArrayRef, +expected_max: ArrayRef, +expected_null_counts: UInt64Array, +expected_row_counts: UInt64Array, +} + +impl Test { +fn run(self, col_name: &str) { +let Self { +reader, +expected_min, +expected_max, +expected_null_counts, +expected_row_counts, +} = self; + +le
Re: [I] Incorrect statistics read for `i8` `i16` columns in parquet [datafusion]
alamb commented on issue #10585: URL: https://github.com/apache/datafusion/issues/10585#issuecomment-2120956336 Possibly related to https://github.com/apache/datafusion/issues/9779 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: API for collecting statistics/index for metadata of a parquet file + tests [datafusion]
NGA-TRAN commented on code in PR #10537: URL: https://github.com/apache/datafusion/pull/10537#discussion_r1607104991 ## datafusion/core/tests/parquet/arrow_statistics.rs: ## @@ -0,0 +1,652 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! This file contains an end to end test of extracting statitics from parquet files. +//! It writes data into a parquet file, reads statistics and verifies they are correct + +use std::fs::File; +use std::sync::Arc; + +use arrow_array::{ +make_array, Array, ArrayRef, Int16Array, Int32Array, Int64Array, Int8Array, +RecordBatch, UInt64Array, +}; +use arrow_schema::{DataType, Field, Schema}; +use datafusion::datasource::physical_plan::parquet::{ +RequestedStatistics, StatisticsConverter, +}; +use parquet::arrow::arrow_reader::{ArrowReaderBuilder, ParquetRecordBatchReaderBuilder}; +use parquet::arrow::ArrowWriter; +use parquet::file::properties::{EnabledStatistics, WriterProperties}; + +use crate::parquet::Scenario; + +use super::make_test_file_rg; + +// TEST HELPERS + +/// Return a record batch with i64 with Null values +fn make_int64_batches_with_null( +null_values: usize, +no_null_values_start: i64, +no_null_values_end: i64, +) -> RecordBatch { +let schema = Arc::new(Schema::new(vec![Field::new("i64", DataType::Int64, true)])); + +let v64: Vec = (no_null_values_start as _..no_null_values_end as _).collect(); + +RecordBatch::try_new( +schema, +vec![make_array( +Int64Array::from_iter( +v64.into_iter() +.map(Some) +.chain(std::iter::repeat(None).take(null_values)), +) +.to_data(), +)], +) +.unwrap() +} + +// Create a parquet file with one column for data type i64 +// Data of the file include +// . Number of null rows is the given num_null +// . There are non-null values in the range [no_null_values_start, no_null_values_end], one value each row +// . The file is divided into row groups of size row_per_group +pub fn parquet_file_one_column( +num_null: usize, +no_null_values_start: i64, +no_null_values_end: i64, +row_per_group: usize, +) -> ParquetRecordBatchReaderBuilder { +let mut output_file = tempfile::Builder::new() +.prefix("parquert_statistics_test") +.suffix(".parquet") +.tempfile() +.expect("tempfile creation"); + +let props = WriterProperties::builder() +.set_max_row_group_size(row_per_group) +.set_statistics_enabled(EnabledStatistics::Chunk) +.build(); + +let batches = vec![make_int64_batches_with_null( +num_null, +no_null_values_start, +no_null_values_end, +)]; + +let schema = batches[0].schema(); + +let mut writer = ArrowWriter::try_new(&mut output_file, schema, Some(props)).unwrap(); + +for batch in batches { +writer.write(&batch).expect("writing batch"); +} + +// close file +let _file_meta = writer.close().unwrap(); + +// open the file & get the reader +let file = output_file.reopen().unwrap(); +ArrowReaderBuilder::try_new(file).unwrap() +} + +// Create a parquet file with many columns each has different data type +// - Data types are specified by the given scenario +// - Row group sizes are withe the same or different depending on the provided row_per_group & data created in the scenario +pub async fn parquet_file_many_columns( +scenario: super::Scenario, +row_per_group: usize, +) -> ParquetRecordBatchReaderBuilder { +let file = make_test_file_rg(scenario, row_per_group).await; + +// open the file & get the reader +let file = file.reopen().unwrap(); +ArrowReaderBuilder::try_new(file).unwrap() +} + +struct Test { +reader: ParquetRecordBatchReaderBuilder, +expected_min: ArrayRef, +expected_max: ArrayRef, +expected_null_counts: UInt64Array, +expected_row_counts: UInt64Array, +} + +impl Test { +fn run(self, col_name: &str) { +let Self { +reader, +expected_min, +expected_max, +expected_null_counts, +expected_row_counts, +} = self; + +
[I] Incorrect statistics read for `i8` `i16` [datafusion]
alamb opened a new issue, #10585: URL: https://github.com/apache/datafusion/issues/10585 ### Describe the bug As @NGA-TRAN found in https://github.com/apache/datafusion/pull/10537 when i8 and i16 values are written to parquet and then the statistics are extracted, the returned min/max values are incorrect. This could lead to incorrect results when reading parquet files with filters on columns with `i8` and `i16` types ### To Reproduce See tests added in https://github.com/apache/datafusion/pull/10537 ### Expected behavior _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: API for collecting statistics/index for metadata of a parquet file + tests [datafusion]
alamb commented on code in PR #10537: URL: https://github.com/apache/datafusion/pull/10537#discussion_r1607104678 ## datafusion/core/tests/parquet/arrow_statistics.rs: ## @@ -0,0 +1,652 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! This file contains an end to end test of extracting statitics from parquet files. +//! It writes data into a parquet file, reads statistics and verifies they are correct + +use std::fs::File; +use std::sync::Arc; + +use arrow_array::{ +make_array, Array, ArrayRef, Int16Array, Int32Array, Int64Array, Int8Array, +RecordBatch, UInt64Array, +}; +use arrow_schema::{DataType, Field, Schema}; +use datafusion::datasource::physical_plan::parquet::{ +RequestedStatistics, StatisticsConverter, +}; +use parquet::arrow::arrow_reader::{ArrowReaderBuilder, ParquetRecordBatchReaderBuilder}; +use parquet::arrow::ArrowWriter; +use parquet::file::properties::{EnabledStatistics, WriterProperties}; + +use crate::parquet::Scenario; + +use super::make_test_file_rg; + +// TEST HELPERS + +/// Return a record batch with i64 with Null values +fn make_int64_batches_with_null( +null_values: usize, +no_null_values_start: i64, +no_null_values_end: i64, +) -> RecordBatch { +let schema = Arc::new(Schema::new(vec![Field::new("i64", DataType::Int64, true)])); + +let v64: Vec = (no_null_values_start as _..no_null_values_end as _).collect(); + +RecordBatch::try_new( +schema, +vec![make_array( +Int64Array::from_iter( +v64.into_iter() +.map(Some) +.chain(std::iter::repeat(None).take(null_values)), +) +.to_data(), +)], +) +.unwrap() +} + +// Create a parquet file with one column for data type i64 +// Data of the file include +// . Number of null rows is the given num_null +// . There are non-null values in the range [no_null_values_start, no_null_values_end], one value each row +// . The file is divided into row groups of size row_per_group +pub fn parquet_file_one_column( +num_null: usize, +no_null_values_start: i64, +no_null_values_end: i64, +row_per_group: usize, +) -> ParquetRecordBatchReaderBuilder { +let mut output_file = tempfile::Builder::new() +.prefix("parquert_statistics_test") +.suffix(".parquet") +.tempfile() +.expect("tempfile creation"); + +let props = WriterProperties::builder() +.set_max_row_group_size(row_per_group) +.set_statistics_enabled(EnabledStatistics::Chunk) +.build(); + +let batches = vec![make_int64_batches_with_null( +num_null, +no_null_values_start, +no_null_values_end, +)]; + +let schema = batches[0].schema(); + +let mut writer = ArrowWriter::try_new(&mut output_file, schema, Some(props)).unwrap(); + +for batch in batches { +writer.write(&batch).expect("writing batch"); +} + +// close file +let _file_meta = writer.close().unwrap(); + +// open the file & get the reader +let file = output_file.reopen().unwrap(); +ArrowReaderBuilder::try_new(file).unwrap() +} + +// Create a parquet file with many columns each has different data type +// - Data types are specified by the given scenario +// - Row group sizes are withe the same or different depending on the provided row_per_group & data created in the scenario +pub async fn parquet_file_many_columns( +scenario: super::Scenario, +row_per_group: usize, +) -> ParquetRecordBatchReaderBuilder { +let file = make_test_file_rg(scenario, row_per_group).await; + +// open the file & get the reader +let file = file.reopen().unwrap(); +ArrowReaderBuilder::try_new(file).unwrap() +} + +struct Test { +reader: ParquetRecordBatchReaderBuilder, +expected_min: ArrayRef, +expected_max: ArrayRef, +expected_null_counts: UInt64Array, +expected_row_counts: UInt64Array, +} + +impl Test { +fn run(self, col_name: &str) { +let Self { +reader, +expected_min, +expected_max, +expected_null_counts, +expected_row_counts, +} = self; + +le
Re: [PR] feat: API for collecting statistics/index for metadata of a parquet file [datafusion]
alamb commented on code in PR #10537: URL: https://github.com/apache/datafusion/pull/10537#discussion_r1607091830 ## datafusion/core/tests/parquet/arrow_statistics.rs: ## @@ -0,0 +1,652 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! This file contains an end to end test of extracting statitics from parquet files. +//! It writes data into a parquet file, reads statistics and verifies they are correct + +use std::fs::File; +use std::sync::Arc; + +use arrow_array::{ +make_array, Array, ArrayRef, Int16Array, Int32Array, Int64Array, Int8Array, +RecordBatch, UInt64Array, +}; +use arrow_schema::{DataType, Field, Schema}; +use datafusion::datasource::physical_plan::parquet::{ +RequestedStatistics, StatisticsConverter, +}; +use parquet::arrow::arrow_reader::{ArrowReaderBuilder, ParquetRecordBatchReaderBuilder}; +use parquet::arrow::ArrowWriter; +use parquet::file::properties::{EnabledStatistics, WriterProperties}; + +use crate::parquet::Scenario; + +use super::make_test_file_rg; + +// TEST HELPERS + +/// Return a record batch with i64 with Null values +fn make_int64_batches_with_null( +null_values: usize, +no_null_values_start: i64, +no_null_values_end: i64, +) -> RecordBatch { +let schema = Arc::new(Schema::new(vec![Field::new("i64", DataType::Int64, true)])); + +let v64: Vec = (no_null_values_start as _..no_null_values_end as _).collect(); + +RecordBatch::try_new( +schema, +vec![make_array( +Int64Array::from_iter( +v64.into_iter() +.map(Some) +.chain(std::iter::repeat(None).take(null_values)), +) +.to_data(), +)], +) +.unwrap() +} + +// Create a parquet file with one column for data type i64 +// Data of the file include +// . Number of null rows is the given num_null +// . There are non-null values in the range [no_null_values_start, no_null_values_end], one value each row +// . The file is divided into row groups of size row_per_group +pub fn parquet_file_one_column( +num_null: usize, +no_null_values_start: i64, +no_null_values_end: i64, +row_per_group: usize, +) -> ParquetRecordBatchReaderBuilder { +let mut output_file = tempfile::Builder::new() +.prefix("parquert_statistics_test") +.suffix(".parquet") +.tempfile() +.expect("tempfile creation"); + +let props = WriterProperties::builder() +.set_max_row_group_size(row_per_group) +.set_statistics_enabled(EnabledStatistics::Chunk) +.build(); + +let batches = vec![make_int64_batches_with_null( +num_null, +no_null_values_start, +no_null_values_end, +)]; + +let schema = batches[0].schema(); + +let mut writer = ArrowWriter::try_new(&mut output_file, schema, Some(props)).unwrap(); + +for batch in batches { +writer.write(&batch).expect("writing batch"); +} + +// close file +let _file_meta = writer.close().unwrap(); + +// open the file & get the reader +let file = output_file.reopen().unwrap(); +ArrowReaderBuilder::try_new(file).unwrap() +} + +// Create a parquet file with many columns each has different data type +// - Data types are specified by the given scenario +// - Row group sizes are withe the same or different depending on the provided row_per_group & data created in the scenario +pub async fn parquet_file_many_columns( +scenario: super::Scenario, +row_per_group: usize, +) -> ParquetRecordBatchReaderBuilder { +let file = make_test_file_rg(scenario, row_per_group).await; + +// open the file & get the reader +let file = file.reopen().unwrap(); +ArrowReaderBuilder::try_new(file).unwrap() +} + +struct Test { +reader: ParquetRecordBatchReaderBuilder, +expected_min: ArrayRef, +expected_max: ArrayRef, +expected_null_counts: UInt64Array, +expected_row_counts: UInt64Array, +} + +impl Test { +fn run(self, col_name: &str) { +let Self { +reader, +expected_min, +expected_max, +expected_null_counts, +expected_row_counts, +} = self; + +le
Re: [PR] feat: API for collecting statistics/index for metadata of a parquet file [datafusion]
alamb commented on code in PR #10537: URL: https://github.com/apache/datafusion/pull/10537#discussion_r1607087345 ## datafusion/core/tests/parquet/arrow_statistics.rs: ## @@ -0,0 +1,652 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! This file contains an end to end test of extracting statitics from parquet files. +//! It writes data into a parquet file, reads statistics and verifies they are correct + +use std::fs::File; +use std::sync::Arc; + +use arrow_array::{ +make_array, Array, ArrayRef, Int16Array, Int32Array, Int64Array, Int8Array, +RecordBatch, UInt64Array, +}; +use arrow_schema::{DataType, Field, Schema}; +use datafusion::datasource::physical_plan::parquet::{ +RequestedStatistics, StatisticsConverter, +}; +use parquet::arrow::arrow_reader::{ArrowReaderBuilder, ParquetRecordBatchReaderBuilder}; +use parquet::arrow::ArrowWriter; +use parquet::file::properties::{EnabledStatistics, WriterProperties}; + +use crate::parquet::Scenario; + +use super::make_test_file_rg; + +// TEST HELPERS + +/// Return a record batch with i64 with Null values +fn make_int64_batches_with_null( +null_values: usize, +no_null_values_start: i64, +no_null_values_end: i64, +) -> RecordBatch { +let schema = Arc::new(Schema::new(vec![Field::new("i64", DataType::Int64, true)])); + +let v64: Vec = (no_null_values_start as _..no_null_values_end as _).collect(); + +RecordBatch::try_new( +schema, +vec![make_array( +Int64Array::from_iter( +v64.into_iter() +.map(Some) +.chain(std::iter::repeat(None).take(null_values)), +) +.to_data(), +)], +) +.unwrap() +} + +// Create a parquet file with one column for data type i64 +// Data of the file include +// . Number of null rows is the given num_null +// . There are non-null values in the range [no_null_values_start, no_null_values_end], one value each row +// . The file is divided into row groups of size row_per_group +pub fn parquet_file_one_column( +num_null: usize, +no_null_values_start: i64, +no_null_values_end: i64, +row_per_group: usize, +) -> ParquetRecordBatchReaderBuilder { +let mut output_file = tempfile::Builder::new() +.prefix("parquert_statistics_test") +.suffix(".parquet") +.tempfile() +.expect("tempfile creation"); + +let props = WriterProperties::builder() +.set_max_row_group_size(row_per_group) +.set_statistics_enabled(EnabledStatistics::Chunk) +.build(); + +let batches = vec![make_int64_batches_with_null( +num_null, +no_null_values_start, +no_null_values_end, +)]; + +let schema = batches[0].schema(); + +let mut writer = ArrowWriter::try_new(&mut output_file, schema, Some(props)).unwrap(); + +for batch in batches { +writer.write(&batch).expect("writing batch"); +} + +// close file +let _file_meta = writer.close().unwrap(); + +// open the file & get the reader +let file = output_file.reopen().unwrap(); +ArrowReaderBuilder::try_new(file).unwrap() +} + +// Create a parquet file with many columns each has different data type +// - Data types are specified by the given scenario +// - Row group sizes are withe the same or different depending on the provided row_per_group & data created in the scenario +pub async fn parquet_file_many_columns( +scenario: super::Scenario, +row_per_group: usize, +) -> ParquetRecordBatchReaderBuilder { +let file = make_test_file_rg(scenario, row_per_group).await; + +// open the file & get the reader +let file = file.reopen().unwrap(); +ArrowReaderBuilder::try_new(file).unwrap() +} + +struct Test { +reader: ParquetRecordBatchReaderBuilder, +expected_min: ArrayRef, +expected_max: ArrayRef, +expected_null_counts: UInt64Array, +expected_row_counts: UInt64Array, +} + +impl Test { +fn run(self, col_name: &str) { +let Self { +reader, +expected_min, +expected_max, +expected_null_counts, +expected_row_counts, +} = self; + +le
Re: [PR] HashJoin LeftAnti Join should handle nulls correctly [datafusion]
viirya commented on code in PR #10584: URL: https://github.com/apache/datafusion/pull/10584#discussion_r1607085982 ## datafusion/sqllogictest/test_files/join.slt: ## @@ -793,3 +793,19 @@ DROP TABLE companies statement ok DROP TABLE leads + + +# LeftAnti Join with null +statement ok +CREATE TABLE IF NOT EXISTS test_table(c1 INT, c2 INT) AS VALUES +(1, 1), +(2, 2), +(3, 3), +(4, null), +(null, 0); + +query II +SELECT * FROM test_table t1 LEFT ANTI JOIN test_table t2 ON t1.c1 = t2.c2 + +4 NULL +NULL 0 Review Comment: Currently LeftAnti join returns incorrect results. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] HashJoin LeftAnti Join should handle nulls correctly [datafusion]
viirya commented on PR #10584: URL: https://github.com/apache/datafusion/pull/10584#issuecomment-2120931328 Added the test case first. I will find some time to work on the fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] HashJoin LeftAnti Join should handle nulls correctly [datafusion]
viirya opened a new pull request, #10584: URL: https://github.com/apache/datafusion/pull/10584 ## Which issue does this PR close? Closes #10583. ## Rationale for this change ## What changes are included in this PR? ## Are these changes tested? ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] HashJoin LeftAnti Join handles nulls incorrectly [datafusion]
viirya opened a new issue, #10583: URL: https://github.com/apache/datafusion/issues/10583 ### Describe the bug During working on https://github.com/apache/datafusion-comet/pull/437, a few Spark join tests are failed when delegating to DataFusion HashJoin. It is because that DataFusion HashJoin LeftAnti Join returns incorrect results when there are nulls in either left or right side. ### To Reproduce Added a test to `join.slt`: ``` statement ok CREATE TABLE IF NOT EXISTS test_table(c1 INT, c2 INT) AS VALUES (1, 1), (2, 2), (3, 3), (4, null), (null, 0); query II SELECT * FROM test_table t1 LEFT ANTI JOIN test_table t2 ON t1.c1 = t2.c2 4 NULL NULL 0 ``` ### Expected behavior Above query should return empty relation. ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Minor: Improve documentation in sql_to_plan example [datafusion]
alamb commented on code in PR #10582: URL: https://github.com/apache/datafusion/pull/10582#discussion_r1607076013 ## datafusion-examples/examples/plan_to_sql.rs: ## @@ -22,36 +22,45 @@ use datafusion::sql::unparser::expr_to_sql; use datafusion_sql::unparser::dialect::CustomDialect; use datafusion_sql::unparser::{plan_to_sql, Unparser}; -/// This example demonstrates the programmatic construction of -/// SQL using the DataFusion Expr [`Expr`] and LogicalPlan [`LogicalPlan`] API. +/// This example demonstrates the programmatic construction of SQL strings using +/// the DataFusion Expr [`Expr`] and LogicalPlan [`LogicalPlan`] API. /// /// /// The code in this example shows how to: -/// 1. Create SQL from a variety of Expr and LogicalPlan: [`main`]` -/// 2. Create a simple expression [`Exprs`] with fluent API -/// and convert to sql: [`simple_expr_to_sql_demo`] -/// 3. Create a simple expression [`Exprs`] with fluent API -/// and convert to sql without escaping column names: [`simple_expr_to_sql_demo_no_escape`] -/// 4. Create a simple expression [`Exprs`] with fluent API -/// and convert to sql escaping column names a MySQL style: [`simple_expr_to_sql_demo_escape_mysql_style`] +/// +/// 1. [`simple_expr_to_sql_demo`]: Create a simple expression [`Exprs`] with +/// fluent API and convert to sql suitable for passing to another database +/// +/// 2. [`simple_expr_to_sql_demo_no_escape`] Create a simple expression +/// [`Exprs`] with fluent API and convert to sql without escaping column names +/// more suitable for displaying to humans. +/// +/// 3. [`simple_expr_to_sql_demo_escape_mysql_style`]" Create a simple +/// expression [`Exprs`] with fluent API and convert to sql escaping column +/// names in MySQL style. +/// +/// 4. [`simple_plan_to_sql_demo`]: Create a simple logical plan using the +/// DataFrames API and convert to sql string. +/// +/// 5. [`round_trip_plan_to_sql_demo`]: Create a logical plan from a SQL string, modify it using the +/// DataFrames API and convert it back to a sql string. #[tokio::main] async fn main() -> Result<()> { // See how to evaluate expressions simple_expr_to_sql_demo()?; simple_expr_to_sql_demo_no_escape()?; simple_expr_to_sql_demo_escape_mysql_style()?; -simple_plan_to_sql_parquest_dataframe_demo().await?; -round_trip_plan_to_sql_parquest_dataframe_demo().await?; +simple_plan_to_sql_demo().await?; +round_trip_plan_to_sql_demo().await?; Ok(()) } /// DataFusion can convert expressions to SQL, using column name escaping /// PostgreSQL style. fn simple_expr_to_sql_demo() -> Result<()> { let expr = col("a").lt(lit(5)).or(col("a").eq(lit(8))); -let ast = expr_to_sql(&expr)?; -let sql = format!("{}", ast); +let sql = expr_to_sql(&expr)?.to_string(); Review Comment: I think it makes it easier to use this example if we don't have to worry about the intermediate `ast` as much, so I combined it into a single statement -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Implement a dialect-specific rule for unparsing an identifier with or without quotes [datafusion]
comphead commented on code in PR #10573: URL: https://github.com/apache/datafusion/pull/10573#discussion_r1607067858 ## datafusion/sql/src/unparser/expr.rs: ## @@ -504,6 +508,14 @@ impl Unparser<'_> { .collect::>>() } +pub(super) fn new_ident_quoted_if_needs(&self, ident: String) -> ast::Ident { Review Comment: Please add a method comments for a pub method -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Add an example of how to convert LogicalPlan to/from SQL Strings [datafusion]
alamb closed issue #10550: Add an example of how to convert LogicalPlan to/from SQL Strings URL: https://github.com/apache/datafusion/issues/10550 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add examples of how to convert logical plan to/from sql strings [datafusion]
alamb merged PR #10558: URL: https://github.com/apache/datafusion/pull/10558 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org