Re: [PR] feat: Add random row generator in data generator [datafusion-comet]

2024-05-20 Thread via GitHub


advancedxy commented on code in PR #451:
URL: https://github.com/apache/datafusion-comet/pull/451#discussion_r1607716091


##
spark/src/test/scala/org/apache/comet/DataGenerator.scala:
##
@@ -95,4 +102,55 @@ class DataGenerator(r: Random) {
   Range(0, n).map(_ => r.nextLong())
   }
 
+  // Generate a random row according to the schema, the string filed in the 
struct could be
+  // configured to generate strings by passing a stringGen function. Other 
types are delegated
+  // to Spark's RandomDataGenerator.
+  def generateRow(schema: StructType, stringGen: Option[() => String] = None): 
Row = {
+val fields = mutable.ArrayBuffer.empty[Any]
+schema.fields.foreach { f =>
+  f.dataType match {
+case ArrayType(childType, nullable) =>
+  val data = if (f.nullable && r.nextFloat() <= PROBABILITY_OF_NULL) {
+null
+  } else {
+val arr = mutable.ArrayBuffer.empty[Any]
+val n = 1 // rand.nextInt(10)
+var i = 0
+val generator = RandomDataGenerator.forType(childType, nullable, r)
+assert(generator.isDefined, "Unsupported type")
+val gen = generator.get
+while (i < n) {
+  arr += gen()
+  i += 1
+}
+arr.toSeq
+  }
+  fields += data
+case StructType(children) =>
+  fields += generateRow(StructType(children))
+case StringType if stringGen.isDefined =>
+  val gen = stringGen.get
+  val data = if (f.nullable && r.nextFloat() <= PROBABILITY_OF_NULL) {
+null
+  } else {
+gen()
+  }
+  fields += data
+case _ =>
+  val generator = RandomDataGenerator.forType(f.dataType, f.nullable, 
r)
+  assert(generator.isDefined, "Unsupported type")
+  val gen = generator.get
+  fields += gen()
+  }
+}
+Row.fromSeq(fields.toSeq)
+  }
+
+  def generateRows(

Review Comment:
   Ah, yeah. It's not used in comet tests yet. It will be used in 
https://github.com/apache/datafusion-comet/pull/433/files#diff-75bf612078cdacdd1c8544b3894bc941c0db803f5f904afc21fbe389fb68555bR1485-R1488
 if that PR merges first.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Add random row generator in data generator [datafusion-comet]

2024-05-20 Thread via GitHub


advancedxy commented on code in PR #451:
URL: https://github.com/apache/datafusion-comet/pull/451#discussion_r1607709720


##
spark/src/test/scala/org/apache/comet/DataGenerator.scala:
##
@@ -95,4 +102,55 @@ class DataGenerator(r: Random) {
   Range(0, n).map(_ => r.nextLong())
   }
 
+  // Generate a random row according to the schema, the string filed in the 
struct could be
+  // configured to generate strings by passing a stringGen function. Other 
types are delegated
+  // to Spark's RandomDataGenerator.
+  def generateRow(schema: StructType, stringGen: Option[() => String] = None): 
Row = {
+val fields = mutable.ArrayBuffer.empty[Any]
+schema.fields.foreach { f =>
+  f.dataType match {
+case ArrayType(childType, nullable) =>
+  val data = if (f.nullable && r.nextFloat() <= PROBABILITY_OF_NULL) {
+null
+  } else {
+val arr = mutable.ArrayBuffer.empty[Any]
+val n = 1 // rand.nextInt(10)
+var i = 0
+val generator = RandomDataGenerator.forType(childType, nullable, r)
+assert(generator.isDefined, "Unsupported type")
+val gen = generator.get
+while (i < n) {
+  arr += gen()
+  i += 1
+}
+arr.toSeq
+  }
+  fields += data
+case StructType(children) =>
+  fields += generateRow(StructType(children))
+case StringType if stringGen.isDefined =>
+  val gen = stringGen.get
+  val data = if (f.nullable && r.nextFloat() <= PROBABILITY_OF_NULL) {
+null
+  } else {
+gen()
+  }
+  fields += data
+case _ =>
+  val generator = RandomDataGenerator.forType(f.dataType, f.nullable, 
r)
+  assert(generator.isDefined, "Unsupported type")

Review Comment:
   >  Hmmm should we do the same (create nulls in forType) for the rest of 
types? 
   
   Let me think about that. Maybe we can remove the array type case. However I 
think the user specified `stringGen` still have to access `PROBABILITY_OF_NULL` 
to generate nullable strings. Otherwise, the `stringGen` method itself should 
handle nullable by itself.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: add hex scalar function [datafusion-comet]

2024-05-20 Thread via GitHub


advancedxy commented on code in PR #449:
URL: https://github.com/apache/datafusion-comet/pull/449#discussion_r1607703859


##
core/src/execution/datafusion/expressions/scalar_funcs/hex.rs:
##
@@ -0,0 +1,191 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::sync::Arc;
+
+use arrow::array::as_string_array;
+use arrow_array::StringArray;
+use arrow_schema::DataType;
+use datafusion::logical_expr::ColumnarValue;
+use datafusion_common::{
+cast::{as_binary_array, as_int64_array},
+exec_err, DataFusionError, ScalarValue,
+};
+use std::fmt::Write;
+
+fn hex_bytes(bytes: &[u8]) -> Vec {
+let length = bytes.len();
+let mut value = vec![0; length * 2];
+let mut i = 0;
+while i < length {
+value[i * 2] = (bytes[i] & 0xF0) >> 4;
+value[i * 2 + 1] = bytes[i] & 0x0F;
+i += 1;
+}
+value
+}
+
+fn hex_int64(num: i64) -> String {
+if num >= 0 {
+format!("{:X}", num)
+} else {
+format!("{:016X}", num as u64)
+}
+}
+
+fn hex_string(s: &str) -> Vec {
+hex_bytes(s.as_bytes())
+}
+
+fn hex_bytes_to_string(bytes: &[u8]) -> Result {
+let mut hex_string = String::with_capacity(bytes.len() * 2);
+for byte in bytes {
+write!(&mut hex_string, "{:01X}", byte)?;

Review Comment:
   By the way, there's `hex_encode` in `scalar_func.rs`, maybe we should moved 
that into this hex.rs instead, so related methods are grouped together.



##
spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala:
##
@@ -1038,6 +1038,46 @@ class CometExpressionSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
   }
 }
   }
+  test("hex") {
+val str_table = "string_hex_table"
+withTable(str_table) {

Review Comment:
   Spark hex's  input types are:
   ```scala
 override def inputTypes: Seq[AbstractDataType] =
   Seq(TypeCollection(LongType, BinaryType, StringType))
   ```
   
   I think we need to test input with these types. By the way, it would be 
helpful to test with randomized input, I think you can leverage the 
`generateRows` in #451 and test hex in one sql such as:
   ```
   select hex(long_col) , hex(binary_col), hex(string_col) from $test_table
   ```



##
core/src/execution/datafusion/expressions/scalar_funcs/hex.rs:
##
@@ -0,0 +1,191 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::sync::Arc;
+
+use arrow::array::as_string_array;
+use arrow_array::StringArray;
+use arrow_schema::DataType;
+use datafusion::logical_expr::ColumnarValue;
+use datafusion_common::{
+cast::{as_binary_array, as_int64_array},
+exec_err, DataFusionError, ScalarValue,
+};
+use std::fmt::Write;
+
+fn hex_bytes(bytes: &[u8]) -> Vec {
+let length = bytes.len();
+let mut value = vec![0; length * 2];
+let mut i = 0;
+while i < length {
+value[i * 2] = (bytes[i] & 0xF0) >> 4;
+value[i * 2 + 1] = bytes[i] & 0x0F;
+i += 1;
+}
+value
+}
+
+fn hex_int64(num: i64) -> String {
+if num >= 0 {
+format!("{:X}", num)
+} else {
+format!("{:016X}", num as u64)
+}
+}
+
+fn hex_string(s: &str) -> Vec {
+hex_bytes(s.as_bytes())
+}
+
+fn hex_bytes_to_string(bytes: &[u8]) -> Result {
+let mut hex_string = String::with_capacity(bytes.len() * 2);
+for byte in bytes {
+write!(&mut hex_string, "{:01X}", byte)?;

Review Comment:
   hmm, I think it should be `:02X` instead?



Re: [PR] feat: Add random row generator in data generator [datafusion-comet]

2024-05-20 Thread via GitHub


kazuyukitanimura commented on code in PR #451:
URL: https://github.com/apache/datafusion-comet/pull/451#discussion_r1607690262


##
spark/src/test/scala/org/apache/comet/DataGenerator.scala:
##
@@ -95,4 +102,55 @@ class DataGenerator(r: Random) {
   Range(0, n).map(_ => r.nextLong())
   }
 
+  // Generate a random row according to the schema, the string filed in the 
struct could be
+  // configured to generate strings by passing a stringGen function. Other 
types are delegated
+  // to Spark's RandomDataGenerator.
+  def generateRow(schema: StructType, stringGen: Option[() => String] = None): 
Row = {
+val fields = mutable.ArrayBuffer.empty[Any]
+schema.fields.foreach { f =>
+  f.dataType match {
+case ArrayType(childType, nullable) =>
+  val data = if (f.nullable && r.nextFloat() <= PROBABILITY_OF_NULL) {
+null
+  } else {
+val arr = mutable.ArrayBuffer.empty[Any]
+val n = 1 // rand.nextInt(10)
+var i = 0
+val generator = RandomDataGenerator.forType(childType, nullable, r)
+assert(generator.isDefined, "Unsupported type")
+val gen = generator.get
+while (i < n) {
+  arr += gen()
+  i += 1
+}
+arr.toSeq
+  }
+  fields += data
+case StructType(children) =>
+  fields += generateRow(StructType(children))
+case StringType if stringGen.isDefined =>
+  val gen = stringGen.get
+  val data = if (f.nullable && r.nextFloat() <= PROBABILITY_OF_NULL) {
+null
+  } else {
+gen()
+  }
+  fields += data
+case _ =>
+  val generator = RandomDataGenerator.forType(f.dataType, f.nullable, 
r)
+  assert(generator.isDefined, "Unsupported type")
+  val gen = generator.get
+  fields += gen()
+  }
+}
+Row.fromSeq(fields.toSeq)
+  }
+
+  def generateRows(

Review Comment:
   hmmm currently it is not testing any comet thing, so it is a bit weird



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Add random row generator in data generator [datafusion-comet]

2024-05-20 Thread via GitHub


kazuyukitanimura commented on code in PR #451:
URL: https://github.com/apache/datafusion-comet/pull/451#discussion_r1607689388


##
spark/src/test/scala/org/apache/comet/DataGenerator.scala:
##
@@ -95,4 +102,55 @@ class DataGenerator(r: Random) {
   Range(0, n).map(_ => r.nextLong())
   }
 
+  // Generate a random row according to the schema, the string filed in the 
struct could be
+  // configured to generate strings by passing a stringGen function. Other 
types are delegated
+  // to Spark's RandomDataGenerator.
+  def generateRow(schema: StructType, stringGen: Option[() => String] = None): 
Row = {
+val fields = mutable.ArrayBuffer.empty[Any]
+schema.fields.foreach { f =>
+  f.dataType match {
+case ArrayType(childType, nullable) =>
+  val data = if (f.nullable && r.nextFloat() <= PROBABILITY_OF_NULL) {
+null
+  } else {
+val arr = mutable.ArrayBuffer.empty[Any]
+val n = 1 // rand.nextInt(10)
+var i = 0
+val generator = RandomDataGenerator.forType(childType, nullable, r)
+assert(generator.isDefined, "Unsupported type")
+val gen = generator.get
+while (i < n) {
+  arr += gen()
+  i += 1
+}
+arr.toSeq
+  }
+  fields += data
+case StructType(children) =>
+  fields += generateRow(StructType(children))
+case StringType if stringGen.isDefined =>
+  val gen = stringGen.get
+  val data = if (f.nullable && r.nextFloat() <= PROBABILITY_OF_NULL) {
+null
+  } else {
+gen()
+  }
+  fields += data
+case _ =>
+  val generator = RandomDataGenerator.forType(f.dataType, f.nullable, 
r)
+  assert(generator.isDefined, "Unsupported type")

Review Comment:
   Hmmm should we do the same (create nulls in `forType`) for the rest of 
types? Then we do not need duplicated `PROBABILITY_OF_NULL`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Compute murmur3 hash with dictionary input correctly [datafusion-comet]

2024-05-20 Thread via GitHub


kazuyukitanimura commented on code in PR #433:
URL: https://github.com/apache/datafusion-comet/pull/433#discussion_r1607685866


##
spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala:
##
@@ -1452,17 +1452,55 @@ class CometExpressionSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
 withTable(table) {
   sql(s"create table $table(col string, a int, b float) using parquet")
   sql(s"""
- |insert into $table values
- |('Spark SQL  ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), 
('苹果手机', NULL, 3.99)
- |, ('Spark SQL  ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), 
('苹果手机', NULL, 3.99)
- |""".stripMargin)
-  checkSparkAnswerAndOperator("""
-   |select
-   |md5(col), md5(cast(a as string)), md5(cast(b as string)),
-   |hash(col), hash(col, 1), hash(col, 0), hash(col, a, b), 
hash(b, a, col),
-   |sha2(col, 0), sha2(col, 256), sha2(col, 224), sha2(col, 384), 
sha2(col, 512), sha2(col, 128)
-   |from test
+   |insert into $table values
+   |('Spark SQL  ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), 
('苹果手机', NULL, 3.99)
+   |, ('Spark SQL  ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), 
('苹果手机', NULL, 3.99)
|""".stripMargin)
+  checkSparkAnswerAndOperator("""
+  |select
+  |md5(col), md5(cast(a as string)), md5(cast(b as string)),
+  |hash(col), hash(col, 1), hash(col, 0), hash(col, a, b), hash(b, 
a, col),
+  |sha2(col, 0), sha2(col, 256), sha2(col, 224), sha2(col, 384), 
sha2(col, 512), sha2(col, 128)
+  |from test
+  |""".stripMargin)
+}
+  }
+}
+  }
+
+  test("hash functions with random input") {
+val dataGen = DataGenerator.DEFAULT
+// sufficient number of rows to create dictionary encoded ArrowArray.
+val randomNumRows = 1000

Review Comment:
   Yes, `makeParquetFileAllTypes` or some existing dictionary related tests may 
be helpful



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Support unnest for struct data type [datafusion]

2024-05-20 Thread via GitHub


duongcongtoai commented on issue #10264:
URL: https://github.com/apache/datafusion/issues/10264#issuecomment-212133

   Please help me review this PR everyone 
https://github.com/apache/datafusion/pull/10429
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Support Substrait's VirtualTables [datafusion]

2024-05-20 Thread via GitHub


jonahgao commented on code in PR #10531:
URL: https://github.com/apache/datafusion/pull/10531#discussion_r1607637374


##
datafusion/substrait/tests/cases/roundtrip_logical_plan.rs:
##
@@ -607,6 +608,15 @@ async fn qualified_catalog_schema_table_reference() -> 
Result<()> {
 roundtrip("SELECT a,b,c,d,e FROM datafusion.public.data;").await
 }
 
+#[tokio::test]
+async fn roundtrip_local_relation() -> Result<()> {
+let ctx = create_context().await?;
+let row1 = vec![lit(1), lit("a")];
+let row2 = vec![lit(2), lit("b")];
+let plan = LogicalPlanBuilder::values(vec![row1, row2])?.build()?;

Review Comment:
   I agree. For Values plan, these aliases should be able to be ignored.
   I think we could use `assert_expected_plan` during testing to ignore 
aliases, for example:
   ```rust
   #[tokio::test]
   async fn roundtrip_values() -> Result<()> {
   assert_expected_plan(
   "VALUES (1, 'a', [[1,2,3], []], STRUCT(true, 1))",
   "Values: (Int64(1), Utf8(\"a\"), List([[1, 2, 3], []]), 
Struct({c0:true,c1:1}))",
   )
   .await
   }
   ```
   Additionally, aliases should be ignored when handling `LogicalPlan::Values` 
in `to_substrait_rel`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Support Substrait's VirtualTables [datafusion]

2024-05-20 Thread via GitHub


jonahgao commented on code in PR #10531:
URL: https://github.com/apache/datafusion/pull/10531#discussion_r1607637374


##
datafusion/substrait/tests/cases/roundtrip_logical_plan.rs:
##
@@ -607,6 +608,15 @@ async fn qualified_catalog_schema_table_reference() -> 
Result<()> {
 roundtrip("SELECT a,b,c,d,e FROM datafusion.public.data;").await
 }
 
+#[tokio::test]
+async fn roundtrip_local_relation() -> Result<()> {
+let ctx = create_context().await?;
+let row1 = vec![lit(1), lit("a")];
+let row2 = vec![lit(2), lit("b")];
+let plan = LogicalPlanBuilder::values(vec![row1, row2])?.build()?;

Review Comment:
   I agree. For Values plan, these aliases should be be ignored.
   I think we could use `assert_expected_plan` during testing to ignore 
aliases, for example:
   ```rust
   #[tokio::test]
   async fn roundtrip_values() -> Result<()> {
   assert_expected_plan(
   "VALUES (1, 'a', [[1,2,3], []], STRUCT(true, 1))",
   "Values: (Int64(1), Utf8(\"a\"), List([[1, 2, 3], []]), 
Struct({c0:true,c1:1}))",
   )
   .await
   }
   ```
   Additionally, aliases should be ignored when handling `LogicalPlan::Values` 
in `to_substrait_rel`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Improve `UserDefinedLogicalNode::from_template` API to return `Result` [datafusion]

2024-05-20 Thread via GitHub


lewiszlw commented on PR #10575:
URL: https://github.com/apache/datafusion/pull/10575#issuecomment-2121658693

   If we want to apply same change to `UserDefinedLogicalNodeCore` trait, we 
shoud add Sized trait bound to it, do you need me to change it in this pr? 
@alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Improve `UserDefinedLogicalNode::from_template` API to return `Result` [datafusion]

2024-05-20 Thread via GitHub


lewiszlw commented on code in PR #10575:
URL: https://github.com/apache/datafusion/pull/10575#discussion_r1607537742


##
datafusion/expr/src/logical_plan/extension.rs:
##
@@ -76,27 +76,20 @@ pub trait UserDefinedLogicalNode: fmt::Debug + Send + Sync {
 /// For example: `TopK: k=10`
 fn fmt_for_explain(&self, f: &mut fmt::Formatter) -> fmt::Result;
 
-/// Create a new `ExtensionPlanNode` with the specified children
+/// Create a new `UserDefinedLogicalNode` with the specified children
 /// and expressions. This function is used during optimization
 /// when the plan is being rewritten and a new instance of the
-/// `ExtensionPlanNode` must be created.
+/// `UserDefinedLogicalNode` must be created.
 ///
 /// Note that exprs and inputs are in the same order as the result
 /// of self.inputs and self.exprs.
 ///
-/// So, `self.from_template(exprs, ..).expressions() == exprs
-//
-// TODO(clippy): This should probably be renamed to use a `with_*` prefix. 
Something
-// like `with_template`, or `with_exprs_and_inputs`.
-//
-// Also, I think `ExtensionPlanNode` has been renamed to 
`UserDefinedLogicalNode`
-// but the doc comments have not been updated.
-#[allow(clippy::wrong_self_convention)]
-fn from_template(
+/// So, `self.with_exprs_and_inputs(exprs, ..).expressions() == exprs
+fn with_exprs_and_inputs(
 &self,
 exprs: &[Expr],
 inputs: &[LogicalPlan],
-) -> Arc;
+) -> Result>;

Review Comment:
   Yes, looks better.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Improve `UserDefinedLogicalNode::from_template` API to return `Result` [datafusion]

2024-05-20 Thread via GitHub


lewiszlw commented on code in PR #10575:
URL: https://github.com/apache/datafusion/pull/10575#discussion_r1607537219


##
datafusion/expr/src/logical_plan/extension.rs:
##
@@ -76,27 +76,20 @@ pub trait UserDefinedLogicalNode: fmt::Debug + Send + Sync {
 /// For example: `TopK: k=10`
 fn fmt_for_explain(&self, f: &mut fmt::Formatter) -> fmt::Result;
 
-/// Create a new `ExtensionPlanNode` with the specified children
+/// Create a new `UserDefinedLogicalNode` with the specified children
 /// and expressions. This function is used during optimization
 /// when the plan is being rewritten and a new instance of the
-/// `ExtensionPlanNode` must be created.
+/// `UserDefinedLogicalNode` must be created.
 ///
 /// Note that exprs and inputs are in the same order as the result
 /// of self.inputs and self.exprs.
 ///
-/// So, `self.from_template(exprs, ..).expressions() == exprs
-//
-// TODO(clippy): This should probably be renamed to use a `with_*` prefix. 
Something
-// like `with_template`, or `with_exprs_and_inputs`.
-//
-// Also, I think `ExtensionPlanNode` has been renamed to 
`UserDefinedLogicalNode`
-// but the doc comments have not been updated.
-#[allow(clippy::wrong_self_convention)]

Review Comment:
   LGTM



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Fix compilation of datafusion-cli on 32bit targets [datafusion]

2024-05-20 Thread via GitHub


nathaniel-daniel opened a new pull request, #10594:
URL: https://github.com/apache/datafusion/pull/10594

   ## Which issue does this PR close?
   Closes #10552.
   
   ## Rationale for this change
   This PR fixes compilation of the datafusion-cli crate on 32bit targets. 
   
   ## What changes are included in this PR?
   A constant assumed that a usize was at least 64 bits wide. I changed the 
constant type to be a u64, before translating it into a usize later on, 
throwing an error if it overflows.
   
   ## Are these changes tested?
   These changes are not tested. I assumed adding CI for 32bit targets was too 
invasive for this PR.
   
   ## Are there any user-facing changes?
   There should be no user-facing changes or breaking changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Migrate testing optimizer rules to use `rewrite` API [datafusion]

2024-05-20 Thread via GitHub


lewiszlw commented on PR #10576:
URL: https://github.com/apache/datafusion/pull/10576#issuecomment-2121613747

   The `CommonSubexprEliminate` rule has not been migrated yet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Fixes bug expect `Date32Array` but returns Int32Array [datafusion]

2024-05-20 Thread via GitHub


xinlifoobar opened a new pull request, #10593:
URL: https://github.com/apache/datafusion/pull/10593

   ## Which issue does this PR close?
   
   
   
   Closes #10587 
   
   ## Rationale for this change
   
   
   
   This is to fix a bug when reading a Date32 or Date64 column from a parquet 
file, DataFusion currently returns an Int32 array
   
   ## What changes are included in this PR?
   
   
   
   Adds conversions in the `get_statistic` marco from Int32 to Date32 and 
Date64 respectively.
   
   ## Are these changes tested?
   
   
   
   Yes
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Compute murmur3 hash with dictionary input correctly [datafusion-comet]

2024-05-20 Thread via GitHub


advancedxy commented on code in PR #433:
URL: https://github.com/apache/datafusion-comet/pull/433#discussion_r1607506665


##
spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala:
##
@@ -1452,17 +1452,55 @@ class CometExpressionSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
 withTable(table) {
   sql(s"create table $table(col string, a int, b float) using parquet")
   sql(s"""
- |insert into $table values
- |('Spark SQL  ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), 
('苹果手机', NULL, 3.99)
- |, ('Spark SQL  ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), 
('苹果手机', NULL, 3.99)
- |""".stripMargin)
-  checkSparkAnswerAndOperator("""
-   |select
-   |md5(col), md5(cast(a as string)), md5(cast(b as string)),
-   |hash(col), hash(col, 1), hash(col, 0), hash(col, a, b), 
hash(b, a, col),
-   |sha2(col, 0), sha2(col, 256), sha2(col, 224), sha2(col, 384), 
sha2(col, 512), sha2(col, 128)
-   |from test
+   |insert into $table values
+   |('Spark SQL  ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), 
('苹果手机', NULL, 3.99)
+   |, ('Spark SQL  ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), 
('苹果手机', NULL, 3.99)
|""".stripMargin)
+  checkSparkAnswerAndOperator("""
+  |select
+  |md5(col), md5(cast(a as string)), md5(cast(b as string)),
+  |hash(col), hash(col, 1), hash(col, 0), hash(col, a, b), hash(b, 
a, col),
+  |sha2(col, 0), sha2(col, 256), sha2(col, 224), sha2(col, 384), 
sha2(col, 512), sha2(col, 128)
+  |from test
+  |""".stripMargin)
+}
+  }
+}
+  }
+
+  test("hash functions with random input") {
+val dataGen = DataGenerator.DEFAULT
+// sufficient number of rows to create dictionary encoded ArrowArray.
+val randomNumRows = 1000

Review Comment:
   >  E.g. randomly generate 100 rows and repeat 10 times to make 1000 rows
   
   So dictionary encoding is only triggered with enough repetition?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] PR build for Linux Java 11 with Spark 3.4 is not running [datafusion-comet]

2024-05-20 Thread via GitHub


advancedxy commented on issue #389:
URL: 
https://github.com/apache/datafusion-comet/issues/389#issuecomment-2121583062

   Java 11 is excluded on purpose on pull request event as there are already 
too much combinations. Java8 and Java17 with Spark 3.4 on Linux are tested, 
which should be sufficient? They are all tested in a post-commit event though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Add random row generator in data generator [datafusion-comet]

2024-05-20 Thread via GitHub


advancedxy commented on code in PR #451:
URL: https://github.com/apache/datafusion-comet/pull/451#discussion_r1607484041


##
spark/src/test/scala/org/apache/comet/DataGenerator.scala:
##
@@ -95,4 +102,55 @@ class DataGenerator(r: Random) {
   Range(0, n).map(_ => r.nextLong())
   }
 
+  // Generate a random row according to the schema, the string filed in the 
struct could be
+  // configured to generate strings by passing a stringGen function. Other 
types are delegated
+  // to Spark's RandomDataGenerator.
+  def generateRow(schema: StructType, stringGen: Option[() => String] = None): 
Row = {
+val fields = mutable.ArrayBuffer.empty[Any]
+schema.fields.foreach { f =>
+  f.dataType match {
+case ArrayType(childType, nullable) =>
+  val data = if (f.nullable && r.nextFloat() <= PROBABILITY_OF_NULL) {
+null
+  } else {
+val arr = mutable.ArrayBuffer.empty[Any]
+val n = 1 // rand.nextInt(10)
+var i = 0
+val generator = RandomDataGenerator.forType(childType, nullable, r)
+assert(generator.isDefined, "Unsupported type")
+val gen = generator.get
+while (i < n) {
+  arr += gen()
+  i += 1
+}
+arr.toSeq
+  }
+  fields += data
+case StructType(children) =>
+  fields += generateRow(StructType(children))
+case StringType if stringGen.isDefined =>
+  val gen = stringGen.get
+  val data = if (f.nullable && r.nextFloat() <= PROBABILITY_OF_NULL) {
+null
+  } else {
+gen()
+  }
+  fields += data
+case _ =>
+  val generator = RandomDataGenerator.forType(f.dataType, f.nullable, 
r)
+  assert(generator.isDefined, "Unsupported type")
+  val gen = generator.get
+  fields += gen()
+  }
+}
+Row.fromSeq(fields.toSeq)
+  }
+
+  def generateRows(

Review Comment:
   Yes. I will use this in the hash functions tests. Also, it should be useful 
for other randomized input tests since it will be much easier to generate the 
data frame with multiple columns.



##
spark/src/test/scala/org/apache/comet/DataGenerator.scala:
##
@@ -95,4 +102,55 @@ class DataGenerator(r: Random) {
   Range(0, n).map(_ => r.nextLong())
   }
 
+  // Generate a random row according to the schema, the string filed in the 
struct could be
+  // configured to generate strings by passing a stringGen function. Other 
types are delegated
+  // to Spark's RandomDataGenerator.
+  def generateRow(schema: StructType, stringGen: Option[() => String] = None): 
Row = {
+val fields = mutable.ArrayBuffer.empty[Any]
+schema.fields.foreach { f =>
+  f.dataType match {
+case ArrayType(childType, nullable) =>
+  val data = if (f.nullable && r.nextFloat() <= PROBABILITY_OF_NULL) {
+null
+  } else {
+val arr = mutable.ArrayBuffer.empty[Any]
+val n = 1 // rand.nextInt(10)
+var i = 0
+val generator = RandomDataGenerator.forType(childType, nullable, r)
+assert(generator.isDefined, "Unsupported type")
+val gen = generator.get
+while (i < n) {
+  arr += gen()
+  i += 1
+}
+arr.toSeq
+  }
+  fields += data
+case StructType(children) =>
+  fields += generateRow(StructType(children))
+case StringType if stringGen.isDefined =>
+  val gen = stringGen.get
+  val data = if (f.nullable && r.nextFloat() <= PROBABILITY_OF_NULL) {
+null
+  } else {
+gen()
+  }
+  fields += data
+case _ =>
+  val generator = RandomDataGenerator.forType(f.dataType, f.nullable, 
r)
+  assert(generator.isDefined, "Unsupported type")

Review Comment:
   The `RandomDataGenerator.forType(f.dataType, f.nullable, r)` handles the 
nullable type, so I don't think we need to handle it here.
   
   See: 
https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala#L380-L392



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] bug: CAST string to integer does not handle all invalid inputs [datafusion-comet]

2024-05-20 Thread via GitHub


andygrove closed issue #431: bug: CAST string to integer does not handle all 
invalid inputs
URL: https://github.com/apache/datafusion-comet/issues/431


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Enable cast string to int tests and fix compatibility issue [datafusion-comet]

2024-05-20 Thread via GitHub


andygrove merged PR #453:
URL: https://github.com/apache/datafusion-comet/pull/453


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Count distinct support multiple expressions [datafusion]

2024-05-20 Thread via GitHub


github-actions[bot] closed pull request #5939: Count distinct support multiple 
expressions
URL: https://github.com/apache/datafusion/pull/5939


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Use stabalized aws-sdk and clap versions in `datafusion-cli` [datafusion]

2024-05-20 Thread via GitHub


github-actions[bot] commented on PR #9659:
URL: https://github.com/apache/datafusion/pull/9659#issuecomment-2121548399

   Thank you for your contribution. Unfortunately, this pull request is stale 
because it has been open 60 days with no activity. Please remove the stale 
label or comment or this will be closed in 7 days.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Minor: Generate the supported Spark builtin expression list into MD file [datafusion-comet]

2024-05-20 Thread via GitHub


comphead opened a new pull request, #455:
URL: https://github.com/apache/datafusion-comet/pull/455

   ## Which issue does this PR close?
   
   
   
   Closes #.
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## How are these changes tested?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Coverage: Add a manual test to show what Spark built in expression the DF can support directly [datafusion-comet]

2024-05-20 Thread via GitHub


comphead commented on PR #331:
URL: https://github.com/apache/datafusion-comet/pull/331#issuecomment-2121507170

   Thanks everyone for the review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Coverage: Add a manual test to show what Spark built in expression the DF can support directly [datafusion-comet]

2024-05-20 Thread via GitHub


comphead merged PR #331:
URL: https://github.com/apache/datafusion-comet/pull/331


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] bug fix: Fix fuzz testcase for cast string to integer [datafusion-comet]

2024-05-20 Thread via GitHub


vaibhawvipul commented on code in PR #450:
URL: https://github.com/apache/datafusion-comet/pull/450#discussion_r1607440985


##
spark/src/test/scala/org/apache/comet/CometCastSuite.scala:
##
@@ -533,11 +533,16 @@ class CometCastSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
 castTest(gen.generateStrings(dataSize, numericPattern, 5).toDF("a"), 
DataTypes.ShortType)
   }
 
-  ignore("cast StringType to IntegerType") {
+  test("cast StringType to IntegerType") {
 // test with hand-picked values
 castTest(castStringToIntegralInputs.toDF("a"), DataTypes.IntegerType)
 // fuzz test
-castTest(gen.generateStrings(dataSize, numericPattern, 8).toDF("a"), 
DataTypes.IntegerType)
+castTest(
+  gen
+.generateStrings(dataSize, numericPattern, 8)
+.toDF("a")
+.withColumn("a", functions.trim($"a")),
+  DataTypes.IntegerType)

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] bug fix: Fix fuzz testcase for cast string to integer [datafusion-comet]

2024-05-20 Thread via GitHub


vaibhawvipul closed pull request #450: bug fix: Fix fuzz testcase for cast 
string to integer 
URL: https://github.com/apache/datafusion-comet/pull/450


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Fix: Sort Merge Join LeftSemi issues when JoinFilter is set [datafusion]

2024-05-20 Thread via GitHub


alamb commented on PR #10304:
URL: https://github.com/apache/datafusion/pull/10304#issuecomment-2121484917

   🚀 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] build(deps): upgrade sqlparser to 0.46.0 [datafusion]

2024-05-20 Thread via GitHub


alamb commented on code in PR #10392:
URL: https://github.com/apache/datafusion/pull/10392#discussion_r1607425124


##
datafusion/sqllogictest/test_files/array.slt:
##


Review Comment:
   Thanks @tisonkun  -- sounds like we should fix that upstream and then I can 
maybe make a new sqlparser release (or maybe I can make a minor patch release) 
🤔 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Implement a way to preserve partitioning through `UnionExec` without losing ordering [datafusion]

2024-05-20 Thread via GitHub


alamb commented on issue #10314:
URL: https://github.com/apache/datafusion/issues/10314#issuecomment-2121468243

   > Hi @alamb, I am trying to work on this.
   > 
   > I am not very familiar on the `InterleaveExec` in the optimizer. As 
initial thought, the interleaveExec is acting as a **Repartition with equal 
number of input partitions and output partitions** and thus a nature idea is to 
reuse `streaming_merge` with respect to the input size. Wdyt?
   
   Hi @xinlifoobar  -- this sounds like it is on the right track


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Using `Expr::field` panics [datafusion]

2024-05-20 Thread via GitHub


jayzhan211 closed issue #10565: Using `Expr::field` panics 
URL: https://github.com/apache/datafusion/issues/10565


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Remove `Expr::GetIndexedField`, replace `Expr::{field,index,range}` with `FieldAccessor`, `IndexAccessor`, and `SliceAccessor` [datafusion]

2024-05-20 Thread via GitHub


jayzhan211 merged PR #10568:
URL: https://github.com/apache/datafusion/pull/10568


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Remove `Expr::GetIndexedField` and `GetFieldAccess` and always use function `get_field` for indexing [datafusion]

2024-05-20 Thread via GitHub


jayzhan211 closed issue #10374: Remove `Expr::GetIndexedField` and 
`GetFieldAccess` and always use function `get_field` for indexing
URL: https://github.com/apache/datafusion/issues/10374


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Remove `Expr::GetIndexedField`, replace `Expr::{field,index,range}` with `FieldAccessor`, `IndexAccessor`, and `SliceAccessor` [datafusion]

2024-05-20 Thread via GitHub


jayzhan211 commented on PR #10568:
URL: https://github.com/apache/datafusion/pull/10568#issuecomment-2121463057

   Thanks, @alamb !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Tsaucer/prepare tpch examples for ci [datafusion-python]

2024-05-20 Thread via GitHub


timsaucer commented on PR #710:
URL: 
https://github.com/apache/datafusion-python/pull/710#issuecomment-2121459794

   It looks like CI is running correctly and also caching the data. I’ll rebase 
in the morning and get the PR ready to merge.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Supports UUID column [datafusion-comet]

2024-05-20 Thread via GitHub


huaxingao commented on code in PR #395:
URL: https://github.com/apache/datafusion-comet/pull/395#discussion_r1607409997


##
common/src/main/java/org/apache/comet/parquet/CometParquetToSparkSchemaConverter.scala:
##
@@ -0,0 +1,403 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.comet.parquet
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.parquet.io.{ColumnIO, GroupColumnIO, PrimitiveColumnIO}
+import org.apache.parquet.schema._
+import org.apache.parquet.schema.LogicalTypeAnnotation._
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName._
+import org.apache.parquet.schema.Type.Repetition._
+import 
org.apache.spark.sql.catalyst.util.ResolveDefaultColumns.normalizeFieldName
+import org.apache.spark.sql.execution.datasources.parquet.{ParquetColumn, 
ParquetToSparkSchemaConverter}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types._
+
+class CometParquetToSparkSchemaConverter(
+assumeBinaryIsString: Boolean = 
SQLConf.PARQUET_BINARY_AS_STRING.defaultValue.get,
+assumeInt96IsTimestamp: Boolean = 
SQLConf.PARQUET_INT96_AS_TIMESTAMP.defaultValue.get,
+caseSensitive: Boolean = SQLConf.CASE_SENSITIVE.defaultValue.get,
+inferTimestampNTZ: Boolean = 
SQLConf.PARQUET_INFER_TIMESTAMP_NTZ_ENABLED.defaultValue.get,
+nanosAsLong: Boolean = 
SQLConf.LEGACY_PARQUET_NANOS_AS_LONG.defaultValue.get) extends 
ParquetToSparkSchemaConverter {
+
+  def this(conf: Configuration) = this(
+assumeBinaryIsString = 
conf.get(SQLConf.PARQUET_BINARY_AS_STRING.key).toBoolean,
+assumeInt96IsTimestamp = 
conf.get(SQLConf.PARQUET_INT96_AS_TIMESTAMP.key).toBoolean,
+caseSensitive = conf.get(SQLConf.CASE_SENSITIVE.key).toBoolean,
+inferTimestampNTZ = 
conf.get(SQLConf.PARQUET_INFER_TIMESTAMP_NTZ_ENABLED.key).toBoolean,
+nanosAsLong = conf.get(SQLConf.LEGACY_PARQUET_NANOS_AS_LONG.key).toBoolean)
+
+  override def convertField(
+  field: ColumnIO,
+  sparkReadType: Option[DataType] = None): ParquetColumn = {
+val targetType = sparkReadType.map {
+  case udt: UserDefinedType[_] => udt.sqlType
+  case otherType => otherType
+}
+field match {
+  case primitiveColumn: PrimitiveColumnIO => 
convertPrimitiveField(primitiveColumn, targetType)
+  case groupColumn: GroupColumnIO => convertGroupField(groupColumn, 
targetType)
+}
+  }
+
+  private def convertPrimitiveField(
+  primitiveColumn: PrimitiveColumnIO,
+  sparkReadType: Option[DataType] = None): ParquetColumn = {
+val parquetType = primitiveColumn.getType.asPrimitiveType()
+val typeAnnotation = primitiveColumn.getType.getLogicalTypeAnnotation
+val typeName = primitiveColumn.getPrimitive
+
+def typeString =
+  if (typeAnnotation == null) s"$typeName" else s"$typeName 
($typeAnnotation)"
+
+def typeNotImplemented() =
+  throw new UnsupportedOperationException("unsupported Parquet type: " + 
typeString)
+
+def illegalType() =
+  throw new UnsupportedOperationException("Illegal Parquet type: " + 
typeString)
+
+// When maxPrecision = -1, we skip precision range check, and always 
respect the precision
+// specified in field.getDecimalMetadata.  This is useful when 
interpreting decimal types stored
+// as binaries with variable lengths.
+def makeDecimalType(maxPrecision: Int = -1): DecimalType = {
+  val decimalLogicalTypeAnnotation = typeAnnotation
+.asInstanceOf[DecimalLogicalTypeAnnotation]
+  val precision = decimalLogicalTypeAnnotation.getPrecision
+  val scale = decimalLogicalTypeAnnotation.getScale
+
+  CometParquetSchemaConverter.checkConversionRequirement(
+maxPrecision == -1 || 1 <= precision && precision <= maxPrecision,
+s"Invalid decimal precision: $typeName cannot store $precision digits 
(max $maxPrecision)")
+
+  DecimalType(precision, scale)
+}
+
+val sparkType = sparkReadType.getOrElse(typeName match {
+  case BOOLEAN => BooleanType
+
+  case FLOAT => FloatType
+
+  case DOUBLE => DoubleType
+
+  case INT32 =>
+typeAnnotation match {
+  case intTypeAnnotation: IntLogicalTypeAnnotation if 
intTypeA

Re: [PR] fix: Enable cast string to int tests and fix compatibility issue [datafusion-comet]

2024-05-20 Thread via GitHub


andygrove commented on PR #453:
URL: https://github.com/apache/datafusion-comet/pull/453#issuecomment-2121447145

   > So do you thing the perf improvement is because we are no longer trimming?
   
   We are still trimming, but we are no longer performing the redundant 
conditional logic in the main loop to skip leading and trailing whitespace. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Cast String to Date ANSI Mode - Spark 3.2 - Mismatch between Spark and Comet Errors [datafusion-comet]

2024-05-20 Thread via GitHub


parthchandra commented on issue #440:
URL: 
https://github.com/apache/datafusion-comet/issues/440#issuecomment-2121399387

   Is this an issue of just a mismatch between error messages? Or is the cast 
actually not doing the right thing with Spark 3.2?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] chore [datafusion-comet]

2024-05-20 Thread via GitHub


parthchandra opened a new issue, #454:
URL: https://github.com/apache/datafusion-comet/issues/454

   ### Describe the bug
   
   Separate out extended explain info unit test into spark dependent and spark 
independent parts
   
   ### Steps to reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   The test `test("explain comet")` has a conditional check which executes only 
if `supportsExtendedExplainInfo` is true. This can be moved into a separate 
unit test that is run only for Spark 4.0 and greater which actually support 
extended info.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] test: Fix explain with exteded info comet test [datafusion-comet]

2024-05-20 Thread via GitHub


parthchandra commented on code in PR #436:
URL: https://github.com/apache/datafusion-comet/pull/436#discussion_r1607380766


##
spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala:
##
@@ -1399,7 +1399,7 @@ class CometExpressionSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
   CometConf.COMET_EXEC_ENABLED.key -> "true",
   CometConf.COMET_SHUFFLE_ENFORCE_MODE_ENABLED.key -> "true",
   CometConf.COMET_EXEC_ALL_OPERATOR_ENABLED.key -> "true",
-  "spark.sql.extendedExplainProvider" -> 
"org.apache.comet.ExtendedExplainInfo") {

Review Comment:
   Will do that @kazuyukitanimura 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Enable cast string to int tests and fix compatibility issue [datafusion-comet]

2024-05-20 Thread via GitHub


parthchandra commented on PR #453:
URL: https://github.com/apache/datafusion-comet/pull/453#issuecomment-2121348910

   So do you thing the perf improvement is because we are no longer trimming? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Supports UUID column [datafusion-comet]

2024-05-20 Thread via GitHub


parthchandra commented on code in PR #395:
URL: https://github.com/apache/datafusion-comet/pull/395#discussion_r1607363360


##
common/src/main/java/org/apache/comet/parquet/CometParquetToSparkSchemaConverter.scala:
##
@@ -0,0 +1,403 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.comet.parquet
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.parquet.io.{ColumnIO, GroupColumnIO, PrimitiveColumnIO}
+import org.apache.parquet.schema._
+import org.apache.parquet.schema.LogicalTypeAnnotation._
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName._
+import org.apache.parquet.schema.Type.Repetition._
+import 
org.apache.spark.sql.catalyst.util.ResolveDefaultColumns.normalizeFieldName
+import org.apache.spark.sql.execution.datasources.parquet.{ParquetColumn, 
ParquetToSparkSchemaConverter}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types._
+
+class CometParquetToSparkSchemaConverter(
+assumeBinaryIsString: Boolean = 
SQLConf.PARQUET_BINARY_AS_STRING.defaultValue.get,
+assumeInt96IsTimestamp: Boolean = 
SQLConf.PARQUET_INT96_AS_TIMESTAMP.defaultValue.get,
+caseSensitive: Boolean = SQLConf.CASE_SENSITIVE.defaultValue.get,
+inferTimestampNTZ: Boolean = 
SQLConf.PARQUET_INFER_TIMESTAMP_NTZ_ENABLED.defaultValue.get,
+nanosAsLong: Boolean = 
SQLConf.LEGACY_PARQUET_NANOS_AS_LONG.defaultValue.get) extends 
ParquetToSparkSchemaConverter {
+
+  def this(conf: Configuration) = this(
+assumeBinaryIsString = 
conf.get(SQLConf.PARQUET_BINARY_AS_STRING.key).toBoolean,
+assumeInt96IsTimestamp = 
conf.get(SQLConf.PARQUET_INT96_AS_TIMESTAMP.key).toBoolean,
+caseSensitive = conf.get(SQLConf.CASE_SENSITIVE.key).toBoolean,
+inferTimestampNTZ = 
conf.get(SQLConf.PARQUET_INFER_TIMESTAMP_NTZ_ENABLED.key).toBoolean,
+nanosAsLong = conf.get(SQLConf.LEGACY_PARQUET_NANOS_AS_LONG.key).toBoolean)
+
+  override def convertField(
+  field: ColumnIO,
+  sparkReadType: Option[DataType] = None): ParquetColumn = {
+val targetType = sparkReadType.map {
+  case udt: UserDefinedType[_] => udt.sqlType
+  case otherType => otherType
+}
+field match {
+  case primitiveColumn: PrimitiveColumnIO => 
convertPrimitiveField(primitiveColumn, targetType)
+  case groupColumn: GroupColumnIO => convertGroupField(groupColumn, 
targetType)
+}
+  }
+
+  private def convertPrimitiveField(
+  primitiveColumn: PrimitiveColumnIO,
+  sparkReadType: Option[DataType] = None): ParquetColumn = {
+val parquetType = primitiveColumn.getType.asPrimitiveType()
+val typeAnnotation = primitiveColumn.getType.getLogicalTypeAnnotation
+val typeName = primitiveColumn.getPrimitive
+
+def typeString =
+  if (typeAnnotation == null) s"$typeName" else s"$typeName 
($typeAnnotation)"
+
+def typeNotImplemented() =
+  throw new UnsupportedOperationException("unsupported Parquet type: " + 
typeString)
+
+def illegalType() =
+  throw new UnsupportedOperationException("Illegal Parquet type: " + 
typeString)
+
+// When maxPrecision = -1, we skip precision range check, and always 
respect the precision
+// specified in field.getDecimalMetadata.  This is useful when 
interpreting decimal types stored
+// as binaries with variable lengths.
+def makeDecimalType(maxPrecision: Int = -1): DecimalType = {
+  val decimalLogicalTypeAnnotation = typeAnnotation
+.asInstanceOf[DecimalLogicalTypeAnnotation]
+  val precision = decimalLogicalTypeAnnotation.getPrecision
+  val scale = decimalLogicalTypeAnnotation.getScale
+
+  CometParquetSchemaConverter.checkConversionRequirement(
+maxPrecision == -1 || 1 <= precision && precision <= maxPrecision,
+s"Invalid decimal precision: $typeName cannot store $precision digits 
(max $maxPrecision)")
+
+  DecimalType(precision, scale)
+}
+
+val sparkType = sparkReadType.getOrElse(typeName match {
+  case BOOLEAN => BooleanType
+
+  case FLOAT => FloatType
+
+  case DOUBLE => DoubleType
+
+  case INT32 =>
+typeAnnotation match {
+  case intTypeAnnotation: IntLogicalTypeAnnotation if 
intTy

Re: [I] Sort Merge Join. LeftSemi issues [datafusion]

2024-05-20 Thread via GitHub


comphead closed issue #10379: Sort Merge Join. LeftSemi issues
URL: https://github.com/apache/datafusion/issues/10379


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Fix: Sort Merge Join LeftSemi issues when JoinFilter is set [datafusion]

2024-05-20 Thread via GitHub


comphead merged PR #10304:
URL: https://github.com/apache/datafusion/pull/10304


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] test: add more tests for statistics reading [datafusion]

2024-05-20 Thread via GitHub


NGA-TRAN opened a new pull request, #10592:
URL: https://github.com/apache/datafusion/pull/10592

   ## Which issue does this PR close?
   
   
   More tests for https://github.com/apache/datafusion/issues/10453
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Docs: Update PR workflow documentation [datafusion]

2024-05-20 Thread via GitHub


alamb commented on PR #10532:
URL: https://github.com/apache/datafusion/pull/10532#issuecomment-2121194261

   I plan to incorporate the feedback on this PR, I just haven't had a chance 
yet. I hope to do so over the next few days


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add reference visitor `TreeNode` APIs [datafusion]

2024-05-20 Thread via GitHub


alamb commented on PR #10543:
URL: https://github.com/apache/datafusion/pull/10543#issuecomment-2121191619

   What do we think about merging this PR and filing a follow on ticket to 
unify the APIs?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] DataFusion reads Date32 and Date64 parquet statistics in as Int32Array [datafusion]

2024-05-20 Thread via GitHub


alamb commented on issue #10587:
URL: https://github.com/apache/datafusion/issues/10587#issuecomment-2121184299

   Thanks for pointing that out @edmondop  -- yes the min/max seem to be 
extracted as `Int32Array`s


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Stop copying LogicalPlan and Exprs in `SingleDistinctToGroupBy` [datafusion]

2024-05-20 Thread via GitHub


appletreeisyellow commented on PR #10527:
URL: https://github.com/apache/datafusion/pull/10527#issuecomment-2121182146

   @alamb Thanks for the review! I have updated the code according to your 
feedback
   
   > I found whitespace blind diff easier to review: [#10527 
(files)](https://github.com/apache/datafusion/pull/10527/files?w=1)
   
   Indeed! It greatly reduced my mental matching effort 😄


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Stop copying LogicalPlan and Exprs in `SingleDistinctToGroupBy` [datafusion]

2024-05-20 Thread via GitHub


appletreeisyellow commented on code in PR #10527:
URL: https://github.com/apache/datafusion/pull/10527#discussion_r1607264284


##
datafusion/optimizer/src/single_distinct_to_groupby.rs:
##
@@ -131,177 +126,190 @@ fn contains_grouping_set(expr: &[Expr]) -> bool {
 impl OptimizerRule for SingleDistinctToGroupBy {
 fn try_optimize(
 &self,
-plan: &LogicalPlan,
+_plan: &LogicalPlan,
 _config: &dyn OptimizerConfig,
 ) -> Result> {
+internal_err!("Should have called SingleDistinctToGroupBy::rewrite")
+}
+
+fn name(&self) -> &str {
+"single_distinct_aggregation_to_group_by"
+}
+
+fn apply_order(&self) -> Option {
+Some(ApplyOrder::TopDown)
+}
+
+fn supports_rewrite(&self) -> bool {
+true
+}
+
+fn rewrite(
+&self,
+plan: LogicalPlan,
+_config: &dyn OptimizerConfig,
+) -> Result, DataFusionError> {
 match plan {
 LogicalPlan::Aggregate(Aggregate {
 input,
 aggr_expr,
 schema,
 group_expr,
 ..
-}) => {
-if is_single_distinct_agg(plan)? && 
!contains_grouping_set(group_expr) {
-// alias all original group_by exprs
-let (mut inner_group_exprs, out_group_expr_with_alias): (
-Vec,
-Vec<(Expr, Option)>,
-) = group_expr
-.iter()
-.enumerate()
-.map(|(i, group_expr)| {
-if let Expr::Column(_) = group_expr {
-// For Column expressions we can use existing 
expression as is.
-(group_expr.clone(), (group_expr.clone(), 
None))
-} else {
-// For complex expression write is as alias, 
to be able to refer
-// if from parent operators successfully.
-// Consider plan below.
-//
-// Aggregate: groupBy=[[group_alias_0]], 
aggr=[[COUNT(alias1)]] [group_alias_0:Int32, COUNT(alias1):Int64;N]\
-// --Aggregate: groupBy=[[test.a + Int32(1) AS 
group_alias_0, test.c AS alias1]], aggr=[[]] [group_alias_0:Int32, 
alias1:UInt32]\
-// TableScan: test [a:UInt32, b:UInt32, 
c:UInt32]
-//
-// First aggregate(from bottom) refers to 
`test.a` column.
-// Second aggregate refers to the 
`group_alias_0` column, Which is a valid field in the first aggregate.
-// If we were to write plan above as below 
without alias
-//
-// Aggregate: groupBy=[[test.a + Int32(1)]], 
aggr=[[COUNT(alias1)]] [group_alias_0:Int32, COUNT(alias1):Int64;N]\
-// --Aggregate: groupBy=[[test.a + Int32(1), 
test.c AS alias1]], aggr=[[]] [group_alias_0:Int32, alias1:UInt32]\
-// TableScan: test [a:UInt32, b:UInt32, 
c:UInt32]
-//
-// Second aggregate refers to the `test.a + 
Int32(1)` expression However, its input do not have `test.a` expression in it.
-let alias_str = format!("group_alias_{i}");
-let alias_expr = 
group_expr.clone().alias(&alias_str);
-let (qualifier, field) = 
schema.qualified_field(i);
+}) if is_single_distinct_agg(&aggr_expr)?
+&& !contains_grouping_set(&group_expr) =>
+{
+let group_size = group_expr.len();
+// alias all original group_by exprs
+let (mut inner_group_exprs, out_group_expr_with_alias): (
+Vec,
+Vec<(Expr, Option)>,
+) = group_expr
+.into_iter()
+.enumerate()
+.map(|(i, group_expr)| {
+if let Expr::Column(_) = group_expr {
+// For Column expressions we can use existing 
expression as is.
+(group_expr.clone(), (group_expr, None))
+} else {
+// For complex expression write is as alias, to be 
able to refer
+// if from parent operators successfully.
+// Consider plan below.
+//
+// Aggregate: groupBy=[[group_alias_0]], 
aggr=[[COUNT(alias1)]] [group_alias_0:Int32, COUNT(alias1):Int64;N]\
+   

Re: [PR] docs: add guide to adding a new expression [datafusion-comet]

2024-05-20 Thread via GitHub


andygrove merged PR #422:
URL: https://github.com/apache/datafusion-comet/pull/422


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] bug fix: Fix fuzz testcase for cast string to integer [datafusion-comet]

2024-05-20 Thread via GitHub


andygrove commented on code in PR #450:
URL: https://github.com/apache/datafusion-comet/pull/450#discussion_r1607255755


##
spark/src/test/scala/org/apache/comet/CometCastSuite.scala:
##
@@ -533,11 +533,16 @@ class CometCastSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
 castTest(gen.generateStrings(dataSize, numericPattern, 5).toDF("a"), 
DataTypes.ShortType)
   }
 
-  ignore("cast StringType to IntegerType") {
+  test("cast StringType to IntegerType") {
 // test with hand-picked values
 castTest(castStringToIntegralInputs.toDF("a"), DataTypes.IntegerType)
 // fuzz test
-castTest(gen.generateStrings(dataSize, numericPattern, 8).toDF("a"), 
DataTypes.IntegerType)
+castTest(
+  gen
+.generateStrings(dataSize, numericPattern, 8)
+.toDF("a")
+.withColumn("a", functions.trim($"a")),
+  DataTypes.IntegerType)

Review Comment:
   Here is my PR: https://github.com/apache/datafusion-comet/pull/453



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Enable cast string to int tests and fix compatibility issue [datafusion-comet]

2024-05-20 Thread via GitHub


andygrove commented on code in PR #453:
URL: https://github.com/apache/datafusion-comet/pull/453#discussion_r1607255075


##
core/src/execution/datafusion/expressions/cast.rs:
##
@@ -1070,7 +1050,7 @@ fn do_cast_string_to_int<
 if ch == '.' {
 if eval_mode == EvalMode::Legacy {
 // truncate decimal in legacy mode
-state = State::ParseFractionalDigits;
+parse_sign_and_digits = false;
 continue;
 } else {
 return none_or_err(eval_mode, type_name, str);

Review Comment:
   error messages refer to the original input string, not the trimmed version



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Enable cast string to int tests and fix compatibility issue [datafusion-comet]

2024-05-20 Thread via GitHub


andygrove commented on code in PR #453:
URL: https://github.com/apache/datafusion-comet/pull/453#discussion_r1607254539


##
core/src/execution/datafusion/expressions/cast.rs:
##
@@ -1029,34 +1021,22 @@ fn do_cast_string_to_int<
 type_name: &str,
 min_value: T,
 ) -> CometResult> {
-let len = str.len();
-if str.is_empty() {
+let trimmed_str = str.trim();
+if trimmed_str.is_empty() {
 return none_or_err(eval_mode, type_name, str);
 }
-
+let len = trimmed_str.len();
 let mut result: T = T::zero();
 let mut negative = false;
 let radix = T::from(10);
 let stop_value = min_value / radix;
-let mut state = State::SkipLeadingWhiteSpace;
-let mut parsed_sign = false;
-
-for (i, ch) in str.char_indices() {
-// skip leading whitespace
-if state == State::SkipLeadingWhiteSpace {
-if ch.is_whitespace() {
-// consume this char
-continue;
-}
-// change state and fall through to next section
-state = State::ParseSignAndDigits;
-}
+let mut parse_sign_and_digits = true;
 
-if state == State::ParseSignAndDigits {
-if !parsed_sign {
+for (i, ch) in trimmed_str.char_indices() {

Review Comment:
   We process the trimmed string here and we no longer need the logic for 
skipping leading and trailing whitespace



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Enable cast string to int tests and fix compatibility issue [datafusion-comet]

2024-05-20 Thread via GitHub


andygrove commented on code in PR #453:
URL: https://github.com/apache/datafusion-comet/pull/453#discussion_r1607253762


##
core/src/execution/datafusion/expressions/cast.rs:
##
@@ -82,7 +82,7 @@ macro_rules! cast_utf8_to_int {
 for i in 0..len {
 if $array.is_null(i) {
 cast_array.append_null()
-} else if let Some(cast_value) = 
$cast_method($array.value(i).trim(), $eval_mode)? {

Review Comment:
   We originally trimmed the input strings here, so error messages did not have 
access to the original inputs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Stop copying LogicalPlan and Exprs in `SingleDistinctToGroupBy` [datafusion]

2024-05-20 Thread via GitHub


appletreeisyellow commented on code in PR #10527:
URL: https://github.com/apache/datafusion/pull/10527#discussion_r1607253589


##
datafusion/optimizer/src/single_distinct_to_groupby.rs:
##
@@ -131,177 +126,190 @@ fn contains_grouping_set(expr: &[Expr]) -> bool {
 impl OptimizerRule for SingleDistinctToGroupBy {
 fn try_optimize(
 &self,
-plan: &LogicalPlan,
+_plan: &LogicalPlan,
 _config: &dyn OptimizerConfig,
 ) -> Result> {
+internal_err!("Should have called SingleDistinctToGroupBy::rewrite")
+}
+
+fn name(&self) -> &str {
+"single_distinct_aggregation_to_group_by"
+}
+
+fn apply_order(&self) -> Option {
+Some(ApplyOrder::TopDown)
+}
+
+fn supports_rewrite(&self) -> bool {
+true
+}
+
+fn rewrite(
+&self,
+plan: LogicalPlan,
+_config: &dyn OptimizerConfig,
+) -> Result, DataFusionError> {
 match plan {
 LogicalPlan::Aggregate(Aggregate {
 input,
 aggr_expr,
 schema,
 group_expr,
 ..
-}) => {
-if is_single_distinct_agg(plan)? && 
!contains_grouping_set(group_expr) {
-// alias all original group_by exprs
-let (mut inner_group_exprs, out_group_expr_with_alias): (
-Vec,
-Vec<(Expr, Option)>,
-) = group_expr
-.iter()
-.enumerate()
-.map(|(i, group_expr)| {
-if let Expr::Column(_) = group_expr {
-// For Column expressions we can use existing 
expression as is.
-(group_expr.clone(), (group_expr.clone(), 
None))
-} else {
-// For complex expression write is as alias, 
to be able to refer
-// if from parent operators successfully.
-// Consider plan below.
-//
-// Aggregate: groupBy=[[group_alias_0]], 
aggr=[[COUNT(alias1)]] [group_alias_0:Int32, COUNT(alias1):Int64;N]\
-// --Aggregate: groupBy=[[test.a + Int32(1) AS 
group_alias_0, test.c AS alias1]], aggr=[[]] [group_alias_0:Int32, 
alias1:UInt32]\
-// TableScan: test [a:UInt32, b:UInt32, 
c:UInt32]
-//
-// First aggregate(from bottom) refers to 
`test.a` column.
-// Second aggregate refers to the 
`group_alias_0` column, Which is a valid field in the first aggregate.
-// If we were to write plan above as below 
without alias
-//
-// Aggregate: groupBy=[[test.a + Int32(1)]], 
aggr=[[COUNT(alias1)]] [group_alias_0:Int32, COUNT(alias1):Int64;N]\
-// --Aggregate: groupBy=[[test.a + Int32(1), 
test.c AS alias1]], aggr=[[]] [group_alias_0:Int32, alias1:UInt32]\
-// TableScan: test [a:UInt32, b:UInt32, 
c:UInt32]
-//
-// Second aggregate refers to the `test.a + 
Int32(1)` expression However, its input do not have `test.a` expression in it.
-let alias_str = format!("group_alias_{i}");
-let alias_expr = 
group_expr.clone().alias(&alias_str);
-let (qualifier, field) = 
schema.qualified_field(i);
+}) if is_single_distinct_agg(&aggr_expr)?
+&& !contains_grouping_set(&group_expr) =>
+{
+let group_size = group_expr.len();
+// alias all original group_by exprs
+let (mut inner_group_exprs, out_group_expr_with_alias): (
+Vec,
+Vec<(Expr, Option)>,
+) = group_expr
+.into_iter()
+.enumerate()
+.map(|(i, group_expr)| {
+if let Expr::Column(_) = group_expr {
+// For Column expressions we can use existing 
expression as is.
+(group_expr.clone(), (group_expr, None))
+} else {
+// For complex expression write is as alias, to be 
able to refer
+// if from parent operators successfully.
+// Consider plan below.
+//
+// Aggregate: groupBy=[[group_alias_0]], 
aggr=[[COUNT(alias1)]] [group_alias_0:Int32, COUNT(alias1):Int64;N]\
+   

[PR] fix: Enable cast string to int tests and fix compatibility issue [datafusion-comet]

2024-05-20 Thread via GitHub


andygrove opened a new pull request, #453:
URL: https://github.com/apache/datafusion-comet/pull/453

   ## Which issue does this PR close?
   
   
   
   Closes https://github.com/apache/datafusion-comet/issues/431
   
   ## Rationale for this change
   
   
   
   Enable cast string to int as a compatible expression.
   
   Also, there is a performance improvement with these changes:
   
   ```
   cast_string_to_int/cast_string_to_i16
   time:   [17.445 µs 17.699 µs 18.019 µs]
   change: [-41.405% -40.622% -39.805%] (p = 0.00 < 
0.05)
   Performance has improved.
   ```
   
   ## What changes are included in this PR?
   
   
   
   - Enable tests for cast string to int
   - Include original input value in error message instead of trimmed input 
value
   - Simplify code and remove whitespace handling code that was not needed 
since we already trim the input values
   
   ## How are these changes tested?
   
   
   
   Enabled the tests that were previously disabled


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] build: Add spark-4.0 profile and shims [datafusion-comet]

2024-05-20 Thread via GitHub


kazuyukitanimura commented on PR #407:
URL: https://github.com/apache/datafusion-comet/pull/407#issuecomment-2121165967

   @viirya Please take another look
   cc @andygrove @comphead 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] build: Add spark-4.0 profile and shims [datafusion-comet]

2024-05-20 Thread via GitHub


kazuyukitanimura commented on code in PR #407:
URL: https://github.com/apache/datafusion-comet/pull/407#discussion_r1607249542


##
spark/src/main/scala/org/apache/spark/sql/comet/DecimalPrecision.scala:
##
@@ -107,11 +108,4 @@ object DecimalPrecision {
   case e => e
 }
   }
-
-  object DecimalExpression {
-def unapply(e: Expression): Option[(Int, Int)] = e.dataType match {
-  case t: DecimalType => Some((t.precision, t.scale))
-  case _ => None
-}
-  }

Review Comment:
   Restored the original



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: add hex scalar function [datafusion-comet]

2024-05-20 Thread via GitHub


kazuyukitanimura commented on code in PR #449:
URL: https://github.com/apache/datafusion-comet/pull/449#discussion_r1607218060


##
spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala:
##
@@ -1038,6 +1038,46 @@ class CometExpressionSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
   }
 }
   }
+  test("hex") {
+val str_table = "string_hex_table"
+withTable(str_table) {

Review Comment:
   probably `makeParquetFileAllTypes` is a better example



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] bug fix: Fix fuzz testcase for cast string to integer [datafusion-comet]

2024-05-20 Thread via GitHub


kazuyukitanimura commented on code in PR #450:
URL: https://github.com/apache/datafusion-comet/pull/450#discussion_r1607216900


##
spark/src/test/scala/org/apache/comet/CometCastSuite.scala:
##
@@ -533,11 +533,16 @@ class CometCastSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
 castTest(gen.generateStrings(dataSize, numericPattern, 5).toDF("a"), 
DataTypes.ShortType)
   }
 
-  ignore("cast StringType to IntegerType") {
+  test("cast StringType to IntegerType") {
 // test with hand-picked values
 castTest(castStringToIntegralInputs.toDF("a"), DataTypes.IntegerType)
 // fuzz test
-castTest(gen.generateStrings(dataSize, numericPattern, 8).toDF("a"), 
DataTypes.IntegerType)
+castTest(
+  gen
+.generateStrings(dataSize, numericPattern, 8)
+.toDF("a")
+.withColumn("a", functions.trim($"a")),
+  DataTypes.IntegerType)

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] bug fix: Fix fuzz testcase for cast string to integer [datafusion-comet]

2024-05-20 Thread via GitHub


andygrove commented on code in PR #450:
URL: https://github.com/apache/datafusion-comet/pull/450#discussion_r1607203214


##
spark/src/test/scala/org/apache/comet/CometCastSuite.scala:
##
@@ -533,11 +533,16 @@ class CometCastSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
 castTest(gen.generateStrings(dataSize, numericPattern, 5).toDF("a"), 
DataTypes.ShortType)
   }
 
-  ignore("cast StringType to IntegerType") {
+  test("cast StringType to IntegerType") {
 // test with hand-picked values
 castTest(castStringToIntegralInputs.toDF("a"), DataTypes.IntegerType)
 // fuzz test
-castTest(gen.generateStrings(dataSize, numericPattern, 8).toDF("a"), 
DataTypes.IntegerType)
+castTest(
+  gen
+.generateStrings(dataSize, numericPattern, 8)
+.toDF("a")
+.withColumn("a", functions.trim($"a")),
+  DataTypes.IntegerType)

Review Comment:
   The main issue is that we are trimming the input before processing, so error 
messages use the trimmed input instead of the original input. Once I resolved 
that I saw that we had extra processing for leading and trailing whitespace 
that is never used (ported from Spark) so I ended up removing that for some 
performance wins.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] bug fix: Fix fuzz testcase for cast string to integer [datafusion-comet]

2024-05-20 Thread via GitHub


andygrove commented on code in PR #450:
URL: https://github.com/apache/datafusion-comet/pull/450#discussion_r1607200546


##
spark/src/test/scala/org/apache/comet/CometCastSuite.scala:
##
@@ -533,11 +533,16 @@ class CometCastSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
 castTest(gen.generateStrings(dataSize, numericPattern, 5).toDF("a"), 
DataTypes.ShortType)
   }
 
-  ignore("cast StringType to IntegerType") {
+  test("cast StringType to IntegerType") {
 // test with hand-picked values
 castTest(castStringToIntegralInputs.toDF("a"), DataTypes.IntegerType)
 // fuzz test
-castTest(gen.generateStrings(dataSize, numericPattern, 8).toDF("a"), 
DataTypes.IntegerType)
+castTest(
+  gen
+.generateStrings(dataSize, numericPattern, 8)
+.toDF("a")
+.withColumn("a", functions.trim($"a")),
+  DataTypes.IntegerType)

Review Comment:
   I'd prefer to see us fix the compatibility issue rather than skip testing 
strings that have leading and/or trailing whitespace. I have been looking into 
this and was going to make some comments here but it turned out to be a bit 
more involved that I thought it would, so will create a PR soon with my 
proposed fix.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] feat: eliminate group by constant optimizer rule [datafusion]

2024-05-20 Thread via GitHub


korowa opened a new pull request, #10591:
URL: https://github.com/apache/datafusion/pull/10591

   ## Which issue does this PR close?
   
   
   
   Closes #.
   
   ## Rationale for this change
   
   
   
   Initial intention was to improve clickbench q34 -- it contains aggregation 
by constant and URL which makes aggregation switch to row-based mode which is 
slower that single field group by.
   
   In general, elimination of constants from `GROUP BY` seems to be reasonable 
optimization.
   
   ## What changes are included in this PR?
   
   
   
   New logical optimizer rule which removes constant or originated from 
constant expressions from `Aggregate::group_expr`, and places additional 
projection on top of aggregation, to satisfy original schema for downstream 
operators.
   
   Optimizer rule is placed between last `SimplifyExpressions` and 
`OptimizeProjections`, since it requires constant evaluation, and projections, 
produced by this rule, may be merged.
   
   ## Are these changes tested?
   
   
   
   New unit tests and sqllogictests
   
   ## Are there any user-facing changes?
   
   
   
   No
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: add hex scalar function [datafusion-comet]

2024-05-20 Thread via GitHub


tshauck commented on code in PR #449:
URL: https://github.com/apache/datafusion-comet/pull/449#discussion_r1607197818


##
spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala:
##
@@ -1038,6 +1038,46 @@ class CometExpressionSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
   }
 }
   }
+  test("hex") {
+val str_table = "string_hex_table"
+withTable(str_table) {

Review Comment:
   Thanks, will have a look at `makeRawTimeParquetFile` and follow up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] docs: add guide to adding a new expression [datafusion-comet]

2024-05-20 Thread via GitHub


tshauck commented on PR #422:
URL: https://github.com/apache/datafusion-comet/pull/422#issuecomment-2121071344

   @andygrove Conflict should actually be fixed now, thanks @kazuyukitanimura 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] docs: add guide to adding a new expression [datafusion-comet]

2024-05-20 Thread via GitHub


tshauck commented on code in PR #422:
URL: https://github.com/apache/datafusion-comet/pull/422#discussion_r1607185036


##
docs/source/index.rst:
##
@@ -58,7 +58,11 @@ as a native runtime to achieve improvement in terms of query 
efficiency and quer
Comet Plugin Overview 
Development Guide 
Debugging Guide 
+<<< HEAD

Review Comment:
   Ah, thanks! 
[1c37b49](https://github.com/apache/datafusion-comet/pull/422/commits/1c37b490b5e0d80d6de5b3ff6844f489a8cf)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Pass BigQuery options to the ArrowSchema [datafusion]

2024-05-20 Thread via GitHub


davisp commented on PR #10590:
URL: https://github.com/apache/datafusion/pull/10590#issuecomment-2121051712

   Also, for anyone more familiar with datafusion and/or sqlparser, one thing I 
wasn't 100% on was how to represent the metadata value. For now I've just 
called format on it, but I have the nagging suspicion there might be a way to 
serialize those so that TableProviderFactor folks can unserialize them more 
easily rather than having to resort to things like stripping `'` and `"` 
blindly when converting the DFSchema to whatever they're working with.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Pass BigQuery options to the ArrowSchema [datafusion]

2024-05-20 Thread via GitHub


davisp opened a new pull request, #10590:
URL: https://github.com/apache/datafusion/pull/10590

   ## Which issue does this PR close?
   
   Closes #10589
   
   ## Rationale for this change
   
   Provide per-column key/value options in the `CREATE EXTERN TABLE` statement.
   
   ## What changes are included in this PR?
   
   This sets the BigQuery column `OPTIONS` key/values into the Arrow Schema's 
fields' metadata.
   
   ## Are these changes tested?
   
   Barely. I assume I'll need to add a couple more, but I figured I'd wait 
until I see what sort of enthusiasm this proposal receives.
   
   ## Are there any user-facing changes?
   
   Users can now access any parsed BigTable options to each Field's metadata. 
This allows for TableProvider/TableProviderFactory implementations the ability 
to accept per-field options. For a query like such:
   
   ```sql
   CREATE EXTERNAL TABLE test(
 col1 bigint NULL,
 col2 bigint NOT NULL OPTIONS(compression='zstd(5)')
   )
   STORED AS parquet
   LOCATION 'foo.parquet';
   ```
   
   This results in the field 'col2' on the Arrow Schema having a metdata value 
of `{"sql_option.compression", "'zstd(5)'"}`.
   
   I think technically this might be able to break downstream users in this 
scenario:
   
   1. For some reason they are using BigQuery OPTIONS clauses in their `CREATE 
EXTERN TABLE` column definitions.
   2. In their TableProviderFactory, they are attempting to access and use 
Field metadata in such a way that the now present `sql_option.*` keys breaks 
things.
   
   The first requirement here seems fairly unlikely to me? I'm making the rough 
assumption that a "standard" BigQuery table creation statement would fail to 
parse as a Datafusion `CREATE EXTERN TABLE` statement and thus it'd limit to 
people who have a) translated a BigQuery statment to a Datafusion statement 
while keeping their column `OPTIONS` even though they don't actually get passed 
to the TableProviderFactory.
   
   For the second half, I could see this actually being rather common (that a 
`MyThing::from(arrow_schema)` blows up with unexpected metadata). Though the 
intersection of these two sets seems like it'd be tiny?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] docs: add guide to adding a new expression [datafusion-comet]

2024-05-20 Thread via GitHub


kazuyukitanimura commented on code in PR #422:
URL: https://github.com/apache/datafusion-comet/pull/422#discussion_r1607164180


##
docs/source/index.rst:
##
@@ -58,7 +58,11 @@ as a native runtime to achieve improvement in terms of query 
efficiency and quer
Comet Plugin Overview 
Development Guide 
Debugging Guide 
+<<< HEAD

Review Comment:
   Looks like this is a merge/rebase error



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: add hex scalar function [datafusion-comet]

2024-05-20 Thread via GitHub


kazuyukitanimura commented on code in PR #449:
URL: https://github.com/apache/datafusion-comet/pull/449#discussion_r1607158744


##
spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala:
##
@@ -1038,6 +1038,46 @@ class CometExpressionSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
   }
 }
   }
+  test("hex") {
+val str_table = "string_hex_table"
+withTable(str_table) {

Review Comment:
   Let's add more types and values, something like `makeRawTimeParquetFile`



##
core/src/execution/datafusion/expressions/scalar_funcs/hex.rs:
##
@@ -0,0 +1,191 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::sync::Arc;
+
+use arrow::array::as_string_array;
+use arrow_array::StringArray;
+use arrow_schema::DataType;
+use datafusion::logical_expr::ColumnarValue;
+use datafusion_common::{
+cast::{as_binary_array, as_int64_array},
+exec_err, DataFusionError, ScalarValue,
+};
+use std::fmt::Write;
+
+fn hex_bytes(bytes: &[u8]) -> Vec {
+let length = bytes.len();
+let mut value = vec![0; length * 2];
+let mut i = 0;
+while i < length {
+value[i * 2] = (bytes[i] & 0xF0) >> 4;
+value[i * 2 + 1] = bytes[i] & 0x0F;
+i += 1;
+}
+value
+}
+
+fn hex_int64(num: i64) -> String {
+if num >= 0 {
+format!("{:X}", num)
+} else {
+format!("{:016X}", num as u64)
+}
+}
+
+fn hex_string(s: &str) -> Vec {
+hex_bytes(s.as_bytes())
+}
+
+fn hex_bytes_to_string(bytes: &[u8]) -> Result {
+let mut hex_string = String::with_capacity(bytes.len() * 2);
+for byte in bytes {
+write!(&mut hex_string, "{:01X}", byte)?;
+}
+Ok(hex_string)
+}
+
+pub(super) fn spark_hex(args: &[ColumnarValue]) -> Result {
+if args.len() != 1 {
+return Err(DataFusionError::Internal(
+"hex expects exactly one argument".to_string(),
+));
+}
+
+match &args[0] {
+ColumnarValue::Array(array) => match array.data_type() {
+DataType::Int64 => {

Review Comment:
   Wondering if we need to support more types?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] Pass per-field BigQuery `OPTIONS` values to the LogicalPlan's Arrow Schema [datafusion]

2024-05-20 Thread via GitHub


davisp opened a new issue, #10589:
URL: https://github.com/apache/datafusion/issues/10589

   ### Is your feature request related to a problem or challenge?
   
   I've been reading and learning the TableProvider APIs and have finally 
gotten around to taking a serious look at implementing support for `CREATE 
EXTERNAL TABLE`. I started down the road of using the statement level `OPTIONS` 
for adding field specific key/value config pairs similar to how the Parquet 
provider does with the `OPTION::COLUMN.NESTED.PATH` keys. However, my 
particular use case has quite a few field specific possible options that would 
be decently more ergonomic with field level options so I went looking for how I 
might be able to customize the SQL parser to support something of that nature.
   
   After many false starts I ended up stumbling over the [BigQuery column 
options 
support](https://docs.rs/sqlparser/latest/sqlparser/ast/enum.ColumnOption.html#variant.Options)
 and thought that would be perfect. While I was a bit disappointed its not 
directly supported, it turned out to be relatively easy to plumb through so 
that's what I've done.
   
   ### Describe the solution you'd like
   
   I'm not tied to anything specific beyond "allow for column specific 
key/value configuration options". The PR I'll open after filing this issue is 
one possible implementation, but it doesn't have to be *the* implementation by 
any means.
   
   ### Describe alternatives you've considered
   
   My original thought was to basically that I was going to have to go the 
route used in the 
[`sql_dialect.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/sql_dialect.rs)
 where I would basically have to reimplement the `CREATE EXTERNAL TABLE` logic 
and then somehow plumb my extra options around.
   
   And the reason I went that direction was my first (hacky) though was to 
re-parse the `CreateExternalTable::definition` field but it apparently doesn't 
keep the actual original SQL around (at least via the cli AFAICT).
   
   ### Additional context
   
   I've got no idea whether the reaction to this will be "Oh, neat!" or "Ewww, 
gross!". I'm not particularly tied to it because in the end I can always just 
go to the statement wide options, though given how easy it was to add it seems 
like it might be acceptable?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] DataFusion reads Date32 and Date64 parquet statistics in as [datafusion]

2024-05-20 Thread via GitHub


edmondop commented on issue #10587:
URL: https://github.com/apache/datafusion/issues/10587#issuecomment-2121024857

   @alamb the title here doesn't make much sense, are you saying that the `min` 
and `max` are not extracted as Date32/Date64?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: API for collecting statistics/index for metadata of a parquet file + tests [datafusion]

2024-05-20 Thread via GitHub


alamb merged PR #10537:
URL: https://github.com/apache/datafusion/pull/10537


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Compute murmur3 hash with dictionary input correctly [datafusion-comet]

2024-05-20 Thread via GitHub


kazuyukitanimura commented on code in PR #433:
URL: https://github.com/apache/datafusion-comet/pull/433#discussion_r1607151015


##
spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala:
##
@@ -1452,17 +1452,55 @@ class CometExpressionSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
 withTable(table) {
   sql(s"create table $table(col string, a int, b float) using parquet")
   sql(s"""
- |insert into $table values
- |('Spark SQL  ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), 
('苹果手机', NULL, 3.99)
- |, ('Spark SQL  ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), 
('苹果手机', NULL, 3.99)
- |""".stripMargin)
-  checkSparkAnswerAndOperator("""
-   |select
-   |md5(col), md5(cast(a as string)), md5(cast(b as string)),
-   |hash(col), hash(col, 1), hash(col, 0), hash(col, a, b), 
hash(b, a, col),
-   |sha2(col, 0), sha2(col, 256), sha2(col, 224), sha2(col, 384), 
sha2(col, 512), sha2(col, 128)
-   |from test
+   |insert into $table values
+   |('Spark SQL  ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), 
('苹果手机', NULL, 3.99)
+   |, ('Spark SQL  ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), 
('苹果手机', NULL, 3.99)
|""".stripMargin)
+  checkSparkAnswerAndOperator("""
+  |select
+  |md5(col), md5(cast(a as string)), md5(cast(b as string)),
+  |hash(col), hash(col, 1), hash(col, 0), hash(col, a, b), hash(b, 
a, col),
+  |sha2(col, 0), sha2(col, 256), sha2(col, 224), sha2(col, 384), 
sha2(col, 512), sha2(col, 128)
+  |from test
+  |""".stripMargin)
+}
+  }
+}
+  }
+
+  test("hash functions with random input") {
+val dataGen = DataGenerator.DEFAULT
+// sufficient number of rows to create dictionary encoded ArrowArray.
+val randomNumRows = 1000

Review Comment:
   Potentially we can add repeated values to force dictionary. E.g. randomly 
generate 100 rows and repeat 10 times to make 1000 rows



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Minor: Consolidate some integration tests into `core_integration` [datafusion]

2024-05-20 Thread via GitHub


alamb commented on code in PR #10588:
URL: https://github.com/apache/datafusion/pull/10588#discussion_r1607149443


##
datafusion/core/tests/custom_sources.rs:
##
@@ -1,308 +0,0 @@
-// Licensed to the Apache Software Foundation (ASF) under one

Review Comment:
   This was moved to custom_sources_cases.mod



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Minor: Consolidate some integration tests into `core_integration` [datafusion]

2024-05-20 Thread via GitHub


alamb opened a new pull request, #10588:
URL: https://github.com/apache/datafusion/pull/10588

   ## Which issue does this PR close?
   
   
   ## Rationale for this change
   
   In an effort to make it faster to develop and test datafusion , it would be 
nice if the resources required to run the tests were smaller
   
   For example, on my machine, simply running the `memory_limit` integration 
test
   
   ```shell
   cargo test --test memory_limit
   ```
   
   Requires creating / linking a 148 MB binary:
   ```shell
   andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$ du -s -h 
target/debug/deps/memory_limit-47b8096a6ecbd81c
   148M target/debug/deps/memory_limit-47b8096a6ecbd81c
   ```
   
   While creating the `core_integration` test requires 151M
   
   ```
   andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$ du -s -h 
target/debug/deps/core_integration-2ccc5be60db1701a
   151M target/debug/deps/core_integration-2ccc5be60db1701a
   ```
   
   Not only does each integration test target require about 140MB of extra 
space and the time to link, it also slows down the tests runs as they are not 
run in parallel with other tests
   
   ## What changes are included in this PR?
   1. Consoldidate the XXX tests into `core_integration` test 
   
   After this PR 
   
   
   
   ## Are these changes tested?
   
   They are all tests
   
   
   ## Are there any user-facing changes?
   
   No this is a developer productivity one only


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Minor: Improve documentation in sql_to_plan example [datafusion]

2024-05-20 Thread via GitHub


edmondop commented on code in PR #10582:
URL: https://github.com/apache/datafusion/pull/10582#discussion_r1607146553


##
datafusion-examples/examples/plan_to_sql.rs:
##
@@ -22,36 +22,45 @@ use datafusion::sql::unparser::expr_to_sql;
 use datafusion_sql::unparser::dialect::CustomDialect;
 use datafusion_sql::unparser::{plan_to_sql, Unparser};
 
-/// This example demonstrates the programmatic construction of
-/// SQL using the DataFusion Expr [`Expr`] and LogicalPlan [`LogicalPlan`] API.
+/// This example demonstrates the programmatic construction of SQL strings 
using
+/// the DataFusion Expr [`Expr`] and LogicalPlan [`LogicalPlan`] API.
 ///
 ///
 /// The code in this example shows how to:
-/// 1. Create SQL from a variety of Expr and LogicalPlan: [`main`]`
-/// 2. Create a simple expression [`Exprs`] with fluent API
-/// and convert to sql: [`simple_expr_to_sql_demo`]
-/// 3. Create a simple expression [`Exprs`] with fluent API
-/// and convert to sql without escaping column names: 
[`simple_expr_to_sql_demo_no_escape`]
-/// 4. Create a simple expression [`Exprs`] with fluent API
-/// and convert to sql escaping column names a MySQL style: 
[`simple_expr_to_sql_demo_escape_mysql_style`]
+///
+/// 1. [`simple_expr_to_sql_demo`]: Create a simple expression [`Exprs`] with
+/// fluent API and convert to sql suitable for passing to another database
+///
+/// 2. [`simple_expr_to_sql_demo_no_escape`]  Create a simple expression
+/// [`Exprs`] with fluent API and convert to sql without escaping column names
+/// more suitable for displaying to humans.
+///
+/// 3. [`simple_expr_to_sql_demo_escape_mysql_style`]" Create a simple
+/// expression [`Exprs`] with fluent API and convert to sql escaping column
+/// names in MySQL style.
+///
+/// 4. [`simple_plan_to_sql_demo`]: Create a simple logical plan using the
+/// DataFrames API and convert to sql string.
+///
+/// 5. [`round_trip_plan_to_sql_demo`]: Create a logical plan from a SQL 
string, modify it using the
+/// DataFrames API and convert it back to a  sql string.
 
 #[tokio::main]
 async fn main() -> Result<()> {
 // See how to evaluate expressions
 simple_expr_to_sql_demo()?;
 simple_expr_to_sql_demo_no_escape()?;
 simple_expr_to_sql_demo_escape_mysql_style()?;
-simple_plan_to_sql_parquest_dataframe_demo().await?;
-round_trip_plan_to_sql_parquest_dataframe_demo().await?;
+simple_plan_to_sql_demo().await?;
+round_trip_plan_to_sql_demo().await?;
 Ok(())
 }
 
 /// DataFusion can convert expressions to SQL, using column name escaping
 /// PostgreSQL style.
 fn simple_expr_to_sql_demo() -> Result<()> {
 let expr = col("a").lt(lit(5)).or(col("a").eq(lit(8)));
-let ast = expr_to_sql(&expr)?;
-let sql = format!("{}", ast);
+let sql = expr_to_sql(&expr)?.to_string();

Review Comment:
   I don't remember why I had to do it, that totally makes sense



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Row groups are read out of order or with completely different values [datafusion]

2024-05-20 Thread via GitHub


twitu closed issue #10572: Row groups are read out of order or with completely 
different values
URL: https://github.com/apache/datafusion/issues/10572


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: API for collecting statistics/index for metadata of a parquet file + tests [datafusion]

2024-05-20 Thread via GitHub


alamb commented on PR #10537:
URL: https://github.com/apache/datafusion/pull/10537#issuecomment-2120986025

   I have filed  the following tickets
   * https://github.com/apache/datafusion/issues/10585
   *  https://github.com/apache/datafusion/issues/10586
   *  #10587
   
   I think this PR is now ready to go. I plan to merge it in when the CI passes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: API for collecting statistics/index for metadata of a parquet file + tests [datafusion]

2024-05-20 Thread via GitHub


alamb commented on code in PR #10537:
URL: https://github.com/apache/datafusion/pull/10537#discussion_r1607125587


##
datafusion/core/tests/parquet/arrow_statistics.rs:
##
@@ -0,0 +1,654 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! This file contains an end to end test of extracting statitics from parquet 
files.
+//! It writes data into a parquet file, reads statistics and verifies they are 
correct
+
+use std::fs::File;
+use std::sync::Arc;
+
+use arrow_array::{
+make_array, Array, ArrayRef, Int16Array, Int32Array, Int64Array, Int8Array,
+RecordBatch, UInt64Array,
+};
+use arrow_schema::{DataType, Field, Schema};
+use datafusion::datasource::physical_plan::parquet::{
+RequestedStatistics, StatisticsConverter,
+};
+use parquet::arrow::arrow_reader::{ArrowReaderBuilder, 
ParquetRecordBatchReaderBuilder};
+use parquet::arrow::ArrowWriter;
+use parquet::file::properties::{EnabledStatistics, WriterProperties};
+
+use crate::parquet::Scenario;
+
+use super::make_test_file_rg;
+
+// TEST HELPERS
+
+/// Return a record batch with i64 with Null values
+fn make_int64_batches_with_null(
+null_values: usize,
+no_null_values_start: i64,
+no_null_values_end: i64,
+) -> RecordBatch {
+let schema = Arc::new(Schema::new(vec![Field::new("i64", DataType::Int64, 
true)]));
+
+let v64: Vec = (no_null_values_start as _..no_null_values_end as 
_).collect();
+
+RecordBatch::try_new(
+schema,
+vec![make_array(
+Int64Array::from_iter(
+v64.into_iter()
+.map(Some)
+.chain(std::iter::repeat(None).take(null_values)),
+)
+.to_data(),
+)],
+)
+.unwrap()
+}
+
+// Create a parquet file with one column for data type i64
+// Data of the file include
+//   . Number of null rows is the given num_null
+//   . There are non-null values in the range [no_null_values_start, 
no_null_values_end], one value each row
+//   . The file is divided into row groups of size row_per_group
+pub fn parquet_file_one_column(
+num_null: usize,
+no_null_values_start: i64,
+no_null_values_end: i64,
+row_per_group: usize,
+) -> ParquetRecordBatchReaderBuilder {
+let mut output_file = tempfile::Builder::new()
+.prefix("parquert_statistics_test")
+.suffix(".parquet")
+.tempfile()
+.expect("tempfile creation");
+
+let props = WriterProperties::builder()
+.set_max_row_group_size(row_per_group)
+.set_statistics_enabled(EnabledStatistics::Chunk)
+.build();
+
+let batches = vec![make_int64_batches_with_null(
+num_null,
+no_null_values_start,
+no_null_values_end,
+)];
+
+let schema = batches[0].schema();
+
+let mut writer = ArrowWriter::try_new(&mut output_file, schema, 
Some(props)).unwrap();
+
+for batch in batches {
+writer.write(&batch).expect("writing batch");
+}
+
+// close file
+let _file_meta = writer.close().unwrap();
+
+// open the file & get the reader
+let file = output_file.reopen().unwrap();
+ArrowReaderBuilder::try_new(file).unwrap()
+}
+
+// Create a parquet file with many columns each has different data type
+//  - Data types are specified by the given scenario
+//  - Row group sizes are withe the same or different depending on the 
provided row_per_group & data created in the scenario
+pub async fn parquet_file_many_columns(
+scenario: super::Scenario,
+row_per_group: usize,
+) -> ParquetRecordBatchReaderBuilder {
+let file = make_test_file_rg(scenario, row_per_group).await;
+
+// open the file & get the reader
+let file = file.reopen().unwrap();
+ArrowReaderBuilder::try_new(file).unwrap()
+}
+
+struct Test {
+reader: ParquetRecordBatchReaderBuilder,
+expected_min: ArrayRef,
+expected_max: ArrayRef,
+expected_null_counts: UInt64Array,
+expected_row_counts: UInt64Array,
+}
+
+impl Test {
+fn run(self, col_name: &str) {
+let Self {
+reader,
+expected_min,
+expected_max,
+expected_null_counts,
+expected_row_counts,
+} = self;
+
+le

[I] DataFusion reads Date32 and Date64 parquet statistics in as [datafusion]

2024-05-20 Thread via GitHub


alamb opened a new issue, #10587:
URL: https://github.com/apache/datafusion/issues/10587

   ### Describe the bug
   
   When reading a Date32 or Date64 column from a parquet file, DataFusion 
currently returns an Int32 array
   
   
   
   ### To Reproduce
   
   You can see the issue in https://github.com/apache/datafusion/pull/10537 
   test_dates_32_diff_rg_sizes
   
   ### Expected behavior
   
   I expect a
   1.  Date32 column to be read as `Date32Array`
   2. Date64 column to be read as `Date64Array`
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: API for collecting statistics/index for metadata of a parquet file + tests [datafusion]

2024-05-20 Thread via GitHub


alamb commented on code in PR #10537:
URL: https://github.com/apache/datafusion/pull/10537#discussion_r1607118413


##
datafusion/core/src/datasource/physical_plan/parquet/arrow_statistics.rs:
##
@@ -0,0 +1,43 @@
+use arrow_array::ArrayRef;
+use arrow_schema::DataType;
+use datafusion_common::Result;
+use parquet::file::statistics::Statistics as ParquetStatistics;
+
+/// statistics extracted from `Statistics` as Arrow `ArrayRef`s
+///
+/// # Note:
+/// If the corresponding `Statistics` is not present, or has no information for
+/// a column, a NULL is present in the  corresponding array entry
+pub struct ArrowStatistics {
+/// min values
+min: ArrayRef,
+/// max values
+max: ArrayRef,
+/// Row counts (UInt64Array)
+row_count: ArrayRef,
+/// Null Counts (UInt64Array)
+null_count: ArrayRef,
+}
+
+/// Extract `ArrowStatistics` from the  parquet [`Statistics`]
+pub fn parquet_stats_to_arrow<'a>(
+arrow_datatype: &DataType,
+statistics: impl IntoIterator>,
+) -> Result {
+todo!() // MY TODO next
+}

Review Comment:
   Filed https://github.com/apache/datafusion/issues/10586



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] DataFusion ignores "column order" parquet statistics specification [datafusion]

2024-05-20 Thread via GitHub


alamb opened a new issue, #10586:
URL: https://github.com/apache/datafusion/issues/10586

   ### Describe the bug
   
   As @tustvold points out, there is a [`column_order` 
API](https://docs.rs/parquet/latest/parquet/file/metadata/struct.FileMetaData.html#method.column_order)
 defined in parquet that is currently entirely ignored by DataFusion 
   
   It is not entirely clear to me what the implications of ignoring this field 
are or what other parquet writers populate it with, but we should probably not 
ignore it
   
   
   ### To Reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   To emphasise the point I made when this API was originally proposed, you 
need more than just the ParquetStatistics in order to correctly interpret the 
data. You need at least the FileMetadata to get the 
https://docs.rs/parquet/latest/parquet/file/metadata/struct.FileMetaData.html#method.column_order
 in order to be able to even interpret what the statistics mean for a given 
column.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: API for collecting statistics/index for metadata of a parquet file + tests [datafusion]

2024-05-20 Thread via GitHub


alamb commented on code in PR #10537:
URL: https://github.com/apache/datafusion/pull/10537#discussion_r1607110465


##
datafusion/core/tests/parquet/arrow_statistics.rs:
##
@@ -0,0 +1,652 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! This file contains an end to end test of extracting statitics from parquet 
files.
+//! It writes data into a parquet file, reads statistics and verifies they are 
correct
+
+use std::fs::File;
+use std::sync::Arc;
+
+use arrow_array::{
+make_array, Array, ArrayRef, Int16Array, Int32Array, Int64Array, Int8Array,
+RecordBatch, UInt64Array,
+};
+use arrow_schema::{DataType, Field, Schema};
+use datafusion::datasource::physical_plan::parquet::{
+RequestedStatistics, StatisticsConverter,
+};
+use parquet::arrow::arrow_reader::{ArrowReaderBuilder, 
ParquetRecordBatchReaderBuilder};
+use parquet::arrow::ArrowWriter;
+use parquet::file::properties::{EnabledStatistics, WriterProperties};
+
+use crate::parquet::Scenario;
+
+use super::make_test_file_rg;
+
+// TEST HELPERS
+
+/// Return a record batch with i64 with Null values
+fn make_int64_batches_with_null(
+null_values: usize,
+no_null_values_start: i64,
+no_null_values_end: i64,
+) -> RecordBatch {
+let schema = Arc::new(Schema::new(vec![Field::new("i64", DataType::Int64, 
true)]));
+
+let v64: Vec = (no_null_values_start as _..no_null_values_end as 
_).collect();
+
+RecordBatch::try_new(
+schema,
+vec![make_array(
+Int64Array::from_iter(
+v64.into_iter()
+.map(Some)
+.chain(std::iter::repeat(None).take(null_values)),
+)
+.to_data(),
+)],
+)
+.unwrap()
+}
+
+// Create a parquet file with one column for data type i64
+// Data of the file include
+//   . Number of null rows is the given num_null
+//   . There are non-null values in the range [no_null_values_start, 
no_null_values_end], one value each row
+//   . The file is divided into row groups of size row_per_group
+pub fn parquet_file_one_column(
+num_null: usize,
+no_null_values_start: i64,
+no_null_values_end: i64,
+row_per_group: usize,
+) -> ParquetRecordBatchReaderBuilder {
+let mut output_file = tempfile::Builder::new()
+.prefix("parquert_statistics_test")
+.suffix(".parquet")
+.tempfile()
+.expect("tempfile creation");
+
+let props = WriterProperties::builder()
+.set_max_row_group_size(row_per_group)
+.set_statistics_enabled(EnabledStatistics::Chunk)
+.build();
+
+let batches = vec![make_int64_batches_with_null(
+num_null,
+no_null_values_start,
+no_null_values_end,
+)];
+
+let schema = batches[0].schema();
+
+let mut writer = ArrowWriter::try_new(&mut output_file, schema, 
Some(props)).unwrap();
+
+for batch in batches {
+writer.write(&batch).expect("writing batch");
+}
+
+// close file
+let _file_meta = writer.close().unwrap();
+
+// open the file & get the reader
+let file = output_file.reopen().unwrap();
+ArrowReaderBuilder::try_new(file).unwrap()
+}
+
+// Create a parquet file with many columns each has different data type
+//  - Data types are specified by the given scenario
+//  - Row group sizes are withe the same or different depending on the 
provided row_per_group & data created in the scenario
+pub async fn parquet_file_many_columns(
+scenario: super::Scenario,
+row_per_group: usize,
+) -> ParquetRecordBatchReaderBuilder {
+let file = make_test_file_rg(scenario, row_per_group).await;
+
+// open the file & get the reader
+let file = file.reopen().unwrap();
+ArrowReaderBuilder::try_new(file).unwrap()
+}
+
+struct Test {
+reader: ParquetRecordBatchReaderBuilder,
+expected_min: ArrayRef,
+expected_max: ArrayRef,
+expected_null_counts: UInt64Array,
+expected_row_counts: UInt64Array,
+}
+
+impl Test {
+fn run(self, col_name: &str) {
+let Self {
+reader,
+expected_min,
+expected_max,
+expected_null_counts,
+expected_row_counts,
+} = self;
+
+le

Re: [PR] feat: API for collecting statistics/index for metadata of a parquet file + tests [datafusion]

2024-05-20 Thread via GitHub


alamb commented on code in PR #10537:
URL: https://github.com/apache/datafusion/pull/10537#discussion_r1607107595


##
datafusion/core/tests/parquet/arrow_statistics.rs:
##
@@ -0,0 +1,652 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! This file contains an end to end test of extracting statitics from parquet 
files.
+//! It writes data into a parquet file, reads statistics and verifies they are 
correct
+
+use std::fs::File;
+use std::sync::Arc;
+
+use arrow_array::{
+make_array, Array, ArrayRef, Int16Array, Int32Array, Int64Array, Int8Array,
+RecordBatch, UInt64Array,
+};
+use arrow_schema::{DataType, Field, Schema};
+use datafusion::datasource::physical_plan::parquet::{
+RequestedStatistics, StatisticsConverter,
+};
+use parquet::arrow::arrow_reader::{ArrowReaderBuilder, 
ParquetRecordBatchReaderBuilder};
+use parquet::arrow::ArrowWriter;
+use parquet::file::properties::{EnabledStatistics, WriterProperties};
+
+use crate::parquet::Scenario;
+
+use super::make_test_file_rg;
+
+// TEST HELPERS
+
+/// Return a record batch with i64 with Null values
+fn make_int64_batches_with_null(
+null_values: usize,
+no_null_values_start: i64,
+no_null_values_end: i64,
+) -> RecordBatch {
+let schema = Arc::new(Schema::new(vec![Field::new("i64", DataType::Int64, 
true)]));
+
+let v64: Vec = (no_null_values_start as _..no_null_values_end as 
_).collect();
+
+RecordBatch::try_new(
+schema,
+vec![make_array(
+Int64Array::from_iter(
+v64.into_iter()
+.map(Some)
+.chain(std::iter::repeat(None).take(null_values)),
+)
+.to_data(),
+)],
+)
+.unwrap()
+}
+
+// Create a parquet file with one column for data type i64
+// Data of the file include
+//   . Number of null rows is the given num_null
+//   . There are non-null values in the range [no_null_values_start, 
no_null_values_end], one value each row
+//   . The file is divided into row groups of size row_per_group
+pub fn parquet_file_one_column(
+num_null: usize,
+no_null_values_start: i64,
+no_null_values_end: i64,
+row_per_group: usize,
+) -> ParquetRecordBatchReaderBuilder {
+let mut output_file = tempfile::Builder::new()
+.prefix("parquert_statistics_test")
+.suffix(".parquet")
+.tempfile()
+.expect("tempfile creation");
+
+let props = WriterProperties::builder()
+.set_max_row_group_size(row_per_group)
+.set_statistics_enabled(EnabledStatistics::Chunk)
+.build();
+
+let batches = vec![make_int64_batches_with_null(
+num_null,
+no_null_values_start,
+no_null_values_end,
+)];
+
+let schema = batches[0].schema();
+
+let mut writer = ArrowWriter::try_new(&mut output_file, schema, 
Some(props)).unwrap();
+
+for batch in batches {
+writer.write(&batch).expect("writing batch");
+}
+
+// close file
+let _file_meta = writer.close().unwrap();
+
+// open the file & get the reader
+let file = output_file.reopen().unwrap();
+ArrowReaderBuilder::try_new(file).unwrap()
+}
+
+// Create a parquet file with many columns each has different data type
+//  - Data types are specified by the given scenario
+//  - Row group sizes are withe the same or different depending on the 
provided row_per_group & data created in the scenario
+pub async fn parquet_file_many_columns(
+scenario: super::Scenario,
+row_per_group: usize,
+) -> ParquetRecordBatchReaderBuilder {
+let file = make_test_file_rg(scenario, row_per_group).await;
+
+// open the file & get the reader
+let file = file.reopen().unwrap();
+ArrowReaderBuilder::try_new(file).unwrap()
+}
+
+struct Test {
+reader: ParquetRecordBatchReaderBuilder,
+expected_min: ArrayRef,
+expected_max: ArrayRef,
+expected_null_counts: UInt64Array,
+expected_row_counts: UInt64Array,
+}
+
+impl Test {
+fn run(self, col_name: &str) {
+let Self {
+reader,
+expected_min,
+expected_max,
+expected_null_counts,
+expected_row_counts,
+} = self;
+
+le

Re: [I] Incorrect statistics read for `i8` `i16` columns in parquet [datafusion]

2024-05-20 Thread via GitHub


alamb commented on issue #10585:
URL: https://github.com/apache/datafusion/issues/10585#issuecomment-2120956336

   Possibly related to https://github.com/apache/datafusion/issues/9779


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: API for collecting statistics/index for metadata of a parquet file + tests [datafusion]

2024-05-20 Thread via GitHub


NGA-TRAN commented on code in PR #10537:
URL: https://github.com/apache/datafusion/pull/10537#discussion_r1607104991


##
datafusion/core/tests/parquet/arrow_statistics.rs:
##
@@ -0,0 +1,652 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! This file contains an end to end test of extracting statitics from parquet 
files.
+//! It writes data into a parquet file, reads statistics and verifies they are 
correct
+
+use std::fs::File;
+use std::sync::Arc;
+
+use arrow_array::{
+make_array, Array, ArrayRef, Int16Array, Int32Array, Int64Array, Int8Array,
+RecordBatch, UInt64Array,
+};
+use arrow_schema::{DataType, Field, Schema};
+use datafusion::datasource::physical_plan::parquet::{
+RequestedStatistics, StatisticsConverter,
+};
+use parquet::arrow::arrow_reader::{ArrowReaderBuilder, 
ParquetRecordBatchReaderBuilder};
+use parquet::arrow::ArrowWriter;
+use parquet::file::properties::{EnabledStatistics, WriterProperties};
+
+use crate::parquet::Scenario;
+
+use super::make_test_file_rg;
+
+// TEST HELPERS
+
+/// Return a record batch with i64 with Null values
+fn make_int64_batches_with_null(
+null_values: usize,
+no_null_values_start: i64,
+no_null_values_end: i64,
+) -> RecordBatch {
+let schema = Arc::new(Schema::new(vec![Field::new("i64", DataType::Int64, 
true)]));
+
+let v64: Vec = (no_null_values_start as _..no_null_values_end as 
_).collect();
+
+RecordBatch::try_new(
+schema,
+vec![make_array(
+Int64Array::from_iter(
+v64.into_iter()
+.map(Some)
+.chain(std::iter::repeat(None).take(null_values)),
+)
+.to_data(),
+)],
+)
+.unwrap()
+}
+
+// Create a parquet file with one column for data type i64
+// Data of the file include
+//   . Number of null rows is the given num_null
+//   . There are non-null values in the range [no_null_values_start, 
no_null_values_end], one value each row
+//   . The file is divided into row groups of size row_per_group
+pub fn parquet_file_one_column(
+num_null: usize,
+no_null_values_start: i64,
+no_null_values_end: i64,
+row_per_group: usize,
+) -> ParquetRecordBatchReaderBuilder {
+let mut output_file = tempfile::Builder::new()
+.prefix("parquert_statistics_test")
+.suffix(".parquet")
+.tempfile()
+.expect("tempfile creation");
+
+let props = WriterProperties::builder()
+.set_max_row_group_size(row_per_group)
+.set_statistics_enabled(EnabledStatistics::Chunk)
+.build();
+
+let batches = vec![make_int64_batches_with_null(
+num_null,
+no_null_values_start,
+no_null_values_end,
+)];
+
+let schema = batches[0].schema();
+
+let mut writer = ArrowWriter::try_new(&mut output_file, schema, 
Some(props)).unwrap();
+
+for batch in batches {
+writer.write(&batch).expect("writing batch");
+}
+
+// close file
+let _file_meta = writer.close().unwrap();
+
+// open the file & get the reader
+let file = output_file.reopen().unwrap();
+ArrowReaderBuilder::try_new(file).unwrap()
+}
+
+// Create a parquet file with many columns each has different data type
+//  - Data types are specified by the given scenario
+//  - Row group sizes are withe the same or different depending on the 
provided row_per_group & data created in the scenario
+pub async fn parquet_file_many_columns(
+scenario: super::Scenario,
+row_per_group: usize,
+) -> ParquetRecordBatchReaderBuilder {
+let file = make_test_file_rg(scenario, row_per_group).await;
+
+// open the file & get the reader
+let file = file.reopen().unwrap();
+ArrowReaderBuilder::try_new(file).unwrap()
+}
+
+struct Test {
+reader: ParquetRecordBatchReaderBuilder,
+expected_min: ArrayRef,
+expected_max: ArrayRef,
+expected_null_counts: UInt64Array,
+expected_row_counts: UInt64Array,
+}
+
+impl Test {
+fn run(self, col_name: &str) {
+let Self {
+reader,
+expected_min,
+expected_max,
+expected_null_counts,
+expected_row_counts,
+} = self;
+
+   

[I] Incorrect statistics read for `i8` `i16` [datafusion]

2024-05-20 Thread via GitHub


alamb opened a new issue, #10585:
URL: https://github.com/apache/datafusion/issues/10585

   ### Describe the bug
   
   As @NGA-TRAN  found in https://github.com/apache/datafusion/pull/10537 when 
i8 and i16 values are written to parquet and then the statistics are extracted, 
the returned min/max values are incorrect.
   
   This could lead to incorrect results when reading parquet files with filters 
on columns with `i8` and `i16` types
   
   ### To Reproduce
   
   See tests added in https://github.com/apache/datafusion/pull/10537
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: API for collecting statistics/index for metadata of a parquet file + tests [datafusion]

2024-05-20 Thread via GitHub


alamb commented on code in PR #10537:
URL: https://github.com/apache/datafusion/pull/10537#discussion_r1607104678


##
datafusion/core/tests/parquet/arrow_statistics.rs:
##
@@ -0,0 +1,652 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! This file contains an end to end test of extracting statitics from parquet 
files.
+//! It writes data into a parquet file, reads statistics and verifies they are 
correct
+
+use std::fs::File;
+use std::sync::Arc;
+
+use arrow_array::{
+make_array, Array, ArrayRef, Int16Array, Int32Array, Int64Array, Int8Array,
+RecordBatch, UInt64Array,
+};
+use arrow_schema::{DataType, Field, Schema};
+use datafusion::datasource::physical_plan::parquet::{
+RequestedStatistics, StatisticsConverter,
+};
+use parquet::arrow::arrow_reader::{ArrowReaderBuilder, 
ParquetRecordBatchReaderBuilder};
+use parquet::arrow::ArrowWriter;
+use parquet::file::properties::{EnabledStatistics, WriterProperties};
+
+use crate::parquet::Scenario;
+
+use super::make_test_file_rg;
+
+// TEST HELPERS
+
+/// Return a record batch with i64 with Null values
+fn make_int64_batches_with_null(
+null_values: usize,
+no_null_values_start: i64,
+no_null_values_end: i64,
+) -> RecordBatch {
+let schema = Arc::new(Schema::new(vec![Field::new("i64", DataType::Int64, 
true)]));
+
+let v64: Vec = (no_null_values_start as _..no_null_values_end as 
_).collect();
+
+RecordBatch::try_new(
+schema,
+vec![make_array(
+Int64Array::from_iter(
+v64.into_iter()
+.map(Some)
+.chain(std::iter::repeat(None).take(null_values)),
+)
+.to_data(),
+)],
+)
+.unwrap()
+}
+
+// Create a parquet file with one column for data type i64
+// Data of the file include
+//   . Number of null rows is the given num_null
+//   . There are non-null values in the range [no_null_values_start, 
no_null_values_end], one value each row
+//   . The file is divided into row groups of size row_per_group
+pub fn parquet_file_one_column(
+num_null: usize,
+no_null_values_start: i64,
+no_null_values_end: i64,
+row_per_group: usize,
+) -> ParquetRecordBatchReaderBuilder {
+let mut output_file = tempfile::Builder::new()
+.prefix("parquert_statistics_test")
+.suffix(".parquet")
+.tempfile()
+.expect("tempfile creation");
+
+let props = WriterProperties::builder()
+.set_max_row_group_size(row_per_group)
+.set_statistics_enabled(EnabledStatistics::Chunk)
+.build();
+
+let batches = vec![make_int64_batches_with_null(
+num_null,
+no_null_values_start,
+no_null_values_end,
+)];
+
+let schema = batches[0].schema();
+
+let mut writer = ArrowWriter::try_new(&mut output_file, schema, 
Some(props)).unwrap();
+
+for batch in batches {
+writer.write(&batch).expect("writing batch");
+}
+
+// close file
+let _file_meta = writer.close().unwrap();
+
+// open the file & get the reader
+let file = output_file.reopen().unwrap();
+ArrowReaderBuilder::try_new(file).unwrap()
+}
+
+// Create a parquet file with many columns each has different data type
+//  - Data types are specified by the given scenario
+//  - Row group sizes are withe the same or different depending on the 
provided row_per_group & data created in the scenario
+pub async fn parquet_file_many_columns(
+scenario: super::Scenario,
+row_per_group: usize,
+) -> ParquetRecordBatchReaderBuilder {
+let file = make_test_file_rg(scenario, row_per_group).await;
+
+// open the file & get the reader
+let file = file.reopen().unwrap();
+ArrowReaderBuilder::try_new(file).unwrap()
+}
+
+struct Test {
+reader: ParquetRecordBatchReaderBuilder,
+expected_min: ArrayRef,
+expected_max: ArrayRef,
+expected_null_counts: UInt64Array,
+expected_row_counts: UInt64Array,
+}
+
+impl Test {
+fn run(self, col_name: &str) {
+let Self {
+reader,
+expected_min,
+expected_max,
+expected_null_counts,
+expected_row_counts,
+} = self;
+
+le

Re: [PR] feat: API for collecting statistics/index for metadata of a parquet file [datafusion]

2024-05-20 Thread via GitHub


alamb commented on code in PR #10537:
URL: https://github.com/apache/datafusion/pull/10537#discussion_r1607091830


##
datafusion/core/tests/parquet/arrow_statistics.rs:
##
@@ -0,0 +1,652 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! This file contains an end to end test of extracting statitics from parquet 
files.
+//! It writes data into a parquet file, reads statistics and verifies they are 
correct
+
+use std::fs::File;
+use std::sync::Arc;
+
+use arrow_array::{
+make_array, Array, ArrayRef, Int16Array, Int32Array, Int64Array, Int8Array,
+RecordBatch, UInt64Array,
+};
+use arrow_schema::{DataType, Field, Schema};
+use datafusion::datasource::physical_plan::parquet::{
+RequestedStatistics, StatisticsConverter,
+};
+use parquet::arrow::arrow_reader::{ArrowReaderBuilder, 
ParquetRecordBatchReaderBuilder};
+use parquet::arrow::ArrowWriter;
+use parquet::file::properties::{EnabledStatistics, WriterProperties};
+
+use crate::parquet::Scenario;
+
+use super::make_test_file_rg;
+
+// TEST HELPERS
+
+/// Return a record batch with i64 with Null values
+fn make_int64_batches_with_null(
+null_values: usize,
+no_null_values_start: i64,
+no_null_values_end: i64,
+) -> RecordBatch {
+let schema = Arc::new(Schema::new(vec![Field::new("i64", DataType::Int64, 
true)]));
+
+let v64: Vec = (no_null_values_start as _..no_null_values_end as 
_).collect();
+
+RecordBatch::try_new(
+schema,
+vec![make_array(
+Int64Array::from_iter(
+v64.into_iter()
+.map(Some)
+.chain(std::iter::repeat(None).take(null_values)),
+)
+.to_data(),
+)],
+)
+.unwrap()
+}
+
+// Create a parquet file with one column for data type i64
+// Data of the file include
+//   . Number of null rows is the given num_null
+//   . There are non-null values in the range [no_null_values_start, 
no_null_values_end], one value each row
+//   . The file is divided into row groups of size row_per_group
+pub fn parquet_file_one_column(
+num_null: usize,
+no_null_values_start: i64,
+no_null_values_end: i64,
+row_per_group: usize,
+) -> ParquetRecordBatchReaderBuilder {
+let mut output_file = tempfile::Builder::new()
+.prefix("parquert_statistics_test")
+.suffix(".parquet")
+.tempfile()
+.expect("tempfile creation");
+
+let props = WriterProperties::builder()
+.set_max_row_group_size(row_per_group)
+.set_statistics_enabled(EnabledStatistics::Chunk)
+.build();
+
+let batches = vec![make_int64_batches_with_null(
+num_null,
+no_null_values_start,
+no_null_values_end,
+)];
+
+let schema = batches[0].schema();
+
+let mut writer = ArrowWriter::try_new(&mut output_file, schema, 
Some(props)).unwrap();
+
+for batch in batches {
+writer.write(&batch).expect("writing batch");
+}
+
+// close file
+let _file_meta = writer.close().unwrap();
+
+// open the file & get the reader
+let file = output_file.reopen().unwrap();
+ArrowReaderBuilder::try_new(file).unwrap()
+}
+
+// Create a parquet file with many columns each has different data type
+//  - Data types are specified by the given scenario
+//  - Row group sizes are withe the same or different depending on the 
provided row_per_group & data created in the scenario
+pub async fn parquet_file_many_columns(
+scenario: super::Scenario,
+row_per_group: usize,
+) -> ParquetRecordBatchReaderBuilder {
+let file = make_test_file_rg(scenario, row_per_group).await;
+
+// open the file & get the reader
+let file = file.reopen().unwrap();
+ArrowReaderBuilder::try_new(file).unwrap()
+}
+
+struct Test {
+reader: ParquetRecordBatchReaderBuilder,
+expected_min: ArrayRef,
+expected_max: ArrayRef,
+expected_null_counts: UInt64Array,
+expected_row_counts: UInt64Array,
+}
+
+impl Test {
+fn run(self, col_name: &str) {
+let Self {
+reader,
+expected_min,
+expected_max,
+expected_null_counts,
+expected_row_counts,
+} = self;
+
+le

Re: [PR] feat: API for collecting statistics/index for metadata of a parquet file [datafusion]

2024-05-20 Thread via GitHub


alamb commented on code in PR #10537:
URL: https://github.com/apache/datafusion/pull/10537#discussion_r1607087345


##
datafusion/core/tests/parquet/arrow_statistics.rs:
##
@@ -0,0 +1,652 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! This file contains an end to end test of extracting statitics from parquet 
files.
+//! It writes data into a parquet file, reads statistics and verifies they are 
correct
+
+use std::fs::File;
+use std::sync::Arc;
+
+use arrow_array::{
+make_array, Array, ArrayRef, Int16Array, Int32Array, Int64Array, Int8Array,
+RecordBatch, UInt64Array,
+};
+use arrow_schema::{DataType, Field, Schema};
+use datafusion::datasource::physical_plan::parquet::{
+RequestedStatistics, StatisticsConverter,
+};
+use parquet::arrow::arrow_reader::{ArrowReaderBuilder, 
ParquetRecordBatchReaderBuilder};
+use parquet::arrow::ArrowWriter;
+use parquet::file::properties::{EnabledStatistics, WriterProperties};
+
+use crate::parquet::Scenario;
+
+use super::make_test_file_rg;
+
+// TEST HELPERS
+
+/// Return a record batch with i64 with Null values
+fn make_int64_batches_with_null(
+null_values: usize,
+no_null_values_start: i64,
+no_null_values_end: i64,
+) -> RecordBatch {
+let schema = Arc::new(Schema::new(vec![Field::new("i64", DataType::Int64, 
true)]));
+
+let v64: Vec = (no_null_values_start as _..no_null_values_end as 
_).collect();
+
+RecordBatch::try_new(
+schema,
+vec![make_array(
+Int64Array::from_iter(
+v64.into_iter()
+.map(Some)
+.chain(std::iter::repeat(None).take(null_values)),
+)
+.to_data(),
+)],
+)
+.unwrap()
+}
+
+// Create a parquet file with one column for data type i64
+// Data of the file include
+//   . Number of null rows is the given num_null
+//   . There are non-null values in the range [no_null_values_start, 
no_null_values_end], one value each row
+//   . The file is divided into row groups of size row_per_group
+pub fn parquet_file_one_column(
+num_null: usize,
+no_null_values_start: i64,
+no_null_values_end: i64,
+row_per_group: usize,
+) -> ParquetRecordBatchReaderBuilder {
+let mut output_file = tempfile::Builder::new()
+.prefix("parquert_statistics_test")
+.suffix(".parquet")
+.tempfile()
+.expect("tempfile creation");
+
+let props = WriterProperties::builder()
+.set_max_row_group_size(row_per_group)
+.set_statistics_enabled(EnabledStatistics::Chunk)
+.build();
+
+let batches = vec![make_int64_batches_with_null(
+num_null,
+no_null_values_start,
+no_null_values_end,
+)];
+
+let schema = batches[0].schema();
+
+let mut writer = ArrowWriter::try_new(&mut output_file, schema, 
Some(props)).unwrap();
+
+for batch in batches {
+writer.write(&batch).expect("writing batch");
+}
+
+// close file
+let _file_meta = writer.close().unwrap();
+
+// open the file & get the reader
+let file = output_file.reopen().unwrap();
+ArrowReaderBuilder::try_new(file).unwrap()
+}
+
+// Create a parquet file with many columns each has different data type
+//  - Data types are specified by the given scenario
+//  - Row group sizes are withe the same or different depending on the 
provided row_per_group & data created in the scenario
+pub async fn parquet_file_many_columns(
+scenario: super::Scenario,
+row_per_group: usize,
+) -> ParquetRecordBatchReaderBuilder {
+let file = make_test_file_rg(scenario, row_per_group).await;
+
+// open the file & get the reader
+let file = file.reopen().unwrap();
+ArrowReaderBuilder::try_new(file).unwrap()
+}
+
+struct Test {
+reader: ParquetRecordBatchReaderBuilder,
+expected_min: ArrayRef,
+expected_max: ArrayRef,
+expected_null_counts: UInt64Array,
+expected_row_counts: UInt64Array,
+}
+
+impl Test {
+fn run(self, col_name: &str) {
+let Self {
+reader,
+expected_min,
+expected_max,
+expected_null_counts,
+expected_row_counts,
+} = self;
+
+le

Re: [PR] HashJoin LeftAnti Join should handle nulls correctly [datafusion]

2024-05-20 Thread via GitHub


viirya commented on code in PR #10584:
URL: https://github.com/apache/datafusion/pull/10584#discussion_r1607085982


##
datafusion/sqllogictest/test_files/join.slt:
##
@@ -793,3 +793,19 @@ DROP TABLE companies
 
 statement ok
 DROP TABLE leads
+
+
+# LeftAnti Join with null
+statement ok
+CREATE TABLE IF NOT EXISTS test_table(c1 INT, c2 INT) AS VALUES
+(1, 1),
+(2, 2),
+(3, 3),
+(4, null),
+(null, 0);
+
+query II
+SELECT * FROM test_table t1 LEFT ANTI JOIN test_table t2 ON t1.c1 = t2.c2
+
+4 NULL
+NULL 0

Review Comment:
   Currently LeftAnti join returns incorrect results.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] HashJoin LeftAnti Join should handle nulls correctly [datafusion]

2024-05-20 Thread via GitHub


viirya commented on PR #10584:
URL: https://github.com/apache/datafusion/pull/10584#issuecomment-2120931328

   Added the test case first. I will find some time to work on the fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] HashJoin LeftAnti Join should handle nulls correctly [datafusion]

2024-05-20 Thread via GitHub


viirya opened a new pull request, #10584:
URL: https://github.com/apache/datafusion/pull/10584

   ## Which issue does this PR close?
   
   
   
   Closes #10583.
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] HashJoin LeftAnti Join handles nulls incorrectly [datafusion]

2024-05-20 Thread via GitHub


viirya opened a new issue, #10583:
URL: https://github.com/apache/datafusion/issues/10583

   ### Describe the bug
   
   During working on https://github.com/apache/datafusion-comet/pull/437, a few 
Spark join tests are failed when delegating to DataFusion HashJoin.
   
   It is because that DataFusion HashJoin LeftAnti Join returns incorrect 
results when there are nulls in either left or right side.
   
   
   ### To Reproduce
   
   Added a test to `join.slt`:
   
   ```
   statement ok
   CREATE TABLE IF NOT EXISTS test_table(c1 INT, c2 INT) AS VALUES
   (1, 1),
   (2, 2),
   (3, 3),
   (4, null),
   (null, 0);
   
   query II
   SELECT * FROM test_table t1 LEFT ANTI JOIN test_table t2 ON t1.c1 = t2.c2
   
   4 NULL
   NULL 0
   ```
   
   ### Expected behavior
   
   Above query should return empty relation.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Minor: Improve documentation in sql_to_plan example [datafusion]

2024-05-20 Thread via GitHub


alamb commented on code in PR #10582:
URL: https://github.com/apache/datafusion/pull/10582#discussion_r1607076013


##
datafusion-examples/examples/plan_to_sql.rs:
##
@@ -22,36 +22,45 @@ use datafusion::sql::unparser::expr_to_sql;
 use datafusion_sql::unparser::dialect::CustomDialect;
 use datafusion_sql::unparser::{plan_to_sql, Unparser};
 
-/// This example demonstrates the programmatic construction of
-/// SQL using the DataFusion Expr [`Expr`] and LogicalPlan [`LogicalPlan`] API.
+/// This example demonstrates the programmatic construction of SQL strings 
using
+/// the DataFusion Expr [`Expr`] and LogicalPlan [`LogicalPlan`] API.
 ///
 ///
 /// The code in this example shows how to:
-/// 1. Create SQL from a variety of Expr and LogicalPlan: [`main`]`
-/// 2. Create a simple expression [`Exprs`] with fluent API
-/// and convert to sql: [`simple_expr_to_sql_demo`]
-/// 3. Create a simple expression [`Exprs`] with fluent API
-/// and convert to sql without escaping column names: 
[`simple_expr_to_sql_demo_no_escape`]
-/// 4. Create a simple expression [`Exprs`] with fluent API
-/// and convert to sql escaping column names a MySQL style: 
[`simple_expr_to_sql_demo_escape_mysql_style`]
+///
+/// 1. [`simple_expr_to_sql_demo`]: Create a simple expression [`Exprs`] with
+/// fluent API and convert to sql suitable for passing to another database
+///
+/// 2. [`simple_expr_to_sql_demo_no_escape`]  Create a simple expression
+/// [`Exprs`] with fluent API and convert to sql without escaping column names
+/// more suitable for displaying to humans.
+///
+/// 3. [`simple_expr_to_sql_demo_escape_mysql_style`]" Create a simple
+/// expression [`Exprs`] with fluent API and convert to sql escaping column
+/// names in MySQL style.
+///
+/// 4. [`simple_plan_to_sql_demo`]: Create a simple logical plan using the
+/// DataFrames API and convert to sql string.
+///
+/// 5. [`round_trip_plan_to_sql_demo`]: Create a logical plan from a SQL 
string, modify it using the
+/// DataFrames API and convert it back to a  sql string.
 
 #[tokio::main]
 async fn main() -> Result<()> {
 // See how to evaluate expressions
 simple_expr_to_sql_demo()?;
 simple_expr_to_sql_demo_no_escape()?;
 simple_expr_to_sql_demo_escape_mysql_style()?;
-simple_plan_to_sql_parquest_dataframe_demo().await?;
-round_trip_plan_to_sql_parquest_dataframe_demo().await?;
+simple_plan_to_sql_demo().await?;
+round_trip_plan_to_sql_demo().await?;
 Ok(())
 }
 
 /// DataFusion can convert expressions to SQL, using column name escaping
 /// PostgreSQL style.
 fn simple_expr_to_sql_demo() -> Result<()> {
 let expr = col("a").lt(lit(5)).or(col("a").eq(lit(8)));
-let ast = expr_to_sql(&expr)?;
-let sql = format!("{}", ast);
+let sql = expr_to_sql(&expr)?.to_string();

Review Comment:
   I think it makes it easier to use this example if we don't have to worry 
about the intermediate `ast` as much, so I combined it into a single statement



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Implement a dialect-specific rule for unparsing an identifier with or without quotes [datafusion]

2024-05-20 Thread via GitHub


comphead commented on code in PR #10573:
URL: https://github.com/apache/datafusion/pull/10573#discussion_r1607067858


##
datafusion/sql/src/unparser/expr.rs:
##
@@ -504,6 +508,14 @@ impl Unparser<'_> {
 .collect::>>()
 }
 
+pub(super) fn new_ident_quoted_if_needs(&self, ident: String) -> 
ast::Ident {

Review Comment:
   Please add a method comments for a pub method



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Add an example of how to convert LogicalPlan to/from SQL Strings [datafusion]

2024-05-20 Thread via GitHub


alamb closed issue #10550: Add an example of how to convert LogicalPlan to/from 
SQL Strings
URL: https://github.com/apache/datafusion/issues/10550


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add examples of how to convert logical plan to/from sql strings [datafusion]

2024-05-20 Thread via GitHub


alamb merged PR #10558:
URL: https://github.com/apache/datafusion/pull/10558


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



  1   2   >