[GitHub] [arrow-datafusion] AssHero commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…

GitBox Wed, 08 Jun 2022 20:30:49 -0700


AssHero commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r893003362



##########
datafusion/core/tests/sql/joins.rs:
##########
@@ -1204,3 +1204,141 @@ async fn join_partitioned() -> Result<()> {
 
     Ok(())
 }
+
+#[tokio::test]
+async fn join_with_hash_unsupported_data_type() -> Result<()> {
+    let ctx = SessionContext::new();
+
+    let schema = Schema::new(vec![
+        Field::new("c1", DataType::Int32, true),
+        Field::new("c2", DataType::Utf8, true),
+        Field::new("c3", DataType::Int64, true),
+        Field::new("c4", DataType::Date32, true),
+    ]);
+    let data = RecordBatch::try_new(
+        Arc::new(schema),
+        vec![
+            Arc::new(Int32Array::from_slice(&[1, 2, 3])),
+            Arc::new(StringArray::from_slice(&["aaa", "bbb", "ccc"])),
+            Arc::new(Int64Array::from_slice(&[100, 200, 300])),
+            Arc::new(Date32Array::from(vec![Some(1), Some(2), Some(3)])),
+        ],
+    )?;
+    let table = MemTable::try_new(data.schema(), vec![vec![data]])?;
+    ctx.register_table("foo", Arc::new(table))?;
+
+    // join on hash unsupported data type (Date32), use cross join instead 
hash join
+    let sql = "select * from foo t1 join foo t2 on t1.c4 = t2.c4";
+    let msg = format!("Creating logical plan for '{}'", sql);

Review Comment:
   > So I think CrossJoin is almost never what the user would want: as once the 
tables get beyond any trivial size the query will effectively never finish or 
will run out of memory. An error is clearer.
   > 
   > From the issue description [#2145 
(comment)](https://github.com/apache/arrow-datafusion/issues/2145#issue-1191065060)
 I think @pjmore's idea to cast unsupported types to a supported type is a good 
one -- the arrow `cast` kernels are quite efficient for things like `Date32` -> 
`Int32` (no copies) as the representations are the same
   > 
   > @pjmore what do you think?
   
   I agree. supporting more data types in hash join is the better way to solve 
this issue, and i'm already working on it. 
   And this pr only wants to make hash unsupported  join running in cross join 
instead of error/panic, we can support more data types in hash join 
continuously. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] AssHero commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…

Reply via email to