tobixdev commented on issue #19067: URL: https://github.com/apache/datafusion/issues/19067#issuecomment-3608986130
So I thought this would be a great excuse for looking a bit into the code of `HashJoin` as I've had a bit of time at the end of my day. First and foremost, thanks @HawaiianSpork for the reproducer! It made finding the issue way easier. Some findings: - The issue is not a left join problem. The optimizer will switch the two join sides, as the right side is smaller than the left one. The issue disappears if you add two additional rows to the right side. (Adjust code below) - I don't think the HashJoin is the problem. When the right join stream processes a batch, it calls `adjust_indices_by_join_type` to add additional indices that are related to the join type. In the right join, this adds indices for each row that is not matched. The left indices of those unmatched rows will be correctly set to `NULL`. However, the `take` kernel for the fixed size binary array seems to ignore the validity of the indices array (https://github.com/apache/arrow-rs/issues/8947). I'll run your reproducer against my fix in https://github.com/apache/arrow-rs/pull/8948 to see whether this is really the root cause. @HawaiianSpork @Jefffrey Hopefully, you haven't been working on this issue. I just got an itch to look at the HashJoin code for this reason and didn't want to commit to solving the issue if I couldn't locate it. **Larger right table that does not exercise the bug** ```rust let right_join_key = Arc::new( FixedSizeBinaryArray::try_from_sparse_iter_with_size( vec![ Some(vec![0xAA, 0xAA, 0xAA, 0xAA]), Some(vec![0xBB, 0xBB, 0xBB, 0xBB]), Some(vec![0xDD, 0xBB, 0xBB, 0xBB]), Some(vec![0xEE, 0xBB, 0xBB, 0xBB]), ] .into_iter(), 4, ) .unwrap(), ) as ArrayRef; let right_value = Arc::new(Int64Array::from(vec![1000, 2000, 3000, 4000])) as ArrayRef; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
