jayshrivastava commented on code in PR #22451:
URL: https://github.com/apache/datafusion/pull/22451#discussion_r3290529845


##########
datafusion/physical-plan/src/joins/hash_join/partitioned_hash_eval.rs:
##########
@@ -289,6 +289,38 @@ impl PartialEq for HashTableLookupExpr {
 
 impl Eq for HashTableLookupExpr {}
 
+impl HashTableLookupExpr {
+    /// Serialize this expression to protobuf.
+    ///
+    /// `HashTableLookupExpr` holds an `Arc<Map>` (a runtime hash table built
+    /// on the build side) which cannot be serialized. We replace it with
+    /// `lit(true)`, which is safe because:
+    ///
+    /// - The filter is a performance optimisation, not a correctness 
requirement.
+    /// - `lit(true)` passes all rows so no valid rows are lost.
+    /// - In distributed execution the remote worker has no access to the
+    ///   build-side hash table anyway.

Review Comment:
   The comment comes from `roundtrip_hash_table_lookup_expr_to_lit()` in 
`/datafusion/datafusion/proto/tests/cases/roundtrip_physical_plan.rs` but I 
don't think it's very true.
   
   IIUC, hash joins might build a `HashTableLookupExpr` **during execution 
after the build side is done**. These expressions get placed in the dynamic 
filter expr.
   
   If you serialize before executing the plan, then there's no code path where 
there would be a `HashTableLookupExpr` in the plan today. If you deserialize 
and execute that plan, then the `HashJoinExec` may create a fresh 
`HashTableLookupExpr` for the dynamic filter. In this case, all the row pruning 
is preserved.
   
   If you serialize after executing, then any potential `HashTableLookupExpr` 
would be replaced with lit(true). I don't think this has any impact. One must 
call `reset_state` to re-execute plans, in which case I would expect the 
`HashTableLookupExpr` to disappear. In this case, all the row pruning is 
preserved as well.
   
   Maybe we can change the comment to explain these two cases? ^
   
   Since `HashTableLookupExpr` is public it might be good to warn users that it 
does not get serialized. We could (a) file a ticket to track that 
`HashTableLookupExpr` are not serialized and (b) add a comment directly on 
`HashTableLookupExpr`.
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to