eejbyfeldt commented on PR #11232: URL: https://github.com/apache/datafusion/pull/11232#issuecomment-2206921868
> > @viirya Good suggestion. I tried added some sqllogictest, they work. But as far as I can tell they are not using HashJoinExec (even if the exact same query joining on only field is). Is there some way to force a HashJoin or help me understand why hashjoin is not used? > > Hmm, what join operator it is using? I think HashJoin is used by default. Just using `=` and an inner join. Here are the same joins in datafusion-cli. The one using struct uses `NestedLoopJoin` while the one using the id direcly uses the HashJoin ``` > CREATE TABLE join_t3(s3 struct<id INT>) AS VALUES (NULL), (struct(1)), (struct(2)); 0 row(s) fetched. Elapsed 0.003 seconds. > CREATE TABLE join_t4(s4 struct<id INT>) AS VALUES (NULL), (struct(2)), (struct(3)); 0 row(s) fetched. Elapsed 0.002 seconds. > explain analyze select join_t3.s3, join_t4.s4 from join_t3 inner join join_t4 on join_t3.s3 = join_t4.s4; +-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Plan with Metrics | NestedLoopJoinExec: join_type=Inner, filter=s3@0 = s4@1, metrics=[output_rows=1, output_batches=1, build_input_batches=1, input_rows=3, build_input_rows=3, input_batches=1, build_mem_used=266, join_time=270.817µs, build_time=34.164µs] | | | MemoryExec: partitions=1, partition_sizes=[1], metrics=[] | | | MemoryExec: partitions=1, partition_sizes=[1], metrics=[] | | | | +-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row(s) fetched. Elapsed 0.002 seconds. > explain analyze select join_t3.s3, join_t4.s4 from join_t3 inner join join_t4 on join_t3.s3.id = join_t4.s4.id; +-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Plan with Metrics | CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=1, elapsed_compute=13.975µs] | | | HashJoinExec: mode=Partitioned, join_type=Inner, on=[(join_t3.s3[id]@1, join_t4.s4[id]@1)], projection=[s3@0, s4@2], metrics=[output_rows=1, output_batches=3, build_input_batches=3, input_rows=3, build_input_rows=3, input_batches=3, build_mem_used=2146, join_time=141.757µs, build_time=647.423µs] | | | CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=3, elapsed_compute=68.41µs] | | | RepartitionExec: partitioning=Hash([join_t3.s3[id]@1], 16), input_partitions=1, metrics=[fetch_time=50.816µs, repart_time=84.96µs, send_time=16.675µs] | | | ProjectionExec: expr=[s3@0 as s3, get_field(s3@0, id) as join_t3.s3[id]], metrics=[output_rows=3, elapsed_compute=28.443µs] | | | MemoryExec: partitions=1, partition_sizes=[1], metrics=[] | | | CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=3, elapsed_compute=129.626µs] | | | RepartitionExec: partitioning=Hash([join_t4.s4[id]@1], 16), input_partitions=1, metrics=[fetch_time=9.288µs, repart_time=36.86µs, send_time=11.054µs] | | | ProjectionExec: expr=[s4@0 as s4, get_field(s4@0, id) as join_t4.s4[id]], metrics=[output_rows=3, elapsed_compute=4.569µs] | | | MemoryExec: partitions=1, partition_sizes=[1], metrics=[] | | | | +-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row(s) fetched. Elapsed 0.003 seconds. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org