I did some basic testing of multi-source queries with the most recent Spark:

The output of "spark.time()" surprised me:

SELECT p.id, p.name, t.id, t.title
FROM db1.public.person p
JOIN db2.public.todos t
ON p.id = t.person_id
WHERE p.id = 1

| id|name| id| title|
|  1| Bob|  1|Todo 1|
|  1| Bob|  2|Todo 2|
Time taken: 168 ms

SELECT p.id, p.name, t.id, t.title
FROM db1.public.person p
JOIN db2.public.todos t
ON p.id = t.person_id
WHERE p.id = 2

| id| name| id| title|
|  2|Alice|  3|Todo 3|
Time taken: 228 ms

Calcite and Teiid manage to do this on the order of 5-50ms for basic
so I'm curious about the technical specifics on why Spark appears to be so
much slower here?

Reply via email to