alamb commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740888007
I tried the rewrite into a Semi join and indeed it is over 2x slower (5.3sec
vs 12sec)
```sql
> SELECT * from 'hits_partitioned' WHERE "URL" LIKE '%google%' ORDER BY
"EventTime" LIMIT 10;
Elapsed 5.320 seconds.
```
Here is what I think the rewrite is
```sql
> SELECT * from 'hits_partitioned' WHERE "WatchID" IN (
SELECT "WatchID" FROM 'hits_partitioned' WHERE "URL" LIKE '%google%' ORDER
BY "EventTime" LIMIT 10
);
Elapsed 12.023 seconds.
```
WatchID is a unique key
```sql
> select count(distinct "WatchID"), count(*) from 'hits_partitioned';
+------------------------------------------+----------+
| count(DISTINCT hits_partitioned.WatchID) | count(*) |
+------------------------------------------+----------+
| 99997493 | 99997497 |
+------------------------------------------+----------+
```
I also double checked the output
```shell
## orig
datafusion-cli -c "SELECT * FROM 'hits_partitioned' WHERE \"URL\" LIKE
'%google%' ORDER BY \"EventTime\" LIMIT 10;" > orig.out
## rewrite
datafusion-cli -c "SELECT * from 'hits_partitioned' WHERE \"WatchID\" IN
(SELECT \"WatchID\" FROM 'hits_partitioned' WHERE \"URL\" LIKE '%google%' ORDER
BY \"EventTime\" LIMIT 10);" > rewrite.out
## check
sort orig.out > orig.out.sort
sort rewrite.out > rewrite.out.sort
diff orig.out.sort rewrite.out.sort
7c7
< Elapsed 5.649 seconds.
---
> Elapsed 11.067 seconds.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]