alamb commented on issue #7363:
URL: https://github.com/apache/arrow-rs/issues/7363#issuecomment-2769697926
I did some analysis on the queries that show the largest slowdown:
```
│ QQuery 24 │ 509.28ms │ 684.55ms │ 1.34x slower │
│ QQuery 25 │ 426.36ms │ 553.26ms │ 1.30x slower │
│ QQuery 26 │ 581.56ms │ 802.46ms │ 1.38x slower │
...
│ QQuery 30 │ 970.56ms │ 1286.68ms │ 1.33x slower │
│ QQuery 31 │ 1008.49ms │ 1398.40ms │ 1.39x slower │
```
Query 24-26
```sql```
SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY
to_timestamp_seconds("EventTime") LIMIT 10;
SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY
"SearchPhrase" LIMIT 10;
SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY
to_timestamp_seconds("EventTime"), "SearchPhrase" LIMIT 10;
Queries 30-31
```sql
SELECT "SearchEngineID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"),
AVG("ResolutionWidth") FROM hits WHERE "SearchPhrase" <> '' GROUP BY
"SearchEngineID", "ClientIP" ORDER BY c DESC LIMIT 10;
SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"),
AVG("ResolutionWidth") FROM hits WHERE "SearchPhrase" <> '' GROUP BY "WatchID",
"ClientIP" ORDER BY c DESC LIMIT 10;
```
Basically they all have the filter `SearchPhrase <> ''` (cleaning out empty
strings)
### How selective is `SearchPhrase <> ''`?
This predicate select 13/99M
```sql
> select count(*) from 'hits.parquet' WHERE "SearchPhrase" <> '';
+----------+
| count(*) |
+----------+
| 13172392 |
+----------+
1 row(s) fetched.
Elapsed 0.303 seconds.
> select count(*) from 'hits.parquet';
+----------+
| count(*) |
+----------+
| 99997497 |
+----------+
1 row(s) fetched.
```
I am now profiling to determine the root cause
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]