Re: [I] Select DISTINCT with LIMIT 10 is doing a full-scan of the database [arrow-datafusion]

via GitHub Tue, 10 Oct 2023 01:24:07 -0700


shner-elmo commented on issue #7781:
URL: 
https://github.com/apache/arrow-datafusion/issues/7781#issuecomment-1754677897


   > While it happens to be the case for your particular dataset that all 
values are present in the first tile, I don't think there is any way for 
datafusion to know that. To answer the query faithfully it needs to check all 
the files (what if there was a new value in the last file?)
   
   Of course it cannot know that. But it should keep scanning the data **only** 
if it didnt find 10 values yet, i.e.
   ```py
   unique_values = set()
   while len(unique_values) != 10:
      val = get_next_value()
      unique_values.add(val)
   ```
   This is how it works in all the major RDBMS systems, why cant we also use 
this "optimization"?
   
   I think the real issue is that we are using a GROUP BY behind the scenes to 
get the DISTINCT values, which while it does work, its not the appropriate 
solution imho, because for a GROUP BY you always need to do a full-scan before 
you can perform any sort of aggregation, on the other hand with a DISTINCT 
query you really dont have to.
   
   The fact of the matter is that it is possible to make this query much more 
efficient then it currently is.
   I'm willing to work on a fix if you guys are interested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Select DISTINCT with LIMIT 10 is doing a full-scan of the database [arrow-datafusion]

Reply via email to