LIMIT n is slower than it needs to be
-------------------------------------

                 Key: HIVE-588
                 URL: https://issues.apache.org/jira/browse/HIVE-588
             Project: Hadoop Hive
          Issue Type: Improvement
            Reporter: Adam Kramer


SELECT a FROM t LIMIT 10;
...simply prints the output of the first 10 lines of the first file in the 
database. That's good.

However,
SELECT function(a) FROM t LIMIT 10;
appears to send all of t to the mappers, runs the function, and and then 
returns the first 10 rows from whatever mapper(s) finish first. This is very 
slow in some cases!

Appropriate behavior for LIMIT would be to use ONE mapper, and to push files 
from the table into that mapper, and then auto-kill the mapper once it has 
output 10 rows...just take the first 10 rows and kill the whole task if 
necessary. On dying, throw some informative error message like, "Dying 
intentionally; LIMIT has been reached." This should be the case even for 
TRANSFORMs in the mapper...the TRANSFORM could spit out 20 rows, but once it 
has split out 10, the whole task should die and the 10 should be returned 
immediately.

The purpose of LIMIT is not just to have "only one response," but it's also to 
speed up queries a whole lot. Running the function over the entire table is a 
big waste.

Obviously, when a reduce step is necessary, the whole table will have to be 
pushed through mappers and then copied and then sorted--but in those cases, 
whenever 10 total rows have been output by any reducer(s), at which point all 
reduce tasks should be killed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to