On Mon, Jul 30, 2012 at 9:12 PM, Namit Jain <nj...@fb.com> wrote:

> The total number of bytes of the input will be used to determine whether
> to not launch a map-reduce job for this
> query. That was in my original mail.
>
> However, given any complex where condition and the lack of column
> statistics in hive, we cannot determine the
> number of bytes that would be needed to satisfy the where condition.


All of these are heuristics are guidelines, clearly. My inclination would
be to use the maximum data volume as the primary metric until we have a
better understanding of cases where that doesn't work well. If we are going
to try the local solution and fall back to mapreduce, it seems better to
put a limit well short of being done so that you don't waste as much work.
Perhaps, if the query isn't 10% done in the first 5 seconds of running
locally, you switch to mapreduce. Would that work?

-- Owen

Reply via email to