On 7/31/12 12:01 PM, "Owen O'Malley" <omal...@apache.org> wrote:

>On Mon, Jul 30, 2012 at 9:12 PM, Namit Jain <nj...@fb.com> wrote:
>
>> The total number of bytes of the input will be used to determine whether
>> to not launch a map-reduce job for this
>> query. That was in my original mail.
>>
>> However, given any complex where condition and the lack of column
>> statistics in hive, we cannot determine the
>> number of bytes that would be needed to satisfy the where condition.
>
>
>All of these are heuristics are guidelines, clearly. My inclination would
>be to use the maximum data volume as the primary metric until we have a
>better understanding of cases where that doesn't work well. If we are
>going

Maximum data volume can be used to dictate the initial behavior. That has
been

already documented in the jira.


>to try the local solution and fall back to mapreduce, it seems better to
>put a limit well short of being done so that you don't waste as much work.
>Perhaps, if the query isn't 10% done in the first 5 seconds of running
>locally, you switch to mapreduce. Would that work?

That would be difficult. The % done can be estimated from the data already
read.

It might be simpler to have a check like: if the query isn't done in
the first 5 seconds of running locally, you switch to mapreduce.






>
>-- Owen

Reply via email to