On 7/31/12 9:23 PM, "Owen O'Malley" <omal...@apache.org> wrote:

>On Mon, Jul 30, 2012 at 11:38 PM, Namit Jain <nj...@fb.com> wrote:
>
>> That would be difficult. The % done can be estimated from the data
>>already
>> read.
>>
>
>I'm confused. Wouldn't the maximum size of the data remaining over the
>maximum size of the original query give a reasonable approximation of the
>amount of work done?
>

Yes and No, the filter behavior can vary a lot with the rows.
But, yes that is the best approximation we can have.

>
>>
>> It might be simpler to have a check like: if the query isn't done in
>> the first 5 seconds of running locally, you switch to mapreduce.
>>
>
>There are three problems I see:
>  * If the query is 95% done at 5 seconds,  it is a shame to kill it and
>start over again at 0% on mapreduce with a much longer latency. (Instead
>of
>spending the additional 0.25 seconds you spend an additional 60+.)
>  * You can't print anything until you know whether you are going to kill
>it or not. (The mapreduce results might come back in a different
>order....)
>With user-facing programs, it is much better to start printing early
>instead of later since it gives faster feedback to the user.


We cannot do this in either of the above approaches.

>  * It isn't predictable how the query will run. That makes it very hard
>to
>build applications on top of Hive.
>
>Do those make sense?

Reply via email to