On 7/31/12 9:23 PM, "Owen O'Malley" <omal...@apache.org> wrote:
>On Mon, Jul 30, 2012 at 11:38 PM, Namit Jain <nj...@fb.com> wrote: > >> That would be difficult. The % done can be estimated from the data >>already >> read. >> > >I'm confused. Wouldn't the maximum size of the data remaining over the >maximum size of the original query give a reasonable approximation of the >amount of work done? > Yes and No, the filter behavior can vary a lot with the rows. But, yes that is the best approximation we can have. > >> >> It might be simpler to have a check like: if the query isn't done in >> the first 5 seconds of running locally, you switch to mapreduce. >> > >There are three problems I see: > * If the query is 95% done at 5 seconds, it is a shame to kill it and >start over again at 0% on mapreduce with a much longer latency. (Instead >of >spending the additional 0.25 seconds you spend an additional 60+.) > * You can't print anything until you know whether you are going to kill >it or not. (The mapreduce results might come back in a different >order....) >With user-facing programs, it is much better to start printing early >instead of later since it gives faster feedback to the user. We cannot do this in either of the above approaches. > * It isn't predictable how the query will run. That makes it very hard >to >build applications on top of Hive. > >Do those make sense?