On Mon, Jul 30, 2012 at 11:38 PM, Namit Jain <nj...@fb.com> wrote: > That would be difficult. The % done can be estimated from the data already > read. >
I'm confused. Wouldn't the maximum size of the data remaining over the maximum size of the original query give a reasonable approximation of the amount of work done? > > It might be simpler to have a check like: if the query isn't done in > the first 5 seconds of running locally, you switch to mapreduce. > There are three problems I see: * If the query is 95% done at 5 seconds, it is a shame to kill it and start over again at 0% on mapreduce with a much longer latency. (Instead of spending the additional 0.25 seconds you spend an additional 60+.) * You can't print anything until you know whether you are going to kill it or not. (The mapreduce results might come back in a different order....) With user-facing programs, it is much better to start printing early instead of later since it gives faster feedback to the user. * It isn't predictable how the query will run. That makes it very hard to build applications on top of Hive. Do those make sense?