I don't know anything about statistics but in your case, duplicating
splits(x100?) by using custom InputFormat might much simpler.


2013/9/6 Sameer Agarwal <samee...@cs.berkeley.edu>

> Hi All,
>
> In order to support approximate queries in Hive and BlinkDB (
> http://blinkdb.org/), I am working towards implementing the bootstrap
> primitive (http://en.wikipedia.org/wiki/Bootstrapping_(statistics)) in
> Hive
> that can help us quantify the "error" incurred by a query Q when it
> operates on a small sample S of data. This method essentially requires
> launching the query Q simultaneously on a large number of samples of
> original data (typically >=100) .
>
> The downside to this is of course that we have to launch the same query 100
> times but the upside is that each of this query would be so small that it
> can be executed on a single machine. So, in order to do this efficiently,
> we would ideally like to execute 100 instances of the query simultaneously
> on the master and all available worker nodes. Furthermore, in order to
> avoid generating the query plan 100 times on the master, we can do either
> of the two things:
>
>    1. Generate the query plan once on the master, serialize it and ship it
>    to the worker nodes.
>    2. Enable the worker nodes to access the Metastore so that they can
>    generate the query plan on their own in parallel.
>
> Given that making the query plan serializable (1) would require a lot of
> refactoring of the current code, is (2) a viable option? Moreover, since
> (2) will increase the load on the existing Metastore by 100x, is there any
> other option?
>
> Thanks,
> Sameer
>
> --
> Sameer Agarwal
> Computer Science | AMP Lab | UC Berkeley
> http://cs.berkeley.edu/~sameerag
>

Reply via email to