I don't know anything about statistics but in your case, duplicating splits(x100?) by using custom InputFormat might much simpler.
2013/9/6 Sameer Agarwal <samee...@cs.berkeley.edu> > Hi All, > > In order to support approximate queries in Hive and BlinkDB ( > http://blinkdb.org/), I am working towards implementing the bootstrap > primitive (http://en.wikipedia.org/wiki/Bootstrapping_(statistics)) in > Hive > that can help us quantify the "error" incurred by a query Q when it > operates on a small sample S of data. This method essentially requires > launching the query Q simultaneously on a large number of samples of > original data (typically >=100) . > > The downside to this is of course that we have to launch the same query 100 > times but the upside is that each of this query would be so small that it > can be executed on a single machine. So, in order to do this efficiently, > we would ideally like to execute 100 instances of the query simultaneously > on the master and all available worker nodes. Furthermore, in order to > avoid generating the query plan 100 times on the master, we can do either > of the two things: > > 1. Generate the query plan once on the master, serialize it and ship it > to the worker nodes. > 2. Enable the worker nodes to access the Metastore so that they can > generate the query plan on their own in parallel. > > Given that making the query plan serializable (1) would require a lot of > refactoring of the current code, is (2) a viable option? Moreover, since > (2) will increase the load on the existing Metastore by 100x, is there any > other option? > > Thanks, > Sameer > > -- > Sameer Agarwal > Computer Science | AMP Lab | UC Berkeley > http://cs.berkeley.edu/~sameerag >