You can write a custom InputFormat whose #getSplits(...) returns your required InputSplit objects (with randomised offsets + lengths, etc.).
On Fri, Feb 7, 2014 at 9:50 PM, Suresh S <suresh...@gmail.com> wrote: > Dear Friends, > > I have some very large file in HDFS with 3000+ blocks. > > I want run a job with various input size. I want to use the same file as a > input. Usually the number of task is equal to number of blocks/splits. > Suppose the job with 2 task need to process randomly any two block of the > given input file. > > How to give a random set of HDFS blocks as a input of a job? > > note: my aim is not processing the input file to produce some output. > I want to replicate the individual block based on the load. > > *Regards* > *S.Suresh,* > *Research Scholar,* > *Department of Computer Applications,* > *National Institute of Technology,* > *Tiruchirappalli - 620015.* > *+91-9941506562* -- Harsh J