Hi, Before I raise this question I searched relevant topics. There are suggestions online:
"Mappers: Output all qualifying values, each with a random integer key. Single reducer: Output the first N values, throwing away the keys." However, this schema seems not very efficient when the data set is very huge, for example, sampling 100 out of one billion. Things are especially worse when Map task is computational demanding. I was trying to write a program to do sampling in Mappers, however, I ended up storing everything in memory and let the final sampling done at Mapper.cleanup() stage. It still seems not a graceful way to do it because it requires lots of memory. Maybe a better way is to control random sample at file.split() stage, is there any good approach existing? Best, Shi