Hi,

Before I raise this question I searched relevant topics. There 
are suggestions online:

"Mappers: Output all qualifying values, each with a random 
integer key.

Single reducer: Output the first N values, throwing away the 
keys."

However, this schema seems not very efficient when the data 
set is very huge, for example, sampling 100 out of one 
billion. Things are especially worse when Map task is 
computational demanding. I was trying to write a program to do 
sampling in Mappers, however, I ended up storing everything in 
memory and let the final sampling done at Mapper.cleanup() 
stage. It still seems not a graceful way to do it because it 
requires lots of memory. Maybe a better way is to control 
random sample at file.split() stage, is there any good 
approach existing?

Best,

Shi

Reply via email to