This is probably a simple question, but I'm noticing that for queries that run on 1+TB of data, it can take Hive up to 30 minutes to actually start the first map-reduce stage. What is it doing? I imagine it's gathering information about the data somehow, this 'startup' time is clearly a function of the amount of data I'm trying to process.
Cheers,