When we load data into hive sometimes we've run into situations where
the load fails and the logs show a heap out of memory error. If I load
just a few days (or months) of data then no problem. But then if I try
to load two years (for example) of data then I've seen it fail. Not
with every feed but certain ones.
Sometimes I've been able to split the data and get it to load. An
example of one type of feed I'm working on is the apache web server
access logs. Generally it works. But there are times when I need to
load more than a few months of data and get the memory heap errors in
the task logs.
Generally how do people load their data into Hive? We have a process
where we first copy it to hdfs then from there we run a staging process
to get it into hive. Once that completes we perform a union all then
overwrite table partition. Usually it's during the union all stage that
we see these errors appear.
Also is there a log which tells you which log it fails on? I can see
which task/job failed but not finding which file it's complaining
about. I figure that might help a bit..
Thanks!
- hadoop/hive data loading hadoopman
-