When we load data into hive sometimes we've run into situations where the load fails and the logs show a heap out of memory error. If I load just a few days (or months) of data then no problem. But then if I try to load two years (for example) of data then I've seen it fail. Not with every feed but certain ones.

Sometimes I've been able to split the data and get it to load. An example of one type of feed I'm working on is the apache web server access logs. Generally it works. But there are times when I need to load more than a few months of data and get the memory heap errors in the task logs.

Generally how do people load their data into Hive? We have a process where we first copy it to hdfs then from there we run a staging process to get it into hive. Once that completes we perform a union all then overwrite table partition. Usually it's during the union all stage that we see these errors appear.

Also is there a log which tells you which log it fails on? I can see which task/job failed but not finding which file it's complaining about. I figure that might help a bit..

Thanks!

Reply via email to