Thanks - I will try your suggestion. Do you know why there are so some many more output than input records on the main table (39x more).
From: Ravi Kiran [mailto:[email protected]] Sent: Thursday, April 02, 2015 2:35 PM To: [email protected] Subject: Re: bulk loader MR counters Hi Ralph. I assume when you are running the MR for the main table, you have a larger number of columns to load than the MR for the index table due to which you see more spilled records. To tune the MR for the Main table, I would do the following first and then measure the counters to see for any improvement. a) To avoid the spilled the records during the MR for the main table, I would recommend trying to increase the mapreduce.task.io.sort.mb to a value like 500 MB rather than the default 100 MB b) mapreduce.task.io.sort.factor to have higher number of streams to merge at once during sorting map output . Regards Ravi From: Perko, Ralph J Sent: Thursday, April 02, 2015 2:36 PM To: [email protected] Subject: RE: bulk loader MR counters My apologies, the formatting did not come out as planned. Here is another go: Hi, we recently upgraded our cluster (Phoenix 4.3 - HDP 2.2) and I'm seeing a significant degradation in performance. I am going through the MR counters for a Phoenix CsvBulkLoad job and I am hoping you can help me understand some things. There is a base table with 4 index tables, so a total of 5 MR jobs run - one for each table. Here are the counters for an index table MR job: Note two things - the Input and output are the same number as expected There seems to be a lot spilled records. =========================================================== Category,Map, Reduce,Total Combine input records,0,0,0 Combine output records,0,0,0 CPU time spent (ms),1800380,156630,1957010 Failed Shuffles,0,0,0 GC time elapsed (ms),39738,1923,41661 Input split bytes,690,0,690 Map input records,13637198,0,13637198 Map output bytes,2144112474,0,2144112474 Map output materialized bytes,2171387170,0,2171387170 Map output records,13637198,0,13637198 Merged Map outputs,0,50,50 Physical memory (bytes) snapshot,8493744128,10708692992,19202437120 Reduce input groups,0,13637198,13637198 Reduce input records,0,13637198,13637198 Reduce output records,0,13637198,13637198 Reduce shuffle bytes,0,2171387170,2171387170 Shuffled Maps,0,50,50 Spilled Records,13637198,13637198,27274396 Total committed heap usage (bytes),11780751360,26862419968,38643171328 Virtual memory (bytes) snapshot,25903271936,96590065664,122493337600 Here are the counters for the main table MR job Please note the input records are correct - same as above The output records are many times the input The output bytes are many times the output from above The amount of spilled records is many times the number of input records and twice the number of output records =========================================================== Category,Map, Reduce,Total Combine input records,0,0,0 Combine output records,0,0,0 CPU time spent (ms),5059340,2035910,7095250 Failed Shuffles,0,0,0 GC time elapsed (ms),38937,13748,52685 Input split bytes,690,0,690 Map input records,13637198,0,13637198 Map output bytes,59638106406,0,59638106406 Map output materialized bytes,60702718624,0,60702718624 Map output records,531850722,0,531850722 Merged Map outputs,0,50,50 Physical memory (bytes) snapshot,8398745600,2756530176,11155275776 Reduce input groups,0,13637198,13637198 Reduce input records,0,531850722,531850722 Reduce output records,0,531850722,531850722 Reduce shuffle bytes,0,60702718624,60702718624 Shuffled Maps,0,50,50 Spilled Records,1063701444,531850722,1595552166 Total committed heap usage (bytes),10136059904,19488309248,29624369152 Virtual memory (bytes) snapshot,25926946816,96562970624,122489917440 Is the large number of output records as opposed to input records normal? Is the large number of spilled records normal? Thanks for your help, Ralph
