Hi Chien, 4. From 50-150k per * second * to 100-150k per * minute *, as stated above, so reads went *DOWN* significantly. I think you must have misread.
I will take into account some of your other suggestions. Thanks, Colin On Tue, Apr 12, 2016 at 8:19 PM, Chien Le <chie...@gmail.com> wrote: > Some things I would look at: > 1. Node statistics, both the mapper and regionserver nodes. Make sure > they're on fully healthy nodes (no disk issues, no half duplex, etc) and > that they're not already saturated from other jobs. > 2. Is there a common regionserver behind the remaining mappers/regions? If > so, try moving some regions off to spread the load. > 3. Verify the locality of the region blocks to the regionserver. If you > don't automate major compacts or have moved regions recently, mapper > locality might not help. Major compact if needed or move regions if you can > determine source? > 4. You mentioned that the requests per sec has gone from 50-150k to > 100-150k. Was that a typo? Did the read rate really increase? > 5. You've listed the region sizes but was that done with a cursory hadoop > fs du? Have you tried using the hfile analyzer to verify number of rows and > sizes are roughly the same? > 5. profile the mappers. If you can share the task counters for a completed > and a still running task to compare, it might help find the issue > 6. I don't think you should underestimate the perf gains of node local > tasks vs just rack local, especially if short circuit reads are enabled. > This is a big gamble unfortunately given how far your tasks have been > running already so I'd look at this as a last resort > > > HTH, > Chien > > On Tue, Apr 12, 2016 at 3:59 PM, Colin Kincaid Williams <disc...@uw.edu> > wrote: > >> I've noticed that I've omitted >> >> scan.setCaching(500); // 1 is the default in Scan, which will >> be bad for MapReduce jobs >> scan.setCacheBlocks(false); // don't set to true for MR jobs >> >> which appear to be suggestions from examples. Still I am not sure if >> this explains the significant request slowdown on the final 25% of the >> jobs. >> >> On Tue, Apr 12, 2016 at 10:36 PM, Colin Kincaid Williams <disc...@uw.edu> >> wrote: >> > Excuse my double post. I thought I deleted my draft, and then >> > constructed a cleaner, more detailed, more readable mail. >> > >> > On Tue, Apr 12, 2016 at 10:26 PM, Colin Kincaid Williams <disc...@uw.edu> >> wrote: >> >> After trying to get help with distcp on hadoop-user and cdh-user >> >> mailing lists, I've given up on trying to use distcp and exporttable >> >> to migrate my hbase from .92.1 cdh4.1.3 to .98 on cdh5.3.0 >> >> >> >> I've been working on an hbase map reduce job to serialize my entries >> >> and insert them into kafka. Then I plan to re-import them into >> >> cdh5.3.0. >> >> >> >> Currently I'm having trouble with my map-reduce job. I have 43 maps, >> >> 33 which have finished successfully, and 10 which are currently still >> >> running. I had previously seen requests of 50-150k per second. Now for >> >> the final 10 maps, I'm seeing 100-150k per minute. >> >> >> >> I might also mention that there were 6 failures near the application >> >> start. Unfortunately, I cannot read the logs for these 6 failures. >> >> There is an exception related to the yarn logging for these maps, >> >> maybe because they failed to start. >> >> >> >> I had a look around HDFS. It appears that the regions are all between >> >> 5-10GB. The longest completed map so far took 7 hours, with the >> >> majority appearing to take around 3.5 hours . >> >> >> >> The remaining 10 maps have each been running between 23-27 hours. >> >> >> >> Considering data locality issues. 6 of the remaining jobs are running >> >> on the same rack. Then the other 4 are split between my other two >> >> racks. There should currently be a replica on each rack, since it >> >> appears the replicas are set to 3. Then I'm not sure this is really >> >> the cause of the slowdown. >> >> >> >> Then I'm looking for advice on what I can do to troubleshoot my job. >> >> I'm setting up my map job like: >> >> >> >> main(String[] args){ >> >> ... >> >> Scan fromScan = new Scan(); >> >> System.out.println(fromScan); >> >> TableMapReduceUtil.initTableMapperJob(fromTableName, fromScan, >> Map.class, >> >> null, null, job, true, TableInputFormat.class); >> >> >> >> // My guess is this contols the output type for the reduce function >> >> base on setOutputKeyClass and setOutput value class from p.27 . Since >> >> there is no reduce step, then this is currently null. >> >> job.setOutputFormatClass(NullOutputFormat.class); >> >> job.setNumReduceTasks(0); >> >> job.submit(); >> >> ... >> >> } >> >> >> >> I'm not performing a reduce step, and I'm traversing row keys like >> >> >> >> map(final ImmutableBytesWritable fromRowKey, >> >> Result fromResult, Context context) throws IOException { >> >> ... >> >> // should I assume that each keyvalue is a version of the stored >> row? >> >> for (KeyValue kv : fromResult.raw()) { >> >> ADTreeMap.get(kv.getQualifier()).fakeLambda(messageBuilder, >> >> kv.getValue()); >> >> //TODO: ADD counter for each qualifier >> >> } >> >> >> >> >> >> >> >> I've also have a list of simple questions. >> >> >> >> Has anybody experienced a significant slowdown on map jobs related to >> >> a portion of their hbase regions? If so what issues did you come >> >> across? >> >> >> >> Can I get a suggestion how to show which map corresponds to which >> >> region, so I can troubleshoot from there? Is this already logged >> >> somewhere by default, or is there a way to set this up with the >> >> TableMapReduceUtil.initTableMapperJob ? >> >> >> >> Any other suggestions would be appreciated. >>