Re: map task performance degradation - any idea why?

Henning Blohm Fri, 19 Nov 2010 07:29:56 -0800

Hi Lars,

  we do not have anything like ganglia up. Unfortunately.


  I use regular puts with autoflush turned off, with a buffer of 4MB
(could be bigger right?). We write to WAL. 

  I flush every 1000 recs. 

  I will try again - maybe over the weekend - and see if I can find out
more.

Thanks,
  Henning

Am Freitag, den 19.11.2010, 16:16 +0100 schrieb Lars George:

> Hi Henning,
> 
> And you what you have seen is often difficult to explain. What I
> listed are the obvious contenders. But ideally you would do a post
> mortem on the master and slave logs for Hadoop and HBase, since that
> would give you a better insight of the events. For example, when did
> the system start to flush, when did it start compacting, when did the
> HDFS start to go slow? And so on. One thing that I would highly
> recommend (if you haven't done so already) is getting graphing going.
> Use the build in Ganglia support and you may be able to at least
> determine the overall load on the system and various metrics of Hadoop
> and HBase.
> 
> Did you use the normal Puts or did you set it to cache Puts and write
> them in bulk? See HTable.setWriteBufferSize() and
> HTable.setAutoFlush() for details (but please note that you then do
> need to call HTable.flushCommits() in your close() method of the
> mapper class). That will help a lot speeding up writing data.
> 
> Lars
> 
> On Fri, Nov 19, 2010 at 3:43 PM, Henning Blohm <henning.bl...@zfabrik.de> 
> wrote:
> > Hi Lars,
> >
> >  thanks. Yes, this is just the first test setup. Eventually the data
> > load will be significantly higher.
> >
> > At the moment (looking at the master after the run) the number of
> > regions is well-distributed (684,685,685 regions). The overall
> > HDFS use is  ~700G. (replication factor is 3 btw).
> >
> > I will want to upgrade as soon as that makes sense. It seems there is
> > "release" after 0.20.6 - that's why we are still with that one.
> >
> > When I do that run again, I will check the master UI and see how things
> > develop there. As for the current run: I do not expect
> > to get stable numbers early in the run. What looked suspicous was that
> > things got gradually worse until well into 30 hours after
> > the start of the run and then even got better. An unexpected load
> > behavior for me (would have expected early changes but then
> > some stable behavior up to the end).
> >
> > Thanks,
> >  Henning
> >
> > Am Freitag, den 19.11.2010, 15:21 +0100 schrieb Lars George:
> >
> >> Hi Henning,
> >>
> >> Could you look at the Master UI while doing the import? The issue with
> >> a cold bulk import is that you are hitting one region server
> >> initially, and while it is filling up its in-memory structures all is
> >> nice and dandy. Then ou start to tax the server as it has to flush
> >> data out and it becomes slower responding to the mappers still
> >> hammering it. Only after a while the regions become large enough so
> >> that they get split and load starts to spread across 2 machines, then
> >> 3. Eventually you have enough regions to handle your data and you will
> >> see an average of the performance you could expect from a loaded
> >> cluster. For that reason we have added a bulk loading feature that
> >> helps building the region files externally and then swap them in.
> >>
> >> When you check the UI you can actually see this behavior as the
> >> operations-per-second (ops) are bound to one server initially. Well,
> >> could be two as one of them has to also serve META. If that is the
> >> same machine then you are penalized twice.
> >>
> >> In addition you start to run into minor compaction load while HBase
> >> tries to do housekeeping during your load.
> >>
> >> With 0.89 you could pre-split the regions into what you see eventually
> >> when your job is complete. Please use the UI to check and let us know
> >> how many regions you end up with in total (out of interest mainly). If
> >> you had that done before the import then the load is split right from
> >> the start.
> >>
> >> In general 0.89 is much better performance wise when it comes to bulk
> >> loads so you may want to try it out as well. The 0.90RC is up so a
> >> release is imminent and saves you from having to upgrade soon. Also,
> >> 0.90 is the first with Hadoop's append fix, so that you do not lose
> >> any data from wonky server behavior.
> >>
> >> And to wrap this up, 3 data nodes is not too great. If you ask anyone
> >> with a serious production type setup you will see 10 machines and
> >> more, I'd say 20-30 machines and up. Some would say "Use MySQL for
> >> this little data" but that is not fair given that we do not know what
> >> your targets are. Bottom line is, you will see issues (like slowness)
> >> with 3 nodes that 8 or 10 nodes will never show.
> >>
> >> HTH,
> >> Lars
> >>
> >>
> >> On Fri, Nov 19, 2010 at 2:09 PM, Henning Blohm <henning.bl...@zfabrik.de> 
> >> wrote:
> >> > We have a Hadoop 0.20.2 + Hbase 0.20.6 setup with three data nodes
> >> > (12GB, 1.5TB each) and one master node (24GB, 1.5TB). We store a
> >> > relatively simple
> >> > table in HBase (1 column familiy, 5 columns, rowkey about 100chars).
> >> >
> >> > In order to better understand the load behavior, I wanted to put 5*10^8
> >> > rows into that table. I wrote an M/R job that uses a Split Input Format
> >> > to split the
> >> > 5*10^8 logical row keys (essentially just counting from 0 to 5*10^8-1)
> >> > into 1000 chunks of 500000 keys and then let the map do the actual job
> >> > of writing the corresponding rows (with some random column values) into
> >> > hbase.
> >> >
> >> > So there are 1000 map tasks, no reducer. Each task writes 500000 rows
> >> > into Hbase. We have 6 mapper slots, i.e. 24 mappers running parallel.
> >> >
> >> > The whole job runs for approx. 48 hours. Initially the map tasks need
> >> > around 30 min. each. After a while things take longer and longer,
> >> > eventually
> >> > reaching > 2h. It tops around the 850s task after which things speed up
> >> > again improving to about 48min. in the end, until completed.
> >> >
> >> > It's all dedicated machines and there is nothing else running. The map
> >> > tasks have 200m heap and when checking with vmstat in between I cannot
> >> > observe swapping.
> >> >
> >> > Also, on the master it seems that heap utilization is not at the limit
> >> > and no swapping either. All Hadoop and Hbase processes have
> >> > 1G heap.
> >> >
> >> > Any idea what would cause the strong variation (or degradation) of write
> >> > performance?
> >> > Is there a way of finding out where time gets lost?
> >> >
> >> > Thanks,
> >> >  Henning
> >> >
> >> >
> >

Re: map task performance degradation - any idea why?

Reply via email to