Re: map task performance degradation - any idea why?

Lars George Fri, 19 Nov 2010 07:37:41 -0800

Yeah, turning of the WAL would have been my next suggestion. Apart
from that Ganglia is really easily set up - you might want to consider
getting used to it now :)


On Fri, Nov 19, 2010 at 4:29 PM, Henning Blohm <henning.bl...@zfabrik.de> wrote:
> Hi Lars,
>
>  we do not have anything like ganglia up. Unfortunately.
>
>  I use regular puts with autoflush turned off, with a buffer of 4MB
> (could be bigger right?). We write to WAL.
>
>  I flush every 1000 recs.
>
>  I will try again - maybe over the weekend - and see if I can find out
> more.
>
> Thanks,
>  Henning
>
> Am Freitag, den 19.11.2010, 16:16 +0100 schrieb Lars George:
>
>> Hi Henning,
>>
>> And you what you have seen is often difficult to explain. What I
>> listed are the obvious contenders. But ideally you would do a post
>> mortem on the master and slave logs for Hadoop and HBase, since that
>> would give you a better insight of the events. For example, when did
>> the system start to flush, when did it start compacting, when did the
>> HDFS start to go slow? And so on. One thing that I would highly
>> recommend (if you haven't done so already) is getting graphing going.
>> Use the build in Ganglia support and you may be able to at least
>> determine the overall load on the system and various metrics of Hadoop
>> and HBase.
>>
>> Did you use the normal Puts or did you set it to cache Puts and write
>> them in bulk? See HTable.setWriteBufferSize() and
>> HTable.setAutoFlush() for details (but please note that you then do
>> need to call HTable.flushCommits() in your close() method of the
>> mapper class). That will help a lot speeding up writing data.
>>
>> Lars
>>
>> On Fri, Nov 19, 2010 at 3:43 PM, Henning Blohm <henning.bl...@zfabrik.de> 
>> wrote:
>> > Hi Lars,
>> >
>> >  thanks. Yes, this is just the first test setup. Eventually the data
>> > load will be significantly higher.
>> >
>> > At the moment (looking at the master after the run) the number of
>> > regions is well-distributed (684,685,685 regions). The overall
>> > HDFS use is  ~700G. (replication factor is 3 btw).
>> >
>> > I will want to upgrade as soon as that makes sense. It seems there is
>> > "release" after 0.20.6 - that's why we are still with that one.
>> >
>> > When I do that run again, I will check the master UI and see how things
>> > develop there. As for the current run: I do not expect
>> > to get stable numbers early in the run. What looked suspicous was that
>> > things got gradually worse until well into 30 hours after
>> > the start of the run and then even got better. An unexpected load
>> > behavior for me (would have expected early changes but then
>> > some stable behavior up to the end).
>> >
>> > Thanks,
>> >  Henning
>> >
>> > Am Freitag, den 19.11.2010, 15:21 +0100 schrieb Lars George:
>> >
>> >> Hi Henning,
>> >>
>> >> Could you look at the Master UI while doing the import? The issue with
>> >> a cold bulk import is that you are hitting one region server
>> >> initially, and while it is filling up its in-memory structures all is
>> >> nice and dandy. Then ou start to tax the server as it has to flush
>> >> data out and it becomes slower responding to the mappers still
>> >> hammering it. Only after a while the regions become large enough so
>> >> that they get split and load starts to spread across 2 machines, then
>> >> 3. Eventually you have enough regions to handle your data and you will
>> >> see an average of the performance you could expect from a loaded
>> >> cluster. For that reason we have added a bulk loading feature that
>> >> helps building the region files externally and then swap them in.
>> >>
>> >> When you check the UI you can actually see this behavior as the
>> >> operations-per-second (ops) are bound to one server initially. Well,
>> >> could be two as one of them has to also serve META. If that is the
>> >> same machine then you are penalized twice.
>> >>
>> >> In addition you start to run into minor compaction load while HBase
>> >> tries to do housekeeping during your load.
>> >>
>> >> With 0.89 you could pre-split the regions into what you see eventually
>> >> when your job is complete. Please use the UI to check and let us know
>> >> how many regions you end up with in total (out of interest mainly). If
>> >> you had that done before the import then the load is split right from
>> >> the start.
>> >>
>> >> In general 0.89 is much better performance wise when it comes to bulk
>> >> loads so you may want to try it out as well. The 0.90RC is up so a
>> >> release is imminent and saves you from having to upgrade soon. Also,
>> >> 0.90 is the first with Hadoop's append fix, so that you do not lose
>> >> any data from wonky server behavior.
>> >>
>> >> And to wrap this up, 3 data nodes is not too great. If you ask anyone
>> >> with a serious production type setup you will see 10 machines and
>> >> more, I'd say 20-30 machines and up. Some would say "Use MySQL for
>> >> this little data" but that is not fair given that we do not know what
>> >> your targets are. Bottom line is, you will see issues (like slowness)
>> >> with 3 nodes that 8 or 10 nodes will never show.
>> >>
>> >> HTH,
>> >> Lars
>> >>
>> >>
>> >> On Fri, Nov 19, 2010 at 2:09 PM, Henning Blohm <henning.bl...@zfabrik.de> 
>> >> wrote:
>> >> > We have a Hadoop 0.20.2 + Hbase 0.20.6 setup with three data nodes
>> >> > (12GB, 1.5TB each) and one master node (24GB, 1.5TB). We store a
>> >> > relatively simple
>> >> > table in HBase (1 column familiy, 5 columns, rowkey about 100chars).
>> >> >
>> >> > In order to better understand the load behavior, I wanted to put 5*10^8
>> >> > rows into that table. I wrote an M/R job that uses a Split Input Format
>> >> > to split the
>> >> > 5*10^8 logical row keys (essentially just counting from 0 to 5*10^8-1)
>> >> > into 1000 chunks of 500000 keys and then let the map do the actual job
>> >> > of writing the corresponding rows (with some random column values) into
>> >> > hbase.
>> >> >
>> >> > So there are 1000 map tasks, no reducer. Each task writes 500000 rows
>> >> > into Hbase. We have 6 mapper slots, i.e. 24 mappers running parallel.
>> >> >
>> >> > The whole job runs for approx. 48 hours. Initially the map tasks need
>> >> > around 30 min. each. After a while things take longer and longer,
>> >> > eventually
>> >> > reaching > 2h. It tops around the 850s task after which things speed up
>> >> > again improving to about 48min. in the end, until completed.
>> >> >
>> >> > It's all dedicated machines and there is nothing else running. The map
>> >> > tasks have 200m heap and when checking with vmstat in between I cannot
>> >> > observe swapping.
>> >> >
>> >> > Also, on the master it seems that heap utilization is not at the limit
>> >> > and no swapping either. All Hadoop and Hbase processes have
>> >> > 1G heap.
>> >> >
>> >> > Any idea what would cause the strong variation (or degradation) of write
>> >> > performance?
>> >> > Is there a way of finding out where time gets lost?
>> >> >
>> >> > Thanks,
>> >> >  Henning
>> >> >
>> >> >
>> >
>

Re: map task performance degradation - any idea why?

Reply via email to