Re: HBase 0.94.15: writes stalls periodically even under moderate steady load (AWS EC2)

Vladimir Rodionov Fri, 17 Jan 2014 10:35:57 -0800

I just started ycsb on our real cluster:

 ./bin/ycsb load hbase -P workloads/load20m -p columnfamily=family -s
-threads 100 -target 15000


This is beginning of a ycsb log on real grid run (looks even worse than on
AWS):


 10 sec: 150231 operations; 14984.14 current ops/sec; [INSERT
AverageLatency(us)=448.53]
 20 sec: 300299 operations; 14999.3 current ops/sec; [INSERT
AverageLatency(us)=16.21]
 30 sec: 450345 operations; 14998.6 current ops/sec; [INSERT
AverageLatency(us)=21.65]
 40 sec: 468100 operations; 1775.14 current ops/sec; [INSERT
AverageLatency(us)=12.99]
 50 sec: 750441 operations; 28225.63 current ops/sec; [INSERT
AverageLatency(us)=3747.47]
 60 sec: 900464 operations; 14997.8 current ops/sec; [INSERT
AverageLatency(us)=11.4]
 70 sec: 937443 operations; 3696.79 current ops/sec; [INSERT
AverageLatency(us)=183.94]
 80 sec: 1200550 operations; 26305.44 current ops/sec; [INSERT
AverageLatency(us)=3185.25]
 90 sec: 1350585 operations; 14999 current ops/sec; [INSERT
AverageLatency(us)=12.8]
 100 sec: 1408345 operations; 5773.69 current ops/sec; [INSERT
AverageLatency(us)=286.38]
 110 sec: 1613792 operations; 20536.49 current ops/sec; [INSERT
AverageLatency(us)=4152.39]
 120 sec: 1800751 operations; 18690.29 current ops/sec; [INSERT
AverageLatency(us)=1481.83]
 130 sec: 1874264 operations; 7349.83 current ops/sec; [INSERT
AverageLatency(us)=46.15]
 140 sec: 2100841 operations; 22650.9 current ops/sec; [INSERT
AverageLatency(us)=3326.26]
 150 sec: 2250867 operations; 14998.1 current ops/sec; [INSERT
AverageLatency(us)=12.78]
 160 sec: 2342700 operations; 9180.55 current ops/sec; [INSERT
AverageLatency(us)=64.83]
 170 sec: 2550953 operations; 20819.05 current ops/sec; [INSERT
AverageLatency(us)=3858.96]
 180 sec: 2701021 operations; 15002.3 current ops/sec; [INSERT
AverageLatency(us)=12.92]
 190 sec: 2810780 operations; 10971.51 current ops/sec; [INSERT
AverageLatency(us)=39.69]
 200 sec: 3001125 operations; 19028.79 current ops/sec; [INSERT
AverageLatency(us)=3625.68]
 210 sec: 3151151 operations; 14998.1 current ops/sec; [INSERT
AverageLatency(us)=12.56]
 220 sec: 3277541 operations; 12636.47 current ops/sec; [INSERT
AverageLatency(us)=23.76]
 230 sec: 3451248 operations; 17365.49 current ops/sec; [INSERT
AverageLatency(us)=4603.66]
 240 sec: 3601277 operations; 14998.4 current ops/sec; [INSERT
AverageLatency(us)=12.44]
 250 sec: 3745500 operations; 14193.78 current ops/sec; [INSERT
AverageLatency(us)=12.02]
 260 sec: 3745500 operations; 0 current ops/sec;
 270 sec: 4053803 operations; 30817.97 current ops/sec; [INSERT
AverageLatency(us)=3932.86]
 280 sec: 4203853 operations; 15000.5 current ops/sec; [INSERT
AverageLatency(us)=12.68]
 290 sec: 4233326 operations; 2946.42 current ops/sec; [INSERT
AverageLatency(us)=3528]
 300 sec: 4503944 operations; 27053.68 current ops/sec; [INSERT
AverageLatency(us)=3233.98]
 310 sec: 4653964 operations; 14999 current ops/sec; [INSERT
AverageLatency(us)=13.58]
 320 sec: 4692140 operations; 3802.77 current ops/sec; [INSERT
AverageLatency(us)=3139.73]
 330 sec: 4954605 operations; 26238.63 current ops/sec; [INSERT
AverageLatency(us)=2750.06]
 340 sec: 5104656 operations; 14999.1 current ops/sec; [INSERT
AverageLatency(us)=12.92]
 350 sec: 5152194 operations; 4751.9 current ops/sec; [INSERT
AverageLatency(us)=52.98]
 360 sec: 5404751 operations; 25250.65 current ops/sec; [INSERT
AverageLatency(us)=3347.95]
 370 sec: 5554789 operations; 14999.3 current ops/sec; [INSERT
AverageLatency(us)=12.68]

Unfortunately, I do not have right now time, but I will get  back to this
problem asap. I do not see any performance consistency even under moderate
load.


On Fri, Jan 17, 2014 at 10:21 AM, Andrew Purtell <apurt...@apache.org>wrote:

> I generally agree. However, the "High I/O" and "Cluster Compute" types are
> HVM single tenant on the server, and the IO stack uses SR-IOV so MMIO and
> interrupts are directly going to the VM, and 10GE network paths with no
> traffic but your own. The locally attached storage is SSD. This is pretty
> close to what you'll have in your own data center or a colo. And damn
> expensive, but good if you can afford it.
>
>
> On Fri, Jan 17, 2014 at 7:03 AM, Michael Segel <msegel_had...@hotmail.com
> >wrote:
>
> > I need to apologize and clarify this statement…
> >
> > First, running benchmarks on AWS is ok, if you’re attempting to get a
> > rough idea of how HBase will perform on a certain class of machines and
> > you’re comparing m1.large to m1.xlarge or m3.xlarge … so that you can
> get a
> > rough scale on sizing.
> >
> > However, in this thread, you’re talking about trying to figure out why a
> > certain mechanism isn’t working.
> >
> > You’re trying to track down why writes stall when you’re working in a
> > virtualized environment where not only do you not have control over the
> > machines, but also the network and your storage.
> >
> > Also when you run the OS on a virtual machine, there are going to be
> > ‘anomalies’ that you can’t explain because the OS is running within a VM
> > and can only report what it sees, and not what could be happening
> > underneath in the VM’s OS.
> >
> > So you may see a problem, but will never be able to find the cause.
> >
> >
> > On Jan 17, 2014, at 5:55 AM, Michael Segel <msegel_had...@hotmail.com>
> > wrote:
> >
> > > Guys,
> > >
> > > Trying to benchmark on AWS is a waste of time. You end up chasing
> ghosts.
> > > You want to benchmark, you need to isolate your systems to reduce
> > extraneous factors.
> > >
> > > You need real hardware, real network in a controlled environment.
> > >
> > >
> > > Sent from a remote device. Please excuse any typos...
> > >
> > > Mike Segel
> > >
> > >> On Jan 16, 2014, at 12:34 PM, "Bryan Beaudreault" <
> > bbeaudrea...@hubspot.com> wrote:
> > >>
> > >> This might be better on the user list? Anyway..
> > >>
> > >> How many IPC handlers are you giving?  m1.xlarge is very low cpu.  Not
> > only
> > >> does it have only 4 cores (more cores allow more concurrent threads
> with
> > >> less context switching), but those cores are severely underpowered.  I
> > >> would recommend at least c1.xlarge, which is only a bit more
> expensive.
> >  If
> > >> you happen to be doing heavy GC, with 1-2 compactions running, and
> with
> > >> many writes incoming, you are quickly using up quite a bit of CPU.
> >  What is
> > >> the load and CPU usage, on the 10.38.106.234:50010?
> > >>
> > >> Did you see anything about blocking updates in the hbase logs?  How
> much
> > >> memstore are you giving?
> > >>
> > >>
> > >>> On Thu, Jan 16, 2014 at 1:17 PM, Andrew Purtell <apurt...@apache.org
> >
> > wrote:
> > >>>
> > >>> On Wed, Jan 15, 2014 at 5:32 PM,
> > >>> Vladimir Rodionov <vladrodio...@gmail.com> wrote:
> > >>>
> > >>>> Yes, I am using ephemeral (local) storage. I found that iostat is
> > most of
> > >>>> the time idle on 3K load with periodic bursts up to 10% iowait.
> > >>>
> > >>> Ok, sounds like the problem is higher up the stack.
> > >>>
> > >>> I see in later emails on this thread a log snippet that shows an
> issue
> > with
> > >>> the WAL writer pipeline, one of the datanodes is slow, sick, or
> > partially
> > >>> unreachable. If you have uneven point to point ping times among your
> > >>> cluster instances, or periodic loss, it might still be AWS's fault,
> > >>> otherwise I wonder why the DFSClient says a datanode is sick.
> > >>>
> > >>> --
> > >>> Best regards,
> > >>>
> > >>>  - Andy
> > >>>
> > >>> Problems worthy of attack prove their worth by hitting back. - Piet
> > Hein
> > >>> (via Tom White)
> > >>>
> > >
> >
> >
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>

Re: HBase 0.94.15: writes stalls periodically even under moderate steady load (AWS EC2)

Reply via email to