I just started ycsb on our real cluster: ./bin/ycsb load hbase -P workloads/load20m -p columnfamily=family -s -threads 100 -target 15000
This is beginning of a ycsb log on real grid run (looks even worse than on AWS): 10 sec: 150231 operations; 14984.14 current ops/sec; [INSERT AverageLatency(us)=448.53] 20 sec: 300299 operations; 14999.3 current ops/sec; [INSERT AverageLatency(us)=16.21] 30 sec: 450345 operations; 14998.6 current ops/sec; [INSERT AverageLatency(us)=21.65] 40 sec: 468100 operations; 1775.14 current ops/sec; [INSERT AverageLatency(us)=12.99] 50 sec: 750441 operations; 28225.63 current ops/sec; [INSERT AverageLatency(us)=3747.47] 60 sec: 900464 operations; 14997.8 current ops/sec; [INSERT AverageLatency(us)=11.4] 70 sec: 937443 operations; 3696.79 current ops/sec; [INSERT AverageLatency(us)=183.94] 80 sec: 1200550 operations; 26305.44 current ops/sec; [INSERT AverageLatency(us)=3185.25] 90 sec: 1350585 operations; 14999 current ops/sec; [INSERT AverageLatency(us)=12.8] 100 sec: 1408345 operations; 5773.69 current ops/sec; [INSERT AverageLatency(us)=286.38] 110 sec: 1613792 operations; 20536.49 current ops/sec; [INSERT AverageLatency(us)=4152.39] 120 sec: 1800751 operations; 18690.29 current ops/sec; [INSERT AverageLatency(us)=1481.83] 130 sec: 1874264 operations; 7349.83 current ops/sec; [INSERT AverageLatency(us)=46.15] 140 sec: 2100841 operations; 22650.9 current ops/sec; [INSERT AverageLatency(us)=3326.26] 150 sec: 2250867 operations; 14998.1 current ops/sec; [INSERT AverageLatency(us)=12.78] 160 sec: 2342700 operations; 9180.55 current ops/sec; [INSERT AverageLatency(us)=64.83] 170 sec: 2550953 operations; 20819.05 current ops/sec; [INSERT AverageLatency(us)=3858.96] 180 sec: 2701021 operations; 15002.3 current ops/sec; [INSERT AverageLatency(us)=12.92] 190 sec: 2810780 operations; 10971.51 current ops/sec; [INSERT AverageLatency(us)=39.69] 200 sec: 3001125 operations; 19028.79 current ops/sec; [INSERT AverageLatency(us)=3625.68] 210 sec: 3151151 operations; 14998.1 current ops/sec; [INSERT AverageLatency(us)=12.56] 220 sec: 3277541 operations; 12636.47 current ops/sec; [INSERT AverageLatency(us)=23.76] 230 sec: 3451248 operations; 17365.49 current ops/sec; [INSERT AverageLatency(us)=4603.66] 240 sec: 3601277 operations; 14998.4 current ops/sec; [INSERT AverageLatency(us)=12.44] 250 sec: 3745500 operations; 14193.78 current ops/sec; [INSERT AverageLatency(us)=12.02] 260 sec: 3745500 operations; 0 current ops/sec; 270 sec: 4053803 operations; 30817.97 current ops/sec; [INSERT AverageLatency(us)=3932.86] 280 sec: 4203853 operations; 15000.5 current ops/sec; [INSERT AverageLatency(us)=12.68] 290 sec: 4233326 operations; 2946.42 current ops/sec; [INSERT AverageLatency(us)=3528] 300 sec: 4503944 operations; 27053.68 current ops/sec; [INSERT AverageLatency(us)=3233.98] 310 sec: 4653964 operations; 14999 current ops/sec; [INSERT AverageLatency(us)=13.58] 320 sec: 4692140 operations; 3802.77 current ops/sec; [INSERT AverageLatency(us)=3139.73] 330 sec: 4954605 operations; 26238.63 current ops/sec; [INSERT AverageLatency(us)=2750.06] 340 sec: 5104656 operations; 14999.1 current ops/sec; [INSERT AverageLatency(us)=12.92] 350 sec: 5152194 operations; 4751.9 current ops/sec; [INSERT AverageLatency(us)=52.98] 360 sec: 5404751 operations; 25250.65 current ops/sec; [INSERT AverageLatency(us)=3347.95] 370 sec: 5554789 operations; 14999.3 current ops/sec; [INSERT AverageLatency(us)=12.68] Unfortunately, I do not have right now time, but I will get back to this problem asap. I do not see any performance consistency even under moderate load. On Fri, Jan 17, 2014 at 10:21 AM, Andrew Purtell <apurt...@apache.org>wrote: > I generally agree. However, the "High I/O" and "Cluster Compute" types are > HVM single tenant on the server, and the IO stack uses SR-IOV so MMIO and > interrupts are directly going to the VM, and 10GE network paths with no > traffic but your own. The locally attached storage is SSD. This is pretty > close to what you'll have in your own data center or a colo. And damn > expensive, but good if you can afford it. > > > On Fri, Jan 17, 2014 at 7:03 AM, Michael Segel <msegel_had...@hotmail.com > >wrote: > > > I need to apologize and clarify this statement… > > > > First, running benchmarks on AWS is ok, if you’re attempting to get a > > rough idea of how HBase will perform on a certain class of machines and > > you’re comparing m1.large to m1.xlarge or m3.xlarge … so that you can > get a > > rough scale on sizing. > > > > However, in this thread, you’re talking about trying to figure out why a > > certain mechanism isn’t working. > > > > You’re trying to track down why writes stall when you’re working in a > > virtualized environment where not only do you not have control over the > > machines, but also the network and your storage. > > > > Also when you run the OS on a virtual machine, there are going to be > > ‘anomalies’ that you can’t explain because the OS is running within a VM > > and can only report what it sees, and not what could be happening > > underneath in the VM’s OS. > > > > So you may see a problem, but will never be able to find the cause. > > > > > > On Jan 17, 2014, at 5:55 AM, Michael Segel <msegel_had...@hotmail.com> > > wrote: > > > > > Guys, > > > > > > Trying to benchmark on AWS is a waste of time. You end up chasing > ghosts. > > > You want to benchmark, you need to isolate your systems to reduce > > extraneous factors. > > > > > > You need real hardware, real network in a controlled environment. > > > > > > > > > Sent from a remote device. Please excuse any typos... > > > > > > Mike Segel > > > > > >> On Jan 16, 2014, at 12:34 PM, "Bryan Beaudreault" < > > bbeaudrea...@hubspot.com> wrote: > > >> > > >> This might be better on the user list? Anyway.. > > >> > > >> How many IPC handlers are you giving? m1.xlarge is very low cpu. Not > > only > > >> does it have only 4 cores (more cores allow more concurrent threads > with > > >> less context switching), but those cores are severely underpowered. I > > >> would recommend at least c1.xlarge, which is only a bit more > expensive. > > If > > >> you happen to be doing heavy GC, with 1-2 compactions running, and > with > > >> many writes incoming, you are quickly using up quite a bit of CPU. > > What is > > >> the load and CPU usage, on the 10.38.106.234:50010? > > >> > > >> Did you see anything about blocking updates in the hbase logs? How > much > > >> memstore are you giving? > > >> > > >> > > >>> On Thu, Jan 16, 2014 at 1:17 PM, Andrew Purtell <apurt...@apache.org > > > > wrote: > > >>> > > >>> On Wed, Jan 15, 2014 at 5:32 PM, > > >>> Vladimir Rodionov <vladrodio...@gmail.com> wrote: > > >>> > > >>>> Yes, I am using ephemeral (local) storage. I found that iostat is > > most of > > >>>> the time idle on 3K load with periodic bursts up to 10% iowait. > > >>> > > >>> Ok, sounds like the problem is higher up the stack. > > >>> > > >>> I see in later emails on this thread a log snippet that shows an > issue > > with > > >>> the WAL writer pipeline, one of the datanodes is slow, sick, or > > partially > > >>> unreachable. If you have uneven point to point ping times among your > > >>> cluster instances, or periodic loss, it might still be AWS's fault, > > >>> otherwise I wonder why the DFSClient says a datanode is sick. > > >>> > > >>> -- > > >>> Best regards, > > >>> > > >>> - Andy > > >>> > > >>> Problems worthy of attack prove their worth by hitting back. - Piet > > Hein > > >>> (via Tom White) > > >>> > > > > > > > > > > -- > Best regards, > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet Hein > (via Tom White) >