On Wed, Dec 30, 2009 at 5:59 AM, Dmitriy Lyfar <[email protected]> wrote:
> > Exception in thread "Thread-9" java.util.ConcurrentModificationException > at > java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372) > at java.util.AbstractList$Itr.next(AbstractList.java:343) > at > org.apache.hadoop.conf.Configuration.loadResources(Configuration Thanks. This is an hbase bug. Happens when lots of concurrent clients in the one process. Fix should be easy. I made an issue. Meantime, please follow my previous suggestion of having all your threads share the one HBaseConfiguration. > > > > As for timings: > > > For 5Kb rows we have about 35-40K records per second. > > > For 25Kb rows -- about 1-2K records per second. > > > > > > So I have different throughput on different row size, looks illogical. > > > > > > > > Is it? Is same amount of data being carried? > > > > Or it could be that while the 25k is being sent, all other access to a > > particular node is blocked (Thats how hadoop RPC works -- one connection > > per > > process per server with request/response exclusive on the channel). > Thread > > dump a few times or add some logging to see if you can figure if this is > > the > > case. > > > > I mean that if I have 80-100 mb/sec throughput for 5Kb rows it should stay > the same for 25Kb rows. Yes. For sure. > Of course I will insert less rows per second in > case of 25Kb, but throughput should stay the same. Now I'm trying to run > several instances of client each of them inserts 100K records (each record > is 25Kb). Time of execution grows for each client. > > > > > > In general, our client ain't to good at multiplexing because of such as > the > > above noted limitation (our client does not yet do nio). If you want to > > test cluster performance, run multiple concurrent clients each to its own > > process. MapReduce is good for doing this. See the > PerformanceEvaluation > > code for a sample MR job that floats many clients doing different loading > > types. > > > > MapReduce is good idea, but actually we don't have data which is located in > hadoop, we processes data in realtime and insert it into hbase. So I think > it will be inefficient to write our data in hadoop and then run MapReduce > work which will insert that data into the tables. > > Agreed. Was just suggesting it as a way of parallellizing clients. I presume that the source of the data feed is multiple, that you can run multiple instances of your upload process? > > > Time with several clients is growing. For example when I'm running four > processes, each of them have one inserter thread I got following results: > 1) Thread-1 have finished its work in 189 sec > 2) Thread-1 have finished its work in 198 sec > 3) Thread-1 have finished its work in 206 sec > 4) Thread-1 have finished its work in 208 sec > I.e. each next process works longer than previous. It was timings for test > where each process inserts 100K 25Kb rows with WAL on. Btw WAL have great > impact on performance when I increase size of row. I have about 80 sec for > this test with WAL off. Also when running several clients nodes seems still > almost idle. > Oh, how many regions in your cluster? At the start, all clients will be hitting a single region (and thus a single server). Check your master console at port 60010. You could rerun a second upload just after a first upload. See what the numbers are like uploading into a table that is pre-split? St.Ack > > > -- > Regards, Lyfar Dmitriy >
