On Wed, Dec 30, 2009 at 5:59 AM, Dmitriy Lyfar <[email protected]> wrote:

>
> Exception in thread "Thread-9" java.util.ConcurrentModificationException
>     at
> java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
>    at java.util.AbstractList$Itr.next(AbstractList.java:343)
>    at
> org.apache.hadoop.conf.Configuration.loadResources(Configuration


Thanks.  This is an hbase bug.  Happens when lots of concurrent clients in
the one process.  Fix should be easy.  I made an issue.  Meantime, please
follow my previous suggestion of having all your threads share the one
HBaseConfiguration.



>
> > > As for timings:
> > > For 5Kb rows we have about 35-40K records per second.
> > > For 25Kb rows -- about 1-2K records per second.
> > >
> > > So I have different throughput on different row size, looks illogical.
> > >
> > >
> > Is it?  Is same amount of data being carried?
> >
> > Or it could be that while the 25k is being sent, all other access to a
> > particular node is blocked (Thats how hadoop RPC works -- one connection
> > per
> > process per server with request/response exclusive on the channel).
>  Thread
> > dump a few times or add some logging to see if you can figure if this is
> > the
> > case.
> >
>
> I mean that if I have 80-100 mb/sec throughput for 5Kb rows it should stay
> the same for 25Kb rows.


Yes.  For sure.




> Of course I will insert less rows per second in
> case of 25Kb, but throughput should stay the same. Now I'm trying to run
> several instances of client each of them inserts 100K records (each record
> is 25Kb). Time of execution grows for each client.
>
>
> >
> > In general, our client ain't to good at multiplexing because of such as
> the
> > above noted limitation (our client does not yet do nio).  If you want to
> > test cluster performance, run multiple concurrent clients each to its own
> > process.  MapReduce is good for doing this.  See the
> PerformanceEvaluation
> > code for a sample MR job that floats many clients doing different loading
> > types.
> >
>
> MapReduce is good idea, but actually we don't have data which is located in
> hadoop, we processes data in realtime and insert it into hbase. So I think
> it will be inefficient to write our data in hadoop and then run MapReduce
> work which will insert that data into the tables.
>
>
Agreed.  Was just suggesting it as a way of parallellizing clients.  I
presume that the source of the data feed is multiple, that you can run
multiple instances of your upload process?



> >
> Time with several clients is growing. For example when I'm running four
> processes, each of them have one inserter thread I got following results:
> 1) Thread-1 have finished its work in 189 sec
> 2) Thread-1 have finished its work in 198 sec
> 3) Thread-1 have finished its work in 206 sec
> 4) Thread-1 have finished its work in 208 sec
> I.e. each next process works longer than previous. It was timings for test
> where each process inserts 100K 25Kb rows with WAL on. Btw WAL have great
> impact on performance when I increase size of row. I have about 80 sec for
> this test with WAL off. Also when running several clients nodes seems still
> almost idle.
>

Oh, how many regions in your cluster?  At the start, all clients will be
hitting a single region (and thus a single server).  Check your master
console at port 60010.

You could rerun a second upload just after a first upload.  See what the
numbers are like uploading into a table that is pre-split?

St.Ack



>
>
> --
> Regards, Lyfar Dmitriy
>

Reply via email to