Re: Client receives SocketTimeoutException (CallerDisconnected on RS)

2012-08-28 Thread N Keywal
>
>
> Totally randoms (even on keys that do not exist).
>

It worth checking if it matches your real use cases. I expect that read by
row key are most of the time on existing rows (as a traditional db
relationship or a UI or workflow driven stuff), even if I'm sure it's
possible to have something totally different.

It's not going to have an impact all the time. But I can easily imagine
scenarios with better performances when the row exists vs. does not exist.
For example, you have to read more files to check that the row key is
really not there. This will be even more true if  you're inserting a lot of
data simultaneously (i.e. the files won't be major compacted). On the
opposite side, bloom filters may be more efficient in this case. But again,
I'm not sure they're going to be efficient on random data. It's like
compression algorithms: on really random data; they will all have similar &
bad results. It does not mean they are equivalent, nor useless.


> I'm working on it ! Thanks,
>

If you can reproduce a 'bad behavior' or a performance issue, we will try
to fix it for sure.

Have a nice day,

N.


Re: Client receives SocketTimeoutException (CallerDisconnected on RS)

2012-08-27 Thread Adrien Mogenet
On Fri, Aug 24, 2012 at 6:52 PM, N Keywal  wrote:
> Hi Adrien,
>
>>  What do you think about that hypothesis ?
>
> Yes, there is something fishy to look at here. Difficult to say
> without more logs as well.
> Are your gets totally random, or are you doing gets on rows that do
> exist? That would explain the number of request vs. empty/full
> regions.

Totally randoms (even on keys that do not exist). But the number of
writeRequests looks also roughly well distributed.

>
> It does not explain all what you're seeing however. So if you're not
> exhausting the system resources, there may be a bug somewhere. If you
> can reproduce the behaviour on a pseudo distributed cluster it could
> be interesting, as I understand from you previous mail, you have a
> single client, and may be a single working server at the end...

I'm working on it ! Thanks,
>
> Nicolas

-- 
AM.


Re: Client receives SocketTimeoutException (CallerDisconnected on RS)

2012-08-24 Thread N Keywal
Hi Adrien,

>  What do you think about that hypothesis ?

Yes, there is something fishy to look at here. Difficult to say
without more logs as well.
Are your gets totally random, or are you doing gets on rows that do
exist? That would explain the number of request vs. empty/full
regions.

It does not explain all what you're seeing however. So if you're not
exhausting the system resources, there may be a bug somewhere. If you
can reproduce the behaviour on a pseudo distributed cluster it could
be interesting, as I understand from you previous mail, you have a
single client, and may be a single working server at the end...

Nicolas


Re: Client receives SocketTimeoutException (CallerDisconnected on RS)

2012-08-24 Thread Adrien Mogenet
G'd evening everyone,

Here are the logs from the server side : http://pastebin.com/yC5dGChh
And from the client side : http://pastebin.com/tR7wdkxG

I followed your advices and I noticed :
First, U thought the "bad" RS were the one with the highest number of
sockets. But rebooting each of them did not change anything. Then, I
checked GC log again. Nothing special there.
Finally, I noticed something strange : according to the web-UI, my
data is roughly well distributed among regions/RS (pre-split table
with 1 region per RS). I mean that the displayed "number of requests"
is the same everywhere. But when I look at the "blocking RS", the
region it's trying to access has 178 StoreFiles in it, for a total of
160 GB, whereas other servers are handling a very few amount of data.
My rowkey is a simple MD5, presplit from "000... (x32)" to "FFF..
(x32)". Maybe it's another topic, but I'm feeling it could result in
slow response time, even if Index size is low enough to fit in memory
(~120 MB / 12 GB of allocated heap).

What do you think about that hypothesis ?

On Thu, Aug 23, 2012 at 8:02 PM, Adrien Mogenet
 wrote:
> Hi guys,
>
> 1/ I checked quickly the GC logs and saw nothing. Since I need very
> fast lookup I set the zookeeper.session.timeout parameter to 10s to
> consider the RS as dead after very short pauses, and that did not
> occur.
>
> 2/ I did not check but I don't think I ran out of sockets since the
> ulimit has been set very high, but I'll check !
>
> 3/ Benchmark can launch several R/W threads, but even the simplest
> program leads to my issue :
>
> Configuration config = HBaseConfiguration.create();
> HTable table = new HTable(config, "test");
> for (<1, 10, 100 or 1000>)
>   getsList.add(new Get()
> table.get(getsList)
> table.close()
>
> 4/ I will share more logs tomorrow to dig deeper, I personally need a
> long STW-pause :-)
>
> Cheers,
>
> On Thu, Aug 23, 2012 at 7:49 PM, N Keywal  wrote:
>> Hi Adrien,
>>
>> As well, if you can share the client code (number of threads, regions,
>> is it a set of single get, or are they multi gets, this kind of
>> stuff).
>>
>> Cheers,
>>
>> N.
>>
>>
>> On Thu, Aug 23, 2012 at 7:40 PM, Jean-Daniel Cryans  
>> wrote:
>>> Hi Adrien,
>>>
>>> I would love to see the region server side of the logs while those
>>> socket timeouts happen, also check the GC log, but one thing people
>>> often hit while doing pure random read workloads with tons of clients
>>> is running out of sockets because they are all stuck in CLOSE_WAIT.
>>> You can check that by using lsof. There are other discussion on this
>>> mailing list about it.
>>>
>>> J-D
>>>
>>> On Thu, Aug 23, 2012 at 10:24 AM, Adrien Mogenet
>>>  wrote:
 Hi there,

 While I'm performing read-intensive benchmarks, I'm seeing storm of
 "CallerDisconnectedException" in certain RegionServers. As the
 documentation says, my client received a SocketTimeoutException
 (6ms etc...) at the same time.
 It's always happening and I get very poor read-performances (from 10
 to 5000 reads/sc) in a 10 nodes cluster.

 My benchmark consists in several iterations launching 10, 100 and 1000
 Get requests on a given random rowkey with a single CF/qualifier.
 I'm using HBase 0.94.1 (a few commits before the official stable
 release) with Hadoop 1.0.3.
 Bloom filters have been enabled (at the rowkey level).

 I do not find very clear informations about these exceptions. From the
 reference guide :
   (...) you should consider digging in a bit more if you aren't doing
 something to trigger them.

 Well... could you help me digging? :-)
>
> --
> AM



-- 
AM.


Re: Client receives SocketTimeoutException (CallerDisconnected on RS)

2012-08-23 Thread Adrien Mogenet
Hi guys,

1/ I checked quickly the GC logs and saw nothing. Since I need very
fast lookup I set the zookeeper.session.timeout parameter to 10s to
consider the RS as dead after very short pauses, and that did not
occur.

2/ I did not check but I don't think I ran out of sockets since the
ulimit has been set very high, but I'll check !

3/ Benchmark can launch several R/W threads, but even the simplest
program leads to my issue :

Configuration config = HBaseConfiguration.create();
HTable table = new HTable(config, "test");
for (<1, 10, 100 or 1000>)
  getsList.add(new Get()
table.get(getsList)
table.close()

4/ I will share more logs tomorrow to dig deeper, I personally need a
long STW-pause :-)

Cheers,

On Thu, Aug 23, 2012 at 7:49 PM, N Keywal  wrote:
> Hi Adrien,
>
> As well, if you can share the client code (number of threads, regions,
> is it a set of single get, or are they multi gets, this kind of
> stuff).
>
> Cheers,
>
> N.
>
>
> On Thu, Aug 23, 2012 at 7:40 PM, Jean-Daniel Cryans  
> wrote:
>> Hi Adrien,
>>
>> I would love to see the region server side of the logs while those
>> socket timeouts happen, also check the GC log, but one thing people
>> often hit while doing pure random read workloads with tons of clients
>> is running out of sockets because they are all stuck in CLOSE_WAIT.
>> You can check that by using lsof. There are other discussion on this
>> mailing list about it.
>>
>> J-D
>>
>> On Thu, Aug 23, 2012 at 10:24 AM, Adrien Mogenet
>>  wrote:
>>> Hi there,
>>>
>>> While I'm performing read-intensive benchmarks, I'm seeing storm of
>>> "CallerDisconnectedException" in certain RegionServers. As the
>>> documentation says, my client received a SocketTimeoutException
>>> (6ms etc...) at the same time.
>>> It's always happening and I get very poor read-performances (from 10
>>> to 5000 reads/sc) in a 10 nodes cluster.
>>>
>>> My benchmark consists in several iterations launching 10, 100 and 1000
>>> Get requests on a given random rowkey with a single CF/qualifier.
>>> I'm using HBase 0.94.1 (a few commits before the official stable
>>> release) with Hadoop 1.0.3.
>>> Bloom filters have been enabled (at the rowkey level).
>>>
>>> I do not find very clear informations about these exceptions. From the
>>> reference guide :
>>>   (...) you should consider digging in a bit more if you aren't doing
>>> something to trigger them.
>>>
>>> Well... could you help me digging? :-)

-- 
AM


Re: Client receives SocketTimeoutException (CallerDisconnected on RS)

2012-08-23 Thread N Keywal
Hi Adrien,

As well, if you can share the client code (number of threads, regions,
is it a set of single get, or are they multi gets, this kind of
stuff).

Cheers,

N.


On Thu, Aug 23, 2012 at 7:40 PM, Jean-Daniel Cryans  wrote:
> Hi Adrien,
>
> I would love to see the region server side of the logs while those
> socket timeouts happen, also check the GC log, but one thing people
> often hit while doing pure random read workloads with tons of clients
> is running out of sockets because they are all stuck in CLOSE_WAIT.
> You can check that by using lsof. There are other discussion on this
> mailing list about it.
>
> J-D
>
> On Thu, Aug 23, 2012 at 10:24 AM, Adrien Mogenet
>  wrote:
>> Hi there,
>>
>> While I'm performing read-intensive benchmarks, I'm seeing storm of
>> "CallerDisconnectedException" in certain RegionServers. As the
>> documentation says, my client received a SocketTimeoutException
>> (6ms etc...) at the same time.
>> It's always happening and I get very poor read-performances (from 10
>> to 5000 reads/sc) in a 10 nodes cluster.
>>
>> My benchmark consists in several iterations launching 10, 100 and 1000
>> Get requests on a given random rowkey with a single CF/qualifier.
>> I'm using HBase 0.94.1 (a few commits before the official stable
>> release) with Hadoop 1.0.3.
>> Bloom filters have been enabled (at the rowkey level).
>>
>> I do not find very clear informations about these exceptions. From the
>> reference guide :
>>   (...) you should consider digging in a bit more if you aren't doing
>> something to trigger them.
>>
>> Well... could you help me digging? :-)
>>
>> --
>> AM.


Re: Client receives SocketTimeoutException (CallerDisconnected on RS)

2012-08-23 Thread Jean-Daniel Cryans
Hi Adrien,

I would love to see the region server side of the logs while those
socket timeouts happen, also check the GC log, but one thing people
often hit while doing pure random read workloads with tons of clients
is running out of sockets because they are all stuck in CLOSE_WAIT.
You can check that by using lsof. There are other discussion on this
mailing list about it.

J-D

On Thu, Aug 23, 2012 at 10:24 AM, Adrien Mogenet
 wrote:
> Hi there,
>
> While I'm performing read-intensive benchmarks, I'm seeing storm of
> "CallerDisconnectedException" in certain RegionServers. As the
> documentation says, my client received a SocketTimeoutException
> (6ms etc...) at the same time.
> It's always happening and I get very poor read-performances (from 10
> to 5000 reads/sc) in a 10 nodes cluster.
>
> My benchmark consists in several iterations launching 10, 100 and 1000
> Get requests on a given random rowkey with a single CF/qualifier.
> I'm using HBase 0.94.1 (a few commits before the official stable
> release) with Hadoop 1.0.3.
> Bloom filters have been enabled (at the rowkey level).
>
> I do not find very clear informations about these exceptions. From the
> reference guide :
>   (...) you should consider digging in a bit more if you aren't doing
> something to trigger them.
>
> Well... could you help me digging? :-)
>
> --
> AM.