I have pasted most of the RS's logs just prior to their FATAL and
including. Would be very thankful if someone can take a look:
http://pastebin.com/qFzycXNS . Interestingly, some RS's experience an
IOException for not finding an .oldlogs/ file. The rest get
KeeperException$NoNodeException
w/o the I
t by 'online' service requests and
copyTable would hit it's resources quite badly. I'll be glad to update.
Thanks again,
Amit
Original message ----
From: Varun Sharma
Date:
To: user@hbase.apache.org
Subject: Re: RS crash upon replication
But wouldn't a
But wouldn't a copy table b/w timestamps bring you back since the mutations
are all timestamp based we should okay ? Basically doing a copy table which
supersedes the downtime interval ?
On Thu, May 23, 2013 at 9:48 AM, Jean-Daniel Cryans wrote:
> fwiw stop_replication is a kill switch, not a ge
fwiw stop_replication is a kill switch, not a general way to start and
stop replicating, and start_replication may put you in an inconsistent
state:
hbase(main):001:0> help 'stop_replication'
Stops all the replication features. The state in which each
stream stops in is undetermined.
WARNING:
star
No the server came out fine just because after the crash (RS's - the
masters were still running), I immediately pulled the breaks with
stop_replication. Then I start the RS's and they came back fine (not
replicating). Once I hit 'start_replication' again they had crashed for the
second time. Eventu
Actually, it seems like something else was wrong here - the servers came up
just fine on trying again - so could not really reproduce the issue.
Amit: Did you try patching 8207 ?
Varun
On Wed, May 22, 2013 at 5:40 PM, Himanshu Vashishtha wrote:
> That sounds like a bug for sure. Could you crea
That sounds like a bug for sure. Could you create a jira with logs/znode
dump/steps to reproduce it?
Thanks,
himanshu
On Wed, May 22, 2013 at 5:01 PM, Varun Sharma wrote:
> It seems I can reproduce this - I did a few rolling restarts and got
> screwed with NoNode exceptions - I am running 0.94
It seems I can reproduce this - I did a few rolling restarts and got
screwed with NoNode exceptions - I am running 0.94.7 which has the fix but
my nodes don't contain hyphens - nodes are no longer coming back up...
Thanks
Varun
On Wed, May 22, 2013 at 3:02 PM, Himanshu Vashishtha wrote:
> I'd s
I'd suggest to please patch the code with 8207; cdh4.2.1 doesn't have it.
With hyphens in the name, ReplicationSource gets confused and tried to set
data in a znode which doesn't exist.
Thanks,
Himanshu
On Wed, May 22, 2013 at 2:42 PM, Amit Mor wrote:
> yes, indeed - hyphens are part of the
Yes, I have checked the source files of the 0.94.2-cdh4.2.1 jar and
HBASE-8207 issues are present in the source codes, namely:
String[] parts = peerClusterZnode.split("-");
On Thu, May 23, 2013 at 12:42 AM, Amit Mor wrote:
> yes, indeed - hyphens are part of the host name (annoying legacy stuf
yes, indeed - hyphens are part of the host name (annoying legacy stuff in
my company). It's hbase-0.94.2-cdh4.2.1. I have no idea if 0.94.6 was
backported by Cloudera into their flavor of 0.94.2, but
the mysterious occurrence of the percent sign in zkcli (ls
/hbase/replication/rs/va-p-hbase-02-d,60
I believe there were cascading failures which got these deep nodes
containing still to be replicated WAL(s) - I suspect there is either some
parsing bug or something which is causing the replication source to not
work - also which version are you using - does it have
https://issues.apache.org/jira/
va-p-hbase-02-d,60020,1369249862401
On Thu, May 23, 2013 at 12:20 AM, Varun Sharma wrote:
> Basically
>
> ls /hbase/rs and what do you see for va-p-02-d ?
>
>
> On Wed, May 22, 2013 at 2:19 PM, Varun Sharma wrote:
>
> > Can you do ls /hbase/rs and see what you get for 02-d - instead of
> look
Basically
ls /hbase/rs and what do you see for va-p-02-d ?
On Wed, May 22, 2013 at 2:19 PM, Varun Sharma wrote:
> Can you do ls /hbase/rs and see what you get for 02-d - instead of looking
> in /replication/, could you look in /hbase/replication/rs - I want to see
> if the timestamps are match
Can you do ls /hbase/rs and see what you get for 02-d - instead of looking
in /replication/, could you look in /hbase/replication/rs - I want to see
if the timestamps are matching or not ?
Varun
On Wed, May 22, 2013 at 2:17 PM, Varun Sharma wrote:
> I see - so looks okay - there's just a lot o
I see - so looks okay - there's just a lot of deep nesting in there - if
you look into these you nodes by doing ls - you should see a bunch of
WAL(s) which still need to be replicated...
Varun
On Wed, May 22, 2013 at 2:16 PM, Varun Sharma wrote:
> 2013-05-22 15:31:25,929 WARN
> org.apache.hado
2013-05-22 15:31:25,929 WARN
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
ZooKeeper exception:
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for *
/hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-0
I found this:
[zk: va-p-zookeeper-01-c:2181(CONNECTED) 17] ls
/hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401
[1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475,
1-va-p-hbase-01-d,60020,1369042382584-va-p-hbase-02-c,60020,136904
empty return:
[zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls
/hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
[]
On Thu, May 23, 2013 at 12:05 AM, Varun Sharma wrote:
> Do an "ls" not a get here and give the output ?
>
> ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
Do an "ls" not a get here and give the output ?
ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com <
amit.mor.m...@gmail.com> wrote:
> [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get
> /hbase/replication/rs/va-p-hbase-01-c,600
[zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get
/hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
cZxid = 0x60281c1de
ctime = Wed May 22 15:11:17 EDT 2013
mZxid = 0x60281c1de
mtime = Wed May 22 15:11:17 EDT 2013
pZxid = 0x60281c1de
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwn
What does this command show you ?
get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
Cheers
On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com <
amit.mor.m...@gmail.com> wrote:
> ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
> [1]
> [zk: va-p-zookeeper-01-c:218
ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
[1]
[zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls
/hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
[]
I'm on hbase-0.94.2-cdh4.2.1
Thanks
On Wed, May 22, 2013 at 11:40 PM, Varun Sharma wrote:
> Also what version of HBase
Also what version of HBase are you running ?
On Wed, May 22, 2013 at 1:38 PM, Varun Sharma wrote:
> Basically,
>
> You had va-p-hbase-02 crash - that caused all the replication related data
> in zookeeper to be moved to va-p-hbase-01 and have it take over for
> replicating 02's logs. Now each r
Basically,
You had va-p-hbase-02 crash - that caused all the replication related data
in zookeeper to be moved to va-p-hbase-01 and have it take over for
replicating 02's logs. Now each region server also maintains an in-memory
state of whats in ZK, it seems like when you start up 01, its trying t
25 matches
Mail list logo