[jira] [Created] (HBASE-23376) NPE happens while replica region is moving

2019-12-05 Thread Sun Xin (Jira)
Sun Xin created HBASE-23376:
---

 Summary: NPE happens while replica region is moving
 Key: HBASE-23376
 URL: https://issues.apache.org/jira/browse/HBASE-23376
 Project: HBase
  Issue Type: Bug
  Components: read replicas
Reporter: Sun Xin
Assignee: Sun Xin


The following code is from AsyncNonMetaRegionLocator#addToCache

 
{code:java}
private RegionLocations addToCache(TableCache tableCache, RegionLocations locs) 
{
  LOG.trace("Try adding {} to cache", locs);
  byte[] startKey = locs.getDefaultRegionLocation().getRegion().getStartKey();
  ...
}{code}
 

we will get a NPE if the locs is without the default region.

 

The following code is from AsyncRegionLocatorHelper#updateCachedLocationOnError

 
{code:java}
...
if (cause instanceof RegionMovedException) {
  RegionMovedException rme = (RegionMovedException) cause;
  HRegionLocation newLoc =
new HRegionLocation(loc.getRegion(), rme.getServerName(), 
rme.getLocationSeqNum());
  LOG.debug("Try updating {} with the new location {} constructed by {}", loc, 
newLoc,
rme.toString());
  addToCache.accept(newLoc);
...{code}
If the replica region is moving, we will get a RegionMovedException and add the 
HRegionLocation of replica region to cache. And finally NPE happens.

 

 
{code:java}
java.lang.NullPointerExceptionjava.lang.NullPointerException at 
org.apache.hadoop.hbase.client.AsyncNonMetaRegionLocator.addToCache(AsyncNonMetaRegionLocator.java:240)
 at 
org.apache.hadoop.hbase.client.AsyncNonMetaRegionLocator.addLocationToCache(AsyncNonMetaRegionLocator.java:596)
 at 
org.apache.hadoop.hbase.client.AsyncRegionLocatorHelper.updateCachedLocationOnError(AsyncRegionLocatorHelper.java:80)
 at 
org.apache.hadoop.hbase.client.AsyncNonMetaRegionLocator.updateCachedLocationOnError(AsyncNonMetaRegionLocator.java:610)
 at 
org.apache.hadoop.hbase.client.AsyncRegionLocator.updateCachedLocationOnError(AsyncRegionLocator.java:153)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-23375) NPE during opening a daughter region in cacheBlock

2019-12-05 Thread Baiqiang Zhao (Jira)
Baiqiang Zhao created HBASE-23375:
-

 Summary: NPE during opening a daughter region in cacheBlock 
 Key: HBASE-23375
 URL: https://issues.apache.org/jira/browse/HBASE-23375
 Project: HBase
  Issue Type: Bug
Affects Versions: 1.4.11, 1.6.0
Reporter: Baiqiang Zhao


The RegionServer log is :
{code:java}
2019-12-04 11:32:37,238 INFO  [regionserver/localhost/0.0.0.0:16020-splits-0] 
regionserver.SplitRequest: Running rollback/cleanup of failed split of 
ONLINE:testTable,\x009\x0014aa9,1575406565984.48f462e65b7961420737797c2ccf76c9.;
 Failed 
localhost,16020,1574999150042-daughterOpener=aad203e7b1aa26a26b50c84f70397456
java.io.IOException: Failed 
localhost,16020,1574999150042-daughterOpener=aad203e7b1aa26a26b50c84f70397456
at 
org.apache.hadoop.hbase.regionserver.SplitTransactionImpl.openDaughters(SplitTransactionImpl.java:504)
at 
org.apache.hadoop.hbase.regionserver.SplitTransactionImpl.stepsAfterPONR(SplitTransactionImpl.java:598)
at 
org.apache.hadoop.hbase.regionserver.SplitTransactionImpl.execute(SplitTransactionImpl.java:581)
at 
org.apache.hadoop.hbase.regionserver.SplitRequest.doSplitting(SplitRequest.java:82)
at 
org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:153)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: java.io.IOException: 
java.lang.NullPointerException
at 
org.apache.hadoop.hbase.regionserver.HRegion.initializeStores(HRegion.java:1041)
at 
org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:916)
at 
org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:884)
at 
org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7098)
at 
org.apache.hadoop.hbase.regionserver.SplitTransactionImpl.openDaughterRegion(SplitTransactionImpl.java:732)
at 
org.apache.hadoop.hbase.regionserver.SplitTransactionImpl$DaughterOpener.run(SplitTransactionImpl.java:712)
   ... 1 more
Caused by: java.io.IOException: java.lang.NullPointerException
at 
org.apache.hadoop.hbase.regionserver.HStore.openStoreFiles(HStore.java:577)
at 
org.apache.hadoop.hbase.regionserver.HStore.loadStoreFiles(HStore.java:532)
at org.apache.hadoop.hbase.regionserver.HStore.(HStore.java:281)
at 
org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:5469)
at 
org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:1015)
at 
org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:1012)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.hbase.io.hfile.BlockCacheUtil.compareCacheBlock(BlockCacheUtil.java:185)
at 
org.apache.hadoop.hbase.io.hfile.BlockCacheUtil.validateBlockAddition(BlockCacheUtil.java:204)
at 
org.apache.hadoop.hbase.io.hfile.BlockCacheUtil.shouldReplaceExistingCacheBlock(BlockCacheUtil.java:233)
at 
org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.cacheBlockWithWait(BucketCache.java:433)
at 
org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.cacheBlock(BucketCache.java:419)
at 
org.apache.hadoop.hbase.io.hfile.CombinedBlockCache.cacheBlock(CombinedBlockCache.java:68)
at 
org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:462)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:269)
at 
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:651)
at 
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:601)
at 
org.apache.hadoop.hbase.io.HalfStoreFileReader$1.seekTo(HalfStoreFileReader.java:190)
at 
org.apache.hadoop.hbase.io.HalfStoreFileReader.getFirstKey(HalfStoreFileReader.java:365)
at 
org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:546)
at 
org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:563)
at 
org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:553)
at 
org.apache.hadoop.hbase.regionserver.HStore.createStoreFileAndReader(HStore.java:707)
   

Re: [DISCUSS] EOM branch-1.3

2019-12-05 Thread Andrew Purtell
Hey Francis,

What is preventing an upgrade to 1.4? Are there specific concerns
remaining? It's been out there for a long time now, and bug fixed through
12 releases. Feel free to contact me offlist if you prefer, no problem.

Committers are voting to EOM 1.3 with their git clients already. It is
spotty if changes make it that far back even when they should. See extra
work the RM had to do for the last 1.3 release to compare history and port
back stuff. Nothing prevents you or someone else from continuing to make
1.3 releases but the writing is on the wall here.



On Thu, Dec 5, 2019 at 6:52 PM Francis Christopher Liu 
wrote:

> Hi Guys,
>
> We are still on 1.3 so it would be in our interest if I can continue to
> rollout 1.3.z releases. Having said that it is the oldest release branch
> and I understand the effort it takes to maintain another branch hence I
> didn't push for it unless there are other reasons than our own for keeping
> it going. If it works for you guys I can send an email to the user list to
> see if that criteria is met?
>
> Also I was wondering if retired does that prevent us from rolling out
> releases with critical/needed fixes?
>
> Thanks,
> Francis
>
>
>
> On Tue, Dec 3, 2019 at 3:27 AM Sean Busbey  wrote:
>
> > If it would change anyone's willingness to maintain the branch, then I
> > encourage them to go ask about the need on user@hbase.
> >
> > AFAIK in the year since we started talking about shutting down branch-1.3
> > no committer or PMC has expressed that their interest would change if
> > someone on user@hbase felt stuck on 1.3.z.
> >
> > Also worth noting that in the month since the 1.3.6 announcement went out
> > noone has showed up to say they can't move off of the release line.
> >
> >
> > On Mon, Dec 2, 2019, 22:53 Andrew Purtell 
> > wrote:
> >
> > > And if a non dev says they won’t move off 1.3? Will it change any
> > > committer or PMC minds on actually continuing to do 1.3 releases? If
> not
> > I
> > > think we have to call it for lack of interest and bandwidth.
> > >
> > > 1.4 is a functional superset of 1.3 and the current stable line anyway.
> > > Seems little reason not to upgrade save inertia or risk aversion.
> > >
> > >
> > > > On Dec 2, 2019, at 5:43 PM, Sean Busbey  wrote:
> > > >
> > > > Anyone who wants branch-1.3 to keep having releases has to be
> willing
> > > > to volunteer to maintain it. If the note in the 1.3.6 release wasn't
> > > > sufficient motivation to get them to show up on dev@hbase to do so,
> I
> > > > could put a more explicit mention of it in the EOM message. We'd need
> > > > to come up with some phrasing that didn't leave the status of the
> > > > release line ambiguous though.
> > > >
> > > > For reference, these are the last two EOM announcements we did:
> > > >
> > > > * 2.0.z in Sep 2019: https://s.apache.org/slgsa
> > > > * 1.2.z in Jun 2019:  https://s.apache.org/g8lnu
> > > >
> > > > 2.0 and 1.3 were never a release line with the "stable" marker on it.
> > > > 1.2 was the stable release line prior to 1.4.
> > > >
> > > >> On Mon, Dec 2, 2019 at 1:58 PM Misty Linville 
> > wrote:
> > > >>
> > > >> Whether any non-dev users are unable to move off 1.3, I suppose.
> > > >>
> > > >>> On Mon, Dec 2, 2019 at 11:04 AM Sean Busbey 
> > wrote:
> > > >>>
> > > >>> On what, specifically?
> > > >>>
> > >  On Mon, Dec 2, 2019, 11:24 Misty Linville 
> wrote:
> > > >>>
> > >  Should the user list be allowed to weigh in?
> > > 
> > >  On Mon, Dec 2, 2019 at 7:33 AM Andrew Purtell <
> > > andrew.purt...@gmail.com>
> > >  wrote:
> > > 
> > > > I think there is a consensus on moving the stable pointer, based
> on
> > > > earlier discussion. What I would suggest is a separate thread to
> > > >>> propose
> > > > it, and if nobody objects, do it.
> > > >
> > > >> On Dec 2, 2019, at 5:14 AM, 张铎(Duo Zhang) <
> palomino...@gmail.com>
> > >  wrote:
> > > >>
> > > >> +1.
> > > >>
> > > >> And I think it is time to move the stable pointer to 2.2.x? I
> know
> > > >>> that
> > > >> 2.2.x still has some bugs, especially on the procedure store,
> but
> > >  anyway,
> > > >> we have HBCK2 to fix them.
> > > >>
> > > >> And for the current stable release line, 1.4.x, the assignment
> > > >>> manager
> > > > also
> > > >> has bugs, as it is the reason why we introduced AMv2.
> > > >>
> > > >> So I do not think bug free is the 'must have' for a stable
> release
> > >  line.
> > > >>
> > > >> Jan Hentschel  于2019年12月2日周一
> > >  下午4:57写道:
> > > >>
> > > >>> +1
> > > >>>
> > > >>> From: Sakthi 
> > > >>> Reply-To: "dev@hbase.apache.org" 
> > > >>> Date: Monday, December 2, 2019 at 3:32 AM
> > > >>> To: "dev@hbase.apache.org" 
> > > >>> Subject: Re: [DISCUSS] EOM branch-1.3
> > > >>>
> > > >>> +1
> > > >>>
> > > >>> On Sun, Dec 1, 2019 at 6:28 PM Andrew Purtell <
> > >  andrew.purt...@gmail.com
> >

[jira] [Created] (HBASE-23374) ExclusiveMemHFileBlock’s allocator should not be hardcoded as ByteBuffAllocator.HEAP

2019-12-05 Thread chenxu (Jira)
chenxu created HBASE-23374:
--

 Summary: ExclusiveMemHFileBlock’s allocator should not be 
hardcoded as ByteBuffAllocator.HEAP
 Key: HBASE-23374
 URL: https://issues.apache.org/jira/browse/HBASE-23374
 Project: HBase
  Issue Type: Improvement
Reporter: chenxu
Assignee: chenxu


After HBASE-22802, ExclusiveMemHFileBlock’s data may be allocated through the 
BB pool, so it’s allocator should not be hard coded as ByteBuffAllocator.HEAP



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] EOM branch-1.3

2019-12-05 Thread Francis Christopher Liu
Hi Guys,

We are still on 1.3 so it would be in our interest if I can continue to
rollout 1.3.z releases. Having said that it is the oldest release branch
and I understand the effort it takes to maintain another branch hence I
didn't push for it unless there are other reasons than our own for keeping
it going. If it works for you guys I can send an email to the user list to
see if that criteria is met?

Also I was wondering if retired does that prevent us from rolling out
releases with critical/needed fixes?

Thanks,
Francis



On Tue, Dec 3, 2019 at 3:27 AM Sean Busbey  wrote:

> If it would change anyone's willingness to maintain the branch, then I
> encourage them to go ask about the need on user@hbase.
>
> AFAIK in the year since we started talking about shutting down branch-1.3
> no committer or PMC has expressed that their interest would change if
> someone on user@hbase felt stuck on 1.3.z.
>
> Also worth noting that in the month since the 1.3.6 announcement went out
> noone has showed up to say they can't move off of the release line.
>
>
> On Mon, Dec 2, 2019, 22:53 Andrew Purtell 
> wrote:
>
> > And if a non dev says they won’t move off 1.3? Will it change any
> > committer or PMC minds on actually continuing to do 1.3 releases? If not
> I
> > think we have to call it for lack of interest and bandwidth.
> >
> > 1.4 is a functional superset of 1.3 and the current stable line anyway.
> > Seems little reason not to upgrade save inertia or risk aversion.
> >
> >
> > > On Dec 2, 2019, at 5:43 PM, Sean Busbey  wrote:
> > >
> > > Anyone who wants branch-1.3 to keep having releases has to be willing
> > > to volunteer to maintain it. If the note in the 1.3.6 release wasn't
> > > sufficient motivation to get them to show up on dev@hbase to do so, I
> > > could put a more explicit mention of it in the EOM message. We'd need
> > > to come up with some phrasing that didn't leave the status of the
> > > release line ambiguous though.
> > >
> > > For reference, these are the last two EOM announcements we did:
> > >
> > > * 2.0.z in Sep 2019: https://s.apache.org/slgsa
> > > * 1.2.z in Jun 2019:  https://s.apache.org/g8lnu
> > >
> > > 2.0 and 1.3 were never a release line with the "stable" marker on it.
> > > 1.2 was the stable release line prior to 1.4.
> > >
> > >> On Mon, Dec 2, 2019 at 1:58 PM Misty Linville 
> wrote:
> > >>
> > >> Whether any non-dev users are unable to move off 1.3, I suppose.
> > >>
> > >>> On Mon, Dec 2, 2019 at 11:04 AM Sean Busbey 
> wrote:
> > >>>
> > >>> On what, specifically?
> > >>>
> >  On Mon, Dec 2, 2019, 11:24 Misty Linville  wrote:
> > >>>
> >  Should the user list be allowed to weigh in?
> > 
> >  On Mon, Dec 2, 2019 at 7:33 AM Andrew Purtell <
> > andrew.purt...@gmail.com>
> >  wrote:
> > 
> > > I think there is a consensus on moving the stable pointer, based on
> > > earlier discussion. What I would suggest is a separate thread to
> > >>> propose
> > > it, and if nobody objects, do it.
> > >
> > >> On Dec 2, 2019, at 5:14 AM, 张铎(Duo Zhang) 
> >  wrote:
> > >>
> > >> +1.
> > >>
> > >> And I think it is time to move the stable pointer to 2.2.x? I know
> > >>> that
> > >> 2.2.x still has some bugs, especially on the procedure store, but
> >  anyway,
> > >> we have HBCK2 to fix them.
> > >>
> > >> And for the current stable release line, 1.4.x, the assignment
> > >>> manager
> > > also
> > >> has bugs, as it is the reason why we introduced AMv2.
> > >>
> > >> So I do not think bug free is the 'must have' for a stable release
> >  line.
> > >>
> > >> Jan Hentschel  于2019年12月2日周一
> >  下午4:57写道:
> > >>
> > >>> +1
> > >>>
> > >>> From: Sakthi 
> > >>> Reply-To: "dev@hbase.apache.org" 
> > >>> Date: Monday, December 2, 2019 at 3:32 AM
> > >>> To: "dev@hbase.apache.org" 
> > >>> Subject: Re: [DISCUSS] EOM branch-1.3
> > >>>
> > >>> +1
> > >>>
> > >>> On Sun, Dec 1, 2019 at 6:28 PM Andrew Purtell <
> >  andrew.purt...@gmail.com
> > >>> >
> > >>> wrote:
> > >>>
> > >>> +1 for EOL of 1.3.
> > >>>
> > >>> Onward to 1.6!
> > >>>
> > >>>
> >  On Dec 1, 2019, at 5:38 PM, Sean Busbey  >  > >>> bus...@apache.org>> wrote:
> > 
> >  Hi folks!
> > 
> >  It's been about a month since the last 1.3.z release came out.
> > >>> We've
> >  been talking about EOM for branch-1.3 for about a year. Most
> >  recently,
> >  we had a growing consensus[1] to EOM after getting the 1.3.6
> > >>> release
> >  out with the fixes for Jackson in HBASE-22728 out.
> > 
> >  Looking at the things that have since landed in branch-1.3 and
> >  nothing
> >  looks critical (these are all Major or Minor)[2]:
> > 
> >  - HBASE-23149 hbase shouldPerformMajorCompaction logic is n

[jira] [Resolved] (HBASE-23364) HRegionServer sometimes does not shut down.

2019-12-05 Thread Lars Hofhansl (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl resolved HBASE-23364.
---
Fix Version/s: 1.6.0
   2.3.0
   3.0.0
   Resolution: Fixed

Committed to branch-1, branch-2, and master.

> HRegionServer sometimes does not shut down.
> ---
>
> Key: HBASE-23364
> URL: https://issues.apache.org/jira/browse/HBASE-23364
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.3.0, 1.6.0
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Fix For: 3.0.0, 2.3.0, 1.6.0
>
> Attachments: 23364-branch-1.txt
>
>
> Note that I initially assumed this to be a Phoenix bug. But I tracked it down 
> to HBase.
> 
> I noticed that recently only. Latest build from HBase's branch-1 and latest 
> build from Phoenix' 4.x-HBase-1.5. I don't know, yet, whether it's a Phoenix 
> or an HBase issues.
> Just filing it here for later reference.
> jstack show this thread as the only non-daemon thread:
> {code:java}
> "pool-11-thread-1" #470 prio=5 os_prio=0 tid=0x558a709a4800 nid=0x238e 
> waiting on condition [0x7f213ad68000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x00058eafece8> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){code}
> No other information. Somebody created a thread pool somewhere and forgot to 
> set the threads to daemon or is not shutting down the pool properly.
> Edit: I looked for other reference of the locked objects in the stack dump, 
> but didn't find any.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-23358) Set version as 2.1.9-SNAPSHOT in branch-2.1

2019-12-05 Thread Duo Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang resolved HBASE-23358.
---
Hadoop Flags: Reviewed
  Resolution: Fixed

Pushed to branch-2.1.

Thanks [~zghao] for reviewing.

> Set version as 2.1.9-SNAPSHOT in branch-2.1
> ---
>
> Key: HBASE-23358
> URL: https://issues.apache.org/jira/browse/HBASE-23358
> Project: HBase
>  Issue Type: Sub-task
>  Components: build, pom
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.1.9
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-23373) Log `RetriesExhaustedException` context with full time precision

2019-12-05 Thread Nick Dimiduk (Jira)
Nick Dimiduk created HBASE-23373:


 Summary: Log `RetriesExhaustedException` context with full time 
precision
 Key: HBASE-23373
 URL: https://issues.apache.org/jira/browse/HBASE-23373
 Project: HBase
  Issue Type: Improvement
  Components: asyncclient, Client
Reporter: Nick Dimiduk


{{RetriesExhaustedException}} will print out a list of the exceptions its 
accumulated while make its attempts. These exceptions are prefixed with a 
timestamp indicating when the exception was collected. We currently print this 
out using {{Date.toString()}}, which only has second granularity and is 
inconsistent by locale. Normalize the locale and include full precision.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-23372) ZooKeeper Assignment can result in stale znodes in region-in-transition after table is dropped and hbck run

2019-12-05 Thread David Manning (Jira)
David Manning created HBASE-23372:
-

 Summary: ZooKeeper Assignment can result in stale znodes in 
region-in-transition after table is dropped and hbck run
 Key: HBASE-23372
 URL: https://issues.apache.org/jira/browse/HBASE-23372
 Project: HBase
  Issue Type: Bug
  Components: hbck, master, Region Assignment, Zookeeper
Affects Versions: 1.3.2
Reporter: David Manning


It is possible for znodes under /hbase/region-in-transition to remain long 
after a table is deleted. There does not appear to be any cleanup logic for 
these.

The details are a little fuzzy, but it seems to be fallout from HBASE-22617. 
Incidents related to that bug involved regions stuck in transition, and use of 
hbck to fix clusters. There was a temporary table created and deleted once per 
day, but somehow it led to receiving 
{{FSLimitException$MaxDirectoryItemsExceededException}} and regions stuck in 
transition. Even weeks after fixing the bug and upgrading the cluster, the 
znodes remain under /hbase/region-in-transition. In the most impacted cluster, 
{{hbase zkcli ls /hbase/region-in-transition | wc -w}} returns almost 100,000 
entries. This causes very slow region transition times (often 80 seconds), 
likely due to enumerating all these entries when zk watch on this node is 
triggered.

Log lines for slow region transitions:
{code:java}
2019-12-05 07:02:14,714 DEBUG [K.Worker-pool3-t7344] master.AssignmentManager - 
Handling RS_ZK_REGION_CLOSED, server=<>, region=<>, 
which is more than 15 seconds late, current_state={<> 
state=PENDING_CLOSE, ts=1575529254635, server=<>}
{code}
Even during hmaster failover, entries are not cleaned, but the following log 
lines can be seen:
{code:java}
2019-11-27 00:26:27,044 WARN  [.activeMasterManager] master.AssignmentManager - 
Couldn't find the region in recovering region=<>, 
state=RS_ZK_REGION_FAILED_OPEN, servername=<>, 
createTime=1565603905404, payload.length=0
{code}
Possible solutions:
 # Logic to parse the RIT znode during master failover which sees if the table 
exists. Clean up entries for nonexistent tables.
 # New mode for hbck to do cleanup of nonexistent regions under the znode.
 # Others?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Error Handling and Logging

2019-12-05 Thread Nick Dimiduk
I think these are good points all around. Can any of these anti-patterns be
flagged by a checkstyle rule? Static analysis would make the infractions
easier to track down.

One more point of my own: I’m of the opinion that we log too much in
general. Info level should not describe the details of operations as
normal. I’m also not a fan of logging data structure “status”messages, as
we do, for example, from the block cache. It’s enough to expose these as
metrics.

Thanks for speaking up! If you’re feeling ambitious, please ping me on any
PRs and we’ll get things cleaned up.

Thanks,
Nick

On Tue, Dec 3, 2019 at 21:02 Stack  wrote:

> Thanks for the helpful note David. Appreciated.
> S
>
> On Tue, Nov 26, 2019 at 1:44 PM David Mollitor  wrote:
>
> > Hello Team,
> >
> > I am one of many people responsible for supporting the Hadoop products
> out
> > in the field.  Error handling and logging are crucial to my success.
> I've
> > been reading over the code and I see many of the same mistakes again and
> > again.  I just wanted to bring some of these things to your attention so
> > that moving forward, we can make these products better.
> >
> > The general best-practice is:
> >
> > public class TestExceptionLogging
> > {
> >   private static final Logger LOG =
> > LoggerFactory.getLogger(TestExceptionLogging.class);
> >
> >   public static void main(String[] args) {
> > try {
> >   processData();
> > } catch (Exception e) {
> >   LOG.error("Application failed", e);
> > }
> >   }
> >
> >   public static void processData() throws Exception {
> > try {
> >   readData();
> > } catch (Exception e) {
> >   throw new Exception("Failed to process data", e);
> > }
> >   }
> >
> >   public static byte[] readData() throws Exception {
> > throw new IOException("Failed to read device");
> >   }
> > }
> >
> > Produces:
> >
> > [main] ERROR TestExceptionLogging - Application failed
> > java.lang.Exception: Failed to process data
> > at TestExceptionLogging.processData(TestExceptionLogging.java:22)
> > at TestExceptionLogging.main(TestExceptionLogging.java:12)
> > Caused by: java.io.IOException: Failed to read device
> > at TestExceptionLogging.readData(TestExceptionLogging.java:27)
> > at TestExceptionLogging.processData(TestExceptionLogging.java:20)
> > ... 1 more
> >
> >
> >
> > Please notice that when an exception is thrown, and caught, it is wrapped
> > at each level and each level adds some more context to describe what was
> > happening when the error occurred.  It also produces a complete stack
> trace
> > that pinpoints the issue.  For Hive folks, it is rarely the case that a
> > method consuming a HMS API call should itself throw a MetaException.  The
> > MetaException has no way of wrapping an underlying Exception and helpful
> > data is often loss.  A method may chooses to wrap a MetaException, but it
> > should not be throwing them around as the default behavior.
> >
> > Also important to note is that there is exactly one place that is doing
> the
> > logging.  There does not need to be any logging at the lower levels.  A
> > catch block should throw or log, not both.  This is an anti-pattern and
> > annoying as the end user: having to deal with multiple stack traces at
> > different log levels for the same error condition.  The log message
> should
> > be at the highest level only.
> >
> > https://community.oracle.com/docs/DOC-983543#logAndThrow
> >
> > Both projects use SLF4J as the logging framework (facade anyway).  Please
> > familiarize yourself with how to correctly log an Exception.  There is no
> > need to log a thread name, a time stamp, a class name, or a stack trace.
> > The logging framework will do that all for you.
> >
> > http://www.slf4j.org/faq.html#paramException
> >
> > Again, there is no need to 'stringify' an exception. For example, do not
> > use this:
> >
> >
> >
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/StringUtils.java#L86
> >
> >
> > If you do however want to dump a stack trace for debugging (or trace)
> > purposes, consider performing the following:
> >
> > if (LOG.isDebugEnabled()) {
> >   LOG.debug("Dump Thread Stack", new Exception("Thread Stack Trace (Not
> an
> > Error)"));
> > }
> >
> > Finally, I've seen it a couple of times in Apache project that enabling
> > debug-level logging causes the application to emit logs at other levels,
> > for example:
> >
> > LOG.warn("Some error occurred: {}", e.getMessage());
> > if (LOG.isDebugEnabled()) {
> >   LOG. warn("Dump Warning Thread Stack", e);
> > }
> >
> > Please refrain from doing this.  The inner log statement should be at
> DEBUG
> > level to match the check.  Otherwise, when I enable DEBUG logging in the
> > application, the expectation that I have is that I will have the exact
> > logging as the INFO level, but I will also have additional DEBUG details
> as
> > well.  I am going to be using 'grep' to find DEBUG a

[jira] [Created] (HBASE-23371) [HBCK2] Provide client side method for removing "ghost" regions in meta.

2019-12-05 Thread Wellington Chevreuil (Jira)
Wellington Chevreuil created HBASE-23371:


 Summary: [HBCK2] Provide client side method for removing "ghost" 
regions in meta. 
 Key: HBASE-23371
 URL: https://issues.apache.org/jira/browse/HBASE-23371
 Project: HBase
  Issue Type: New Feature
Reporter: Wellington Chevreuil
Assignee: Wellington Chevreuil


We found some customers facing problems where region entries are present in 
meta, but no related region dir is available on the underlying files system. 
The actual cause couldn't be confirmed with certainty, and this was happening 
on a cluster version that doesn't support server side *fixMeta* command. 
Ultimately, solution was to manual delete the extra regions from meta. 

This ticket proposes two new client side commands for HBCK2, similar to the 
already existing *addFsRegionsMissingInMeta* and *reportMissingRegionsInMeta*, 
for finding and cleaning out meta from inconsistent state, when no *fixMeta* is 
available (which is the case for any version prior to "2.0.6", "2.1.6", 
"2.2.1", "2.3.0","3.0.0").



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-23367) Remove unused constructor from WALPrettyPrinter

2019-12-05 Thread Wellington Chevreuil (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wellington Chevreuil resolved HBASE-23367.
--
Resolution: Fixed

Merged PR into master branch.

> Remove unused constructor from WALPrettyPrinter
> ---
>
> Key: HBASE-23367
> URL: https://issues.apache.org/jira/browse/HBASE-23367
> Project: HBase
>  Issue Type: Task
>Reporter: Wellington Chevreuil
>Assignee: Wellington Chevreuil
>Priority: Trivial
> Fix For: 3.0.0
>
>
> WALPrettyPrinter is marked as limited private and defines a currently unused 
> constructor defining several parameters. It's not used on any of the base 2 
> branches either. Since it's supposed to be used mainly as CLI tool and 
> invoked via its main method, I believe this can be removed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)