[jira] [Updated] (HBASE-2730) Expose RS work queue contents on web UI

2012-07-09 Thread Jie Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-2730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Huang updated HBASE-2730:
-

Attachment: (was: hbase-2730-0_94_0_v3.patch)

> Expose RS work queue contents on web UI
> ---
>
> Key: HBASE-2730
> URL: https://issues.apache.org/jira/browse/HBASE-2730
> Project: HBase
>  Issue Type: New Feature
>  Components: monitoring, regionserver
>Reporter: Todd Lipcon
>Priority: Critical
> Fix For: 0.96.0
>
> Attachments: dump.png, hbase-2730-0_94_0.patch, 
> hbase-2730-0_94_0.patch, hbase-2730-0_94_0_v2.patch
>
>
> Would be nice to be able to see the contents of the various work queues - eg 
> to know what regions are pending compaction/split/flush/etc. This is handy 
> for debugging why a region might be blocked, etc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-2730) Expose RS work queue contents on web UI

2012-07-09 Thread Jie Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409241#comment-13409241
 ] 

Jie Huang commented on HBASE-2730:
--

Just ignore the previous patch and comments on RS in transition. Since I found 
that we needn't to dump those regions under opening or closing in the 
queue-dump. That work is already done in the Master-Status. I wonder if it is 
OK for the queque-dump patch. 

> Expose RS work queue contents on web UI
> ---
>
> Key: HBASE-2730
> URL: https://issues.apache.org/jira/browse/HBASE-2730
> Project: HBase
>  Issue Type: New Feature
>  Components: monitoring, regionserver
>Reporter: Todd Lipcon
>Priority: Critical
> Fix For: 0.96.0
>
> Attachments: dump.png, hbase-2730-0_94_0.patch, 
> hbase-2730-0_94_0.patch, hbase-2730-0_94_0_v2.patch
>
>
> Would be nice to be able to see the contents of the various work queues - eg 
> to know what regions are pending compaction/split/flush/etc. This is handy 
> for debugging why a region might be blocked, etc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4379) [hbck] Does not complain about tables with no end region [Z,]

2012-07-09 Thread Jonathan Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hsieh updated HBASE-4379:
--

Attachment: hbase-4379-90.patch
hbase-4379-92.patch

Minor tweaks for a 0.90 and 0.92 version.

> [hbck] Does not complain about tables with no end region [Z,]
> -
>
> Key: HBASE-4379
> URL: https://issues.apache.org/jira/browse/HBASE-4379
> Project: HBase
>  Issue Type: Bug
>  Components: hbck
>Affects Versions: 0.90.5, 0.92.0, 0.94.0, 0.96.0
>Reporter: Jonathan Hsieh
>Assignee: Anoop Sam John
> Fix For: 0.96.0, 0.94.1
>
> Attachments: 
> 0001-HBASE-4379-hbck-does-not-complain-about-tables-with-.patch, 
> HBASE-4379_94.patch, HBASE-4379_94_V2.patch, HBASE-4379_Trunk.patch, 
> TestcaseForDisabledTableIssue.patch, hbase-4379-90.patch, 
> hbase-4379-92.patch, hbase-4379.v2.patch
>
>
> hbck does not detect or have an error condition when the last region of a 
> table is missing (end key != '').

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4379) [hbck] Does not complain about tables with no end region [Z,]

2012-07-09 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409246#comment-13409246
 ] 

Hadoop QA commented on HBASE-4379:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12535634/hbase-4379-90.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

-1 patch.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2353//console

This message is automatically generated.

> [hbck] Does not complain about tables with no end region [Z,]
> -
>
> Key: HBASE-4379
> URL: https://issues.apache.org/jira/browse/HBASE-4379
> Project: HBase
>  Issue Type: Bug
>  Components: hbck
>Affects Versions: 0.90.5, 0.92.0, 0.94.0, 0.96.0
>Reporter: Jonathan Hsieh
>Assignee: Anoop Sam John
> Fix For: 0.96.0, 0.94.1
>
> Attachments: 
> 0001-HBASE-4379-hbck-does-not-complain-about-tables-with-.patch, 
> HBASE-4379_94.patch, HBASE-4379_94_V2.patch, HBASE-4379_Trunk.patch, 
> TestcaseForDisabledTableIssue.patch, hbase-4379-90.patch, 
> hbase-4379-92.patch, hbase-4379.v2.patch
>
>
> hbck does not detect or have an error condition when the last region of a 
> table is missing (end key != '').

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4379) [hbck] Does not complain about tables with no end region [Z,]

2012-07-09 Thread Jonathan Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409253#comment-13409253
 ] 

Jonathan Hsieh commented on HBASE-4379:
---

Anoop, sorry for the delay on commit of this patch.  Looks good, and tests 
fine.  I've added and committed tweaked version for 0.90/0.92 as well.

Let's create a new issue for the new "disabling" test case with the 
offline/disable regions, and discuss there?


> [hbck] Does not complain about tables with no end region [Z,]
> -
>
> Key: HBASE-4379
> URL: https://issues.apache.org/jira/browse/HBASE-4379
> Project: HBase
>  Issue Type: Bug
>  Components: hbck
>Affects Versions: 0.90.5, 0.92.0, 0.94.0, 0.96.0
>Reporter: Jonathan Hsieh
>Assignee: Anoop Sam John
> Fix For: 0.96.0, 0.94.1
>
> Attachments: 
> 0001-HBASE-4379-hbck-does-not-complain-about-tables-with-.patch, 
> HBASE-4379_94.patch, HBASE-4379_94_V2.patch, HBASE-4379_Trunk.patch, 
> TestcaseForDisabledTableIssue.patch, hbase-4379-90.patch, 
> hbase-4379-92.patch, hbase-4379.v2.patch
>
>
> hbck does not detect or have an error condition when the last region of a 
> table is missing (end key != '').

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4379) [hbck] Does not complain about tables with no end region [Z,]

2012-07-09 Thread Jonathan Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hsieh updated HBASE-4379:
--

   Resolution: Fixed
Fix Version/s: 0.92.2
   0.90.7
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

> [hbck] Does not complain about tables with no end region [Z,]
> -
>
> Key: HBASE-4379
> URL: https://issues.apache.org/jira/browse/HBASE-4379
> Project: HBase
>  Issue Type: Bug
>  Components: hbck
>Affects Versions: 0.90.5, 0.92.0, 0.94.0, 0.96.0
>Reporter: Jonathan Hsieh
>Assignee: Anoop Sam John
> Fix For: 0.90.7, 0.92.2, 0.96.0, 0.94.1
>
> Attachments: 
> 0001-HBASE-4379-hbck-does-not-complain-about-tables-with-.patch, 
> HBASE-4379_94.patch, HBASE-4379_94_V2.patch, HBASE-4379_Trunk.patch, 
> TestcaseForDisabledTableIssue.patch, hbase-4379-90.patch, 
> hbase-4379-92.patch, hbase-4379.v2.patch
>
>
> hbck does not detect or have an error condition when the last region of a 
> table is missing (end key != '').

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6233) [brainstorm] snapshots: hardlink alternatives

2012-07-09 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409257#comment-13409257
 ] 

stack commented on HBASE-6233:
--

It'd be a radical change Matteo.  It feels like a hbase 2.0 kinda thing rather 
than a 0.96-type change (But this is a 'brainstorm' issue so we have license to 
talk hypotheticals).

I think we could auto-migrate from the one format to the new; new hfiles would 
be written into new location in hdfs while we'd read the old unmigrated hfiles 
from the old layout ("Policy" for compatibility up to this is that versions go 
forward perhaps w/ a "migration step" but preferably not and we do not have to 
support reverting an upgrade... thats "policy" so far).

Would we need x-row transactions updating files in .META.?  I don't think so.  
Read/write locks might be enough.

We might need to let .META. split now that it can grow largish fast.

We've had "interesting" issues updating .META. in the past: e.g. socket timeout 
on client side but the edit went through anyways that kinda thing.  Now the 
repercussions of failed or false positive fail will be larger?

Yeah, instead of looking inside hdfs, hbck will have to read .META.  In hdfs, 
we'd still have tables and regions, or not?





> [brainstorm] snapshots: hardlink alternatives
> -
>
> Key: HBASE-6233
> URL: https://issues.apache.org/jira/browse/HBASE-6233
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
> Attachments: Restore-Snapshot-Hardlink-alternatives.pdf
>
>
> Discussion ticket around snapshots and hardlink alternatives.
> (See the HDFS-3370 discussion about hardlink and implementation problems)
> (taking for a moment WAL out of the discussion and focusing on hfiles)
> With hardlinks available taking snapshot will be fairly easy:
> * (hfiles are immutable)
> * hardlink to .snapshot/name to take snapshot
> * hardlink from .snapshot/name to restore the snapshot
> * No code change needed (on fs.delete() only one reference is deleted)
> but we don't have hardlinks, what are the alternatives?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6350) Some logging improvements for RegionServer bulk loading

2012-07-09 Thread Zhihong Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409281#comment-13409281
 ] 

Zhihong Ted Yu commented on HBASE-6350:
---

Patch integrated to trunk.

Thanks for the patch, Harsh.

> Some logging improvements for RegionServer bulk loading
> ---
>
> Key: HBASE-6350
> URL: https://issues.apache.org/jira/browse/HBASE-6350
> Project: HBase
>  Issue Type: Improvement
>  Components: regionserver
>Affects Versions: 0.94.0
>Reporter: Harsh J
>Assignee: Harsh J
>Priority: Minor
> Attachments: HBASE-6350.patch
>
>
> The current logging in the bulk loading RPC call to a RegionServer lacks some 
> info in certain cases. For instance, I recently noticed that it is possible 
> that IOException may be caused during bulk load file transfer (copy) off of 
> another FS and that during the same time the client already times the socket 
> out and thereby does not receive a thrown Exception back remotely (HBase 
> prints a ClosedChannelException for the IPC when it attempts to send the real 
> message, and hence the real cause is lost).
> Improvements around this kind of issue, wherein we could first log the 
> IOException at the RS before sending, and a few other wording improvements 
> are present in my patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5145) HMasterCommandLine's -minServers seems to be useless.

2012-07-09 Thread Zhihong Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409286#comment-13409286
 ] 

Zhihong Ted Yu commented on HBASE-5145:
---

Here is javadoc for waitForRegionServers():
{code}
   * Wait for the region servers to report in.
   * We will wait until one of this condition is met:
   *  - the master is stopped
   *  - the 'hbase.master.wait.on.regionservers.timeout' is reached
   *  - the 'hbase.master.wait.on.regionservers.maxtostart' number of
   *region servers is reached
   *  - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND
   *   there have been no new region server in for
   *  'hbase.master.wait.on.regionservers.interval' time
{code}
If we expose 'hbase.master.wait.on.regionservers.mintostart', we should expose 
the other parameters as well ?

Also, the following is difficult to follow:
{code}
opt.addOption("minServers", true, "Minimum RegionServers needed to host 
user tables");
{code}
We should change the wording.

What do you think ?

> HMasterCommandLine's -minServers seems to be useless.
> -
>
> Key: HBASE-5145
> URL: https://issues.apache.org/jira/browse/HBASE-5145
> Project: HBase
>  Issue Type: Sub-task
>  Components: master
>Affects Versions: 0.94.0
>Reporter: Harsh J
>Assignee: Harsh J
> Attachments: HBASE-5145.patch
>
>
> HMasterCommandLine gets a number via -minServers opt. and sets it to a config 
> param "hbase.regions.server.count.min".
> This config is not used anywhere else.
> Perhaps it wants to use "hbase.master.wait.on.regionservers.mintostart" 
> instead?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5145) HMasterCommandLine's -minServers seems to be useless.

2012-07-09 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409289#comment-13409289
 ] 

Harsh J commented on HBASE-5145:


bq. If we expose 'hbase.master.wait.on.regionservers.mintostart', we should 
expose the other parameters as well ?

We already do expose those properties (they are read from configs).

The plan is to properly make them use constants via HBASE-3274, which I've been 
slacking on lately but plan to rebase and resume again very soon.

So for this change, which in reality was only removing the useless minServers' 
set property and using the right one (a bug), I thought I'll also constantize 
them as I go.

bq. We should change the wording.

Agree, and will do.



> HMasterCommandLine's -minServers seems to be useless.
> -
>
> Key: HBASE-5145
> URL: https://issues.apache.org/jira/browse/HBASE-5145
> Project: HBase
>  Issue Type: Sub-task
>  Components: master
>Affects Versions: 0.94.0
>Reporter: Harsh J
>Assignee: Harsh J
> Attachments: HBASE-5145.patch
>
>
> HMasterCommandLine gets a number via -minServers opt. and sets it to a config 
> param "hbase.regions.server.count.min".
> This config is not used anywhere else.
> Perhaps it wants to use "hbase.master.wait.on.regionservers.mintostart" 
> instead?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-07-09 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409359#comment-13409359
 ] 

ramkrishna.s.vasudevan commented on HBASE-6060:
---

@Stack
Some comments on RB.

> Regions's in OPENING state from failed regionservers takes a long time to 
> recover
> -
>
> Key: HBASE-6060
> URL: https://issues.apache.org/jira/browse/HBASE-6060
> Project: HBase
>  Issue Type: Bug
>  Components: master, regionserver
>Reporter: Enis Soztutar
>Assignee: rajeshbabu
> Fix For: 0.96.0, 0.94.1, 0.92.3
>
> Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
> 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
> 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
> 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
> 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
> HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
> HBASE-6060_trunk_5.patch
>
>
> we have seen a pattern in tests, that the regions are stuck in OPENING state 
> for a very long time when the region server who is opening the region fails. 
> My understanding of the process: 
>  
>  - master calls rs to open the region. If rs is offline, a new plan is 
> generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
> master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
> HMaster.assign()
>  - RegionServer, starts opening a region, changes the state in znode. But 
> that znode is not ephemeral. (see ZkAssign)
>  - Rs transitions zk node from OFFLINE to OPENING. See 
> OpenRegionHandler.process()
>  - rs then opens the region, and changes znode from OPENING to OPENED
>  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
> state, and the master just waits for rs to change the region state, but since 
> rs is down, that wont happen. 
>  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
> against these kind of conditions. It periodically checks (every 10 sec by 
> default) the regions in transition to see whether they timedout 
> (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
> which explains what you and I are seeing. 
>  - ServerShutdownHandler in Master does not reassign regions in OPENING 
> state, although it handles other states. 
> Lowering that threshold from the configuration is one option, but still I 
> think we can do better. 
> Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6317) Master clean start up and Partially enabled tables make region assignment inconsistent.

2012-07-09 Thread rajeshbabu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

rajeshbabu updated HBASE-6317:
--

Attachment: HBASE-6317_94.patch

> Master clean start up and Partially enabled tables make region assignment 
> inconsistent.
> ---
>
> Key: HBASE-6317
> URL: https://issues.apache.org/jira/browse/HBASE-6317
> Project: HBase
>  Issue Type: Bug
>Reporter: ramkrishna.s.vasudevan
>Assignee: ramkrishna.s.vasudevan
> Fix For: 0.92.2, 0.96.0, 0.94.1
>
> Attachments: HBASE-6317_94.patch
>
>
> If we have a  table in partially enabled state (ENABLING) then on HMaster 
> restart we treat it as a clean cluster start up and do a bulk assign.  
> Currently in 0.94 bulk assign will not handle ALREADY_OPENED scenarios and it 
> leads to region assignment problems.  Analysing more on this we found that we 
> have better way to handle these scenarios.
> {code}
> if (false == checkIfRegionBelongsToDisabled(regionInfo)
> && false == checkIfRegionsBelongsToEnabling(regionInfo)) {
>   synchronized (this.regions) {
> regions.put(regionInfo, regionLocation);
> addToServers(regionLocation, regionInfo);
>   }
> {code}
> We dont add to regions map so that enable table handler can handle it.  But 
> as nothing is added to regions map we think it as a clean cluster start up.
> Will come up with a patch tomorrow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HBASE-6317) Master clean start up and Partially enabled tables make region assignment inconsistent.

2012-07-09 Thread rajeshbabu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

rajeshbabu reassigned HBASE-6317:
-

Assignee: rajeshbabu  (was: ramkrishna.s.vasudevan)

> Master clean start up and Partially enabled tables make region assignment 
> inconsistent.
> ---
>
> Key: HBASE-6317
> URL: https://issues.apache.org/jira/browse/HBASE-6317
> Project: HBase
>  Issue Type: Bug
>Reporter: ramkrishna.s.vasudevan
>Assignee: rajeshbabu
> Fix For: 0.92.2, 0.96.0, 0.94.1
>
> Attachments: HBASE-6317_94.patch
>
>
> If we have a  table in partially enabled state (ENABLING) then on HMaster 
> restart we treat it as a clean cluster start up and do a bulk assign.  
> Currently in 0.94 bulk assign will not handle ALREADY_OPENED scenarios and it 
> leads to region assignment problems.  Analysing more on this we found that we 
> have better way to handle these scenarios.
> {code}
> if (false == checkIfRegionBelongsToDisabled(regionInfo)
> && false == checkIfRegionsBelongsToEnabling(regionInfo)) {
>   synchronized (this.regions) {
> regions.put(regionInfo, regionLocation);
> addToServers(regionLocation, regionInfo);
>   }
> {code}
> We dont add to regions map so that enable table handler can handle it.  But 
> as nothing is added to regions map we think it as a clean cluster start up.
> Will come up with a patch tomorrow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6317) Master clean start up and Partially enabled tables make region assignment inconsistent.

2012-07-09 Thread rajeshbabu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409503#comment-13409503
 ] 

rajeshbabu commented on HBASE-6317:
---

Patch for 94.
As per the current code two scenarios may cause assignment incosistent.

1)in EnableTableHandler we dont assign regions if they are present in regions 
map.
{code}
final List onlineRegions =
  this.assignmentManager.getRegionsOfTable(tableName);
regionsInMeta.removeAll(onlineRegions);
{code}
But in case of enabling table regions during master start up we are not adding 
them to regions map in rebuldUseRegions even the regions in/transition to 
onlineServers.
{code}
if (false == checkIfRegionBelongsToDisabled(regionInfo)
&& false == checkIfRegionsBelongsToEnabling(regionInfo)) {
  synchronized (this.regions) {
regions.put(regionInfo, regionLocation);
addToServers(regionLocation, regionInfo);
  }
}
{code}

So we will call assign to all the regions even they are in transition/already 
assigned  to online servers which may cause double assignment.

2) If all the tables are in ENABLING we may consider as clean cluster 
startup(because regions map is empty) and again call assignment for all the 
regions.(Which may again cause double assignment)

This patch solves these problems. Please review and provide comments or 
suggestions.

> Master clean start up and Partially enabled tables make region assignment 
> inconsistent.
> ---
>
> Key: HBASE-6317
> URL: https://issues.apache.org/jira/browse/HBASE-6317
> Project: HBase
>  Issue Type: Bug
>Reporter: ramkrishna.s.vasudevan
>Assignee: ramkrishna.s.vasudevan
> Fix For: 0.92.2, 0.96.0, 0.94.1
>
> Attachments: HBASE-6317_94.patch
>
>
> If we have a  table in partially enabled state (ENABLING) then on HMaster 
> restart we treat it as a clean cluster start up and do a bulk assign.  
> Currently in 0.94 bulk assign will not handle ALREADY_OPENED scenarios and it 
> leads to region assignment problems.  Analysing more on this we found that we 
> have better way to handle these scenarios.
> {code}
> if (false == checkIfRegionBelongsToDisabled(regionInfo)
> && false == checkIfRegionsBelongsToEnabling(regionInfo)) {
>   synchronized (this.regions) {
> regions.put(regionInfo, regionLocation);
> addToServers(regionLocation, regionInfo);
>   }
> {code}
> We dont add to regions map so that enable table handler can handle it.  But 
> as nothing is added to regions map we think it as a clean cluster start up.
> Will come up with a patch tomorrow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-6356) printStackTrace in FSUtils

2012-07-09 Thread nkeywal (JIRA)
nkeywal created HBASE-6356:
--

 Summary: printStackTrace in FSUtils
 Key: HBASE-6356
 URL: https://issues.apache.org/jira/browse/HBASE-6356
 Project: HBase
  Issue Type: Bug
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Priority: Trivial


This is bad...
{format}
public boolean accept(Path p) {
  boolean isValid = false;
  try {
if (HConstants.HBASE_NON_USER_TABLE_DIRS.contains(p.toString())) {
  isValid = false;
} else {
isValid = this.fs.getFileStatus(p).isDir();
}
  } catch (IOException e) {
e.printStackTrace();  < 
  }
  return isValid;
}
  }
{format}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6356) printStackTrace in FSUtils

2012-07-09 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6356:
---

Description: 
This is bad...
{noformat}
public boolean accept(Path p) {
  boolean isValid = false;
  try {
if (HConstants.HBASE_NON_USER_TABLE_DIRS.contains(p.toString())) {
  isValid = false;
} else {
isValid = this.fs.getFileStatus(p).isDir();
}
  } catch (IOException e) {
e.printStackTrace();  < 
  }
  return isValid;
}
  }
{noformat}

  was:
This is bad...
{format}
public boolean accept(Path p) {
  boolean isValid = false;
  try {
if (HConstants.HBASE_NON_USER_TABLE_DIRS.contains(p.toString())) {
  isValid = false;
} else {
isValid = this.fs.getFileStatus(p).isDir();
}
  } catch (IOException e) {
e.printStackTrace();  < 
  }
  return isValid;
}
  }
{format}


> printStackTrace in FSUtils
> --
>
> Key: HBASE-6356
> URL: https://issues.apache.org/jira/browse/HBASE-6356
> Project: HBase
>  Issue Type: Bug
>  Components: client, master, regionserver
>Affects Versions: 0.96.0
>Reporter: nkeywal
>Priority: Trivial
>
> This is bad...
> {noformat}
> public boolean accept(Path p) {
>   boolean isValid = false;
>   try {
> if (HConstants.HBASE_NON_USER_TABLE_DIRS.contains(p.toString())) {
>   isValid = false;
> } else {
> isValid = this.fs.getFileStatus(p).isDir();
> }
>   } catch (IOException e) {
> e.printStackTrace();  < 
>   }
>   return isValid;
> }
>   }
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6356) printStackTrace in FSUtils

2012-07-09 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-6356:
-

  Tags: noob
Labels: noob  (was: )

> printStackTrace in FSUtils
> --
>
> Key: HBASE-6356
> URL: https://issues.apache.org/jira/browse/HBASE-6356
> Project: HBase
>  Issue Type: Bug
>  Components: client, master, regionserver
>Affects Versions: 0.96.0
>Reporter: nkeywal
>Priority: Trivial
>  Labels: noob
>
> This is bad...
> {noformat}
> public boolean accept(Path p) {
>   boolean isValid = false;
>   try {
> if (HConstants.HBASE_NON_USER_TABLE_DIRS.contains(p.toString())) {
>   isValid = false;
> } else {
> isValid = this.fs.getFileStatus(p).isDir();
> }
>   } catch (IOException e) {
> e.printStackTrace();  < 
>   }
>   return isValid;
> }
>   }
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6356) printStackTrace in FSUtils

2012-07-09 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409543#comment-13409543
 ] 

stack commented on HBASE-6356:
--

I suppose we're not allowed throw in here?  Should we throw a RuntimeException?

> printStackTrace in FSUtils
> --
>
> Key: HBASE-6356
> URL: https://issues.apache.org/jira/browse/HBASE-6356
> Project: HBase
>  Issue Type: Bug
>  Components: client, master, regionserver
>Affects Versions: 0.96.0
>Reporter: nkeywal
>Priority: Trivial
>  Labels: noob
>
> This is bad...
> {noformat}
> public boolean accept(Path p) {
>   boolean isValid = false;
>   try {
> if (HConstants.HBASE_NON_USER_TABLE_DIRS.contains(p.toString())) {
>   isValid = false;
> } else {
> isValid = this.fs.getFileStatus(p).isDir();
> }
>   } catch (IOException e) {
> e.printStackTrace();  < 
>   }
>   return isValid;
> }
>   }
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-07-09 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409546#comment-13409546
 ] 

stack commented on HBASE-6060:
--

Responded Ram.

> Regions's in OPENING state from failed regionservers takes a long time to 
> recover
> -
>
> Key: HBASE-6060
> URL: https://issues.apache.org/jira/browse/HBASE-6060
> Project: HBase
>  Issue Type: Bug
>  Components: master, regionserver
>Reporter: Enis Soztutar
>Assignee: rajeshbabu
> Fix For: 0.96.0, 0.94.1, 0.92.3
>
> Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
> 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
> 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
> 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
> 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
> HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
> HBASE-6060_trunk_5.patch
>
>
> we have seen a pattern in tests, that the regions are stuck in OPENING state 
> for a very long time when the region server who is opening the region fails. 
> My understanding of the process: 
>  
>  - master calls rs to open the region. If rs is offline, a new plan is 
> generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
> master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
> HMaster.assign()
>  - RegionServer, starts opening a region, changes the state in znode. But 
> that znode is not ephemeral. (see ZkAssign)
>  - Rs transitions zk node from OFFLINE to OPENING. See 
> OpenRegionHandler.process()
>  - rs then opens the region, and changes znode from OPENING to OPENED
>  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
> state, and the master just waits for rs to change the region state, but since 
> rs is down, that wont happen. 
>  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
> against these kind of conditions. It periodically checks (every 10 sec by 
> default) the regions in transition to see whether they timedout 
> (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
> which explains what you and I are seeing. 
>  - ServerShutdownHandler in Master does not reassign regions in OPENING 
> state, although it handles other states. 
> Lowering that threshold from the configuration is one option, but still I 
> think we can do better. 
> Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5151) Rename "hbase.skip.errors" in HRegion as it is too general-sounding.

2012-07-09 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409551#comment-13409551
 ] 

stack commented on HBASE-5151:
--

@Harsh Patch looks good.  Does the above test fail for you?

> Rename "hbase.skip.errors" in HRegion as it is too general-sounding.
> 
>
> Key: HBASE-5151
> URL: https://issues.apache.org/jira/browse/HBASE-5151
> Project: HBase
>  Issue Type: Sub-task
>  Components: documentation
>Affects Versions: 0.94.0
>Reporter: Harsh J
>Assignee: Harsh J
> Attachments: HBASE-5151.patch
>
>
> We should rename "hbase.skip.errors", used in HRegion.java for skipping 
> errors when replaying edits. It should probably be something more like 
> "hbase.hregion.edits.replay.skip.errors" or so.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6356) printStackTrace in FSUtils

2012-07-09 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409552#comment-13409552
 ] 

nkeywal commented on HBASE-6356:


Yes, we're not allowed to throw. I wonder if it's better to log.warn or to stop 
here. Today when it happens it returns false, so the easy option is to just log 
and saying that we keep the backward compatibility this way...

> printStackTrace in FSUtils
> --
>
> Key: HBASE-6356
> URL: https://issues.apache.org/jira/browse/HBASE-6356
> Project: HBase
>  Issue Type: Bug
>  Components: client, master, regionserver
>Affects Versions: 0.96.0
>Reporter: nkeywal
>Priority: Trivial
>  Labels: noob
>
> This is bad...
> {noformat}
> public boolean accept(Path p) {
>   boolean isValid = false;
>   try {
> if (HConstants.HBASE_NON_USER_TABLE_DIRS.contains(p.toString())) {
>   isValid = false;
> } else {
> isValid = this.fs.getFileStatus(p).isDir();
> }
>   } catch (IOException e) {
> e.printStackTrace();  < 
>   }
>   return isValid;
> }
>   }
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6337) [MTTR] Remove renaming tmp log file in SplitLogManager

2012-07-09 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409557#comment-13409557
 ] 

stack commented on HBASE-6337:
--

@Lars I'm not sure why it was done originally.  Prakash was probably being 
cautious.  As long as a region does not open before log split completes, as 
Chunhui says, I think we should be fine.  The code as is already overwrites 
recovered.edits files with same sequence number (which could happen if first 
split failed and we then rerun it before region open).  Maybe there are failure 
scenarios we've not yet encountered or imagined -- have you tried conjure any 
new ones Chunhui?  And don't we have tests for some common failures already?  
-- but this approach should be fine I believe.

> [MTTR] Remove renaming tmp log file in SplitLogManager 
> ---
>
> Key: HBASE-6337
> URL: https://issues.apache.org/jira/browse/HBASE-6337
> Project: HBase
>  Issue Type: Bug
>Reporter: chunhui shen
>Assignee: chunhui shen
> Fix For: 0.96.0, 0.94.2
>
> Attachments: HBASE-6337v1.patch, HBASE-6337v2.patch, 
> HBASE-6337v3.patch
>
>
> As HBASE-6309 mentioned, we also encounter problem of 
> distributed-log-splitting take much more time than matser-local-log-splitting 
> because lots of SplitLogManager 's renaming operations when finishing task.
> Could we try to remove renaming tmp log file in SplitLogManager through 
> splitting log to regions' recover.edits directory directly as the same as the 
> master-local-log-splitting.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6337) [MTTR] Remove renaming tmp log file in SplitLogManager

2012-07-09 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409564#comment-13409564
 ] 

stack commented on HBASE-6337:
--

@Chunhui The patch looks good.  Thanks for all the cleanup.  moveSplitLogFile 
could be named better since now there is no actually moving being done.  Call 
it finishSplitLog or something?  Have you checked out the open region side of 
the affair to see if any conditions under which we might bungle the replay of 
recovered.edits files?  I don't think it possible as long as split finishes 
successfully before region opens (or we crash out the cluster if we can't 
split).

> [MTTR] Remove renaming tmp log file in SplitLogManager 
> ---
>
> Key: HBASE-6337
> URL: https://issues.apache.org/jira/browse/HBASE-6337
> Project: HBase
>  Issue Type: Bug
>Reporter: chunhui shen
>Assignee: chunhui shen
> Fix For: 0.96.0, 0.94.2
>
> Attachments: HBASE-6337v1.patch, HBASE-6337v2.patch, 
> HBASE-6337v3.patch
>
>
> As HBASE-6309 mentioned, we also encounter problem of 
> distributed-log-splitting take much more time than matser-local-log-splitting 
> because lots of SplitLogManager 's renaming operations when finishing task.
> Could we try to remove renaming tmp log file in SplitLogManager through 
> splitting log to regions' recover.edits directory directly as the same as the 
> master-local-log-splitting.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6350) Some logging improvements for RegionServer bulk loading

2012-07-09 Thread Zhihong Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Ted Yu updated HBASE-6350:
--

Fix Version/s: 0.96.0
 Hadoop Flags: Reviewed

> Some logging improvements for RegionServer bulk loading
> ---
>
> Key: HBASE-6350
> URL: https://issues.apache.org/jira/browse/HBASE-6350
> Project: HBase
>  Issue Type: Improvement
>  Components: regionserver
>Affects Versions: 0.94.0
>Reporter: Harsh J
>Assignee: Harsh J
>Priority: Minor
> Fix For: 0.96.0
>
> Attachments: HBASE-6350.patch
>
>
> The current logging in the bulk loading RPC call to a RegionServer lacks some 
> info in certain cases. For instance, I recently noticed that it is possible 
> that IOException may be caused during bulk load file transfer (copy) off of 
> another FS and that during the same time the client already times the socket 
> out and thereby does not receive a thrown Exception back remotely (HBase 
> prints a ClosedChannelException for the IPC when it attempts to send the real 
> message, and hence the real cause is lost).
> Improvements around this kind of issue, wherein we could first log the 
> IOException at the RS before sending, and a few other wording improvements 
> are present in my patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6317) Master clean start up and Partially enabled tables make region assignment inconsistent.

2012-07-09 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409634#comment-13409634
 ] 

ramkrishna.s.vasudevan commented on HBASE-6317:
---

Just to add on to Rajesh
If we are in a masterrestart scenario, even if roundrobin is set to true we 
still go with single assignment just to avoid the problems of bulk assignment 
that may lead to region assignment inconsistency.

> Master clean start up and Partially enabled tables make region assignment 
> inconsistent.
> ---
>
> Key: HBASE-6317
> URL: https://issues.apache.org/jira/browse/HBASE-6317
> Project: HBase
>  Issue Type: Bug
>Reporter: ramkrishna.s.vasudevan
>Assignee: rajeshbabu
> Fix For: 0.92.2, 0.96.0, 0.94.1
>
> Attachments: HBASE-6317_94.patch
>
>
> If we have a  table in partially enabled state (ENABLING) then on HMaster 
> restart we treat it as a clean cluster start up and do a bulk assign.  
> Currently in 0.94 bulk assign will not handle ALREADY_OPENED scenarios and it 
> leads to region assignment problems.  Analysing more on this we found that we 
> have better way to handle these scenarios.
> {code}
> if (false == checkIfRegionBelongsToDisabled(regionInfo)
> && false == checkIfRegionsBelongsToEnabling(regionInfo)) {
>   synchronized (this.regions) {
> regions.put(regionInfo, regionLocation);
> addToServers(regionLocation, regionInfo);
>   }
> {code}
> We dont add to regions map so that enable table handler can handle it.  But 
> as nothing is added to regions map we think it as a clean cluster start up.
> Will come up with a patch tomorrow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-07-09 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409639#comment-13409639
 ] 

ramkrishna.s.vasudevan commented on HBASE-6060:
---

@Stack
Ok with approach stack. Few cases may still be hiding out there :)
Nice of you Stack.

> Regions's in OPENING state from failed regionservers takes a long time to 
> recover
> -
>
> Key: HBASE-6060
> URL: https://issues.apache.org/jira/browse/HBASE-6060
> Project: HBase
>  Issue Type: Bug
>  Components: master, regionserver
>Reporter: Enis Soztutar
>Assignee: rajeshbabu
> Fix For: 0.96.0, 0.94.1, 0.92.3
>
> Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
> 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
> 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
> 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
> 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
> HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
> HBASE-6060_trunk_5.patch
>
>
> we have seen a pattern in tests, that the regions are stuck in OPENING state 
> for a very long time when the region server who is opening the region fails. 
> My understanding of the process: 
>  
>  - master calls rs to open the region. If rs is offline, a new plan is 
> generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
> master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
> HMaster.assign()
>  - RegionServer, starts opening a region, changes the state in znode. But 
> that znode is not ephemeral. (see ZkAssign)
>  - Rs transitions zk node from OFFLINE to OPENING. See 
> OpenRegionHandler.process()
>  - rs then opens the region, and changes znode from OPENING to OPENED
>  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
> state, and the master just waits for rs to change the region state, but since 
> rs is down, that wont happen. 
>  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
> against these kind of conditions. It periodically checks (every 10 sec by 
> default) the regions in transition to see whether they timedout 
> (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
> which explains what you and I are seeing. 
>  - ServerShutdownHandler in Master does not reassign regions in OPENING 
> state, although it handles other states. 
> Lowering that threshold from the configuration is one option, but still I 
> think we can do better. 
> Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-07-09 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409640#comment-13409640
 ] 

ramkrishna.s.vasudevan commented on HBASE-6060:
---

Do we need to test the perf because now the transition from OFFLINE to OPENING 
is done in the main thread?

> Regions's in OPENING state from failed regionservers takes a long time to 
> recover
> -
>
> Key: HBASE-6060
> URL: https://issues.apache.org/jira/browse/HBASE-6060
> Project: HBase
>  Issue Type: Bug
>  Components: master, regionserver
>Reporter: Enis Soztutar
>Assignee: rajeshbabu
> Fix For: 0.96.0, 0.94.1, 0.92.3
>
> Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
> 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
> 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
> 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
> 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
> HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
> HBASE-6060_trunk_5.patch
>
>
> we have seen a pattern in tests, that the regions are stuck in OPENING state 
> for a very long time when the region server who is opening the region fails. 
> My understanding of the process: 
>  
>  - master calls rs to open the region. If rs is offline, a new plan is 
> generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
> master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
> HMaster.assign()
>  - RegionServer, starts opening a region, changes the state in znode. But 
> that znode is not ephemeral. (see ZkAssign)
>  - Rs transitions zk node from OFFLINE to OPENING. See 
> OpenRegionHandler.process()
>  - rs then opens the region, and changes znode from OPENING to OPENED
>  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
> state, and the master just waits for rs to change the region state, but since 
> rs is down, that wont happen. 
>  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
> against these kind of conditions. It periodically checks (every 10 sec by 
> default) the regions in transition to see whether they timedout 
> (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
> which explains what you and I are seeing. 
>  - ServerShutdownHandler in Master does not reassign regions in OPENING 
> state, although it handles other states. 
> Lowering that threshold from the configuration is one option, but still I 
> think we can do better. 
> Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

2012-07-09 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409663#comment-13409663
 ] 

nkeywal commented on HBASE-5843:


Some tests results:

I tested the following scenarios, on a local machine, a pseudo
distributed cluster with ZooKeeper and HBase writing in a ram drive,
no datanode nor namenode, with 2 region servers, and one empty table
with 1 regions, 5K on each RS. Versions taken monday 2nd

1) Clean stop of one RS; wait for all regions to become online again:
0.92: ~800 seconds
0.96: ~13 seconds

=> Huge improvement, hopefully from stuff like HBASE-5970 and HBASE-6109.

1.1) As above with 2Mb memory per server
Results as 1)

=> Results don't depend on any GC stuff (memory reported is around 200 Mb)


2) Kill -9 of a RS; wait for all regions to become online again:
0.92: 980s
0.96: ~13s

=> The 180s gap comes from HBASE-5844. For master, HBASE-5926 is not tested but 
should bring similar results.



3) Start of the cluster after a clean stop; wait for all regions to
become online.
0.92: ~1020s
0.94: ~1023s (tested once only)
0.96: ~31s

=> The benefit is visible at startup
=> This does not come from something implemented for 0.94



4) As 3) But with HBase on a local HD
0.92: ~1044s (tested once only)
0.96: ~28s (tested once only)

=> Similar results. Seems that HBase i/o was not and is not becoming the 
bottleneck.


5) As 1) With 4RS instead of 2
0.92) 406s
0.96) 6s

=> Twice faster in both cases. Scales with the number of RS with both versions 
on this minimalistic test.



6) As 3) But with ZK on a local HD
Impossible to get something consistent here. Machine and test dependent.
The most credible result was similar to 2).
>From ZK mailing list or ZOOKEEPER-866 is seems that what we should expect.



7) With 2 RS, Insert 20M simple puts; then kill -9 the second one. See how long 
it takes to have all the regions available.
0.92) 180s detection time+ then hangs twice out of 2 tests.
0.96) 14s (hangs once out of 3)

=> There's a bug ;-)
=> Test to be changed to get a real difference when we need to replay the wal.


> Improve HBase MTTR - Mean Time To Recover
> -
>
> Key: HBASE-5843
> URL: https://issues.apache.org/jira/browse/HBASE-5843
> Project: HBase
>  Issue Type: Umbrella
>Affects Versions: 0.96.0
>Reporter: nkeywal
>Assignee: nkeywal
>
> A part of the approach is described here: 
> https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a 
> query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

2012-07-09 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409666#comment-13409666
 ] 

nkeywal commented on HBASE-5843:


@andrew I had a look at HBASE-5844 and HBASE-5926, they have a small dependency 
to protobuf stuff that I had forgotten (they read the server name from zk), so 
it's not a pure git port.

> Improve HBase MTTR - Mean Time To Recover
> -
>
> Key: HBASE-5843
> URL: https://issues.apache.org/jira/browse/HBASE-5843
> Project: HBase
>  Issue Type: Umbrella
>Affects Versions: 0.96.0
>Reporter: nkeywal
>Assignee: nkeywal
>
> A part of the approach is described here: 
> https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a 
> query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6233) [brainstorm] snapshots: hardlink alternatives

2012-07-09 Thread Matteo Bertozzi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matteo Bertozzi updated HBASE-6233:
---

Attachment: Restore-Snapshot-Hardlink-alternatives-v2.pdf

Updated the doc to cover the different hbase.root file-system layout idea.
Removed the extra symlink for snapshot in the "Move & Symlink" approach.
And added some notes about why we can't just rely on .META. refcount with the 
current layout.

> [brainstorm] snapshots: hardlink alternatives
> -
>
> Key: HBASE-6233
> URL: https://issues.apache.org/jira/browse/HBASE-6233
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
> Attachments: Restore-Snapshot-Hardlink-alternatives-v2.pdf, 
> Restore-Snapshot-Hardlink-alternatives.pdf
>
>
> Discussion ticket around snapshots and hardlink alternatives.
> (See the HDFS-3370 discussion about hardlink and implementation problems)
> (taking for a moment WAL out of the discussion and focusing on hfiles)
> With hardlinks available taking snapshot will be fairly easy:
> * (hfiles are immutable)
> * hardlink to .snapshot/name to take snapshot
> * hardlink from .snapshot/name to restore the snapshot
> * No code change needed (on fs.delete() only one reference is deleted)
> but we don't have hardlinks, what are the alternatives?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6290) Add a function a mark a server as dead and start the recovery the process

2012-07-09 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409675#comment-13409675
 ] 

nkeywal commented on HBASE-6290:


@stack We should try to connect the RS, someone we trust told us it was dead. 
Or if we try it should be with a minimum timeout (if not, out socket timeout 
will be longer than the zookeeper timeout). So the shell command should just 
clean the znode associated to an IP.

It could also be in ZK, or very strongly linked to ZK if we can: if the API 
allows it, get the session associated to this IP and expire them. We know it's 
easy to expire a session :-).

> Add a function a mark a server as dead and start the recovery the process
> -
>
> Key: HBASE-6290
> URL: https://issues.apache.org/jira/browse/HBASE-6290
> Project: HBase
>  Issue Type: Improvement
>  Components: monitoring
>Affects Versions: 0.96.0
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Minor
>
> ZooKeeper is used a a monitoring tool: we use znode and we start the recovery 
> process when a znode is deleted by ZK because it got a timeout. This timeout 
> is defaulted to 90 seconds, and often set to 30s
> However, some HW issues could be detected by specialized hw monitoring tools 
> before the ZK timeout. For this reason, it makes sense to offer a very simple 
> function to mark a RS as dead. This should not take in
> It could be a hbase shell function such as
> considerAsDead ipAddress|serverName
> This would delete all the znodes of the server running on this box, starting 
> the recovery process.
> Such a function would be easily callable (at callers risk) by any fault 
> detection tool... We could have issues to identify the right master & region 
> servers around ipv4 vs ipv6 vs and multi networked boxes however.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HBASE-6302) Document how to run integration tests

2012-07-09 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar reassigned HBASE-6302:


Assignee: Enis Soztutar

> Document how to run integration tests
> -
>
> Key: HBASE-6302
> URL: https://issues.apache.org/jira/browse/HBASE-6302
> Project: HBase
>  Issue Type: Bug
>  Components: documentation
>Reporter: stack
>Assignee: Enis Soztutar
>Priority: Blocker
> Fix For: 0.96.0
>
>
> HBASE-6203 has attached the old IT doc with some mods.  When we figure how 
> ITs are to be run, update it and apply the documentation under this issue.  
> Making a blocker against 0.96.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6312) Make BlockCache eviction thresholds configurable

2012-07-09 Thread Zhihong Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Ted Yu updated HBASE-6312:
--

Fix Version/s: 0.96.0
 Assignee: Jie Huang

> Make BlockCache eviction thresholds configurable
> 
>
> Key: HBASE-6312
> URL: https://issues.apache.org/jira/browse/HBASE-6312
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jie Huang
>Assignee: Jie Huang
>Priority: Minor
> Fix For: 0.96.0
>
> Attachments: hbase-6312.patch, hbase-6312_v2.patch
>
>
> Some of our customers found that tuning the BlockCache eviction thresholds 
> made test results different in their test environment. However, those 
> thresholds are not configurable in the current implementation. The only way 
> to change those values is to re-compile the HBase source code. We wonder if 
> it is possible to make them configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5786) Implement histogram metrics for flush and compaction latencies and sizes.

2012-07-09 Thread Jonathan Creasy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409725#comment-13409725
 ] 

Jonathan Creasy commented on HBASE-5786:


I'm interested in working on a patch for this, it seems like a pretty good 
starter task for getting involved in HBASE development.

> Implement histogram metrics for flush and compaction latencies and sizes.
> -
>
> Key: HBASE-5786
> URL: https://issues.apache.org/jira/browse/HBASE-5786
> Project: HBase
>  Issue Type: New Feature
>  Components: metrics, regionserver
>Affects Versions: 0.92.2, 0.94.0, 0.96.0
>Reporter: Jonathan Hsieh
>
> Average time for region operations doesn't really tell a useful story when 
> that help diagnose anomalous conditions.
> It would be extremely useful to add histogramming metrics similar to 
> HBASE-5533 for region operations like flush, compaction and splitting.  The 
> probably should be forward biased at a much coarser granularity however 
> (maybe decay every day?) 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5136) Redundant MonitoredTask instances in case of distributed log splitting retry

2012-07-09 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HBASE-5136:
---

Attachment: 5136-trunk.patch

> Redundant MonitoredTask instances in case of distributed log splitting retry
> 
>
> Key: HBASE-5136
> URL: https://issues.apache.org/jira/browse/HBASE-5136
> Project: HBase
>  Issue Type: Task
>Reporter: Zhihong Ted Yu
>Assignee: Zhihong Ted Yu
> Attachments: 5136-trunk.patch, 5136.txt
>
>
> In case of log splitting retry, the following code would be executed multiple 
> times:
> {code}
>   public long splitLogDistributed(final List logDirs) throws 
> IOException {
> MonitoredTask status = TaskMonitor.get().createStatus(
>   "Doing distributed log split in " + logDirs);
> {code}
> leading to multiple MonitoredTask instances.
> User may get confused by multiple distributed log splitting entries for the 
> same region server on master UI

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5136) Redundant MonitoredTask instances in case of distributed log splitting retry

2012-07-09 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HBASE-5136:
---

Status: Open  (was: Patch Available)

> Redundant MonitoredTask instances in case of distributed log splitting retry
> 
>
> Key: HBASE-5136
> URL: https://issues.apache.org/jira/browse/HBASE-5136
> Project: HBase
>  Issue Type: Task
>Reporter: Zhihong Ted Yu
>Assignee: Zhihong Ted Yu
> Attachments: 5136-trunk.patch, 5136.txt
>
>
> In case of log splitting retry, the following code would be executed multiple 
> times:
> {code}
>   public long splitLogDistributed(final List logDirs) throws 
> IOException {
> MonitoredTask status = TaskMonitor.get().createStatus(
>   "Doing distributed log split in " + logDirs);
> {code}
> leading to multiple MonitoredTask instances.
> User may get confused by multiple distributed log splitting entries for the 
> same region server on master UI

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5136) Redundant MonitoredTask instances in case of distributed log splitting retry

2012-07-09 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HBASE-5136:
---

Status: Patch Available  (was: Open)

Posted a patch to abort the task status if log splitting failed.

Let HBASE-5174 to handle the duplicated status entry.

> Redundant MonitoredTask instances in case of distributed log splitting retry
> 
>
> Key: HBASE-5136
> URL: https://issues.apache.org/jira/browse/HBASE-5136
> Project: HBase
>  Issue Type: Task
>Reporter: Zhihong Ted Yu
>Assignee: Zhihong Ted Yu
> Attachments: 5136-trunk.patch, 5136.txt
>
>
> In case of log splitting retry, the following code would be executed multiple 
> times:
> {code}
>   public long splitLogDistributed(final List logDirs) throws 
> IOException {
> MonitoredTask status = TaskMonitor.get().createStatus(
>   "Doing distributed log split in " + logDirs);
> {code}
> leading to multiple MonitoredTask instances.
> User may get confused by multiple distributed log splitting entries for the 
> same region server on master UI

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6348) Region assignments should be only allowed edit META hosted on the same cluster.

2012-07-09 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409743#comment-13409743
 ] 

Jean-Daniel Cryans commented on HBASE-6348:
---

Wouldn't it be just easier to make sure that .META. is assigned correctly? IIUC 
this is where the problem happened (HMaster.assignRootAndMeta):

{code}
if (!this.catalogTracker.verifyMetaRegionLocation(timeout)) {
  ServerName currentMetaServer =
this.catalogTracker.getMetaLocationOrReadLocationFromRoot();
  if (currentMetaServer != null
  && !currentMetaServer.equals(currentRootServer)) {
splitLogAndExpireIfOnline(currentMetaServer);
  }
  assignmentManager.assignMeta();
  this.catalogTracker.waitForMeta();
  // Above check waits for general meta availability but this does not
  // guarantee that the transition has completed
  
this.assignmentManager.waitForAssignment(HRegionInfo.FIRST_META_REGIONINFO);
  assigned++;
} else {
  // Region already assigned.  We didnt' assign it.  Add to in-memory state.
  this.assignmentManager.regionOnline(HRegionInfo.FIRST_META_REGIONINFO,
this.catalogTracker.getMetaLocation());
}
{code}

When the location was verified, it was able to read the old .META. location 
from ROOT and since the region was still there it was assumed that .META. was 
correctly assigned. Now what's interesting is this from AM.regionOnline:

{code}
  if (isServerOnline(sn)) {
this.regions.put(regionInfo, sn);
addToServers(sn, regionInfo);
this.regions.notifyAll();
  } else {
LOG.info("The server is not in online servers, ServerName=" + 
  sn.getServerName() + ", region=" + regionInfo.getEncodedName());
  }
{code}

I assume that if you went over the master's log you would find the log message 
about the server not being online? It seems to me that we should either check 
if the server belongs to us or backtrack when we fail to setting the region 
online.

> Region assignments should be only allowed edit META hosted on the same 
> cluster.
> ---
>
> Key: HBASE-6348
> URL: https://issues.apache.org/jira/browse/HBASE-6348
> Project: HBase
>  Issue Type: Task
>Reporter: Jonathan Hsieh
>
> We copied hbase file data (root/meta/tables) from one hdfs cluster to 
> another, scrubbed it, and then attempted to start the new cluster.  We 
> noticed that META on the original cluster was being modified with server 
> entries from the new cluster.  
> Its contrived but here is how it happened.
> First we copied all the data.  Then we "scrubbed" META -- we removed all 
> region serverinfo cols that pointed to nodes on the original cluster.  When 
> we started the new cluster, it picked a RS to serve ROOT.  Since we had 
> scrubbed meta, then the new cluster's master attempted to assign regions to 
> other region servers on the new cluster.  From the code's point of view this 
> all succeeeded -- zk went through transitions, according to the master they 
> were assigned.  However, we started seeing NotServingRegionExceptions on the 
> original cluster.
> The root cause is that ROOT was not scrubbed.  The new cluster assigned the 
> copy of ROOT to a new cluster RS.  Now, when the new cluster attempted to 
> modify META, it would read the old ROOT's serverinfo pointer go to the *old 
> cluster's regionserver*.  The old cluster's regionserer just so happened to 
> be still serving META, so the old cluster's META server gladly accepted the 
> assignments that included the new cluster's regionserver names.
> At this point we brought down the new cluster (it was getting killed).  
> Clients on the old cluster would now go to zk,root,meta, and get pointers to 
> the new cluster.  NSRE's happened.  Unhappyness.
> Long story short, we should have some mechanism to make sure that region 
> assignments should be only allowed edit META hosted on the same cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-07-09 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409749#comment-13409749
 ] 

stack commented on HBASE-6060:
--

bq. Do we need to test the perf because now the transition from OFFLINE to 
OPENING is done in the main thread?

Yeah, its going to take longer for the rpc to complete though we are doing the 
same amount of 'work' opening a region.  I'm not worried about the single 
region open case.  I'd think we will add a few ticks.  But bulk opens will be 
interesting.  I've not made the change there, yet.  These try to be async so 
maybe it'll be ok there... will see.

What you think?



> Regions's in OPENING state from failed regionservers takes a long time to 
> recover
> -
>
> Key: HBASE-6060
> URL: https://issues.apache.org/jira/browse/HBASE-6060
> Project: HBase
>  Issue Type: Bug
>  Components: master, regionserver
>Reporter: Enis Soztutar
>Assignee: rajeshbabu
> Fix For: 0.96.0, 0.94.1, 0.92.3
>
> Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
> 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
> 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
> 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
> 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
> HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
> HBASE-6060_trunk_5.patch
>
>
> we have seen a pattern in tests, that the regions are stuck in OPENING state 
> for a very long time when the region server who is opening the region fails. 
> My understanding of the process: 
>  
>  - master calls rs to open the region. If rs is offline, a new plan is 
> generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
> master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
> HMaster.assign()
>  - RegionServer, starts opening a region, changes the state in znode. But 
> that znode is not ephemeral. (see ZkAssign)
>  - Rs transitions zk node from OFFLINE to OPENING. See 
> OpenRegionHandler.process()
>  - rs then opens the region, and changes znode from OPENING to OPENED
>  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
> state, and the master just waits for rs to change the region state, but since 
> rs is down, that wont happen. 
>  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
> against these kind of conditions. It periodically checks (every 10 sec by 
> default) the regions in transition to see whether they timedout 
> (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
> which explains what you and I are seeing. 
>  - ServerShutdownHandler in Master does not reassign regions in OPENING 
> state, although it handles other states. 
> Lowering that threshold from the configuration is one option, but still I 
> think we can do better. 
> Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover

2012-07-09 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-6060:
-

Fix Version/s: (was: 0.94.1)
   0.94.2

This won't be done for 0.94.1 is my guess. Moving it out.

> Regions's in OPENING state from failed regionservers takes a long time to 
> recover
> -
>
> Key: HBASE-6060
> URL: https://issues.apache.org/jira/browse/HBASE-6060
> Project: HBase
>  Issue Type: Bug
>  Components: master, regionserver
>Reporter: Enis Soztutar
>Assignee: rajeshbabu
> Fix For: 0.96.0, 0.92.3, 0.94.2
>
> Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 
> 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 
> 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 
> 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 
> 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 
> HBASE-6060-92.patch, HBASE-6060-94.patch, HBASE-6060-trunk_4.patch, 
> HBASE-6060_trunk_5.patch
>
>
> we have seen a pattern in tests, that the regions are stuck in OPENING state 
> for a very long time when the region server who is opening the region fails. 
> My understanding of the process: 
>  
>  - master calls rs to open the region. If rs is offline, a new plan is 
> generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in 
> master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), 
> HMaster.assign()
>  - RegionServer, starts opening a region, changes the state in znode. But 
> that znode is not ephemeral. (see ZkAssign)
>  - Rs transitions zk node from OFFLINE to OPENING. See 
> OpenRegionHandler.process()
>  - rs then opens the region, and changes znode from OPENING to OPENED
>  - when rs is killed between OPENING and OPENED states, then zk shows OPENING 
> state, and the master just waits for rs to change the region state, but since 
> rs is down, that wont happen. 
>  - There is a AssignmentManager.TimeoutMonitor, which does exactly guard 
> against these kind of conditions. It periodically checks (every 10 sec by 
> default) the regions in transition to see whether they timedout 
> (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, 
> which explains what you and I are seeing. 
>  - ServerShutdownHandler in Master does not reassign regions in OPENING 
> state, although it handles other states. 
> Lowering that threshold from the configuration is one option, but still I 
> think we can do better. 
> Will investigate more. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5136) Redundant MonitoredTask instances in case of distributed log splitting retry

2012-07-09 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HBASE-5136:
---

Status: Open  (was: Patch Available)

Let me create a different related issue.

> Redundant MonitoredTask instances in case of distributed log splitting retry
> 
>
> Key: HBASE-5136
> URL: https://issues.apache.org/jira/browse/HBASE-5136
> Project: HBase
>  Issue Type: Task
>Reporter: Zhihong Ted Yu
>Assignee: Zhihong Ted Yu
> Attachments: 5136-trunk.patch, 5136.txt
>
>
> In case of log splitting retry, the following code would be executed multiple 
> times:
> {code}
>   public long splitLogDistributed(final List logDirs) throws 
> IOException {
> MonitoredTask status = TaskMonitor.get().createStatus(
>   "Doing distributed log split in " + logDirs);
> {code}
> leading to multiple MonitoredTask instances.
> User may get confused by multiple distributed log splitting entries for the 
> same region server on master UI

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HBASE-6357) Failed distributed log splitting stuck on master web UI

2012-07-09 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang reassigned HBASE-6357:
--

Assignee: Jimmy Xiang

> Failed distributed log splitting stuck on master web UI
> ---
>
> Key: HBASE-6357
> URL: https://issues.apache.org/jira/browse/HBASE-6357
> Project: HBase
>  Issue Type: Bug
>Reporter: Jimmy Xiang
>Assignee: Jimmy Xiang
>
> Failed distributed log splitting MonitoredTask is stuck on the master web UI 
> since it is not aborted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-6357) Failed distributed log splitting stuck on master web UI

2012-07-09 Thread Jimmy Xiang (JIRA)
Jimmy Xiang created HBASE-6357:
--

 Summary: Failed distributed log splitting stuck on master web UI
 Key: HBASE-6357
 URL: https://issues.apache.org/jira/browse/HBASE-6357
 Project: HBase
  Issue Type: Bug
Reporter: Jimmy Xiang


Failed distributed log splitting MonitoredTask is stuck on the master web UI 
since it is not aborted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6357) Failed distributed log splitting stuck on master web UI

2012-07-09 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HBASE-6357:
---

Attachment: 6357-trunk.patch

> Failed distributed log splitting stuck on master web UI
> ---
>
> Key: HBASE-6357
> URL: https://issues.apache.org/jira/browse/HBASE-6357
> Project: HBase
>  Issue Type: Bug
>Reporter: Jimmy Xiang
>Assignee: Jimmy Xiang
> Attachments: 6357-trunk.patch
>
>
> Failed distributed log splitting MonitoredTask is stuck on the master web UI 
> since it is not aborted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6357) Failed distributed log splitting stuck on master web UI

2012-07-09 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HBASE-6357:
---

Status: Patch Available  (was: Open)

> Failed distributed log splitting stuck on master web UI
> ---
>
> Key: HBASE-6357
> URL: https://issues.apache.org/jira/browse/HBASE-6357
> Project: HBase
>  Issue Type: Bug
>Reporter: Jimmy Xiang
>Assignee: Jimmy Xiang
> Attachments: 6357-trunk.patch
>
>
> Failed distributed log splitting MonitoredTask is stuck on the master web UI 
> since it is not aborted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6357) Failed distributed log splitting stuck on master web UI

2012-07-09 Thread Zhihong Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409784#comment-13409784
 ] 

Zhihong Ted Yu commented on HBASE-6357:
---

+1 if Hadoop QA tests pass.

> Failed distributed log splitting stuck on master web UI
> ---
>
> Key: HBASE-6357
> URL: https://issues.apache.org/jira/browse/HBASE-6357
> Project: HBase
>  Issue Type: Bug
>Reporter: Jimmy Xiang
>Assignee: Jimmy Xiang
> Attachments: 6357-trunk.patch
>
>
> Failed distributed log splitting MonitoredTask is stuck on the master web UI 
> since it is not aborted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5786) Implement histogram metrics for flush and compaction latencies and sizes.

2012-07-09 Thread Elliott Clark (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409791#comment-13409791
 ] 

Elliott Clark commented on HBASE-5786:
--

@jonathan
This would be a good project to get you started.  I would ignore the 
discussions about the accuracy of our histograms and just use the 
MetricsHistogram for now.

> Implement histogram metrics for flush and compaction latencies and sizes.
> -
>
> Key: HBASE-5786
> URL: https://issues.apache.org/jira/browse/HBASE-5786
> Project: HBase
>  Issue Type: New Feature
>  Components: metrics, regionserver
>Affects Versions: 0.92.2, 0.94.0, 0.96.0
>Reporter: Jonathan Hsieh
>
> Average time for region operations doesn't really tell a useful story when 
> that help diagnose anomalous conditions.
> It would be extremely useful to add histogramming metrics similar to 
> HBASE-5533 for region operations like flush, compaction and splitting.  The 
> probably should be forward biased at a much coarser granularity however 
> (maybe decay every day?) 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5997) Fix concerns raised in HBASE-5922 related to HalfStoreFileReader

2012-07-09 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409794#comment-13409794
 ] 

stack commented on HBASE-5997:
--

@Anoop Great work.  Agree on the seekBefore finding.  Lets fix that in another 
issue.

On the patch, minor comment only (I'm good w/ commit):

{code}
if (firstKey == null) {
{code}

Is it possible if the file is empty say that we'll seek on every invocation of 
getFirstKey?

This patch does not do you your compare of row only rather than compare of full 
key.  Is it supposed to?

> Fix concerns raised in HBASE-5922 related to HalfStoreFileReader
> 
>
> Key: HBASE-5997
> URL: https://issues.apache.org/jira/browse/HBASE-5997
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.90.6, 0.92.1, 0.94.0, 0.96.0
>Reporter: ramkrishna.s.vasudevan
>Assignee: Anoop Sam John
> Fix For: 0.94.1
>
> Attachments: HBASE-5997_0.94.patch, HBASE-5997_94 V2.patch, 
> Testcase.patch.txt
>
>
> Pls refer to the comment
> https://issues.apache.org/jira/browse/HBASE-5922?focusedCommentId=13269346&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13269346.
> Raised this issue to solve that comment. Just incase we don't forget it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5786) Implement histogram metrics for flush and compaction latencies and sizes.

2012-07-09 Thread Zhihong Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409801#comment-13409801
 ] 

Zhihong Ted Yu commented on HBASE-5786:
---

MetricsHistogram depends on the following:
{code}
import org.apache.hadoop.metrics.MetricsRecord;
import org.apache.hadoop.metrics.util.MetricsBase;
import org.apache.hadoop.metrics.util.MetricsRegistry;
{code}
which are deprecated in hadoop.

See discussion 'deprecating (old) metrics in favor of metrics2 framework' on 
dev@ list.

> Implement histogram metrics for flush and compaction latencies and sizes.
> -
>
> Key: HBASE-5786
> URL: https://issues.apache.org/jira/browse/HBASE-5786
> Project: HBase
>  Issue Type: New Feature
>  Components: metrics, regionserver
>Affects Versions: 0.92.2, 0.94.0, 0.96.0
>Reporter: Jonathan Hsieh
>
> Average time for region operations doesn't really tell a useful story when 
> that help diagnose anomalous conditions.
> It would be extremely useful to add histogramming metrics similar to 
> HBASE-5533 for region operations like flush, compaction and splitting.  The 
> probably should be forward biased at a much coarser granularity however 
> (maybe decay every day?) 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5786) Implement histogram metrics for flush and compaction latencies and sizes.

2012-07-09 Thread Elliott Clark (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409812#comment-13409812
 ] 

Elliott Clark commented on HBASE-5786:
--

Yes Ted it is deprecated.  However right now it's the best that we have. I have 
other jira's that have metrics2 implementations.  However I don't think that 
it's appropriate to expect a first time contributor to make all of those 
changes before adding a smaller fix.  When we move all of our implementation 
over to metrics2 MetricsHistogram will have to be addressed there too.

> Implement histogram metrics for flush and compaction latencies and sizes.
> -
>
> Key: HBASE-5786
> URL: https://issues.apache.org/jira/browse/HBASE-5786
> Project: HBase
>  Issue Type: New Feature
>  Components: metrics, regionserver
>Affects Versions: 0.92.2, 0.94.0, 0.96.0
>Reporter: Jonathan Hsieh
>
> Average time for region operations doesn't really tell a useful story when 
> that help diagnose anomalous conditions.
> It would be extremely useful to add histogramming metrics similar to 
> HBASE-5533 for region operations like flush, compaction and splitting.  The 
> probably should be forward biased at a much coarser granularity however 
> (maybe decay every day?) 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6346) Observed excessive CPU on "quiescent" cluster

2012-07-09 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409820#comment-13409820
 ] 

Todd Lipcon commented on HBASE-6346:


RH6.1+ THP defragmentation perhaps?

> Observed excessive CPU on "quiescent" cluster
> -
>
> Key: HBASE-6346
> URL: https://issues.apache.org/jira/browse/HBASE-6346
> Project: HBase
>  Issue Type: Bug
> Environment: Sun JRE 6u33, Hadoop 2.0.1-alpha, HBase 0.94.1-SNAPSHOT
>Reporter: Andrew Purtell
> Attachments: graph.gif, jstack.txt, runnable_cpu_profile.png
>
>
> I've only started looking at this but became concerned when a completely 
> quiescent cluster loads up 20% CPU (system+user) once HBase comes up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6317) Master clean start up and Partially enabled tables make region assignment inconsistent.

2012-07-09 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409822#comment-13409822
 ] 

stack commented on HBASE-6317:
--



Does this need to be public?  Can it be package private?

{code}
+  public List getEnablingTableRegions(String tableName) {
{code}

What does this protect against?  Concurrent assignment by another thread?

{code}
+  public boolean addPlanIfNotPresent(HRegionInfo hri, RegionPlan plan) {
{code}

Can we not ask if regionInTransition rather than add this new method?

How can we get this better noticed?  The table will be offline right?

{code}
+} catch (InterruptedException e) {
+  LOG.error("Error trying to enable the table " + this.tableNameStr, e);
 }
{code}

Why is this:

{code}
+if(this.failover == false && this.enablingTables.size() > 0){
+  this.failover = true;
+}
{code}

We are testing if failover is necessary and if there are tables in an enabling 
state when new master comes up, then that is good enough reason to run the 
failover code?  A comment would be good here I'd say (especially if I have it 
wrong).

Should we at least look at the RegionState before we remove the item from the 
list below (or do you think it would never be other than OPENING or something?)

{code}
+List hris = 
this.enablingTables.get(regionInfo.getTableNameAsString());
+if(hris != null && !hris.isEmpty()){
+  hris.remove(regionInfo);
+}
{code}

Why do we do tests like if (false == 
checkIfRegionBelongsToDisabled(regionInfo)... instead of if 
(!checkIfRegionBelongsToDisabled(regionInfo)...?  (Its not you but it looks 
weird your replicating this oddity).

I'm not I follow all that is going on in here but it looks right... I can try a 
review again later.  Too hard to write a test I suppose lads?  Its difficult 
state to reproduce, this master startup stuff?

> Master clean start up and Partially enabled tables make region assignment 
> inconsistent.
> ---
>
> Key: HBASE-6317
> URL: https://issues.apache.org/jira/browse/HBASE-6317
> Project: HBase
>  Issue Type: Bug
>Reporter: ramkrishna.s.vasudevan
>Assignee: rajeshbabu
> Fix For: 0.92.2, 0.96.0, 0.94.1
>
> Attachments: HBASE-6317_94.patch
>
>
> If we have a  table in partially enabled state (ENABLING) then on HMaster 
> restart we treat it as a clean cluster start up and do a bulk assign.  
> Currently in 0.94 bulk assign will not handle ALREADY_OPENED scenarios and it 
> leads to region assignment problems.  Analysing more on this we found that we 
> have better way to handle these scenarios.
> {code}
> if (false == checkIfRegionBelongsToDisabled(regionInfo)
> && false == checkIfRegionsBelongsToEnabling(regionInfo)) {
>   synchronized (this.regions) {
> regions.put(regionInfo, regionLocation);
> addToServers(regionLocation, regionInfo);
>   }
> {code}
> We dont add to regions map so that enable table handler can handle it.  But 
> as nothing is added to regions map we think it as a clean cluster start up.
> Will come up with a patch tomorrow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6345) Utilize fault injection in testing using AspectJ

2012-07-09 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409823#comment-13409823
 ] 

Todd Lipcon commented on HBASE-6345:


Yea, we haven't been using the AspectJ FI stuff since we mavenized quite some 
time ago. It was a pain to maintain, and only one person ever really knew how 
to write FI tests with this framework.

All of our new fault injection tests are just using Mockito. Where necessary 
we're adding FaultInjector classes which are easy to hook, for example: 
https://github.com/toddlipcon/hadoop-common/blob/auto-failover-and-qjournal/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/CheckpointFaultInjector.java

> Utilize fault injection in testing using AspectJ
> 
>
> Key: HBASE-6345
> URL: https://issues.apache.org/jira/browse/HBASE-6345
> Project: HBase
>  Issue Type: Bug
>Reporter: Zhihong Ted Yu
>
> HDFS uses fault injection to test pipeline failure in addition to mock, spy. 
> HBase uses mock, spy. But there are cases where mock, spy aren't convenient.
> Some example from DFSClientAspects.aj :
> {code}
>   pointcut pipelineInitNonAppend(DataStreamer datastreamer):
> callCreateBlockOutputStream(datastreamer)
> && cflow(execution(* nextBlockOutputStream(..)))
> && within(DataStreamer);
>   after(DataStreamer datastreamer) returning : 
> pipelineInitNonAppend(datastreamer) {
> LOG.info("FI: after pipelineInitNonAppend: hasError="
> + datastreamer.hasError + " errorIndex=" + datastreamer.errorIndex);
> if (datastreamer.hasError) {
>   DataTransferTest dtTest = DataTransferTestUtil.getDataTransferTest();
>   if (dtTest != null)
> dtTest.fiPipelineInitErrorNonAppend.run(datastreamer.errorIndex);
> }
>   }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5786) Implement histogram metrics for flush and compaction latencies and sizes.

2012-07-09 Thread Zhihong Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Ted Yu updated HBASE-5786:
--

Fix Version/s: 0.96.0

> Implement histogram metrics for flush and compaction latencies and sizes.
> -
>
> Key: HBASE-5786
> URL: https://issues.apache.org/jira/browse/HBASE-5786
> Project: HBase
>  Issue Type: New Feature
>  Components: metrics, regionserver
>Affects Versions: 0.92.2, 0.94.0, 0.96.0
>Reporter: Jonathan Hsieh
> Fix For: 0.96.0
>
>
> Average time for region operations doesn't really tell a useful story when 
> that help diagnose anomalous conditions.
> It would be extremely useful to add histogramming metrics similar to 
> HBASE-5533 for region operations like flush, compaction and splitting.  The 
> probably should be forward biased at a much coarser granularity however 
> (maybe decay every day?) 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6323) [replication] most of the source metrics are wrong when there's multiple slaves

2012-07-09 Thread Elliott Clark (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Elliott Clark updated HBASE-6323:
-

Attachment: HBASE-6323-1.patch

Patch with documentation
This adds the hadoop-metrics2.properties config file.

> [replication] most of the source metrics are wrong when there's multiple 
> slaves
> ---
>
> Key: HBASE-6323
> URL: https://issues.apache.org/jira/browse/HBASE-6323
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.92.1, 0.94.0
>Reporter: Jean-Daniel Cryans
>Assignee: Elliott Clark
> Fix For: 0.96.0, 0.94.2
>
> Attachments: HBASE-6323-0.patch, HBASE-6323-1.patch
>
>
> Most of the metrics in replication were written with 1 slave in mind but with 
> multiple slaves the issue really shows. Most of the metrics are set directly:
> {code}
> public void enqueueLog(Path log) {
>   this.queue.put(log);
>   this.metrics.sizeOfLogQueue.set(queue.size());
> }
> {code}
> So {{sizeOfLogQueue}} is always showing the size of the queue that updated 
> the metric last.
> I'm not sure what's the right way to fix this since we can't have dynamic 
> metrics. Merging them would work here but it wouldn't work so well with 
> {{ageOfLastShippedOp}} since the age can be different and it definitely 
> cannot be summed.
> Assigning to Elliott since he seems to dig metrics these days. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6323) [replication] most of the source metrics are wrong when there's multiple slaves

2012-07-09 Thread Elliott Clark (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Elliott Clark updated HBASE-6323:
-

Status: Patch Available  (was: Open)

0.96 version first.

> [replication] most of the source metrics are wrong when there's multiple 
> slaves
> ---
>
> Key: HBASE-6323
> URL: https://issues.apache.org/jira/browse/HBASE-6323
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.94.0, 0.92.1
>Reporter: Jean-Daniel Cryans
>Assignee: Elliott Clark
> Fix For: 0.96.0, 0.94.2
>
> Attachments: HBASE-6323-0.patch, HBASE-6323-1.patch
>
>
> Most of the metrics in replication were written with 1 slave in mind but with 
> multiple slaves the issue really shows. Most of the metrics are set directly:
> {code}
> public void enqueueLog(Path log) {
>   this.queue.put(log);
>   this.metrics.sizeOfLogQueue.set(queue.size());
> }
> {code}
> So {{sizeOfLogQueue}} is always showing the size of the queue that updated 
> the metric last.
> I'm not sure what's the right way to fix this since we can't have dynamic 
> metrics. Merging them would work here but it wouldn't work so well with 
> {{ageOfLastShippedOp}} since the age can be different and it definitely 
> cannot be summed.
> Assigning to Elliott since he seems to dig metrics these days. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6345) Utilize fault injection in testing using AspectJ

2012-07-09 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409865#comment-13409865
 ] 

stack commented on HBASE-6345:
--

So we should close this issue or redo as 'Fault Injection in testing'?  (Is 
that too vague?)

> Utilize fault injection in testing using AspectJ
> 
>
> Key: HBASE-6345
> URL: https://issues.apache.org/jira/browse/HBASE-6345
> Project: HBase
>  Issue Type: Bug
>Reporter: Zhihong Ted Yu
>
> HDFS uses fault injection to test pipeline failure in addition to mock, spy. 
> HBase uses mock, spy. But there are cases where mock, spy aren't convenient.
> Some example from DFSClientAspects.aj :
> {code}
>   pointcut pipelineInitNonAppend(DataStreamer datastreamer):
> callCreateBlockOutputStream(datastreamer)
> && cflow(execution(* nextBlockOutputStream(..)))
> && within(DataStreamer);
>   after(DataStreamer datastreamer) returning : 
> pipelineInitNonAppend(datastreamer) {
> LOG.info("FI: after pipelineInitNonAppend: hasError="
> + datastreamer.hasError + " errorIndex=" + datastreamer.errorIndex);
> if (datastreamer.hasError) {
>   DataTransferTest dtTest = DataTransferTestUtil.getDataTransferTest();
>   if (dtTest != null)
> dtTest.fiPipelineInitErrorNonAppend.run(datastreamer.errorIndex);
> }
>   }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6323) [replication] most of the source metrics are wrong when there's multiple slaves

2012-07-09 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409868#comment-13409868
 ] 

stack commented on HBASE-6323:
--

Looks good Mr. Elliott.  Will give J-D honor of blessing it.  I like your 
moving classes under metrics package.  Where is the bit where we deal w/ 
multiple slaves?  Thats in one of the moved classes?  Good stuff.

> [replication] most of the source metrics are wrong when there's multiple 
> slaves
> ---
>
> Key: HBASE-6323
> URL: https://issues.apache.org/jira/browse/HBASE-6323
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.92.1, 0.94.0
>Reporter: Jean-Daniel Cryans
>Assignee: Elliott Clark
> Fix For: 0.96.0, 0.94.2
>
> Attachments: HBASE-6323-0.patch, HBASE-6323-1.patch
>
>
> Most of the metrics in replication were written with 1 slave in mind but with 
> multiple slaves the issue really shows. Most of the metrics are set directly:
> {code}
> public void enqueueLog(Path log) {
>   this.queue.put(log);
>   this.metrics.sizeOfLogQueue.set(queue.size());
> }
> {code}
> So {{sizeOfLogQueue}} is always showing the size of the queue that updated 
> the metric last.
> I'm not sure what's the right way to fix this since we can't have dynamic 
> metrics. Merging them would work here but it wouldn't work so well with 
> {{ageOfLastShippedOp}} since the age can be different and it definitely 
> cannot be summed.
> Assigning to Elliott since he seems to dig metrics these days. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6357) Failed distributed log splitting stuck on master web UI

2012-07-09 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409872#comment-13409872
 ] 

stack commented on HBASE-6357:
--

+1 and what Ted says (commit if hadoopqa looks reasonable)

> Failed distributed log splitting stuck on master web UI
> ---
>
> Key: HBASE-6357
> URL: https://issues.apache.org/jira/browse/HBASE-6357
> Project: HBase
>  Issue Type: Bug
>Reporter: Jimmy Xiang
>Assignee: Jimmy Xiang
> Attachments: 6357-trunk.patch
>
>
> Failed distributed log splitting MonitoredTask is stuck on the master web UI 
> since it is not aborted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6323) [replication] most of the source metrics are wrong when there's multiple slaves

2012-07-09 Thread Elliott Clark (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409875#comment-13409875
 ] 

Elliott Clark commented on HBASE-6323:
--

Yes.  The ReplicationSourceMetric class adds values to a global gauge and to a 
gauge for just that source.  When a source is stopped the values are subtracted 
from the global gauge and the gauge for just that source is removed.  Counters 
are not subtracted since they represent things like the total number of log 
edits shipped out and removing them doesn't really make sense.

Since there's only one sink ReplicationSinkMetrics doesn't really have that.

> [replication] most of the source metrics are wrong when there's multiple 
> slaves
> ---
>
> Key: HBASE-6323
> URL: https://issues.apache.org/jira/browse/HBASE-6323
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.92.1, 0.94.0
>Reporter: Jean-Daniel Cryans
>Assignee: Elliott Clark
> Fix For: 0.96.0, 0.94.2
>
> Attachments: HBASE-6323-0.patch, HBASE-6323-1.patch
>
>
> Most of the metrics in replication were written with 1 slave in mind but with 
> multiple slaves the issue really shows. Most of the metrics are set directly:
> {code}
> public void enqueueLog(Path log) {
>   this.queue.put(log);
>   this.metrics.sizeOfLogQueue.set(queue.size());
> }
> {code}
> So {{sizeOfLogQueue}} is always showing the size of the queue that updated 
> the metric last.
> I'm not sure what's the right way to fix this since we can't have dynamic 
> metrics. Merging them would work here but it wouldn't work so well with 
> {{ageOfLastShippedOp}} since the age can be different and it definitely 
> cannot be summed.
> Assigning to Elliott since he seems to dig metrics these days. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6323) [replication] most of the source metrics are wrong when there's multiple slaves

2012-07-09 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409881#comment-13409881
 ] 

stack commented on HBASE-6323:
--

I'm +1 then.

> [replication] most of the source metrics are wrong when there's multiple 
> slaves
> ---
>
> Key: HBASE-6323
> URL: https://issues.apache.org/jira/browse/HBASE-6323
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.92.1, 0.94.0
>Reporter: Jean-Daniel Cryans
>Assignee: Elliott Clark
> Fix For: 0.96.0, 0.94.2
>
> Attachments: HBASE-6323-0.patch, HBASE-6323-1.patch
>
>
> Most of the metrics in replication were written with 1 slave in mind but with 
> multiple slaves the issue really shows. Most of the metrics are set directly:
> {code}
> public void enqueueLog(Path log) {
>   this.queue.put(log);
>   this.metrics.sizeOfLogQueue.set(queue.size());
> }
> {code}
> So {{sizeOfLogQueue}} is always showing the size of the queue that updated 
> the metric last.
> I'm not sure what's the right way to fix this since we can't have dynamic 
> metrics. Merging them would work here but it wouldn't work so well with 
> {{ageOfLastShippedOp}} since the age can be different and it definitely 
> cannot be summed.
> Assigning to Elliott since he seems to dig metrics these days. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-6358) Bulkloading from remote filesystem is problematic

2012-07-09 Thread Dave Revell (JIRA)
Dave Revell created HBASE-6358:
--

 Summary: Bulkloading from remote filesystem is problematic
 Key: HBASE-6358
 URL: https://issues.apache.org/jira/browse/HBASE-6358
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.94.0
Reporter: Dave Revell
Assignee: Dave Revell


Bulk loading hfiles that don't live on the same filesystem as HBase can cause 
problems for subtle reasons.

In Store.bulkLoadHFile(), the regionserver will copy the source hfile to its 
own filesystem if it's not already there. Since this can take a long time for 
large hfiles, it's likely that the client will timeout and retry. When the 
client retries repeatedly, there may be several bulkload operations in flight 
for the same hfile, causing lots of unnecessary IO and tying up handler 
threads. This can seriously impact performance. In my case, the cluster became 
unusable and the regionservers had to be kill -9'ed.

Possible solutions:
 # Require that hfiles already be on the same filesystem as HBase in order for 
bulkloading to succeed. The copy could be handled by LoadIncrementalHFiles 
before the regionserver is called.
 # Others? I'm not familiar with Hadoop IPC so there may be tricks to extend 
the timeout or something else.

I'm willing to write a patch but I'd appreciate recommendations on how to 
proceed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6358) Bulkloading from remote filesystem is problematic

2012-07-09 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409896#comment-13409896
 ] 

Harsh J commented on HBASE-6358:


I sort of agree, except it is also more of a best-practice thing. If you bulk 
load remotely with only a single or very few requests per source at a time, and 
with a high RPC timeout at the client (such that it does not retry too often), 
then it should be more tolerable.

But in any case, having the RS do FS copies will indeed make it slow.

I ran into a very similar issue and the tweak I had to suggest was to indeed 
distcp/cp the data first and bulk load next. HBASE-6350 (Logging improvements 
for ops) and HBASE-6339 (Possible optimization, negative in the end) came out 
of it.

> Bulkloading from remote filesystem is problematic
> -
>
> Key: HBASE-6358
> URL: https://issues.apache.org/jira/browse/HBASE-6358
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.94.0
>Reporter: Dave Revell
>Assignee: Dave Revell
>
> Bulk loading hfiles that don't live on the same filesystem as HBase can cause 
> problems for subtle reasons.
> In Store.bulkLoadHFile(), the regionserver will copy the source hfile to its 
> own filesystem if it's not already there. Since this can take a long time for 
> large hfiles, it's likely that the client will timeout and retry. When the 
> client retries repeatedly, there may be several bulkload operations in flight 
> for the same hfile, causing lots of unnecessary IO and tying up handler 
> threads. This can seriously impact performance. In my case, the cluster 
> became unusable and the regionservers had to be kill -9'ed.
> Possible solutions:
>  # Require that hfiles already be on the same filesystem as HBase in order 
> for bulkloading to succeed. The copy could be handled by 
> LoadIncrementalHFiles before the regionserver is called.
>  # Others? I'm not familiar with Hadoop IPC so there may be tricks to extend 
> the timeout or something else.
> I'm willing to write a patch but I'd appreciate recommendations on how to 
> proceed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6350) Some logging improvements for RegionServer bulk loading

2012-07-09 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HBASE-6350:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Thanks Ted!

(Marking as resolved, as I think it may have been accidentally left open)

> Some logging improvements for RegionServer bulk loading
> ---
>
> Key: HBASE-6350
> URL: https://issues.apache.org/jira/browse/HBASE-6350
> Project: HBase
>  Issue Type: Improvement
>  Components: regionserver
>Affects Versions: 0.94.0
>Reporter: Harsh J
>Assignee: Harsh J
>Priority: Minor
> Fix For: 0.96.0
>
> Attachments: HBASE-6350.patch
>
>
> The current logging in the bulk loading RPC call to a RegionServer lacks some 
> info in certain cases. For instance, I recently noticed that it is possible 
> that IOException may be caused during bulk load file transfer (copy) off of 
> another FS and that during the same time the client already times the socket 
> out and thereby does not receive a thrown Exception back remotely (HBase 
> prints a ClosedChannelException for the IPC when it attempts to send the real 
> message, and hence the real cause is lost).
> Improvements around this kind of issue, wherein we could first log the 
> IOException at the RS before sending, and a few other wording improvements 
> are present in my patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6358) Bulkloading from remote filesystem is problematic

2012-07-09 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409898#comment-13409898
 ] 

Todd Lipcon commented on HBASE-6358:


Yea, I originally wrote the code that did the copy, but in hindsight I think it 
was a mistake. I think we should remove that capability and have the code fail 
if the filesystem doesn't match.

> Bulkloading from remote filesystem is problematic
> -
>
> Key: HBASE-6358
> URL: https://issues.apache.org/jira/browse/HBASE-6358
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.94.0
>Reporter: Dave Revell
>Assignee: Dave Revell
>
> Bulk loading hfiles that don't live on the same filesystem as HBase can cause 
> problems for subtle reasons.
> In Store.bulkLoadHFile(), the regionserver will copy the source hfile to its 
> own filesystem if it's not already there. Since this can take a long time for 
> large hfiles, it's likely that the client will timeout and retry. When the 
> client retries repeatedly, there may be several bulkload operations in flight 
> for the same hfile, causing lots of unnecessary IO and tying up handler 
> threads. This can seriously impact performance. In my case, the cluster 
> became unusable and the regionservers had to be kill -9'ed.
> Possible solutions:
>  # Require that hfiles already be on the same filesystem as HBase in order 
> for bulkloading to succeed. The copy could be handled by 
> LoadIncrementalHFiles before the regionserver is called.
>  # Others? I'm not familiar with Hadoop IPC so there may be tricks to extend 
> the timeout or something else.
> I'm willing to write a patch but I'd appreciate recommendations on how to 
> proceed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6345) Utilize fault injection in testing using AspectJ

2012-07-09 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409902#comment-13409902
 ] 

Ming Ma commented on HBASE-6345:


Couple comments for Todd and others. I did some my investigation on this topic 
last week and found testing data streaming pipeline in hdfs requires fi. Is 
there any other reasons not to use fi besides "hard to maintain"?

1. I have just got maven + aspectJ working for hbase. It does require some 
learning to write fi test; but it doesn't seem to be that hard.
2. It looks like CheckpointFaultInjector example above requires code change in 
hadoop core. fi doesn't require that. I find it useful to not require core code 
change to inject failure at any place. On frequently called function, adding 
call to fault injector in the core code might have perf impact.

> Utilize fault injection in testing using AspectJ
> 
>
> Key: HBASE-6345
> URL: https://issues.apache.org/jira/browse/HBASE-6345
> Project: HBase
>  Issue Type: Bug
>Reporter: Zhihong Ted Yu
>
> HDFS uses fault injection to test pipeline failure in addition to mock, spy. 
> HBase uses mock, spy. But there are cases where mock, spy aren't convenient.
> Some example from DFSClientAspects.aj :
> {code}
>   pointcut pipelineInitNonAppend(DataStreamer datastreamer):
> callCreateBlockOutputStream(datastreamer)
> && cflow(execution(* nextBlockOutputStream(..)))
> && within(DataStreamer);
>   after(DataStreamer datastreamer) returning : 
> pipelineInitNonAppend(datastreamer) {
> LOG.info("FI: after pipelineInitNonAppend: hasError="
> + datastreamer.hasError + " errorIndex=" + datastreamer.errorIndex);
> if (datastreamer.hasError) {
>   DataTransferTest dtTest = DataTransferTestUtil.getDataTransferTest();
>   if (dtTest != null)
> dtTest.fiPipelineInitErrorNonAppend.run(datastreamer.errorIndex);
> }
>   }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-6359) KeyValue may return incorrect values after readFields()

2012-07-09 Thread Dave Revell (JIRA)
Dave Revell created HBASE-6359:
--

 Summary: KeyValue may return incorrect values after readFields()
 Key: HBASE-6359
 URL: https://issues.apache.org/jira/browse/HBASE-6359
 Project: HBase
  Issue Type: Bug
Reporter: Dave Revell
Assignee: Dave Revell


When the same KeyValue object is used multiple times for deserialization using 
readFields, some methods may return incorrect values. Here is a sequence of 
operations that will reproduce the problem:

 # A KeyValue is created whose key has length 10. The private field keyLength 
is initialized to 0.
 # KeyValue.getKeyLength() is called. This reads the key length 10 from the 
backing array and caches it in keyLength.
 # KeyValue.readFields() is called to deserialize a new value. The keyLength 
field is not cleared and keeps its value of 10, even though this value is 
probably incorrect.
 # If getKeyLength() is called, the value 10 will be returned.

For example, in a reducer with Iterable, all values after the first 
one from the iterable are likely to return incorrect values from getKeyLength().

The solution is to clear all memoized values in KeyValue.readFields(). I'll 
write a patch for this soon.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-3484) Replace memstore's ConcurrentSkipListMap with our own implementation

2012-07-09 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409901#comment-13409901
 ] 

Otis Gospodnetic commented on HBASE-3484:
-

@JD - what would/should the ideal graph look like, roughly?


> Replace memstore's ConcurrentSkipListMap with our own implementation
> 
>
> Key: HBASE-3484
> URL: https://issues.apache.org/jira/browse/HBASE-3484
> Project: HBase
>  Issue Type: Improvement
>  Components: performance
>Affects Versions: 0.92.0
>Reporter: Todd Lipcon
>Priority: Critical
> Attachments: hierarchical-map.txt, memstore_drag.png
>
>
> By copy-pasting ConcurrentSkipListMap into HBase we can make two improvements 
> to it for our use case in MemStore:
> - add an iterator.replace() method which should allow us to do upsert much 
> more cheaply
> - implement a Set directly without having to do Map to 
> save one reference per entry
> It turns out CSLM is in public domain from its development as part of JSR 
> 166, so we should be OK with licenses.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-3725) HBase increments from old value after delete and write to disk

2012-07-09 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409910#comment-13409910
 ] 

Lars Hofhansl commented on HBASE-3725:
--

I am not disagreeing. Just a note of caution. We waited with fixing this in 
0.94 until we had lazy seek.


> HBase increments from old value after delete and write to disk
> --
>
> Key: HBASE-3725
> URL: https://issues.apache.org/jira/browse/HBASE-3725
> Project: HBase
>  Issue Type: Bug
>  Components: io, regionserver
>Affects Versions: 0.90.1
>Reporter: Nathaniel Cook
>Assignee: Jonathan Gray
> Attachments: HBASE-3725-0.92-V1.patch, HBASE-3725-0.92-V2.patch, 
> HBASE-3725-0.92-V3.patch, HBASE-3725-0.92-V4.patch, HBASE-3725-Test-v1.patch, 
> HBASE-3725-v3.patch, HBASE-3725.patch
>
>
> Deleted row values are sometimes used for starting points on new increments.
> To reproduce:
> Create a row "r". Set column "x" to some default value.
> Force hbase to write that value to the file system (such as restarting the 
> cluster).
> Delete the row.
> Call table.incrementColumnValue with "some_value"
> Get the row.
> The returned value in the column was incremented from the old value before 
> the row was deleted instead of being initialized to "some_value".
> Code to reproduce:
> {code}
> import java.io.IOException;
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.hbase.HBaseConfiguration;
> import org.apache.hadoop.hbase.HColumnDescriptor;
> import org.apache.hadoop.hbase.HTableDescriptor;
> import org.apache.hadoop.hbase.client.Delete;
> import org.apache.hadoop.hbase.client.Get;
> import org.apache.hadoop.hbase.client.HBaseAdmin;
> import org.apache.hadoop.hbase.client.HTableInterface;
> import org.apache.hadoop.hbase.client.HTablePool;
> import org.apache.hadoop.hbase.client.Increment;
> import org.apache.hadoop.hbase.client.Result;
> import org.apache.hadoop.hbase.util.Bytes;
> public class HBaseTestIncrement
> {
>   static String tableName  = "testIncrement";
>   static byte[] infoCF = Bytes.toBytes("info");
>   static byte[] rowKey = Bytes.toBytes("test-rowKey");
>   static byte[] newInc = Bytes.toBytes("new");
>   static byte[] oldInc = Bytes.toBytes("old");
>   /**
>* This code reproduces a bug with increment column values in hbase
>* Usage: First run part one by passing '1' as the first arg
>*Then restart the hbase cluster so it writes everything to disk
>*Run part two by passing '2' as the first arg
>*
>* This will result in the old deleted data being found and used for 
> the increment calls
>*
>* @param args
>* @throws IOException
>*/
>   public static void main(String[] args) throws IOException
>   {
>   if("1".equals(args[0]))
>   partOne();
>   if("2".equals(args[0]))
>   partTwo();
>   if ("both".equals(args[0]))
>   {
>   partOne();
>   partTwo();
>   }
>   }
>   /**
>* Creates a table and increments a column value 10 times by 10 each 
> time.
>* Results in a value of 100 for the column
>*
>* @throws IOException
>*/
>   static void partOne()throws IOException
>   {
>   Configuration conf = HBaseConfiguration.create();
>   HBaseAdmin admin = new HBaseAdmin(conf);
>   HTableDescriptor tableDesc = new HTableDescriptor(tableName);
>   tableDesc.addFamily(new HColumnDescriptor(infoCF));
>   if(admin.tableExists(tableName))
>   {
>   admin.disableTable(tableName);
>   admin.deleteTable(tableName);
>   }
>   admin.createTable(tableDesc);
>   HTablePool pool = new HTablePool(conf, Integer.MAX_VALUE);
>   HTableInterface table = pool.getTable(Bytes.toBytes(tableName));
>   //Increment unitialized column
>   for (int j = 0; j < 10; j++)
>   {
>   table.incrementColumnValue(rowKey, infoCF, oldInc, 
> (long)10);
>   Increment inc = new Increment(rowKey);
>   inc.addColumn(infoCF, newInc, (long)10);
>   table.increment(inc);
>   }
>   Get get = new Get(rowKey);
>   Result r = table.get(get);
>   System.out.println("initial values: new " + 
> Bytes.toLong(r.getValue(infoCF, newInc)) + " old " + 
> Bytes.toLong(r.getValue(infoCF, oldInc)));
>   }
>   /**
>* First deletes the data then increments the column 10 times by

[jira] [Commented] (HBASE-6345) Utilize fault injection in testing using AspectJ

2012-07-09 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409915#comment-13409915
 ] 

Todd Lipcon commented on HBASE-6345:


bq. Couple comments for Todd and others. I did some my investigation on this 
topic last week and found testing data streaming pipeline in hdfs requires fi. 
Is there any other reasons not to use fi besides "hard to maintain"?

Well, it used to, but when we mavenized, we lost the ability to actually run 
those tests. So if they're still in the code base, they're not getting compiled 
or run, and I doubt they work anymore.

bq. I have just got maven + aspectJ working for hbase. It does require some 
learning to write fi test; but it doesn't seem to be that hard.

It's not that it's super hard, but it does take many hours of learning to be 
able to write even the simplest test. I found this to be a big pain when 
refactoring or otherwise changing the Hadoop code. I wanted to restructure a 
bit of that code at one point, and it basically took me a day to learn enough 
AspectJ to figure out how to do it - they are lots of new concepts like 
"advices" and "pointcuts" which need to be understood before you can get very 
far. In contrast, the simple Java approach is immediately obvious to anyone who 
looks at the code, and also "plays nice" with IDE features - for example, you 
can right click on the fault point you're interested in, search for references, 
and see all of the unit tests which touch this fault injection point.


bq. It looks like CheckpointFaultInjector example above requires code change in 
hadoop core. fi doesn't require that. I find it useful to not require core code 
change to inject failure at any place. On frequently called function, adding 
call to fault injector in the core code might have perf impact.

In a JITted language, the perf impact of this should be zero. The reason being 
is that the JVM can figure out that there's only one implementation of the 
method being called (since the actual fault injection implementation hasn't 
been classloaded). So, it directly inlines the empty method, and thus 
disappears entirely.

I've heard the argument before that the fault points clutter the production 
code, and people find that ugly. I personally take the opposite opinion here: 
the fact that there are fault hooks in the non-test code means that anyone else 
coming alone will better understand what kind of faults are important to 
consider for the code in question. And, when they change the code, they'll be 
much more aware of the potential faults they need to take into account.

> Utilize fault injection in testing using AspectJ
> 
>
> Key: HBASE-6345
> URL: https://issues.apache.org/jira/browse/HBASE-6345
> Project: HBase
>  Issue Type: Bug
>Reporter: Zhihong Ted Yu
>
> HDFS uses fault injection to test pipeline failure in addition to mock, spy. 
> HBase uses mock, spy. But there are cases where mock, spy aren't convenient.
> Some example from DFSClientAspects.aj :
> {code}
>   pointcut pipelineInitNonAppend(DataStreamer datastreamer):
> callCreateBlockOutputStream(datastreamer)
> && cflow(execution(* nextBlockOutputStream(..)))
> && within(DataStreamer);
>   after(DataStreamer datastreamer) returning : 
> pipelineInitNonAppend(datastreamer) {
> LOG.info("FI: after pipelineInitNonAppend: hasError="
> + datastreamer.hasError + " errorIndex=" + datastreamer.errorIndex);
> if (datastreamer.hasError) {
>   DataTransferTest dtTest = DataTransferTestUtil.getDataTransferTest();
>   if (dtTest != null)
> dtTest.fiPipelineInitErrorNonAppend.run(datastreamer.errorIndex);
> }
>   }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5705) Introduce Protocol Buffer RPC engine

2012-07-09 Thread Zhihong Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Ted Yu updated HBASE-5705:
--

Fix Version/s: 0.96.0
   Status: Patch Available  (was: Open)

> Introduce Protocol Buffer RPC engine
> 
>
> Key: HBASE-5705
> URL: https://issues.apache.org/jira/browse/HBASE-5705
> Project: HBase
>  Issue Type: Sub-task
>  Components: ipc, master, migration, regionserver
>Reporter: Devaraj Das
>Assignee: Devaraj Das
> Fix For: 0.96.0
>
> Attachments: 5705-1.patch, 5705-2.1.patch
>
>
> Introduce Protocol Buffer RPC engine in the RPC core. Protocols that are PB 
> aware can be made to go through this RPC engine. The approach, in my current 
> thinking, would be similar to HADOOP-7773.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5705) Introduce Protocol Buffer RPC engine

2012-07-09 Thread Zhihong Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Ted Yu updated HBASE-5705:
--

Attachment: 5705-2.1.patch

Patch from Devaraj.
I verified that it compiles against latest trunk.

> Introduce Protocol Buffer RPC engine
> 
>
> Key: HBASE-5705
> URL: https://issues.apache.org/jira/browse/HBASE-5705
> Project: HBase
>  Issue Type: Sub-task
>  Components: ipc, master, migration, regionserver
>Reporter: Devaraj Das
>Assignee: Devaraj Das
> Fix For: 0.96.0
>
> Attachments: 5705-1.patch, 5705-2.1.patch
>
>
> Introduce Protocol Buffer RPC engine in the RPC core. Protocols that are PB 
> aware can be made to go through this RPC engine. The approach, in my current 
> thinking, would be similar to HADOOP-7773.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6337) [MTTR] Remove renaming tmp log file in SplitLogManager

2012-07-09 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409929#comment-13409929
 ] 

Lars Hofhansl commented on HBASE-6337:
--

Change itself looks to me. If the writing directly into recoverededits is not a 
problem: +1


> [MTTR] Remove renaming tmp log file in SplitLogManager 
> ---
>
> Key: HBASE-6337
> URL: https://issues.apache.org/jira/browse/HBASE-6337
> Project: HBase
>  Issue Type: Bug
>Reporter: chunhui shen
>Assignee: chunhui shen
> Fix For: 0.96.0, 0.94.2
>
> Attachments: HBASE-6337v1.patch, HBASE-6337v2.patch, 
> HBASE-6337v3.patch
>
>
> As HBASE-6309 mentioned, we also encounter problem of 
> distributed-log-splitting take much more time than matser-local-log-splitting 
> because lots of SplitLogManager 's renaming operations when finishing task.
> Could we try to remove renaming tmp log file in SplitLogManager through 
> splitting log to regions' recover.edits directory directly as the same as the 
> master-local-log-splitting.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6346) Observed excessive CPU on "quiescent" cluster

2012-07-09 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409946#comment-13409946
 ] 

Jean-Daniel Cryans commented on HBASE-6346:
---

Yeah what's that config Andy? I may have encountered the same problem.

> Observed excessive CPU on "quiescent" cluster
> -
>
> Key: HBASE-6346
> URL: https://issues.apache.org/jira/browse/HBASE-6346
> Project: HBase
>  Issue Type: Bug
> Environment: Sun JRE 6u33, Hadoop 2.0.1-alpha, HBase 0.94.1-SNAPSHOT
>Reporter: Andrew Purtell
> Attachments: graph.gif, jstack.txt, runnable_cpu_profile.png
>
>
> I've only started looking at this but became concerned when a completely 
> quiescent cluster loads up 20% CPU (system+user) once HBase comes up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5997) Fix concerns raised in HBASE-5922 related to HalfStoreFileReader

2012-07-09 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409952#comment-13409952
 ] 

Lars Hofhansl commented on HBASE-5997:
--

BTW. HTableInterface.getRowOrBefore is deprecated. Nobody should be using that.
Are the issues you address here seen during our particular internal meta 
lookups (which are special cases)?

(It's time to fix HBASE-2600)


> Fix concerns raised in HBASE-5922 related to HalfStoreFileReader
> 
>
> Key: HBASE-5997
> URL: https://issues.apache.org/jira/browse/HBASE-5997
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.90.6, 0.92.1, 0.94.0, 0.96.0
>Reporter: ramkrishna.s.vasudevan
>Assignee: Anoop Sam John
> Fix For: 0.94.1
>
> Attachments: HBASE-5997_0.94.patch, HBASE-5997_94 V2.patch, 
> Testcase.patch.txt
>
>
> Pls refer to the comment
> https://issues.apache.org/jira/browse/HBASE-5922?focusedCommentId=13269346&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13269346.
> Raised this issue to solve that comment. Just incase we don't forget it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6323) [replication] most of the source metrics are wrong when there's multiple slaves

2012-07-09 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409953#comment-13409953
 ] 

Jean-Daniel Cryans commented on HBASE-6323:
---

I did a review with Elliott, basically there's 3 "Copyright 2010" to remove and 
ReplicationMetricsSource needs some refactoring because right now there's 3 
methods that are almost identical.

> [replication] most of the source metrics are wrong when there's multiple 
> slaves
> ---
>
> Key: HBASE-6323
> URL: https://issues.apache.org/jira/browse/HBASE-6323
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.92.1, 0.94.0
>Reporter: Jean-Daniel Cryans
>Assignee: Elliott Clark
> Fix For: 0.96.0, 0.94.2
>
> Attachments: HBASE-6323-0.patch, HBASE-6323-1.patch
>
>
> Most of the metrics in replication were written with 1 slave in mind but with 
> multiple slaves the issue really shows. Most of the metrics are set directly:
> {code}
> public void enqueueLog(Path log) {
>   this.queue.put(log);
>   this.metrics.sizeOfLogQueue.set(queue.size());
> }
> {code}
> So {{sizeOfLogQueue}} is always showing the size of the queue that updated 
> the metric last.
> I'm not sure what's the right way to fix this since we can't have dynamic 
> metrics. Merging them would work here but it wouldn't work so well with 
> {{ageOfLastShippedOp}} since the age can be different and it definitely 
> cannot be summed.
> Assigning to Elliott since he seems to dig metrics these days. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6323) [replication] most of the source metrics are wrong when there's multiple slaves

2012-07-09 Thread Elliott Clark (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Elliott Clark updated HBASE-6323:
-

Attachment: HBASE-6323-2.patch

Removed copy right and refactored the similar code.

> [replication] most of the source metrics are wrong when there's multiple 
> slaves
> ---
>
> Key: HBASE-6323
> URL: https://issues.apache.org/jira/browse/HBASE-6323
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.92.1, 0.94.0
>Reporter: Jean-Daniel Cryans
>Assignee: Elliott Clark
> Fix For: 0.96.0, 0.94.2
>
> Attachments: HBASE-6323-0.patch, HBASE-6323-1.patch, 
> HBASE-6323-2.patch
>
>
> Most of the metrics in replication were written with 1 slave in mind but with 
> multiple slaves the issue really shows. Most of the metrics are set directly:
> {code}
> public void enqueueLog(Path log) {
>   this.queue.put(log);
>   this.metrics.sizeOfLogQueue.set(queue.size());
> }
> {code}
> So {{sizeOfLogQueue}} is always showing the size of the queue that updated 
> the metric last.
> I'm not sure what's the right way to fix this since we can't have dynamic 
> metrics. Merging them would work here but it wouldn't work so well with 
> {{ageOfLastShippedOp}} since the age can be different and it definitely 
> cannot be summed.
> Assigning to Elliott since he seems to dig metrics these days. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-6360) Thrift proxy does not emit runtime metrics

2012-07-09 Thread Karthik Ranganathan (JIRA)
Karthik Ranganathan created HBASE-6360:
--

 Summary: Thrift proxy does not emit runtime metrics
 Key: HBASE-6360
 URL: https://issues.apache.org/jira/browse/HBASE-6360
 Project: HBase
  Issue Type: Bug
  Components: thrift
Reporter: Karthik Ranganathan
Assignee: Michal Gregorczyk


Open jconsole against a thrift proxy, and you will not find the rumtime stats 
that it should be exporting.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5705) Introduce Protocol Buffer RPC engine

2012-07-09 Thread Devaraj Das (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409983#comment-13409983
 ] 

Devaraj Das commented on HBASE-5705:


I updated RB with the new patch. Yes, the PB stuff goes via writables still. 
Some of the stuff can be put within a if/else logic for PB vs. Writable but 
when I did it, it seemed to make the code complex. So am thinking of doing the 
work of making RPC use as much of PB as possible as a follow up (and at that 
time, remove the WritableRpcEngine/Invocation classes), when all the 
application protocols are converted to PB. Open to feedback on this aspect. 
Once the above is done, the use of PB RPC Engine should make things more 
efficient in terms of avoiding copies in the rpc layer (for example in the RPC 
layer, if we needed to write a pair of {message-length, message}, in the 
current RPC, we would need to serialize the Writable object into a buffer, and 
then get the length of the buffer. In the PB world, every message has a 
getSerializedSize method generated..)

> Introduce Protocol Buffer RPC engine
> 
>
> Key: HBASE-5705
> URL: https://issues.apache.org/jira/browse/HBASE-5705
> Project: HBase
>  Issue Type: Sub-task
>  Components: ipc, master, migration, regionserver
>Reporter: Devaraj Das
>Assignee: Devaraj Das
> Fix For: 0.96.0
>
> Attachments: 5705-1.patch, 5705-2.1.patch
>
>
> Introduce Protocol Buffer RPC engine in the RPC core. Protocols that are PB 
> aware can be made to go through this RPC engine. The approach, in my current 
> thinking, would be similar to HADOOP-7773.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-6361) Change the compaction queue to a round robin scheduler

2012-07-09 Thread Akashnil (JIRA)
Akashnil created HBASE-6361:
---

 Summary: Change the compaction queue to a round robin scheduler
 Key: HBASE-6361
 URL: https://issues.apache.org/jira/browse/HBASE-6361
 Project: HBase
  Issue Type: Improvement
Reporter: Akashnil


Currently the compaction requests are submitted to the minor/major compaction 
queue of a region-server from every column-family/region belonging to it. The 
queue processes those requests in FIFO order (First in First out). We want to 
make a lazy scheduler in place of the current one. The idea of lazy scheduling 
is that, it is always better to make a decision (compaction selection) later if 
the decision is relevant later only. Currently, if the queue grows large, 
currently generated requests are not processed until all the preceding requests 
are executed. Rather than that, we can postpone the compaction selection until 
the queue is empty when we will have more information (new flush files will 
have affected the state) to make a better decision.

Removing the queue, we propose to implement a round-robin scheduler. All the 
column families in their regions will be visited in sequence periodically. In 
each visit, if the column family generates a valid compaction request, the 
request is executed before moving to the next one. We do not plan to change the 
current compaction algorithm for now. We expect that it will automatically make 
a better decision when doing just-in-time selection due to the new change. How 
do we know that? Let us consider an example.

Note that the presently existing compaction queue is only relevant as a buffer, 
when the flushes out-pace the compactions for a period of time, or a relatively 
large compaction consumes time to complete, the queue accumulates requests. 
Suppose such a scenario has occurred. Suppose min-files for compaction = 4. For 
an active column-family, new compaction requests, each of size 4 will be added 
to the queue continuously until the queue starts processing them.

Now consider a round-robin scheduler. The effect of a bottle-neck due to the IO 
rate of compaction results in a longer latency to visit the same column family 
again. By this time suppose there are 16 new flush files in this column family. 
The compaction selection algorithm will select a compaction request of size 16, 
as opposed to 4 compaction requests of size 4 that would have been generated in 
the previous case.

A compaction request with 16 flush files is more IOPs-efficient than the same 
set of files being compacted 4 at a time. This is because both consume the same 
total amount of reads, total writes, and IOPs/sec while producing a file of 
size 16 compared to 4 files of size 4. So we obtained a free compaction from 
those 4*4->16 without paying for it. In case of the queue, those smaller files 
would have consumed more IOPs to become bigger later.

In case of uniform steady-state load this change should not make a difference, 
because the compaction queue would have been empty anyway. However in case of 
bursty load, it automatically adapts itself to consume less IOPs in times of 
high flush rate. This negative feedback should mainly improve 
faliure-resistence of the system. In case something goes wrong, monitoring 
should still give feedback, not in the form of queue size, but the number of 
files in each compaction, which will go up when the bottle-neck occurs. If 
there is no important down-sides, this should be a very good change since this 
should apply to all use-cases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-6362) Enhance test-patch.sh script to recognize images / non-trunk patches

2012-07-09 Thread Zhihong Ted Yu (JIRA)
Zhihong Ted Yu created HBASE-6362:
-

 Summary: Enhance test-patch.sh script to recognize images / 
non-trunk patches
 Key: HBASE-6362
 URL: https://issues.apache.org/jira/browse/HBASE-6362
 Project: HBase
  Issue Type: Bug
Reporter: Zhihong Ted Yu


When user uploads logs / images / non-trunk patches, Hadoop QA would complain 
that the file couldn't be applied as a patch (for trunk).

We should make this script smarter by recognizing image files and non-trunk 
patches.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6337) [MTTR] Remove renaming tmp log file in SplitLogManager

2012-07-09 Thread chunhui shen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13410011#comment-13410011
 ] 

chunhui shen commented on HBASE-6337:
-

bq.Have you checked out the open region side of the affair to see if any 
conditions under which we might bungle the replay of recovered.edits files? 
I think it won't happen, when splitting log, the edits file is named ends with 
".temp", and repalying edits will skip these files.

> [MTTR] Remove renaming tmp log file in SplitLogManager 
> ---
>
> Key: HBASE-6337
> URL: https://issues.apache.org/jira/browse/HBASE-6337
> Project: HBase
>  Issue Type: Bug
>Reporter: chunhui shen
>Assignee: chunhui shen
> Fix For: 0.96.0, 0.94.2
>
> Attachments: HBASE-6337v1.patch, HBASE-6337v2.patch, 
> HBASE-6337v3.patch
>
>
> As HBASE-6309 mentioned, we also encounter problem of 
> distributed-log-splitting take much more time than matser-local-log-splitting 
> because lots of SplitLogManager 's renaming operations when finishing task.
> Could we try to remove renaming tmp log file in SplitLogManager through 
> splitting log to regions' recover.edits directory directly as the same as the 
> master-local-log-splitting.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6337) [MTTR] Remove renaming tmp log file in SplitLogManager

2012-07-09 Thread chunhui shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-6337:


Attachment: HBASE-6337v4.patch

Uploading patch v4 : Rename method name moveSplitLogFile to finishSplitLogFile

> [MTTR] Remove renaming tmp log file in SplitLogManager 
> ---
>
> Key: HBASE-6337
> URL: https://issues.apache.org/jira/browse/HBASE-6337
> Project: HBase
>  Issue Type: Bug
>Reporter: chunhui shen
>Assignee: chunhui shen
> Fix For: 0.96.0, 0.94.2
>
> Attachments: HBASE-6337v1.patch, HBASE-6337v2.patch, 
> HBASE-6337v3.patch, HBASE-6337v4.patch
>
>
> As HBASE-6309 mentioned, we also encounter problem of 
> distributed-log-splitting take much more time than matser-local-log-splitting 
> because lots of SplitLogManager 's renaming operations when finishing task.
> Could we try to remove renaming tmp log file in SplitLogManager through 
> splitting log to regions' recover.edits directory directly as the same as the 
> master-local-log-splitting.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6362) Enhance test-patch.sh script to recognize images / non-trunk patches

2012-07-09 Thread Zhihong Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13410021#comment-13410021
 ] 

Zhihong Ted Yu commented on HBASE-6362:
---

The filenames of attached patches can vary widely. My first attempt is to use 
simple heuristic for determining patches for trunk.

Patch for trunk should end in .txt or .patch in the filename.
The filename can include delimiter '-' between components.
It should include 'trunk' or 'TRUNK' as the last or second last component (if 
versioning is involved).

The following are examples of accepted filenames:

hbase-6311-trunk.patch
HBASE-6311-trunk-v1.txt
6311-trunk.txt
6311-trunk-v9.patch

Please comment.

> Enhance test-patch.sh script to recognize images / non-trunk patches
> 
>
> Key: HBASE-6362
> URL: https://issues.apache.org/jira/browse/HBASE-6362
> Project: HBase
>  Issue Type: Bug
>Reporter: Zhihong Ted Yu
>
> When user uploads logs / images / non-trunk patches, Hadoop QA would complain 
> that the file couldn't be applied as a patch (for trunk).
> We should make this script smarter by recognizing image files and non-trunk 
> patches.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6357) Failed distributed log splitting stuck on master web UI

2012-07-09 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HBASE-6357:
---

Status: Open  (was: Patch Available)

> Failed distributed log splitting stuck on master web UI
> ---
>
> Key: HBASE-6357
> URL: https://issues.apache.org/jira/browse/HBASE-6357
> Project: HBase
>  Issue Type: Bug
>Reporter: Jimmy Xiang
>Assignee: Jimmy Xiang
> Attachments: 6357-trunk.patch
>
>
> Failed distributed log splitting MonitoredTask is stuck on the master web UI 
> since it is not aborted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6357) Failed distributed log splitting stuck on master web UI

2012-07-09 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HBASE-6357:
---

Status: Patch Available  (was: Open)

try hadoopqa again

> Failed distributed log splitting stuck on master web UI
> ---
>
> Key: HBASE-6357
> URL: https://issues.apache.org/jira/browse/HBASE-6357
> Project: HBase
>  Issue Type: Bug
>Reporter: Jimmy Xiang
>Assignee: Jimmy Xiang
> Attachments: 6357-trunk.patch
>
>
> Failed distributed log splitting MonitoredTask is stuck on the master web UI 
> since it is not aborted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6357) Failed distributed log splitting stuck on master web UI

2012-07-09 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HBASE-6357:
---

Component/s: master

> Failed distributed log splitting stuck on master web UI
> ---
>
> Key: HBASE-6357
> URL: https://issues.apache.org/jira/browse/HBASE-6357
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Reporter: Jimmy Xiang
>Assignee: Jimmy Xiang
> Attachments: 6357-trunk.patch
>
>
> Failed distributed log splitting MonitoredTask is stuck on the master web UI 
> since it is not aborted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6331) Problem with HBCK mergeOverlaps

2012-07-09 Thread Anoop Sam John (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13410030#comment-13410030
 ] 

Anoop Sam John commented on HBASE-6331:
---

@Jon
Sure I will add test case. I have doubt whether some other problems in the 
related area. Any way tests will reveal that.

> Problem with HBCK mergeOverlaps
> ---
>
> Key: HBASE-6331
> URL: https://issues.apache.org/jira/browse/HBASE-6331
> Project: HBase
>  Issue Type: Bug
>  Components: hbck
>Reporter: Anoop Sam John
>Assignee: Anoop Sam John
> Fix For: 0.96.0, 0.94.1
>
> Attachments: HBASE-6331_94.patch, HBASE-6331_Trunk.patch
>
>
> In HDFSIntegrityFixer#mergeOverlaps(), there is a logic to create the final 
> range of the region after the overlap.
> I can see one issue with this code
> {code}
> if (RegionSplitCalculator.BYTES_COMPARATOR
> .compare(hi.getEndKey(), range.getSecond()) > 0) {
>   range.setSecond(hi.getEndKey());
> }
> {code}
> Here suppose the regions include the end region for which the endKey will be 
> empty, we need to get finally the range with endkey as empty byte[]
> But as per the above logic it will see that any other key greater than the 
> empty byte[] and will set it.
> Finally the new region created will not get endkey as empty byte[]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6357) Failed distributed log splitting stuck on master web UI

2012-07-09 Thread Zhihong Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13410033#comment-13410033
 ] 

Zhihong Ted Yu commented on HBASE-6357:
---

@Jimmy:
Looks like Hadoop QA has been dormant for 20 hours.

> Failed distributed log splitting stuck on master web UI
> ---
>
> Key: HBASE-6357
> URL: https://issues.apache.org/jira/browse/HBASE-6357
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Reporter: Jimmy Xiang
>Assignee: Jimmy Xiang
> Attachments: 6357-trunk.patch
>
>
> Failed distributed log splitting MonitoredTask is stuck on the master web UI 
> since it is not aborted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4379) [hbck] Does not complain about tables with no end region [Z,]

2012-07-09 Thread Anoop Sam John (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13410086#comment-13410086
 ] 

Anoop Sam John commented on HBASE-4379:
---

Thanks Jon
bq.Let's create a new issue for the new "disabling" test case with the 
offline/disable regions, and discuss there?
Fine

> [hbck] Does not complain about tables with no end region [Z,]
> -
>
> Key: HBASE-4379
> URL: https://issues.apache.org/jira/browse/HBASE-4379
> Project: HBase
>  Issue Type: Bug
>  Components: hbck
>Affects Versions: 0.90.5, 0.92.0, 0.94.0, 0.96.0
>Reporter: Jonathan Hsieh
>Assignee: Anoop Sam John
> Fix For: 0.90.7, 0.92.2, 0.96.0, 0.94.1
>
> Attachments: 
> 0001-HBASE-4379-hbck-does-not-complain-about-tables-with-.patch, 
> HBASE-4379_94.patch, HBASE-4379_94_V2.patch, HBASE-4379_Trunk.patch, 
> TestcaseForDisabledTableIssue.patch, hbase-4379-90.patch, 
> hbase-4379-92.patch, hbase-4379.v2.patch
>
>
> hbck does not detect or have an error condition when the last region of a 
> table is missing (end key != '').

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira