[jira] [Commented] (HBASE-8362) Possible MultiGet optimization
[ https://issues.apache.org/jira/browse/HBASE-8362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13634870#comment-13634870 ] Anoop Sam John commented on HBASE-8362: --- Not using row blooms was the thing came to my mind also when I saw this issue. We need to see how we can (can we?) make use of this row bloom with seeks.. I will explore options in that area as this will help in some of our usecase as well. (Not multi get) Possible MultiGet optimization -- Key: HBASE-8362 URL: https://issues.apache.org/jira/browse/HBASE-8362 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Currently MultiGets are executed on a RegionServer in a single thread in a loop that handles each Get separately (opening a scanner, seeking, etc). It seems we could optimize this (per region at least) by opening a single scanner and issue a reseek for each Get that was requested. I have not tested this yet and no patch, but I would like to solicit feedback on this idea. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8319) hbase-it tests are run when you ask to run all tests -- it should be that you have to ask explicitly to run them
[ https://issues.apache.org/jira/browse/HBASE-8319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13634874#comment-13634874 ] Nicolas Liochon commented on HBASE-8319: bq. Can this be moved into a script in dev-tools? IMHO, the fewer moving parts in jenkins, the better. It's adding a level of indirection, so it makes things more complex and more fragile. See for example the jenkins configuration that sayd 1.7, while the script was pretending to use openjdk 1.6 before finally running with oracle 1.6. bq. Who controls the jenkins config screen? It's an access right per project. I have it, Ted has it, and you could probably have it if you ask to Stack (if you don't have it already which is possible as well). bq. I wonder how it didn't happen before. Must be something in our pom that changed. It's unlikely, as it's hardcoded in maven (that's the advantage of maven). More probably, it was changed a while ago, and as we were failing 90% of time if not more in the hbase-server part, so no one saw the impact from hbase-it. hbase-it tests are run when you ask to run all tests -- it should be that you have to ask explicitly to run them Key: HBASE-8319 URL: https://issues.apache.org/jira/browse/HBASE-8319 Project: HBase Issue Type: Task Reporter: stack Assignee: Sergey Shelukhin Priority: Critical Fix For: 0.95.1 Up on trunk and on 0.95 apache builds, Sergey noticed that hbase-it tests are running. Up to this, the convention was that you had to explicitly ask that they run but that changed somehow recently. These tests are heavyweight, take a long time to complete, and are very likely to fail up on the apache infra (which is what we want of them but not as part of the general proofing build). For now the tests have been artificially disabled up on builds.apache.org but their inclusion likely frustrates joe blow trying to do a local hbase packaging. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8362) Possible MultiGet optimization
[ https://issues.apache.org/jira/browse/HBASE-8362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13634877#comment-13634877 ] Nicolas Liochon commented on HBASE-8362: bq. It is not really that common to provide a filter for Gets and even timeranges are not used that often, so we could just only do this for Get with either of those. We could change the API to support a single timerange and filter for a multiget. I bet it would cover 99% of the use cases. Possible MultiGet optimization -- Key: HBASE-8362 URL: https://issues.apache.org/jira/browse/HBASE-8362 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Currently MultiGets are executed on a RegionServer in a single thread in a loop that handles each Get separately (opening a scanner, seeking, etc). It seems we could optimize this (per region at least) by opening a single scanner and issue a reseek for each Get that was requested. I have not tested this yet and no patch, but I would like to solicit feedback on this idea. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7239) Verify protobuf serialization is correctly chunking upon read to avoid direct memory OOMs
[ https://issues.apache.org/jira/browse/HBASE-7239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj Das updated HBASE-7239: --- Attachment: 7239-1.patch Patch that mimics what was done in the 0.94 codebase for Result class. Verify protobuf serialization is correctly chunking upon read to avoid direct memory OOMs - Key: HBASE-7239 URL: https://issues.apache.org/jira/browse/HBASE-7239 Project: HBase Issue Type: Sub-task Reporter: Lars Hofhansl Priority: Critical Fix For: 0.95.1 Attachments: 7239-1.patch Result.readFields() used to read from the input stream in 8k chunks to avoid OOM issues with direct memory. (Reading variable sized chunks into direct memory prevent the JVM from reusing the allocated direct memory and direct memory is only collected during full GCs) This is just to verify protobufs parseFrom type methods do the right thing as well so that we do not reintroduce this problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8362) Possible MultiGet optimization
[ https://issues.apache.org/jira/browse/HBASE-8362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13634884#comment-13634884 ] Varun Sharma commented on HBASE-8362: - Or add a new API and retain the older API for the exotic 1 % ? Possible MultiGet optimization -- Key: HBASE-8362 URL: https://issues.apache.org/jira/browse/HBASE-8362 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Currently MultiGets are executed on a RegionServer in a single thread in a loop that handles each Get separately (opening a scanner, seeking, etc). It seems we could optimize this (per region at least) by opening a single scanner and issue a reseek for each Get that was requested. I have not tested this yet and no patch, but I would like to solicit feedback on this idea. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (HBASE-7239) Verify protobuf serialization is correctly chunking upon read to avoid direct memory OOMs
[ https://issues.apache.org/jira/browse/HBASE-7239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13634883#comment-13634883 ] Devaraj Das edited comment on HBASE-7239 at 4/18/13 6:19 AM: - Patch that mimics what was done in the 0.94 codebase for Result class. [~stack] [~lhofhansl], please have a look. was (Author: devaraj): Patch that mimics what was done in the 0.94 codebase for Result class. Verify protobuf serialization is correctly chunking upon read to avoid direct memory OOMs - Key: HBASE-7239 URL: https://issues.apache.org/jira/browse/HBASE-7239 Project: HBase Issue Type: Sub-task Reporter: Lars Hofhansl Priority: Critical Fix For: 0.95.1 Attachments: 7239-1.patch Result.readFields() used to read from the input stream in 8k chunks to avoid OOM issues with direct memory. (Reading variable sized chunks into direct memory prevent the JVM from reusing the allocated direct memory and direct memory is only collected during full GCs) This is just to verify protobufs parseFrom type methods do the right thing as well so that we do not reintroduce this problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8214) Remove proxy and engine, rely directly on pb generated Service
[ https://issues.apache.org/jira/browse/HBASE-8214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-8214: - Attachment: 8214v2.txt {code} Removes engine and proxy. Everywhere we now use the pb Interface explicitly with no shim or decoration (the BlockingInterface to be more specific) Compiles. Most tests pass. Need to fix the remainder. Pluses: + Regularizes our rpc. No more voodoo. + Removes at least two layers. We could remove more if we go mess with protoc generation (as per Gary as per pb recommendation) Cons: + Looks a bit ugly. It is ugly the way we now do security information. Previous, the kerberos and token info was an interface that the rpc interface implemented but now, because pb does the server and stub implementation creation and they cannot be altered -- not w/o protoc messing, we have to pass the rpc interface and its security info separately (you cannot take the BlockingService or BlockingStub class and get the token and kerberos Interfaces from them, not w/o a bunch of ugly delegations). Adds a class per rpc interface named *SecurityInfo. For example, ClientServiceSecurityInfo and AdminServiceSecurityInfo. Only purpose is the carrying of the Kerberos and Token info. Gets passed when we setup an rpcserver and when we make an rpcclient stub. Removes the now useless IpcProtocol and ditto for interfaces that extended the pb blockinginterfaces such as MasterAdminProtocol. Instead, we now pass raw BlockingInterface and the kerberos and token interfaces are moved to the new *SecurityInfo classes. Bulk of changes are using BlockingInterface instead and class removals such as HBaseClientRPC and the support for caching of method invocations. For example, changing AdminProto:col to instead refer to AdminService.BlockingInterface (If you looked at old AdminProtocol, it implemented BlockingInterface) The new rpc classes are name RpcClient and RpcServerImplementation (there is a silly RpcServer Interface that is in the way and needs at least some cleanup and probably just removal but can do that later) These classes have facility to help make the protobuf stub on the client side. Got rid of MasterService that only had isMasterRunning in it and added this method to MasterMonitor and to MasterAdmin -- they both have it now; as said above we were trying to do some kinda inheritance where both MasterMonitor and MasterAdmin both had isMasterRunning method. Dodgy. TODO: + See if I can make this cleaner still. Would appreciate suggestion on the *SecurityInfo stuff. + Fix outstanding tests. + There is a bit of a mess still in HCOnnectionManager around isMasterRunning. It was working because we had faked pb service inheritance, something it does not support and something that the pb fellows are against in principal. Did some fixup but it is typeless. Need to spend more time on it. {code} Remove proxy and engine, rely directly on pb generated Service -- Key: HBASE-8214 URL: https://issues.apache.org/jira/browse/HBASE-8214 Project: HBase Issue Type: Bug Components: IPC/RPC Reporter: stack Assignee: stack Attachments: 8124.txt, 8214v2.txt Attached patch is not done. Removes two to three layers -- depending on how you count -- between client and rpc client and similar on server side (between rpc and server implementation). Strips ProtobufRpcServer/Client and HBaseClientRpc/HBaseServerRpc. Also gets rid of proxy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8362) Possible MultiGet optimization
[ https://issues.apache.org/jira/browse/HBASE-8362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13634930#comment-13634930 ] Nicolas Liochon commented on HBASE-8362: bq. Or add a new API and retain the older API for the exotic 1 % ? +1 (that's what I wanted to say actually) Possible MultiGet optimization -- Key: HBASE-8362 URL: https://issues.apache.org/jira/browse/HBASE-8362 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Currently MultiGets are executed on a RegionServer in a single thread in a loop that handles each Get separately (opening a scanner, seeking, etc). It seems we could optimize this (per region at least) by opening a single scanner and issue a reseek for each Get that was requested. I have not tested this yet and no patch, but I would like to solicit feedback on this idea. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8366) HBaseServer logs the full query.
[ https://issues.apache.org/jira/browse/HBASE-8366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13634936#comment-13634936 ] Nicolas Liochon commented on HBASE-8366: After some thinking, I think the best option is to remove both of them (mine + the one mentioned by Himanshu). That's the only viable long term option. I will do this if nobody objects. HBaseServer logs the full query. Key: HBASE-8366 URL: https://issues.apache.org/jira/browse/HBASE-8366 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.95.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.98.0, 0.95.1 Attachments: 8366.v1.patch We log the query when we have an error. As a results, the logs are not readable when using stuff like multi. As a side note, this is as well a security issue (no need to encrypt the network and the storage if the logs contain everything). I'm not removing the full log line here; but just ask and I do it :-). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8360) In HBaseClient#cancelConnections we should close fully the connection
[ https://issues.apache.org/jira/browse/HBASE-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13634938#comment-13634938 ] Nicolas Liochon commented on HBASE-8360: unrelated error. I will commit tomorrow my time if nobody objects. In HBaseClient#cancelConnections we should close fully the connection - Key: HBASE-8360 URL: https://issues.apache.org/jira/browse/HBASE-8360 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.95.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.98.0, 0.95.1 Attachments: 8860.v1.patch If not the clients are not disconnected hence still depend on the timeout... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8214) Remove proxy and engine, rely directly on pb generated Service
[ https://issues.apache.org/jira/browse/HBASE-8214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13634940#comment-13634940 ] stack commented on HBASE-8214: -- I updated rb https://reviews.apache.org/r/10174/ Remove proxy and engine, rely directly on pb generated Service -- Key: HBASE-8214 URL: https://issues.apache.org/jira/browse/HBASE-8214 Project: HBase Issue Type: Bug Components: IPC/RPC Reporter: stack Assignee: stack Attachments: 8124.txt, 8214v2.txt Attached patch is not done. Removes two to three layers -- depending on how you count -- between client and rpc client and similar on server side (between rpc and server implementation). Strips ProtobufRpcServer/Client and HBaseClientRpc/HBaseServerRpc. Also gets rid of proxy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8359) Too much logs on HConnectionManager
[ https://issues.apache.org/jira/browse/HBASE-8359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13634941#comment-13634941 ] Nicolas Liochon commented on HBASE-8359: Commited to .95 trunk. Thanks for the review, Andrew and Sergey! Too much logs on HConnectionManager --- Key: HBASE-8359 URL: https://issues.apache.org/jira/browse/HBASE-8359 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.95.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Attachments: 8359.v1.patch Under a YCSB load test (for HBASE-6295), we can have sporadic bulk of logs because of this: {code} final RegionMovedException rme = RegionMovedException.find(exception); if (rme != null) { LOG.info(Region + regionInfo.getRegionNameAsString() + moved to + rme.getHostname() + : + rme.getPort() + according to + source.getHostnamePort()); updateCachedLocation( regionInfo, source, rme.getServerName(), rme.getLocationSeqNum()); } else if (RegionOpeningException.find(exception) != null) { LOG.info(Region + regionInfo.getRegionNameAsString() + is being opened on + source.getHostnamePort() + ; not deleting the cache entry); } else { deleteCachedLocation(regionInfo, source); } {code} They should just be removed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8359) Too much logs on HConnectionManager
[ https://issues.apache.org/jira/browse/HBASE-8359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-8359: --- Resolution: Fixed Fix Version/s: 0.95.1 0.98.0 Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Too much logs on HConnectionManager --- Key: HBASE-8359 URL: https://issues.apache.org/jira/browse/HBASE-8359 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.95.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.98.0, 0.95.1 Attachments: 8359.v1.patch Under a YCSB load test (for HBASE-6295), we can have sporadic bulk of logs because of this: {code} final RegionMovedException rme = RegionMovedException.find(exception); if (rme != null) { LOG.info(Region + regionInfo.getRegionNameAsString() + moved to + rme.getHostname() + : + rme.getPort() + according to + source.getHostnamePort()); updateCachedLocation( regionInfo, source, rme.getServerName(), rme.getLocationSeqNum()); } else if (RegionOpeningException.find(exception) != null) { LOG.info(Region + regionInfo.getRegionNameAsString() + is being opened on + source.getHostnamePort() + ; not deleting the cache entry); } else { deleteCachedLocation(regionInfo, source); } {code} They should just be removed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8366) HBaseServer logs the full query.
[ https://issues.apache.org/jira/browse/HBASE-8366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13634954#comment-13634954 ] stack commented on HBASE-8366: -- Sorry for this lads. I am going to come back and fix this (thanks for filing the issue). TextFormat was useful debugging the ipc but yeah, too verbose. On other hand, because we are all pb now, we can log a TextFormat shorthand and print out stuff like region and row which will help when query is tooSlow or tooBig. TextFormat is not subclassable so it would be a hbase form of TextFormat. I can assign this to myself since I have a notion of how it should be if that is ok w/ you lot. HBaseServer logs the full query. Key: HBASE-8366 URL: https://issues.apache.org/jira/browse/HBASE-8366 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.95.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.98.0, 0.95.1 Attachments: 8366.v1.patch We log the query when we have an error. As a results, the logs are not readable when using stuff like multi. As a side note, this is as well a security issue (no need to encrypt the network and the storage if the logs contain everything). I'm not removing the full log line here; but just ask and I do it :-). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8366) HBaseServer logs the full query.
[ https://issues.apache.org/jira/browse/HBASE-8366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13634967#comment-13634967 ] Nicolas Liochon commented on HBASE-8366: I'm fine if you do it :-). HBaseServer logs the full query. Key: HBASE-8366 URL: https://issues.apache.org/jira/browse/HBASE-8366 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.95.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.98.0, 0.95.1 Attachments: 8366.v1.patch We log the query when we have an error. As a results, the logs are not readable when using stuff like multi. As a side note, this is as well a security issue (no need to encrypt the network and the storage if the logs contain everything). I'm not removing the full log line here; but just ask and I do it :-). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8369) MapReduce over snapshot files
[ https://issues.apache.org/jira/browse/HBASE-8369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13634972#comment-13634972 ] Matteo Bertozzi commented on HBASE-8369: in general I'm against having another way to direct access the data, since it means that you're giving up on optimizing the main one. but if the final implementation will be like this one using the HRegion object, I'll be +1. MapReduce over snapshot files - Key: HBASE-8369 URL: https://issues.apache.org/jira/browse/HBASE-8369 Project: HBase Issue Type: New Feature Components: mapreduce, snapshots Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 0.98.0, 0.95.2 Attachments: hbase-8369_v0.patch The idea is to add an InputFormat, which can run the mapreduce job over snapshot files directly bypassing hbase server layer. The IF is similar in usage to TableInputFormat, taking a Scan object from the user, but instead of running from an online table, it runs from a table snapshot. We do one split per region in the snapshot, and open an HRegion inside the RecordReader. A RegionScanner is used internally for doing the scan without any HRegionServer bits. Users have been asking and searching for ways to run MR jobs by reading directly from hfiles, so this allows new use cases if reading from stale data is ok: - Take snapshots periodically, and run MR jobs only on snapshots. - Export snapshots to remote hdfs cluster, run the MR jobs at that cluster without HBase cluster. - (Future use case) Combine snapshot data with online hbase data: Scan from yesterday's snapshot, but read today's data from online hbase cluster. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8365) Duplicated ZK notifications cause Master abort (or other unknown issues)
[ https://issues.apache.org/jira/browse/HBASE-8365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635031#comment-13635031 ] ramkrishna.s.vasudevan commented on HBASE-8365: --- I went thro the logs once again. One thing that is surprising is How did two nodeDataChangedEvent occur for FAILED_OPEN. Is it like when the znode got changed to OPENING twice and then to FAILED_OPEN for each change we got one event and by the time it tried to process the first two times the data it got was FAILED_OPEN but the third time it was OPENING due to some other latest assign opeartion? Thanks. Duplicated ZK notifications cause Master abort (or other unknown issues) Key: HBASE-8365 URL: https://issues.apache.org/jira/browse/HBASE-8365 Project: HBase Issue Type: Bug Affects Versions: 0.94.6 Reporter: Jeffrey Zhong Attachments: TestResult.txt The duplicated ZK notifications should happen in trunk as well. Since the way we handle ZK notifications is different in trunk, we don't see the issue there. I'll explain later. The issue is causing TestMetaReaderEditor.testRetrying flaky with error message {code}reader: count=2, t=null{code} A related link is at https://builds.apache.org/job/HBase-0.94/941/testReport/junit/org.apache.hadoop.hbase.catalog/TestMetaReaderEditor/testRetrying/ The test case failure is due to an IllegalStateException and master is aborted so the rest test cases also failed after testRetrying. Below are steps why the issue is happening(region fa0e7a5590feb69bd065fbc99c228b36 is in interests): 1) Got first notification event RS_ZK_REGION_FAILED_OPEN at 2013-04-04 17:39:01,197 {code} DEBUG [pool-1-thread-1-EventThread] master.AssignmentManager(744): Handling transition=RS_ZK_REGION_FAILED_OPEN, server=janus.apache.org,42093,1365097126155, region=fa0e7a5590feb69bd065fbc99c228b36{code} In the step, AM tries to open the region on another RS in a separate thread 2) Got second notification event RS_ZK_REGION_FAILED_OPEN at 2013-04-04 17:39:01,200 {code}DEBUG [pool-1-thread-1-EventThread] master.AssignmentManager(744): Handling transition=RS_ZK_REGION_FAILED_OPEN, server=janus.apache.org,42093,1365097126155, region=fa0e7a5590feb69bd065fbc99c228b36{code} 3) Later got opening notification event result from the step 1 at 2013-04-04 17:39:01,288 {code} DEBUG [pool-1-thread-1-EventThread] master.AssignmentManager(744): Handling transition=RS_ZK_REGION_OPENING, server=janus.apache.org,54833,1365097126175, region=fa0e7a5590feb69bd065fbc99c228b36{code} Step 2 ClosedRegionHandler throws IllegalStateException because Cannot transit it to OFFLINE(state is in opening from notification 3) and abort Master. This could happen in 0.94 because we handle notifications using executorService which opens the door to handle events out of order through receive them in order of updates. I've confirmed that we don't have duplicated AM listeners and both events triggered by same ZK data of exact same version. The issue can be reproduced once by running testRetrying test case 20 times in a loop. There are several issues for the failure: 1) duplicated ZK notifications. Since ZK watcher is one time trigger, the duplicated notification should not happen from the same data of the same version in the first place 2) ZooKeeper watcher handling is wrong in both 0.94 and trunk as following: a) 0.94 handle notifications in async way which may lead to handle notifications out of order of the events happened b) In trunk, we handle ZK notifications synchronously which slows down other components such as SSH, LogSplitting etc. because we have a single notification queue c) In trunk 0.94, we could use stale event data because we have a long listener list. ZK node state could have changed at the time when handling the event. If a listener needs to act upon latest state, it should re-fetch the data to verify if the data triggered the handler hasn't changed. Suggestions: For 0.94, we can bandit the CloseRegionHandler to pass in the expected ZK data version to skip event handling on stale data with min impact. For trunk, I'll open an improvement JIRA on ZK notification handling to provide more parallelism to handle unrelated notifications. For the duplicated ZK notifications, we need bring some ZK experters to take a look at this. Please let me know what you think or any better idea. Thanks! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8365) Duplicated ZK notifications cause Master abort (or other unknown issues)
[ https://issues.apache.org/jira/browse/HBASE-8365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635041#comment-13635041 ] ramkrishna.s.vasudevan commented on HBASE-8365: --- If i recollect from what i have debugged is that the nodeDataChangeEvent only will give the latest data because it will not be able to read the old data. I may be wrong here. Duplicated ZK notifications cause Master abort (or other unknown issues) Key: HBASE-8365 URL: https://issues.apache.org/jira/browse/HBASE-8365 Project: HBase Issue Type: Bug Affects Versions: 0.94.6 Reporter: Jeffrey Zhong Attachments: TestResult.txt The duplicated ZK notifications should happen in trunk as well. Since the way we handle ZK notifications is different in trunk, we don't see the issue there. I'll explain later. The issue is causing TestMetaReaderEditor.testRetrying flaky with error message {code}reader: count=2, t=null{code} A related link is at https://builds.apache.org/job/HBase-0.94/941/testReport/junit/org.apache.hadoop.hbase.catalog/TestMetaReaderEditor/testRetrying/ The test case failure is due to an IllegalStateException and master is aborted so the rest test cases also failed after testRetrying. Below are steps why the issue is happening(region fa0e7a5590feb69bd065fbc99c228b36 is in interests): 1) Got first notification event RS_ZK_REGION_FAILED_OPEN at 2013-04-04 17:39:01,197 {code} DEBUG [pool-1-thread-1-EventThread] master.AssignmentManager(744): Handling transition=RS_ZK_REGION_FAILED_OPEN, server=janus.apache.org,42093,1365097126155, region=fa0e7a5590feb69bd065fbc99c228b36{code} In the step, AM tries to open the region on another RS in a separate thread 2) Got second notification event RS_ZK_REGION_FAILED_OPEN at 2013-04-04 17:39:01,200 {code}DEBUG [pool-1-thread-1-EventThread] master.AssignmentManager(744): Handling transition=RS_ZK_REGION_FAILED_OPEN, server=janus.apache.org,42093,1365097126155, region=fa0e7a5590feb69bd065fbc99c228b36{code} 3) Later got opening notification event result from the step 1 at 2013-04-04 17:39:01,288 {code} DEBUG [pool-1-thread-1-EventThread] master.AssignmentManager(744): Handling transition=RS_ZK_REGION_OPENING, server=janus.apache.org,54833,1365097126175, region=fa0e7a5590feb69bd065fbc99c228b36{code} Step 2 ClosedRegionHandler throws IllegalStateException because Cannot transit it to OFFLINE(state is in opening from notification 3) and abort Master. This could happen in 0.94 because we handle notifications using executorService which opens the door to handle events out of order through receive them in order of updates. I've confirmed that we don't have duplicated AM listeners and both events triggered by same ZK data of exact same version. The issue can be reproduced once by running testRetrying test case 20 times in a loop. There are several issues for the failure: 1) duplicated ZK notifications. Since ZK watcher is one time trigger, the duplicated notification should not happen from the same data of the same version in the first place 2) ZooKeeper watcher handling is wrong in both 0.94 and trunk as following: a) 0.94 handle notifications in async way which may lead to handle notifications out of order of the events happened b) In trunk, we handle ZK notifications synchronously which slows down other components such as SSH, LogSplitting etc. because we have a single notification queue c) In trunk 0.94, we could use stale event data because we have a long listener list. ZK node state could have changed at the time when handling the event. If a listener needs to act upon latest state, it should re-fetch the data to verify if the data triggered the handler hasn't changed. Suggestions: For 0.94, we can bandit the CloseRegionHandler to pass in the expected ZK data version to skip event handling on stale data with min impact. For trunk, I'll open an improvement JIRA on ZK notification handling to provide more parallelism to handle unrelated notifications. For the duplicated ZK notifications, we need bring some ZK experters to take a look at this. Please let me know what you think or any better idea. Thanks! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8329) Limit compaction speed
[ https://issues.apache.org/jira/browse/HBASE-8329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] binlijin updated HBASE-8329: Attachment: HBASE-8329-trunk.patch Limit compaction speed -- Key: HBASE-8329 URL: https://issues.apache.org/jira/browse/HBASE-8329 Project: HBase Issue Type: Improvement Components: Compaction Reporter: binlijin Attachments: HBASE-8329-trunk.patch There is no speed or resource limit for compaction,I think we should add this feature especially when request burst. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-8372) Provide mutability to CompoundConfiguration
Ted Yu created HBASE-8372: - Summary: Provide mutability to CompoundConfiguration Key: HBASE-8372 URL: https://issues.apache.org/jira/browse/HBASE-8372 Project: HBase Issue Type: New Feature Reporter: Ted Yu In discussion of HBASE-8347, it was proposed that CompoundConfiguration should support mutability. This can be done by consolidating ImmutableConfigMap's on first modification to CompoundConfiguration. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8359) Too much logs on HConnectionManager
[ https://issues.apache.org/jira/browse/HBASE-8359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635163#comment-13635163 ] Hudson commented on HBASE-8359: --- Integrated in hbase-0.95-on-hadoop2 #73 (See [https://builds.apache.org/job/hbase-0.95-on-hadoop2/73/]) HBASE-8359 Too much logs on HConnectionManager (Revision 1469204) Result = FAILURE nkeywal : Files : * /hbase/branches/0.95/hbase-client/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java Too much logs on HConnectionManager --- Key: HBASE-8359 URL: https://issues.apache.org/jira/browse/HBASE-8359 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.95.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.98.0, 0.95.1 Attachments: 8359.v1.patch Under a YCSB load test (for HBASE-6295), we can have sporadic bulk of logs because of this: {code} final RegionMovedException rme = RegionMovedException.find(exception); if (rme != null) { LOG.info(Region + regionInfo.getRegionNameAsString() + moved to + rme.getHostname() + : + rme.getPort() + according to + source.getHostnamePort()); updateCachedLocation( regionInfo, source, rme.getServerName(), rme.getLocationSeqNum()); } else if (RegionOpeningException.find(exception) != null) { LOG.info(Region + regionInfo.getRegionNameAsString() + is being opened on + source.getHostnamePort() + ; not deleting the cache entry); } else { deleteCachedLocation(regionInfo, source); } {code} They should just be removed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8369) MapReduce over snapshot files
[ https://issues.apache.org/jira/browse/HBASE-8369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635167#comment-13635167 ] Jean-Marc Spaggiari commented on HBASE-8369: I very like the idea. Are the initCredentials modifications in TableMapReduceUtil required for the scope of this patch? Or they are coming from another scope? MapReduce over snapshot files - Key: HBASE-8369 URL: https://issues.apache.org/jira/browse/HBASE-8369 Project: HBase Issue Type: New Feature Components: mapreduce, snapshots Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 0.98.0, 0.95.2 Attachments: hbase-8369_v0.patch The idea is to add an InputFormat, which can run the mapreduce job over snapshot files directly bypassing hbase server layer. The IF is similar in usage to TableInputFormat, taking a Scan object from the user, but instead of running from an online table, it runs from a table snapshot. We do one split per region in the snapshot, and open an HRegion inside the RecordReader. A RegionScanner is used internally for doing the scan without any HRegionServer bits. Users have been asking and searching for ways to run MR jobs by reading directly from hfiles, so this allows new use cases if reading from stale data is ok: - Take snapshots periodically, and run MR jobs only on snapshots. - Export snapshots to remote hdfs cluster, run the MR jobs at that cluster without HBase cluster. - (Future use case) Combine snapshot data with online hbase data: Scan from yesterday's snapshot, but read today's data from online hbase cluster. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8357) current region server failover mechanism for replication can lead to stale region server whose left hlogs can't be replicated by other region server
[ https://issues.apache.org/jira/browse/HBASE-8357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635201#comment-13635201 ] Himanshu Vashishtha commented on HBASE-8357: [~fenghh] Also, consider using 0.94.6 when using HBase-2611. Please refer to the release notes on HBase-2611. current region server failover mechanism for replication can lead to stale region server whose left hlogs can't be replicated by other region server Key: HBASE-8357 URL: https://issues.apache.org/jira/browse/HBASE-8357 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.94.3 Reporter: Feng Honghua consider this scenario: rs A/B/C, A dies, B and C race to lock A to help replicate A's left unreplicated hlogs, B wins and successfully creates lock under A's znode, but before B copies A's hlog queues to its own znode, B also dies, and C successfully creates lock under B's znode and helps replicate B's own left hlogs. But A's left hlogs can't be replicated by any other rs since B left back a lock under A's znode and B didn't transfer A's hlog queues to its own znode before B dies. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Himanshu Vashishtha updated HBASE-2611: --- Release Note: The fix for this issue uses Zookeeper multi functionality (hbase.zookeeper.useMulti). Please refer to hbase-default.xml about this property. There is an addendum fix at HBase-8099 (fixed in 0.94.6). In case you are running on branch 0.94.6, please patch it with HBase-8099, OR make sure hbase.zookeeper.useMulti is set to false. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Priority: Critical Fix For: 0.94.5, 0.95.0 Attachments: 2611-0.94.txt, 2611-trunk-v3.patch, 2611-trunk-v4.patch, 2611-v3.patch, HBASE-2611-trunk-v2.patch, HBASE-2611-trunk-v3.patch, HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-8373) Update Rolling Restart documentation
Jean-Marc Spaggiari created HBASE-8373: -- Summary: Update Rolling Restart documentation Key: HBASE-8373 URL: https://issues.apache.org/jira/browse/HBASE-8373 Project: HBase Issue Type: Bug Components: documentation Reporter: Jean-Marc Spaggiari Assignee: Jean-Marc Spaggiari Priority: Minor Rolling Restart documentation specifies that we need to stop the balancer before proceeding. However, bin/graceful_stop.sh is already taking care of that: {code} echo Disabling balancer! echo 'balance_switch false' | $bin/hbase --config ${HBASE_CONF_DIR} shell {code} So documentation need to be updated to remove this recommandation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8373) Update Rolling Restart documentation
[ https://issues.apache.org/jira/browse/HBASE-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Marc Spaggiari updated HBASE-8373: --- Attachment: HBASE-8373-v0-trunk.patch Update Rolling Restart documentation Key: HBASE-8373 URL: https://issues.apache.org/jira/browse/HBASE-8373 Project: HBase Issue Type: Bug Components: documentation Reporter: Jean-Marc Spaggiari Assignee: Jean-Marc Spaggiari Priority: Minor Attachments: HBASE-8373-v0-trunk.patch Rolling Restart documentation specifies that we need to stop the balancer before proceeding. However, bin/graceful_stop.sh is already taking care of that: {code} echo Disabling balancer! echo 'balance_switch false' | $bin/hbase --config ${HBASE_CONF_DIR} shell {code} So documentation need to be updated to remove this recommandation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8373) Update Rolling Restart documentation
[ https://issues.apache.org/jira/browse/HBASE-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Marc Spaggiari updated HBASE-8373: --- Status: Patch Available (was: Open) Documentation update attached. Update Rolling Restart documentation Key: HBASE-8373 URL: https://issues.apache.org/jira/browse/HBASE-8373 Project: HBase Issue Type: Bug Components: documentation Reporter: Jean-Marc Spaggiari Assignee: Jean-Marc Spaggiari Priority: Minor Attachments: HBASE-8373-v0-trunk.patch Rolling Restart documentation specifies that we need to stop the balancer before proceeding. However, bin/graceful_stop.sh is already taking care of that: {code} echo Disabling balancer! echo 'balance_switch false' | $bin/hbase --config ${HBASE_CONF_DIR} shell {code} So documentation need to be updated to remove this recommandation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8373) Update Rolling Restart documentation
[ https://issues.apache.org/jira/browse/HBASE-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635224#comment-13635224 ] Hadoop QA commented on HBASE-8373: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12579330/HBASE-8373-v0-trunk.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+0 tests included{color}. The patch appears to be a documentation patch that doesn't require tests. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5343//console This message is automatically generated. Update Rolling Restart documentation Key: HBASE-8373 URL: https://issues.apache.org/jira/browse/HBASE-8373 Project: HBase Issue Type: Bug Components: documentation Reporter: Jean-Marc Spaggiari Assignee: Jean-Marc Spaggiari Priority: Minor Attachments: HBASE-8373-v0-trunk.patch Rolling Restart documentation specifies that we need to stop the balancer before proceeding. However, bin/graceful_stop.sh is already taking care of that: {code} echo Disabling balancer! echo 'balance_switch false' | $bin/hbase --config ${HBASE_CONF_DIR} shell {code} So documentation need to be updated to remove this recommandation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6295) Possible performance improvement in client batch operations: presplit and send in background
[ https://issues.apache.org/jira/browse/HBASE-6295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635225#comment-13635225 ] Nicolas Liochon commented on HBASE-6295: Load test with YCSB on EC2. Lot of problems. The server seems sensible to the workload, and beeing asynchronous adds some workload. Here is a stack with a moderate setting. I don't get the UnknownHostException: ip-10-4-226-168, may be there are much calls for AWS... {noformat} 2013-04-18 10:41:33,377 INFO [regionserver60020-smallCompactions-1366296026287] org.apache.hadoop.hbase.regionserver.StoreFile: Delete Family Bloom f 2013-04-18 10:41:37,849 FATAL [regionserver60020.logRoller] org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ip-10-4-229-217 java.io.IOException: cannot get log writer at org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createWriter(HLogFactory.java:162) at org.apache.hadoop.hbase.regionserver.wal.FSHLog.createWriterInstance(FSHLog.java:591) at org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:533) at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:96) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.IOException: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:169) at org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createWriter(HLogFactory.java:159) ... 4 more Caused by: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at org.apache.hadoop.fs.AbstractFileSystem.newInstance(AbstractFileSystem.java:123) at org.apache.hadoop.fs.AbstractFileSystem.createFileSystem(AbstractFileSystem.java:149) at org.apache.hadoop.fs.AbstractFileSystem.get(AbstractFileSystem.java:234) at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:342) at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:339) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1441) at org.apache.hadoop.fs.FileContext.getAbstractFileSystem(FileContext.java:339) at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:453) at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:469) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:150) ... 5 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.fs.AbstractFileSystem.newInstance(AbstractFileSystem.java:121) ... 20 more Caused by: java.lang.IllegalArgumentException: java.net.UnknownHostException: ip-10-4-226-168 at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:417) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:164) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:129) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:415) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:382) at org.apache.hadoop.fs.Hdfs.init(Hdfs.java:85) ... 25 more Caused by: java.net.UnknownHostException: ip-10-4-226-168 ... 31 more 2013-04-18 10:41:37,851 FATAL [regionserver60020.logRoller] org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort: loaded coprocessor 2013-04-18 10:41:37,863 INFO [regionserver60020.logRoller] org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: IOE in log roller {noformat} Possible performance improvement in client batch operations: presplit and send in background Key: HBASE-6295 URL: https://issues.apache.org/jira/browse/HBASE-6295 Project: HBase Issue Type: Improvement Components: Client, Performance
[jira] [Created] (HBASE-8374) NPE when launching the balance
Nicolas Liochon created HBASE-8374: -- Summary: NPE when launching the balance Key: HBASE-8374 URL: https://issues.apache.org/jira/browse/HBASE-8374 Project: HBase Issue Type: Bug Components: Balancer Affects Versions: 0.95.0 Environment: AWS / real cluster with 3 nodes + master Reporter: Nicolas Liochon I don't reproduce this all the time, but I had it on a fairly clean env. It occurs every 5 minutes (i.e. the balancer period). Impact is severe: the balancer does not run. When it starts to occurs, it occurs all the time. I haven't tried to restart the master, but I think it should be enough. Now, looking at the code, the NPE is strange. {noformat} 2013-04-18 08:09:52,079 ERROR [box,6,1366281581983-BalancerChore] org.apache.hadoop.hbase.master.balancer.BalancerChore: Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.master.balancer.BaseLoadBalancer$Cluster.init(BaseLoadBalancer.java:145) at org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer.balanceCluster(StochasticLoadBalancer.java:194) at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1295) at org.apache.hadoop.hbase.master.balancer.BalancerChore.chore(BalancerChore.java:48) at org.apache.hadoop.hbase.Chore.run(Chore.java:81) at java.lang.Thread.run(Thread.java:662) 2013-04-18 08:09:52,103 DEBUG [box,6,1366281581983-CatalogJanitor] org.apache.hadoop.hbase.client.ClientScanner: Creating scanner over .META. starting at key '' {noformat} {code} if (regionFinder != null) { //region location ListServerName loc = regionFinder.getTopBlockLocations(region); regionLocations[regionIndex] = new int[loc.size()]; for (int i=0; i loc.size(); i++) { regionLocations[regionIndex][i] = serversToIndex.get(loc.get(i)); // = NPE here } } {code} pinging [~enis], just in case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6295) Possible performance improvement in client batch operations: presplit and send in background
[ https://issues.apache.org/jira/browse/HBASE-6295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635227#comment-13635227 ] Nicolas Liochon commented on HBASE-6295: On the good news part, it seems it does what we expect: performances are 25% better, even with a dead regionserver. Possible performance improvement in client batch operations: presplit and send in background Key: HBASE-6295 URL: https://issues.apache.org/jira/browse/HBASE-6295 Project: HBase Issue Type: Improvement Components: Client, Performance Affects Versions: 0.95.2 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Labels: noob Attachments: 6295.v1.patch, 6295.v2.patch, 6295.v3.patch today batch algo is: {noformat} for Operation o: ListOp{ add o to todolist if todolist maxsize or o last in list split todolist per location send split lists to region servers clear todolist wait } {noformat} We could: - create immediately the final object instead of an intermediate array - split per location immediately - instead of sending when the list as a whole is full, send it when there is enough data for a single location It would be: {noformat} for Operation o: ListOp{ get location add o to todo location.todolist if (location.todolist maxLocationSize) send location.todolist to region server clear location.todolist // don't wait, continue the loop } send remaining wait {noformat} It's not trivial to write if you add error management: retried list must be shared with the operations added in the todolist. But it's doable. It's interesting mainly for 'big' writes -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8214) Remove proxy and engine, rely directly on pb generated Service
[ https://issues.apache.org/jira/browse/HBASE-8214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635248#comment-13635248 ] Nick Dimiduk commented on HBASE-8214: - bq. It is ugly the way we now do security information. Dunno if it's interesting, but you might have a look at https://code.google.com/p/protobuf-rpc-pro/wiki/SSLGuide and the relevant implementation. Remove proxy and engine, rely directly on pb generated Service -- Key: HBASE-8214 URL: https://issues.apache.org/jira/browse/HBASE-8214 Project: HBase Issue Type: Bug Components: IPC/RPC Reporter: stack Assignee: stack Attachments: 8124.txt, 8214v2.txt Attached patch is not done. Removes two to three layers -- depending on how you count -- between client and rpc client and similar on server side (between rpc and server implementation). Strips ProtobufRpcServer/Client and HBaseClientRpc/HBaseServerRpc. Also gets rid of proxy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8373) Update Rolling Restart documentation
[ https://issues.apache.org/jira/browse/HBASE-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635253#comment-13635253 ] Ted Yu commented on HBASE-8373: --- Can you regenerate the patch ? The correct path should be src/main/docbkx/ops_mgt.xml Update Rolling Restart documentation Key: HBASE-8373 URL: https://issues.apache.org/jira/browse/HBASE-8373 Project: HBase Issue Type: Bug Components: documentation Reporter: Jean-Marc Spaggiari Assignee: Jean-Marc Spaggiari Priority: Minor Attachments: HBASE-8373-v0-trunk.patch Rolling Restart documentation specifies that we need to stop the balancer before proceeding. However, bin/graceful_stop.sh is already taking care of that: {code} echo Disabling balancer! echo 'balance_switch false' | $bin/hbase --config ${HBASE_CONF_DIR} shell {code} So documentation need to be updated to remove this recommandation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8359) Too much logs on HConnectionManager
[ https://issues.apache.org/jira/browse/HBASE-8359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635259#comment-13635259 ] Hudson commented on HBASE-8359: --- Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #503 (See [https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/503/]) HBASE-8359 Too much logs on HConnectionManager (Revision 1469203) Result = FAILURE nkeywal : Files : * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java Too much logs on HConnectionManager --- Key: HBASE-8359 URL: https://issues.apache.org/jira/browse/HBASE-8359 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.95.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.98.0, 0.95.1 Attachments: 8359.v1.patch Under a YCSB load test (for HBASE-6295), we can have sporadic bulk of logs because of this: {code} final RegionMovedException rme = RegionMovedException.find(exception); if (rme != null) { LOG.info(Region + regionInfo.getRegionNameAsString() + moved to + rme.getHostname() + : + rme.getPort() + according to + source.getHostnamePort()); updateCachedLocation( regionInfo, source, rme.getServerName(), rme.getLocationSeqNum()); } else if (RegionOpeningException.find(exception) != null) { LOG.info(Region + regionInfo.getRegionNameAsString() + is being opened on + source.getHostnamePort() + ; not deleting the cache entry); } else { deleteCachedLocation(regionInfo, source); } {code} They should just be removed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8374) NPE when launching the balance
[ https://issues.apache.org/jira/browse/HBASE-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-8374: -- Attachment: 8374-trunk.txt I think loc was empty, leading to NPE. Here is a patch. NPE when launching the balance -- Key: HBASE-8374 URL: https://issues.apache.org/jira/browse/HBASE-8374 Project: HBase Issue Type: Bug Components: Balancer Affects Versions: 0.95.0 Environment: AWS / real cluster with 3 nodes + master Reporter: Nicolas Liochon Attachments: 8374-trunk.txt I don't reproduce this all the time, but I had it on a fairly clean env. It occurs every 5 minutes (i.e. the balancer period). Impact is severe: the balancer does not run. When it starts to occurs, it occurs all the time. I haven't tried to restart the master, but I think it should be enough. Now, looking at the code, the NPE is strange. {noformat} 2013-04-18 08:09:52,079 ERROR [box,6,1366281581983-BalancerChore] org.apache.hadoop.hbase.master.balancer.BalancerChore: Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.master.balancer.BaseLoadBalancer$Cluster.init(BaseLoadBalancer.java:145) at org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer.balanceCluster(StochasticLoadBalancer.java:194) at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1295) at org.apache.hadoop.hbase.master.balancer.BalancerChore.chore(BalancerChore.java:48) at org.apache.hadoop.hbase.Chore.run(Chore.java:81) at java.lang.Thread.run(Thread.java:662) 2013-04-18 08:09:52,103 DEBUG [box,6,1366281581983-CatalogJanitor] org.apache.hadoop.hbase.client.ClientScanner: Creating scanner over .META. starting at key '' {noformat} {code} if (regionFinder != null) { //region location ListServerName loc = regionFinder.getTopBlockLocations(region); regionLocations[regionIndex] = new int[loc.size()]; for (int i=0; i loc.size(); i++) { regionLocations[regionIndex][i] = serversToIndex.get(loc.get(i)); // = NPE here } } {code} pinging [~enis], just in case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-8374) NPE when launching the balance
[ https://issues.apache.org/jira/browse/HBASE-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu reassigned HBASE-8374: - Assignee: Ted Yu NPE when launching the balance -- Key: HBASE-8374 URL: https://issues.apache.org/jira/browse/HBASE-8374 Project: HBase Issue Type: Bug Components: Balancer Affects Versions: 0.95.0 Environment: AWS / real cluster with 3 nodes + master Reporter: Nicolas Liochon Assignee: Ted Yu Attachments: 8374-trunk.txt I don't reproduce this all the time, but I had it on a fairly clean env. It occurs every 5 minutes (i.e. the balancer period). Impact is severe: the balancer does not run. When it starts to occurs, it occurs all the time. I haven't tried to restart the master, but I think it should be enough. Now, looking at the code, the NPE is strange. {noformat} 2013-04-18 08:09:52,079 ERROR [box,6,1366281581983-BalancerChore] org.apache.hadoop.hbase.master.balancer.BalancerChore: Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.master.balancer.BaseLoadBalancer$Cluster.init(BaseLoadBalancer.java:145) at org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer.balanceCluster(StochasticLoadBalancer.java:194) at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1295) at org.apache.hadoop.hbase.master.balancer.BalancerChore.chore(BalancerChore.java:48) at org.apache.hadoop.hbase.Chore.run(Chore.java:81) at java.lang.Thread.run(Thread.java:662) 2013-04-18 08:09:52,103 DEBUG [box,6,1366281581983-CatalogJanitor] org.apache.hadoop.hbase.client.ClientScanner: Creating scanner over .META. starting at key '' {noformat} {code} if (regionFinder != null) { //region location ListServerName loc = regionFinder.getTopBlockLocations(region); regionLocations[regionIndex] = new int[loc.size()]; for (int i=0; i loc.size(); i++) { regionLocations[regionIndex][i] = serversToIndex.get(loc.get(i)); // = NPE here } } {code} pinging [~enis], just in case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8374) NPE when launching the balance
[ https://issues.apache.org/jira/browse/HBASE-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-8374: -- Fix Version/s: 0.95.1 0.98.0 Status: Patch Available (was: Open) NPE when launching the balance -- Key: HBASE-8374 URL: https://issues.apache.org/jira/browse/HBASE-8374 Project: HBase Issue Type: Bug Components: Balancer Affects Versions: 0.95.0 Environment: AWS / real cluster with 3 nodes + master Reporter: Nicolas Liochon Assignee: Ted Yu Fix For: 0.98.0, 0.95.1 Attachments: 8374-trunk.txt I don't reproduce this all the time, but I had it on a fairly clean env. It occurs every 5 minutes (i.e. the balancer period). Impact is severe: the balancer does not run. When it starts to occurs, it occurs all the time. I haven't tried to restart the master, but I think it should be enough. Now, looking at the code, the NPE is strange. {noformat} 2013-04-18 08:09:52,079 ERROR [box,6,1366281581983-BalancerChore] org.apache.hadoop.hbase.master.balancer.BalancerChore: Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.master.balancer.BaseLoadBalancer$Cluster.init(BaseLoadBalancer.java:145) at org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer.balanceCluster(StochasticLoadBalancer.java:194) at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1295) at org.apache.hadoop.hbase.master.balancer.BalancerChore.chore(BalancerChore.java:48) at org.apache.hadoop.hbase.Chore.run(Chore.java:81) at java.lang.Thread.run(Thread.java:662) 2013-04-18 08:09:52,103 DEBUG [box,6,1366281581983-CatalogJanitor] org.apache.hadoop.hbase.client.ClientScanner: Creating scanner over .META. starting at key '' {noformat} {code} if (regionFinder != null) { //region location ListServerName loc = regionFinder.getTopBlockLocations(region); regionLocations[regionIndex] = new int[loc.size()]; for (int i=0; i loc.size(); i++) { regionLocations[regionIndex][i] = serversToIndex.get(loc.get(i)); // = NPE here } } {code} pinging [~enis], just in case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8374) NPE when launching the balance
[ https://issues.apache.org/jira/browse/HBASE-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635271#comment-13635271 ] Jean-Marc Spaggiari commented on HBASE-8374: Hi Ted, I'm not sure about your patch. {code} +if (loc.size() 0) { + regionLocations[regionIndex] = new int[loc.size()]; + for (int i=0; i loc.size(); i++) { +regionLocations[regionIndex][i] = serversToIndex.get(loc.get(i)); + } } {code} If loc.size() == 0, then the for loop will never run and the loc.get(i) will never be called. No? And we know that loc can't be null else the NPE will have been on loc.size(). So 2 options. 1) log.get(i) returns null and serversToIndex.get(null) give the NPE 2) serversToIndex is null. I think we are facing #1 here. NPE when launching the balance -- Key: HBASE-8374 URL: https://issues.apache.org/jira/browse/HBASE-8374 Project: HBase Issue Type: Bug Components: Balancer Affects Versions: 0.95.0 Environment: AWS / real cluster with 3 nodes + master Reporter: Nicolas Liochon Assignee: Ted Yu Fix For: 0.98.0, 0.95.1 Attachments: 8374-trunk.txt I don't reproduce this all the time, but I had it on a fairly clean env. It occurs every 5 minutes (i.e. the balancer period). Impact is severe: the balancer does not run. When it starts to occurs, it occurs all the time. I haven't tried to restart the master, but I think it should be enough. Now, looking at the code, the NPE is strange. {noformat} 2013-04-18 08:09:52,079 ERROR [box,6,1366281581983-BalancerChore] org.apache.hadoop.hbase.master.balancer.BalancerChore: Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.master.balancer.BaseLoadBalancer$Cluster.init(BaseLoadBalancer.java:145) at org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer.balanceCluster(StochasticLoadBalancer.java:194) at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1295) at org.apache.hadoop.hbase.master.balancer.BalancerChore.chore(BalancerChore.java:48) at org.apache.hadoop.hbase.Chore.run(Chore.java:81) at java.lang.Thread.run(Thread.java:662) 2013-04-18 08:09:52,103 DEBUG [box,6,1366281581983-CatalogJanitor] org.apache.hadoop.hbase.client.ClientScanner: Creating scanner over .META. starting at key '' {noformat} {code} if (regionFinder != null) { //region location ListServerName loc = regionFinder.getTopBlockLocations(region); regionLocations[regionIndex] = new int[loc.size()]; for (int i=0; i loc.size(); i++) { regionLocations[regionIndex][i] = serversToIndex.get(loc.get(i)); // = NPE here } } {code} pinging [~enis], just in case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8373) Update Rolling Restart documentation
[ https://issues.apache.org/jira/browse/HBASE-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Marc Spaggiari updated HBASE-8373: --- Status: Open (was: Patch Available) Update Rolling Restart documentation Key: HBASE-8373 URL: https://issues.apache.org/jira/browse/HBASE-8373 Project: HBase Issue Type: Bug Components: documentation Reporter: Jean-Marc Spaggiari Assignee: Jean-Marc Spaggiari Priority: Minor Attachments: HBASE-8373-v0-trunk.patch Rolling Restart documentation specifies that we need to stop the balancer before proceeding. However, bin/graceful_stop.sh is already taking care of that: {code} echo Disabling balancer! echo 'balance_switch false' | $bin/hbase --config ${HBASE_CONF_DIR} shell {code} So documentation need to be updated to remove this recommandation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8374) NPE when launching the balance
[ https://issues.apache.org/jira/browse/HBASE-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635286#comment-13635286 ] Ted Yu commented on HBASE-8374: --- If loc.size() == 0, there is no information to fill in regionLocations. serversToIndex.get() call would be skipped, avoiding NPE. NPE when launching the balance -- Key: HBASE-8374 URL: https://issues.apache.org/jira/browse/HBASE-8374 Project: HBase Issue Type: Bug Components: Balancer Affects Versions: 0.95.0 Environment: AWS / real cluster with 3 nodes + master Reporter: Nicolas Liochon Assignee: Ted Yu Fix For: 0.98.0, 0.95.1 Attachments: 8374-trunk.txt I don't reproduce this all the time, but I had it on a fairly clean env. It occurs every 5 minutes (i.e. the balancer period). Impact is severe: the balancer does not run. When it starts to occurs, it occurs all the time. I haven't tried to restart the master, but I think it should be enough. Now, looking at the code, the NPE is strange. {noformat} 2013-04-18 08:09:52,079 ERROR [box,6,1366281581983-BalancerChore] org.apache.hadoop.hbase.master.balancer.BalancerChore: Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.master.balancer.BaseLoadBalancer$Cluster.init(BaseLoadBalancer.java:145) at org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer.balanceCluster(StochasticLoadBalancer.java:194) at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1295) at org.apache.hadoop.hbase.master.balancer.BalancerChore.chore(BalancerChore.java:48) at org.apache.hadoop.hbase.Chore.run(Chore.java:81) at java.lang.Thread.run(Thread.java:662) 2013-04-18 08:09:52,103 DEBUG [box,6,1366281581983-CatalogJanitor] org.apache.hadoop.hbase.client.ClientScanner: Creating scanner over .META. starting at key '' {noformat} {code} if (regionFinder != null) { //region location ListServerName loc = regionFinder.getTopBlockLocations(region); regionLocations[regionIndex] = new int[loc.size()]; for (int i=0; i loc.size(); i++) { regionLocations[regionIndex][i] = serversToIndex.get(loc.get(i)); // = NPE here } } {code} pinging [~enis], just in case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8374) NPE when launching the balance
[ https://issues.apache.org/jira/browse/HBASE-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635300#comment-13635300 ] Jean-Marc Spaggiari commented on HBASE-8374: My point is that serversToIndex.get() was already skipped with loc.size() == 0 because of the for. So I agree that adding the test will avoid the creation of the regionLocations array, but it will not fix the NPE error since it occured when loc.size() was 0. NPE when launching the balance -- Key: HBASE-8374 URL: https://issues.apache.org/jira/browse/HBASE-8374 Project: HBase Issue Type: Bug Components: Balancer Affects Versions: 0.95.0 Environment: AWS / real cluster with 3 nodes + master Reporter: Nicolas Liochon Assignee: Ted Yu Fix For: 0.98.0, 0.95.1 Attachments: 8374-trunk.txt I don't reproduce this all the time, but I had it on a fairly clean env. It occurs every 5 minutes (i.e. the balancer period). Impact is severe: the balancer does not run. When it starts to occurs, it occurs all the time. I haven't tried to restart the master, but I think it should be enough. Now, looking at the code, the NPE is strange. {noformat} 2013-04-18 08:09:52,079 ERROR [box,6,1366281581983-BalancerChore] org.apache.hadoop.hbase.master.balancer.BalancerChore: Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.master.balancer.BaseLoadBalancer$Cluster.init(BaseLoadBalancer.java:145) at org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer.balanceCluster(StochasticLoadBalancer.java:194) at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1295) at org.apache.hadoop.hbase.master.balancer.BalancerChore.chore(BalancerChore.java:48) at org.apache.hadoop.hbase.Chore.run(Chore.java:81) at java.lang.Thread.run(Thread.java:662) 2013-04-18 08:09:52,103 DEBUG [box,6,1366281581983-CatalogJanitor] org.apache.hadoop.hbase.client.ClientScanner: Creating scanner over .META. starting at key '' {noformat} {code} if (regionFinder != null) { //region location ListServerName loc = regionFinder.getTopBlockLocations(region); regionLocations[regionIndex] = new int[loc.size()]; for (int i=0; i loc.size(); i++) { regionLocations[regionIndex][i] = serversToIndex.get(loc.get(i)); // = NPE here } } {code} pinging [~enis], just in case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8374) NPE when launching the balance
[ https://issues.apache.org/jira/browse/HBASE-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-8374: -- Attachment: 8374-trunk-v2.txt Patch v2 addresses the case where (some) ServerName couldn't be determined by regionFinder.getTopBlockLocations() NPE when launching the balance -- Key: HBASE-8374 URL: https://issues.apache.org/jira/browse/HBASE-8374 Project: HBase Issue Type: Bug Components: Balancer Affects Versions: 0.95.0 Environment: AWS / real cluster with 3 nodes + master Reporter: Nicolas Liochon Assignee: Ted Yu Fix For: 0.98.0, 0.95.1 Attachments: 8374-trunk.txt, 8374-trunk-v2.txt I don't reproduce this all the time, but I had it on a fairly clean env. It occurs every 5 minutes (i.e. the balancer period). Impact is severe: the balancer does not run. When it starts to occurs, it occurs all the time. I haven't tried to restart the master, but I think it should be enough. Now, looking at the code, the NPE is strange. {noformat} 2013-04-18 08:09:52,079 ERROR [box,6,1366281581983-BalancerChore] org.apache.hadoop.hbase.master.balancer.BalancerChore: Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.master.balancer.BaseLoadBalancer$Cluster.init(BaseLoadBalancer.java:145) at org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer.balanceCluster(StochasticLoadBalancer.java:194) at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1295) at org.apache.hadoop.hbase.master.balancer.BalancerChore.chore(BalancerChore.java:48) at org.apache.hadoop.hbase.Chore.run(Chore.java:81) at java.lang.Thread.run(Thread.java:662) 2013-04-18 08:09:52,103 DEBUG [box,6,1366281581983-CatalogJanitor] org.apache.hadoop.hbase.client.ClientScanner: Creating scanner over .META. starting at key '' {noformat} {code} if (regionFinder != null) { //region location ListServerName loc = regionFinder.getTopBlockLocations(region); regionLocations[regionIndex] = new int[loc.size()]; for (int i=0; i loc.size(); i++) { regionLocations[regionIndex][i] = serversToIndex.get(loc.get(i)); // = NPE here } } {code} pinging [~enis], just in case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8374) NPE when launching the balance
[ https://issues.apache.org/jira/browse/HBASE-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635307#comment-13635307 ] Nicolas Liochon commented on HBASE-8374: I think it's serversToIndex.get(loc.get(i)); that returns null. As it's an Integer it's then casted to int, hence the NPE. So v2 does not remove the NPE imho. I added some tests locally, but I've not yet reproduced it, so I can't say for sure. If so, the question would be: why does it happen? NPE when launching the balance -- Key: HBASE-8374 URL: https://issues.apache.org/jira/browse/HBASE-8374 Project: HBase Issue Type: Bug Components: Balancer Affects Versions: 0.95.0 Environment: AWS / real cluster with 3 nodes + master Reporter: Nicolas Liochon Assignee: Ted Yu Fix For: 0.98.0, 0.95.1 Attachments: 8374-trunk.txt, 8374-trunk-v2.txt I don't reproduce this all the time, but I had it on a fairly clean env. It occurs every 5 minutes (i.e. the balancer period). Impact is severe: the balancer does not run. When it starts to occurs, it occurs all the time. I haven't tried to restart the master, but I think it should be enough. Now, looking at the code, the NPE is strange. {noformat} 2013-04-18 08:09:52,079 ERROR [box,6,1366281581983-BalancerChore] org.apache.hadoop.hbase.master.balancer.BalancerChore: Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.master.balancer.BaseLoadBalancer$Cluster.init(BaseLoadBalancer.java:145) at org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer.balanceCluster(StochasticLoadBalancer.java:194) at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1295) at org.apache.hadoop.hbase.master.balancer.BalancerChore.chore(BalancerChore.java:48) at org.apache.hadoop.hbase.Chore.run(Chore.java:81) at java.lang.Thread.run(Thread.java:662) 2013-04-18 08:09:52,103 DEBUG [box,6,1366281581983-CatalogJanitor] org.apache.hadoop.hbase.client.ClientScanner: Creating scanner over .META. starting at key '' {noformat} {code} if (regionFinder != null) { //region location ListServerName loc = regionFinder.getTopBlockLocations(region); regionLocations[regionIndex] = new int[loc.size()]; for (int i=0; i loc.size(); i++) { regionLocations[regionIndex][i] = serversToIndex.get(loc.get(i)); // = NPE here } } {code} pinging [~enis], just in case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8353) -ROOT-/.META. regions are hanging if master restarted while closing -ROOT-/.META. regions on dead RS
[ https://issues.apache.org/jira/browse/HBASE-8353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635317#comment-13635317 ] ramkrishna.s.vasudevan commented on HBASE-8353: --- I feel the patch is ok. @Rajesh Again if we go to check the ROOT/META while startup we have to again see how long we should wait if ROOT/META goes down at that time.. What do you think? -ROOT-/.META. regions are hanging if master restarted while closing -ROOT-/.META. regions on dead RS Key: HBASE-8353 URL: https://issues.apache.org/jira/browse/HBASE-8353 Project: HBase Issue Type: Bug Components: Region Assignment Affects Versions: 0.94.6 Reporter: rajeshbabu Assignee: rajeshbabu Fix For: 0.94.8 Attachments: HBASE-8353_94.patch ROOT/META are not getting assigned if master restarted while closing ROOT/META. Lets suppose catalog table regions in M_ZK_REGION_CLOSING state during master initialization and then just we are adding the them to RIT and waiting for TM. {code} if (isOnDeadServer(regionInfo, deadServers) (data.getOrigin() == null || !serverManager.isServerOnline(data.getOrigin( { // If was on dead server, its closed now. Force to OFFLINE and this // will get it reassigned if appropriate forceOffline(regionInfo, data); } else { // Just insert region into RIT. // If this never updates the timeout will trigger new assignment regionsInTransition.put(encodedRegionName, new RegionState( regionInfo, RegionState.State.CLOSING, data.getStamp(), data.getOrigin())); } {code} isOnDeadServer always return false to ROOT/META because deadServers is null. Even TM cannot close them properly because its not available in online regions since its not yet assigned. {code} synchronized (this.regions) { // Check if this region is currently assigned if (!regions.containsKey(region)) { LOG.debug(Attempted to unassign region + region.getRegionNameAsString() + but it is not + currently assigned anywhere); return; } } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8374) NPE when launching the balance
[ https://issues.apache.org/jira/browse/HBASE-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-8374: -- Attachment: 8374-trunk-v3.txt See if patch v3 is better. NPE when launching the balance -- Key: HBASE-8374 URL: https://issues.apache.org/jira/browse/HBASE-8374 Project: HBase Issue Type: Bug Components: Balancer Affects Versions: 0.95.0 Environment: AWS / real cluster with 3 nodes + master Reporter: Nicolas Liochon Assignee: Ted Yu Fix For: 0.98.0, 0.95.1 Attachments: 8374-trunk.txt, 8374-trunk-v2.txt, 8374-trunk-v3.txt I don't reproduce this all the time, but I had it on a fairly clean env. It occurs every 5 minutes (i.e. the balancer period). Impact is severe: the balancer does not run. When it starts to occurs, it occurs all the time. I haven't tried to restart the master, but I think it should be enough. Now, looking at the code, the NPE is strange. {noformat} 2013-04-18 08:09:52,079 ERROR [box,6,1366281581983-BalancerChore] org.apache.hadoop.hbase.master.balancer.BalancerChore: Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.master.balancer.BaseLoadBalancer$Cluster.init(BaseLoadBalancer.java:145) at org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer.balanceCluster(StochasticLoadBalancer.java:194) at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1295) at org.apache.hadoop.hbase.master.balancer.BalancerChore.chore(BalancerChore.java:48) at org.apache.hadoop.hbase.Chore.run(Chore.java:81) at java.lang.Thread.run(Thread.java:662) 2013-04-18 08:09:52,103 DEBUG [box,6,1366281581983-CatalogJanitor] org.apache.hadoop.hbase.client.ClientScanner: Creating scanner over .META. starting at key '' {noformat} {code} if (regionFinder != null) { //region location ListServerName loc = regionFinder.getTopBlockLocations(region); regionLocations[regionIndex] = new int[loc.size()]; for (int i=0; i loc.size(); i++) { regionLocations[regionIndex][i] = serversToIndex.get(loc.get(i)); // = NPE here } } {code} pinging [~enis], just in case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8374) NPE when launching the balance
[ https://issues.apache.org/jira/browse/HBASE-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635328#comment-13635328 ] Jean-Marc Spaggiari commented on HBASE-8374: Hum, not sure. regionLocations[regionIndex][i] is int. If yo uassign null to it you will still get the NPE. Also, as Nicolas is saying, the question would be: why does it happen? NPE when launching the balance -- Key: HBASE-8374 URL: https://issues.apache.org/jira/browse/HBASE-8374 Project: HBase Issue Type: Bug Components: Balancer Affects Versions: 0.95.0 Environment: AWS / real cluster with 3 nodes + master Reporter: Nicolas Liochon Assignee: Ted Yu Fix For: 0.98.0, 0.95.1 Attachments: 8374-trunk.txt, 8374-trunk-v2.txt, 8374-trunk-v3.txt I don't reproduce this all the time, but I had it on a fairly clean env. It occurs every 5 minutes (i.e. the balancer period). Impact is severe: the balancer does not run. When it starts to occurs, it occurs all the time. I haven't tried to restart the master, but I think it should be enough. Now, looking at the code, the NPE is strange. {noformat} 2013-04-18 08:09:52,079 ERROR [box,6,1366281581983-BalancerChore] org.apache.hadoop.hbase.master.balancer.BalancerChore: Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.master.balancer.BaseLoadBalancer$Cluster.init(BaseLoadBalancer.java:145) at org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer.balanceCluster(StochasticLoadBalancer.java:194) at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1295) at org.apache.hadoop.hbase.master.balancer.BalancerChore.chore(BalancerChore.java:48) at org.apache.hadoop.hbase.Chore.run(Chore.java:81) at java.lang.Thread.run(Thread.java:662) 2013-04-18 08:09:52,103 DEBUG [box,6,1366281581983-CatalogJanitor] org.apache.hadoop.hbase.client.ClientScanner: Creating scanner over .META. starting at key '' {noformat} {code} if (regionFinder != null) { //region location ListServerName loc = regionFinder.getTopBlockLocations(region); regionLocations[regionIndex] = new int[loc.size()]; for (int i=0; i loc.size(); i++) { regionLocations[regionIndex][i] = serversToIndex.get(loc.get(i)); // = NPE here } } {code} pinging [~enis], just in case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8279) Performance Evaluation does not consider the args passed in case of more than one client
[ https://issues.apache.org/jira/browse/HBASE-8279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan updated HBASE-8279: -- Status: Open (was: Patch Available) Performance Evaluation does not consider the args passed in case of more than one client Key: HBASE-8279 URL: https://issues.apache.org/jira/browse/HBASE-8279 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Minor Fix For: 0.98.0, 0.94.8, 0.95.1 Attachments: HBASE-8279_1.patch, HBASE-8279_2.patch, HBASE-8279.patch Performance evaluation gives a provision to pass the table name. The table name is considered when we first initialize the table - like the disabling and creation of tables happens with the name that we pass. But the write and read test again uses only the default table and so the perf evaluation fails. I think the problem is like this {code} ./hbase org.apache.hadoop.hbase.PerformanceEvaluation --nomapred --table=MyTable2 --presplit=70 randomRead 2 {code} {code} 13/04/04 21:42:07 DEBUG hbase.HRegionInfo: Current INFO from scan results = {NAME = 'MyTable2,0002067171,1365126124904.bc9e936f4f8ca8ee55eb90091d4a13b6.', STARTKEY = '0002067171', ENDKEY = '', ENCODED = bc9e936f4f8ca8ee55eb90091d4a13b6,} 13/04/04 21:42:07 INFO hbase.PerformanceEvaluation: Table created with 70 splits {code} You can see that the specified table is created with the splits. But when the read starts {code} Caused by: org.apache.hadoop.hbase.exceptions.TableNotFoundException: TestTable at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1157) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1034) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:984) at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:246) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:187) at org.apache.hadoop.hbase.PerformanceEvaluation$Test.testSetup(PerformanceEvaluation.java:851) at org.apache.hadoop.hbase.PerformanceEvaluation$Test.test(PerformanceEvaluation.java:869) at org.apache.hadoop.hbase.PerformanceEvaluation.runOneClient(PerformanceEvaluation.java:1495) at org.apache.hadoop.hbase.PerformanceEvaluation$1.run(PerformanceEvaluation.java:590) {code} It says TestTable not found which is the default table. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8279) Performance Evaluation does not consider the args passed in case of more than one client
[ https://issues.apache.org/jira/browse/HBASE-8279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan updated HBASE-8279: -- Attachment: HBASE-8279_2.patch Latest patch for trunk. I can commit this if nobody objects to this. Performance Evaluation does not consider the args passed in case of more than one client Key: HBASE-8279 URL: https://issues.apache.org/jira/browse/HBASE-8279 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Minor Fix For: 0.98.0, 0.94.8, 0.95.1 Attachments: HBASE-8279_1.patch, HBASE-8279_2.patch, HBASE-8279.patch Performance evaluation gives a provision to pass the table name. The table name is considered when we first initialize the table - like the disabling and creation of tables happens with the name that we pass. But the write and read test again uses only the default table and so the perf evaluation fails. I think the problem is like this {code} ./hbase org.apache.hadoop.hbase.PerformanceEvaluation --nomapred --table=MyTable2 --presplit=70 randomRead 2 {code} {code} 13/04/04 21:42:07 DEBUG hbase.HRegionInfo: Current INFO from scan results = {NAME = 'MyTable2,0002067171,1365126124904.bc9e936f4f8ca8ee55eb90091d4a13b6.', STARTKEY = '0002067171', ENDKEY = '', ENCODED = bc9e936f4f8ca8ee55eb90091d4a13b6,} 13/04/04 21:42:07 INFO hbase.PerformanceEvaluation: Table created with 70 splits {code} You can see that the specified table is created with the splits. But when the read starts {code} Caused by: org.apache.hadoop.hbase.exceptions.TableNotFoundException: TestTable at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1157) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1034) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:984) at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:246) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:187) at org.apache.hadoop.hbase.PerformanceEvaluation$Test.testSetup(PerformanceEvaluation.java:851) at org.apache.hadoop.hbase.PerformanceEvaluation$Test.test(PerformanceEvaluation.java:869) at org.apache.hadoop.hbase.PerformanceEvaluation.runOneClient(PerformanceEvaluation.java:1495) at org.apache.hadoop.hbase.PerformanceEvaluation$1.run(PerformanceEvaluation.java:590) {code} It says TestTable not found which is the default table. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8279) Performance Evaluation does not consider the args passed in case of more than one client
[ https://issues.apache.org/jira/browse/HBASE-8279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan updated HBASE-8279: -- Status: Patch Available (was: Open) Performance Evaluation does not consider the args passed in case of more than one client Key: HBASE-8279 URL: https://issues.apache.org/jira/browse/HBASE-8279 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Minor Fix For: 0.98.0, 0.94.8, 0.95.1 Attachments: HBASE-8279_1.patch, HBASE-8279_2.patch, HBASE-8279.patch Performance evaluation gives a provision to pass the table name. The table name is considered when we first initialize the table - like the disabling and creation of tables happens with the name that we pass. But the write and read test again uses only the default table and so the perf evaluation fails. I think the problem is like this {code} ./hbase org.apache.hadoop.hbase.PerformanceEvaluation --nomapred --table=MyTable2 --presplit=70 randomRead 2 {code} {code} 13/04/04 21:42:07 DEBUG hbase.HRegionInfo: Current INFO from scan results = {NAME = 'MyTable2,0002067171,1365126124904.bc9e936f4f8ca8ee55eb90091d4a13b6.', STARTKEY = '0002067171', ENDKEY = '', ENCODED = bc9e936f4f8ca8ee55eb90091d4a13b6,} 13/04/04 21:42:07 INFO hbase.PerformanceEvaluation: Table created with 70 splits {code} You can see that the specified table is created with the splits. But when the read starts {code} Caused by: org.apache.hadoop.hbase.exceptions.TableNotFoundException: TestTable at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1157) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1034) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:984) at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:246) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:187) at org.apache.hadoop.hbase.PerformanceEvaluation$Test.testSetup(PerformanceEvaluation.java:851) at org.apache.hadoop.hbase.PerformanceEvaluation$Test.test(PerformanceEvaluation.java:869) at org.apache.hadoop.hbase.PerformanceEvaluation.runOneClient(PerformanceEvaluation.java:1495) at org.apache.hadoop.hbase.PerformanceEvaluation$1.run(PerformanceEvaluation.java:590) {code} It says TestTable not found which is the default table. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8374) NPE when launching the balance
[ https://issues.apache.org/jira/browse/HBASE-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-8374: -- Attachment: 8374-trunk-v4.txt w.r.t. Nicolas' question, I think what happened was that serversToIndex map wasn't fully populated because one loop was used to iterate through clusterState.entrySet(). In patch v4, I introduced another loop to populate serversToIndex map. I kept the checking from patch v3 in case regionFinder.getTopBlockLocations() returns some ServerName which is not in serversToIndex map. NPE when launching the balance -- Key: HBASE-8374 URL: https://issues.apache.org/jira/browse/HBASE-8374 Project: HBase Issue Type: Bug Components: Balancer Affects Versions: 0.95.0 Environment: AWS / real cluster with 3 nodes + master Reporter: Nicolas Liochon Assignee: Ted Yu Fix For: 0.98.0, 0.95.1 Attachments: 8374-trunk.txt, 8374-trunk-v2.txt, 8374-trunk-v3.txt, 8374-trunk-v4.txt I don't reproduce this all the time, but I had it on a fairly clean env. It occurs every 5 minutes (i.e. the balancer period). Impact is severe: the balancer does not run. When it starts to occurs, it occurs all the time. I haven't tried to restart the master, but I think it should be enough. Now, looking at the code, the NPE is strange. {noformat} 2013-04-18 08:09:52,079 ERROR [box,6,1366281581983-BalancerChore] org.apache.hadoop.hbase.master.balancer.BalancerChore: Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.master.balancer.BaseLoadBalancer$Cluster.init(BaseLoadBalancer.java:145) at org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer.balanceCluster(StochasticLoadBalancer.java:194) at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1295) at org.apache.hadoop.hbase.master.balancer.BalancerChore.chore(BalancerChore.java:48) at org.apache.hadoop.hbase.Chore.run(Chore.java:81) at java.lang.Thread.run(Thread.java:662) 2013-04-18 08:09:52,103 DEBUG [box,6,1366281581983-CatalogJanitor] org.apache.hadoop.hbase.client.ClientScanner: Creating scanner over .META. starting at key '' {noformat} {code} if (regionFinder != null) { //region location ListServerName loc = regionFinder.getTopBlockLocations(region); regionLocations[regionIndex] = new int[loc.size()]; for (int i=0; i loc.size(); i++) { regionLocations[regionIndex][i] = serversToIndex.get(loc.get(i)); // = NPE here } } {code} pinging [~enis], just in case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5583) Master restart on create table with splitkeys does not recreate table with all the splitkey regions
[ https://issues.apache.org/jira/browse/HBASE-5583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635346#comment-13635346 ] ramkrishna.s.vasudevan commented on HBASE-5583: --- If HBASE-5487 takes some more time can we check this ? Master restart on create table with splitkeys does not recreate table with all the splitkey regions --- Key: HBASE-5583 URL: https://issues.apache.org/jira/browse/HBASE-5583 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Fix For: 0.95.1 Attachments: HBASE-5583_new_1.patch, HBASE-5583_new_1_review.patch, HBASE-5583_new_2.patch, HBASE-5583_new_4_WIP.patch, HBASE-5583_new_5_WIP_using_tableznode.patch - Create table using splitkeys - MAster goes down before all regions are added to meta - On master restart the table is again enabled but with less number of regions than specified in splitkeys Anyway client will get an exception if i had called sync create table. But table exists or not check will say table exists. Is this scenario to be handled by client only or can we have some mechanism on the master side for this? Pls suggest. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5583) Master restart on create table with splitkeys does not recreate table with all the splitkey regions
[ https://issues.apache.org/jira/browse/HBASE-5583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635349#comment-13635349 ] Ted Yu commented on HBASE-5583: --- HBASE-5487 has no assignee and no Fix Version. I think we can revive discussion for this JIRA. Master restart on create table with splitkeys does not recreate table with all the splitkey regions --- Key: HBASE-5583 URL: https://issues.apache.org/jira/browse/HBASE-5583 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Fix For: 0.95.1 Attachments: HBASE-5583_new_1.patch, HBASE-5583_new_1_review.patch, HBASE-5583_new_2.patch, HBASE-5583_new_4_WIP.patch, HBASE-5583_new_5_WIP_using_tableznode.patch - Create table using splitkeys - MAster goes down before all regions are added to meta - On master restart the table is again enabled but with less number of regions than specified in splitkeys Anyway client will get an exception if i had called sync create table. But table exists or not check will say table exists. Is this scenario to be handled by client only or can we have some mechanism on the master side for this? Pls suggest. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8353) -ROOT-/.META. regions are hanging if master restarted while closing -ROOT-/.META. regions on dead RS
[ https://issues.apache.org/jira/browse/HBASE-8353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635350#comment-13635350 ] Jimmy Xiang commented on HBASE-8353: [~rajesh23], if we don't change the origin, does this mean {quote} +if (regionInfo.isMetaTable() + !serverManager.isServerOnline(data.getOrigin())) { + forceOffline(regionInfo, data); {quote} We always re-assign the meta table when the master restarts, if it's closing? I am ok with not changing the origin since it could be a compatibility issue. -ROOT-/.META. regions are hanging if master restarted while closing -ROOT-/.META. regions on dead RS Key: HBASE-8353 URL: https://issues.apache.org/jira/browse/HBASE-8353 Project: HBase Issue Type: Bug Components: Region Assignment Affects Versions: 0.94.6 Reporter: rajeshbabu Assignee: rajeshbabu Fix For: 0.94.8 Attachments: HBASE-8353_94.patch ROOT/META are not getting assigned if master restarted while closing ROOT/META. Lets suppose catalog table regions in M_ZK_REGION_CLOSING state during master initialization and then just we are adding the them to RIT and waiting for TM. {code} if (isOnDeadServer(regionInfo, deadServers) (data.getOrigin() == null || !serverManager.isServerOnline(data.getOrigin( { // If was on dead server, its closed now. Force to OFFLINE and this // will get it reassigned if appropriate forceOffline(regionInfo, data); } else { // Just insert region into RIT. // If this never updates the timeout will trigger new assignment regionsInTransition.put(encodedRegionName, new RegionState( regionInfo, RegionState.State.CLOSING, data.getStamp(), data.getOrigin())); } {code} isOnDeadServer always return false to ROOT/META because deadServers is null. Even TM cannot close them properly because its not available in online regions since its not yet assigned. {code} synchronized (this.regions) { // Check if this region is currently assigned if (!regions.containsKey(region)) { LOG.debug(Attempted to unassign region + region.getRegionNameAsString() + but it is not + currently assigned anywhere); return; } } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6774) Immediate assignment of regions that don't have entries in HLog
[ https://issues.apache.org/jira/browse/HBASE-6774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635351#comment-13635351 ] Himanshu Vashishtha commented on HBASE-6774: [~nkeywal] [~devaraj]: I am interested to know whether there is any progress on this issue (making regions available which do not have a WAL entry, i.e., not waiting for log splitting to finish). Faced this when working on a read intensive workload. As Nkeywal commented earlier, it is quite useful for some use-cases. There is already a separate WAL for .META., thanks to Devaraj. If you guys are OK, I would like to work on this. Immediate assignment of regions that don't have entries in HLog --- Key: HBASE-6774 URL: https://issues.apache.org/jira/browse/HBASE-6774 Project: HBase Issue Type: Improvement Components: master, regionserver Affects Versions: 0.95.2 Reporter: Nicolas Liochon The algo is today, after a failure detection: - split the logs - when all the logs are split, assign the regions But some regions can have no entries at all in the HLog. There are many reasons for this: - kind of reference or historical tables. Bulk written sometimes then read only. - sequential rowkeys. In this case, most of the regions will be read only. But they can be in a regionserver with a lot of writes. - tables flushed often for safety reasons. I'm thinking about meta here. For meta; we can imagine flushing very often. Hence, the recovery for meta, in many cases, will be the failure detection time. There are different possible algos: Option 1) A new task is added, in parallel of the split. This task reads all the HLog. If there is no entry for a region, this region is assigned. Pro: simple Cons: We will need to read all the files. Add a read. Option 2) The master writes in ZK the number of log files, per region. When the regionserver starts the split, it reads the full block (64M) and decrease the log file counter of the region. If it reaches 0, the assign start. At the end of its split, the region server decreases the counter as well. This allow to start the assign even if not all the HLog are finished. It would allow to make some regions available even if we have an issue in one of the log file. Pro: parallel Cons: add something to do for the region server. Requites to read the whole file before starting to write. Option 3) Add some metadata at the end of the log file. The last log file won't have meta data, as if we are recovering, it's because the server crashed. But the others will. And last log file should be smaller (half a block on average). Option 4) Still some metadata, but in a different file. Cons: write are increased (but not that much, we just need to write the region once). Pros: if we lose the HLog files (major failure, no replica available) we can still continue with the regions that were not written at this stage. I think it should be done, even if none of the algorithm above is totally convincing yet. It's linked as well to locality and short circuit reads: with these two points reading the file twice become much less of an issue for example. My current preference would be to open the file twice in the region server, once for splitting as of today, once for a quick read looking for unused regions. Who knows, may be it would even be faster this way, the quick read thread would warm-up the different caches for the splitting thread. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8329) Limit compaction speed
[ https://issues.apache.org/jira/browse/HBASE-8329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635359#comment-13635359 ] Nick Dimiduk commented on HBASE-8329: - If I understand the tickets correctly, all three of them are addressing the same symptoms via slightly different approaches. HBASE-5867 increases the threshold on the number of HFiles that triggers a major compaction. HBASE-3743 introduces the idea of a major compaction manager that orchestrates a rolling compaction across the cluster. This ticket introduces a local compaction rate-limiter, configured on both time of day and IO throughput. [~aoxiang] does that summary sound about right? I personally like the idea of both the macro monitor (HBASE-3743) and the localized throttle (this ticket). This patch would be improved by extracting out the throttling policy interface with this peak+rate as one implementation. Based on my novice understanding of the storage system, tweaking the threashold and HFiles count (HBASE-5867) is a bandaid specific to the current default storage implementation. It will become less applicable as [~sershe]'s work in modularization continues. [~stack], [~nspiegelberg], [~larsgeorge], [~sershe] what's the right approach here? Limit compaction speed -- Key: HBASE-8329 URL: https://issues.apache.org/jira/browse/HBASE-8329 Project: HBase Issue Type: Improvement Components: Compaction Reporter: binlijin Attachments: HBASE-8329-trunk.patch There is no speed or resource limit for compaction,I think we should add this feature especially when request burst. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8374) NPE when launching the balance
[ https://issues.apache.org/jira/browse/HBASE-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635360#comment-13635360 ] Jean-Marc Spaggiari commented on HBASE-8374: Make sense! I'm +4 with patch v4. NPE when launching the balance -- Key: HBASE-8374 URL: https://issues.apache.org/jira/browse/HBASE-8374 Project: HBase Issue Type: Bug Components: Balancer Affects Versions: 0.95.0 Environment: AWS / real cluster with 3 nodes + master Reporter: Nicolas Liochon Assignee: Ted Yu Fix For: 0.98.0, 0.95.1 Attachments: 8374-trunk.txt, 8374-trunk-v2.txt, 8374-trunk-v3.txt, 8374-trunk-v4.txt I don't reproduce this all the time, but I had it on a fairly clean env. It occurs every 5 minutes (i.e. the balancer period). Impact is severe: the balancer does not run. When it starts to occurs, it occurs all the time. I haven't tried to restart the master, but I think it should be enough. Now, looking at the code, the NPE is strange. {noformat} 2013-04-18 08:09:52,079 ERROR [box,6,1366281581983-BalancerChore] org.apache.hadoop.hbase.master.balancer.BalancerChore: Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.master.balancer.BaseLoadBalancer$Cluster.init(BaseLoadBalancer.java:145) at org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer.balanceCluster(StochasticLoadBalancer.java:194) at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1295) at org.apache.hadoop.hbase.master.balancer.BalancerChore.chore(BalancerChore.java:48) at org.apache.hadoop.hbase.Chore.run(Chore.java:81) at java.lang.Thread.run(Thread.java:662) 2013-04-18 08:09:52,103 DEBUG [box,6,1366281581983-CatalogJanitor] org.apache.hadoop.hbase.client.ClientScanner: Creating scanner over .META. starting at key '' {noformat} {code} if (regionFinder != null) { //region location ListServerName loc = regionFinder.getTopBlockLocations(region); regionLocations[regionIndex] = new int[loc.size()]; for (int i=0; i loc.size(); i++) { regionLocations[regionIndex][i] = serversToIndex.get(loc.get(i)); // = NPE here } } {code} pinging [~enis], just in case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8317) Seek returns wrong result with PREFIX_TREE Encoding
[ https://issues.apache.org/jira/browse/HBASE-8317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635375#comment-13635375 ] ramkrishna.s.vasudevan commented on HBASE-8317: --- I think this can be committed. Seek returns wrong result with PREFIX_TREE Encoding --- Key: HBASE-8317 URL: https://issues.apache.org/jira/browse/HBASE-8317 Project: HBase Issue Type: Bug Affects Versions: 0.95.0 Reporter: chunhui shen Assignee: chunhui shen Attachments: HBASE-8317-v1.patch, hbase-trunk-8317.patch, hbase-trunk-8317v3.patch TestPrefixTreeEncoding#testSeekWithFixedData from the patch could reproduce the bug. An example of the bug case: Suppose the following rows: 1.row3/c1:q1/ 2.row3/c1:q2/ 3.row3/c1:q3/ 4.row4/c1:q1/ 5.row4/c1:q2/ After seeking the row 'row30', the expected peek KV is row4/c1:q1/, but actual is row3/c1:q1/. I just fix this bug case in the patch, Maybe we can do more for other potential problems if anyone is familiar with the code of PREFIX_TREE -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7437) Improve CompactSelection
[ https://issues.apache.org/jira/browse/HBASE-7437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635390#comment-13635390 ] Sergey Shelukhin commented on HBASE-7437: - Good catch w/Calendar, I looked at the sources, it appears to be the case indeed. Why do you pass peak expiration time as current time? Would it be good to pass the close to the end of current hour, for example, based on calendar minutes? Could be imprecise in some freak cases, but much less object creation. Improve CompactSelection Key: HBASE-7437 URL: https://issues.apache.org/jira/browse/HBASE-7437 Project: HBase Issue Type: Improvement Components: Compaction Reporter: Hiroshi Ikeda Assignee: Hiroshi Ikeda Priority: Minor Attachments: HBASE-7437.patch, HBASE-7437-V2.patch, HBASE-7437-V3.patch, HBASE-7437-V4.patch 1. Using AtomicLong makes CompactSelection simple and improve its performance. 2. There are unused fields and methods. 3. The fields should be private. 4. Assertion in the method finishRequest seems wrong: {code} public void finishRequest() { if (isOffPeakCompaction) { long newValueToLog = -1; synchronized(compactionCountLock) { assert !isOffPeakCompaction : Double-counting off-peak count for compaction; {code} The above assertion seems almost always false. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8329) Limit compaction speed
[ https://issues.apache.org/jira/browse/HBASE-8329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635397#comment-13635397 ] Sergey Shelukhin commented on HBASE-8329: - 1) Very similar to offpeak hours tracking, should they be unified? See also HBASE-7437 2) Calling isPeakHour after every KV seems excessive, can it be improved? Esp. in light of 3. 3) In HBASE-7437 it was mentioned that the calendar object does not auto-update, so getting time of day is actually going to get you time of day of the calendar was first asked. That way it wouldn't work, and updating calendar after every KV would seem to be expensive. 4) Do you have some numbers for throttling? Limit compaction speed -- Key: HBASE-8329 URL: https://issues.apache.org/jira/browse/HBASE-8329 Project: HBase Issue Type: Improvement Components: Compaction Reporter: binlijin Attachments: HBASE-8329-trunk.patch There is no speed or resource limit for compaction,I think we should add this feature especially when request burst. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6774) Immediate assignment of regions that don't have entries in HLog
[ https://issues.apache.org/jira/browse/HBASE-6774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635395#comment-13635395 ] Devaraj Das commented on HBASE-6774: I am fine with that, [~v.himanshu].. I guess we should start with a proposal and agree on (this jira had multiple proposals). Immediate assignment of regions that don't have entries in HLog --- Key: HBASE-6774 URL: https://issues.apache.org/jira/browse/HBASE-6774 Project: HBase Issue Type: Improvement Components: master, regionserver Affects Versions: 0.95.2 Reporter: Nicolas Liochon Assignee: Himanshu Vashishtha The algo is today, after a failure detection: - split the logs - when all the logs are split, assign the regions But some regions can have no entries at all in the HLog. There are many reasons for this: - kind of reference or historical tables. Bulk written sometimes then read only. - sequential rowkeys. In this case, most of the regions will be read only. But they can be in a regionserver with a lot of writes. - tables flushed often for safety reasons. I'm thinking about meta here. For meta; we can imagine flushing very often. Hence, the recovery for meta, in many cases, will be the failure detection time. There are different possible algos: Option 1) A new task is added, in parallel of the split. This task reads all the HLog. If there is no entry for a region, this region is assigned. Pro: simple Cons: We will need to read all the files. Add a read. Option 2) The master writes in ZK the number of log files, per region. When the regionserver starts the split, it reads the full block (64M) and decrease the log file counter of the region. If it reaches 0, the assign start. At the end of its split, the region server decreases the counter as well. This allow to start the assign even if not all the HLog are finished. It would allow to make some regions available even if we have an issue in one of the log file. Pro: parallel Cons: add something to do for the region server. Requites to read the whole file before starting to write. Option 3) Add some metadata at the end of the log file. The last log file won't have meta data, as if we are recovering, it's because the server crashed. But the others will. And last log file should be smaller (half a block on average). Option 4) Still some metadata, but in a different file. Cons: write are increased (but not that much, we just need to write the region once). Pros: if we lose the HLog files (major failure, no replica available) we can still continue with the regions that were not written at this stage. I think it should be done, even if none of the algorithm above is totally convincing yet. It's linked as well to locality and short circuit reads: with these two points reading the file twice become much less of an issue for example. My current preference would be to open the file twice in the region server, once for splitting as of today, once for a quick read looking for unused regions. Who knows, may be it would even be faster this way, the quick read thread would warm-up the different caches for the splitting thread. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7437) Improve CompactSelection
[ https://issues.apache.org/jira/browse/HBASE-7437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635392#comment-13635392 ] Sergey Shelukhin commented on HBASE-7437: - Sorry for long response time, it fell off my radar Improve CompactSelection Key: HBASE-7437 URL: https://issues.apache.org/jira/browse/HBASE-7437 Project: HBase Issue Type: Improvement Components: Compaction Reporter: Hiroshi Ikeda Assignee: Hiroshi Ikeda Priority: Minor Attachments: HBASE-7437.patch, HBASE-7437-V2.patch, HBASE-7437-V3.patch, HBASE-7437-V4.patch 1. Using AtomicLong makes CompactSelection simple and improve its performance. 2. There are unused fields and methods. 3. The fields should be private. 4. Assertion in the method finishRequest seems wrong: {code} public void finishRequest() { if (isOffPeakCompaction) { long newValueToLog = -1; synchronized(compactionCountLock) { assert !isOffPeakCompaction : Double-counting off-peak count for compaction; {code} The above assertion seems almost always false. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6774) Immediate assignment of regions that don't have entries in HLog
[ https://issues.apache.org/jira/browse/HBASE-6774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635384#comment-13635384 ] Nicolas Liochon commented on HBASE-6774: Ok for me of course :-). Thanks for this. I don't have an ideal solution in mind, I guess there is some design work to do here, but may be Devaraj is more advanced than me. I assign the jira to you in case you don't have the ar for this. Immediate assignment of regions that don't have entries in HLog --- Key: HBASE-6774 URL: https://issues.apache.org/jira/browse/HBASE-6774 Project: HBase Issue Type: Improvement Components: master, regionserver Affects Versions: 0.95.2 Reporter: Nicolas Liochon The algo is today, after a failure detection: - split the logs - when all the logs are split, assign the regions But some regions can have no entries at all in the HLog. There are many reasons for this: - kind of reference or historical tables. Bulk written sometimes then read only. - sequential rowkeys. In this case, most of the regions will be read only. But they can be in a regionserver with a lot of writes. - tables flushed often for safety reasons. I'm thinking about meta here. For meta; we can imagine flushing very often. Hence, the recovery for meta, in many cases, will be the failure detection time. There are different possible algos: Option 1) A new task is added, in parallel of the split. This task reads all the HLog. If there is no entry for a region, this region is assigned. Pro: simple Cons: We will need to read all the files. Add a read. Option 2) The master writes in ZK the number of log files, per region. When the regionserver starts the split, it reads the full block (64M) and decrease the log file counter of the region. If it reaches 0, the assign start. At the end of its split, the region server decreases the counter as well. This allow to start the assign even if not all the HLog are finished. It would allow to make some regions available even if we have an issue in one of the log file. Pro: parallel Cons: add something to do for the region server. Requites to read the whole file before starting to write. Option 3) Add some metadata at the end of the log file. The last log file won't have meta data, as if we are recovering, it's because the server crashed. But the others will. And last log file should be smaller (half a block on average). Option 4) Still some metadata, but in a different file. Cons: write are increased (but not that much, we just need to write the region once). Pros: if we lose the HLog files (major failure, no replica available) we can still continue with the regions that were not written at this stage. I think it should be done, even if none of the algorithm above is totally convincing yet. It's linked as well to locality and short circuit reads: with these two points reading the file twice become much less of an issue for example. My current preference would be to open the file twice in the region server, once for splitting as of today, once for a quick read looking for unused regions. Who knows, may be it would even be faster this way, the quick read thread would warm-up the different caches for the splitting thread. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7437) Improve CompactSelection
[ https://issues.apache.org/jira/browse/HBASE-7437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635392#comment-13635392 ] Sergey Shelukhin commented on HBASE-7437: - Sorry for long response time, it fell off my radar Improve CompactSelection Key: HBASE-7437 URL: https://issues.apache.org/jira/browse/HBASE-7437 Project: HBase Issue Type: Improvement Components: Compaction Reporter: Hiroshi Ikeda Assignee: Hiroshi Ikeda Priority: Minor Attachments: HBASE-7437.patch, HBASE-7437-V2.patch, HBASE-7437-V3.patch, HBASE-7437-V4.patch 1. Using AtomicLong makes CompactSelection simple and improve its performance. 2. There are unused fields and methods. 3. The fields should be private. 4. Assertion in the method finishRequest seems wrong: {code} public void finishRequest() { if (isOffPeakCompaction) { long newValueToLog = -1; synchronized(compactionCountLock) { assert !isOffPeakCompaction : Double-counting off-peak count for compaction; {code} The above assertion seems almost always false. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6774) Immediate assignment of regions that don't have entries in HLog
[ https://issues.apache.org/jira/browse/HBASE-6774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-6774: --- Assignee: Himanshu Vashishtha Immediate assignment of regions that don't have entries in HLog --- Key: HBASE-6774 URL: https://issues.apache.org/jira/browse/HBASE-6774 Project: HBase Issue Type: Improvement Components: master, regionserver Affects Versions: 0.95.2 Reporter: Nicolas Liochon Assignee: Himanshu Vashishtha The algo is today, after a failure detection: - split the logs - when all the logs are split, assign the regions But some regions can have no entries at all in the HLog. There are many reasons for this: - kind of reference or historical tables. Bulk written sometimes then read only. - sequential rowkeys. In this case, most of the regions will be read only. But they can be in a regionserver with a lot of writes. - tables flushed often for safety reasons. I'm thinking about meta here. For meta; we can imagine flushing very often. Hence, the recovery for meta, in many cases, will be the failure detection time. There are different possible algos: Option 1) A new task is added, in parallel of the split. This task reads all the HLog. If there is no entry for a region, this region is assigned. Pro: simple Cons: We will need to read all the files. Add a read. Option 2) The master writes in ZK the number of log files, per region. When the regionserver starts the split, it reads the full block (64M) and decrease the log file counter of the region. If it reaches 0, the assign start. At the end of its split, the region server decreases the counter as well. This allow to start the assign even if not all the HLog are finished. It would allow to make some regions available even if we have an issue in one of the log file. Pro: parallel Cons: add something to do for the region server. Requites to read the whole file before starting to write. Option 3) Add some metadata at the end of the log file. The last log file won't have meta data, as if we are recovering, it's because the server crashed. But the others will. And last log file should be smaller (half a block on average). Option 4) Still some metadata, but in a different file. Cons: write are increased (but not that much, we just need to write the region once). Pros: if we lose the HLog files (major failure, no replica available) we can still continue with the regions that were not written at this stage. I think it should be done, even if none of the algorithm above is totally convincing yet. It's linked as well to locality and short circuit reads: with these two points reading the file twice become much less of an issue for example. My current preference would be to open the file twice in the region server, once for splitting as of today, once for a quick read looking for unused regions. Who knows, may be it would even be faster this way, the quick read thread would warm-up the different caches for the splitting thread. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7239) Verify protobuf serialization is correctly chunking upon read to avoid direct memory OOMs
[ https://issues.apache.org/jira/browse/HBASE-7239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj Das updated HBASE-7239: --- Status: Patch Available (was: Open) Let's see what hadoopqa says. Verify protobuf serialization is correctly chunking upon read to avoid direct memory OOMs - Key: HBASE-7239 URL: https://issues.apache.org/jira/browse/HBASE-7239 Project: HBase Issue Type: Sub-task Reporter: Lars Hofhansl Priority: Critical Fix For: 0.95.1 Attachments: 7239-1.patch Result.readFields() used to read from the input stream in 8k chunks to avoid OOM issues with direct memory. (Reading variable sized chunks into direct memory prevent the JVM from reusing the allocated direct memory and direct memory is only collected during full GCs) This is just to verify protobufs parseFrom type methods do the right thing as well so that we do not reintroduce this problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8279) Performance Evaluation does not consider the args passed in case of more than one client
[ https://issues.apache.org/jira/browse/HBASE-8279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635409#comment-13635409 ] Lars Hofhansl commented on HBASE-8279: -- +1 Performance Evaluation does not consider the args passed in case of more than one client Key: HBASE-8279 URL: https://issues.apache.org/jira/browse/HBASE-8279 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Minor Fix For: 0.98.0, 0.94.8, 0.95.1 Attachments: HBASE-8279_1.patch, HBASE-8279_2.patch, HBASE-8279.patch Performance evaluation gives a provision to pass the table name. The table name is considered when we first initialize the table - like the disabling and creation of tables happens with the name that we pass. But the write and read test again uses only the default table and so the perf evaluation fails. I think the problem is like this {code} ./hbase org.apache.hadoop.hbase.PerformanceEvaluation --nomapred --table=MyTable2 --presplit=70 randomRead 2 {code} {code} 13/04/04 21:42:07 DEBUG hbase.HRegionInfo: Current INFO from scan results = {NAME = 'MyTable2,0002067171,1365126124904.bc9e936f4f8ca8ee55eb90091d4a13b6.', STARTKEY = '0002067171', ENDKEY = '', ENCODED = bc9e936f4f8ca8ee55eb90091d4a13b6,} 13/04/04 21:42:07 INFO hbase.PerformanceEvaluation: Table created with 70 splits {code} You can see that the specified table is created with the splits. But when the read starts {code} Caused by: org.apache.hadoop.hbase.exceptions.TableNotFoundException: TestTable at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1157) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1034) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:984) at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:246) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:187) at org.apache.hadoop.hbase.PerformanceEvaluation$Test.testSetup(PerformanceEvaluation.java:851) at org.apache.hadoop.hbase.PerformanceEvaluation$Test.test(PerformanceEvaluation.java:869) at org.apache.hadoop.hbase.PerformanceEvaluation.runOneClient(PerformanceEvaluation.java:1495) at org.apache.hadoop.hbase.PerformanceEvaluation$1.run(PerformanceEvaluation.java:590) {code} It says TestTable not found which is the default table. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8374) NPE when launching the balance
[ https://issues.apache.org/jira/browse/HBASE-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635412#comment-13635412 ] Hadoop QA commented on HBASE-8374: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12579361/8374-trunk-v3.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5346//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5346//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5346//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5346//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5346//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5346//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5346//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5346//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5346//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5346//console This message is automatically generated. NPE when launching the balance -- Key: HBASE-8374 URL: https://issues.apache.org/jira/browse/HBASE-8374 Project: HBase Issue Type: Bug Components: Balancer Affects Versions: 0.95.0 Environment: AWS / real cluster with 3 nodes + master Reporter: Nicolas Liochon Assignee: Ted Yu Fix For: 0.98.0, 0.95.1 Attachments: 8374-trunk.txt, 8374-trunk-v2.txt, 8374-trunk-v3.txt, 8374-trunk-v4.txt I don't reproduce this all the time, but I had it on a fairly clean env. It occurs every 5 minutes (i.e. the balancer period). Impact is severe: the balancer does not run. When it starts to occurs, it occurs all the time. I haven't tried to restart the master, but I think it should be enough. Now, looking at the code, the NPE is strange. {noformat} 2013-04-18 08:09:52,079 ERROR [box,6,1366281581983-BalancerChore] org.apache.hadoop.hbase.master.balancer.BalancerChore: Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.master.balancer.BaseLoadBalancer$Cluster.init(BaseLoadBalancer.java:145) at org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer.balanceCluster(StochasticLoadBalancer.java:194) at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1295) at org.apache.hadoop.hbase.master.balancer.BalancerChore.chore(BalancerChore.java:48) at org.apache.hadoop.hbase.Chore.run(Chore.java:81) at java.lang.Thread.run(Thread.java:662) 2013-04-18 08:09:52,103 DEBUG [box,6,1366281581983-CatalogJanitor] org.apache.hadoop.hbase.client.ClientScanner: Creating scanner over .META. starting at key '' {noformat} {code} if (regionFinder != null) { //region location ListServerName loc = regionFinder.getTopBlockLocations(region);
[jira] [Created] (HBASE-8375) Streamline Table durability settings
Lars Hofhansl created HBASE-8375: Summary: Streamline Table durability settings Key: HBASE-8375 URL: https://issues.apache.org/jira/browse/HBASE-8375 Project: HBase Issue Type: Sub-task Reporter: Lars Hofhansl HBASE-7801 introduces the notion of per mutation fine grained durability settings. This issue is to consider and the discuss the same for the per table settings (i.e. what would be used if the mutation indicates USE_DEFAULT). I propose the following setting per table: * SKIP_WAL (i.e. an unlogged table) * ASYNC_WAL (the current deferred log flush) * SYNC_WAL (the current default) * FSYNC_WAL (for future uses of HDFS' hsync()) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8353) -ROOT-/.META. regions are hanging if master restarted while closing -ROOT-/.META. regions on dead RS
[ https://issues.apache.org/jira/browse/HBASE-8353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] rajeshbabu updated HBASE-8353: -- Attachment: HBASE-8353_94_2.patch -ROOT-/.META. regions are hanging if master restarted while closing -ROOT-/.META. regions on dead RS Key: HBASE-8353 URL: https://issues.apache.org/jira/browse/HBASE-8353 Project: HBase Issue Type: Bug Components: Region Assignment Affects Versions: 0.94.6 Reporter: rajeshbabu Assignee: rajeshbabu Fix For: 0.94.8 Attachments: HBASE-8353_94_2.patch, HBASE-8353_94.patch ROOT/META are not getting assigned if master restarted while closing ROOT/META. Lets suppose catalog table regions in M_ZK_REGION_CLOSING state during master initialization and then just we are adding the them to RIT and waiting for TM. {code} if (isOnDeadServer(regionInfo, deadServers) (data.getOrigin() == null || !serverManager.isServerOnline(data.getOrigin( { // If was on dead server, its closed now. Force to OFFLINE and this // will get it reassigned if appropriate forceOffline(regionInfo, data); } else { // Just insert region into RIT. // If this never updates the timeout will trigger new assignment regionsInTransition.put(encodedRegionName, new RegionState( regionInfo, RegionState.State.CLOSING, data.getStamp(), data.getOrigin())); } {code} isOnDeadServer always return false to ROOT/META because deadServers is null. Even TM cannot close them properly because its not available in online regions since its not yet assigned. {code} synchronized (this.regions) { // Check if this region is currently assigned if (!regions.containsKey(region)) { LOG.debug(Attempted to unassign region + region.getRegionNameAsString() + but it is not + currently assigned anywhere); return; } } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6970) hbase-deamon.sh creates/updates pid file even when that start failed.
[ https://issues.apache.org/jira/browse/HBASE-6970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635421#comment-13635421 ] Himanshu Vashishtha commented on HBASE-6970: [~nkeywal] Can you please tell why we are not deleting the pid file after removing the znode? hbase-deamon.sh creates/updates pid file even when that start failed. - Key: HBASE-6970 URL: https://issues.apache.org/jira/browse/HBASE-6970 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Nicolas Liochon We just ran into a strange issue where could neither start nor stop services with hbase-deamon.sh. The problem is this: {code} nohup nice -n $HBASE_NICENESS $HBASE_HOME/bin/hbase \ --config ${HBASE_CONF_DIR} \ $command $@ $startStop $logout 21 /dev/null echo $! $pid {code} So the pid file is created or updated even when the start of the service failed. The next stop command will then fail, because the pid file has the wrong pid in it. Edit: Spelling and more spelling errors. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8279) Performance Evaluation does not consider the args passed in case of more than one client
[ https://issues.apache.org/jira/browse/HBASE-8279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635426#comment-13635426 ] Hadoop QA commented on HBASE-8279: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12579362/HBASE-8279_2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.client.TestFromClientSideWithCoprocessor Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5347//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5347//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5347//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5347//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5347//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5347//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5347//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5347//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5347//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5347//console This message is automatically generated. Performance Evaluation does not consider the args passed in case of more than one client Key: HBASE-8279 URL: https://issues.apache.org/jira/browse/HBASE-8279 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Minor Fix For: 0.98.0, 0.94.8, 0.95.1 Attachments: HBASE-8279_1.patch, HBASE-8279_2.patch, HBASE-8279.patch Performance evaluation gives a provision to pass the table name. The table name is considered when we first initialize the table - like the disabling and creation of tables happens with the name that we pass. But the write and read test again uses only the default table and so the perf evaluation fails. I think the problem is like this {code} ./hbase org.apache.hadoop.hbase.PerformanceEvaluation --nomapred --table=MyTable2 --presplit=70 randomRead 2 {code} {code} 13/04/04 21:42:07 DEBUG hbase.HRegionInfo: Current INFO from scan results = {NAME = 'MyTable2,0002067171,1365126124904.bc9e936f4f8ca8ee55eb90091d4a13b6.', STARTKEY = '0002067171', ENDKEY = '', ENCODED = bc9e936f4f8ca8ee55eb90091d4a13b6,} 13/04/04 21:42:07 INFO hbase.PerformanceEvaluation: Table created with 70 splits {code} You can see that the specified table is created with the splits. But when the read starts {code} Caused by: org.apache.hadoop.hbase.exceptions.TableNotFoundException: TestTable at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1157) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1034) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:984) at
[jira] [Commented] (HBASE-6774) Immediate assignment of regions that don't have entries in HLog
[ https://issues.apache.org/jira/browse/HBASE-6774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635427#comment-13635427 ] Lars Hofhansl commented on HBASE-6774: -- This mingles (somewhat at least) with HBASE-8375 that I just opened. One of the options proposed there are unlogged tables (tables that never write WAL entries). All regions of those tables could be assigned immediately. Immediate assignment of regions that don't have entries in HLog --- Key: HBASE-6774 URL: https://issues.apache.org/jira/browse/HBASE-6774 Project: HBase Issue Type: Improvement Components: master, regionserver Affects Versions: 0.95.2 Reporter: Nicolas Liochon Assignee: Himanshu Vashishtha The algo is today, after a failure detection: - split the logs - when all the logs are split, assign the regions But some regions can have no entries at all in the HLog. There are many reasons for this: - kind of reference or historical tables. Bulk written sometimes then read only. - sequential rowkeys. In this case, most of the regions will be read only. But they can be in a regionserver with a lot of writes. - tables flushed often for safety reasons. I'm thinking about meta here. For meta; we can imagine flushing very often. Hence, the recovery for meta, in many cases, will be the failure detection time. There are different possible algos: Option 1) A new task is added, in parallel of the split. This task reads all the HLog. If there is no entry for a region, this region is assigned. Pro: simple Cons: We will need to read all the files. Add a read. Option 2) The master writes in ZK the number of log files, per region. When the regionserver starts the split, it reads the full block (64M) and decrease the log file counter of the region. If it reaches 0, the assign start. At the end of its split, the region server decreases the counter as well. This allow to start the assign even if not all the HLog are finished. It would allow to make some regions available even if we have an issue in one of the log file. Pro: parallel Cons: add something to do for the region server. Requites to read the whole file before starting to write. Option 3) Add some metadata at the end of the log file. The last log file won't have meta data, as if we are recovering, it's because the server crashed. But the others will. And last log file should be smaller (half a block on average). Option 4) Still some metadata, but in a different file. Cons: write are increased (but not that much, we just need to write the region once). Pros: if we lose the HLog files (major failure, no replica available) we can still continue with the regions that were not written at this stage. I think it should be done, even if none of the algorithm above is totally convincing yet. It's linked as well to locality and short circuit reads: with these two points reading the file twice become much less of an issue for example. My current preference would be to open the file twice in the region server, once for splitting as of today, once for a quick read looking for unused regions. Who knows, may be it would even be faster this way, the quick read thread would warm-up the different caches for the splitting thread. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8369) MapReduce over snapshot files
[ https://issues.apache.org/jira/browse/HBASE-8369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635428#comment-13635428 ] Enis Soztutar commented on HBASE-8369: -- bq. in general I'm against having another way to direct access the data, since it means that you're giving up on optimizing the main one. Conceptually, this is similar to the short circuit reads for HDFS. I agree that we should not need these kinds of optimizations, since in the long term, it will be impossible to implement QoS for IO if you give direct access to local files (for SSR) / hdfs files (for snapshot). bq. if the final implementation will be like this one using the HRegion object, I'll be +1. Yes, that is the plan. bq. Are the initCredentials modifications in TableMapReduceUtil required for the scope of this patch? Yes, we do not need to initCredentials, since we are not talking to any hbase server. MapReduce over snapshot files - Key: HBASE-8369 URL: https://issues.apache.org/jira/browse/HBASE-8369 Project: HBase Issue Type: New Feature Components: mapreduce, snapshots Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 0.98.0, 0.95.2 Attachments: hbase-8369_v0.patch The idea is to add an InputFormat, which can run the mapreduce job over snapshot files directly bypassing hbase server layer. The IF is similar in usage to TableInputFormat, taking a Scan object from the user, but instead of running from an online table, it runs from a table snapshot. We do one split per region in the snapshot, and open an HRegion inside the RecordReader. A RegionScanner is used internally for doing the scan without any HRegionServer bits. Users have been asking and searching for ways to run MR jobs by reading directly from hfiles, so this allows new use cases if reading from stale data is ok: - Take snapshots periodically, and run MR jobs only on snapshots. - Export snapshots to remote hdfs cluster, run the MR jobs at that cluster without HBase cluster. - (Future use case) Combine snapshot data with online hbase data: Scan from yesterday's snapshot, but read today's data from online hbase cluster. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6970) hbase-deamon.sh creates/updates pid file even when that start failed.
[ https://issues.apache.org/jira/browse/HBASE-6970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635431#comment-13635431 ] Nicolas Liochon commented on HBASE-6970: In the java code you mean? It is in the context of this jira? hbase-deamon.sh creates/updates pid file even when that start failed. - Key: HBASE-6970 URL: https://issues.apache.org/jira/browse/HBASE-6970 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Nicolas Liochon We just ran into a strange issue where could neither start nor stop services with hbase-deamon.sh. The problem is this: {code} nohup nice -n $HBASE_NICENESS $HBASE_HOME/bin/hbase \ --config ${HBASE_CONF_DIR} \ $command $@ $startStop $logout 21 /dev/null echo $! $pid {code} So the pid file is created or updated even when the start of the service failed. The next stop command will then fail, because the pid file has the wrong pid in it. Edit: Spelling and more spelling errors. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8353) -ROOT-/.META. regions are hanging if master restarted while closing -ROOT-/.META. regions on dead RS
[ https://issues.apache.org/jira/browse/HBASE-8353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635438#comment-13635438 ] rajeshbabu commented on HBASE-8353: --- [~jxiang] bq. We always re-assign the meta table when the master restarts, if it's closing? Yes. we will re-assign even the RS holding the catalog region is not dead. @Ram, with first patch we are not able to identify the origin is dead nor not which may cause double assignments. In latest patch tried to handle without changing the origin. What do you say about latest patch? -ROOT-/.META. regions are hanging if master restarted while closing -ROOT-/.META. regions on dead RS Key: HBASE-8353 URL: https://issues.apache.org/jira/browse/HBASE-8353 Project: HBase Issue Type: Bug Components: Region Assignment Affects Versions: 0.94.6 Reporter: rajeshbabu Assignee: rajeshbabu Fix For: 0.94.8 Attachments: HBASE-8353_94_2.patch, HBASE-8353_94.patch ROOT/META are not getting assigned if master restarted while closing ROOT/META. Lets suppose catalog table regions in M_ZK_REGION_CLOSING state during master initialization and then just we are adding the them to RIT and waiting for TM. {code} if (isOnDeadServer(regionInfo, deadServers) (data.getOrigin() == null || !serverManager.isServerOnline(data.getOrigin( { // If was on dead server, its closed now. Force to OFFLINE and this // will get it reassigned if appropriate forceOffline(regionInfo, data); } else { // Just insert region into RIT. // If this never updates the timeout will trigger new assignment regionsInTransition.put(encodedRegionName, new RegionState( regionInfo, RegionState.State.CLOSING, data.getStamp(), data.getOrigin())); } {code} isOnDeadServer always return false to ROOT/META because deadServers is null. Even TM cannot close them properly because its not available in online regions since its not yet assigned. {code} synchronized (this.regions) { // Check if this region is currently assigned if (!regions.containsKey(region)) { LOG.debug(Attempted to unassign region + region.getRegionNameAsString() + but it is not + currently assigned anywhere); return; } } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6739) Single put should avoid batch overhead when autoflush is on
[ https://issues.apache.org/jira/browse/HBASE-6739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635439#comment-13635439 ] Nick Dimiduk commented on HBASE-6739: - Can this be closed as a dupe of HBASE-5824? Single put should avoid batch overhead when autoflush is on Key: HBASE-6739 URL: https://issues.apache.org/jira/browse/HBASE-6739 Project: HBase Issue Type: Improvement Reporter: Jimmy Xiang Priority: Minor Currently, even when autoflush is on, a single put is handled the same way as if autoflush is off: convert the put to multi-action, create a callable, hand it to an executor to process, wait for it to complete. We can avoid this overhead for single put if autoflush is on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8369) MapReduce over snapshot files
[ https://issues.apache.org/jira/browse/HBASE-8369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635453#comment-13635453 ] Gary Helmling commented on HBASE-8369: -- bq. Yes, we do not need to initCredentials, since we are not talking to any hbase server. So would this completely bypass security? I also want this functionality for certain use cases, we should just be clear on this caveat. MapReduce over snapshot files - Key: HBASE-8369 URL: https://issues.apache.org/jira/browse/HBASE-8369 Project: HBase Issue Type: New Feature Components: mapreduce, snapshots Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 0.98.0, 0.95.2 Attachments: hbase-8369_v0.patch The idea is to add an InputFormat, which can run the mapreduce job over snapshot files directly bypassing hbase server layer. The IF is similar in usage to TableInputFormat, taking a Scan object from the user, but instead of running from an online table, it runs from a table snapshot. We do one split per region in the snapshot, and open an HRegion inside the RecordReader. A RegionScanner is used internally for doing the scan without any HRegionServer bits. Users have been asking and searching for ways to run MR jobs by reading directly from hfiles, so this allows new use cases if reading from stale data is ok: - Take snapshots periodically, and run MR jobs only on snapshots. - Export snapshots to remote hdfs cluster, run the MR jobs at that cluster without HBase cluster. - (Future use case) Combine snapshot data with online hbase data: Scan from yesterday's snapshot, but read today's data from online hbase cluster. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8375) Streamline Table durability settings
[ https://issues.apache.org/jira/browse/HBASE-8375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635456#comment-13635456 ] Devaraj Das commented on HBASE-8375: I think we should seriously consider HBASE-5930 in the context of these don't write to WAL changes. That jira would have done at least some damage control in the events of node failures. Streamline Table durability settings Key: HBASE-8375 URL: https://issues.apache.org/jira/browse/HBASE-8375 Project: HBase Issue Type: Sub-task Reporter: Lars Hofhansl HBASE-7801 introduces the notion of per mutation fine grained durability settings. This issue is to consider and the discuss the same for the per table settings (i.e. what would be used if the mutation indicates USE_DEFAULT). I propose the following setting per table: * SKIP_WAL (i.e. an unlogged table) * ASYNC_WAL (the current deferred log flush) * SYNC_WAL (the current default) * FSYNC_WAL (for future uses of HDFS' hsync()) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6970) hbase-deamon.sh creates/updates pid file even when that start failed.
[ https://issues.apache.org/jira/browse/HBASE-6970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635460#comment-13635460 ] Himanshu Vashishtha commented on HBASE-6970: No, in the hbase-daemon.sh script, there is a clearZNode() method which deletes the rs znode, but the pid file is kept intact. I wonder why this is so. hbase-deamon.sh creates/updates pid file even when that start failed. - Key: HBASE-6970 URL: https://issues.apache.org/jira/browse/HBASE-6970 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Nicolas Liochon We just ran into a strange issue where could neither start nor stop services with hbase-deamon.sh. The problem is this: {code} nohup nice -n $HBASE_NICENESS $HBASE_HOME/bin/hbase \ --config ${HBASE_CONF_DIR} \ $command $@ $startStop $logout 21 /dev/null echo $! $pid {code} So the pid file is created or updated even when the start of the service failed. The next stop command will then fail, because the pid file has the wrong pid in it. Edit: Spelling and more spelling errors. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8374) NPE when launching the balance
[ https://issues.apache.org/jira/browse/HBASE-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635464#comment-13635464 ] Hadoop QA commented on HBASE-8374: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12579365/8374-trunk-v4.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5348//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5348//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5348//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5348//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5348//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5348//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5348//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5348//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5348//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5348//console This message is automatically generated. NPE when launching the balance -- Key: HBASE-8374 URL: https://issues.apache.org/jira/browse/HBASE-8374 Project: HBase Issue Type: Bug Components: Balancer Affects Versions: 0.95.0 Environment: AWS / real cluster with 3 nodes + master Reporter: Nicolas Liochon Assignee: Ted Yu Fix For: 0.98.0, 0.95.1 Attachments: 8374-trunk.txt, 8374-trunk-v2.txt, 8374-trunk-v3.txt, 8374-trunk-v4.txt I don't reproduce this all the time, but I had it on a fairly clean env. It occurs every 5 minutes (i.e. the balancer period). Impact is severe: the balancer does not run. When it starts to occurs, it occurs all the time. I haven't tried to restart the master, but I think it should be enough. Now, looking at the code, the NPE is strange. {noformat} 2013-04-18 08:09:52,079 ERROR [box,6,1366281581983-BalancerChore] org.apache.hadoop.hbase.master.balancer.BalancerChore: Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.master.balancer.BaseLoadBalancer$Cluster.init(BaseLoadBalancer.java:145) at org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer.balanceCluster(StochasticLoadBalancer.java:194) at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1295) at org.apache.hadoop.hbase.master.balancer.BalancerChore.chore(BalancerChore.java:48) at org.apache.hadoop.hbase.Chore.run(Chore.java:81) at java.lang.Thread.run(Thread.java:662) 2013-04-18 08:09:52,103 DEBUG [box,6,1366281581983-CatalogJanitor] org.apache.hadoop.hbase.client.ClientScanner: Creating scanner over .META. starting at key '' {noformat} {code} if (regionFinder != null) { //region location ListServerName loc = regionFinder.getTopBlockLocations(region);
[jira] [Created] (HBASE-8376) MiniHBaseCluster#waitFor{Master|RegionServer)ToStop should implement timeout.
rajeshbabu created HBASE-8376: - Summary: MiniHBaseCluster#waitFor{Master|RegionServer)ToStop should implement timeout. Key: HBASE-8376 URL: https://issues.apache.org/jira/browse/HBASE-8376 Project: HBase Issue Type: Improvement Components: test Reporter: rajeshbabu Assignee: rajeshbabu Priority: Minor -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8376) MiniHBaseCluster#waitFor{Master|RegionServer)ToStop should implement timeout.
[ https://issues.apache.org/jira/browse/HBASE-8376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] rajeshbabu updated HBASE-8376: -- Description: Presently we are ignoring timeout in watiForMasterToStop,watiForRegionServerToStop methods in MiniHBaseCluster {code} @Override public void waitForRegionServerToStop(ServerName serverName, long timeout) throws IOException { //ignore timeout for now waitOnRegionServer(getRegionServerIndex(serverName)); } {code} {code} @Override public void waitForMasterToStop(ServerName serverName, long timeout) throws IOException { //ignore timeout for now waitOnMaster(getMasterIndex(serverName)); } {code} We can implement timeout in these methods. MiniHBaseCluster#waitFor{Master|RegionServer)ToStop should implement timeout. - Key: HBASE-8376 URL: https://issues.apache.org/jira/browse/HBASE-8376 Project: HBase Issue Type: Improvement Components: test Reporter: rajeshbabu Assignee: rajeshbabu Priority: Minor Presently we are ignoring timeout in watiForMasterToStop,watiForRegionServerToStop methods in MiniHBaseCluster {code} @Override public void waitForRegionServerToStop(ServerName serverName, long timeout) throws IOException { //ignore timeout for now waitOnRegionServer(getRegionServerIndex(serverName)); } {code} {code} @Override public void waitForMasterToStop(ServerName serverName, long timeout) throws IOException { //ignore timeout for now waitOnMaster(getMasterIndex(serverName)); } {code} We can implement timeout in these methods. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8375) Streamline Table durability settings
[ https://issues.apache.org/jira/browse/HBASE-8375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635482#comment-13635482 ] Lars Hofhansl commented on HBASE-8375: -- Agreed. Streamline Table durability settings Key: HBASE-8375 URL: https://issues.apache.org/jira/browse/HBASE-8375 Project: HBase Issue Type: Sub-task Reporter: Lars Hofhansl HBASE-7801 introduces the notion of per mutation fine grained durability settings. This issue is to consider and the discuss the same for the per table settings (i.e. what would be used if the mutation indicates USE_DEFAULT). I propose the following setting per table: * SKIP_WAL (i.e. an unlogged table) * ASYNC_WAL (the current deferred log flush) * SYNC_WAL (the current default) * FSYNC_WAL (for future uses of HDFS' hsync()) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8365) Duplicated ZK notifications cause Master abort (or other unknown issues)
[ https://issues.apache.org/jira/browse/HBASE-8365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635488#comment-13635488 ] Jeffrey Zhong commented on HBASE-8365: -- {quote} nodeDataChangeEvent only will give the latest data because it will not be able to read the old data {quote} ZooKeeper intentionally only sends out notifications without passing the original state which triggers the notification. It relies on clients to fetch the latest state. In addition, ZooKeeper watcher is one-time trigger which means it only fire once and client need re-setup watcher on the same znode to get next notification. In our case, from the log, the related updates with watcher set on the region are: 1) opening-opening 2) opening-failed_open 3) failed_open-offline 4) offline-opening The first notification(when we got FAILED_OPEN) is triggered by the update of opening-opening. When Master got the notification and znode was already changed to failed_open, that's the first trace nodeDataChange. The thing puzzles me is that ZooKeeper watcher will reset up on failed_open state after receiving the first failed_open and should only get more notifications when failed_open state changes. While we still get one more failed_open later from the same znode and data has the same version as we received the first notification. I guess we may trigger ZK client reads stale cache data when the node state changes from failed_open - offline OR race conditions in ZK side to cause the dup notifications. Duplicated ZK notifications cause Master abort (or other unknown issues) Key: HBASE-8365 URL: https://issues.apache.org/jira/browse/HBASE-8365 Project: HBase Issue Type: Bug Affects Versions: 0.94.6 Reporter: Jeffrey Zhong Attachments: TestResult.txt The duplicated ZK notifications should happen in trunk as well. Since the way we handle ZK notifications is different in trunk, we don't see the issue there. I'll explain later. The issue is causing TestMetaReaderEditor.testRetrying flaky with error message {code}reader: count=2, t=null{code} A related link is at https://builds.apache.org/job/HBase-0.94/941/testReport/junit/org.apache.hadoop.hbase.catalog/TestMetaReaderEditor/testRetrying/ The test case failure is due to an IllegalStateException and master is aborted so the rest test cases also failed after testRetrying. Below are steps why the issue is happening(region fa0e7a5590feb69bd065fbc99c228b36 is in interests): 1) Got first notification event RS_ZK_REGION_FAILED_OPEN at 2013-04-04 17:39:01,197 {code} DEBUG [pool-1-thread-1-EventThread] master.AssignmentManager(744): Handling transition=RS_ZK_REGION_FAILED_OPEN, server=janus.apache.org,42093,1365097126155, region=fa0e7a5590feb69bd065fbc99c228b36{code} In the step, AM tries to open the region on another RS in a separate thread 2) Got second notification event RS_ZK_REGION_FAILED_OPEN at 2013-04-04 17:39:01,200 {code}DEBUG [pool-1-thread-1-EventThread] master.AssignmentManager(744): Handling transition=RS_ZK_REGION_FAILED_OPEN, server=janus.apache.org,42093,1365097126155, region=fa0e7a5590feb69bd065fbc99c228b36{code} 3) Later got opening notification event result from the step 1 at 2013-04-04 17:39:01,288 {code} DEBUG [pool-1-thread-1-EventThread] master.AssignmentManager(744): Handling transition=RS_ZK_REGION_OPENING, server=janus.apache.org,54833,1365097126175, region=fa0e7a5590feb69bd065fbc99c228b36{code} Step 2 ClosedRegionHandler throws IllegalStateException because Cannot transit it to OFFLINE(state is in opening from notification 3) and abort Master. This could happen in 0.94 because we handle notifications using executorService which opens the door to handle events out of order through receive them in order of updates. I've confirmed that we don't have duplicated AM listeners and both events triggered by same ZK data of exact same version. The issue can be reproduced once by running testRetrying test case 20 times in a loop. There are several issues for the failure: 1) duplicated ZK notifications. Since ZK watcher is one time trigger, the duplicated notification should not happen from the same data of the same version in the first place 2) ZooKeeper watcher handling is wrong in both 0.94 and trunk as following: a) 0.94 handle notifications in async way which may lead to handle notifications out of order of the events happened b) In trunk, we handle ZK notifications synchronously which slows down other components such as SSH, LogSplitting etc. because we have a single notification queue c) In trunk 0.94, we could use stale event data because we have a long listener list. ZK node state could have changed at the time when
[jira] [Updated] (HBASE-8365) Duplicated ZK notifications cause Master abort (or other unknown issues)
[ https://issues.apache.org/jira/browse/HBASE-8365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] rajeshbabu updated HBASE-8365: -- Attachment: TestZookeeper.txt Recently I have also observed duplicated zk notifications. FYI attaching logs. These also may useful for analysis. I have tried to debug but not able to reproduce. {code} 2013-04-07 12:14:50,735 INFO [hbase-am-zkevent-worker-pool-20-thread-7] master.RegionStates(264): Region {NAME = 'testLogSplittingAfterMasterRecoveryDueToZKExpiry,,1365336889784.fb4182aef4ce07f011871ae0a083aee0.', STARTKEY = '', ENDKEY = '1', ENCODED = fb4182aef4ce07f011871ae0a083aee0,} transitioned from {testLogSplittingAfterMasterRecoveryDueToZKExpiry,,1365336889784.fb4182aef4ce07f011871ae0a083aee0. state=OPENING, ts=1365336890719, server=asf001.sp2.ygridcore.net,60884,1365336878389} to {testLogSplittingAfterMasterRecoveryDueToZKExpiry,,1365336889784.fb4182aef4ce07f011871ae0a083aee0. state=OPEN, ts=1365336890735, server=asf001.sp2.ygridcore.net,60884,1365336878389} 2013-04-07 12:14:50,735 DEBUG [hbase-am-zkevent-worker-pool-2-thread-20] master.AssignmentManager(740): Handling transition=RS_ZK_REGION_OPENED, server=asf001.sp2.ygridcore.net,60884,1365336878389, region=fb4182aef4ce07f011871ae0a083aee0, current state from region state map ={testLogSplittingAfterMasterRecoveryDueToZKExpiry,,1365336889784.fb4182aef4ce07f011871ae0a083aee0. state=OPEN, ts=1365336890727, server=asf001.sp2.ygridcore.net,60884,1365336878389} 2013-04-07 12:14:50,736 WARN [hbase-am-zkevent-worker-pool-2-thread-20] master.AssignmentManager(934): Received OPENED for region fb4182aef4ce07f011871ae0a083aee0 from server asf001.sp2.ygridcore.net,60884,1365336878389 but region was in the state {testLogSplittingAfterMasterRecoveryDueToZKExpiry,,1365336889784.fb4182aef4ce07f011871ae0a083aee0. state=OPEN, ts=1365336890727, server=asf001.sp2.ygridcore.net,60884,1365336878389} and not in expected PENDING_OPEN or OPENING states, or not on the expected server {code} Duplicated ZK notifications cause Master abort (or other unknown issues) Key: HBASE-8365 URL: https://issues.apache.org/jira/browse/HBASE-8365 Project: HBase Issue Type: Bug Affects Versions: 0.94.6 Reporter: Jeffrey Zhong Attachments: TestResult.txt, TestZookeeper.txt The duplicated ZK notifications should happen in trunk as well. Since the way we handle ZK notifications is different in trunk, we don't see the issue there. I'll explain later. The issue is causing TestMetaReaderEditor.testRetrying flaky with error message {code}reader: count=2, t=null{code} A related link is at https://builds.apache.org/job/HBase-0.94/941/testReport/junit/org.apache.hadoop.hbase.catalog/TestMetaReaderEditor/testRetrying/ The test case failure is due to an IllegalStateException and master is aborted so the rest test cases also failed after testRetrying. Below are steps why the issue is happening(region fa0e7a5590feb69bd065fbc99c228b36 is in interests): 1) Got first notification event RS_ZK_REGION_FAILED_OPEN at 2013-04-04 17:39:01,197 {code} DEBUG [pool-1-thread-1-EventThread] master.AssignmentManager(744): Handling transition=RS_ZK_REGION_FAILED_OPEN, server=janus.apache.org,42093,1365097126155, region=fa0e7a5590feb69bd065fbc99c228b36{code} In the step, AM tries to open the region on another RS in a separate thread 2) Got second notification event RS_ZK_REGION_FAILED_OPEN at 2013-04-04 17:39:01,200 {code}DEBUG [pool-1-thread-1-EventThread] master.AssignmentManager(744): Handling transition=RS_ZK_REGION_FAILED_OPEN, server=janus.apache.org,42093,1365097126155, region=fa0e7a5590feb69bd065fbc99c228b36{code} 3) Later got opening notification event result from the step 1 at 2013-04-04 17:39:01,288 {code} DEBUG [pool-1-thread-1-EventThread] master.AssignmentManager(744): Handling transition=RS_ZK_REGION_OPENING, server=janus.apache.org,54833,1365097126175, region=fa0e7a5590feb69bd065fbc99c228b36{code} Step 2 ClosedRegionHandler throws IllegalStateException because Cannot transit it to OFFLINE(state is in opening from notification 3) and abort Master. This could happen in 0.94 because we handle notifications using executorService which opens the door to handle events out of order through receive them in order of updates. I've confirmed that we don't have duplicated AM listeners and both events triggered by same ZK data of exact same version. The issue can be reproduced once by running testRetrying test case 20 times in a loop. There are several issues for the failure: 1) duplicated ZK notifications. Since ZK watcher is one time trigger, the duplicated notification should not happen from the same data of the same version in the first place 2)
[jira] [Commented] (HBASE-7239) Verify protobuf serialization is correctly chunking upon read to avoid direct memory OOMs
[ https://issues.apache.org/jira/browse/HBASE-7239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635531#comment-13635531 ] Hadoop QA commented on HBASE-7239: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12579277/7239-1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5349//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5349//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5349//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5349//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5349//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5349//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5349//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5349//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5349//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5349//console This message is automatically generated. Verify protobuf serialization is correctly chunking upon read to avoid direct memory OOMs - Key: HBASE-7239 URL: https://issues.apache.org/jira/browse/HBASE-7239 Project: HBase Issue Type: Sub-task Reporter: Lars Hofhansl Priority: Critical Fix For: 0.95.1 Attachments: 7239-1.patch Result.readFields() used to read from the input stream in 8k chunks to avoid OOM issues with direct memory. (Reading variable sized chunks into direct memory prevent the JVM from reusing the allocated direct memory and direct memory is only collected during full GCs) This is just to verify protobufs parseFrom type methods do the right thing as well so that we do not reintroduce this problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6970) hbase-deamon.sh creates/updates pid file even when that start failed.
[ https://issues.apache.org/jira/browse/HBASE-6970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635536#comment-13635536 ] Himanshu Vashishtha commented on HBASE-6970: I see, the pid file is deleted in case normal stop command is used. I was using kill -9 rs_proc, in that case, only cleanZnode is called by the script. I wonder shouldn't we delete the pid in either case (just like deleting the znode), irrespective the rs process died? I don't see any benefit of keeping that file. Did I miss anything. Thanks. hbase-deamon.sh creates/updates pid file even when that start failed. - Key: HBASE-6970 URL: https://issues.apache.org/jira/browse/HBASE-6970 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Nicolas Liochon We just ran into a strange issue where could neither start nor stop services with hbase-deamon.sh. The problem is this: {code} nohup nice -n $HBASE_NICENESS $HBASE_HOME/bin/hbase \ --config ${HBASE_CONF_DIR} \ $command $@ $startStop $logout 21 /dev/null echo $! $pid {code} So the pid file is created or updated even when the start of the service failed. The next stop command will then fail, because the pid file has the wrong pid in it. Edit: Spelling and more spelling errors. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6970) hbase-deamon.sh creates/updates pid file even when that start failed.
[ https://issues.apache.org/jira/browse/HBASE-6970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635562#comment-13635562 ] Nicolas Liochon commented on HBASE-6970: the pid file is supposed to be there is the process is still there, so deleting the znode is not enough: we need to be sure that the process died. hbase-deamon.sh creates/updates pid file even when that start failed. - Key: HBASE-6970 URL: https://issues.apache.org/jira/browse/HBASE-6970 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Nicolas Liochon We just ran into a strange issue where could neither start nor stop services with hbase-deamon.sh. The problem is this: {code} nohup nice -n $HBASE_NICENESS $HBASE_HOME/bin/hbase \ --config ${HBASE_CONF_DIR} \ $command $@ $startStop $logout 21 /dev/null echo $! $pid {code} So the pid file is created or updated even when the start of the service failed. The next stop command will then fail, because the pid file has the wrong pid in it. Edit: Spelling and more spelling errors. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8366) HBaseServer logs the full query.
[ https://issues.apache.org/jira/browse/HBASE-8366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635584#comment-13635584 ] Andrew Purtell commented on HBASE-8366: --- +1! HBaseServer logs the full query. Key: HBASE-8366 URL: https://issues.apache.org/jira/browse/HBASE-8366 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.95.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.98.0, 0.95.1 Attachments: 8366.v1.patch We log the query when we have an error. As a results, the logs are not readable when using stuff like multi. As a side note, this is as well a security issue (no need to encrypt the network and the storage if the logs contain everything). I'm not removing the full log line here; but just ask and I do it :-). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8366) HBaseServer logs the full query.
[ https://issues.apache.org/jira/browse/HBASE-8366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635595#comment-13635595 ] Himanshu Vashishtha commented on HBASE-8366: +1. Thanks Stack. HBaseServer logs the full query. Key: HBASE-8366 URL: https://issues.apache.org/jira/browse/HBASE-8366 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.95.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.98.0, 0.95.1 Attachments: 8366.v1.patch We log the query when we have an error. As a results, the logs are not readable when using stuff like multi. As a side note, this is as well a security issue (no need to encrypt the network and the storage if the logs contain everything). I'm not removing the full log line here; but just ask and I do it :-). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8352) Rename '.snapshot' directory
[ https://issues.apache.org/jira/browse/HBASE-8352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635675#comment-13635675 ] Tsz Wo (Nicholas), SZE commented on HBASE-8352: --- Thanks for everyone. You guys have done a great job! Rename '.snapshot' directory Key: HBASE-8352 URL: https://issues.apache.org/jira/browse/HBASE-8352 Project: HBase Issue Type: Bug Reporter: Ted Yu Assignee: Ted Yu Priority: Blocker Fix For: 0.98.0, 0.94.7, 0.95.1 Attachments: 8352-0.94-v1.txt, 8352-0.94-v2.txt, 8352-0.94-v3.txt, 8352-0.94-v4.txt, 8352-trunk.txt, 8352-trunk-v2.txt, 8352-trunk-v3.txt, 8352-trunk-v4.txt, 8352-trunk-v5.txt, 8352-trunk-v6.txt Testing HBase Snapshot on top of Hadoop's Snapshot branch (http://svn.apache.org/viewvc/hadoop/common/branches/HDFS-2802/), we found that both features used '.snapshot' directory to store metadata. HDFS (built from HDFS-2802 branch) doesn't allow paths with .snapshot as a component From discussion on d...@hbase.apache.org, (see http://search-hadoop.com/m/kY6C3cXMs51), consensus was to rename '.snapshot' directory in HBase so that both features can co-exist smoothly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8374) NPE when launching the balance
[ https://issues.apache.org/jira/browse/HBASE-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635693#comment-13635693 ] Enis Soztutar commented on HBASE-8374: -- Nice bug. Agree that serversToIndex is not populated first. Also it might happen that RegionLocationFinder might return region locations that we do not know about (the RS might have died, and we could be caching the data, etc). We should still guard against serversToIndex.get(loc.get(i)) returning null. For the patch, we should not use boxed primitives (for regionLocations = new int[numRegions][];). We can use -1, to indicate a null value. NPE when launching the balance -- Key: HBASE-8374 URL: https://issues.apache.org/jira/browse/HBASE-8374 Project: HBase Issue Type: Bug Components: Balancer Affects Versions: 0.95.0 Environment: AWS / real cluster with 3 nodes + master Reporter: Nicolas Liochon Assignee: Ted Yu Fix For: 0.98.0, 0.95.1 Attachments: 8374-trunk.txt, 8374-trunk-v2.txt, 8374-trunk-v3.txt, 8374-trunk-v4.txt I don't reproduce this all the time, but I had it on a fairly clean env. It occurs every 5 minutes (i.e. the balancer period). Impact is severe: the balancer does not run. When it starts to occurs, it occurs all the time. I haven't tried to restart the master, but I think it should be enough. Now, looking at the code, the NPE is strange. {noformat} 2013-04-18 08:09:52,079 ERROR [box,6,1366281581983-BalancerChore] org.apache.hadoop.hbase.master.balancer.BalancerChore: Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.master.balancer.BaseLoadBalancer$Cluster.init(BaseLoadBalancer.java:145) at org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer.balanceCluster(StochasticLoadBalancer.java:194) at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1295) at org.apache.hadoop.hbase.master.balancer.BalancerChore.chore(BalancerChore.java:48) at org.apache.hadoop.hbase.Chore.run(Chore.java:81) at java.lang.Thread.run(Thread.java:662) 2013-04-18 08:09:52,103 DEBUG [box,6,1366281581983-CatalogJanitor] org.apache.hadoop.hbase.client.ClientScanner: Creating scanner over .META. starting at key '' {noformat} {code} if (regionFinder != null) { //region location ListServerName loc = regionFinder.getTopBlockLocations(region); regionLocations[regionIndex] = new int[loc.size()]; for (int i=0; i loc.size(); i++) { regionLocations[regionIndex][i] = serversToIndex.get(loc.get(i)); // = NPE here } } {code} pinging [~enis], just in case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8374) NPE when launching the balance
[ https://issues.apache.org/jira/browse/HBASE-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-8374: -- Attachment: 8374-trunk-v5.txt Patch v5 changes regionLocations back to int array. NPE when launching the balance -- Key: HBASE-8374 URL: https://issues.apache.org/jira/browse/HBASE-8374 Project: HBase Issue Type: Bug Components: Balancer Affects Versions: 0.95.0 Environment: AWS / real cluster with 3 nodes + master Reporter: Nicolas Liochon Assignee: Ted Yu Fix For: 0.98.0, 0.95.1 Attachments: 8374-trunk.txt, 8374-trunk-v2.txt, 8374-trunk-v3.txt, 8374-trunk-v4.txt, 8374-trunk-v5.txt I don't reproduce this all the time, but I had it on a fairly clean env. It occurs every 5 minutes (i.e. the balancer period). Impact is severe: the balancer does not run. When it starts to occurs, it occurs all the time. I haven't tried to restart the master, but I think it should be enough. Now, looking at the code, the NPE is strange. {noformat} 2013-04-18 08:09:52,079 ERROR [box,6,1366281581983-BalancerChore] org.apache.hadoop.hbase.master.balancer.BalancerChore: Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.master.balancer.BaseLoadBalancer$Cluster.init(BaseLoadBalancer.java:145) at org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer.balanceCluster(StochasticLoadBalancer.java:194) at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1295) at org.apache.hadoop.hbase.master.balancer.BalancerChore.chore(BalancerChore.java:48) at org.apache.hadoop.hbase.Chore.run(Chore.java:81) at java.lang.Thread.run(Thread.java:662) 2013-04-18 08:09:52,103 DEBUG [box,6,1366281581983-CatalogJanitor] org.apache.hadoop.hbase.client.ClientScanner: Creating scanner over .META. starting at key '' {noformat} {code} if (regionFinder != null) { //region location ListServerName loc = regionFinder.getTopBlockLocations(region); regionLocations[regionIndex] = new int[loc.size()]; for (int i=0; i loc.size(); i++) { regionLocations[regionIndex][i] = serversToIndex.get(loc.get(i)); // = NPE here } } {code} pinging [~enis], just in case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8374) NPE when launching the balance
[ https://issues.apache.org/jira/browse/HBASE-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-8374: -- Attachment: (was: 8374-trunk-v5.txt) NPE when launching the balance -- Key: HBASE-8374 URL: https://issues.apache.org/jira/browse/HBASE-8374 Project: HBase Issue Type: Bug Components: Balancer Affects Versions: 0.95.0 Environment: AWS / real cluster with 3 nodes + master Reporter: Nicolas Liochon Assignee: Ted Yu Fix For: 0.98.0, 0.95.1 Attachments: 8374-trunk.txt, 8374-trunk-v2.txt, 8374-trunk-v3.txt, 8374-trunk-v4.txt, 8374-trunk-v5.txt I don't reproduce this all the time, but I had it on a fairly clean env. It occurs every 5 minutes (i.e. the balancer period). Impact is severe: the balancer does not run. When it starts to occurs, it occurs all the time. I haven't tried to restart the master, but I think it should be enough. Now, looking at the code, the NPE is strange. {noformat} 2013-04-18 08:09:52,079 ERROR [box,6,1366281581983-BalancerChore] org.apache.hadoop.hbase.master.balancer.BalancerChore: Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.master.balancer.BaseLoadBalancer$Cluster.init(BaseLoadBalancer.java:145) at org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer.balanceCluster(StochasticLoadBalancer.java:194) at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1295) at org.apache.hadoop.hbase.master.balancer.BalancerChore.chore(BalancerChore.java:48) at org.apache.hadoop.hbase.Chore.run(Chore.java:81) at java.lang.Thread.run(Thread.java:662) 2013-04-18 08:09:52,103 DEBUG [box,6,1366281581983-CatalogJanitor] org.apache.hadoop.hbase.client.ClientScanner: Creating scanner over .META. starting at key '' {noformat} {code} if (regionFinder != null) { //region location ListServerName loc = regionFinder.getTopBlockLocations(region); regionLocations[regionIndex] = new int[loc.size()]; for (int i=0; i loc.size(); i++) { regionLocations[regionIndex][i] = serversToIndex.get(loc.get(i)); // = NPE here } } {code} pinging [~enis], just in case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8374) NPE when launching the balance
[ https://issues.apache.org/jira/browse/HBASE-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-8374: -- Attachment: 8374-trunk-v5.txt NPE when launching the balance -- Key: HBASE-8374 URL: https://issues.apache.org/jira/browse/HBASE-8374 Project: HBase Issue Type: Bug Components: Balancer Affects Versions: 0.95.0 Environment: AWS / real cluster with 3 nodes + master Reporter: Nicolas Liochon Assignee: Ted Yu Fix For: 0.98.0, 0.95.1 Attachments: 8374-trunk.txt, 8374-trunk-v2.txt, 8374-trunk-v3.txt, 8374-trunk-v4.txt, 8374-trunk-v5.txt I don't reproduce this all the time, but I had it on a fairly clean env. It occurs every 5 minutes (i.e. the balancer period). Impact is severe: the balancer does not run. When it starts to occurs, it occurs all the time. I haven't tried to restart the master, but I think it should be enough. Now, looking at the code, the NPE is strange. {noformat} 2013-04-18 08:09:52,079 ERROR [box,6,1366281581983-BalancerChore] org.apache.hadoop.hbase.master.balancer.BalancerChore: Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.master.balancer.BaseLoadBalancer$Cluster.init(BaseLoadBalancer.java:145) at org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer.balanceCluster(StochasticLoadBalancer.java:194) at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1295) at org.apache.hadoop.hbase.master.balancer.BalancerChore.chore(BalancerChore.java:48) at org.apache.hadoop.hbase.Chore.run(Chore.java:81) at java.lang.Thread.run(Thread.java:662) 2013-04-18 08:09:52,103 DEBUG [box,6,1366281581983-CatalogJanitor] org.apache.hadoop.hbase.client.ClientScanner: Creating scanner over .META. starting at key '' {noformat} {code} if (regionFinder != null) { //region location ListServerName loc = regionFinder.getTopBlockLocations(region); regionLocations[regionIndex] = new int[loc.size()]; for (int i=0; i loc.size(); i++) { regionLocations[regionIndex][i] = serversToIndex.get(loc.get(i)); // = NPE here } } {code} pinging [~enis], just in case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8369) MapReduce over snapshot files
[ https://issues.apache.org/jira/browse/HBASE-8369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635708#comment-13635708 ] Enis Soztutar commented on HBASE-8369: -- bq. So would this completely bypass security? Underlying hFiles are owned by the hbase user. For reading the files from MR files, a couple of options comes to my mind: (1) open the files directly from hdfs, in which case, the user has to be in the same group and have group permissions to read the files, or the user has to be the hbase user. Similar to current SSR. (2) have HBase servers open the file, and pass the file handlers to the MR job, similar to the approach in HDFS-347. This is obviously more involved and require a live HBase cluster. (3) Copy snapshot files as different user. This will only be applicable to exported snapshots. Copying data for in-place snapshots would be costly. any other ideas? MapReduce over snapshot files - Key: HBASE-8369 URL: https://issues.apache.org/jira/browse/HBASE-8369 Project: HBase Issue Type: New Feature Components: mapreduce, snapshots Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 0.98.0, 0.95.2 Attachments: hbase-8369_v0.patch The idea is to add an InputFormat, which can run the mapreduce job over snapshot files directly bypassing hbase server layer. The IF is similar in usage to TableInputFormat, taking a Scan object from the user, but instead of running from an online table, it runs from a table snapshot. We do one split per region in the snapshot, and open an HRegion inside the RecordReader. A RegionScanner is used internally for doing the scan without any HRegionServer bits. Users have been asking and searching for ways to run MR jobs by reading directly from hfiles, so this allows new use cases if reading from stale data is ok: - Take snapshots periodically, and run MR jobs only on snapshots. - Export snapshots to remote hdfs cluster, run the MR jobs at that cluster without HBase cluster. - (Future use case) Combine snapshot data with online hbase data: Scan from yesterday's snapshot, but read today's data from online hbase cluster. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8374) NPE when launching the balance
[ https://issues.apache.org/jira/browse/HBASE-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635709#comment-13635709 ] Enis Soztutar commented on HBASE-8374: -- lgtm. Thanks Ted for taking this on. Nicolas, any chance you can try this with the cluster? NPE when launching the balance -- Key: HBASE-8374 URL: https://issues.apache.org/jira/browse/HBASE-8374 Project: HBase Issue Type: Bug Components: Balancer Affects Versions: 0.95.0 Environment: AWS / real cluster with 3 nodes + master Reporter: Nicolas Liochon Assignee: Ted Yu Fix For: 0.98.0, 0.95.1 Attachments: 8374-trunk.txt, 8374-trunk-v2.txt, 8374-trunk-v3.txt, 8374-trunk-v4.txt, 8374-trunk-v5.txt I don't reproduce this all the time, but I had it on a fairly clean env. It occurs every 5 minutes (i.e. the balancer period). Impact is severe: the balancer does not run. When it starts to occurs, it occurs all the time. I haven't tried to restart the master, but I think it should be enough. Now, looking at the code, the NPE is strange. {noformat} 2013-04-18 08:09:52,079 ERROR [box,6,1366281581983-BalancerChore] org.apache.hadoop.hbase.master.balancer.BalancerChore: Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.master.balancer.BaseLoadBalancer$Cluster.init(BaseLoadBalancer.java:145) at org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer.balanceCluster(StochasticLoadBalancer.java:194) at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1295) at org.apache.hadoop.hbase.master.balancer.BalancerChore.chore(BalancerChore.java:48) at org.apache.hadoop.hbase.Chore.run(Chore.java:81) at java.lang.Thread.run(Thread.java:662) 2013-04-18 08:09:52,103 DEBUG [box,6,1366281581983-CatalogJanitor] org.apache.hadoop.hbase.client.ClientScanner: Creating scanner over .META. starting at key '' {noformat} {code} if (regionFinder != null) { //region location ListServerName loc = regionFinder.getTopBlockLocations(region); regionLocations[regionIndex] = new int[loc.size()]; for (int i=0; i loc.size(); i++) { regionLocations[regionIndex][i] = serversToIndex.get(loc.get(i)); // = NPE here } } {code} pinging [~enis], just in case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-8377) IntegrationTestBigLinkedList calculates wrap for linked list size incorrectly
Enis Soztutar created HBASE-8377: Summary: IntegrationTestBigLinkedList calculates wrap for linked list size incorrectly Key: HBASE-8377 URL: https://issues.apache.org/jira/browse/HBASE-8377 Project: HBase Issue Type: Bug Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 0.98.0, 0.94.8, 0.95.1 There is a bug in IntegrationTestBigLinkedList that it reads the wrong config key to calculate the wrap size for the linked list. It uses num mappers, instead of num recors per mapper. This has not been caught before, because it causes the test to fail only if 1M is not divisible by num mappers. So launching the job with num mappers 1, 2, 4, 5 would succeed, while 6 will fail, etc. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8377) IntegrationTestBigLinkedList calculates wrap for linked list size incorrectly
[ https://issues.apache.org/jira/browse/HBASE-8377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated HBASE-8377: - Attachment: hbase-8377_v1.patch Simple patch. IntegrationTestBigLinkedList calculates wrap for linked list size incorrectly - Key: HBASE-8377 URL: https://issues.apache.org/jira/browse/HBASE-8377 Project: HBase Issue Type: Bug Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 0.98.0, 0.94.8, 0.95.1 Attachments: hbase-8377_v1.patch There is a bug in IntegrationTestBigLinkedList that it reads the wrong config key to calculate the wrap size for the linked list. It uses num mappers, instead of num recors per mapper. This has not been caught before, because it causes the test to fail only if 1M is not divisible by num mappers. So launching the job with num mappers 1, 2, 4, 5 would succeed, while 6 will fail, etc. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8377) IntegrationTestBigLinkedList calculates wrap for linked list size incorrectly
[ https://issues.apache.org/jira/browse/HBASE-8377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated HBASE-8377: - Status: Patch Available (was: Open) IntegrationTestBigLinkedList calculates wrap for linked list size incorrectly - Key: HBASE-8377 URL: https://issues.apache.org/jira/browse/HBASE-8377 Project: HBase Issue Type: Bug Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 0.98.0, 0.94.8, 0.95.1 Attachments: hbase-8377_v1.patch There is a bug in IntegrationTestBigLinkedList that it reads the wrong config key to calculate the wrap size for the linked list. It uses num mappers, instead of num recors per mapper. This has not been caught before, because it causes the test to fail only if 1M is not divisible by num mappers. So launching the job with num mappers 1, 2, 4, 5 would succeed, while 6 will fail, etc. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira