[ https://issues.apache.org/jira/browse/PHOENIX-3964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069927#comment-16069927 ]
Hudson commented on PHOENIX-3964: --------------------------------- FAILURE: Integrated in Jenkins build Phoenix-master #1671 (See [https://builds.apache.org/job/Phoenix-master/1671/]) PHOENIX-3964 Index.preWALRestore should handle index write failure (chenglei: rev 5c43b347a4cd2c3c38b28bee2983e716460e50db) * (edit) phoenix-core/src/main/java/org/apache/phoenix/hbase/index/Indexer.java > Index.preWALRestore should handle index write failure > ----------------------------------------------------- > > Key: PHOENIX-3964 > URL: https://issues.apache.org/jira/browse/PHOENIX-3964 > Project: Phoenix > Issue Type: Bug > Affects Versions: 4.10.0 > Reporter: chenglei > Assignee: chenglei > Fix For: 4.12.0, 4.11.1 > > Attachments: PHOENIX-3964_v1.patch > > > When I restarted my hbase cluster a certain time, I noticed some regions are > in FAILED_OPEN state and the RegionServers have some error logs as following: > {code:java} > 2017-06-20 12:31:30,493 ERROR [RS_OPEN_REGION-rsync:60020-3] > handler.OpenRegionHandler: Failed open of > region=BIZARCH_NS_PRODUCT.BIZTRACER_SPAN,0100134e-7ddf-4d13-ab77-6f0d683e5fad_0,1487594358223.57a7be72f9beaeb4285529ac6236f4e5., > starting to roll back the global memstore size. > org.apache.phoenix.hbase.index.exception.MultiIndexWriteFailureException: > Failed to write to multiple index tables > at > org.apache.phoenix.hbase.index.write.recovery.TrackingParallelWriterIndexCommitter.write(TrackingParallelWriterIndexCommitter.java:221) > at > org.apache.phoenix.hbase.index.write.IndexWriter.write(IndexWriter.java:185) > at > org.apache.phoenix.hbase.index.write.RecoveryIndexWriter.write(RecoveryIndexWriter.java:75) > at > org.apache.phoenix.hbase.index.Indexer.preWALRestore(Indexer.java:554) > at > org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$58.call(RegionCoprocessorHost.java:1312) > at > org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1517) > at > org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1592) > at > org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1549) > at > org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.preWALRestore(RegionCoprocessorHost.java:1308) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEdits(HRegion.java:3338) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEditsIfAny(HRegion.java:3220) > at > org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionStores(HRegion.java:823) > at > org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:716) > at > org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:687) > at > org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4596) > at > org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4566) > at > org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4538) > at > org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4494) > at > org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4445) > at > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:465) > at > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:139) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > When I look the code of Index.preWALRestore method, I find > RecoveryIndexWriter.write method is used to write the indexUpdates in > following line 543: > > {code:java} > > 526 public void preWALRestore(ObserverContext<RegionCoprocessorEnvironment> > env, HRegionInfo info, > 527 HLogKey logKey, WALEdit logEdit) throws IOException { > 528 if (this.disabled) { > 529 super.preWALRestore(env, info, logKey, logEdit); > 530 return; > 531 } > 532 // TODO check the regions in transition. If the server on which the > region lives is this one, > 533 // then we should rety that write later in postOpen. > 534 // we might be able to get even smarter here and pre-split the edits > that are server-local > 535 // into their own recovered.edits file. This then lets us do a > straightforward recovery of each > 536 // region (and more efficiently as we aren't writing quite as > hectically from this one place). > 537 > 538 /* > 539 * Basically, we let the index regions recover for a little while long > before retrying in the > 540 * hopes they come up before the primary table finishes. > 541 */ > 542 Collection<Pair<Mutation, byte[]>> indexUpdates = > extractIndexUpdate(logEdit); > 543 recoveryWriter.write(indexUpdates, true); > 544 } > {code} > but the RecoveryIndexWriter.write method is as following, it directly throws > Exception except non-existing tables, so RecoveryIndexWriter's > failurePolicy(which is StoreFailuresInCachePolicy by default) even has no > opportunity to be used, and it leads to Index.failedIndexEdits which is > filled by the StoreFailuresInCachePolicy is always empty. > {code:java} > public void write(Collection<Pair<Mutation, byte[]>> toWrite, boolean > allowLocalUpdates) throws IOException { > try { > write(resolveTableReferences(toWrite), allowLocalUpdates); > } catch (MultiIndexWriteFailureException e) { > for (HTableInterfaceReference table : e.getFailedTables()) { > if (!admin.tableExists(table.getTableName())) { > LOG.warn("Failure due to non existing table: " + > table.getTableName()); > nonExistingTablesList.add(table); > } else { > throw e; > } > } > } > } > {code} > So the Index.postOpen method seems useless, because the updates Multimap in > following 500 line which is got from Index.failedIndexEdits is always empty. > {code:java} > 499 public void postOpen(final ObserverContext<RegionCoprocessorEnvironment> > c) { > 500 Multimap<HTableInterfaceReference, Mutation> updates = > failedIndexEdits.getEdits(c.getEnvironment().getRegion()); > 501 > 502 if (this.disabled) { > 503 super.postOpen(c); > 504 return; > 505 } > 506 > 507 //if we have no pending edits to complete, then we are done > 508 if (updates == null || updates.size() == 0) { > 509 return; > 510 } > 511 > 512 LOG.info("Found some outstanding index updates that didn't succeed > during" > 513 + " WAL replay - attempting to replay now."); > 514 > 515 // do the usual writer stuff, killing the server again, if we can't > manage to make the index > 516 // writes succeed again > 517 try { > 518 writer.writeAndKillYourselfOnFailure(updates, true); > 519 } catch (IOException e) { > 520 LOG.error("During WAL replay of outstanding index updates, " > 521 + "Exception is thrown instead of killing server > during index writing", e); > 522 } > 523 } > {code} > So I think in Index.preWALRestore method, we should use > RecoveryWriter.writeAndKillYourselfOnFailure method to write the indexUpdates > and handle index write failure, not just use the RecoveryIndexWriter.write > method. > -- This message was sent by Atlassian JIRA (v6.4.14#64029)