[ https://issues.apache.org/jira/browse/KYLIN-4153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16925703#comment-16925703 ]
ASF subversion and git services commented on KYLIN-4153: -------------------------------------------------------- Commit 7e117e27764dc94cd627b0bd3dc4f4bbbf7f4a3e in kylin's branch refs/heads/master from XiaoxiangYu [ https://gitbox.apache.org/repos/asf?p=kylin.git;h=7e117e2 ] KYLIN-4153 Delete marker if real file not exists > Failed to read big resource /dict/xxxx at "Build Dimension Dictionary" Step > ---------------------------------------------------------------------------- > > Key: KYLIN-4153 > URL: https://issues.apache.org/jira/browse/KYLIN-4153 > Project: Kylin > Issue Type: Bug > Components: Metadata > Affects Versions: v2.6.0 > Reporter: Xiaoxiang Yu > Assignee: Xiaoxiang Yu > Priority: Major > > At the version of *Kylin 2.6.0*, kylin team has introduce an important > refactor of Kylin's Metadata Store, which add a lot of enhancement such as > upload/download metadata concurrently, store metadata with JDBC etc. Please > refer to https://issues.apache.org/jira/browse/KYLIN-3671 for detail. > > When kylin want to save a *big resource*(such as dict or snapshot) into > metadata store, it won't store it into metadata store(HBase or RDBMS) > directly. Instead, kylin will first {color:red}save it into HDFS(Step > 1){color}, and then {color:red}write a empty byte array as marker into > metadata store(Step 2) {color}. If first action succeed and second action > failed, a rollback method will be called to revert modification for HDFS > files. We could regard it as a complete and atomic transaction. > > {color:#0747A6}Here is part of the source code added in KYLIN-3671.{color} > Check it at > https://github.com/apache/kylin/blob/8737bc1f555a2789a67462c8f8420b6ab3be97ce/core-common/src/main/java/org/apache/kylin/common/persistence/PushdownResourceStore.java#L58 > . > {code:java} > final void putBigResource(String resPath, ContentWriter content, long newTS) > throws IOException { > // pushdown the big resource to DFS file > RollbackablePushdown pushdown = writePushdown(resPath, content); // Step > 1: write big resource into HDFS > try { > // write a marker in resource store, to indicate the resource is now > available > logger.debug("Writing marker for big resource {}", resPath); > putResourceWithRetry(resPath, > ContentWriter.create(BytesUtil.EMPTY_BYTE_ARRAY), newTS); // Step 2: write > marker into HBase/RDBMS > } catch (Throwable ex) { > pushdown.rollback(); > throw ex; > } finally { > pushdown.close(); > } > } > {code} > > > > But in some case, both step 1 and step 2 succeed but an exception still > throwed in step 2,{color:red} the rollback won't clear marker written in Step > 2{color}, which break the atomicity of this put action, thus cause the > FileNotFoundException when Kylin want to read that dict later. > > > > {color:#0747A6}Here is part of reporter's kylin.log of incomplete rollback > action.{color} > > > {noformat} > 2019-08-29 05:13:51,237 INFO [Scheduler 169045403 Job > ca4a4a08-54e2-b922-70bb-2aa2bf58709f-492] dict.DictionaryManager:388 : Saving > dictionary at > /dict/KYLIN_VIEW.USER_SECRET_TABLE/COUNTRY/66292068-e8eb-975a-3e44-b56c933c14cc.dict > 2019-08-29 05:13:51,238 DEBUG [Scheduler 169045403 Job > ca4a4a08-54e2-b922-70bb-2aa2bf58709f-492] persistence.HDFSResourceStore:98 : > Writing pushdown file > /kylin/kylin_metadata/resources/dict/KYLIN_VIEW.USER_SECRET_TABLE/COUNTRY/66292068-e8eb-975a-3e44-b56c933c14cc.dict.temp.-1798610090 > 2019-08-29 05:13:51,256 DEBUG [Scheduler 169045403 Job > ca4a4a08-54e2-b922-70bb-2aa2bf58709f-492] persistence.HDFSResourceStore:117 : > Move > /kylin/kylin_metadata/resources/dict/KYLIN_VIEW.USER_SECRET_TABLE/COUNTRY/66292068-e8eb-975a-3e44-b56c933c14cc.dict.temp.-1798610090 > to > /kylin/kylin_metadata/resources/dict/KYLIN_VIEW.USER_SECRET_TABLE/COUNTRY/66292068-e8eb-975a-3e44-b56c933c14cc.dict > 2019-08-29 05:13:51,258 DEBUG [Scheduler 169045403 Job > ca4a4a08-54e2-b922-70bb-2aa2bf58709f-492] persistence.HDFSResourceStore:65 : > Writing marker for big resource > /dict/KYLIN_VIEW.USER_SECRET_TABLE/COUNTRY/66292068-e8eb-975a-3e44-b56c933c14cc.dict > 2019-08-29 05:13:56,263 WARN > [hconnection-0x56f3258e-shared--pool10944-t54867] client.AsyncProcess:1263 : > #10545, table=kylin_metadata, attempt=1/1 failed=1ops, last exception: > java.io.IOException: Call to tx-dn41.data/10.14.243.51:60020 failed on local > exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=2662317, > waitTime=5001, operationTimeout=5000 expired. on > tx-dn41.data,60020,1565943919204, tracking started Thu Aug 29 05:13:51 > GMT+08:00 2019; not retrying 1 - final failure > 2019-08-29 05:13:56,266 ERROR [Scheduler 169045403 Job > ca4a4a08-54e2-b922-70bb-2aa2bf58709f-492] persistence.HDFSResourceStore:134 : > Rollback > /kylin/kylin_metadata/resources/dict/KYLIN_VIEW.USER_SECRET_TABLE/COUNTRY/66292068-e8eb-975a-3e44-b56c933c14cc.dict > from <empty> > 2019-08-29 05:13:56,274 ERROR [Scheduler 169045403 Job > ca4a4a08-54e2-b922-70bb-2aa2bf58709f-492] common.HadoopShellExecutable:65 : > error execute > HadoopShellExecutable{id=ca4a4a08-54e2-b922-70bb-2aa2bf58709f-03, name=Build > Dimension Dictionary, state=RUNNING} > 2019-08-29 05:13:56,274 INFO [Scheduler 169045403 Job > ca4a4a08-54e2-b922-70bb-2aa2bf58709f-492] execution.AbstractExecutable:162 : > Retry 1 > {noformat} > > > > > {color:#0747A6}Here is part of reporter's kylin.log of reading a non-exist > dict in HDFS in "Build Dimension Dictionary" Step. > {color} > > {noformat} > 2019-08-29 14:54:59,602 INFO [Scheduler 343338459 Job > af4b847d-afa6-3729-4c19-03a5db08447b-498] steps.CreateDictionaryJob:110 : > DictionaryProvider read dict from file: > hdfs://CDH-cluster-main/kylin/kylin_metadata/kylin-af4b847d-afa6-3729-4c19-03a5db08447b/209_new_device/fact_distinct_columns/USER_SECRET_TABLE.COUNTRY/COUNTRY.rldict-r-00004 > 2019-08-29 14:54:59,602 DEBUG [Scheduler 343338459 Job > af4b847d-afa6-3729-4c19-03a5db08447b-498] cli.DictionaryGeneratorCLI:73 : > Dict for 'COUNTRY' has already been built, save it > 2019-08-29 14:54:59,720 ERROR [Scheduler 343338459 Job > af4b847d-afa6-3729-4c19-03a5db08447b-498] persistence.ResourceStore:233 : > Error reading resource > /dict/KYLIN_VIEW.USER_SECRET_TABLE/COUNTRY/66292068-e8eb-975a-3e44-b56c933c14cc.dict > java.io.IOException: Failed to read big resource > /dict/KYLIN_VIEW.USER_SECRET_TABLE/COUNTRY/66292068-e8eb-975a-3e44-b56c933c14cc.dict > at > org.apache.kylin.common.persistence.PushdownResourceStore.openPushdown(PushdownResourceStore.java:176) > at > org.apache.kylin.storage.hbase.HBaseResourceStore.getInputStream(HBaseResourceStore.java:256) > at > org.apache.kylin.storage.hbase.HBaseResourceStore.rawResource(HBaseResourceStore.java:226) > at > org.apache.kylin.storage.hbase.HBaseResourceStore.access$000(HBaseResourceStore.java:64) > at > org.apache.kylin.storage.hbase.HBaseResourceStore$1.visit(HBaseResourceStore.java:159) > at > org.apache.kylin.storage.hbase.HBaseResourceStore.visitFolder(HBaseResourceStore.java:204) > at > org.apache.kylin.storage.hbase.HBaseResourceStore.visitFolderImpl(HBaseResourceStore.java:152) > at > org.apache.kylin.common.persistence.ResourceStore.visitFolderInner(ResourceStore.java:689) > at > org.apache.kylin.common.persistence.ResourceStore.visitFolderAndContent(ResourceStore.java:675) > at > org.apache.kylin.common.persistence.ResourceStore$2.call(ResourceStore.java:224) > at > org.apache.kylin.common.persistence.ResourceStore$2.call(ResourceStore.java:220) > at > org.apache.kylin.common.persistence.ExponentialBackoffRetry.doWithRetry(ExponentialBackoffRetry.java:52) > at > org.apache.kylin.common.persistence.ResourceStore.getAllResources(ResourceStore.java:220) > at > org.apache.kylin.common.persistence.ResourceStore.getAllResources(ResourceStore.java:209) > at > org.apache.kylin.dict.DictionaryManager.checkDupByInfo(DictionaryManager.java:334) > at > org.apache.kylin.dict.DictionaryManager.saveDictionary(DictionaryManager.java:314) > at > org.apache.kylin.cube.CubeManager$DictionaryAssist.saveDictionary(CubeManager.java:1127) > at > org.apache.kylin.cube.CubeManager.saveDictionary(CubeManager.java:1089) > at > org.apache.kylin.cube.cli.DictionaryGeneratorCLI.processSegment(DictionaryGeneratorCLI.java:74) > at > org.apache.kylin.cube.cli.DictionaryGeneratorCLI.processSegment(DictionaryGeneratorCLI.java:55) > at > org.apache.kylin.engine.mr.steps.CreateDictionaryJob.run(CreateDictionaryJob.java:73) > at org.apache.kylin.engine.mr.MRUtil.runMRJob(MRUtil.java:93) > at > org.apache.kylin.engine.mr.common.HadoopShellExecutable.doWork(HadoopShellExecutable.java:63) > at > org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167) > at > org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71) > at > org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167) > at > org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.FileNotFoundException: > /kylin/kylin_metadata/resources/dict/KYLIN_VIEW.USER_SECRET_TABLE/COUNTRY/66292068-e8eb-975a-3e44-b56c933c14cc.dict > (FS: DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-784809092_27, > ugi=kylin (auth:SIMPLE)]]) > at > org.apache.kylin.common.persistence.PushdownResourceStore.openPushdown(PushdownResourceStore.java:173) > ... 29 more > {noformat} > > This often happen in Build Step 4: {color:#0747A6}Build Dimension > Dictionary{color}. And this incomplete metadata entry will cause same > failure(*_FileNotFoundException_*) of {color:#DE350B}*ALL*{color} following > cube rebuild job. > > As far as I can see, my *{color:#0747A6}workaround{color}* should be delete > that marker. Since this is a broken metadata entry, deletion won't make > damage. After the deletion, following rebuilt job will succeed. > > This is some related report mail : > 1. > http://apache-kylin.74782.x6.nabble.com/How-to-repair-the-cube-that-it-lost-someone-dictionary-td12989.html > 2. > http://mail-archives.apache.org/mod_mbox/kylin-user/201908.mbox/%3c4bcca64e.4af8.16cdb473a62.coremail.itzhangqi...@163.com%3e > -- This message was sent by Atlassian Jira (v8.3.2#803003)