[ https://issues.apache.org/jira/browse/KUDU-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
HeLifu updated KUDU-2892: ------------------------- Description: On one of our production clusters, a tserver crashed yesterday morning while dropping a range partition, and below is error-msg: {code:java} // code placeholder Log file created at: 2019/07/11 01:51:30 Running on machine: kudu31.jd.163.org Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg E0711 01:51:30.331185 11840 env_posix.cc:316] I/O error, context: /mnt/dfs/0/kudu/tserver/data/data/9305dce18e6f4100b486b605617122b3.data E0711 01:51:30.337604 11840 data_dirs.cc:1120] Directory /mnt/dfs/0/kudu/tserver/data/data marked as failed F0711 04:00:51.835958 68948 ts_tablet_manager.cc:940] Failed to delete tablet data for 2278f736bf6548e2b773003c1ba7ed66: Invalid argument: Unable to delete on-disk data from tablet 2278f736bf6548e2b773003c1ba7ed66: The metadata for tablet 2278f736bf6548e2b773003c1ba7ed66 still references orphaned blocks. Call DeleteTabletData() first {code} It seems the new orphan blocks that were not deleted caused this problem after a disk was marked as bad. I attached an info-msg file about tablet '2278f736bf6548e2b773003c1ba7ed66'. For brevity, I made a quick generalization: # 01:51:30.331185: bad disk /mnt/dfs/0 was detected # 01:51:30.344581: failing tablet # 01:51:30.870059: Initiating tablet copy # 04:00:51.820354: Processing DeleteTablet # 04:00:51.835958: Crashed. was: On one of our production clusters, a tserver crashed yesterday morning while dropping a range partition, and below is error-msg: {code:java} // code placeholder Log file created at: 2019/07/11 01:51:30 Running on machine: kudu31.jd.163.org Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg E0711 01:51:30.331185 11840 env_posix.cc:316] I/O error, context: /mnt/dfs/0/kudu/tserver/data/data/9305dce18e6f4100b486b605617122b3.data E0711 01:51:30.337604 11840 data_dirs.cc:1120] Directory /mnt/dfs/0/kudu/tserver/data/data marked as failed F0711 04:00:51.835958 68948 ts_tablet_manager.cc:940] Failed to delete tablet data for 2278f736bf6548e2b773003c1ba7ed66: Invalid argument: Unable to delete on-disk data from tablet 2278f736bf6548e2b773003c1ba7ed66: The metadata for tablet 2278f736bf6548e2b773003c1ba7ed66 still references orphaned blocks. Call DeleteTabletData() first {code} It seems the new orphan blocks that were not deleted caused this problem after a disk was marked as bad. I have attached an info-msg file about tablet '2278f736bf6548e2b773003c1ba7ed66'. For brevity, let me make a quick generalization: # 01:51:30.331185: bad disk /mnt/dfs/0 was detected # 01:51:30.344581: failing tablet # 01:51:30.870059: Initiating tablet copy # 04:00:51.820354: Processing DeleteTablet # 04:00:51.835958: Crashed. > tserver crashed while dropping range partition > ---------------------------------------------- > > Key: KUDU-2892 > URL: https://issues.apache.org/jira/browse/KUDU-2892 > Project: Kudu > Issue Type: Bug > Components: tablet > Affects Versions: 1.9.0 > Reporter: HeLifu > Priority: Major > Attachments: tserver-INFO.log > > > On one of our production clusters, a tserver crashed yesterday morning while > dropping a range partition, and below is error-msg: > {code:java} > // code placeholder > Log file created at: 2019/07/11 01:51:30 > Running on machine: kudu31.jd.163.org > Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg > E0711 01:51:30.331185 11840 env_posix.cc:316] I/O error, context: > /mnt/dfs/0/kudu/tserver/data/data/9305dce18e6f4100b486b605617122b3.data > E0711 01:51:30.337604 11840 data_dirs.cc:1120] Directory > /mnt/dfs/0/kudu/tserver/data/data marked as failed > F0711 04:00:51.835958 68948 ts_tablet_manager.cc:940] Failed to delete tablet > data for 2278f736bf6548e2b773003c1ba7ed66: Invalid argument: Unable to delete > on-disk data from tablet 2278f736bf6548e2b773003c1ba7ed66: The metadata for > tablet 2278f736bf6548e2b773003c1ba7ed66 still references orphaned blocks. > Call DeleteTabletData() first > {code} > It seems the new orphan blocks that were not deleted caused this problem > after a disk was marked as bad. I attached an info-msg file about tablet > '2278f736bf6548e2b773003c1ba7ed66'. > For brevity, I made a quick generalization: > # 01:51:30.331185: bad disk /mnt/dfs/0 was detected > # 01:51:30.344581: failing tablet > # 01:51:30.870059: Initiating tablet copy > # 04:00:51.820354: Processing DeleteTablet > # 04:00:51.835958: Crashed. > -- This message was sent by Atlassian JIRA (v7.6.14#76016)