[ 
https://issues.apache.org/jira/browse/KUDU-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

HeLifu updated KUDU-2892:
-------------------------
    Description: 
On one of our production clusters, a tserver crashed yesterday morning while 
dropping a range partition, and below is error-msg:
{code:java}
// code placeholder
Log file created at: 2019/07/11 01:51:30
Running on machine: kudu31.jd.163.org
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0711 01:51:30.331185 11840 env_posix.cc:316] I/O error, context: 
/mnt/dfs/0/kudu/tserver/data/data/9305dce18e6f4100b486b605617122b3.data
E0711 01:51:30.337604 11840 data_dirs.cc:1120] Directory 
/mnt/dfs/0/kudu/tserver/data/data marked as failed
F0711 04:00:51.835958 68948 ts_tablet_manager.cc:940] Failed to delete tablet 
data for 2278f736bf6548e2b773003c1ba7ed66: Invalid argument: Unable to delete 
on-disk data from tablet 2278f736bf6548e2b773003c1ba7ed66: The metadata for 
tablet 2278f736bf6548e2b773003c1ba7ed66 still references orphaned blocks. Call 
DeleteTabletData() first
{code}
It seems the new orphan blocks that were not deleted caused this problem after 
a disk was marked as bad. I attached an info-msg file about tablet 
'2278f736bf6548e2b773003c1ba7ed66'.

For brevity, I made a quick generalization:
 # 01:51:30.331185: bad disk /mnt/dfs/0 was detected
 # 01:51:30.344581: failing tablet
 # 01:51:30.870059: Initiating tablet copy
 # 04:00:51.820354: Processing DeleteTablet
 # 04:00:51.835958: Crashed.

 

  was:
On one of our production clusters, a tserver crashed yesterday morning while 
dropping a range partition, and below is error-msg:

 
{code:java}
// code placeholder
Log file created at: 2019/07/11 01:51:30
Running on machine: kudu31.jd.163.org
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0711 01:51:30.331185 11840 env_posix.cc:316] I/O error, context: 
/mnt/dfs/0/kudu/tserver/data/data/9305dce18e6f4100b486b605617122b3.data
E0711 01:51:30.337604 11840 data_dirs.cc:1120] Directory 
/mnt/dfs/0/kudu/tserver/data/data marked as failed
F0711 04:00:51.835958 68948 ts_tablet_manager.cc:940] Failed to delete tablet 
data for 2278f736bf6548e2b773003c1ba7ed66: Invalid argument: Unable to delete 
on-disk data from tablet 2278f736bf6548e2b773003c1ba7ed66: The metadata for 
tablet 2278f736bf6548e2b773003c1ba7ed66 still references orphaned blocks. Call 
DeleteTabletData() first
{code}
It seems the new orphan blocks that were not deleted caused this problem after 
a disk was marked as bad. I have attached an info-msg file about tablet 
'2278f736bf6548e2b773003c1ba7ed66'.  For brevity, let me make a quick 
generalization:
 # 01:51:30.331185: bad disk /mnt/dfs/0 was detected
 # 01:51:30.344581: failing tablet
 # 01:51:30.870059: Initiating tablet copy
 # 04:00:51.820354: Processing DeleteTablet
 # 04:00:51.835958: Crashed.

 


> tserver crashed while dropping range partition
> ----------------------------------------------
>
>                 Key: KUDU-2892
>                 URL: https://issues.apache.org/jira/browse/KUDU-2892
>             Project: Kudu
>          Issue Type: Bug
>          Components: tablet
>    Affects Versions: 1.9.0
>            Reporter: HeLifu
>            Priority: Major
>         Attachments: tserver-INFO.log
>
>
> On one of our production clusters, a tserver crashed yesterday morning while 
> dropping a range partition, and below is error-msg:
> {code:java}
> // code placeholder
> Log file created at: 2019/07/11 01:51:30
> Running on machine: kudu31.jd.163.org
> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
> E0711 01:51:30.331185 11840 env_posix.cc:316] I/O error, context: 
> /mnt/dfs/0/kudu/tserver/data/data/9305dce18e6f4100b486b605617122b3.data
> E0711 01:51:30.337604 11840 data_dirs.cc:1120] Directory 
> /mnt/dfs/0/kudu/tserver/data/data marked as failed
> F0711 04:00:51.835958 68948 ts_tablet_manager.cc:940] Failed to delete tablet 
> data for 2278f736bf6548e2b773003c1ba7ed66: Invalid argument: Unable to delete 
> on-disk data from tablet 2278f736bf6548e2b773003c1ba7ed66: The metadata for 
> tablet 2278f736bf6548e2b773003c1ba7ed66 still references orphaned blocks. 
> Call DeleteTabletData() first
> {code}
> It seems the new orphan blocks that were not deleted caused this problem 
> after a disk was marked as bad. I attached an info-msg file about tablet 
> '2278f736bf6548e2b773003c1ba7ed66'.
> For brevity, I made a quick generalization:
>  # 01:51:30.331185: bad disk /mnt/dfs/0 was detected
>  # 01:51:30.344581: failing tablet
>  # 01:51:30.870059: Initiating tablet copy
>  # 04:00:51.820354: Processing DeleteTablet
>  # 04:00:51.835958: Crashed.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to