[ 
https://issues.apache.org/jira/browse/ACCUMULO-2333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Brassard updated ACCUMULO-2333:
------------------------------------

    Description: 
While running the agitator during a client ingest test, encountered a "File 
does not exist" error that stuck in the Table Problems section of the monitor 
page. 

Confirmed that the file in question had been compacted away previously.

While it appears that no data was lost, it is strange that the error surfaced 
and then seemed to right itself shortly thereafter. (though not updating the 
Table Problems section)

Here is the stacktrace from the Monitor:
{code}
File does not exist: /accumulo/tables/2/t-00000dj/F0000dwj.rf at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61) at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:51) at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1540)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1483)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1463)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1437)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:468)
 at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:269)
 at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59566)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at 
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2053) at 
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:415) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2047)
{code}

_UPDATE (from comments below):_
On a cluster with 15 slaves, two of the participating tablet servers had logs 
referencing the file.

slave05 was one that was killed by the agitator at 20:33 and then restarted at 
20:43, where it immediately compacted F0000dwj.rf. That file had been created 
by slave03 at 20:34 when slave05 was offline. slave03, who seems to have 
previously been responsible for the file, then tried to perform a MajC at 
21:10, which caused the exceptions to appear in the monitor. It seems that the 
master was also killed at 21:02 and was revived at 21:05. It appears that the 
"missing" extent was never unloaded and re-assigned before the failure.

There were RuntimeExceptions reported by slave03 at about 20:34 as well, so 
there's a chance that slave03's actions at that time did not complete cleanly.

I'm attaching logs for the time and pertinent servers.

  was:
While running the agitator during a client ingest test, encountered a "File 
does not exist" error that stuck in the Table Problems section of the monitor 
page. 

Confirmed that the file in question had been compacted away previously.

While it appears that no data was lost, it is strange that the error surfaced 
and then seemed to right itself shortly thereafter. (though not updating the 
Table Problems section)

Here is the stacktrace from the Monitor:
{code}
File does not exist: /accumulo/tables/2/t-00000dj/F0000dwj.rf at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61) at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:51) at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1540)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1483)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1463)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1437)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:468)
 at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:269)
 at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59566)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at 
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2053) at 
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:415) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2047)
{code}


> "File does not exist" error during client ingest with agitation
> ---------------------------------------------------------------
>
>                 Key: ACCUMULO-2333
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2333
>             Project: Accumulo
>          Issue Type: Bug
>    Affects Versions: 1.5.0
>            Reporter: Luke Brassard
>         Attachments: master.log, tserver_slave03.log, tserver_slave05.log
>
>
> While running the agitator during a client ingest test, encountered a "File 
> does not exist" error that stuck in the Table Problems section of the monitor 
> page. 
> Confirmed that the file in question had been compacted away previously.
> While it appears that no data was lost, it is strange that the error surfaced 
> and then seemed to right itself shortly thereafter. (though not updating the 
> Table Problems section)
> Here is the stacktrace from the Monitor:
> {code}
> File does not exist: /accumulo/tables/2/t-00000dj/F0000dwj.rf at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61) 
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:51) 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1540)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1483)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1463)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1437)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:468)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:269)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59566)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at 
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2053) at 
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:415) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2047)
> {code}
> _UPDATE (from comments below):_
> On a cluster with 15 slaves, two of the participating tablet servers had logs 
> referencing the file.
> slave05 was one that was killed by the agitator at 20:33 and then restarted 
> at 20:43, where it immediately compacted F0000dwj.rf. That file had been 
> created by slave03 at 20:34 when slave05 was offline. slave03, who seems to 
> have previously been responsible for the file, then tried to perform a MajC 
> at 21:10, which caused the exceptions to appear in the monitor. It seems that 
> the master was also killed at 21:02 and was revived at 21:05. It appears that 
> the "missing" extent was never unloaded and re-assigned before the failure.
> There were RuntimeExceptions reported by slave03 at about 20:34 as well, so 
> there's a chance that slave03's actions at that time did not complete cleanly.
> I'm attaching logs for the time and pertinent servers.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to