Marton Elek created HDDS-4667:
---------------------------------

             Summary: XCeiverClientGrpc should give up if unexpected exception 
is thrown from read path
                 Key: HDDS-4667
                 URL: https://issues.apache.org/jira/browse/HDDS-4667
             Project: Hadoop Distributed Data Store
          Issue Type: Bug
          Components: Ozone Client
            Reporter: Marton Elek


Found it during the usage of a data generator.

 1. I accidentally uploaded keys without checksum data.

  2. In this specific key, the client is moved to an endless loop instead of 
giving up after the first unexpected exceptions:

{code}
2021-01-11 13:01:50,031 INFO  storage.BlockInputStream 
(BlockInputStream.java:refreshPipeline(166)) - Unable to read information for 
block conID: 2 locID: 185 bcsId: 0 from pipeline 
PipelineID=206da15d-62f6-4e24-93d1-e2e805fc1376: Unexpected OzoneException: 
org.apache.hadoop.ozone.common.OzoneChecksumException: Original checksumData 
has no checksums
2021-01-11 13:01:50,047 ERROR scm.XceiverClientGrpc 
(XceiverClientGrpc.java:sendCommandWithRetry(408)) - Failed to execute command 
cmdType: ReadChunk
traceID: ""
containerID: 2
datanodeUuid: "2c124e08-e8a5-4493-a41e-84797984e6a6"
readChunk {
  blockID {
    containerID: 2
    localID: 185
    blockCommitSequenceId: 0
  }
  chunkData {
    chunkName: "chunk0"
    offset: 0
    len: 4194304
    checksumData {
      type: CRC32
      bytesPerChecksum: 1048576
    }
  }
}
 on the pipeline Pipeline[ Id: 7d5ed2da-7453-4113-b766-4100458dcc16, Nodes: 
2c124e08-e8a5-4493-a41e-84797984e6a6{ip: 127.0.0.1, host: localhost, 
networkLocation: /default-rack, certSerialId: null, persistedOpState: 
IN_SERVICE, persistedOpStateExpiryEpochSec: 0}, Type:STAND_ALONE, Factor:THREE, 
State:OPEN, leaderId:, CreationTimestamp2021-01-11T12:01:50.032Z].
2021-01-11 13:01:50,047 INFO  storage.BlockInputStream 
(BlockInputStream.java:refreshPipeline(166)) - Unable to read information for 
block conID: 2 locID: 185 bcsId: 0 from pipeline 
PipelineID=7d5ed2da-7453-4113-b766-4100458dcc16: Unexpected OzoneException: 
org.apache.hadoop.ozone.common.OzoneChecksumException: Original checksumData 
has no checksums
2021-01-11 13:01:50,062 ERROR scm.XceiverClientGrpc 
(XceiverClientGrpc.java:sendCommandWithRetry(408)) - Failed to execute command 
cmdType: ReadChunk
traceID: ""
containerID: 2
datanodeUuid: "2c124e08-e8a5-4493-a41e-84797984e6a6"
readChunk {
  blockID {
    containerID: 2
    localID: 185
    blockCommitSequenceId: 0
  }
  chunkData {
    chunkName: "chunk0"
    offset: 0
    len: 4194304
    checksumData {
      type: CRC32
      bytesPerChecksum: 1048576
    }
  }
}
 on the pipeline Pipeline[ Id: 3a4b5032-6b2f-4297-8c4b-89d715175bb1, Nodes: 
2c124e08-e8a5-4493-a41e-84797984e6a6{ip: 127.0.0.1, host: localhost, 
networkLocation: /default-rack, certSerialId: null, persistedOpState: 
IN_SERVICE, persistedOpStateExpiryEpochSec: 0}, Type:STAND_ALONE, Factor:THREE, 
State:OPEN, leaderId:, CreationTimestamp2021-01-11T12:01:50.048Z].
{code}

Please note that the two attempt happens in the same milliseconds.

The problematic part seems to be in the BlockInputStream:

{code}
      try {
        numBytesRead = current.read(b, off, numBytesToRead);
      } catch (IOException e) {
        handleReadError(e);
        continue;
      }
{code}

In case of system exceptions we should "break" from the loop instead of 
"continue".

(Normally it's not possible in a production cluster as the data is created with 
a bad client. But it has security implication: a malicious user can create 
similar keys which makes a DoS attack: all the clients will retry without 
sleep...)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to