[ https://issues.apache.org/jira/browse/HDFS-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
James Clampffer updated HDFS-10247: ----------------------------------- Attachment: HDFS-10247.HDFS-8707.001.patch Update to make read/write continuations take DataNodeConnectionImpl by shared pointer only. They already only exist as shared_ptrs in the calling code, this makes sure the connection isn't yanked out from under the continuation. > libhdfs++: Datanode protocol version mismatch > --------------------------------------------- > > Key: HDFS-10247 > URL: https://issues.apache.org/jira/browse/HDFS-10247 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client > Reporter: James Clampffer > Assignee: James Clampffer > Attachments: HDFS-10247.HDFS-8707.000.patch, > HDFS-10247.HDFS-8707.001.patch > > > Occasionally "Version Mismatch (Expected: 28, Received: 22794 )" shows up in > the logs. This doesn't happen much at all with less than 500 concurrent > reads and starts happening often enough to be an issue at 1000 concurrent > reads. > I've seen 3 distinct numbers: 23050 (most common), 22538, and 22794. If you > break these shorts into bytes you get > {code} > 23050 -> [90,10] > 22794 -> [89,10] > 22538 -> [88,10] > {code} > Interestingly enough if we dump buffers holding protobuf messages just before > they hit the wire we see things like the following with the first two bytes > as 90,10 > {code} > buffer > ={90,10,82,10,64,10,52,10,37,66,80,45,49,51,56,49,48,51,51,57,57,49,45,49,50,55,46,48,46,48,46,49,45,49,52,53,57,53,50,53,54,49,53,55,50,53,16,-127,-128,-128,-128,4,24,-23,7,32,-128,-128,64,18,8,10,0,18,0,26,0,34,0,18,14,108,105,98,104,100,102,115,43,43,95,75,67,43,49,16,0,24,23,32,1} > {code} > The first 3 bytes the DN is expecting for an unsecured read block request = > {code} > {0,28,81} //[0, 28]->a short for protocol, 81 is read block opcode > {code} > This seems like either connections are getting swapped between readers or > the header isn't being sent for some reason but the protobuf message is. > I've ruled out memory stomps on the header data (see HDFS-10241) by sticking > the 3 byte header in it's own static buffer that all requests use. > Some notes: > -The mismatched number will stay the same for the duration of a stress test. > -The mismatch is distributed fairly evenly throughout the logs -- This message was sent by Atlassian JIRA (v6.3.4#6332)