danielsason112 commented on issue #114: URL: https://github.com/apache/solr-sandbox/issues/114#issuecomment-2601638032
Hey, I came across this issue as well. After triggering the encryption in a distributed SolrCloud, for a collection with replication factor greater than 1, all follower replicas gradually fail to sync index files from leaders and are infinitely stuck on recovery. Newley created replicas also cannot recover and fail to sync files from the leader. When leader replicas try to read any of the encrypted index files to the buffer a `java.io.EOFException: Read beyond EOF` is thrown from [DecryptingIndexInput.readBytes](https://github.com/apache/solr-sandbox/blob/4e1819ff8a3758becca19bf337ecd1b352dba805/encryption/src/main/java/org/apache/solr/encryption/crypto/DecryptingIndexInput.java#L222) and the following log entry appears on the leader replicas node: ``` [WARN] org.apache.solr.handler.ReplicationHandler Exception while writing response for params: generation=8&qt=/replication&file=_9.cfs&checksum=true&wt=filestream&command=filecontent java.io.EOFException: Read beyond EOF (position=0, arrayLength=46333, fileLength=46309) in Decrypting MemorySegmentIndexInput(path="/data/test_shard2_replica_n6/data/index/_9.cfs") at org.apache.solr.encryption.crypto.DecryptingIndexInput.readBytes(DecryptingIndexInput.java:223) ~[solr-encryption-plugin-1.0.0.jar:?] at org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1635) ~[solr-core-9.6.0.jar:9.6.0 f8e5a93c11267e13b7b43005a428bfb910ac6e57 - gus - 2024-04-22 23:20:52] at org.apache.solr.core.SolrCore$3.write(SolrCore.java:3056) ~[solr-core-9.6.0.jar:9.6.0 f8e5a93c11267e13b7b43005a428bfb910ac6e57 - gus - 2024-04-22 23:20:52] ... ``` Follower replica logs show that the download has failed. My investigation led to believe that the root cause is the ReplicationHandler trying to read files from the EncryptionDirectory using the "full" length of the file (including the encryption header, footer etc.), while the DecryptingIndexInput actually expect to read up to the "logical" length of the file, resulting in read beyond EOF exception. While digging into it, I noticed ReplicationHandler uses the EncryptionDirectory super class `fileLength` method to [get the actual file size](https://github.com/apache/solr/blob/0937bb504618cbf741862fb60833072982fcf895/solr/core/src/java/org/apache/solr/handler/admin/api/ReplicationAPIBase.java#L385). To try and read the "logical" length of the file during replication file sync, I have overridden the above in method in EncryptionDirectory with the following: ``` @Override public long fileLength(String name) throws IOException { IndexInput indexInput = null; try { indexInput = this.openInput(name, IOContext.READONCE); return indexInput.length(); } finally { if (indexInput != null) { indexInput.close(); } } } ``` The above patch seems to get the EOF error, replicas that were stuck on recovery gone active, and creation of new replicas is also successful. I was just following a hunch, and I do not deeply understand the issue and the appropriate fix for it. @bruno-roustant I will be glad to get your input on this one. BTW working on a test to reproduce the issue. Many thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
