danielsason112 commented on issue #114:
URL: https://github.com/apache/solr-sandbox/issues/114#issuecomment-2601638032

   Hey,
   
   I came across this issue as well.
   After triggering the encryption in a distributed SolrCloud, for a collection 
with replication factor greater than 1, all follower replicas gradually fail to 
sync index files from leaders and are infinitely stuck on recovery. Newley 
created replicas also cannot recover and fail to sync files from the leader.
   
   When leader replicas try to read any of the encrypted index files to the 
buffer a `java.io.EOFException: Read beyond EOF` is thrown from 
[DecryptingIndexInput.readBytes](https://github.com/apache/solr-sandbox/blob/4e1819ff8a3758becca19bf337ecd1b352dba805/encryption/src/main/java/org/apache/solr/encryption/crypto/DecryptingIndexInput.java#L222)
 and the following log entry appears on the leader replicas node:
   ```
   [WARN]  org.apache.solr.handler.ReplicationHandler Exception while writing 
response for params: 
generation=8&qt=/replication&file=_9.cfs&checksum=true&wt=filestream&command=filecontent
   java.io.EOFException: Read beyond EOF (position=0, arrayLength=46333, 
fileLength=46309) in Decrypting 
MemorySegmentIndexInput(path="/data/test_shard2_replica_n6/data/index/_9.cfs")
           at 
org.apache.solr.encryption.crypto.DecryptingIndexInput.readBytes(DecryptingIndexInput.java:223)
 ~[solr-encryption-plugin-1.0.0.jar:?]
           at 
org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1635)
 ~[solr-core-9.6.0.jar:9.6.0 f8e5a93c11267e13b7b43005a428bfb910ac6e57 - gus - 
2024-04-22 23:20:52]
           at org.apache.solr.core.SolrCore$3.write(SolrCore.java:3056) 
~[solr-core-9.6.0.jar:9.6.0 f8e5a93c11267e13b7b43005a428bfb910ac6e57 - gus - 
2024-04-22 23:20:52]
   ...
   ```
   
   Follower replica logs show that the download has failed.
   
   My investigation led to believe that the root cause is the 
ReplicationHandler trying to read files from the EncryptionDirectory using the 
"full" length of the file (including the encryption header, footer etc.), while 
the DecryptingIndexInput actually expect to read up to the "logical" length of 
the file, resulting in read beyond EOF exception.
   
   While digging into it, I noticed ReplicationHandler uses the 
EncryptionDirectory super class `fileLength` method to [get the actual file 
size](https://github.com/apache/solr/blob/0937bb504618cbf741862fb60833072982fcf895/solr/core/src/java/org/apache/solr/handler/admin/api/ReplicationAPIBase.java#L385).
   To try and read the "logical" length of the file during replication file 
sync, I have overridden the above in method in EncryptionDirectory with the 
following:
   
   ```
   @Override
     public long fileLength(String name) throws IOException {
       IndexInput indexInput = null;
       try {
         indexInput = this.openInput(name, IOContext.READONCE);
           return indexInput.length();
       } finally {
         if (indexInput != null) {
           indexInput.close();
         }
       }
     }
   ```
   
   The above patch seems to get the EOF error, replicas that were stuck on 
recovery gone active, and creation of new replicas is also successful.
   
   I was just following a hunch, and I do not deeply understand the issue and 
the appropriate fix for it.
   @bruno-roustant I will be glad to get your input on this one.
   
   BTW working on a test to reproduce the issue.
   
   Many thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to