ggershinsky commented on code in PR #9752:
URL: https://github.com/apache/iceberg/pull/9752#discussion_r1497429004
##########
core/src/main/java/org/apache/iceberg/encryption/AesGcmInputFile.java:
##########
@@ -20,39 +20,33 @@
import org.apache.iceberg.io.InputFile;
import org.apache.iceberg.io.SeekableInputStream;
-import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
public class AesGcmInputFile implements InputFile {
private final InputFile sourceFile;
private final byte[] dataKey;
private final byte[] fileAADPrefix;
private long plaintextLength;
+ /**
+ * Important: sourceFile.getLength() must return the verified plaintext
content length, not the
+ * physical file size after encryption. This protects against tampering with
the file size in
+ * untrusted storage systems.
+ */
public AesGcmInputFile(InputFile sourceFile, byte[] dataKey, byte[]
fileAADPrefix) {
this.sourceFile = sourceFile;
this.dataKey = dataKey;
this.fileAADPrefix = fileAADPrefix;
- this.plaintextLength = -1;
+ this.plaintextLength = sourceFile.getLength();
Review Comment:
A couple of observations
1. AesGcmInputFile objects are produced by the
`EncryptionManager.decrypt(EncryptedInputFile encrypted)` function. This
function is only given an EncryptedInputFile object, comprised of the source
InputFile and its KeyMetadata. So the only way for the decrypt function to get
the file length information is to call the source `InputFile.getLength()`. We
can't provide the decrypt function with the file length via additional channels.
2. But thats not a problem in the current Iceberg code - because the source
InputFile length is fully controlled, and comes from a trusted place (not from
the underlying object store). More specifically, the sequence is:
a) **writing**:
_public ManifestFile ManifestWriter.toManifestFile()
return new GenericManifestFile(file.location(),_
[writer.length()](https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/ManifestWriter.java#L194),
->
_public long org.apache.iceberg.avro.AvroFileAppender.length()_
[ return
stream.getPos();](https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/avro/AvroFileAppender.java#L83)
So the length, recorded in the ManifestFile objects for each encrypted
manifest, is the plaintext length (and not the actual file length), since the
`AesGcmOutputStream.getPos()` returns the plaintext position.
b) **reading**:
_InputFile ManifestFiles.newInputFile(FileIO io, **ManifestFile manifest**)
return InputFile input =_
[io.newInputFile(manifest)](https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/ManifestFiles.java#L349);
->
_public InputFile EncryptingFileIO.newInputFile(ManifestFile manifest)
return newDecryptingInputFile(manifest.path(), **manifest.length()**,
manifest.keyMetadata());_
->
_public InputFile EncryptingFileIO.newDecryptingInputFile(String path, long
plainLength, ByteBuffer buffer) {
InputFile inputFile = **io.newInputFile(path, plainLength);**
return em.decrypt(wrap(inputFile, buffer));_
So that the length in the `InputFile sourceFile`, given to the
AesGcmInputFile constructor, comes from a safe ManifestFile object (since
manifest lists are encrypted/signed), and not from an unsafe storage.
What do you think?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]