Re: [PR] API: Fix default FileIO#newInputFile ManifestFile, DataFile and DeleteFile implementation to pass lengths [iceberg]
amogh-jahagirdar merged PR #9953: URL: https://github.com/apache/iceberg/pull/9953 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API: Fix default FileIO#newInputFile ManifestFile, DataFile and DeleteFile implementation to pass lengths [iceberg]
amogh-jahagirdar commented on PR #9953: URL: https://github.com/apache/iceberg/pull/9953#issuecomment-2038104819 Thanks for the reviews @ajantha-bhat and @Fokko ! Merging -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API: Fix default FileIO#newInputFile ManifestFile, DataFile and DeleteFile implementation to pass lengths [iceberg]
amogh-jahagirdar commented on code in PR #9953: URL: https://github.com/apache/iceberg/pull/9953#discussion_r1552280909 ## aws/src/test/java/org/apache/iceberg/aws/s3/TestS3FileIO.java: ## @@ -377,6 +384,50 @@ public void testResolvingFileIOLoad() { Assertions.assertThat(result).isInstanceOf(S3FileIO.class); } + @Test + public void testInputFileWithDataFile() throws IOException { +String location = "s3://bucket/path/to/data-file.parquet"; +DataFile dataFile = +DataFiles.builder(PartitionSpec.unpartitioned()) +.withPath(location) +.withFileSizeInBytes(123L) +.withFormat(FileFormat.PARQUET) +.withRecordCount(123L) +.build(); +OutputStream outputStream = s3FileIO.newOutputFile(location).create(); +byte[] data = "testing".getBytes(); +outputStream.write(data); +outputStream.close(); + +InputFile inputFile = s3FileIO.newInputFile(dataFile); +Assertions.assertThat(inputFile.getLength()) +.as("Data file length should be determined from the file size stats") +.isEqualTo(123L); + } + + @Test + public void testInputFileWithManifest() throws IOException { +String dataFileLocation = "s3://bucket/path/to/data-file-2.parquet"; +DataFile dataFile = +DataFiles.builder(PartitionSpec.unpartitioned()) +.withPath(dataFileLocation) +.withFileSizeInBytes(123L) +.withFormat(FileFormat.PARQUET) +.withRecordCount(123L) +.build(); +String manifestLocation = "s3://bucket/path/to/manifest.avro"; +OutputFile outputFile = s3FileIO.newOutputFile(manifestLocation); +ManifestWriter writer = +ManifestFiles.write(PartitionSpec.unpartitioned(), outputFile); +writer.add(dataFile); +writer.close(); +ManifestFile manifest = writer.toManifestFile(); +InputFile inputFile = s3FileIO.newInputFile(manifest); +Assertions.assertThat(inputFile.getLength()) +.as("Manifest file length should be determined from the file size stats") +.isEqualTo(manifest.length()); Review Comment: I just added a verification that the s3Mock.headObject is never called when determining the length. It fails before this change, and passes after the fix so I think it's a better test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API: Fix default FileIO#newInputFile ManifestFile, DataFile and DeleteFile implementation to pass lengths [iceberg]
amogh-jahagirdar commented on PR #9953: URL: https://github.com/apache/iceberg/pull/9953#issuecomment-2007499715 @ajantha-bhat Replied my thoughts on the Trino PR and I'll summarize what I think here: To be clear, there is no extra I/O being done in practice for `newInputfile(DataFile)` since that code path does not look to be exercised yet in the latest release. What's remaining is the `newInputFile(ManifestFile)` where there would be an extra I/O during engine planning. Since the I/O is an extra head request to object stores which typically is low double digit milliseconds + the fact that only a subset of manifess are read during planning I think that it's not significant enough to warrant a patch release, but I can be convinced otherwise if folks hit a latency regression this in their workloads. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API: Fix default FileIO#newInputFile ManifestFile, DataFile and DeleteFile implementation to pass lengths [iceberg]
ajantha-bhat commented on PR #9953: URL: https://github.com/apache/iceberg/pull/9953#issuecomment-2002774794 ping @danielcweeks, @rdblue, @nastra -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API: Fix default FileIO#newInputFile ManifestFile, DataFile and DeleteFile implementation to pass lengths [iceberg]
amogh-jahagirdar commented on code in PR #9953: URL: https://github.com/apache/iceberg/pull/9953#discussion_r1524996255 ## aws/src/test/java/org/apache/iceberg/aws/s3/TestS3FileIO.java: ## @@ -377,6 +384,50 @@ public void testResolvingFileIOLoad() { Assertions.assertThat(result).isInstanceOf(S3FileIO.class); } + @Test + public void testInputFileWithDataFile() throws IOException { +String location = "s3://bucket/path/to/data-file.parquet"; +DataFile dataFile = +DataFiles.builder(PartitionSpec.unpartitioned()) +.withPath(location) +.withFileSizeInBytes(123L) +.withFormat(FileFormat.PARQUET) +.withRecordCount(123L) +.build(); +OutputStream outputStream = s3FileIO.newOutputFile(location).create(); +byte[] data = "testing".getBytes(); +outputStream.write(data); +outputStream.close(); + +InputFile inputFile = s3FileIO.newInputFile(dataFile); +Assertions.assertThat(inputFile.getLength()) +.as("Data file length should be determined from the file size stats") +.isEqualTo(123L); + } + + @Test + public void testInputFileWithManifest() throws IOException { +String dataFileLocation = "s3://bucket/path/to/data-file-2.parquet"; +DataFile dataFile = +DataFiles.builder(PartitionSpec.unpartitioned()) +.withPath(dataFileLocation) +.withFileSizeInBytes(123L) +.withFormat(FileFormat.PARQUET) +.withRecordCount(123L) +.build(); +String manifestLocation = "s3://bucket/path/to/manifest.avro"; +OutputFile outputFile = s3FileIO.newOutputFile(manifestLocation); +ManifestWriter writer = +ManifestFiles.write(PartitionSpec.unpartitioned(), outputFile); +writer.add(dataFile); +writer.close(); +ManifestFile manifest = writer.toManifestFile(); +InputFile inputFile = s3FileIO.newInputFile(manifest); +Assertions.assertThat(inputFile.getLength()) +.as("Manifest file length should be determined from the file size stats") +.isEqualTo(manifest.length()); Review Comment: Ah yeah good point, let me see how to get a more useful test for the manifest case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API: Fix default FileIO#newInputFile ManifestFile, DataFile and DeleteFile implementation to pass lengths [iceberg]
ajantha-bhat commented on code in PR #9953: URL: https://github.com/apache/iceberg/pull/9953#discussion_r1524376106 ## aws/src/test/java/org/apache/iceberg/aws/s3/TestS3FileIO.java: ## @@ -377,6 +384,50 @@ public void testResolvingFileIOLoad() { Assertions.assertThat(result).isInstanceOf(S3FileIO.class); } + @Test + public void testInputFileWithDataFile() throws IOException { +String location = "s3://bucket/path/to/data-file.parquet"; +DataFile dataFile = +DataFiles.builder(PartitionSpec.unpartitioned()) +.withPath(location) +.withFileSizeInBytes(123L) +.withFormat(FileFormat.PARQUET) +.withRecordCount(123L) +.build(); +OutputStream outputStream = s3FileIO.newOutputFile(location).create(); +byte[] data = "testing".getBytes(); +outputStream.write(data); +outputStream.close(); + +InputFile inputFile = s3FileIO.newInputFile(dataFile); +Assertions.assertThat(inputFile.getLength()) +.as("Data file length should be determined from the file size stats") +.isEqualTo(123L); + } + + @Test + public void testInputFileWithManifest() throws IOException { +String dataFileLocation = "s3://bucket/path/to/data-file-2.parquet"; +DataFile dataFile = +DataFiles.builder(PartitionSpec.unpartitioned()) +.withPath(dataFileLocation) +.withFileSizeInBytes(123L) +.withFormat(FileFormat.PARQUET) +.withRecordCount(123L) +.build(); +String manifestLocation = "s3://bucket/path/to/manifest.avro"; +OutputFile outputFile = s3FileIO.newOutputFile(manifestLocation); +ManifestWriter writer = +ManifestFiles.write(PartitionSpec.unpartitioned(), outputFile); +writer.add(dataFile); +writer.close(); +ManifestFile manifest = writer.toManifestFile(); +InputFile inputFile = s3FileIO.newInputFile(manifest); +Assertions.assertThat(inputFile.getLength()) +.as("Manifest file length should be determined from the file size stats") +.isEqualTo(manifest.length()); Review Comment: minor: This will not confirm whether `manifest.length` was used or file size is computed by reading the file. As the both values can be same. The above `dataFileTestcase` validates right. Maybe we have mock the `manifest.length` to confirm that value is used. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API: Fix default FileIO#newInputFile ManifestFile, DataFile and DeleteFile implementation to pass lengths [iceberg]
amogh-jahagirdar commented on code in PR #9953: URL: https://github.com/apache/iceberg/pull/9953#discussion_r1523786976 ## api/src/main/java/org/apache/iceberg/io/FileIO.java: ## @@ -49,25 +49,25 @@ default InputFile newInputFile(String path, long length) { default InputFile newInputFile(DataFile file) { Preconditions.checkArgument( file.keyMetadata() == null, -"Cannot decrypt data file: {} (use EncryptingFileIO)", +"Cannot decrypt data file: %s (use EncryptingFileIO)", file.path()); -return newInputFile(file.path().toString()); +return newInputFile(file.path().toString(), file.fileSizeInBytes()); Review Comment: no I confused myself, actually the default is fine. It's up to FileIO implementations to override newInputfile with a size and keep track of the size. That's what S3FileIO and others do. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API: Fix default FileIO#newInputFile ManifestFile, DataFile and DeleteFile implementation to pass lengths [iceberg]
amogh-jahagirdar commented on code in PR #9953: URL: https://github.com/apache/iceberg/pull/9953#discussion_r1523777029 ## api/src/main/java/org/apache/iceberg/io/FileIO.java: ## @@ -49,25 +49,25 @@ default InputFile newInputFile(String path, long length) { default InputFile newInputFile(DataFile file) { Preconditions.checkArgument( file.keyMetadata() == null, -"Cannot decrypt data file: {} (use EncryptingFileIO)", +"Cannot decrypt data file: %s (use EncryptingFileIO)", file.path()); -return newInputFile(file.path().toString()); +return newInputFile(file.path().toString(), file.fileSizeInBytes()); Review Comment: actually not quite right, since the default implementation of newInputfile with a size actually ignores the size. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API: Fix default FileIO#newInputFile ManifestFile, DataFile and DeleteFile implementation to pass lengths [iceberg]
amogh-jahagirdar commented on code in PR #9953: URL: https://github.com/apache/iceberg/pull/9953#discussion_r1523777029 ## api/src/main/java/org/apache/iceberg/io/FileIO.java: ## @@ -49,25 +49,25 @@ default InputFile newInputFile(String path, long length) { default InputFile newInputFile(DataFile file) { Preconditions.checkArgument( file.keyMetadata() == null, -"Cannot decrypt data file: {} (use EncryptingFileIO)", +"Cannot decrypt data file: %s (use EncryptingFileIO)", file.path()); -return newInputFile(file.path().toString()); +return newInputFile(file.path().toString(), file.fileSizeInBytes()); Review Comment: actually not quite right, since the default implementation of newInputfile with a path actually ignores the path. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API: Fix default FileIO#newInputFile ManifestFile, DataFile and DeleteFile implementation to pass lengths [iceberg]
amogh-jahagirdar commented on code in PR #9953: URL: https://github.com/apache/iceberg/pull/9953#discussion_r1523734039 ## api/src/main/java/org/apache/iceberg/io/FileIO.java: ## @@ -51,23 +51,23 @@ default InputFile newInputFile(DataFile file) { file.keyMetadata() == null, "Cannot decrypt data file: {} (use EncryptingFileIO)", file.path()); -return newInputFile(file.path().toString()); +return newInputFile(file.path().toString(), file.fileSizeInBytes()); } default InputFile newInputFile(DeleteFile file) { Preconditions.checkArgument( file.keyMetadata() == null, "Cannot decrypt delete file: {} (use EncryptingFileIO)", file.path()); -return newInputFile(file.path().toString()); +return newInputFile(file.path().toString(), file.fileSizeInBytes()); } default InputFile newInputFile(ManifestFile manifest) { Preconditions.checkArgument( manifest.keyMetadata() == null, "Cannot decrypt manifest: {} (use EncryptingFileIO)", manifest.path()); -return newInputFile(manifest.path()); +return newInputFile(manifest.path(), manifest.length()); Review Comment: Let me see about adding tests for this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] API: Fix default FileIO#newInputFile ManifestFile, DataFile and DeleteFile implementation to pass lengths [iceberg]
amogh-jahagirdar opened a new pull request, #9953: URL: https://github.com/apache/iceberg/pull/9953 As part of adding encryption support, in https://github.com/apache/iceberg/pull/9592 we added some new FileIO APIs, namely ``` newInputFile(ManifestFile) newInputFile(DataFile) newInputFile(DeleteFile) ``` The overriden implementaiton in EncryptedFileIO is correct but the default implementation in `FileIO` for these new APIs should pass in a length since it's always known from the Iceberg metadata. Without this, FileIO implementations which end up calling these default implementations will make extra requests to the object store/file system to determine the length which we can avoid. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org