Re: [PR] API: Fix default FileIO#newInputFile ManifestFile, DataFile and DeleteFile implementation to pass lengths [iceberg]

2024-04-04 Thread via GitHub


amogh-jahagirdar merged PR #9953:
URL: https://github.com/apache/iceberg/pull/9953


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API: Fix default FileIO#newInputFile ManifestFile, DataFile and DeleteFile implementation to pass lengths [iceberg]

2024-04-04 Thread via GitHub


amogh-jahagirdar commented on PR #9953:
URL: https://github.com/apache/iceberg/pull/9953#issuecomment-2038104819

   Thanks for the reviews @ajantha-bhat and @Fokko ! Merging
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API: Fix default FileIO#newInputFile ManifestFile, DataFile and DeleteFile implementation to pass lengths [iceberg]

2024-04-04 Thread via GitHub


amogh-jahagirdar commented on code in PR #9953:
URL: https://github.com/apache/iceberg/pull/9953#discussion_r1552280909


##
aws/src/test/java/org/apache/iceberg/aws/s3/TestS3FileIO.java:
##
@@ -377,6 +384,50 @@ public void testResolvingFileIOLoad() {
 Assertions.assertThat(result).isInstanceOf(S3FileIO.class);
   }
 
+  @Test
+  public void testInputFileWithDataFile() throws IOException {
+String location = "s3://bucket/path/to/data-file.parquet";
+DataFile dataFile =
+DataFiles.builder(PartitionSpec.unpartitioned())
+.withPath(location)
+.withFileSizeInBytes(123L)
+.withFormat(FileFormat.PARQUET)
+.withRecordCount(123L)
+.build();
+OutputStream outputStream = s3FileIO.newOutputFile(location).create();
+byte[] data = "testing".getBytes();
+outputStream.write(data);
+outputStream.close();
+
+InputFile inputFile = s3FileIO.newInputFile(dataFile);
+Assertions.assertThat(inputFile.getLength())
+.as("Data file length should be determined from the file size stats")
+.isEqualTo(123L);
+  }
+
+  @Test
+  public void testInputFileWithManifest() throws IOException {
+String dataFileLocation = "s3://bucket/path/to/data-file-2.parquet";
+DataFile dataFile =
+DataFiles.builder(PartitionSpec.unpartitioned())
+.withPath(dataFileLocation)
+.withFileSizeInBytes(123L)
+.withFormat(FileFormat.PARQUET)
+.withRecordCount(123L)
+.build();
+String manifestLocation = "s3://bucket/path/to/manifest.avro";
+OutputFile outputFile = s3FileIO.newOutputFile(manifestLocation);
+ManifestWriter writer =
+ManifestFiles.write(PartitionSpec.unpartitioned(), outputFile);
+writer.add(dataFile);
+writer.close();
+ManifestFile manifest = writer.toManifestFile();
+InputFile inputFile = s3FileIO.newInputFile(manifest);
+Assertions.assertThat(inputFile.getLength())
+.as("Manifest file length should be determined from the file size 
stats")
+.isEqualTo(manifest.length());

Review Comment:
   I just added a verification that the s3Mock.headObject is never called when 
determining the length. It fails before this change, and passes after the fix 
so I think it's a better test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API: Fix default FileIO#newInputFile ManifestFile, DataFile and DeleteFile implementation to pass lengths [iceberg]

2024-03-19 Thread via GitHub


amogh-jahagirdar commented on PR #9953:
URL: https://github.com/apache/iceberg/pull/9953#issuecomment-2007499715

   @ajantha-bhat Replied my thoughts on the Trino PR and I'll summarize what I 
think here: 
   
   To be clear, there is no extra I/O being done in practice for 
`newInputfile(DataFile)` since that code path does not look to be exercised yet 
in the latest release.
   
What's remaining is the `newInputFile(ManifestFile)` where there would be 
an extra I/O during engine planning. Since the I/O is an extra head request to 
object stores which typically is low double digit milliseconds + the fact that 
only a subset of manifess are read during planning I think that it's not 
significant enough to warrant a patch release, but I can be convinced otherwise 
if folks hit a latency regression this in their workloads. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API: Fix default FileIO#newInputFile ManifestFile, DataFile and DeleteFile implementation to pass lengths [iceberg]

2024-03-17 Thread via GitHub


ajantha-bhat commented on PR #9953:
URL: https://github.com/apache/iceberg/pull/9953#issuecomment-2002774794

   ping @danielcweeks, @rdblue, @nastra   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API: Fix default FileIO#newInputFile ManifestFile, DataFile and DeleteFile implementation to pass lengths [iceberg]

2024-03-14 Thread via GitHub


amogh-jahagirdar commented on code in PR #9953:
URL: https://github.com/apache/iceberg/pull/9953#discussion_r1524996255


##
aws/src/test/java/org/apache/iceberg/aws/s3/TestS3FileIO.java:
##
@@ -377,6 +384,50 @@ public void testResolvingFileIOLoad() {
 Assertions.assertThat(result).isInstanceOf(S3FileIO.class);
   }
 
+  @Test
+  public void testInputFileWithDataFile() throws IOException {
+String location = "s3://bucket/path/to/data-file.parquet";
+DataFile dataFile =
+DataFiles.builder(PartitionSpec.unpartitioned())
+.withPath(location)
+.withFileSizeInBytes(123L)
+.withFormat(FileFormat.PARQUET)
+.withRecordCount(123L)
+.build();
+OutputStream outputStream = s3FileIO.newOutputFile(location).create();
+byte[] data = "testing".getBytes();
+outputStream.write(data);
+outputStream.close();
+
+InputFile inputFile = s3FileIO.newInputFile(dataFile);
+Assertions.assertThat(inputFile.getLength())
+.as("Data file length should be determined from the file size stats")
+.isEqualTo(123L);
+  }
+
+  @Test
+  public void testInputFileWithManifest() throws IOException {
+String dataFileLocation = "s3://bucket/path/to/data-file-2.parquet";
+DataFile dataFile =
+DataFiles.builder(PartitionSpec.unpartitioned())
+.withPath(dataFileLocation)
+.withFileSizeInBytes(123L)
+.withFormat(FileFormat.PARQUET)
+.withRecordCount(123L)
+.build();
+String manifestLocation = "s3://bucket/path/to/manifest.avro";
+OutputFile outputFile = s3FileIO.newOutputFile(manifestLocation);
+ManifestWriter writer =
+ManifestFiles.write(PartitionSpec.unpartitioned(), outputFile);
+writer.add(dataFile);
+writer.close();
+ManifestFile manifest = writer.toManifestFile();
+InputFile inputFile = s3FileIO.newInputFile(manifest);
+Assertions.assertThat(inputFile.getLength())
+.as("Manifest file length should be determined from the file size 
stats")
+.isEqualTo(manifest.length());

Review Comment:
   Ah yeah good point, let me see how to get a more useful test for the 
manifest case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API: Fix default FileIO#newInputFile ManifestFile, DataFile and DeleteFile implementation to pass lengths [iceberg]

2024-03-14 Thread via GitHub


ajantha-bhat commented on code in PR #9953:
URL: https://github.com/apache/iceberg/pull/9953#discussion_r1524376106


##
aws/src/test/java/org/apache/iceberg/aws/s3/TestS3FileIO.java:
##
@@ -377,6 +384,50 @@ public void testResolvingFileIOLoad() {
 Assertions.assertThat(result).isInstanceOf(S3FileIO.class);
   }
 
+  @Test
+  public void testInputFileWithDataFile() throws IOException {
+String location = "s3://bucket/path/to/data-file.parquet";
+DataFile dataFile =
+DataFiles.builder(PartitionSpec.unpartitioned())
+.withPath(location)
+.withFileSizeInBytes(123L)
+.withFormat(FileFormat.PARQUET)
+.withRecordCount(123L)
+.build();
+OutputStream outputStream = s3FileIO.newOutputFile(location).create();
+byte[] data = "testing".getBytes();
+outputStream.write(data);
+outputStream.close();
+
+InputFile inputFile = s3FileIO.newInputFile(dataFile);
+Assertions.assertThat(inputFile.getLength())
+.as("Data file length should be determined from the file size stats")
+.isEqualTo(123L);
+  }
+
+  @Test
+  public void testInputFileWithManifest() throws IOException {
+String dataFileLocation = "s3://bucket/path/to/data-file-2.parquet";
+DataFile dataFile =
+DataFiles.builder(PartitionSpec.unpartitioned())
+.withPath(dataFileLocation)
+.withFileSizeInBytes(123L)
+.withFormat(FileFormat.PARQUET)
+.withRecordCount(123L)
+.build();
+String manifestLocation = "s3://bucket/path/to/manifest.avro";
+OutputFile outputFile = s3FileIO.newOutputFile(manifestLocation);
+ManifestWriter writer =
+ManifestFiles.write(PartitionSpec.unpartitioned(), outputFile);
+writer.add(dataFile);
+writer.close();
+ManifestFile manifest = writer.toManifestFile();
+InputFile inputFile = s3FileIO.newInputFile(manifest);
+Assertions.assertThat(inputFile.getLength())
+.as("Manifest file length should be determined from the file size 
stats")
+.isEqualTo(manifest.length());

Review Comment:
   minor: This will not confirm whether `manifest.length` was used or file size 
is computed by reading the file. As the both values can be same.
   The above `dataFileTestcase` validates right. Maybe we have mock the 
`manifest.length` to confirm that value is used. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API: Fix default FileIO#newInputFile ManifestFile, DataFile and DeleteFile implementation to pass lengths [iceberg]

2024-03-13 Thread via GitHub


amogh-jahagirdar commented on code in PR #9953:
URL: https://github.com/apache/iceberg/pull/9953#discussion_r1523786976


##
api/src/main/java/org/apache/iceberg/io/FileIO.java:
##
@@ -49,25 +49,25 @@ default InputFile newInputFile(String path, long length) {
   default InputFile newInputFile(DataFile file) {
 Preconditions.checkArgument(
 file.keyMetadata() == null,
-"Cannot decrypt data file: {} (use EncryptingFileIO)",
+"Cannot decrypt data file: %s (use EncryptingFileIO)",
 file.path());
-return newInputFile(file.path().toString());
+return newInputFile(file.path().toString(), file.fileSizeInBytes());

Review Comment:
   no I confused myself, actually the default is fine. It's up to FileIO 
implementations to override newInputfile with a size and keep track of the 
size. That's what S3FileIO and others do.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API: Fix default FileIO#newInputFile ManifestFile, DataFile and DeleteFile implementation to pass lengths [iceberg]

2024-03-13 Thread via GitHub


amogh-jahagirdar commented on code in PR #9953:
URL: https://github.com/apache/iceberg/pull/9953#discussion_r1523777029


##
api/src/main/java/org/apache/iceberg/io/FileIO.java:
##
@@ -49,25 +49,25 @@ default InputFile newInputFile(String path, long length) {
   default InputFile newInputFile(DataFile file) {
 Preconditions.checkArgument(
 file.keyMetadata() == null,
-"Cannot decrypt data file: {} (use EncryptingFileIO)",
+"Cannot decrypt data file: %s (use EncryptingFileIO)",
 file.path());
-return newInputFile(file.path().toString());
+return newInputFile(file.path().toString(), file.fileSizeInBytes());

Review Comment:
   actually not quite right, since the default implementation of newInputfile 
with a size actually ignores the size.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API: Fix default FileIO#newInputFile ManifestFile, DataFile and DeleteFile implementation to pass lengths [iceberg]

2024-03-13 Thread via GitHub


amogh-jahagirdar commented on code in PR #9953:
URL: https://github.com/apache/iceberg/pull/9953#discussion_r1523777029


##
api/src/main/java/org/apache/iceberg/io/FileIO.java:
##
@@ -49,25 +49,25 @@ default InputFile newInputFile(String path, long length) {
   default InputFile newInputFile(DataFile file) {
 Preconditions.checkArgument(
 file.keyMetadata() == null,
-"Cannot decrypt data file: {} (use EncryptingFileIO)",
+"Cannot decrypt data file: %s (use EncryptingFileIO)",
 file.path());
-return newInputFile(file.path().toString());
+return newInputFile(file.path().toString(), file.fileSizeInBytes());

Review Comment:
   actually not quite right, since the default implementation of newInputfile 
with a path actually ignores the path.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API: Fix default FileIO#newInputFile ManifestFile, DataFile and DeleteFile implementation to pass lengths [iceberg]

2024-03-13 Thread via GitHub


amogh-jahagirdar commented on code in PR #9953:
URL: https://github.com/apache/iceberg/pull/9953#discussion_r1523734039


##
api/src/main/java/org/apache/iceberg/io/FileIO.java:
##
@@ -51,23 +51,23 @@ default InputFile newInputFile(DataFile file) {
 file.keyMetadata() == null,
 "Cannot decrypt data file: {} (use EncryptingFileIO)",
 file.path());
-return newInputFile(file.path().toString());
+return newInputFile(file.path().toString(), file.fileSizeInBytes());
   }
 
   default InputFile newInputFile(DeleteFile file) {
 Preconditions.checkArgument(
 file.keyMetadata() == null,
 "Cannot decrypt delete file: {} (use EncryptingFileIO)",
 file.path());
-return newInputFile(file.path().toString());
+return newInputFile(file.path().toString(), file.fileSizeInBytes());
   }
 
   default InputFile newInputFile(ManifestFile manifest) {
 Preconditions.checkArgument(
 manifest.keyMetadata() == null,
 "Cannot decrypt manifest: {} (use EncryptingFileIO)",
 manifest.path());
-return newInputFile(manifest.path());
+return newInputFile(manifest.path(), manifest.length());

Review Comment:
   Let me see about adding tests for this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] API: Fix default FileIO#newInputFile ManifestFile, DataFile and DeleteFile implementation to pass lengths [iceberg]

2024-03-13 Thread via GitHub


amogh-jahagirdar opened a new pull request, #9953:
URL: https://github.com/apache/iceberg/pull/9953

   As part of adding encryption support, in 
https://github.com/apache/iceberg/pull/9592 we added some new FileIO APIs, 
namely 
   
   ```
   newInputFile(ManifestFile)
   newInputFile(DataFile)
   newInputFile(DeleteFile)
   ```
   The overriden implementaiton in EncryptedFileIO is correct but the default 
implementation in `FileIO` for these new APIs should pass in a length since 
it's always known from the Iceberg metadata. 
   
   Without this, FileIO implementations which end up calling these default 
implementations will make extra requests to the object store/file system to 
determine the length which we can avoid. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org