FabricioZGalvani opened a new issue, #8427: URL: https://github.com/apache/iceberg/issues/8427
### Apache Iceberg version 1.3.0 ### Query engine Athena ### Please describe the bug 🐞 ## ICEBERG_CANNOT_OPEN_SPLIT: Error opening Iceberg split s3 Recurring Issue with AWS Athena When Running Queries on Iceberg Table. ## Description I am trying to run queries on an Iceberg table using AWS Athena. The data is stored in S3, and I am using EMR 6.12.0, Iceberg 1.3.0-amzn-0, and Spark 3.4.0. The data ingestion process is running on EMR, which consumes data from a Kafka topic and ingests it into my Iceberg table in S3. Interestingly, sometimes the query runs successfully, but other times I encounter the following error in Athena: > ICEBERG_CANNOT_OPEN_SPLIT: Error opening Iceberg split my_s3_path/data/id_pk_bucket=2/created_at_month=2023-08/my_parquet.parquet (offset=4, length=16038): Incorrect file size (16042) for file (end of stream not reached): my_s3_path/data/id_pk_bucket=2/created_at_month=2023-08/my_parquet.parquet The error occurs only in Athena; running a query on the table using Spark works fine. ## Steps to Reproduce 1. Configured EMR with version 6.12.0 and Spark 3.4.0. 2. Set up an ingestion process on EMR to consume data from a Kafka topic and insert it into an Iceberg table on S3. 3. Created an Iceberg table on S3 using Iceberg version 1.3.0-amzn-0 and the following properties: ```sql OPTIONS ( 'format-version'='2', 'write.target-file-size-bytes'='124217728', 'history.expire.max-snapshot-age-ms'='172800000' PARTITIONED BY (bucket(10, my_pk), months(created_at)) ) ``` 4. Data write process executed in Spark: ```python query = ( df.writeStream.format("iceberg") .outputMode("append") .trigger(once=True) .option("path", iceberg_table) .option("fanout-enabled", "true") .option( "checkpointLocation", checkpoint_location, ) ) query.toTable(iceberg_table).awaitTermination() ``` 5. Tried running a query in AWS Athena. `SELECT * FROM "db"."table" limit 10;` ## Expected Result I expected the query in AWS Athena to run without any issues. ## Actual Result I am receiving a recurring error, ICEBERG_CANNOT_OPEN_SPLIT, which appears to indicate there is an issue with the file size or with the data streaming from S3. ## Additional Information - EMR Version: 6.12.0 - Iceberg Version: 1.3.0-amzn-0 - Spark Version: 3.4.0 - We are using Glue as the catalog I am open to providing more information as needed. Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org