FabricioZGalvani opened a new issue, #8427:
URL: https://github.com/apache/iceberg/issues/8427

   ### Apache Iceberg version
   
   1.3.0
   
   ### Query engine
   
   Athena
   
   ### Please describe the bug 🐞
   
   ## ICEBERG_CANNOT_OPEN_SPLIT: Error opening Iceberg split s3
   Recurring Issue with AWS Athena When Running Queries on Iceberg Table.
   
   ## Description
   
   I am trying to run queries on an Iceberg table using AWS Athena. The data is 
stored in S3, and I am using EMR 6.12.0, Iceberg 1.3.0-amzn-0, and Spark 3.4.0. 
The data ingestion process is running on EMR, which consumes data from a Kafka 
topic and ingests it into my Iceberg table in S3. Interestingly, sometimes the 
query runs successfully, but other times I encounter the following error in 
Athena:
   
   > ICEBERG_CANNOT_OPEN_SPLIT: Error opening Iceberg split 
my_s3_path/data/id_pk_bucket=2/created_at_month=2023-08/my_parquet.parquet 
(offset=4, length=16038): Incorrect file size (16042) for file (end of stream 
not reached): 
my_s3_path/data/id_pk_bucket=2/created_at_month=2023-08/my_parquet.parquet
   
   The error occurs only in Athena; running a query on the table using Spark 
works fine.
   
   ## Steps to Reproduce
   
   1. Configured EMR with version 6.12.0 and Spark 3.4.0.
   2. Set up an ingestion process on EMR to consume data from a Kafka topic and 
insert it into an Iceberg table on S3.
   3. Created an Iceberg table on S3 using Iceberg version 1.3.0-amzn-0 and the 
following properties:
   
       ```sql
       OPTIONS (
           'format-version'='2',
           'write.target-file-size-bytes'='124217728',
           'history.expire.max-snapshot-age-ms'='172800000'
       PARTITIONED BY (bucket(10, my_pk), months(created_at))
       )
       ```
     
   4. Data write process executed in Spark:
   ```python
   query = (
                   df.writeStream.format("iceberg")
                   .outputMode("append")
                   .trigger(once=True)
                   .option("path", iceberg_table)
                   .option("fanout-enabled", "true")
                   .option(
                       "checkpointLocation",
                       checkpoint_location,
                   )
               )
               
   query.toTable(iceberg_table).awaitTermination()
   ```
   
   5. Tried running a query in AWS Athena.
   `SELECT * FROM "db"."table" limit 10;`
   
   ## Expected Result
   I expected the query in AWS Athena to run without any issues.
   
   ## Actual Result
   
   I am receiving a recurring error, ICEBERG_CANNOT_OPEN_SPLIT, which appears 
to indicate there is an issue with the file size or with the data streaming 
from S3.
   
   ## Additional Information
   
   - EMR Version: 6.12.0
   - Iceberg Version: 1.3.0-amzn-0
   - Spark Version: 3.4.0
   - We are using Glue as the catalog
   
   I am open to providing more information as needed. Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to