Re: [I] Failed to initialize S3FileIO when writing to minio using spark [iceberg]

via GitHub Wed, 24 Jul 2024 09:45:47 -0700


umathivagit commented on issue #7396:
URL: https://github.com/apache/iceberg/issues/7396#issuecomment-2248469430


   Dear All, I faced similar issue and found that including Iceberg and Minio 
official documentation, the samples given are not working , also there is no 
working found anywhere in internet to enable minio as a storage location to 
define it as metadata/data files location...Here is the code base (working 
version) of managing the Iceberg catalog on Minio...there were few issues on 
the mismatch of iceberg version vs aws bundle and also few more additional 
options I had to setup to get this working...after successful run you should 
see the table get created inside the minio like below...
   
   
![image](https://github.com/user-attachments/assets/68cfa30f-14b9-4e0c-95df-d1325b27793f)
   
   Also most important thing create access keys in the minio browser (as I have 
highlighted in the second screen shot click on the icon , it will bring up the 
access keys and then create access keys those keys are the ones you need to 
pass it on the below configurations while creating the spark session)
   
   
![image](https://github.com/user-attachments/assets/a0f5c49f-2e10-4c3d-9283-d06d63f73981)
   
   .config('spark.hadoop.fs.s3a.access.key', "<<accesskey>>")\
   .config('spark.hadoop.fs.s3a.secret.key', "<<secretkey>>")\
   Please make sure add below user environment variables as well
   AWS_ACCESS_KEY_ID = accesskey
   AWS_SECRET_ACCESS_KEY = secretkey
   AWS_REGION = us-east-1
   MINIO_ROOT_USER = minioadmin
   MINIO_ROOT_PASSWORD = minioadmin
   MINIO_REGION = us-east-1
   This code base stores the catalog into  Postgres and the metadata/data files 
into minio object storage
   
   **Here is the working code:**
   
   > from pyspark.sql import SparkSession, Row
   > 
   > 
   > ### Initialize Spark session with Iceberg JDBC catalog configuration
   > spark = SparkSession.builder \
   >     .appName("udaydemo-app") \
   >     .config('spark.jars.packages', 
'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2,org.postgresql:postgresql:42.2.23,org.apache.iceberg:iceberg-aws-bundle:1.5.2')
 \
   >     
.config('spark.sql.catalog.uday_minio_catalog','org.apache.iceberg.spark.SparkCatalog')\
   >     .config("spark.sql.catalog.uday_minio_catalog.catalog-impl", 
"org.apache.iceberg.jdbc.JdbcCatalog")\
   >     .config("spark.sql.catalog.uday_minio_catalog.uri", 
"jdbc:postgresql://<<replace your postgres host address>>/<<database name>>") \
   >     
.config("spark.sql.catalog.uday_minio_catalog.verifyServerCertificate", "true") 
\
   >     .config("spark.sql.catalog.uday_minio_catalog.useSSL", "true") \
   >     .config("spark.sql.catalog.uday_minio_catalog.jdbc.user", "<<replace 
your username>>") \
   >     .config("spark.sql.catalog.uday_minio_catalog.jdbc.password", 
"<<replace your password>>) \
   >     .config("spark.sql.catalog.uday_minio_catalog.jdbc.driver", 
"org.postgresql.Driver") \
   >     .config("spark.sql.catalog.uday_minio_catalog.warehouse", 
"s3a://demo-icare")\
   >     
.config("spark.sql.catalog.uday_minio_catalog.s3.endpoint","http://127.0.0.1:9000";)\
   >     .config("spark.sql.catalog.uday_minio_catalog.io-impl", 
"org.apache.iceberg.aws.s3.S3FileIO")\
   >     .config('spark.hadoop.fs.s3a.access.key', "<<minio access key>>")\
   >     .config('spark.hadoop.fs.s3a.endpoint.region','us-east-1')\
   >     .config("spark.hadoop.fs.s3a.secret.key", "<<minio secret key>>")\
   >     .config("spark.sql.catalog.uday_minio_catalog.s3a.path-style-access", 
"true")\
   >     .config("spark.sql.catalogImplementation","in-memory")\
   >     .config("spark.executor.heartbeatInterval", "300000")\
   >     .config("spark.network.timeout", "400000")\
   >     .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false")\
   >     .config("spark.hadoop.fs.s3a.path.style.access", "true")\
   >     .config("spark.hadoop.fs.s3a.attempts.maximum", "1")\
   >     .config("spark.hadoop.fs.s3a.connection.establish.timeout", "5000")\
   >     .config("spark.hadoop.fs.s3a.connection.timeout", "10000")\
   >     .getOrCreate()
   > 
   > sc = spark.sparkContext
   > sc.setLogLevel("ERROR")
   > 
   > ### Create an Iceberg table
   > spark.sql("""
   >         CREATE TABLE IF NOT EXISTS uday_minio_catalog.product (
   >         id INT,
   >         name STRING,
   >         price INT
   >     ) USING iceberg""")
   > 
   > spark.sql("""
   >     INSERT INTO uday_minio_catalog.product VALUES 
   >         (1, 'laptop', 50000), 
   >         (2, 'workstation', 100000),
   >         (3, 'server', 250000)
   >     """)
   > 
   > spark.sql("SELECT * FROM uday_minio_catalog.product").show(truncate=False)
   > 
   > spark.stop()


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Failed to initialize S3FileIO when writing to minio using spark [iceberg]

Reply via email to