[I] MinIO + Spark + hive metadata + iceberg format [iceberg]

via GitHub Thu, 25 Apr 2024 02:31:44 -0700


rychu151 opened a new issue, #10222:
URL: https://github.com/apache/iceberg/issues/10222


   ### Query engine
   
   Spark
   
   ### Question
   
   Im trying to setup local develop env for my testing purposes using docker
   
   **Target is to save dataframe in a Iceberg format and Hive-metadata**
   
   Here is my current docker-compose:
   
   ```
   version: "3"
   
   services:
   
     #Jupyter Notebook with PySpark & iceberg Server
     spark-iceberg:
       image: tabulario/spark-iceberg
       container_name: spark-iceberg
       build: spark/
       networks:
         iceberg_net:
       depends_on:
         #- rest
         - minio
       volumes:
         - ./warehouse:/home/iceberg/warehouse
         - ./notebooks:/home/iceberg/notebooks/notebooks
         - 
./spark-iceberg/spark/jars/nessie-spark-extensions-3.5_2.12-0.80.0.jar:/opt/spark/jars/nessie-spark-extensions-3.5_2.12-0.80.0.jar
         - 
./spark-iceberg/spark/conf/spark-defaults.conf:/opt/spark/conf/spark-defaults.conf
       environment:
         - AWS_ACCESS_KEY_ID=admin
         - AWS_SECRET_ACCESS_KEY=password
         - AWS_REGION=us-east-1
         - USE_STREAM_CAPABLE_STATE_STORE=true
         - CATALOG_WAREHOUSE=s3://warehouse/
       ports:
         - "8888:8888"
         - "8080:8080"
         - "10000:10000"
         - "10001:10001"
         
     # Minio Storage Server
     minio:
       image: bitnami/minio:latest # not miniop/minio because of reported 
issues with the image
       container_name: minio
       environment:
         - MINIO_ROOT_USER=admin
         - MINIO_ROOT_PASSWORD=password
         - MINIO_REGION=us-east-1
         - MINIO_REGION_NAME=us-east-1
       networks:
         iceberg_net:
           aliases:
             - warehouse.minio
       ports:
         - "9001:9001"
         - "9000:9000"
         
     #hive metastore
     hive-metastore:
       image: apache/hive:4.0.0
       container_name: hive-metastore
       networks:
         iceberg_net:
       ports:
         - "9083:9083"
       environment:
           - SERVICE_NAME=metastore
       depends_on:
         - zookeeper
         - postgres
       volumes:
           - ./hive_metastore/conf/hive-site.xml:/opt/hive/conf/hive-site.xml
   ```
   
   
   spark-defaults.conf:
   
   ```
   spark.sql.extensions                   
org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
   spark.sql.catalog.hive_prod            org.apache.iceberg.spark.SparkCatalog
   spark.sql.catalog.hive_prod.type       hive
   spark.sql.catalog.hive_prod.uri        thrift://hive-metastore:9083
   
   spark.sql.catalog.hive_prod.io-impl          
org.apache.iceberg.aws.s3.S3FileIO
   spark.sql.catalog.hive_prod.s3.endpoint      http://minio:9000
   spark.sql.catalog.hive_prod.warehouse        s3://warehouse/
   hive.metastore.uris                    thrift://hive-metastore:9083
   ```
   
   and hive-site.xml
   
   ```
   <configuration>
       <property>
           <name>hive.server2.enable.doAs</name>
           <value>false</value>
       </property>
       <property>
           <name>hive.tez.exec.inplace.progress</name>
           <value>false</value>
       </property>
       <property>
           <name>hive.exec.scratchdir</name>
           <value>/opt/hive/scratch_dir</value>
       </property>
       <property>
           <name>hive.user.install.directory</name>
           <value>/opt/hive/install_dir</value>
       </property>
       <property>
           <name>tez.runtime.optimize.local.fetch</name>
           <value>true</value>
       </property>
       <property>
           <name>hive.exec.submit.local.task.via.child</name>
           <value>false</value>
       </property>
       <property>
           <name>mapreduce.framework.name</name>
           <value>local</value>
       </property>
       <property>
           <name>tez.local.mode</name>
           <value>true</value>
       </property>
       <property>
           <name>hive.execution.engine</name>
           <value>tez</value>
       </property>
       <property>
           <name>metastore.metastore.event.db.notification.api.auth</name>
           <value>false</value>
       </property>
       <property>
           <name>hive.metastore.warehouse.dir</name>
           <value>s3a://warehouse/</value>
       </property>
       <property>
           <name>fs.s3a.endpoint</name>
           <value>http://localhost:9000</value>
       </property>
       <property>
           <name>fs.s3a.access.key</name>
           <value>admin</value>
       </property>
       <property>
           <name>fs.s3a.secret.key</name>
           <value>password</value>
       </property>
       <property>
           <name>fs.s3a.path.style.access</name>
           <value>true</value>
       </property>
       <property>
           <name>fs.s3a.impl</name>
           <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
       </property>
       <property>
           <name>fs.s3a.connection.ssl.enabled</name>
           <value>false</value>
       </property>
       <property>
           <name>hive.metastore.authorization.storage.checks</name>
           <value>false</value>
           <description>Disables storage-based authorization checks to allow 
Hive better compatibility with MinIO.
           </description>
       </property>
   
   </configuration>
   ```
   
   using MinIO US i have created a bucket called `warehouse` and set it to 
public access
   
   
   **Target is to save dataframe in a Iceberg format and Hive-metadata** so i 
will be able to browse this data using Apache Druid
   
   
   in order to create a table i use PySpark:
   
   ```
   col_name = "col_name"
   label_name = "label"
   data_name = "upload_date"
   
   schema = StructType([
       StructField(data_name, LongType(), False),
       StructField(col_name, StringType(), False),
       StructField(label_name, StringType(), False)
   ])
   
   spark = 
SparkSession.builder.appName("schema_example").enableHiveSupport().getOrCreate()
   spark.conf.set("spark.sql.iceberg.catalog.hive_prod", "DEBUG")
   spark.conf.set("spark.hadoop.fs.s3a.aws.credentials.provider", 
"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
      
   data = []
   
   df = spark.createDataFrame(data, schema)
   ```
   
   spark.sql("SHOW DATABASES ").show() 
   prints only `default` database
   
   
   when i try to create a database like below:
   `spark.sql('CREATE DATABASE IF NOT EXISTS hive_prod.testing')`
   
   i get the following error:
   
   ```
   Py4JJavaError: An error occurred while calling o34.sql.
   : java.lang.RuntimeException: Failed to create namespace testing in Hive 
Metastore
   at org.apache.iceberg.hive.HiveCatalog.createNamespace(HiveCatalog.java:299)
   Caused by: MetaException(message:Failed to create external path 
s3://warehouse/testing.db for database testing. This may result in access not 
being allowed if the StorageBasedAuthorizationProvider is enabled: null)
        
   ```
   
   anyone understands why?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] MinIO + Spark + hive metadata + iceberg format [iceberg]

Reply via email to