Pyflink w Nessie and Iceberg in S3 Jars

Robert Prat Fri, 12 Apr 2024 06:46:19 -0700

Hi there,

For several days I have been trying to find the right configuration for my 
pipeline which roughly consists in the following schema 
RabbitMQ->PyFlink->Nessie/Iceberg/S3.


For what I am going to explain I have tried both locally and through the 
official Flink docker images.

I have tried several different flink versions, but for simplicity let's say I 
am using the apache-flink==1.18.0 version. So far I have been able to use the 
jar in org/apache/iceberg/iceberg-flink-runtime-1.18 to connect to RabbitMQ and 
obtain the data from some streams, so I'd say the source side is working.

After that I have been trying to find a way to send the data in those streams 
to Iceberg in S3 through Nessie Catalog which is the one I have working. I have 
been using this pipeline with both Spark and Trino for some time now so I know 
it is working. Now what I am "simply" trying to do is to use my already set up 
Nessie catalog through flink.

I have tried to connect both directly through the sql-client.sh in the bin of 
pyflink dir and through python as
    table_env.execute_sql(f"""
        CREATE CATALOG nessie WITH (
        'type'='iceberg',
        'catalog-impl'='org.apache.iceberg.nessie.NessieCatalog',
        'io-impl'='org.apache.iceberg.aws.s3.S3FileIO',
        'uri'='http://mynessieip:mynessieport/api/v1',
        'ref'='main',
        'warehouse'='s3a://mys3warehouse',
        's3-access-key-id'='{USER_AWS}',
        's3-secret-access-key'='{KEY_AWS}',
        's3-path-style-access'='true',
        'client-region'='eu-west-1')""")

The Jars I have included  (One of the many combinations I've tried with no 
result) in my  pyflink/lib dir  (i also tried to add them with env.add_jars or 
--jarfile)   are:

  *   hadoop-aws-3.4.0.jar
  *
iceberg-flink-runtime-1.18-1.5.0.jar
  *
hadoop-client-3.4.0.jar
  *
hadoop-common-3.4.0.jar
  *
hadoop-hdfs-3.4.0.jar

Right now I am getting the following error message:

 py4j.protocol.Py4JJavaError: An error occurred while calling o56.executeSql.
: java.lang.NoClassDefFoundError: org/apache/hadoop/hdfs/HdfsConfiguration (...)
Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.hdfs.HdfsConfiguration...


But I have gotten several different errors in all the different Jar 
combinations I have tried. So my request is, does anybody know if my problem is 
JAR related or if I am doing something else wrong? I would be immensely 
grateful if someone could guide me to the right steps to implement this 
pipeline.

Thanks:)

Pyflink w Nessie and Iceberg in S3 Jars

Reply via email to