luoyuxia commented on code in PR #2555: URL: https://github.com/apache/fluss/pull/2555#discussion_r2758804785
########## website/docs/quickstart/lakehouse.md: ########## Review Comment: I'll also encounter parquet related conflict, after shade parquet, I can resolve this problem: ``` <relocation> <pattern>org.apache.parquet</pattern> <shadedPattern>org.apache.iceberg.shaded.org.apache.parquet</shadedPattern> </relocation> ``` ########## website/docs/quickstart/lakehouse.md: ########## Review Comment: we can remove this line since it's already in `streaming` mode ########## website/docs/quickstart/lakehouse.md: ########## Review Comment: May in the next release, we'll consider not bundle iceberg-releated class in `fluss-lake-fluss` just like what we do for paimon in #2531 ########## website/docs/quickstart/lakehouse.md: ########## @@ -155,37 +155,60 @@ mkdir fluss-quickstart-iceberg cd fluss-quickstart-iceberg ``` -2. Create a `lib` directory and download the required Hadoop jar file: +2. Create directories and download required jars: ```shell -mkdir lib -wget -O lib/hadoop-apache-3.3.5-2.jar https://repo1.maven.org/maven2/io/trino/hadoop/hadoop-apache/3.3.5-2/hadoop-apache-3.3.5-2.jar -``` +mkdir -p lib opt -This jar file provides Hadoop 3.3.5 dependencies required for Iceberg's Hadoop catalog integration. +# Flink connectors +wget -O lib/flink-faker.jar https://github.com/knaufk/flink-faker/releases/download/v0.5.3/flink-faker-0.5.3.jar Review Comment: Can we still keep the version in the file naming so that it's easy for user to track ########## website/docs/quickstart/lakehouse.md: ########## @@ -199,9 +222,8 @@ services: datalake.iceberg.warehouse: /tmp/iceberg volumes: - shared-tmpfs:/tmp/iceberg Review Comment: from flink 0.9, we will also need mount `/tmp/fluss`: ``` shared-tmpfs:/tmp/fluss ``` ########## website/docs/quickstart/lakehouse.md: ########## @@ -220,9 +242,11 @@ services: datalake.iceberg.warehouse: /tmp/iceberg volumes: - shared-tmpfs:/tmp/iceberg Review Comment: from flink 0.9, we will also need mount `/tmp/fluss`: ``` shared-tmpfs:/tmp/fluss ``` ########## website/docs/quickstart/lakehouse.md: ########## Review Comment: I encounter ``` inherit an implementation of the resolved method 'abstract void generate(com.fasterxml.jackson.core.JsonGenerator)' of interface org.apache.iceberg.util.JsonUtil$ToJson. ``` when fire this sql: The reason is that `iceberg-flink` introduce `JsonUtil` which already shade `com.fasterxml.jackson` to `org.apache.iceberg.shaded.com.fasterxml.jackson`. `fluss-lake-iceberg` also introduce `JsonUtil` which doesn't shade `com.fasterxml.jackson`, so the class conflict happens. To solve it, we need to shade `com.fasterxml.jackson` in our `fluss-lake-iceberg` module: ``` <configuration> <artifactSet> <includes> <include>*:*</include> </includes> </artifactSet> <relocations> <relocation> <pattern>com.fasterxml.jackson</pattern> <shadedPattern>org.apache.iceberg.shaded.com.fasterxml.jackson</shadedPattern> </relocation> <relocation> <pattern>org.apache.parquet</pattern> <shadedPattern>org.apache.iceberg.shaded.org.apache.parquet</shadedPattern> </relocation> </relocations> <filters> <filter> <artifact>*</artifact> <excludes> <exclude>LICENSE</exclude> <exclude>NOTICE</exclude> <exclude>META-INF/versions/21/**</exclude> <exclude>META-INF/versions/17/**</exclude> </excludes> </filter> </filters> </configuration> ``` ########## website/docs/quickstart/lakehouse.md: ########## @@ -155,37 +155,60 @@ mkdir fluss-quickstart-iceberg cd fluss-quickstart-iceberg ``` -2. Create a `lib` directory and download the required Hadoop jar file: +2. Create directories and download required jars: ```shell -mkdir lib -wget -O lib/hadoop-apache-3.3.5-2.jar https://repo1.maven.org/maven2/io/trino/hadoop/hadoop-apache/3.3.5-2/hadoop-apache-3.3.5-2.jar -``` +mkdir -p lib opt -This jar file provides Hadoop 3.3.5 dependencies required for Iceberg's Hadoop catalog integration. +# Flink connectors +wget -O lib/flink-faker.jar https://github.com/knaufk/flink-faker/releases/download/v0.5.3/flink-faker-0.5.3.jar +wget -O lib/fluss-flink.jar "https://repo1.maven.org/maven2/org/apache/fluss/fluss-flink-1.20/$FLUSS_DOCKER_VERSION$/fluss-flink-1.20-$FLUSS_DOCKER_VERSION$.jar" +wget -O lib/iceberg-flink-runtime.jar "https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-flink-runtime-1.20/1.10.1/iceberg-flink-runtime-1.20-1.10.1.jar" -:::info -The `lib` directory serves as a staging area for additional jars needed by the Fluss coordinator server. The docker-compose configuration (see step 3) mounts this directory and copies all jars to `/opt/fluss/plugins/iceberg/` inside the coordinator container at startup. +# Fluss lake plugin +wget -O lib/fluss-lake-iceberg.jar https://repo1.maven.org/maven2/org/apache/fluss/fluss-lake-iceberg/$FLUSS_DOCKER_VERSION$/fluss-lake-iceberg-$FLUSS_DOCKER_VERSION$.jar + +# Hadoop filesystem support +wget -O lib/hadoop-apache.jar https://repo1.maven.org/maven2/io/trino/hadoop/hadoop-apache/3.3.5-2/hadoop-apache-3.3.5-2.jar +wget -O lib/failsafe.jar https://repo1.maven.org/maven2/dev/failsafe/failsafe/3.3.2/failsafe-3.3.2.jar +# Tiering service +wget -O opt/fluss-flink-tiering.jar https://repo1.maven.org/maven2/org/apache/fluss/fluss-flink-tiering/$FLUSS_DOCKER_VERSION$/fluss-flink-tiering-$FLUSS_DOCKER_VERSION$.jar +``` + +:::info You can add more jars to this `lib` directory based on your requirements: - **Cloud storage support**: For AWS S3 integration with Iceberg, add the corresponding Iceberg bundle jars (e.g., `iceberg-aws-bundle`) - **Custom Hadoop configurations**: Add jars for specific HDFS distributions or custom authentication mechanisms - **Other catalog backends**: Add jars needed for alternative Iceberg catalog implementations (e.g., Rest, Hive, Glue) - -Any jar placed in the `lib` directory will be automatically loaded by the Fluss coordinator server, making it available for Iceberg integration. ::: -3. Create a `docker-compose.yml` file with the following content: +3. Create a `Dockerfile` for the custom Flink image: Review Comment: It'll be hard for user to build flink image by themself. IIUC, the problem is that the flink image use user `flink` which cause the permission. I have solved by the following content: ``` jobmanager: image: flink:1.20-scala_2.12-java17 ports: - "8083:8081" entrypoint: ["/bin/bash", "-c"] command: > "sed -i 's/exec $$(drop_privs_cmd)//g' /docker-entrypoint.sh && cp /tmp/jars/*.jar /opt/flink/lib/ 2>/dev/null || true; /docker-entrypoint.sh jobmanager" environment: - | FLINK_PROPERTIES= jobmanager.rpc.address: jobmanager volumes: - shared-tmpfs:/tmp/iceberg - ./lib:/tmp/jars # Mount the JARs directory taskmanager: image: flink:1.20-scala_2.12-java17 depends_on: - jobmanager entrypoint: ["/bin/bash", "-c"] command: > "sed -i 's/exec $$(drop_privs_cmd)//g' /docker-entrypoint.sh && cp /tmp/jars/*.jar /opt/flink/lib/ 2>/dev/null || true; /docker-entrypoint.sh taskmanager" environment: - | FLINK_PROPERTIES= jobmanager.rpc.address: jobmanager taskmanager.numberOfTaskSlots: 10 taskmanager.memory.process.size: 2048m taskmanager.memory.framework.off-heap.size: 256m volumes: - shared-tmpfs:/tmp/iceberg - ./lib:/tmp/jars # Mount the JARs directory ``` ########## website/docs/quickstart/lakehouse.md: ########## Review Comment: since we use standard flink image, we have no `tree` command in the `taskmanager`. I think we can remove this part. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
