Copilot commented on code in PR #2462:
URL: https://github.com/apache/tika/pull/2462#discussion_r2628822570
##########
tika-grpc/docker-build/README.md:
##########
@@ -0,0 +1,191 @@
+# Tika gRPC Docker Build
+
+This directory contains the Docker build configuration for Apache Tika gRPC
server.
+
+## Overview
+
+The Docker image includes:
+- Tika gRPC server JAR
+- All Tika Pipes plugins (fetchers, emitters, iterators)
+- Parser packages (standard, extended, ML)
+- OCR support (Tesseract with multiple languages)
+- GDAL for geospatial formats
+- Common fonts
+
+## Building the Docker Image
+
+### Prerequisites
+
+1. Build Tika from the project root (this builds all modules including
plugins):
+```bash
+cd <tika-root>
+mvn clean install -DskipTests
+```
+
+### Build Activation
+
+The Docker build can be activated in two ways:
+
+**Option 1: Using environment variables (recommended)**
+- Set `DOCKER_ID`, `AWS_ACCOUNT_ID`, or `AZURE_REGISTRY_NAME`
+- Maven profiles automatically detect these and enable the build
+- No need for `-Dskip.docker.build=false`
+
+**Option 2: Using Maven property**
+- Add `-Dskip.docker.build=false` to your Maven command
+- Use when you want explicit control or testing
+
+### Building from Tika Root
+
+**Build tika-grpc and dependencies only:**
+```bash
+DOCKER_ID=myusername \
+ mvn clean install -DskipTests -pl :tika-grpc -am
+```
+
+**Build entire project:**
+```bash
+DOCKER_ID=myusername \
+ mvn clean install -DskipTests
+```
+
+### Building from tika-grpc Directory
+
+#### Controlling Docker Build with Environment Variables
+
+All docker-build.sh environment variables are passed through from your shell.
When these variables are set, the Maven profiles automatically activate the
Docker build.
+
+**Build and push to Docker Hub:**
+```bash
+DOCKER_ID=myusername \
+ mvn package
+```
+
+**Build multi-arch and push to Docker Hub:**
+```bash
+MULTI_ARCH=true DOCKER_ID=myusername \
+ mvn package
+```
+
+**Build and push to AWS ECR:**
+```bash
+AWS_ACCOUNT_ID=123456789012 AWS_REGION=us-east-1 \
+ mvn package
+```
+
+**Build and push to Azure Container Registry:**
+```bash
+AZURE_REGISTRY_NAME=myregistry \
+ mvn package
+```
+
+**Note:** When environment variables are set, you don't need
`-Dskip.docker.build=false`. The Maven profiles detect the variables and
automatically enable the build.
+
+### Option 2: Run the Docker Build Script Manually
+
+Set the required environment variable and run the script:
+
+```bash
+export TIKA_VERSION=4.0.0-SNAPSHOT
+./tika-grpc/docker-build/docker-build.sh
+```
+
+### Optional Environment Variables
+
+- `TIKA_VERSION`: Maven project version (required)
+- `RELEASE_IMAGE_TAG`: Override the default tag (defaults to TIKA_VERSION
without -SNAPSHOT)
+- `DOCKER_ID`: Docker Hub username to push to Docker Hub
+- `AWS_ACCOUNT_ID`: AWS account ID to push to ECR
+- `AWS_REGION`: AWS region for ECR (default: us-west-2)
+- `AZURE_REGISTRY_NAME`: Azure Container Registry name
+- `MULTI_ARCH`: Build for multiple architectures (default: false)
+- `PROJECT_NAME`: Docker image name (default: tika-grpc)
+
+### Examples
+
+**Build with Docker Hub using environment variable:**
+```bash
+DOCKER_ID=myusername \
+ mvn package
+```
+
+**Build multi-arch with Docker Hub:**
+```bash
+MULTI_ARCH=true DOCKER_ID=myusername \
+ mvn package
+```
+
+**Build with AWS ECR:**
+```bash
+AWS_ACCOUNT_ID=123456789012 AWS_REGION=us-east-1 \
+ mvn package
+```
+
+**Build with explicit property (for testing/development):**
+```bash
+mvn package -Dskip.docker.build=false -DDOCKER_ID=myusername
Review Comment:
The command example is incorrect. The syntax '-DDOCKER_ID=myusername' won't
work because Maven properties cannot set environment variables. The profiles
are activated by environment variables (env.DOCKER_ID), not Maven properties.
This command should either use 'DOCKER_ID=myusername mvn package
-Dskip.docker.build=false' or just rely on the environment variable without the
property override.
```suggestion
DOCKER_ID=myusername mvn package -Dskip.docker.build=false
```
##########
tika-grpc/docker-build/docker-build.sh:
##########
@@ -0,0 +1,127 @@
+#!/bin/bash
+# This script is intended to be run from Maven exec plugin during the package
phase of maven build
+
+# Check if Docker is installed
+if ! command -v docker &> /dev/null; then
+ echo "ERROR: Docker is not installed or not in PATH. Please install Docker
first."
+ exit 1
+fi
+
+if [ -z "${TIKA_VERSION}" ]; then
+ echo "Environment variable TIKA_VERSION is required, and should match the
maven project version of Tika"
+ exit 1
+fi
+
+SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
+
+cd "${SCRIPT_DIR}/../../" || exit
+
+OUT_DIR=target/tika-docker
+
+MULTI_ARCH=${MULTI_ARCH:-false}
+AWS_REGION=${AWS_REGION:-us-west-2}
+AWS_ACCOUNT_ID=${AWS_ACCOUNT_ID:-}
+AZURE_REGISTRY_NAME=${AZURE_REGISTRY_NAME:-}
+DOCKER_ID=${DOCKER_ID:-}
+PROJECT_NAME=${PROJECT_NAME:-tika-grpc}
+
+# If RELEASE_IMAGE_TAG not specified, use TIKA_VERSION
+if [[ -z "${RELEASE_IMAGE_TAG}" ]]; then
+ RELEASE_IMAGE_TAG="${TIKA_VERSION}"
+ ## Remove '-SNAPSHOT' from the version string
+ RELEASE_IMAGE_TAG="${RELEASE_IMAGE_TAG//-SNAPSHOT/}"
+fi
+
+mkdir -p "${OUT_DIR}/libs"
+mkdir -p "${OUT_DIR}/plugins"
+mkdir -p "${OUT_DIR}/config"
+mkdir -p "${OUT_DIR}/bin"
+cp -v -r "tika-grpc/target/tika-grpc-${TIKA_VERSION}.jar" "${OUT_DIR}/libs"
+
+# Copy all tika-pipes plugin zip files
+for dir in tika-pipes/tika-pipes-plugins/*/; do
+ plugin_name=$(basename "$dir")
+ zip_file="${dir}target/${plugin_name}-${TIKA_VERSION}.zip"
+ if [ -f "$zip_file" ]; then
+ cp -v -r "$zip_file" "${OUT_DIR}/plugins"
+ else
+ echo "WARNING: Plugin file $zip_file does not exist, skipping."
+ fi
+done
+
+# Copy parser package jars as plugins
+parser_packages=(
+ "tika-parsers/tika-parsers-standard/tika-parsers-standard-package"
+ "tika-parsers/tika-parsers-extended/tika-parser-scientific-package"
+ "tika-parsers/tika-parsers-extended/tika-parser-sqlite3-package"
+ "tika-parsers/tika-parsers-ml/tika-parser-nlp-package"
+)
+
+for parser_package in "${parser_packages[@]}"; do
+ package_name=$(basename "$parser_package")
+ jar_file="${parser_package}/target/${package_name}-${TIKA_VERSION}.jar"
+ if [ -f "$jar_file" ]; then
+ cp -v -r "$jar_file" "${OUT_DIR}/plugins"
+ else
+ echo "Parser package file $jar_file does not exist, skipping."
+ fi
+done
+
+cp -v -r "tika-grpc/docker-build/start-tika-grpc.sh" "${OUT_DIR}/bin"
+
+cp -v "tika-grpc/docker-build/Dockerfile" "${OUT_DIR}/Dockerfile"
+
+cd "${OUT_DIR}" || exit
+
+echo "Running docker build from directory: $(pwd)"
+
+IMAGE_TAGS=()
+if [[ -n "${AWS_ACCOUNT_ID}" ]]; then
+ if ! aws ecr get-login-password --region "${AWS_REGION}" | docker login
--username AWS --password-stdin
"${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com"; then
+ echo "ERROR: Failed to authenticate with AWS ECR"
+ exit 1
+ fi
+ IMAGE_TAGS+=("-t
${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${PROJECT_NAME}:${RELEASE_IMAGE_TAG}")
+fi
+
+if [[ -n "${AZURE_REGISTRY_NAME}" ]]; then
+ if ! az acr login --name "${AZURE_REGISTRY_NAME}"; then
+ echo "ERROR: Failed to authenticate with Azure Container Registry"
+ exit 1
+ fi
+ IMAGE_TAGS+=("-t
${AZURE_REGISTRY_NAME}.azurecr.io/${PROJECT_NAME}:${RELEASE_IMAGE_TAG}")
+fi
+
+if [[ -n "${DOCKER_ID}" ]]; then
+ IMAGE_TAGS+=("-t ${DOCKER_ID}/${PROJECT_NAME}:${RELEASE_IMAGE_TAG}")
+fi
+
+if [ ${#IMAGE_TAGS[@]} -eq 0 ]; then
+ echo "No image tags specified, skipping Docker build step. To enable
build, set AWS_ACCOUNT_ID, AZURE_REGISTRY_NAME, and/or DOCKER_ID environment
variables."
+ exit 0
+fi
+
+tag="${IMAGE_TAGS[*]}"
+if [ "${MULTI_ARCH}" == "true" ]; then
+ echo "Building multi arch image"
+ docker buildx create --name tikabuilder
+ # Pin binfmt to a specific digest for security
+ # see
https://askubuntu.com/questions/1339558/cant-build-dockerfile-for-arm64-due-to-libc-bin-segmentation-fault/1398147#1398147
+ docker run --rm --privileged
tonistiigi/binfmt:latest@sha256:8de6f2decb92e9001d094534bf8a92880c175bd5dfb4a9d8579f26f09821cfa2
--install amd64
+ docker run --rm --privileged
tonistiigi/binfmt:latest@sha256:8de6f2decb92e9001d094534bf8a92880c175bd5dfb4a9d8579f26f09821cfa2
--install arm64
+ docker buildx build \
+ --builder=tikabuilder . \
+ ${tag} \
+ --platform linux/amd64,linux/arm64 \
+ --push
+ docker buildx stop tikabuilder
+ docker buildx rm tikabuilder
Review Comment:
The buildx builder should check if it already exists before attempting to
create it. Running this script multiple times will fail because the builder
'tikabuilder' already exists from a previous run. Consider adding a check to
remove existing builder or use the existing one.
```suggestion
BUILDX_BUILDER_NAME="tikabuilder"
BUILDX_CREATED_BY_SCRIPT=false
if [ "${MULTI_ARCH}" == "true" ]; then
echo "Building multi arch image"
if docker buildx inspect "${BUILDX_BUILDER_NAME}" >/dev/null 2>&1; then
echo "Using existing Docker buildx builder '${BUILDX_BUILDER_NAME}'"
else
echo "Creating Docker buildx builder '${BUILDX_BUILDER_NAME}'"
docker buildx create --name "${BUILDX_BUILDER_NAME}"
BUILDX_CREATED_BY_SCRIPT=true
fi
# Pin binfmt to a specific digest for security
# see
https://askubuntu.com/questions/1339558/cant-build-dockerfile-for-arm64-due-to-libc-bin-segmentation-fault/1398147#1398147
docker run --rm --privileged
tonistiigi/binfmt:latest@sha256:8de6f2decb92e9001d094534bf8a92880c175bd5dfb4a9d8579f26f09821cfa2
--install amd64
docker run --rm --privileged
tonistiigi/binfmt:latest@sha256:8de6f2decb92e9001d094534bf8a92880c175bd5dfb4a9d8579f26f09821cfa2
--install arm64
docker buildx build \
--builder="${BUILDX_BUILDER_NAME}" . \
${tag} \
--platform linux/amd64,linux/arm64 \
--push
if [ "${BUILDX_CREATED_BY_SCRIPT}" = "true" ]; then
docker buildx stop "${BUILDX_BUILDER_NAME}"
docker buildx rm "${BUILDX_BUILDER_NAME}"
fi
```
##########
tika-grpc/docker-build/docker-build.sh:
##########
@@ -0,0 +1,127 @@
+#!/bin/bash
+# This script is intended to be run from Maven exec plugin during the package
phase of maven build
+
+# Check if Docker is installed
+if ! command -v docker &> /dev/null; then
+ echo "ERROR: Docker is not installed or not in PATH. Please install Docker
first."
+ exit 1
+fi
+
+if [ -z "${TIKA_VERSION}" ]; then
+ echo "Environment variable TIKA_VERSION is required, and should match the
maven project version of Tika"
+ exit 1
+fi
+
+SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
+
+cd "${SCRIPT_DIR}/../../" || exit
+
+OUT_DIR=target/tika-docker
+
+MULTI_ARCH=${MULTI_ARCH:-false}
+AWS_REGION=${AWS_REGION:-us-west-2}
+AWS_ACCOUNT_ID=${AWS_ACCOUNT_ID:-}
+AZURE_REGISTRY_NAME=${AZURE_REGISTRY_NAME:-}
+DOCKER_ID=${DOCKER_ID:-}
+PROJECT_NAME=${PROJECT_NAME:-tika-grpc}
+
+# If RELEASE_IMAGE_TAG not specified, use TIKA_VERSION
+if [[ -z "${RELEASE_IMAGE_TAG}" ]]; then
+ RELEASE_IMAGE_TAG="${TIKA_VERSION}"
+ ## Remove '-SNAPSHOT' from the version string
+ RELEASE_IMAGE_TAG="${RELEASE_IMAGE_TAG//-SNAPSHOT/}"
+fi
+
+mkdir -p "${OUT_DIR}/libs"
+mkdir -p "${OUT_DIR}/plugins"
+mkdir -p "${OUT_DIR}/config"
+mkdir -p "${OUT_DIR}/bin"
+cp -v -r "tika-grpc/target/tika-grpc-${TIKA_VERSION}.jar" "${OUT_DIR}/libs"
+
+# Copy all tika-pipes plugin zip files
+for dir in tika-pipes/tika-pipes-plugins/*/; do
+ plugin_name=$(basename "$dir")
+ zip_file="${dir}target/${plugin_name}-${TIKA_VERSION}.zip"
+ if [ -f "$zip_file" ]; then
+ cp -v -r "$zip_file" "${OUT_DIR}/plugins"
+ else
+ echo "WARNING: Plugin file $zip_file does not exist, skipping."
+ fi
+done
+
+# Copy parser package jars as plugins
+parser_packages=(
+ "tika-parsers/tika-parsers-standard/tika-parsers-standard-package"
+ "tika-parsers/tika-parsers-extended/tika-parser-scientific-package"
+ "tika-parsers/tika-parsers-extended/tika-parser-sqlite3-package"
+ "tika-parsers/tika-parsers-ml/tika-parser-nlp-package"
+)
+
+for parser_package in "${parser_packages[@]}"; do
+ package_name=$(basename "$parser_package")
+ jar_file="${parser_package}/target/${package_name}-${TIKA_VERSION}.jar"
+ if [ -f "$jar_file" ]; then
+ cp -v -r "$jar_file" "${OUT_DIR}/plugins"
+ else
+ echo "Parser package file $jar_file does not exist, skipping."
+ fi
+done
+
+cp -v -r "tika-grpc/docker-build/start-tika-grpc.sh" "${OUT_DIR}/bin"
+
+cp -v "tika-grpc/docker-build/Dockerfile" "${OUT_DIR}/Dockerfile"
Review Comment:
Missing error handling for the 'cp' commands. If the source JAR file doesn't
exist, the script will continue and produce a broken Docker image. The 'cp'
command should be followed by error checking or use 'set -e' at the beginning
of the script to exit on any error.
##########
tika-grpc/docker-build/Dockerfile:
##########
@@ -0,0 +1,39 @@
+FROM ubuntu:22.04
+COPY libs/ /tika/libs/
+COPY plugins/ /tika/plugins/
+COPY config/ /tika/config/
+COPY bin/ /tika/bin
Review Comment:
The Dockerfile copies files before installing system dependencies. If the
apt-get update or installation fails, you'll need to rebuild from scratch
including the COPY operations. Consider moving COPY commands after the RUN
command to improve Docker build cache efficiency.
##########
tika-grpc/pom.xml:
##########
@@ -372,20 +409,65 @@
</execution>
</executions>
</plugin>
+ <plugin>
+ <groupId>org.apache.maven.plugins</groupId>
+ <artifactId>maven-antrun-plugin</artifactId>
+ <version>3.1.0</version>
+ <executions>
+ <execution>
+ <id>set-chmod-on-docker-build-sh</id>
+ <phase>validate</phase>
+ <goals>
+ <goal>run</goal>
+ </goals>
+ <configuration>
+ <target>
+ <chmod file="${project.basedir}/docker-build/docker-build.sh"
perm="755" failonerror="false"/>
+ </target>
+ <skip>${skip.docker.build}</skip>
+ </configuration>
+ </execution>
Review Comment:
The Maven antrun plugin execution runs in the 'validate' phase which occurs
before the 'package' phase where the docker-build.sh script runs. This means
the chmod might run on a file that doesn't exist yet if building from a clean
state. However, since failonerror is set to false, this won't cause a build
failure. Consider moving this to an earlier phase like 'initialize' or document
why validate phase is chosen, or rely on git permissions for the script.
##########
tika-grpc/docker-build/docker-build.sh:
##########
@@ -0,0 +1,127 @@
+#!/bin/bash
+# This script is intended to be run from Maven exec plugin during the package
phase of maven build
+
+# Check if Docker is installed
+if ! command -v docker &> /dev/null; then
+ echo "ERROR: Docker is not installed or not in PATH. Please install Docker
first."
+ exit 1
+fi
+
+if [ -z "${TIKA_VERSION}" ]; then
+ echo "Environment variable TIKA_VERSION is required, and should match the
maven project version of Tika"
+ exit 1
+fi
+
+SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
+
+cd "${SCRIPT_DIR}/../../" || exit
+
+OUT_DIR=target/tika-docker
+
+MULTI_ARCH=${MULTI_ARCH:-false}
+AWS_REGION=${AWS_REGION:-us-west-2}
+AWS_ACCOUNT_ID=${AWS_ACCOUNT_ID:-}
+AZURE_REGISTRY_NAME=${AZURE_REGISTRY_NAME:-}
+DOCKER_ID=${DOCKER_ID:-}
+PROJECT_NAME=${PROJECT_NAME:-tika-grpc}
+
+# If RELEASE_IMAGE_TAG not specified, use TIKA_VERSION
+if [[ -z "${RELEASE_IMAGE_TAG}" ]]; then
+ RELEASE_IMAGE_TAG="${TIKA_VERSION}"
+ ## Remove '-SNAPSHOT' from the version string
+ RELEASE_IMAGE_TAG="${RELEASE_IMAGE_TAG//-SNAPSHOT/}"
+fi
+
+mkdir -p "${OUT_DIR}/libs"
+mkdir -p "${OUT_DIR}/plugins"
+mkdir -p "${OUT_DIR}/config"
+mkdir -p "${OUT_DIR}/bin"
+cp -v -r "tika-grpc/target/tika-grpc-${TIKA_VERSION}.jar" "${OUT_DIR}/libs"
+
+# Copy all tika-pipes plugin zip files
+for dir in tika-pipes/tika-pipes-plugins/*/; do
+ plugin_name=$(basename "$dir")
+ zip_file="${dir}target/${plugin_name}-${TIKA_VERSION}.zip"
+ if [ -f "$zip_file" ]; then
+ cp -v -r "$zip_file" "${OUT_DIR}/plugins"
+ else
+ echo "WARNING: Plugin file $zip_file does not exist, skipping."
+ fi
+done
+
+# Copy parser package jars as plugins
+parser_packages=(
+ "tika-parsers/tika-parsers-standard/tika-parsers-standard-package"
+ "tika-parsers/tika-parsers-extended/tika-parser-scientific-package"
+ "tika-parsers/tika-parsers-extended/tika-parser-sqlite3-package"
+ "tika-parsers/tika-parsers-ml/tika-parser-nlp-package"
+)
+
+for parser_package in "${parser_packages[@]}"; do
+ package_name=$(basename "$parser_package")
+ jar_file="${parser_package}/target/${package_name}-${TIKA_VERSION}.jar"
+ if [ -f "$jar_file" ]; then
+ cp -v -r "$jar_file" "${OUT_DIR}/plugins"
+ else
+ echo "Parser package file $jar_file does not exist, skipping."
+ fi
+done
+
+cp -v -r "tika-grpc/docker-build/start-tika-grpc.sh" "${OUT_DIR}/bin"
+
+cp -v "tika-grpc/docker-build/Dockerfile" "${OUT_DIR}/Dockerfile"
+
+cd "${OUT_DIR}" || exit
+
+echo "Running docker build from directory: $(pwd)"
+
+IMAGE_TAGS=()
+if [[ -n "${AWS_ACCOUNT_ID}" ]]; then
+ if ! aws ecr get-login-password --region "${AWS_REGION}" | docker login
--username AWS --password-stdin
"${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com"; then
+ echo "ERROR: Failed to authenticate with AWS ECR"
+ exit 1
+ fi
+ IMAGE_TAGS+=("-t
${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${PROJECT_NAME}:${RELEASE_IMAGE_TAG}")
+fi
+
+if [[ -n "${AZURE_REGISTRY_NAME}" ]]; then
+ if ! az acr login --name "${AZURE_REGISTRY_NAME}"; then
+ echo "ERROR: Failed to authenticate with Azure Container Registry"
+ exit 1
+ fi
+ IMAGE_TAGS+=("-t
${AZURE_REGISTRY_NAME}.azurecr.io/${PROJECT_NAME}:${RELEASE_IMAGE_TAG}")
+fi
+
+if [[ -n "${DOCKER_ID}" ]]; then
+ IMAGE_TAGS+=("-t ${DOCKER_ID}/${PROJECT_NAME}:${RELEASE_IMAGE_TAG}")
+fi
+
+if [ ${#IMAGE_TAGS[@]} -eq 0 ]; then
+ echo "No image tags specified, skipping Docker build step. To enable
build, set AWS_ACCOUNT_ID, AZURE_REGISTRY_NAME, and/or DOCKER_ID environment
variables."
+ exit 0
+fi
+
+tag="${IMAGE_TAGS[*]}"
+if [ "${MULTI_ARCH}" == "true" ]; then
+ echo "Building multi arch image"
+ docker buildx create --name tikabuilder
+ # Pin binfmt to a specific digest for security
+ # see
https://askubuntu.com/questions/1339558/cant-build-dockerfile-for-arm64-due-to-libc-bin-segmentation-fault/1398147#1398147
Review Comment:
The hardcoded SHA256 digest is pinned to a specific version of the binfmt
image. This digest may become outdated over time. While pinning provides
security benefits, there should be a comment indicating when this digest was
last verified or consider using a tagged version that's more maintainable.
```suggestion
# see
https://askubuntu.com/questions/1339558/cant-build-dockerfile-for-arm64-due-to-libc-bin-segmentation-fault/1398147#1398147
# Digest last verified on 2025-01-10; update periodically if
tonistiigi/binfmt:latest is refreshed.
```
##########
tika-grpc/docker-build/start-tika-grpc.sh:
##########
@@ -0,0 +1,29 @@
+#!/bin/bash
+echo "Tika Version:"
+echo "${TIKA_VERSION}"
+echo "Tika Plugins:"
+ls "/tika/plugins"
+echo "Tika gRPC Max Inbound Message Size:"
+echo "${TIKA_GRPC_MAX_INBOUND_MESSAGE_SIZE}"
+echo "Tika gRPC Max Outbound Message Size:"
+echo "${TIKA_GRPC_MAX_OUTBOUND_MESSAGE_SIZE}"
+echo "Tika gRPC Num Threads:"
+echo "${TIKA_GRPC_NUM_THREADS}"
+exec java \
+ -Dgrpc.server.port=9090 \
+
"-Dgrpc.server.max-inbound-message-size=${TIKA_GRPC_MAX_INBOUND_MESSAGE_SIZE}" \
+
"-Dgrpc.server.max-outbound-message-size=${TIKA_GRPC_MAX_OUTBOUND_MESSAGE_SIZE}"
\
+ "-Dgrpc.server.numThreads=${TIKA_GRPC_NUM_THREADS}" \
+ --add-opens=jdk.management/com.sun.management.internal=ALL-UNNAMED \
+ --add-opens=java.base/jdk.internal.misc=ALL-UNNAMED \
+ --add-opens=java.base/sun.nio.ch=ALL-UNNAMED \
+ --add-opens=java.management/com.sun.jmx.mbeanserver=ALL-UNNAMED \
+ --add-opens=jdk.internal.jvmstat/sun.jvmstat.monitor=ALL-UNNAMED \
+ --add-opens=java.base/sun.reflect.generics.reflectiveObjects=ALL-UNNAMED \
+ --add-opens=java.base/java.io=ALL-UNNAMED \
+ --add-opens=java.base/java.nio=ALL-UNNAMED \
+ --add-opens=java.base/java.util=ALL-UNNAMED \
+ --add-opens=java.base/java.lang=ALL-UNNAMED \
+ -Djava.net.preferIPv4Stack=true \
+ "-Dplugins.pluginDirs=/tika/plugins" \
+ -jar "/tika/libs/tika-grpc-${TIKA_VERSION}.jar"
Review Comment:
The script doesn't validate that TIKA_VERSION is set before using it in the
jar path. If TIKA_VERSION is empty (which it will be since the Dockerfile ARG
VERSION is never passed), the java command will fail with a confusing error.
Add validation at the start of the script to check that TIKA_VERSION is set and
not empty.
##########
tika-grpc/docker-build/docker-build.sh:
##########
@@ -0,0 +1,127 @@
+#!/bin/bash
+# This script is intended to be run from Maven exec plugin during the package
phase of maven build
+
+# Check if Docker is installed
+if ! command -v docker &> /dev/null; then
+ echo "ERROR: Docker is not installed or not in PATH. Please install Docker
first."
+ exit 1
+fi
+
+if [ -z "${TIKA_VERSION}" ]; then
+ echo "Environment variable TIKA_VERSION is required, and should match the
maven project version of Tika"
+ exit 1
+fi
+
+SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
+
+cd "${SCRIPT_DIR}/../../" || exit
+
+OUT_DIR=target/tika-docker
+
+MULTI_ARCH=${MULTI_ARCH:-false}
+AWS_REGION=${AWS_REGION:-us-west-2}
+AWS_ACCOUNT_ID=${AWS_ACCOUNT_ID:-}
+AZURE_REGISTRY_NAME=${AZURE_REGISTRY_NAME:-}
+DOCKER_ID=${DOCKER_ID:-}
+PROJECT_NAME=${PROJECT_NAME:-tika-grpc}
+
+# If RELEASE_IMAGE_TAG not specified, use TIKA_VERSION
+if [[ -z "${RELEASE_IMAGE_TAG}" ]]; then
+ RELEASE_IMAGE_TAG="${TIKA_VERSION}"
+ ## Remove '-SNAPSHOT' from the version string
+ RELEASE_IMAGE_TAG="${RELEASE_IMAGE_TAG//-SNAPSHOT/}"
+fi
+
+mkdir -p "${OUT_DIR}/libs"
+mkdir -p "${OUT_DIR}/plugins"
+mkdir -p "${OUT_DIR}/config"
+mkdir -p "${OUT_DIR}/bin"
+cp -v -r "tika-grpc/target/tika-grpc-${TIKA_VERSION}.jar" "${OUT_DIR}/libs"
+
+# Copy all tika-pipes plugin zip files
+for dir in tika-pipes/tika-pipes-plugins/*/; do
+ plugin_name=$(basename "$dir")
+ zip_file="${dir}target/${plugin_name}-${TIKA_VERSION}.zip"
+ if [ -f "$zip_file" ]; then
+ cp -v -r "$zip_file" "${OUT_DIR}/plugins"
+ else
+ echo "WARNING: Plugin file $zip_file does not exist, skipping."
+ fi
+done
+
+# Copy parser package jars as plugins
+parser_packages=(
+ "tika-parsers/tika-parsers-standard/tika-parsers-standard-package"
+ "tika-parsers/tika-parsers-extended/tika-parser-scientific-package"
+ "tika-parsers/tika-parsers-extended/tika-parser-sqlite3-package"
+ "tika-parsers/tika-parsers-ml/tika-parser-nlp-package"
+)
+
+for parser_package in "${parser_packages[@]}"; do
+ package_name=$(basename "$parser_package")
+ jar_file="${parser_package}/target/${package_name}-${TIKA_VERSION}.jar"
+ if [ -f "$jar_file" ]; then
+ cp -v -r "$jar_file" "${OUT_DIR}/plugins"
+ else
+ echo "Parser package file $jar_file does not exist, skipping."
+ fi
+done
+
+cp -v -r "tika-grpc/docker-build/start-tika-grpc.sh" "${OUT_DIR}/bin"
+
+cp -v "tika-grpc/docker-build/Dockerfile" "${OUT_DIR}/Dockerfile"
+
+cd "${OUT_DIR}" || exit
+
+echo "Running docker build from directory: $(pwd)"
+
+IMAGE_TAGS=()
+if [[ -n "${AWS_ACCOUNT_ID}" ]]; then
+ if ! aws ecr get-login-password --region "${AWS_REGION}" | docker login
--username AWS --password-stdin
"${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com"; then
+ echo "ERROR: Failed to authenticate with AWS ECR"
+ exit 1
+ fi
+ IMAGE_TAGS+=("-t
${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${PROJECT_NAME}:${RELEASE_IMAGE_TAG}")
+fi
+
+if [[ -n "${AZURE_REGISTRY_NAME}" ]]; then
+ if ! az acr login --name "${AZURE_REGISTRY_NAME}"; then
+ echo "ERROR: Failed to authenticate with Azure Container Registry"
+ exit 1
+ fi
+ IMAGE_TAGS+=("-t
${AZURE_REGISTRY_NAME}.azurecr.io/${PROJECT_NAME}:${RELEASE_IMAGE_TAG}")
+fi
+
+if [[ -n "${DOCKER_ID}" ]]; then
+ IMAGE_TAGS+=("-t ${DOCKER_ID}/${PROJECT_NAME}:${RELEASE_IMAGE_TAG}")
+fi
+
+if [ ${#IMAGE_TAGS[@]} -eq 0 ]; then
+ echo "No image tags specified, skipping Docker build step. To enable
build, set AWS_ACCOUNT_ID, AZURE_REGISTRY_NAME, and/or DOCKER_ID environment
variables."
+ exit 0
+fi
+
+tag="${IMAGE_TAGS[*]}"
+if [ "${MULTI_ARCH}" == "true" ]; then
+ echo "Building multi arch image"
+ docker buildx create --name tikabuilder
+ # Pin binfmt to a specific digest for security
+ # see
https://askubuntu.com/questions/1339558/cant-build-dockerfile-for-arm64-due-to-libc-bin-segmentation-fault/1398147#1398147
+ docker run --rm --privileged
tonistiigi/binfmt:latest@sha256:8de6f2decb92e9001d094534bf8a92880c175bd5dfb4a9d8579f26f09821cfa2
--install amd64
+ docker run --rm --privileged
tonistiigi/binfmt:latest@sha256:8de6f2decb92e9001d094534bf8a92880c175bd5dfb4a9d8579f26f09821cfa2
--install arm64
+ docker buildx build \
+ --builder=tikabuilder . \
+ ${tag} \
+ --platform linux/amd64,linux/arm64 \
+ --push
+ docker buildx stop tikabuilder
+ docker buildx rm tikabuilder
+else
+ echo "Building single arch image"
+ # build single arch
+ docker build . ${tag}
Review Comment:
The docker build commands don't include error checking. The multi-arch build
uses --push which will fail silently if the build succeeds but push fails. The
single-arch build doesn't verify if the build succeeded. Add error checking
after docker build commands to ensure build success.
```suggestion
if ! docker buildx build \
--builder=tikabuilder . \
${tag} \
--platform linux/amd64,linux/arm64 \
--push; then
echo "ERROR: Docker multi-arch build or push failed"
# Attempt to clean up the buildx builder even on failure
docker buildx stop tikabuilder >/dev/null 2>&1 || true
docker buildx rm tikabuilder >/dev/null 2>&1 || true
exit 1
fi
docker buildx stop tikabuilder
docker buildx rm tikabuilder
else
echo "Building single arch image"
# build single arch
if ! docker build . ${tag}; then
echo "ERROR: Docker single-arch build failed"
exit 1
fi
```
##########
tika-grpc/docker-build/Dockerfile:
##########
@@ -0,0 +1,39 @@
+FROM ubuntu:22.04
+COPY libs/ /tika/libs/
+COPY plugins/ /tika/plugins/
+COPY config/ /tika/config/
+COPY bin/ /tika/bin
+ARG JRE='openjdk-17-jre-headless'
+ARG VERSION
+ARG TIKA_GRPC_MAX_INBOUND_MESSAGE_SIZE=104857600
+ARG TIKA_GRPC_MAX_OUTBOUND_MESSAGE_SIZE=104857600
+ARG TIKA_GRPC_NUM_THREADS=4
+RUN set -eux \
+ && apt-get update \
+ && apt-get install --yes --no-install-recommends gnupg2
software-properties-common \
+ && DEBIAN_FRONTEND=noninteractive apt-get install --yes
--no-install-recommends $JRE \
+ gdal-bin \
+ tesseract-ocr \
+ tesseract-ocr-eng \
+ tesseract-ocr-ita \
+ tesseract-ocr-fra \
+ tesseract-ocr-spa \
+ tesseract-ocr-deu \
+ && echo ttf-mscorefonts-installer msttcorefonts/accepted-mscorefonts-eula
select true | debconf-set-selections \
+ && DEBIAN_FRONTEND=noninteractive apt-get install --yes
--no-install-recommends \
+ xfonts-utils \
+ fonts-freefont-ttf \
+ fonts-liberation \
+ ttf-mscorefonts-installer \
+ wget \
+ cabextract \
+ && apt-get clean -y \
+ && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
+
+EXPOSE 9090
+ENV TIKA_VERSION=$VERSION
+ENV TIKA_GRPC_MAX_INBOUND_MESSAGE_SIZE=$TIKA_GRPC_MAX_INBOUND_MESSAGE_SIZE
+ENV TIKA_GRPC_MAX_OUTBOUND_MESSAGE_SIZE=$TIKA_GRPC_MAX_OUTBOUND_MESSAGE_SIZE
+ENV TIKA_GRPC_NUM_THREADS=$TIKA_GRPC_NUM_THREADS
+RUN chmod +x "/tika/bin/start-tika-grpc.sh"
Review Comment:
The Dockerfile doesn't specify a user, so the container will run as root by
default. This is a security risk. Consider adding a non-root user to run the
Java application, similar to best practices for production Docker images.
```suggestion
RUN chmod +x "/tika/bin/start-tika-grpc.sh" \
&& groupadd --system tika \
&& useradd --system --no-create-home --gid tika --shell
/usr/sbin/nologin tika \
&& chown -R tika:tika /tika
USER tika
```
##########
tika-grpc/docker-build/Dockerfile:
##########
@@ -0,0 +1,39 @@
+FROM ubuntu:22.04
+COPY libs/ /tika/libs/
+COPY plugins/ /tika/plugins/
+COPY config/ /tika/config/
+COPY bin/ /tika/bin
+ARG JRE='openjdk-17-jre-headless'
+ARG VERSION
+ARG TIKA_GRPC_MAX_INBOUND_MESSAGE_SIZE=104857600
+ARG TIKA_GRPC_MAX_OUTBOUND_MESSAGE_SIZE=104857600
+ARG TIKA_GRPC_NUM_THREADS=4
+RUN set -eux \
+ && apt-get update \
+ && apt-get install --yes --no-install-recommends gnupg2
software-properties-common \
+ && DEBIAN_FRONTEND=noninteractive apt-get install --yes
--no-install-recommends $JRE \
+ gdal-bin \
+ tesseract-ocr \
+ tesseract-ocr-eng \
+ tesseract-ocr-ita \
+ tesseract-ocr-fra \
+ tesseract-ocr-spa \
+ tesseract-ocr-deu \
+ && echo ttf-mscorefonts-installer msttcorefonts/accepted-mscorefonts-eula
select true | debconf-set-selections \
+ && DEBIAN_FRONTEND=noninteractive apt-get install --yes
--no-install-recommends \
+ xfonts-utils \
+ fonts-freefont-ttf \
+ fonts-liberation \
+ ttf-mscorefonts-installer \
+ wget \
+ cabextract \
+ && apt-get clean -y \
+ && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
+
+EXPOSE 9090
+ENV TIKA_VERSION=$VERSION
Review Comment:
The Dockerfile uses an ARG VERSION that is defined but never set during the
build process. The docker-build.sh script doesn't pass this build argument,
which means TIKA_VERSION environment variable will be empty in the resulting
image. Add --build-arg VERSION="${TIKA_VERSION}" to the docker build and docker
buildx build commands.
##########
tika-grpc/docker-build/docker-build.sh:
##########
@@ -0,0 +1,127 @@
+#!/bin/bash
+# This script is intended to be run from Maven exec plugin during the package
phase of maven build
+
+# Check if Docker is installed
+if ! command -v docker &> /dev/null; then
+ echo "ERROR: Docker is not installed or not in PATH. Please install Docker
first."
+ exit 1
+fi
+
+if [ -z "${TIKA_VERSION}" ]; then
+ echo "Environment variable TIKA_VERSION is required, and should match the
maven project version of Tika"
+ exit 1
+fi
+
+SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
+
+cd "${SCRIPT_DIR}/../../" || exit
+
+OUT_DIR=target/tika-docker
+
+MULTI_ARCH=${MULTI_ARCH:-false}
+AWS_REGION=${AWS_REGION:-us-west-2}
+AWS_ACCOUNT_ID=${AWS_ACCOUNT_ID:-}
+AZURE_REGISTRY_NAME=${AZURE_REGISTRY_NAME:-}
+DOCKER_ID=${DOCKER_ID:-}
+PROJECT_NAME=${PROJECT_NAME:-tika-grpc}
+
+# If RELEASE_IMAGE_TAG not specified, use TIKA_VERSION
+if [[ -z "${RELEASE_IMAGE_TAG}" ]]; then
+ RELEASE_IMAGE_TAG="${TIKA_VERSION}"
+ ## Remove '-SNAPSHOT' from the version string
+ RELEASE_IMAGE_TAG="${RELEASE_IMAGE_TAG//-SNAPSHOT/}"
+fi
+
+mkdir -p "${OUT_DIR}/libs"
+mkdir -p "${OUT_DIR}/plugins"
+mkdir -p "${OUT_DIR}/config"
+mkdir -p "${OUT_DIR}/bin"
+cp -v -r "tika-grpc/target/tika-grpc-${TIKA_VERSION}.jar" "${OUT_DIR}/libs"
+
+# Copy all tika-pipes plugin zip files
+for dir in tika-pipes/tika-pipes-plugins/*/; do
+ plugin_name=$(basename "$dir")
+ zip_file="${dir}target/${plugin_name}-${TIKA_VERSION}.zip"
+ if [ -f "$zip_file" ]; then
+ cp -v -r "$zip_file" "${OUT_DIR}/plugins"
+ else
+ echo "WARNING: Plugin file $zip_file does not exist, skipping."
+ fi
+done
+
+# Copy parser package jars as plugins
+parser_packages=(
+ "tika-parsers/tika-parsers-standard/tika-parsers-standard-package"
+ "tika-parsers/tika-parsers-extended/tika-parser-scientific-package"
+ "tika-parsers/tika-parsers-extended/tika-parser-sqlite3-package"
+ "tika-parsers/tika-parsers-ml/tika-parser-nlp-package"
+)
+
+for parser_package in "${parser_packages[@]}"; do
+ package_name=$(basename "$parser_package")
+ jar_file="${parser_package}/target/${package_name}-${TIKA_VERSION}.jar"
+ if [ -f "$jar_file" ]; then
+ cp -v -r "$jar_file" "${OUT_DIR}/plugins"
+ else
+ echo "Parser package file $jar_file does not exist, skipping."
+ fi
+done
+
+cp -v -r "tika-grpc/docker-build/start-tika-grpc.sh" "${OUT_DIR}/bin"
Review Comment:
The script uses 'cp -v -r' on files (not directories) which doesn't require
the '-r' flag. The '-r' flag is only necessary for recursive directory copying.
This is used on lines 39, 46, 64, and 70. Remove '-r' for single file
operations to avoid confusion.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]