tuhaihe commented on code in PR #336: URL: https://github.com/apache/cloudberry-site/pull/336#discussion_r2660620725
########## docs/developer/functions-and-procedures/overview.md: ########## @@ -0,0 +1,33 @@ +--- +title: Stored Procedures and User-Defined Functions +--- + +# Stored Procedures and User-Defined Functions + +Apache Cloudberry provides powerful capabilities for extending the database functionality through Stored Procedures (SPs) and User-Defined Functions (UDFs). + +## User-Defined functions (UDFs) + +User-Defined Functions (UDFs) return values and can be used in queries. They allow you to bundle complex logic and calculations into reusable components. + +Apache Cloudberry supports several procedural languages for writing UDFs: + +- **PL/Python**: Write functions using Python 3. With the `plpython3u` untrusted language, you can access system calls and external libraries. +- **PL/Java**: Write functions using Java. Suitable for complex computations and integration with existing Java libraries. +- **PL/Perl**: Write functions using Perl, leveraging its strong string manipulation capabilities. +- **PL/Container**: Run Python and R functions securely inside Docker containers. This provides isolation and security for running untrusted code. Review Comment: We can add the link to it: ```suggestion - **[PL/Container](https://github.com/cloudberry-contrib/plcontainer)**: Run Python and R functions securely inside Docker containers. This provides isolation and security for running untrusted code. ``` ########## docs/developer/functions-and-procedures/overview.md: ########## @@ -0,0 +1,33 @@ +--- +title: Stored Procedures and User-Defined Functions +--- + +# Stored Procedures and User-Defined Functions + +Apache Cloudberry provides powerful capabilities for extending the database functionality through Stored Procedures (SPs) and User-Defined Functions (UDFs). + +## User-Defined functions (UDFs) + +User-Defined Functions (UDFs) return values and can be used in queries. They allow you to bundle complex logic and calculations into reusable components. + +Apache Cloudberry supports several procedural languages for writing UDFs: + +- **PL/Python**: Write functions using Python 3. With the `plpython3u` untrusted language, you can access system calls and external libraries. +- **PL/Java**: Write functions using Java. Suitable for complex computations and integration with existing Java libraries. +- **PL/Perl**: Write functions using Perl, leveraging its strong string manipulation capabilities. +- **PL/Container**: Run Python and R functions securely inside Docker containers. This provides isolation and security for running untrusted code. + +## Stored procedures + +Stored Procedures are similar to functions but do not return a value. They are invoked using the `CALL` command and can handle transaction control (e.g., `COMMIT`, `ROLLBACK`) within the procedure body, which is not allowed in functions. + +## Supported languages + +| Language | Extension | Trusted or not | Description | +| :--- | :--- | :--- | :--- | +| **SQL** | Built-in | Yes | Write functions using standard SQL queries. | +| **PL/pgSQL** | Built-in | Yes | The procedural language for the PostgreSQL database system. | +| **PL/Python** | `plpython3u` | No | Python 3 procedural language. | +| **PL/Java** | `pljava` | Yes | Java procedural language. | +| **PL/Perl** | `plperl` | Yes | Perl procedural language. | +| **PL/Container** | `plcontainer` | Yes (Safe) | Run Python/R in Docker containers. | Review Comment: Do we need to add `[PL/R](https://github.com/cloudberry-contrib/plr)` to the list? ########## docs/developer/functions-and-procedures/overview.md: ########## @@ -0,0 +1,33 @@ +--- +title: Stored Procedures and User-Defined Functions +--- + +# Stored Procedures and User-Defined Functions + +Apache Cloudberry provides powerful capabilities for extending the database functionality through Stored Procedures (SPs) and User-Defined Functions (UDFs). + +## User-Defined functions (UDFs) + +User-Defined Functions (UDFs) return values and can be used in queries. They allow you to bundle complex logic and calculations into reusable components. + +Apache Cloudberry supports several procedural languages for writing UDFs: + +- **PL/Python**: Write functions using Python 3. With the `plpython3u` untrusted language, you can access system calls and external libraries. +- **PL/Java**: Write functions using Java. Suitable for complex computations and integration with existing Java libraries. Review Comment: ```suggestion - **[PL/Java](https://github.com/cloudberry-contrib/pljava)**: Write functions using Java. Suitable for complex computations and integration with existing Java libraries. ``` ########## docs/developer/functions-and-procedures/use-pl-container.md: ########## @@ -0,0 +1,612 @@ +# Use PL/Container + +PL/Container allows you to run procedural language functions inside Docker containers, mitigating the security risks associated with running Python or R code directly on Apache Cloudberry segment hosts. This document introduces the architecture, configuration, usage, and advanced topics of PL/Container. + +## About the PL/Container language extension + +The PL/Container language extension allows you to securely create and run PL/Python or PL/R user-defined functions (UDFs) inside Docker containers. Docker provides the ability to package and run an application in a loosely isolated environment called a container. + +Running UDFs inside a Docker container ensures that: + +- The function execution occurs in an isolated environment, decoupling data processing. +- The user code cannot access the host's operating system or file system. +- The user code does not introduce any security risks. +- If the container is started with limited or no network access, the function cannot connect back to the database. + +### PL/Container architecture + +Example of the process flow: + +1. When a query calls a PL/Container function, the Query Executor (QE) on a segment host starts a container and communicates with it to get results. The container might call back to the database to execute SQL queries via the Server Programming Interface (SPI) and then returns the final results to the QE. + +2. A container in standby mode waits for a socket connection without consuming any CPU resources. When the Apache Cloudberry database session that started it is closed, the container connection is also closed, and the container shuts down. + +## Configure and use PL/Container + +You do not need to install an extension package to use PL/Container in Apache Cloudberry. You only need to configure the Docker environment and enable the extension in the database. + +### Prerequisites + +- Docker is installed on all Apache Cloudberry database hosts (coordinator and all segments). +- The minimum Linux kernel version is 3.10. +- The minimum Docker version is 19.03. + +### Step 1: Install Docker + +You need to install Docker on all Apache Cloudberry database hosts. The following is an installation example on CentOS 7: + +1. Ensure the current user has `sudo` privileges or is the `root` user. + +2. Install the dependencies required for Docker: + + ```bash + sudo yum install -y yum-utils device-mapper-persistent-data lvm2 + ``` + +3. Add the Docker software repository: + + ```bash + sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo + ``` + +4. Install Docker: + + ```bash + sudo yum -y install docker-ce + ``` + +5. Start the Docker service: + + ```bash + sudo systemctl start docker + ``` + +6. Add the Apache Cloudberry database administrator user (usually `gpadmin`) to the `docker` group so that it can manage Docker images and containers: + + ```bash + sudo usermod -aG docker gpadmin + ``` + +7. Log out of the current session and log back in for the permissions to take effect. + +8. Configure Docker to start on boot: + + ```bash + sudo systemctl enable docker.service + ``` + +9. Test whether Docker is installed successfully. This command lists the currently running containers (which should be empty at this point): + + ```bash + docker ps + ``` + +10. After installing Docker on all hosts, restart the Apache Cloudberry database for the changes to take effect: + + ```bash + gpstop -ra + ``` + +### Step 2: Enable PL/Container + +PL/Container is a built-in extension in Apache Cloudberry. You only need to run one command in the database where you want to use it to enable it. + +1. Connect to the target database using `psql`: + + ```bash + psql -d your_database_name + ``` + +2. Run the `CREATE EXTENSION` command: + + ```sql + CREATE EXTENSION plcontainer; + ``` + + After successful execution, PL/Container is activated in that database. + +### Step 3: Install and configure Docker images + +PL/Container requires specific Docker images to create containers for running UDFs. + +1. Obtain the PL/Container Docker image file (for example, `plcontainer-python3-image-*.tar.gz`) from trusted sources. + +2. Use the `plcontainer image-add` command to install the image on all Apache Cloudberry hosts. Use the `-f` option to specify the path to the image file. + + :::note + Make sure that your system has the `rsync` dependency; otherwise, the following commands will fail. + ::: + + ```bash + # Installs a Python 3 based Docker image. + plcontainer image-add -f /path/to/your/plcontainer-python3-image.tar.gz + + # Installs an R based Docker image. + plcontainer image-add -f /path/to/your/plcontainer-r-image.tar.gz + ``` + +3. Use the `plcontainer image-list` command to view the installed images: + + ```bash + plcontainer image-list + ``` + +4. Use the `plcontainer runtime-add` command to register the installed image as a "runtime" so that it can be called in functions. + + - `-r`: Specify the name of the runtime (custom). + - `-i`: Specify the associated Docker image. + - `-l`: Specify the language supported by this runtime (`python`, `python3`, or `r`). + + ```bash + # Adds a Python 3 runtime. + plcontainer runtime-add -r plc_python3_shared -i pivotaldata/plcontainer_python3_shared:devel -l python3 + + # Adds an R runtime. + plcontainer runtime-add -r plc_r_shared -i pivotaldata/plcontainer_r_shared:devel -l r + ``` + +### Step 4: Test the installation + +1. Connect to the database where PL/Container is enabled using `psql`. + +2. Create a simple function to test the Python runtime. Note the `# container:` comment, which specifies the runtime ID used by the function. + + ```sql + CREATE FUNCTION dummy_python() RETURNS text AS $$ + # container: plc_python3_shared + return 'Hello from Python in a container' + $$ LANGUAGE plcontainer; + ``` + +3. Call the function to test it: + + ```sql + SELECT dummy_python(); + ``` + +4. Similarly, you can create a function to test the R runtime: + + ```sql + CREATE FUNCTION dummy_r() RETURNS text AS $$ + # container: plc_r_shared + return ('Hello from R in a container') + $$ LANGUAGE plcontainer; + ``` + +5. Call the function to perform a test: + + ```sql + SELECT dummy_r(); + ``` + +## Write PL/Container functions + +When you enable the PL/Container extension in a database, the `plcontainer` language is registered. You can write functions in PL/Container-supported languages, such as Python or R, by specifying `LANGUAGE plcontainer` in a user-defined function (UDF). + +The PL/Container configuration file is read only when the first PL/Container function is called in each database session. If you modify the configuration externally, you can force a reload by running `SELECT * FROM plcontainer_refresh_config;` within the session. + +### Function definition + +To create a PL/Container function, the UDF you define must include the following elements: + +- **Container declaration**: The first line of the function body must be `# container: ID`, where `ID` is the runtime ID you defined with `plcontainer runtime-add`. +- **Language declaration**: The `LANGUAGE` attribute of the function must be `plcontainer`. + +If the `# container` line in a UDF specifies an ID that does not exist in the PL/Container configuration file, the database returns an error when trying to run the UDF. + +### Function examples + +Here are some basic function examples. Ensure that the `# container` ID in the examples matches the runtime ID you have configured. + +**PL/Python example:** + +```sql +CREATE OR REPLACE FUNCTION pylog100() RETURNS double precision AS $$ + # container: plc_python3_shared + import math + return math.log10(100) +$$ LANGUAGE plcontainer; + +SELECT pylog100(); +``` + +**PL/R example:** + +```sql +CREATE OR REPLACE FUNCTION rlog100() RETURNS text AS $$ +# container: plc_r_shared +return(log10(100)) +$$ LANGUAGE plcontainer; + +SELECT rlog100(); +``` + +### About PL/Python functions + +#### PL/Python 2 vs. PL/Python 3 + +- PL/Container supports Python 2.7 and Python 3.6+. +- If you want to use PL/Container to run functions for both Python 2 and Python 3, you need to create two different user-defined functions for them. +- Note that UDFs written for Python 2 might not run directly in Python 3. + +#### Database interaction (`plpy` module) + +The `plpy` module provides a set of methods for interacting with the database: + +- `plpy.execute(stmt)`: Executes a SQL query string and returns the result as a list of dictionaries. +- `plpy.prepare(stmt[, argtypes])`: Prepares a query plan. +- `plpy.execute(plan[, argtypes])`: Executes a prepared query plan. +- `plpy.debug(msg)`, `plpy.log(msg)`, `plpy.info(msg)`, `plpy.notice(msg)`, `plpy.warning(msg)`, `plpy.error(msg)`, `plpy.fatal(msg)`: Sends log messages of different levels to the database logging system. An `ERROR` or `FATAL` level error aborts the transaction. +- `plpy.subtransaction()`: Manages explicit subtransactions. + +#### String quoting + +The `plpy` module supports several useful string quoting functions for building dynamic queries: + +- `plpy.quote_literal(string)`: Quotes a string as a literal in a SQL statement, correctly handling single quotes and backslashes. +- `plpy.quote_nullable(string)`: Same as above, but returns `NULL` if the input is `null`. +- `plpy.quote_ident(string)`: Quotes a string as an identifier in a SQL statement, adding quotes only when necessary. + +#### Global dictionaries + +The `plpy` module provides two special global dictionaries for persisting data between function calls: + +- **GD (Global Dictionary)**: This dictionary is shared among all function calls within the same container. The data persists as long as the container instance is alive. +- **SD (Static Dictionary)**: This dictionary shares data only between multiple calls to the same function. + +:::note +Note that the lifecycle of a container is associated with a session. When an idle session is terminated by the database, the related containers are destroyed, and the data in GD and SD will be lost. +::: + +### About PL/R functions + +The `pg.spi` module in PL/Container provides methods for the R language to interact with the database: + +- `pg.spi.exec(stmt)`: Executes a SQL query and returns an R `data.frame`. +- `pg.spi.prepare(stmt[, argtypes])`: Prepares a query plan. +- `pg.spi.execp(plan[, argtypes])`: Executes a prepared query plan. +- `pg.spi.debug(msg)`, `pg.spi.log(msg)`, `pg.spi.info(msg)`, `pg.spi.notice(msg)`, `pg.spi.warning(msg)`, `pg.spi.error(msg)`, `pg.spi.fatal(msg)`: Sends log messages of different levels to the database logging system. + +### Function limitations + +Note the following limitations when using PL/Container: + +- Database domains are not supported. +- Multidimensional arrays are not supported. +- Call stack information for Python and R is not displayed when debugging UDFs. +- The `nrows()` and `status()` methods of `plpy.execute()` are not supported. +- The `plpy.SPIError()` method of PL/Python is not supported. +- Running the `SAVEPOINT` command in `plpy.execute()` is not supported. +- Container flow control is not supported. +- Triggers are not supported. +- `OUT` parameters are not supported. +- PL/Python functions cannot directly return a Python `dict` type, but it can be converted to a database user-defined type (UDT) before returning. + +## Manage PL/Container + +### Manage configurations and containers in a session + +PL/Container provides some built-in views and functions to help you view and manage configurations within a database session. + +- **Refresh configuration**: If you modify the `plcontainer_configuration.xml` file externally, you can run the following command in a session to force a configuration reload without restarting the database. + + ```sql + SELECT * FROM plcontainer_refresh_config; + ``` + + This command returns the status of the configuration refresh on the coordinator and all segment instances. + +- **View current configuration**: + + ```sql + SELECT * FROM plcontainer_show_config; + ``` + + This command executes a PL/Container function to display configuration information from the coordinator and all segment instances. + +- **View running containers**: + + ```sql + SELECT * FROM plcontainer_containers_summary(); + ``` + + If run by a regular user, this function only displays the containers created by that user. If run by a superuser, it displays containers created by all users. The output includes segment ID, container ID, runtime, owner, and memory usage. + +## Advanced topics + +### Resource management + +Docker containers share CPU and memory resources with the Apache Cloudberry database service on the same host. By default, the database is unaware of the resources consumed by PL/Container instances. You can use the database's resource group feature to control the total CPU resource usage of PL/Container instances. + +PL/Container manages resources at two levels: container level and runtime level. You can control container-level CPU and memory resources by configuring `memory_mb` and `cpu_share` settings for a runtime. `memory_mb` controls the memory resources available to each container instance, while `cpu_share` defines the CPU usage weight of a container relative to others. + +:::note +If you do not explicitly configure a resource group for a PL/Container runtime, its container instances will only be limited by system resources. This can lead to containers consuming excessive resources, thus affecting the performance of the database server. +::: + +#### Configuration process + +To use resource groups to manage PL/Container's CPU resources, you need to explicitly configure both resource groups and PL/Container. + +1. **Plan resource allocation**: + + - Analyze the resource usage of your deployment environment to determine the percentage of CPU resources to allocate to PL/Container Docker containers. + - Decide how to distribute these resources among different PL/Container runtimes. Clarify the required number of resource groups, the CPU percentage for each group, and the mapping between resource groups and runtimes. + +2. **Create resource groups**: Create resource groups according to your plan. For example, assume you decide to allocate 25% of CPU resources to PL/Container and distribute it between two different resource groups in a 60/40 ratio: + + ```sql + -- Creates a resource group for the R runtime, allocating 15% CPU. + CREATE RESOURCE GROUP plr_run1_rg WITH (CONCURRENCY=0, CPU_MAX_PERCENT=15); + + -- Creates a resource group for the Python runtime, allocating 10% CPU. + CREATE RESOURCE GROUP plpy_run1_rg WITH (CONCURRENCY=0, CPU_MAX_PERCENT=10); + ``` + +3. **Get resource group IDs**: Query the `gp_toolkit.gp_resgroup_config` view to get the `groupid` of the resource groups you created. + + ```sql + SELECT groupname, groupid FROM gp_toolkit.gp_resgroup_config + WHERE groupname IN ('plpy_run1_rg', 'plr_run1_rg'); + + -- Example output: + -- groupname | groupid + -- --------------+---------- + -- plpy_run1_rg | 16391 + -- plr_run1_rg | 16393 + -- (2 rows) + ``` + +4. **Assign resource groups to runtimes**: Use the `plcontainer runtime-add` (for a new runtime) or `plcontainer runtime-replace` (for an existing runtime) command to assign a resource group via the `-s resource_group_id=<groupid>` parameter. + + ```bash + # Assigns a resource group to the new python_run1 runtime. + plcontainer runtime-add -r python_run1 -i pivotaldata/plcontainer_python_shared:devel -l python -s resource_group_id=16391 + + # Replaces and assigns a resource group to the existing r_run1 runtime. + plcontainer runtime-replace -r r_run1 -i pivotaldata/plcontainer_r_shared:devel -l r -s resource_group_id=16393 + ``` + + You can also use the `plcontainer runtime-edit` command to manually edit the configuration file to assign a resource group. + +After assigning a resource group to a runtime, all container instances that share this runtime configuration will be subject to the CPU limits of that group's configuration. If you delete a PL/Container resource group that is in use, the database will terminate the running containers. + +### Logging + +When the logging feature of PL/Container is enabled, you can set the log level (default is `warning`) through the database's `log_min_messages` parameter. This parameter controls the log level for both the database and PL/Container. + +- **Enable logging**: PL/Container logging is enabled per runtime ID and controlled by the `use_container_logging` setting, which defaults to no logging. +- **Log content and location**: PL/Container log messages originate from UDFs running in Docker containers. On Red Hat 8 systems, log messages are sent to the `journald` service by default. Database log messages are sent to the log files on the coordinator node. +- **Dynamically adjust log level**: When testing or troubleshooting, you can use the `SET` command in a session to temporarily change the log level, for example, to `debug1`. + + ```sql + SET log_min_messages='debug1'; + ``` + +:::note +The `log_min_messages` parameter affects logging for both the database and PL/Container. Increasing the log level might affect database performance, even if no PL/Container functions are running. +::: + +### Use CUDA for GPU acceleration + +PL/Container supports using NVIDIA GPUs for computational acceleration. This requires you to prepare a custom Docker image containing the CUDA Toolkit and corresponding Python libraries (such as PyCUDA) and configure it accordingly. + +#### Prerequisites + +- Docker engine version is not lower than v19.03. +- PL/Container version is not lower than 2.2.0. +- At least one NVIDIA GPU is on the host, and the corresponding GPU driver is installed. +- NVIDIA Container Toolkit is installed, and it is verified that the `nvidia-docker` image can successfully use the GPU. + +#### Install and customize the PL/Container image + +1. **Load the base image**: Obtain the PL/Container Python 3 image from official channels and load it into Docker. + + ```bash + docker image load < plcontainer-python3-image-*.tar.gz + ``` + +2. **Customize the image**: Create a Dockerfile to add the CUDA runtime and the `pycuda` library to the base image. The following is an example Dockerfile content for adding CUDA 11.7 and `pycuda` 2021.1: + + ```dockerfile + FROM pivotaldata/plcontainer_python3_shared:devel + + ENV XKBLAYOUT=en + ENV DEBIAN_FRONTEND=noninteractive + + # Install CUDA from https://developer.nvidia.com/cuda-downloads + # By downloading and using the software, you agree to fully comply with the terms and conditions of the CUDA EULA. + RUN true &&\ + wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin && \ Review Comment: It would be better to have a Ubuntu22.04 package here, FYI. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
