Re: [DISCUSS] PIP-35: Introduce Blob to store multimodal data

Jingsong Li Tue, 16 Sep 2025 04:32:19 -0700

>From Guoxing in another thread:

Following MySQL's BLOB field design, can Paimon also support streaming
write capabilities for BLOB fields?
MySQL Large Object Storage


1. BINARY vs BLOB

*Note: MySQL supports both BINARY and BLOB types, whereas Paimon currently
only supports Binary*
Type
Description
BINARY Fixed-length binary string type, similar to CHAR, but stores bytes
instead of characters.
BLOB Variable-length binary large object type, used to store large amounts
of binary data (e.g., images, audio, files).
------------------------------
2. Operation InterfacesInput Streams (Writing Data)
Category
Method
Purpose
Statement setBinaryStream(int index, InputStream x, int length) Writes
binary stream data into a BLOB field; used for inserting or updating BLOB
data. Recommended for streaming writes.
setBlob(int index, InputStream inputStream) Writes BLOB data using an input
stream (JDBC 4.0+). A more modern approach that does not require specifying
the length.
setBytes(int index, byte[] bytes) Directly writes a byte array. Suitable
only for small files (<1MB); be cautious about memory usage.
Output Streams (Reading Data)
Category
Method
Purpose
ResultSet getBinaryStream(String columnName) Reads BLOB data as an input
stream. Recommended for streaming large files to avoid OOM.
getBinaryStream(int index) Same as above, but accesses by column index.
Equivalent to using column name, useful when column order is known.
Large Object Handling (Blob)
Category
Method
Purpose
Blob ResultSet.getBlob(String columnName) Retrieves a java.sql.Blob object,
which provides additional methods for manipulation.
Blob.getBinaryStream() Gets an input stream from the Blob object. Used in
conjunction with ResultSet.getBlob().
Blob.length() Returns the size (length) of the BLOB data. Useful for
determining file size or allocating buffers.
Byte Array Access
Category
Method
Purpose
Bytes ResultSet.getBytes(String columnName) Reads the entire BLOB directly
into a byte array. Only suitable for small files, as large files may cause
OutOfMemoryError (OOM).
------------------------------

This comparison highlights that MySQL provides robust streaming I/O support for
BLOBs, enabling efficient handling of large binary objects without loading
them entirely into memory — a capability that could be valuable to
implement in Paimon for better multimodal data management.

On Tue, Sep 16, 2025 at 3:08 PM Jingsong Li <[email protected]> wrote:
>
> Hi everyone,
>
> I want to start a discussion about blob files.
>
> Multimodal data storage needs to support multimedia files, including
> text, images, audio, video, embedding vectors, etc. Paimon needs to
> meet the demand for multimodal data entering the lake, and achieve
> unified storage and efficient management of multimodal data and
> structured data.
>
> Most multimodal files are actually not large, around 1MB or even below
> 1MB, but there are also relatively large multimodal files, such as
> 10GB+files, which pose storage challenges for us.
>
> Consider two ways:
>
> 1. Multimodal data can be directly stored in column files, such as
> Parquet or Lance files. The biggest problem with this solution is that
> it brings challenges to the file format, such as solving the read and
> write problems of OOM, which requires a streaming API to the file
> format to avoid loading the entire multimodal data. In addition, the
> additional fields of multimodal data may undergo frequent changes,
> additions, or even deletions. If these changes require multimodal
> files to participate in reading and writing together, the cost is very
> high.
>
> 2. Multimodal data is stored on object storage, and Parquet references
> these files through pointers. The downside of doing so is that it
> cannot directly manage multimodal data and may result in a large
> number of small files, which can cause a significant amount of file IO
> during use, leading to decreased performance and increased costs.
>
> We should consider new ways to satisfy this requirement. Create a
> high-performance architecture specifically designed for mixed
> scenarios of massive small and large multimodal files, achieving high
> throughput writing and low latency reading, meeting the stringent
> performance requirements of AI, big data, and other businesses.
>
> A more intuitive solution is: independent multimodal storage and
> structured storage, separate management of multimodal storage,
> introduction of bin file mechanism to store multiple multimodal data,
> Parquet still references multimodal data through pointers.
>
> What do you think?
>
> [1] 
> https://cwiki.apache.org/confluence/display/PAIMON/PIP-35%3A+Introduce+Blob+to+store+multimodal+data
>
> Best,
> Jingsong

Re: [DISCUSS] PIP-35: Introduce Blob to store multimodal data

Reply via email to