Re: [DISCUSS] PIP-35: Introduce Blob to store multimodal data

Jingsong Li Tue, 16 Sep 2025 20:50:10 -0700

Hi Guoxing,

Sure, I will add these utils to the PIP.


Best,
Jingsong

On Wed, Sep 17, 2025 at 11:01 AM guoxing wgx <[email protected]> wrote:
>
> Subject: [Proposal] Introduce built-in Blob implementations (e.g.,
> FileBlob, HttpBlob) for common use cases
>
> Hi all,
>
> I've been reviewing the Blob design in PIP-35 ("Introduce Blob to store
> multimodal data") and think it's a solid foundation for supporting
> multimodal workloads in Paimon.
>
> One area I'd like to propose for improvement is **developer experience and
> ease of use**. Currently, users need to implement the `Blob` interface
> themselves for custom data sources (e.g., files, HTTP URLs), which leads to
> duplicated efforts and potential inconsistencies.
>
> Could we consider introducing **built-in Blob implementations** for common
> scenarios? For example:
>
> - `FileBlob`: for reading from local or mounted file systems
> - `HttpBlob` / `UrlBlob`: for streaming data from HTTP/HTTPS endpoints
> - `ByteArrayBlob`: for small in-memory binary objects (<1MB)
>
> These could be exposed through a simple factory API, such as:
>
> ```java
> Blob blob = Blobs.fromPath("pangu|oss|file://data/image.png");    // file
> Blob blob = Blobs.fromUrl("https://example.com/audio.mp3";);       // remote
> URL
> Blob blob = Blobs.fromByteArray(embeddingBytes);                     //
> inline data
>
> Jingsong Li <[email protected]> 于2025年9月16日周二 22:13写道：
>
> > Hi everyone,
> >
> > Blob type and data POC is in https://github.com/apache/paimon/pull/6268
> >
> > Best,
> > Jingsong
> >
> > On Tue, Sep 16, 2025 at 10:08 PM Jingsong Li <[email protected]>
> > wrote:
> > >
> > > Thanks Guoxing for your suggestion.
> > >
> > > Now I have introduced the Blob interface:
> > >
> > > /**
> > >  * Blob interface, provide bytes and input stream methods.
> > >  *
> > >  * @since 1.4.0
> > >  */
> > > @Public
> > > public interface Blob {
> > >
> > >     byte[] toBytes();
> > >
> > >     SeekableInputStream newInputStream() throws IOException;
> > > }
> > >
> > > And you can see the read and write example in PIP.
> > >
> > > Best,
> > > Jingsong
> > >
> > > ---------- Forwarded message ---------
> > > From: guoxing wgx <[email protected]>
> > > Date: Tue, Sep 16, 2025 at 7:47 PM
> > > Subject: Re: [DISCUSS] PIP-35: Introduce Blob to store multimodal data
> > > To: Jingsong Li <[email protected]>
> > >
> > >
> > > Following MySQL's BLOB Field Design, Can Paimon Also Support Streaming
> > > Write Capabilities for BLOB Fields?
> > >
> > > MySQL Large Object Storage
> > >
> > > 1. BINARY vs BLOB
> > >
> > > MySQL supports two binary data types: BINARY and BLOB.
> > >
> > > BINARY is a fixed-length binary string type, similar to CHAR, but it
> > > stores raw bytes instead of characters. It is suitable for small,
> > > fixed-size binary data.
> > > BLOB (Binary Large Object) is a variable-length type designed to store
> > > large amounts of binary data such as images, audio, video, documents,
> > > and other file types.
> > >
> > > Note: Currently, Apache Paimon only supports the Binary type and does
> > > not have a dedicated BLOB type with streaming I/O capabilities.
> > >
> > > 2. Operation Interfaces
> > >
> > > Input Streams (Writing Data)
> > >
> > > When inserting or updating BLOB data, MySQL provides several methods
> > > through the JDBC API:
> > >
> > > setBinaryStream(int index, InputStream x, int length)
> > > Writes binary data from an input stream into a BLOB field. This method
> > > is recommended for streaming large files, as it avoids loading the
> > > entire data into memory.
> > >
> > > setBlob(int index, InputStream inputStream) (available since JDBC 4.0)
> > > A more modern approach that writes BLOB data using an input stream
> > > without requiring the length to be specified upfront. This simplifies
> > > handling dynamically sized data.
> > >
> > > setBytes(int index, byte[] bytes)
> > > Directly writes a byte array to the BLOB field. This is appropriate
> > > only for small files (e.g., less than 1MB), as it can lead to high
> > > memory consumption and potential OutOfMemoryError (OOM) for larger
> > > data.
> > >
> > > Output Streams (Reading Data)
> > >
> > > When retrieving BLOB data from a result set, streaming access is
> > > supported to prevent memory issues:
> > >
> > > getBinaryStream(String columnName)
> > > Reads the BLOB value as an input stream, enabling chunked reading of
> > > large files. This is the recommended way to handle large binary
> > > objects and avoid OOM.
> > >
> > > getBinaryStream(int index)
> > > Similar to the above method, but accesses the column by its numeric
> > > index instead of name. It is useful when the column order is known and
> > > stable.
> > >
> > > Large Object Handling (Blob)
> > >
> > > In addition to direct stream access, MySQL allows working with the
> > > java.sql.Blob interface for more advanced operations:
> > >
> > > ResultSet.getBlob(String columnName)
> > > Retrieves a java.sql.Blob object from the result set, which provides
> > > additional methods for manipulation.
> > >
> > > Blob.getBinaryStream()
> > > Returns an input stream from the Blob object, typically used in
> > > conjunction with ResultSet.getBlob() to enable lazy or on-demand
> > > reading.
> > >
> > > Blob.length()
> > > Returns the size (in bytes) of the BLOB data. This is useful for
> > > allocating buffers, determining file size, or managing partial reads.
> > >
> > > Byte Array Access
> > >
> > > ResultSet.getBytes(String columnName)
> > > Reads the entire BLOB content directly into a byte array. While
> > > convenient for small data, this method should be avoided for large
> > > files, as it may cause OutOfMemoryError due to excessive memory usage.
> > >
> > > ________________________________
> > >
> > > This completes the description of MySQL’s BLOB handling mechanisms,
> > > focusing solely on factual presentation without additional analysis or
> > > recommendations.
> > >
> > >
> > > Jingsong Li <[email protected]> 于2025年9月16日周二 19:30写道：
> > > >
> > > > From Guoxing in another thread:
> > > >
> > > > Following MySQL's BLOB field design, can Paimon also support streaming
> > > > write capabilities for BLOB fields?
> > > > MySQL Large Object Storage
> > > >
> > > > 1. BINARY vs BLOB
> > > >
> > > > *Note: MySQL supports both BINARY and BLOB types, whereas Paimon
> > currently
> > > > only supports Binary*
> > > > Type
> > > > Description
> > > > BINARY Fixed-length binary string type, similar to CHAR, but stores
> > bytes
> > > > instead of characters.
> > > > BLOB Variable-length binary large object type, used to store large
> > amounts
> > > > of binary data (e.g., images, audio, files).
> > > > ------------------------------
> > > > 2. Operation InterfacesInput Streams (Writing Data)
> > > > Category
> > > > Method
> > > > Purpose
> > > > Statement setBinaryStream(int index, InputStream x, int length) Writes
> > > > binary stream data into a BLOB field; used for inserting or updating
> > BLOB
> > > > data. Recommended for streaming writes.
> > > > setBlob(int index, InputStream inputStream) Writes BLOB data using an
> > input
> > > > stream (JDBC 4.0+). A more modern approach that does not require
> > specifying
> > > > the length.
> > > > setBytes(int index, byte[] bytes) Directly writes a byte array.
> > Suitable
> > > > only for small files (<1MB); be cautious about memory usage.
> > > > Output Streams (Reading Data)
> > > > Category
> > > > Method
> > > > Purpose
> > > > ResultSet getBinaryStream(String columnName) Reads BLOB data as an
> > input
> > > > stream. Recommended for streaming large files to avoid OOM.
> > > > getBinaryStream(int index) Same as above, but accesses by column index.
> > > > Equivalent to using column name, useful when column order is known.
> > > > Large Object Handling (Blob)
> > > > Category
> > > > Method
> > > > Purpose
> > > > Blob ResultSet.getBlob(String columnName) Retrieves a java.sql.Blob
> > object,
> > > > which provides additional methods for manipulation.
> > > > Blob.getBinaryStream() Gets an input stream from the Blob object. Used
> > in
> > > > conjunction with ResultSet.getBlob().
> > > > Blob.length() Returns the size (length) of the BLOB data. Useful for
> > > > determining file size or allocating buffers.
> > > > Byte Array Access
> > > > Category
> > > > Method
> > > > Purpose
> > > > Bytes ResultSet.getBytes(String columnName) Reads the entire BLOB
> > directly
> > > > into a byte array. Only suitable for small files, as large files may
> > cause
> > > > OutOfMemoryError (OOM).
> > > > ------------------------------
> > > >
> > > > This comparison highlights that MySQL provides robust streaming I/O
> > support for
> > > > BLOBs, enabling efficient handling of large binary objects without
> > loading
> > > > them entirely into memory — a capability that could be valuable to
> > > > implement in Paimon for better multimodal data management.
> > > >
> > > > On Tue, Sep 16, 2025 at 3:08 PM Jingsong Li <[email protected]>
> > wrote:
> > > > >
> > > > > Hi everyone,
> > > > >
> > > > > I want to start a discussion about blob files.
> > > > >
> > > > > Multimodal data storage needs to support multimedia files, including
> > > > > text, images, audio, video, embedding vectors, etc. Paimon needs to
> > > > > meet the demand for multimodal data entering the lake, and achieve
> > > > > unified storage and efficient management of multimodal data and
> > > > > structured data.
> > > > >
> > > > > Most multimodal files are actually not large, around 1MB or even
> > below
> > > > > 1MB, but there are also relatively large multimodal files, such as
> > > > > 10GB+files, which pose storage challenges for us.
> > > > >
> > > > > Consider two ways:
> > > > >
> > > > > 1. Multimodal data can be directly stored in column files, such as
> > > > > Parquet or Lance files. The biggest problem with this solution is
> > that
> > > > > it brings challenges to the file format, such as solving the read and
> > > > > write problems of OOM, which requires a streaming API to the file
> > > > > format to avoid loading the entire multimodal data. In addition, the
> > > > > additional fields of multimodal data may undergo frequent changes,
> > > > > additions, or even deletions. If these changes require multimodal
> > > > > files to participate in reading and writing together, the cost is
> > very
> > > > > high.
> > > > >
> > > > > 2. Multimodal data is stored on object storage, and Parquet
> > references
> > > > > these files through pointers. The downside of doing so is that it
> > > > > cannot directly manage multimodal data and may result in a large
> > > > > number of small files, which can cause a significant amount of file
> > IO
> > > > > during use, leading to decreased performance and increased costs.
> > > > >
> > > > > We should consider new ways to satisfy this requirement. Create a
> > > > > high-performance architecture specifically designed for mixed
> > > > > scenarios of massive small and large multimodal files, achieving high
> > > > > throughput writing and low latency reading, meeting the stringent
> > > > > performance requirements of AI, big data, and other businesses.
> > > > >
> > > > > A more intuitive solution is: independent multimodal storage and
> > > > > structured storage, separate management of multimodal storage,
> > > > > introduction of bin file mechanism to store multiple multimodal data,
> > > > > Parquet still references multimodal data through pointers.
> > > > >
> > > > > What do you think?
> > > > >
> > > > > [1]
> > https://cwiki.apache.org/confluence/display/PAIMON/PIP-35%3A+Introduce+Blob+to+store+multimodal+data
> > > > >
> > > > > Best,
> > > > > Jingsong
> >

Re: [DISCUSS] PIP-35: Introduce Blob to store multimodal data

Reply via email to