Re: [DISCUSS] PIP-35: Introduce Blob to store multimodal data

yonghao fang Thu, 18 Sep 2025 23:49:01 -0700

Subject: [Proposal] Proposal for PyPaimon Blob Type Support with Arrow

Hi everyone,


I've been reviewing the Blob design in PIP-35, and I agree it provides a
solid foundation for supporting multimodal workloads in Paimon.

I've been thinking about how PyPaimon could handle the Blob type with
Arrow. Since Arrow doesn't currently have a native Blob type, I suggest
we leverage the existing `BINARY` type.

My proposed approach involves two key steps:

1. **On writing**: Serialize the Blob's metadata (`BlobMeta`—including
   URI, length, and offset) into a binary format (e.g., Protobuf or
   FlatBuffers). This serialized data could then be stored in an Arrow
   BINARY array and passed into Paimon.

2. **On reading**: Add a new option, `read_blobs_as_meta(enabled: bool)`,
   in the `ReadBuilder`. When enabled, the Paimon reader returns the
   `BlobMeta` as a `BINARY` value. This allows users to reconstruct a
   `Blob` object using the `from_meta` class method, enabling streaming
   reads without loading the full binary data into memory. This provides
   clear semantics, users always know whether they're working with raw data
   or metadata.

I’m hoping this approach could offer a practical way to support the Blob
type
within the Arrow ecosystem.

Looking forward to your thoughts.

Best regards,
Yonghao Fang

Jingsong Li <[email protected]> 于2025年9月17日周三 11:13写道：

> Hi Guoxing,
>
> Sure, I will add these utils to the PIP.
>
> Best,
> Jingsong
>
> On Wed, Sep 17, 2025 at 11:01 AM guoxing wgx <[email protected]>
> wrote:
> >
> > Subject: [Proposal] Introduce built-in Blob implementations (e.g.,
> > FileBlob, HttpBlob) for common use cases
> >
> > Hi all,
> >
> > I've been reviewing the Blob design in PIP-35 ("Introduce Blob to store
> > multimodal data") and think it's a solid foundation for supporting
> > multimodal workloads in Paimon.
> >
> > One area I'd like to propose for improvement is **developer experience
> and
> > ease of use**. Currently, users need to implement the `Blob` interface
> > themselves for custom data sources (e.g., files, HTTP URLs), which leads
> to
> > duplicated efforts and potential inconsistencies.
> >
> > Could we consider introducing **built-in Blob implementations** for
> common
> > scenarios? For example:
> >
> > - `FileBlob`: for reading from local or mounted file systems
> > - `HttpBlob` / `UrlBlob`: for streaming data from HTTP/HTTPS endpoints
> > - `ByteArrayBlob`: for small in-memory binary objects (<1MB)
> >
> > These could be exposed through a simple factory API, such as:
> >
> > ```java
> > Blob blob = Blobs.fromPath("pangu|oss|file://data/image.png");    // file
> > Blob blob = Blobs.fromUrl("https://example.com/audio.mp3";);       //
> remote
> > URL
> > Blob blob = Blobs.fromByteArray(embeddingBytes);                     //
> > inline data
> >
> > Jingsong Li <[email protected]> 于2025年9月16日周二 22:13写道：
> >
> > > Hi everyone,
> > >
> > > Blob type and data POC is in
> https://github.com/apache/paimon/pull/6268
> > >
> > > Best,
> > > Jingsong
> > >
> > > On Tue, Sep 16, 2025 at 10:08 PM Jingsong Li <[email protected]>
> > > wrote:
> > > >
> > > > Thanks Guoxing for your suggestion.
> > > >
> > > > Now I have introduced the Blob interface:
> > > >
> > > > /**
> > > >  * Blob interface, provide bytes and input stream methods.
> > > >  *
> > > >  * @since 1.4.0
> > > >  */
> > > > @Public
> > > > public interface Blob {
> > > >
> > > >     byte[] toBytes();
> > > >
> > > >     SeekableInputStream newInputStream() throws IOException;
> > > > }
> > > >
> > > > And you can see the read and write example in PIP.
> > > >
> > > > Best,
> > > > Jingsong
> > > >
> > > > ---------- Forwarded message ---------
> > > > From: guoxing wgx <[email protected]>
> > > > Date: Tue, Sep 16, 2025 at 7:47 PM
> > > > Subject: Re: [DISCUSS] PIP-35: Introduce Blob to store multimodal
> data
> > > > To: Jingsong Li <[email protected]>
> > > >
> > > >
> > > > Following MySQL's BLOB Field Design, Can Paimon Also Support
> Streaming
> > > > Write Capabilities for BLOB Fields?
> > > >
> > > > MySQL Large Object Storage
> > > >
> > > > 1. BINARY vs BLOB
> > > >
> > > > MySQL supports two binary data types: BINARY and BLOB.
> > > >
> > > > BINARY is a fixed-length binary string type, similar to CHAR, but it
> > > > stores raw bytes instead of characters. It is suitable for small,
> > > > fixed-size binary data.
> > > > BLOB (Binary Large Object) is a variable-length type designed to
> store
> > > > large amounts of binary data such as images, audio, video, documents,
> > > > and other file types.
> > > >
> > > > Note: Currently, Apache Paimon only supports the Binary type and does
> > > > not have a dedicated BLOB type with streaming I/O capabilities.
> > > >
> > > > 2. Operation Interfaces
> > > >
> > > > Input Streams (Writing Data)
> > > >
> > > > When inserting or updating BLOB data, MySQL provides several methods
> > > > through the JDBC API:
> > > >
> > > > setBinaryStream(int index, InputStream x, int length)
> > > > Writes binary data from an input stream into a BLOB field. This
> method
> > > > is recommended for streaming large files, as it avoids loading the
> > > > entire data into memory.
> > > >
> > > > setBlob(int index, InputStream inputStream) (available since JDBC
> 4.0)
> > > > A more modern approach that writes BLOB data using an input stream
> > > > without requiring the length to be specified upfront. This simplifies
> > > > handling dynamically sized data.
> > > >
> > > > setBytes(int index, byte[] bytes)
> > > > Directly writes a byte array to the BLOB field. This is appropriate
> > > > only for small files (e.g., less than 1MB), as it can lead to high
> > > > memory consumption and potential OutOfMemoryError (OOM) for larger
> > > > data.
> > > >
> > > > Output Streams (Reading Data)
> > > >
> > > > When retrieving BLOB data from a result set, streaming access is
> > > > supported to prevent memory issues:
> > > >
> > > > getBinaryStream(String columnName)
> > > > Reads the BLOB value as an input stream, enabling chunked reading of
> > > > large files. This is the recommended way to handle large binary
> > > > objects and avoid OOM.
> > > >
> > > > getBinaryStream(int index)
> > > > Similar to the above method, but accesses the column by its numeric
> > > > index instead of name. It is useful when the column order is known
> and
> > > > stable.
> > > >
> > > > Large Object Handling (Blob)
> > > >
> > > > In addition to direct stream access, MySQL allows working with the
> > > > java.sql.Blob interface for more advanced operations:
> > > >
> > > > ResultSet.getBlob(String columnName)
> > > > Retrieves a java.sql.Blob object from the result set, which provides
> > > > additional methods for manipulation.
> > > >
> > > > Blob.getBinaryStream()
> > > > Returns an input stream from the Blob object, typically used in
> > > > conjunction with ResultSet.getBlob() to enable lazy or on-demand
> > > > reading.
> > > >
> > > > Blob.length()
> > > > Returns the size (in bytes) of the BLOB data. This is useful for
> > > > allocating buffers, determining file size, or managing partial reads.
> > > >
> > > > Byte Array Access
> > > >
> > > > ResultSet.getBytes(String columnName)
> > > > Reads the entire BLOB content directly into a byte array. While
> > > > convenient for small data, this method should be avoided for large
> > > > files, as it may cause OutOfMemoryError due to excessive memory
> usage.
> > > >
> > > > ________________________________
> > > >
> > > > This completes the description of MySQL’s BLOB handling mechanisms,
> > > > focusing solely on factual presentation without additional analysis
> or
> > > > recommendations.
> > > >
> > > >
> > > > Jingsong Li <[email protected]> 于2025年9月16日周二 19:30写道：
> > > > >
> > > > > From Guoxing in another thread:
> > > > >
> > > > > Following MySQL's BLOB field design, can Paimon also support
> streaming
> > > > > write capabilities for BLOB fields?
> > > > > MySQL Large Object Storage
> > > > >
> > > > > 1. BINARY vs BLOB
> > > > >
> > > > > *Note: MySQL supports both BINARY and BLOB types, whereas Paimon
> > > currently
> > > > > only supports Binary*
> > > > > Type
> > > > > Description
> > > > > BINARY Fixed-length binary string type, similar to CHAR, but stores
> > > bytes
> > > > > instead of characters.
> > > > > BLOB Variable-length binary large object type, used to store large
> > > amounts
> > > > > of binary data (e.g., images, audio, files).
> > > > > ------------------------------
> > > > > 2. Operation InterfacesInput Streams (Writing Data)
> > > > > Category
> > > > > Method
> > > > > Purpose
> > > > > Statement setBinaryStream(int index, InputStream x, int length)
> Writes
> > > > > binary stream data into a BLOB field; used for inserting or
> updating
> > > BLOB
> > > > > data. Recommended for streaming writes.
> > > > > setBlob(int index, InputStream inputStream) Writes BLOB data using
> an
> > > input
> > > > > stream (JDBC 4.0+). A more modern approach that does not require
> > > specifying
> > > > > the length.
> > > > > setBytes(int index, byte[] bytes) Directly writes a byte array.
> > > Suitable
> > > > > only for small files (<1MB); be cautious about memory usage.
> > > > > Output Streams (Reading Data)
> > > > > Category
> > > > > Method
> > > > > Purpose
> > > > > ResultSet getBinaryStream(String columnName) Reads BLOB data as an
> > > input
> > > > > stream. Recommended for streaming large files to avoid OOM.
> > > > > getBinaryStream(int index) Same as above, but accesses by column
> index.
> > > > > Equivalent to using column name, useful when column order is known.
> > > > > Large Object Handling (Blob)
> > > > > Category
> > > > > Method
> > > > > Purpose
> > > > > Blob ResultSet.getBlob(String columnName) Retrieves a java.sql.Blob
> > > object,
> > > > > which provides additional methods for manipulation.
> > > > > Blob.getBinaryStream() Gets an input stream from the Blob object.
> Used
> > > in
> > > > > conjunction with ResultSet.getBlob().
> > > > > Blob.length() Returns the size (length) of the BLOB data. Useful
> for
> > > > > determining file size or allocating buffers.
> > > > > Byte Array Access
> > > > > Category
> > > > > Method
> > > > > Purpose
> > > > > Bytes ResultSet.getBytes(String columnName) Reads the entire BLOB
> > > directly
> > > > > into a byte array. Only suitable for small files, as large files
> may
> > > cause
> > > > > OutOfMemoryError (OOM).
> > > > > ------------------------------
> > > > >
> > > > > This comparison highlights that MySQL provides robust streaming I/O
> > > support for
> > > > > BLOBs, enabling efficient handling of large binary objects without
> > > loading
> > > > > them entirely into memory — a capability that could be valuable to
> > > > > implement in Paimon for better multimodal data management.
> > > > >
> > > > > On Tue, Sep 16, 2025 at 3:08 PM Jingsong Li <
> [email protected]>
> > > wrote:
> > > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > I want to start a discussion about blob files.
> > > > > >
> > > > > > Multimodal data storage needs to support multimedia files,
> including
> > > > > > text, images, audio, video, embedding vectors, etc. Paimon needs
> to
> > > > > > meet the demand for multimodal data entering the lake, and
> achieve
> > > > > > unified storage and efficient management of multimodal data and
> > > > > > structured data.
> > > > > >
> > > > > > Most multimodal files are actually not large, around 1MB or even
> > > below
> > > > > > 1MB, but there are also relatively large multimodal files, such
> as
> > > > > > 10GB+files, which pose storage challenges for us.
> > > > > >
> > > > > > Consider two ways:
> > > > > >
> > > > > > 1. Multimodal data can be directly stored in column files, such
> as
> > > > > > Parquet or Lance files. The biggest problem with this solution is
> > > that
> > > > > > it brings challenges to the file format, such as solving the
> read and
> > > > > > write problems of OOM, which requires a streaming API to the file
> > > > > > format to avoid loading the entire multimodal data. In addition,
> the
> > > > > > additional fields of multimodal data may undergo frequent
> changes,
> > > > > > additions, or even deletions. If these changes require multimodal
> > > > > > files to participate in reading and writing together, the cost is
> > > very
> > > > > > high.
> > > > > >
> > > > > > 2. Multimodal data is stored on object storage, and Parquet
> > > references
> > > > > > these files through pointers. The downside of doing so is that it
> > > > > > cannot directly manage multimodal data and may result in a large
> > > > > > number of small files, which can cause a significant amount of
> file
> > > IO
> > > > > > during use, leading to decreased performance and increased costs.
> > > > > >
> > > > > > We should consider new ways to satisfy this requirement. Create a
> > > > > > high-performance architecture specifically designed for mixed
> > > > > > scenarios of massive small and large multimodal files, achieving
> high
> > > > > > throughput writing and low latency reading, meeting the stringent
> > > > > > performance requirements of AI, big data, and other businesses.
> > > > > >
> > > > > > A more intuitive solution is: independent multimodal storage and
> > > > > > structured storage, separate management of multimodal storage,
> > > > > > introduction of bin file mechanism to store multiple multimodal
> data,
> > > > > > Parquet still references multimodal data through pointers.
> > > > > >
> > > > > > What do you think?
> > > > > >
> > > > > > [1]
> > >
> https://cwiki.apache.org/confluence/display/PAIMON/PIP-35%3A+Introduce+Blob+to+store+multimodal+data
> > > > > >
> > > > > > Best,
> > > > > > Jingsong
> > >
>

Re: [DISCUSS] PIP-35: Introduce Blob to store multimodal data

Reply via email to