Hi Guoxing, Sure, I will add these utils to the PIP.
Best, Jingsong On Wed, Sep 17, 2025 at 11:01 AM guoxing wgx <[email protected]> wrote: > > Subject: [Proposal] Introduce built-in Blob implementations (e.g., > FileBlob, HttpBlob) for common use cases > > Hi all, > > I've been reviewing the Blob design in PIP-35 ("Introduce Blob to store > multimodal data") and think it's a solid foundation for supporting > multimodal workloads in Paimon. > > One area I'd like to propose for improvement is **developer experience and > ease of use**. Currently, users need to implement the `Blob` interface > themselves for custom data sources (e.g., files, HTTP URLs), which leads to > duplicated efforts and potential inconsistencies. > > Could we consider introducing **built-in Blob implementations** for common > scenarios? For example: > > - `FileBlob`: for reading from local or mounted file systems > - `HttpBlob` / `UrlBlob`: for streaming data from HTTP/HTTPS endpoints > - `ByteArrayBlob`: for small in-memory binary objects (<1MB) > > These could be exposed through a simple factory API, such as: > > ```java > Blob blob = Blobs.fromPath("pangu|oss|file://data/image.png"); // file > Blob blob = Blobs.fromUrl("https://example.com/audio.mp3"); // remote > URL > Blob blob = Blobs.fromByteArray(embeddingBytes); // > inline data > > Jingsong Li <[email protected]> 于2025年9月16日周二 22:13写道: > > > Hi everyone, > > > > Blob type and data POC is in https://github.com/apache/paimon/pull/6268 > > > > Best, > > Jingsong > > > > On Tue, Sep 16, 2025 at 10:08 PM Jingsong Li <[email protected]> > > wrote: > > > > > > Thanks Guoxing for your suggestion. > > > > > > Now I have introduced the Blob interface: > > > > > > /** > > > * Blob interface, provide bytes and input stream methods. > > > * > > > * @since 1.4.0 > > > */ > > > @Public > > > public interface Blob { > > > > > > byte[] toBytes(); > > > > > > SeekableInputStream newInputStream() throws IOException; > > > } > > > > > > And you can see the read and write example in PIP. > > > > > > Best, > > > Jingsong > > > > > > ---------- Forwarded message --------- > > > From: guoxing wgx <[email protected]> > > > Date: Tue, Sep 16, 2025 at 7:47 PM > > > Subject: Re: [DISCUSS] PIP-35: Introduce Blob to store multimodal data > > > To: Jingsong Li <[email protected]> > > > > > > > > > Following MySQL's BLOB Field Design, Can Paimon Also Support Streaming > > > Write Capabilities for BLOB Fields? > > > > > > MySQL Large Object Storage > > > > > > 1. BINARY vs BLOB > > > > > > MySQL supports two binary data types: BINARY and BLOB. > > > > > > BINARY is a fixed-length binary string type, similar to CHAR, but it > > > stores raw bytes instead of characters. It is suitable for small, > > > fixed-size binary data. > > > BLOB (Binary Large Object) is a variable-length type designed to store > > > large amounts of binary data such as images, audio, video, documents, > > > and other file types. > > > > > > Note: Currently, Apache Paimon only supports the Binary type and does > > > not have a dedicated BLOB type with streaming I/O capabilities. > > > > > > 2. Operation Interfaces > > > > > > Input Streams (Writing Data) > > > > > > When inserting or updating BLOB data, MySQL provides several methods > > > through the JDBC API: > > > > > > setBinaryStream(int index, InputStream x, int length) > > > Writes binary data from an input stream into a BLOB field. This method > > > is recommended for streaming large files, as it avoids loading the > > > entire data into memory. > > > > > > setBlob(int index, InputStream inputStream) (available since JDBC 4.0) > > > A more modern approach that writes BLOB data using an input stream > > > without requiring the length to be specified upfront. This simplifies > > > handling dynamically sized data. > > > > > > setBytes(int index, byte[] bytes) > > > Directly writes a byte array to the BLOB field. This is appropriate > > > only for small files (e.g., less than 1MB), as it can lead to high > > > memory consumption and potential OutOfMemoryError (OOM) for larger > > > data. > > > > > > Output Streams (Reading Data) > > > > > > When retrieving BLOB data from a result set, streaming access is > > > supported to prevent memory issues: > > > > > > getBinaryStream(String columnName) > > > Reads the BLOB value as an input stream, enabling chunked reading of > > > large files. This is the recommended way to handle large binary > > > objects and avoid OOM. > > > > > > getBinaryStream(int index) > > > Similar to the above method, but accesses the column by its numeric > > > index instead of name. It is useful when the column order is known and > > > stable. > > > > > > Large Object Handling (Blob) > > > > > > In addition to direct stream access, MySQL allows working with the > > > java.sql.Blob interface for more advanced operations: > > > > > > ResultSet.getBlob(String columnName) > > > Retrieves a java.sql.Blob object from the result set, which provides > > > additional methods for manipulation. > > > > > > Blob.getBinaryStream() > > > Returns an input stream from the Blob object, typically used in > > > conjunction with ResultSet.getBlob() to enable lazy or on-demand > > > reading. > > > > > > Blob.length() > > > Returns the size (in bytes) of the BLOB data. This is useful for > > > allocating buffers, determining file size, or managing partial reads. > > > > > > Byte Array Access > > > > > > ResultSet.getBytes(String columnName) > > > Reads the entire BLOB content directly into a byte array. While > > > convenient for small data, this method should be avoided for large > > > files, as it may cause OutOfMemoryError due to excessive memory usage. > > > > > > ________________________________ > > > > > > This completes the description of MySQL’s BLOB handling mechanisms, > > > focusing solely on factual presentation without additional analysis or > > > recommendations. > > > > > > > > > Jingsong Li <[email protected]> 于2025年9月16日周二 19:30写道: > > > > > > > > From Guoxing in another thread: > > > > > > > > Following MySQL's BLOB field design, can Paimon also support streaming > > > > write capabilities for BLOB fields? > > > > MySQL Large Object Storage > > > > > > > > 1. BINARY vs BLOB > > > > > > > > *Note: MySQL supports both BINARY and BLOB types, whereas Paimon > > currently > > > > only supports Binary* > > > > Type > > > > Description > > > > BINARY Fixed-length binary string type, similar to CHAR, but stores > > bytes > > > > instead of characters. > > > > BLOB Variable-length binary large object type, used to store large > > amounts > > > > of binary data (e.g., images, audio, files). > > > > ------------------------------ > > > > 2. Operation InterfacesInput Streams (Writing Data) > > > > Category > > > > Method > > > > Purpose > > > > Statement setBinaryStream(int index, InputStream x, int length) Writes > > > > binary stream data into a BLOB field; used for inserting or updating > > BLOB > > > > data. Recommended for streaming writes. > > > > setBlob(int index, InputStream inputStream) Writes BLOB data using an > > input > > > > stream (JDBC 4.0+). A more modern approach that does not require > > specifying > > > > the length. > > > > setBytes(int index, byte[] bytes) Directly writes a byte array. > > Suitable > > > > only for small files (<1MB); be cautious about memory usage. > > > > Output Streams (Reading Data) > > > > Category > > > > Method > > > > Purpose > > > > ResultSet getBinaryStream(String columnName) Reads BLOB data as an > > input > > > > stream. Recommended for streaming large files to avoid OOM. > > > > getBinaryStream(int index) Same as above, but accesses by column index. > > > > Equivalent to using column name, useful when column order is known. > > > > Large Object Handling (Blob) > > > > Category > > > > Method > > > > Purpose > > > > Blob ResultSet.getBlob(String columnName) Retrieves a java.sql.Blob > > object, > > > > which provides additional methods for manipulation. > > > > Blob.getBinaryStream() Gets an input stream from the Blob object. Used > > in > > > > conjunction with ResultSet.getBlob(). > > > > Blob.length() Returns the size (length) of the BLOB data. Useful for > > > > determining file size or allocating buffers. > > > > Byte Array Access > > > > Category > > > > Method > > > > Purpose > > > > Bytes ResultSet.getBytes(String columnName) Reads the entire BLOB > > directly > > > > into a byte array. Only suitable for small files, as large files may > > cause > > > > OutOfMemoryError (OOM). > > > > ------------------------------ > > > > > > > > This comparison highlights that MySQL provides robust streaming I/O > > support for > > > > BLOBs, enabling efficient handling of large binary objects without > > loading > > > > them entirely into memory — a capability that could be valuable to > > > > implement in Paimon for better multimodal data management. > > > > > > > > On Tue, Sep 16, 2025 at 3:08 PM Jingsong Li <[email protected]> > > wrote: > > > > > > > > > > Hi everyone, > > > > > > > > > > I want to start a discussion about blob files. > > > > > > > > > > Multimodal data storage needs to support multimedia files, including > > > > > text, images, audio, video, embedding vectors, etc. Paimon needs to > > > > > meet the demand for multimodal data entering the lake, and achieve > > > > > unified storage and efficient management of multimodal data and > > > > > structured data. > > > > > > > > > > Most multimodal files are actually not large, around 1MB or even > > below > > > > > 1MB, but there are also relatively large multimodal files, such as > > > > > 10GB+files, which pose storage challenges for us. > > > > > > > > > > Consider two ways: > > > > > > > > > > 1. Multimodal data can be directly stored in column files, such as > > > > > Parquet or Lance files. The biggest problem with this solution is > > that > > > > > it brings challenges to the file format, such as solving the read and > > > > > write problems of OOM, which requires a streaming API to the file > > > > > format to avoid loading the entire multimodal data. In addition, the > > > > > additional fields of multimodal data may undergo frequent changes, > > > > > additions, or even deletions. If these changes require multimodal > > > > > files to participate in reading and writing together, the cost is > > very > > > > > high. > > > > > > > > > > 2. Multimodal data is stored on object storage, and Parquet > > references > > > > > these files through pointers. The downside of doing so is that it > > > > > cannot directly manage multimodal data and may result in a large > > > > > number of small files, which can cause a significant amount of file > > IO > > > > > during use, leading to decreased performance and increased costs. > > > > > > > > > > We should consider new ways to satisfy this requirement. Create a > > > > > high-performance architecture specifically designed for mixed > > > > > scenarios of massive small and large multimodal files, achieving high > > > > > throughput writing and low latency reading, meeting the stringent > > > > > performance requirements of AI, big data, and other businesses. > > > > > > > > > > A more intuitive solution is: independent multimodal storage and > > > > > structured storage, separate management of multimodal storage, > > > > > introduction of bin file mechanism to store multiple multimodal data, > > > > > Parquet still references multimodal data through pointers. > > > > > > > > > > What do you think? > > > > > > > > > > [1] > > https://cwiki.apache.org/confluence/display/PAIMON/PIP-35%3A+Introduce+Blob+to+store+multimodal+data > > > > > > > > > > Best, > > > > > Jingsong > >
