Hello Ryan,
Looks like it's limited by both the Parquet implementation and the Thrift
message methods. Am I missing anything?
From cpp/src/parquet/types.h
struct ByteArray {
ByteArray() : len(0), ptr(NULLPTR) {}
ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
uint32_t len;
const uint8_t* ptr;
};
From cpp/src/parquet/thrift.h
inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T*
deserialized_msg) {
inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream* out)
-Brian
On 4/5/19, 1:32 PM, "Ryan Blue" <[email protected]> wrote:
EXTERNAL
Hi Brian,
This seems like something we should allow. What imposes the current limit?
Is it in the thrift format, or just the implementations?
On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman <[email protected]> wrote:
> All,
>
> SAS requires support for storing varying-length character and binary blobs
> with a 2^64 max length in Parquet. Currently, the ByteArray len field is
> a unint32_t. Looks this the will require incrementing the Parquet file
> format version and changing ByteArray len to uint64_t.
>
> Have there been any requests for this or other Parquet developments that
> require file format versioning changes?
>
> I realize this a non-trivial ask. Thanks for considering it.
>
> -Brian
>
--
Ryan Blue
Software Engineer
Netflix