Re: [DISCUSS] KIP-1008: ParKa - the Marriage of Parquet and Kafka

2024-03-21 Thread Andrew Schofield
Hi Xinli, Thanks for the KIP. I see that the discussion thread has died down, which is often a tricky situation with a KIP. I’ve been thinking about this KIP for a while and it was really good to be able to attend the Kafka Summit London session to get a proper understanding of it. I think it’s

Re: [DISCUSS] KIP-1008: ParKa - the Marriage of Parquet and Kafka

2023-12-02 Thread Xinli shang
Hi Steven, Thank you for your question! Firstly, the statistics such as min/max, null count, exist inside the file (page and column index), or you can consider it as inside the the Parquet segment. These statistics will be generated at the Kafka producer in our proposal when the Parquet format is

Re: [DISCUSS] KIP-1008: ParKa - the Marriage of Parquet and Kafka

2023-11-26 Thread Steven Wu
> if we can produce the segment with Parquet, which is the native format in a data lake, the consumer application (e.g., Spark jobs for ingestion) can directly dump the segments as raw byte buffer into the data lake without unwrapping each record individually and then writing to the Parquet file

[DISCUSS] KIP-1008: ParKa - the Marriage of Parquet and Kafka

2023-11-21 Thread Xinli shang
Hi, all Can I ask for a discussion on the KIP just created KIP-1008: ParKa - the Marriage of Parquet and Kafka ? -- Xinli Shang