Re: Streaming data to parquet

Ayush Verma Fri, 11 Sep 2020 07:15:10 -0700

Hi,

Looking at the problem broadly, file size is directly tied up with how
often you commit. No matter which system you use, this variable will always
be there. If you commit frequently, you will be close to realtime, but you
will have numerous small files. If you commit after long intervals, you
will have larger files, but this is as good as a "batch world". We solved
this problem at my company by having 2 systems. One to commit the files at
small intervals, thus bringing data into durable storage reliably, and one
to roll up these small files. It's actually really simple to implement this
if you don't try to do it in a single job.


Best
Ayush

On Fri, Sep 11, 2020 at 2:22 PM Robert Metzger <rmetz...@apache.org> wrote:

> Hi Marek,
>
> what you are describing is a known problem in Flink. There are some
> thoughts on how to address this in
> https://issues.apache.org/jira/browse/FLINK-11499 and
> https://issues.apache.org/jira/browse/FLINK-17505
> Maybe some ideas there help you already for your current problem (use long
> checkpoint intervals).
>
> A related idea to (2) is to write your data with the Avro format, and then
> regularly use a batch job to transform your data from Avro to Parquet.
>
> I hope these are some helpful pointers. I don't have a good overview over
> other potential open source solutions.
>
> Best,
> Robert
>
>
> On Thu, Sep 10, 2020 at 5:10 PM Marek Maj <marekm...@gmail.com> wrote:
>
>> Hello Flink Community,
>>
>> When designing our data pipelines, we very often encounter the
>> requirement to stream traffic (usually from kafka) to external distributed
>> file system (usually HDFS or S3). This data is typically meant to be
>> queried from hive/presto or similar tools. Preferably data sits in columnar
>> format like parquet.
>>
>> Currently, using flink, it is possible to leverage StreamingFileSink to
>> achieve what we want to some extent. It satisfies our requirements to
>> partition data by event time, ensure exactly-once semantics and
>> fault-tolerance with checkpointing. Unfortunately, when using bulk writer
>> like PaquetWriter, that comes with a price of producing a big number of
>> files which degrades the performance of queries.
>>
>> I believe that many companies struggle with similar use cases. I know
>> that some of them have already approached that problem. Solutions like
>> Alibaba Hologres or Netflix solution with Iceberg described during FF 2019
>> emerged. Given that full transition to real-time data warehouse may take a
>> significant amount of time and effort, I would like to primarily focus on
>> solutions for tools like hive/presto backed up by a distributed file
>> system. Usually those are the systems that we are integrating with.
>>
>> So what options do we have? Maybe I missed some existing open source
>> tool?
>>
>> Currently, I can come up with two approaches using flink exclusively:
>> 1. Cache incoming traffic in flink state until trigger fires according to
>> rolling strategy, probably with some late events special strategy and then
>> output data with StreamingFileSink. Solution is not perfect as it may
>> introduce additional latency and queries will still be less performant
>> compared to fully compacted files (late events problem). And the biggest
>> issue I am afraid of is actually a performance drop while releasing data
>> from flink state and its peak character
>> 2. Focus on implementing batch rewrite job that will compact data
>> offline. Source for the job could be both kafka or small files produced by
>> another job that uses plain StreamingFileSink. The drawback is that whole
>> system gets more complex, additional maintenance is needed and, maybe what
>> is more troubling, we enter to batch world again (how could we know that no
>> more late data will come and we can safely run the job)
>>
>> I would really love to hear what are community thoughts on that.
>>
>> Kind regards
>> Marek
>>
>

Re: Streaming data to parquet

Reply via email to