Hi Enrico,
Nice to hear from you and thanks for checking it out!
This can be helpful for people using the BucketingSink but I would
recommend you to switch to the StreamingFileSink which is the "new
version" of the BucketingSink. In fact the BucketingSink is going to
be removed in one of the
I finally found the time to dig a little more on this and found the real
problem.
The culprit of the slow-down is this piece of code:
Hi Enrico,
Thanks for opening the discussion!
One thing to note that may help s that the hadoop S3 FS tries to
imitate a filesystem on top of S3:
- before writing a key it checks if the "parent directory" exists by
checking for a key with the prefix up to the last "/"
- it creates empty marker
Starting here the discussion after an initial discussion with Ververica and AWS
teams during FlinkForward.
I'm investigating the performances of a Flink job that transports data from
Kafka to an S3 Sink.
We are using a BucketingSink to write parquet files. The bucketing logic
divides the