Re: performances of S3 writing with many buckets in parallel

2020-02-07 Thread Kostas Kloudas
Hi Enrico, Nice to hear from you and thanks for checking it out! This can be helpful for people using the BucketingSink but I would recommend you to switch to the StreamingFileSink which is the "new version" of the BucketingSink. In fact the BucketingSink is going to be removed in one of the

Re: performances of S3 writing with many buckets in parallel

2020-02-07 Thread Enrico Agnoli
I finally found the time to dig a little more on this and found the real problem. The culprit of the slow-down is this piece of code:

Re: performances of S3 writing with many buckets in parallel

2019-10-16 Thread Kostas Kloudas
Hi Enrico, Thanks for opening the discussion! One thing to note that may help s that the hadoop S3 FS tries to imitate a filesystem on top of S3: - before writing a key it checks if the "parent directory" exists by checking for a key with the prefix up to the last "/" - it creates empty marker

performances of S3 writing with many buckets in parallel

2019-10-15 Thread Enrico Agnoli
Starting here the discussion after an initial discussion with Ververica and AWS teams during FlinkForward. I'm investigating the performances of a Flink job that transports data from Kafka to an S3 Sink. We are using a BucketingSink to write parquet files. The bucketing logic divides the