+1 to remove the Bucketing Sink. Thanks for the effort on ORC and `HadoopPathBasedBulkFormatBuilder`, I think it's safe to get rid of the old Bucketing API with them.
Best, Jingsong On Thu, Oct 29, 2020 at 3:06 AM Kostas Kloudas <kklou...@gmail.com> wrote: > Thanks for the discussion! > > From this thread I do not see any objection with moving forward with > removing the sink. > Given this I will open a voting thread tomorrow. > > Cheers, > Kostas > > On Wed, Oct 28, 2020 at 6:50 PM Stephan Ewen <se...@apache.org> wrote: > > > > +1 to remove the Bucketing Sink. > > > > It has been very common in the past to remove code that was deprecated > for multiple releases in favor of reducing baggage. > > Also in cases that had no perfect drop-in replacement, but needed users > to forward fit the code. > > I am not sure I understand why this case is so different. > > > > Why the Bucketing Sink should be thrown out, in my opinion: > > > > The Bucketing sink makes it easier for users to add general Hadoop > writes. > > But the price is that it easily leads to dataloss, because it assumes > flush()/sync() work reliably on Hadoop relicably, which they don't (HDFS > works somewhat, S3 works not at all). > > I think the Bucketing sink is a trap for users, that's why it was > deprecated long ago. > > > > The StreamingFileSink covers the majority of cases from the Bucketing > Sink. > > It does have some friction when adding/wrapping some general Hadoop > writers. Parts will be solved with the transactional sink work. > > If something is missing and blocking users, we can prioritize adding it > to the Streaming File Sink. Also that is something we did before and it > helped being pragmatic with moving forward, rather than being held back by > "maybe there is something we don't know". > > > > > > > > > > On Wed, Oct 28, 2020 at 12:36 PM Chesnay Schepler <ches...@apache.org> > wrote: > >> > >> Then we can't remove it, because there is no way for us to ascertain > >> whether anyone is still using it. > >> > >> Sure, the user ML is the best we got, but you can't argue that we don't > >> want any users to be affected and then use an imperfect mean to find > users. > >> If you are fine with relying on the user ML, then you _are_ fine with > >> removing it at the cost of friction for some users. > >> > >> To be clear, I, personally, don't have a problem with removing it (we > >> have removed other connectors in the past that did not have a migration > >> plan), I just reject he argumentation. > >> > >> On 10/28/2020 12:21 PM, Kostas Kloudas wrote: > >> > No, I do not think that "we are fine with removing it at the cost of > >> > friction for some users". > >> > > >> > I believe that this can be another discussion that we should have as > >> > soon as we establish that someone is actually using it. The point I am > >> > trying to make is that if no user is using it, we should remove it and > >> > not leave unmaintained code around. > >> > > >> > On Wed, Oct 28, 2020 at 12:11 PM Chesnay Schepler <ches...@apache.org> > wrote: > >> >> The alternative could also be to use a different argument than "no > one > >> >> uses it", e.g., we are fine with removing it at the cost of friction > for > >> >> some users because there are better alternatives. > >> >> > >> >> On 10/28/2020 10:46 AM, Kostas Kloudas wrote: > >> >>> I think that the mailing lists is the best we can do and I would say > >> >>> that they seem to be working pretty well (e.g. the recent Mesos > >> >>> discussion). > >> >>> Of course they are not perfect but the alternative would be to never > >> >>> remove anything user facing until the next major release, which I > find > >> >>> pretty strict. > >> >>> > >> >>> On Wed, Oct 28, 2020 at 10:04 AM Chesnay Schepler < > ches...@apache.org> wrote: > >> >>>> If the conclusion is that we shouldn't remove it if _anyone_ is > using > >> >>>> it, then we cannot remove it because the user ML obviously does not > >> >>>> reach all users. > >> >>>> > >> >>>> On 10/28/2020 9:28 AM, Kostas Kloudas wrote: > >> >>>>> Hi all, > >> >>>>> > >> >>>>> I am bringing the up again to see if there are any users actively > >> >>>>> using the BucketingSink. > >> >>>>> So far, if I am not mistaken (and really sorry if I forgot > anything), > >> >>>>> it is only a discussion between devs about the potential problems > of > >> >>>>> removing it. I totally understand Chesnay's concern about not > >> >>>>> providing compatibility with the StreamingFileSink (SFS) and if > there > >> >>>>> are any users, then we should not remove it without trying to > find a > >> >>>>> solution for them. > >> >>>>> > >> >>>>> But if there are no users then I would still propose to remove the > >> >>>>> module, given that I am not aware of any efforts to provide > >> >>>>> compatibility with the SFS any time soon. > >> >>>>> The reasons for removing it also include the facts that we do not > >> >>>>> actively maintain it and we do not add new features. As for > potential > >> >>>>> missing features in the SFS compared to the BucketingSink that was > >> >>>>> mentioned before, I am not aware of any fundamental limitations > and > >> >>>>> even if there are, I would assume that the solution is not to > direct > >> >>>>> the users to a deprecated sink but rather try to increase the > >> >>>>> functionality of the actively maintained one. > >> >>>>> > >> >>>>> Please keep in mind that the BucketingSink is deprecated since > FLINK > >> >>>>> 1.9 and there is a new File Sink that is coming as part of > FLIP-143 > >> >>>>> [1]. > >> >>>>> Again, if there are any active users who cannot migrate easily, > then > >> >>>>> we cannot remove it before trying to provide a smooth migration > path. > >> >>>>> > >> >>>>> Thanks, > >> >>>>> Kostas > >> >>>>> > >> >>>>> [1] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-143%3A+Unified+Sink+API > >> >>>>> > >> >>>>> On Fri, Oct 16, 2020 at 4:36 PM Chesnay Schepler < > ches...@apache.org> wrote: > >> >>>>>> @Seth: Earlier in this discussion it was said that the > BucketingSink > >> >>>>>> would not be usable in 1.12 . > >> >>>>>> > >> >>>>>> On 10/16/2020 4:25 PM, Seth Wiesman wrote: > >> >>>>>>> +1 It has been deprecated for some time and the > StreamingFileSink has > >> >>>>>>> stabalized with a large number of formats and features. > >> >>>>>>> > >> >>>>>>> Plus, the bucketing sink only implements a small number of > stable > >> >>>>>>> interfaces[1]. I would expect users to continue to use the > bucketing sink > >> >>>>>>> from the 1.11 release with future versions for some time. > >> >>>>>>> > >> >>>>>>> Seth > >> >>>>>>> > >> >>>>>>> > https://github.com/apache/flink/blob/2ff3b771cbb091e1f43686dd8e176cea6d435501/flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/bucketing/BucketingSink.java#L170-L172 > >> >>>>>>> > >> >>>>>>> On Thu, Oct 15, 2020 at 2:57 PM Kostas Kloudas < > kklou...@gmail.com> wrote: > >> >>>>>>> > >> >>>>>>>> @Arvid Heise I also do not remember exactly what were all the > >> >>>>>>>> problems. The fact that we added some more bulk formats to the > >> >>>>>>>> streaming file sink definitely reduced the non-supported > features. In > >> >>>>>>>> addition, the latest discussion I found on the topic was [1] > and the > >> >>>>>>>> conclusion of that discussion seems to be to remove it. > >> >>>>>>>> > >> >>>>>>>> Currently, I cannot find any obvious reason why keeping the > >> >>>>>>>> BucketingSink, apart from the fact that we do not have a > migration > >> >>>>>>>> plan unfortunately. This is why I posted this to dev@ and > user@. > >> >>>>>>>> > >> >>>>>>>> Cheers, > >> >>>>>>>> Kostas > >> >>>>>>>> > >> >>>>>>>> [1] > >> >>>>>>>> > https://lists.apache.org/thread.html/r799be74658bc7e169238cc8c1e479e961a9e85ccea19089290940ff0%40%3Cdev.flink.apache.org%3E > >> >>>>>>>> > >> >>>>>>>> On Wed, Oct 14, 2020 at 8:03 AM Arvid Heise < > ar...@ververica.com> wrote: > >> >>>>>>>>> I remember this conversation popping up a few times already > and I'm in > >> >>>>>>>>> general a big fan of removing BucketingSink. > >> >>>>>>>>> > >> >>>>>>>>> However, until now there were a few features lacking in > StreamingFileSink > >> >>>>>>>>> that are present in BucketingSink and that are being actively > used (I > >> >>>>>>>> can't > >> >>>>>>>>> exactly remember them now, but I can look it up if everyone > else is also > >> >>>>>>>>> suffering from bad memory). Did we manage to add them in the > meantime? If > >> >>>>>>>>> not, then it feels rushed to remove it at this point. > >> >>>>>>>>> > >> >>>>>>>>> On Tue, Oct 13, 2020 at 2:33 PM Kostas Kloudas < > kklou...@gmail.com> > >> >>>>>>>> wrote: > >> >>>>>>>>>> @Chesnay Schepler Off the top of my head, I cannot find an > easy way > >> >>>>>>>>>> to migrate from the BucketingSink to the StreamingFileSink. > It may be > >> >>>>>>>>>> possible but it will require some effort because the logic > would be > >> >>>>>>>>>> "read the old state, commit it, and start fresh with the > >> >>>>>>>>>> StreamingFileSink." > >> >>>>>>>>>> > >> >>>>>>>>>> On Tue, Oct 13, 2020 at 2:09 PM Aljoscha Krettek < > aljos...@apache.org> > >> >>>>>>>>>> wrote: > >> >>>>>>>>>>> On 13.10.20 14:01, David Anderson wrote: > >> >>>>>>>>>>>> I thought this was waiting on FLIP-46 -- Graceful Shutdown > >> >>>>>>>> Handling -- > >> >>>>>>>>>> and > >> >>>>>>>>>>>> in fact, the StreamingFileSink is mentioned in that FLIP > as a > >> >>>>>>>>>> motivating > >> >>>>>>>>>>>> use case. > >> >>>>>>>>>>> Ah yes, I see FLIP-147 as a more general replacement for > FLIP-46. > >> >>>>>>>> Thanks > >> >>>>>>>>>>> for the reminder, we should close FLIP-46 now with an > explanatory > >> >>>>>>>>>>> message to avoid confusion. > >> >>>>>>>>> -- > >> >>>>>>>>> > >> >>>>>>>>> Arvid Heise | Senior Java Developer > >> >>>>>>>>> > >> >>>>>>>>> <https://www.ververica.com/> > >> >>>>>>>>> > >> >>>>>>>>> Follow us @VervericaData > >> >>>>>>>>> > >> >>>>>>>>> -- > >> >>>>>>>>> > >> >>>>>>>>> Join Flink Forward <https://flink-forward.org/> - The Apache > Flink > >> >>>>>>>>> Conference > >> >>>>>>>>> > >> >>>>>>>>> Stream Processing | Event Driven | Real Time > >> >>>>>>>>> > >> >>>>>>>>> -- > >> >>>>>>>>> > >> >>>>>>>>> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > >> >>>>>>>>> > >> >>>>>>>>> -- > >> >>>>>>>>> Ververica GmbH > >> >>>>>>>>> Registered at Amtsgericht Charlottenburg: HRB 158244 B > >> >>>>>>>>> Managing Directors: Timothy Alexander Steinert, Yip Park Tung > Jason, Ji > >> >>>>>>>>> (Toni) Cheng > >> >> > >> > -- Best, Jingsong Lee