Hi Kostas, I didn’t see a follow-up to this, and have also run into this same issue of winding up with a bunch of .inprogress files when a bounded input stream ends and the job terminates.
When StreamingFileSystem.close() is called, shouldn’t all buckets get auto-rolled, so that the .inprogress files become part-xxx files? Thanks, — Ken > On Dec 9, 2019, at 6:56 PM, Jingsong Li <jingsongl...@gmail.com> wrote: > > Hi Kostas, > > I took a look to StreamingFileSink.close, it just delete all temporary > files. I know it is for failover. When Job fail, it should just delete temp > files for next restart. > But for testing purposes, we just want to run a bounded streaming job. If > there is no checkpoint trigger, no one will move the final temp files to > output path, so the result of this job is wrong. > Do you have any idea about this? Can we distinguish "fail close" from > "success finish close" in StreamingFileSink? > > Best, > Jingsong Lee > > On Mon, Dec 9, 2019 at 10:32 PM Kostas Kloudas <kklou...@gmail.com > <mailto:kklou...@gmail.com>> wrote: > Hi Li, > > This is the expected behavior. All the "exactly-once" sinks in Flink > require checkpointing to be enabled. > We will update the documentation to be clearer in the upcoming release. > > Thanks a lot, > Kostas > > On Sat, Dec 7, 2019 at 3:47 AM Li Peng <li.p...@doordash.com > <mailto:li.p...@doordash.com>> wrote: > > > > Ok I seem to have solved the issue by enabling checkpointing. Based on the > > docs (I'm using 1.9.0), it seemed like only > > StreamingFileSink.forBulkFormat() should've required checkpointing, but > > based on this experience, StreamingFileSink.forRowFormat() requires it too! > > Is this the intended behavior? If so, the docs should probably be updated. > > > > Thanks, > > Li > > > > On Fri, Dec 6, 2019 at 2:01 PM Li Peng <li.p...@doordash.com > > <mailto:li.p...@doordash.com>> wrote: > >> > >> Hey folks, I'm trying to get StreamingFileSink to write to s3 every > >> minute, with flink-s3-fs-hadoop, and based on the default rolling policy, > >> which is configured to "roll" every 60 seconds, I thought that would be > >> automatic (I interpreted rolling to mean actually close a multipart upload > >> to s3). > >> > >> But I'm not actually seeing files written to s3 at all, instead I see a > >> bunch of open multipart uploads when I check the AWS s3 console, for > >> example: > >> > >> "Uploads": [ > >> { > >> "Initiated": "2019-12-06T20:57:47.000Z", > >> "Key": "2019-12-06--20/part-0-0" > >> }, > >> { > >> "Initiated": "2019-12-06T20:57:47.000Z", > >> "Key": "2019-12-06--20/part-1-0" > >> }, > >> { > >> "Initiated": "2019-12-06T21:03:12.000Z", > >> "Key": "2019-12-06--21/part-0-1" > >> }, > >> { > >> "Initiated": "2019-12-06T21:04:15.000Z", > >> "Key": "2019-12-06--21/part-0-2" > >> }, > >> { > >> "Initiated": "2019-12-06T21:22:23.000Z" > >> "Key": "2019-12-06--21/part-0-3" > >> } > >> ] > >> > >> And these uploads are being open for a long time. So far after an hour, > >> none of the uploads have been closed. Is this the expected behavior? If I > >> wanted to get these uploads to actually write to s3 quickly, do I need to > >> configure the hadoop stuff to get that done, like setting a smaller > >> buffer/partition size to force it to upload? > >> > >> Thanks, > >> Li > > > -- > Best, Jingsong Lee -------------------------- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr