+dev <d...@beam.apache.org> On Mon, May 4, 2020 at 3:56 AM Jose Manuel <kiuby88....@gmail.com> wrote:
> Hi guys, > > I think I have found something interesting about windowing. > > I have a pipeline that gets data from Kafka and writes in HDFS by means of > TextIO. > Once written, generated files are combined to apply some custom operations. > However, the combine does not receive data. Following, you can find the > highlight of my pipeline. > > //STREAM + processing time. > pipeline.apply(KafkaIO.read()) > .apply(...) //mappers, a window and a combine > > .apply("WriteFiles", > TextIO.<String>writeCustomType().to(policy).withShards(4).withWindowedWrites()) > .getPerDestinationOutputFilenames() > .apply("CombineFileNames", Combine.perKey(...)) > > Running the pipeline with Flink I have found a log trace that says data are > discarded as late in the combine CombineFileNames. Then, I have added > AllowedLateness to pipeline window strategy, as a workaround. > It works by now, but this opens several questions to me > > I think the problem is getPerDestinationOutputFilenames generates files, > but > it does not maintain the windowing information. Then, CombineFileNames > compares > file names with the watermark of the pipeline and discards them as late. > > Is there any issue with getPerDestinationOutputFilenames? Maybe, I am > doing something wrong > and using getPerDestinationOutputFilenames + combine does not make sense. > What do you think? > > Please, note I am using Beam 2.17.0 with Flink 1.9.1. > > Many thanks, > Jose > >