HDFS streaming source concerns

Carlos Downey Wed, 06 Apr 2022 07:17:29 -0700

Hi,

We have an in-house platform that we want to integrate with external
clients via HDFS. They have lots of existing files and they continuously
put more data to HDFS. Ideally, we would like to have a Flink job that
takes care of ingesting data as one of the requirements is to execute SQL
on top of these files. We looked at existing FileSource implementation but
we believe this will not be well suited for this use case.
- firstly, we'd have to ingest all files initially present on HDFS before
completing first checkpoint - this is unacceptable for us as we would have
to reprocess all the files again in case of early job failure. Not to
mention the state blowing up for aggregations.
- secondly, we see now way to establish valid watermark strategy. This is a
major pain point that we can't find the right answer for. We don't want to
assume too much about the data itself. In general, the only solutions we
see require some sort of synchronization across subtasks. On the other
hand, the simplest strategy is to delay the watermark. In that case though
we are afraid of accidentally dropping events.


Given this, we think about implementing our own file source, have someone
in the community already tried solving similar problem? If not, any
suggestions about the concerns we raised would be valuable.

HDFS streaming source concerns

Reply via email to