> The fact that Bounded vs Unbounded JOIN is performed by considering
> Bounded PCollection as a Sideinput means that the Bounded PCollection
> should fit into Memory. Am I right? In that case bounded PCollection of
> Hive (or) HDFS, where data might not fit into Memory cannot be JOINED with
> Kafka?
>

My discussion above didn't involve a slow changing data source that is
larger than fitting into memory, since the [*Pattern: Slowly-changing
lookup cache*] does not focus on data size.

I don't have insight on the case in which sources are HDFS or Hive that
contains a very large volume of data while the data is slowly changing.


>
> Does this approach have something to do with Watermark? As the computation
> might take time depending on the size of the Bounded Data, and the window
> might get expired before the result for the window is emitted.
>

Indeed there are details of watermark that you have to check implementation
to understand on how does it work on core Beam primitives. There are some
high-level explanation [1] and [2] for your reference.

I think in Beam model, you do not need to reason watermark when processing
data. If data is not considered too late (not later than GC watermark),
your pipeline will for sure to process those data, otherwise data will be
dropped.



[1]: https://www.youtube.com/watch?v=TWxSLmkWPm4
[2]:
https://docs.google.com/document/d/12r7frmxNickxB5tbpuEh_n35_IJeVZn1peOrBrhhP6Y/edit#heading=h.7a03n7d5mf6g

Reply via email to