I have a use-case where I'm extracting embedded items from
archive file formats which themselves have embedded items. For example a
zip file with emails with attachments. The goal in this example would be to
create a PCollection where each email is an element as well as each
attachment being an element. (No need to create a tree structure here.)
There are certain criteria which would prevent continuing embedded item
extraction, such as an item SHA being present in a "rejection" list. The
pipeline will perform a series of transformations on the items and then
continue to extract embedded items. This type of problem lends itself to be
solved with an iterative algorithm. My understanding is that BEAM does not
support iterative algorithms to the same extent Spark does. In BEAM I would
have to persist the results of each iteration and instantiate a new
pipeline for each iteration. This _works_ though isn't ideal. The
"rejection" list is a PCollection of a few million elements. Re-reading
this "rejection" list on each iteration isn't ideal.

Is there a way to write such an iterative algorithm in BEAM without having
to create a new pipeline on each iteration? Further, would something like
SplitableDoFn be a potential solution? Looking through the user's guide and
some existing implementations of SplitableDoFns I'm thinking not but I'm
still trying to understand SplitableDoFns.

Reply via email to