I have written a feature extraction pipeline in which I extract two
features using ParDos and combine the results with a CoGroupByKey.

with beam.Pipeline() as p:

    input = p | 'read input' >> beam.io.ReadFromText(input_path)

    first_feature = input | 'extract first feature' >>
beam.ParDo(ExtractFirstFeatureFn())

    second_feature = input | 'extract second feature' >>
beam.ParDo(ExtractSecondFeatureFn())

    combined = {'first feature': first_feature, 'second feature':
second_feature} | 'combine' >> beam.CoGroupByKey()



I'd like to extend the pipeline to extract an arbitrary number of features
while still aggregating them at the end with a CoGroupByKey. I'd also like
to be able to decide at runtime (via command line arguments or a
configuration file) which features will be extracted (e.g., extract
features 1 and 3, but not feature 2). How could I write such a pipeline?


Thanks in advance,

- Xander

Reply via email to