I came across an interesting user report at
https://github.com/apache/beam/issues/32866 which made me realize that
providing metadata about a bad element in the "bad records" output is
useful, we don't make it easy to extract the output into a PCollection
of the original elements. The output schema contains the original
element as well as metadata about what error occurred, and in an
ordinary Beam pipeline one could easily apply a Map(lambda error_row:
error_row.element) but YAML doesn't have Map, just MapToFields
(primarily to be more schema friendly).
There are a couple of options:
(0) Leave things as they are. One can write
type: MapToFields
config:
fields:
fld1: element.fld1
fld2: element.fld2
...
This is of course a bit ugly as one needs to enumerate (and know) the
set of original fields.
(1a) Provide a special operation "Unnest" that takes a single field
and emits it as the top-level element. This can of course result in
unschema'd PCollections (which are supported, but generally don't play
as well with the other operations, including xlang ones).
(1b) Just provide a Map. This is a generalization of 1a, but on the
other hand would be more prone to abuse.
(1c) We could name this
type: MapToFields
config:
fields:
*: element
IIRC, we already have the special case of "*" in our join syntax, and
we could re-use a bunch of the MapToFields infrastructure. But maybe
it's too obscure?
(2) Add an optional argument to error_handling to omit the metadata.
This would require a bit of a hack to support ubiquitously, and
wouldn't solve the more general problem.
Maybe there are some other ideas as well?