[
https://issues.apache.org/jira/browse/NIFI-15869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Pierre Villard updated NIFI-15869:
----------------------------------
Affects Version/s: (was: 2.0.0)
(was: 2.1.0)
(was: 2.2.0)
(was: 2.3.0)
(was: 2.4.0)
(was: 2.5.0)
(was: 2.6.0)
(was: 2.8.0)
(was: 2.7.2)
(was: 2.9.0)
> PutBigQuery - Add Unknown Field Behavior property to handle fields absent
> from the BigQuery table schema
> --------------------------------------------------------------------------------------------------------
>
> Key: NIFI-15869
> URL: https://issues.apache.org/jira/browse/NIFI-15869
> Project: Apache NiFi
> Issue Type: Improvement
> Environment: linux
> Reporter: Youngjun Kim
> Assignee: Pierre Villard
> Priority: Major
> Labels: GCP
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> The PutBigQuery (nifi-gcp-nar) processor currently uses the BigQuery table
> schema to generate a Protobuf descriptor via BQTableSchemaToProtoDescriptor.
> When converting records to Protobuf messages, ProtoUtils.createMessage()
> iterates over the descriptor fields (derived from the BigQuery schema) and
> silently discards any record fields that do not exist in the schema. No
> warning is logged and the FlowFile is routed to success.
> This behavior can result in undetected data loss in CDC pipelines. When a
> column is added to the source database, the corresponding field appears in
> NiFi records but is silently dropped before the Protobuf message is sent to
> BigQuery. The BigQuery Storage Write API would reject such fields with
> SCHEMA_MISMATCH_EXTRA_FIELD if they were included in the request, but since
> NiFi removes them client-side, no error is raised.
> This is distinct from the existing "Skip Invalid Rows" property. "Skip
> Invalid Rows" controls whether rows that fail during Protobuf serialization
> (e.g., type mismatches) are skipped or cause the entire FlowFile to fail. It
> operates at the point where an exception is already raised. In contrast,
> fields absent from the BigQuery schema are silently removed during encoding
> before any error can occur, so "Skip Invalid Rows" has no effect on this data
> loss scenario.
> This improvement adds an "Unmatched Field Behavior" property with three
> allowable values:
> - "Ignore Unmatched Fields" (default) — current behavior, no logging
> - "Warn on Unmatched Fields" — logs a warning per affected record, continues
> writing
> - "Fail on Unmatched Fields" — routes the FlowFile to the failure
> relationship, or drops the affected record when "Skip Invalid Rows" is set to
> true
> The detection point is recordToProtoMessage(), where rawMap fields are
> compared against the BigQuery table schema prior to Protobuf encoding. The
> property name and value pattern follow PutDatabaseRecord's "Unmatched Field
> Behavior" and "Unmatched Column Behavior" properties, which apply the same
> pattern for unmatched fields in database writes.
> Reference: https://cloud.google.com/bigquery/docs/write-api-best-practices
--
This message was sent by Atlassian Jira
(v8.20.10#820010)