[ 
https://issues.apache.org/jira/browse/NIFI-15869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Youngjun Kim updated NIFI-15869:
--------------------------------
    Environment: linux  (was: linux (k8s))

> PutBigQuery - Add Unknown Field Behavior property to handle fields absent 
> from the BigQuery table schema
> --------------------------------------------------------------------------------------------------------
>
>                 Key: NIFI-15869
>                 URL: https://issues.apache.org/jira/browse/NIFI-15869
>             Project: Apache NiFi
>          Issue Type: Improvement
>    Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0, 2.5.0, 2.6.0, 2.8.0, 
> 2.7.2, 2.9.0
>         Environment: linux
>            Reporter: Youngjun Kim
>            Priority: Major
>              Labels: GCP
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The PutBigQuery (nifi-gcp-nar) processor currently uses the BigQuery table 
> schema to generate a Protobuf descriptor via BQTableSchemaToProtoDescriptor. 
> When converting records to Protobuf messages, ProtoUtils.createMessage() 
> iterates over the descriptor fields (derived from the BigQuery schema) and 
> silently discards any record fields that do not exist in the schema. No 
> warning is logged and the FlowFile is routed to success.
> This behavior can result in undetected data loss in CDC pipelines. When a 
> column is added to the source database, the corresponding field appears in 
> NiFi records but is silently dropped before the Protobuf message is sent to 
> BigQuery. The BigQuery Storage Write API would reject such fields with 
> SCHEMA_MISMATCH_EXTRA_FIELD if they were included in the request, but since 
> NiFi removes them client-side, no error is raised.
> This is distinct from the existing "Skip Invalid Rows" property. "Skip 
> Invalid Rows" controls whether rows that fail during Protobuf serialization 
> (e.g., type mismatches) are skipped or cause the entire FlowFile to fail. It 
> operates at the point where an exception is already raised. In contrast, 
> fields absent from the BigQuery schema are silently removed during encoding 
> before any error can occur, so "Skip Invalid Rows" has no effect on this data 
> loss scenario.
> This improvement adds an "Unmatched Field Behavior" property with three 
> allowable values:
> - "Ignore Unmatched Fields" (default) — current behavior, no logging
> - "Warn on Unmatched Fields" — logs a warning per affected record, continues 
> writing
> - "Fail on Unmatched Fields" — routes the FlowFile to the failure 
> relationship, or drops the affected record when "Skip Invalid Rows" is set to 
> true
> The detection point is recordToProtoMessage(), where rawMap fields are 
> compared against the BigQuery table schema prior to Protobuf encoding. The 
> property name and value pattern follow PutDatabaseRecord's "Unmatched Field 
> Behavior" and "Unmatched Column Behavior" properties, which apply the same 
> pattern for unmatched fields in database writes.
> Reference: https://cloud.google.com/bigquery/docs/write-api-best-practices



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to