[
https://issues.apache.org/jira/browse/BEAM-12139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17549285#comment-17549285
]
Danny McCormick commented on BEAM-12139:
----------------------------------------
This issue has been migrated to https://github.com/apache/beam/issues/20891
> Suspected data loss (and/or duplicates) bug in BigQueyrServicesImpl
> -------------------------------------------------------------------
>
> Key: BEAM-12139
> URL: https://issues.apache.org/jira/browse/BEAM-12139
> Project: Beam
> Issue Type: Test
> Components: io-java-gcp
> Reporter: Alex Amato
> Priority: P3
>
> When this API yields errors specific to failed inserts for a row.
> Rows are selected [here for
> retrying|https://github.com/apache/beam/blob/243128a8fc52798e1b58b0cf1a271d95ee7aa241/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryServicesImpl.java#L967],
> using the errorIndex which is returned from the error.
> retryRows.add(rowsToPublish.get(errorIndex));
> However, this errorIndex is not valid to index rowsToPublish. So it looks
> like the wrong rows are being selected to be retried.
> *Why can't you use errorIndex to index rowsToPublish?*
> because rowsToPublish contains all of the rows which were passed into
> insertAll.
> These are then batched into a smaller list of
> ["rows"|https://github.com/apache/beam/blob/243128a8fc52798e1b58b0cf1a271d95ee7aa241/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryServicesImpl.java#L875]
> , where multpile API calls are made to bigquery to insert the rows.
> The errors returned actually refer to the list of rows passed into the call
> made to BigQuery, so they are only valid indices for "rows". Thus, they are
> not valid indices for "rowsToPublish".
> Note: These lists have a different number of rows: rowsToPublish.size() >
> rows.size()
--
This message was sent by Atlassian Jira
(v8.20.7#820007)