[
https://issues.apache.org/jira/browse/FLINK-39225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18065254#comment-18065254
]
陈哲凯 commented on FLINK-39225:
-----------------------------
Hi, I'd like to take this one. Will start by looking at the parent issue
FLINK-38857 to get familiar with the Triton module design first.
> Add retry with default value fallback for triton inference failures
> -------------------------------------------------------------------
>
> Key: FLINK-39225
> URL: https://issues.apache.org/jira/browse/FLINK-39225
> Project: Flink
> Issue Type: Sub-task
> Components: Table SQL / Runtime
> Reporter: featzhang
> Priority: Major
>
> Adds retry mechanism with default value fallback for Triton model inference
> failures, enabling robust error handling and downstream filtering.
> h2. Brief change log
> h3. 1. New Configuration Options (TritonOptions.java)
> * {{{}max-retries{}}}: Maximum retry attempts (default: 0)
> * {{{}retry-backoff{}}}: Initial backoff duration with exponential strategy
> (default: 100ms)
> * {{{}default-value{}}}: Fallback value when all retries fail
> h3. 2. Retry Logic (TritonInferenceModelFunction.java)
> * Implements exponential backoff retry strategy
> * Retries on network errors and 5xx server errors (503, 504)
> * Fails immediately on 4xx client errors (configuration issues)
> * Detailed logging for each retry attempt
> h3. 3. Default Value Fallback
> * Returns configured default value after exhausting all retries
> * Supports all output types: STRING, numeric, ARRAY
> * Enables downstream view-based routing for success/failure cases
> * Backward compatible: throws exceptions if no default value configured
> h3. 4. AbstractTritonModelFunction.java
> * Added fields and getters for retry configuration
> h2. Use Cases
> {*}Scenario{*}: After N consecutive failures, return a default value that
> downstream can use to route records to success/failure paths.
> {*}Example Configuration{*}:
> CREATE MODEL my_triton_model
> WITH ( 'provider' = 'triton', 'endpoint' = 'http://triton:8000/v2/models',
> 'model-name' = 'my-model', 'max-retries' = '3', -- Retry up to
> 3 times'retry-backoff' = '100ms', -- 100ms, 200ms, 400ms
> backoff'default-value' = 'FAILED' -- Return 'FAILED' on all failures);
>
> {*}Downstream Processing{*}:
> -- Route based on prediction resultINSERT INTO success_tableSELECT * FROM
> predictions WHERE result != 'FAILED';INSERT INTO failure_tableSELECT * FROM
> predictions WHERE result = 'FAILED';
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)