Stephan Weinwurm created FLINK-28747: ----------------------------------------
Summary: "target_id can not be missing" in HTTP statefun request Key: FLINK-28747 URL: https://issues.apache.org/jira/browse/FLINK-28747 Project: Flink Issue Type: Bug Components: Stateful Functions Affects Versions: statefun-3.2.0 Reporter: Stephan Weinwurm Hi all, We've suddenly started to see the following exception in our HTTP statefun functions endpoints: ``` Traceback (most recent call last): File "/src/.venv/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", line 403, in run_asgi result = await app(self.scope, self.receive, self.send) File "/src/.venv/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__ return await self.app(scope, receive, send) File "/src/worker/baseplate_asgi/asgi/baseplate_asgi_middleware.py", line 37, in __call__ await span_processor.execute() File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line 61, in execute raise e File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line 57, in execute await self.app(self.scope, self.receive, self.send) File "/src/.venv/lib/python3.9/site-packages/starlette/applications.py", line 124, in __call__ await self.middleware_stack(scope, receive, send) File "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 184, in __call__ raise exc File "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 162, in __call__ await self.app(scope, receive, _send) File "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 75, in __call__ raise exc File "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 64, in __call__ await self.app(scope, receive, sender) File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 680, in __call__ await route.handle(scope, receive, send) File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 275, in handle await self.app(scope, receive, send) File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 65, in app response = await func(request) File "/src/worker/baseplate_statefun/server/asgi/make_statefun_handler.py", line 25, in statefun_handler result = await handler.handle_async(request_body) File "/src/.venv/lib/python3.9/site-packages/statefun/request_reply_v3.py", line 262, in handle_async msg = Message(target_typename=sdk_address.typename, target_id=sdk_address.id, File "/src/.venv/lib/python3.9/site-packages/statefun/messages.py", line 42, in __init__ raise ValueError("target_id can not be missing") ``` Interestingly, this has started to happen in three separate Flink deployments at the very same time. The only thing in common between the three deployments is that they consume the same Kafka topics. No deployments have happened when the issue started happening which was on July 28th 3:05PM. We have since been continuously seeing the error. We were also able to extract the request that Flink sends to the HTTP statefun endpoint: ``` {'invocation': {'target': {'namespace': 'com.x.dummy', 'type': 'dummy'}, 'invocations': [{'argument': {'typename': 'type.googleapis.com/v2_event.Event', 'has_value': True, 'value': '-redicated-'}}]}} ``` As you can see, no `id` field is present in the `invocation.target` object or the `target_id` was an empty string. This is our module.yaml from one of the Flink deployments: ``` version: "3.0" module: meta: type: remote spec: endpoints: - endpoint: meta: kind: io.statefun.endpoints.v1/http spec: functions: com.x.dummy/dummy urlPathTemplate: [http://x-worker-dummy.x-functions:9090/statefun] timeouts: call: 2 min read: 2 min write: 2 min maxNumBatchRequests: 100 ingresses: - ingress: meta: type: io.statefun.kafka/ingress id: com.x/ingress spec: address: x-kafka-0.x.ue1.x.net:9092 consumerGroupId: x-worker-dummy topics: - topic: v2_post_events valueType: type.googleapis.com/v2_event.Event targets: - com.x.dummy/dummy startupPosition: type: group-offsets autoOffsetResetPosition: earliest ``` Can you please help us investigate as this is critically impacting our prod setup? -- This message was sent by Atlassian Jira (v8.20.10#820010)