Hi Jan,

I haven't stumbled upon this but I will try to reconstruct that scenario
with a stress test and report back.

Can you share a little bit about your environment. For example do you use
gunicorn, ngnix, aiohttp/or flask perhaps?

I'd suggest maybe checking for request size limits parameters on that stack
(ngnix / gunicorn, python app server)
On the worker side, there should be a detailed log message that also prints
the request size, perhaps it would be useful to correlate that number
with any limits.

Thanks,
Igal.

לCa


On Thu, May 20, 2021 at 4:49 PM Stephan Ewen <se...@apache.org> wrote:

> Thanks for reporting this, it looks indeed like a potential bug.
>
> I filed this Jira for it:
> https://issues.apache.org/jira/browse/FLINK-22729
>
> Could you share (here ot in Jira) what the stack on the Python Worker side
> is (for example which HTTP server)? Do you know if the message truncation
> happens reliably at a certain message size?
>
>
> On Wed, May 19, 2021 at 2:12 PM Jan Brusch <jan.bru...@neuland-bfi.de>
> wrote:
>
>> Hi,
>>
>> recently we started seeing the following faulty behaviour in the Flink
>> Stateful Functions HTTP communication towards external Python workers.
>> This is only occuring when the system is under heavy load.
>>
>> The Java Application will send HTTP Messages to an external Python
>> Function but the external Function fails to parse the message with a
>> "Truncated Message Error". Printouts show that the truncated message
>> looks as follows:
>>
>> ------------------------------
>>
>> <Start of Message>
>>
>> my.protobuf.MyClass: <Protobuf Content>
>>
>> my.protobuf.MyClass: <Protobuf Content>
>>
>> my.protobuf.MyClass: <Protobuf Content>
>>
>> my.protobuf.MyClass: <Protob
>>
>> ------------------------------
>>
>> Which leads to the following Error in the Python worker:
>>
>> ------------------------------
>>
>> Error Parsing Message: Truncated Message
>>
>> ------------------------------
>>
>> Either the sender or the receiver (or something in between) seems to be
>> truncacting some (not all) messages at some random point in the payload.
>> The source code in both Flink SDKs looks to be correct. We temporarily
>> solved this by setting the "maxNumBatchRequests" parameter in the
>> external function definition really low. But this is not an ideal
>> solution as we believe this adds considerable communication overhead
>> between the Java and the Python Functions.
>>
>> The Stateful Function version is 2.2.2, java8. The Java App as well as
>> the external Python workers are deployed in the same kubernetes cluster.
>>
>>
>> Has anyone ever seen this problem before?
>>
>> Best regards
>>
>> Jan
>>
>> --
>> neuland  – Büro für Informatik GmbH
>> Konsul-Smidt-Str. 8g, 28217 Bremen
>>
>> Telefon (0421) 380107 57
>> Fax (0421) 380107 99
>> https://www.neuland-bfi.de
>>
>> https://twitter.com/neuland
>> https://facebook.com/neulandbfi
>> https://xing.com/company/neulandbfi
>>
>>
>> Geschäftsführer: Thomas Gebauer, Jan Zander
>> Registergericht: Amtsgericht Bremen, HRB 23395 HB
>> USt-ID. DE 246585501
>>
>>

Reply via email to