Re: Help needed to increase throughput of simple flink app

2020-11-12 Thread ashwinkonale
Hi Arvid,
Thank you so much for your detailed reply. I think I will go with one schema
per topic using GenericRecordAvroTypeInfo for genericRecords and not do any
custom magic.

Approach of sending records as byte array also seems quite interesting.
Right now I am deserializing avro records so that I can pass it to
StreamingFileSink's AvroWriters(Which accepts only avro objects) so that it
merges bunch of avro records before dumping to sink. It seems unnecessary
for me, since there could be some bulk writer implementation which could do
this at byte level itself. Do you know any of such implementations ?



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: Help needed to increase throughput of simple flink app

2020-11-12 Thread ashwinkonale
So in this case, flink will fall back to default kyro serialiser right ?



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: Help needed to increase throughput of simple flink app

2020-11-12 Thread ashwinkonale
Hi Arvid,
Thanks a lot for your reply. And yes, we do use confluent schema registry
extensively. But the `ConfluentRegistryAvroDeserializationSchema` expects
reader schema to be provided. That means it reads the message using writer
schema and converts to reader schema. But this is not what I want always. If
I have messages of different schema in the same topic, I cannot apply
`ConfluentRegistryAvroDeserializationSchema` correct ? I also came across
this  question

 
. I am also doing the same thing in my pipeline by providing custom
deserialiser using confluentSchemaRegistryClient. So as far as I understood,
in this usecase there is no way to tell flink about
`GenericRecordAvroTypeInfo` of the genericRecord which comes out of source
function. Please tell me if my understanding is correct.



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: Help needed to increase throughput of simple flink app

2020-11-12 Thread ashwinkonale
Hi,
Thanks a lot for the reply. And you both are right. Serializing
GenericRecord without specifying schema was indeed a HUGE bottleneck in my
app. I got to know it through jfr analysis and then read the blog post you
mentioned. Now I am able to pump in lot more data per second. (In my test
setup atleast). I am going to try this with kafka. 
But now it poses me a problem, that my app cannot handle schema changes
automatically since at the startup flink needs to know schema. If there is a
backward compatible change in upstream, new messages will not be read
properly. Do you know any workarounds for this ?



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: Help needed to increase throughput of simple flink app

2020-11-10 Thread ashwinkonale
Hey,
I am reading messages with schema id and using confluent schema registry to
deserialize to Genericrecord. After this point, pipelineline will have this
objects moving across. Can you give me some examples of `special handling of
avro messages` you mentioned ?



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: Help needed to increase throughput of simple flink app

2020-11-09 Thread ashwinkonale
Hi,
Thanks a lot for the reply. I added some more metrics to the pipeline to
understand bottleneck. Seems like avro deserialization introduces some
delay. With use of histogram I found processing of a single message takes
~300us(p99). ~180(p50). Which means a single slot can output at most 3000
messages per second. This essentially means, to support QPS of 3mil/s I will
need parallelism of 1000. Is my understanding correct ? Can I do anything
else apart from having so many slots in my job cluster ? Also do you have
any guides or pointers how to do such setups. eg, large number of
taskmanagers with smaller slots or bigger TMs with many slots and bigger
jvms, larger network buffers etc ? 



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: Help needed to increase throughput of simple flink app

2020-11-09 Thread ashwinkonale
Hi Till,
Thanks a lot for the reply. The problem I am facing is as soon as I add
network(remove chaining) to discarding sink, I have huge problem with
throughput. Do you have any pointers on how can I go about debugging this ?

- Ashwin



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/