Document an example pattern that makes sources and sinks pluggable in the production code for testing
Hi guys, I found the testing part mentioned make sources and sinks pluggable in your production code and inject special test sources and test sinks in your tests. https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/testing.html#testing-flink-jobs I think it would be useful to have a documented example as the section *testing stateful operato*r does, which demonstrates by WindowOperatorTest https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/testing.html#unit-testing-stateful-or-timely-udfs--custom-operators or, is there perhaps already a test that plugs sources and sinks? -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Document an example pattern that makes sources and sinks pluggable in the production code for testing
Hi guys, I found the testing part mentioned make sources and sinks pluggable in your production code and inject special test sources and test sinks in your tests. https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/testing.html#testing-flink-jobs I think it would be useful to have a documented example as the section *testing stateful operato*r does, which demonstrates by WindowOperatorTest https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/testing.html#unit-testing-stateful-or-timely-udfs--custom-operators or, is there perhaps already a test that plugs sources and sinks? -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Re: Flink Standalone cluster - production settings
/ Each job has 3 asynch operators with Executors with thread counts of 20,20,100/ Flink handles parallelisms for you. If you want a higher parallelism of a operator, you can call setParallelism() for example, flatMap(new Mapper1()).setParallelism(20) flatMap(new Mapper2()).setParallelism(20) flatMap(new Mapper3()).setParallelism(100) You can check the official document here https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/parallel.html#setting-the-parallelism /Currently we are using parallelism = 1/ I guess you set the job level parallelism I would suggest you replace Executors with the use of Flink parallelisms. It would be more efficient so you don't create the other thread pool although you already have one that flink provides you(I maybe not right describing this concept) Cheers, Sendoh -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Re: Apache Flink Examples
in my case I usually check the tests they write for each function I want to use. Take CountTrigger as an example, if I want to customize my own way of counting, I will have a look at the test the write https://github.com/apache/flink/blob/8dfb9d00653271ea4adbeb752da8f62d7647b6d8/flink-streaming-java/src/test/java/org/apache/flink/streaming/runtime/operators/windowing/CountTriggerTest.java Then I understand how this function is expected to work, and then I write my own test with my expected result. Test is the best documentation I would say. Also there is an example folder in github. https://github.com/apache/flink/tree/master/flink-examples Best, Sendoh -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Re: End-to-end exactly once from kafka source to S3 sink
"Flink will only commit the kafka offsets when the data has been saved to S3" -> no, you can check the BucketingSink code, and it would mean BucketingSink depends on Kafka which is not reasonable. Flink stores checkpoint in disk of each worker, not Kafka. (KafkaStream, the other streaming API provided by Kafka, stores checkpoint back to Kafka) So, bucket size doesn't affect the commit frequency. Best, Sendoh -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Re: Advice or best practices on adding metadata to stream events
So there are three ways. 1. make your model as stream source 2. let master read the model once, distribute it via constructor, and update it periodically 3. let worker read the model and update it periodically(you mentioned) option 3 would be problematic if you scale a lot and use many parallelisms because there are too many connections. option 2 is the best, if you don't have to update your model. otherwise you have to restart your flink job to get the new model, or implement this update logic your own. option 1 for me is the best if you need to update the model. so you can control how often you read Best, Sendoh -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Re: [Window] Per key window
after you keyBy() each of your window has its group of events. or what you want is a global window? Best, Sendoh -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Re: How does BucketingSink generate a SUCCESS file when a directory is finished
it depends on how you partition your file. in my case I write file per hour, so I'm sure that file is ready after that hour period, in processing time. Here, read to be ready means this file contains all the data in that hour period. If the downstream runs in a batch way, you may want to ensure the file is ready. In this case, ready to read can mean all the data before watermark as arrived. You could take the BucketingSink and implement this logic there, maybe wait until watermark reaches Best, Sendoh -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Re: Scaling Flink
What about scaling up with #task slots left? You can obtain this information from Flink's endpoint. Cheers, Sendoh -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Is it possible to restart only the function that fails instead of entire job?
After reading the document and configuring to test failure strategy, it seems to me Flink restarts the job once any failures (e.g. exception thrown, etc.) occur. https://ci.apache.org/projects/flink/flink-docs-master/internals/stream_checkpointing.html My question: Is it possible to configure in allowing the function that fails to recover instead of restarting entire job (like Erlang's One For One Supervision)? For instance within a job the parallelism is configured to 100, so at runtime 100 maps instances are executed. Now one of map functions fails, we want to recover the failed map function because other map functions are functioning normally. Is it possible to achieve such effect? Thanks