Re: Optimize exact deduplication for tens of billions data per day

2024-03-29 Thread Lei Wang
Perhaps I can keyBy(Hash(originalKey) % 10) Then in the KeyProcessOperator using MapState instead of ValueState MapState mapState There's about 10 OriginalKey for each mapState Hope this will help On Fri, Mar 29, 2024 at 9:24 PM Péter Váry wrote: > Hi Lei, > > Have you tried to ma

Re: flink version stable

2024-03-29 Thread Fokou Toukam, Thierry
I’m asking because I am seeing that the latest version don’t have all libraries such as Kafka connector Thierry FOKOU | IT M.A.Sc student Département de génie logiciel et TI École de technologie supérieure | Université du Québec 1100, rue Notre-Dame Ouest Montréal (Québec) H3C 1K3 [imag

Re: Optimize exact deduplication for tens of billions data per day

2024-03-29 Thread Péter Váry
Hi Lei, Have you tried to make the key smaller, and store a list of found keys as a value? Let's make the operator key a hash of your original key, and store a list of the full keys in the state. You can play with your hash length to achieve the optimal number of keys. I hope this helps, Peter

Re: flink version stable

2024-03-29 Thread Junrui Lee
Hi, The latest stable version of FLINK is 1.19.0 > > Fokou Toukam, Thierry > 于2024年3月29日周五 16:25写道: > >> Hi, just want to know which version of flink is stable? >> >> *Thierry FOKOU *| * IT M.A.Sc Student* >> >> Département de génie logiciel et TI >> >> École de technologie sup

Re: Batch Job with Adaptive Batch Scheduler failing with JobInitializationException: Could not start the JobMaster

2024-03-29 Thread Junrui Lee
Hi Dipak, Regarding question 1, I noticed from the logs that the method createBatchExecutionEnvironment from Beam is being used in your job. IIUC, this method utilizes Flink's DataSet API. If indeed the DataSet API is being used, the configuration option execution.batch-shuffle-mode will not take

Row to tuple conversion in PyFlink when switching to 'thread' execution mode

2024-03-29 Thread Wouter Zorgdrager
Dear readers, I'm running into some unexpected behaviour in PyFlink when switching execution mode from process to thread. In thread mode, my `Row` gets converted to a tuple whenever I use a UDF in a map operation. By this conversion to tuples, we lose critical information such as column names. Bel

Batch Job with Adaptive Batch Scheduler failing with JobInitializationException: Could not start the JobMaster

2024-03-29 Thread Dipak Tandel
Hi Everyone I am facing some issues while running the batch job on a Flink cluster using Adaptive Batch Scheduler. I have deployed a flink cluster on Kubernetes using the flink Kubernetes operator and submitted a job to the cluster using Apache beam FlinkRunner. I am using Flink version 1.16. I