Re: [Discussion] - Release major Flink version to support JDK 17 (LTS)

2023-03-30 Thread Piotr Nowojski
Hey, > 1. The Flink community agrees that we upgrade Kryo to a later version, which means breaking all checkpoint/savepoint compatibility and releasing a Flink 2.0 with Java 17 support added and Java 8 and Flink Scala API support dropped. This is probably the quickest way, but would still mean

Re: Kafka transactions drastically limit usability of Flink savepoints

2022-11-17 Thread Piotr Nowojski
:", however this is not what I observe, instead the job fails. I > am attaching the relevant part of the log. This error happens upon > trying to recover from a one month old savepoint. > > Regards, > Yordan > > On Tue, 15 Nov 2022 at 18:53, Piotr Nowojski wrote: > > >

Re: Kafka transactions drastically limit usability of Flink savepoints

2022-11-15 Thread Piotr Nowojski
Hi Yordan, I don't understand where the problem is, why do you think savepoints are unusable? If you recover with `ignoreFailuresAfterTransactionTimeout` enabled, the current Flink behaviour shouldn't cause any problems (except for maybe some logged errors). Best, Piotrek wt., 15 lis 2022 o

Re: Modify savepoints in Flink

2022-10-21 Thread Piotr Nowojski
ent connector can have different > serialisation and de-serlisation technique right?. Wouldn't that impact?. > If I use StateProcessor API, would that be agnostic to all the sources and > sinks?. > > On Fri, Oct 21, 2022, 18:00 Piotr Nowojski wrote: > >> ops >> >>

Re: Modify savepoints in Flink

2022-10-21 Thread Piotr Nowojski
to use StateProcessor API. Best, Piotrek pt., 21 paź 2022 o 10:54 Sriram Ganesh napisał(a): > Thanks !. Will try this. > > On Fri, Oct 21, 2022 at 2:22 PM Piotr Nowojski > wrote: > >> Hi Sriram, >> >> You can read and modify savepoints using StateProcessor API [1]. >

Re: Modify savepoints in Flink

2022-10-21 Thread Piotr Nowojski
Hi Sriram, You can read and modify savepoints using StateProcessor API [1]. Alternatively, you can modify a code of your function/operator for which you want to modify the state. For example in the `org.apache.flink.streaming.api.checkpoint.CheckpointedFunction#initializeState` method you could

Re: [Discuss] Creating an Apache Flink slack workspace

2022-05-06 Thread Piotr Nowojski
Hi Xintong, I'm not sure if slack is the right tool for the job. IMO it works great as an adhoc tool for discussion between developers, but it's not searchable and it's not persistent. Between devs, it works fine, as long as the result of the ad hoc discussions is backported to JIRA/mailing

Re: Avro deserialization issue

2022-04-13 Thread Piotr Nowojski
Hey, Could you be more specific about how it is not working? A compiler error that there is no such class as RuntimeContextInitializationContextAdapters? This class has been introduced in Flink 1.12 in FLINK-18363 [1]. I don't know this code and I also don't know where it's documented, but: a)

Re: Low Watermark

2022-02-25 Thread Piotr Nowojski
Hi, It's the minimal watermark among all 10 parallel instances of that Task. Using metric (currentInputWatermark) [1] you can access the watermark of each of those 10 sub tasks individually. Best, Piotrek [1] https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/ pt., 25 lut

Re: Basic questions about resuming stateful Flink jobs

2022-02-17 Thread Piotr Nowojski
gt;> >> >> From what I’ve read there is state, checkpoints and save points – all of >> them hold state - and currently I can’t get any of these to restore when >> developing in an IDE and the program builds up all state from scratch. So >> what else do I need to do in

Re: Basic questions about resuming stateful Flink jobs

2022-02-16 Thread Piotr Nowojski
Hi James, Sure! The basic idea of checkpoints is that they are fully owned by the running job and used for failure recovery. Thus by default if you stopped the job, checkpoints are being removed. If you want to stop a job and then later resume working from the same point that it has previously

Re: getting "original" ingestion timestamp after using a TimestampAssigner

2022-02-16 Thread Piotr Nowojski
Hi Frank, I'm not sure exactly what you are trying to accomplish, but yes. In the TimestampAssigner you can only return what should be the new timestamp for the given record. If you want to use "ingestion time" - "true even time" as some kind of delay metric, you will indeed need to have both

Re: Python Function for Datastream Transformation in Flink Java Job

2022-02-16 Thread Piotr Nowojski
Hi, As far as I can tell the answer is unfortunately no. With Table API (SQL) things are much simpler, as you have a restricted number of types of columns that you need to support and you don't need to support arbitrary Java classes as the records. I'm shooting blindly here, but maybe you can

Re: Job manager slots are in bad state.

2022-02-16 Thread Piotr Nowojski
Hi Josson, Would you be able to reproduce this issue on a more recent version of Flink? I'm afraid that we won't be able to help with this issue as this affects a Flink version that is not supported for quite some time and moreover `SlotSharingManager` has been completed removed in Flink 1.13.

Re: Performance Issues in Source Operator while migrating to Flink-1.14 from 1.9

2022-02-16 Thread Piotr Nowojski
Hi, Unfortunately the new KafkaSource was contributed without good benchmarks, and so far you are the first one that noticed and reported this issue. Without more direct comparison (as Martijn suggested), it's hard for us to help right away. It would be a tremendous help for us if you could for

Re: Buffering when connecting streams

2022-01-18 Thread Piotr Nowojski
in a (Process)JoinFunction? > The join needs keys, but I don’t know if the resulting stream counts as > keyed from the state’s point of view. > > > > Regards, > > Alexis. > > > > *From:* Piotr Nowojski > *Sent:* Montag, 6. Dezember 2021 08:43 > *To:* David Mo

Re: unaligned checkpoint for job with large start delay

2022-01-11 Thread Piotr Nowojski
gt; > > > > > > [1] > https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/config/#state-backend-rocksdb-metrics-estimate-num-keys > > > > *From:* Mason Chen > *Sent:* Dienstag, 11. Januar 2022 19:20 > *To:* Piotr Nowojski

Re: RichMapFunction to convert tuple of strings to DataStream[(String,String)]

2022-01-10 Thread Piotr Nowojski
n's posts where he mentioned that this way of creating > DS is not "encouraged and tested". So, I figured out an alternate way of > using side output and now I can do what I was aiming for. > > Thanks, > Sid. > > On Mon, Jan 10, 2022 at 5:29 PM Piotr Now

Re: Custom Kafka Keystore on Amazon Kinesis Analytics

2022-01-10 Thread Piotr Nowojski
Ah, I see. Pitty. You could always use reflection if you really had to, but that's of course not a long term solution. I will raise this issue to the KafkaSource/AWS contributors. Best, Piotr Nowojski pon., 10 sty 2022 o 16:55 Clayton Wohl napisał(a): > Custom code can create subclas

Re: Is there a way to know how long a Flink app takes to finish resuming from Savepoint?

2022-01-10 Thread Piotr Nowojski
Hi, Unfortunately there is no such metric. Regarding the logs, I'm not sure what Flink version you are using, but since Flink 1.13.0 [1][2], you could relay on the tasks/subtasks switch from `INITIALIZING` to `RUNNING` to check when the task/subtask has finished recovering it's state. Best,

Re: Uploading jar to s3 for persistence

2022-01-10 Thread Piotr Nowojski
Hi Puneet, Have you seen this thread before? [1]. It looks like the same issue and especially this part might be the key: > Be aware that the filesystem used by the FileUploadHandler > is java.nio.file.FileSystem and not > Flink's org.apache.flink.core.fs.FileSystem for which we provide

Re: Custom Kafka Keystore on Amazon Kinesis Analytics

2022-01-10 Thread Piotr Nowojski
Hi Clayton, I think in principle this example should be still valid, however instead of providing a `CustomFlinkKafkaConsumer` and overriding it's `open` method, you would probably need to override `org.apache.flink.connector.kafka.source.reader.KafkaSourceReader#start`. So you would most likely

Re: RichMapFunction to convert tuple of strings to DataStream[(String,String)]

2022-01-10 Thread Piotr Nowojski
Hi Sid, I don't see on the stackoverflow explanation of what are you trying to do here (no mentions of MapFunction or a tuple). If you want to create a `DataStream` from some a pre existing/static Tuple of Strings, the easiest thing would be to convert the tuple to a collection/iterator and use

Re: Job stuck in savePoint - entire topic replayed on restart.

2022-01-10 Thread Piotr Nowojski
Hi Basil, 1. What do you mean by: > The only way we could stop these stuck jobs was to patch the finalizers. ? 2. Do you mean that your job is stuck when doing stop-with-savepoint? 3. What Flink version are you using? Have you tried upgrading to the most recent version, or at least the most

Re: unaligned checkpoint for job with large start delay

2022-01-10 Thread Piotr Nowojski
r to see the huge jumps in window fires. I can this > benefiting other users who troubleshoot the problem of large number of > window firing. > > Best, > Mason > > On Dec 29, 2021, at 2:56 AM, Piotr Nowojski wrote: > > Hi Mason, > > > and it has to finish proce

Re: unaligned checkpoint for job with large start delay

2021-12-20 Thread Piotr Nowojski
d (retain on cancellation) > Tolerable Failed Checkpoints 10 > ``` > > Are there other metrics should I look at—why else should tasks fail > acknowledgement in unaligned mode? Is it something about the implementation > details of window function that I am not considering? My main hunch

Re: Prometheus labels in metrics / counters

2021-12-17 Thread Piotr Nowojski
Hi, In principle you can register metric/metric groups dynamically it should be working just fine. However your code probably won't work, because per every record you are creating a new group and new counter, that most likely will be colliding with an old one. So every time you are defining a new

Re: fraud detection example fails

2021-12-17 Thread Piotr Nowojski
Hi, It might be simply because the binary artifacts are not yet published/visible. The blog post [1] mentions that it should be visible within 24h from yesterday), so please try again later/tomorrow. This is also mentioned in the dev mailing list thread [2] Best, Piotrek [1]

Re: Read parquet data from S3 with Flink 1.12

2021-12-17 Thread Piotr Nowojski
Hi, Reading in the DataStream API (that's what I'm using you are doing) from Parquet files is officially supported and documented only since 1.14 [1]. Before that it was only supported for the Table API. As far as I can tell, the basic classes (`FileSource` and `ParquetColumnarRowInputFormat`)

Re: Confusion about rebalance bytes sent metric in Flink UI

2021-12-16 Thread Piotr Nowojski
Hi Tao, Could you prepare a minimalistic example that would reproduce this issue? Also what Flink version are you using? Best, Piotrek czw., 16 gru 2021 o 09:44 tao xiao napisał(a): > >Your upstream is not inflating the record size? > No, this is a simply dedup function > > On Thu, Dec 16,

Re: unaligned checkpoint for job with large start delay

2021-12-16 Thread Piotr Nowojski
Hi Mason, In Flink 1.14 we have also changed the timeout behavior from checking against the alignment duration, to simply checking how old is the checkpoint barrier (so it would also account for the start delay) [1]. It was done in order to solve problems as you are describing. Unfortunately we

Re: Buffering when connecting streams

2021-12-05 Thread Piotr Nowojski
Hi Alexis and David, This actually can not happen. There are mechanisms in the code to make sure none of the input is starved IF there is some data to be read. The only time when input can be blocked is during the alignment phase of aligned checkpoints under back pressure. If there was a back

Re: Input Selectable & Checkpointing

2021-11-25 Thread Piotr Nowojski
> > Best wishes, > > Shazia > > > - Original message - > From: "Piotr Nowojski" > To: "Shazia Kayani" > Cc: mart...@ververica.com, "user" > Subject: [EXTERNAL] Re: Input Selectable & Checkpointing > Date: Wed, Nov 24, 2

Re: Input Selectable & Checkpointing

2021-11-24 Thread Piotr Nowojski
Hi Shazia, FLIP-182 [1] might be a thing that will let you address issues like this in the future. With it, maybe you could do some magic with assigning watermarks to make sure that one stream doesn't run too much into the future which would effectively prioritise the other stream. But that's

Re: Providing files while application mode deployment

2021-11-09 Thread Piotr Nowojski
Hi Vasily, Unfortunately no, I don't think there is such an option in your case. With per job mode, you could try to use the Distributed Cache, it should be working in streaming as well [1], but this doesn't work in the application mode, as in that case no code is executed on the JobMaster [2]

Re: how to expose the current in-flight async i/o requests as metrics?

2021-11-09 Thread Piotr Nowojski
Hi All, to me it looks like something deadlocked, maybe due to this OOM error from Kafka, preventing a Task from making any progress. To confirm Dongwan you could collecte stack traces while the job is in such a blocked state. Deadlocked Kafka could easily explain those symptoms and it would be

Re: A savepoint was created but the corresponding job didn't terminate successfully.

2021-11-09 Thread Piotr Nowojski
Hi Dongwon, Thanks for reporting the issue, I've created a ticket for it [1] and we will analyse and try to fix it soon. In the meantime it should be safe for you to ignore this problem. If this failure happens only rarely, you can always retry stop-with-savepoint command and there should be no

Re: Beginner: guidance on long term event stream persistence and replaying

2021-11-09 Thread Piotr Nowojski
Hi Simon, >From the top of my head I do not see a reason why this shouldn't work in Flink. I'm not sure what your question is here. For reading both from the FileSource and Kafka at the same time you might want to take a look at the Hybrid Source [1]. Apart from that there are

Re: Troubleshooting checkpoint timeout

2021-10-26 Thread Piotr Nowojski
r downstream, it > just broadcasts it to all output channels”. > > > > I’ll see what I can do about upgrading the Flink version and do some more > tests with unaligned checkpoints. Thanks again for all the info. > > > > Regards, > > Alexis. > > > > *Fro

Re: Troubleshooting checkpoint timeout

2021-10-25 Thread Piotr Nowojski
say, the last 5 minutes of data. Then it > fails again because the checkpoint times out and, after restarting, would > it try to read, for example, 15 minutes of data? If there was no > backpressure in the source, it could be that the new checkpoint barriers > created after the first restar

Re: Troubleshooting checkpoint timeout

2021-10-25 Thread Piotr Nowojski
arrier for n-1 and “last” the one for n? >3. Start delay also refers to the “first checkpoint barrier to reach >this subtask”. As before, what is “first” in this context? >4. Maybe this will be answered by the previous questions, but what >happens to barriers if a down

Re: Troubleshooting checkpoint timeout

2021-10-25 Thread Piotr Nowojski
Hi Alexis, You can read about those metrics in the documentation [1]. Long alignment duration and start delay almost always come together. High values indicate long checkpoint barrier propagation times through the job graph, that's always (at least so far I haven't seen a different reason) caused

Re: Empty Kafka topics and watermarks

2021-10-11 Thread Piotr Nowojski
Great, thanks! pon., 11 paź 2021 o 17:24 James Sandys-Lumsdaine napisał(a): > Ah thanks for the feedback. I can work around for now but will upgrade as > soon as I can to the latest version. > > Thanks very much, > > James. > -- > *From:* Piot

Re: Empty Kafka topics and watermarks

2021-10-08 Thread Piotr Nowojski
Hi James, I believe you have encountered a bug that we have already fixed [1]. The small problem is that in order to fix this bug, we had to change some `@PublicEvolving` interfaces and thus we were not able to backport this fix to earlier minor releases. As such, this is only fixed starting from

Re: How to ugrade JobManagerCommunicationUtils from FLink 1.4 to Flink 1.5?

2021-10-08 Thread Piotr Nowojski
Hi, `JobManagerCommunicationUtils` was never part of Flink's API. It was an internal class, for our internal unit tests. Note that Flink's public API is annotated with `@Public`, `@PublicEvolving` or `@Experimental`. Anything else by default is internal (sometimes to avoid confusion we are

Re: Event is taking a lot of time between the operators

2021-09-29 Thread Piotr Nowojski
improving the approach. > > > > Thanks, > > Sanket Agrawal > > > > *From:* Ragini Manjaiah > *Sent:* Wednesday, September 29, 2021 11:17 AM > *To:* Sanket Agrawal > *Cc:* Piotr Nowojski ; user@flink.apache.org > *Subject:* Re: Event is taking a lot of t

Re: Event is taking a lot of time between the operators

2021-09-28 Thread Piotr Nowojski
Hi, With Flink 1.8.0 I'm not sure how reliable the backpressure status is in the WebUI when it comes to the Async operators. If I remember correctly until around Flink 1.10 (+/- 2 version) backpressure monitoring was checking for thread dumps stuck in requesting Flink's network memory buffers. If

Re: State processor API very slow reading a keyed state with RocksDB

2021-09-09 Thread Piotr Nowojski
Hi David, I can confirm that I'm able to reproduce this behaviour. I've tried profiling/flame graphs and I was not able to make much sense out of those results. There are no IO/Memory bottlenecks that I could notice, it looks indeed like the Job is stuck inside RocksDB itself. This might be an

Re: Duplicate copies of job in Flink UI/API

2021-09-08 Thread Piotr Nowojski
Hi Peter, Can you provide relevant JobManager logs? And can you write down what steps have you taken before the failure happened? Did this failure occur during upgrading Flink, or after the upgrade etc. Best, Piotrek śr., 8 wrz 2021 o 16:11 Peter Westermann napisał(a): > We recently upgraded

Re: Re: [ANNOUNCE] RocksDB Version Upgrade and Performance

2021-08-14 Thread Piotr Nowojski
Hi, FYI, the performance regression after upgrading RocksDB was clearly visible in all of our RocksDB related benchmarks, like for example: http://codespeed.dak8s.net:8000/timeline/?ben=stateBackends.ROCKS=2 http://codespeed.dak8s.net:8000/timeline/?ben=stateBackends.ROCKS_INC=2 (and many more

Re: [ANNOUNCE] RocksDB Version Upgrade and Performance

2021-08-04 Thread Piotr Nowojski
Thanks for the detailed explanation Yun Tang and clearly all of the effort you have put into it. Based on what was described here I would also vote for going forward with the upgrade. It's a pity that this regression wasn't caught in the RocksDB community. I would have two questions/ideas: 1. Can

Re: TaskManager crash after cancelling a job

2021-07-29 Thread Piotr Nowojski
Hi Ivan, It sounds to me like a bug in FlinkKinesisConsumer that it's not cancelling properly. The change in the behaviour could have been introduced as a bug fix [1], where we had to stop interrupting the source thread. This also might be related or at least relevant for fixing [2]. Ivan, the

Re: Kafka Consumer Retries Failing

2021-07-19 Thread Piotr Nowojski
d the High Heap Usage is because of a Memory Leak in a > library we are using. > > Thanks for your help. > > On Mon, Jul 19, 2021 at 8:31 PM Piotr Nowojski > wrote: > >> Thanks for the update. >> >> > Could the backpressure timeout and heartbeat ti

Re: Kafka Consumer Retries Failing

2021-07-19 Thread Piotr Nowojski
apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1193) > .. > > Because of heartbeat timeout, there was an internal restart of Flink and > the Kafka consumption rate recovered after the restart. > > Could the backpressure timeout and

Re: Process finite stream and notify upon completion

2021-07-14 Thread Piotr Nowojski
ly called windows), or when you do event processing where the > > time when an event occurred is important. > > ci.apache.org > > > > > > [2] > https://ci.apache.org/projects/flink/flink-docs-release-1.13/api/java/org/apache/flink/streaming/api/watermark/Watermar

Re: Kafka Consumer Retries Failing

2021-07-14 Thread Piotr Nowojski
2)\n\tat > org.apache.flink.streaming.connectors.kafka.internal.Handover.pollNext(Handover.java:74)\n\tat > org.apache.flink.streaming.connectors.kafka.internal.KafkaFetcher.runFetchLoop(KafkaFetcher.java:133)\n\tat > org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(Fl

Re: Process finite stream and notify upon completion

2021-07-14 Thread Piotr Nowojski
hat the entire flow is the > same(operators and sink); is there any good practice to achieve a single > job for that? > > Tamir. > > [1] > https://stackoverflow.com/questions/54687372/flink-append-an-event-to-the-end-of-finite-datastream#answer-54697302 > -

Re: Kafka Consumer Retries Failing

2021-07-14 Thread Piotr Nowojski
il.concurrent.CompletableFuture.get(CompletableFuture.java:1908)\n\tat > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:293)\n\tat > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(Local

Re: Process finite stream and notify upon completion

2021-07-13 Thread Piotr Nowojski
Hi, Sources when finishing are emitting {{org.apache.flink.streaming.api.watermark.Watermark#MAX_WATERMARK}}, so I think the best approach is to register an even time timer for {{Watermark#MAX_WATERMARK}} or maybe {{Watermark#MAX_WATERMARK - 1}}. If your function registers such a timer, it would

Re: Kafka Consumer Retries Failing

2021-07-13 Thread Piotr Nowojski
Hi, I'm not sure, maybe someone will be able to help you, but it sounds like it would be better for you to: - google search something like "Kafka Error sending fetch request TimeoutException" (I see there are quite a lot of results, some of them might be related) - ask this question on the Kafka

Re: How to register custormize serializer for flink kafka format type

2021-07-13 Thread Piotr Nowojski
Hi, It's mentioned in the docs [1], but unfortunately this is not very well documented in 1.10. In short you have to provide a custom implementation of a `DeserializationSchemaFactory`. Please look at the built-in factories for examples of how it can be done. In newer versions it's both easier

Re: How to read large amount of data from hive and write to redis, in a batch manner?

2021-07-08 Thread Piotr Nowojski
Great, thanks for coming back and I'm glad that it works for you! Piotrek czw., 8 lip 2021 o 13:34 Yik San Chan napisał(a): > Hi Piotr, > > Thanks! I end up doing option 1, and that works great. > > Best, > Yik San > > On Tue, May 25, 2021 at 11:43 PM Piotr No

Re: Corrupted unaligned checkpoints in Flink 1.11.1

2021-07-05 Thread Piotr Nowojski
ob id. >> >> A bit of a mystery. Is there a way to at least catch it in the future? >> Any additional telemetry (logs, metrics) we can extract to better >> understand what is happening. >> >> Alex >> >> On Tue, Jun 8, 2021 at 12:01 AM Piotr Nowojski &g

Re: Yarn Application Crashed?

2021-06-30 Thread Piotr Nowojski
You are welcome :) Piotrek śr., 30 cze 2021 o 08:34 Thomas Wang napisał(a): > Thanks Piotr. This is helpful. > > Thomas > > On Mon, Jun 28, 2021 at 8:29 AM Piotr Nowojski > wrote: > >> Hi, >> >> You should still be able to get the Flink log

Re: Yarn Application Crashed?

2021-06-28 Thread Piotr Nowojski
Hi, You should still be able to get the Flink logs via: > yarn logs -applicationId application_1623861596410_0010 And it should give you more answers about what has happened. About the Flink and YARN behaviour, have you seen the documentation? [1] Especially this part: > Failed containers

Re: Looking for example code

2021-06-28 Thread Piotr Nowojski
Thomas J. Raef > Founder, WeWatchYourWebsite.com > http://wewatchyourwebsite.com > tr...@wewatchyourwebsite.com > LinkedIn <https://www.linkedin.com/in/thomas-raef-74b93a14/> > Facebook <https://www.facebook.com/WeWatchYourWebsite> > > > > On Mon, Jun 28, 2021

Re: Cancel job error ! Interrupted while waiting for buffer

2021-06-28 Thread Piotr Nowojski
Hi, It's hard to say from the log fragment, but I presume this task has correctly switched to "CANCELLED" state and this error should not have been logged as an ERROR, right? How did you get this stack trace? Maybe it was logged as a DEBUG message? If not, that would be probably a minor bug in

Re: Looking for example code

2021-06-28 Thread Piotr Nowojski
Hi, We are glad that you want to try out Flink, but if you would like to get help you need to be a bit more specific. What are you exactly doing, and what, on which step exactly and how is not working (including logs and/or error messages) is necessary for someone to help you. In terms of how to

Re: How to make onTimer() trigger on a CoProcessFunction after a failure?

2021-06-25 Thread Piotr Nowojski
TERMARKS. >> WHY? >> processing watermark: 0 >> processing watermark: 0 >> processing watermark: 0 >> Attempts restart: 1 >> processing watermark: 1 >> processing watermark: 1 >> processing watermark: 1 >> processing watermark: 1 >> A

Re: Task is always created state after submit a example job

2021-06-21 Thread Piotr Nowojski
17 Lei Wang napisał(a): > There's enough slots on the jobmanger UI, but the slots are not available. > > After i add taskmanager.host: localhost to flink-conf.yaml, it works. > > But i don't know why. > > Thanks, > Lei > > > On Fri, Jun 18, 2021 at 6:07 PM

Re: How to make onTimer() trigger on a CoProcessFunction after a failure?

2021-06-18 Thread Piotr Nowojski
I'm glad I could help, I hope it will solve your problem :) Best, Piotrek pt., 18 cze 2021 o 14:38 Felipe Gutierrez napisał(a): > On Fri, Jun 18, 2021 at 1:41 PM Piotr Nowojski > wrote: > >> Hi, >> >> Keep in mind that this is a quite low level approach to this p

Re: Handling Large Broadcast States

2021-06-18 Thread Piotr Nowojski
Hi, As far as I know there are no plans to support other state backends with BroadcastState. I don't know about any particular technical limitation, it probably just hasn't been done. Also I don't know how much effort that would be. Probably it wouldn't be easy. Timo, can you chip in how for

Re: The memory usage of the job is very different between Flink1.9 and Flink1.12

2021-06-18 Thread Piotr Nowojski
Hi, always when upgrading I would suggest to check release notes first [1] Best, Piotrek [1] https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html#memory-management pt., 18 cze 2021 o 12:24 Haihang Jing napisał(a): > Ask a question, the same business

Re: How to make onTimer() trigger on a CoProcessFunction after a failure?

2021-06-18 Thread Piotr Nowojski
ctStreamOperator` or `OneInputStreamOperator`. Best, Piotrek pt., 18 cze 2021 o 12:49 Felipe Gutierrez napisał(a): > Hello Piotrek, > > On Fri, Jun 18, 2021 at 11:48 AM Piotr Nowojski > wrote: > >> Hi, >> >> As far as I can tell timers should be che

Re: Task is always created state after submit a example job

2021-06-18 Thread Piotr Nowojski
Hi, I would start by looking at the Job Manager and Task Manager logs. Take a look if Task Managers connected/registered in the Job Manager and if so, if there were no problems when submitting the job. It seems like either there are not enough slots, or slots are actually not available. Best,

Re: Flow of events when Flink Iterations are used in DataStream API

2021-06-18 Thread Piotr Nowojski
Hi, In old Flink versions (prior to 1.9) that would be the case. If operator D emitted a record to Operator B, but Operator B hasn't yet processed when checkpoint is happening, this record would be lost during recovery. Operator D would be recovered with it's state as it was after emitting this

Re: How to make onTimer() trigger on a CoProcessFunction after a failure?

2021-06-18 Thread Piotr Nowojski
Hi, As far as I can tell timers should be checkpointed and recovered. What may be happening is that the state of the last seen watermarks by operators on different inputs and different channels inside an input is not persisted. Flink is assuming that after the restart, watermark assigners will

Re: Discard checkpoint files through a single recursive call

2021-06-18 Thread Piotr Nowojski
Hi, Unfortunately at the moment I think there are no plans to push for this. I would suggest you to bump/cast a vote on https://issues.apache.org/jira/browse/FLINK-13856 in order to allows us more accurately prioritise efforts. Best, Piotrek śr., 16 cze 2021 o 05:46 Jiahui Jiang napisał(a): >

Re: Web UI shows my AssignTImestamp is in high back pressure but in/outPoolUsage are both 0.

2021-06-18 Thread Piotr Nowojski
Hi Haocheng, Regarding the first part, yes. For a very long time there was a trivial bug that was displaying the maximum "backpressure status" ("HIGH" in your case) from all of the subtasks, for every subtask, instead of showing the subtask's individual status. [1] It is/will be fixed in Flink

Re: Re: Re: Re: Failed to cancel a job using the STOP rest API

2021-06-17 Thread Piotr Nowojski
Hi Thomas. The bug https://issues.apache.org/jira/browse/FLINK-21028 is still present in 1.12.1. You would need to upgrade to at least 1.13.0, 1.12.2 or 1.11.4. However as I mentioned before, 1.11.4 hasn't yet been released. On the other hand both 1.12.2 and 1.13.0 have already been superseded by

Re: Re: Re: Re: Failed to cancel a job using the STOP rest API

2021-06-09 Thread Piotr Nowojski
Yes good catch Kezhu, IllegalStateException sounds very much like FLINK-21028. Thomas, could you try upgrading to Flink 1.13.1 or 1.12.4? (1.11.4 hasn't been released yet)? Piotrek wt., 8 cze 2021 o 17:18 Kezhu Wang napisał(a): > Could it be same as FLINK-21028[1] (titled as “Streaming

Re: Corrupted unaligned checkpoints in Flink 1.11.1

2021-06-08 Thread Piotr Nowojski
nts. > > Alex > > On Mon, Jun 7, 2021 at 3:12 AM Piotr Nowojski > wrote: > >> Hi Alex, >> >> A quick question. Are you using incremental checkpoints? >> >> Best, Piotrek >> >> sob., 5 cze 2021 o 21:23 napisał(a): >> >>

Re: recover from svaepoint

2021-06-07 Thread Piotr Nowojski
>>>> Unexpected error in InitProducerIdResponse; Producer attempted an >>>> operation with an old epoch. Either there is a newer producer with the same >>>> transactionalId, or the producer's transaction has been expired by the >>>> broker. >>>

Re: Corrupted unaligned checkpoints in Flink 1.11.1

2021-06-07 Thread Piotr Nowojski
Hi Alex, A quick question. Are you using incremental checkpoints? Best, Piotrek sob., 5 cze 2021 o 21:23 napisał(a): > Small correction, in T4 and T5 I mean Job2, not Job 1 (as job 1 was save > pointed). > > Thank you, > Alex > > On Jun 4, 2021, at 3:07 PM, Alexander Filipchik > wrote: > > 

Re: recover from svaepoint

2021-06-02 Thread Piotr Nowojski
Hi, I think there is no generic way. If this error has happened indeed after starting a second job from the same savepoint, or something like that, user can change Sink's operator UID. If this is an issue of intentional recovery from an earlier checkpoint/savepoint, maybe

Re: Customer operator in BATCH execution mode

2021-05-27 Thread Piotr Nowojski
>> >> Yes, it should be possible to register a timer for Long.MAX_WATERMARK if >> you want to apply a transformation at the end of each key. You could >> also use the reduce operation (DataStream#keyBy#reduce) in BATCH mode. > > According to [0], timer time is irrelevant since timer will be

Re: Managing Jobs entirely with Flink Monitoring API

2021-05-26 Thread Piotr Nowojski
Glad to hear it! Thanks for confirming that it works. Piotrek śr., 26 maj 2021 o 12:59 Barak Ben Nathan napisał(a): > > > Hi Piotrek, > > > > This is exactly what I was searching for. Thanks! > > > > Barak > > > > *From:* Piotr Nowojski > *Sent

Re: Flink 1.11.3 NoClassDefFoundError: Could not initialize class

2021-05-26 Thread Piotr Nowojski
k.beforeInvoke(StreamTask.java:475) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:526) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:546) > at java

Re: Managing Jobs entirely with Flink Monitoring API

2021-05-26 Thread Piotr Nowojski
Hi Barak, Before starting the JobManager I don't think there is any API running at all. If you want to be able to submit/stop multiple jobs to the same cluster session mode is indeed the way to go. But first you need to to start the cluster ( start-cluster.sh ) [1] Piotrek [1]

Re: How to read large amount of data from hive and write to redis, in a batch manner?

2021-05-25 Thread Piotr Nowojski
Hi, You could always buffer records in your sink function/operator, until a large enough batch is accumulated and upload the whole batch at once. Note that if you want to have at-least-once or exactly-once semantics, you would need to take care of those buffered records in one way or another. For

Re: Customer operator in BATCH execution mode

2021-05-25 Thread Piotr Nowojski
Hi, 1. I don't know if there is a built-in way of doing it. You can always pass this information anyway on your own when you are starting the job (via operator/function's constructors). 2. Yes, I think this should work. Best, Piotrek wt., 25 maj 2021 o 17:05 ChangZhuo Chen (陳昌倬) napisał(a): >

Re: Flink 1.11.3 NoClassDefFoundError: Could not initialize class

2021-05-25 Thread Piotr Nowojski
Hi Georgi, I don't think it's a bug in Flink. It sounds like some problem with dependencies or jars in your job. Can you explain a bit more what do you mean by: > that some of them are constantly restarting with the following exception. After restart, everything is working as expected

Re: Time needed to read from Kafka source

2021-05-25 Thread Piotr Nowojski
Hi, That's a throughput of 700 records/second, which should be well below theoretical limits of any deserializer (from hundreds thousands up to tens of millions records/second/per single operator), unless your records are huge or very complex. Long story short, I don't know of a magic bullet to

Re: yarn ship from s3

2021-05-25 Thread Piotr Nowojski
Hi Vijay, I'm not sure if I understand your question correctly. You have jar and configs (1, 2, 3 and 4) on S3 and you want to start a Flink job using those? Can you simply download those things (whole directory containing those) to the machine that will be starting the Flink job? Best, Piotrek

Re: DataStream API in Batch Mode job is timing out, please advise on how to adjust the parameters.

2021-05-25 Thread Piotr Nowojski
Hi Marco, How are you starting the job? For example, are you using Yarn as the resource manager? It looks like there is just enough resources in the cluster to run this job. Assuming the cluster is correctly configured and Task Managers are able to connect with the Job Manager (can you share full

Re: [ANNOUNCE] Apache Flink 1.12.3 released

2021-05-04 Thread Piotr Nowojski
Yes, thanks a lot for driving this release Arvid :) Piotrek czw., 29 kwi 2021 o 19:04 Till Rohrmann napisał(a): > Great to hear. Thanks a lot for being our release manager Arvid and to > everyone who has contributed to this release! > > Cheers, > Till > > On Thu, Apr 29, 2021 at 4:11 PM Arvid

Re: Nondeterministic results with SQL job when parallelism is > 1

2021-04-14 Thread Piotr Nowojski
d that count is about 20-25% short (and not > consistent) of what comes out consistently when parallelism is set to 1. > > > > Dylan > > > > *From: *Dylan Forciea > *Date: *Wednesday, April 14, 2021 at 9:08 AM > *To: *Piotr Nowojski > *Cc: *"user@flink.a

Re: flink1.12.2 "Failed to execute job"

2021-04-14 Thread Piotr Nowojski
think the only thing you can do is to either speed up the split enumeration (probably difficult) or increase the timeouts that are failing. I don't know if there is some other workaround at the moment (Becket?). Piotrek śr., 14 kwi 2021 o 15:57 Piotr Nowojski napisał(a): > Hey, > >

Re: Nondeterministic results with SQL job when parallelism is > 1

2021-04-14 Thread Piotr Nowojski
Hi, Yes, it looks like your query is non deterministic because of `FIRST_VALUE` used inside `GROUP BY`. If you have many different parallel sources, each time you run your query your first value might be different. If that's the case, you could try to confirm it with even smaller query:

Re: flink1.12.2 "Failed to execute job"

2021-04-14 Thread Piotr Nowojski
Hey, could you provide full logs from both task managers and job managers? Piotrek śr., 14 kwi 2021 o 15:43 太平洋 <495635...@qq.com> napisał(a): > After submit job, I received 'Failed to execute job' error. And the time > between initialization and scheduling last 214s. What has happened during

  1   2   3   4   5   6   7   >