Re: Yarn Application Crashed?

2021-06-30 Thread Piotr Nowojski
You are welcome :) Piotrek śr., 30 cze 2021 o 08:34 Thomas Wang napisał(a): > Thanks Piotr. This is helpful. > > Thomas > > On Mon, Jun 28, 2021 at 8:29 AM Piotr Nowojski > wrote: > >> Hi, >> >> You should still be able to get the Flink log

Re: Cancel job error ! Interrupted while waiting for buffer

2021-06-28 Thread Piotr Nowojski
Hi, It's hard to say from the log fragment, but I presume this task has correctly switched to "CANCELLED" state and this error should not have been logged as an ERROR, right? How did you get this stack trace? Maybe it was logged as a DEBUG message? If not, that would be probably a minor bug in

Re: Yarn Application Crashed?

2021-06-28 Thread Piotr Nowojski
Hi, You should still be able to get the Flink logs via: > yarn logs -applicationId application_1623861596410_0010 And it should give you more answers about what has happened. About the Flink and YARN behaviour, have you seen the documentation? [1] Especially this part: > Failed containers

Re: Looking for example code

2021-06-28 Thread Piotr Nowojski
Thomas J. Raef > Founder, WeWatchYourWebsite.com > http://wewatchyourwebsite.com > tr...@wewatchyourwebsite.com > LinkedIn <https://www.linkedin.com/in/thomas-raef-74b93a14/> > Facebook <https://www.facebook.com/WeWatchYourWebsite> > > > > On Mon, Jun 28, 2021

Re: Looking for example code

2021-06-28 Thread Piotr Nowojski
Hi, We are glad that you want to try out Flink, but if you would like to get help you need to be a bit more specific. What are you exactly doing, and what, on which step exactly and how is not working (including logs and/or error messages) is necessary for someone to help you. In terms of how to

Re: Corrupted unaligned checkpoints in Flink 1.11.1

2021-07-05 Thread Piotr Nowojski
ob id. >> >> A bit of a mystery. Is there a way to at least catch it in the future? >> Any additional telemetry (logs, metrics) we can extract to better >> understand what is happening. >> >> Alex >> >> On Tue, Jun 8, 2021 at 12:01 AM Piotr Nowojski &g

Re: [ANNOUNCE] Apache Flink 1.12.3 released

2021-05-04 Thread Piotr Nowojski
Yes, thanks a lot for driving this release Arvid :) Piotrek czw., 29 kwi 2021 o 19:04 Till Rohrmann napisał(a): > Great to hear. Thanks a lot for being our release manager Arvid and to > everyone who has contributed to this release! > > Cheers, > Till > > On Thu, Apr 29, 2021 at 4:11 PM Arvid

Re: Re: flink kryo exception

2021-02-10 Thread Piotr Nowojski
Hi, As Kezhu Wang pointed out, this MIGHT BE caused by the https://issues.apache.org/jira/browse/FLINK-21028 issue. During stop with savepoint procedure, source thread might be interrupted, leaving the whole application in an invalid and inconsistent state. In FLINK-1.12.x one potential symptom

Re: Checkpoint fail due to timeout

2021-03-23 Thread Piotr Nowojski
Hi Alexey, You should definitely investigate why the job is stuck. 1. First of all, is it completely stuck, or is something moving? - Use Flink metrics [1] (number bytes/records processed), and go through all of the operators/tasks to check this. 2. The stack traces like the one you quoted: >

Re: PyFlink Table API: Interpret datetime field from Kafka as event time

2021-03-29 Thread Piotr Nowojski
Hi, I hope someone else might have a better answer, but one thing that would most likely work is to convert this field and define even time during DataStream to table conversion [1]. You could always pre process this field in the DataStream API. Piotrek [1]

Re: Flink failing to restore from checkpoint

2021-03-29 Thread Piotr Nowojski
Hi, What Flink version are you using and what is the scenario that's happening? It can be a number of things, most likely an issue that your filed mounted under: > /mnt/checkpoints/5dde50b6e70608c63708cbf979bce4aa/shared/47993871-c7eb-4fec-ae23-207d307c384a disappeared or stopped being

Re: Restore from Checkpoint from local Standalone Job

2021-03-29 Thread Piotr Nowojski
Hi Sandeep, I think it should work fine with `StandaloneCompletedCheckpointStore`. Have you checked if your directory /Users/test/savepoint is being populated in the first place? And if so, if the restarted job is not throwing some exceptions like it can not access those files? Also note, that

Re: Zigzag shape in TM JVM used memory

2021-04-06 Thread Piotr Nowojski
Hi, this should be posted on the user mailing list not the dev. Apart from that, this looks like normal/standard behaviour of JVM, and has very little to do with Flink. Garbage Collector (GC) is kicking in when memory usage is approaching some threshold:

Re: inputFloatingBuffersUsage=0?

2021-03-22 Thread Piotr Nowojski
dated blog post > > Thanks, > Alexey > > ------ > *From:* Piotr Nowojski > *Sent:* Friday, March 19, 2021 5:01 AM > *To:* Alexey Trenikhun > *Cc:* Flink User Mail List > *Subject:* Re: inputFloatingBuffersUsage=0? > > Hi, > >

Re: inputFloatingBuffersUsage=0?

2021-03-19 Thread Piotr Nowojski
` and `busyTimeMsPerSecond`. We are planning to release a new updated blog post about analysing backpressure in the following weeks. Best, Piotrek pt., 19 mar 2021 o 11:57 Piotr Nowojski napisał(a): > Hi Alexey, > > Have you looked at the documentation [1]? > > > inPoolUsage An estimate of

Re: inputFloatingBuffersUsage=0?

2021-03-19 Thread Piotr Nowojski
> > Flink Network Stack Vol. 2: Monitoring, Metrics, and that Backpressure > Thing. 23 Jul 2019 Nico Kruber & Piotr Nowojski . In a previous blog post, > we presented how Flink’s network stack works from the high-level > abstractions to the low-level details.This second blog

Re: Allocating tasks to specific TaskManagers

2021-03-03 Thread Piotr Nowojski
Hi Hyejo, I don't think it's possible. May I ask why do you want to do this? Best, Piotrek pon., 1 mar 2021 o 21:02 황혜조 napisał(a): > Hi, > > I am looking for a way to allocate each created subTask to a specific > TaskManager. > Is there any way to force assigning tasks to specific

Re: Independence of task parallelism

2021-03-03 Thread Piotr Nowojski
Hi Jan, As far as I remember, Flink doesn't handle very well cases like (1-2-1-1-1) and two Task Managers. There are no guarantees how the operators/subtasks are going to be scheduled, but most likely it will be as you mentioned/observed. First task manager will be handling all of the operators,

Re: Flink Metrics

2021-03-03 Thread Piotr Nowojski
Hi, 1) Do you want to output those metrics as Flink metrics? Or output those "metrics"/counters as values to some external system (like Kafka)? The problem discussed in [1], was that the metrics (Counters) were not fitting in memory, so David suggested to hold them on Flink's state and treat the

Re: how to propagate watermarks across multiple jobs

2021-03-03 Thread Piotr Nowojski
Hi, Can not you write the watermark as a special event to the "mid-topic"? In the "new job2" you would parse this event and use it to assign watermark before `xxxWindow2`? I believe this is what FlinkKafkaShuffle is doing [1], you could look at its code for inspiration. Piotrek [1]

Re: Job downgrade

2021-03-03 Thread Piotr Nowojski
Hi, I'm not sure what's the reason behind this. Probably classes are somehow attached to the state and this would explain why you are experiencing this issue. I've asked someone else from the community to chip in, but in the meantime, can not you just prepare a new "version 1" of the job, with

Re: Flink KafkaProducer flushing on savepoints

2021-03-03 Thread Piotr Nowojski
Hi, What Flink version and which FlinkKafkaProducer version are you using? `FlinkKafkaProducerBase` is no longer used in the latest version. I would guess some older versions, and FlinkKafkaProducer010 or later (no longer supported). I would suggest either to use the universal FlinkKafkaProducer

Re: Running Apache Flink on Android

2021-03-03 Thread Piotr Nowojski
Hi, The question would be, why do you want to do it? I think it might be possible, but probably nobody has ever tested it. Flink is a distributed system, so running it on an Android phone doesn't make much sense. I would suggest you first make your app/example work outside of Android. To make

Re: Allocating tasks to specific TaskManagers

2021-03-04 Thread Piotr Nowojski
er. However when I > checked the registeredTaskManager variable, only one or two TaskManagers > are registered even I started 9 TaskManagers. I would like to know how I > can register every started TaskManager. > > Best regards, > > Hyejo > > 2021년 3월 3일 (수) 오후 7:37, Piotr Nowoj

Re: Scaling Higher than 10k Nodes

2021-03-04 Thread Piotr Nowojski
re trying to contribute to the community. See FLINK-21110 [1] for the > details. > > Thank you~ > > Xintong Song > > > [1] https://issues.apache.org/jira/browse/FLINK-21110 > > On Thu, Mar 4, 2021 at 3:39 PM Piotr Nowojski > wrote: > >> Hi Joey, >> >&

Re: how to propagate watermarks across multiple jobs

2021-03-03 Thread Piotr Nowojski
Great :) Just one more note. Currently FlinkKafkaShuffle has a critical bug [1] that probably will prevent you from using it directly. I hope it will be fixed in some next release. In the meantime you can just inspire your solution with the source code. Best, Piotrek [1]

Re: Timeout Exception When Producing/Consuming Messages to Hundreds of Topics

2021-03-04 Thread Piotr Nowojski
Hi, Sorry, I don't know. I've heard that this kind of pattern is discouraged by Confluent. At least it used to be. Maybe someone else from the community will be able to help from his experience, however keep in mind that under the hood Flink is just simply using KafkaConsumer and KafkaProducer

Re: Scaling Higher than 10k Nodes

2021-03-03 Thread Piotr Nowojski
Hi Joey, Sorry for not responding to your question sooner. As you can imagine there are not many users running Flink at such scale. As far as I know, Alibaba is running the largest/one of the largest clusters, I'm asking for someone who is familiar with those deployments to take a look at this

Re: Stateful functions 2.2 and stop with savepoint

2021-03-04 Thread Piotr Nowojski
Hi Meissner, Can you clarify, are you talking about stateful functions? [1] Or the stream iterations [2]? The first e-mail suggests stateful functions, but the ticket that Kezhu created is talking about the latter. Piotrek [1] https://flink.apache.org/news/2020/04/07/release-statefun-2.0.0.html

Re: Stateful functions 2.2 and stop with savepoint

2021-03-04 Thread Piotr Nowojski
st, > Kezhu Wang > > > On March 4, 2021 at 22:13:48, Piotr Nowojski (piotr.nowoj...@gmail.com) > wrote: > > Hi Meissner, > > Can you clarify, are you talking about stateful functions? [1] Or the > stream iterations [2]? The first e-mail suggests stateful functions, but >

Re: Watermark doesn't progress after job restore from savepoint

2021-03-04 Thread Piotr Nowojski
Hi Hemant, State of the latest seen watermarks is not persisted in the operators. Currently DataStream API assumes that after recovery watermarks are going to be re-emitted sooner or later. What probably happens is that one of your sources has emitted watermarks (maybe some very high one or even

Re: Re: Independence of task parallelism

2021-03-05 Thread Piotr Nowojski
> Jan > > > *Gesendet:* Mittwoch, 03. März 2021 um 19:53 Uhr > *Von:* "Piotr Nowojski" > *An:* "Jan Nitschke" > *Cc:* "user" > *Betreff:* Re: Independence of task parallelism > Hi Jan, > > As far as I remember, Flink does

Re: Flink KafkaProducer flushing on savepoints

2021-03-05 Thread Piotr Nowojski
sal Flink producer, because of an older Kafka > version I am reading from. So unfortunately for now I will have to go with > the hack. > Thanks > -- > *From:* Piotr Nowojski > *Sent:* 03 March 2021 21:10 > *To:* Witzany, Tomas > *Cc:* user@flink.a

Re: [ANNOUNCE] Apache Flink 1.12.2 released

2021-03-05 Thread Piotr Nowojski
ailable in Jira: > > > > > > > > > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12349502=12315522 > > > > > > We would like to thank all contributors of the Apache Flink community > who > > > made this release possible! > > > > > > Special thanks to Yuan Mei for managing the release and PMC members > > Robert > > > Metzger, Chesnay Schepler and Piotr Nowojski. > > > > > > Regards, > > > Roman > > > > > >

Re: Zigzag shape in TM JVM used memory

2021-04-08 Thread Piotr Nowojski
OOM: java heap space. Where to move next? simply bump up > taskmananger.memory? or just increase heap? > 3. What’s the final state? Job running fine and ensuring XYZ headroom in > each memory component? > > Best > Lu > > On Tue, Apr 6, 2021 at 12:26 AM Piotr Nowojski > wrote: > > &g

Re: Nondeterministic results with SQL job when parallelism is > 1

2021-04-14 Thread Piotr Nowojski
Hi, Yes, it looks like your query is non deterministic because of `FIRST_VALUE` used inside `GROUP BY`. If you have many different parallel sources, each time you run your query your first value might be different. If that's the case, you could try to confirm it with even smaller query:

Re: Nondeterministic results with SQL job when parallelism is > 1

2021-04-14 Thread Piotr Nowojski
d that count is about 20-25% short (and not > consistent) of what comes out consistently when parallelism is set to 1. > > > > Dylan > > > > *From: *Dylan Forciea > *Date: *Wednesday, April 14, 2021 at 9:08 AM > *To: *Piotr Nowojski > *Cc: *"user@flink.a

Re: Get consumed Kafka offsets from Flink kafka source

2021-04-14 Thread Piotr Nowojski
Hi, Depending how you configured your FlinkKafkaSource, but you can make the source to commit consumed offsets back to Kafka. So one way to examine them, would be to check those offsets in Kafka (I don't know how, but I'm pretty sure there is a way to do it). Secondly, if you want to examine

Re: Call run() of another SourceFunction inside run()?

2021-04-14 Thread Piotr Nowojski
Hi, I think it should be working. At least from the top of my head I do not see any reason why it shouldn't be working. Just make sure that you are proxying all relevant methods, not only those defined in `SourceFunction`. For example `FlinkKafkaConsumer` is implementing/extending:

Re: flink1.12.2 "Failed to execute job"

2021-04-14 Thread Piotr Nowojski
Hey, could you provide full logs from both task managers and job managers? Piotrek śr., 14 kwi 2021 o 15:43 太平洋 <495635...@qq.com> napisał(a): > After submit job, I received 'Failed to execute job' error. And the time > between initialization and scheduling last 214s. What has happened during

Re: flink1.12.2 "Failed to execute job"

2021-04-14 Thread Piotr Nowojski
think the only thing you can do is to either speed up the split enumeration (probably difficult) or increase the timeouts that are failing. I don't know if there is some other workaround at the moment (Becket?). Piotrek śr., 14 kwi 2021 o 15:57 Piotr Nowojski napisał(a): > Hey, > >

Re: Extract/Interpret embedded byte data from a record

2021-04-14 Thread Piotr Nowojski
Hi, One thing that you can do is to read this record using Avro keeping `Result` as `bytes` and in a subsequent mapping function, you could change the record type and deserialize the result. In Data Stream API: source.map(new MapFunction { ...} ) Best, Piotrek śr., 14 kwi 2021 o 03:17 Sumeet

Re: How to report metric based on keyed state piece

2021-02-17 Thread Piotr Nowojski
Hi Salva, I'm not sure, but I think you can not access the state (especially the keyed state) from within the metric, as metrics are being evaluated outside of the keyed context, and also from another thread. Also things like `ValueState`/`MapState` are not exposing any size. So probably you

Re: ConnectedStreams paused until control stream “ready”

2021-02-17 Thread Piotr Nowojski
Note^2: InputSelectable is `@PublicEvolving` API, so it can be used. However as Timo pointed out, it would block the checkpointing. If I remember correctly there is a checkState that will not allow to use `InputSelectable` with enabled checkpointing. Piotrek śr., 17 lut 2021 o 16:46 Kezhu Wang

Re: ConnectedStreams paused until control stream “ready”

2021-02-17 Thread Piotr Nowojski
16:58 Kezhu Wang napisał(a): > Piotr is right. So just ignore my words. It is the price of going deep > down the rabbit hole:-). > > > Best, > Kezhu Wang > > > On February 17, 2021 at 23:48:30, Piotr Nowojski (pnowoj...@apache.org) > wrote: > > Note^2: InputSele

Re: Re: [ANNOUNCE] RocksDB Version Upgrade and Performance

2021-08-14 Thread Piotr Nowojski
Hi, FYI, the performance regression after upgrading RocksDB was clearly visible in all of our RocksDB related benchmarks, like for example: http://codespeed.dak8s.net:8000/timeline/?ben=stateBackends.ROCKS=2 http://codespeed.dak8s.net:8000/timeline/?ben=stateBackends.ROCKS_INC=2 (and many more

Re: State processor API very slow reading a keyed state with RocksDB

2021-09-09 Thread Piotr Nowojski
Hi David, I can confirm that I'm able to reproduce this behaviour. I've tried profiling/flame graphs and I was not able to make much sense out of those results. There are no IO/Memory bottlenecks that I could notice, it looks indeed like the Job is stuck inside RocksDB itself. This might be an

Re: Duplicate copies of job in Flink UI/API

2021-09-08 Thread Piotr Nowojski
Hi Peter, Can you provide relevant JobManager logs? And can you write down what steps have you taken before the failure happened? Did this failure occur during upgrading Flink, or after the upgrade etc. Best, Piotrek śr., 8 wrz 2021 o 16:11 Peter Westermann napisał(a): > We recently upgraded

Re: How to register custormize serializer for flink kafka format type

2021-07-13 Thread Piotr Nowojski
Hi, It's mentioned in the docs [1], but unfortunately this is not very well documented in 1.10. In short you have to provide a custom implementation of a `DeserializationSchemaFactory`. Please look at the built-in factories for examples of how it can be done. In newer versions it's both easier

Re: Kafka Consumer Retries Failing

2021-07-13 Thread Piotr Nowojski
Hi, I'm not sure, maybe someone will be able to help you, but it sounds like it would be better for you to: - google search something like "Kafka Error sending fetch request TimeoutException" (I see there are quite a lot of results, some of them might be related) - ask this question on the Kafka

Re: Process finite stream and notify upon completion

2021-07-13 Thread Piotr Nowojski
Hi, Sources when finishing are emitting {{org.apache.flink.streaming.api.watermark.Watermark#MAX_WATERMARK}}, so I think the best approach is to register an even time timer for {{Watermark#MAX_WATERMARK}} or maybe {{Watermark#MAX_WATERMARK - 1}}. If your function registers such a timer, it would

Re: TaskManager crash after cancelling a job

2021-07-29 Thread Piotr Nowojski
Hi Ivan, It sounds to me like a bug in FlinkKinesisConsumer that it's not cancelling properly. The change in the behaviour could have been introduced as a bug fix [1], where we had to stop interrupting the source thread. This also might be related or at least relevant for fixing [2]. Ivan, the

Re: [ANNOUNCE] RocksDB Version Upgrade and Performance

2021-08-04 Thread Piotr Nowojski
Thanks for the detailed explanation Yun Tang and clearly all of the effort you have put into it. Based on what was described here I would also vote for going forward with the upgrade. It's a pity that this regression wasn't caught in the RocksDB community. I would have two questions/ideas: 1. Can

Re: Empty Kafka topics and watermarks

2021-10-11 Thread Piotr Nowojski
Great, thanks! pon., 11 paź 2021 o 17:24 James Sandys-Lumsdaine napisał(a): > Ah thanks for the feedback. I can work around for now but will upgrade as > soon as I can to the latest version. > > Thanks very much, > > James. > -- > *From:* Piot

Re: Empty Kafka topics and watermarks

2021-10-08 Thread Piotr Nowojski
Hi James, I believe you have encountered a bug that we have already fixed [1]. The small problem is that in order to fix this bug, we had to change some `@PublicEvolving` interfaces and thus we were not able to backport this fix to earlier minor releases. As such, this is only fixed starting from

Re: How to ugrade JobManagerCommunicationUtils from FLink 1.4 to Flink 1.5?

2021-10-08 Thread Piotr Nowojski
Hi, `JobManagerCommunicationUtils` was never part of Flink's API. It was an internal class, for our internal unit tests. Note that Flink's public API is annotated with `@Public`, `@PublicEvolving` or `@Experimental`. Anything else by default is internal (sometimes to avoid confusion we are

Re: Event is taking a lot of time between the operators

2021-09-28 Thread Piotr Nowojski
Hi, With Flink 1.8.0 I'm not sure how reliable the backpressure status is in the WebUI when it comes to the Async operators. If I remember correctly until around Flink 1.10 (+/- 2 version) backpressure monitoring was checking for thread dumps stuck in requesting Flink's network memory buffers. If

Re: Event is taking a lot of time between the operators

2021-09-29 Thread Piotr Nowojski
improving the approach. > > > > Thanks, > > Sanket Agrawal > > > > *From:* Ragini Manjaiah > *Sent:* Wednesday, September 29, 2021 11:17 AM > *To:* Sanket Agrawal > *Cc:* Piotr Nowojski ; user@flink.apache.org > *Subject:* Re: Event is taking a lot of t

Re: Troubleshooting checkpoint timeout

2021-10-25 Thread Piotr Nowojski
Hi Alexis, You can read about those metrics in the documentation [1]. Long alignment duration and start delay almost always come together. High values indicate long checkpoint barrier propagation times through the job graph, that's always (at least so far I haven't seen a different reason) caused

Re: Prometheus labels in metrics / counters

2021-12-17 Thread Piotr Nowojski
Hi, In principle you can register metric/metric groups dynamically it should be working just fine. However your code probably won't work, because per every record you are creating a new group and new counter, that most likely will be colliding with an old one. So every time you are defining a new

Re: fraud detection example fails

2021-12-17 Thread Piotr Nowojski
Hi, It might be simply because the binary artifacts are not yet published/visible. The blog post [1] mentions that it should be visible within 24h from yesterday), so please try again later/tomorrow. This is also mentioned in the dev mailing list thread [2] Best, Piotrek [1]

Re: Read parquet data from S3 with Flink 1.12

2021-12-17 Thread Piotr Nowojski
Hi, Reading in the DataStream API (that's what I'm using you are doing) from Parquet files is officially supported and documented only since 1.14 [1]. Before that it was only supported for the Table API. As far as I can tell, the basic classes (`FileSource` and `ParquetColumnarRowInputFormat`)

Re: unaligned checkpoint for job with large start delay

2021-12-20 Thread Piotr Nowojski
d (retain on cancellation) > Tolerable Failed Checkpoints 10 > ``` > > Are there other metrics should I look at—why else should tasks fail > acknowledgement in unaligned mode? Is it something about the implementation > details of window function that I am not considering? My main hunch

Re: Input Selectable & Checkpointing

2021-11-24 Thread Piotr Nowojski
Hi Shazia, FLIP-182 [1] might be a thing that will let you address issues like this in the future. With it, maybe you could do some magic with assigning watermarks to make sure that one stream doesn't run too much into the future which would effectively prioritise the other stream. But that's

Re: Input Selectable & Checkpointing

2021-11-25 Thread Piotr Nowojski
> > Best wishes, > > Shazia > > > - Original message - > From: "Piotr Nowojski" > To: "Shazia Kayani" > Cc: mart...@ververica.com, "user" > Subject: [EXTERNAL] Re: Input Selectable & Checkpointing > Date: Wed, Nov 24, 2

Re: Troubleshooting checkpoint timeout

2021-10-26 Thread Piotr Nowojski
r downstream, it > just broadcasts it to all output channels”. > > > > I’ll see what I can do about upgrading the Flink version and do some more > tests with unaligned checkpoints. Thanks again for all the info. > > > > Regards, > > Alexis. > > > > *Fro

Re: Troubleshooting checkpoint timeout

2021-10-25 Thread Piotr Nowojski
arrier for n-1 and “last” the one for n? >3. Start delay also refers to the “first checkpoint barrier to reach >this subtask”. As before, what is “first” in this context? >4. Maybe this will be answered by the previous questions, but what >happens to barriers if a down

Re: Troubleshooting checkpoint timeout

2021-10-25 Thread Piotr Nowojski
say, the last 5 minutes of data. Then it > fails again because the checkpoint times out and, after restarting, would > it try to read, for example, 15 minutes of data? If there was no > backpressure in the source, it could be that the new checkpoint barriers > created after the first restar

Re: A savepoint was created but the corresponding job didn't terminate successfully.

2021-11-09 Thread Piotr Nowojski
Hi Dongwon, Thanks for reporting the issue, I've created a ticket for it [1] and we will analyse and try to fix it soon. In the meantime it should be safe for you to ignore this problem. If this failure happens only rarely, you can always retry stop-with-savepoint command and there should be no

Re: Beginner: guidance on long term event stream persistence and replaying

2021-11-09 Thread Piotr Nowojski
Hi Simon, >From the top of my head I do not see a reason why this shouldn't work in Flink. I'm not sure what your question is here. For reading both from the FileSource and Kafka at the same time you might want to take a look at the Hybrid Source [1]. Apart from that there are

Re: how to expose the current in-flight async i/o requests as metrics?

2021-11-09 Thread Piotr Nowojski
Hi All, to me it looks like something deadlocked, maybe due to this OOM error from Kafka, preventing a Task from making any progress. To confirm Dongwan you could collecte stack traces while the job is in such a blocked state. Deadlocked Kafka could easily explain those symptoms and it would be

Re: Providing files while application mode deployment

2021-11-09 Thread Piotr Nowojski
Hi Vasily, Unfortunately no, I don't think there is such an option in your case. With per job mode, you could try to use the Distributed Cache, it should be working in streaming as well [1], but this doesn't work in the application mode, as in that case no code is executed on the JobMaster [2]

Re: unaligned checkpoint for job with large start delay

2021-12-16 Thread Piotr Nowojski
Hi Mason, In Flink 1.14 we have also changed the timeout behavior from checking against the alignment duration, to simply checking how old is the checkpoint barrier (so it would also account for the start delay) [1]. It was done in order to solve problems as you are describing. Unfortunately we

Re: Confusion about rebalance bytes sent metric in Flink UI

2021-12-16 Thread Piotr Nowojski
Hi Tao, Could you prepare a minimalistic example that would reproduce this issue? Also what Flink version are you using? Best, Piotrek czw., 16 gru 2021 o 09:44 tao xiao napisał(a): > >Your upstream is not inflating the record size? > No, this is a simply dedup function > > On Thu, Dec 16,

Re: Buffering when connecting streams

2021-12-05 Thread Piotr Nowojski
Hi Alexis and David, This actually can not happen. There are mechanisms in the code to make sure none of the input is starved IF there is some data to be read. The only time when input can be blocked is during the alignment phase of aligned checkpoints under back pressure. If there was a back

Re: Kafka Consumer Retries Failing

2021-07-19 Thread Piotr Nowojski
apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1193) > .. > > Because of heartbeat timeout, there was an internal restart of Flink and > the Kafka consumption rate recovered after the restart. > > Could the backpressure timeout and

Re: Kafka Consumer Retries Failing

2021-07-19 Thread Piotr Nowojski
d the High Heap Usage is because of a Memory Leak in a > library we are using. > > Thanks for your help. > > On Mon, Jul 19, 2021 at 8:31 PM Piotr Nowojski > wrote: > >> Thanks for the update. >> >> > Could the backpressure timeout and heartbeat ti

Re: Kafka Consumer Retries Failing

2021-07-14 Thread Piotr Nowojski
il.concurrent.CompletableFuture.get(CompletableFuture.java:1908)\n\tat > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:293)\n\tat > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(Local

Re: Process finite stream and notify upon completion

2021-07-14 Thread Piotr Nowojski
hat the entire flow is the > same(operators and sink); is there any good practice to achieve a single > job for that? > > Tamir. > > [1] > https://stackoverflow.com/questions/54687372/flink-append-an-event-to-the-end-of-finite-datastream#answer-54697302 > -

Re: Kafka Consumer Retries Failing

2021-07-14 Thread Piotr Nowojski
2)\n\tat > org.apache.flink.streaming.connectors.kafka.internal.Handover.pollNext(Handover.java:74)\n\tat > org.apache.flink.streaming.connectors.kafka.internal.KafkaFetcher.runFetchLoop(KafkaFetcher.java:133)\n\tat > org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(Fl

Re: Process finite stream and notify upon completion

2021-07-14 Thread Piotr Nowojski
ly called windows), or when you do event processing where the > > time when an event occurred is important. > > ci.apache.org > > > > > > [2] > https://ci.apache.org/projects/flink/flink-docs-release-1.13/api/java/org/apache/flink/streaming/api/watermark/Watermar

Re: unaligned checkpoint for job with large start delay

2022-01-10 Thread Piotr Nowojski
r to see the huge jumps in window fires. I can this > benefiting other users who troubleshoot the problem of large number of > window firing. > > Best, > Mason > > On Dec 29, 2021, at 2:56 AM, Piotr Nowojski wrote: > > Hi Mason, > > > and it has to finish proce

Re: RichMapFunction to convert tuple of strings to DataStream[(String,String)]

2022-01-10 Thread Piotr Nowojski
Hi Sid, I don't see on the stackoverflow explanation of what are you trying to do here (no mentions of MapFunction or a tuple). If you want to create a `DataStream` from some a pre existing/static Tuple of Strings, the easiest thing would be to convert the tuple to a collection/iterator and use

Re: Uploading jar to s3 for persistence

2022-01-10 Thread Piotr Nowojski
Hi Puneet, Have you seen this thread before? [1]. It looks like the same issue and especially this part might be the key: > Be aware that the filesystem used by the FileUploadHandler > is java.nio.file.FileSystem and not > Flink's org.apache.flink.core.fs.FileSystem for which we provide

Re: Job stuck in savePoint - entire topic replayed on restart.

2022-01-10 Thread Piotr Nowojski
Hi Basil, 1. What do you mean by: > The only way we could stop these stuck jobs was to patch the finalizers. ? 2. Do you mean that your job is stuck when doing stop-with-savepoint? 3. What Flink version are you using? Have you tried upgrading to the most recent version, or at least the most

Re: Custom Kafka Keystore on Amazon Kinesis Analytics

2022-01-10 Thread Piotr Nowojski
Hi Clayton, I think in principle this example should be still valid, however instead of providing a `CustomFlinkKafkaConsumer` and overriding it's `open` method, you would probably need to override `org.apache.flink.connector.kafka.source.reader.KafkaSourceReader#start`. So you would most likely

Re: unaligned checkpoint for job with large start delay

2022-01-11 Thread Piotr Nowojski
gt; > > > > > > [1] > https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/config/#state-backend-rocksdb-metrics-estimate-num-keys > > > > *From:* Mason Chen > *Sent:* Dienstag, 11. Januar 2022 19:20 > *To:* Piotr Nowojski

Re: RichMapFunction to convert tuple of strings to DataStream[(String,String)]

2022-01-10 Thread Piotr Nowojski
n's posts where he mentioned that this way of creating > DS is not "encouraged and tested". So, I figured out an alternate way of > using side output and now I can do what I was aiming for. > > Thanks, > Sid. > > On Mon, Jan 10, 2022 at 5:29 PM Piotr Now

Re: Is there a way to know how long a Flink app takes to finish resuming from Savepoint?

2022-01-10 Thread Piotr Nowojski
Hi, Unfortunately there is no such metric. Regarding the logs, I'm not sure what Flink version you are using, but since Flink 1.13.0 [1][2], you could relay on the tasks/subtasks switch from `INITIALIZING` to `RUNNING` to check when the task/subtask has finished recovering it's state. Best,

Re: Custom Kafka Keystore on Amazon Kinesis Analytics

2022-01-10 Thread Piotr Nowojski
Ah, I see. Pitty. You could always use reflection if you really had to, but that's of course not a long term solution. I will raise this issue to the KafkaSource/AWS contributors. Best, Piotr Nowojski pon., 10 sty 2022 o 16:55 Clayton Wohl napisał(a): > Custom code can create subclas

Re: Basic questions about resuming stateful Flink jobs

2022-02-16 Thread Piotr Nowojski
Hi James, Sure! The basic idea of checkpoints is that they are fully owned by the running job and used for failure recovery. Thus by default if you stopped the job, checkpoints are being removed. If you want to stop a job and then later resume working from the same point that it has previously

Re: getting "original" ingestion timestamp after using a TimestampAssigner

2022-02-16 Thread Piotr Nowojski
Hi Frank, I'm not sure exactly what you are trying to accomplish, but yes. In the TimestampAssigner you can only return what should be the new timestamp for the given record. If you want to use "ingestion time" - "true even time" as some kind of delay metric, you will indeed need to have both

Re: Job manager slots are in bad state.

2022-02-16 Thread Piotr Nowojski
Hi Josson, Would you be able to reproduce this issue on a more recent version of Flink? I'm afraid that we won't be able to help with this issue as this affects a Flink version that is not supported for quite some time and moreover `SlotSharingManager` has been completed removed in Flink 1.13.

Re: Performance Issues in Source Operator while migrating to Flink-1.14 from 1.9

2022-02-16 Thread Piotr Nowojski
Hi, Unfortunately the new KafkaSource was contributed without good benchmarks, and so far you are the first one that noticed and reported this issue. Without more direct comparison (as Martijn suggested), it's hard for us to help right away. It would be a tremendous help for us if you could for

Re: Python Function for Datastream Transformation in Flink Java Job

2022-02-16 Thread Piotr Nowojski
Hi, As far as I can tell the answer is unfortunately no. With Table API (SQL) things are much simpler, as you have a restricted number of types of columns that you need to support and you don't need to support arbitrary Java classes as the records. I'm shooting blindly here, but maybe you can

Re: Low Watermark

2022-02-25 Thread Piotr Nowojski
Hi, It's the minimal watermark among all 10 parallel instances of that Task. Using metric (currentInputWatermark) [1] you can access the watermark of each of those 10 sub tasks individually. Best, Piotrek [1] https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/ pt., 25 lut

Re: Basic questions about resuming stateful Flink jobs

2022-02-17 Thread Piotr Nowojski
gt;> >> >> From what I’ve read there is state, checkpoints and save points – all of >> them hold state - and currently I can’t get any of these to restore when >> developing in an IDE and the program builds up all state from scratch. So >> what else do I need to do in

Re: Buffering when connecting streams

2022-01-18 Thread Piotr Nowojski
in a (Process)JoinFunction? > The join needs keys, but I don’t know if the resulting stream counts as > keyed from the state’s point of view. > > > > Regards, > > Alexis. > > > > *From:* Piotr Nowojski > *Sent:* Montag, 6. Dezember 2021 08:43 > *To:* David Mo

Re: Avro deserialization issue

2022-04-13 Thread Piotr Nowojski
Hey, Could you be more specific about how it is not working? A compiler error that there is no such class as RuntimeContextInitializationContextAdapters? This class has been introduced in Flink 1.12 in FLINK-18363 [1]. I don't know this code and I also don't know where it's documented, but: a)

Re: [Discuss] Creating an Apache Flink slack workspace

2022-05-06 Thread Piotr Nowojski
Hi Xintong, I'm not sure if slack is the right tool for the job. IMO it works great as an adhoc tool for discussion between developers, but it's not searchable and it's not persistent. Between devs, it works fine, as long as the result of the ad hoc discussions is backported to JIRA/mailing

<    1   2   3   4   5   6   7   >