Re: PyFlink java.io.EOFException at java.io.DataInputStream.readInt

2021-03-19 Thread Yik San Chan
Hi Dian, I simplify the question in https://stackoverflow.com/questions/66687797/pyflink-java-io-eofexception-at-java-io-datainputstream-readfully. You can also find the updated question below: I have a PyFlink job that reads from a file, filter based on a condition, and print. This is a `tree` v

Re: PyFlink java.io.EOFException at java.io.DataInputStream.readInt

2021-03-19 Thread Yik San Chan
I got why regarding the simplified question - the dummy parser should return key(string)-value(string), otherwise it fails the result_type spec On Fri, Mar 19, 2021 at 3:37 PM Yik San Chan wrote: > Hi Dian, > > I simplify the question in > https://stackoverflow.com/questions/66687797/pyflink-jav

Re: The Role of TimerService in ProcessFunction

2021-03-19 Thread Dawid Wysakowicz
Hi Chirag, I agree it might be a little bit confusing. Let me try to explain the reasoning. To do that I'll first try to rephrase the reasoning from FLINK-8560 for introducing the KeyedProcessFunction. It was introduced so that users have a typed access to the current key via Context and OnTimerC

Re: Flink minimum resource recommendation on k8s cluster

2021-03-19 Thread Dawid Wysakowicz
I'd say no. It depends on your job. You can refer to a very good presentation from Robert on how to calculate resource requirements[1]. [1] https://www.youtube.com/watch?v=8l8dCKMMWkw On 18/03/2021 11:37, Amit Bhatia wrote: > Hi, > > Is there any minimum resource ( CPU & Memory) recommendation to

Re: Parameter to config read frequency in Kafka SQL connector

2021-03-19 Thread Dawid Wysakowicz
Hi, Unfortunately I have no experience with this. You can pass any Kafka client parameters through the properties.* option[1] and see if the setting works for you. Best, Dawid [1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/connectors/kafka.html#properties On 18/03/2

Re: Understanding Max Parallelism

2021-03-19 Thread Dawid Wysakowicz
Hi Aeden, The maxParallelism option defines the number of key groups that will be created within the keyed state and thus define the maximum parallelism that a Flink keyed job can scale up to as each key group must be atomically assigned to a single task. You can read more on how the rescaling wor

Re: Eliminating Shuffling Under FlinkSQL

2021-03-19 Thread Dawid Wysakowicz
Your understanding of a group by is correct. It is equivalent to a key by. I agree it would be a great feature to keep the Source's partitioning but unfortunately as of now it is not yet supported. Best, Dawid On 18/03/2021 18:28, Aeden Jameson wrote: > It's my understanding that a group by is a

Re: Flink History server ( running jobs )

2021-03-19 Thread Matthias Pohl
Hi Vishal, yes, as the documentation explains [1]: Only jobs that reached a globally terminal state are archived into Flink's history server. State information about running jobs can be retrieved through Flink's REST API. Best, Matthias [1] https://ci.apache.org/projects/flink/flink-docs-release-

Re: PyFlink java.io.EOFException at java.io.DataInputStream.readInt

2021-03-19 Thread Xingbo Huang
Yes, you need to ensure that the key and value types of the Map are determined Best, Xingbo Yik San Chan 于2021年3月19日周五 下午3:41写道: > I got why regarding the simplified question - the dummy parser should > return key(string)-value(string), otherwise it fails the result_type spec > > On Fri, Mar 19

Re: inputFloatingBuffersUsage=0?

2021-03-19 Thread Piotr Nowojski
Hi Alexey, Have you looked at the documentation [1]? > inPoolUsage An estimate of the input buffers usage. (ignores LocalInputChannels) Gauge > inputFloatingBuffersUsage An estimate of the floating input buffers usage. (ignores LocalInputChannels) Gauge > inputExclusiveBuffersUsage An estimate of

Re: Flink SQL : Interval Outer/Left Join not working as expected

2021-03-19 Thread Benchao Li
Hi Aneesha, For the interval join operator will output the data with NULL when it confirms that there will no data coming before the watermark. And there is an optimization for reducing state access, which may add more time to trigger the output of these data. For your case, it's almost 30 + 30 =

Re: inputFloatingBuffersUsage=0?

2021-03-19 Thread Piotr Nowojski
Hi, clarification about the 2nd part. Required memory is one single exclusive buffer per channel, so if you are running low on memory, floating buffers are one of the first to go, hence you could observe some `0` for floating buffers. For analysing backpressure I would recommend to use a differen

Re: PyFlink java.io.EOFException at java.io.DataInputStream.readInt

2021-03-19 Thread Dian Fu
Good finding! I think we should handle this case more friendly as I guess this issue should be very common for Python users since Python is dynamic language. I have created https://issues.apache.org/jira/browse/FLINK-21876 to follow up with t

Re: Flink History server ( running jobs )

2021-03-19 Thread Vishal Santoshi
Thank you for the confirmation. On Fri, Mar 19, 2021 at 5:37 AM Matthias Pohl wrote: > Hi Vishal, > yes, as the documentation explains [1]: Only jobs that reached a globally > terminal state are archived into Flink's history server. State information > about running jobs can be retrieved through

Re: Saved State in FSstate Backend

2021-03-19 Thread Abdullah bin Omar
Hi, I use the checkpoint configuration 1 ms. env.getCheckpointConfig().setCheckpointTimeout(1); used FSstatebackend to save the state env.setStateBackend(*new* FsStateBackend( "file:///Users/username/Documents/checkpoint-save-java/checkpointing.txt", *false*)); //here I tried both t

Re: Python API + Unit Testing

2021-03-19 Thread Kevin Lam
Hi Dian Fu, I meant testing in application development. When I'm developing a Pyflink Pipeline, are there any recommended approaches to testing the Flink application? For instance, how should we test applications end-to-end? Individual operators? I'm interested in the Datastream API. One approach

Re: Flink SQL : Interval Outer/Left Join not working as expected

2021-03-19 Thread Aneesha Kaushal
Hello Benchao, Thanks for clarifying! The issue was I send very few records so I could not test how the watermark is progressing. Today after trying on continuous stream I was able to get the results. On Fri, Mar 19, 2021 at 5:24 PM Benchao Li wrote: > Hi Aneesha, > > For the interval join ope

OOM issues with Python Objects

2021-03-19 Thread Kevin Lam
Hi all, I've encountered an interesting issue where I observe an OOM issue in my Flink Application when I use a DataStream of Python Objects, but when I make that Python Object a Subclass of pyflink.common.types.Row and provide TypeInformation, there is no issue. For the OOM scenario, no elements

RocksDB StateBuilder unexpected exception

2021-03-19 Thread dhanesh arole
Hello Hivemind, We are running a stateful streaming job. Each task manager instance hosts around ~100GB of data. During restart of task managers we encountered following errors, because of which the job is not able to restart. Initially we thought it might be due to failing status checks of attach

Re: PyFlink java.io.EOFException at java.io.DataInputStream.readInt

2021-03-19 Thread Yik San Chan
Hi Dian, Thank you for your help! Best, Yik San On Fri, Mar 19, 2021 at 9:33 PM Dian Fu wrote: > Good finding! > > I think we should handle this case more friendly as I guess this issue > should be very common for Python users since Python is dynamic language. I > have created https://issues.a

Re: inputFloatingBuffersUsage=0?

2021-03-19 Thread Alexey Trenikhun
Hi Piotrek, Thank for information, looks like isBackPressured ( and in future backPressuredTimeMsPerSecond) is more useful for our simple monitoring purposes. Looking forward for updated blog post Thanks, Alexey From: Piotr Nowojski Sent: Friday, March 19, 202

Flink on Minikube

2021-03-19 Thread Sandeep khanzode
Hello, I have a fat JAR compiled using the Man Shade plugin and everything works correctly when I deploy it on a standalone local cluster i.e. one job and one task manager node. But I installed Minikube and the same JAR file packaged into a docker image fails with weird serialization errors:

Capturing Statement Execution / Results within JdbcSink

2021-03-19 Thread Rion Williams
Hey all, I've been working with JdbcSink and it's really made my life much easier, but I had two questions about it that folks might be able to answer or provide some clarity around. *Accessing Statement Execution / Results* Is there any mechanism in place (or out of the box) to support reading

Re: Flink application has slightly data loss using Processing Time

2021-03-19 Thread Rainie Li
Hi Arvid, After increasing producer.kafka.request.timeout.ms from 9 to 12. The job has been running fine for almost 5 days, but one of the tasks failed again recently for the same timeout error. (attached stack trace below) Should I keep increasing producer.kafka.request.timeout.ms value?

Histogram

2021-03-19 Thread Alexey Trenikhun
Hello, Is any way to expose from Flink Histogram metric, not Summary ? Thanks, Alexey

Re: Flink SQL : Interval Outer/Left Join not working as expected

2021-03-19 Thread Benchao Li
Glad to hear that you resolved the issue. Aneesha Kaushal 于2021年3月19日周五 下午10:49写道: > Hello Benchao, > > Thanks for clarifying! > The issue was I send very few records so I could not test how the > watermark is progressing. Today after trying on continuous stream I was > able to get the results.