Flink sql client support k8s application mode?

2022-03-15 Thread JianWen Huang
Does Flink SQL client support k8s application mode?


Setting S3 as State Backend in SQL Client

2022-03-15 Thread dz902
Hi,

I'm using Flink 1.14 and was unable to set S3 as state backend. I tried
combination of:

SET state.backend='filesystem';
SET state.checkpoints.dir='s3://xxx/checkpoints/';
SET state.backend.fs.checkpointdir='s3://xxx/checkpoints/';
SET state.checkpoint-storage='filesystem'

As well as:

SET state.backend='hashmap';

Which covered both legacy 1.13 way to do it and 1.14 new way to do it.

None worked. In the Web UI I see checkpoints being made to the Job Manager
continuously. Configuration reads:

- Checkpoint Storage: JobManagerCheckpointStorage
- State Backend: HashMapStateBackend

Is this a bug? Is there a way to set state backend to S3 using SQL Client?

Thanks,
Dai


Re: Setting S3 as State Backend in SQL Client

2022-03-15 Thread dz902
Just tried editing flink-conf.yaml and it seems SQL Client does not respect
that also. Is this an intended behavior?

On Tue, Mar 15, 2022 at 7:14 PM dz902  wrote:

> Hi,
>
> I'm using Flink 1.14 and was unable to set S3 as state backend. I tried
> combination of:
>
> SET state.backend='filesystem';
> SET state.checkpoints.dir='s3://xxx/checkpoints/';
> SET state.backend.fs.checkpointdir='s3://xxx/checkpoints/';
> SET state.checkpoint-storage='filesystem'
>
> As well as:
>
> SET state.backend='hashmap';
>
> Which covered both legacy 1.13 way to do it and 1.14 new way to do it.
>
> None worked. In the Web UI I see checkpoints being made to the Job Manager
> continuously. Configuration reads:
>
> - Checkpoint Storage: JobManagerCheckpointStorage
> - State Backend: HashMapStateBackend
>
> Is this a bug? Is there a way to set state backend to S3 using SQL Client?
>
> Thanks,
> Dai
>
>
>


RocksDB metrics for effective memory consumption

2022-03-15 Thread Donatien Schmitz
Hi,

I am working on the analysis of the memory consumption of RocksDB state backend 
for simple DAGs. I would like to check fine-grained memory utilization of 
RocksDB with the native metrics (reported on Prometheus+Grafana). RocksDB uses 
Managed memory allocated to each TaskManager but this value peaks at the 
beginning of the job. Is the managed memory always allocated at full even if it 
would not be necessary?

For my experiments I am using a simple DAG consisting of Source (FS) -> Map -> 
DiscardSink. The Map does not process anything but stores the latest value of 
the KeyedStream keys (with predicted amount of keys in the dataset and constant 
value size (1024 bytes)).

I anyone has some more insights on the memory utilization of RocksDB at Flink's 
level, I would appreciate.

Best,

Donatien Schmitz
PhD Student


[ANNOUNCE] Apache Flink 1.1.4.4 released

2022-03-15 Thread Konstantin Knauf
The Apache Flink community is very happy to announce the release of Apache
Flink 1.14.4, which is the third bugfix release for the Apache Flink 1.14
series.

Apache Flink® is an open-source stream processing framework for
distributed, high-performing, always-available, and accurate data streaming
applications.

The release is available for download at:
https://flink.apache.org/downloads.html

Please check out the release blog post for an overview of the improvements
for this bugfix release:

https://flink.apache.org/news/2022/03/11/release-1.14.4.html

The full release notes are available in Jira:

https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12351231

We would like to thank all contributors of the Apache Flink community who
made this release possible!

Regards,

Konstantin

--

Konstantin Knauf

https://twitter.com/snntrable

https://github.com/knaufk


Re: [ANNOUNCE] Apache Flink 1.1.4.4 released

2022-03-15 Thread Martijn Visser
Thank you Konstantin and everyone who contributed!

On Tue, 15 Mar 2022 at 14:22, Konstantin Knauf  wrote:

> The Apache Flink community is very happy to announce the release of Apache
> Flink 1.14.4, which is the third bugfix release for the Apache Flink 1.14
> series.
>
> Apache Flink® is an open-source stream processing framework for
> distributed, high-performing, always-available, and accurate data streaming
> applications.
>
> The release is available for download at:
> https://flink.apache.org/downloads.html
>
> Please check out the release blog post for an overview of the improvements
> for this bugfix release:
>
> https://flink.apache.org/news/2022/03/11/release-1.14.4.html
>
> The full release notes are available in Jira:
>
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12351231
>
> We would like to thank all contributors of the Apache Flink community who
> made this release possible!
>
> Regards,
>
> Konstantin
>
> --
>
> Konstantin Knauf
>
> https://twitter.com/snntrable
>
> https://github.com/knaufk
>


回复:Re: Fwd: [DISCUSS] Flink's supported APIs and Hive query syntax

2022-03-15 Thread 罗宇侠(莫辞)
Hi, thanks for the inputs. 
​I wrote a google doc “Plan for decoupling Hive connector with Flink planner” 
[1] that shows how to decouple Hive connector with planner. 

[1] 
https://docs.google.com/document/d/1LMQ_mWfB_mkYkEBCUa2DgCO2YdtiZV7YRs2mpXyjdP4/edit?usp=sharing

Best, 
Yuxia.--
发件人:Jark Wu
日 期:2022年03月10日 19:59:30
收件人:Francesco Guardiani; dev
抄 送:Martijn Visser; User; 
罗宇侠(莫辞)
主 题:Re: Fwd: [DISCUSS] Flink's supported APIs and Hive query syntax

Hi Francesco,

Yes. The Hive syntax is a syntax plugin provided by Hive connector.

> But right now I don't think It's a good idea adding new features on top, 
as it will create only more maintenance burden both for Hive developers and for 
table developers.

We are not adding new Hive features, but fixing compatibility or behavior bugs, 
and almost all of them
are just related to the Hive connector code, nothing to do with table planner. 

I agree we should investigate how to and how much work to decouple Hive 
connector and planner ASAP. 
We will come up with a google doc soon. But AFAIK, this may not be a huge work 
and not conflict with the bugfix works. 

Best,
Jark
On Thu, 10 Mar 2022 at 17:03, Francesco Guardiani  
wrote:

> We still need some work to make the Hive dialect purely rely on public APIs, 
> and the Hive connector should be decopule with table planner. 

From the table perspective, I think this is the big pain point at the moment. 
First of all, when we talk about the Hive syntax, we're really talking about 
the Hive connector, as my understanding is that without the Hive connector in 
the classpath you can't use the Hive syntax [1].

The Hive connector is heavily relying on internals [2], and this is an 
important struggle for the table project, as sometimes is impedes and slows 
down development of new features and creates a huge maintenance burden for 
table developers [3]. The planner itself has some classes specific to Hive [4], 
making the codebase of the planner more complex than it already is. Some of 
these are just legacy, others exists because there are some abstractions 
missing in the table planner side, but those just need some work.

So I agree with Jark, when the two Hive modules (connector-hive and 
sql-parser-hive) reach a point where they don't depend at all on 
flink-table-planner, like every other connector (except for testing of course), 
we should be good to move them in a separate repo and continue committing to 
them. But right now I don't think It's a good idea adding new features on top, 
as it will create only more maintenance burden both for Hive developers and for 
table developers.

My concern with this plan is: how much realistic is to fix all the planner 
internal leaks in the existing Hive connector/parser? To me this seems like a 
huge task, including a non trivial amount of work to stabilize and design new 
entry points in Table API.

[1] HiveParser
[2] HiveParserCalcitePlanner
[3] Just talking about code coupling, not even mentioning problems like 
dependencies and security updates
[4] HiveAggSqlFunction
On Thu, Mar 10, 2022 at 9:05 AM Martijn Visser  wrote:

Thank you Yuxia for volunteering, that's really much appreciated. It would be 
great if you can create an umbrella ticket for that. 

It would be great to get some insights from currently Flink and Hive users 
which versions are being used.
@Jark I would indeed deprecate the old Hive versions in Flink 1.15 and then 
drop them in Flink 1.16. That would also remove some tech debt and make it less 
work with regards to externalizing connectors.

Best regards,

Martijn
On Thu, 10 Mar 2022 at 07:39, Jark Wu  wrote:

Thanks Martijn for the reply and summary. 

I totally agree with your plan and thank Yuxia for volunteering the Hive tech 
debt issue. 
I think we can create an umbrella issue for this and target version 1.16. We 
can discuss
details and create subtasks there. 

Regarding dropping old Hive versions, I'm also fine with that. But I would like 
to investigate
some Hive users first to see whether it's acceptable at this point. My first 
thought was we
can deprecate the old Hive versions in 1.15, and we can discuss dropping it in 
1.16 or 1.17. 

Best,
Jark


On Thu, 10 Mar 2022 at 14:19, 罗宇侠(莫辞)  wrote:

Thanks Martijn for your insights.

About the tech debt/maintenance with regards to Hive query syntax, I would like 
to chip-in and expect it can be resolved for Flink 1.16.

Best regards,

Yuxia


 --原始邮件 --
发件人:Martijn Visser 
发送时间:Thu Mar 10 04:03:34 2022
收件人:User 
主题:Fwd: [DISCUSS] Flink's supported APIs and Hive query syntax

(Forwarding this also to the User mailing list as I made a typo when replying 
to this email thread)

-- Forwarded message -
From: Martijn Visser 
Date: Wed, 9 Mar 2022 at 20:57
Subject: Re: [DISCUSS] Flink's supported APIs and Hive query syntax
To: dev , Francesco Guardiani , 
Timo Walther , 


Hi everyone,

Re: [ANNOUNCE] Apache Flink 1.1.4.4 released

2022-03-15 Thread Leonard Xu
Thanks a lot for being our release manager Konstantin and everyone who 
involved! 

Best,
Leonard

> 2022年3月15日 下午9:34,Martijn Visser  写道:
> 
> Thank you Konstantin and everyone who contributed! 



Re: Interval join operator is not forwarding watermarks correctly

2022-03-15 Thread Alexis Sarda-Espinosa
For completeness, this still happens with Flink 1.14.4

Regards,
Alexis.


From: Alexis Sarda-Espinosa 
Sent: Friday, March 11, 2022 12:21 AM
To: user@flink.apache.org 
Cc: pnowoj...@apache.org 
Subject: Re: Interval join operator is not forwarding watermarks correctly

I think I managed to create a reproducible example [1], I think it's due to the 
use of window + join + window. When I run the test, I never see the print 
output, but if I uncomment part of the code in the watermark generator to mark 
it as idle more quickly, it starts working after a while.

[1] https://github.com/asardaes/flink-interval-join-test


Regards,
Alexis.


From: Alexis Sarda-Espinosa 
Sent: Thursday, March 10, 2022 7:47 PM
To: user@flink.apache.org 
Cc: pnowoj...@apache.org 
Subject: RE: Interval join operator is not forwarding watermarks correctly


I found [1] and [2], which are closed, but could be related?



[1] https://issues.apache.org/jira/browse/FLINK-23698

[2] https://issues.apache.org/jira/browse/FLINK-18934



Regards,

Alexis.



From: Alexis Sarda-Espinosa 
Sent: Donnerstag, 10. März 2022 19:27
To: user@flink.apache.org
Subject: Interval join operator is not forwarding watermarks correctly



Hello,



I’m in the process of updating from Flink 1.11.3 to 1.14.3, and it seems the 
interval join in my pipeline is no longer working. More specifically, I have a 
sliding window after the interval join, and the window isn’t firing. After many 
tests, I ended up creating a custom operator that extends IntervalJoinOperator 
and I overrode processWatermark1() and processWatermark2() to add logs and 
check when they are called. I can see that processWatermark1() isn’t called.



For completeness, this is how I use my custom operator:



joinOperator = new CustomIntervalJoinOperator(…);



stream1.connect(stream2)

.keyBy(selector1, selector2)

.transform("Interval Join", TypeInformation.of(Pojo.class), joinOperator);



---



Some more information in case it’s relevant:



- stream2 is obtained from a side output.

- both stream1 and stream2 have watermarks assigned by custom strategies. I 
also log watermark creation, and I can see that watermarks are indeed emitted 
as expected in both streams.



Strangely, my watermark strategies mark themselves idle if they don’t receive 
new events after 10 minutes, and if I send some events and wait 10 minutes, 
processWatermark1() is called! On the other hand, if I continuously send 
events, it is never called.



Is this a known issue?



Regards,

Alexis.




Re: [ANNOUNCE] Apache Flink 1.1.4.4 released

2022-03-15 Thread Xingbo Huang
Thanks a lot for being our release manager Konstantin and everyone who
contributed. I have a question about pyflink. I see that there are no
corresponding wheel packages uploaded on pypi, only the source package is
uploaded. Is there something wrong with building the wheel packages?

Best,
Xingbo

Leonard Xu  于2022年3月16日周三 01:02写道:

> Thanks a lot for being our release manager Konstantin and everyone who
> involved!
>
> Best,
> Leonard
>
> 2022年3月15日 下午9:34,Martijn Visser  写道:
>
> Thank you Konstantin and everyone who contributed!
>
>
>


Rescaling REST API not working

2022-03-15 Thread Aarsh Shah
Hello,
I tried to call the rescaling api with patch to automatically rescale, but
it shows 503, is it deprecated? because it is present in the docs too, and
if it is deprecated, is there any API through which I can rescale directly?
Because the mode which we are using is not the reactive mode, we need to
either come up with a script kind of thing or if there is an API, our time
would be saved.
Please help.
Thank you.


how to convert Table to DataStream / DataSet by TableEnvironment on Batch mode?

2022-03-15 Thread vtygoss
Hi, community!


When dealing with retractable stream, i meet a problem about converting Table 
to DataSet / DataStream on batch mode in Flink-1.13.5. 


scenario and process: 
- 1. Database CDC to Kafka
- 2. Sync data into Hive with HoodieTableFormat(Apache Hudi)
- 3. Incremental processing hoodie table in Streaming mode, or full processing 
in Batch mode. 


Because it's more difficult to implement UDAF / UDTF on retractable table than 
on retractable stream, we choose to convert the Table to DataStream for 
processing. But we find that 


- 1. Only BatchTableEnvironment can convert Table to DataSet, other 
implementations of TableEnvironment will throw Exception as code below.  
- 2. BatchTableEnvironment will be dropped in 1.14 because it only support the 
old planner.  
- 3. TableEnvironment#create API returns the TableEnvironmentImpl instance, 
TableEnvironmentImpl works exclusively with Table API and can't convert Table 
to DataSet.


org.apache.flink.table.api.bridge.scala.TableConversions.scala




so, how to convert Table to DataStream on batch mode? Thanks for any replies or 
suggestions.




in streaming mode,


```
val tenv: StreamTableEnvironment = StreanTableEnvironment.create(senv, setting);
val table = tenv.sqlQuery("select ...") 
val dStream: DataStream[Row] = tenv.toChangelogStream(table)
dStream.map(..)...
```


in batch mode, 


```
val tenv: TableEnvironment = TableEnvironment.create(setting) // actually, 
return TableEnvironmentImpl 
val table = tenv.sqlQuery("select ...")


// how to convert Table to DataSet / DataStream 
val dStream: DataSet[Row] = ? 
```


Best Regards!

fecb192f-3915-4958-8600-7b9c1afebfec.png
Description: Binary data