Re: Flink Web UI

2024-08-31 Thread Yu Chen
Hi Kartik,

The time for complete job erxpired in SessionCluster was controlled by
conf `jobstore.expiration-time`[1].

Best,
Yu Chen

[1] 
https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/deployment/config/#jobstore-expiration-time


发件人: Kartik Kushwaha 
发送时间: 2024年8月31日 2:40
收件人: user@flink.apache.org
主题: Flink Web UI


I've noticed that completed job details (single completed job) are
being removed from the Flink web UI after a few hours. I haven't
configured an archive directory or history server. Could you please
explain how Flink clears these job details and after what timeframe? I
couldn't find information on the default settings for this behavior.
Anything on this will be helpful.

Thanks.


Re: [ANNOUNCE] Apache Flink 1.19.0 released

2024-03-18 Thread Yu Chen
Congratulations!
Thanks to release managers and everyone involved!

Best,
Yu Chen
 

> 2024年3月19日 01:01,Jeyhun Karimov  写道:
> 
> Congrats!
> Thanks to release managers and everyone involved.
> 
> Regards,
> Jeyhun
> 
> On Mon, Mar 18, 2024 at 9:25 AM Lincoln Lee  wrote:
> 
>> The Apache Flink community is very happy to announce the release of Apache
>> Flink 1.19.0, which is the fisrt release for the Apache Flink 1.19 series.
>> 
>> Apache Flink® is an open-source stream processing framework for
>> distributed, high-performing, always-available, and accurate data streaming
>> applications.
>> 
>> The release is available for download at:
>> https://flink.apache.org/downloads.html
>> 
>> Please check out the release blog post for an overview of the improvements
>> for this bugfix release:
>> 
>> https://flink.apache.org/2024/03/18/announcing-the-release-of-apache-flink-1.19/
>> 
>> The full release notes are available in Jira:
>> 
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12353282
>> 
>> We would like to thank all contributors of the Apache Flink community who
>> made this release possible!
>> 
>> 
>> Best,
>> Yun, Jing, Martijn and Lincoln
>> 



Re: Flink autoscaler scaling report

2024-01-18 Thread Yu Chen
Hi Yang,

You can run `StandaloneAutoscalerEntrypoint`, and the scale report will print 
in log (info level) by LoggingEventHandler[2].

[1] 
flink-kubernetes-operator/flink-autoscaler-standalone/src/main/java/org/apache/flink/autoscaler/standalone/StandaloneAutoscalerEntrypoint.java
 at main · apache/flink-kubernetes-operator (github.com) 
<https://github.com/apache/flink-kubernetes-operator/blob/main/flink-autoscaler-standalone/src/main/java/org/apache/flink/autoscaler/standalone/StandaloneAutoscalerEntrypoint.java>
[2] 
https://github.com/apache/flink-kubernetes-operator/blob/48df9d35ed55ae8bb513d9153e9f6f668da9e1c3/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/event/LoggingEventHandler.java#L43C18-L43C18

Best,
Yu Chen


> 2024年1月18日 18:20,Yang LI  写道:
> 
> Hello dear flink community,
> 
> I noticed that there's a scaling report feature (specifically, the strings
> defined in AutoscalerEventHandler) in the Flink operator autoscaler.
> However, I'm unable to find this information in the Flink operator logs.
> Could anyone guide me on how to access or visualize this scaling report?
> 
> Thanks,
> Yang



Re: How to monitor changes in the existing files using flink 1.17.2

2024-01-09 Thread Yu Chen
Hi Nitin,

In the Flink file system connector, a collection of file paths is used by flink 
to identify whether a file has been processed or not in the state[1]. 
So, if your file path has not been updated but the content has been updated, it 
will not be reprocessed in that case.

Meanwhile, here is another question, should we reprocess all the content in a 
file when it has been updated?
If your answer is Yes, then the easiest way is to have upstream write the data 
to a new file.

If your answer is No, then I think you need a data source that can support 
incremental reading, perhaps consider Apache Paimon's similar datalake format 
alternative to pure files to provide you with more ability to manipulate the 
data.

[1] 
https://github.com/apache/flink/blob/cad090aaed770c90facb6edbcce57dd341449a02/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/impl/ContinuousFileSplitEnumerator.java#L62C33-L62C54
[2] Overview | Apache Paimon 
<https://paimon.apache.org/docs/master/concepts/overview/>

Best,
Yu Chen


> 2024年1月10日 04:31,Nitin Saini  写道:
> 
> Hi Flink Community,
> 
> I was using flink 1.12.7 readFile to read files from the s3 it was able to 
> monitor if there are new files added or updation in the existing files as 
> well.
> 
> But now I have migrated to flink 1.17.2 and using FileSource to read files 
> from s3 it was able to monitor if new files are being added to s3 but not 
> able to monitor changes in the existing files.
> 
> Is there any way in flink 1.17.2 through which we can achieve that 
> functionality as well, i.e. we are also able to monitor the changes in the 
> existing files as well. By overriding some classes or by doing something else.
> 
> Thanks & Regards,
> Nitin Saini



Re: oomkill issue

2023-12-04 Thread Yu Chen
Hi Prashant,
Can you describe the steps you use to run `jeprof` in detail?

In my case, I did it by logging in to Taskmanager's shell command line and then 
operating it through shell commands. But I am confused that I saw the curl 
operation in the error log you provided.

Also, it is true that rocksdb is not perfect in memory control and it is 
possible to exceed the managed memory limit, you can refer to the 
documentation[1] for more details.

[1] Write buffer manager internals - Google 文档 
<https://docs.google.com/document/d/1_4Brwy2Axzzqu7SJ4hLLl92hVCpeRlVEG-fj8KsTMUo/edit#heading=h.f5wfmsmpemd0>

Best,
Yu Chen

> 2023年12月5日 04:42,prashant parbhane  写道:
> 
> Hi Yu,
> 
> Thanks for your reply.
> 
> When i run below script
> 
> ```
> jeprof --show_bytes -svg `which java` /tmp/jeprof.out.301.1009.i1009.heap  > 
> 1009.svg
> ```
> i am getting below error
> 
> ```
> Gathering CPU profile from http:///pprof/profile?seconds=30 for 30 seconds to
>   /root/jeprof/java.1701718686.
> Be patient...
> Failed to get profile: curl -s --fail --max-time 90 
> 'http:///pprof/profile?seconds=30' > /root/jeprof/.tmp.java.1701718686.: No 
> such file or directory
> ```
> Any input on this?
> 
> However, oomkill was resolve with below rocksdb configurations
> • "state.backend.rocksdb.memory.managed": "false", 
> "state.backend.rocksdb.block.cache-size": "10m", 
> "state.backend.rocksdb.writebuffer.size": "128m",
> "state.backend.rocksdb.writebuffer.count": "134217728"
> "state.backend.rocksdb.ttl.compaction.filter.enabled":"true"
> 
> 
> Thanks,
> Prashant
> 
> On Mon, Nov 27, 2023 at 7:11 PM Xuyang  wrote:
> Hi, Prashant.
> I think Yu Chen has given a professional troubleshooting ideas. Another thing 
> I want to ask is whether you use some 
> user defined function to store some objects? You can firstly dump the memory 
> and get more details to check for memory leaks.
> 
> --
> Best!
> Xuyang
> 在 2023-11-28 09:12:01,"Yu Chen"  写道:
> Hi Prashant,
> 
> OOMkill was mostly caused by workset memory exceed the pod limit. 
> We have to first expand the OVERHEAD memory properly by the following params 
> to observe if the problem can be solved.
> ```
> taskmanager.memory.jvm-overhead.max=1536m
> taskmanager.memory.jvm-overhead.min=1536m
> ```
> 
> And if the OOMKill still exists, we need to suspect if the task has an 
> off-heap memory leak.
> One of the most popular tools, jemallc, is recommended. You have to install 
> the jemalloc in the image arrording to the document[1].
> After that, you can enable jemalloc profiling by setting environment for the 
> taskmanager:
> ```
> containerized.taskmanager.env.MALLOC_CONF=prof:true,lg_prof_interval:30,lg_prof_sample:16,prof_prefix:/tmp/jeprof.out
>   ```
> After running for a while, you can log into the Taskmanager to generate svg 
> files to troubleshoot off-heap memory distribution.
> ```
> jeprof --show_bytes -svg `which java` /tmp/jeprof.out.301.1009.i1009.heap  > 
> 1009.svg
> ```
> 
> Otherwise, if the OOMKill no longer occurs, but the GC overhead limit 
> exceeded, then you should dump heap memory to find out what objects are 
> taking up so much of the memory.
> Here is the command for you.
> ```
> jmap -dump:live,format=b,file=/tmp/heap.hprof 
> ```
> 
> [1] Using jemalloc to Optimize Memory Allocation — Sentieon Appnotes 
> 202308.01 documentation
> 
> Best,
> Yu Chen
> 发件人: prashant parbhane 
> 发送时间: 2023年11月28日 1:42
> 收件人: user@flink.apache.org 
> 主题: oomkill issue   Hello,  
> 
> We have been facing this oomkill issue, where task managers are getting 
> restarted with this error.
> I am seeing memory consumption increasing in a linear manner, i have given 
> memory and CPU as high as possible but still facing the same issue.
> 
> We are using rocksdb for the state backend, is there a way to find which 
> operator causing this issue? or find which operator takes more memory? Any 
> good practice that we can follow? We are using broadcast state.
> 
> Thanks,
> Prashant



回复: oomkill issue

2023-11-27 Thread Yu Chen
Hi Prashant,

OOMkill was mostly caused by workset memory exceed the pod limit.
We have to first expand the OVERHEAD memory properly by the following params to 
observe if the problem can be solved.
```
taskmanager.memory.jvm-overhead.max=1536m
taskmanager.memory.jvm-overhead.min=1536m
```

And if the OOMKill still exists, we need to suspect if the task has an off-heap 
memory leak.
One of the most popular tools, jemallc, is recommended. You have to install the 
jemalloc in the image arrording to the document[1].
After that, you can enable jemalloc profiling by setting environment for the 
taskmanager:
```

containerized.taskmanager.env.MALLOC_CONF=prof:true,lg_prof_interval:30,lg_prof_sample:16,prof_prefix:/tmp/jeprof.out

```
After running for a while, you can log into the Taskmanager to generate svg 
files to troubleshoot off-heap memory distribution.
```
jeprof --show_bytes -svg `which java` /tmp/jeprof.out.301.1009.i1009.heap  > 
1009.svg
```

Otherwise, if the OOMKill no longer occurs, but the GC overhead limit exceeded, 
then you should dump heap memory to find out what objects are taking up so much 
of the memory.
Here is the command for you.
```
jmap -dump:live,format=b,file=/tmp/heap.hprof 
```

[1] Using jemalloc to Optimize Memory Allocation ― Sentieon Appnotes 202308.01 
documentation<https://support.sentieon.com/appnotes/jemalloc/>

Best,
Yu Chen

发件人: prashant parbhane 
发送时间: 2023年11月28日 1:42
收件人: user@flink.apache.org 
主题: oomkill issue

Hello,

We have been facing this oomkill issue, where task managers are getting 
restarted with this error.
I am seeing memory consumption increasing in a linear manner, i have given 
memory and CPU as high as possible but still facing the same issue.

We are using rocksdb for the state backend, is there a way to find which 
operator causing this issue? or find which operator takes more memory? Any good 
practice that we can follow? We are using broadcast state.

Thanks,
Prashant


回复: Operator ids

2023-11-26 Thread Yu Chen
Hi rania,

Through the following REST APIs, you can get the vertex metrics(chained 
operator).

GET http://localhost:8081/jobs//vertices/
Note that vertex_id can be accessed from GET http://localhost:8081/jobs/

However, there is no interface for getting operator-level metrics.
But I was planning to add such interface, you can follow the ticket 
FLINK-33230[1]

[1] [FLINK-33230] Support Expanding ExecutionGraph to StreamGraph in Web UI - 
ASF JIRA (apache.org)<https://issues.apache.org/jira/browse/FLINK-33230>

Best,
Yu Chen

发件人: rania duni 
发送时间: 2023年11月27日 0:25
收件人: Yu Chen 
主题: Re: Operator ids

Thank you for answering! I want the operator ids to get the metric “records 
Out” in case data are split in a task. I develop, for my thesis, a scaling 
algorithm, so I needed this metric.

26 Νοε 2023, 3:01 μμ, ο χρήστης «Yu Chen » έγραψε:


Hi rania,

If you means the Job Vertex ID of the JobGraph, you can try this:
http://localhost:8081/jobs/

Best,
Yu Chen

发件人: Zhanghao Chen 
发送时间: 2023年11月26日 11:02
收件人: rania duni ; user@flink.apache.org 

主题: Re: Operator ids

It is not supported yet. Curious why do you need to get the operator IDs? They 
are usually only used internally.

Best,
Zhanghao Chen

From: rania duni 
Sent: Saturday, November 25, 2023 20:44
To: user@flink.apache.org 
Subject: Operator ids

Hello!

I would like to know how can I get the operator ids of a running job. I know 
how can I get the task ids but I want the operator ids! I couldn’t find 
something to the REST API docs.
Thank you.


回复: Doubts about state and table API

2023-11-26 Thread Yu Chen
Hi Oscar,

The Operator ID of the SQL job was generated by `StreamingJobGraphGenerator`, 
it was releated with the topology of the stream graph.
If you would like to confirm that the problem was caused by the changes of 
opearator id or not, please remove --allowNonRestoredState, and you will get 
the exception of the failed restore operator id.

However, the lost of the operator state would only produce some erroneous 
results and would not result in `not able to return any row`. It would be 
better to provide logs after restoring to locate a more specific problem.

Best,
Yu Chen

发件人: Oscar Perez via user 
发送时间: 2023年11月25日 0:08
收件人: Oscar Perez via user 
主题: Doubts about state and table API

Hi,

We are having a job in production where we use table API to join multiple 
topics. The query looks like this:


SELECT *
FROM topic1 AS t1
JOIN topic2 AS t2 ON t1.userId = t2.userId
JOIN topic3 AS t3 ON t1.userId = t3.accountUserId


This works and produces an EnrichedActivity any time any of the topics receives 
a new event, which is what we expect. This SQL query is linked to a processor 
function and the processElement gets triggered whenever a new EnrichedActivity 
occurs

We have experienced an issue a couple of times in production where we have 
deployed a new version from savepoint and then suddenly we stopped receiving 
EnrichedActivities in the process function.

Our assumption is that this is related to the table API state and that some 
operators are lost from going from one savepoint to new deployment.

Let me illustrate with one example:

version A of the job is deployed
version B of the job is deployed

version B UID for some table api operators changes and this operator is removed 
when deploying version B as it is unable to be mapped (we have the 
--allowNonRestoredState enabled)

The state for the table api stores bot the committed offset and the contents of 
the topic but just the contents are lost and the committed offset is still in 
the offset

Therefore, when doing the join of the query, it is not able to return any row 
as it is unable to get data from topic2 or topic 3.

Can this be the case?
We are having a hard time trying to understand how the table api and state 
works internally so any help in this regard would be truly helpful!

Thanks,
Oscar




回复: Operator ids

2023-11-26 Thread Yu Chen
Hi rania,

If you means the Job Vertex ID of the JobGraph, you can try this:
http://localhost:8081/jobs/

Best,
Yu Chen

发件人: Zhanghao Chen 
发送时间: 2023年11月26日 11:02
收件人: rania duni ; user@flink.apache.org 

主题: Re: Operator ids

It is not supported yet. Curious why do you need to get the operator IDs? They 
are usually only used internally.

Best,
Zhanghao Chen

From: rania duni 
Sent: Saturday, November 25, 2023 20:44
To: user@flink.apache.org 
Subject: Operator ids

Hello!

I would like to know how can I get the operator ids of a running job. I know 
how can I get the task ids but I want the operator ids! I couldn’t find 
something to the REST API docs.
Thank you.


Re: 配置了state.checkpoints.num-retained为1,但taskmanger 中checkpoints数量越来越多,占用内存,如何解决?

2023-11-07 Thread Yu Chen
Hi 嘉贤,

这不太符合预期。请问任务中间有发生手动Cancel的情况吗?这种情况下,Flink的默认行为是RETAINED_ON_CANCELLATION,需要手动清理。
如果你希望在任务CANCEL之后将Checkpoint清理,可以考虑调整参数execution-checkpointing-externalized-checkpoint-retention[1].


[1] 
http://stream.devops.sit.xiaohongshu.com/docs/red/docs/deployment/config/#execution-checkpointing-externalized-checkpoint-retention

Best,
Yu Chen


> 2023年11月8日 13:08,梁嘉贤  写道:
> 
> Hi, 我纠正一下我的问题,是taskmanager中checkpoints数量越来越多占用磁盘。同时,补充一下以下信息:
> 我通过把task manager的checkpoint路径挂载到本地,采用du 
> -h命令查看checkpoint中的结果,发现任务中会持续增加chk,导致占用磁盘越来越大,如下图
> 我的疑问是,如何把这些历史chk文件删掉?
> <2dfdf...@88d32a26.b2174b65.png>
> 
> 
> 
> 
> 
> 梁嘉贤
> 深圳市城市交通规划设计研究中心有限公司/数据模型中心
>  
>  
>  
> -- Original --
> From:  "Yu Chen";
> Date:  Wed, Nov 8, 2023 12:59 PM
> To:  "梁嘉贤";
> Cc:  "user";
> Subject:  Re: 配置了state.checkpoints.num-retained为1,但taskmanger 
> 中checkpoints数量越来越多,占用内存,如何解决?
>  
> Hi 嘉贤,
> 
> Flink Web上展示的Checkpoint的历史记录,state-checkpoints.num-retained参数会控制在Checkpoint 
> storage中存储的checkpoint数量,Flink会滚动删除Checkpoint 
> storage的checkpoint文件,但是这个过程中Flink 
> Web上记录是不会删除的(你可以在对应的Checkpoint记录的Path上的地址去确认)。
> 同时,如果你是Heap 
> StateBackend,那么状态是存储到内存里的,checkpoint是flush到文件的。之所以内存增大大概率是任务本身导致的,而非历史Checkpoint导致(例如全局窗口聚合且未设置State
>  TTL的场景),如果要定位内存上涨的原因还需要更多的作业信息。
> 另外,如果你希望确认参数是否生效,可以在JobManager的Configuration一栏查看。
> 
> Best,
> Yu Chen
> 
> > 2023年11月8日 11:56,梁嘉贤  写道:
> > 
> > 您好,
> >   采用Flink 1.14 
> > 版本,用docker分别建立了jobmanger和taskmanager两个容器,docker-compose.yml信息如下图1所示。
> >   在配置中,设置了state.checkpoints.num-retained : 
> > 1,但在web中看到checkpoint持续增多(下图2),在taskmanager容器中的checkpoint数量也持续增多,请问可以怎么清理这些历史checkpoint吗?
> >   
> > <03b4c...@04cd7323.eb064b65.png>
> 



Re: 配置了state.checkpoints.num-retained为1,但taskmanger 中checkpoints数量越来越多,占用内存,如何解决?

2023-11-07 Thread Yu Chen
Hi 嘉贤,

Flink Web上展示的Checkpoint的历史记录,state-checkpoints.num-retained参数会控制在Checkpoint 
storage中存储的checkpoint数量,Flink会滚动删除Checkpoint storage的checkpoint文件,但是这个过程中Flink 
Web上记录是不会删除的(你可以在对应的Checkpoint记录的Path上的地址去确认)。
同时,如果你是Heap 
StateBackend,那么状态是存储到内存里的,checkpoint是flush到文件的。之所以内存增大大概率是任务本身导致的,而非历史Checkpoint导致(例如全局窗口聚合且未设置State
 TTL的场景),如果要定位内存上涨的原因还需要更多的作业信息。
另外,如果你希望确认参数是否生效,可以在JobManager的Configuration一栏查看。

Best,
Yu Chen

> 2023年11月8日 11:56,梁嘉贤  写道:
> 
> 您好,
>   采用Flink 1.14 
> 版本,用docker分别建立了jobmanger和taskmanager两个容器,docker-compose.yml信息如下图1所示。
>   在配置中,设置了state.checkpoints.num-retained : 
> 1,但在web中看到checkpoint持续增多(下图2),在taskmanager容器中的checkpoint数量也持续增多,请问可以怎么清理这些历史checkpoint吗?
>   
> <03b4c...@04cd7323.eb064b65.png>



Re: Error in /jars/upload curl request

2023-11-07 Thread Yu Chen
Hi Tauseef,

That's really dependent on the environment you're actually running in. But I'm 
guessing you're using ingress to route your requests to the JM POD. 
If so, I'd suggest you adjust the value of 
nginx.ingress.kubernetes.io/proxy-body-size.

Following is an example.
```
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
   kubernetes.io/ingress.class: nginx
   nginx.ingress.kubernetes.io/proxy-body-size: 250m # change this
  name: xxx
  namespace: xxx
spec:
  rules:
  - host: flink-nyquist.hvreaning.com
http:
 paths:
 - backend:
serviceName: xxx
servicePort: 8081
```

Please let me know if there are any other problems.

Best,Yu Chen

> 2023年11月7日 18:40,Tauseef Janvekar  写道:
> 
> Hi Chen,
> 
> We are not using nginx anywhere on the server(kubernetes cluster) or on my 
> client(my local machine).
> Not sure how to proceed on this.
> 
> Thanks,
> Tauseef
> 
> On Tue, 7 Nov 2023 at 13:36, Yu Chen  wrote:
> Hi Tauseef,
> 
> The error was caused by the nginx configuration and was not a flink problem.
> 
> You can find many related solutions on the web [1].
> 
> Best,
> Yu Chen
> 
> [1] 
> https://stackoverflow.com/questions/24306335/413-request-entity-too-large-file-upload-issue
> 
>> 2023年11月7日 15:14,Tauseef Janvekar  写道:
>> 
>> Hi Chen,
>> 
>> Now I get a different error message.
>> root@R914SK4W:~/learn-building-flink-applications-in-java-exercises/exercises#
>>  curl -X POST -H "Expect:" -F "jarfile=@./target/travel-i
>> tinerary-0.1.jar" https://flink-nyquist.hvreaning.com/jars/upload
>> 
>> 413 Request Entity Too Large
>> 
>> 413 Request Entity Too Large
>> nginx
>> 
>> 
>> 
>> Thanks
>> Tauseef
>> 
>> On Tue, 7 Nov 2023 at 06:19, Chen Yu  wrote:
>>  Hi Tauseef,
>> 
>> Adding an @ sign before the path will resolve your problem. 
>> And I verified that both web and postman upload the jar file properly on the 
>> master branch code. 
>> If you are still having problems then you can provide some more detailed 
>> information.
>> 
>> Here are some documents of curl by `man curl`.
>> 
>>-F, --form  
>>   (HTTP SMTP IMAP) For HTTP protocol family, this lets curl 
>> emulate a filled-in form in which a user  has
>>   pressed the submit button. This causes curl to POST data using 
>> the Content-Type multipart/form-data ac‐
>>   cording to RFC 2388.
>> 
>>   For SMTP and IMAP protocols, this is the means to compose a 
>> multipart mail message to transmit.
>> 
>>   This enables uploading of binary files etc. To force the 
>> 'content' part to be a file, prefix  the  file
>>   name  with an @ sign. To just get the content part from a 
>> file, prefix the file name with the symbol <.
>>   The difference between @ and < is then that @ makes a file get 
>> attached in the post as a  file  upload,
>>   while the < makes a text field and just get the contents for 
>> that text field from a file.
>> 
>> 
>> Best,
>> Yu Chen
>> 发件人: Tauseef Janvekar 
>> 发送时间: 2023年11月6日 22:27
>> 收件人: user@flink.apache.org 
>> 主题: Error in /jars/upload curl request   I am using curl request to upload a 
>> jar but it throws the below error 
>> 
>> 
>> Received unknown attribute jarfile.
>> 
>> Not sure what is wrong here. I am following the standard documentation
>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/
>> 
>> Please let me know if I have to use some other command to upload a jar using 
>> "/jars/upload" endpoint
>> 
>> I also tried to upload using webui but it hangs continuously and only calls 
>> GET api with 200 success- https://flink-nyquist.hvreaning.com/jars
>> 
>> Thanks,
>> Tauseef
> 



Re: Handling Schema Variability and Applying Regex Patterns in Flink Job Configuration

2023-11-07 Thread Yu Chen
Hi Arjun,

As stated in the document, 'This regex pattern should be matched with the 
absolute file path.'
Therefore, you should adjust your regular expression to match absolute paths.

Please let me know if there are any other problems.

Best,
Yu Chen

> 2023年11月7日 18:11,arjun s  写道:
> 
> Hi Chen,
> I attempted to configure the 'source.path.regex-pattern' property in the 
> table settings as '^customer.*' to ensure that the Flink job only processes 
> file names starting with "customer" in the specified directory. However, it 
> appears that this configuration is not producing the expected results. Are 
> there any additional configurations or adjustments that need to be made? The 
> table script I used is as follows:
> CREATE TABLE sample (
>   col1 STRING,
>   col2 STRING,
>   col3 STRING,
>   col4 STRING,
>   file.path STRING NOT NULL METADATA
> ) WITH (
>   'connector' = 'filesystem',
>   'path' = 'file:///home/techuser/inputdata',
>   'format' = 'csv',
>   'source.path.regex-pattern' = '^customer.*',
>   'source.monitor-interval' = '1'
> )
> Thanks in advance,
> Arjun
> 
> On Mon, 6 Nov 2023 at 20:56, Chen Yu  wrote:
> Hi Arjun,
> 
> If you can filter files by a regex pattern, I think the config 
> `source.path.regex-pattern`[1] maybe what you want.
> 
>   'source.path.regex-pattern' = '...',  -- optional: regex pattern to filter 
> files to read under the 
> -- directory of `path` option. This 
> regex pattern should be
>     -- matched with the absolute file 
> path. If this option is set,
> -- the connector  will recursive all 
> files under the directory
> -- of `path` option
> 
> Best,
> Yu Chen
> 
> 
> [1] 
> https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/connectors/table/filesystem/
> 
> 发件人: arjun s 
> 发送时间: 2023年11月6日 20:50
> 收件人: user@flink.apache.org 
> 主题: Handling Schema Variability and Applying Regex Patterns in Flink Job 
> Configuration   Hi team,
> I'm currently utilizing the Table API function within my Flink job, with the 
> objective of reading records from CSV files located in a source directory. To 
> obtain the file names, I'm creating a table and specifying the schema using 
> the Table API in Flink. Consequently, when the schema matches, my Flink job 
> successfully submits and executes as intended. However, in cases where the 
> schema does not match, the job fails to submit. Given that the schema of the 
> files in the source directory is unpredictable, I'm seeking a method to 
> handle this situation.
> Create table query
> =
> CREATE TABLE sample (col1 STRING,col2 STRING,col3 STRING,col4 
> STRING,file.path` STRING NOT NULL METADATA) WITH ('connector' = 
> 'filesystem','path' = 'file:///home/techuser/inputdata','format' = 
> 'csv','source.monitor-interval' = '1')
> =
> 
> Furthermore, I have a question about whether there's a way to read files from 
> the source directory based on a specific regex pattern. This is relevant in 
> our situation because only file names that match a particular pattern need to 
> be processed by the Flink job.
> 
> Thanks and Regards,
> Arjun



Re: Error in /jars/upload curl request

2023-11-07 Thread Yu Chen
Hi Tauseef,

The error was caused by the nginx configuration and was not a flink problem.

You can find many related solutions on the web [1].

Best,
Yu Chen

[1] 
https://stackoverflow.com/questions/24306335/413-request-entity-too-large-file-upload-issue

> 2023年11月7日 15:14,Tauseef Janvekar  写道:
> 
> Hi Chen,
> 
> Now I get a different error message.
> root@R914SK4W:~/learn-building-flink-applications-in-java-exercises/exercises#
>  curl -X POST -H "Expect:" -F "jarfile=@./target/travel-i
> tinerary-0.1.jar" https://flink-nyquist.hvreaning.com/jars/upload
> 
> 413 Request Entity Too Large
> 
> 413 Request Entity Too Large
> nginx
> 
> 
> 
> Thanks
> Tauseef
> 
> On Tue, 7 Nov 2023 at 06:19, Chen Yu  <mailto:yuchen.e...@gmail.com>> wrote:
>>  Hi Tauseef,
>> 
>> Adding an @ sign before the path will resolve your problem. 
>> And I verified that both web and postman upload the jar file properly on the 
>> master branch code. 
>> If you are still having problems then you can provide some more detailed 
>> information.
>> 
>> Here are some documents of curl by `man curl`.
>> 
>>-F, --form 
>>   (HTTP SMTP IMAP) For HTTP protocol family, this lets curl 
>> emulate a filled-in form in which a user  has
>>   pressed the submit button. This causes curl to POST data using 
>> the Content-Type multipart/form-data ac‐
>>   cording to RFC 2388.
>> 
>>   For SMTP and IMAP protocols, this is the means to compose a 
>> multipart mail message to transmit.
>> 
>>   This enables uploading of binary files etc. To force the 
>> 'content' part to be a file, prefix  the  file
>>   name  with an @ sign. To just get the content part from a 
>> file, prefix the file name with the symbol <.
>>       The difference between @ and < is then that @ makes a file get 
>> attached in the post as a  file  upload,
>>   while the < makes a text field and just get the contents for 
>> that text field from a file.
>> 
>> 
>> Best,
>> Yu Chen
>> 发件人: Tauseef Janvekar > <mailto:tauseefjanve...@gmail.com>>
>> 发送时间: 2023年11月6日 22:27
>> 收件人: user@flink.apache.org <mailto:user@flink.apache.org> 
>> mailto:user@flink.apache.org>>
>> 主题: Error in /jars/upload curl request
>>  
>> I am using curl request to upload a jar but it throws the below error
>> 
>> 
>> Received unknown attribute jarfile.
>> 
>> Not sure what is wrong here. I am following the standard documentation
>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/
>> 
>> Please let me know if I have to use some other command to upload a jar using 
>> "/jars/upload" endpoint
>> 
>> I also tried to upload using webui but it hangs continuously and only calls 
>> GET api with 200 success- https://flink-nyquist.hvreaning.com/jars
>> 
>> Thanks,
>> Tauseef



Re: Inquiry about ActiveResourceManager and StandaloneResourceManager in Flink

2023-11-03 Thread Yu Chen
Hi Steven,

As stated in the `StandaloneResourceManager` comments, the manager does not
acquire new resources and the user needs to manually start the Taskmanager
by themself.
While `ActiveResourceManager` achieves requesting or releasing resources on
demand(that's what active means) based on some resource frameworks ( like
yarn and k8s ).

As we know, different users have different environments in production, and
not all of them want to run in yarn or k8s (especially for local debugging,
Standalone Cluster is very convenient).
Therefore, Flink provides users with these two different resource managers
to deal with different usage scenarios.

Please feel free to correct me if there are any misunderstandings.

Best regards,
Yu Chen

Steven Chen  于2023年11月3日周五 13:28写道:

> Dear Flink Community,
>
>
> I am currently using Flink for my project and have a question regarding
> ActiveResourceManager and StandaloneResourceManager.
>
> What does "active" mean in ActiveResourceManager and why is
> StandaloneResourceManager not considered an active resource manager?
>
>
> Thank you for your time and assistance.
>
>
> Best regards,
> Steven Chen
>