[jira] [Created] (FLINK-21854) WebUI show incorrect value
junbiao chen created FLINK-21854: Summary: WebUI show incorrect value Key: FLINK-21854 URL: https://issues.apache.org/jira/browse/FLINK-21854 Project: Flink Issue Type: Bug Components: Project Website Affects Versions: 1.12.1 Reporter: junbiao chen Attachments: image-2021-03-18-14-50-49-546.png !image-2021-03-18-14-50-49-546.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21853) Running HA per-job cluster (rocks, non-incremental) end-to-end test could not finished in 900 seconds
Guowei Ma created FLINK-21853: - Summary: Running HA per-job cluster (rocks, non-incremental) end-to-end test could not finished in 900 seconds Key: FLINK-21853 URL: https://issues.apache.org/jira/browse/FLINK-21853 Project: Flink Issue Type: Bug Components: Runtime / Coordination, Runtime / State Backends Affects Versions: 1.11.3 Reporter: Guowei Ma https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=14921&view=logs&j=91bf6583-3fb2-592f-e4d4-d79d79c3230a&t=03dbd840-5430-533d-d1a7-05d0ebe03873&l=7318 {code:java} Waiting for text Completed checkpoint [1-9]* for job to appear 2 of times in logs... grep: /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT/log/*standalonejob-2*.log: No such file or directory grep: /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT/log/*standalonejob-2*.log: No such file or directory grep: /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT/log/*standalonejob-2*.log: No such file or directory Starting standalonejob daemon on host fv-az232-135. grep: /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT/log/*standalonejob-2*.log: No such file or directory Killed TM @ 15744 Killed TM @ 19625 Test (pid: 9232) did not finish after 900 seconds. {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21852) Introduce SupportsAnyNull to BinaryRowData
Jingsong Lee created FLINK-21852: Summary: Introduce SupportsAnyNull to BinaryRowData Key: FLINK-21852 URL: https://issues.apache.org/jira/browse/FLINK-21852 Project: Flink Issue Type: Improvement Reporter: Jingsong Lee Fix For: 1.13.0 We should avoid force casting to implementation. It is better to rely on interface. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21851) Refactor BinaryRowDataKeySelector in testing
Jingsong Lee created FLINK-21851: Summary: Refactor BinaryRowDataKeySelector in testing Key: FLINK-21851 URL: https://issues.apache.org/jira/browse/FLINK-21851 Project: Flink Issue Type: Improvement Reporter: Jingsong Lee Assignee: Jingsong Lee Fix For: 1.13.0 We can introduce a {{HandwrittenKeySelectorUtil}} to replace BinaryRowDataKeySelector newInstance for reduce misunderstanding. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21850) Improve document and config description of sort-merge blocking shuffle
Yingjie Cao created FLINK-21850: --- Summary: Improve document and config description of sort-merge blocking shuffle Key: FLINK-21850 URL: https://issues.apache.org/jira/browse/FLINK-21850 Project: Flink Issue Type: Sub-task Components: Documentation Reporter: Yingjie Cao Fix For: 1.13.0 After the improvement of FLINK-19614, some of the previous document description for sort-merge blocking shuffle is not accurate, we need to improve the corresponding document. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21849) 'SHOW MODULES' tests for CliClientITCase lack the default module test case
Nicholas Jiang created FLINK-21849: -- Summary: 'SHOW MODULES' tests for CliClientITCase lack the default module test case Key: FLINK-21849 URL: https://issues.apache.org/jira/browse/FLINK-21849 Project: Flink Issue Type: Improvement Components: Table SQL / Client Affects Versions: 1.13.0 Reporter: Nicholas Jiang Fix For: 1.13.0 Currently 'SHOW MODULES' command tests for CliClientITCase lack the default module test case, which default module is core module. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21848) table api -----window opening for event time。NullPointer
wangbin created FLINK-21848: --- Summary: table api -window opening for event time。NullPointer Key: FLINK-21848 URL: https://issues.apache.org/jira/browse/FLINK-21848 Project: Flink Issue Type: Bug Components: Table SQL / API Reporter: wangbin Attachments: image-2021-03-18-09-14-21-726.png, image-2021-03-18-09-14-38-073.png, image-2021-03-18-09-14-52-536.png * According to the code of the official website, I wrote an example of window opening for event time.but nullpoint! * _I don't know why_ * _!image-2021-03-18-09-14-21-726.png!_ _!image-2021-03-18-09-14-38-073.png!_ _!image-2021-03-18-09-14-52-536.png!_ -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Flink job cannot find recover path after using entropy injection for s3 file systems
Thanks for checking, Till. I have a follow up question for #2, do you know why the same job cannot show up at the entropy checkpoint in Version 1.9. For example: *When it's running in v1.11, checkpoint path is: * s3a://{bucket name}/dev/checkpoints/_entropy_/{job_id}/chk-1537 *When it's running in v1.9, checkpoint path is: * s3a://{bucket name}/dev/checkpoints/{job_id}/chk-2230 Not sure which caused this inconsistency issue. Thanks Best regards Rainie On Wed, Mar 17, 2021 at 6:38 AM Till Rohrmann wrote: > Hi Rainie, > > 1. I think what you need to do is to look for the {job_id} in all the > possible sub folders of the dev/checkpoints/ folder or you extract the > entropy from the logs. > > 2. According to [1] entropy should only be used for the data files and not > for the metadata files. The idea was to keep the metadata path entropy free > in order to make it more easily discoverable. I can imagine that this > changed with FLINK-5763 [2] which was added in Flink 1.11. This effectively > means that in order to make checkpoints/savepoints self contained we needed > to add the entropy also to the metadata file paths. Moreover, this also > means that the entropy injection works for 1.9 and 1.11. I think it was > introduced with Flink 1.6.2, 1.7.0 [3]. > > [1] > > https://ci.apache.org/projects/flink/flink-docs-stable/deployment/filesystems/s3.html#entropy-injection-for-s3-file-systems > [2] https://issues.apache.org/jira/browse/FLINK-5763 > [3] https://issues.apache.org/jira/browse/FLINK-9061 > > Cheers, > Till > > On Tue, Mar 16, 2021 at 7:03 PM Rainie Li > wrote: > > > Hi Flink Developers. > > > > We enabled entropy injection for s3, here is our setting on Yarn Cluster. > > s3.entropy.key: _entropy_ > > s3.entropy.length: 1 > > state.checkpoints.dir: 's3a://{bucket name}/dev/checkpoints/_entropy_' > > > > I have two questions: > > 1. After enabling entropy, job's checkpoint path changed to: > > *s3://{bucket name}/dev/checkpoints/_entropy_/{job_id}chk-607* > > SInce we don't know which key is mapped to _entropy_ > > It cannot be used to relaunch flink jobs by running > > *flink run -s **s3://{bucket > > name}/dev/checkpoints/_entropy_/{job_id}chk-607* > > If you also enabled entropy injection for s3, any suggestion how to > recover > > failed jobs using entropy checkpoints? > > > > 2.We added entropy settings on the Yarn cluster. > > But we can only see flink jobs in version 1.11 shows the entropy > checkpoint > > path. > > For flink jobs version 1.9, they are still using checkpoint paths without > > entropy like: > > *s3://{bucket name}/dev/checkpoints/{job_id}/chk-607* > > Is this path equal to s3://*{bucket name}* > > */dev/checkpoints/_entropy_/{job_id}**chk-607?* > > Does entropy work for v1.9? If so, why does v1.9 job show checkpoint > paths > > *without* entropy? > > > > Appreciated any suggestions. > > Thanks > > Best regards > > Rainie > > >
Re: flink 1.11 class loading question
Hi Chen, You are right that Flink changed its memory model with Flink 1.10. Now the memory model is better defined and stricter. You can find information about it here [1]. For some pointers towards potential problems please take a look at [2]. What you need to do is to figure out where the non-heap memory is allocated. Maybe you are using a library which leaks some memory. Maybe your code requires more non-heap memory than you have configured the system with. If you are using the per-job mode on Yarn without yarn.per-job-cluster.include-user-jar: disabled, then you should not have any classloader leak problems because the user code should be part of the system classpath. If you set yarn.per-job-cluster.include-user-jar: disabled, then the TaskExecutor will create a user code class loader and keep it as long as the TaskExecutor still has some slots allocated for the job. [1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/memory/mem_setup_tm.html [2] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/memory/mem_trouble.html Cheers, Till On Sun, Mar 14, 2021 at 12:22 AM Chen Qin wrote: > Hi there, > > We were using flink 1.11.2 in production with a large setting. The job > runs fine for a couple of days and ends up with a restart loop caused by > YARN container memory kill. This is not observed while running against > 1.9.1 with the same setting. > Here is JVM environment passed to 1.11 as well as 1.9.1 job > > > env.java.opts.taskmanager: '-XX:+UseG1GC -XX:MaxGCPauseMillis=500 >> -XX:ParallelGCThreads=20 -XX:ConcGCThreads=5 >> -XX:InitiatingHeapOccupancyPercent=45 -XX:NewRatio=1 >> -XX:+PrintClassHistogram -XX:+PrintGCDateStamps -XX:+PrintGCDetails >> -XX:+PrintGCApplicationStoppedTime -Xloggc:/gc.log' >> env.java.opts.jobmanager: '-XX:+UseG1GC -XX:MaxGCPauseMillis=500 >> -XX:ParallelGCThreads=20 -XX:ConcGCThreads=5 >> -XX:InitiatingHeapOccupancyPercent=45 -XX:NewRatio=1 >> -XX:+PrintClassHistogram -XX:+PrintGCDateStamps -XX:+PrintGCDetails >> -XX:+PrintGCApplicationStoppedTime -Xloggc:/gc.log' >> > > After primitive investigation, we found this might not be related to jvm > heap space usage nor gc issue. Meanwhile, we observed jvm non heap usage on > some containers keep rising while job fails into restart loop as stated > below. > [image: image.png] > > From a configuration perspective, we would like to learn how the task > manager handles classloading and (unloading?) when we set include-user-jar > to first. Is there suggestions how we can have a better understanding of > how the new memory model introduced in 1.10 affects this issue? > > > cluster.evenly-spread-out-slots: true > zookeeper.sasl.disable: true > yarn.per-job-cluster.include-user-jar: first > yarn.properties-file.location: /usr/local/hadoop/etc/hadoop/ > > > Thanks, > Chen > >
Re: Flink job cannot find recover path after using entropy injection for s3 file systems
Hi Rainie, 1. I think what you need to do is to look for the {job_id} in all the possible sub folders of the dev/checkpoints/ folder or you extract the entropy from the logs. 2. According to [1] entropy should only be used for the data files and not for the metadata files. The idea was to keep the metadata path entropy free in order to make it more easily discoverable. I can imagine that this changed with FLINK-5763 [2] which was added in Flink 1.11. This effectively means that in order to make checkpoints/savepoints self contained we needed to add the entropy also to the metadata file paths. Moreover, this also means that the entropy injection works for 1.9 and 1.11. I think it was introduced with Flink 1.6.2, 1.7.0 [3]. [1] https://ci.apache.org/projects/flink/flink-docs-stable/deployment/filesystems/s3.html#entropy-injection-for-s3-file-systems [2] https://issues.apache.org/jira/browse/FLINK-5763 [3] https://issues.apache.org/jira/browse/FLINK-9061 Cheers, Till On Tue, Mar 16, 2021 at 7:03 PM Rainie Li wrote: > Hi Flink Developers. > > We enabled entropy injection for s3, here is our setting on Yarn Cluster. > s3.entropy.key: _entropy_ > s3.entropy.length: 1 > state.checkpoints.dir: 's3a://{bucket name}/dev/checkpoints/_entropy_' > > I have two questions: > 1. After enabling entropy, job's checkpoint path changed to: > *s3://{bucket name}/dev/checkpoints/_entropy_/{job_id}chk-607* > SInce we don't know which key is mapped to _entropy_ > It cannot be used to relaunch flink jobs by running > *flink run -s **s3://{bucket > name}/dev/checkpoints/_entropy_/{job_id}chk-607* > If you also enabled entropy injection for s3, any suggestion how to recover > failed jobs using entropy checkpoints? > > 2.We added entropy settings on the Yarn cluster. > But we can only see flink jobs in version 1.11 shows the entropy checkpoint > path. > For flink jobs version 1.9, they are still using checkpoint paths without > entropy like: > *s3://{bucket name}/dev/checkpoints/{job_id}/chk-607* > Is this path equal to s3://*{bucket name}* > */dev/checkpoints/_entropy_/{job_id}**chk-607?* > Does entropy work for v1.9? If so, why does v1.9 job show checkpoint paths > *without* entropy? > > Appreciated any suggestions. > Thanks > Best regards > Rainie >
Re: [DISCUSS] Split PyFlink packages into two packages: apache-flink and apache-flink-libraries
How do other projects solve this problem? Cheers, Till On Wed, Mar 17, 2021 at 3:45 AM Xingbo Huang wrote: > Hi Chesnay, > > Yes, in most cases, we can indeed download the required jars in `setup.py`, > which is also the solution I originally thought of reducing the size of > wheel packages. However, I'm afraid that it will not work in scenarios when > accessing the external network is not possible which is very common in the > production cluster. > > Best, > Xingbo > > Chesnay Schepler 于2021年3月16日周二 下午8:32写道: > > > This proposed apache-flink-libraries package would just contain the > > binary, right? And effectively be unusable to the python audience on > > it's own. > > > > Essentially we are just abusing Pypi for shipping a java binary. Is > > there no way for us to download the jars when the python package is > > being installed? (e.g., in setup.py) > > > > On 3/16/2021 1:23 PM, Dian Fu wrote: > > > Yes, the size of .whl file in PyFlink will also be about 3MB if we > split > > the package. Currently the package is big because we bundled the jar > files > > in it. > > > > > >> 2021年3月16日 下午8:13,Chesnay Schepler 写道: > > >> > > >> key difference being that the beam .whl files are 3mb large, aka 60x > > smaller. > > >> > > >> On 3/16/2021 1:06 PM, Dian Fu wrote: > > >>> Hi Chesnay, > > >>> > > >>> We will publish binary packages separately for: > > >>> 1) Python 3.5 / 3.6 / 3.7 / 3.8 (since 1.12) separately > > >>> 2) Linux / Mac separately > > >>> > > >>> Besides, there is also a source package which is used when none of > the > > above binary packages is usable, e.g. for Window users. > > >>> > > >>> PS: publishing multiple binary packages is very common in Python > > world, e.g. Beam published 22 packages in 2.28, Pandas published 16 > > packages in 1.2.3 [2]. We could also publishing more packages if we > > splitting the packages as the cost of adding another package will be very > > small. > > >>> > > >>> Regards, > > >>> Dian > > >>> > > >>> [1] https://pypi.org/project/apache-beam/#files < > > https://pypi.org/project/apache-beam/#files> < > > https://pypi.org/project/apache-beam/#files < > > https://pypi.org/project/apache-beam/#files>> > > >>> [2] https://pypi.org/project/pandas/#files > > >>> > > >>> > > >>> Hi Xintong, > > >>> > > >>> Yes, you are right that there is 9 packages in 1.12 as we added > Python > > 3.8 support in 1.12. > > >>> > > >>> Regards, > > >>> Dian > > >>> > > 2021年3月16日 下午7:45,Xintong Song 写道: > > > > And it's not only uploaded to PyPI, but the ASF mirrors as well. > > > > > https://dist.apache.org/repos/dist/release/flink/flink-1.12.2/python/ > > > > Thank you~ > > > > Xintong Song > > > > > > > > On Tue, Mar 16, 2021 at 7:41 PM Xintong Song > > > wrote: > > > > > Actually, I think it's 9 packages, not 7. > > > > > > Check here for the 1.12.2 packages. > > > https://pypi.org/project/apache-flink/#files > > > > > > Thank you~ > > > > > > Xintong Song > > > > > > > > > > > > On Tue, Mar 16, 2021 at 7:08 PM Chesnay Schepler < > ches...@apache.org > > > > > > wrote: > > > > > >> Am I reading this correctly that we publish 7 different artifacts > > just > > >> for python? > > >> What does the release matrix look like? > > >> > > >> On 3/16/2021 3:45 AM, Dian Fu wrote: > > >>> Hi Xingbo, > > >>> > > >>> > > >>> Thanks a lot for bringing up this discussion. Actually the size > > limit > > >> already becomes an issue during releasing 1.11.3 and 1.12.1. It > > blocks us > > >> to publish PyFlink packages to PyPI during the release as there is > > no > > >> enough space left (PS: already published the packages after > > increasing the > > >> size limit). > > >>> Considering that the total package size are about 1.5GB (220MB * > > 7) for > > >> each release, it makes sense to split the PyFlink package. It > could > > reduce > > >> the total package size to about 250MB (3MB * 7 + 220 MB) for each > > release. > > >> We don’t need to increase the size limit any more in the next few > > years as > > >> currently we still have about 7.5 GB space left. > > >>> So +1 from my side. > > >>> > > >>> Regards, > > >>> Dian > > >>> > > 2021年3月12日 下午2:30,Xingbo Huang 写道: > > > > Hi everyone, > > > > Since release-1.11, pyflink has introduced cython support and we > > will > > release 7 packages (for different platforms and Python versions) > > to > > >> PyPI > > for each release and the size of each package is more than 200MB > > as we > > >> need > > to bundle the jar files into the package. The entire project > > space in > > >> PyPI > > grows very fast, and we need to apply to PyPI for more project > > space > > frequently. Please refer to [ > > >> https://github.com/
[jira] [Created] (FLINK-21847) Introduce DESC grammar in sql parser
Shengkai Fang created FLINK-21847: - Summary: Introduce DESC grammar in sql parser Key: FLINK-21847 URL: https://issues.apache.org/jira/browse/FLINK-21847 Project: Flink Issue Type: Sub-task Components: Table SQL / Planner Affects Versions: 1.13.0 Reporter: Shengkai Fang DESC is abbreviation of the DESCRIBE statement. DESC currently is only supported in the sql client. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21846) Rethink whether failure of ExecutionGraph creation should directly fail the job
Till Rohrmann created FLINK-21846: - Summary: Rethink whether failure of ExecutionGraph creation should directly fail the job Key: FLINK-21846 URL: https://issues.apache.org/jira/browse/FLINK-21846 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Affects Versions: 1.13.0 Reporter: Till Rohrmann Fix For: 1.13.0 Currently, the {{AdaptiveScheduler}} fails a job execution if the {{ExecutionGraph}} creation fails. This can be problematic because the failure could result from a transient problem (e.g. filesystem is currently not available). In the case of a transient problem a job rescaling could lead to a job failure which might be a bit surprising for users. Instead, I would expect that Flink would retry the {{ExecutionGraph}} creation. One idea could be to ask the restart policy for how to treat the failure and whether to retry the {{ExecutionGraph}} creation or not. One thing to keep in mind, though, is that some failure might be permanent failures (e.g. wrongly specified savepoint path). In such as case we would ideally fail immediately. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21845) NOt able to read config in flink Migration
Yogesh Kumbhare created FLINK-21845: --- Summary: NOt able to read config in flink Migration Key: FLINK-21845 URL: https://issues.apache.org/jira/browse/FLINK-21845 Project: Flink Issue Type: Bug Affects Versions: 1.9.0 Reporter: Yogesh Kumbhare I am facing a issue to migrate the application from samza to flink , not able to read config file. at the class time all the config ara loading but when the dofn call then config are coming null. please suggest to me is there anything flink-config.YAML need to add some configuration . -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21844) Do not auto-configure maxParallelism when setting "scheduler-mode: reactive"
Konstantin Knauf created FLINK-21844: Summary: Do not auto-configure maxParallelism when setting "scheduler-mode: reactive" Key: FLINK-21844 URL: https://issues.apache.org/jira/browse/FLINK-21844 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Reporter: Konstantin Knauf I believe we should not automatically change the maxParallelism when the "scheduler-mode" is set to "reactive", because: * it magically breaks savepoint compatibility, when you switch between default and reactive scheduler mode * the maximum parallelism is an orthogonal concern that in my opinion should not be mixed with the scheduler mode. The reactive scheduler should respect the maxParallelism, but it should not set change its default value. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21843) Support StreamExecGroupWindowAggregate json ser/de
Wenlong Lyu created FLINK-21843: --- Summary: Support StreamExecGroupWindowAggregate json ser/de Key: FLINK-21843 URL: https://issues.apache.org/jira/browse/FLINK-21843 Project: Flink Issue Type: Sub-task Reporter: Wenlong Lyu -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21842) Support user defined WindowAssigner, Trigger and ProcessWindowFunction on Python DataStream API
Wei Zhong created FLINK-21842: - Summary: Support user defined WindowAssigner, Trigger and ProcessWindowFunction on Python DataStream API Key: FLINK-21842 URL: https://issues.apache.org/jira/browse/FLINK-21842 Project: Flink Issue Type: Improvement Components: API / Python Reporter: Wei Zhong Assignee: Wei Zhong Currently the Python DataStream API still not support user defined window. We need to support it firstly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21841) Can not find kafka-connect with sql-kafka-connector
JieFang.He created FLINK-21841: -- Summary: Can not find kafka-connect with sql-kafka-connector Key: FLINK-21841 URL: https://issues.apache.org/jira/browse/FLINK-21841 Project: Flink Issue Type: Bug Reporter: JieFang.He When use sql-kafka with fat-jar(make flink-sql-connector-kafka_2.11 in user jar) with flink 1.11.1 like {code:java} CREATE TABLE user_behavior ( user_id INT, action STRING, province INT, ts TIMESTAMP(3) ) WITH ( 'connector' = 'kafka', 'topic' = 'intopic', 'properties.bootstrap.servers' = 'kafkaserver:9092', 'properties.group.id' = 'testGroup', 'format' = 'csv', 'scan.startup.mode' = 'earliest-offset' ) {code} I get a exception {code:java} Caused by: org.apache.flink.table.api.ValidationException: Cannot discover a connector using option ''connector'='kafka''. at org.apache.flink.table.factories.FactoryUtil.getDynamicTableFactory(FactoryUtil.java:329) at org.apache.flink.table.factories.FactoryUtil.createTableSource(FactoryUtil.java:118) ... 35 more Caused by: org.apache.flink.table.api.ValidationException: Could not find any factory for identifier 'kafka' that implements 'org.apache.flink.table.factories.DynamicTableSourceFactory' in the classpath.Available factory identifiers are:datagen {code} It looks like the issue [FLINK-18076|http://example.com/] is not deal with all exceptions -- This message was sent by Atlassian Jira (v8.3.4#803005)