各位老师,我有一个作业运行很久了,但是最近源系统有一些脏数据导致作业运行失败,看yarn的日志报错是空指针,但是我现在想把那一条脏数据捕获到,请问一下有啥办法吗?谢谢各位老师的指导
| |
小昌同学
|
|
ccc0606fight...@163.com
|
Hi Jing,
Thank you for caring about the releasing process. It has to be said that
the entire process went smoothly. We have very comprehensive
documentation[1] to guide my work, thanks to the contribution of previous
release managers and the community.
Regarding the obstacles, I actually only
Hi Jing,
Thank you for caring about the releasing process. It has to be said that
the entire process went smoothly. We have very comprehensive
documentation[1] to guide my work, thanks to the contribution of previous
release managers and the community.
Regarding the obstacles, I actually only
Hi, Shammon FY.
理论上提高并行度是可以缓解,但是并行度调整太大对成本要求可能会比较高。因为写入其实不需要占用太多的资源,只是窗口触发后数据量过大(用户的基数),每条数据合并的操作成本过高(一条数据的窗口聚合成本需要合并24*60/5=288次)。
现在我只能想到以下几种解决办法
1. 将窗口步长往上调,这个问题可以从根本上解决(步长过长意味着窗口触发的时间会延后)
2. 步长可以往上调,使用early-fire机制,将未计算完成的窗口直接下发(触发的数据可能不符合近24小时的业务含义,下游系统需要支持upsert)
3.
上封邮件发错了,重新发一下。项目中使用精准一次语义写入kafka,代码和配置如下:
写入代码如下:
Properties producerProperties = MyKafkaUtil.getProducerProperties();
KafkaSink kafkaSink = KafkaSink.builder()
.setBootstrapServers(Event2Kafka.parameterTool.get("bootstrap.server"))
.setRecordSerializer(KafkaRecordSerializationSchema.builder()
项目中使用精准一次语义写入kafka,代码和配置如下:
Properties producerProperties = MyKafkaUtil.getProducerProperties();
KafkaSink kafkaSink = KafkaSink.builder()
.setBootstrapServers(Event2Kafka.parameterTool.get("bootstrap.server"))
.setRecordSerializer(KafkaRecordSerializationSchema.builder()
Hi Kamal,
The network buffer will be full for specific `FileSource` when the job has
back pressure which will block the source subtask. You can refer to network
buffer [1] for more information.
[1]
https://flink.apache.org/2019/06/05/a-deep-dive-into-flinks-network-stack/
Best,
Shammon FY
On
Hi,
这是窗口触发后发送的数据量过大吗?调大资源,加大窗口计算的并发度是否可以缓解这个问题?
Best,
Shammon FY
On Fri, May 26, 2023 at 2:03 PM tanjialiang wrote:
> Hi, all.
> 我在使用FlinkSQL的window tvf滑动窗口时遇到一些问题。
> 滑动步长为5分钟,窗口为24小时,group by
> user_id的滑动窗口,当任务挂掉了或者从kafka的earliest-offset消费,checkpoint很难成功。
>