Re: Let BucketingSink roll file on each checkpoint

2018-07-09 Thread zhangminglei
Hi, Xilang You can watch the jira what you referred to. I will work on this in the next couple of days. Cheers Minglei > 在 2018年7月9日,上午9:50,XilangYan 写道: > > Hi Febian, > > With watermark, I understand it could only write those that are smaller than > the received watermark, but could I know

Re: Let BucketingSink roll file on each checkpoint

2018-07-08 Thread XilangYan
Hi Febian, With watermark, I understand it could only write those that are smaller than the received watermark, but could I know why it would also need to maintain a write ahead log of all received rows? When an event received, it just compare time with current watermark, write it to correct buck

Re: Let BucketingSink roll file on each checkpoint

2018-07-04 Thread Fabian Hueske
Hi Xilang, I thought about this again. The bucketing sink would need to roll on event-time intervals (similar to the current processing time rolling) which are triggered by watermarks in order to support consistency. However, it would also need to maintain a write ahead log of all received rows an

Re: Let BucketingSink roll file on each checkpoint

2018-07-04 Thread XilangYan
Hi Fabian, We did need a consistent view of data, we need the Counter and HDFS file to be consistent. For example, when the Counter indicate there is 1000 message wrote to the HDFS, there must be exactly 1000 messages in HDFS ready for read. The data we write to HDFS is collected by an Agent(whic

Re: Let BucketingSink roll file on each checkpoint

2018-07-03 Thread Fabian Hueske
Hi Xilang, Let me try to summarize your requirements. If I understood you correctly, you are not only concerned about the exactly-once guarantees but also need a consistent view of the data. The data in all files that are finalized need to originate from a prefix of the stream, i.e., all records w

Re: Let BucketingSink roll file on each checkpoint

2018-07-01 Thread XilangYan
Thank you Minglei, I should describe my current flow and my requirement more clearly. 1. any data we collect have a send-time 2. when collect data, we also send another counter message, says we have collect 45 message whose send-time is 2018-07-02 10:02:00 3. data is sent to kafka(or other message

Re: Let BucketingSink roll file on each checkpoint

2018-06-29 Thread zhangminglei
By the way, I do not think below is a correct way. As @ Fabian said. The BucketingSink closes files once they reached a certain size (BatchSize) or have not been written to for a certain amount of time (InactiveBucketThreshold). > . If we can close > file during checkpoint, then the result is a

Re: Let BucketingSink roll file on each checkpoint

2018-06-29 Thread zhangminglei
Hi Xilang I think you are doing a together work with the offline team. Also what you said ETL, ETL team want to use the data in HDFS. I would like to confirm one question from you. What is their scheduling time for every job ? 5mins or 10 mins ? > My user case is we read data from message que

Re: Let BucketingSink roll file on each checkpoint

2018-06-28 Thread XilangYan
Hi Febian, Finally I have time to read the code, and it is brilliant it does provide exactly once guarantee。 However I still suggest to add the function that can close a file when checkpoint made. I noticed that there is an enhancement https://issues.apache.org/jira/browse/FLINK-9138 which can clo

Re: Let BucketingSink roll file on each checkpoint

2018-03-22 Thread XilangYan
Ok, then may be I have misunderstanding about checkpoint. I thought flink use checkpoint to store offset, but when kafka connector making a checkpoint, it doesn't know whether data is in in-progress file or in pending-file so a whole offset is saved in checkpoint. I used to guess, the data in in-p

Re: Let BucketingSink roll file on each checkpoint

2018-03-22 Thread Fabian Hueske
Hi, Flink maintains its own Kafka offsets in its checkpoints and does not rely on Kafka's offset management. That way Flink guarantees that read offsets and checkpointed operator state are always aligned. The BucketingSink is designed to not lose any data and the mode of operation is described in

Re: Let BucketingSink roll file on each checkpoint

2018-03-20 Thread XilangYan
Thank you! Fabian HDFS small file problem can be avoid with big checkpoint interval. Meanwhile, there is potential data lose problem in current BucketingSink. Say we consume data in kafka, when checkpoint is requested, kafka offset is update, but in-progress file in BucketingSink is remained. If

Re: Let BucketingSink roll file on each checkpoint

2018-03-20 Thread Fabian Hueske
> file is persistent, because we met some bugs in flush/append hdfs file user > case. > > Is there anyway to let BucketingSink roll file on each checkpoint? Thanks > in > advance. > > > > > -- > Sent from: http://apache-flink-user-mailing-list-archive.2336050. > n4.nabble.com/ >

Let BucketingSink roll file on each checkpoint

2018-03-19 Thread XilangYan
anyway to let BucketingSink roll file on each checkpoint? Thanks in advance. -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Let BucketingSink roll file on each checkpoint

2018-03-19 Thread XilangYan
anyway to let BucketingSink roll file on each checkpoint? Thanks in advance. -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/