Hi Guys, Can someone please help me in understanding this ?
Regards, Vinay Patil On Thu, Apr 27, 2017 at 12:36 PM, Vinay Patil <vinay18.pa...@gmail.com> wrote: > Hi Guys, > > For historical reprocessing , I am reading the avro data from S3 and > passing these records to the same pipeline for processing. > > I have the following queries: > > 1. I am running this pipeline as a stream application with checkpointing > enabled, the records are successfully written to S3, however they remain in > the pending state as checkpointing is not triggered when I doing > re-processing. Why does this happen ? (kept the checkpointing interval to 1 > minute, pipeline ran for 10 minutes) > this is the code I am using for reading avro data from S3 > > > > > > *AvroInputFormat<SomeAvroClass> avroInputFormat = new AvroInputFormat<>( > new org.apache.flink.core.fs.Path(s3Path), > SomeAvroClass.class); sourceStream = > env.createInput(avroInputFormat).map(...); * > > 2. For the source stream Flink sets the parallelism as 1 , and for the > rest of the operators the user specified parallelism is set. How does Flink > reads the data ? does it bring the entire file from S3 one at a time and > then Split it according to parallelism ? > > 3. I am reading from two different S3 folders and treating them as > separate sourceStreams, how does Flink reads data in this case ? does it > pick one file from each S3 folder , split the data and pass it downstream ? > Does Flink reads the data sequentially ? I am confused here as only one > Task Manager is reading the data from S3 and then all TM's are getting the > data. > > 4. Although I am running this as as stream application, the operators goes > into FINISHED state after processing , is this because Flink treats the S3 > source as finite data ? What will happen if the data is continuously > written to S3 from one pipeline and from the second pipeline I am doing > historical re-processing ? > > Regards, > Vinay Patil >