Hi Mahak, To quickly answer your question. Scenario 1 : A feed instance runs at 17:30 for replication but a file ending in 1730 isn't available yet. So, the instance is rescheduled for a later time and this keeps on happening until the file is found or the late arrival cut off time (an hour in this case) is reached. - Assuming its a feed with f*requency minutes(10),* this scenario has nothing to do with late-data, when the availability flag is ready, the replication kicks off, otherwise the 17:30 replication instance will be in "Waiting" state. Once the availability flag is found the instance goes to "Running" state and replicates the data to target cluster and this instance 17:30 is considered as "Success".
Scenario 2: A feed instance runs at 17:30 for replication and finds that a file ending in 1720 is now available which wasn't available when the last replication instance ran(at 17:20). So, now it copies both the files (the one ending in 1730 and the one ending in 1720). - No it wont copy data from both the instances, since 17:20 is available for the first time, it simply copies 17:20's data alone. And feed instance for 17:30 will check for data under 17:30 directory alone. Both are independent instances. Late arrival works for both Feed and Process and the details on the functionality is available in Falcon documentation. Please check http://falcon.apache.org/0.6-incubating/EntitySpecification.html#Feed_Specification "Late Data" section. Since your question is related to Feed replication (late-data) I will try to answer here: 1. From Feed definition, lets say we have <frequency>hours(1)</frequency> <late-arrival cut-off="hours(6)"/> 2. From falcon runtime.properties A feed cut-off policy is required for late-data handling for Feeds. allowed policies: periodic, exp-backoff(exponential backoff) and final Ex: periodic with delay=hours(2), Here, falcon would replicate the feed once every hour 17:00, 18:00 and so on. late-arrival specifies, since how *long this feed should be checked for late data changes in the Source cluster*. In this case 6 hours. So, for the instance 17:00, it is honoured till(17+6) 23:00 hour and for instance 18:00, 00:00 (next day) and so on. *When to check?* is specified by the cut-off policy, here it says periodic, hours(2), so falcon checks for changes every 2 hours in source cluster input. So, falcon would check the instance 17:00 at time 19:00 for the data in source cluster, followed by 21:00 and finally at 23:00. *How changes are detected?* Falcon maintains the data size for every instance run, so it records the size of data at first run (17:00) if it detects a different size in source input in next period check 19:00, it simply reruns the entire replication by *overriding* the previous replicated data. Hope it answers your question. Thanks, -Idris On Thu, Jun 25, 2015 at 10:02 PM, Mahak Mukhi <[email protected] <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: > Hi, > I wanted to get a clearer picture on how does falcon handle late arrivals? > Does it wait for the specific feed instance for cut off time before failing > or would it look for all files in the time interval (current - cut off) to > (current).Consider the following 2 scenarios, I'd like to know which one > corresponds with falcon: > There's a feed set up for replication with a frequency of 10 minutes and > the late arrival cut off time is set to be an hour. > Scenario 1 : A feed instance runs at 17:30 for replication but a file > ending in 1730 isn't available yet. So, the instance is rescheduled for a > later time and this keeps on happening until the file is found or the late > arrival cut off time (an hour in this case) is reached. In latter case, the > replication job fails. > Scenario 2: A feed instance runs at 17:30 for replication and finds that a > file ending in 1720 is now available which wasn't available when the last > replication instance ran(at 17:20). So, now it copies both the files (the > one ending in 1730 and the one ending in 1720). > > I'm inclined to believe that scenario 1 corresponds with Falcon, however I > want to confirm that I'm not missing anything.In case, it is Scenario 2, > how does falcon keep track of what files have been copied? > Your help is much appreciated. Thanks. > Regards, > Mahak Mukhi >
