Addicting to Sowmya reply: Falcon does not look into the data records inside those files to enforce retention. Basically, it works at a file level taking into account a name scheme followed in the hdfs file paths.
On Friday, January 22, 2016, Sowmya Ramesh <[email protected]> wrote: > Hi John, > > Retention policy determines how long the data will remain on the cluster. > > Falcon kicks off the retention policy on the basis of the time value you > specify in the retention limit: > > * Less than 24 hours: Falcon kicks off the retention policy job every 6 > hours > * More than 24 hours: Falcon kicks off the retention policy job every 24 > hours > > When a feed is scheduled Falcon kicks off the retention policy > immediately. When job runs, it deletes everything thats eligible for > eviction - eligibility criteria is the date pattern on the partition and > NOT creation date. For e.g. if the retention limit is 90 days then > retention job consistently deletes files older than 90 days. > > I don¹t understand what do you mean by records inside the file. I am > assuming you mean files within a directory. > > For retention, Falcon expects data to be in dated partitions. I will try > to explain the retention policy logic with an example. > Lets say your feed location is defined as below: > > <locations> > <location type=³data" > path=³/falcon/demo/primary/clicks/${YEAR}-${MONTH}-${DAY}-${HOUR}"/> > <location type="stats" path="/none"/> > <location type="meta" path="/none"/> > </locations> > > When the retention job is kicked off, it finds all the files that needs to > be evicted based on retention policy. For the feed example mentioned above > * It gets the location from the feed which is > "/falcon/demo/primary/clicks/${YEAR}-${MONTH}-${DAY}-${HOUR}² > * Then it uses pattern matching to find the file pattern to get the list > of files for the feed: "/falcon/demo/primary/clicks/*-*-*-*² > * Calls FileSystem.globStatus with the file pattern > "/falcon/demo/primary/clicks/*-*-*-*² to get list of files > * Gets the date from the file path. For e.g. If the file path is > /falcon/demo/primary/clicks/2016-01-11-02 mapped date is > 2016-01-11-02T00:00Z > * If the file path date is beyond the retention limit it's deleted > > As this uses pattern matching it is not time consuming. > You can set retention policies on a per-cluster basis and not per field > basis. > > Hope this helps. Let us know if you have any further queries. > > Thanks! > > On 1/22/16, 9:55 AM, "John Smith" <[email protected] <javascript:;>> > wrote: > > >Hello, > > > >I found that Falcon supports retention policy as part of the Lifecycle. I > >am wondering how is it working, because its not clear to me by reading the > >documentation. > > > >Assume I store one file (with thousands/million of records) into HDFS and > >I set retention period for 1 year. > > > >How is that retention period enforced on the records inside the file? Does > >it mean that scheduler executes some "flow" that reads record by record of > >the stored file every day and check the current date agains retention > >date? > >In case the current date >= retention date the record is removed. Is it > >cpu/time consuming? Each check requires the full file scan? > > > >What will happen in scenario when I define different retention dates per > >field? > > > > > > > >Thank you! > > > >Best, > >John > >
