Sowmya, awesome and detailed! Thank you and you should encourage others to do this too.
On 1/22/16, 12:20 PM, "Sowmya Ramesh" <[email protected]> wrote: >Hi John, > >Retention policy determines how long the data will remain on the cluster. > >Falcon kicks off the retention policy on the basis of the time value you >specify in the retention limit: > >* Less than 24 hours: Falcon kicks off the retention policy job every 6 >hours >* More than 24 hours: Falcon kicks off the retention policy job every 24 >hours > >When a feed is scheduled Falcon kicks off the retention policy >immediately. When job runs, it deletes everything thats eligible for >eviction - eligibility criteria is the date pattern on the partition and >NOT creation date. For e.g. if the retention limit is 90 days then >retention job consistently deletes files older than 90 days. > >I don¹t understand what do you mean by records inside the file. I am >assuming you mean files within a directory. > >For retention, Falcon expects data to be in dated partitions. I will try >to explain the retention policy logic with an example. >Lets say your feed location is defined as below: > ><locations> > <location type=³data" >path=³/falcon/demo/primary/clicks/${YEAR}-${MONTH}-${DAY}-${HOUR}"/> > <location type="stats" path="/none"/> > <location type="meta" path="/none"/> ></locations> > >When the retention job is kicked off, it finds all the files that needs to >be evicted based on retention policy. For the feed example mentioned above >* It gets the location from the feed which is >"/falcon/demo/primary/clicks/${YEAR}-${MONTH}-${DAY}-${HOUR}² >* Then it uses pattern matching to find the file pattern to get the list >of files for the feed: "/falcon/demo/primary/clicks/*-*-*-*² >* Calls FileSystem.globStatus with the file pattern >"/falcon/demo/primary/clicks/*-*-*-*² to get list of files >* Gets the date from the file path. For e.g. If the file path is >/falcon/demo/primary/clicks/2016-01-11-02 mapped date is >2016-01-11-02T00:00Z >* If the file path date is beyond the retention limit it's deleted > >As this uses pattern matching it is not time consuming. >You can set retention policies on a per-cluster basis and not per field >basis. > >Hope this helps. Let us know if you have any further queries. > >Thanks! > >On 1/22/16, 9:55 AM, "John Smith" <[email protected]> wrote: > >>Hello, >> >>I found that Falcon supports retention policy as part of the Lifecycle. I >>am wondering how is it working, because its not clear to me by reading >>the >>documentation. >> >>Assume I store one file (with thousands/million of records) into HDFS >>and >>I set retention period for 1 year. >> >>How is that retention period enforced on the records inside the file? >>Does >>it mean that scheduler executes some "flow" that reads record by record >>of >>the stored file every day and check the current date agains retention >>date? >>In case the current date >= retention date the record is removed. Is it >>cpu/time consuming? Each check requires the full file scan? >> >>What will happen in scenario when I define different retention dates per >>field? >> >> >> >>Thank you! >> >>Best, >>John > >
