Addicting to Sowmya reply:

Falcon does not look into the data records inside those files to enforce
retention. Basically, it works at a file level taking into account a
name scheme followed in the hdfs file paths.

On Friday, January 22, 2016, Sowmya Ramesh <[email protected]> wrote:

> Hi John,
>
> Retention policy determines how long the data will remain on the cluster.
>
> Falcon kicks off the retention policy on the basis of the time value you
> specify in the retention limit:
>
> * Less than 24 hours: Falcon kicks off the retention policy job every 6
> hours
> * More than 24 hours: Falcon kicks off the retention policy job every 24
> hours
>
> When a feed is scheduled Falcon kicks off the retention policy
> immediately. When job runs, it deletes everything thats eligible for
> eviction - eligibility criteria is the date pattern on the partition and
> NOT creation date. For e.g. if the retention limit is 90 days then
> retention job consistently deletes files older than 90 days.
>
> I don¹t understand what do you mean by records inside the file. I am
> assuming you mean files within a directory.
>
> For retention, Falcon expects data to be in dated partitions. I will try
> to explain the retention policy logic with an example.
> Lets say your feed location is defined as below:
>
> <locations>
>         <location type=³data"
> path=³/falcon/demo/primary/clicks/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
>         <location type="stats" path="/none"/>
>         <location type="meta" path="/none"/>
> </locations>
>
> When the retention job is kicked off, it finds all the files that needs to
> be evicted based on retention policy. For the feed example mentioned above
> * It gets the location from the feed which is
> "/falcon/demo/primary/clicks/${YEAR}-${MONTH}-${DAY}-${HOUR}²
> * Then it uses pattern matching to find the file pattern to get the list
> of files for the feed: "/falcon/demo/primary/clicks/*-*-*-*²
> * Calls FileSystem.globStatus with the file pattern
> "/falcon/demo/primary/clicks/*-*-*-*² to get list of files
> * Gets the date from the file path. For e.g. If the file path is
> /falcon/demo/primary/clicks/2016-01-11-02 mapped date is
> 2016-01-11-02T00:00Z
> * If the file path date is beyond the retention limit it's deleted
>
> As this uses pattern matching it is not time consuming.
> You can set retention policies on a per-cluster basis and not per field
> basis.
>
> Hope this helps. Let us know if you have any further queries.
>
> Thanks!
>
> On 1/22/16, 9:55 AM, "John Smith" <[email protected] <javascript:;>>
> wrote:
>
> >Hello,
> >
> >I found that Falcon supports retention policy as part of the Lifecycle. I
> >am wondering how is it working, because its not clear to me by reading the
> >documentation.
> >
> >Assume I store one file  (with thousands/million of records) into HDFS and
> >I set retention period for 1 year.
> >
> >How is that retention period enforced on the records inside the file? Does
> >it mean that scheduler executes some "flow" that reads record by record of
> >the stored file every day and check the current date agains retention
> >date?
> >In case the current date >= retention date the record is removed. Is it
> >cpu/time consuming? Each check requires the full file scan?
> >
> >What will happen in scenario when I define different retention dates per
> >field?
> >
> >
> >
> >Thank you!
> >
> >Best,
> >John
>
>

Reply via email to