[
https://issues.apache.org/jira/browse/FALCON-594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14095846#comment-14095846
]
Sowmya Ramesh commented on FALCON-594:
--------------------------------------
Multiple approaches have been identified for adding lineage information for
eviction policy.
*Approach 1:*
On execution of eviction policy delete the identified feed instance vertices
from graph. For completeness the associated entities vertices should also be
deleted i.e. cascade delete.
Pros:
- As the identified feed instance vertices are deleted graph DB won't keep
growing and hence no storage space issues.
Cons:
- Since eviction history is not preserved this information cannot be retrieved
at later point of time.
*Approach 2:*
- On execution of eviction policy delete the identified feed instance vertices
[cascade delete].
- For each identified feed entity vertex create a common Evicted vertex and add
an edge with label "evicted". Add a property to identify the feed instance
vertex evicted [fi], timestamp of eviction[ti], WF id[wi]. Instead of creating
a new common vertex self loop can be added
Pros:
- As the identified feed instance vertices are deleted graph DB won't keep
growing and hence no storage space issues
- Some details about eviction is being stored in graph DB. This would enable
getting details about eviction
Cons:
- Compared to Approach 1 requires more storage as we store some details related
to eviction
- For each evicted instance property [fi, ti, wi] is added. In order to get the
eviction details this property has to be parsed leading to performance issues
*Approach 3:*
Create a common Evicted vertex and on execution of eviction policy add an edge
label "evicted" from each identified feed instance vertex to this.
Pros:
- Approach is simple in terms of implementation
- Retaining all the details of evicted feed instances for historical queries
Cons:
- Storage and performance issues as graphDB keeps growing
*Approach 4*
On execution of retention policy add "evicted" property to each identified feed
instance vertex. Do some cleanup based on time limit that ought to be available
to avoid graph DB from growing leading to storage/performance related issues
[FALCON-335|https://issues.apache.org/jira/browse/FALCON-335].
Pros:
- Retaining all the details of evicted feed instances for historical queries
Cons:
- Storage and performance issues as graphDB keeps growing
In addition the decision to purge the vertices can be based on user input to
preserve the history or not. In this case multiple approaches has to be
implemented.
Instead of deleting vertices right away there can be time limit to do DB
cleanup.
Approach 4 is identified as a feasible solution. Please comment if you have any
concerns or inputs.
Thanks!
> Process lineage information for Retention policies
> --------------------------------------------------
>
> Key: FALCON-594
> URL: https://issues.apache.org/jira/browse/FALCON-594
> Project: Falcon
> Issue Type: Sub-task
> Reporter: Sowmya Ramesh
> Assignee: Sowmya Ramesh
>
> Falcon currently addresses process executions and not data lifecycle
> policies. This task should address adding this information.
--
This message was sent by Atlassian JIRA
(v6.2#6252)