[jira] [Commented] (FALCON-288) Persist lineage information into a persistent store

Srikanth Sundarrajan (JIRA) Sat, 01 Mar 2014 20:07:14 -0800

    [ 
https://issues.apache.org/jira/browse/FALCON-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13917279#comment-13917279
 ]


Srikanth Sundarrajan commented on FALCON-288:
---------------------------------------------

Why do we need user node attached to the cluster vertex that relation isn't 
very useful and is likely to be misleading as well.
{code}
public void addClusterEntity(Cluster clusterEntity) {
...
+        addUser(clusterVertex);
{code}

addVertex() checks for existence of the vertex, however similar thing is not 
done for edge. You will find that for every restart, you might create redundant 
edges between vertex pairs at least for all element of the entity graph.

In code snippets similar to this, it might be useful to not assume the default 
edge label to be "output", but to actually check for it and throw an assertion 
error otherwise. It generally gets very hard to debug issues relation to graph 
sanity when the graph gets larger as there are no relational, unique or 
property value constraints available in most graph implementations.
{code}
+    public void addProcessFeedEdge(Vertex processVertex, Vertex feedVertex, 
String edgeLabel) {
+        if (edgeLabel.equals(FEED_PROCESS_EDGE_LABEL)) {
+            feedVertex.addEdge(edgeLabel, processVertex);
+        } else {
+            processVertex.addEdge(edgeLabel, feedVertex);
+        }
+    }
{code}

This is going to be a little tricky. If you leave behind vertices, even after 
all incident edges are removed, database is going to monotonically increase in 
size and cause performance issue along the line. One technique that I have used 
in the past with graph databases is upon edge removal, check if the vertex is 
left behind with no edges, if so delete the vertex as well. Few gotchas in this 
with respect to this particular graph are
1. Entity elements aren't to be removed
2. Convenience relations may be added to instance vertices, which aren't to be 
considered when counting remaining edges. 
These can be achieved by tagging the vertices and edges with appropriate 
properties.
{code}
+    public void removeEdge(Vertex fromVertex, Object toVertexName, String 
edgeLabel) {
...
+                // remove the edge and not the vertex since instances are 
pointing to this vertex
...
{code}

It is reasonable to leave behind graph elements after an entity is deleted to 
allow historical queries. However there has to be some cleanup based on time 
limit that ought to be available. This is required even for active ones. Also 
it might be worth considering to make an option available to the user to delete 
an entity along with its historical data.
{code}
+    @Override
+    public void onRemove(Entity entity) throws FalconException {
+        // do nothing, we'd leave the deleted entities as-is for historical 
purposes
+        // should we mark 'em as deleted?
+    }
{code}

Is the motivation of adding classification & groups relationship for every 
instance to provide "WHAT-WAS" view of the feed instance? Is that a more 
standard ask? Current model is generic enough to provide both "WHAT-WAS" and 
"WHAT-IS", but it is at a higher cost. If that is a required feature, we can 
leave it as is.
{code}
+    public void addFeedInstances(String[] feedNames, String[] 
feedInstancePaths,
...
+            addDataClassification(feed.getTags(), feedInstance);
+            addGroups(feed.getGroups(), feedInstance);
{code}

Why is workflowInstance a separate node in the graph and not a set of property 
on the process instance? I can imagine this being useful in re-run scenarios, 
but I dont see that run-relationship being captured though.

There are so many relationships being created, it might be very useful to test 
each one of these functions independently.


> Persist lineage information into a persistent store
> ---------------------------------------------------
>
>                 Key: FALCON-288
>                 URL: https://issues.apache.org/jira/browse/FALCON-288
>             Project: Falcon
>          Issue Type: Sub-task
>    Affects Versions: 0.5
>            Reporter: Venkatesh Seetharam
>            Assignee: Venkatesh Seetharam
>              Labels: lineage
>         Attachments: Dependency Graph.png, FALCON-288-Hive-Review.patch, 
> FALCON-288-review-v1.patch, FALCON-288-review.patch, FALCON-288-v1.patch, 
> Lineage Over Dependency.png
>
>
> Need to evaluate the store - rdbms vs graph db. Leaning towards latter since 
> the data is hierarchical.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (FALCON-288) Persist lineage information into a persistent store

Reply via email to