[
https://issues.apache.org/jira/browse/FALCON-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13917279#comment-13917279
]
Srikanth Sundarrajan commented on FALCON-288:
---------------------------------------------
Why do we need user node attached to the cluster vertex that relation isn't
very useful and is likely to be misleading as well.
{code}
public void addClusterEntity(Cluster clusterEntity) {
...
+ addUser(clusterVertex);
{code}
addVertex() checks for existence of the vertex, however similar thing is not
done for edge. You will find that for every restart, you might create redundant
edges between vertex pairs at least for all element of the entity graph.
In code snippets similar to this, it might be useful to not assume the default
edge label to be "output", but to actually check for it and throw an assertion
error otherwise. It generally gets very hard to debug issues relation to graph
sanity when the graph gets larger as there are no relational, unique or
property value constraints available in most graph implementations.
{code}
+ public void addProcessFeedEdge(Vertex processVertex, Vertex feedVertex,
String edgeLabel) {
+ if (edgeLabel.equals(FEED_PROCESS_EDGE_LABEL)) {
+ feedVertex.addEdge(edgeLabel, processVertex);
+ } else {
+ processVertex.addEdge(edgeLabel, feedVertex);
+ }
+ }
{code}
This is going to be a little tricky. If you leave behind vertices, even after
all incident edges are removed, database is going to monotonically increase in
size and cause performance issue along the line. One technique that I have used
in the past with graph databases is upon edge removal, check if the vertex is
left behind with no edges, if so delete the vertex as well. Few gotchas in this
with respect to this particular graph are
1. Entity elements aren't to be removed
2. Convenience relations may be added to instance vertices, which aren't to be
considered when counting remaining edges.
These can be achieved by tagging the vertices and edges with appropriate
properties.
{code}
+ public void removeEdge(Vertex fromVertex, Object toVertexName, String
edgeLabel) {
...
+ // remove the edge and not the vertex since instances are
pointing to this vertex
...
{code}
It is reasonable to leave behind graph elements after an entity is deleted to
allow historical queries. However there has to be some cleanup based on time
limit that ought to be available. This is required even for active ones. Also
it might be worth considering to make an option available to the user to delete
an entity along with its historical data.
{code}
+ @Override
+ public void onRemove(Entity entity) throws FalconException {
+ // do nothing, we'd leave the deleted entities as-is for historical
purposes
+ // should we mark 'em as deleted?
+ }
{code}
Is the motivation of adding classification & groups relationship for every
instance to provide "WHAT-WAS" view of the feed instance? Is that a more
standard ask? Current model is generic enough to provide both "WHAT-WAS" and
"WHAT-IS", but it is at a higher cost. If that is a required feature, we can
leave it as is.
{code}
+ public void addFeedInstances(String[] feedNames, String[]
feedInstancePaths,
...
+ addDataClassification(feed.getTags(), feedInstance);
+ addGroups(feed.getGroups(), feedInstance);
{code}
Why is workflowInstance a separate node in the graph and not a set of property
on the process instance? I can imagine this being useful in re-run scenarios,
but I dont see that run-relationship being captured though.
There are so many relationships being created, it might be very useful to test
each one of these functions independently.
> Persist lineage information into a persistent store
> ---------------------------------------------------
>
> Key: FALCON-288
> URL: https://issues.apache.org/jira/browse/FALCON-288
> Project: Falcon
> Issue Type: Sub-task
> Affects Versions: 0.5
> Reporter: Venkatesh Seetharam
> Assignee: Venkatesh Seetharam
> Labels: lineage
> Attachments: Dependency Graph.png, FALCON-288-Hive-Review.patch,
> FALCON-288-review-v1.patch, FALCON-288-review.patch, FALCON-288-v1.patch,
> Lineage Over Dependency.png
>
>
> Need to evaluate the store - rdbms vs graph db. Leaning towards latter since
> the data is hierarchical.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)