[
https://issues.apache.org/jira/browse/TINKERPOP-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15114757#comment-15114757
]
Dylan Bethune-Waddell commented on TINKERPOP-1099:
--------------------------------------------------
So I sat down to do this, and realized that this problem probably exists for
edges between two distinct vertices as well - if there is more than one edge of
the same label between two vertices, the first one is assumed to be a duplicate
of the one we are about to load into the graph. This makes this a more general
problem, and the only solution I've been able to come up with is to make this
explicit in the configuration of the bulk loader with "createEdge" and no
"getOr" as the default because it's the least confusing, and a
"gremlin.bulkLoader.edgeEquivalenceLevel" or something with these possibilities
(this seems better than a custom bulkLoader for each use case to me):
*# =LABEL, overwrite edges of a given label between v1 and v2 if found (and the
only one found, otherwise add don't overwrite?)
*# =PROPERTYKEY, overwrite edges of a given label between v1 and v2 if found
and having the same property keys of the edge to be added (and the only one
found, otherwise add or overwrite?)
*# =PROPERTY, overwrite edges of a given label between v1 and v2 if they have
the same property key/value pairs as the edge to be added (and the only one
found, otherwise add or overwrite?)
*# =META, probably the best solution albeit the hardest to enforce or expect is
that if the user has added metadata about the edges that are already in the
graph and the edges being added that dictate whether this is a valid overwrite
or a brand new edge to integrate into the graph, even if it has the same label,
endpoints, and key/value pairs... isn't this something doable in the RDF world?
Just some thoughts - I'm rolling with createEdge for now, and will wait for you
guys to comment about this before going ahead with a PR for these changes (I
still think I can do something at this level, unless you guys want to handle it
to be sure it's done right).
> IncrementalBulkLoader's getOrCreateEdge overwrites previously added self-loops
> ------------------------------------------------------------------------------
>
> Key: TINKERPOP-1099
> URL: https://issues.apache.org/jira/browse/TINKERPOP-1099
> Project: TinkerPop
> Issue Type: Bug
> Components: hadoop, io
> Affects Versions: 3.1.1-incubating
> Environment: Linux, CentOS 6.4
> Reporter: Dylan Bethune-Waddell
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> The traversal in this function assumes that only one edge will be returned
> and if it is returned then we should "get" and not "create" that edge, but
> any number of self-edges on a vertex will be picked up by this traversal.
> Further, they are allowed to have different properties than the edge we are
> about to load, causing the first self-edge returned to be repeatedly
> overwritten or have properties appended to it. On a fresh bulk load, only the
> very last self-edge of a given label out of all self-edges of that label that
> were going to be added appears in the graph.
> {code:title=IncrementalBulkLoader.java (lines 51-77)}
> @Override
> public Edge getOrCreateEdge(final Edge edge, final Vertex outVertex,
> final Vertex inVertex, final Graph graph, final GraphTraversalSource g) {
> final Edge e;
> final Traversal<Vertex, Edge> t =
> g.V(outVertex).outE(edge.label()).filter(__.inV().is(inVertex));
> if (t.hasNext()) {
> e = t.next();
> edge.properties().forEachRemaining(property -> {
> final Property<?> existing = e.property(property.key());
> if (!existing.isPresent() ||
> !existing.value().equals(property.value())) {
> e.property(property.key(), property.value());
> }
> });
> } else {
> e = createEdge(edge, outVertex, inVertex, graph, g);
> }
> return e;
> }
> {code}
> It would seem that the values of any properties on the edge must be compared
> to try a "get" instead of just creating the edge, but if there are no
> properties on the (weird) self-edge, I have no idea what reasonable behaviour
> would be.
> I may be able to submit a PR for this later today so I'll assign myself for
> now but feel free to put it in better hands - this seems like something
> relatively minor to provide a decent interim fix for, and it does overwrite
> user data in the graph so I vote that a fix gets pushed into 3.1.1 and I have
> flagged it as "important".
> *Tested on*:
> - Linux/CentOS 6.4
> - Titan 1.1 (6 nodes) and TinkerGraph
> - TinkerPop-3.1.1-SNAPSHOT
> - Spark 1.5.2
> - Hadoop 2.7.1
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)