Daniel Kuppitz created TINKERPOP3-904:
-----------------------------------------
Summary: BulkLoaderVertexProgram optimizations
Key: TINKERPOP3-904
URL: https://issues.apache.org/jira/browse/TINKERPOP3-904
Project: TinkerPop 3
Issue Type: Improvement
Components: process
Affects Versions: 3.1.0-incubating
Reporter: Daniel Kuppitz
Assignee: Daniel Kuppitz
Fix For: 3.1.0-incubating
This is the continuation of
https://issues.apache.org/jira/browse/TINKERPOP3-319. A few suggestion were
made by [~mbroecheler] on how to optimize the current BLVP implementation.
Since these changes require breaking changes, they were not implemented for
3.0.2.
{quote}
The following optimizations should be implemented to improve the performance of
BLVP:
* In line 212, BLVP should get the information whether the vertex was created
or retrieved. If it was created (i.e. it did not exist before) then we are
guaranteed that it cannot have any vertex properties. As such, the BLVP should
then just create the vertex properties without checking for their existence
first - this will be significantly faster.
* Similarly, when loading edges in the second iteration, it should first
compute this boolean variable {{requiresIncremental =
sourceVertex.edges(OUT).hasNext() && outV.edges(OUT).hasNext()}} and then only
do incremental loading on edges if this variable is true. If it is not true
incremental loading (i.e. checking for edge existence) isn't necessary.
Both improvement together should dramatically improve the performance of BLVP
since it will require a read per edge/vertex property only in those cases where
a previous job failed. Under "normal" operational conditions it only requires
one read per vertex per iteration. That is, the reads scale in O(|V|) and not
O(|E|).
In addition, there should be an option for IncrementalBulkLoader so that it
does not attempt to update edges and vertex properties when those already
exist. In most cases, the edge will be identical when it has been loaded in a
previous job (since edge and property mutations are atomic in most graph
databases) and hence this check is unnecessary and being able to make it
optional can save time.
Note, that these are important optimizations for large scale graph databases
where bulk loading is necessary to get started.
{quote}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)