[ 
https://issues.apache.org/jira/browse/TINKERPOP3-319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876857#comment-14876857
 ] 

Matthias Broecheler commented on TINKERPOP3-319:
------------------------------------------------

The following optimizations should be implemented to improve the performance of 
BLVP:

* In line 212, BLVP should get the information whether the vertex was created 
or retrieved. If it was created (i.e. it did not exist before) then we are 
guaranteed that it cannot have any vertex properties. As such, the BLVP should 
then just create the vertex properties without checking for their existence 
first - this will be significantly faster.
* Similarly, when loading edges in the second iteration, it should first 
compute this boolean variable {{requiresIncremental = 
sourceVertex.edges(OUT).hasNext() && outV.edges(OUT).hasNext()}} and then only 
do incremental loading on edges if this variable is true. If it is not true 
incremental loading (i.e. checking for edge existence) isn't necessary.

Both improvement together should dramatically improve the performance of BLVP 
since it will require a read per edge/vertex property only in those cases where 
a previous job failed. Under "normal" operational conditions it only requires 
one read per vertex per iteration. That is, the reads scale in O(|V|) and not 
O(|E|).

In addition, there should be an option for IncrementalBulkLoader so that it 
does not attempt to update edges and vertex properties when those already 
exist. In most cases, the edge will be identical when it has been loaded in a 
previous job (since edge and property mutations are atomic in most graph 
databases) and hence this check is unnecessary and being able to make it 
optional can save time.

> BulkLoaderVertexProgram for generalized batch loading across graphs
> -------------------------------------------------------------------
>
>                 Key: TINKERPOP3-319
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP3-319
>             Project: TinkerPop 3
>          Issue Type: Improvement
>          Components: process
>    Affects Versions: 3.0.1-incubating
>            Reporter: Marko A. Rodriguez
>            Assignee: Daniel Kuppitz
>             Fix For: 3.1.0-incubating, 3.0.2-incubating
>
>
> After working on {{BulkLoaderVertexProgram}} for Titan, it is trivial to add 
> this generally to TinkerPop -- equivalent to BlueprintsOutputFormat (or 
> whatever the bulk loader was known that was blueprints specific). However, 
> given that Titan and TinkerPop have the same data model, Titan having its own 
> {{BulkLoaderVertexProgram}} isn't necessary as there is no longer a data 
> model alignment issue. The difference would be that instead of:
> {code:groovy}
> g.V.compute().program(BulkLoaderVertexProgram.build().titan(propertiesFile).create()).submit()
> {code}
> It would simply be:
> {code:groovy}
> g.V.compute().program(BulkLoaderVertexProgram.build().factory(propertiesFile).create()).submit()
> {code}
> ...and {{BulkLoaderVertexProgram}} would use {{GraphFactory.open()}} to 
> instantiate the connection to the graph. Moreover, (and [~spmallette] will 
> need to clear my head here), if the factory opened up a Gremlin Server 
> connection, then we get parallel writing to embedded graph databases like 
> Neo4j.
> {{BulkLoaderVertexProgram}} is simply a vertex program that parallel loads a 
> graph (with a graph computer) to any other graph that can be accessed via 
> {{GraphFactory}} (which is every TP3 graph).
> [~dalaro] @mbroecheler [~dkuppitz] 
> EXTENDED NOTES:
> * {{SchemaInference}} would be a MapReduce job executed prior to 
> {{BulkLoaderVertexProgram}}
> * Titan and Neo4j can each have their own {{SchemaInference}} implementations.
> * Incremental loading .... I forget how this worked.
> * Bulk mutations ... this is possible at the TP3 level with hidden properties 
> and smart add/remove/etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to