Re: [Neo4j] Re: Load very large CSV into Neo4j

Michael Hunger Fri, 12 Dec 2014 03:13:43 -0800

Right, that's the problem with an RDF model why only uses relationships to
represent properties, you won't get the performance that you would get with
a real property-graph model.


I share the version separately.

Cheers, Michael

On Fri, Dec 12, 2014 at 12:07 PM, mohsen <[email protected]> wrote:

> I appreciate if you get me the newer version, I am already using
> 2.2.0-M01.
>
> I want to run some graph queries over my rdf. First, I loaded my data into
> Virtuoso triple store (took 2-3 hours), but could not get results for my
> SPARQL queries in a reasonable time. That is the reason I decided to load
> my data into Neo4j to be able to run my queries.
>
> I am only importing RDF to Neo4j only for a specific research problem. I
> need to extract some patterns from the rdf data and I have to write queries
> that require some sort of graph traversal. I don't want to do reasoning
> over my rdf data. The graph structure looks simple: nodes only have Label
> (Uri or Literal) and Value, and relationships don't have any property.
>
> On Friday, December 12, 2014 2:41:36 AM UTC-8, Michael Hunger wrote:
>>
>> >our id's are UUIDs or ? so 36 chars * 90M -> 72 bytes and Neo-Id's are
>> longs w/ 8 bytes. so 80 bytes per entry.
>> Should allocate about 6G heap.
>>
>> Btw. importing RDF 1:1 into Neo4j is no good idea in the first place.
>>
>> You should model a clean property graph model and import INTO that model.
>>
>> The the batch-import, it's a bug that has been fixed after the milestone,
>> I try to get you a newer version to try.
>>
>> Cheers, Michael
>>
>>
>>
>> On Fri, Dec 12, 2014 at 11:26 AM, mohsen <[email protected]> wrote:
>>
>>> Thanks Michael for following my problem. In groovy script, the output
>>> was still with nodes. It is not feasible to use enum for relationshipTypes,
>>> types are URIs of ontology predicates coming from CSV file, and there are
>>> many of them. However, I think the problem is that this script requires
>>> more than 10GB heap, because it needs to store the nodes in memory (map) to
>>> use them later for creating relationships. So, I guess even reducing mmio
>>> mapping size won't solve the problem, will try it though tomorrow.
>>>
>>> Regarding the batch-import command, do you have any idea why I am
>>> getting that error?
>>>
>>> On Friday, December 12, 2014 1:40:56 AM UTC-8, Michael Hunger wrote:
>>>>
>>>> It would have been good if you had taken a thread dump from the groovy
>>>> script.
>>>>
>>>> but if you look at the memory:
>>>>
>>>> off heap = 2+2+1+1 => 6
>>>> heap = 10
>>>> leaves nothing for OS
>>>>
>>>> probably the heap gc's as well.
>>>>
>>>> So you have to reduce the mmio mapping size
>>>>
>>>> Was the output still with nodes or already rels?
>>>>
>>>> Perhaps also replace DynamicRelationshipType.withName(line.Type) with
>>>> an enum
>>>>
>>>> you can also extend trace to output number of nodes and rels
>>>>
>>>> Would you be able to share your csv files?
>>>>
>>>> Michael
>>>>
>>>>
>>>>
>>>> On Fri, Dec 12, 2014 at 10:08 AM, mohsen <[email protected]> wrote:
>>>>
>>>>> I could not load the data using Groovy too. I increased groovy heap
>>>>> size to 10G before running the script (using JAVA_OPTS). My machine has 
>>>>> 16G
>>>>> of RAM. It halts when it loads 41M rows from nodes.csv:
>>>>>
>>>>>
>>>>> log:
>>>>> ....
>>>>> 41200000 rows 38431 ms
>>>>> 41300000 rows 50988 ms
>>>>> 41400000 rows 63747 ms
>>>>> 41500000 rows 112758 ms
>>>>> 41600000 rows 326497 ms
>>>>>
>>>>> After logging 41,600,000 rows, nothing happened. I waited 2 hours
>>>>> there was not any progress. The process was still taking CPU but there was
>>>>> NOT any free memory at that time. I guess that's the reason for that. I
>>>>> have attached my groovy script where you can find the memory
>>>>> configurations. I guess something goes wrong with memory since it stopped
>>>>> when all my system's memory was used.
>>>>>
>>>>> I then switched back to batch-import tool with stacktrace. I think the
>>>>> error I got last time was due to small heap size because I did not get 
>>>>> that
>>>>> error this time (after allocating 10GB heap). Anyway, I have exactly 
>>>>> 86983375
>>>>> nodes and it could load the nodes this time, but I got another error:
>>>>>
>>>>>  Nodes
>>>>>
>>>>> [INPUT-------------|ENCODER-----------------------------------------|WRITER]
>>>>>> 86M
>>>>>
>>>>> Calculate dense nodes
>>>>>> Import error: InputRelationship:
>>>>>>    properties: []
>>>>>>    startNode: file:///Users/mohsen/Desktop/M
>>>>>> usic%20RDF/echonest/analyze-example.rdf#signal
>>>>>>    endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1
>>>>>>    type: http://purl.org/ontology/echonest/beatVariance specified
>>>>>> start node that hasn't been imported
>>>>>> java.lang.RuntimeException: InputRelationship:
>>>>>>    properties: []
>>>>>>    startNode: file:///Users/mohsen/Desktop/M
>>>>>> usic%20RDF/echonest/analyze-example.rdf#signal
>>>>>>    endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1
>>>>>>    type: http://purl.org/ontology/echonest/beatVariance specified
>>>>>> start node that hasn't been imported
>>>>>> at org.neo4j.unsafe.impl.batchimport.staging.StageExecution.
>>>>>> stillExecuting(StageExecution.java:54)
>>>>>> at org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMonitor.
>>>>>> anyStillExecuting(PollingExecutionMonitor.java:71)
>>>>>> at org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMonitor.
>>>>>> finishAwareSleep(PollingExecutionMonitor.java:94)
>>>>>> at org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMonitor.
>>>>>> monitor(PollingExecutionMonitor.java:62)
>>>>>> at org.neo4j.unsafe.impl.batchimport.ParallelBatchImporter.exec
>>>>>> uteStages(ParallelBatchImporter.java:221)
>>>>>> at org.neo4j.unsafe.impl.batchimport.ParallelBatchImporter.doImport(
>>>>>> ParallelBatchImporter.java:139)
>>>>>> at org.neo4j.tooling.ImportTool.main(ImportTool.java:212)
>>>>>> Caused by: org.neo4j.unsafe.impl.batchimport.input.InputException:
>>>>>> InputRelationship:
>>>>>>    properties: []
>>>>>>    startNode: file:///Users/mohsen/Desktop/M
>>>>>> usic%20RDF/echonest/analyze-example.rdf#signal
>>>>>>    endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1
>>>>>>    type: http://purl.org/ontology/echonest/beatVariance specified
>>>>>> start node that hasn't been imported
>>>>>> at org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.en
>>>>>> sureNodeFound(CalculateDenseNodesStep.java:95)
>>>>>> at org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.process(
>>>>>> CalculateDenseNodesStep.java:61)
>>>>>> at org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.process(
>>>>>> CalculateDenseNodesStep.java:38)
>>>>>> at org.neo4j.unsafe.impl.batchimport.staging.ExecutorServiceSte
>>>>>> p$2.run(ExecutorServiceStep.java:81)
>>>>>> at java.util.concurrent.Executors$RunnableAdapter.call(
>>>>>> Executors.java:471)
>>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>>>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>>>>> Executor.java:1145)
>>>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>>>>> lExecutor.java:615)
>>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>> at org.neo4j.helpers.NamedThreadFactory$2.run(NamedThreadFactor
>>>>>> y.java:99)
>>>>>
>>>>>
>>>>> It seems that it cannot find the start and end node of a
>>>>> relationships. However, both nodes exist in nodes.csv (I did a grep to be
>>>>> sure). So, I don't know what goes wrong. Do you have any idea? Can it be
>>>>> related to the id of the start node "file:///Users/mohsen/Desktop/
>>>>> Music%20RDF/echonest/analyze-example.rdf#signal"?
>>>>> On Thursday, December 11, 2014 10:02:05 PM UTC-8, Michael Hunger wrote:
>>>>>>
>>>>>> The groovy one should work fine too. I wanted to augment the post
>>>>>> with one that has @CompileStatic so that it's faster.
>>>>>>
>>>>>> I'd be also interested in the --stacktraces output of the
>>>>>> batch-import tool of Neo4j 2.2, perhaps you can let it run over night or 
>>>>>> in
>>>>>> the background.
>>>>>>
>>>>>> Cheers, Michael
>>>>>>
>>>>>> On Fri, Dec 12, 2014 at 3:34 AM, mohsen <[email protected]> wrote:
>>>>>>
>>>>>>> I guess the core code for both batch-import and Load CSV is the
>>>>>>> same, why do you think running it from Cypher (rather than through
>>>>>>> batch-import) helps? I am trying groovy and batch-inserter
>>>>>>> <https://gist.github.com/jexp/0617412dcdd644fd520b#file-import_kaggle-groovy>
>>>>>>>  now,
>>>>>>> will post how it goes.
>>>>>>>
>>>>>>>
>>>>>>> On Thursday, December 11, 2014 5:44:36 AM UTC-8, Andrii Stesin wrote:
>>>>>>>>
>>>>>>>> I'd suggest you take a look at last 5-7 posts in this recent thread
>>>>>>>> <https://groups.google.com/forum/#!topic/neo4j/jSFtnD5OHxg>. You
>>>>>>>> don't basically need any "batch import" command - I'd suggest you to 
>>>>>>>> use
>>>>>>>> just a plain LOAD CSV functionality from Cypher, and you will just fill
>>>>>>>> your database step by step.
>>>>>>>>
>>>>>>>> WBR,
>>>>>>>> Andrii
>>>>>>>>
>>>>>>>  --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "Neo4j" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to [email protected].
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>>  --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "Neo4j" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "Neo4j" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] Re: Load very large CSV into Neo4j

Reply via email to