Re: [Neo4j] Re: Load very large CSV into Neo4j

mohsen Fri, 12 Dec 2014 02:27:00 -0800

Thanks Michael for following my problem. In groovy script, the output was 
still with nodes. It is not feasible to use enum for relationshipTypes, 
types are URIs of ontology predicates coming from CSV file, and there are 
many of them. However, I think the problem is that this script requires 
more than 10GB heap, because it needs to store the nodes in memory (map) to 
use them later for creating relationships. So, I guess even reducing mmio 
mapping size won't solve the problem, will try it though tomorrow.


Regarding the batch-import command, do you have any idea why I am getting 
that error? 

On Friday, December 12, 2014 1:40:56 AM UTC-8, Michael Hunger wrote:
>
> It would have been good if you had taken a thread dump from the groovy 
> script.
>
> but if you look at the memory:
>
> off heap = 2+2+1+1 => 6
> heap = 10
> leaves nothing for OS
>
> probably the heap gc's as well.
>
> So you have to reduce the mmio mapping size
>
> Was the output still with nodes or already rels?
>
> Perhaps also replace DynamicRelationshipType.withName(line.Type) with an 
> enum
>
> you can also extend trace to output number of nodes and rels
>
> Would you be able to share your csv files?
>
> Michael
>
>
>
> On Fri, Dec 12, 2014 at 10:08 AM, mohsen <[email protected] 
> <javascript:>> wrote:
>
>> I could not load the data using Groovy too. I increased groovy heap size 
>> to 10G before running the script (using JAVA_OPTS). My machine has 16G of 
>> RAM. It halts when it loads 41M rows from nodes.csv:
>>
>>
>> log: 
>> ....
>> 41200000 rows 38431 ms
>> 41300000 rows 50988 ms 
>> 41400000 rows 63747 ms 
>> 41500000 rows 112758 ms 
>> 41600000 rows 326497 ms
>>
>> After logging 41,600,000 rows, nothing happened. I waited 2 hours there 
>> was not any progress. The process was still taking CPU but there was NOT 
>> any free memory at that time. I guess that's the reason for that. I have 
>> attached my groovy script where you can find the memory configurations. I 
>> guess something goes wrong with memory since it stopped when all my 
>> system's memory was used.
>>
>> I then switched back to batch-import tool with stacktrace. I think the 
>> error I got last time was due to small heap size because I did not get that 
>> error this time (after allocating 10GB heap). Anyway, I have exactly 
>> 86983375 
>> nodes and it could load the nodes this time, but I got another error:  
>>
>>  Nodes
>>
>> [INPUT-------------|ENCODER-----------------------------------------|WRITER] 
>>> 86M
>>
>> Calculate dense nodes
>>> Import error: InputRelationship:
>>>    properties: []
>>>    startNode: 
>>> file:///Users/mohsen/Desktop/Music%20RDF/echonest/analyze-example.rdf#signal
>>>    endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1
>>>    type: http://purl.org/ontology/echonest/beatVariance specified start 
>>> node that hasn't been imported
>>> java.lang.RuntimeException: InputRelationship:
>>>    properties: []
>>>    startNode: 
>>> file:///Users/mohsen/Desktop/Music%20RDF/echonest/analyze-example.rdf#signal
>>>    endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1
>>>    type: http://purl.org/ontology/echonest/beatVariance specified start 
>>> node that hasn't been imported
>>> at 
>>> org.neo4j.unsafe.impl.batchimport.staging.StageExecution.stillExecuting(StageExecution.java:54)
>>> at 
>>> org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMonitor.anyStillExecuting(PollingExecutionMonitor.java:71)
>>> at 
>>> org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMonitor.finishAwareSleep(PollingExecutionMonitor.java:94)
>>> at 
>>> org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMonitor.monitor(PollingExecutionMonitor.java:62)
>>> at 
>>> org.neo4j.unsafe.impl.batchimport.ParallelBatchImporter.executeStages(ParallelBatchImporter.java:221)
>>> at 
>>> org.neo4j.unsafe.impl.batchimport.ParallelBatchImporter.doImport(ParallelBatchImporter.java:139)
>>> at org.neo4j.tooling.ImportTool.main(ImportTool.java:212)
>>> Caused by: org.neo4j.unsafe.impl.batchimport.input.InputException: 
>>> InputRelationship:
>>>    properties: []
>>>    startNode: 
>>> file:///Users/mohsen/Desktop/Music%20RDF/echonest/analyze-example.rdf#signal
>>>    endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1
>>>    type: http://purl.org/ontology/echonest/beatVariance specified start 
>>> node that hasn't been imported
>>> at 
>>> org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.ensureNodeFound(CalculateDenseNodesStep.java:95)
>>> at 
>>> org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.process(CalculateDenseNodesStep.java:61)
>>> at 
>>> org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.process(CalculateDenseNodesStep.java:38)
>>> at 
>>> org.neo4j.unsafe.impl.batchimport.staging.ExecutorServiceStep$2.run(ExecutorServiceStep.java:81)
>>> at 
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> at java.lang.Thread.run(Thread.java:745)
>>> at org.neo4j.helpers.NamedThreadFactory$2.run(NamedThreadFactory.java:99)
>>
>>
>> It seems that it cannot find the start and end node of a relationships. 
>> However, both nodes exist in nodes.csv (I did a grep to be sure). So, I 
>> don't know what goes wrong. Do you have any idea? Can it be related to the 
>> id of the start node 
>> "file:///Users/mohsen/Desktop/Music%20RDF/echonest/analyze-example.rdf#signal"?
>> On Thursday, December 11, 2014 10:02:05 PM UTC-8, Michael Hunger wrote:
>>>
>>> The groovy one should work fine too. I wanted to augment the post with 
>>> one that has @CompileStatic so that it's faster. 
>>>
>>> I'd be also interested in the --stacktraces output of the batch-import 
>>> tool of Neo4j 2.2, perhaps you can let it run over night or in the 
>>> background.
>>>
>>> Cheers, Michael
>>>
>>> On Fri, Dec 12, 2014 at 3:34 AM, mohsen <[email protected]> wrote:
>>>
>>>> I guess the core code for both batch-import and Load CSV is the same, 
>>>> why do you think running it from Cypher (rather than through batch-import) 
>>>> helps? I am trying groovy and batch-inserter 
>>>> <https://gist.github.com/jexp/0617412dcdd644fd520b#file-import_kaggle-groovy>
>>>>  now, 
>>>> will post how it goes.
>>>>
>>>>
>>>> On Thursday, December 11, 2014 5:44:36 AM UTC-8, Andrii Stesin wrote:
>>>>>
>>>>> I'd suggest you take a look at last 5-7 posts in this recent thread 
>>>>> <https://groups.google.com/forum/#!topic/neo4j/jSFtnD5OHxg>. You 
>>>>> don't basically need any "batch import" command - I'd suggest you to use 
>>>>> just a plain LOAD CSV functionality from Cypher, and you will just fill 
>>>>> your database step by step.
>>>>>
>>>>> WBR,
>>>>> Andrii
>>>>>
>>>>  -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "Neo4j" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Neo4j" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] Re: Load very large CSV into Neo4j

Reply via email to