Re: [Neo4j] Re: Load very large CSV into Neo4j

mohsen Fri, 12 Dec 2014 02:28:24 -0800

I don't have any problem with sharing my CSV files, but I don't know how 
and where can I share these large files.


On Friday, December 12, 2014 2:26:17 AM UTC-8, mohsen wrote:
>
> Thanks Michael for following my problem. In groovy script, the output was 
> still with nodes. It is not feasible to use enum for relationshipTypes, 
> types are URIs of ontology predicates coming from CSV file, and there are 
> many of them. However, I think the problem is that this script requires 
> more than 10GB heap, because it needs to store the nodes in memory (map) to 
> use them later for creating relationships. So, I guess even reducing mmio 
> mapping size won't solve the problem, will try it though tomorrow.
>
> Regarding the batch-import command, do you have any idea why I am getting 
> that error? 
>
> On Friday, December 12, 2014 1:40:56 AM UTC-8, Michael Hunger wrote:
>>
>> It would have been good if you had taken a thread dump from the groovy 
>> script.
>>
>> but if you look at the memory:
>>
>> off heap = 2+2+1+1 => 6
>> heap = 10
>> leaves nothing for OS
>>
>> probably the heap gc's as well.
>>
>> So you have to reduce the mmio mapping size
>>
>> Was the output still with nodes or already rels?
>>
>> Perhaps also replace DynamicRelationshipType.withName(line.Type) with an 
>> enum
>>
>> you can also extend trace to output number of nodes and rels
>>
>> Would you be able to share your csv files?
>>
>> Michael
>>
>>
>>
>> On Fri, Dec 12, 2014 at 10:08 AM, mohsen <[email protected]> wrote:
>>
>>> I could not load the data using Groovy too. I increased groovy heap size 
>>> to 10G before running the script (using JAVA_OPTS). My machine has 16G of 
>>> RAM. It halts when it loads 41M rows from nodes.csv:
>>>
>>>
>>> log: 
>>> ....
>>> 41200000 rows 38431 ms
>>> 41300000 rows 50988 ms 
>>> 41400000 rows 63747 ms 
>>> 41500000 rows 112758 ms 
>>> 41600000 rows 326497 ms
>>>
>>> After logging 41,600,000 rows, nothing happened. I waited 2 hours there 
>>> was not any progress. The process was still taking CPU but there was NOT 
>>> any free memory at that time. I guess that's the reason for that. I have 
>>> attached my groovy script where you can find the memory configurations. I 
>>> guess something goes wrong with memory since it stopped when all my 
>>> system's memory was used.
>>>
>>> I then switched back to batch-import tool with stacktrace. I think the 
>>> error I got last time was due to small heap size because I did not get that 
>>> error this time (after allocating 10GB heap). Anyway, I have exactly 
>>> 86983375 
>>> nodes and it could load the nodes this time, but I got another error:  
>>>
>>>  Nodes
>>>
>>> [INPUT-------------|ENCODER-----------------------------------------|WRITER]
>>>  
>>>> 86M
>>>
>>> Calculate dense nodes
>>>> Import error: InputRelationship:
>>>>    properties: []
>>>>    startNode: 
>>>> file:///Users/mohsen/Desktop/Music%20RDF/echonest/analyze-example.rdf#signal
>>>>    endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1
>>>>    type: http://purl.org/ontology/echonest/beatVariance specified 
>>>> start node that hasn't been imported
>>>> java.lang.RuntimeException: InputRelationship:
>>>>    properties: []
>>>>    startNode: 
>>>> file:///Users/mohsen/Desktop/Music%20RDF/echonest/analyze-example.rdf#signal
>>>>    endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1
>>>>    type: http://purl.org/ontology/echonest/beatVariance specified 
>>>> start node that hasn't been imported
>>>> at 
>>>> org.neo4j.unsafe.impl.batchimport.staging.StageExecution.stillExecuting(StageExecution.java:54)
>>>> at 
>>>> org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMonitor.anyStillExecuting(PollingExecutionMonitor.java:71)
>>>> at 
>>>> org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMonitor.finishAwareSleep(PollingExecutionMonitor.java:94)
>>>> at 
>>>> org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMonitor.monitor(PollingExecutionMonitor.java:62)
>>>> at 
>>>> org.neo4j.unsafe.impl.batchimport.ParallelBatchImporter.executeStages(ParallelBatchImporter.java:221)
>>>> at 
>>>> org.neo4j.unsafe.impl.batchimport.ParallelBatchImporter.doImport(ParallelBatchImporter.java:139)
>>>> at org.neo4j.tooling.ImportTool.main(ImportTool.java:212)
>>>> Caused by: org.neo4j.unsafe.impl.batchimport.input.InputException: 
>>>> InputRelationship:
>>>>    properties: []
>>>>    startNode: 
>>>> file:///Users/mohsen/Desktop/Music%20RDF/echonest/analyze-example.rdf#signal
>>>>    endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1
>>>>    type: http://purl.org/ontology/echonest/beatVariance specified 
>>>> start node that hasn't been imported
>>>> at 
>>>> org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.ensureNodeFound(CalculateDenseNodesStep.java:95)
>>>> at 
>>>> org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.process(CalculateDenseNodesStep.java:61)
>>>> at 
>>>> org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.process(CalculateDenseNodesStep.java:38)
>>>> at 
>>>> org.neo4j.unsafe.impl.batchimport.staging.ExecutorServiceStep$2.run(ExecutorServiceStep.java:81)
>>>> at 
>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>>> at 
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>> at 
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>> at java.lang.Thread.run(Thread.java:745)
>>>> at 
>>>> org.neo4j.helpers.NamedThreadFactory$2.run(NamedThreadFactory.java:99)
>>>
>>>
>>> It seems that it cannot find the start and end node of a relationships. 
>>> However, both nodes exist in nodes.csv (I did a grep to be sure). So, I 
>>> don't know what goes wrong. Do you have any idea? Can it be related to the 
>>> id of the start node 
>>> "file:///Users/mohsen/Desktop/Music%20RDF/echonest/analyze-example.rdf#signal"?
>>> On Thursday, December 11, 2014 10:02:05 PM UTC-8, Michael Hunger wrote:
>>>>
>>>> The groovy one should work fine too. I wanted to augment the post with 
>>>> one that has @CompileStatic so that it's faster. 
>>>>
>>>> I'd be also interested in the --stacktraces output of the batch-import 
>>>> tool of Neo4j 2.2, perhaps you can let it run over night or in the 
>>>> background.
>>>>
>>>> Cheers, Michael
>>>>
>>>> On Fri, Dec 12, 2014 at 3:34 AM, mohsen <[email protected]> wrote:
>>>>
>>>>> I guess the core code for both batch-import and Load CSV is the same, 
>>>>> why do you think running it from Cypher (rather than through 
>>>>> batch-import) 
>>>>> helps? I am trying groovy and batch-inserter 
>>>>> <https://gist.github.com/jexp/0617412dcdd644fd520b#file-import_kaggle-groovy>
>>>>>  now, 
>>>>> will post how it goes.
>>>>>
>>>>>
>>>>> On Thursday, December 11, 2014 5:44:36 AM UTC-8, Andrii Stesin wrote:
>>>>>>
>>>>>> I'd suggest you take a look at last 5-7 posts in this recent thread 
>>>>>> <https://groups.google.com/forum/#!topic/neo4j/jSFtnD5OHxg>. You 
>>>>>> don't basically need any "batch import" command - I'd suggest you to use 
>>>>>> just a plain LOAD CSV functionality from Cypher, and you will just fill 
>>>>>> your database step by step.
>>>>>>
>>>>>> WBR,
>>>>>> Andrii
>>>>>>
>>>>>  -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "Neo4j" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "Neo4j" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] Re: Load very large CSV into Neo4j

Reply via email to