Re: [Neo4j] Re: Load very large CSV into Neo4j

mohsen Fri, 12 Dec 2014 03:08:42 -0800

I appreciate if you get me the newer version, I am already using 2.2.0-M01.


I want to run some graph queries over my rdf. First, I loaded my data into 
Virtuoso triple store (took 2-3 hours), but could not get results for my 
SPARQL queries in a reasonable time. That is the reason I decided to load 
my data into Neo4j to be able to run my queries.

I am only importing RDF to Neo4j only for a specific research problem. I 
need to extract some patterns from the rdf data and I have to write queries 
that require some sort of graph traversal. I don't want to do reasoning 
over my rdf data. The graph structure looks simple: nodes only have Label 
(Uri or Literal) and Value, and relationships don't have any property. 

On Friday, December 12, 2014 2:41:36 AM UTC-8, Michael Hunger wrote:
>
> >our id's are UUIDs or ? so 36 chars * 90M -> 72 bytes and Neo-Id's are 
> longs w/ 8 bytes. so 80 bytes per entry.
> Should allocate about 6G heap.
>
> Btw. importing RDF 1:1 into Neo4j is no good idea in the first place.
>
> You should model a clean property graph model and import INTO that model.
>
> The the batch-import, it's a bug that has been fixed after the milestone, 
> I try to get you a newer version to try.
>
> Cheers, Michael
>
>
>
> On Fri, Dec 12, 2014 at 11:26 AM, mohsen <[email protected] 
> <javascript:>> wrote:
>
>> Thanks Michael for following my problem. In groovy script, the output was 
>> still with nodes. It is not feasible to use enum for relationshipTypes, 
>> types are URIs of ontology predicates coming from CSV file, and there are 
>> many of them. However, I think the problem is that this script requires 
>> more than 10GB heap, because it needs to store the nodes in memory (map) to 
>> use them later for creating relationships. So, I guess even reducing mmio 
>> mapping size won't solve the problem, will try it though tomorrow.
>>
>> Regarding the batch-import command, do you have any idea why I am getting 
>> that error? 
>>
>> On Friday, December 12, 2014 1:40:56 AM UTC-8, Michael Hunger wrote:
>>>
>>> It would have been good if you had taken a thread dump from the groovy 
>>> script.
>>>
>>> but if you look at the memory:
>>>
>>> off heap = 2+2+1+1 => 6
>>> heap = 10
>>> leaves nothing for OS
>>>
>>> probably the heap gc's as well.
>>>
>>> So you have to reduce the mmio mapping size
>>>
>>> Was the output still with nodes or already rels?
>>>
>>> Perhaps also replace DynamicRelationshipType.withName(line.Type) with 
>>> an enum
>>>
>>> you can also extend trace to output number of nodes and rels
>>>
>>> Would you be able to share your csv files?
>>>
>>> Michael
>>>
>>>
>>>
>>> On Fri, Dec 12, 2014 at 10:08 AM, mohsen <[email protected]> wrote:
>>>
>>>> I could not load the data using Groovy too. I increased groovy heap 
>>>> size to 10G before running the script (using JAVA_OPTS). My machine has 
>>>> 16G 
>>>> of RAM. It halts when it loads 41M rows from nodes.csv:
>>>>
>>>>
>>>> log: 
>>>> ....
>>>> 41200000 rows 38431 ms
>>>> 41300000 rows 50988 ms 
>>>> 41400000 rows 63747 ms 
>>>> 41500000 rows 112758 ms 
>>>> 41600000 rows 326497 ms
>>>>
>>>> After logging 41,600,000 rows, nothing happened. I waited 2 hours there 
>>>> was not any progress. The process was still taking CPU but there was NOT 
>>>> any free memory at that time. I guess that's the reason for that. I have 
>>>> attached my groovy script where you can find the memory configurations. I 
>>>> guess something goes wrong with memory since it stopped when all my 
>>>> system's memory was used.
>>>>
>>>> I then switched back to batch-import tool with stacktrace. I think the 
>>>> error I got last time was due to small heap size because I did not get 
>>>> that 
>>>> error this time (after allocating 10GB heap). Anyway, I have exactly 
>>>> 86983375 
>>>> nodes and it could load the nodes this time, but I got another error:  
>>>>
>>>>  Nodes
>>>>
>>>> [INPUT-------------|ENCODER-----------------------------------------|WRITER]
>>>>  
>>>>> 86M
>>>>
>>>> Calculate dense nodes
>>>>> Import error: InputRelationship:
>>>>>    properties: []
>>>>>    startNode: file:///Users/mohsen/Desktop/
>>>>> Music%20RDF/echonest/analyze-example.rdf#signal
>>>>>    endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1
>>>>>    type: http://purl.org/ontology/echonest/beatVariance specified 
>>>>> start node that hasn't been imported
>>>>> java.lang.RuntimeException: InputRelationship:
>>>>>    properties: []
>>>>>    startNode: file:///Users/mohsen/Desktop/
>>>>> Music%20RDF/echonest/analyze-example.rdf#signal
>>>>>    endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1
>>>>>    type: http://purl.org/ontology/echonest/beatVariance specified 
>>>>> start node that hasn't been imported
>>>>> at org.neo4j.unsafe.impl.batchimport.staging.
>>>>> StageExecution.stillExecuting(StageExecution.java:54)
>>>>> at org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMonitor.
>>>>> anyStillExecuting(PollingExecutionMonitor.java:71)
>>>>> at org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMonitor.
>>>>> finishAwareSleep(PollingExecutionMonitor.java:94)
>>>>> at org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMonitor.
>>>>> monitor(PollingExecutionMonitor.java:62)
>>>>> at org.neo4j.unsafe.impl.batchimport.ParallelBatchImporter.
>>>>> executeStages(ParallelBatchImporter.java:221)
>>>>> at org.neo4j.unsafe.impl.batchimport.ParallelBatchImporter.doImport(
>>>>> ParallelBatchImporter.java:139)
>>>>> at org.neo4j.tooling.ImportTool.main(ImportTool.java:212)
>>>>> Caused by: org.neo4j.unsafe.impl.batchimport.input.InputException: 
>>>>> InputRelationship:
>>>>>    properties: []
>>>>>    startNode: file:///Users/mohsen/Desktop/
>>>>> Music%20RDF/echonest/analyze-example.rdf#signal
>>>>>    endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1
>>>>>    type: http://purl.org/ontology/echonest/beatVariance specified 
>>>>> start node that hasn't been imported
>>>>> at org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.
>>>>> ensureNodeFound(CalculateDenseNodesStep.java:95)
>>>>> at org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.process(
>>>>> CalculateDenseNodesStep.java:61)
>>>>> at org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.process(
>>>>> CalculateDenseNodesStep.java:38)
>>>>> at org.neo4j.unsafe.impl.batchimport.staging.
>>>>> ExecutorServiceStep$2.run(ExecutorServiceStep.java:81)
>>>>> at java.util.concurrent.Executors$RunnableAdapter.
>>>>> call(Executors.java:471)
>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(
>>>>> ThreadPoolExecutor.java:1145)
>>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>>>>> ThreadPoolExecutor.java:615)
>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>> at org.neo4j.helpers.NamedThreadFactory$2.run(
>>>>> NamedThreadFactory.java:99)
>>>>
>>>>
>>>> It seems that it cannot find the start and end node of a relationships. 
>>>> However, both nodes exist in nodes.csv (I did a grep to be sure). So, I 
>>>> don't know what goes wrong. Do you have any idea? Can it be related to the 
>>>> id of the start node "file:///Users/mohsen/Desktop/
>>>> Music%20RDF/echonest/analyze-example.rdf#signal"?
>>>> On Thursday, December 11, 2014 10:02:05 PM UTC-8, Michael Hunger wrote:
>>>>>
>>>>> The groovy one should work fine too. I wanted to augment the post with 
>>>>> one that has @CompileStatic so that it's faster. 
>>>>>
>>>>> I'd be also interested in the --stacktraces output of the batch-import 
>>>>> tool of Neo4j 2.2, perhaps you can let it run over night or in the 
>>>>> background.
>>>>>
>>>>> Cheers, Michael
>>>>>
>>>>> On Fri, Dec 12, 2014 at 3:34 AM, mohsen <[email protected]> wrote:
>>>>>
>>>>>> I guess the core code for both batch-import and Load CSV is the same, 
>>>>>> why do you think running it from Cypher (rather than through 
>>>>>> batch-import) 
>>>>>> helps? I am trying groovy and batch-inserter 
>>>>>> <https://gist.github.com/jexp/0617412dcdd644fd520b#file-import_kaggle-groovy>
>>>>>>  now, 
>>>>>> will post how it goes.
>>>>>>
>>>>>>
>>>>>> On Thursday, December 11, 2014 5:44:36 AM UTC-8, Andrii Stesin wrote:
>>>>>>>
>>>>>>> I'd suggest you take a look at last 5-7 posts in this recent thread 
>>>>>>> <https://groups.google.com/forum/#!topic/neo4j/jSFtnD5OHxg>. You 
>>>>>>> don't basically need any "batch import" command - I'd suggest you to 
>>>>>>> use 
>>>>>>> just a plain LOAD CSV functionality from Cypher, and you will just fill 
>>>>>>> your database step by step.
>>>>>>>
>>>>>>> WBR,
>>>>>>> Andrii
>>>>>>>
>>>>>>  -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "Neo4j" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to [email protected].
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>  -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "Neo4j" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Neo4j" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] Re: Load very large CSV into Neo4j

Reply via email to