Re: [Neo4j] Re: Load very large CSV into Neo4j

mohsen Fri, 12 Dec 2014 14:33:32 -0800

Michael, I sent you a separate email with credentials to access the csv 
files. Thanks.


On Friday, December 12, 2014 12:26:51 PM UTC-8, mohsen wrote:
>
> Thanks for sharing the new version. Here are my memory info before running 
> batch-import: 
>
> Mem:  18404972k total,   549848k used, 17855124k free,    12524k buffers
>> Swap:  4063224k total,        0k used,  4063224k free,   211284k cached
>
>
> I assigned 11G for heap:  export JAVA_OPTS="$JAVA_OPTS -Xmx11G"
> I ran the batch-import at 11:13am, now it is 12:20pm and it seems that it 
> is stuck. Here is the log: 
>
> Nodes
>> [INPUT-------------------|NODE-------------------------------------------------|PROP|WRITER:
>>  
>> W:] 86M
>> Done in 15m 21s 150ms
>> Calculate dense nodes
>> [INPUT---------|PREPARE(2)====================================================================|]
>>  
>>   0
>
>
> And this is my memory info right now:
>
>> top - 12:22:43 up  1:34,  3 users,  load average: 0.00, 0.00, 0.00
>> Tasks: 134 total,   1 running, 133 sleeping,   0 stopped,   0 zombie
>> Cpu(s):  0.3%us,  0.5%sy,  0.0%ni, 99.2%id,  0.0%wa,  0.0%hi,  0.0%si,  
>> 0.0%st
>> Mem:  18404972k total, 18244612k used,   160360k free,     6132k buffers
>> Swap:  4063224k total,        0k used,  4063224k free, 14089236k cached
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND      
>>                                                                             
>>                  
>>  4496 root      20   0 7598m 3.4g  15m S  3.3 19.4  20:35.88 java        
>
> It's been more than 40 minutes that it is stuck in Calculate Dense Nodes. 
> Should I wait for that? or I need to kill the process?
>
>  
>
> On Friday, December 12, 2014 3:13:15 AM UTC-8, Michael Hunger wrote:
>
>> Right, that's the problem with an RDF model why only uses relationships 
>> to represent properties, you won't get the performance that you would get 
>> with a real property-graph model.
>>
>> I share the version separately.
>>
>> Cheers, Michael
>>
>> On Fri, Dec 12, 2014 at 12:07 PM, mohsen <[email protected]> wrote:
>>
>>> I appreciate if you get me the newer version, I am already using 
>>> 2.2.0-M01. 
>>>
>>> I want to run some graph queries over my rdf. First, I loaded my data 
>>> into Virtuoso triple store (took 2-3 hours), but could not get results for 
>>> my SPARQL queries in a reasonable time. That is the reason I decided to 
>>> load my data into Neo4j to be able to run my queries.
>>>
>>> I am only importing RDF to Neo4j only for a specific research problem. I 
>>> need to extract some patterns from the rdf data and I have to write queries 
>>> that require some sort of graph traversal. I don't want to do reasoning 
>>> over my rdf data. The graph structure looks simple: nodes only have Label 
>>> (Uri or Literal) and Value, and relationships don't have any property. 
>>>
>>> On Friday, December 12, 2014 2:41:36 AM UTC-8, Michael Hunger wrote:
>>>>
>>>> >our id's are UUIDs or ? so 36 chars * 90M -> 72 bytes and Neo-Id's are 
>>>> longs w/ 8 bytes. so 80 bytes per entry.
>>>> Should allocate about 6G heap.
>>>>
>>>> Btw. importing RDF 1:1 into Neo4j is no good idea in the first place.
>>>>
>>>> You should model a clean property graph model and import INTO that 
>>>> model.
>>>>
>>>> The the batch-import, it's a bug that has been fixed after the 
>>>> milestone, I try to get you a newer version to try.
>>>>
>>>> Cheers, Michael
>>>>
>>>>
>>>>
>>>> On Fri, Dec 12, 2014 at 11:26 AM, mohsen <[email protected]> wrote:
>>>>
>>>>> Thanks Michael for following my problem. In groovy script, the output 
>>>>> was still with nodes. It is not feasible to use enum for 
>>>>> relationshipTypes, 
>>>>> types are URIs of ontology predicates coming from CSV file, and there are 
>>>>> many of them. However, I think the problem is that this script requires 
>>>>> more than 10GB heap, because it needs to store the nodes in memory (map) 
>>>>> to 
>>>>> use them later for creating relationships. So, I guess even reducing mmio 
>>>>> mapping size won't solve the problem, will try it though tomorrow.
>>>>>
>>>>> Regarding the batch-import command, do you have any idea why I am 
>>>>> getting that error? 
>>>>>
>>>>> On Friday, December 12, 2014 1:40:56 AM UTC-8, Michael Hunger wrote:
>>>>>>
>>>>>> It would have been good if you had taken a thread dump from the 
>>>>>> groovy script.
>>>>>>
>>>>>> but if you look at the memory:
>>>>>>
>>>>>> off heap = 2+2+1+1 => 6
>>>>>> heap = 10
>>>>>> leaves nothing for OS
>>>>>>
>>>>>> probably the heap gc's as well.
>>>>>>
>>>>>> So you have to reduce the mmio mapping size
>>>>>>
>>>>>> Was the output still with nodes or already rels?
>>>>>>
>>>>>> Perhaps also replace DynamicRelationshipType.withName(line.Type) 
>>>>>> with an enum
>>>>>>
>>>>>> you can also extend trace to output number of nodes and rels
>>>>>>
>>>>>> Would you be able to share your csv files?
>>>>>>
>>>>>> Michael
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Dec 12, 2014 at 10:08 AM, mohsen <[email protected]> wrote:
>>>>>>
>>>>>>> I could not load the data using Groovy too. I increased groovy heap 
>>>>>>> size to 10G before running the script (using JAVA_OPTS). My machine has 
>>>>>>> 16G 
>>>>>>> of RAM. It halts when it loads 41M rows from nodes.csv:
>>>>>>>
>>>>>>>
>>>>>>> log: 
>>>>>>> ....
>>>>>>> 41200000 rows 38431 ms
>>>>>>> 41300000 rows 50988 ms 
>>>>>>> 41400000 rows 63747 ms 
>>>>>>> 41500000 rows 112758 ms 
>>>>>>> 41600000 rows 326497 ms
>>>>>>>
>>>>>>> After logging 41,600,000 rows, nothing happened. I waited 2 hours 
>>>>>>> there was not any progress. The process was still taking CPU but there 
>>>>>>> was 
>>>>>>> NOT any free memory at that time. I guess that's the reason for that. I 
>>>>>>> have attached my groovy script where you can find the memory 
>>>>>>> configurations. I guess something goes wrong with memory since it 
>>>>>>> stopped 
>>>>>>> when all my system's memory was used.
>>>>>>>
>>>>>>> I then switched back to batch-import tool with stacktrace. I think 
>>>>>>> the error I got last time was due to small heap size because I did not 
>>>>>>> get 
>>>>>>> that error this time (after allocating 10GB heap). Anyway, I have 
>>>>>>> exactly 86983375 
>>>>>>> nodes and it could load the nodes this time, but I got another error:  
>>>>>>>
>>>>>>>  Nodes
>>>>>>>
>>>>>>> [INPUT-------------|ENCODER-----------------------------------------|WRITER]
>>>>>>>  
>>>>>>>> 86M
>>>>>>>
>>>>>>> Calculate dense nodes
>>>>>>>> Import error: InputRelationship:
>>>>>>>>    properties: []
>>>>>>>>    startNode: file:///Users/mohsen/Desktop/M
>>>>>>>> usic%20RDF/echonest/analyze-example.rdf#signal
>>>>>>>>    endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1
>>>>>>>>    type: http://purl.org/ontology/echonest/beatVariance specified 
>>>>>>>> start node that hasn't been imported
>>>>>>>> java.lang.RuntimeException: InputRelationship:
>>>>>>>>    properties: []
>>>>>>>>    startNode: file:///Users/mohsen/Desktop/M
>>>>>>>> usic%20RDF/echonest/analyze-example.rdf#signal
>>>>>>>>    endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1
>>>>>>>>    type: http://purl.org/ontology/echonest/beatVariance specified 
>>>>>>>> start node that hasn't been imported
>>>>>>>> at org.neo4j.unsafe.impl.batchimport.staging.StageExecution.
>>>>>>>> stillExecuting(StageExecution.java:54)
>>>>>>>> at org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMo
>>>>>>>> nitor.anyStillExecuting(PollingExecutionMonitor.java:71)
>>>>>>>> at org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMo
>>>>>>>> nitor.finishAwareSleep(PollingExecutionMonitor.java:94)
>>>>>>>> at org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMo
>>>>>>>> nitor.monitor(PollingExecutionMonitor.java:62)
>>>>>>>> at org.neo4j.unsafe.impl.batchimport.ParallelBatchImporter.exec
>>>>>>>> uteStages(ParallelBatchImporter.java:221)
>>>>>>>> at org.neo4j.unsafe.impl.batchimport.ParallelBatchImporter.doIm
>>>>>>>> port(ParallelBatchImporter.java:139)
>>>>>>>> at org.neo4j.tooling.ImportTool.main(ImportTool.java:212)
>>>>>>>> Caused by: org.neo4j.unsafe.impl.batchimport.input.InputException: 
>>>>>>>> InputRelationship:
>>>>>>>>    properties: []
>>>>>>>>    startNode: file:///Users/mohsen/Desktop/M
>>>>>>>> usic%20RDF/echonest/analyze-example.rdf#signal
>>>>>>>>    endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1
>>>>>>>>    type: http://purl.org/ontology/echonest/beatVariance specified 
>>>>>>>> start node that hasn't been imported
>>>>>>>> at org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.en
>>>>>>>> sureNodeFound(CalculateDenseNodesStep.java:95)
>>>>>>>> at org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.pr
>>>>>>>> ocess(CalculateDenseNodesStep.java:61)
>>>>>>>> at org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.pr
>>>>>>>> ocess(CalculateDenseNodesStep.java:38)
>>>>>>>> at org.neo4j.unsafe.impl.batchimport.staging.ExecutorServiceSte
>>>>>>>> p$2.run(ExecutorServiceStep.java:81)
>>>>>>>> at java.util.concurrent.Executors$RunnableAdapter.call(
>>>>>>>> Executors.java:471)
>>>>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>>>>>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>>>>>>> Executor.java:1145)
>>>>>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>>>>>>> lExecutor.java:615)
>>>>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>>>> at org.neo4j.helpers.NamedThreadFactory$2.run(NamedThreadFactor
>>>>>>>> y.java:99)
>>>>>>>
>>>>>>>
>>>>>>> It seems that it cannot find the start and end node of a 
>>>>>>> relationships. However, both nodes exist in nodes.csv (I did a grep to 
>>>>>>> be 
>>>>>>> sure). So, I don't know what goes wrong. Do you have any idea? Can it 
>>>>>>> be 
>>>>>>> related to the id of the start node "file:///Users/mohsen/Desktop/
>>>>>>> Music%20RDF/echonest/analyze-example.rdf#signal"?
>>>>>>> On Thursday, December 11, 2014 10:02:05 PM UTC-8, Michael Hunger 
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> The groovy one should work fine too. I wanted to augment the post 
>>>>>>>> with one that has @CompileStatic so that it's faster. 
>>>>>>>>
>>>>>>>> I'd be also interested in the --stacktraces output of the 
>>>>>>>> batch-import tool of Neo4j 2.2, perhaps you can let it run over night 
>>>>>>>> or in 
>>>>>>>> the background.
>>>>>>>>
>>>>>>>> Cheers, Michael
>>>>>>>>
>>>>>>>> On Fri, Dec 12, 2014 at 3:34 AM, mohsen <[email protected]> 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I guess the core code for both batch-import and Load CSV is the 
>>>>>>>>> same, why do you think running it from Cypher (rather than through 
>>>>>>>>> batch-import) helps? I am trying groovy and batch-inserter 
>>>>>>>>> <https://gist.github.com/jexp/0617412dcdd644fd520b#file-import_kaggle-groovy>
>>>>>>>>>  now, 
>>>>>>>>> will post how it goes.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thursday, December 11, 2014 5:44:36 AM UTC-8, Andrii Stesin 
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> I'd suggest you take a look at last 5-7 posts in this recent 
>>>>>>>>>> thread 
>>>>>>>>>> <https://groups.google.com/forum/#!topic/neo4j/jSFtnD5OHxg>. You 
>>>>>>>>>> don't basically need any "batch import" command - I'd suggest you to 
>>>>>>>>>> use 
>>>>>>>>>> just a plain LOAD CSV functionality from Cypher, and you will just 
>>>>>>>>>> fill 
>>>>>>>>>> your database step by step.
>>>>>>>>>>
>>>>>>>>>> WBR,
>>>>>>>>>> Andrii
>>>>>>>>>>
>>>>>>>>>  -- 
>>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>>> Groups "Neo4j" group.
>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>>> send an email to [email protected].
>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>
>>>>>>>>
>>>>>>>>  -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "Neo4j" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to [email protected].
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>>  -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "Neo4j" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "Neo4j" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] Re: Load very large CSV into Neo4j

Reply via email to