Re: [Neo4j] Re: Load very large CSV into Neo4j

mohsen Mon, 15 Dec 2014 20:54:07 -0800

With help of Michael, I could finally load the data using batch-import 
command in Neo4j 2.2.0-M02. It took totally 56 minutes. The issue 
preventing Neo4j from loading the CSV files was having *\"* in some values, 
which was interpreted as a quotation character to be included in the field 
value and this was messing up everything from this point forward.


On Friday, December 12, 2014 2:33:14 PM UTC-8, mohsen wrote:
>
> Michael, I sent you a separate email with credentials to access the csv 
> files. Thanks.
>
> On Friday, December 12, 2014 12:26:51 PM UTC-8, mohsen wrote:
>>
>> Thanks for sharing the new version. Here are my memory info before 
>> running batch-import: 
>>
>> Mem:  18404972k total,   549848k used, 17855124k free,    12524k buffers
>>> Swap:  4063224k total,        0k used,  4063224k free,   211284k cached
>>
>>
>> I assigned 11G for heap:  export JAVA_OPTS="$JAVA_OPTS -Xmx11G"
>> I ran the batch-import at 11:13am, now it is 12:20pm and it seems that it 
>> is stuck. Here is the log: 
>>
>> Nodes
>>> [INPUT-------------------|NODE-------------------------------------------------|PROP|WRITER:
>>>  
>>> W:] 86M
>>> Done in 15m 21s 150ms
>>> Calculate dense nodes
>>> [INPUT---------|PREPARE(2)====================================================================|]
>>>  
>>>   0
>>
>>
>> And this is my memory info right now:
>>
>>> top - 12:22:43 up  1:34,  3 users,  load average: 0.00, 0.00, 0.00
>>> Tasks: 134 total,   1 running, 133 sleeping,   0 stopped,   0 zombie
>>> Cpu(s):  0.3%us,  0.5%sy,  0.0%ni, 99.2%id,  0.0%wa,  0.0%hi,  0.0%si,  
>>> 0.0%st
>>> Mem:  18404972k total, 18244612k used,   160360k free,     6132k buffers
>>> Swap:  4063224k total,        0k used,  4063224k free, 14089236k cached
>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND    
>>>                                                                             
>>>                    
>>>  4496 root      20   0 7598m 3.4g  15m S  3.3 19.4  20:35.88 java        
>>
>> It's been more than 40 minutes that it is stuck in Calculate Dense Nodes. 
>> Should I wait for that? or I need to kill the process?
>>
>>  
>>
>> On Friday, December 12, 2014 3:13:15 AM UTC-8, Michael Hunger wrote:
>>
>>> Right, that's the problem with an RDF model why only uses relationships 
>>> to represent properties, you won't get the performance that you would get 
>>> with a real property-graph model.
>>>
>>> I share the version separately.
>>>
>>> Cheers, Michael
>>>
>>> On Fri, Dec 12, 2014 at 12:07 PM, mohsen <[email protected]> wrote:
>>>
>>>> I appreciate if you get me the newer version, I am already using 
>>>> 2.2.0-M01. 
>>>>
>>>> I want to run some graph queries over my rdf. First, I loaded my data 
>>>> into Virtuoso triple store (took 2-3 hours), but could not get results for 
>>>> my SPARQL queries in a reasonable time. That is the reason I decided to 
>>>> load my data into Neo4j to be able to run my queries.
>>>>
>>>> I am only importing RDF to Neo4j only for a specific research problem. 
>>>> I need to extract some patterns from the rdf data and I have to write 
>>>> queries that require some sort of graph traversal. I don't want to do 
>>>> reasoning over my rdf data. The graph structure looks simple: nodes only 
>>>> have Label (Uri or Literal) and Value, and relationships don't have any 
>>>> property. 
>>>>
>>>> On Friday, December 12, 2014 2:41:36 AM UTC-8, Michael Hunger wrote:
>>>>>
>>>>> >our id's are UUIDs or ? so 36 chars * 90M -> 72 bytes and Neo-Id's 
>>>>> are longs w/ 8 bytes. so 80 bytes per entry.
>>>>> Should allocate about 6G heap.
>>>>>
>>>>> Btw. importing RDF 1:1 into Neo4j is no good idea in the first place.
>>>>>
>>>>> You should model a clean property graph model and import INTO that 
>>>>> model.
>>>>>
>>>>> The the batch-import, it's a bug that has been fixed after the 
>>>>> milestone, I try to get you a newer version to try.
>>>>>
>>>>> Cheers, Michael
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Dec 12, 2014 at 11:26 AM, mohsen <[email protected]> wrote:
>>>>>
>>>>>> Thanks Michael for following my problem. In groovy script, the output 
>>>>>> was still with nodes. It is not feasible to use enum for 
>>>>>> relationshipTypes, 
>>>>>> types are URIs of ontology predicates coming from CSV file, and there 
>>>>>> are 
>>>>>> many of them. However, I think the problem is that this script requires 
>>>>>> more than 10GB heap, because it needs to store the nodes in memory (map) 
>>>>>> to 
>>>>>> use them later for creating relationships. So, I guess even reducing 
>>>>>> mmio 
>>>>>> mapping size won't solve the problem, will try it though tomorrow.
>>>>>>
>>>>>> Regarding the batch-import command, do you have any idea why I am 
>>>>>> getting that error? 
>>>>>>
>>>>>> On Friday, December 12, 2014 1:40:56 AM UTC-8, Michael Hunger wrote:
>>>>>>>
>>>>>>> It would have been good if you had taken a thread dump from the 
>>>>>>> groovy script.
>>>>>>>
>>>>>>> but if you look at the memory:
>>>>>>>
>>>>>>> off heap = 2+2+1+1 => 6
>>>>>>> heap = 10
>>>>>>> leaves nothing for OS
>>>>>>>
>>>>>>> probably the heap gc's as well.
>>>>>>>
>>>>>>> So you have to reduce the mmio mapping size
>>>>>>>
>>>>>>> Was the output still with nodes or already rels?
>>>>>>>
>>>>>>> Perhaps also replace DynamicRelationshipType.withName(line.Type) 
>>>>>>> with an enum
>>>>>>>
>>>>>>> you can also extend trace to output number of nodes and rels
>>>>>>>
>>>>>>> Would you be able to share your csv files?
>>>>>>>
>>>>>>> Michael
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Dec 12, 2014 at 10:08 AM, mohsen <[email protected]> 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I could not load the data using Groovy too. I increased groovy heap 
>>>>>>>> size to 10G before running the script (using JAVA_OPTS). My machine 
>>>>>>>> has 16G 
>>>>>>>> of RAM. It halts when it loads 41M rows from nodes.csv:
>>>>>>>>
>>>>>>>>
>>>>>>>> log: 
>>>>>>>> ....
>>>>>>>> 41200000 rows 38431 ms
>>>>>>>> 41300000 rows 50988 ms 
>>>>>>>> 41400000 rows 63747 ms 
>>>>>>>> 41500000 rows 112758 ms 
>>>>>>>> 41600000 rows 326497 ms
>>>>>>>>
>>>>>>>> After logging 41,600,000 rows, nothing happened. I waited 2 hours 
>>>>>>>> there was not any progress. The process was still taking CPU but there 
>>>>>>>> was 
>>>>>>>> NOT any free memory at that time. I guess that's the reason for that. 
>>>>>>>> I 
>>>>>>>> have attached my groovy script where you can find the memory 
>>>>>>>> configurations. I guess something goes wrong with memory since it 
>>>>>>>> stopped 
>>>>>>>> when all my system's memory was used.
>>>>>>>>
>>>>>>>> I then switched back to batch-import tool with stacktrace. I think 
>>>>>>>> the error I got last time was due to small heap size because I did not 
>>>>>>>> get 
>>>>>>>> that error this time (after allocating 10GB heap). Anyway, I have 
>>>>>>>> exactly 86983375 
>>>>>>>> nodes and it could load the nodes this time, but I got another error:  
>>>>>>>>
>>>>>>>>  Nodes
>>>>>>>>
>>>>>>>> [INPUT-------------|ENCODER-----------------------------------------|WRITER]
>>>>>>>>  
>>>>>>>>> 86M
>>>>>>>>
>>>>>>>> Calculate dense nodes
>>>>>>>>> Import error: InputRelationship:
>>>>>>>>>    properties: []
>>>>>>>>>    startNode: file:///Users/mohsen/Desktop/M
>>>>>>>>> usic%20RDF/echonest/analyze-example.rdf#signal
>>>>>>>>>    endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1
>>>>>>>>>    type: http://purl.org/ontology/echonest/beatVariance specified 
>>>>>>>>> start node that hasn't been imported
>>>>>>>>> java.lang.RuntimeException: InputRelationship:
>>>>>>>>>    properties: []
>>>>>>>>>    startNode: file:///Users/mohsen/Desktop/M
>>>>>>>>> usic%20RDF/echonest/analyze-example.rdf#signal
>>>>>>>>>    endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1
>>>>>>>>>    type: http://purl.org/ontology/echonest/beatVariance specified 
>>>>>>>>> start node that hasn't been imported
>>>>>>>>> at org.neo4j.unsafe.impl.batchimport.staging.StageExecution.
>>>>>>>>> stillExecuting(StageExecution.java:54)
>>>>>>>>> at org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMo
>>>>>>>>> nitor.anyStillExecuting(PollingExecutionMonitor.java:71)
>>>>>>>>> at org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMo
>>>>>>>>> nitor.finishAwareSleep(PollingExecutionMonitor.java:94)
>>>>>>>>> at org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMo
>>>>>>>>> nitor.monitor(PollingExecutionMonitor.java:62)
>>>>>>>>> at org.neo4j.unsafe.impl.batchimport.ParallelBatchImporter.exec
>>>>>>>>> uteStages(ParallelBatchImporter.java:221)
>>>>>>>>> at org.neo4j.unsafe.impl.batchimport.ParallelBatchImporter.doIm
>>>>>>>>> port(ParallelBatchImporter.java:139)
>>>>>>>>> at org.neo4j.tooling.ImportTool.main(ImportTool.java:212)
>>>>>>>>> Caused by: org.neo4j.unsafe.impl.batchimport.input.InputException: 
>>>>>>>>> InputRelationship:
>>>>>>>>>    properties: []
>>>>>>>>>    startNode: file:///Users/mohsen/Desktop/M
>>>>>>>>> usic%20RDF/echonest/analyze-example.rdf#signal
>>>>>>>>>    endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1
>>>>>>>>>    type: http://purl.org/ontology/echonest/beatVariance specified 
>>>>>>>>> start node that hasn't been imported
>>>>>>>>> at org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.en
>>>>>>>>> sureNodeFound(CalculateDenseNodesStep.java:95)
>>>>>>>>> at org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.pr
>>>>>>>>> ocess(CalculateDenseNodesStep.java:61)
>>>>>>>>> at org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.pr
>>>>>>>>> ocess(CalculateDenseNodesStep.java:38)
>>>>>>>>> at org.neo4j.unsafe.impl.batchimport.staging.ExecutorServiceSte
>>>>>>>>> p$2.run(ExecutorServiceStep.java:81)
>>>>>>>>> at java.util.concurrent.Executors$RunnableAdapter.call(
>>>>>>>>> Executors.java:471)
>>>>>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>>>>>>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>>>>>>>> Executor.java:1145)
>>>>>>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>>>>>>>> lExecutor.java:615)
>>>>>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>>>>> at org.neo4j.helpers.NamedThreadFactory$2.run(NamedThreadFactor
>>>>>>>>> y.java:99)
>>>>>>>>
>>>>>>>>
>>>>>>>> It seems that it cannot find the start and end node of a 
>>>>>>>> relationships. However, both nodes exist in nodes.csv (I did a grep to 
>>>>>>>> be 
>>>>>>>> sure). So, I don't know what goes wrong. Do you have any idea? Can it 
>>>>>>>> be 
>>>>>>>> related to the id of the start node "file:///Users/mohsen/Desktop/
>>>>>>>> Music%20RDF/echonest/analyze-example.rdf#signal"?
>>>>>>>> On Thursday, December 11, 2014 10:02:05 PM UTC-8, Michael Hunger 
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> The groovy one should work fine too. I wanted to augment the post 
>>>>>>>>> with one that has @CompileStatic so that it's faster. 
>>>>>>>>>
>>>>>>>>> I'd be also interested in the --stacktraces output of the 
>>>>>>>>> batch-import tool of Neo4j 2.2, perhaps you can let it run over night 
>>>>>>>>> or in 
>>>>>>>>> the background.
>>>>>>>>>
>>>>>>>>> Cheers, Michael
>>>>>>>>>
>>>>>>>>> On Fri, Dec 12, 2014 at 3:34 AM, mohsen <[email protected]> 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I guess the core code for both batch-import and Load CSV is the 
>>>>>>>>>> same, why do you think running it from Cypher (rather than through 
>>>>>>>>>> batch-import) helps? I am trying groovy and batch-inserter 
>>>>>>>>>> <https://gist.github.com/jexp/0617412dcdd644fd520b#file-import_kaggle-groovy>
>>>>>>>>>>  now, 
>>>>>>>>>> will post how it goes.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thursday, December 11, 2014 5:44:36 AM UTC-8, Andrii Stesin 
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> I'd suggest you take a look at last 5-7 posts in this recent 
>>>>>>>>>>> thread 
>>>>>>>>>>> <https://groups.google.com/forum/#!topic/neo4j/jSFtnD5OHxg>. 
>>>>>>>>>>> You don't basically need any "batch import" command - I'd suggest 
>>>>>>>>>>> you to 
>>>>>>>>>>> use just a plain LOAD CSV functionality from Cypher, and you will 
>>>>>>>>>>> just fill 
>>>>>>>>>>> your database step by step.
>>>>>>>>>>>
>>>>>>>>>>> WBR,
>>>>>>>>>>> Andrii
>>>>>>>>>>>
>>>>>>>>>>  -- 
>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>> Google Groups "Neo4j" group.
>>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>>>> send an email to [email protected].
>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  -- 
>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>> Groups "Neo4j" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>> send an email to [email protected].
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>
>>>>>>>
>>>>>>>  -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "Neo4j" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to [email protected].
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>  -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "Neo4j" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] Re: Load very large CSV into Neo4j

Reply via email to