With help of Michael, I could finally load the data using batch-import command in Neo4j 2.2.0-M02. It took totally 56 minutes. The issue preventing Neo4j from loading the CSV files was having *\"* in some values, which was interpreted as a quotation character to be included in the field value and this was messing up everything from this point forward.
On Friday, December 12, 2014 2:33:14 PM UTC-8, mohsen wrote: > > Michael, I sent you a separate email with credentials to access the csv > files. Thanks. > > On Friday, December 12, 2014 12:26:51 PM UTC-8, mohsen wrote: >> >> Thanks for sharing the new version. Here are my memory info before >> running batch-import: >> >> Mem: 18404972k total, 549848k used, 17855124k free, 12524k buffers >>> Swap: 4063224k total, 0k used, 4063224k free, 211284k cached >> >> >> I assigned 11G for heap: export JAVA_OPTS="$JAVA_OPTS -Xmx11G" >> I ran the batch-import at 11:13am, now it is 12:20pm and it seems that it >> is stuck. Here is the log: >> >> Nodes >>> [INPUT-------------------|NODE-------------------------------------------------|PROP|WRITER: >>> >>> W:] 86M >>> Done in 15m 21s 150ms >>> Calculate dense nodes >>> [INPUT---------|PREPARE(2)====================================================================|] >>> >>> 0 >> >> >> And this is my memory info right now: >> >>> top - 12:22:43 up 1:34, 3 users, load average: 0.00, 0.00, 0.00 >>> Tasks: 134 total, 1 running, 133 sleeping, 0 stopped, 0 zombie >>> Cpu(s): 0.3%us, 0.5%sy, 0.0%ni, 99.2%id, 0.0%wa, 0.0%hi, 0.0%si, >>> 0.0%st >>> Mem: 18404972k total, 18244612k used, 160360k free, 6132k buffers >>> Swap: 4063224k total, 0k used, 4063224k free, 14089236k cached >>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >>> >>> >>> 4496 root 20 0 7598m 3.4g 15m S 3.3 19.4 20:35.88 java >> >> It's been more than 40 minutes that it is stuck in Calculate Dense Nodes. >> Should I wait for that? or I need to kill the process? >> >> >> >> On Friday, December 12, 2014 3:13:15 AM UTC-8, Michael Hunger wrote: >> >>> Right, that's the problem with an RDF model why only uses relationships >>> to represent properties, you won't get the performance that you would get >>> with a real property-graph model. >>> >>> I share the version separately. >>> >>> Cheers, Michael >>> >>> On Fri, Dec 12, 2014 at 12:07 PM, mohsen <[email protected]> wrote: >>> >>>> I appreciate if you get me the newer version, I am already using >>>> 2.2.0-M01. >>>> >>>> I want to run some graph queries over my rdf. First, I loaded my data >>>> into Virtuoso triple store (took 2-3 hours), but could not get results for >>>> my SPARQL queries in a reasonable time. That is the reason I decided to >>>> load my data into Neo4j to be able to run my queries. >>>> >>>> I am only importing RDF to Neo4j only for a specific research problem. >>>> I need to extract some patterns from the rdf data and I have to write >>>> queries that require some sort of graph traversal. I don't want to do >>>> reasoning over my rdf data. The graph structure looks simple: nodes only >>>> have Label (Uri or Literal) and Value, and relationships don't have any >>>> property. >>>> >>>> On Friday, December 12, 2014 2:41:36 AM UTC-8, Michael Hunger wrote: >>>>> >>>>> >our id's are UUIDs or ? so 36 chars * 90M -> 72 bytes and Neo-Id's >>>>> are longs w/ 8 bytes. so 80 bytes per entry. >>>>> Should allocate about 6G heap. >>>>> >>>>> Btw. importing RDF 1:1 into Neo4j is no good idea in the first place. >>>>> >>>>> You should model a clean property graph model and import INTO that >>>>> model. >>>>> >>>>> The the batch-import, it's a bug that has been fixed after the >>>>> milestone, I try to get you a newer version to try. >>>>> >>>>> Cheers, Michael >>>>> >>>>> >>>>> >>>>> On Fri, Dec 12, 2014 at 11:26 AM, mohsen <[email protected]> wrote: >>>>> >>>>>> Thanks Michael for following my problem. In groovy script, the output >>>>>> was still with nodes. It is not feasible to use enum for >>>>>> relationshipTypes, >>>>>> types are URIs of ontology predicates coming from CSV file, and there >>>>>> are >>>>>> many of them. However, I think the problem is that this script requires >>>>>> more than 10GB heap, because it needs to store the nodes in memory (map) >>>>>> to >>>>>> use them later for creating relationships. So, I guess even reducing >>>>>> mmio >>>>>> mapping size won't solve the problem, will try it though tomorrow. >>>>>> >>>>>> Regarding the batch-import command, do you have any idea why I am >>>>>> getting that error? >>>>>> >>>>>> On Friday, December 12, 2014 1:40:56 AM UTC-8, Michael Hunger wrote: >>>>>>> >>>>>>> It would have been good if you had taken a thread dump from the >>>>>>> groovy script. >>>>>>> >>>>>>> but if you look at the memory: >>>>>>> >>>>>>> off heap = 2+2+1+1 => 6 >>>>>>> heap = 10 >>>>>>> leaves nothing for OS >>>>>>> >>>>>>> probably the heap gc's as well. >>>>>>> >>>>>>> So you have to reduce the mmio mapping size >>>>>>> >>>>>>> Was the output still with nodes or already rels? >>>>>>> >>>>>>> Perhaps also replace DynamicRelationshipType.withName(line.Type) >>>>>>> with an enum >>>>>>> >>>>>>> you can also extend trace to output number of nodes and rels >>>>>>> >>>>>>> Would you be able to share your csv files? >>>>>>> >>>>>>> Michael >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Dec 12, 2014 at 10:08 AM, mohsen <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> I could not load the data using Groovy too. I increased groovy heap >>>>>>>> size to 10G before running the script (using JAVA_OPTS). My machine >>>>>>>> has 16G >>>>>>>> of RAM. It halts when it loads 41M rows from nodes.csv: >>>>>>>> >>>>>>>> >>>>>>>> log: >>>>>>>> .... >>>>>>>> 41200000 rows 38431 ms >>>>>>>> 41300000 rows 50988 ms >>>>>>>> 41400000 rows 63747 ms >>>>>>>> 41500000 rows 112758 ms >>>>>>>> 41600000 rows 326497 ms >>>>>>>> >>>>>>>> After logging 41,600,000 rows, nothing happened. I waited 2 hours >>>>>>>> there was not any progress. The process was still taking CPU but there >>>>>>>> was >>>>>>>> NOT any free memory at that time. I guess that's the reason for that. >>>>>>>> I >>>>>>>> have attached my groovy script where you can find the memory >>>>>>>> configurations. I guess something goes wrong with memory since it >>>>>>>> stopped >>>>>>>> when all my system's memory was used. >>>>>>>> >>>>>>>> I then switched back to batch-import tool with stacktrace. I think >>>>>>>> the error I got last time was due to small heap size because I did not >>>>>>>> get >>>>>>>> that error this time (after allocating 10GB heap). Anyway, I have >>>>>>>> exactly 86983375 >>>>>>>> nodes and it could load the nodes this time, but I got another error: >>>>>>>> >>>>>>>> Nodes >>>>>>>> >>>>>>>> [INPUT-------------|ENCODER-----------------------------------------|WRITER] >>>>>>>> >>>>>>>>> 86M >>>>>>>> >>>>>>>> Calculate dense nodes >>>>>>>>> Import error: InputRelationship: >>>>>>>>> properties: [] >>>>>>>>> startNode: file:///Users/mohsen/Desktop/M >>>>>>>>> usic%20RDF/echonest/analyze-example.rdf#signal >>>>>>>>> endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1 >>>>>>>>> type: http://purl.org/ontology/echonest/beatVariance specified >>>>>>>>> start node that hasn't been imported >>>>>>>>> java.lang.RuntimeException: InputRelationship: >>>>>>>>> properties: [] >>>>>>>>> startNode: file:///Users/mohsen/Desktop/M >>>>>>>>> usic%20RDF/echonest/analyze-example.rdf#signal >>>>>>>>> endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1 >>>>>>>>> type: http://purl.org/ontology/echonest/beatVariance specified >>>>>>>>> start node that hasn't been imported >>>>>>>>> at org.neo4j.unsafe.impl.batchimport.staging.StageExecution. >>>>>>>>> stillExecuting(StageExecution.java:54) >>>>>>>>> at org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMo >>>>>>>>> nitor.anyStillExecuting(PollingExecutionMonitor.java:71) >>>>>>>>> at org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMo >>>>>>>>> nitor.finishAwareSleep(PollingExecutionMonitor.java:94) >>>>>>>>> at org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMo >>>>>>>>> nitor.monitor(PollingExecutionMonitor.java:62) >>>>>>>>> at org.neo4j.unsafe.impl.batchimport.ParallelBatchImporter.exec >>>>>>>>> uteStages(ParallelBatchImporter.java:221) >>>>>>>>> at org.neo4j.unsafe.impl.batchimport.ParallelBatchImporter.doIm >>>>>>>>> port(ParallelBatchImporter.java:139) >>>>>>>>> at org.neo4j.tooling.ImportTool.main(ImportTool.java:212) >>>>>>>>> Caused by: org.neo4j.unsafe.impl.batchimport.input.InputException: >>>>>>>>> InputRelationship: >>>>>>>>> properties: [] >>>>>>>>> startNode: file:///Users/mohsen/Desktop/M >>>>>>>>> usic%20RDF/echonest/analyze-example.rdf#signal >>>>>>>>> endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1 >>>>>>>>> type: http://purl.org/ontology/echonest/beatVariance specified >>>>>>>>> start node that hasn't been imported >>>>>>>>> at org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.en >>>>>>>>> sureNodeFound(CalculateDenseNodesStep.java:95) >>>>>>>>> at org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.pr >>>>>>>>> ocess(CalculateDenseNodesStep.java:61) >>>>>>>>> at org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.pr >>>>>>>>> ocess(CalculateDenseNodesStep.java:38) >>>>>>>>> at org.neo4j.unsafe.impl.batchimport.staging.ExecutorServiceSte >>>>>>>>> p$2.run(ExecutorServiceStep.java:81) >>>>>>>>> at java.util.concurrent.Executors$RunnableAdapter.call( >>>>>>>>> Executors.java:471) >>>>>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:262) >>>>>>>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool >>>>>>>>> Executor.java:1145) >>>>>>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo >>>>>>>>> lExecutor.java:615) >>>>>>>>> at java.lang.Thread.run(Thread.java:745) >>>>>>>>> at org.neo4j.helpers.NamedThreadFactory$2.run(NamedThreadFactor >>>>>>>>> y.java:99) >>>>>>>> >>>>>>>> >>>>>>>> It seems that it cannot find the start and end node of a >>>>>>>> relationships. However, both nodes exist in nodes.csv (I did a grep to >>>>>>>> be >>>>>>>> sure). So, I don't know what goes wrong. Do you have any idea? Can it >>>>>>>> be >>>>>>>> related to the id of the start node "file:///Users/mohsen/Desktop/ >>>>>>>> Music%20RDF/echonest/analyze-example.rdf#signal"? >>>>>>>> On Thursday, December 11, 2014 10:02:05 PM UTC-8, Michael Hunger >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> The groovy one should work fine too. I wanted to augment the post >>>>>>>>> with one that has @CompileStatic so that it's faster. >>>>>>>>> >>>>>>>>> I'd be also interested in the --stacktraces output of the >>>>>>>>> batch-import tool of Neo4j 2.2, perhaps you can let it run over night >>>>>>>>> or in >>>>>>>>> the background. >>>>>>>>> >>>>>>>>> Cheers, Michael >>>>>>>>> >>>>>>>>> On Fri, Dec 12, 2014 at 3:34 AM, mohsen <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> I guess the core code for both batch-import and Load CSV is the >>>>>>>>>> same, why do you think running it from Cypher (rather than through >>>>>>>>>> batch-import) helps? I am trying groovy and batch-inserter >>>>>>>>>> <https://gist.github.com/jexp/0617412dcdd644fd520b#file-import_kaggle-groovy> >>>>>>>>>> now, >>>>>>>>>> will post how it goes. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thursday, December 11, 2014 5:44:36 AM UTC-8, Andrii Stesin >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> I'd suggest you take a look at last 5-7 posts in this recent >>>>>>>>>>> thread >>>>>>>>>>> <https://groups.google.com/forum/#!topic/neo4j/jSFtnD5OHxg>. >>>>>>>>>>> You don't basically need any "batch import" command - I'd suggest >>>>>>>>>>> you to >>>>>>>>>>> use just a plain LOAD CSV functionality from Cypher, and you will >>>>>>>>>>> just fill >>>>>>>>>>> your database step by step. >>>>>>>>>>> >>>>>>>>>>> WBR, >>>>>>>>>>> Andrii >>>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>> Google Groups "Neo4j" group. >>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>>> send an email to [email protected]. >>>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "Neo4j" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to [email protected]. >>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>> >>>>>>> >>>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "Neo4j" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "Neo4j" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- You received this message because you are subscribed to the Google Groups "Neo4j" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
