What takes so long? The loading? Or figuring it out? Michael
On Sat, Nov 29, 2014 at 1:18 AM, José F. Morales <josef...@gmail.com> wrote: > Hey Michael, > > I'll check it out. Trouble is knowing what over-complicating is. Thanks > for the heads up! > > I am trying to figure out inductively how to use LOAD CSV from various > examples. Thanks for another one. > > Its killing me that its taking so long. > > Jose > > > > On Friday, November 28, 2014 6:16:49 PM UTC-5, Michael Hunger wrote: >> >> José >> >> if you watch Nicole's webinar many things will become clear. >> https://vimeo.com/112447027 >> You don't have to overcomplicate things. >> >> The Skewer(id) thing is not really needed if each of your entities has a >> label and a primary key of some sorts. >> It is just an optimization to not have to think about separate entities. >> >> Cheers, Michael >> >> On Sat, Nov 29, 2014 at 12:12 AM, José F. Morales <jose...@gmail.com> >> wrote: >> >>> Hey Andrii, >>> >>> I've been thinking alot about your recommendations. I have some >>> questions, some of which show how ignorant I am. Apologies for basics >>> if necessary. >>> >>> On Thursday, November 20, 2014 6:22:34 AM UTC-5, Andrii Stesin wrote: >>>> >>>> Before you start. >>>> >>>> 1. On nodes and their labels. First of all, I strongly suggest you to >>>> separate your nodes into different .csv files by label. So you won't have a >>>> column *`label`* in your .csv but rather set of files: >>>> >>>> nodes_LabelA.csv >>>> ... >>>> nodes_LabelZ.csv >>>> >>>> whatever your labels are. (Consider label to be kinda of synonym for >>>> `class` in object-oriented programming or `table` in RDBMS). That's due the >>>> fact that labels in Cypher are somewhat specific entities and you probably >>>> won't be allowed to make them parameterized into variables inside your LOAD >>>> CSV statement. >>>> >>>> >>> OK, so you have modified your original idea of putting the db into two >>> files 1 nodes , 1 relationships. Now here you say, put all the nodes into >>> 1 file/ label. The way I have worked with it, I created 1 file for a >>> class of nodes I'll call CLT_SOURCE and another file for a class of nodes >>> called CLT_TARGET. Then I have a file for the relationships. Perhaps >>> foolishly I originally would create 1 file that would combine all of this >>> info and try to paste it in the browser or in the shell. Neither worked >>> even though with smaller amount of data it did. >>> >>> You are recommending that with the nodes, I take two steps... >>> 1) Combine my CLT_SOURCE and CLT_TARGET nodes, >>> 2) then I split that file into files that correspond to the node: >>> *my_node_id, * 1 label, and then properties P1...Pn. Since I have 10 >>> Labels/node, I should have 10 files named..... Nodes_LabelA... >>> Nodes_LabelJ. Thus... >>> >>> File: CLT_Nodes-LabelA columns: *my_node_id,* label A, property >>> P1..., property P4 >>> ... >>> File: CLT_Nodes-LabelJ columns: *my_node_id,* label B, property >>> P1..., property P4 >>> >>> >>> Q1: What are the rules about what can be used for *my_node_id? *I have >>> usually seen them as a letter integer combination. Is that the convention? >>> Sometimes I've seen a letter being used with a specific class of nodes >>> a1..a100 for one class and b1..b100 for another. I learned the hard way >>> that you have to give each node a unique ID. I used CLT_1...CLT_n for my >>> CLT_SOURCE nodes and CLT_TARGET_1...CLT_TARGET_n for my TARGET nodes. >>> It worked with the smaller db I made. Anything wrong using the convention >>> n1...n100? >>> >>> >>> >>>> 2. Then consider one additional "technological" label, let's name it >>>> `:Skewer` because it will "penetrate" all your nodes of every different >>>> label (class) like a kebab skewer. >>>> >>>> Before you start (or at least before you start importing relationships) >>>> do >>>> >>>> CREATE CONSTRAINT ON (every_node:Skewer) ASSERT every_node.my_node_id >>>> IS UNIQUE; >>>> >>>> >>> Q2: Should I do scenario 1 or 2? >>> >>> Scenario 1: add two labels to each file? One from my original nodes >>> and one as "Skewer" >>> >>> File 1: CLT_Nodes-LabelA columns: *my_node_id,* label A, *Skewer*, >>> property P1..., property P4 >>> ... >>> File 2: CLT_Nodes-LabelJ columns: *my_node_id,* label J, *Skewer*, >>> property P1..., property P4 >>> >>> OR >>> >>> Scenario 2: Include an eleventh file thus.... >>> >>> File 11: CLT_Nodes-LabelK columns: *my_node_id,* *Skewer*, >>> property P1..., property P4 >>> >>> From below, I think you mean Scenario 1. >>> >>> Q3: “Skewer” is just an integer right? It corresponds in a way to >>> my_node_id >>> >>> 3. When doing LOAD CSV with nodes, make sure each node will get 2 (two) >>>> labels, one of them is `:Skewer`. This will create index on `my_node_id` >>>> attribute (makes relationships creation some orders of magnitude faster) >>>> and you'll be sure you don't have occasional duplicate nodes, as a bonus. >>>> >>> >>> >>> Here is some sort of cypher…. >>> >>> >>> //Creating the nodes >>> >>> >>> >>> USING PERIODIC COMMIT 1000 >>> >>> LOAD CSV WITH HEADERS FROM “…/././…. CLT_NODES_LabelA.csv" AS csvline >>> >>> MERGE (my_node_id:Skewer: LabelA {property1: csvline.property1}) >>> >>> ON CREATE SET >>> >>> n.Property2 = csvline.Property2, >>> >>> n.Property3 = csvline.Property3, >>> >>> n.Property4 = csvline.Property4; >>> >>> >>> …. >>> LOAD CSV WITH HEADERS FROM “…/././…. CLT_NODES_LabelJ.csv" AS csvline >>> >>> >>> >>> MERGE (my_node_id:Skewer: LabelJ {property1: csvline.property1}) >>> >>> ON CREATE SET >>> >>> n.Property2 = csvline.Property2, >>> >>> n.Property3 = csvline.Property3, >>> >>> n.Property4 = csvline.Property4; >>> >>> >>> >>> >>> Q4: So does repeating the LOAD CSV with each file CLT_NODES_LabelA…J >>> combine the various labels and their respective values with their >>> corresponding nodes? >>> >>> Q5: Since I think of my data in terms of the two classes of nodes in my >>> Data model …[CLT_SOURCE —> CLT_TARGET ; CLT_TARGET —> CLT_SOURCE], after >>> loading the nodes, how then I get two classes of nodes? >>> >>> Q6: Is there a step missing that explains how the code below got to have >>> a “source_node” and a “dest_node” that appears to correspond to my >>> CLT_SOURCE and CLT_TARGET nodes? >>> >>> >>> >>> >>> >>>> 4. Now when you are done with nodes and start doing LOAD CSV for >>>> relationships, you may give the MATCH statement, which looks up your pair >>>> of nodes, a hint for fast lookup, like >>>> >>>> LOAD CSV ...from somewhere... AS csvline >>>> MATCH (source_node:Skewer {my_node_id: ToInt(csvline[0]}), (dest_node: >>>> Skewer {my_node_id: ToInt(csvline[1]}) >>>> CREATE (source_node)-[r:MY_REL_TYPE {rel_prop_00: csvline[2], ..., >>>> rel_prop_NN: csvline[ZZ]}]->(dest_node); >>>> >>>> >>> Q6: This LOAD CSV command (line 1) looks into the separate REL.csv file >>> you mentioned first right? >>> >>> Q7: csvline is some sort of temp file that is a series of lines of the >>> cvs file? >>> >>> Q8: Do you imply in line 2 that the REL.csv file has headers that >>> include source_node, dest_node ? >>> >>> Q9: While I see how Skewer is a label, how is my_node_id a property >>> (line 2) ? >>> >>> Q10: How does my_node_id relate to either ToInt(csvline[0]} or >>> ToInt(csvline[1]} (line 2) ? >>> >>> Is it that ToInt(csvline[0]} refers to the a line of the REL.csv file? >>> >>> Does csvline[0] refer to a column in REL.csv as do csvline[2] and >>> csvline[ZZ] (line 3) ? >>> >>> >>>> Adding *`:Skewer` *label in MATCH will tell Cypher to (implicitly) use >>>> your index on *my_node_id* which was created when you created your >>>> constraint. Or you may try to explicitly give it a hint to use the index, >>>> with USING INDEX... clause after MATCH before CREATE. Btw some earlier >>>> versions of Neo4j refused to use index in LOAD CSV for some reason, I hope >>>> this problem is gone with 2.1.5. >>>> >>>> OK >>> >>> >>>> 5. While importing, be careful to *explicitly specify type conversions >>>> for each property which is not a string*. I have seen numerous >>>> occasions when people missed ToInt(csvline[i]) or ToFloat(csvline[j]) - and >>>> Cypher silently stored their (supposed) numerics as strings. It's Ok, dude, >>>> you say it :) This led to confusion afterwards when say numerical >>>> comparisons doesn't MATCH and so on (though it's easy to correct with a >>>> single Cypher command, but anyway). >>>> >>>> Think I did that re. type conversion. Only applies to properties for >>> my data. >>> >>> Sorry for so many questions. I am really interested in figuring this >>> out! >>> >>> Thanks loads, >>> Jose >>> >>> >>> >>>> WBR, >>>> Andrii >>>> >>>> On Wednesday, November 19, 2014 9:36:50 PM UTC+2, José F. Morales wrote: >>>>> >>>>> >>>>> 3. CSV approach >>>>> a. “Dump the base into 2 .csv files:” >>>>> b. CSV1: “Describe nodes (enumerate them via some my_node_id integer >>>>> attribute), columns: my_node_id,label,node_prop_01,node_prop_ZZ” >>>>> c. CSV2: “Describe relations, columns: source_my_node_id, >>>>> dest_my_node_id,rel_type,rel_prop_01,...,rel_prop_NN” >>>>> d. Indexes constraints: before starting import —> have appropriate >>>>> indexes / constraints >>>>> e. via LOAD CSV, import CSV1, then CSV2. >>>>> f. Import no more than 10,000-30,000 lines in a single LOAD CSV >>>>> statement >>>>> >>>>> This seems to be a very well elaborated method and the easiest for me >>>>> to do. I have files such that I can create these without too much >>>>> problem. I figure I’ll split the nodes into three files 20k rows each. I >>>>> can do the same with the Rels. I have not used indexes or constraints yet >>>>> in the db’s that I already created and as I said above, I’ll have to see >>>>> how to use them. >>>>> >>>>> I am assuming column headers that fit with my data are consistent with >>>>> what you explained below (Like, I can put my own meaningful text into >>>>> Label >>>>> 1 -10 and node_prop_01 - 05).... >>>>> my_node_id, label1, label2, label3, label4, >>>>> label5, label6, label7, label8, label9, >>>>> label10, node_prop_01, node_prop_02, node_prop_03, >>>>> node_prop_04, node_prop_ZZ” >>>>> >>>>> Thanks again Fellas!! >>>>> >>>>> Jose >>>>> >>>>> >>>>> On Wednesday, November 19, 2014 8:04:44 AM UTC-5, Michael Hunger wrote: >>>>>> >>>>>> José, >>>>>> >>>>>> Let's continue the discussion on the google group >>>>>> >>>>>> With larger I meant amount of data, not size of statements >>>>>> >>>>>> As I also point out in various places we recommend creating only >>>>>> small subgraphs with a single statement separated by srmicolons. >>>>>> Eg up to 100 nodes and rels >>>>>> >>>>>> Gigantic statements just let the parser explode >>>>>> >>>>>> I recommending splitting them up into statements creating subgraphs >>>>>> Or create nodes and later match them by label & property to connect >>>>>> them >>>>>> Make sure to have appropriate indexes / constraints >>>>>> >>>>>> You should also surround blocks if statements with begin and commit >>>>>> commands >>>>>> >>>>>> Von meinem iPhone gesendet >>>>>> >>>>>> Am 19.11.2014 um 04:18 schrieb José F. Morales Ph.D. < >>>>>> jm3...@columbia.edu>: >>>>>> >>>>>> Hey Michael and Kenny >>>>>> >>>>>> Thanks you guys a bunch for the help. >>>>>> >>>>>> Let me give you a little background. I am charged to make a >>>>>> prototype of a tool (“LabCards”) that we hope to use in the hospital and >>>>>> beyond at some point . In preparation for making the main prototype, I >>>>>> made two prior Neo4j databases that worked exactly as I wanted them to. >>>>>> The first database was built with NIH data and had 183 nodes and around >>>>>> 7500 relationships. The second database was the Pre-prototype and it had >>>>>> 1080 nodes and around 2000 relationships. I created these in the form of >>>>>> cypher statements and either pasted them in the Neo4j browser or used the >>>>>> neo4j shell and loaded them as text files. Before doing that I checked >>>>>> the >>>>>> cypher code with Sublime Text 2 that highlights the code. Both databases >>>>>> loaded fine in both methods and did what I wanted them to do. >>>>>> >>>>>> As you might imagine, the prototype is an expansion of the >>>>>> mini-prototype. It has almost the same data model and I built it as a >>>>>> series of cypher statements as well. My first version of the prototype >>>>>> had >>>>>> ~60k nodes and 160k relationships. >>>>>> >>>>>> I should say that a feature of this model is that all the source and >>>>>> target nodes have relationships that point to each other. No node points >>>>>> to itself as far as I know. This file was 41 Mb of cypher code that I >>>>>> tried >>>>>> to load via the neo4j shell. >>>>>> >>>>>> In fact, I was following your advise on loading big data files... >>>>>> “Use the Neo4j-Shell for larger Imports” ( >>>>>> http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-an >>>>>> d-successfully/). This first time out, Java maxed out its memory >>>>>> allocated at 4Gb 2x and did not complete loading in 24 hours. I killed >>>>>> it. >>>>>> >>>>>> I then contacted Kenny, and he generously gave me some advice >>>>>> regarding the properties file (below) and again the same deal (4 Gb >>>>>> Memory >>>>>> 2x) with Java and no success in about 24 hours. I killed that one too. >>>>>> >>>>>> Given my loading problems, I have subsequently eliminated a bunch >>>>>> relationships (100k) so that the file is now 21 Mb. Alot of these were >>>>>> duplicates that I didn’t pick up before and am trying it again. So far >>>>>> 15 >>>>>> min into it, similar situation. The difference is that Java is using 1.7 >>>>>> and 0.5 GB of memory >>>>>> >>>>>> Here is the cypher for a typical node… >>>>>> >>>>>> CREATE ( CLT_1:`CLT SOURCE`:BIOMEDICAL:TEST_NAME:`Laboratory >>>>>> Procedure`:lbpr:`Procedures`:PROC:T059:`B1.3.1.1`:TZ{NAME:'Acetoacetate >>>>>> (ketone body)',SYNONYM:'',Sample:'SERUM, >>>>>> URINE',MEDCODE:10010,CUI:'NA’}) >>>>>> >>>>>> Here is the cypher for a typical relationship... >>>>>> >>>>>> CREATE(CLT_1)-[:MEASUREMENT_OF{Phylum:'TZ',CAT:'TEST.NAME', >>>>>> Ui_Rl:'T157',RESULT:'',Type:'',Semantic_Distance_Score:'NA', >>>>>> Path_Length:'NA',Path_Steps:'NA'}]->(CLT_TARGET_3617), >>>>>> >>>>>> I will let you know how this one turns out. I hope this is helpful. >>>>>> >>>>>> Many, many thanks fellas!!! >>>>>> >>>>>> Jose >>>>>> >>>>>> On Nov 18, 2014, at 8:33 PM, Michael Hunger < >>>>>> michael...@neotechnology.com> wrote: >>>>>> >>>>>> Hi José, >>>>>> >>>>>> can you provide perhaps more detail about your dataset (e.g. sample >>>>>> of the csv, size, etc. perhaps an output of csvstat (of csvkit) would be >>>>>> helpful), your cypher queries to load it >>>>>> >>>>>> Have you seen my other blog post, which explains two big caveats that >>>>>> people run into when trying this? jexp.de/blog/2014/10/loa >>>>>> d-cvs-with-success/ >>>>>> >>>>>> Cheers, Michael >>>>>> >>>>>> On Tue, Nov 18, 2014 at 8:43 PM, Kenny Bastani <k...@socialmoon.com> >>>>>> wrote: >>>>>> >>>>>>> Hey Jose, >>>>>>> >>>>>>> There is definitely an answer. Let me put you in touch with the >>>>>>> data import master: Michael Hunger. >>>>>>> >>>>>>> Michael, I think the answers here will be pretty straight forward >>>>>>> for you. You met Jose at GraphConnect NY last year, so I'll spare any >>>>>>> introductions. The memory map configurations I provided need to be >>>>>>> calculated and customized for the data import volume. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Kenny >>>>>>> >>>>>>> Sent from my iPhone >>>>>>> >>>>>>> On Nov 18, 2014, at 11:37 AM, José F. Morales Ph.D. < >>>>>>> jm3...@columbia.edu> wrote: >>>>>>> >>>>>>> Kenny, >>>>>>> >>>>>>> In 3 hours it’ll be trying to load for 24 hours so this is not >>>>>>> working. I’m catching shit from my crew too, so I got to fix this like >>>>>>> soon. >>>>>>> >>>>>>> I haven’t done this before, but can I break up the data and load >>>>>>> it in pieces? >>>>>>> >>>>>>> Jose >>>>>>> >>>>>>> On Nov 17, 2014, at 3:35 PM, Kenny Bastani <k...@socialmoon.com> >>>>>>> wrote: >>>>>>> >>>>>>> Hey Jose, >>>>>>> >>>>>>> Try turning off the object cache. Add this line to the >>>>>>> neo4j.properties configuration file: >>>>>>> >>>>>>> cache_type=none >>>>>>> >>>>>>> Then retry your import. Also, enable memory mapped files by adding >>>>>>> these lines to the neo4j.properties file: >>>>>>> >>>>>>> neostore.nodestore.db.mapped_memory=2048M >>>>>>> neostore.relationshipstore.db.mapped_memory=4096M >>>>>>> neostore.propertystore.db.mapped_memory=200M >>>>>>> neostore.propertystore.db.strings.mapped_memory=500M >>>>>>> neostore.propertystore.db.arrays.mapped_memory=500M >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Kenny >>>>>>> >>>>>>> ------------------------------ >>>>>>> *From:* José F. Morales Ph.D. <jm3...@columbia.edu> >>>>>>> *Sent:* Monday, November 17, 2014 12:32 PM >>>>>>> *To:* Kenny Bastani >>>>>>> *Subject:* latest >>>>>>> >>>>>>> Hey Kenny, >>>>>>> >>>>>>> Here’s the deal. As I think I said, I loaded the 41 Mb file of >>>>>>> cypher code via the neo4j shell. Before I tried the LabCards file, I >>>>>>> tried >>>>>>> the movies file and a UMLS database I made (8k relationships). They >>>>>>> worked >>>>>>> fine. >>>>>>> >>>>>>> The LabCards file is taking a LONG time to load since I started at >>>>>>> about 9:30 - 10 PM last night and its 3PM now. >>>>>>> >>>>>>> I’ve wondered if its hung up and the activity monitor’s memory >>>>>>> usage is constant at two rows of Java at 4GB w/ the kernel at 1 GB. The >>>>>>> CPU panel changes alot so it looks like its doing its thing. >>>>>>> >>>>>>> So is this how are things to be expected? Do you think the >>>>>>> loading is gonna take a day or two? >>>>>>> >>>>>>> Jose >>>>>>> >>>>>>> >>>>>>> |//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\|| >>>>>>> José F. Morales Ph.D. >>>>>>> Instructor >>>>>>> Cell Biology and Pathology >>>>>>> Columbia University Medical Center >>>>>>> jm3...@columbia.edu >>>>>>> 212-452-3351 >>>>>>> >>>>>>> >>>>>>> |//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\|| >>>>>>> José F. Morales Ph.D. >>>>>>> Instructor >>>>>>> Cell Biology and Pathology >>>>>>> Columbia University Medical Center >>>>>>> jm3...@columbia.edu >>>>>>> 212-452-3351 >>>>>>> >>>>>>> >>>>>> >>>>>> |//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\|| >>>>>> José F. Morales Ph.D. >>>>>> Instructor >>>>>> Cell Biology and Pathology >>>>>> Columbia University Medical Center >>>>>> jm3...@columbia.edu >>>>>> 212-452-3351 >>>>>> >>>>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Neo4j" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to neo4j+un...@googlegroups.com. >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- > You received this message because you are subscribed to the Google Groups > "Neo4j" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to neo4j+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "Neo4j" group. To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.