Both. Using what I did before the loading either never finished or failed. I’m trying to not follow that example with the figuring it out!! :)
> On Nov 28, 2014, at 7:50 PM, Michael Hunger > <michael.hun...@neotechnology.com> wrote: > > What takes so long? The loading? Or figuring it out? > > Michael > > > On Sat, Nov 29, 2014 at 1:18 AM, José F. Morales <josef...@gmail.com > <mailto:josef...@gmail.com>> wrote: > Hey Michael, > > I'll check it out. Trouble is knowing what over-complicating is. Thanks > for the heads up! > > I am trying to figure out inductively how to use LOAD CSV from various > examples. Thanks for another one. > > Its killing me that its taking so long. > > Jose > > > > On Friday, November 28, 2014 6:16:49 PM UTC-5, Michael Hunger wrote: > José > > if you watch Nicole's webinar many things will become clear. > https://vimeo.com/112447027 <https://vimeo.com/112447027> > You don't have to overcomplicate things. > > The Skewer(id) thing is not really needed if each of your entities has a > label and a primary key of some sorts. > It is just an optimization to not have to think about separate entities. > > Cheers, Michael > > On Sat, Nov 29, 2014 at 12:12 AM, José F. Morales <jose...@gmail.com <>> > wrote: > Hey Andrii, > > I've been thinking alot about your recommendations. I have some questions, > some of which show how ignorant I am. Apologies for basics if necessary. > > On Thursday, November 20, 2014 6:22:34 AM UTC-5, Andrii Stesin wrote: > Before you start. > > 1. On nodes and their labels. First of all, I strongly suggest you to > separate your nodes into different .csv files by label. So you won't have a > column `label` in your .csv but rather set of files: > > nodes_LabelA.csv > ... > nodes_LabelZ.csv > > whatever your labels are. (Consider label to be kinda of synonym for `class` > in object-oriented programming or `table` in RDBMS). That's due the fact that > labels in Cypher are somewhat specific entities and you probably won't be > allowed to make them parameterized into variables inside your LOAD CSV > statement. > > > OK, so you have modified your original idea of putting the db into two files > 1 nodes , 1 relationships. Now here you say, put all the nodes into 1 file/ > label. The way I have worked with it, I created 1 file for a class of nodes > I'll call CLT_SOURCE and another file for a class of nodes called CLT_TARGET. > Then I have a file for the relationships. Perhaps foolishly I originally > would create 1 file that would combine all of this info and try to paste it > in the browser or in the shell. Neither worked even though with smaller > amount of data it did. > > You are recommending that with the nodes, I take two steps... > 1) Combine my CLT_SOURCE and CLT_TARGET nodes, > 2) then I split that file into files that correspond to the node: my_node_id, > 1 label, and then properties P1...Pn. Since I have 10 Labels/node, I should > have 10 files named..... Nodes_LabelA... Nodes_LabelJ. Thus... > > File: CLT_Nodes-LabelA columns: my_node_id, label A, property P1..., > property P4 > ... > File: CLT_Nodes-LabelJ columns: my_node_id, label B, property P1..., > property P4 > > > Q1: What are the rules about what can be used for my_node_id? I have usually > seen them as a letter integer combination. Is that the convention? > Sometimes I've seen a letter being used with a specific class of nodes > a1..a100 for one class and b1..b100 for another. I learned the hard way that > you have to give each node a unique ID. I used CLT_1...CLT_n for my > CLT_SOURCE nodes and CLT_TARGET_1...CLT_TARGET_n for my TARGET nodes. It > worked with the smaller db I made. Anything wrong using the convention > n1...n100? > > > 2. Then consider one additional "technological" label, let's name it > `:Skewer` because it will "penetrate" all your nodes of every different label > (class) like a kebab skewer. > > Before you start (or at least before you start importing relationships) do > > CREATE CONSTRAINT ON (every_node:Skewer) ASSERT every_node.my_node_id IS > UNIQUE; > > > Q2: Should I do scenario 1 or 2? > > Scenario 1: add two labels to each file? One from my original nodes and one > as "Skewer" > > File 1: CLT_Nodes-LabelA columns: my_node_id, label A, Skewer, property > P1..., property P4 > ... > File 2: CLT_Nodes-LabelJ columns: my_node_id, label J, Skewer, property > P1..., property P4 > > OR > > Scenario 2: Include an eleventh file thus.... > > File 11: CLT_Nodes-LabelK columns: my_node_id, Skewer, property P1..., > property P4 > > From below, I think you mean Scenario 1. > > Q3: “Skewer” is just an integer right? It corresponds in a way to my_node_id > > 3. When doing LOAD CSV with nodes, make sure each node will get 2 (two) > labels, one of them is `:Skewer`. This will create index on `my_node_id` > attribute (makes relationships creation some orders of magnitude faster) and > you'll be sure you don't have occasional duplicate nodes, as a bonus. > > > Here is some sort of cypher…. > > //Creating the nodes > > > > USING PERIODIC COMMIT 1000 > > LOAD CSV WITH HEADERS FROM “…/././…. CLT_NODES_LabelA.csv" AS csvline > > MERGE (my_node_id:Skewer: LabelA {property1: csvline.property1}) > > ON CREATE SET > > n.Property2 = csvline.Property2, > > n.Property3 = csvline.Property3, > > n.Property4 = csvline.Property4; > > …. > LOAD CSV WITH HEADERS FROM “…/././…. CLT_NODES_LabelJ.csv" AS csvline > > > > MERGE (my_node_id:Skewer: LabelJ {property1: csvline.property1}) > > ON CREATE SET > > n.Property2 = csvline.Property2, > > n.Property3 = csvline.Property3, > > n.Property4 = csvline.Property4; > > > Q4: So does repeating the LOAD CSV with each file CLT_NODES_LabelA…J combine > the various labels and their respective values with their corresponding > nodes? > Q5: Since I think of my data in terms of the two classes of nodes in my Data > model …[CLT_SOURCE —> CLT_TARGET ; CLT_TARGET —> CLT_SOURCE], after > loading the nodes, how then I get two classes of nodes? > Q6: Is there a step missing that explains how the code below got to have a > “source_node” and a “dest_node” that appears to correspond to my CLT_SOURCE > and CLT_TARGET nodes? > > > > 4. Now when you are done with nodes and start doing LOAD CSV for > relationships, you may give the MATCH statement, which looks up your pair of > nodes, a hint for fast lookup, like > > LOAD CSV ...from somewhere... AS csvline > MATCH (source_node:Skewer {my_node_id: ToInt(csvline[0]}), (dest_node:Skewer > {my_node_id: ToInt(csvline[1]}) > CREATE (source_node)-[r:MY_REL_TYPE {rel_prop_00: csvline[2], ..., > rel_prop_NN: csvline[ZZ]}]->(dest_node); > > > Q6: This LOAD CSV command (line 1) looks into the separate REL.csv file you > mentioned first right? > Q7: csvline is some sort of temp file that is a series of lines of the cvs > file? > Q8: Do you imply in line 2 that the REL.csv file has headers that include > source_node, dest_node ? > Q9: While I see how Skewer is a label, how is my_node_id a property (line > 2) ? > Q10: How does my_node_id relate to either ToInt(csvline[0]} or > ToInt(csvline[1]} (line 2) ? > Is it that ToInt(csvline[0]} refers to the a line of the REL.csv file? > Does csvline[0] refer to a column in REL.csv as do csvline[2] and csvline[ZZ] > (line 3) ? > > Adding `:Skewer` label in MATCH will tell Cypher to (implicitly) use your > index on my_node_id which was created when you created your constraint. Or > you may try to explicitly give it a hint to use the index, with USING > INDEX... clause after MATCH before CREATE. Btw some earlier versions of Neo4j > refused to use index in LOAD CSV for some reason, I hope this problem is gone > with 2.1.5. > > OK > > 5. While importing, be careful to explicitly specify type conversions for > each property which is not a string. I have seen numerous occasions when > people missed ToInt(csvline[i]) or ToFloat(csvline[j]) - and Cypher silently > stored their (supposed) numerics as strings. It's Ok, dude, you say it :) > This led to confusion afterwards when say numerical comparisons doesn't MATCH > and so on (though it's easy to correct with a single Cypher command, but > anyway). > > Think I did that re. type conversion. Only applies to properties for my data. > > Sorry for so many questions. I am really interested in figuring this out! > > Thanks loads, > Jose > > > WBR, > Andrii > > On Wednesday, November 19, 2014 9:36:50 PM UTC+2, José F. Morales wrote: > > 3. CSV approach > a. “Dump the base into 2 .csv files:” > b. CSV1: “Describe nodes (enumerate them via some my_node_id integer > attribute), columns: my_node_id,label,node_prop_01,node_prop_ZZ” > c. CSV2: “Describe relations, > columns: source_my_node_id, > dest_my_node_id,rel_type,rel_prop_01,...,rel_prop_NN” > d. Indexes constraints: before starting import —> have > appropriate indexes / constraints > e. via LOAD CSV, import CSV1, then CSV2. > f. Import no more than 10,000-30,000 lines in a single LOAD CSV > statement > > This seems to be a very well elaborated method and the easiest for me to do. > I have files such that I can create these without too much problem. I figure > I’ll split the nodes into three files 20k rows each. I can do the same with > the Rels. I have not used indexes or constraints yet in the db’s that I > already created and as I said above, I’ll have to see how to use them. > > I am assuming column headers that fit with my data are consistent with what > you explained below (Like, I can put my own meaningful text into Label 1 -10 > and node_prop_01 - 05).... > my_node_id, label1, label2, label3, label4, > label5, label6, label7, label8, label9, > label10, node_prop_01, node_prop_02, node_prop_03, > node_prop_04, node_prop_ZZ” > > Thanks again Fellas!! > > Jose > > > On Wednesday, November 19, 2014 8:04:44 AM UTC-5, Michael Hunger wrote: > José, > > Let's continue the discussion on the google group > > With larger I meant amount of data, not size of statements > > As I also point out in various places we recommend creating only small > subgraphs with a single statement separated by srmicolons. > Eg up to 100 nodes and rels > > Gigantic statements just let the parser explode > > I recommending splitting them up into statements creating subgraphs > Or create nodes and later match them by label & property to connect them > Make sure to have appropriate indexes / constraints > > You should also surround blocks if statements with begin and commit commands > > Von meinem iPhone gesendet > > Am 19.11.2014 um 04:18 schrieb José F. Morales Ph.D. <jm3...@columbia.edu <>>: > >> Hey Michael and Kenny >> >> Thanks you guys a bunch for the help. >> >> Let me give you a little background. I am charged to make a prototype of a >> tool (“LabCards”) that we hope to use in the hospital and beyond at some >> point . In preparation for making the main prototype, I made two prior >> Neo4j databases that worked exactly as I wanted them to. The first database >> was built with NIH data and had 183 nodes and around 7500 relationships. >> The second database was the Pre-prototype and it had 1080 nodes and around >> 2000 relationships. I created these in the form of cypher statements and >> either pasted them in the Neo4j browser or used the neo4j shell and loaded >> them as text files. Before doing that I checked the cypher code with Sublime >> Text 2 that highlights the code. Both databases loaded fine in both methods >> and did what I wanted them to do. >> >> As you might imagine, the prototype is an expansion of the mini-prototype. >> It has almost the same data model and I built it as a series of cypher >> statements as well. My first version of the prototype had ~60k nodes and >> 160k relationships. >> >> I should say that a feature of this model is that all the source and target >> nodes have relationships that point to each other. No node points to itself >> as far as I know. This file was 41 Mb of cypher code that I tried to load >> via the neo4j shell. >> >> In fact, I was following your advise on loading big data files... “Use the >> Neo4j-Shell for larger Imports” >> (http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-and-successfully/ >> <http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-and-successfully/>). >> This first time out, Java maxed out its memory allocated at 4Gb 2x and >> did not complete loading in 24 hours. I killed it. >> >> I then contacted Kenny, and he generously gave me some advice regarding the >> properties file (below) and again the same deal (4 Gb Memory 2x) with Java >> and no success in about 24 hours. I killed that one too. >> >> Given my loading problems, I have subsequently eliminated a bunch >> relationships (100k) so that the file is now 21 Mb. Alot of these were >> duplicates that I didn’t pick up before and am trying it again. So far 15 >> min into it, similar situation. The difference is that Java is using 1.7 >> and 0.5 GB of memory >> >> Here is the cypher for a typical node… >> >> CREATE ( CLT_1:`CLT SOURCE`:BIOMEDICAL:TEST_NAME:`Laboratory >> Procedure`:lbpr:`Procedures`:PROC:T059:`B1.3.1.1`:TZ{NAME:'Acetoacetate >> (ketone body)',SYNONYM:'',Sample:'SERUM, URINE',MEDCODE:10010,CUI:'NA’}) >> >> Here is the cypher for a typical relationship... >> >> CREATE(CLT_1)-[:MEASUREMENT_OF{Phylum:'TZ',CAT:'TEST.NAME >> <http://test.name/>',Ui_Rl:'T157',RESULT:'',Type:'',Semantic_Distance_Score:'NA',Path_Length:'NA',Path_Steps:'NA'}]->(CLT_TARGET_3617), >> >> I will let you know how this one turns out. I hope this is helpful. >> >> Many, many thanks fellas!!! >> >> Jose >> >>> On Nov 18, 2014, at 8:33 PM, Michael Hunger <michael...@neotechnology.com >>> <>> wrote: >>> >>> Hi José, >>> >>> can you provide perhaps more detail about your dataset (e.g. sample of the >>> csv, size, etc. perhaps an output of csvstat (of csvkit) would be helpful), >>> your cypher queries to load it >>> >>> Have you seen my other blog post, which explains two big caveats that >>> people run into when trying this? >>> jexp.de/blog/2014/10/load-cvs-with-success/ >>> <http://jexp.de/blog/2014/10/load-cvs-with-success/> >>> >>> Cheers, Michael >>> >>> On Tue, Nov 18, 2014 at 8:43 PM, Kenny Bastani <k...@socialmoon.com <>> >>> wrote: >>> Hey Jose, >>> >>> There is definitely an answer. Let me put you in touch with the data import >>> master: Michael Hunger. >>> >>> Michael, I think the answers here will be pretty straight forward for you. >>> You met Jose at GraphConnect NY last year, so I'll spare any introductions. >>> The memory map configurations I provided need to be calculated and >>> customized for the data import volume. >>> >>> Thanks, >>> >>> Kenny >>> >>> Sent from my iPhone >>> >>> On Nov 18, 2014, at 11:37 AM, José F. Morales Ph.D. <jm3...@columbia.edu >>> <>> wrote: >>> >>>> Kenny, >>>> >>>> In 3 hours it’ll be trying to load for 24 hours so this is not working. >>>> I’m catching shit from my crew too, so I got to fix this like soon. >>>> >>>> I haven’t done this before, but can I break up the data and load it in >>>> pieces? >>>> >>>> Jose >>>> >>>>> On Nov 17, 2014, at 3:35 PM, Kenny Bastani <k...@socialmoon.com <>> wrote: >>>>> >>>>> Hey Jose, >>>>> >>>>> Try turning off the object cache. Add this line to the neo4j.properties >>>>> configuration file: >>>>> >>>>> cache_type=none >>>>> >>>>> Then retry your import. Also, enable memory mapped files by adding these >>>>> lines to the neo4j.properties file: >>>>> >>>>> neostore.nodestore.db.mapped_memory=2048M >>>>> neostore.relationshipstore.db.mapped_memory=4096M >>>>> neostore.propertystore.db.mapped_memory=200M >>>>> neostore.propertystore.db.strings.mapped_memory=500M >>>>> neostore.propertystore.db.arrays.mapped_memory=500M >>>>> >>>>> Thanks, >>>>> >>>>> Kenny >>>>> >>>>> From: José F. Morales Ph.D. <jm3...@columbia.edu <>> >>>>> Sent: Monday, November 17, 2014 12:32 PM >>>>> To: Kenny Bastani >>>>> Subject: latest >>>>> >>>>> Hey Kenny, >>>>> >>>>> Here’s the deal. As I think I said, I loaded the 41 Mb file of cypher >>>>> code via the neo4j shell. Before I tried the LabCards file, I tried the >>>>> movies file and a UMLS database I made (8k relationships). They worked >>>>> fine. >>>>> >>>>> The LabCards file is taking a LONG time to load since I started at about >>>>> 9:30 - 10 PM last night and its 3PM now. >>>>> >>>>> I’ve wondered if its hung up and the activity monitor’s memory usage is >>>>> constant at two rows of Java at 4GB w/ the kernel at 1 GB. The CPU panel >>>>> changes alot so it looks like its doing its thing. >>>>> >>>>> So is this how are things to be expected? Do you think the loading is >>>>> gonna take a day or two? >>>>> >>>>> Jose >>>>> >>>>> >>>>> |//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\|| >>>>> José F. Morales Ph.D. >>>>> Instructor >>>>> Cell Biology and Pathology >>>>> Columbia University Medical Center >>>>> jm3...@columbia.edu <> >>>>> 212-452-3351 <> >>>> |//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\|| >>>> José F. Morales Ph.D. >>>> Instructor >>>> Cell Biology and Pathology >>>> Columbia University Medical Center >>>> jm3...@columbia.edu <> >>>> 212-452-3351 <> >>> >> >> |//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\|| >> José F. Morales Ph.D. >> Instructor >> Cell Biology and Pathology >> Columbia University Medical Center >> jm3...@columbia.edu <> >> 212-452-3351 <> > > > -- > You received this message because you are subscribed to the Google Groups > "Neo4j" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to neo4j+un...@googlegroups.com <>. > For more options, visit https://groups.google.com/d/optout > <https://groups.google.com/d/optout>. > > > -- > You received this message because you are subscribed to the Google Groups > "Neo4j" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to neo4j+unsubscr...@googlegroups.com > <mailto:neo4j+unsubscr...@googlegroups.com>. > For more options, visit https://groups.google.com/d/optout > <https://groups.google.com/d/optout>. > > > -- > You received this message because you are subscribed to a topic in the Google > Groups "Neo4j" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/neo4j/jSFtnD5OHxg/unsubscribe > <https://groups.google.com/d/topic/neo4j/jSFtnD5OHxg/unsubscribe>. > To unsubscribe from this group and all its topics, send an email to > neo4j+unsubscr...@googlegroups.com > <mailto:neo4j+unsubscr...@googlegroups.com>. > For more options, visit https://groups.google.com/d/optout > <https://groups.google.com/d/optout>. José F. Morales Ph.D. josef...@gmail.com -- You received this message because you are subscribed to the Google Groups "Neo4j" group. To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.