Thanks Dude! I’ll let you know! > On Nov 28, 2014, at 8:03 PM, Michael Hunger > <michael.hun...@neotechnology.com> wrote: > > If you look at the video it is pretty obvious, > she outlines all the major steps and pitfalls. > > Except for one, create nodes and rels separately if you need more than one > MERGE > > GOOD merge|match|create node merge|match|create node create erel > GOOD match node match node MERGE rel > > BAD match|create node merge node merge rel > BAD match node set node.prop > > > > On Sat, Nov 29, 2014 at 1:53 AM, FANC2 <josef...@gmail.com > <mailto:josef...@gmail.com>> wrote: > Both. Using what I did before the loading either never finished or failed. > I’m trying to not follow that example with the figuring it out!! :) > >> On Nov 28, 2014, at 7:50 PM, Michael Hunger >> <michael.hun...@neotechnology.com <mailto:michael.hun...@neotechnology.com>> >> wrote: >> >> What takes so long? The loading? Or figuring it out? >> >> Michael >> >> >> On Sat, Nov 29, 2014 at 1:18 AM, José F. Morales <josef...@gmail.com >> <mailto:josef...@gmail.com>> wrote: >> Hey Michael, >> >> I'll check it out. Trouble is knowing what over-complicating is. Thanks >> for the heads up! >> >> I am trying to figure out inductively how to use LOAD CSV from various >> examples. Thanks for another one. >> >> Its killing me that its taking so long. >> >> Jose >> >> >> >> On Friday, November 28, 2014 6:16:49 PM UTC-5, Michael Hunger wrote: >> José >> >> if you watch Nicole's webinar many things will become clear. >> https://vimeo.com/112447027 <https://vimeo.com/112447027> >> You don't have to overcomplicate things. >> >> The Skewer(id) thing is not really needed if each of your entities has a >> label and a primary key of some sorts. >> It is just an optimization to not have to think about separate entities. >> >> Cheers, Michael >> >> On Sat, Nov 29, 2014 at 12:12 AM, José F. Morales <jose...@gmail.com <>> >> wrote: >> Hey Andrii, >> >> I've been thinking alot about your recommendations. I have some questions, >> some of which show how ignorant I am. Apologies for basics if necessary. >> >> On Thursday, November 20, 2014 6:22:34 AM UTC-5, Andrii Stesin wrote: >> Before you start. >> >> 1. On nodes and their labels. First of all, I strongly suggest you to >> separate your nodes into different .csv files by label. So you won't have a >> column `label` in your .csv but rather set of files: >> >> nodes_LabelA.csv >> ... >> nodes_LabelZ.csv >> >> whatever your labels are. (Consider label to be kinda of synonym for `class` >> in object-oriented programming or `table` in RDBMS). That's due the fact >> that labels in Cypher are somewhat specific entities and you probably won't >> be allowed to make them parameterized into variables inside your LOAD CSV >> statement. >> >> >> OK, so you have modified your original idea of putting the db into two files >> 1 nodes , 1 relationships. Now here you say, put all the nodes into 1 file/ >> label. The way I have worked with it, I created 1 file for a class of >> nodes I'll call CLT_SOURCE and another file for a class of nodes called >> CLT_TARGET. Then I have a file for the relationships. Perhaps foolishly I >> originally would create 1 file that would combine all of this info and try >> to paste it in the browser or in the shell. Neither worked even though with >> smaller amount of data it did. >> >> You are recommending that with the nodes, I take two steps... >> 1) Combine my CLT_SOURCE and CLT_TARGET nodes, >> 2) then I split that file into files that correspond to the node: >> my_node_id, 1 label, and then properties P1...Pn. Since I have 10 >> Labels/node, I should have 10 files named..... Nodes_LabelA... Nodes_LabelJ. >> Thus... >> >> File: CLT_Nodes-LabelA columns: my_node_id, label A, property P1..., >> property P4 >> ... >> File: CLT_Nodes-LabelJ columns: my_node_id, label B, property P1..., >> property P4 >> >> >> Q1: What are the rules about what can be used for my_node_id? I have >> usually seen them as a letter integer combination. Is that the convention? >> Sometimes I've seen a letter being used with a specific class of nodes >> a1..a100 for one class and b1..b100 for another. I learned the hard way >> that you have to give each node a unique ID. I used CLT_1...CLT_n for my >> CLT_SOURCE nodes and CLT_TARGET_1...CLT_TARGET_n for my TARGET nodes. It >> worked with the smaller db I made. Anything wrong using the convention >> n1...n100? >> >> >> 2. Then consider one additional "technological" label, let's name it >> `:Skewer` because it will "penetrate" all your nodes of every different >> label (class) like a kebab skewer. >> >> Before you start (or at least before you start importing relationships) do >> >> CREATE CONSTRAINT ON (every_node:Skewer) ASSERT every_node.my_node_id IS >> UNIQUE; >> >> >> Q2: Should I do scenario 1 or 2? >> >> Scenario 1: add two labels to each file? One from my original nodes and >> one as "Skewer" >> >> File 1: CLT_Nodes-LabelA columns: my_node_id, label A, Skewer, >> property P1..., property P4 >> ... >> File 2: CLT_Nodes-LabelJ columns: my_node_id, label J, Skewer, >> property P1..., property P4 >> >> OR >> >> Scenario 2: Include an eleventh file thus.... >> >> File 11: CLT_Nodes-LabelK columns: my_node_id, Skewer, property P1..., >> property P4 >> >> From below, I think you mean Scenario 1. >> >> Q3: “Skewer” is just an integer right? It corresponds in a way to >> my_node_id >> >> 3. When doing LOAD CSV with nodes, make sure each node will get 2 (two) >> labels, one of them is `:Skewer`. This will create index on `my_node_id` >> attribute (makes relationships creation some orders of magnitude faster) and >> you'll be sure you don't have occasional duplicate nodes, as a bonus. >> >> >> Here is some sort of cypher…. >> >> //Creating the nodes >> >> >> >> USING PERIODIC COMMIT 1000 >> >> LOAD CSV WITH HEADERS FROM “…/././…. CLT_NODES_LabelA.csv" AS csvline >> >> MERGE (my_node_id:Skewer: LabelA {property1: csvline.property1}) >> >> ON CREATE SET >> >> n.Property2 = csvline.Property2, >> >> n.Property3 = csvline.Property3, >> >> n.Property4 = csvline.Property4; >> >> …. >> LOAD CSV WITH HEADERS FROM “…/././…. CLT_NODES_LabelJ.csv" AS csvline >> >> >> >> MERGE (my_node_id:Skewer: LabelJ {property1: csvline.property1}) >> >> ON CREATE SET >> >> n.Property2 = csvline.Property2, >> >> n.Property3 = csvline.Property3, >> >> n.Property4 = csvline.Property4; >> >> >> Q4: So does repeating the LOAD CSV with each file CLT_NODES_LabelA…J combine >> the various labels and their respective values with their corresponding >> nodes? >> Q5: Since I think of my data in terms of the two classes of nodes in my Data >> model …[CLT_SOURCE —> CLT_TARGET ; CLT_TARGET —> CLT_SOURCE], after >> loading the nodes, how then I get two classes of nodes? >> Q6: Is there a step missing that explains how the code below got to have a >> “source_node” and a “dest_node” that appears to correspond to my CLT_SOURCE >> and CLT_TARGET nodes? >> >> >> >> 4. Now when you are done with nodes and start doing LOAD CSV for >> relationships, you may give the MATCH statement, which looks up your pair of >> nodes, a hint for fast lookup, like >> >> LOAD CSV ...from somewhere... AS csvline >> MATCH (source_node:Skewer {my_node_id: ToInt(csvline[0]}), (dest_node:Skewer >> {my_node_id: ToInt(csvline[1]}) >> CREATE (source_node)-[r:MY_REL_TYPE {rel_prop_00: csvline[2], ..., >> rel_prop_NN: csvline[ZZ]}]->(dest_node); >> >> >> Q6: This LOAD CSV command (line 1) looks into the separate REL.csv file you >> mentioned first right? >> Q7: csvline is some sort of temp file that is a series of lines of the cvs >> file? >> Q8: Do you imply in line 2 that the REL.csv file has headers that include >> source_node, dest_node ? >> Q9: While I see how Skewer is a label, how is my_node_id a property (line >> 2) ? >> Q10: How does my_node_id relate to either ToInt(csvline[0]} or >> ToInt(csvline[1]} (line 2) ? >> Is it that ToInt(csvline[0]} refers to the a line of the REL.csv file? >> Does csvline[0] refer to a column in REL.csv as do csvline[2] and >> csvline[ZZ] (line 3) ? >> >> Adding `:Skewer` label in MATCH will tell Cypher to (implicitly) use your >> index on my_node_id which was created when you created your constraint. Or >> you may try to explicitly give it a hint to use the index, with USING >> INDEX... clause after MATCH before CREATE. Btw some earlier versions of >> Neo4j refused to use index in LOAD CSV for some reason, I hope this problem >> is gone with 2.1.5. >> >> OK >> >> 5. While importing, be careful to explicitly specify type conversions for >> each property which is not a string. I have seen numerous occasions when >> people missed ToInt(csvline[i]) or ToFloat(csvline[j]) - and Cypher silently >> stored their (supposed) numerics as strings. It's Ok, dude, you say it :) >> This led to confusion afterwards when say numerical comparisons doesn't >> MATCH and so on (though it's easy to correct with a single Cypher command, >> but anyway). >> >> Think I did that re. type conversion. Only applies to properties for my >> data. >> >> Sorry for so many questions. I am really interested in figuring this out! >> >> Thanks loads, >> Jose >> >> >> WBR, >> Andrii >> >> On Wednesday, November 19, 2014 9:36:50 PM UTC+2, José F. Morales wrote: >> >> 3. CSV approach >> a. “Dump the base into 2 .csv files:” >> b. CSV1: “Describe nodes (enumerate them via some my_node_id integer >> attribute), columns: my_node_id,label,node_prop_01,node_prop_ZZ” >> c. CSV2: “Describe relations, >> columns: source_my_node_id, >> dest_my_node_id,rel_type,rel_prop_01,...,rel_prop_NN” >> d. Indexes constraints: before starting import —> have >> appropriate indexes / constraints >> e. via LOAD CSV, import CSV1, then CSV2. >> f. Import no more than 10,000-30,000 lines in a single LOAD CSV >> statement >> >> This seems to be a very well elaborated method and the easiest for me to do. >> I have files such that I can create these without too much problem. I >> figure I’ll split the nodes into three files 20k rows each. I can do the >> same with the Rels. I have not used indexes or constraints yet in the db’s >> that I already created and as I said above, I’ll have to see how to use them. >> >> I am assuming column headers that fit with my data are consistent with what >> you explained below (Like, I can put my own meaningful text into Label 1 -10 >> and node_prop_01 - 05).... >> my_node_id, label1, label2, label3, label4, >> label5, label6, label7, label8, label9, >> label10, node_prop_01, node_prop_02, node_prop_03, >> node_prop_04, node_prop_ZZ” >> >> Thanks again Fellas!! >> >> Jose >> >> >> On Wednesday, November 19, 2014 8:04:44 AM UTC-5, Michael Hunger wrote: >> José, >> >> Let's continue the discussion on the google group >> >> With larger I meant amount of data, not size of statements >> >> As I also point out in various places we recommend creating only small >> subgraphs with a single statement separated by srmicolons. >> Eg up to 100 nodes and rels >> >> Gigantic statements just let the parser explode >> >> I recommending splitting them up into statements creating subgraphs >> Or create nodes and later match them by label & property to connect them >> Make sure to have appropriate indexes / constraints >> >> You should also surround blocks if statements with begin and commit commands >> >> Von meinem iPhone gesendet >> >> Am 19.11.2014 um 04:18 schrieb José F. Morales Ph.D. <jm3...@columbia.edu >> <>>: >> >>> Hey Michael and Kenny >>> >>> Thanks you guys a bunch for the help. >>> >>> Let me give you a little background. I am charged to make a prototype of a >>> tool (“LabCards”) that we hope to use in the hospital and beyond at some >>> point . In preparation for making the main prototype, I made two prior >>> Neo4j databases that worked exactly as I wanted them to. The first >>> database was built with NIH data and had 183 nodes and around 7500 >>> relationships. The second database was the Pre-prototype and it had 1080 >>> nodes and around 2000 relationships. I created these in the form of cypher >>> statements and either pasted them in the Neo4j browser or used the neo4j >>> shell and loaded them as text files. Before doing that I checked the cypher >>> code with Sublime Text 2 that highlights the code. Both databases loaded >>> fine in both methods and did what I wanted them to do. >>> >>> As you might imagine, the prototype is an expansion of the mini-prototype. >>> It has almost the same data model and I built it as a series of cypher >>> statements as well. My first version of the prototype had ~60k nodes and >>> 160k relationships. >>> >>> I should say that a feature of this model is that all the source and target >>> nodes have relationships that point to each other. No node points to >>> itself as far as I know. This file was 41 Mb of cypher code that I tried to >>> load via the neo4j shell. >>> >>> In fact, I was following your advise on loading big data files... “Use the >>> Neo4j-Shell for larger Imports” >>> (http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-and-successfully/ >>> <http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-and-successfully/>). >>> This first time out, Java maxed out its memory allocated at 4Gb 2x and >>> did not complete loading in 24 hours. I killed it. >>> >>> I then contacted Kenny, and he generously gave me some advice regarding the >>> properties file (below) and again the same deal (4 Gb Memory 2x) with Java >>> and no success in about 24 hours. I killed that one too. >>> >>> Given my loading problems, I have subsequently eliminated a bunch >>> relationships (100k) so that the file is now 21 Mb. Alot of these were >>> duplicates that I didn’t pick up before and am trying it again. So far 15 >>> min into it, similar situation. The difference is that Java is using 1.7 >>> and 0.5 GB of memory >>> >>> Here is the cypher for a typical node… >>> >>> CREATE ( CLT_1:`CLT SOURCE`:BIOMEDICAL:TEST_NAME:`Laboratory >>> Procedure`:lbpr:`Procedures`:PROC:T059:`B1.3.1.1`:TZ{NAME:'Acetoacetate >>> (ketone body)',SYNONYM:'',Sample:'SERUM, URINE',MEDCODE:10010,CUI:'NA’}) >>> >>> Here is the cypher for a typical relationship... >>> >>> CREATE(CLT_1)-[:MEASUREMENT_OF{Phylum:'TZ',CAT:'TEST.NAME >>> <http://test.name/>',Ui_Rl:'T157',RESULT:'',Type:'',Semantic_Distance_Score:'NA',Path_Length:'NA',Path_Steps:'NA'}]->(CLT_TARGET_3617), >>> >>> I will let you know how this one turns out. I hope this is helpful. >>> >>> Many, many thanks fellas!!! >>> >>> Jose >>> >>>> On Nov 18, 2014, at 8:33 PM, Michael Hunger <michael...@neotechnology.com >>>> <>> wrote: >>>> >>>> Hi José, >>>> >>>> can you provide perhaps more detail about your dataset (e.g. sample of the >>>> csv, size, etc. perhaps an output of csvstat (of csvkit) would be >>>> helpful), your cypher queries to load it >>>> >>>> Have you seen my other blog post, which explains two big caveats that >>>> people run into when trying this? >>>> jexp.de/blog/2014/10/load-cvs-with-success/ >>>> <http://jexp.de/blog/2014/10/load-cvs-with-success/> >>>> >>>> Cheers, Michael >>>> >>>> On Tue, Nov 18, 2014 at 8:43 PM, Kenny Bastani <k...@socialmoon.com <>> >>>> wrote: >>>> Hey Jose, >>>> >>>> There is definitely an answer. Let me put you in touch with the data >>>> import master: Michael Hunger. >>>> >>>> Michael, I think the answers here will be pretty straight forward for you. >>>> You met Jose at GraphConnect NY last year, so I'll spare any >>>> introductions. The memory map configurations I provided need to be >>>> calculated and customized for the data import volume. >>>> >>>> Thanks, >>>> >>>> Kenny >>>> >>>> Sent from my iPhone >>>> >>>> On Nov 18, 2014, at 11:37 AM, José F. Morales Ph.D. <jm3...@columbia.edu >>>> <>> wrote: >>>> >>>>> Kenny, >>>>> >>>>> In 3 hours it’ll be trying to load for 24 hours so this is not working. >>>>> I’m catching shit from my crew too, so I got to fix this like soon. >>>>> >>>>> I haven’t done this before, but can I break up the data and load it in >>>>> pieces? >>>>> >>>>> Jose >>>>> >>>>>> On Nov 17, 2014, at 3:35 PM, Kenny Bastani <k...@socialmoon.com <>> >>>>>> wrote: >>>>>> >>>>>> Hey Jose, >>>>>> >>>>>> Try turning off the object cache. Add this line to the neo4j.properties >>>>>> configuration file: >>>>>> >>>>>> cache_type=none >>>>>> >>>>>> Then retry your import. Also, enable memory mapped files by adding these >>>>>> lines to the neo4j.properties file: >>>>>> >>>>>> neostore.nodestore.db.mapped_memory=2048M >>>>>> neostore.relationshipstore.db.mapped_memory=4096M >>>>>> neostore.propertystore.db.mapped_memory=200M >>>>>> neostore.propertystore.db.strings.mapped_memory=500M >>>>>> neostore.propertystore.db.arrays.mapped_memory=500M >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Kenny >>>>>> >>>>>> From: José F. Morales Ph.D. <jm3...@columbia.edu <>> >>>>>> Sent: Monday, November 17, 2014 12:32 PM >>>>>> To: Kenny Bastani >>>>>> Subject: latest >>>>>> >>>>>> Hey Kenny, >>>>>> >>>>>> Here’s the deal. As I think I said, I loaded the 41 Mb file of cypher >>>>>> code via the neo4j shell. Before I tried the LabCards file, I tried the >>>>>> movies file and a UMLS database I made (8k relationships). They worked >>>>>> fine. >>>>>> >>>>>> The LabCards file is taking a LONG time to load since I started at about >>>>>> 9:30 - 10 PM last night and its 3PM now. >>>>>> >>>>>> I’ve wondered if its hung up and the activity monitor’s memory usage is >>>>>> constant at two rows of Java at 4GB w/ the kernel at 1 GB. The CPU >>>>>> panel changes alot so it looks like its doing its thing. >>>>>> >>>>>> So is this how are things to be expected? Do you think the loading is >>>>>> gonna take a day or two? >>>>>> >>>>>> Jose >>>>>> >>>>>> >>>>>> |//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\|| >>>>>> José F. Morales Ph.D. >>>>>> Instructor >>>>>> Cell Biology and Pathology >>>>>> Columbia University Medical Center >>>>>> jm3...@columbia.edu <> >>>>>> 212-452-3351 <> >>>>> |//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\|| >>>>> José F. Morales Ph.D. >>>>> Instructor >>>>> Cell Biology and Pathology >>>>> Columbia University Medical Center >>>>> jm3...@columbia.edu <> >>>>> 212-452-3351 <> >>>> >>> >>> |//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\|| >>> José F. Morales Ph.D. >>> Instructor >>> Cell Biology and Pathology >>> Columbia University Medical Center >>> jm3...@columbia.edu <> >>> 212-452-3351 <> >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Neo4j" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to neo4j+un...@googlegroups.com <>. >> For more options, visit https://groups.google.com/d/optout >> <https://groups.google.com/d/optout>. >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Neo4j" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to neo4j+unsubscr...@googlegroups.com >> <mailto:neo4j+unsubscr...@googlegroups.com>. >> For more options, visit https://groups.google.com/d/optout >> <https://groups.google.com/d/optout>. >> >> >> -- >> You received this message because you are subscribed to a topic in the >> Google Groups "Neo4j" group. >> To unsubscribe from this topic, visit >> https://groups.google.com/d/topic/neo4j/jSFtnD5OHxg/unsubscribe >> <https://groups.google.com/d/topic/neo4j/jSFtnD5OHxg/unsubscribe>. >> To unsubscribe from this group and all its topics, send an email to >> neo4j+unsubscr...@googlegroups.com >> <mailto:neo4j+unsubscr...@googlegroups.com>. >> For more options, visit https://groups.google.com/d/optout >> <https://groups.google.com/d/optout>. > > José F. Morales Ph.D. > josef...@gmail.com <mailto:josef...@gmail.com> > > > > > -- > You received this message because you are subscribed to the Google Groups > "Neo4j" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to neo4j+unsubscr...@googlegroups.com > <mailto:neo4j+unsubscr...@googlegroups.com>. > For more options, visit https://groups.google.com/d/optout > <https://groups.google.com/d/optout>. > > > -- > You received this message because you are subscribed to a topic in the Google > Groups "Neo4j" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/neo4j/jSFtnD5OHxg/unsubscribe > <https://groups.google.com/d/topic/neo4j/jSFtnD5OHxg/unsubscribe>. > To unsubscribe from this group and all its topics, send an email to > neo4j+unsubscr...@googlegroups.com > <mailto:neo4j+unsubscr...@googlegroups.com>. > For more options, visit https://groups.google.com/d/optout > <https://groups.google.com/d/optout>.
José F. Morales Ph.D. josef...@gmail.com -- You received this message because you are subscribed to the Google Groups "Neo4j" group. To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.