Just one thing Never use more than one label and one property in merge otherwise it wont use indexes
And use ... On create set ... Von meinem iPhone gesendet > Am 29.11.2014 um 12:35 schrieb Andrii Stesin <ste...@gmail.com>: > > Hi Jose, > > On Saturday, November 29, 2014 1:12:52 AM UTC+2, José F. Morales wrote: >> >>> >>> 1. On nodes and their labels. First of all, I strongly suggest you to >>> separate your nodes into different .csv files by label. So you won't have a >>> column `label` in your .csv but rather set of files: >>> >>> nodes_LabelA.csv >>> ... >>> nodes_LabelZ.csv >>> >>> whatever your labels are. (Consider label to be kinda of synonym for >>> `class` in object-oriented programming or `table` in RDBMS). That's due the >>> fact that labels in Cypher are somewhat specific entities and you probably >>> won't be allowed to make them parameterized into variables inside your LOAD >>> CSV statement. >> >> OK, so you have modified your original idea of putting the db into two files >> 1 nodes , 1 relationships. Now here you say, put all the nodes into 1 file/ >> label. The way I have worked with it, I created 1 file for a class of >> nodes I'll call CLT_SOURCE and another file for a class of nodes called >> CLT_TARGET. > > Ok, but how many valid distinct combinations of your 10 node labels may > exist? I was speaking about a simple case where you have some limited number > of possible node labels (or their combinations), say less than 10. > > You are recommending that with the nodes, I take two steps... >> 1) Combine my CLT_SOURCE and CLT_TARGET nodes, > > Not nessesary "combine" but just give each node a unique (temporary) > my_node_id see my "10+M tree" example below. > >> 2) then I split that file into files that correspond to the node: >> my_node_id, 1 label, and then properties P1...Pn. Since I have 10 >> Labels/node, I should have 10 files named..... Nodes_LabelA... Nodes_LabelJ. >> Thus... > > You may have as much labels per node you wish, but it is all about how many > valid distinct combinations of labels you have. (One single label is a > combination itself, obviously). > > If you have some limited quantity of valid label combination it's one story. > But if we are talking about order of 10! possible valid combinations, the > story is somewhat more interesting :) Which setup is yours? > >> File: CLT_Nodes-LabelA columns: my_node_id, label A, property P1..., >> property P4 >> ... >> File: CLT_Nodes-LabelJ columns: my_node_id, label B, property P1..., >> property P4 >> >> >> Q1: What are the rules about what can be used for my_node_id? I have >> usually seen them as a letter integer combination. Is that the convention? >> Sometimes I've seen a letter being used with a specific class of nodes >> a1..a100 for one class and b1..b100 for another. I learned the hard way >> that you have to give each node a unique ID. I used CLT_1...CLT_n for my >> CLT_SOURCE nodes and CLT_TARGET_1...CLT_TARGET_n for my TARGET nodes. It >> worked with the smaller db I made. Anything wrong using the convention >> n1...n100? > > I'm not aware of any conventions here, the only thing I know for sure is that > schema index works much(!) faster on plain integers than on Unicode strings. > That's the only difference which I consider significant. So my personal > preference is to have my_node_id to be a unique integer. Once when importing > a 10+ millions nodes into a tree with variable height [1..7] where each level > of nodes was in a separate file (because of level's own unique label and > unique set of properties) I just selected a schema for numbering them like > > :Skewer:Level1 my_node_id = 10000000 + file1.csv line number > :Skewer:Level2 my_node_id = 20000000 + file2.csv line number > ... > :Skewer:Level7 my_node_id = 70000000 + file7.csv line number > > so relationship file (all relationships were of a same single type) has > become a simple 2 column .csv like this with 10+ millions of lines > > 10000017,20000362 > 10000017,20000547 > 10000017,40083215 > 10000018,30000397 > ... > > After successful importing of 7 node files (and have nodes ready in db and > indexed on their unique my_node_id under the label :Skewer) I split > relationships.csv into 1000+ files with 10000 lines each and wrote a dumb > shell script which loaded them with `neo4j-shell -c` file by file doing > `sleep 60` between files (to give neo4j a minute to complete each batch > transaction) than started it Friday evening and got my tree ready on Monday > morning :) > > If you prefer alphanumerics for my_node_id it's completely up to you :) > Anyway, after successful import you may prefer to remove those temporary ids > completely from the database, just to conserve space where properties are > stored. > >>> 2. Then consider one additional "technological" label, let's name it >>> `:Skewer` because it will "penetrate" all your nodes of every different >>> label (class) like a kebab skewer. >>> >>> Before you start (or at least before you start importing relationships) do >>> >>> CREATE CONSTRAINT ON (every_node:Skewer) ASSERT every_node.my_node_id IS >>> UNIQUE; >> >> Q2: Should I do scenario 1 or 2? >> >> Scenario 1: add two labels to each file? One from my original nodes and >> one as "Skewer" >> >> File 1: CLT_Nodes-LabelA columns: my_node_id, label A, Skewer, >> property P1..., property P4 >> ... >> File 2: CLT_Nodes-LabelJ columns: my_node_id, label J, Skewer, >> property P1..., property P4 >> >> OR >> >> Scenario 2: Include an eleventh file thus.... >> >> File 11: CLT_Nodes-LabelK columns: my_node_id, Skewer, property P1..., >> property P4 >> >> From below, I think you mean Scenario 1. > > Yes and you don't need to add a column for :Skewer label into a file, the > LOAD CSV statement should assign it. > >> Q3: “Skewer” is just an integer right? It corresponds in a way to >> my_node_id > > No, it's a label! so in Cypher your node (suppose it has 2 labels :LabelA and > :LabelJ ) is described like > > MATCH (n:LabelA:LabelJ:Skewer {my_node_id: 123454, p1: 'something', p2: > 'something else', p3: 'etc.'}) > > Here is some sort of cypher…. >> >> //Creating the nodes >> >> >> >> USING PERIODIC COMMIT 1000 >> >> LOAD CSV WITH HEADERS FROM “…/././…. CLT_NODES_LabelA.csv" AS csvline >> >> MERGE (my_node_id:Skewer: LabelA {property1: csvline.property1}) >> >> ON CREATE SET >> >> n.Property2 = csvline.Property2, >> >> n.Property3 = csvline.Property3, >> >> n.Property4 = csvline.Property4; >> >> …. >> LOAD CSV WITH HEADERS FROM “…/././…. CLT_NODES_LabelJ.csv" AS csvline >> >> >> >> MERGE (my_node_id:Skewer: LabelJ {property1: csvline.property1}) >> >> ON CREATE SET >> >> n.Property2 = csvline.Property2, >> >> n.Property3 = csvline.Property3, >> >> n.Property4 = csvline.Property4; >> >> >> Q4: So does repeating the LOAD CSV with each file CLT_NODES_LabelA…J combine >> the various labels and their respective values with their corresponding >> nodes? > > Label is not a variable, it does not have a value. It's just a label, > consider "tag". > Also my_node_id IS a variable so it does have a value. > > Looking at your 2 code snippets - in case you hope that the first one will > create a node with LabelA and the second one will assign LabelJ to a node > which was created earlier, you are wrong. But... if you remove labels from > MERGE, it will work but look here with attention: > > LOAD CSV WITH HEADERS FROM "…/././…. CLT_NODES_LabelA.csv" AS csvline > MERGE (new_node_A:Skewer: {my_node_id: csvline.node_unique_number, property1: > csvline.property1}) > // only my_node_id and property1 values will be taken into account! no > labels, no other properties are taken care of > // AFAIR we do not need `ON CREATE SET` here, do you really care is it a new > node or it was created earlier? > SET > new_node_A : LabelA, > new_node_A.Property2 = csvline.Property2, > new_node_A.Property3 = csvline.Property3, > new_node_A.Property4 = csvline.Property4; > > LOAD CSV WITH HEADERS FROM "…/././…. CLT_NODES_LabelJ.csv" AS csvline > MERGE (new_node_J:Skewer: {my_node_id: csvline.node_unique_number, property1: > csvline.property1}) > // only my_node_id and property1 values will be taken into account! no > labels, no other properties are taken care of > // AFAIR we do not need `ON CREATE SET` here, do you really care is it a new > node or it was created earlier? > SET > new_node_J : LabelJ, > new_node_J.Property2 = csvline.Property2, > new_node_J.Property3 = csvline.Property3, > new_node_J.Property4 = csvline.Property4; > > > What you get if doing things this way: > > When doing LabelA .csv you will create whatever uniquely numbered nodes were > not already in the database, fill their properties (or maybe overwrite them?) > and label the node (be it new or existing one) with LabelA - no matter what > other labels did node (possibly) have, > When doing LabelJ .csv you again will create whatever uniquely numbered nodes > were not already in the database, again either fill or overwrite propertiers, > and again label the node (be it new or existing one) with LabelJ - no matter > what other labels did node (possibly) have, > so if you created some node with first file and labeled it LabelA, if the > same unique my_node_id occur both in first and second files, your node will > get 2 labels LabelA and LabelJ. > >> Q5: Since I think of my data in terms of the two classes of nodes in my Data >> model …[CLT_SOURCE —> CLT_TARGET ; CLT_TARGET —> CLT_SOURCE], after >> loading the nodes, how then I get two classes of nodes? > > Make them 2 labels: CLTSource and CLTTarget respectively. > >> Q6: Is there a step missing that explains how the code below got to have a >> “source_node” and a “dest_node” that appears to correspond to my CLT_SOURCE >> and CLT_TARGET nodes? > > // suppose we coded relationships as 2 my_node_id's of nodes > LOAD CSV FROM "...somewhere..." AS csvline > MATCH (s:CLTSource:Skewer {my_node_id: TOINT(csvline[0)}) > USING INDEX s:Skewer(my_node_id) > WITH s > MATCH (t:CLTTarget:Skewer {my_node_id: TOINT(csvline[1)}) > USING INDEX t:Skewer(my_node_id) > MERGE (s)-[r:MY_RELATIONSHIP_TYPE]->(t) > SET > r.prop1 = 'smth'; > > > >>> 4. Now when you are done with nodes and start doing LOAD CSV for >>> relationships, you may give the MATCH statement, which looks up your pair >>> of nodes, a hint for fast lookup, like >>> >>> LOAD CSV ...from somewhere... AS csvline >>> MATCH (source_node:Skewer {my_node_id: ToInt(csvline[0]}), >>> (dest_node:Skewer {my_node_id: ToInt(csvline[1]}) >>> CREATE (source_node)-[r:MY_REL_TYPE {rel_prop_00: csvline[2], ..., >>> rel_prop_NN: csvline[ZZ]}]->(dest_node); >> >> Q6: This LOAD CSV command (line 1) looks into the separate REL.csv file you >> mentioned first right? > > Yep > >> Q7: csvline is some sort of temp file that is a series of lines of the cvs >> file? > > This is a variable - collection which is filled by column values of .csv line > by line. You can use it either as an array referring fields by their index > (my preferred way) - or, if you use `WITH HEADERS` mode, you can use it as a > keyed map. See > http://neo4j.com/docs/2.1.6/cypherdoc-importing-csv-files-with-cypher.html > >> Q8: Do you imply in line 2 that the REL.csv file has headers that include >> source_node, dest_node ? > > No I don't use headers so I refer to csvline fields by their index > ("collection mode") > >> Q9: While I see how Skewer is a label, how is my_node_id a property (line >> 2) ? > > Because it IS a property of a node, and you build constraint & index on this > exact property inside the scope of a label :Skewer > >> Q10: How does my_node_id relate to either ToInt(csvline[0]} or >> ToInt(csvline[1]} (line 2) ? > > For .csv with relationships, csvline[0] is a value of my_node_id property of > the source node, csvline[1] is a value of my_node_id property of the target > node, and TOINT() type conversion is used because my personal preference is > to use integers for ids. > >> Is it that ToInt(csvline[0]} refers to the a line of the REL.csv file? >> Does csvline[0] refer to a column in REL.csv as do csvline[2] and >> csvline[ZZ] (line 3) ? > > I think you can combine import of multiple .CSV files in a single LOAD CSV > statement but I didn't ever try this mode. > > WBR, > Andrii > >>> Adding `:Skewer` label in MATCH will tell Cypher to (implicitly) use your >>> index on my_node_id which was created when you created your constraint. Or >>> you may try to explicitly give it a hint to use the index, with USING >>> INDEX... clause after MATCH before CREATE. Btw some earlier versions of >>> Neo4j refused to use index in LOAD CSV for some reason, I hope this problem >>> is gone with 2.1.5. >> OK >> >>> 5. While importing, be careful to explicitly specify type conversions for >>> each property which is not a string. I have seen numerous occasions when >>> people missed ToInt(csvline[i]) or ToFloat(csvline[j]) - and Cypher >>> silently stored their (supposed) numerics as strings. It's Ok, dude, you >>> say it :) This led to confusion afterwards when say numerical comparisons >>> doesn't MATCH and so on (though it's easy to correct with a single Cypher >>> command, but anyway). >> Think I did that re. type conversion. Only applies to properties for my >> data. >> >> Sorry for so many questions. I am really interested in figuring this out! >> >> Thanks loads, >> Jose > > -- > You received this message because you are subscribed to the Google Groups > "Neo4j" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to neo4j+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "Neo4j" group. To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.