On Saturday, November 29, 2014 6:35:33 AM UTC-5, Andrii Stesin wrote: > > Hi Jose, > > On Saturday, November 29, 2014 1:12:52 AM UTC+2, José F. Morales wrote: >> >> >>> 1. On nodes and their labels. First of all, I strongly suggest you to >>> separate your nodes into different .csv files by label. So you won't have a >>> column *`label`* in your .csv but rather set of files: >>> >>> nodes_LabelA.csv >>> ... >>> nodes_LabelZ.csv >>> >>> whatever your labels are. (Consider label to be kinda of synonym for >>> `class` in object-oriented programming or `table` in RDBMS). That's due the >>> fact that labels in Cypher are somewhat specific entities and you probably >>> won't be allowed to make them parameterized into variables inside your LOAD >>> CSV statement. >>> >>> >> OK, so you have modified your original idea of putting the db into two >> files 1 nodes , 1 relationships. Now here you say, put all the nodes into >> 1 file/ label. The way I have worked with it, I created 1 file for a >> class of nodes I'll call CLT_SOURCE and another file for a class of nodes >> called CLT_TARGET. >> > > Ok, but how many valid distinct combinations of your 10 node labels may > exist? >
JFM: 264 > I was speaking about a simple case where you have some limited number of > possible node labels (or their combinations), say less than 10. > JFM: Lot more than that. > > You are recommending that with the nodes, I take two steps... > >> 1) Combine my CLT_SOURCE and CLT_TARGET nodes, >> > > Not nessesary "combine" but just give each node a unique (temporary) > *my_node_id > *see my "10+M tree" example below. > > >> 2) then I split that file into files that correspond to the node: >> *my_node_id, * 1 label, and then properties P1...Pn. Since I have 10 >> Labels/node, I should have 10 files named..... Nodes_LabelA... >> Nodes_LabelJ. Thus... >> > > You may have as much labels per node you wish, but it is all about how > many valid distinct combinations of labels you have. (One single label is a > combination itself, obviously). > > If you have some limited quantity of valid label combination it's one > story. But if we are talking about order of 10! possible valid > combinations, the story is somewhat more interesting :) Which setup is > yours? > JFM: Like I said, there are 264 unique combinations in all my nodes. Some are redundant, full spelling of a term/phrase and an abbreviation. Some are a code for a term/phrase. Some were created in anticipation of others values I would create later. I am trying to anticipate queries I'll make later. > > >> File: CLT_Nodes-LabelA columns: *my_node_id,* label A, property >> P1..., property P4 >> ... >> File: CLT_Nodes-LabelJ columns: *my_node_id,* label B, property >> P1..., property P4 >> >> >> Q1: What are the rules about what can be used for *my_node_id? *I have >> usually seen them as a letter integer combination. Is that the convention? >> Sometimes I've seen a letter being used with a specific class of nodes >> a1..a100 for one class and b1..b100 for another. I learned the hard way >> that you have to give each node a unique ID. I used CLT_1...CLT_n for my >> CLT_SOURCE nodes and CLT_TARGET_1...CLT_TARGET_n for my TARGET nodes. >> It worked with the smaller db I made. Anything wrong using the convention >> n1...n100? >> > > I'm not aware of any conventions here, the only thing I know for sure is > that *schema index works much(!) faster on plain integers than on Unicode > strings*. That's the only difference which I consider significant. So my > personal preference is to have *my_node_id* to be a unique integer. Once > when importing a 10+ millions nodes into a tree with variable height [1..7] > where each level of nodes was in a separate file (because of level's own > unique label and unique set of properties) I just selected a schema for > numbering them like > > JFM: Makes sense for speed. I guess it depends upon the size of one's data. > :Skewer:Level1 my_node_id = 10000000 + file1.csv line number > :Skewer:Level2 my_node_id = 20000000 + file2.csv line number > ... > :Skewer:Level7 my_node_id = 70000000 + file7.csv line number > > so relationship file (all relationships were of a same single type) has > become a simple 2 column .csv like this with 10+ millions of lines > > 10000017,20000362 > 10000017,20000547 > 10000017,40083215 > 10000018,30000397 > ... > > After successful importing of 7 node files (and have nodes ready in db and > indexed on their unique *my_node_id* under the label :Skewer) I split > relationships.csv into 1000+ files with 10000 lines each and wrote a dumb > shell script which loaded them with `neo4j-shell -c` file by file doing > `sleep 60` between files (to give neo4j a minute to complete each batch > transaction) than started it Friday evening and got my tree ready on Monday > morning :) > > If you prefer alphanumerics for my_node_id it's completely up to you :) > Anyway, after successful import you may prefer to remove those temporary > ids completely from the database, just to conserve space where properties > are stored. > > JFM: OK. Sounds good. > 2. Then consider one additional "technological" label, let's name it >>> `:Skewer` because it will "penetrate" all your nodes of every different >>> label (class) like a kebab skewer. >>> >>> Before you start (or at least before you start importing relationships) >>> do >>> >>> CREATE CONSTRAINT ON (every_node:Skewer) ASSERT every_node.my_node_id >>> IS UNIQUE; >>> >>> >> Q2: Should I do scenario 1 or 2? >> >> Scenario 1: add two labels to each file? One from my original nodes and >> one as "Skewer" >> >> File 1: CLT_Nodes-LabelA columns: *my_node_id,* label A, *Skewer*, >> property P1..., property P4 >> ... >> File 2: CLT_Nodes-LabelJ columns: *my_node_id,* label J, *Skewer*, >> property P1..., property P4 >> >> OR >> >> Scenario 2: Include an eleventh file thus.... >> >> File 11: CLT_Nodes-LabelK columns: *my_node_id,* *Skewer*, >> property P1..., property P4 >> >> From below, I think you mean Scenario 1. >> > > Yes and you don't need to add a column for :Skewer label into a file, the > LOAD CSV statement should assign it. > > JFM: OK. Sounds good. > Q3: “Skewer” is just an integer right? It corresponds in a way to >> my_node_id >> > > No, it's a label! so in Cypher your node (suppose it has 2 labels :LabelA > and :LabelJ ) is described like > > MATCH (n:LabelA:LabelJ:Skewer {my_node_id: 123454, p1: 'something', p2: > 'something > else', p3: 'etc.'}) > > JFM: Got that! JFM: ok basic question... MATCH (n: <---What is "n"? Does it just indicate that its a node of a particular class? What letter it is is arbitrary right? Is there a name for what "n" is? For a while there, I thought it was *my_node_ID. * > Here is some sort of cypher…. > >> >> //Creating the nodes >> >> >> >> USING PERIODIC COMMIT 1000 >> >> LOAD CSV WITH HEADERS FROM “…/././…. CLT_NODES_LabelA.csv" AS csvline >> >> MERGE (my_node_id:Skewer: LabelA {property1: csvline.property1}) >> >> ON CREATE SET >> >> n.Property2 = csvline.Property2, >> >> n.Property3 = csvline.Property3, >> >> n.Property4 = csvline.Property4; >> >> >> …. >> LOAD CSV WITH HEADERS FROM “…/././…. CLT_NODES_LabelJ.csv" AS csvline >> >> >> >> MERGE (my_node_id:Skewer: LabelJ {property1: csvline.property1}) >> >> ON CREATE SET >> >> n.Property2 = csvline.Property2, >> >> n.Property3 = csvline.Property3, >> >> n.Property4 = csvline.Property4; >> >> >> >> >> Q4: So does repeating the LOAD CSV with each file CLT_NODES_LabelA…J >> combine the various labels and their respective values with their >> corresponding nodes? >> > > Label is not a variable, it does not have a value. It's just a label, > consider "tag". > Also *my_node_id* IS a variable so it does have a value. > > JFM: OK, I am not understanding this. I understood a "Label" as a general category for a node. This was as opposed to a "Property" that was specific to a particular node. As I understood it, a "Label" has different values. So that Label could be "Category" and there could be two categories, for example... CLT_SOURCE and CLT_TARGET . I thought that makes it like a variable. If not, the label is all the same on a given set of nodes and what's the point in that? JFM: OK, I get that *my_node_id *is a variable. > Looking at your 2 code snippets - in case you hope that the first one will > create a node with LabelA and the second one will assign LabelJ to a node > which was created earlier, you are wrong. > > But... if you remove labels from MERGE, it will work but look here with > attention: > > LOAD CSV WITH HEADERS FROM "…/././…. CLT_NODES_LabelA.csv" AS csvline > MERGE (new_node_A:Skewer: {my_node_id: csvline.node_unique_number, > property1: csvline.property1}) > // only my_node_id and property1 values will be taken into account! no > labels, no other properties are taken care of > // AFAIR we do not need `ON CREATE SET` here, do you really care is it a > new node or it was created earlier? > SET > new_node_A : LabelA, > new_node_A.Property2 = csvline.Property2, > new_node_A.Property3 = csvline.Property3, > new_node_A.Property4 = csvline.Property4; > > LOAD CSV WITH HEADERS FROM "…/././…. CLT_NODES_LabelJ.csv" AS csvline > MERGE (new_node_J:Skewer: {my_node_id: csvline.node_unique_number, > property1: csvline.property1}) > // only my_node_id and property1 values will be taken into account! no > labels, no other properties are taken care of > // AFAIR we do not need `ON CREATE SET` here, do you really care is it a > new node or it was created earlier? > SET > new_node_J : LabelJ, > new_node_J.Property2 = csvline.Property2, > new_node_J.Property3 = csvline.Property3, > new_node_J.Property4 = csvline.Property4; > > > What you get if doing things this way: > > > 1. When doing LabelA .csv you will create whatever uniquely numbered > nodes were not already in the database, fill their properties (or maybe > overwrite them?) and label the node (be it new or existing one) with > LabelA > - no matter what other labels did node (possibly) have, > > JFM: OK. I get it. > > 1. When doing LabelJ .csv you *again *will create whatever uniquely > numbered nodes were not already in the database, *again* either fill > or overwrite propertiers, and *again* label the node (be it new or > existing one) with LabelJ - no matter what other labels did node > (possibly) > have, > > JFM: OK. I get it. > > 1. so if you created some node with first file and labeled it LabelA, > if the same unique *my_node_id *occur both in first and second files, > your node will get 2 labels LabelA and LabelJ. > > JFM: That's wha tI want!! > > >> Q5: Since I think of my data in terms of the two classes of nodes in my >> Data model …[CLT_SOURCE —> CLT_TARGET ; CLT_TARGET —> CLT_SOURCE], after >> loading the nodes, how then I get two classes of nodes? >> > > Make them 2 labels: CLTSource and CLTTarget respectively. > > JFM: OK. Regarding the labels...my csv file has a column called DESC that has two values CLT_SOURCE and CLT_TARGET. You are saying that my Source cvs should have a CLT_SOURCE column and my target csv should have a CLT_TARGET column? My csv files should NOT a configuration as I described? JFM: Since my csv file has its A thru J columns A (2) values, B (1), C (4) D (83), E (83), F (11) G (11) H (83) J (83), K (2), I should have ALOT of csv files instead of just two for nodes! > Q6: Is there a step missing that explains how the code below got to have a >> “source_node” and a “dest_node” that appears to correspond to my CLT_SOURCE >> and CLT_TARGET nodes? >> > > // suppose we coded relationships as 2 my_node_id's of nodes > LOAD CSV FROM "...somewhere..." AS csvline > MATCH (s:CLTSource:Skewer {my_node_id: TOINT(csvline[0)}) > USING INDEX s:Skewer(my_node_id) > WITH s > MATCH (t:CLTTarget:Skewer {my_node_id: TOINT(csvline[1)}) > USING INDEX t:Skewer(my_node_id) > MERGE (s)-[r:MY_RELATIONSHIP_TYPE]->(t) > SET > r.prop1 = 'smth'; > > > JFM: What I am not getting from this is there is one csv file that has the CLTSOURCE and CLTTARGET labels in it. That contradicts what I said above because that would make only 1 csv file. I assume this there is one LOAD CSV statement and the my_node_ID:TOINT(csvline(0)}) and my_node_ID:TOINT(csvline(1)}) refer presumably to two lines in that file. > > 4. Now when you are done with nodes and start doing LOAD CSV for >>> relationships, you may give the MATCH statement, which looks up your pair >>> of nodes, a hint for fast lookup, like >>> >>> LOAD CSV ...from somewhere... AS csvline >>> MATCH (source_node:Skewer {my_node_id: ToInt(csvline[0]}), (dest_node: >>> Skewer {my_node_id: ToInt(csvline[1]}) >>> CREATE (source_node)-[r:MY_REL_TYPE {rel_prop_00: csvline[2], ..., >>> rel_prop_NN: csvline[ZZ]}]->(dest_node); >>> >>> >> Q6: This LOAD CSV command (line 1) looks into the separate REL.csv file >> you mentioned first right? >> > > Yep > > >> Q7: csvline is some sort of temp file that is a series of lines of the >> cvs file? >> > > This is a variable - collection which is filled by column values of .csv > line by line. You can use it either as an array referring fields by their > index (my preferred way) - or, if you use `WITH HEADERS` mode, you can use > it as a keyed map. See > http://neo4j.com/docs/2.1.6/cypherdoc-importing-csv-files-with-cypher.html > > Q8: Do you imply in line 2 that the REL.csv file has headers that include >> source_node, dest_node ? >> > > No I don't use headers so I refer to csvline fields by their index > ("collection mode") > > >> Q9: While I see how Skewer is a label, how is my_node_id a property >> (line 2) ? >> > > Because it IS a property of a node, and you build constraint & index *on > this exact property* inside the scope of a label :Skewer > > JFM: OK. > Q10: How does my_node_id relate to either ToInt(csvline[0]} or >> ToInt(csvline[1]} (line 2) ? >> > > For .csv with relationships, csvline[0] is a value of *my_node_id *property > of the *source* node, csvline[1] is a value of *my_node_id *property of > the *target* node, and TOINT() type conversion is used because my > personal preference is to use integers for ids. > > >> Is it that ToInt(csvline[0]} refers to the a line of the REL.csv file? >> >> Does csvline[0] refer to a column in REL.csv as do csvline[2] and >> csvline[ZZ] (line 3) ? >> > > JFM: OK, I think I get it. > I think you can combine import of multiple .CSV files in a single LOAD CSV > statement but I didn't ever try this mode. > > WBR, > Andrii > > JFM: Thanks! > Adding *`:Skewer` *label in MATCH will tell Cypher to (implicitly) use >>> your index on *my_node_id* which was created when you created your >>> constraint. Or you may try to explicitly give it a hint to use the index, >>> with USING INDEX... clause after MATCH before CREATE. Btw some earlier >>> versions of Neo4j refused to use index in LOAD CSV for some reason, I hope >>> this problem is gone with 2.1.5. >>> >>> OK >> >> >>> 5. While importing, be careful to *explicitly specify type conversions >>> for each property which is not a string*. I have seen numerous >>> occasions when people missed ToInt(csvline[i]) or ToFloat(csvline[j]) - and >>> Cypher silently stored their (supposed) numerics as strings. It's Ok, dude, >>> you say it :) This led to confusion afterwards when say numerical >>> comparisons doesn't MATCH and so on (though it's easy to correct with a >>> single Cypher command, but anyway). >>> >>> Think I did that re. type conversion. Only applies to properties for my >> data. >> >> Sorry for so many questions. I am really interested in figuring this out! >> >> Thanks loads, >> Jose >> > -- You received this message because you are subscribed to the Google Groups "Neo4j" group. To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.