Just one thing

Never use more than one label and one property in merge otherwise it wont use 
indexes

And use ... On create set ...

Von meinem iPhone gesendet

> Am 29.11.2014 um 12:35 schrieb Andrii Stesin <ste...@gmail.com>:
> 
> Hi Jose,
> 
> On Saturday, November 29, 2014 1:12:52 AM UTC+2, José F. Morales wrote:
>> 
>>> 
>>> 1. On nodes and their labels. First of all, I strongly suggest you to 
>>> separate your nodes into different .csv files by label. So you won't have a 
>>> column `label` in your .csv but rather set of files:
>>> 
>>> nodes_LabelA.csv
>>> ...
>>> nodes_LabelZ.csv
>>> 
>>> whatever your labels are. (Consider label to be kinda of synonym for 
>>> `class` in object-oriented programming or `table` in RDBMS). That's due the 
>>> fact that labels in Cypher are somewhat specific entities and you probably 
>>> won't be allowed to make them parameterized into variables inside your LOAD 
>>> CSV statement.
>> 
>> OK, so you have modified your original idea of putting the db into two files 
>> 1 nodes , 1 relationships.  Now here you say, put all the nodes into 1 file/ 
>> label.   The way I have worked with it, I created 1 file for a class of 
>> nodes I'll call CLT_SOURCE and another file for a class of nodes called 
>> CLT_TARGET.
> 
> Ok, but how many valid distinct combinations of your 10 node labels may 
> exist? I was speaking about a simple case where you have some limited number 
> of possible node labels (or their combinations), say less than 10.
> 
>  You are recommending that with the nodes, I take two steps...
>> 1) Combine my CLT_SOURCE and CLT_TARGET nodes, 
> 
> Not nessesary "combine" but just give each node a unique (temporary) 
> my_node_id see my "10+M tree" example below.
>  
>> 2) then I split that file into files that correspond to the node: 
>> my_node_id,  1 label, and then properties P1...Pn.  Since I have 10 
>> Labels/node, I should have 10 files named..... Nodes_LabelA... Nodes_LabelJ. 
>>  Thus...
> 
> You may have as much labels per node you wish, but it is all about how many 
> valid distinct combinations of labels you have. (One single label is a 
> combination itself, obviously).
> 
> If you have some limited quantity of valid label combination it's one story. 
> But if we are talking about order of 10! possible valid combinations, the 
> story is somewhat more interesting :) Which setup is yours?
>  
>> File:  CLT_Nodes-LabelA     columns:  my_node_id, label A, property P1..., 
>> property P4
>> ...
>> File:  CLT_Nodes-LabelJ     columns:  my_node_id, label B, property P1..., 
>> property P4
>> 
>> 
>> Q1: What are the rules about what can be used for my_node_id?  I have 
>> usually seen them as a letter integer combination. Is that the convention?   
>> Sometimes I've seen a letter being used with a specific class of nodes  
>> a1..a100 for one class and b1..b100 for another.  I learned the hard way 
>> that you have to give each node a unique ID.  I used CLT_1...CLT_n for my 
>> CLT_SOURCE nodes and CLT_TARGET_1...CLT_TARGET_n for my TARGET nodes. It 
>> worked with the smaller db I made.  Anything wrong using the convention 
>> n1...n100?
> 
> I'm not aware of any conventions here, the only thing I know for sure is that 
> schema index works much(!) faster on plain integers than on Unicode strings. 
> That's the only difference which I consider significant. So my personal 
> preference is to have my_node_id to be a unique integer. Once when importing 
> a 10+ millions nodes into a tree with variable height [1..7] where each level 
> of nodes was in a separate file (because of level's own unique label and 
> unique set of properties) I just selected a schema for numbering them like
> 
> :Skewer:Level1 my_node_id = 10000000 + file1.csv line number
> :Skewer:Level2 my_node_id = 20000000 + file2.csv line number
> ...
> :Skewer:Level7 my_node_id = 70000000 + file7.csv line number
> 
> so relationship file (all relationships were of a same single type) has 
> become a simple 2 column .csv like this with 10+ millions of lines
> 
> 10000017,20000362
> 10000017,20000547
> 10000017,40083215
> 10000018,30000397
> ...
> 
> After successful importing of 7 node files (and have nodes ready in db and 
> indexed on their unique my_node_id under the label :Skewer) I split 
> relationships.csv into 1000+ files with 10000 lines each and wrote a dumb 
> shell script which loaded them with `neo4j-shell -c`  file by file doing 
> `sleep 60` between files (to give neo4j a minute to complete each batch 
> transaction) than started it Friday evening and got my tree ready on Monday 
> morning :)
> 
> If you prefer alphanumerics for my_node_id it's completely up to you :) 
> Anyway, after successful import you may prefer to remove those temporary ids 
> completely from the database, just to conserve space where properties are 
> stored.
>  
>>> 2. Then consider one additional "technological" label, let's name it 
>>> `:Skewer` because it will "penetrate" all your nodes of every different 
>>> label (class) like a kebab skewer.
>>> 
>>> Before you start (or at least before you start importing relationships) do
>>> 
>>> CREATE CONSTRAINT ON (every_node:Skewer) ASSERT every_node.my_node_id IS 
>>> UNIQUE;
>> 
>> Q2:  Should I do scenario 1 or 2?
>> 
>> Scenario 1:  add two labels to each file?  One from my original nodes and 
>> one as "Skewer"
>> 
>> File 1:  CLT_Nodes-LabelA     columns:  my_node_id, label A, Skewer, 
>> property P1..., property P4
>> ...
>> File 2:  CLT_Nodes-LabelJ     columns:  my_node_id, label J, Skewer, 
>> property P1..., property P4
>>  
>> OR 
>> 
>> Scenario 2:  Include an eleventh file thus....
>> 
>> File 11:  CLT_Nodes-LabelK     columns:  my_node_id, Skewer, property P1..., 
>> property P4 
>> 
>> From below, I think you mean Scenario 1.
> 
> Yes and you don't need to add a column for :Skewer label into a file, the 
> LOAD CSV statement should assign it.
>  
>> Q3: “Skewer” is just an integer right?  It corresponds in a way to 
>> my_node_id 
> 
> No, it's a label! so in Cypher your node (suppose it has 2 labels :LabelA and 
> :LabelJ ) is described like
> 
> MATCH (n:LabelA:LabelJ:Skewer {my_node_id: 123454, p1: 'something', p2: 
> 'something else', p3: 'etc.'})
> 
>  Here is some sort of cypher….
>>  
>> //Creating the nodes
>> 
>>  
>> 
>> USING PERIODIC COMMIT 1000 
>> 
>> LOAD CSV WITH HEADERS FROM “…/././…. CLT_NODES_LabelA.csv" AS csvline 
>> 
>> MERGE (my_node_id:Skewer: LabelA {property1: csvline.property1}) 
>> 
>> ON CREATE SET  
>> 
>> n.Property2 = csvline.Property2,  
>> 
>> n.Property3 = csvline.Property3,  
>> 
>> n.Property4 = csvline.Property4;
>> 
>> ….
>> LOAD CSV WITH HEADERS FROM “…/././…. CLT_NODES_LabelJ.csv" AS csvline
>> 
>>  
>> 
>> MERGE (my_node_id:Skewer: LabelJ {property1: csvline.property1}) 
>> 
>> ON CREATE SET  
>> 
>> n.Property2 = csvline.Property2,  
>> 
>> n.Property3 = csvline.Property3,  
>> 
>> n.Property4 = csvline.Property4;
>> 
>>  
>> Q4: So does repeating the LOAD CSV with each file CLT_NODES_LabelA…J combine 
>> the various labels and their respective values with their corresponding 
>> nodes? 
> 
> Label is not a variable, it does not have a value. It's just a label, 
> consider "tag".
> Also my_node_id IS a variable so it does have a value.
> 
> Looking at your 2 code snippets - in case you hope that the first one will 
> create a node with LabelA and the second one will assign LabelJ to a node 
> which was created earlier, you are wrong. But... if you remove labels from 
> MERGE, it will work but look here with attention:
> 
> LOAD CSV WITH HEADERS FROM "…/././…. CLT_NODES_LabelA.csv" AS csvline
> MERGE (new_node_A:Skewer: {my_node_id: csvline.node_unique_number, property1: 
> csvline.property1})
> // only my_node_id and property1 values will be taken into account! no 
> labels, no other properties are taken care of
> // AFAIR we do not need `ON CREATE SET` here, do you really care is it a new 
> node or it was created earlier?  
> SET
> new_node_A : LabelA,
> new_node_A.Property2 = csvline.Property2,  
> new_node_A.Property3 = csvline.Property3,  
> new_node_A.Property4 = csvline.Property4;
> 
> LOAD CSV WITH HEADERS FROM "…/././…. CLT_NODES_LabelJ.csv" AS csvline
> MERGE (new_node_J:Skewer: {my_node_id: csvline.node_unique_number, property1: 
> csvline.property1})
> // only my_node_id and property1 values will be taken into account! no 
> labels, no other properties are taken care of
> // AFAIR we do not need `ON CREATE SET` here, do you really care is it a new 
> node or it was created earlier?  
> SET
> new_node_J : LabelJ,
> new_node_J.Property2 = csvline.Property2,  
> new_node_J.Property3 = csvline.Property3,  
> new_node_J.Property4 = csvline.Property4;
> 
> 
> What you get if doing things this way:
> 
> When doing LabelA .csv you will create whatever uniquely numbered nodes were 
> not already in the database, fill their properties (or maybe overwrite them?) 
> and label the node (be it new or existing one) with LabelA - no matter what 
> other labels did node (possibly) have,
> When doing LabelJ .csv you again will create whatever uniquely numbered nodes 
> were not already in the database, again either fill or overwrite propertiers, 
> and again label the node (be it new or existing one) with LabelJ - no matter 
> what other labels did node (possibly) have,
> so if you created some node with first file and labeled it LabelA, if the 
> same unique my_node_id occur both in first and second files, your node will 
> get 2 labels LabelA and LabelJ.
>  
>> Q5: Since I think of my data in terms of the two classes of nodes in my Data 
>> model …[CLT_SOURCE —> CLT_TARGET ;  CLT_TARGET —>  CLT_SOURCE],  after 
>> loading the nodes, how then I get two classes of nodes?
> 
> Make them 2 labels: CLTSource and CLTTarget respectively.
>  
>> Q6: Is there a step missing that explains how the code below got to have a 
>> “source_node” and a “dest_node” that appears to correspond to my CLT_SOURCE 
>> and CLT_TARGET nodes?
> 
> // suppose we coded relationships as 2 my_node_id's of nodes
> LOAD CSV FROM "...somewhere..." AS csvline
> MATCH (s:CLTSource:Skewer {my_node_id: TOINT(csvline[0)})
> USING INDEX s:Skewer(my_node_id)
> WITH s
> MATCH (t:CLTTarget:Skewer {my_node_id: TOINT(csvline[1)})
> USING INDEX t:Skewer(my_node_id)
> MERGE (s)-[r:MY_RELATIONSHIP_TYPE]->(t)
> SET
> r.prop1 = 'smth';
> 
> 
> 
>>> 4. Now when you are done with nodes and start doing LOAD CSV for 
>>> relationships, you may give the MATCH statement, which looks up your pair 
>>> of nodes, a hint for fast lookup, like
>>> 
>>> LOAD CSV ...from somewhere... AS csvline
>>> MATCH (source_node:Skewer {my_node_id: ToInt(csvline[0]}), 
>>> (dest_node:Skewer {my_node_id: ToInt(csvline[1]})
>>> CREATE (source_node)-[r:MY_REL_TYPE {rel_prop_00: csvline[2], ..., 
>>> rel_prop_NN: csvline[ZZ]}]->(dest_node);
>> 
>> Q6: This LOAD CSV  command (line 1) looks into the separate REL.csv file you 
>> mentioned first right?  
> 
> Yep
>  
>> Q7: csvline is some sort of temp file that is a series of lines of the cvs 
>> file? 
> 
> This is a variable - collection which is filled by column values of .csv line 
> by line. You can use it either as an array referring fields by their index 
> (my preferred way) - or, if you use `WITH HEADERS` mode, you can use it as a 
> keyed map. See 
> http://neo4j.com/docs/2.1.6/cypherdoc-importing-csv-files-with-cypher.html
> 
>> Q8: Do you imply in line 2 that the REL.csv file has headers that include  
>> source_node, dest_node ?
> 
> No I don't use headers so I refer to csvline fields by their index 
> ("collection mode")
>  
>> Q9: While I see how Skewer is a label,  how is my_node_id a  property (line 
>> 2) ? 
> 
> Because it IS a property of a node, and you build constraint & index on this 
> exact property inside the scope of a label :Skewer
>  
>> Q10: How does my_node_id relate to either ToInt(csvline[0]} or 
>> ToInt(csvline[1]}  (line 2) ?
> 
> For .csv with relationships, csvline[0] is a value of my_node_id property of 
> the source node, csvline[1] is a value of my_node_id property of the target 
> node, and TOINT() type conversion is used because my personal preference is 
> to use integers for ids.
>  
>> Is it that ToInt(csvline[0]} refers to the a line of the REL.csv file?  
>> Does csvline[0] refer to a column in REL.csv as do csvline[2] and 
>> csvline[ZZ] (line 3) ?
> 
> I think you can combine import of multiple .CSV files in a single LOAD CSV 
> statement but I didn't ever try this mode.
> 
> WBR,
> Andrii
>  
>>> Adding `:Skewer` label in MATCH will tell Cypher to (implicitly) use your 
>>> index on my_node_id which was created when you created your constraint. Or 
>>> you may try to explicitly give it a hint to use the index, with USING 
>>> INDEX... clause after MATCH before CREATE. Btw some earlier versions of 
>>> Neo4j refused to use index in LOAD CSV for some reason, I hope this problem 
>>> is gone with 2.1.5.
>> OK
>>  
>>> 5. While importing, be careful to explicitly specify type conversions for 
>>> each property which is not a string. I have seen numerous occasions when 
>>> people missed ToInt(csvline[i]) or ToFloat(csvline[j]) - and Cypher 
>>> silently stored their (supposed) numerics as strings. It's Ok, dude, you 
>>> say it :) This led to confusion afterwards when say numerical comparisons 
>>> doesn't MATCH and so on (though it's easy to correct with a single Cypher 
>>> command, but anyway).
>> Think I did that re. type conversion.  Only applies to properties for my 
>> data.
>>   
>> Sorry for so many questions.  I am really interested in figuring this out!
>> 
>> Thanks loads,  
>> Jose
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to neo4j+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to neo4j+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to