Re: [Neo4j] Re: large cypher statements

Michael Hunger Sat, 29 Nov 2014 04:45:39 -0800

Just one thing

Never use more than one label and one property in merge otherwise it wont use 
indexes


And use ... On create set ...

Von meinem iPhone gesendet

> Am 29.11.2014 um 12:35 schrieb Andrii Stesin <ste...@gmail.com>:
> 
> Hi Jose,
> 
> On Saturday, November 29, 2014 1:12:52 AM UTC+2, José F. Morales wrote:
>> 
>>> 
>>> 1. On nodes and their labels. First of all, I strongly suggest you to 
>>> separate your nodes into different .csv files by label. So you won't have a 
>>> column `label` in your .csv but rather set of files:
>>> 
>>> nodes_LabelA.csv
>>> ...
>>> nodes_LabelZ.csv
>>> 
>>> whatever your labels are. (Consider label to be kinda of synonym for 
>>> `class` in object-oriented programming or `table` in RDBMS). That's due the 
>>> fact that labels in Cypher are somewhat specific entities and you probably 
>>> won't be allowed to make them parameterized into variables inside your LOAD 
>>> CSV statement.
>> 
>> OK, so you have modified your original idea of putting the db into two files 
>> 1 nodes , 1 relationships.  Now here you say, put all the nodes into 1 file/ 
>> label.   The way I have worked with it, I created 1 file for a class of 
>> nodes I'll call CLT_SOURCE and another file for a class of nodes called 
>> CLT_TARGET.
> 
> Ok, but how many valid distinct combinations of your 10 node labels may 
> exist? I was speaking about a simple case where you have some limited number 
> of possible node labels (or their combinations), say less than 10.
> 
>  You are recommending that with the nodes, I take two steps...
>> 1) Combine my CLT_SOURCE and CLT_TARGET nodes, 
> 
> Not nessesary "combine" but just give each node a unique (temporary) 
> my_node_id see my "10+M tree" example below.
>  
>> 2) then I split that file into files that correspond to the node: 
>> my_node_id,  1 label, and then properties P1...Pn.  Since I have 10 
>> Labels/node, I should have 10 files named..... Nodes_LabelA... Nodes_LabelJ. 
>>  Thus...
> 
> You may have as much labels per node you wish, but it is all about how many 
> valid distinct combinations of labels you have. (One single label is a 
> combination itself, obviously).
> 
> If you have some limited quantity of valid label combination it's one story. 
> But if we are talking about order of 10! possible valid combinations, the 
> story is somewhat more interesting :) Which setup is yours?
>  
>> File:  CLT_Nodes-LabelA     columns:  my_node_id, label A, property P1..., 
>> property P4
>> ...
>> File:  CLT_Nodes-LabelJ     columns:  my_node_id, label B, property P1..., 
>> property P4
>> 
>> 
>> Q1: What are the rules about what can be used for my_node_id?  I have 
>> usually seen them as a letter integer combination. Is that the convention?   
>> Sometimes I've seen a letter being used with a specific class of nodes  
>> a1..a100 for one class and b1..b100 for another.  I learned the hard way 
>> that you have to give each node a unique ID.  I used CLT_1...CLT_n for my 
>> CLT_SOURCE nodes and CLT_TARGET_1...CLT_TARGET_n for my TARGET nodes. It 
>> worked with the smaller db I made.  Anything wrong using the convention 
>> n1...n100?
> 
> I'm not aware of any conventions here, the only thing I know for sure is that 
> schema index works much(!) faster on plain integers than on Unicode strings. 
> That's the only difference which I consider significant. So my personal 
> preference is to have my_node_id to be a unique integer. Once when importing 
> a 10+ millions nodes into a tree with variable height [1..7] where each level 
> of nodes was in a separate file (because of level's own unique label and 
> unique set of properties) I just selected a schema for numbering them like
> 
> :Skewer:Level1 my_node_id = 10000000 + file1.csv line number
> :Skewer:Level2 my_node_id = 20000000 + file2.csv line number
> ...
> :Skewer:Level7 my_node_id = 70000000 + file7.csv line number
> 
> so relationship file (all relationships were of a same single type) has 
> become a simple 2 column .csv like this with 10+ millions of lines
> 
> 10000017,20000362
> 10000017,20000547
> 10000017,40083215
> 10000018,30000397
> ...
> 
> After successful importing of 7 node files (and have nodes ready in db and 
> indexed on their unique my_node_id under the label :Skewer) I split 
> relationships.csv into 1000+ files with 10000 lines each and wrote a dumb 
> shell script which loaded them with `neo4j-shell -c`  file by file doing 
> `sleep 60` between files (to give neo4j a minute to complete each batch 
> transaction) than started it Friday evening and got my tree ready on Monday 
> morning :)
> 
> If you prefer alphanumerics for my_node_id it's completely up to you :) 
> Anyway, after successful import you may prefer to remove those temporary ids 
> completely from the database, just to conserve space where properties are 
> stored.
>  
>>> 2. Then consider one additional "technological" label, let's name it 
>>> `:Skewer` because it will "penetrate" all your nodes of every different 
>>> label (class) like a kebab skewer.
>>> 
>>> Before you start (or at least before you start importing relationships) do
>>> 
>>> CREATE CONSTRAINT ON (every_node:Skewer) ASSERT every_node.my_node_id IS 
>>> UNIQUE;
>> 
>> Q2:  Should I do scenario 1 or 2?
>> 
>> Scenario 1:  add two labels to each file?  One from my original nodes and 
>> one as "Skewer"
>> 
>> File 1:  CLT_Nodes-LabelA     columns:  my_node_id, label A, Skewer, 
>> property P1..., property P4
>> ...
>> File 2:  CLT_Nodes-LabelJ     columns:  my_node_id, label J, Skewer, 
>> property P1..., property P4
>>  
>> OR 
>> 
>> Scenario 2:  Include an eleventh file thus....
>> 
>> File 11:  CLT_Nodes-LabelK     columns:  my_node_id, Skewer, property P1..., 
>> property P4 
>> 
>> From below, I think you mean Scenario 1.
> 
> Yes and you don't need to add a column for :Skewer label into a file, the 
> LOAD CSV statement should assign it.
>  
>> Q3: “Skewer” is just an integer right?  It corresponds in a way to 
>> my_node_id 
> 
> No, it's a label! so in Cypher your node (suppose it has 2 labels :LabelA and 
> :LabelJ ) is described like
> 
> MATCH (n:LabelA:LabelJ:Skewer {my_node_id: 123454, p1: 'something', p2: 
> 'something else', p3: 'etc.'})
> 
>  Here is some sort of cypher….
>>  
>> //Creating the nodes
>> 
>>  
>> 
>> USING PERIODIC COMMIT 1000 
>> 
>> LOAD CSV WITH HEADERS FROM “…/././…. CLT_NODES_LabelA.csv" AS csvline 
>> 
>> MERGE (my_node_id:Skewer: LabelA {property1: csvline.property1}) 
>> 
>> ON CREATE SET  
>> 
>> n.Property2 = csvline.Property2,  
>> 
>> n.Property3 = csvline.Property3,  
>> 
>> n.Property4 = csvline.Property4;
>> 
>> ….
>> LOAD CSV WITH HEADERS FROM “…/././…. CLT_NODES_LabelJ.csv" AS csvline
>> 
>>  
>> 
>> MERGE (my_node_id:Skewer: LabelJ {property1: csvline.property1}) 
>> 
>> ON CREATE SET  
>> 
>> n.Property2 = csvline.Property2,  
>> 
>> n.Property3 = csvline.Property3,  
>> 
>> n.Property4 = csvline.Property4;
>> 
>>  
>> Q4: So does repeating the LOAD CSV with each file CLT_NODES_LabelA…J combine 
>> the various labels and their respective values with their corresponding 
>> nodes? 
> 
> Label is not a variable, it does not have a value. It's just a label, 
> consider "tag".
> Also my_node_id IS a variable so it does have a value.
> 
> Looking at your 2 code snippets - in case you hope that the first one will 
> create a node with LabelA and the second one will assign LabelJ to a node 
> which was created earlier, you are wrong. But... if you remove labels from 
> MERGE, it will work but look here with attention:
> 
> LOAD CSV WITH HEADERS FROM "…/././…. CLT_NODES_LabelA.csv" AS csvline
> MERGE (new_node_A:Skewer: {my_node_id: csvline.node_unique_number, property1: 
> csvline.property1})
> // only my_node_id and property1 values will be taken into account! no 
> labels, no other properties are taken care of
> // AFAIR we do not need `ON CREATE SET` here, do you really care is it a new 
> node or it was created earlier?  
> SET
> new_node_A : LabelA,
> new_node_A.Property2 = csvline.Property2,  
> new_node_A.Property3 = csvline.Property3,  
> new_node_A.Property4 = csvline.Property4;
> 
> LOAD CSV WITH HEADERS FROM "…/././…. CLT_NODES_LabelJ.csv" AS csvline
> MERGE (new_node_J:Skewer: {my_node_id: csvline.node_unique_number, property1: 
> csvline.property1})
> // only my_node_id and property1 values will be taken into account! no 
> labels, no other properties are taken care of
> // AFAIR we do not need `ON CREATE SET` here, do you really care is it a new 
> node or it was created earlier?  
> SET
> new_node_J : LabelJ,
> new_node_J.Property2 = csvline.Property2,  
> new_node_J.Property3 = csvline.Property3,  
> new_node_J.Property4 = csvline.Property4;
> 
> 
> What you get if doing things this way:
> 
> When doing LabelA .csv you will create whatever uniquely numbered nodes were 
> not already in the database, fill their properties (or maybe overwrite them?) 
> and label the node (be it new or existing one) with LabelA - no matter what 
> other labels did node (possibly) have,
> When doing LabelJ .csv you again will create whatever uniquely numbered nodes 
> were not already in the database, again either fill or overwrite propertiers, 
> and again label the node (be it new or existing one) with LabelJ - no matter 
> what other labels did node (possibly) have,
> so if you created some node with first file and labeled it LabelA, if the 
> same unique my_node_id occur both in first and second files, your node will 
> get 2 labels LabelA and LabelJ.
>  
>> Q5: Since I think of my data in terms of the two classes of nodes in my Data 
>> model …[CLT_SOURCE —> CLT_TARGET ;  CLT_TARGET —>  CLT_SOURCE],  after 
>> loading the nodes, how then I get two classes of nodes?
> 
> Make them 2 labels: CLTSource and CLTTarget respectively.
>  
>> Q6: Is there a step missing that explains how the code below got to have a 
>> “source_node” and a “dest_node” that appears to correspond to my CLT_SOURCE 
>> and CLT_TARGET nodes?
> 
> // suppose we coded relationships as 2 my_node_id's of nodes
> LOAD CSV FROM "...somewhere..." AS csvline
> MATCH (s:CLTSource:Skewer {my_node_id: TOINT(csvline[0)})
> USING INDEX s:Skewer(my_node_id)
> WITH s
> MATCH (t:CLTTarget:Skewer {my_node_id: TOINT(csvline[1)})
> USING INDEX t:Skewer(my_node_id)
> MERGE (s)-[r:MY_RELATIONSHIP_TYPE]->(t)
> SET
> r.prop1 = 'smth';
> 
> 
> 
>>> 4. Now when you are done with nodes and start doing LOAD CSV for 
>>> relationships, you may give the MATCH statement, which looks up your pair 
>>> of nodes, a hint for fast lookup, like
>>> 
>>> LOAD CSV ...from somewhere... AS csvline
>>> MATCH (source_node:Skewer {my_node_id: ToInt(csvline[0]}), 
>>> (dest_node:Skewer {my_node_id: ToInt(csvline[1]})
>>> CREATE (source_node)-[r:MY_REL_TYPE {rel_prop_00: csvline[2], ..., 
>>> rel_prop_NN: csvline[ZZ]}]->(dest_node);
>> 
>> Q6: This LOAD CSV  command (line 1) looks into the separate REL.csv file you 
>> mentioned first right?  
> 
> Yep
>  
>> Q7: csvline is some sort of temp file that is a series of lines of the cvs 
>> file? 
> 
> This is a variable - collection which is filled by column values of .csv line 
> by line. You can use it either as an array referring fields by their index 
> (my preferred way) - or, if you use `WITH HEADERS` mode, you can use it as a 
> keyed map. See 
> http://neo4j.com/docs/2.1.6/cypherdoc-importing-csv-files-with-cypher.html
> 
>> Q8: Do you imply in line 2 that the REL.csv file has headers that include  
>> source_node, dest_node ?
> 
> No I don't use headers so I refer to csvline fields by their index 
> ("collection mode")
>  
>> Q9: While I see how Skewer is a label,  how is my_node_id a  property (line 
>> 2) ? 
> 
> Because it IS a property of a node, and you build constraint & index on this 
> exact property inside the scope of a label :Skewer
>  
>> Q10: How does my_node_id relate to either ToInt(csvline[0]} or 
>> ToInt(csvline[1]}  (line 2) ?
> 
> For .csv with relationships, csvline[0] is a value of my_node_id property of 
> the source node, csvline[1] is a value of my_node_id property of the target 
> node, and TOINT() type conversion is used because my personal preference is 
> to use integers for ids.
>  
>> Is it that ToInt(csvline[0]} refers to the a line of the REL.csv file?  
>> Does csvline[0] refer to a column in REL.csv as do csvline[2] and 
>> csvline[ZZ] (line 3) ?
> 
> I think you can combine import of multiple .CSV files in a single LOAD CSV 
> statement but I didn't ever try this mode.
> 
> WBR,
> Andrii
>  
>>> Adding `:Skewer` label in MATCH will tell Cypher to (implicitly) use your 
>>> index on my_node_id which was created when you created your constraint. Or 
>>> you may try to explicitly give it a hint to use the index, with USING 
>>> INDEX... clause after MATCH before CREATE. Btw some earlier versions of 
>>> Neo4j refused to use index in LOAD CSV for some reason, I hope this problem 
>>> is gone with 2.1.5.
>> OK
>>  
>>> 5. While importing, be careful to explicitly specify type conversions for 
>>> each property which is not a string. I have seen numerous occasions when 
>>> people missed ToInt(csvline[i]) or ToFloat(csvline[j]) - and Cypher 
>>> silently stored their (supposed) numerics as strings. It's Ok, dude, you 
>>> say it :) This led to confusion afterwards when say numerical comparisons 
>>> doesn't MATCH and so on (though it's easy to correct with a single Cypher 
>>> command, but anyway).
>> Think I did that re. type conversion.  Only applies to properties for my 
>> data.
>>   
>> Sorry for so many questions.  I am really interested in figuring this out!
>> 
>> Thanks loads,  
>> Jose
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to neo4j+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to neo4j+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] Re: large cypher statements

Reply via email to