Re: [Neo4j] large cypher statements

FANC2 Fri, 28 Nov 2014 17:06:13 -0800

Thanks Dude!  I’ll let you know!

> On Nov 28, 2014, at 8:03 PM, Michael Hunger 
> <michael.hun...@neotechnology.com> wrote:
> 
> If you look at the video it is pretty obvious,
> she outlines all the major steps and pitfalls.
> 
> Except for one, create nodes and rels separately if you need more than one 
> MERGE
> 
> GOOD merge|match|create node merge|match|create node create erel
> GOOD match node match node MERGE rel
> 
> BAD match|create node merge node merge rel
> BAD match node set node.prop
> 
> 
> 
> On Sat, Nov 29, 2014 at 1:53 AM, FANC2 <josef...@gmail.com 
> <mailto:josef...@gmail.com>> wrote:
> Both. Using what I did before the loading either never finished or failed.  
> I’m trying to not follow that example with the figuring it out!!  :)
> 
>> On Nov 28, 2014, at 7:50 PM, Michael Hunger 
>> <michael.hun...@neotechnology.com <mailto:michael.hun...@neotechnology.com>> 
>> wrote:
>> 
>> What takes so long? The loading? Or figuring it out?
>> 
>> Michael
>> 
>> 
>> On Sat, Nov 29, 2014 at 1:18 AM, José F. Morales <josef...@gmail.com 
>> <mailto:josef...@gmail.com>> wrote:
>> Hey Michael,
>> 
>> I'll check it out.   Trouble is knowing what over-complicating is.  Thanks 
>> for the heads up!
>> 
>> I am trying to figure out inductively how to use LOAD CSV from various 
>> examples.  Thanks for another one.  
>> 
>> Its killing me that its taking so long.  
>> 
>> Jose
>> 
>> 
>> 
>> On Friday, November 28, 2014 6:16:49 PM UTC-5, Michael Hunger wrote:
>> José 
>> 
>> if you watch Nicole's webinar many things will become clear. 
>> https://vimeo.com/112447027 <https://vimeo.com/112447027>
>> You don't have to overcomplicate things.
>> 
>> The Skewer(id) thing is not really needed if each of your entities has a 
>> label and a primary key of some sorts.
>> It is just an optimization to not have to think about separate entities.
>> 
>> Cheers, Michael
>> 
>> On Sat, Nov 29, 2014 at 12:12 AM, José F. Morales <jose...@gmail.com <>> 
>> wrote:
>> Hey Andrii,
>> 
>> I've been thinking alot about your recommendations.   I have some questions, 
>> some of which show how ignorant I am.  Apologies for basics if necessary.
>> 
>> On Thursday, November 20, 2014 6:22:34 AM UTC-5, Andrii Stesin wrote:
>> Before you start.
>> 
>> 1. On nodes and their labels. First of all, I strongly suggest you to 
>> separate your nodes into different .csv files by label. So you won't have a 
>> column `label` in your .csv but rather set of files:
>> 
>> nodes_LabelA.csv
>> ...
>> nodes_LabelZ.csv
>> 
>> whatever your labels are. (Consider label to be kinda of synonym for `class` 
>> in object-oriented programming or `table` in RDBMS). That's due the fact 
>> that labels in Cypher are somewhat specific entities and you probably won't 
>> be allowed to make them parameterized into variables inside your LOAD CSV 
>> statement.
>> 
>> 
>> OK, so you have modified your original idea of putting the db into two files 
>> 1 nodes , 1 relationships.  Now here you say, put all the nodes into 1 file/ 
>> label.   The way I have worked with it, I created 1 file for a class of 
>> nodes I'll call CLT_SOURCE and another file for a class of nodes called 
>> CLT_TARGET.  Then I have a file for the relationships. Perhaps foolishly I 
>> originally would create 1 file that would combine all of this info and try 
>> to paste it in the browser or in the shell.  Neither worked even though with 
>> smaller amount of data it did.
>> 
>> You are recommending that with the nodes, I take two steps...
>> 1) Combine my CLT_SOURCE and CLT_TARGET nodes, 
>> 2) then I split that file into files that correspond to the node: 
>> my_node_id,  1 label, and then properties P1...Pn.  Since I have 10 
>> Labels/node, I should have 10 files named..... Nodes_LabelA... Nodes_LabelJ. 
>>  Thus...
>> 
>> File:  CLT_Nodes-LabelA     columns:  my_node_id, label A, property P1..., 
>> property P4
>> ...
>> File:  CLT_Nodes-LabelJ     columns:  my_node_id, label B, property P1..., 
>> property P4
>> 
>> 
>> Q1: What are the rules about what can be used for my_node_id?  I have 
>> usually seen them as a letter integer combination. Is that the convention?   
>> Sometimes I've seen a letter being used with a specific class of nodes  
>> a1..a100 for one class and b1..b100 for another.  I learned the hard way 
>> that you have to give each node a unique ID.  I used CLT_1...CLT_n for my 
>> CLT_SOURCE nodes and CLT_TARGET_1...CLT_TARGET_n for my TARGET nodes. It 
>> worked with the smaller db I made.  Anything wrong using the convention 
>> n1...n100?
>>  
>>  
>> 2. Then consider one additional "technological" label, let's name it 
>> `:Skewer` because it will "penetrate" all your nodes of every different 
>> label (class) like a kebab skewer.
>> 
>> Before you start (or at least before you start importing relationships) do
>> 
>> CREATE CONSTRAINT ON (every_node:Skewer) ASSERT every_node.my_node_id IS 
>> UNIQUE;
>> 
>> 
>> Q2:  Should I do scenario 1 or 2?
>> 
>> Scenario 1:  add two labels to each file?  One from my original nodes and 
>> one as "Skewer"
>> 
>> File 1:  CLT_Nodes-LabelA     columns:  my_node_id, label A, Skewer, 
>> property P1..., property P4
>> ...
>> File 2:  CLT_Nodes-LabelJ     columns:  my_node_id, label J, Skewer, 
>> property P1..., property P4
>>  
>> OR 
>> 
>> Scenario 2:  Include an eleventh file thus....
>> 
>> File 11:  CLT_Nodes-LabelK     columns:  my_node_id, Skewer, property P1..., 
>> property P4 
>> 
>> From below, I think you mean Scenario 1.
>> 
>> Q3: “Skewer” is just an integer right?  It corresponds in a way to 
>> my_node_id 
>> 
>> 3. When doing LOAD CSV with nodes, make sure each node will get 2 (two) 
>> labels, one of them is `:Skewer`. This will create index on `my_node_id` 
>> attribute (makes relationships creation some orders of magnitude faster) and 
>> you'll be sure you don't have occasional duplicate nodes, as a bonus.
>> 
>> 
>> Here is some sort of cypher….
>>  
>> //Creating the nodes
>> 
>>  
>> 
>> USING PERIODIC COMMIT 1000 
>> 
>> LOAD CSV WITH HEADERS FROM “…/././…. CLT_NODES_LabelA.csv" AS csvline 
>> 
>> MERGE (my_node_id:Skewer: LabelA {property1: csvline.property1}) 
>> 
>> ON CREATE SET  
>> 
>> n.Property2 = csvline.Property2,  
>> 
>> n.Property3 = csvline.Property3,  
>> 
>> n.Property4 = csvline.Property4; 
>> 
>> ….
>> LOAD CSV WITH HEADERS FROM “…/././…. CLT_NODES_LabelJ.csv" AS csvline
>> 
>>  
>> 
>> MERGE (my_node_id:Skewer: LabelJ {property1: csvline.property1}) 
>> 
>> ON CREATE SET  
>> 
>> n.Property2 = csvline.Property2,  
>> 
>> n.Property3 = csvline.Property3,  
>> 
>> n.Property4 = csvline.Property4;
>> 
>>  
>> Q4: So does repeating the LOAD CSV with each file CLT_NODES_LabelA…J combine 
>> the various labels and their respective values with their corresponding 
>> nodes? 
>> Q5: Since I think of my data in terms of the two classes of nodes in my Data 
>> model …[CLT_SOURCE —> CLT_TARGET ;  CLT_TARGET —>  CLT_SOURCE],  after 
>> loading the nodes, how then I get two classes of nodes? 
>> Q6: Is there a step missing that explains how the code below got to have a 
>> “source_node” and a “dest_node” that appears to correspond to my CLT_SOURCE 
>> and CLT_TARGET nodes?
>>  
>> 
>>  
>> 4. Now when you are done with nodes and start doing LOAD CSV for 
>> relationships, you may give the MATCH statement, which looks up your pair of 
>> nodes, a hint for fast lookup, like
>> 
>> LOAD CSV ...from somewhere... AS csvline
>> MATCH (source_node:Skewer {my_node_id: ToInt(csvline[0]}), (dest_node:Skewer 
>> {my_node_id: ToInt(csvline[1]})
>> CREATE (source_node)-[r:MY_REL_TYPE {rel_prop_00: csvline[2], ..., 
>> rel_prop_NN: csvline[ZZ]}]->(dest_node);
>> 
>> 
>> Q6: This LOAD CSV  command (line 1) looks into the separate REL.csv file you 
>> mentioned first right?  
>> Q7: csvline is some sort of temp file that is a series of lines of the cvs 
>> file? 
>> Q8: Do you imply in line 2 that the REL.csv file has headers that include  
>> source_node, dest_node ?
>> Q9: While I see how Skewer is a label,  how is my_node_id a  property (line 
>> 2) ? 
>> Q10: How does my_node_id relate to either ToInt(csvline[0]} or 
>> ToInt(csvline[1]}  (line 2) ?
>> Is it that ToInt(csvline[0]} refers to the a line of the REL.csv file?  
>> Does csvline[0] refer to a column in REL.csv as do csvline[2] and 
>> csvline[ZZ] (line 3) ?
>>  
>> Adding `:Skewer` label in MATCH will tell Cypher to (implicitly) use your 
>> index on my_node_id which was created when you created your constraint. Or 
>> you may try to explicitly give it a hint to use the index, with USING 
>> INDEX... clause after MATCH before CREATE. Btw some earlier versions of 
>> Neo4j refused to use index in LOAD CSV for some reason, I hope this problem 
>> is gone with 2.1.5.
>> 
>> OK
>>  
>> 5. While importing, be careful to explicitly specify type conversions for 
>> each property which is not a string. I have seen numerous occasions when 
>> people missed ToInt(csvline[i]) or ToFloat(csvline[j]) - and Cypher silently 
>> stored their (supposed) numerics as strings. It's Ok, dude, you say it :) 
>> This led to confusion afterwards when say numerical comparisons doesn't 
>> MATCH and so on (though it's easy to correct with a single Cypher command, 
>> but anyway).
>> 
>> Think I did that re. type conversion.  Only applies to properties for my 
>> data.
>>   
>> Sorry for so many questions.  I am really interested in figuring this out!
>> 
>> Thanks loads,  
>> Jose
>> 
>>  
>> WBR,
>> Andrii
>> 
>> On Wednesday, November 19, 2014 9:36:50 PM UTC+2, José F. Morales wrote:
>> 
>> 3. CSV approach 
>>      a. “Dump the base into 2 .csv files:”
>>      b. CSV1:  “Describe nodes (enumerate them via some my_node_id integer 
>> attribute),  columns: my_node_id,label,node_prop_01,node_prop_ZZ”
>>      c. CSV2:  “Describe relations,                                          
>>                                                 columns: source_my_node_id, 
>> dest_my_node_id,rel_type,rel_prop_01,...,rel_prop_NN”
>>      d. Indexes constraints:         before starting import  —> have 
>> appropriate indexes / constraints
>>      e. via LOAD CSV, import CSV1, then CSV2. 
>>      f. Import no more than 10,000-30,000 lines in a single LOAD CSV 
>> statement 
>> 
>> This seems to be a very well elaborated method and the easiest for me to do. 
>>  I have files such that I can create these without too much problem.  I 
>> figure I’ll split the nodes into three files 20k rows each.  I can do the 
>> same with the Rels.  I have not used indexes or constraints yet in the db’s 
>> that I already created and as I said above, I’ll have to see how to use them.
>> 
>> I am assuming column headers that fit with my data are consistent with what 
>> you explained below (Like, I can put my own meaningful text into Label 1 -10 
>> and node_prop_01 - 05).... 
>> my_node_id,    label1,       label2,       label3,   label4,            
>> label5,         label6,             label7,          label8,   label9,       
>>      label10,           node_prop_01,    node_prop_02,  node_prop_03,  
>> node_prop_04,       node_prop_ZZ”
>> 
>> Thanks again Fellas!!
>> 
>> Jose
>> 
>> 
>> On Wednesday, November 19, 2014 8:04:44 AM UTC-5, Michael Hunger wrote:
>> José,
>> 
>> Let's continue the discussion on the google group
>> 
>> With larger I meant amount of data, not size of statements
>> 
>> As I also point out in various places we recommend creating only small 
>> subgraphs with a single statement separated by srmicolons.
>> Eg up to 100 nodes and rels
>> 
>> Gigantic statements just let the parser explode
>> 
>> I recommending splitting them up into statements creating subgraphs
>> Or create nodes and later match them by label & property to connect them
>> Make sure to have appropriate indexes / constraints
>> 
>> You should also surround blocks if statements with begin and commit commands
>> 
>> Von meinem iPhone gesendet
>> 
>> Am 19.11.2014 um 04:18 schrieb José F. Morales Ph.D. <jm3...@columbia.edu 
>> <>>:
>> 
>>> Hey Michael and Kenny
>>> 
>>> Thanks you guys a bunch for the help.
>>> 
>>> Let me give you a little background.  I am charged to make a prototype of a 
>>> tool (“LabCards”) that we hope to use in the hospital and beyond at some 
>>> point .  In preparation for making the main prototype, I made two prior 
>>> Neo4j databases that worked exactly as I wanted them to.  The first 
>>> database was built with NIH data and had 183 nodes and around 7500 
>>> relationships.  The second database was the Pre-prototype and it had 1080 
>>> nodes and around 2000 relationships.  I created these in the form of cypher 
>>> statements and either pasted them in the Neo4j browser or used the neo4j 
>>> shell and loaded them as text files. Before doing that I checked the cypher 
>>> code with Sublime Text 2 that highlights the code. Both databases loaded 
>>> fine in both methods and did what I wanted them to do.  
>>> 
>>> As you might imagine, the prototype is an expansion of the mini-prototype.  
>>> It has almost the same data model and I built it as a series of cypher 
>>> statements as well.  My first version of the prototype had ~60k nodes and 
>>> 160k relationships.  
>>> 
>>> I should say that a feature of this model is that all the source and target 
>>> nodes have relationships that point to each other.  No node points to 
>>> itself as far as I know. This file was 41 Mb of cypher code that I tried to 
>>> load via the neo4j shell.  
>>> 
>>> In fact, I was following your advise on loading big data files... “Use the 
>>> Neo4j-Shell for larger Imports”  
>>> (http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-and-successfully/ 
>>> <http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-and-successfully/>).
>>>    This first time out, Java maxed out its memory allocated at 4Gb 2x and 
>>> did not complete loading in 24 hours.  I killed it. 
>>> 
>>> I then contacted Kenny, and he generously gave me some advice regarding the 
>>> properties file (below) and again the same deal (4 Gb Memory 2x) with Java 
>>> and no success in about 24 hours. I killed that one too.
>>> 
>>> Given my loading problems, I have subsequently eliminated a bunch 
>>> relationships (100k) so that the file is now 21 Mb. Alot of these were 
>>> duplicates that I didn’t pick up before and am trying it again.  So far 15 
>>> min into it, similar situation.  The difference is that Java is using 1.7 
>>> and 0.5 GB of memory
>>> 
>>> Here is the cypher for a typical node…
>>> 
>>> CREATE ( CLT_1:`CLT SOURCE`:BIOMEDICAL:TEST_NAME:`Laboratory 
>>> Procedure`:lbpr:`Procedures`:PROC:T059:`B1.3.1.1`:TZ{NAME:'Acetoacetate 
>>> (ketone body)',SYNONYM:'',Sample:'SERUM, URINE',MEDCODE:10010,CUI:'NA’})
>>> 
>>> Here is the cypher for a typical relationship...
>>> 
>>> CREATE(CLT_1)-[:MEASUREMENT_OF{Phylum:'TZ',CAT:'TEST.NAME 
>>> <http://test.name/>',Ui_Rl:'T157',RESULT:'',Type:'',Semantic_Distance_Score:'NA',Path_Length:'NA',Path_Steps:'NA'}]->(CLT_TARGET_3617),
>>> 
>>> I will let you know how this one turns out.  I hope this is helpful.
>>> 
>>> Many, many thanks fellas!!!
>>> 
>>> Jose
>>> 
>>>> On Nov 18, 2014, at 8:33 PM, Michael Hunger <michael...@neotechnology.com 
>>>> <>> wrote:
>>>> 
>>>> Hi José,
>>>> 
>>>> can you provide perhaps more detail about your dataset (e.g. sample of the 
>>>> csv, size, etc. perhaps an output of csvstat (of csvkit) would be 
>>>> helpful), your cypher queries to load it
>>>> 
>>>> Have you seen my other blog post, which explains two big caveats that 
>>>> people run into when trying this? 
>>>> jexp.de/blog/2014/10/load-cvs-with-success/ 
>>>> <http://jexp.de/blog/2014/10/load-cvs-with-success/>
>>>> 
>>>> Cheers, Michael
>>>> 
>>>> On Tue, Nov 18, 2014 at 8:43 PM, Kenny Bastani <k...@socialmoon.com <>> 
>>>> wrote:
>>>> Hey Jose,
>>>> 
>>>> There is definitely an answer. Let me put you in touch with the data 
>>>> import master: Michael Hunger.
>>>> 
>>>> Michael, I think the answers here will be pretty straight forward for you. 
>>>> You met Jose at GraphConnect NY last year, so I'll spare any 
>>>> introductions. The memory map configurations I provided need to be 
>>>> calculated and customized for the data import volume.
>>>> 
>>>> Thanks,
>>>> 
>>>> Kenny
>>>> 
>>>> Sent from my iPhone
>>>> 
>>>> On Nov 18, 2014, at 11:37 AM, José F. Morales Ph.D. <jm3...@columbia.edu 
>>>> <>> wrote:
>>>> 
>>>>> Kenny,  
>>>>> 
>>>>> In 3 hours it’ll be trying to load for 24 hours so this is not working.  
>>>>> I’m catching shit from my crew too, so I got to fix this like soon.
>>>>> 
>>>>> I haven’t done this before, but can I break up the data and load it in 
>>>>> pieces?
>>>>> 
>>>>> Jose
>>>>> 
>>>>>> On Nov 17, 2014, at 3:35 PM, Kenny Bastani <k...@socialmoon.com <>> 
>>>>>> wrote:
>>>>>> 
>>>>>> Hey Jose,
>>>>>> 
>>>>>> Try turning off the object cache. Add this line to the neo4j.properties 
>>>>>> configuration file:
>>>>>> 
>>>>>> cache_type=none
>>>>>> 
>>>>>> Then retry your import. Also, enable memory mapped files by adding these 
>>>>>> lines to the neo4j.properties file:
>>>>>> 
>>>>>> neostore.nodestore.db.mapped_memory=2048M
>>>>>> neostore.relationshipstore.db.mapped_memory=4096M
>>>>>> neostore.propertystore.db.mapped_memory=200M
>>>>>> neostore.propertystore.db.strings.mapped_memory=500M
>>>>>> neostore.propertystore.db.arrays.mapped_memory=500M
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Kenny
>>>>>> 
>>>>>> From: José F. Morales Ph.D. <jm3...@columbia.edu <>>
>>>>>> Sent: Monday, November 17, 2014 12:32 PM
>>>>>> To: Kenny Bastani
>>>>>> Subject: latest
>>>>>>  
>>>>>> Hey Kenny,
>>>>>> 
>>>>>> Here’s the deal. As I think I said, I loaded the 41 Mb file of cypher 
>>>>>> code via the neo4j shell. Before I tried the LabCards file, I tried the 
>>>>>> movies file and a UMLS database I made (8k relationships).  They worked 
>>>>>> fine. 
>>>>>> 
>>>>>> The LabCards file is taking a LONG time to load since I started at about 
>>>>>> 9:30 - 10 PM last night and its 3PM now.  
>>>>>> 
>>>>>> I’ve wondered if its hung up and the activity monitor’s memory usage is 
>>>>>> constant at two rows of Java at 4GB w/ the kernel at 1 GB.  The CPU 
>>>>>> panel changes alot so it looks like its doing its thing. 
>>>>>> 
>>>>>> So is this how are things to be expected?  Do you think the loading is 
>>>>>> gonna take a day or two?  
>>>>>> 
>>>>>> Jose
>>>>>> 
>>>>>> 
>>>>>> |//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\||
>>>>>> José F. Morales Ph.D.
>>>>>> Instructor
>>>>>> Cell Biology and Pathology
>>>>>> Columbia University Medical Center
>>>>>> jm3...@columbia.edu <>
>>>>>> 212-452-3351 <>
>>>>> |//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\||
>>>>> José F. Morales Ph.D.
>>>>> Instructor
>>>>> Cell Biology and Pathology
>>>>> Columbia University Medical Center
>>>>> jm3...@columbia.edu <>
>>>>> 212-452-3351 <>
>>>> 
>>> 
>>> |//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\||
>>> José F. Morales Ph.D.
>>> Instructor
>>> Cell Biology and Pathology
>>> Columbia University Medical Center
>>> jm3...@columbia.edu <>
>>> 212-452-3351 <>
>> 
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Neo4j" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to neo4j+un...@googlegroups.com <>.
>> For more options, visit https://groups.google.com/d/optout 
>> <https://groups.google.com/d/optout>.
>> 
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Neo4j" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to neo4j+unsubscr...@googlegroups.com 
>> <mailto:neo4j+unsubscr...@googlegroups.com>.
>> For more options, visit https://groups.google.com/d/optout 
>> <https://groups.google.com/d/optout>.
>> 
>> 
>> -- 
>> You received this message because you are subscribed to a topic in the 
>> Google Groups "Neo4j" group.
>> To unsubscribe from this topic, visit 
>> https://groups.google.com/d/topic/neo4j/jSFtnD5OHxg/unsubscribe 
>> <https://groups.google.com/d/topic/neo4j/jSFtnD5OHxg/unsubscribe>.
>> To unsubscribe from this group and all its topics, send an email to 
>> neo4j+unsubscr...@googlegroups.com 
>> <mailto:neo4j+unsubscr...@googlegroups.com>.
>> For more options, visit https://groups.google.com/d/optout 
>> <https://groups.google.com/d/optout>.
> 
> José F. Morales Ph.D.
> josef...@gmail.com <mailto:josef...@gmail.com>
> 
> 
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to neo4j+unsubscr...@googlegroups.com 
> <mailto:neo4j+unsubscr...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout 
> <https://groups.google.com/d/optout>.
> 
> 
> -- 
> You received this message because you are subscribed to a topic in the Google 
> Groups "Neo4j" group.
> To unsubscribe from this topic, visit 
> https://groups.google.com/d/topic/neo4j/jSFtnD5OHxg/unsubscribe 
> <https://groups.google.com/d/topic/neo4j/jSFtnD5OHxg/unsubscribe>.
> To unsubscribe from this group and all its topics, send an email to 
> neo4j+unsubscr...@googlegroups.com 
> <mailto:neo4j+unsubscr...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout 
> <https://groups.google.com/d/optout>.


José F. Morales Ph.D.
josef...@gmail.com



-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to neo4j+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] large cypher statements

Reply via email to