Re: [Neo4j] Re: large cypher statements

José F . Morales Fri, 28 Nov 2014 16:19:29 -0800

Hey Michael,

I'll check it out.   Trouble is knowing what over-complicating is.  Thanks 
for the heads up!


I am trying to figure out inductively how to use LOAD CSV from various 
examples.  Thanks for another one.  

Its killing me that its taking so long.  

Jose



On Friday, November 28, 2014 6:16:49 PM UTC-5, Michael Hunger wrote:
>
> José 
>
> if you watch Nicole's webinar many things will become clear. 
> https://vimeo.com/112447027
> You don't have to overcomplicate things.
>
> The Skewer(id) thing is not really needed if each of your entities has a 
> label and a primary key of some sorts.
> It is just an optimization to not have to think about separate entities.
>
> Cheers, Michael
>
> On Sat, Nov 29, 2014 at 12:12 AM, José F. Morales <jose...@gmail.com 
> <javascript:>> wrote:
>
>> Hey Andrii,
>>
>> I've been thinking alot about your recommendations.   I have some 
>> questions, some of which show how ignorant I am.  Apologies for basics 
>> if necessary.
>>
>> On Thursday, November 20, 2014 6:22:34 AM UTC-5, Andrii Stesin wrote:
>>>
>>> Before you start.
>>>
>>> 1. On nodes and their labels. First of all, I strongly suggest you to 
>>> separate your nodes into different .csv files by label. So you won't have a 
>>> column *`label`* in your .csv but rather set of files:
>>>
>>> nodes_LabelA.csv
>>> ...
>>> nodes_LabelZ.csv
>>>
>>> whatever your labels are. (Consider label to be kinda of synonym for 
>>> `class` in object-oriented programming or `table` in RDBMS). That's due the 
>>> fact that labels in Cypher are somewhat specific entities and you probably 
>>> won't be allowed to make them parameterized into variables inside your LOAD 
>>> CSV statement.
>>>
>>>
>> OK, so you have modified your original idea of putting the db into two 
>> files 1 nodes , 1 relationships.  Now here you say, put all the nodes into 
>> 1 file/ label.   The way I have worked with it, I created 1 file for a 
>> class of nodes I'll call CLT_SOURCE and another file for a class of nodes 
>> called CLT_TARGET.  Then I have a file for the relationships. Perhaps 
>> foolishly I originally would create 1 file that would combine all of this 
>> info and try to paste it in the browser or in the shell.  Neither worked 
>> even though with smaller amount of data it did.
>>
>> You are recommending that with the nodes, I take two steps...
>> 1) Combine my CLT_SOURCE and CLT_TARGET nodes, 
>> 2) then I split that file into files that correspond to the node: 
>> *my_node_id, * 1 label, and then properties P1...Pn.  Since I have 10 
>> Labels/node, I should have 10 files named..... Nodes_LabelA... 
>> Nodes_LabelJ.  Thus...
>>
>> File:  CLT_Nodes-LabelA     columns:  *my_node_id,* label A, property 
>> P1..., property P4
>> ...
>> File:  CLT_Nodes-LabelJ     columns:  *my_node_id,* label B, property 
>> P1..., property P4
>>
>>
>> Q1: What are the rules about what can be used for *my_node_id?  *I have 
>> usually seen them as a letter integer combination. Is that the convention? 
>>   Sometimes I've seen a letter being used with a specific class of nodes 
>>  a1..a100 for one class and b1..b100 for another.  I learned the hard way 
>> that you have to give each node a unique ID.  I used CLT_1...CLT_n for my 
>> CLT_SOURCE nodes and CLT_TARGET_1...CLT_TARGET_n for my TARGET nodes. 
>> It worked with the smaller db I made.  Anything wrong using the convention 
>> n1...n100?
>>  
>>  
>>
>>> 2. Then consider one additional "technological" label, let's name it 
>>> `:Skewer` because it will "penetrate" all your nodes of every different 
>>> label (class) like a kebab skewer.
>>>
>>> Before you start (or at least before you start importing relationships) 
>>> do
>>>
>>> CREATE CONSTRAINT ON (every_node:Skewer) ASSERT every_node.my_node_id 
>>> IS UNIQUE;
>>>
>>>
>> Q2:  Should I do scenario 1 or 2?
>>
>> Scenario 1:  add two labels to each file?  One from my original nodes and 
>> one as "Skewer"
>>
>> File 1:  CLT_Nodes-LabelA     columns:  *my_node_id,* label A, *Skewer*, 
>> property P1..., property P4
>> ...
>> File 2:  CLT_Nodes-LabelJ     columns:  *my_node_id,* label J, *Skewer*, 
>> property P1..., property P4
>>  
>> OR 
>>
>> Scenario 2:  Include an eleventh file thus....
>>
>> File 11:  CLT_Nodes-LabelK     columns:  *my_node_id,* *Skewer*, 
>> property P1..., property P4 
>>
>> From below, I think you mean Scenario 1.
>>
>> Q3: “Skewer” is just an integer right?  It corresponds in a way to 
>> my_node_id 
>>
>> 3. When doing LOAD CSV with nodes, make sure each node will get 2 (two) 
>>> labels, one of them is `:Skewer`. This will create index on `my_node_id` 
>>> attribute (makes relationships creation some orders of magnitude faster) 
>>> and you'll be sure you don't have occasional duplicate nodes, as a bonus.
>>>
>>
>>
>> Here is some sort of cypher….
>>
>>  
>> //Creating the nodes
>>
>>  
>>
>> USING PERIODIC COMMIT 1000 
>>
>> LOAD CSV WITH HEADERS FROM “…/././…. CLT_NODES_LabelA.csv" AS csvline 
>>
>> MERGE (my_node_id:Skewer: LabelA {property1: csvline.property1}) 
>>
>> ON CREATE SET  
>>
>> n.Property2 = csvline.Property2,  
>>
>> n.Property3 = csvline.Property3,  
>>
>> n.Property4 = csvline.Property4; 
>>
>>
>> ….
>> LOAD CSV WITH HEADERS FROM “…/././…. CLT_NODES_LabelJ.csv" AS csvline
>>
>>  
>>
>> MERGE (my_node_id:Skewer: LabelJ {property1: csvline.property1}) 
>>
>> ON CREATE SET  
>>
>> n.Property2 = csvline.Property2,  
>>
>> n.Property3 = csvline.Property3,  
>>
>> n.Property4 = csvline.Property4;
>>
>>
>>  
>>
>> Q4: So does repeating the LOAD CSV with each file CLT_NODES_LabelA…J 
>> combine the various labels and their respective values with their 
>> corresponding nodes? 
>>
>> Q5: Since I think of my data in terms of the two classes of nodes in my 
>> Data model …[CLT_SOURCE —> CLT_TARGET ;  CLT_TARGET —>  CLT_SOURCE],  after 
>> loading the nodes, how then I get two classes of nodes? 
>>
>> Q6: Is there a step missing that explains how the code below got to have 
>> a “source_node” and a “dest_node” that appears to correspond to my 
>> CLT_SOURCE and CLT_TARGET nodes?
>>
>>  
>>
>>  
>>
>>> 4. Now when you are done with nodes and start doing LOAD CSV for 
>>> relationships, you may give the MATCH statement, which looks up your pair 
>>> of nodes, a hint for fast lookup, like
>>>
>>> LOAD CSV ...from somewhere... AS csvline
>>> MATCH (source_node:Skewer {my_node_id: ToInt(csvline[0]}), (dest_node:
>>> Skewer {my_node_id: ToInt(csvline[1]})
>>> CREATE (source_node)-[r:MY_REL_TYPE {rel_prop_00: csvline[2], ..., 
>>> rel_prop_NN: csvline[ZZ]}]->(dest_node);
>>>
>>>
>> Q6: This LOAD CSV  command (line 1) looks into the separate REL.csv file 
>> you mentioned first right?  
>>
>> Q7: csvline is some sort of temp file that is a series of lines of the 
>> cvs file? 
>>
>> Q8: Do you imply in line 2 that the REL.csv file has headers that 
>> include  source_node, dest_node ?
>>
>> Q9: While I see how Skewer is a label,  how is my_node_id a  property 
>> (line 2) ? 
>>
>> Q10: How does my_node_id relate to either ToInt(csvline[0]} or 
>> ToInt(csvline[1]}  (line 2) ?
>>
>> Is it that ToInt(csvline[0]} refers to the a line of the REL.csv file?  
>>
>> Does csvline[0] refer to a column in REL.csv as do csvline[2] and 
>> csvline[ZZ] (line 3) ?
>>  
>>
>>> Adding *`:Skewer` *label in MATCH will tell Cypher to (implicitly) use 
>>> your index on *my_node_id* which was created when you created your 
>>> constraint. Or you may try to explicitly give it a hint to use the index, 
>>> with USING INDEX... clause after MATCH before CREATE. Btw some earlier 
>>> versions of Neo4j refused to use index in LOAD CSV for some reason, I hope 
>>> this problem is gone with 2.1.5.
>>>
>>> OK
>>  
>>
>>> 5. While importing, be careful to *explicitly specify type conversions 
>>> for each property which is not a string*. I have seen numerous 
>>> occasions when people missed ToInt(csvline[i]) or ToFloat(csvline[j]) - and 
>>> Cypher silently stored their (supposed) numerics as strings. It's Ok, dude, 
>>> you say it :) This led to confusion afterwards when say numerical 
>>> comparisons doesn't MATCH and so on (though it's easy to correct with a 
>>> single Cypher command, but anyway).
>>>
>>> Think I did that re. type conversion.  Only applies to properties for my 
>> data.
>>   
>> Sorry for so many questions.  I am really interested in figuring this out!
>>
>> Thanks loads,  
>> Jose
>>
>>  
>>
>>> WBR,
>>> Andrii
>>>
>>> On Wednesday, November 19, 2014 9:36:50 PM UTC+2, José F. Morales wrote:
>>>>
>>>>
>>>> 3. CSV approach 
>>>> a. “Dump the base into 2 .csv files:”
>>>> b. CSV1:  “Describe nodes (enumerate them via some my_node_id integer 
>>>> attribute),  columns: my_node_id,label,node_prop_01,node_prop_ZZ”
>>>> c. CSV2:  “Describe relations,  columns: source_my_node_id, 
>>>> dest_my_node_id,rel_type,rel_prop_01,...,rel_prop_NN”
>>>> d. Indexes constraints: before starting import  —> have appropriate 
>>>> indexes / constraints
>>>> e. via LOAD CSV, import CSV1, then CSV2. 
>>>> f. Import no more than 10,000-30,000 lines in a single LOAD CSV 
>>>> statement 
>>>>
>>>> This seems to be a very well elaborated method and the easiest for me 
>>>> to do.  I have files such that I can create these without too much 
>>>> problem.  I figure I’ll split the nodes into three files 20k rows each.  I 
>>>> can do the same with the Rels.  I have not used indexes or constraints yet 
>>>> in the db’s that I already created and as I said above, I’ll have to see 
>>>> how to use them.
>>>>
>>>> I am assuming column headers that fit with my data are consistent with 
>>>> what you explained below (Like, I can put my own meaningful text into 
>>>> Label 
>>>> 1 -10 and node_prop_01 - 05).... 
>>>> my_node_id,    label1,       label2,       label3,   label4,           
>>>>  label5,         label6,             label7,          label8,   label9,    
>>>>  
>>>>        label10,           node_prop_01,    node_prop_02,  node_prop_03, 
>>>>  node_prop_04,       node_prop_ZZ”
>>>>
>>>> Thanks again Fellas!!
>>>>
>>>> Jose
>>>>
>>>>
>>>> On Wednesday, November 19, 2014 8:04:44 AM UTC-5, Michael Hunger wrote:
>>>>>
>>>>> José,
>>>>>
>>>>> Let's continue the discussion on the google group
>>>>>
>>>>> With larger I meant amount of data, not size of statements
>>>>>
>>>>> As I also point out in various places we recommend creating only small 
>>>>> subgraphs with a single statement separated by srmicolons.
>>>>> Eg up to 100 nodes and rels
>>>>>
>>>>> Gigantic statements just let the parser explode
>>>>>
>>>>> I recommending splitting them up into statements creating subgraphs
>>>>> Or create nodes and later match them by label & property to connect 
>>>>> them
>>>>> Make sure to have appropriate indexes / constraints
>>>>>
>>>>> You should also surround blocks if statements with begin and commit 
>>>>> commands
>>>>>
>>>>> Von meinem iPhone gesendet
>>>>>
>>>>> Am 19.11.2014 um 04:18 schrieb José F. Morales Ph.D. <
>>>>> jm3...@columbia.edu>:
>>>>>
>>>>> Hey Michael and Kenny
>>>>>
>>>>> Thanks you guys a bunch for the help.
>>>>>
>>>>> Let me give you a little background.  I am charged to make a prototype 
>>>>> of a tool (“LabCards”) that we hope to use in the hospital and beyond at 
>>>>> some point .  In preparation for making the main prototype, I made two 
>>>>> prior Neo4j databases that worked exactly as I wanted them to.  The first 
>>>>> database was built with NIH data and had 183 nodes and around 7500 
>>>>> relationships.  The second database was the Pre-prototype and it had 1080 
>>>>> nodes and around 2000 relationships.  I created these in the form of 
>>>>> cypher 
>>>>> statements and either pasted them in the Neo4j browser or used the neo4j 
>>>>> shell and loaded them as text files. Before doing that I checked the 
>>>>> cypher 
>>>>> code with Sublime Text 2 that highlights the code. Both databases loaded 
>>>>> fine in both methods and did what I wanted them to do.  
>>>>>
>>>>> As you might imagine, the prototype is an expansion of the 
>>>>> mini-prototype.  It has almost the same data model and I built it as a 
>>>>> series of cypher statements as well.  My first version of the prototype 
>>>>> had 
>>>>> ~60k nodes and 160k relationships.  
>>>>>
>>>>> I should say that a feature of this model is that all the source and 
>>>>> target nodes have relationships that point to each other.  No node points 
>>>>> to itself as far as I know. This file was 41 Mb of cypher code that I 
>>>>> tried 
>>>>> to load via the neo4j shell.  
>>>>>
>>>>> In fact, I was following your advise on loading big data files... “Use 
>>>>> the Neo4j-Shell for larger Imports”  (http://jexp.de/blog/2014/06/
>>>>> load-csv-into-neo4j-quickly-and-successfully/).   This first time 
>>>>> out, Java maxed out its memory allocated at 4Gb 2x and did not complete 
>>>>> loading in 24 hours.  I killed it. 
>>>>>
>>>>> I then contacted Kenny, and he generously gave me some advice 
>>>>> regarding the properties file (below) and again the same deal (4 Gb 
>>>>> Memory 
>>>>> 2x) with Java and no success in about 24 hours. I killed that one too.
>>>>>
>>>>> Given my loading problems, I have subsequently eliminated a bunch 
>>>>> relationships (100k) so that the file is now 21 Mb. Alot of these were 
>>>>> duplicates that I didn’t pick up before and am trying it again.  So far 
>>>>> 15 
>>>>> min into it, similar situation.  The difference is that Java is using 1.7 
>>>>> and 0.5 GB of memory
>>>>>
>>>>> Here is the cypher for a typical node…
>>>>>
>>>>> CREATE ( CLT_1:`CLT SOURCE`:BIOMEDICAL:TEST_NAME:`Laboratory 
>>>>> Procedure`:lbpr:`Procedures`:PROC:T059:`B1.3.1.1`:TZ{NAME:'Acetoacetate 
>>>>> (ketone body)',SYNONYM:'',Sample:'SERUM, 
>>>>> URINE',MEDCODE:10010,CUI:'NA’})
>>>>>
>>>>> Here is the cypher for a typical relationship...
>>>>>
>>>>> CREATE(CLT_1)-[:MEASUREMENT_OF{Phylum:'TZ',CAT:'TEST.NAME'
>>>>> ,Ui_Rl:'T157',RESULT:'',Type:'',Semantic_Distance_Score:'NA'
>>>>> ,Path_Length:'NA',Path_Steps:'NA'}]->(CLT_TARGET_3617),
>>>>>
>>>>> I will let you know how this one turns out.  I hope this is helpful.
>>>>>
>>>>> Many, many thanks fellas!!!
>>>>>
>>>>> Jose
>>>>>
>>>>> On Nov 18, 2014, at 8:33 PM, Michael Hunger <
>>>>> michael...@neotechnology.com> wrote:
>>>>>
>>>>> Hi José,
>>>>>
>>>>> can you provide perhaps more detail about your dataset (e.g. sample of 
>>>>> the csv, size, etc. perhaps an output of csvstat (of csvkit) would be 
>>>>> helpful), your cypher queries to load it
>>>>>
>>>>> Have you seen my other blog post, which explains two big caveats that 
>>>>> people run into when trying this? jexp.de/blog/2014/10/
>>>>> load-cvs-with-success/
>>>>>
>>>>> Cheers, Michael
>>>>>
>>>>> On Tue, Nov 18, 2014 at 8:43 PM, Kenny Bastani <k...@socialmoon.com> 
>>>>> wrote:
>>>>>
>>>>>>  Hey Jose,
>>>>>>
>>>>>>  There is definitely an answer. Let me put you in touch with the 
>>>>>> data import master: Michael Hunger.
>>>>>>
>>>>>>  Michael, I think the answers here will be pretty straight forward 
>>>>>> for you. You met Jose at GraphConnect NY last year, so I'll spare any 
>>>>>> introductions. The memory map configurations I provided need to be 
>>>>>> calculated and customized for the data import volume.
>>>>>>
>>>>>>  Thanks,
>>>>>>
>>>>>>  Kenny
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>> On Nov 18, 2014, at 11:37 AM, José F. Morales Ph.D. <
>>>>>> jm3...@columbia.edu> wrote:
>>>>>>
>>>>>>   Kenny,  
>>>>>>
>>>>>>  In 3 hours it’ll be trying to load for 24 hours so this is not 
>>>>>> working.  I’m catching shit from my crew too, so I got to fix this like 
>>>>>> soon.
>>>>>>
>>>>>>  I haven’t done this before, but can I break up the data and load it 
>>>>>> in pieces?
>>>>>>
>>>>>>  Jose
>>>>>>
>>>>>>  On Nov 17, 2014, at 3:35 PM, Kenny Bastani <k...@socialmoon.com> 
>>>>>> wrote:
>>>>>>
>>>>>>  Hey Jose,
>>>>>>
>>>>>>  Try turning off the object cache. Add this line to the 
>>>>>> neo4j.properties configuration file:
>>>>>>
>>>>>>  cache_type=none
>>>>>>
>>>>>> Then retry your import. Also, enable memory mapped files by adding 
>>>>>> these lines to the neo4j.properties file:
>>>>>>
>>>>>>  neostore.nodestore.db.mapped_memory=2048M
>>>>>> neostore.relationshipstore.db.mapped_memory=4096M
>>>>>> neostore.propertystore.db.mapped_memory=200M
>>>>>> neostore.propertystore.db.strings.mapped_memory=500M
>>>>>> neostore.propertystore.db.arrays.mapped_memory=500M
>>>>>>  
>>>>>>  Thanks,
>>>>>>
>>>>>>  Kenny
>>>>>>  
>>>>>>  ------------------------------
>>>>>> *From:* José F. Morales Ph.D. <jm3...@columbia.edu>
>>>>>> *Sent:* Monday, November 17, 2014 12:32 PM
>>>>>> *To:* Kenny Bastani
>>>>>> *Subject:* latest 
>>>>>>  
>>>>>>   Hey Kenny,
>>>>>>
>>>>>>  Here’s the deal. As I think I said, I loaded the 41 Mb file of 
>>>>>> cypher code via the neo4j shell. Before I tried the LabCards file, I 
>>>>>> tried 
>>>>>> the movies file and a UMLS database I made (8k relationships).  They 
>>>>>> worked 
>>>>>> fine. 
>>>>>>
>>>>>>  The LabCards file is taking a LONG time to load since I started at 
>>>>>> about 9:30 - 10 PM last night and its 3PM now.  
>>>>>>
>>>>>>  I’ve wondered if its hung up and the activity monitor’s memory 
>>>>>> usage is constant at two rows of Java at 4GB w/ the kernel at 1 GB.  The 
>>>>>> CPU panel changes alot so it looks like its doing its thing. 
>>>>>>
>>>>>>  So is this how are things to be expected?  Do you think the loading 
>>>>>> is gonna take a day or two?  
>>>>>>
>>>>>>  Jose
>>>>>>  
>>>>>>  
>>>>>>    |//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\||
>>>>>> José F. Morales Ph.D.
>>>>>>  Instructor
>>>>>>  Cell Biology and Pathology
>>>>>> Columbia University Medical Center
>>>>>>  jm3...@columbia.edu
>>>>>>  212-452-3351
>>>>>>     
>>>>>>  
>>>>>>    |//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\||
>>>>>> José F. Morales Ph.D.
>>>>>>  Instructor
>>>>>>  Cell Biology and Pathology
>>>>>> Columbia University Medical Center
>>>>>>  jm3...@columbia.edu
>>>>>>  212-452-3351
>>>>>>   
>>>>>>   
>>>>>
>>>>> |//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\||
>>>>> José F. Morales Ph.D.
>>>>> Instructor
>>>>> Cell Biology and Pathology
>>>>> Columbia University Medical Center
>>>>> jm3...@columbia.edu
>>>>> 212-452-3351
>>>>>  
>>>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Neo4j" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to neo4j+un...@googlegroups.com <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to neo4j+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] Re: large cypher statements

Reply via email to