Re: [Neo4j] Re: large cypher statements

Michael Hunger Fri, 28 Nov 2014 16:50:42 -0800

What takes so long? The loading? Or figuring it out?

Michael



On Sat, Nov 29, 2014 at 1:18 AM, José F. Morales <josef...@gmail.com> wrote:

> Hey Michael,
>
> I'll check it out.   Trouble is knowing what over-complicating is.  Thanks
> for the heads up!
>
> I am trying to figure out inductively how to use LOAD CSV from various
> examples.  Thanks for another one.
>
> Its killing me that its taking so long.
>
> Jose
>
>
>
> On Friday, November 28, 2014 6:16:49 PM UTC-5, Michael Hunger wrote:
>>
>> José
>>
>> if you watch Nicole's webinar many things will become clear.
>> https://vimeo.com/112447027
>> You don't have to overcomplicate things.
>>
>> The Skewer(id) thing is not really needed if each of your entities has a
>> label and a primary key of some sorts.
>> It is just an optimization to not have to think about separate entities.
>>
>> Cheers, Michael
>>
>> On Sat, Nov 29, 2014 at 12:12 AM, José F. Morales <jose...@gmail.com>
>> wrote:
>>
>>> Hey Andrii,
>>>
>>> I've been thinking alot about your recommendations.   I have some
>>> questions, some of which show how ignorant I am.  Apologies for basics
>>> if necessary.
>>>
>>> On Thursday, November 20, 2014 6:22:34 AM UTC-5, Andrii Stesin wrote:
>>>>
>>>> Before you start.
>>>>
>>>> 1. On nodes and their labels. First of all, I strongly suggest you to
>>>> separate your nodes into different .csv files by label. So you won't have a
>>>> column *`label`* in your .csv but rather set of files:
>>>>
>>>> nodes_LabelA.csv
>>>> ...
>>>> nodes_LabelZ.csv
>>>>
>>>> whatever your labels are. (Consider label to be kinda of synonym for
>>>> `class` in object-oriented programming or `table` in RDBMS). That's due the
>>>> fact that labels in Cypher are somewhat specific entities and you probably
>>>> won't be allowed to make them parameterized into variables inside your LOAD
>>>> CSV statement.
>>>>
>>>>
>>> OK, so you have modified your original idea of putting the db into two
>>> files 1 nodes , 1 relationships.  Now here you say, put all the nodes into
>>> 1 file/ label.   The way I have worked with it, I created 1 file for a
>>> class of nodes I'll call CLT_SOURCE and another file for a class of nodes
>>> called CLT_TARGET.  Then I have a file for the relationships. Perhaps
>>> foolishly I originally would create 1 file that would combine all of this
>>> info and try to paste it in the browser or in the shell.  Neither worked
>>> even though with smaller amount of data it did.
>>>
>>> You are recommending that with the nodes, I take two steps...
>>> 1) Combine my CLT_SOURCE and CLT_TARGET nodes,
>>> 2) then I split that file into files that correspond to the node:
>>> *my_node_id, * 1 label, and then properties P1...Pn.  Since I have 10
>>> Labels/node, I should have 10 files named..... Nodes_LabelA...
>>> Nodes_LabelJ.  Thus...
>>>
>>> File:  CLT_Nodes-LabelA     columns:  *my_node_id,* label A, property
>>> P1..., property P4
>>> ...
>>> File:  CLT_Nodes-LabelJ     columns:  *my_node_id,* label B, property
>>> P1..., property P4
>>>
>>>
>>> Q1: What are the rules about what can be used for *my_node_id?  *I have
>>> usually seen them as a letter integer combination. Is that the convention?
>>>   Sometimes I've seen a letter being used with a specific class of nodes
>>>  a1..a100 for one class and b1..b100 for another.  I learned the hard way
>>> that you have to give each node a unique ID.  I used CLT_1...CLT_n for my
>>> CLT_SOURCE nodes and CLT_TARGET_1...CLT_TARGET_n for my TARGET nodes.
>>> It worked with the smaller db I made.  Anything wrong using the convention
>>> n1...n100?
>>>
>>>
>>>
>>>> 2. Then consider one additional "technological" label, let's name it
>>>> `:Skewer` because it will "penetrate" all your nodes of every different
>>>> label (class) like a kebab skewer.
>>>>
>>>> Before you start (or at least before you start importing relationships)
>>>> do
>>>>
>>>> CREATE CONSTRAINT ON (every_node:Skewer) ASSERT every_node.my_node_id
>>>> IS UNIQUE;
>>>>
>>>>
>>> Q2:  Should I do scenario 1 or 2?
>>>
>>> Scenario 1:  add two labels to each file?  One from my original nodes
>>> and one as "Skewer"
>>>
>>> File 1:  CLT_Nodes-LabelA     columns:  *my_node_id,* label A, *Skewer*,
>>> property P1..., property P4
>>> ...
>>> File 2:  CLT_Nodes-LabelJ     columns:  *my_node_id,* label J, *Skewer*,
>>> property P1..., property P4
>>>
>>> OR
>>>
>>> Scenario 2:  Include an eleventh file thus....
>>>
>>> File 11:  CLT_Nodes-LabelK     columns:  *my_node_id,* *Skewer*,
>>> property P1..., property P4
>>>
>>> From below, I think you mean Scenario 1.
>>>
>>> Q3: “Skewer” is just an integer right?  It corresponds in a way to
>>> my_node_id
>>>
>>> 3. When doing LOAD CSV with nodes, make sure each node will get 2 (two)
>>>> labels, one of them is `:Skewer`. This will create index on `my_node_id`
>>>> attribute (makes relationships creation some orders of magnitude faster)
>>>> and you'll be sure you don't have occasional duplicate nodes, as a bonus.
>>>>
>>>
>>>
>>> Here is some sort of cypher….
>>>
>>>
>>> //Creating the nodes
>>>
>>>
>>>
>>> USING PERIODIC COMMIT 1000
>>>
>>> LOAD CSV WITH HEADERS FROM “…/././…. CLT_NODES_LabelA.csv" AS csvline
>>>
>>> MERGE (my_node_id:Skewer: LabelA {property1: csvline.property1})
>>>
>>> ON CREATE SET
>>>
>>> n.Property2 = csvline.Property2,
>>>
>>> n.Property3 = csvline.Property3,
>>>
>>> n.Property4 = csvline.Property4;
>>>
>>>
>>> ….
>>> LOAD CSV WITH HEADERS FROM “…/././…. CLT_NODES_LabelJ.csv" AS csvline
>>>
>>>
>>>
>>> MERGE (my_node_id:Skewer: LabelJ {property1: csvline.property1})
>>>
>>> ON CREATE SET
>>>
>>> n.Property2 = csvline.Property2,
>>>
>>> n.Property3 = csvline.Property3,
>>>
>>> n.Property4 = csvline.Property4;
>>>
>>>
>>>
>>>
>>> Q4: So does repeating the LOAD CSV with each file CLT_NODES_LabelA…J
>>> combine the various labels and their respective values with their
>>> corresponding nodes?
>>>
>>> Q5: Since I think of my data in terms of the two classes of nodes in my
>>> Data model …[CLT_SOURCE —> CLT_TARGET ;  CLT_TARGET —>  CLT_SOURCE],  after
>>> loading the nodes, how then I get two classes of nodes?
>>>
>>> Q6: Is there a step missing that explains how the code below got to have
>>> a “source_node” and a “dest_node” that appears to correspond to my
>>> CLT_SOURCE and CLT_TARGET nodes?
>>>
>>>
>>>
>>>
>>>
>>>> 4. Now when you are done with nodes and start doing LOAD CSV for
>>>> relationships, you may give the MATCH statement, which looks up your pair
>>>> of nodes, a hint for fast lookup, like
>>>>
>>>> LOAD CSV ...from somewhere... AS csvline
>>>> MATCH (source_node:Skewer {my_node_id: ToInt(csvline[0]}), (dest_node:
>>>> Skewer {my_node_id: ToInt(csvline[1]})
>>>> CREATE (source_node)-[r:MY_REL_TYPE {rel_prop_00: csvline[2], ...,
>>>> rel_prop_NN: csvline[ZZ]}]->(dest_node);
>>>>
>>>>
>>> Q6: This LOAD CSV  command (line 1) looks into the separate REL.csv file
>>> you mentioned first right?
>>>
>>> Q7: csvline is some sort of temp file that is a series of lines of the
>>> cvs file?
>>>
>>> Q8: Do you imply in line 2 that the REL.csv file has headers that
>>> include  source_node, dest_node ?
>>>
>>> Q9: While I see how Skewer is a label,  how is my_node_id a  property
>>> (line 2) ?
>>>
>>> Q10: How does my_node_id relate to either ToInt(csvline[0]} or
>>> ToInt(csvline[1]}  (line 2) ?
>>>
>>> Is it that ToInt(csvline[0]} refers to the a line of the REL.csv file?
>>>
>>> Does csvline[0] refer to a column in REL.csv as do csvline[2] and
>>> csvline[ZZ] (line 3) ?
>>>
>>>
>>>> Adding *`:Skewer` *label in MATCH will tell Cypher to (implicitly) use
>>>> your index on *my_node_id* which was created when you created your
>>>> constraint. Or you may try to explicitly give it a hint to use the index,
>>>> with USING INDEX... clause after MATCH before CREATE. Btw some earlier
>>>> versions of Neo4j refused to use index in LOAD CSV for some reason, I hope
>>>> this problem is gone with 2.1.5.
>>>>
>>>> OK
>>>
>>>
>>>> 5. While importing, be careful to *explicitly specify type conversions
>>>> for each property which is not a string*. I have seen numerous
>>>> occasions when people missed ToInt(csvline[i]) or ToFloat(csvline[j]) - and
>>>> Cypher silently stored their (supposed) numerics as strings. It's Ok, dude,
>>>> you say it :) This led to confusion afterwards when say numerical
>>>> comparisons doesn't MATCH and so on (though it's easy to correct with a
>>>> single Cypher command, but anyway).
>>>>
>>>> Think I did that re. type conversion.  Only applies to properties for
>>> my data.
>>>
>>> Sorry for so many questions.  I am really interested in figuring this
>>> out!
>>>
>>> Thanks loads,
>>> Jose
>>>
>>>
>>>
>>>> WBR,
>>>> Andrii
>>>>
>>>> On Wednesday, November 19, 2014 9:36:50 PM UTC+2, José F. Morales wrote:
>>>>>
>>>>>
>>>>> 3. CSV approach
>>>>> a. “Dump the base into 2 .csv files:”
>>>>> b. CSV1:  “Describe nodes (enumerate them via some my_node_id integer
>>>>> attribute),  columns: my_node_id,label,node_prop_01,node_prop_ZZ”
>>>>> c. CSV2:  “Describe relations,  columns: source_my_node_id,
>>>>> dest_my_node_id,rel_type,rel_prop_01,...,rel_prop_NN”
>>>>> d. Indexes constraints: before starting import  —> have appropriate
>>>>> indexes / constraints
>>>>> e. via LOAD CSV, import CSV1, then CSV2.
>>>>> f. Import no more than 10,000-30,000 lines in a single LOAD CSV
>>>>> statement
>>>>>
>>>>> This seems to be a very well elaborated method and the easiest for me
>>>>> to do.  I have files such that I can create these without too much
>>>>> problem.  I figure I’ll split the nodes into three files 20k rows each.  I
>>>>> can do the same with the Rels.  I have not used indexes or constraints yet
>>>>> in the db’s that I already created and as I said above, I’ll have to see
>>>>> how to use them.
>>>>>
>>>>> I am assuming column headers that fit with my data are consistent with
>>>>> what you explained below (Like, I can put my own meaningful text into 
>>>>> Label
>>>>> 1 -10 and node_prop_01 - 05)....
>>>>> my_node_id,    label1,       label2,       label3,   label4,
>>>>>  label5,         label6,             label7,          label8,   label9,
>>>>>        label10,           node_prop_01,    node_prop_02,  node_prop_03,
>>>>>  node_prop_04,       node_prop_ZZ”
>>>>>
>>>>> Thanks again Fellas!!
>>>>>
>>>>> Jose
>>>>>
>>>>>
>>>>> On Wednesday, November 19, 2014 8:04:44 AM UTC-5, Michael Hunger wrote:
>>>>>>
>>>>>> José,
>>>>>>
>>>>>> Let's continue the discussion on the google group
>>>>>>
>>>>>> With larger I meant amount of data, not size of statements
>>>>>>
>>>>>> As I also point out in various places we recommend creating only
>>>>>> small subgraphs with a single statement separated by srmicolons.
>>>>>> Eg up to 100 nodes and rels
>>>>>>
>>>>>> Gigantic statements just let the parser explode
>>>>>>
>>>>>> I recommending splitting them up into statements creating subgraphs
>>>>>> Or create nodes and later match them by label & property to connect
>>>>>> them
>>>>>> Make sure to have appropriate indexes / constraints
>>>>>>
>>>>>> You should also surround blocks if statements with begin and commit
>>>>>> commands
>>>>>>
>>>>>> Von meinem iPhone gesendet
>>>>>>
>>>>>> Am 19.11.2014 um 04:18 schrieb José F. Morales Ph.D. <
>>>>>> jm3...@columbia.edu>:
>>>>>>
>>>>>> Hey Michael and Kenny
>>>>>>
>>>>>> Thanks you guys a bunch for the help.
>>>>>>
>>>>>> Let me give you a little background.  I am charged to make a
>>>>>> prototype of a tool (“LabCards”) that we hope to use in the hospital and
>>>>>> beyond at some point .  In preparation for making the main prototype, I
>>>>>> made two prior Neo4j databases that worked exactly as I wanted them to.
>>>>>> The first database was built with NIH data and had 183 nodes and around
>>>>>> 7500 relationships.  The second database was the Pre-prototype and it had
>>>>>> 1080 nodes and around 2000 relationships.  I created these in the form of
>>>>>> cypher statements and either pasted them in the Neo4j browser or used the
>>>>>> neo4j shell and loaded them as text files. Before doing that I checked 
>>>>>> the
>>>>>> cypher code with Sublime Text 2 that highlights the code. Both databases
>>>>>> loaded fine in both methods and did what I wanted them to do.
>>>>>>
>>>>>> As you might imagine, the prototype is an expansion of the
>>>>>> mini-prototype.  It has almost the same data model and I built it as a
>>>>>> series of cypher statements as well.  My first version of the prototype 
>>>>>> had
>>>>>> ~60k nodes and 160k relationships.
>>>>>>
>>>>>> I should say that a feature of this model is that all the source and
>>>>>> target nodes have relationships that point to each other.  No node points
>>>>>> to itself as far as I know. This file was 41 Mb of cypher code that I 
>>>>>> tried
>>>>>> to load via the neo4j shell.
>>>>>>
>>>>>> In fact, I was following your advise on loading big data files...
>>>>>> “Use the Neo4j-Shell for larger Imports”  (
>>>>>> http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-an
>>>>>> d-successfully/).   This first time out, Java maxed out its memory
>>>>>> allocated at 4Gb 2x and did not complete loading in 24 hours.  I killed 
>>>>>> it.
>>>>>>
>>>>>> I then contacted Kenny, and he generously gave me some advice
>>>>>> regarding the properties file (below) and again the same deal (4 Gb 
>>>>>> Memory
>>>>>> 2x) with Java and no success in about 24 hours. I killed that one too.
>>>>>>
>>>>>> Given my loading problems, I have subsequently eliminated a bunch
>>>>>> relationships (100k) so that the file is now 21 Mb. Alot of these were
>>>>>> duplicates that I didn’t pick up before and am trying it again.  So far 
>>>>>> 15
>>>>>> min into it, similar situation.  The difference is that Java is using 1.7
>>>>>> and 0.5 GB of memory
>>>>>>
>>>>>> Here is the cypher for a typical node…
>>>>>>
>>>>>> CREATE ( CLT_1:`CLT SOURCE`:BIOMEDICAL:TEST_NAME:`Laboratory
>>>>>> Procedure`:lbpr:`Procedures`:PROC:T059:`B1.3.1.1`:TZ{NAME:'Acetoacetate
>>>>>> (ketone body)',SYNONYM:'',Sample:'SERUM,
>>>>>> URINE',MEDCODE:10010,CUI:'NA’})
>>>>>>
>>>>>> Here is the cypher for a typical relationship...
>>>>>>
>>>>>> CREATE(CLT_1)-[:MEASUREMENT_OF{Phylum:'TZ',CAT:'TEST.NAME',
>>>>>> Ui_Rl:'T157',RESULT:'',Type:'',Semantic_Distance_Score:'NA',
>>>>>> Path_Length:'NA',Path_Steps:'NA'}]->(CLT_TARGET_3617),
>>>>>>
>>>>>> I will let you know how this one turns out.  I hope this is helpful.
>>>>>>
>>>>>> Many, many thanks fellas!!!
>>>>>>
>>>>>> Jose
>>>>>>
>>>>>> On Nov 18, 2014, at 8:33 PM, Michael Hunger <
>>>>>> michael...@neotechnology.com> wrote:
>>>>>>
>>>>>> Hi José,
>>>>>>
>>>>>> can you provide perhaps more detail about your dataset (e.g. sample
>>>>>> of the csv, size, etc. perhaps an output of csvstat (of csvkit) would be
>>>>>> helpful), your cypher queries to load it
>>>>>>
>>>>>> Have you seen my other blog post, which explains two big caveats that
>>>>>> people run into when trying this? jexp.de/blog/2014/10/loa
>>>>>> d-cvs-with-success/
>>>>>>
>>>>>> Cheers, Michael
>>>>>>
>>>>>> On Tue, Nov 18, 2014 at 8:43 PM, Kenny Bastani <k...@socialmoon.com>
>>>>>> wrote:
>>>>>>
>>>>>>>  Hey Jose,
>>>>>>>
>>>>>>>  There is definitely an answer. Let me put you in touch with the
>>>>>>> data import master: Michael Hunger.
>>>>>>>
>>>>>>>  Michael, I think the answers here will be pretty straight forward
>>>>>>> for you. You met Jose at GraphConnect NY last year, so I'll spare any
>>>>>>> introductions. The memory map configurations I provided need to be
>>>>>>> calculated and customized for the data import volume.
>>>>>>>
>>>>>>>  Thanks,
>>>>>>>
>>>>>>>  Kenny
>>>>>>>
>>>>>>> Sent from my iPhone
>>>>>>>
>>>>>>> On Nov 18, 2014, at 11:37 AM, José F. Morales Ph.D. <
>>>>>>> jm3...@columbia.edu> wrote:
>>>>>>>
>>>>>>>   Kenny,
>>>>>>>
>>>>>>>  In 3 hours it’ll be trying to load for 24 hours so this is not
>>>>>>> working.  I’m catching shit from my crew too, so I got to fix this like
>>>>>>> soon.
>>>>>>>
>>>>>>>  I haven’t done this before, but can I break up the data and load
>>>>>>> it in pieces?
>>>>>>>
>>>>>>>  Jose
>>>>>>>
>>>>>>>  On Nov 17, 2014, at 3:35 PM, Kenny Bastani <k...@socialmoon.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>  Hey Jose,
>>>>>>>
>>>>>>>  Try turning off the object cache. Add this line to the
>>>>>>> neo4j.properties configuration file:
>>>>>>>
>>>>>>>  cache_type=none
>>>>>>>
>>>>>>> Then retry your import. Also, enable memory mapped files by adding
>>>>>>> these lines to the neo4j.properties file:
>>>>>>>
>>>>>>>  neostore.nodestore.db.mapped_memory=2048M
>>>>>>> neostore.relationshipstore.db.mapped_memory=4096M
>>>>>>> neostore.propertystore.db.mapped_memory=200M
>>>>>>> neostore.propertystore.db.strings.mapped_memory=500M
>>>>>>> neostore.propertystore.db.arrays.mapped_memory=500M
>>>>>>>
>>>>>>>  Thanks,
>>>>>>>
>>>>>>>  Kenny
>>>>>>>
>>>>>>>  ------------------------------
>>>>>>> *From:* José F. Morales Ph.D. <jm3...@columbia.edu>
>>>>>>> *Sent:* Monday, November 17, 2014 12:32 PM
>>>>>>> *To:* Kenny Bastani
>>>>>>> *Subject:* latest
>>>>>>>
>>>>>>>   Hey Kenny,
>>>>>>>
>>>>>>>  Here’s the deal. As I think I said, I loaded the 41 Mb file of
>>>>>>> cypher code via the neo4j shell. Before I tried the LabCards file, I 
>>>>>>> tried
>>>>>>> the movies file and a UMLS database I made (8k relationships).  They 
>>>>>>> worked
>>>>>>> fine.
>>>>>>>
>>>>>>>  The LabCards file is taking a LONG time to load since I started at
>>>>>>> about 9:30 - 10 PM last night and its 3PM now.
>>>>>>>
>>>>>>>  I’ve wondered if its hung up and the activity monitor’s memory
>>>>>>> usage is constant at two rows of Java at 4GB w/ the kernel at 1 GB.  The
>>>>>>> CPU panel changes alot so it looks like its doing its thing.
>>>>>>>
>>>>>>>  So is this how are things to be expected?  Do you think the
>>>>>>> loading is gonna take a day or two?
>>>>>>>
>>>>>>>  Jose
>>>>>>>
>>>>>>>
>>>>>>>    |//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\||
>>>>>>> José F. Morales Ph.D.
>>>>>>>  Instructor
>>>>>>>  Cell Biology and Pathology
>>>>>>> Columbia University Medical Center
>>>>>>>  jm3...@columbia.edu
>>>>>>>  212-452-3351
>>>>>>>
>>>>>>>
>>>>>>>    |//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\||
>>>>>>> José F. Morales Ph.D.
>>>>>>>  Instructor
>>>>>>>  Cell Biology and Pathology
>>>>>>> Columbia University Medical Center
>>>>>>>  jm3...@columbia.edu
>>>>>>>  212-452-3351
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> |//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\||
>>>>>> José F. Morales Ph.D.
>>>>>> Instructor
>>>>>> Cell Biology and Pathology
>>>>>> Columbia University Medical Center
>>>>>> jm3...@columbia.edu
>>>>>> 212-452-3351
>>>>>>
>>>>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "Neo4j" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to neo4j+un...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to neo4j+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to neo4j+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] Re: large cypher statements

Reply via email to