Re: [Neo4j] Re: large cypher statements

Michael Hunger Fri, 28 Nov 2014 15:17:24 -0800

José

if you watch Nicole's webinar many things will become clear.
https://vimeo.com/112447027
You don't have to overcomplicate things.


The Skewer(id) thing is not really needed if each of your entities has a
label and a primary key of some sorts.
It is just an optimization to not have to think about separate entities.

Cheers, Michael

On Sat, Nov 29, 2014 at 12:12 AM, José F. Morales <josef...@gmail.com>
wrote:

> Hey Andrii,
>
> I've been thinking alot about your recommendations.   I have some
> questions, some of which show how ignorant I am.  Apologies for basics if
> necessary.
>
> On Thursday, November 20, 2014 6:22:34 AM UTC-5, Andrii Stesin wrote:
>>
>> Before you start.
>>
>> 1. On nodes and their labels. First of all, I strongly suggest you to
>> separate your nodes into different .csv files by label. So you won't have a
>> column *`label`* in your .csv but rather set of files:
>>
>> nodes_LabelA.csv
>> ...
>> nodes_LabelZ.csv
>>
>> whatever your labels are. (Consider label to be kinda of synonym for
>> `class` in object-oriented programming or `table` in RDBMS). That's due the
>> fact that labels in Cypher are somewhat specific entities and you probably
>> won't be allowed to make them parameterized into variables inside your LOAD
>> CSV statement.
>>
>>
> OK, so you have modified your original idea of putting the db into two
> files 1 nodes , 1 relationships.  Now here you say, put all the nodes into
> 1 file/ label.   The way I have worked with it, I created 1 file for a
> class of nodes I'll call CLT_SOURCE and another file for a class of nodes
> called CLT_TARGET.  Then I have a file for the relationships. Perhaps
> foolishly I originally would create 1 file that would combine all of this
> info and try to paste it in the browser or in the shell.  Neither worked
> even though with smaller amount of data it did.
>
> You are recommending that with the nodes, I take two steps...
> 1) Combine my CLT_SOURCE and CLT_TARGET nodes,
> 2) then I split that file into files that correspond to the node:
> *my_node_id, * 1 label, and then properties P1...Pn.  Since I have 10
> Labels/node, I should have 10 files named..... Nodes_LabelA...
> Nodes_LabelJ.  Thus...
>
> File:  CLT_Nodes-LabelA     columns:  *my_node_id,* label A, property
> P1..., property P4
> ...
> File:  CLT_Nodes-LabelJ     columns:  *my_node_id,* label B, property
> P1..., property P4
>
>
> Q1: What are the rules about what can be used for *my_node_id?  *I have
> usually seen them as a letter integer combination. Is that the convention?
>   Sometimes I've seen a letter being used with a specific class of nodes
>  a1..a100 for one class and b1..b100 for another.  I learned the hard way
> that you have to give each node a unique ID.  I used CLT_1...CLT_n for my
> CLT_SOURCE nodes and CLT_TARGET_1...CLT_TARGET_n for my TARGET nodes.
> It worked with the smaller db I made.  Anything wrong using the convention
> n1...n100?
>
>
>
>> 2. Then consider one additional "technological" label, let's name it
>> `:Skewer` because it will "penetrate" all your nodes of every different
>> label (class) like a kebab skewer.
>>
>> Before you start (or at least before you start importing relationships) do
>>
>> CREATE CONSTRAINT ON (every_node:Skewer) ASSERT every_node.my_node_id IS
>> UNIQUE;
>>
>>
> Q2:  Should I do scenario 1 or 2?
>
> Scenario 1:  add two labels to each file?  One from my original nodes and
> one as "Skewer"
>
> File 1:  CLT_Nodes-LabelA     columns:  *my_node_id,* label A, *Skewer*,
> property P1..., property P4
> ...
> File 2:  CLT_Nodes-LabelJ     columns:  *my_node_id,* label J, *Skewer*,
> property P1..., property P4
>
> OR
>
> Scenario 2:  Include an eleventh file thus....
>
> File 11:  CLT_Nodes-LabelK     columns:  *my_node_id,* *Skewer*, property
> P1..., property P4
>
> From below, I think you mean Scenario 1.
>
> Q3: “Skewer” is just an integer right?  It corresponds in a way to
> my_node_id
>
> 3. When doing LOAD CSV with nodes, make sure each node will get 2 (two)
>> labels, one of them is `:Skewer`. This will create index on `my_node_id`
>> attribute (makes relationships creation some orders of magnitude faster)
>> and you'll be sure you don't have occasional duplicate nodes, as a bonus.
>>
>
>
> Here is some sort of cypher….
>
>
> //Creating the nodes
>
>
>
> USING PERIODIC COMMIT 1000
>
> LOAD CSV WITH HEADERS FROM “…/././…. CLT_NODES_LabelA.csv" AS csvline
>
> MERGE (my_node_id:Skewer: LabelA {property1: csvline.property1})
>
> ON CREATE SET
>
> n.Property2 = csvline.Property2,
>
> n.Property3 = csvline.Property3,
>
> n.Property4 = csvline.Property4;
>
>
> ….
> LOAD CSV WITH HEADERS FROM “…/././…. CLT_NODES_LabelJ.csv" AS csvline
>
>
>
> MERGE (my_node_id:Skewer: LabelJ {property1: csvline.property1})
>
> ON CREATE SET
>
> n.Property2 = csvline.Property2,
>
> n.Property3 = csvline.Property3,
>
> n.Property4 = csvline.Property4;
>
>
>
>
> Q4: So does repeating the LOAD CSV with each file CLT_NODES_LabelA…J
> combine the various labels and their respective values with their
> corresponding nodes?
>
> Q5: Since I think of my data in terms of the two classes of nodes in my
> Data model …[CLT_SOURCE —> CLT_TARGET ;  CLT_TARGET —>  CLT_SOURCE],  after
> loading the nodes, how then I get two classes of nodes?
>
> Q6: Is there a step missing that explains how the code below got to have a
> “source_node” and a “dest_node” that appears to correspond to my CLT_SOURCE
> and CLT_TARGET nodes?
>
>
>
>
>
>> 4. Now when you are done with nodes and start doing LOAD CSV for
>> relationships, you may give the MATCH statement, which looks up your pair
>> of nodes, a hint for fast lookup, like
>>
>> LOAD CSV ...from somewhere... AS csvline
>> MATCH (source_node:Skewer {my_node_id: ToInt(csvline[0]}), (dest_node:
>> Skewer {my_node_id: ToInt(csvline[1]})
>> CREATE (source_node)-[r:MY_REL_TYPE {rel_prop_00: csvline[2], ...,
>> rel_prop_NN: csvline[ZZ]}]->(dest_node);
>>
>>
> Q6: This LOAD CSV  command (line 1) looks into the separate REL.csv file
> you mentioned first right?
>
> Q7: csvline is some sort of temp file that is a series of lines of the cvs
> file?
>
> Q8: Do you imply in line 2 that the REL.csv file has headers that include
> source_node, dest_node ?
>
> Q9: While I see how Skewer is a label,  how is my_node_id a  property
> (line 2) ?
>
> Q10: How does my_node_id relate to either ToInt(csvline[0]} or
> ToInt(csvline[1]}  (line 2) ?
>
> Is it that ToInt(csvline[0]} refers to the a line of the REL.csv file?
>
> Does csvline[0] refer to a column in REL.csv as do csvline[2] and
> csvline[ZZ] (line 3) ?
>
>
>> Adding *`:Skewer` *label in MATCH will tell Cypher to (implicitly) use
>> your index on *my_node_id* which was created when you created your
>> constraint. Or you may try to explicitly give it a hint to use the index,
>> with USING INDEX... clause after MATCH before CREATE. Btw some earlier
>> versions of Neo4j refused to use index in LOAD CSV for some reason, I hope
>> this problem is gone with 2.1.5.
>>
>> OK
>
>
>> 5. While importing, be careful to *explicitly specify type conversions
>> for each property which is not a string*. I have seen numerous occasions
>> when people missed ToInt(csvline[i]) or ToFloat(csvline[j]) - and Cypher
>> silently stored their (supposed) numerics as strings. It's Ok, dude, you
>> say it :) This led to confusion afterwards when say numerical comparisons
>> doesn't MATCH and so on (though it's easy to correct with a single Cypher
>> command, but anyway).
>>
>> Think I did that re. type conversion.  Only applies to properties for my
> data.
>
> Sorry for so many questions.  I am really interested in figuring this out!
>
> Thanks loads,
> Jose
>
>
>
>> WBR,
>> Andrii
>>
>> On Wednesday, November 19, 2014 9:36:50 PM UTC+2, José F. Morales wrote:
>>>
>>>
>>> 3. CSV approach
>>> a. “Dump the base into 2 .csv files:”
>>> b. CSV1:  “Describe nodes (enumerate them via some my_node_id integer
>>> attribute),  columns: my_node_id,label,node_prop_01,node_prop_ZZ”
>>> c. CSV2:  “Describe relations,  columns: source_my_node_id,
>>> dest_my_node_id,rel_type,rel_prop_01,...,rel_prop_NN”
>>> d. Indexes constraints: before starting import  —> have appropriate
>>> indexes / constraints
>>> e. via LOAD CSV, import CSV1, then CSV2.
>>> f. Import no more than 10,000-30,000 lines in a single LOAD CSV
>>> statement
>>>
>>> This seems to be a very well elaborated method and the easiest for me to
>>> do.  I have files such that I can create these without too much problem.  I
>>> figure I’ll split the nodes into three files 20k rows each.  I can do the
>>> same with the Rels.  I have not used indexes or constraints yet in the db’s
>>> that I already created and as I said above, I’ll have to see how to use
>>> them.
>>>
>>> I am assuming column headers that fit with my data are consistent with
>>> what you explained below (Like, I can put my own meaningful text into Label
>>> 1 -10 and node_prop_01 - 05)....
>>> my_node_id,    label1,       label2,       label3,   label4,
>>>  label5,         label6,             label7,          label8,   label9,
>>>        label10,           node_prop_01,    node_prop_02,  node_prop_03,
>>>  node_prop_04,       node_prop_ZZ”
>>>
>>> Thanks again Fellas!!
>>>
>>> Jose
>>>
>>>
>>> On Wednesday, November 19, 2014 8:04:44 AM UTC-5, Michael Hunger wrote:
>>>>
>>>> José,
>>>>
>>>> Let's continue the discussion on the google group
>>>>
>>>> With larger I meant amount of data, not size of statements
>>>>
>>>> As I also point out in various places we recommend creating only small
>>>> subgraphs with a single statement separated by srmicolons.
>>>> Eg up to 100 nodes and rels
>>>>
>>>> Gigantic statements just let the parser explode
>>>>
>>>> I recommending splitting them up into statements creating subgraphs
>>>> Or create nodes and later match them by label & property to connect them
>>>> Make sure to have appropriate indexes / constraints
>>>>
>>>> You should also surround blocks if statements with begin and commit
>>>> commands
>>>>
>>>> Von meinem iPhone gesendet
>>>>
>>>> Am 19.11.2014 um 04:18 schrieb José F. Morales Ph.D. <
>>>> jm3...@columbia.edu>:
>>>>
>>>> Hey Michael and Kenny
>>>>
>>>> Thanks you guys a bunch for the help.
>>>>
>>>> Let me give you a little background.  I am charged to make a prototype
>>>> of a tool (“LabCards”) that we hope to use in the hospital and beyond at
>>>> some point .  In preparation for making the main prototype, I made two
>>>> prior Neo4j databases that worked exactly as I wanted them to.  The first
>>>> database was built with NIH data and had 183 nodes and around 7500
>>>> relationships.  The second database was the Pre-prototype and it had 1080
>>>> nodes and around 2000 relationships.  I created these in the form of cypher
>>>> statements and either pasted them in the Neo4j browser or used the neo4j
>>>> shell and loaded them as text files. Before doing that I checked the cypher
>>>> code with Sublime Text 2 that highlights the code. Both databases loaded
>>>> fine in both methods and did what I wanted them to do.
>>>>
>>>> As you might imagine, the prototype is an expansion of the
>>>> mini-prototype.  It has almost the same data model and I built it as a
>>>> series of cypher statements as well.  My first version of the prototype had
>>>> ~60k nodes and 160k relationships.
>>>>
>>>> I should say that a feature of this model is that all the source and
>>>> target nodes have relationships that point to each other.  No node points
>>>> to itself as far as I know. This file was 41 Mb of cypher code that I tried
>>>> to load via the neo4j shell.
>>>>
>>>> In fact, I was following your advise on loading big data files... “Use
>>>> the Neo4j-Shell for larger Imports”  (http://jexp.de/blog/2014/06/
>>>> load-csv-into-neo4j-quickly-and-successfully/).   This first time out,
>>>> Java maxed out its memory allocated at 4Gb 2x and did not complete loading
>>>> in 24 hours.  I killed it.
>>>>
>>>> I then contacted Kenny, and he generously gave me some advice regarding
>>>> the properties file (below) and again the same deal (4 Gb Memory 2x) with
>>>> Java and no success in about 24 hours. I killed that one too.
>>>>
>>>> Given my loading problems, I have subsequently eliminated a bunch
>>>> relationships (100k) so that the file is now 21 Mb. Alot of these were
>>>> duplicates that I didn’t pick up before and am trying it again.  So far 15
>>>> min into it, similar situation.  The difference is that Java is using 1.7
>>>> and 0.5 GB of memory
>>>>
>>>> Here is the cypher for a typical node…
>>>>
>>>> CREATE ( CLT_1:`CLT SOURCE`:BIOMEDICAL:TEST_NAME:`Laboratory
>>>> Procedure`:lbpr:`Procedures`:PROC:T059:`B1.3.1.1`:TZ{NAME:'Acetoacetate
>>>> (ketone body)',SYNONYM:'',Sample:'SERUM, URINE',MEDCODE:10010,CUI:'NA’}
>>>> )
>>>>
>>>> Here is the cypher for a typical relationship...
>>>>
>>>> CREATE(CLT_1)-[:MEASUREMENT_OF{Phylum:'TZ',CAT:'TEST.NAME'
>>>> ,Ui_Rl:'T157',RESULT:'',Type:'',Semantic_Distance_Score:'NA'
>>>> ,Path_Length:'NA',Path_Steps:'NA'}]->(CLT_TARGET_3617),
>>>>
>>>> I will let you know how this one turns out.  I hope this is helpful.
>>>>
>>>> Many, many thanks fellas!!!
>>>>
>>>> Jose
>>>>
>>>> On Nov 18, 2014, at 8:33 PM, Michael Hunger <
>>>> michael...@neotechnology.com> wrote:
>>>>
>>>> Hi José,
>>>>
>>>> can you provide perhaps more detail about your dataset (e.g. sample of
>>>> the csv, size, etc. perhaps an output of csvstat (of csvkit) would be
>>>> helpful), your cypher queries to load it
>>>>
>>>> Have you seen my other blog post, which explains two big caveats that
>>>> people run into when trying this? jexp.de/blog/2014/10/
>>>> load-cvs-with-success/
>>>>
>>>> Cheers, Michael
>>>>
>>>> On Tue, Nov 18, 2014 at 8:43 PM, Kenny Bastani <k...@socialmoon.com>
>>>> wrote:
>>>>
>>>>>  Hey Jose,
>>>>>
>>>>>  There is definitely an answer. Let me put you in touch with the data
>>>>> import master: Michael Hunger.
>>>>>
>>>>>  Michael, I think the answers here will be pretty straight forward
>>>>> for you. You met Jose at GraphConnect NY last year, so I'll spare any
>>>>> introductions. The memory map configurations I provided need to be
>>>>> calculated and customized for the data import volume.
>>>>>
>>>>>  Thanks,
>>>>>
>>>>>  Kenny
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On Nov 18, 2014, at 11:37 AM, José F. Morales Ph.D. <
>>>>> jm3...@columbia.edu> wrote:
>>>>>
>>>>>   Kenny,
>>>>>
>>>>>  In 3 hours it’ll be trying to load for 24 hours so this is not
>>>>> working.  I’m catching shit from my crew too, so I got to fix this like
>>>>> soon.
>>>>>
>>>>>  I haven’t done this before, but can I break up the data and load it
>>>>> in pieces?
>>>>>
>>>>>  Jose
>>>>>
>>>>>  On Nov 17, 2014, at 3:35 PM, Kenny Bastani <k...@socialmoon.com>
>>>>> wrote:
>>>>>
>>>>>  Hey Jose,
>>>>>
>>>>>  Try turning off the object cache. Add this line to the
>>>>> neo4j.properties configuration file:
>>>>>
>>>>>  cache_type=none
>>>>>
>>>>> Then retry your import. Also, enable memory mapped files by adding
>>>>> these lines to the neo4j.properties file:
>>>>>
>>>>>  neostore.nodestore.db.mapped_memory=2048M
>>>>> neostore.relationshipstore.db.mapped_memory=4096M
>>>>> neostore.propertystore.db.mapped_memory=200M
>>>>> neostore.propertystore.db.strings.mapped_memory=500M
>>>>> neostore.propertystore.db.arrays.mapped_memory=500M
>>>>>
>>>>>  Thanks,
>>>>>
>>>>>  Kenny
>>>>>
>>>>>  ------------------------------
>>>>> *From:* José F. Morales Ph.D. <jm3...@columbia.edu>
>>>>> *Sent:* Monday, November 17, 2014 12:32 PM
>>>>> *To:* Kenny Bastani
>>>>> *Subject:* latest
>>>>>
>>>>>   Hey Kenny,
>>>>>
>>>>>  Here’s the deal. As I think I said, I loaded the 41 Mb file of
>>>>> cypher code via the neo4j shell. Before I tried the LabCards file, I tried
>>>>> the movies file and a UMLS database I made (8k relationships).  They 
>>>>> worked
>>>>> fine.
>>>>>
>>>>>  The LabCards file is taking a LONG time to load since I started at
>>>>> about 9:30 - 10 PM last night and its 3PM now.
>>>>>
>>>>>  I’ve wondered if its hung up and the activity monitor’s memory usage
>>>>> is constant at two rows of Java at 4GB w/ the kernel at 1 GB.  The CPU
>>>>> panel changes alot so it looks like its doing its thing.
>>>>>
>>>>>  So is this how are things to be expected?  Do you think the loading
>>>>> is gonna take a day or two?
>>>>>
>>>>>  Jose
>>>>>
>>>>>
>>>>>    |//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\||
>>>>> José F. Morales Ph.D.
>>>>>  Instructor
>>>>>  Cell Biology and Pathology
>>>>> Columbia University Medical Center
>>>>>  jm3...@columbia.edu
>>>>>  212-452-3351
>>>>>
>>>>>
>>>>>    |//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\||
>>>>> José F. Morales Ph.D.
>>>>>  Instructor
>>>>>  Cell Biology and Pathology
>>>>> Columbia University Medical Center
>>>>>  jm3...@columbia.edu
>>>>>  212-452-3351
>>>>>
>>>>>
>>>>
>>>> |//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\||
>>>> José F. Morales Ph.D.
>>>> Instructor
>>>> Cell Biology and Pathology
>>>> Columbia University Medical Center
>>>> jm3...@columbia.edu
>>>> 212-452-3351
>>>>
>>>>  --
> You received this message because you are subscribed to the Google Groups
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to neo4j+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to neo4j+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] Re: large cypher statements

Reply via email to