Re: [Neo4j] LOAD CSV takes over an hour

Mark Needham Tue, 04 Mar 2014 08:22:58 -0800

Hi Aram,

* Do you have any other information of the spec of the machine you're
running this on? e.g. how much RAM etc
* Have you tried upping the value to PERIODIC COMMIT? Perhaps try it out
with a smaller subset of the data to measure the impact - try it with
values of 1,000 / 10,000 perhaps.
* I think it would be interesting to pull out some other things as nodes as
well - might lead to more interesting queries e.g. CEO, Location,
Registered Agent, DOS Process, Jurisdiction could all be nodes that link
back to a DOS.


Let me know if any of that doesn't make sense.
Mark


On 4 March 2014 15:54, Aram Chung <aramol...@gmail.com> wrote:

> Hi,
>
> I was asked to post this here by Mark Needham (@markhneedham) who thought
> my query took longer than it should.
>
> I'm trying to see how graph databases could be used in investigative
> journalism: I was loading in New York State's Active Corporations:
> Beginning 1800 data from
> https://data.ny.gov/Economic-Development/Active-Corporations-Beginning-1800/n9v6-gdp6as
>  a 1964486-row csv (and deleted all U+F8FF characters, because I was
> getting "[null] is not a supported property value"). The Cypher query I
> used was
>
> USING PERIODIC COMMIT 500
> LOAD CSV
>   FROM
> "file://path/to/csv/Active_Corporations___Beginning_1800__without_header__wonky_characters_fixed.csv"
>   AS company
> CREATE (:DataActiveCorporations
> {
> DOS_ID:company[0],
> Current_Entity_Name:company[1],
> Initial_DOS_Filing_Date:company[2],
> County:company[3],
> Jurisdiction:company[4],
> Entity_Type:company[5],
>
> DOS_Process_Name:company[6],
> DOS_Process_Address_1:company[7],
> DOS_Process_Address_2:company[8],
> DOS_Process_City:company[9],
> DOS_Process_State:company[10],
> DOS_Process_Zip:company[11],
>
> CEO_Name:company[12],
> CEO_Address_1:company[13],
> CEO_Address_2:company[14],
> CEO_City:company[15],
> CEO_State:company[16],
> CEO_Zip:company[17],
>
> Registered_Agent_Name:company[18],
> Registered_Agent_Address_1:company[19],
> Registered_Agent_Address_2:company[20],
> Registered_Agent_City:company[21],
> Registered_Agent_State:company[22],
> Registered_Agent_Zip:company[23],
>
> Location_Name:company[24],
> Location_Address_1:company[25],
> Location_Address_2:company[26],
> Location_City:company[27],
> Location_State:company[28],
> Location_Zip:company[29]
> }
> );
>
> Each row is one node so it's as close to the raw data as possible. The
> idea is loosely that these nodes will be linked with new nodes representing
> people and addresses verified by reporters.
>
> This is what I got:
>
> +-------------------+
> | No data returned. |
> +-------------------+
> Nodes created: 1964486
> Properties set: 58934580
> Labels added: 1964486
> 4550855 ms
>
> Some context information:
> Neo4j Milestone Release 2.1.0-M01
> Windows 7
> java version "1.7.0_03"
>
> Best,
> Aram
>
> --
> You received this message because you are subscribed to the Google Groups
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to neo4j+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to neo4j+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: [Neo4j] LOAD CSV takes over an hour

Reply via email to