On Tue, Feb 25, 2020 at 9:30 PM Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > > On Sun, Feb 23, 2020 at 05:09:51PM -0800, Andres Freund wrote: > >Hi, > > > >> The one piece of information I'm missing here is at least a very rough > >> quantification of the individual steps of CSV processing - for example > >> if parsing takes only 10% of the time, it's pretty pointless to start by > >> parallelising this part and we should focus on the rest. If it's 50% it > >> might be a different story. Has anyone done any measurements? > > > >Not recently, but I'm pretty sure that I've observed CSV parsing to be > >way more than 10%. > > > > Perhaps. I guess it'll depend on the CSV file (number of fields, ...), > so I still think we need to do some measurements first. >
Agreed. > I'm willing to > do that, but (a) I doubt I'll have time for that until after 2020-03, > and (b) it'd be good to agree on some set of typical CSV files. > Right, I don't know what is the best way to define that. I can think of the below tests. 1. A table with 10 columns (with datatypes as integers, date, text). It has one index (unique/primary). Load with 1 million rows (basically the data should be probably 5-10 GB). 2. A table with 10 columns (with datatypes as integers, date, text). It has three indexes, one index can be (unique/primary). Load with 1 million rows (basically the data should be probably 5-10 GB). 3. A table with 10 columns (with datatypes as integers, date, text). It has three indexes, one index can be (unique/primary). It has before and after trigeers. Load with 1 million rows (basically the data should be probably 5-10 GB). 4. A table with 10 columns (with datatypes as integers, date, text). It has five or six indexes, one index can be (unique/primary). Load with 1 million rows (basically the data should be probably 5-10 GB). Among all these tests, we can check how much time did we spend in reading, parsing the csv files vs. rest of execution? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com