On Mon, Feb 15, 2016 at 3:45 PM, Greg Stark <st...@mit.edu> wrote: > I was thinking about this over the past couple weeks. I'm starting to > think the quicksort runs gives at least the beginnings of a way > forward on this front.
As I've already pointed out several times, I wrote a tool that makes it easy to load sortbenchmark.org data into a PostgreSQL table: https://github.com/petergeoghegan/gensort (You should use the Python script that invokes the "gensort" utility -- see its "--help" display for details). This seems useful as a standard benchmark, since it's perfectly deterministic, allowing the user to create arbitrarily large tables to use for sort benchmarks. Still, it doesn't produce data that is any way organic; sort data is uniformly distributed. Also, it produces a table that really only has one attribute to sort on, a text attribute. I suggest looking at real world data, too. I have downloaded UK land registry data, which is a freely available dataset about property sales in the UK since the 1990s, of which there have apparently been about 20 million (I started with a 20 million line CSV file). I've used COPY to load the data into one PostgreSQL table. I attach instructions on how to recreate this, and some suggested CREATE INDEX statements that seemed representative to me. There are a variety of Postgres data types in use, including UUID, numeric, and text. The final Postgres table is just under 3GB. I will privately make available a URL that those CC'd here can use to download a custom format dump of the table, which comes in at 1.1GB (ask me off-list if you'd like to get that URL, but weren't CC'd here). This URL is provided as a convenience for reviewers, who can skip my detailed instructions. An expensive rollup() query on the "land_registry_price_paid_uk" table is interesting. Example: select date_trunc('year', transfer_date), county, district, city, sum(price) from land_registry_price_paid_uk group by rollup (1, county, district, city); Performance is within ~5% of an *internal* sort with the patch series applied, even though ~80% of time is spent copying and sorting SortTuples overall in the internal sort case (the internal case cannot overlap sorting and aggregate processing, since it has no final merge step). This is a nice demonstration of how this work has significantly blurred the line between internal and external sorts. -- Peter Geoghegan
Instructions ============ CSV File -------- The land registry file from http://data.gov.uk is 3.2GB. A CSV file that can be loaded into PostgreSQL that has organic data. No registration required. See https://theodi.org/blog/the-status-of-csvs-on-datagovuk for details on downloaded the file pp-complete.csv. SQL --- begin; create table land_registry_price_paid_uk( transaction uuid, price numeric, transfer_date date, postcode text, property_type char(1), newly_built boolean, duration char(1), paon text, saon text, street text, locality text, city text, district text, county text, ppd_category_type char(1)); copy land_registry_price_paid_uk FROM '/home/pg/Downloads/pp-complete.csv' with (format csv, freeze true, encoding 'win1252', header false, null '', quote '"', force_null (postcode, saon, paon, street, locality, city, district)); commit; Resulting table --------------- postgres=# \dt+ List of relations Schema â Name â Type â Owner â Size â Description âââââââââ¼ââââââââââââââââââââââââââââââ¼ââââââââ¼ââââââââ¼ââââââââââ¼âââââââââââââ public â land_registry_price_paid_uk â table â pg â 2779 MB â (1 row) Interesting Indexes =================== Many attribute index (Low cardinality leading attributes): postgres=# create index on land_registry_price_paid_uk_suffix(county, district, city, locality, street); UUID pk index (UUID type, high cardinality): postgres=# create index on land_registry_price_paid_uk (transaction); Price index (numeric, moderate cardinality): postgres=# create index on land_registry_price_paid_uk (price); Preview ======= pg@hamster:~$ head ~/Downloads/pp-complete.csv "{0C7ADEF5-878D-4066-B785-0000003ED74A}","163000","2003-02-21 00:00","UB5 4PJ","T","N","F","106","","READING ROAD","NORTHOLT","NORTHOLT","EALING","GREATER LONDON","A" "{35F67271-ABD4-40DA-AB09-00000085B9D3}","247500","2005-07-15 00:00","TA19 9DD","D","N","F","58","","ADAMS MEADOW","ILMINSTER","ILMINSTER","SOUTH SOMERSET","SOMERSET","A" "{B20B1C74-E8E1-4137-AB3E-0000011DF342}","320000","2010-09-10 00:00","W4 1DZ","F","N","L","58","","WHELLOCK ROAD","","LONDON","EALING","GREATER LONDON","A" "{7D6B0915-C56B-4275-AF9B-00000156BCE7}","104000","1997-08-27 00:00","NE61 2BH","D","N","F","17","","WESTGATE","MORPETH","MORPETH","CASTLE MORPETH","NORTHUMBERLAND","A" "{47B60101-B64C-413D-8F60-000002F1692D}","147995","2003-05-02 00:00","PE33 0RU","D","N","F","4","","MASON GARDENS","WEST WINCH","KING'S LYNN","KING'S LYNN AND WEST NORFOLK","NORFOLK","A" "{51F797CA-7BEB-4958-821F-000003E464AE}","110000","2013-03-22 00:00","NR35 2SF","T","N","F","5","","WILD FLOWER WAY","DITCHINGHAM","BUNGAY","SOUTH NORFOLK","NORFOLK","A" "{AB8CE5AE-DFED-46CE-A682-000004533355}","180000","2003-12-23 00:00","SA8 4DG","D","Y","F","17","","OAKWOOD DRIVE","CLYDACH","SWANSEA","NEATH PORT TALBOT","NEATH PORT TALBOT","A" "{1FB4E294-102B-475C-876A-000005E78B1E}","177000","2012-05-29 00:00","BS4 5AW","S","N","F","11","","WATER LANE","BRISLINGTON","BRISTOL","CITY OF BRISTOL","CITY OF BRISTOL","A" "{3B0DA29C-C89A-4FAA-918A-0000074FA0E0}","190000","2013-10-04 00:00","SN14 8LU","T","N","F","148","","HIGH STREET","MARSHFIELD","CHIPPENHAM","SOUTH GLOUCESTERSHIRE","SOUTH GLOUCESTERSHIRE","A" "{9F669E50-AF1E-4E8A-9A66-0000075006C7}","68000","1999-05-21 00:00","CR0 2JJ","S","N","F","83","","WHITEHORSE ROAD","","CROYDON","CROYDON","GREATER LONDON","A" Notes from UK Land Registry authority: ====================================== Transaction unique identifier A reference number which is generated automatically recording each published sale. The number is unique and will change each time a sale is recorded. Price Sale price stated on the transfer deed. Date of Transfer Date when the sale was completed, as stated on the transfer deed. Postcode This is the postcode used at the time of the original transaction. Note that postcodes can be reallocated and these changes are not reflected in the Price Paid Dataset. Property Type D = Detached, S = Semi-Detached, T = Terraced, F = Flats/Maisonettes, O = Other Note that: - we only record the above categories to describe property type, we do not separately identify bungalows. - end-of-terrace properties are included in the Terraced category above. - âOtherâ is only valid where the transaction relates to a property type that is not covered by existing values. Old/New Indicates the age of the property and applies to all price paid transactions, residential and non-residential. Y = a newly built property, N = an established residential building Duration Relates to the tenure: F = Freehold, L= Leasehold etc. Note that Land Registry does not record leases of 7 years or less in the Price Paid Dataset. PAON Primary Addressable Object Name. If there is a sub-building for example the building is divided into flats, see Secondary Addressable Object Name (SAON). SAON Secondary Addressable Object Name. If there is a sub-building, for example the building is divided into flats, there will be a SAON. Street Locality Town/City District County PPD Category Type Indicates the type of Price Paid transaction. A = Standard Price Paid entry, includes single residential property sold for full market value. B = Additional Price Paid entry including transfers under a power of sale/repossessions, buy-to-lets (where they can be identified by a Mortgage) and transfers to non-private individuals. Note that category B does not separately identify the transaction types stated. Land Registry has been collecting information on Category A transactions from January 1995. Category B transactions were identified from October 2013. Record Status - monthly file only Indicates additions, changes and deletions to the records.(see guide below). A = Addition C = Change D = Delete. Note that where a transaction changes category type due to misallocation (as above) it will be deleted from the original category type and added to the correct category with a new transaction unique identifier.
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers