Re: Fixing backslash dot for COPY FROM...CSV
Hi, In <4ffb7317-7e92-4cbd-a542-1e478af6a...@manitou-mail.org> "Re: Fixing backslash dot for COPY FROM...CSV" on Wed, 31 Jul 2024 15:42:41 +0200, "Daniel Verite" wrote: >> BTW, here is a diff after pgindent: > > PFA a v5 with the cosmetic changes applied. Thanks for updating the patch. >> I also confirmed that the updated server and non-updated >> psql compatibility problem (the end-of-data "\." is >> inserted). It seems that it's difficult to solve without >> introducing incompatibility. > > To clarify the compatibility issue, the attached bash script > compares pre-patch and post-patch client/server combinations with > different cases, submitted with different copy variants. > > case A: quoted backslash-dot sequence in CSV > case B: unquoted backslash-dot sequence in CSV > case C: CSV without backslash-dot > case D: text with backslash-dot at the end > case E: text without backslash-dot at the end > > The different ways to submit the data: > copy from file > \copy from file > \copy from pstdin > copy from stdin with embedded data after the command > > Also attaching the tables of results with the patch as it stands. > "Failed" is when psql reports an error and "Data mismatch" > is when it succeeds but with copied data differing from what was > expected. > > Does your test has an equivalent in these results or is it a different > case? Sorry for not clarify my try. My try was: Case C +--++ | method | patched-server+| | | unpatched-psql | +--++ | embedded | Data mismatch | +--++ I confirmed that this case is "Data mismatch". Thanks, -- kou
Re: Separate HEAP WAL replay logic into its own file
Hi, In <599e67d2-2929-4858-b8bc-f9c4ae889...@ebay.com> "Re: Separate HEAP WAL replay logic into its own file" on Fri, 26 Jul 2024 07:56:12 +, "Li, Yong" wrote: >> 1. Could you create your patch by "git format-patch -vN master" >> or something? If you create your patch by "git format-patch", >> we can apply your patch by "git am XXX.patch". >> > > Thanks for your review. I’ve updated the patch to follow your > suggested format. Thanks. I could apply your patch by "git am v2-0001-heapam_refactor.patch". Could you use the following format for the commit message next time? ${TITLE} ${DESCRIPTION} For example: Separate HEAP WAL replay logic into its own file Most access methods (i.e. nbtree and hash) use a separate file with "xlog" in its name for its WAL replay logic. Heap is one exception of this convention. To make it easier for newcomers to find the WAL replay logic for the heap access method, this patch isolates heap's replay logic in a new heapam_xlog.c file. This patch is a pure refactoring with no change to the logic. This is a commonly used Git's commit message format. See also other commit messages by "git log". >> 5. There are still WAL related codes in heapam.c: >> >> 4.1. log_heap_update() >> 4.2. log_heap_new_cid() >> 4.3. if (RelationNeedsWAL()) {...} in heap_insert() >> 4.4. if (needwal) {...} in heap_multi_insert() >> 4.5. if (RelationNeedsWAL()) {...} in heap_delete() >> 4.6. if (RelationNeedsWAL()) {...}s in heap_update() >> 4.7. if (RelationNeedsWAL()) {...} in heap_lock_tuple() >> 4.8. if (RelationNeedsWAL()) {...} in heap_lock_updated_tuple_rec() >> 4.9. if (RelationNeedsWAL()) {...} in heap_finish_speculative() >> 4.10. if (RelationNeedsWAL()) {...} in heap_abort_speculative() >> 4.11. if (RelationNeedsWAL()) {...} in heap_inplace_update() >> 4.12. log_heap_visible() >> >> Should we move them to head_xlog.c too? >> >> If we should do it, separated commits will be easy to >> review. For example, the 0001 patch moves existing codes >> to head_xlog.c as-is. The 0002 patch extracts WAL related >> codes in heap_insert() to heap_xlog.c and so on. > > I followed the convention of most access methods. The “xlog” > file includes the WAL replay logic only. The logic that generates > the log records themselves stays with the code that performs > the changes. Take nbtree as an example, you can also find > WAL generating code in several _bt_insertxxx() functions inside > the nbtinsert.c file. You're right. Sorry. I think that this proposal is reasonable but we need to get attention from a committer to move forward this proposal. Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Sun, 28 Jul 2024 22:49:47 +0800, Junwang Zhao wrote: > Thanks for updating the patches, I applied them and test > in my local machine, I did not use tmpfs in my test, I guess > if I run the tests enough rounds, the OS will cache the > pages, below is my numbers(I run each test 30 times, I > count for the last 10 ones): > > HEADPATCHED > > COPY to_tab_30 TO '/dev/null' WITH (FORMAT text); > > 5628.280 ms 5679.860 ms > 5583.144 ms 5588.078 ms > 5604.444 ms 5628.029 ms > 5617.133 ms 5613.926 ms > 5575.570 ms 5601.045 ms > 5634.828 ms 5616.409 ms > 5693.489 ms 5637.434 ms > 5585.857 ms 5609.531 ms > 5613.948 ms 5643.629 ms > 5610.394 ms 5580.206 ms > > COPY from_tab_30 FROM '/tmp/to_tab_30.txt' WITH (FORMAT text); > > 3929.955 ms 4050.895 ms > 3909.061 ms 3890.156 ms > 3940.272 ms 3927.614 ms > 3907.535 ms 3925.560 ms > 3952.719 ms 3942.141 ms > 3933.751 ms 3904.250 ms > 3958.274 ms 4025.581 ms > 3937.065 ms 3894.149 ms > 3949.896 ms 3933.878 ms > 3925.399 ms 3936.170 ms > > I did not see obvious performance degradation, maybe it's > because I did not use tmpfs, but I think this OTH means > that the *function call* and *if branch* added for each row > is not the bottleneck of the whole execution path. Thanks for sharing your numbers. I agree with there are no obvious performance degradation. > In 0001, > > +typedef struct CopyFromRoutine > +{ > + /* > + * Called when COPY FROM is started to set up the input functions > + * associated to the relation's attributes writing to. `finfo` can be > + * optionally filled to provide the catalog information of the input > + * function. `typioparam` can be optionally filled to define the OID of > + * the type to pass to the input function. `atttypid` is the OID of data > + * type used by the relation's attribute. > > +typedef struct CopyToRoutine > +{ > + /* > + * Called when COPY TO is started to set up the output functions > + * associated to the relation's attributes reading from. `finfo` can be > + * optionally filled. `atttypid` is the OID of data type used by the > + * relation's attribute. > > The second comment has a simplified description for `finfo`, I think it > should match the first by: > > `finfo` can be optionally filled to provide the catalog information of the > output function. Good catch. I'll update it as suggested in the next patch set. > After I post the patch diffs, the gmail grammer shows some hints that > it should be *associated with* rather than *associated to*, but I'm > not sure about this one. Thanks. I'll use "associated with". > I think the patches are in good shape, I can help to do some > further tests if needed, thanks for working on this. Thanks! -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, THREAD SUMMARY: Proposal: How about making COPY format extendable? Background: Currently, COPY TO/FROM supports only "text", "csv" and "binary" formats. There are some requests to support more COPY formats. For example: * 2023-11: JSON and JSON lines [1] * 2022-04: Apache Arrow [2] * 2018-02: Apache Avro, Apache Parquet and Apache ORC [3] There were discussions how to add support for more formats. [3][4] In these discussions, we got a consensus about making COPY format extendable. [1]: https://www.postgresql.org/message-id/flat/24e3ee88-ec1e-421b-89ae-8a47ee0d2df1%40joeconway.com#a5e6b8829f9a74dfc835f6f29f2e44c5 [2]: https://www.postgresql.org/message-id/flat/CAGrfaBVyfm0wPzXVqm0%3Dh5uArYh9N_ij%2BsVpUtDHqkB%3DVyB3jw%40mail.gmail.com [3]: https://www.postgresql.org/message-id/flat/20180210151304.fonjztsynewldfba%40gmail.com [4]: https://www.postgresql.org/message-id/flat/3741749.1655952719%40sss.pgh.pa.us#2bb7af4a3d2c7669f9a49808d777a20d Concerns: * Performance: If we make COPY format extendable, it will introduce some overheads. We don't want to loss our optimization efforts for the current implementations by this. [5] * Extendability: We don't know which API set is enough for custom COPY format implementations yet. We don't want to provide too much APIs to reduce maintenance cost. [5]: https://www.postgresql.org/message-id/3741749.1655952719%40sss.pgh.pa.us Implementation: The v18 patch set is the latest patch set. [6] It includes the following patches: 0001: This adds a basic feature (Copy{From,To}Routine) (This isn't enough for extending COPY format. This just extracts minimal procedure sets to be extendable as callback sets.) 0002: This uses Copy{From,To}Rountine for the existing formats (text, csv and binary) (This may not be committed because there is a profiling related concern. See the following section for details) 0003: This adds support for specifying custom format by "COPY ... WITH (format 'my-format')" (This also adds a test for this feature.) 0004: This exports Copy{From,To}StateData (But this isn't enough to implement custom COPY FROM/TO handlers as an extension.) 0005: This adds opaque member to Copy{From,To}StateData and export some functions to read the next data and flush the buffer (We can implement a PoC Apache Arrow COPY FROM/TO handler as an extension with this. [7]) [6]: https://www.postgresql.org/message-id/flat/20240724.173059.909782980111496972.kou%40clear-code.com [7]: https://github.com/kou/pg-copy-arrow Implementation notes: * 0002: We use "static inline" and "constant argument" for optimization. * 0002: This hides NextCopyFromRawFields() in a public header because it's not used in PostgreSQL and we want to use "static inline" for it. If it's a problem, we can keep it and create an internal function for "static inline". * 0003: We use "CREATE FUNCTION" to register a custom COPY FROM/TO handler. It's the same approach as tablesample. * 0004 and 0005: We can mix them but this patch set split them for easy to review. 0004 just moves the existing codes. It doesn't change the existing codes. * PoC: I provide it as a separated repository instead of a patch because an extension exists as a separated project in general. If it's a problem, I can provide it as a patch for contrib/. * This patch set still has minimal Copy{From,To}Routine. For example, custom COPY FROM/TO handlers can't process their own options with this patch set. We may add more callbacks to Copy{From,To}Routine later based on real world use-cases. Performance concern: We have a benchmark result and a profile for the change that uses Copy{From,To}Routine for the existing formats. [8] They are based on the v15 patch but there are no significant difference between the v15 patch and v18 patch set. These results show the followings: * Runtime: The patched version is faster than HEAD. * The patched version: 6232ms in average * HEAD: 6550ms in average * Profile: The patched version spends more percents than HEAD in a core function. * The patched version: 85.61% in CopyOneRowTo() * HEAD: 80.35% in CopyOneRowTo() [8]: https://www.postgresql.org/message-id/flat/ZdbtQJ-p5H1_EDwE%40paquier.xyz Here are related information for this benchmark/profile: * Use -O2 for optimization build flag ("meson setup --buildtype=release" may be used) * Use tmpfs for PGDATA * Disable fsync * Run on scissors (what is "scissors" in this context...?) [9] * Unlogged table may be used * Use a table that has 30 integer columns (*1) * Use 5M rows (*2) * Use '/dev/null' for COPY TO (*3) * Use blackhole_am for COPY FROM (*4) https://github.com/michaelpq/pg_plugins/tree/main/blackhole_am * perf is used but used options are unknown (sorry) (*1) This SQL may be used to create the table: CREATE OR REPLACE FUNCTION create_table_cols(tabname text, num_cols int) RETURNS VOID AS $fu
Re: Make COPY format extendable: Extract COPY TO format implementations
7; int default 1'; IF i != num_cols THEN query := query || ', '; END IF; END LOOP; query := query || ')'; EXECUTE format(query); END $func$ LANGUAGE plpgsql; SELECT create_table_cols ('to_tab_30', 30); SELECT create_table_cols ('from_tab_30', 30); (*2) This SQL may be used to insert 5M rows: INSERT INTO to_tab_30 SELECT FROM generate_series(1, 500); (*3) This SQL may be used for COPY TO: COPY to_tab_30 TO '/dev/null' WITH (FORMAT text); (*4) This SQL may be used for COPY FROM: CREATE EXTENSION blackhole_am; ALTER TABLE from_tab_30 SET ACCESS METHOD blackhole_am; COPY to_tab_30 TO '/tmp/to_tab_30.txt' WITH (FORMAT text); COPY from_tab_30 FROM '/tmp/to_tab_30.txt' WITH (FORMAT text); If there is enough information, could you try? Thanks, -- kou >From 22daacbd77c6dd0e13fe11e30fba90f7595ff6c1 Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Mon, 4 Mar 2024 13:52:34 +0900 Subject: [PATCH v18 1/5] Add CopyFromRoutine/CopyToRountine They are for implementing custom COPY FROM/TO format. But this is not enough to implement custom COPY FROM/TO format yet. We'll export some APIs to receive/send data and add "format" option to COPY FROM/TO later. Existing text/csv/binary format implementations don't use CopyFromRoutine/CopyToRoutine for now. We have a patch for it but we defer it. Because there are some mysterious profile results in spite of we get faster runtimes. See [1] for details. [1] https://www.postgresql.org/message-id/ZdbtQJ-p5H1_EDwE%40paquier.xyz Note that this doesn't change existing text/csv/binary format implementations. --- src/backend/commands/copyfrom.c | 24 +- src/backend/commands/copyfromparse.c | 5 ++ src/backend/commands/copyto.c| 31 ++- src/include/commands/copyapi.h | 100 +++ src/include/commands/copyfrom_internal.h | 4 + src/tools/pgindent/typedefs.list | 2 + 6 files changed, 158 insertions(+), 8 deletions(-) create mode 100644 src/include/commands/copyapi.h diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c index ce4d62e707c..ff13b3e3592 100644 --- a/src/backend/commands/copyfrom.c +++ b/src/backend/commands/copyfrom.c @@ -1618,12 +1618,22 @@ BeginCopyFrom(ParseState *pstate, /* Fetch the input function and typioparam info */ if (cstate->opts.binary) + { getTypeBinaryInputInfo(att->atttypid, &in_func_oid, &typioparams[attnum - 1]); + fmgr_info(in_func_oid, &in_functions[attnum - 1]); + } + else if (cstate->routine) + cstate->routine->CopyFromInFunc(cstate, att->atttypid, + &in_functions[attnum - 1], + &typioparams[attnum - 1]); + else + { getTypeInputInfo(att->atttypid, &in_func_oid, &typioparams[attnum - 1]); - fmgr_info(in_func_oid, &in_functions[attnum - 1]); + fmgr_info(in_func_oid, &in_functions[attnum - 1]); + } /* Get default info if available */ defexprs[attnum - 1] = NULL; @@ -1763,10 +1773,13 @@ BeginCopyFrom(ParseState *pstate, /* Read and verify binary header */ ReceiveCopyBinaryHeader(cstate); } - - /* create workspace for CopyReadAttributes results */ - if (!cstate->opts.binary) + else if (cstate->routine) { + cstate->routine->CopyFromStart(cstate, tupDesc); + } + else + { + /* create workspace for CopyReadAttributes results */ AttrNumber attr_count = list_length(cstate->attnumlist); cstate->max_fields = attr_count; @@ -1784,6 +1797,9 @@ BeginCopyFrom(ParseState *pstate, void EndCopyFrom(CopyFromState cstate) { + if (cstate->routine) + cstate->routine->CopyFromEnd(cstate); + /* No COPY FROM related resources except memory. */ if (cstate->is_program) { diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c index 7efcb891598..92b8d5e72d5 100644 --- a/src/backend/commands/copyfromparse.c +++ b/src/backend/commands/copyfromparse.c @@ -1012,6 +1012,11 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext, Assert(fieldno == attr_count); } + else if (cstate->routine) + { + if (!cstate->routine->CopyFromOneRow(cstate, econtext, values, nulls)) + return false; + } else { /* binary */ diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c index ae8b2e36d72..ff19c457abf 100644 --- a/src/backend/commands/copyto.c +++ b/src/backend/commands/copyto.c @@ -20,6 +20,7 @@ #include "access/tableam.h" #include "commands/copy.h" +#include "commands/copyapi.h" #include "commands/progress.h" #include "executor/execdesc.h" #include "executor/executor.h" @@ -64,6 +65,9 @@ typedef enum CopyDest */ typedef struct CopyToStateData { + /* format routine */ + const CopyToRoutine *routine; + /* low-leve
Re: Separate HEAP WAL replay logic into its own file
Hi, I'm reviewing patches in commitfest 2024-07: https://commitfest.postgresql.org/48/ This is the 5th patch: https://commitfest.postgresql.org/48/5054/ FYI: https://commitfest.postgresql.org/48/4681/ is my patch. In "Re: Separate HEAP WAL replay logic into its own file" on Tue, 18 Jun 2024 01:12:42 +, "Li, Yong" wrote: > As a newcomer, when I was walking through the code looking > for WAL replay related code, it was relatively easy for me > to find them for the B-Tree access method because of the > “xlog” hint in the file names. It took me a while to > find the same for the heap access method. When I finally > found them (via text search), it was a small > surprise. Having different file organizations for > different access methods gives me this urge to make > everything consistent. I think it will make it easier for > newcomers, and it will reduce the mental load for everyone > to remember that heap replay is inside the heapam.c not > some “???xlog.c”. It makes sense. Here are my comments for your patch: 1. Could you create your patch by "git format-patch -vN master" or something? If you create your patch by "git format-patch", we can apply your patch by "git am XXX.patch". 2. I confirmed that all heapam.c -> heapam_xlog.c/heapam.h moves don't change implementations. I re-moved moved codes to heapam.c and there is no diff in heapam.c. 3. Could you remove '#include "access/heapam_xlog.h"' from heapam.c because it's needless now. BTW, it seems that we can remove more includes from heapam.c: diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index bc6d4868975..f1671072576 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -31,42 +31,24 @@ */ #include "postgres.h" -#include "access/bufmask.h" #include "access/heapam.h" -#include "access/heapam_xlog.h" #include "access/heaptoast.h" #include "access/hio.h" #include "access/multixact.h" -#include "access/parallel.h" -#include "access/relscan.h" #include "access/subtrans.h" #include "access/syncscan.h" -#include "access/sysattr.h" -#include "access/tableam.h" -#include "access/transam.h" #include "access/valid.h" #include "access/visibilitymap.h" -#include "access/xact.h" -#include "access/xlog.h" #include "access/xloginsert.h" -#include "access/xlogutils.h" -#include "catalog/catalog.h" #include "commands/vacuum.h" -#include "miscadmin.h" #include "pgstat.h" -#include "port/atomics.h" #include "port/pg_bitutils.h" -#include "storage/bufmgr.h" -#include "storage/freespace.h" #include "storage/lmgr.h" #include "storage/predicate.h" #include "storage/procarray.h" -#include "storage/standby.h" #include "utils/datum.h" #include "utils/injection_point.h" #include "utils/inval.h" -#include "utils/relcache.h" -#include "utils/snapmgr.h" #include "utils/spccache.h" --- We may want to work on removing needless includes as a separated cleanup task. 4. Could you remove needless includes from heapam_xlog.c? It seems that we can remove the following includes: diff --git a/src/backend/access/heap/heapam_xlog.c b/src/backend/access/heap/heapam_xlog.c index b372f2b4afc..af4976f382d 100644 --- a/src/backend/access/heap/heapam_xlog.c +++ b/src/backend/access/heap/heapam_xlog.c @@ -16,16 +16,11 @@ #include "access/bufmask.h" #include "access/heapam.h" -#include "access/heapam_xlog.h" -#include "access/transam.h" #include "access/visibilitymap.h" #include "access/xlog.h" #include "access/xlogutils.h" -#include "port/atomics.h" -#include "storage/bufmgr.h" #include "storage/freespace.h" #include "storage/standby.h" -#include "utils/relcache.h" 5. There are still WAL related codes in heapam.c: 4.1. log_heap_update() 4.2. log_heap_new_cid() 4.3. if (RelationNeedsWAL()) {...} in heap_insert() 4.4. if (needwal) {...} in heap_multi_insert() 4.5. if (RelationNeedsWAL()) {...} in heap_delete() 4.6. if (RelationNeedsWAL()) {...}s in heap_update() 4.7. if (RelationNeedsWAL()) {...} in heap_lock_tuple() 4.8. if (RelationNeedsWAL()) {...} in heap_lock_updated_tuple_rec() 4.9. if (RelationNeedsWAL()) {...} in heap_finish_speculative() 4.10. if (RelationNeedsWAL()) {...} in heap_abort_speculative() 4.11. if (RelationNeedsWAL()) {...} in heap_inplace_update() 4.12. log_heap_visible() Should we move them to head_xlog.c too? If we should do it, separated commits will be easy to review. For example, the 0001 patch moves existing codes to head_xlog.c as-is. The 0002 patch extracts WAL related codes in heap_insert() to heap_xlog.c and so on. Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi Yong, Thanks for joining this thread! In <453d52d4-2ac5-49f6-928d-79f8a4c08...@ebay.com> "Re: Make COPY format extendable: Extract COPY TO format implementations" on Mon, 22 Jul 2024 07:11:15 +, "Li, Yong" wrote: > My understanding is that the provided v17 patch aims to achieve the > followings: > - Retain existing format implementations as built-in formats, and do not go > through the new interface for them. > - Make sure that there is no sign of performance degradation. > - Refactoring the existing code to make it easier and possible to make copy > handlers extensible. However, some of the infrastructure work that are > required to make copy handler extensible are intentionally delayed for future > patches. Some of the work were proposed as patches in earlier messages, but > they were not explicitly referenced in recent messages. Right. Sorry for bothering you. As Tomas suggested, I should have prepared the current summary. My last e-mail summarized the current information: https://www.postgresql.org/message-id/flat/20240722.164540.889091645042390373.kou%40clear-code.com#0be14c4eeb041e70438ab7a423b728da It also shows that your understanding is right. Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi Tomas, Thanks for joining this thread! In <257d5573-07da-48c3-ac07-e047e7a65...@enterprisedb.com> "Re: Make COPY format extendable: Extract COPY TO format implementations" on Fri, 19 Jul 2024 14:40:05 +0200, Tomas Vondra wrote: > I think it'd be helpful if you could post a patch status, i.e. a message > re-explaininig what it aims to achieve, summary of the discussion so > far, and what you think are the open questions. Otherwise every reviewer > has to read the whole thread to learn this. It makes sense. It seems your questions covers all important points in this thread. So my answers of your questions summarize the latest information. > FWIW I realize there are other related patches, and maybe some of the > discussion is happening on those threads. But that's just another reason > to post the summary here - as a reviewer I'm not going to read random > other patches that "might" have relevant info. It makes sense too. To clarify it, other threads are unrelated. We can focus on only this thread for this propose. > The way I understand it, the ultimate goal is to allow extensions to > define formats using CREATE XYZ. Right. > But the proposed patch does not do that, right? It > only does some basic things at the C level, there's no DDL etc. Right. The latest patch set includes only the basic things for the first implementation. > Per the commit message, none of the existing formats (text/csv/binary) > is implemented as "copy routine". Right. > IMHO that's a bit strange, because > that's exactly what I'd expect this patch to do - to define all the > infrastructure (catalogs, ...) and switch the existing formats to it. We did it in the v1-v15 patch sets. But the v16/v17 patch sets remove it because of a profiling result. (It's described later.) In general, we don't want to decrease the current performance of the existing formats: https://www.postgresql.org/message-id/flat/10025bac-158c-ffe7-fbec-32b42629121f%40dunslane.net#81cf82c219f2f2d77a616bbf5e511a5c > We've spent quite a lot of blood sweat and tears over the years to make > COPY fast, and we should not sacrifice any of that lightly. The v15 patch set is faster than HEAD but there is a mysterious profiling result: https://www.postgresql.org/message-id/flat/ZdbtQJ-p5H1_EDwE%40paquier.xyz#6439e6ad574f2d47cd7220e9bfed3889 > The increase in CopyOneRowTo from 80% to 85% worries me ... > I am getting faster > runtimes with v15 (6232ms in average) vs HEAD (6550ms). I think that it's not a blocker because the v15 patch set approach is faster. But someone may think that it's a blocker. So the v16 or later patch sets don't include codes to use this extension mechanism for the existing formats. We can work on it after we introduce the basic features if it's valuable. > How would you even know the new code is correct, when there's nothing > using using the "copy routine" branch? We can't test it only with the v16/v17 patch set changes. But we can do it by adding more changes we did in the v6 patch set. https://www.postgresql.org/message-id/flat/20240124.144936.67229716500876806.kou%40clear-code.com#f1ad092fc5e81fe38d3c376559efd52c If we should commit the basic changes with tests, I can adjust the test mechanism in v6 patch set and add it to the latest patch set. But it needs CREATE XYZ mechanism and so on too. Is it OK? > In fact, doesn't this mean that the benchmarks presented earlier are not > very useful? We still use the old code, except there are a couple "if" > branches that are never taken? I don't think this measures the new > approach would not be slower once everything gets to be copy routine. Here is a benchmark result with the v17 and HEAD: https://www.postgresql.org/message-id/flat/ZelfYatRdVZN3FbE%40paquier.xyz#eccfd1a0131af93c48026d691cc247f4 It shows that no performance difference for the existing formats. The added mechanism may be slower than the existing formats mechanism but it's not a blocker. Because it's never performance regression. (Because this is a new feature.) We can improve it later if it's needed. > Also, how do we know this API is suitable for the alternative formats? The v6 patch set has more APIs built on this API. These APIs are for implementing the alternative formats. https://www.postgresql.org/message-id/flat/20240124.144936.67229716500876806.kou%40clear-code.com#f1ad092fc5e81fe38d3c376559efd52c This is an Apache Arrow format implementation based on the v6 patch set: https://github.com/kou/pg-copy-arrow > For example you mentioned Arrow, and I suppose people will want to add > support for other column-oriented formats. I assume that will require > stashing a batch of rows (or some other internal state) somewhere, but > does the proposed API plan for that? > > My guess would be we'll need to add a "private_data" pointer to the > CopyFromStateData/CopyToStateData structs, but maybe I'm wr
Re: Fixing backslash dot for COPY FROM...CSV
Hi, I'm reviewing patches in Commitfest 2024-07 from top to bottom: https://commitfest.postgresql.org/48/ This is the 4th patch: https://commitfest.postgresql.org/48/4710/ FYI: https://commitfest.postgresql.org/48/4681/ is my patch. In <2726138.1712462...@sss.pgh.pa.us> "Re: Fixing backslash dot for COPY FROM...CSV" on Sun, 07 Apr 2024 00:07:13 -0400, Tom Lane wrote: > Here's the problem: if some client-side code thinks it's okay to > quote "." as "\.", what is likely to happen when it's presented > a data value consisting only of "."? It could very easily fall > into the trap of producing an end-of-data marker. > > If we get rid of the whole concept of end-of-data markers, then > it'd be totally reasonable to accept "\." as ".". But as long > as we still have end-of-data markers, I think it's unwise to allow > "\." to appear as anything but an end-of-data marker. It'd just > add camouflage to the booby trap. I read through this thread. It seems that this thread discuss 2 things: 1. \. in CSV mode 2. \. in non-CSV mode Recent messages discussed mainly 2. but how about create a separated thread for 2.? Because the original mail focused on 1. and it seems that we can handle them separately. Here are comments for the latest v4 patch: diff --git a/src/bin/psql/copy.c b/src/bin/psql/copy.c index 961ae32694..a39818b193 100644 --- a/src/bin/psql/copy.c +++ b/src/bin/psql/copy.c @@ -620,20 +620,32 @@ handleCopyIn(PGconn *conn, FILE *copystream, bool isbinary, PGresult **res) ... + if (copystream == pset.cur_cmd_source) + { + *fgresult='\0'; + buflen -= linelen; Spaces around "=" are missing for the fgresult line. BTW, here is a diff after pgindent: diff --git a/src/bin/psql/copy.c b/src/bin/psql/copy.c index a39818b1933..4ee8481998a 100644 --- a/src/bin/psql/copy.c +++ b/src/bin/psql/copy.c @@ -621,13 +621,13 @@ handleCopyIn(PGconn *conn, FILE *copystream, bool isbinary, PGresult **res) if (buf[buflen - 1] == '\n') { /* -* When at the beginning of the line, check for EOF marker. -* If the marker is found and the data is inlined, +* When at the beginning of the line, check for EOF +* marker. If the marker is found and the data is inlined, * we must stop at this point. If not, the \. line can be -* sent to the server, and we let it decide whether -* it's an EOF or not depending on the format: in -* basic TEXT, \. is going to be interpreted as an EOF, in -* CSV, it will not. +* sent to the server, and we let it decide whether it's +* an EOF or not depending on the format: in basic TEXT, +* \. is going to be interpreted as an EOF, in CSV, it +* will not. */ if (at_line_begin && copystream == pset.cur_cmd_source) { @@ -635,15 +635,16 @@ handleCopyIn(PGconn *conn, FILE *copystream, bool isbinary, PGresult **res) (linelen == 4 && memcmp(fgresult, "\\.\r\n", 4) == 0)) { copydone = true; + /* -* Remove the EOF marker from the data sent. -* In the case of CSV, the EOF marker must be +* Remove the EOF marker from the data sent. In +* the case of CSV, the EOF marker must be * removed, otherwise it would be interpreted by * the server as valid data. */ if (copystream == pset.cur_cmd_source) { - *fgresult='\0'; +
Re: July Commitfest: Entries Needing Review
Hi, In "July Commitfest: Entries Needing Review" on Thu, 18 Jul 2024 14:17:38 -0400, Corey Huinker wrote: > If you know your patch isn't going to get reviewed in this commitfest, > please consider moving it to the next commitfest or withdrawing it. I hope my patch https://commitfest.postgresql.org/48/4681/ gets reviewed in this commitfest but it's not done yet. I'm reviewing other patches in this commitfest because I heard that my patch will be got reviewed if I review other patches in this commitfest. But it seems that it's not related. Should I move my patch to the next commitfest or withdraw my patch? Thanks, -- kou
Re: Wrong results with grouping sets
Hi, I'm reviewing patches in Commitfest 2024-07 from top to bottom: https://commitfest.postgresql.org/48/ This is the 3rd patch: https://commitfest.postgresql.org/48/4583/ FYI: https://commitfest.postgresql.org/48/4681/ is my patch. In "Re: Wrong results with grouping sets" on Wed, 10 Jul 2024 09:22:54 +0800, Richard Guo wrote: > Here is an updated version of this patchset. I've added some comments > according to the review feedback, and also tweaked the commit messages > a bit more. I'm not familiar with related codes but here are my comments: 0001: --- diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h index 85a62b538e5..8055f4b2b9e 100644 --- a/src/include/nodes/parsenodes.h +++ b/src/include/nodes/parsenodes.h @@ -1242,6 +1245,12 @@ typedef struct RangeTblEntry /* estimated or actual from caller */ Cardinality enrtuples pg_node_attr(query_jumble_ignore); + /* +* Fields valid for a GROUP RTE (else NULL/zero): +*/ + /* list of expressions grouped on */ + List *groupexprs pg_node_attr(query_jumble_ignore); + /* * Fields valid in all RTEs: */ +* Fields valid for a GROUP RTE (else NULL/zero): There is only one field and it's LIST. So how about using the following? * A field valid for a GROUP RTE (else NIL): diff --git a/src/backend/optimizer/util/var.c b/src/backend/optimizer/util/var.c index 844fc30978b..0982f873a42 100644 --- a/src/backend/optimizer/util/var.c +++ b/src/backend/optimizer/util/var.c @@ -902,6 +915,141 @@ flatten_join_alias_vars_mutator(Node *node, ... +Node * +flatten_group_exprs(PlannerInfo *root, Query *query, Node *node) +{ + flatten_join_alias_vars_context context; ... --- If we want to reuse flatten_join_alias_vars_context for flatten_group_exprs(), how about renaming it? flatten_join_alias_vars() only uses flatten_join_alias_vars_context for now. So the name of flatten_join_alias_vars_context is meaningful. But if we want to flatten_join_alias_vars_context for flatten_group_exprs() too. The name of flatten_join_alias_vars_context is strange. diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c index 2f64eaf0e37..69476384252 100644 --- a/src/backend/parser/parse_relation.c +++ b/src/backend/parser/parse_relation.c @@ -2557,6 +2557,79 @@ addRangeTableEntryForENR(ParseState *pstate, ... + char *colname = te->resname ? pstrdup(te->resname) : "unamed_col"; ... Can the "te->resname == NULL" case be happen? If so, how about adding a new test for the case? (BTW, is "unamed_col" intentional name? Is it a typo of "unnamed_col"?) Thanks, -- kou
Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
Hi, I'm reviewing patches in Commitfest 2024-07 from top to bottom: https://commitfest.postgresql.org/48/ This is the 2st patch: https://commitfest.postgresql.org/48/4573/ FYI: https://commitfest.postgresql.org/48/4681/ is my patch. In "Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows" on Tue, 4 Jun 2024 08:30:19 +0900, Michael Paquier wrote: > During a live review of this patch last week, as part of the Advanced > Patch Workshop of pgconf.dev, it has been mentioned by Tom that we may > be able to simplify the check on pmstart if the detection of the > postmaster PID started by pg_ctl is more stable using the WIN32 > internals that this patch relies on. I am not sure that this > suggestion is right, though, because we should still care about the > clock skew case as written in the surrounding comments? Even if > that's OK, I would assume that this should be an independent patch, > written on top of the proposed v6-0001. I reviewed the latest patch set and I felt different impression. start_postmaster() on Windows uses cmd.exe for redirection based on the comment in the function: > /* > * As with the Unix case, it's easiest to use the shell (CMD.EXE) to > * handle redirection etc. Unfortunately CMD.EXE lacks any equivalent of > * "exec", so we don't get to find out the postmaster's PID immediately. > */ It seems that we can use redirection by CreateProcess() family functions without cmd.exe based on the following documentation: https://learn.microsoft.com/en-us/windows/win32/procthread/creating-a-child-process-with-redirected-input-and-output How about changing start_postmaster() for Windows to start postgres.exe directly so that it returns the postgres.exe's PID not cmd.exe's PID? If we can do it, we don't need pgwin32_find_postmaster_pid() in the patch set. Thanks, -- kou
Re: pg_rewind WAL segments deletion pitfall
Hi, I'm reviewing patches in Commitfest 2024-07 from top to bottom: https://commitfest.postgresql.org/48/ This is the 1st patch: https://commitfest.postgresql.org/48/3874/ The latest patch can't be applied on master: https://www.postgresql.org/message-id/CAFh8B=nnjtm9ke4_1mhpwgz2pv9yoyf6hmnyh5xact0aa4v...@mail.gmail.com I've rebased on master. See the attached patch. Here are changes for it: * Resolve conflict * Update copyright year to 2024 from 2023 * Add an added test to meson.build * Run pgindent Here are my review comments: @@ -217,6 +221,26 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex, + charxlogfname[MAXFNAMELEN]; + + tli = xlogreader->seg.ws_tli; + segno = xlogreader->seg.ws_segno; + + snprintf(xlogfname, MAXPGPATH, XLOGDIR "/"); + XLogFileName(xlogfname + strlen(xlogfname), +xlogreader->seg.ws_tli, +xlogreader->seg.ws_segno, WalSegSz); + + /* +* Make sure pg_rewind doesn't remove this file, because it is +* required for postgres to start after rewind. +*/ + insert_keepwalhash_entry(xlogfname); MAXFNAMELEN is 64 and MAXPGPATH is 1024. strlen(XLOGDIR "/") is 7 because XLOGDIR is "pg_wal". So xlogfname has enough size but snprintf(xlogfname, MAXPGPATH) is wrong usage. (And XLogFileName() uses snprintf(xlogfname, MAXFNAMELEN) internally.) How about using one more buffer? charxlogpath[MAXPGPATH]; charxlogfname[MAXFNAMELEN]; tli = xlogreader->seg.ws_tli; segno = xlogreader->seg.ws_segno; XLogFileName(xlogfname, xlogreader->seg.ws_tli, xlogreader->seg.ws_segno, WalSegSz); snprintf(xlogpath, MAXPGPATH, "%s/%s", XLOGDIR, xlogfname); /* * Make sure pg_rewind doesn't remove this file, because it is * required for postgres to start after rewind. */ insert_keepwalhash_entry(xlogpath); Thanks, -- kou In "Re: pg_rewind WAL segments deletion pitfall" on Tue, 23 Jan 2024 09:23:29 +0100, Alexander Kukushkin wrote: > Hi Peter, > > On Mon, 22 Jan 2024 at 00:38, Peter Smith wrote: > >> 2024-01 Commitfest. >> >> Hi, This patch has a CF status of "Ready for Committer", but it is >> currently failing some CFbot tests [1]. Please have a look and post an >> updated version.. >> >> == >> [1] >> https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/3874 >> >> > From what I can see all failures are not related to this patch: > 1. Windows build failed with > [10:52:49.679] 126/281 postgresql:recovery / recovery/019_replslot_limit > ERROR 185.84s (exit status 255 or signal 127 SIGinvalid) > 2. FreeBSD build failed with > [09:11:57.656] 190/285 postgresql:psql / psql/010_tab_completion ERROR > 0.46s exit status 2 > [09:11:57.656] 220/285 postgresql:authentication / > authentication/001_password ERROR 0.57s exit status 2 > > In fact, I don't even see this patch being applied for these builds and the > introduced TAP test being executed. > > Regards, > -- > Alexander Kukushkin >From 38edc07c49970ad0ac1818284f2e59e26591b3a4 Mon Sep 17 00:00:00 2001 From: Alexander Kukushkin Date: Sun, 6 Aug 2023 15:56:39 +0100 Subject: [PATCH v9] Be more picky with WAL segment deletion in pg_rewind Make pg_rewind to be a bit wiser in terms of creating filemap: preserve on the target all WAL segments that contain records between the last common checkpoint and the point of divergence. Co-authored-by: Polina Bungina --- src/bin/pg_rewind/filemap.c | 62 +- src/bin/pg_rewind/filemap.h | 3 + src/bin/pg_rewind/meson.build | 1 + src/bin/pg_rewind/parsexlog.c | 24 +++ src/bin/pg_rewind/pg_rewind.c | 3 + src/bin/pg_rewind/t/010_keep_recycled_wals.pl | 65 +++ 6 files changed, 157 insertions(+), 1 deletion(-) create mode 100644 src/bin/pg_rewind/t/010_keep_recycled_wals.pl diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c index 4458324c9d8..b357c28338a 100644 --- a/src/bin/pg_rewind/filemap.c +++ b/src/bin/pg_rewind/filemap.c @@ -63,6 +63,28 @@ static file_entry_t *lookup_filehash_entry(const char *path); static int final_filemap_cmp(const void *a, const void *b); static bool check_file_excluded(const char *path, bool is_source); +typedef struct skipwal_t +{ + const char *path; + uint32 status; +} skipwal_t; + +#define SH_PREFIX keepwalhash +#define SH_ELEMENT_TYPE skipwal_t +#define SH_KEY_TYPE const char * +#define SH_KEY path +#define SH_HASH_KEY(tb, key) hash_string(key) +#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0) +#define SH_SCOPE static inline +#define SH_RAW_ALLOCATOR pg_malloc0 +#defin
Re: Columnar format export in Postgres
Hi, In "Re: Columnar format export in Postgres" on Thu, 13 Jun 2024 22:30:24 +0530, Sushrut Shivaswamy wrote: > - To facilitate efficient querying it would help to export multiple > parquet files for the table instead of a single file. >Having multiple files allows queries to skip chunks if the key range in > the chunk does not match query filter criteria. >Even within a chunk it would help to be able to configure the size of a > row group. > - I'm not sure how these parameters will be exposed within `COPY TO`. > Or maybe the extension implementing the `COPY TO` handler will > allow this configuration? Yes. But adding support for custom COPY TO options is out-of-scope in the first version. We will focus on only the minimal features in the first version. We can improve it later based on use-cases. See also: https://www.postgresql.org/message-id/20240131.141122.279551156957581322.kou%40clear-code.com > - Regarding using file_fdw to read Apache Arrow and Apache Parquet file > because file_fdw is based on COPY FROM: > - I'm not too clear on this. file_fdw seems to allow creating a table > from data on disk exported using COPY TO. Correct. >But is the newly created table still using the data on disk(maybe in > columnar format or csv) or is it just reading that data to create a row > based table. The former. >I'm not aware of any capability in the postgres planner to read > columnar files currently without using an extension like parquet_fdw. Correct. We still need another approach such as parquet_fdw with the COPY format extensible feature to optimize query against Apache Parquet data. file_fdw can just read Apache Parquet data by SELECT. Sorry for confusing you. Thanks, -- kou
Re: Columnar format export in Postgres
Hi, In "Columnar format export in Postgres" on Wed, 12 Jun 2024 22:26:30 +0530, Sushrut Shivaswamy wrote: > I have been working on adding support for columnar format export to > Postgres to speed up analytics queries. FYI: I'm proposing making COPY format extendable: * https://www.postgresql.org/message-id/flat/20231204.153548.2126325458835528809@clear-code.com * https://commitfest.postgresql.org/48/4681/ If it's accepted, we can implement extensions for COPY FORMAT arrow and COPY FORMAT parquet. With these extensions, we can use file_fdw to read Apache Arrow and Apache Parquet file because file_fdw is based on COPY FROM: https://www.postgresql.org/docs/current/file-fdw.html If you're interested in this proposal, you can review the latest proposed patch set to proceed this proposal. > - Reduce memory consumption when exporting table data to columnar format The above COPY support will help this. Thanks, -- kou
Re: RFC: adding pytest as a supported test framework
Hi, (I don't have an opinion which language should be selected here.) In "Re: RFC: adding pytest as a supported test framework" on Wed, 12 Jun 2024 12:31:23 -0700, Jacob Champion wrote: > - I like Ruby as a language but have no experience using it for > testing. (RSpec did come up during the unconference session and > subsequent hallway conversations.) If we want to select Ruby, I can help. (I'm a Ruby committer and a maintainer of a testing framework bundled in Ruby.) I'm using Ruby for PGroonga's tests that can't be covered by pg_regress. For example, streaming replication related tests. PGroonga has a small utility for it: https://github.com/pgroonga/pgroonga/blob/main/test/helpers/sandbox.rb Here is a streaming replication test with it: https://github.com/pgroonga/pgroonga/blob/main/test/test-streaming-replication.rb I'm using test-unit as testing framework that is bundled in Ruby: https://github.com/test-unit/test-unit/ I don't recommend that we use RSpec as testing framework even if we select Ruby. RSpec may change API. (RSpec did it several times in the past.) If testing framework changes API, we need to rewrite our tests to adapt the change. I'll never change test-unit API because I don't want to rewrite existing tests. Thanks, -- kou
Re: meson: Specify -Wformat as a common warning flag for extensions
Hi, In <4707d4ed-f268-43c0-b4dd-cdbc7520f...@eisentraut.org> "Re: meson: Specify -Wformat as a common warning flag for extensions" on Tue, 28 May 2024 23:31:05 -0700, Peter Eisentraut wrote: > On 07.04.24 18:01, Sutou Kouhei wrote: >> +# We don't have "warning_level == 3" and "warning_level == >> +# 'everything'" here because we don't use these warning levels. >> +if warning_level == '1' >> + common_builtin_flags += ['-Wall'] >> +elif warning_level == '2' >> + common_builtin_flags += ['-Wall', '-Wextra'] >> +endif > > I would trim this even further and always export just '-Wall'. The > other options aren't really something we support. OK. How about the v6 patch? It always uses '-Wall'. Thanks, -- kou >From 8238adba3f3fc96d4a9e50af611b1cb3566abc0e Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Fri, 15 Mar 2024 18:27:30 +0900 Subject: [PATCH v6] meson: Restore implicit warning/debug/optimize flags for extensions Meson specifies warning/debug/optimize flags such as "-Wall", "-g" and "-O2" automatically based on "--warnlevel" and "--buildtype" options. And we use "--warning_level=1" and "--buildtype=debugoptimized" by default. We don't specify warning/debug/optimize flags explicitly to build PostgreSQL with Meson. Because Meson does it automatically as we said. But Meson doesn't care about flags in Makefile.global and pg_config. So we need to care about them manually. This changes do it. They detect debug/optimize flags based on debug/optimization option values because Meson doesn't tell us flags Meson guessed. We always use -Wall for warning flags. --- meson.build | 27 +++ src/include/meson.build | 4 ++-- 2 files changed, 29 insertions(+), 2 deletions(-) diff --git a/meson.build b/meson.build index d6401fb8e30..d7239dbb114 100644 --- a/meson.build +++ b/meson.build @@ -1851,6 +1851,33 @@ endif vectorize_cflags = cc.get_supported_arguments(['-ftree-vectorize']) unroll_loops_cflags = cc.get_supported_arguments(['-funroll-loops']) +# They aren't used for building PostgreSQL itself because Meson does +# everything internally. They are used by extensions via pg_config or +# Makefile.global. +common_builtin_flags = ['-Wall'] + +if get_option('debug') + common_builtin_flags += ['-g'] +endif + +optimization = get_option('optimization') +if optimization == '0' + common_builtin_flags += ['-O0'] +elif optimization == '1' + common_builtin_flags += ['-O1'] +elif optimization == '2' + common_builtin_flags += ['-O2'] +elif optimization == '3' + common_builtin_flags += ['-O3'] +elif optimization == 's' + common_builtin_flags += ['-Os'] +endif + +cflags_builtin = cc.get_supported_arguments(common_builtin_flags) +if llvm.found() + cxxflags_builtin = cpp.get_supported_arguments(common_builtin_flags) +endif + common_warning_flags = [ '-Wmissing-prototypes', '-Wpointer-arith', diff --git a/src/include/meson.build b/src/include/meson.build index a28f115d867..58b7a9c1e7e 100644 --- a/src/include/meson.build +++ b/src/include/meson.build @@ -44,9 +44,9 @@ config_paths_data.set_quoted('MANDIR', dir_prefix / dir_man) var_cc = ' '.join(cc.cmd_array()) var_cpp = ' '.join(cc.cmd_array() + ['-E']) -var_cflags = ' '.join(cflags + cflags_warn + get_option('c_args')) +var_cflags = ' '.join(cflags + cflags_builtin + cflags_warn + get_option('c_args')) if llvm.found() - var_cxxflags = ' '.join(cxxflags + cxxflags_warn + get_option('cpp_args')) + var_cxxflags = ' '.join(cxxflags + cxxflags_builtin + cxxflags_warn + get_option('cpp_args')) else var_cxxflags = '' endif -- 2.43.0
Re: Is it acceptable making COPY format extendable?
Hi, In <7a978dd2-371b-4d10-b58e-cb32ef728...@eisentraut.org> "Re: Is it acceptable making COPY format extendable?" on Thu, 25 Apr 2024 09:59:18 +0200, Peter Eisentraut wrote: > PostgreSQL development is currently in feature freeze for PostgreSQL > 17 (since April 8). So most participants will not be looking very > closely at development projects for releases beyond that. The next > commitfest for patches targeting PostgreSQL 18 is in July, and I see > your patch registered there. It's possible that your thread is not > going to get much attention until then. > > So, in summary, you are doing everything right, you just have to be a > bit more patient right now. :) I see. I'll wait for the next commitfest. Thanks, -- kou
Re: 2024-05-09 release announcement draft,2024-05-09 release announcement draft
Hi, In <779790c7-45d7-4010-9305-c3f9e6a60...@postgresql.org>,<779790c7-45d7-4010-9305-c3f9e6a60...@postgresql.org> "2024-05-09 release announcement draft,2024-05-09 release announcement draft" on Mon, 6 May 2024 13:44:05 -0400, "Jonathan S. Katz" wrote: > * Added optimization for certain operations where an installation has > thousands > of roles. Added -> Add > * [Follow @postgresql on Twitter](https://twitter.com/postgresql) Twitter -> X Thanks, -- kou
Re: Is it acceptable making COPY format extendable?
Hi, Thanks for replying this. In <02cccd36-3083-4a50-aae4-f957e96fb...@eisentraut.org> "Re: Is it acceptable making COPY format extendable?" on Wed, 24 Apr 2024 09:57:38 +0200, Peter Eisentraut wrote: >> I'm proposing a patch that making COPY format extendable: >> https://www.postgresql.org/message-id/20231204.153548.2126325458835528809.kou%40clear-code.com >> https://commitfest.postgresql.org/48/4681/ >> >> But my proposal stalled. It it acceptable the feature >> request that making COPY format extendable? If it's not >> acceptable, what are blockers? Performance? > > I think that thread is getting an adequate amount of attention. Of > course, you can always wish for more, but "stalled" looks very > different. Sorry for "stalled" misuse. I haven't got any reply since 2024-03-15: https://www.postgresql.org/message-id/flat/20240315.173754.2049843193122381085.kou%40clear-code.com#07aefc636d8165204ddfba971dc9a490 (I sent some pings.) So I called this status as "stalled". I'm not familiar with the PostgreSQL's development style. What should I do for this case? Should I just wait for a reply from others without doing anything? > Let's not start a new thread here to discuss the merits of the other > thread; that discussion belongs there. I wanted to discuss possibility for this feature request in this thread. I wanted to use my existing thread is for how to implement this feature request. Should I reuse https://www.postgresql.org/message-id/flat/CAJ7c6TM6Bz1c3F04Cy6%2BSzuWfKmr0kU8c_3Stnvh_8BR0D6k8Q%40mail.gmail.com for it instead of this thread? Thanks, -- kou
Is it acceptable making COPY format extendable?
Hi, I'm proposing a patch that making COPY format extendable: https://www.postgresql.org/message-id/20231204.153548.2126325458835528809.kou%40clear-code.com https://commitfest.postgresql.org/48/4681/ It's based on the discussion at: https://www.postgresql.org/message-id/flat/3741749.1655952...@sss.pgh.pa.us#2bb7af4a3d2c7669f9a49808d777a20d > > > IIUC, you want extensibility in FORMAT argument to COPY command > > > https://www.postgresql.org/docs/current/sql-copy.html. Where the > > > format is pluggable. That seems useful. > > > Agreed. > > Ditto. But my proposal stalled. It it acceptable the feature request that making COPY format extendable? If it's not acceptable, what are blockers? Performance? Thanks, -- kou
Re: pg_stat_statements and "IN" conditions
Hi, In <20240404143514.a26f7ttxrbdfc73a@erthalion.local> "Re: pg_stat_statements and "IN" conditions" on Thu, 4 Apr 2024 16:35:14 +0200, Dmitry Dolgov <9erthali...@gmail.com> wrote: > Here is the rebased version. Thanks. I'm not familiar with this code base but I've reviewed these patches because I'm interested in this feature too. 0001: > diff --git a/src/backend/nodes/queryjumblefuncs.c > b/src/backend/nodes/queryjumblefuncs.c > index be823a7f8fa..e9473def361 100644 > --- a/src/backend/nodes/queryjumblefuncs.c > +++ b/src/backend/nodes/queryjumblefuncs.c > > @@ -212,15 +233,67 @@ RecordConstLocation(JumbleState *jstate, int location) > ... > +static bool > +IsMergeableConstList(List *elements, Const **firstConst, Const **lastConst) > +{ > + ListCell *temp; > + Node *firstExpr = NULL; > + > + if (elements == NULL) "elements == NIL" will be better for List. > +static void > +_jumbleElements(JumbleState *jstate, List *elements) > +{ > + Const *first, *last; > + if(IsMergeableConstList(elements, &first, &last)) A space is missing between "if" and "(". > diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h > index aa727e722cc..cf4f900d4ed 100644 > --- a/src/include/nodes/primnodes.h > +++ b/src/include/nodes/primnodes.h > @@ -1333,7 +1333,7 @@ typedef struct ArrayExpr > /* common type of array elements */ > Oid element_typeid > pg_node_attr(query_jumble_ignore); > /* the array elements or sub-arrays */ > - List *elements; > + List *elements pg_node_attr(query_jumble_merge); Should we also update the pg_node_attr() comment for query_jumble_merge in nodes.h? 0003: > diff --git a/contrib/pg_stat_statements/pg_stat_statements.c > b/contrib/pg_stat_statements/pg_stat_statements.c > index d7841b51cc9..00eec30feb1 100644 > --- a/contrib/pg_stat_statements/pg_stat_statements.c > +++ b/contrib/pg_stat_statements/pg_stat_statements.c > ... > @@ -2883,12 +2886,22 @@ generate_normalized_query(JumbleState *jstate, const > char *query, > /* The firsts merged constant */ > else if (!skip) > { > + static const uint32 powers_of_ten[] = { > + 1, 10, 100, > + 1000, 1, 10, > + 100, 1000, 1, > + 10 > + }; > + int lower_merged = powers_of_ten[magnitude - 1]; > + int upper_merged = powers_of_ten[magnitude]; How about adding a reverse function of decimalLength32() to numutils.h and use it here? > - n_quer_loc += sprintf(norm_query + n_quer_loc, "..."); > + n_quer_loc += sprintf(norm_query + n_quer_loc, "... > [%d-%d entries]", > + lower_merged, > upper_merged - 1); Do we still have enough space in norm_query for this change? It seems that norm_query expects up to 10 additional bytes per jstate->clocations[i]. It seems that we can merge 0001, 0003 and 0004 to one patch. (Sorry. I haven't read all discussions yet. If we already discuss this, sorry for this noise.) Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi Andres, Could you take a look at this? I think that you don't want to touch the current text/csv/binary implementations. The v17 patch approach doesn't touch the current text/csv/binary implementations. What do you think about this approach? Thanks, -- kou In <20240320.232732.488684985873786799@clear-code.com> "Re: Make COPY format extendable: Extract COPY TO format implementations" on Wed, 20 Mar 2024 23:27:32 +0900 (JST), Sutou Kouhei wrote: > Hi, > > Could someone review the v17 patch to proceed this? > > The v17 patch: > https://www.postgresql.org/message-id/flat/20240305.171808.667980402249336456.kou%40clear-code.com#d2ee079b75ebcf00c410300ecc4a357a > > Some profiles by Michael: > https://www.postgresql.org/message-id/flat/ZelfYatRdVZN3FbE%40paquier.xyz#eccfd1a0131af93c48026d691cc247f4 > > Thanks, > -- > kou > > In <20240308.092254.359611633589181574@clear-code.com> > "Re: Make COPY format extendable: Extract COPY TO format implementations" > on Fri, 08 Mar 2024 09:22:54 +0900 (JST), > Sutou Kouhei wrote: > >> Hi, >> >> In >> "Re: Make COPY format extendable: Extract COPY TO format implementations" >> on Thu, 7 Mar 2024 15:32:01 +0900, >> Michael Paquier wrote: >> >>> While on it, here are some profiles based on HEAD and v17 with the >>> previous tests (COPY TO /dev/null, COPY FROM data sent to the void). >>> >> ... >>> >>> So, in short, and that's not really a surprise, there is no effect >>> once we use the dispatching with the routines only when a format would >>> want to plug-in with the APIs, but a custom format would still have a >>> penalty of a few percents for both if bottlenecked on CPU. >> >> Thanks for sharing these profiles! >> I agree with you. >> >> This shows that the v17 approach doesn't affect the current >> text/csv/binary implementations. (The v17 approach just adds >> 2 new structs, Copy{From,To}Rountine, without changing the >> current text/csv/binary implementations.) >> >> Can we push the v17 patch and proceed following >> implementations? Could someone (especially a PostgreSQL >> committer) take a look at this for double-check? >> >> >> Thanks, >> -- >> kou >> >> > >
Re: meson: Specify -Wformat as a common warning flag for extensions
Hi Andres, Thanks for reviewing this! In <20240407232635.fq4kc5556laha...@awork3.anarazel.de> "Re: meson: Specify -Wformat as a common warning flag for extensions" on Sun, 7 Apr 2024 16:26:35 -0700, Andres Freund wrote: > This seems like a fair amount of extra configure tests. Particularly because > /W* isn't ever interesting for Makefile.global - they're msvc flags - because > you can't use that with msvc. > > I'm also doubtful that it's worth supporting warning_level=3/everything, you > end up with a completely flood of warnings that way. OK. I've removed "/W*" flags and warning_level==3/everything cases. How about the attached v5 patch? Thanks, -- kou >From 205ef88c66cf1050eedfc1e72d951de93a02e53a Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Fri, 15 Mar 2024 18:27:30 +0900 Subject: [PATCH v5] meson: Restore implicit warning/debug/optimize flags for extensions Meson specifies warning/debug/optimize flags such as "-Wall", "-g" and "-O2" automatically based on "--warnlevel" and "--buildtype" options. And we use "--warning_level=1" and "--buildtype=debugoptimized" by default. We don't specify warning/debug/optimize flags explicitly to build PostgreSQL with Meson. Because Meson does it automatically as we said. But Meson doesn't care about flags in Makefile.global and pg_config. So we need to care about them manually. This changes do it. They detect warning/debug/optimize flags based on warning_level/debug/optimization option values because Meson doesn't tell us flags Meson guessed. --- meson.build | 41 + src/include/meson.build | 4 ++-- 2 files changed, 43 insertions(+), 2 deletions(-) diff --git a/meson.build b/meson.build index 5acf083ce3c..11bd56f79a7 100644 --- a/meson.build +++ b/meson.build @@ -1848,6 +1848,47 @@ endif vectorize_cflags = cc.get_supported_arguments(['-ftree-vectorize']) unroll_loops_cflags = cc.get_supported_arguments(['-funroll-loops']) +# They aren't used for building PostgreSQL itself because Meson does +# everything internally. They are used by extensions via pg_config or +# Makefile.global. +common_builtin_flags = [] + +warning_level = get_option('warning_level') +# See https://mesonbuild.com/Builtin-options.html#details-for-warning_level for +# warning_level values. +# +# We don't use "/W*" flags here because we don't need to care about MSVC here. +# +# We don't have "warning_level == 3" and "warning_level == +# 'everything'" here because we don't use these warning levels. +if warning_level == '1' + common_builtin_flags += ['-Wall'] +elif warning_level == '2' + common_builtin_flags += ['-Wall', '-Wextra'] +endif + +if get_option('debug') + common_builtin_flags += ['-g'] +endif + +optimization = get_option('optimization') +if optimization == '0' + common_builtin_flags += ['-O0'] +elif optimization == '1' + common_builtin_flags += ['-O1'] +elif optimization == '2' + common_builtin_flags += ['-O2'] +elif optimization == '3' + common_builtin_flags += ['-O3'] +elif optimization == 's' + common_builtin_flags += ['-Os'] +endif + +cflags_builtin = cc.get_supported_arguments(common_builtin_flags) +if llvm.found() + cxxflags_builtin = cpp.get_supported_arguments(common_builtin_flags) +endif + common_warning_flags = [ '-Wmissing-prototypes', '-Wpointer-arith', diff --git a/src/include/meson.build b/src/include/meson.build index a28f115d867..58b7a9c1e7e 100644 --- a/src/include/meson.build +++ b/src/include/meson.build @@ -44,9 +44,9 @@ config_paths_data.set_quoted('MANDIR', dir_prefix / dir_man) var_cc = ' '.join(cc.cmd_array()) var_cpp = ' '.join(cc.cmd_array() + ['-E']) -var_cflags = ' '.join(cflags + cflags_warn + get_option('c_args')) +var_cflags = ' '.join(cflags + cflags_builtin + cflags_warn + get_option('c_args')) if llvm.found() - var_cxxflags = ' '.join(cxxflags + cxxflags_warn + get_option('cpp_args')) + var_cxxflags = ' '.join(cxxflags + cxxflags_builtin + cxxflags_warn + get_option('cpp_args')) else var_cxxflags = '' endif -- 2.43.0
Re: Commitfest Manager for March
Hi, In "Re: Commitfest Manager for March" on Mon, 1 Apr 2024 11:57:27 +0300, Heikki Linnakangas wrote: > On 01/04/2024 09:05, Andrey M. Borodin wrote: >> I think it makes sense to close this commitfest only after Feature >> Freeze on April 8, 2024 at 0:00 AoE. >> What do you think? > > +1. IIRC that's how it's been done in last commitfest in previous > years too. Thanks for extending the period. Could someone review my patches for this commitfest? 1. https://commitfest.postgresql.org/47/4681/ Make COPY format extendable: Extract COPY TO format implementations Status: Needs review 2. https://commitfest.postgresql.org/47/4791/ meson: Specify -Wformat as a common warning flag for extensions Status: Ready for Committer Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, Could someone review the v17 patch to proceed this? The v17 patch: https://www.postgresql.org/message-id/flat/20240305.171808.667980402249336456.kou%40clear-code.com#d2ee079b75ebcf00c410300ecc4a357a Some profiles by Michael: https://www.postgresql.org/message-id/flat/ZelfYatRdVZN3FbE%40paquier.xyz#eccfd1a0131af93c48026d691cc247f4 Thanks, -- kou In <20240308.092254.359611633589181574@clear-code.com> "Re: Make COPY format extendable: Extract COPY TO format implementations" on Fri, 08 Mar 2024 09:22:54 +0900 (JST), Sutou Kouhei wrote: > Hi, > > In > "Re: Make COPY format extendable: Extract COPY TO format implementations" > on Thu, 7 Mar 2024 15:32:01 +0900, > Michael Paquier wrote: > >> While on it, here are some profiles based on HEAD and v17 with the >> previous tests (COPY TO /dev/null, COPY FROM data sent to the void). >> > ... >> >> So, in short, and that's not really a surprise, there is no effect >> once we use the dispatching with the routines only when a format would >> want to plug-in with the APIs, but a custom format would still have a >> penalty of a few percents for both if bottlenecked on CPU. > > Thanks for sharing these profiles! > I agree with you. > > This shows that the v17 approach doesn't affect the current > text/csv/binary implementations. (The v17 approach just adds > 2 new structs, Copy{From,To}Rountine, without changing the > current text/csv/binary implementations.) > > Can we push the v17 patch and proceed following > implementations? Could someone (especially a PostgreSQL > committer) take a look at this for double-check? > > > Thanks, > -- > kou > >
Re: meson: Specify -Wformat as a common warning flag for extensions
Hi, In <49e97fd0-c17e-4cbc-aeee-80ac51400...@eisentraut.org> "Re: meson: Specify -Wformat as a common warning flag for extensions" on Wed, 13 Mar 2024 08:38:28 +0100, Peter Eisentraut wrote: > I think the fix then is to put -Wall into CFLAGS in > Makefile.global. Looking at a diff of Makefile.global between an > autoconf and a meson build, I also see that under meson, CFLAGS > doesn't get -O2 -g (or similar, depending on settings). This > presumably has the same underlying issue that meson handles those > flags internally. > > For someone who wants to write a fix for this, the relevant variable > is var_cflags in our meson scripts. And var_cxxflags as well. How about the attached v4 patch? Thanks, -- kou >From a515720338dc49e764f598021200316c6d01ddf9 Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Fri, 15 Mar 2024 18:27:30 +0900 Subject: [PATCH v4] meson: Restore implicit warning/debug/optimize flags for extensions Meson specifies warning/debug/optimize flags such as "-Wall", "-g" and "-O2" automatically based on "--warnlevel" and "--buildtype" options. And we use "--warning_level=1" and "--buildtype=debugoptimized" by default. We don't specify warning/debug/optimize flags explicitly to build PostgreSQL with Meson. Because Meson does it automatically as we said. But Meson doesn't care about flags in Makefile.global and pg_config. So we need to care about them manually. This changes do it. They detect warning/debug/optimize flags based on warning_level/debug/optimization option values because Meson doesn't tell us flags Meson guessed. --- meson.build | 40 src/include/meson.build | 4 ++-- 2 files changed, 42 insertions(+), 2 deletions(-) diff --git a/meson.build b/meson.build index c8fdfeb0ec..3c5c449a0a 100644 --- a/meson.build +++ b/meson.build @@ -1824,6 +1824,46 @@ endif vectorize_cflags = cc.get_supported_arguments(['-ftree-vectorize']) unroll_loops_cflags = cc.get_supported_arguments(['-funroll-loops']) +# They aren't used for building PostgreSQL itself because Meson does +# everything internally. They are used by extensions via pg_config or +# Makefile.global. +common_builtin_flags = [] + +warning_level = get_option('warning_level') +# See https://mesonbuild.com/Builtin-options.html#details-for-warning_level for +# warning_level values. +if warning_level == '1' + common_builtin_flags += ['-Wall', '/W2'] +elif warning_level == '2' + common_builtin_flags += ['-Wall', '-Wextra', '/W3'] +elif warning_level == '3' + common_builtin_flags += ['-Wall', '-Wextra', '-Wpedantic', '/W4'] +elif warning_level == 'everything' + common_builtin_flags += ['-Weverything', '/Wall'] +endif + +if get_option('debug') + common_builtin_flags += ['-g'] +endif + +optimization = get_option('optimization') +if optimization == '0' + common_builtin_flags += ['-O0'] +elif optimization == '1' + common_builtin_flags += ['-O1'] +elif optimization == '2' + common_builtin_flags += ['-O2'] +elif optimization == '3' + common_builtin_flags += ['-O3'] +elif optimization == 's' + common_builtin_flags += ['-Os'] +endif + +cflags_builtin = cc.get_supported_arguments(common_builtin_flags) +if llvm.found() + cxxflags_builtin = cpp.get_supported_arguments(common_builtin_flags) +endif + common_warning_flags = [ '-Wmissing-prototypes', '-Wpointer-arith', diff --git a/src/include/meson.build b/src/include/meson.build index a28f115d86..58b7a9c1e7 100644 --- a/src/include/meson.build +++ b/src/include/meson.build @@ -44,9 +44,9 @@ config_paths_data.set_quoted('MANDIR', dir_prefix / dir_man) var_cc = ' '.join(cc.cmd_array()) var_cpp = ' '.join(cc.cmd_array() + ['-E']) -var_cflags = ' '.join(cflags + cflags_warn + get_option('c_args')) +var_cflags = ' '.join(cflags + cflags_builtin + cflags_warn + get_option('c_args')) if llvm.found() - var_cxxflags = ' '.join(cxxflags + cxxflags_warn + get_option('cpp_args')) + var_cxxflags = ' '.join(cxxflags + cxxflags_builtin + cxxflags_warn + get_option('cpp_args')) else var_cxxflags = '' endif -- 2.43.0
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Wed, 13 Mar 2024 16:00:46 +0800, jian he wrote: >> More use cases will help us which callbacks are needed. We >> will be able to collect more use cases by providing basic >> callbacks. > Let's say the customized format is "csv1", but it is just analogous to > the csv format. > people should be able to create an extension, with serval C functions, > then they can do `copy (select 1 ) to stdout (format 'csv1');` > but the output will be exact same as `copy (select 1 ) to stdout > (format 'csv');` Thanks for sharing one use case but I think that we need real-world use cases to consider our APIs. For example, JSON support that is currently discussing in another thread is a real-world use case. My Apache Arrow support is also another real-world use case. Thanks, -- kou
Re: meson: Specify -Wformat as a common warning flag for extensions
Hi, In "Re: meson: Specify -Wformat as a common warning flag for extensions" on Wed, 13 Mar 2024 00:43:11 -0500, "Tristan Partin" wrote: > Perhaps adding some more clarification in the comments that I wrote. > > - # -Wformat-security requires -Wformat, so check for it > + # -Wformat-secuirty requires -Wformat. We compile with -Wall in + # > Postgres, which includes -Wformat=1. -Wformat is shorthand for + # > -Wformat=1. The set of flags which includes -Wformat-security is + # > persisted into pg_config --cflags, which is commonly used by + # > PGXS-based extensions. The lack of -Wformat in the persisted flags > + # will produce a warning on many GCC versions, so even though adding > + # -Wformat here is a no-op for Postgres, it silences other use > cases. > > That might be too long-winded though :). Thanks for the wording! I used it for the v3 patch. Thanks, -- kou >From 0ba2e6dd55b00ee8a57313a629a1e5fa8c9e40cc Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Wed, 13 Mar 2024 16:10:34 +0900 Subject: [PATCH v3] Add -Wformat to common warning flags We specify -Wformat-security as a common warning flag explicitly. GCC requires -Wformat to be added for -Wformat-security to take effect. If -Wformat-security is used without -Wformat, GCC shows the following warning: cc1: warning: '-Wformat-security' ignored without '-Wformat' [-Wformat-security] Note that this is not needed for PostgreSQL itself because PostgreSQL uses -Wall, which includes -Wformat=1. -Wformat is shorthand for -Wformat=1. These flags without -Wall are persisted into "pg_config --cflags", which is commonly used by PGXS-based extensions. So PGXS-based extensions will get the warning without -Wformat. Co-authored-by: Tristan Partin --- configure| 99 configure.ac | 10 ++ meson.build | 9 + 3 files changed, 118 insertions(+) diff --git a/configure b/configure index 70a1968003..7b0fda3825 100755 --- a/configure +++ b/configure @@ -6013,6 +6013,105 @@ if test x"$pgac_cv_prog_CXX_cxxflags__Wshadow_compatible_local" = x"yes"; then fi + # -Wformat-security requires -Wformat. We compile with -Wall in + # PostgreSQL, which includes -Wformat=1. -Wformat is shorthand for + # -Wformat=1. The set of flags which includes -Wformat-security is + # persisted into pg_config --cflags, which is commonly used by + # PGXS-based extensions. The lack of -Wformat in the persisted flags + # will produce a warning on many GCC versions, so even though adding + # -Wformat here is a no-op for PostgreSQL, it silences other use + # cases., so check for it. + +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CC} supports -Wformat, for CFLAGS" >&5 +$as_echo_n "checking whether ${CC} supports -Wformat, for CFLAGS... " >&6; } +if ${pgac_cv_prog_CC_cflags__Wformat+:} false; then : + $as_echo_n "(cached) " >&6 +else + pgac_save_CFLAGS=$CFLAGS +pgac_save_CC=$CC +CC=${CC} +CFLAGS="${CFLAGS} -Wformat" +ac_save_c_werror_flag=$ac_c_werror_flag +ac_c_werror_flag=yes +cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + +int +main () +{ + + ; + return 0; +} +_ACEOF +if ac_fn_c_try_compile "$LINENO"; then : + pgac_cv_prog_CC_cflags__Wformat=yes +else + pgac_cv_prog_CC_cflags__Wformat=no +fi +rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext +ac_c_werror_flag=$ac_save_c_werror_flag +CFLAGS="$pgac_save_CFLAGS" +CC="$pgac_save_CC" +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_prog_CC_cflags__Wformat" >&5 +$as_echo "$pgac_cv_prog_CC_cflags__Wformat" >&6; } +if test x"$pgac_cv_prog_CC_cflags__Wformat" = x"yes"; then + CFLAGS="${CFLAGS} -Wformat" +fi + + + { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CXX} supports -Wformat, for CXXFLAGS" >&5 +$as_echo_n "checking whether ${CXX} supports -Wformat, for CXXFLAGS... " >&6; } +if ${pgac_cv_prog_CXX_cxxflags__Wformat+:} false; then : + $as_echo_n "(cached) " >&6 +else + pgac_save_CXXFLAGS=$CXXFLAGS +pgac_save_CXX=$CXX +CXX=${CXX} +CXXFLAGS="${CXXFLAGS} -Wformat" +ac_save_cxx_werror_flag=$ac_cxx_werror_flag +ac_cxx_werror_flag=yes +ac_ext=cpp +ac_cpp='$CXXCPP $CPPFLAGS' +ac_compile='$CXX -c $CXXFLAGS $CPPFLAGS conftest.$ac_ext >&5' +ac_link='$CXX -o conftest$ac_exeext $CXXFLAGS $CPPFLAGS $LDFLAGS conftest.$ac_ext $LIBS >&5' +ac_compiler_gnu=$ac_cv_cxx_compiler_gnu + +cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + +int +main () +{ + + ; + return 0; +} +_ACEOF +if ac_fn_cxx_try_compile "$LINENO"; then : + pgac_cv_prog_CXX_cxxflags__Wforma
Re: meson: Specify -Wformat as a common warning flag for extensions
Hi, In "Re: meson: Specify -Wformat as a common warning flag for extensions" on Fri, 08 Mar 2024 10:05:27 -0600, "Tristan Partin" wrote: > Ok, I figured this out. -Wall implies -Wformat=1. We set warning_level > to 1 in the Meson project() call, which implies -Wall, and set -Wall > in CFLAGS for autoconf. That's the reason we don't get issues building > Postgres. A user making use of the pg_config --cflags option, as Sutou > is, *will* run into the aforementioned issues, since we don't > propogate -Wall into pg_config. > > $ gcc $(pg_config --cflags) -E - < /dev/null > /dev/null > cc1: warning: ‘-Wformat-security’ ignored without ‘-Wformat’ > [-Wformat-security] > $ gcc -Wall $(pg_config --cflags) -E - < /dev/null > /dev/null > (nothing printed) Thanks for explaining this. You're right. This is the reason why we don't need this for PostgreSQL itself but we need this for PostgreSQL extensions. Sorry. I should have explained this in the first e-mail... What should we do to proceed this patch? Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Mon, 11 Mar 2024 08:00:00 +0800, jian he wrote: > Hi, here are my cents: > Currently in v17, we have 3 extra functions within DoCopyTo > CopyToStart, one time, start, doing some preliminary work. > CopyToOneRow, doing the repetitive work, called many times, row by row. > CopyToEnd, one time doing the closing work. > > seems to need a function pointer for processing the format and other options. > or maybe the reason is we need a one time function call before doing DoCopyTo, > like one time initialization. I know that JSON format wants it but can we defer it? We can add more options later. I want to proceed this improvement step by step. More use cases will help us which callbacks are needed. We will be able to collect more use cases by providing basic callbacks. > generally in v17, the code pattern looks like this. > if (cstate->opts.binary) > { > /* handle binary format */ > } > else if (cstate->routine) > { > /* custom code, make the copy format extensible */ > } > else > { > /* handle non-binary, (csv or text) format */ > } > maybe we need another bool flag like `bool buildin_format`. > if the copy format is {csv|text|binary} then buildin_format is true else > false. > > so the code pattern would be: > if (cstate->opts.binary) > { > /* handle binary format */ > } > else if (cstate->routine && !buildin_format) > { > /* custom code, make the copy format extensible */ > } > else > { > /* handle non-binary, (csv or text) format */ > } > > otherwise the {CopyToRoutine| CopyFromRoutine} needs a function pointer > to distinguish native copy format and extensible supported format, > like I mentioned above? Hmm. I may miss something but I think that we don't need the bool flag. Because we don't set cstate->routine for native copy formats. So we can distinguish native copy format and extensible supported format by checking only cstate->routine. Thanks, -- kou
Re: meson: Specify -Wformat as a common warning flag for extensions
Hi, In "Re: meson: Specify -Wformat as a common warning flag for extensions" on Fri, 8 Mar 2024 15:32:22 +0900, Michael Paquier wrote: > Are there version and/or > environment requirements to be aware of? I'm using Debian GNU/Linux sid and I can reproduce with gcc 8-13: $ for x in {8..13}; do; echo gcc-${x}; gcc-${x} -Wformat-security -E - < /dev/null > /dev/null; done gcc-8 cc1: warning: -Wformat-security ignored without -Wformat [-Wformat-security] gcc-9 cc1: warning: '-Wformat-security' ignored without '-Wformat' [-Wformat-security] gcc-10 cc1: warning: '-Wformat-security' ignored without '-Wformat' [-Wformat-security] gcc-11 cc1: warning: '-Wformat-security' ignored without '-Wformat' [-Wformat-security] gcc-12 cc1: warning: '-Wformat-security' ignored without '-Wformat' [-Wformat-security] gcc-13 cc1: warning: '-Wformat-security' ignored without '-Wformat' [-Wformat-security] $ I tried this on Ubuntu 22.04 too but this isn't reproduced: $ gcc-11 -Wformat-security -E - < /dev/null > /dev/null $ It seems that Ubuntu enables -Wformat by default: $ gcc-11 -Wno-format -Wformat-security -E - < /dev/null > /dev/null cc1: warning: '-Wformat-security' ignored without '-Wformat' [-Wformat-security] I tried this on AlmaLinux 9 too and this is reproduced: $ gcc --version gcc (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2) Copyright (C) 2021 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. $ gcc -Wformat-security -E - < /dev/null > /dev/null cc1: warning: '-Wformat-security' ignored without '-Wformat' [-Wformat-security] > Forcing -Wformat implies more stuff that can be disabled with > -Wno-format-contains-nul, -Wno-format-extra-args, and > -Wno-format-zero-length, but the thing is that we're usually very > conservative with such additions in the scripts. See also > 8b6f5f25102f, done, I guess, as an answer to this thread: > https://www.postgresql.org/message-id/4D431505.9010002%40dunslane.net I think that this is not a problem. Because the comment added by 8b6f5f25102f ("This was included in -Wall/-Wformat in older GCC versions") implies that we want to always use -Wformat-security. -Wformat-security isn't worked without -Wformat: https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wformat-security > If -Wformat is specified, also warn about uses of format > functions that represent possible security problems. Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Thu, 7 Mar 2024 15:32:01 +0900, Michael Paquier wrote: > While on it, here are some profiles based on HEAD and v17 with the > previous tests (COPY TO /dev/null, COPY FROM data sent to the void). > ... > > So, in short, and that's not really a surprise, there is no effect > once we use the dispatching with the routines only when a format would > want to plug-in with the APIs, but a custom format would still have a > penalty of a few percents for both if bottlenecked on CPU. Thanks for sharing these profiles! I agree with you. This shows that the v17 approach doesn't affect the current text/csv/binary implementations. (The v17 approach just adds 2 new structs, Copy{From,To}Rountine, without changing the current text/csv/binary implementations.) Can we push the v17 patch and proceed following implementations? Could someone (especially a PostgreSQL committer) take a look at this for double-check? Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Tue, 5 Mar 2024 15:16:33 +0900, Michael Paquier wrote: > CopyOneRowTo() could do something like that to avoid the extra > indentation: > if (cstate->routine) > { > cstate->routine->CopyToOneRow(cstate, slot); > MemoryContextSwitchTo(oldcontext); > return; > } OK. The v17 patch uses this style. Others are same as the v16. > I didn't know this trick. That's indeed nice.. I may use that for > other stuff to make patches more presentable to the eyes. And that's > available as well with `git diff`. :-) > If we basically agree about this part, how would the rest work out > with this set of APIs and the possibility to plug in a custom value > for FORMAT to do a pg_proc lookup, including an example of how these > APIs can be used? I'll send the following patches after this patch is merged. They are based on the v6 patch[1]: 1. Add copy_handler * This also adds a pg_proc lookup for custom FORMAT * This also adds a test for copy_handler 2. Export CopyToStateData * We need it to implement custom copy TO handler 3. Add needed APIs to implement custom copy TO handler * Add CopyToStateData::opaque * Export CopySendEndOfRow() 4. Export CopyFromStateData * We need it to implement custom copy FROM handler 5. Add needed APIs to implement custom copy FROM handler * Add CopyFromStateData::opaque * Export CopyReadBinaryData() [1] https://www.postgresql.org/message-id/flat/20240124.144936.67229716500876806.kou%40clear-code.com#f1ad092fc5e81fe38d3c376559efd52c Thanks, -- kou >From a78b8ee88575e2c2873afc3acf3c8c4e535becf0 Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Mon, 4 Mar 2024 13:52:34 +0900 Subject: [PATCH v17] Add CopyFromRoutine/CopyToRountine They are for implementing custom COPY FROM/TO format. But this is not enough to implement custom COPY FROM/TO format yet. We'll export some APIs to receive/send data and add "format" option to COPY FROM/TO later. Existing text/csv/binary format implementations don't use CopyFromRoutine/CopyToRoutine for now. We have a patch for it but we defer it. Because there are some mysterious profile results in spite of we get faster runtimes. See [1] for details. [1] https://www.postgresql.org/message-id/ZdbtQJ-p5H1_EDwE%40paquier.xyz Note that this doesn't change existing text/csv/binary format implementations. --- src/backend/commands/copyfrom.c | 24 +- src/backend/commands/copyfromparse.c | 5 ++ src/backend/commands/copyto.c| 25 +- src/include/commands/copyapi.h | 100 +++ src/include/commands/copyfrom_internal.h | 4 + src/tools/pgindent/typedefs.list | 2 + 6 files changed, 155 insertions(+), 5 deletions(-) create mode 100644 src/include/commands/copyapi.h diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c index c3bc897028..9bf2f6497e 100644 --- a/src/backend/commands/copyfrom.c +++ b/src/backend/commands/copyfrom.c @@ -1623,12 +1623,22 @@ BeginCopyFrom(ParseState *pstate, /* Fetch the input function and typioparam info */ if (cstate->opts.binary) + { getTypeBinaryInputInfo(att->atttypid, &in_func_oid, &typioparams[attnum - 1]); + fmgr_info(in_func_oid, &in_functions[attnum - 1]); + } + else if (cstate->routine) + cstate->routine->CopyFromInFunc(cstate, att->atttypid, + &in_functions[attnum - 1], + &typioparams[attnum - 1]); + else + { getTypeInputInfo(att->atttypid, &in_func_oid, &typioparams[attnum - 1]); - fmgr_info(in_func_oid, &in_functions[attnum - 1]); + fmgr_info(in_func_oid, &in_functions[attnum - 1]); + } /* Get default info if available */ defexprs[attnum - 1] = NULL; @@ -1768,10 +1778,13 @@ BeginCopyFrom(ParseState *pstate, /* Read and verify binary header */ ReceiveCopyBinaryHeader(cstate); } - - /* create workspace for CopyReadAttributes results */ - if (!cstate->opts.binary) + else if (cstate->routine) { + cstate->routine->CopyFromStart(cstate, tupDesc); + } + else + { + /* create workspace for CopyReadAttributes results */ AttrNumber attr_count = list_length(cstate->attnumlist); cstate->max_fields = attr_count; @@ -1789,6 +1802,9 @@ BeginCopyFrom(ParseState *pstate, void EndCopyFrom(CopyFromState cstate) { + if (cstate->routine) + cstate->routine->CopyFromEnd(cstate); + /* No COPY FROM related resources except memory. */ if (cstate->is_program) { diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c index 7cacd0b752..8b15080585 100644 --- a/src/backend/commands/copyfromparse.c +++ b/src/backend/commands/copyfromparse.c @@ -978,6 +978,11 @@ NextCopyFrom(CopyFromState cstate
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In <20240301.154443.618034282613922707@clear-code.com> "Re: Make COPY format extendable: Extract COPY TO format implementations" on Fri, 01 Mar 2024 15:44:43 +0900 (JST), Sutou Kouhei wrote: >> I guess so. It does not make much of a difference, though. The thing >> is that the dispatch caused by the custom callbacks called for each >> row is noticeable in any profiles I'm taking (not that much in the >> worst-case scenarios, still a few percents), meaning that this impacts >> the performance for all the in-core formats (text, csv, binary) as >> long as we refactor text/csv/binary to use the routines of copyapi.h. >> I don't really see a way forward, except if we don't dispatch the >> in-core formats to not impact the default cases. That makes the code >> a bit less elegant, but equally efficient for the existing formats. > > It's an option based on your profile result but your > execution result also shows that v15 is faster than HEAD [1]: > >> I am getting faster runtimes with v15 (6232ms in average) >> vs HEAD (6550ms) at 5M rows with COPY TO > > [1] > https://www.postgresql.org/message-id/flat/ZdbtQJ-p5H1_EDwE%40paquier.xyz#6439e6ad574f2d47cd7220e9bfed3889 > > I think that faster runtime is beneficial than mysterious > profile for users. So I think that we can merge v15 to > master. If this is a blocker of making COPY format extendable, can we defer moving the existing text/csv/binary format implementations to Copy{From,To}Routine for now as Michael suggested to proceed making COPY format extendable? (Can we add Copy{From,To}Routine without changing the existing text/csv/binary format implementations?) I attach a patch for it. There is a large hunk for CopyOneRowTo() that is caused by indent change. I also attach "...-w.patch" that uses "git -w" to remove space only changes. "...-w.patch" is only for review. We should use .patch without -w for push. Thanks, -- kou >From 6a5bfc8e104f0a339b421028e9fec69a4d092671 Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Mon, 4 Mar 2024 13:52:34 +0900 Subject: [PATCH v16] Add CopyFromRoutine/CopyToRountine They are for implementing custom COPY FROM/TO format. But this is not enough to implement custom COPY FROM/TO format yet. We'll export some APIs to receive/send data and add "format" option to COPY FROM/TO later. Existing text/csv/binary format implementations don't use CopyFromRoutine/CopyToRoutine for now. We have a patch for it but we defer it. Because there are some mysterious profile results in spite of we get faster runtimes. See [1] for details. [1] https://www.postgresql.org/message-id/ZdbtQJ-p5H1_EDwE%40paquier.xyz Note that this doesn't change existing text/csv/binary format implementations. There are many diffs for CopyOneRowTo() but they're caused by indentation. They don't change implementations. --- src/backend/commands/copyfrom.c | 24 +- src/backend/commands/copyfromparse.c | 5 ++ src/backend/commands/copyto.c| 103 ++- src/include/commands/copyapi.h | 100 ++ src/include/commands/copyfrom_internal.h | 4 + src/tools/pgindent/typedefs.list | 2 + 6 files changed, 193 insertions(+), 45 deletions(-) create mode 100644 src/include/commands/copyapi.h diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c index c3bc897028..9bf2f6497e 100644 --- a/src/backend/commands/copyfrom.c +++ b/src/backend/commands/copyfrom.c @@ -1623,12 +1623,22 @@ BeginCopyFrom(ParseState *pstate, /* Fetch the input function and typioparam info */ if (cstate->opts.binary) + { getTypeBinaryInputInfo(att->atttypid, &in_func_oid, &typioparams[attnum - 1]); + fmgr_info(in_func_oid, &in_functions[attnum - 1]); + } + else if (cstate->routine) + cstate->routine->CopyFromInFunc(cstate, att->atttypid, + &in_functions[attnum - 1], + &typioparams[attnum - 1]); + else + { getTypeInputInfo(att->atttypid, &in_func_oid, &typioparams[attnum - 1]); - fmgr_info(in_func_oid, &in_functions[attnum - 1]); + fmgr_info(in_func_oid, &in_functions[attnum - 1]); + } /* Get default info if available */ defexprs[attnum - 1] = NULL; @@ -1768,10 +1778,13 @@ BeginCopyFrom(ParseState *pstate, /* Read and verify binary header */ ReceiveCopyBinaryHeader(cstate); } - - /* create workspace for CopyReadAttributes results */ - if (!cstate->opts.binary) + else if (cstate->routine) { + cstate->routine->CopyFromStart(cstate, tupDesc); + } + else + { + /* create workspace for CopyReadAttributes results */ AttrNumber attr_count = list_length(cstate->attnumlist); cstate->max_fields = attr_cou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Fri, 1 Mar 2024 14:31:38 +0900, Michael Paquier wrote: > I guess so. It does not make much of a difference, though. The thing > is that the dispatch caused by the custom callbacks called for each > row is noticeable in any profiles I'm taking (not that much in the > worst-case scenarios, still a few percents), meaning that this impacts > the performance for all the in-core formats (text, csv, binary) as > long as we refactor text/csv/binary to use the routines of copyapi.h. > I don't really see a way forward, except if we don't dispatch the > in-core formats to not impact the default cases. That makes the code > a bit less elegant, but equally efficient for the existing formats. It's an option based on your profile result but your execution result also shows that v15 is faster than HEAD [1]: > I am getting faster runtimes with v15 (6232ms in average) > vs HEAD (6550ms) at 5M rows with COPY TO [1] https://www.postgresql.org/message-id/flat/ZdbtQJ-p5H1_EDwE%40paquier.xyz#6439e6ad574f2d47cd7220e9bfed3889 I think that faster runtime is beneficial than mysterious profile for users. So I think that we can merge v15 to master. Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In <20240222.183948.518018047578925034@clear-code.com> "Re: Make COPY format extendable: Extract COPY TO format implementations" on Thu, 22 Feb 2024 18:39:48 +0900 (JST), Sutou Kouhei wrote: > How about adding "is_csv" to CopyReadline() and > CopyReadLineText() too? I tried this on my environment. This is a change for COPY FROM not COPY TO but this decreases COPY TO performance with [1]... Hmm... master: 697.693 msec (the best case) v15: 576.374 msec (the best case) v15+this: 593.559 msec (the best case) [1] COPY (SELECT 1::int2,2::int2,3::int2,4::int2,5::int2,6::int2,7::int2,8::int2,9::int2,10::int2,11::int2,12::int2,13::int2,14::int2,15::int2,16::int2,17::int2,18::int2,19::int2,20::int2, generate_series(1, 100::int4)) TO '/dev/null' \watch c=15 So I think that v15 is good. perf result of master: # Children Self Command Shared Object Symbol # . . # 31.39%14.54% postgres postgres [.] CopyOneRowTo |--17.00%--CopyOneRowTo | |--10.61%--FunctionCall1Coll | | --8.40%--int2out | | |--2.58%--pg_ltoa | | | --0.68%--pg_ultoa_n | | |--1.11%--pg_ultoa_n | | |--0.83%--AllocSetAlloc | | |--0.69%--__memcpy_avx_unaligned_erms (inlined) | | |--0.58%--FunctionCall1Coll | | --0.55%--memcpy@plt | |--3.25%--appendBinaryStringInfo | | --0.56%--pg_ultoa_n | --0.69%--CopyAttributeOutText perf result of v15: # Children Self Command Shared Object Symbol # . . # 25.60%10.47% postgres postgres [.] CopyToTextOneRow |--15.39%--CopyToTextOneRow | |--10.44%--FunctionCall1Coll | | |--7.25%--int2out | | | |--2.60%--pg_ltoa | | | | --0.71%--pg_ultoa_n | | | |--0.90%--FunctionCall1Coll | | | |--0.84%--pg_ultoa_n | | | --0.66%--AllocSetAlloc | | |--0.79%--ExecProjectSet | | --0.68%--int4out | |--2.50%--appendBinaryStringInfo | --0.53%--CopyAttributeOutText The profiles on Michael's environment [2] showed that CopyOneRow() % was increased by v15. But it (CopyToTextOneRow() % not CopyOneRow() %) wasn't increased by v15. It's decreased instead. [2] https://www.postgresql.org/message-id/flat/ZdbtQJ-p5H1_EDwE%40paquier.xyz#6439e6ad574f2d47cd7220e9bfed3889 So I think that v15 doesn't have performance regression but my environment isn't suitable for benchmark... Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Thu, 22 Feb 2024 15:44:16 +0900, Michael Paquier wrote: > I was comparing what you have here, and what's been attached by Andres > at [1] and the top of the changes on my development branch at [2] > (v3-0008, mostly). And, it strikes me that there is no need to do any > major changes in any of the callbacks proposed up to v13 and v14 in > this thread, as all the changes proposed want to plug in more data > into each StateData for COPY FROM and COPY TO, the best part being > that v3-0008 can just reuse the proposed callbacks as-is. v1-0001 > from Sutou-san would need one slight tweak in the per-row callback, > still that's minor. I think so too. But I thought that some minor conflicts will be happen with this and the v15. So I worked on this before the v15. We agreed that this optimization doesn't block v15: [1] So we can work on the v15 without this optimization for now. [1] https://www.postgresql.org/message-id/flat/20240219195351.5vy7cdl3wxia66kg%40awork3.anarazel.de#20f9677e074fb0f8c5bb3994ef059a15 > I have been spending more time on the patch to introduce the COPY > APIs, leading me to the v15 attached, where I have replaced the > previous attribute callbacks for the output representation and the > reads with hardcoded routines that should be optimized by compilers, > and I have done more profiling with -O2. Thanks! I wanted to work on it but I didn't have enough time for it in a few days... I've reviewed the v15. > @@ -751,8 +751,9 @@ CopyReadBinaryData(CopyFromState cstate, char *dest, int > nbytes) > * > * NOTE: force_not_null option are not applied to the returned fields. > */ > -bool > -NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields) > +static bool "inline" is missing here. > +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, > + bool is_csv) > { > int fldct; How about adding "is_csv" to CopyReadline() and CopyReadLineText() too? diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c index 25b8d4bc52..79fabecc69 100644 --- a/src/backend/commands/copyfromparse.c +++ b/src/backend/commands/copyfromparse.c @@ -150,8 +150,8 @@ static const char BinarySignature[11] = "PGCOPY\n\377\r\n\0"; /* non-export function prototypes */ -static bool CopyReadLine(CopyFromState cstate); -static bool CopyReadLineText(CopyFromState cstate); +static inline bool CopyReadLine(CopyFromState cstate, bool is_csv); +static inline bool CopyReadLineText(CopyFromState cstate, bool is_csv); static inline int CopyReadAttributesText(CopyFromState cstate); static inline int CopyReadAttributesCSV(CopyFromState cstate); static Datum CopyReadBinaryAttribute(CopyFromState cstate, FmgrInfo *flinfo, @@ -770,7 +770,7 @@ NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, tupDesc = RelationGetDescr(cstate->rel); cstate->cur_lineno++; - done = CopyReadLine(cstate); + done = CopyReadLine(cstate, is_csv); if (cstate->opts.header_line == COPY_HEADER_MATCH) { @@ -823,7 +823,7 @@ NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, cstate->cur_lineno++; /* Actually read the line into memory here */ - done = CopyReadLine(cstate); + done = CopyReadLine(cstate, is_csv); /* * EOF at start of line means we're done. If we see EOF after some @@ -1133,8 +1133,8 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext, * by newline. The terminating newline or EOF marker is not included * in the final value of line_buf. */ -static bool -CopyReadLine(CopyFromState cstate) +static inline bool +CopyReadLine(CopyFromState cstate, bool is_csv) { boolresult; @@ -1142,7 +1142,7 @@ CopyReadLine(CopyFromState cstate) cstate->line_buf_valid = false; /* Parse data and transfer into line_buf */ - result = CopyReadLineText(cstate); + result = CopyReadLineText(cstate, is_csv); if (result) { @@ -1209,8 +1209,8 @@ CopyReadLine(CopyFromState cstate) /* * CopyReadLineText - inner loop of CopyReadLine for text mode */ -static bool -CopyReadLineText(CopyFromState cstate) +static inline bool +CopyReadLineText(CopyFromState cstate, bool is_csv) { char *copy_input_buf; int input_buf_ptr; @@ -1226,7 +1226,7 @@ CopyReadLineText(CopyFromState cstate) charquotec = '\0'; charescapec = '\0'; - if (cstate->opts.csv_mode) + if (is_csv) { quotec = cstate->opts.quote[0]; escapec = cstate->opts.escape[0]; @@ -1306,7 +1306,7 @@ CopyReadLineText(CopyFromState cstate) prev_raw_ptr = inpu
Re: meson: Specify -Wformat as a common warning flag for extensions
Hi, Could someone take a look at this? Patch is attached in the original e-mail: https://www.postgresql.org/message-id/20240122.141139.931086145628347157.kou%40clear-code.com Thanks, -- kou In <20240122.141139.931086145628347157@clear-code.com> "meson: Specify -Wformat as a common warning flag for extensions" on Mon, 22 Jan 2024 14:11:39 +0900 (JST), Sutou Kouhei wrote: > Hi, > > I'm an extension developer. If I use PostgreSQL built with > Meson, I get the following warning: > > cc1: warning: '-Wformat-security' ignored without '-Wformat' > [-Wformat-security] > > Because "pg_config --cflags" includes -Wformat-security but > doesn't include -Wformat. > > Can we specify -Wformat as a common warning flag too? If we > do it, "pg_config --cflags" includes both of > -Wformat-security and -Wformat. So I don't get the warning. > > > Thanks, > -- > kou
Re: Why is pq_begintypsend so slow?
Hi, In <20240219195351.5vy7cdl3wxia6...@awork3.anarazel.de> "Re: Why is pq_begintypsend so slow?" on Mon, 19 Feb 2024 11:53:51 -0800, Andres Freund wrote: >> I don't have a strong opinion how to implement this >> optimization as I said above. It seems that you like your >> approach. So I withdraw [1]. Could you complete this >> optimization? Can we proceed making COPY format extensible >> without this optimization? It seems that it'll take a little >> time to complete this optimization because your patch is >> still WIP. And it seems that you can work on it after making >> COPY format extensible. > > I don't think optimizing this aspect needs to block making copy extensible. OK. I'll work on making copy extensible without this optimization. > I don't know how much time/energy I'll have to focus on this in the near > term. I really just reposted this because the earlier patches were relevant > for the discussion in another thread. If you want to pick the COPY part up, > feel free. OK. I may work on this after I complete making copy extensible if you haven't completed this yet. Thanks, -- kou
Re: Why is pq_begintypsend so slow?
Hi, In <20240218200906.zvihkrs46yl2j...@awork3.anarazel.de> "Re: Why is pq_begintypsend so slow?" on Sun, 18 Feb 2024 12:09:06 -0800, Andres Freund wrote: >> [1] >> https://www.postgresql.org/message-id/flat/20240215.153421.96888103784986803.kou%40clear-code.com#34df359e6d255795d16814ce138cc995 >> Do we need one FunctionCallInfoBaseData for each attribute? >> How about sharing one FunctionCallInfoBaseData by all >> attributes like [1]? > > That makes no sense to me. You're throwing away most of the possible gains by > having to update the FunctionCallInfo fields on every call. You're saving > neglegible amounts of memory at a substantial runtime cost. The number of updated fields of your approach and [1] are same: Your approach: 6 (I think that "fcinfo->isnull = false" is needed though.) + fcinfo->args[0].value = CStringGetDatum(string); + fcinfo->args[0].isnull = false; + fcinfo->args[1].value = ObjectIdGetDatum(attr->typioparam); + fcinfo->args[1].isnull = false; + fcinfo->args[2].value = Int32GetDatum(attr->typmod); + fcinfo->args[2].isnull = false; [1]: 6 (including "fcinfo->isnull = false") + fcinfo->flinfo = flinfo; + fcinfo->context = escontext; + fcinfo->isnull = false; + fcinfo->args[0].value = CStringGetDatum(str); + fcinfo->args[1].value = ObjectIdGetDatum(typioparam); + fcinfo->args[2].value = Int32GetDatum(typmod); >> Inlining InputFuncallCallSafe() here to use pre-initialized >> fcinfo will decrease maintainability. Because we abstract >> InputFunctionCall family in fmgr.c. If we define a >> InputFunctionCall variant here, we need to change both of >> fmgr.c and here when InputFunctionCall family is changed. >> How about defining details in fmgr.c and call it here >> instead like [1]? > > I'm not sure I buy that that is a problem. It's not like my approach was > actually bypassing fmgr abstractions alltogether - instead it just used the > lower level APIs, because it's a performance sensitive area. [1] provides some optimized abstractions, which are implemented with lower level APIs, without breaking the abstractions. Note that I don't have a strong opinion how to implement this optimization. If other developers think this approach makes sense for this optimization, I don't object it. >> @@ -1966,7 +1992,7 @@ CopyReadBinaryAttribute(CopyFromState cstate, FmgrInfo >> *flinfo, >> if (fld_size == -1) >> { >> *isnull = true; >> -return ReceiveFunctionCall(flinfo, NULL, typioparam, typmod); >> +return ReceiveFunctionCall(fcinfo->flinfo, NULL, >> attr->typioparam, attr->typmod); >> >> Why pre-initialized fcinfo isn't used here? > > Because it's a prototype and because I don't think it's a common path. How about adding a comment why we don't need to optimize this case? I don't have a strong opinion how to implement this optimization as I said above. It seems that you like your approach. So I withdraw [1]. Could you complete this optimization? Can we proceed making COPY format extensible without this optimization? It seems that it'll take a little time to complete this optimization because your patch is still WIP. And it seems that you can work on it after making COPY format extensible. Thanks, -- kou
Re: Why is pq_begintypsend so slow?
Hi, In <20240218015955.rmw5mcmobt5hb...@awork3.anarazel.de> "Re: Why is pq_begintypsend so slow?" on Sat, 17 Feb 2024 17:59:55 -0800, Andres Freund wrote: > v3-0008-wip-make-in-out-send-recv-calls-in-copy.c-cheaper.patch It seems that this is an alternative approach of [1]. [1] https://www.postgresql.org/message-id/flat/20240215.153421.96888103784986803.kou%40clear-code.com#34df359e6d255795d16814ce138cc995 +typedef struct CopyInAttributeInfo +{ + AttrNumber num; + const char *name; + + Oid typioparam; + int32 typmod; + + FmgrInfoin_finfo; + union + { + FunctionCallInfoBaseData fcinfo; + charfcinfo_data[SizeForFunctionCallInfo(3)]; + } in_fcinfo; Do we need one FunctionCallInfoBaseData for each attribute? How about sharing one FunctionCallInfoBaseData by all attributes like [1]? @@ -956,20 +956,47 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext, values[m] = ExecEvalExpr(defexprs[m], econtext, &nulls[m]); } - - /* -* If ON_ERROR is specified with IGNORE, skip rows with soft -* errors -*/ - else if (!InputFunctionCallSafe(&in_functions[m], - string, - typioparams[m], - att->atttypmod, - (Node *) cstate->escontext, - &values[m])) Inlining InputFuncallCallSafe() here to use pre-initialized fcinfo will decrease maintainability. Because we abstract InputFunctionCall family in fmgr.c. If we define a InputFunctionCall variant here, we need to change both of fmgr.c and here when InputFunctionCall family is changed. How about defining details in fmgr.c and call it here instead like [1]? + fcinfo->args[0].value = CStringGetDatum(string); + fcinfo->args[0].isnull = false; + fcinfo->args[1].value = ObjectIdGetDatum(attr->typioparam); + fcinfo->args[1].isnull = false; + fcinfo->args[2].value = Int32GetDatum(attr->typmod); + fcinfo->args[2].isnull = false; I think that "fcinfo->isnull = false;" is also needed like [1]. @@ -1966,7 +1992,7 @@ CopyReadBinaryAttribute(CopyFromState cstate, FmgrInfo *flinfo, if (fld_size == -1) { *isnull = true; - return ReceiveFunctionCall(flinfo, NULL, typioparam, typmod); + return ReceiveFunctionCall(fcinfo->flinfo, NULL, attr->typioparam, attr->typmod); Why pre-initialized fcinfo isn't used here? Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Thu, 15 Feb 2024 17:09:20 +0800, jian he wrote: > My environment is slow (around 10x) but consistent. > I see around 2-3 percent increase consistently. > (with patch 7369.068 ms, without patch 7574.802 ms) Thanks for sharing your numbers! It will help us to determine whether these changes improve performance or not. > the patchset looks good in my eyes, i can understand it. > however I cannot apply it cleanly against the HEAD. Hmm, I used 9bc1eee988c31e66a27e007d41020664df490214 as the base version. But both patches based on the same revision. So we may not be able to apply both patches at once cleanly. > +/* > + * Prepare callinfo for InputFunctionCallSafeWithInfo to reuse one callinfo > + * instead of initializing it for each call. This is for performance. > + */ > +FunctionCallInfoBaseData * > +PrepareInputFunctionCallInfo(void) > +{ > + FunctionCallInfoBaseData *fcinfo; > + > + fcinfo = (FunctionCallInfoBaseData *) palloc(SizeForFunctionCallInfo(3)); > > just wondering, I saw other similar places using palloc0, > do we need to use palloc0? I think that we don't need to use palloc0() here because the following InitFunctionCallInfoData() call initializes all members explicitly. Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In <20240213.173340.1518143507526518973@clear-code.com> "Re: Make COPY format extendable: Extract COPY TO format implementations" on Tue, 13 Feb 2024 17:33:40 +0900 (JST), Sutou Kouhei wrote: > I'll reply other comments later... I've read other comments and my answers for them are same as Michael's one. I'll prepare the v15 patch with static inline functions and fixed arguments after the fcinfo cache patches are merged. I think that the v15 patch will be conflicted with fcinfo cache patches. Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Wed, 14 Feb 2024 15:52:36 +0900, Michael Paquier wrote: >> How about InputFunctionCallSafeWithInfo(), >> InputFunctionCallSafeInfo() or >> InputFunctionCallInfoCallSafe()? > > WithInfo() would not be a new thing. There are a couple of APIs named > like this when manipulating catalogs, so that sounds kind of a good > choice from here. Thanks for the info. Let's use InputFunctionCallSafeWithInfo(). See that attached patch: v2-0001-Reuse-fcinfo-used-in-COPY-FROM.patch I also attach a patch for COPY TO: v1-0001-Reuse-fcinfo-used-in-COPY-TO.patch I measured the COPY TO patch on my environment with: COPY (SELECT 1::int2,2::int2,3::int2,4::int2,5::int2,6::int2,7::int2,8::int2,9::int2,10::int2,11::int2,12::int2,13::int2,14::int2,15::int2,16::int2,17::int2,18::int2,19::int2,20::int2, generate_series(1, 100::int4)) TO '/dev/null' \watch c=5 master: 740.066ms 734.884ms 738.579ms 734.170ms 727.953ms patched: 730.714ms 741.483ms 714.149ms 715.436ms 713.578ms It seems that it improves performance a bit but my environment isn't suitable for benchmark. So they may not be valid numbers. Thanks, -- kou >From b677732f46f735a5601b889f79671e91be41 Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Thu, 15 Feb 2024 15:01:08 +0900 Subject: [PATCH v2] Reuse fcinfo used in COPY FROM Each NextCopyFrom() calls input functions or binary-input functions. We can reuse fcinfo for them instead of creating a local fcinfo for each call. This will improve performance. --- src/backend/commands/copyfrom.c | 4 + src/backend/commands/copyfromparse.c | 20 ++-- src/backend/utils/fmgr/fmgr.c| 126 +++ src/include/commands/copyfrom_internal.h | 1 + src/include/fmgr.h | 12 +++ 5 files changed, 154 insertions(+), 9 deletions(-) diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c index 1fe70b9133..ed375c012e 100644 --- a/src/backend/commands/copyfrom.c +++ b/src/backend/commands/copyfrom.c @@ -1691,6 +1691,10 @@ BeginCopyFrom(ParseState *pstate, /* We keep those variables in cstate. */ cstate->in_functions = in_functions; cstate->typioparams = typioparams; + if (cstate->opts.binary) + cstate->fcinfo = PrepareReceiveFunctionCallInfo(); + else + cstate->fcinfo = PrepareInputFunctionCallInfo(); cstate->defmap = defmap; cstate->defexprs = defexprs; cstate->volatile_defexprs = volatile_defexprs; diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c index 7cacd0b752..7907e16ea8 100644 --- a/src/backend/commands/copyfromparse.c +++ b/src/backend/commands/copyfromparse.c @@ -861,6 +861,7 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext, num_defaults = cstate->num_defaults; FmgrInfo *in_functions = cstate->in_functions; Oid *typioparams = cstate->typioparams; + FunctionCallInfoBaseData *fcinfo = cstate->fcinfo; int i; int *defmap = cstate->defmap; ExprState **defexprs = cstate->defexprs; @@ -961,12 +962,13 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext, * If ON_ERROR is specified with IGNORE, skip rows with soft * errors */ - else if (!InputFunctionCallSafe(&in_functions[m], - string, - typioparams[m], - att->atttypmod, - (Node *) cstate->escontext, - &values[m])) + else if (!InputFunctionCallSafeWithInfo(fcinfo, + &in_functions[m], + string, + typioparams[m], + att->atttypmod, + (Node *) cstate->escontext, + &values[m])) { cstate->num_errors++; return true; @@ -1966,7 +1968,7 @@ CopyReadBinaryAttribute(CopyFromState cstate, FmgrInfo *flinfo, if (fld_size == -1) { *isnull = true; - return ReceiveFunctionCall(flinfo, NULL, typioparam, typmod); + return ReceiveFunctionCallWithInfo(cstate->fcinfo, flinfo, NULL, typioparam, typmod); } if (fld_size < 0) ereport(ERROR, @@ -1987,8 +1989,8 @@ CopyReadBinaryAttribute(CopyFromState cstate, FmgrInfo *flinfo, cstate->attribute_buf.data[fld_size] = '\0'; /* Call the column type's binary input converter */ - result = ReceiveFunctionCall(flinfo, &cstate->attribute_buf, - typioparam, typmod); + result = ReceiveFunctionCallWithInfo(cstate->fcinfo, flinfo, &cstate->attribute_buf, + typioparam, typmod); /* Trouble if it didn't eat the whole buffer */ if (cstate->attribute_buf.cursor != cstate->attribute_buf.len) diff --git a/src/backend/utils/fmgr/fmgr.c b/src/backend/utils/fmgr/fmgr.c index e48a86be54..14c3ed2bdb 100644 --- a/src/backend/utils/fmgr/fmgr.c +++ b/src/backend/utils/fmgr/fmgr.c @@ -1672,6 +1672,73 @@ DirectInputFu
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Wed, 14 Feb 2024 12:28:38 +0900, Michael Paquier wrote: >> How about the attached patch approach? If it's a desired >> approach, I can also write a separated patch for COPY TO. > > Hmm, I have not studied that much, but my first impression was that we > would not require any new facility in fmgr.c, but perhaps you're right > and it's more elegant to pass a InitFunctionCallInfoData this way. I'm not familiar with the fmgr.c related code base but it seems that we abstract {,binary-}input function call by fmgr.c. So I think that it's better that we follow the design. (If there is a person who knows the fmgr.c related code base, please help us.) > PrepareInputFunctionCallInfo() looks OK as a name, but I'm less a fan > of PreparedInputFunctionCallSafe() and its "Prepared" part. How about > something like ExecuteInputFunctionCallSafe()? I understand the feeling. SQL uses "prepared" for "prepared statement". There are similar function names such as InputFunctionCall()/InputFunctionCallSafe()/DirectInputFunctionCallSafe(). They execute (call) an input function but they use "call" not "execute" for it... So "Execute...Call..." may be redundant... How about InputFunctionCallSafeWithInfo(), InputFunctionCallSafeInfo() or InputFunctionCallInfoCallSafe()? > I may be able to look more at that next week, and I would surely check > the impact of that with a simple COPY query throttled by CPU (more > rows and more attributes the better). Thanks! >Note that I've reverted 06bd311bce24 for > the moment, as this is just getting in the way of the main patch, and > that was non-optimal once there is a per-row callback. Thanks for sharing the information. I'll rebase on master when I create the v15 patch. >> diff --git a/src/backend/commands/copyfrom.c >> b/src/backend/commands/copyfrom.c >> index 41f6bc43e4..a43c853e99 100644 >> --- a/src/backend/commands/copyfrom.c >> +++ b/src/backend/commands/copyfrom.c >> @@ -1691,6 +1691,10 @@ BeginCopyFrom(ParseState *pstate, >> /* We keep those variables in cstate. */ >> cstate->in_functions = in_functions; >> cstate->typioparams = typioparams; >> +if (cstate->opts.binary) >> +cstate->fcinfo = PrepareInputFunctionCallInfo(); >> +else >> +cstate->fcinfo = PrepareReceiveFunctionCallInfo(); > > Perhaps we'd better avoid more callbacks like that, for now. I'll not use a callback for this. I'll not change this part after we introduce Copy{To,From}Routine. cstate->fcinfo isn't used some custom COPY format handlers such as Apache Arrow handler like cstate->in_functions and cstate->typioparams. But they will be always allocated. It's a bit wasteful for those handlers but we may not care about it. So we can always use "if (state->opts.binary)" condition here. BTW... This part was wrong... Sorry... It should be: if (cstate->opts.binary) cstate->fcinfo = PrepareReceiveFunctionCallInfo(); else cstate->fcinfo = PrepareInputFunctionCallInfo(); Thanks, -- kou
Re: confusing / inefficient "need_transcoding" handling in copy
Hi, In "Re: confusing / inefficient "need_transcoding" handling in copy" on Wed, 14 Feb 2024 06:56:16 +0900, Michael Paquier wrote: > We have a couple of non-ASCII characters in the tests, but I suspect > that this one will not be digested correctly everywhere, even if > EUC_JP should be OK to use for the check. How about writing an > arbitrary sequence of bytes into a temporary file that gets used for > the COPY FROM instead? See for example how we do that with > abs_builddir in copy.sql. It makes sense. How about the attached patch? Thanks, -- kou >From 6eb9669f97c54f8b85fac63db40ad80664692d12 Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Wed, 14 Feb 2024 11:44:13 +0900 Subject: [PATCH v2] Add a test for invalid encoding for COPY FROM The test data use an UTF-8 character (U+3042 HIRAGANA LETTER A) but the test specifies EUC_JP. So it's an invalid data. --- src/test/regress/expected/copyencoding.out | 13 + src/test/regress/parallel_schedule | 2 +- src/test/regress/sql/copyencoding.sql | 15 +++ 3 files changed, 29 insertions(+), 1 deletion(-) create mode 100644 src/test/regress/expected/copyencoding.out create mode 100644 src/test/regress/sql/copyencoding.sql diff --git a/src/test/regress/expected/copyencoding.out b/src/test/regress/expected/copyencoding.out new file mode 100644 index 00..32a9d918fa --- /dev/null +++ b/src/test/regress/expected/copyencoding.out @@ -0,0 +1,13 @@ +-- +-- Test cases for COPY WITH (ENCODING) +-- +-- directory paths are passed to us in environment variables +\getenv abs_builddir PG_ABS_BUILDDIR +CREATE TABLE test (t text); +\set utf8_csv :abs_builddir '/results/copyencoding_utf8.csv' +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8'); +COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'EUC_JP'); +ERROR: invalid byte sequence for encoding "EUC_JP": 0xe3 0x81 +CONTEXT: COPY test, line 1 +DROP TABLE test; diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule index 1d8a414eea..238cef28c4 100644 --- a/src/test/regress/parallel_schedule +++ b/src/test/regress/parallel_schedule @@ -36,7 +36,7 @@ test: geometry horology tstypes regex type_sanity opr_sanity misc_sanity comment # execute two copy tests in parallel, to check that copy itself # is concurrent safe. # -- -test: copy copyselect copydml insert insert_conflict +test: copy copyselect copydml copyencoding insert insert_conflict # -- # More groups of parallel tests diff --git a/src/test/regress/sql/copyencoding.sql b/src/test/regress/sql/copyencoding.sql new file mode 100644 index 00..89e2124996 --- /dev/null +++ b/src/test/regress/sql/copyencoding.sql @@ -0,0 +1,15 @@ +-- +-- Test cases for COPY WITH (ENCODING) +-- + +-- directory paths are passed to us in environment variables +\getenv abs_builddir PG_ABS_BUILDDIR + +CREATE TABLE test (t text); + +\set utf8_csv :abs_builddir '/results/copyencoding_utf8.csv' +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8'); +COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'EUC_JP'); + +DROP TABLE test; -- 2.43.0
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In <20240209192705.5qdilvviq3py2...@awork3.anarazel.de> "Re: Make COPY format extendable: Extract COPY TO format implementations" on Fri, 9 Feb 2024 11:27:05 -0800, Andres Freund wrote: >> +static void >> +CopyFromTextInFunc(CopyFromState cstate, Oid atttypid, >> + FmgrInfo *finfo, Oid *typioparam) >> +{ >> +Oid func_oid; >> + >> +getTypeInputInfo(atttypid, &func_oid, typioparam); >> +fmgr_info(func_oid, finfo); >> +} > > FWIW, we should really change the copy code to initialize FunctionCallInfoData > instead of re-initializing that on every call, realy makes a difference > performance wise. How about the attached patch approach? If it's a desired approach, I can also write a separated patch for COPY TO. >> +cstate->raw_fields = (char **) palloc(attr_count * sizeof(char *)); >> +/* Set read attribute callback */ >> +if (cstate->opts.csv_mode) >> +cstate->copy_read_attributes = CopyReadAttributesCSV; >> +else >> +cstate->copy_read_attributes = CopyReadAttributesText; >> +} > > Isn't this precisely repeating the mistake of 2889fd23be56? What do you think about the approach in my previous mail's attachments? https://www.postgresql.org/message-id/flat/20240209.163205.704848659612151781.kou%40clear-code.com#dbb1f8d7f2f0e8fe3c7e37a757fcfc54 If it's a desired approach, I can prepare a v15 patch set based on the v14 patch set and the approach. I'll reply other comments later... Thanks, -- kou diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c index 41f6bc43e4..a43c853e99 100644 --- a/src/backend/commands/copyfrom.c +++ b/src/backend/commands/copyfrom.c @@ -1691,6 +1691,10 @@ BeginCopyFrom(ParseState *pstate, /* We keep those variables in cstate. */ cstate->in_functions = in_functions; cstate->typioparams = typioparams; + if (cstate->opts.binary) + cstate->fcinfo = PrepareInputFunctionCallInfo(); + else + cstate->fcinfo = PrepareReceiveFunctionCallInfo(); cstate->defmap = defmap; cstate->defexprs = defexprs; cstate->volatile_defexprs = volatile_defexprs; diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c index 906756362e..e372e5efb8 100644 --- a/src/backend/commands/copyfromparse.c +++ b/src/backend/commands/copyfromparse.c @@ -853,6 +853,7 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext, num_defaults = cstate->num_defaults; FmgrInfo *in_functions = cstate->in_functions; Oid *typioparams = cstate->typioparams; + FunctionCallInfoBaseData *fcinfo = cstate->fcinfo; int i; int *defmap = cstate->defmap; ExprState **defexprs = cstate->defexprs; @@ -953,12 +954,13 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext, * If ON_ERROR is specified with IGNORE, skip rows with soft * errors */ - else if (!InputFunctionCallSafe(&in_functions[m], - string, - typioparams[m], - att->atttypmod, - (Node *) cstate->escontext, - &values[m])) + else if (!PreparedInputFunctionCallSafe(fcinfo, + &in_functions[m], + string, + typioparams[m], + att->atttypmod, + (Node *) cstate->escontext, + &values[m])) { cstate->num_errors++; return true; @@ -1958,7 +1960,7 @@ CopyReadBinaryAttribute(CopyFromState cstate, FmgrInfo *flinfo, if (fld_size == -1) { *isnull = true; - return ReceiveFunctionCall(flinfo, NULL, typioparam, typmod); + return PreparedReceiveFunctionCall(cstate->fcinfo, flinfo, NULL, typioparam, typmod); } if (fld_size < 0) ereport(ERROR, @@ -1979,8 +1981,8 @@ CopyReadBinaryAttribute(CopyFromState cstate, FmgrInfo *flinfo, cstate->attribute_buf.data[fld_size] = '\0'; /* Call the column type's binary input converter */ - result = ReceiveFunctionCall(flinfo, &cstate->attribute_buf, - typioparam, typmod); + result = PreparedReceiveFunctionCall(cstate->fcinfo, flinfo, &cstate->attribute_buf, + typioparam, typmod); /* Trouble if it didn't eat the whole buffer */ if (cstate->attribute_buf.cursor != cstate->attribute_buf.len) diff --git a/src/backend/utils/fmgr/fmgr.c b/src/backend/utils/fmgr/fmgr.c index e48a86be54..b0b5310219 100644 --- a/src/backend/utils/fmgr/fmgr.c +++ b/src/backend/utils/fmgr/fmgr.c @@ -1672,6 +1672,73 @@ DirectInputFunctionCallSafe(PGFunction func, char *str, return true; } +/* + * Prepare callinfo for PreparedInputFunctionCallSafe to reuse one callinfo + * instead of initializing it for each call. This is for performance. + */ +FunctionCallInfoBaseData * +PrepareInputFunctionCallInfo(void) +{ + FunctionCallInfoBaseData *fcinfo; + + fcinfo = (FunctionCallInfoBaseData *) palloc(SizeForFunctionCallInfo(3)); + InitFunctionCallInfoData(*fcinfo, NULL, 3, InvalidOid, NULL, NULL); + fcinfo->args[0].isnull = false; + fcinfo->args[1].isnull = false; + fcinfo->args[2].isnull = f
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Fri, 9 Feb 2024 13:21:34 +0900, Michael Paquier wrote: >> - Revisit what we have here, looking at more profiles to see how HEAD >> an v13 compare. It looks like we are on a good path, but let's tackle >> things one step at a time. > > And attached is a v14 that's rebased on HEAD. Thanks! > A next step I think we could take is to split the binary-only and the > text/csv-only data in each cstate into their own structure to make the > structure, with an opaque pointer that custom formats could use, but a > lot of fields are shared as well. It'll make COPY code base cleaner but it may decrease performance. How about just adding an opaque pointer to each cstate as the next step and then try the split? My suggestion: 1. Introduce Copy{To,From}Routine (We can do it based on the v14 patch.) 2. Add an opaque pointer to Copy{To,From}Routine (This must not have performance impact.) 3.a. Split format specific data to the opaque space 3.b. Add support for registering custom format handler by creating a function 4. ... >This patch is already complicated > enough IMO, so I'm OK to leave it out for the moment, and focus on > making this infra pluggable as a next step. I agree with you. > Are there any comments about this v14? Sutou-san? Here are my comments: + /* Set read attribute callback */ + if (cstate->opts.csv_mode) + cstate->copy_read_attributes = CopyReadAttributesCSV; + else + cstate->copy_read_attributes = CopyReadAttributesText; I think that we should not use this approach for performance. We need to use "static inline" and constant argument instead something like the attached remove-copy-read-attributes.diff. We have similar codes for CopyReadLine()/CopyReadLineText(). The attached remove-copy-read-attributes-and-optimize-copy-read-line.diff also applies the same optimization to CopyReadLine()/CopyReadLineText(). I hope that this improved performance of COPY FROM. +/* + * Routines assigned to each format. ++ Garbage "+" + * CSV and text share the same implementation, at the exception of the + * copy_read_attributes callback. + */ +/* + * CopyToTextOneRow + * + * Process one row for text/CSV format. + */ +static void +CopyToTextOneRow(CopyToState cstate, +TupleTableSlot *slot) +{ ... + if (cstate->opts.csv_mode) + CopyAttributeOutCSV(cstate, string, + cstate->opts.force_quote_flags[attnum - 1]); + else + CopyAttributeOutText(cstate, string); ... How about use "static inline" and constant argument approach here too? static inline void CopyToTextBasedOneRow(CopyToState cstate, TupleTableSlot *slot, bool csv_mode) { ... if (cstate->opts.csv_mode) CopyAttributeOutCSV(cstate, string, cstate->opts.force_quote_flags[attnum - 1]); else CopyAttributeOutText(cstate, string); ... } static void CopyToTextOneRow(CopyToState cstate, TupleTableSlot *slot, bool csv_mode) { CopyToTextBasedOneRow(cstate, slot, false); } static void CopyToCSVOneRow(CopyToState cstate, TupleTableSlot *slot, bool csv_mode) { CopyToTextBasedOneRow(cstate, slot, true); } static const CopyToRoutine CopyCSVRoutineText = { ... .CopyToOneRow = CopyToCSVOneRow, ... }; Thanks, -- kou diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c index a90b7189b5..6e244fb443 100644 --- a/src/backend/commands/copyfrom.c +++ b/src/backend/commands/copyfrom.c @@ -158,12 +158,6 @@ CopyFromTextStart(CopyFromState cstate, TupleDesc tupDesc) attr_count = list_length(cstate->attnumlist); cstate->max_fields = attr_count; cstate->raw_fields = (char **) palloc(attr_count * sizeof(char *)); - - /* Set read attribute callback */ - if (cstate->opts.csv_mode) - cstate->copy_read_attributes = CopyReadAttributesCSV; - else - cstate->copy_read_attributes = CopyReadAttributesText; } /* @@ -221,9 +215,8 @@ CopyFromBinaryEnd(CopyFromState cstate) /* * Routines assigned to each format. -+ * CSV and text share the same implementation, at the exception of the - * copy_read_attributes callback. + * CopyFromOneRow callback. */ static const CopyFromRoutine CopyFromRoutineText = { .CopyFromInFunc = CopyFromTextInFunc, @@ -235,7 +228,7 @@ static const CopyFromRoutine CopyFromRoutineText = { static con
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Wed, 7 Feb 2024 13:33:18 +0900, Michael Paquier wrote: > Hmm. That explains why I was not seeing any differences with this > callback then. It seems to me that the order of actions to take is > clear, like: > - Revert 2889fd23be56 to keep a clean state of the tree, now done with > 1aa8324b81fa. Done. > - Dive into the strlen() issue, as it really looks like this can > create more simplifications for the patch discussed on this thread > with COPY TO. Done: b619852086ed2b5df76631f5678f60d3bebd3745 > - Revisit what we have here, looking at more profiles to see how HEAD > an v13 compare. It looks like we are on a good path, but let's tackle > things one step at a time. Are you already working on this? Do you want me to write the next patch based on the current master? Thanks, -- kou
Re: confusing / inefficient "need_transcoding" handling in copy
Hi, In <20240208.172501.2177371292839763981@clear-code.com> "Re: confusing / inefficient "need_transcoding" handling in copy" on Thu, 08 Feb 2024 17:25:01 +0900 (JST), Sutou Kouhei wrote: > How about the following to avoid needless transcoding? Oh, sorry. I missed the Michael's patch: https://www.postgresql.org/message-id/flat/ZcR9Q9hJ8GedFSCd%40paquier.xyz#e73272b042a22befac7a95f7bcb4fb9a I withdraw my change. Thanks, -- kou
Re: confusing / inefficient "need_transcoding" handling in copy
Hi, In <20240206222445.hzq22pb2nye7r...@awork3.anarazel.de> "Re: confusing / inefficient "need_transcoding" handling in copy" on Tue, 6 Feb 2024 14:24:45 -0800, Andres Freund wrote: > One unfortunate issue: We don't have any tests verifying that COPY FROM > catches encoding issues. How about the attached patch for it? How about the following to avoid needless transcoding? diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c index bd4710a79b..80ec26cafe 100644 --- a/src/backend/commands/copyto.c +++ b/src/backend/commands/copyto.c @@ -612,13 +612,14 @@ BeginCopyTo(ParseState *pstate, cstate->file_encoding = cstate->opts.file_encoding; /* -* Set up encoding conversion info. Even if the file and server encodings -* are the same, we must apply pg_any_to_server() to validate data in -* multibyte encodings. +* Set up encoding conversion info. We use pg_server_to_any() for the +* conversion. pg_server_to_any() does nothing with an encoding that +* equals to GetDatabaseEncoding() and PG_SQL_ASCII. If +* cstate->file_encoding equals to GetDatabaseEncoding() and PG_SQL_ASCII, +* we don't need to transcode. */ - cstate->need_transcoding = - (cstate->file_encoding != GetDatabaseEncoding() || -pg_database_encoding_max_length() > 1); + cstate->need_transcoding = !(cstate->file_encoding == GetDatabaseEncoding() || + cstate->file_encoding == PG_SQL_ASCII); /* See Multibyte encoding comment above */ cstate->encoding_embeds_ascii = PG_ENCODING_IS_CLIENT_ONLY(cstate->file_encoding); Numbers on my environment: master: 861.646ms patched: 697.547ms SQL: COPY (SELECT 1::int2,2::int2,3::int2,4::int2,5::int2,6::int2,7::int2,8::int2,9::int2,10::int2,11::int2,12::int2,13::int2,14::int2,15::int2,16::int2,17::int2,18::int2,19::int2,20::int2, generate_series(1, 100::int4)) TO '/dev/null' \watch c=5 Thanks, -- kou From 387f11bde4eb928af23f6a66101aa9d63b3dcfdf Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Thu, 8 Feb 2024 17:17:25 +0900 Subject: [PATCH v1] Add a test for invalid encoding for COPY FROM The test data uses UTF-8 "hello" in Japanese but the test specifies EUC_JP. So it's an invalid data. --- src/test/regress/expected/copyencoding.out | 8 src/test/regress/parallel_schedule | 2 +- src/test/regress/sql/copyencoding.sql | 10 ++ 3 files changed, 19 insertions(+), 1 deletion(-) create mode 100644 src/test/regress/expected/copyencoding.out create mode 100644 src/test/regress/sql/copyencoding.sql diff --git a/src/test/regress/expected/copyencoding.out b/src/test/regress/expected/copyencoding.out new file mode 100644 index 00..aa4ecea09e --- /dev/null +++ b/src/test/regress/expected/copyencoding.out @@ -0,0 +1,8 @@ +-- +-- Test cases for COPY WITH (ENCODING) +-- +CREATE TABLE test (t text); +COPY test FROM stdin WITH (ENCODING 'EUC_JP'); +ERROR: invalid byte sequence for encoding "EUC_JP": 0xe3 0x81 +CONTEXT: COPY test, line 1 +DROP TABLE test; diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule index 1d8a414eea..238cef28c4 100644 --- a/src/test/regress/parallel_schedule +++ b/src/test/regress/parallel_schedule @@ -36,7 +36,7 @@ test: geometry horology tstypes regex type_sanity opr_sanity misc_sanity comment # execute two copy tests in parallel, to check that copy itself # is concurrent safe. # -- -test: copy copyselect copydml insert insert_conflict +test: copy copyselect copydml copyencoding insert insert_conflict # -- # More groups of parallel tests diff --git a/src/test/regress/sql/copyencoding.sql b/src/test/regress/sql/copyencoding.sql new file mode 100644 index 00..07c85e988b --- /dev/null +++ b/src/test/regress/sql/copyencoding.sql @@ -0,0 +1,10 @@ +-- +-- Test cases for COPY WITH (ENCODING) +-- + +CREATE TABLE test (t text); +COPY test FROM stdin WITH (ENCODING 'EUC_JP'); +こんにちは +\. + +DROP TABLE test; -- 2.43.0
Re: meson: catalog/syscache_ids.h isn't installed
Hi, In <4b60e9bd-426c-4d4b-bbbd-1abd06fa0...@eisentraut.org> "Re: meson: catalog/syscache_ids.h isn't installed" on Mon, 5 Feb 2024 17:53:52 +0100, Peter Eisentraut wrote: > On 05.02.24 02:29, Sutou Kouhei wrote: >> catalog/syscache_ids.h is referred by utils/syscache.h but >> it's not installed with Meson. > > This has been fixed. (Somebody else reported it in a different thread > also.) Thanks! Commit: 1ae5ace7558ea949d2f94af2fd5eb145d5558659 Thread: https://www.postgresql.org/message-id/CAJ7c6TMDGmAiozDjJQ3%3DP3cd-1BidC_GpitcAuU0aqq-r1eSoQ%40mail.gmail.com -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Mon, 5 Feb 2024 16:14:08 +0900, Michael Paquier wrote: > So, I've looked at all that today, and finished by applying two > patches as of 2889fd23be56 and 95fb5b49024a to get some of the > weirdness with the workhorse routines out of the way. Thanks! > As this is called within the OneRow routine, I can live with that. If > there is an opposition to that, we could just attach it within the > routines. I don't object the approach. > I am attaching a v12 which is close to what I want it to be, with > much more documentation and comments. There are two things that I've > changed compared to the previous versions though: > 1) I have added a callback to set up the input and output functions > rather than attach that in the Start callback. I'm OK with this. I just don't use them in Apache Arrow COPY FORMAT extension. > - No need for plugins to think about how to allocate this data. v11 > and other versions were doing things the wrong way by allocating this > stuff in the wrong memory context as we switch to the COPY context > when we are in the Start routines. Oh, sorry. I missed it when I moved them. > 2) I have backpedaled on the postpare callback, which did not bring > much in clarity IMO while being a CSV-only callback. Note that we > have in copyfromparse.c more paths that are only for CSV but the past > versions of the patch never cared about that. This makes the text and > CSV implementations much closer to each other, as a result. Ah, sorry. I forgot to eliminate cstate->opts.csv_mode in CopyReadLineText(). The postpare callback is for optimization. If it doesn't improve performance, we don't need to introduce it. We may want to try eliminating cstate->opts.csv_mode in CopyReadLineText() for performance. But we don't need to do this in introducing CopyFromRoutine. We can defer it. So I don't object removing the postpare callback. > Let me know if you have > comments about all that. Here are some comments for the patch: + /* +* Called when COPY FROM is started to set up the input functions +* associated to the relation's attributes writing to. `fmgr_info` can be fmgr_info -> finfo +* optionally filled to provide the catalog information of the input +* function. `typioparam` can be optinally filled to define the OID of optinally -> optionally +* the type to pass to the input function. `atttypid` is the OID of data +* type used by the relation's attribute. +*/ + void(*CopyFromInFunc) (Oid atttypid, FmgrInfo *finfo, + Oid *typioparam); How about passing CopyFromState cstate too like other callbacks for consistency? + /* +* Copy one row to a set of `values` and `nulls` of size tupDesc->natts. +* +* 'econtext' is used to evaluate default expression for each column that +* is either not read from the file or is using the DEFAULT option of COPY or is -> or (I'm not sure...) +* FROM. It is NULL if no default values are used. +* +* Returns false if there are no more tuples to copy. +*/ + bool(*CopyFromOneRow) (CopyFromState cstate, ExprContext *econtext, + Datum *values, bool *nulls); +typedef struct CopyToRoutine +{ + /* +* Called when COPY TO is started to set up the output functions +* associated to the relation's attributes reading from. `fmgr_info` can fmgr_info -> finfo +* be optionally filled. `atttypid` is the OID of data type used by the +* relation's attribute. +*/ + void(*CopyToOutFunc) (Oid atttypid, FmgrInfo *finfo); How about passing CopyToState cstate too like other callbacks for consistency? @@ -200,4 +204,10 @@ extern void ReceiveCopyBinaryHeader(CopyFromState cstate); extern int CopyReadAttributesCSV(CopyFromState cstate); extern int CopyReadAttributesText(CopyFromState cstate); +/* Callbacks for CopyFromRoutine->OneRow */ CopyFromRoutine->OneRow -> CopyFromRoutine->CopyFromOneRow +extern bool CopyFromTextOneRow(CopyFromState cstate, ExprContext *econtext, + Datum *values, bool *nulls); +extern bool CopyFromBinaryOneRow(CopyFromState cstate, ExprContext *econtext, +Datum *values, bool *nulls); + #endif /* COPYFROM_INTERNAL_H */ +/* + * CopyFromTextStart CopyFromTextStart -> CopyFromBinaryStart + * + * Start of COPY FROM for binary format. + */ +static void +CopyFromBinaryStart(CopyFromState cstate, TupleDesc tupDesc) +{ + /* Read and verify binary header */ + Rece
meson: catalog/syscache_ids.h isn't installed
Hi, catalog/syscache_ids.h is referred by utils/syscache.h but it's not installed with Meson. FYI: * 9b1a6f50b91dca6610932650c8c81a3c924259f9 It uses catalog/syscache_ids.h in utils/syscache.h but catalog/syscache_ids.h isn't installed. * 6eb6086faa3842c2a38a1ee2f97bf9a42ce27610 It changes a Makefile to install catalog/syscache_ids.h but it doesn't change meson.build. diff --git a/src/include/catalog/meson.build b/src/include/catalog/meson.build index 6be76dca1d..0bf6e112d5 100644 --- a/src/include/catalog/meson.build +++ b/src/include/catalog/meson.build @@ -114,7 +114,7 @@ output_install = [ dir_data, dir_data, dir_include_server / 'catalog', - false, + dir_include_server / 'catalog', dir_include_server / 'catalog', dir_include_server / 'catalog', ] Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Fri, 2 Feb 2024 17:04:28 +0900, Michael Paquier wrote: > One idea I was considering is whether we should use a special value in > the "format" DefElem, say "custom:$my_custom_format" where it would be > possible to bypass the formay check when processing options and find > the routines after processing all the options. I'm not wedded to > that, but attaching the routines to the state data is IMO the correct > thing, because this has nothing to do with CopyFormatOptions. Thanks for sharing your idea. Let's discuss how to support custom options after we complete the current performance changes. >> I'm OK with the approach. But how about adding the extra >> callbacks to Copy{From,To}StateData not >> Copy{From,To}Routines like CopyToStateData::data_dest_cb and >> CopyFromStateData::data_source_cb? They are only needed for >> "text" and "csv". So we don't need to add them to >> Copy{From,To}Routines to keep required callback minimum. > > And set them in cstate while we are in the Start routine, right? I imagined that it's done around the following part: @@ -1418,6 +1579,9 @@ BeginCopyFrom(ParseState *pstate, /* Extract options from the statement node tree */ ProcessCopyOptions(pstate, &cstate->opts, true /* is_from */ , options); + /* Set format routine */ + cstate->routine = CopyFromGetRoutine(cstate->opts); + /* Process the target relation */ cstate->rel = rel; Example1: /* Set format routine */ cstate->routine = CopyFromGetRoutine(cstate->opts); if (!cstate->opts.binary) if (cstate->opts.csv_mode) cstate->copy_read_attributes = CopyReadAttributesCSV; else cstate->copy_read_attributes = CopyReadAttributesText; Example2: static void CopyFromSetRoutine(CopyFromState cstate) { if (cstate->opts.csv_mode) { cstate->routine = &CopyFromRoutineCSV; cstate->copy_read_attributes = CopyReadAttributesCSV; } else if (cstate.binary) cstate->routine = &CopyFromRoutineBinary; else { cstate->routine = &CopyFromRoutineText; cstate->copy_read_attributes = CopyReadAttributesText; } } BeginCopyFrom() { /* Set format routine */ CopyFromSetRoutine(cstate); } But I don't object your original approach. If we have the extra callbacks in Copy{From,To}Routines, I just don't use them for my custom format extension. >> What is the better next action for us? Do you want to >> complete the WIP v11 patch set by yourself (and commit it)? >> Or should I take over it? > > I was planning to work on that, but wanted to be sure how you felt > about the problem with text and csv first. OK. My opinion is the above. I have an idea how to implement it but it's not a strong idea. You can choose whichever you like. Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Fri, 2 Feb 2024 15:27:15 +0800, Junwang Zhao wrote: > I agree CopyToRoutine should be placed into CopyToStateData, but > why set it after ProcessCopyOptions, the implementation of > CopyToGetRoutine doesn't make sense if we want to support custom > format in the future. > > Seems the refactor of v11 only considered performance but not > *extendable copy format*. Right. We focus on performance for now. And then we will focus on extendability. [1] [1] https://www.postgresql.org/message-id/flat/20240130.171511.2014195814665030502.kou%40clear-code.com#757a48c273f140081656ec8eb69f502b > If V7 and V10 have no performance reduction, then I think V6 is also > good with performance, since most of the time goes to CopyToOneRow > and CopyFromOneRow. Don't worry. I'll re-submit changes in the v6 patch set again after the current patch set that focuses on performance is merged. > I just think we should take the *extendable* into consideration at > the beginning. Introducing Copy{To,From}Routine is also valuable for extendability. We can improve extendability later. Let's focus on only performance for now to introduce Copy{To,From}Routine. Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Fri, 2 Feb 2024 15:21:31 +0900, Michael Paquier wrote: > I have done a review of v10, see v11 attached which is still WIP, with > the patches for COPY TO and COPY FROM merged together. Note that I'm > thinking to merge them into a single commit. OK. I don't have a strong opinion for commit unit. > @@ -74,11 +75,11 @@ typedef struct CopyFormatOptions > boolconvert_selectively;/* do selective binary conversion? */ > CopyOnErrorChoice on_error; /* what to do when error happened */ > List *convert_select; /* list of column names (can be NIL) */ > +constCopyToRoutine *to_routine;/* callback routines for COPY > TO */ > } CopyFormatOptions; > > Adding the routines to the structure for the format options is in my > opinion incorrect. The elements of this structure are first processed > in the option deparsing path, and then we need to use the options to > guess which routines we need. This was discussed with Sawada-san a bit before. [1][2] [1] https://www.postgresql.org/message-id/flat/CAD21AoBmNiWwrspuedgAPgbAqsn7e7NoZYF6gNnYBf%2BgXEk9Mg%40mail.gmail.com#bfd19262d261c67058fdb8d64e6a723c [2] https://www.postgresql.org/message-id/flat/20240130.144531.1257430878438173740.kou%40clear-code.com#fc55392d77f400fc74e42686fe7e348a I kept the routines in CopyFormatOptions for custom option processing. But I should have not cared about it in this patch set because this patch set doesn't include custom option processing. So I'm OK that we move the routines to Copy{From,To}StateData. > This also led to a strange separation with > ProcessCopyOptionFormatFrom() and ProcessCopyOptionFormatTo() to fit > the hole in-between. They also for custom option processing. We don't need to care about them in this patch set too. > copyapi.h needs more documentation, like what is expected for > extension developers when using these, what are the arguments, etc. I > have added what I had in mind for now. Thanks! I'm not good at writing documentation in English... > +typedef char *(*PostpareColumnValue) (CopyFromState cstate, char *string, > int m); > > CopyReadAttributes and PostpareColumnValue are also callbacks specific > to text and csv, except that they are used within the per-row > callbacks. The same can be said about CopyAttributeOutHeaderFunction. > It seems to me that it would be less confusing to store pointers to > them in the routine structures, where the final picture involves not > having multiple layers of APIs like CopyToCSVStart, > CopyAttributeOutTextValue, etc. These *have* to be documented > properly in copyapi.h, and this is much easier now that cstate stores > the routine pointers. That would also make simpler function stacks. > Note that I have not changed that in the v11 attached. > > This business with the extra callbacks required for csv and text is my > main point of contention, but I'd be OK once the model of the APIs is > more linear, with everything in Copy{From,To}State. The changes would > be rather simple, and I'd be OK to put my hands on it. Just, > Sutou-san, would you agree with my last point about these extra > callbacks? I'm OK with the approach. But how about adding the extra callbacks to Copy{From,To}StateData not Copy{From,To}Routines like CopyToStateData::data_dest_cb and CopyFromStateData::data_source_cb? They are only needed for "text" and "csv". So we don't need to add them to Copy{From,To}Routines to keep required callback minimum. What is the better next action for us? Do you want to complete the WIP v11 patch set by yourself (and commit it)? Or should I take over it? Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Fri, 2 Feb 2024 06:51:02 +0900, Michael Paquier wrote: > On Fri, Feb 02, 2024 at 12:19:51AM +0900, Sutou Kouhei wrote: >> Here are some numbers on my local machine (Note that my >> local machine isn't suitable for benchmark as I said >> before. Each number is median of "\watch 15" results): >>> >> I'll measure again on my local machine later. I'll stop >> other processes such as Web browser, editor and so on as >> much as possible when I do. > > Thanks for compiling some numbers. This is showing a lot of variance. > Expecially, these two lines in table 2 are showing surprising results > for v7: > direction format n_columns master v7v10 >fromcsv 1917.973 1695.401871.991 >from binary 1841.104 1422.012820.786 Here are more numbers: 1: direction format n_columns master v7v10 to text 1 1053.844978.998956.575 tocsv 1 1091.316 1020.584 1098.314 to binary 1 1034.685969.224980.458 to text 10 4216.264 3886.515 4111.417 tocsv 10 4649.228 4530.882 4682.988 to binary 10 4219.2284189.99 4211.942 from text 1851.697896.968890.458 fromcsv 1890.229936.231 887.15 from binary 1784.407 817.07938.736 from text 10 2549.056 2233.899 2630.892 fromcsv 10 2809.441 2868.411 2895.196 from binary 10 2985.674 3027.522 3397.5 2: direction format n_columns master v7v10 to text 1 1013.764 1011.968940.855 tocsv 1 1060.431 1065.4681040.68 to binary 1 1013.652 1009.956965.675 to text 10 4411.484 4031.571 3896.836 tocsv 10 4739.6254715.81 4631.002 to binary 10 4374.077 4357.942 4227.215 from text 1955.078922.346866.222 fromcsv 1 1040.717986.524905.657 from binary 1849.316864.859833.152 from text 10 2703.209 2361.651 2533.992 fromcsv 102990.35 3059.167 2930.632 from binary 10 3008.375 3368.714 3055.723 3: direction format n_columns master v7v10 to text 1 1084.756 1003.822994.409 tocsv 1 1092.4 1062.536 1079.027 to binary 1 1046.774994.168993.633 to text 104363.51 3978.205 4124.359 tocsv 10 4866.762 4616.001 4715.052 to binary 10 4382.412 4363.269 4213.456 from text 1852.976907.315860.749 fromcsv 1925.187962.632897.833 from binary 1824.997897.046828.231 from text 102591.07 2358.541 2540.431 fromcsv 10 2907.033 3018.486 2915.997 from binary 10 3069.0273209.21 3119.128 Other processes are stopped while I measure them. But I'm not sure these numbers are more reliable than before... > I am going to try to plug in some rusage() calls in the backend for > the COPY paths. I hope that gives more precision about the backend > activity. I'll post that with more numbers. Thanks. It'll help us. -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, Thanks for preparing benchmark. In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Thu, 1 Feb 2024 12:49:59 +0900, Michael Paquier wrote: > On Thu, Feb 01, 2024 at 10:57:58AM +0900, Michael Paquier wrote: >> And here are the results I get for text and binary (ms, average of 15 >> queries after discarding the three highest and three lowest values): >> test | master | v7 | v10 >> -++--+-- >> from_bin_1col | 1575 | 1546 | 1575 >> from_bin_10col | 5364 | 5208 | 5230 >> from_text_1col | 1690 | 1715 | 1684 >> from_text_10col | 4875 | 4793 | 4757 >> to_bin_1col | 1717 | 1730 | 1731 >> to_bin_10col| 7728 | 7707 | 7513 >> to_text_1col| 1710 | 1730 | 1698 >> to_text_10col | 5998 | 5960 | 5987 > > Here are some numbers from a second local machine: > test | master | v7 | v10 > -++--+-- > from_bin_1col | 508| 467 | 461 > from_bin_10col | 2192 | 2083 | 2098 > from_text_1col | 510| 499 | 517 > from_text_10col | 1970 | 1678 | 1654 > to_bin_1col | 575| 577 | 573 > to_bin_10col| 2680 | 2678 | 2722 > to_text_1col| 516| 506 | 527 > to_text_10col | 2250 | 2245 | 2235 > > This is confirming a speedup with COPY FROM for both text and binary, > with more impact with a larger number of attributes. That's harder to > conclude about COPY TO in both cases, but at least I'm not seeing any > regression even with some variance caused by what looks like noise. > We need more numbers from more people. Sutou-san or Sawada-san, or > any volunteers? Here are some numbers on my local machine (Note that my local machine isn't suitable for benchmark as I said before. Each number is median of "\watch 15" results): 1: direction format n_columns master v7v10 to text 1 1077.254 1016.953 1028.434 tocsv 11079.88 1055.5451053.95 to binary 1 1051.2471033.931003.44 to text 10 4373.168 3980.4423955.94 tocsv 10 4753.842 4719.2 4677.643 to binary 10 4598.374 4431.238 4285.757 from text 1875.729916.526869.283 fromcsv 1909.355 1001.277918.655 from binary 1872.943907.778859.433 from text 10 2594.429 2345.292 2587.603 fromcsv 10 2968.972 3039.544 2964.468 from binary 103072.01 3109.267 3093.983 2: direction format n_columns master v7v10 to text 1 1061.908988.768978.291 tocsv 1 1095.109 1037.015 1041.613 to binary 1 1076.992 1000.212983.318 to text 10 4336.517 3901.833 3841.789 tocsv 10 4679.411 4640.975 4570.774 to binary 104465.04 4508.063 4261.749 from text 1866.689 917.54830.417 fromcsv 1917.973 1695.401871.991 from binary 1841.104 1422.012820.786 from text 10 2523.607 3147.738 2517.505 fromcsv 10 2917.018 3042.685 2950.338 from binary 10 2998.051 3128.542 3018.954 3: direction format n_columns master v7v10 to text 1 1021.168 1031.183962.945 tocsv 1 1076.549 1069.661 1060.258 to binary 1 1024.611 1022.143975.768 to text 104327.24 3936.703 4049.893 tocsv 10 4620.436 4531.676 4685.672 to binary 10 4457.165 4390.992 4301.463 from text 1887.532907.365888.892 fromcsv 1945.1671012.29895.921 from binary 1 853.06854.652849.661 from text 10 2660.509 2304.256 2527.071 fromcsv 10 2913.644 2968.204 2935.081 from binary 10 3020.812 3081.162 3090.803 I'll measure again on my local machine later. I'll stop other processes such as Web browser, editor and so on as much as possible when I do. Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Tue, 30 Jan 2024 17:37:35 +0900, Michael Paquier wrote: >> We use defel->location for an error message. (We don't need >> to set location for the default "text" DefElem.) > > Yeah, but you should not need to have this error in the paths that set > the callback routines in opts_out if the same validation happens a few > lines before, in copy.c. Ah, yes. defel->location is used in later patches. For example, it's used when a COPY handler for the specified FORMAT isn't found. > I am really worried about the complexities > this thread is getting into because we are trying to shape the > callbacks in the most generic way possible based on *two* use cases. > This is going to be a never-ending discussion. I'd rather get some > simple basics, and then we can discuss if tweaking the callbacks is > really necessary or not. Even after introducing the pg_proc lookups > to get custom callbacks. I understand your concern. Let's introduce minimal callbacks as the first step. I think that we completed our design discussion for this feature. We can choose minimal callbacks based on the discussion. > The custom options don't seem like an absolute strong > requirement for the first shot with the callbacks or even the > possibility to retrieve the callbacks from a function call. I mean, > you could provide some control with SET commands and a few GUCs, at > least, even if that would be strange. Manipulations with a list of > DefElems is the intuitive way to have custom options at query level, > but we also have to guess the set of callbacks from this list of > DefElems coming from the query. You see my point, I am not sure > if it would be the best thing to process twice the options, especially > when it comes to decide if a DefElem should be valid or not depending > on the callbacks used. Or we could use a kind of "special" DefElem > where we could store a set of key:value fed to a callback :) Interesting. Let's remove custom options support from the initial minimal callbacks. >> Can I prepare the v10 patch set as "the v7 patch set" + "the >> removal of the "if" checks in NextCopyFromRawFields()"? >> (+ reverting opts_out->{csv_mode,binary} changes in >> ProcessCopyOptions().) > > Yep, if I got it that would make sense to me. If you can do that, > that would help quite a bit. :) I've prepared the v10 patch set. Could you try this? Changes since the v7 patch set: 0001: * Remove CopyToProcessOption() callback * Remove CopyToGetFormat() callback * Revert passing CopyToState to ProcessCopyOptions() * Revert moving "opts_out->{csv_mode,binary} = true" to ProcessCopyOptionFormatTo() * Change to receive "const char *format" instead "DefElem *defel" by ProcessCopyOptionFormatTo() 0002: * Remove CopyFromProcessOption() callback * Remove CopyFromGetFormat() callback * Change to receive "const char *format" instead "DefElem *defel" by ProcessCopyOptionFormatFrom() * Remove "if (cstate->opts.csv_mode)" branches from NextCopyFromRawFields() FYI: Here are Copy{From,To}Routine in the v10 patch set. I think that only Copy{From,To}OneRow are minimal callbacks for the performance gain. But can we keep Copy{From,To}Start and Copy{From,To}End for consistency? We can remove a few {csv_mode,binary} conditions by Copy{From,To}{Start,End}. It doesn't depend on the number of COPY target tuples. So they will not affect performance. /* Routines for a COPY FROM format implementation. */ typedef struct CopyFromRoutine { /* * Called when COPY FROM is started. This will initialize something and * receive a header. */ void(*CopyFromStart) (CopyFromState cstate, TupleDesc tupDesc); /* Copy one row. It returns false if no more tuples. */ bool(*CopyFromOneRow) (CopyFromState cstate, ExprContext *econtext, Datum *values, bool *nulls); /* Called when COPY FROM is ended. This will finalize something. */ void(*CopyFromEnd) (CopyFromState cstate); } CopyFromRoutine; /* Routines for a COPY TO format implementation. */ typedef struct CopyToRoutine { /* Called when COPY TO is started. This will send a header. */ void(*CopyToStart) (CopyToState cstate, TupleDesc tupDesc); /* Copy one row for COPY TO. */ void(*CopyToOneRow) (CopyToState cstate, TupleTableSlot *slot); /* Called when COPY TO is ended. This will send a trailer. */ void(*CopyToEnd) (CopyToState cstate); }
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Tue, 30 Jan 2024 16:20:54 +0900, Michael Paquier wrote: >>> +if (!format_specified) >>> +/* Set the default format. */ >>> +ProcessCopyOptionFormatTo(pstate, opts_out, cstate, NULL); >>> + >>> >>> I think we can pass "text" in this case instead of NULL. That way, >>> ProcessCopyOptionFormatTo doesn't need to handle NULL case. >> >> Yes, we can do it. But it needs a DefElem allocation. Is it >> acceptable? > > I don't think that there is a need for a DelElem at all here? We use defel->location for an error message. (We don't need to set location for the default "text" DefElem.) >While I > am OK with the choice of calling CopyToInit() in the > ProcessCopyOption*() routines that exist to keep the set of callbacks > local to copyto.c and copyfrom.c, I think that this should not bother > about setting opts_out->csv_mode or opts_out->csv_mode but just set > the opts_out->{to,from}_routine callbacks. OK. I'll keep opts_out->{csv_mode,binary} in copy.c. > Now, all the Init() callbacks are empty for the > in-core callbacks, so I think that we should just remove it entirely > for now. Let's keep the core patch a maximum simple. It is always > possible to build on top of it depending on what people need. It's > been mentioned that JSON would want that, but this also proves that we > just don't care about that for all the in-core callbacks, as well. I > would choose a minimalistic design for now. OK. Let's remove Init() callbacks from the first patch set. > I would be really tempted to put my hands on this patch to put into > shape a minimal set of changes because I'm caring quite a lot about > the performance gains reported with the removal of the "if" checks in > the per-row callbacks, and that's one goal of this thread quite > independent on the extensibility. Sutou-san, would you be OK with > that? Yes, sure. (We want to focus on the performance gains in the first patch set and then focus on extensibility again, right?) For the purpose, I think that the v7 patch set is more suitable than the v9 patch set. The v7 patch set doesn't include Init() callbacks, custom options validation support or extra Copy{In,Out}Response support. But the v7 patch set misses the removal of the "if" checks in NextCopyFromRawFields() that exists in the v9 patch set. I'm not sure how much performance will improve by this but it may be worth a try. Can I prepare the v10 patch set as "the v7 patch set" + "the removal of the "if" checks in NextCopyFromRawFields()"? (+ reverting opts_out->{csv_mode,binary} changes in ProcessCopyOptions().) Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Tue, 30 Jan 2024 11:11:59 +0900, Masahiko Sawada wrote: > --- > +if (!format_specified) > +/* Set the default format. */ > +ProcessCopyOptionFormatTo(pstate, opts_out, cstate, NULL); > + > > I think we can pass "text" in this case instead of NULL. That way, > ProcessCopyOptionFormatTo doesn't need to handle NULL case. Yes, we can do it. But it needs a DefElem allocation. Is it acceptable? > We need curly brackets for this "if branch" as follows: > > if (!format_specifed) > { > /* Set the default format. */ > ProcessCopyOptionFormatTo(pstate, opts_out, cstate, NULL); > } Oh, sorry. I assumed that pgindent adjusts the style too. > --- > +/* Process not built-in options. */ > +foreach(option, unknown_options) > +{ > +DefElem*defel = lfirst_node(DefElem, option); > +bool processed = false; > + > +if (!is_from) > +processed = > opts_out->to_routine->CopyToProcessOption(cstate, defel); > +if (!processed) > +ereport(ERROR, > +(errcode(ERRCODE_SYNTAX_ERROR), > + errmsg("option \"%s\" not > recognized", > +defel->defname), > + parser_errposition(pstate, > defel->location))); > +} > +list_free(unknown_options); > > I think we can check the duplicated options in the core as we discussed. Oh, sorry. I missed the part. I'll implement it. > --- > +static void > +CopyToTextBasedInit(CopyToState cstate) > +{ > +} > > and > > +static void > +CopyToBinaryInit(CopyToState cstate) > +{ > +} > > Do we really need separate callbacks for the same behavior? I think we > can have a common init function say CopyToBuitinInit() that does > nothing. Or we can make the init callback optional. > > The same is true for process-option callback. OK. I'll make them optional. > --- > List *convert_select; /* list of column names (can be NIL) */ > +const CopyToRoutine *to_routine; /* callback > routines for COPY TO */ > } CopyFormatOptions; > > I think CopyToStateData is a better place to have CopyToRoutine. > copy_data_dest_cb is also there. We can do it but ProcessCopyOptions() accepts NULL CopyToState for file_fdw. Can we create an empty CopyToStateData internally like we did for opts_out in ProcessCopyOptions()? (But it requires exporting CopyToStateData. We'll export it in a later patch but it's not yet in 0001.) > The 0002 patch replaces the options checks with > ProcessCopyOptionFormatFrom(). However, both > ProcessCopyOptionFormatTo() and ProcessCOpyOptionFormatFrom() would > set format-related options such as opts_out->csv_mode etc, which seems > not elegant. IIUC the reason why we process only the "format" option > first is to set the callback functions and call the init callback. So > I think we don't necessarily need to do both setting callbacks and > setting format-related options together. Probably we can do only the > callback stuff first and then set format-related options in the > original place we used to do? If we do it, we need to write the (strcmp(format, "csv") == 0) condition in copyto.c and copy.c. I wanted to avoid it. I think that the duplication (setting opts_out->csv_mode in copyto.c and copyfrom.c) is not a problem. But it's not a strong opinion. If (strcmp(format, "csv") == 0) duplication is better than opts_out->csv_mode = true duplication, I'll do it. BTW, if we want to make the CSV format implementation more modularized, we will remove opts_out->csv_mode, move CSV related options to CopyToCSVProcessOption() and keep CSV related options in its opaque space. For example, opts_out->force_quote exists in COPY TO opaque space but doesn't exist in COPY FROM opaque space because it's not used in COPY FROM. > +static void > +CopyToTextBasedFillCopyOutResponse(CopyToState cstate, StringInfoData *buf) > +{ > +int16 format = 0; > +intnatts = list_length(cstate->attnumlist); > +inti; > + > +pq_sendbyte(buf, format); /* overall format */ > +pq_sendint16(buf, natts); > +for (i = 0; i < natts; i++) > +pq_sendint16(buf, format); /* per-column formats */ > +} > > This function and CopyToBinaryFillCopyOutResponse() fill three things: > overall format, the number of columns, and per-column formats. While > this approach is flexible, extensions will have to understand the > format of CopyOutResponse message. An alternative is to have one or > more callbacks that return these three things. Yes, we can choose the approach. I don't have a strong opinion on which approach to choo
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Mon, 29 Jan 2024 11:37:07 +0800, Junwang Zhao wrote: >> > > Does it make sense to pass only non-builtin options to the custom >> > > format callback after parsing and evaluating the builtin options? That >> > > is, we parse and evaluate only the builtin options and populate >> > > opts_out first, then pass each rest option to the custom format >> > > handler callback. The callback can refer to the builtin option values. >> >> What I imagined is that while parsing the all specified options, we >> evaluate builtin options and we add non-builtin options to another >> list. Then when parsing a non-builtin option, we check if this option >> already exists in the list. If there is, we raise the "option %s not >> recognized" error.". Once we complete checking all options, we pass >> each option in the list to the callback. I implemented this idea and the following ideas: 1. Add init callback for initialization 2. Change GetFormat() to FillCopyXXXResponse() because JSON format always use 1 column 3. FROM only: Eliminate more cstate->opts.csv_mode branches (This is for performance.) See the attached v9 patch set for details. Changes since v7: 0001: * Move CopyToProcessOption() calls to the end of ProcessCopyOptions() for easy to option validation * Add CopyToState::CopyToInit() and call it in ProcessCopyOptionFormatTo() * Change CopyToState::CopyToGetFormat() to CopyToState::CopyToFillCopyOutResponse() and use it in SendCopyBegin() 0002: * Move CopyFromProcessOption() calls to the end of ProcessCopyOptions() for easy to option validation * Add CopyFromState::CopyFromInit() and call it in ProcessCopyOptionFormatFrom() * Change CopyFromState::CopyFromGetFormat() to CopyFromState::CopyFromFillCopyOutResponse() and use it in ReceiveCopyBegin() * Rename NextCopyFromRawFields() to NextCopyFromRawFieldsInternal() and pass the read attributes callback explicitly to eliminate more cstate->opts.csv_mode branches Thanks, -- kou >From c136833f4a385574474b246a381014abeb631377 Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Fri, 26 Jan 2024 16:46:51 +0900 Subject: [PATCH v9 1/2] Extract COPY TO format implementations This is a part of making COPY format extendable. See also these past discussions: * New Copy Formats - avro/orc/parquet: https://www.postgresql.org/message-id/flat/20180210151304.fonjztsynewldfba%40gmail.com * Make COPY extendable in order to support Parquet and other formats: https://www.postgresql.org/message-id/flat/CAJ7c6TM6Bz1c3F04Cy6%2BSzuWfKmr0kU8c_3Stnvh_8BR0D6k8Q%40mail.gmail.com This doesn't change the current behavior. This just introduces CopyToRoutine, which just has function pointers of format implementation like TupleTableSlotOps, and use it for existing "text", "csv" and "binary" format implementations. Note that CopyToRoutine can't be used from extensions yet because CopySend*() aren't exported yet. Extensions can't send formatted data to a destination without CopySend*(). They will be exported by subsequent patches. Here is a benchmark result with/without this change because there was a discussion that we should care about performance regression: https://www.postgresql.org/message-id/3741749.1655952719%40sss.pgh.pa.us > I think that step 1 ought to be to convert the existing formats into > plug-ins, and demonstrate that there's no significant loss of > performance. You can see that there is no significant loss of performance: Data: Random 32 bit integers: CREATE TABLE data (int32 integer); SELECT setseed(0.29); INSERT INTO data SELECT random() * 1 FROM generate_series(1, ${n_records}); The number of records: 100K, 1M and 10M 100K without this change: format,elapsed time (ms) text,10.561 csv,10.868 binary,10.287 100K with this change: format,elapsed time (ms) text,9.962 csv,10.453 binary,9.473 1M without this change: format,elapsed time (ms) text,103.265 csv,109.789 binary,104.078 1M with this change: format,elapsed time (ms) text,98.612 csv,101.908 binary,94.456 10M without this change: format,elapsed time (ms) text,1060.614 csv,1065.272 binary,1025.875 10M with this change: format,elapsed time (ms) text,1020.050 csv,1031.279 binary,954.792 --- contrib/file_fdw/file_fdw.c | 2 +- src/backend/commands/copy.c | 82 - src/backend/commands/copyfrom.c | 2 +- src/backend/commands/copyto.c | 587 +++- src/include/commands/copy.h | 8 +- src/include/commands/copyapi.h | 62 6 files changed, 560 insertions(+), 183 deletions(-) create mode 100644 src/include/comm
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Sat, 27 Jan 2024 14:15:02 +0800, Junwang Zhao wrote: > I have been working on a *COPY TO JSON* extension since yesterday, > which is based on your V6 patch set, I'd like to give you more input > so you can make better decisions about the implementation(with only > pg-copy-arrow you might not get everything considered). Thanks! > 0009 is some changes made by me, I changed CopyToGetFormat to > CopyToSendCopyBegin because pg_copy_json need to send different bytes > in SendCopyBegin, get the format code along is not enough Oh, I haven't cared about the case. How about the following API instead? static void SendCopyBegin(CopyToState cstate) { StringInfoData buf; pq_beginmessage(&buf, PqMsg_CopyOutResponse); cstate->opts.to_routine->CopyToFillCopyOutResponse(cstate, &buf); pq_endmessage(&buf); cstate->copy_dest = COPY_FRONTEND; } static void CopyToJsonFillCopyOutResponse(CopyToState cstate, StringInfoData &buf) { int16 format = 0; pq_sendbyte(&buf, format); /* overall format */ /* * JSON mode is always one non-binary column */ pq_sendint16(&buf, 1); pq_sendint16(&buf, format); } > 00010 is the pg_copy_json extension, I think this should be a good > case which can utilize the *extendable copy format* feature It seems that it's convenient that we have one more callback for initializing CopyToState::opaque. It's called only once when Copy{To,From}Routine is chosen: typedef struct CopyToRoutine { void(*CopyToInit) (CopyToState cstate); ... }; void ProcessCopyOptions(ParseState *pstate, CopyFormatOptions *opts_out, bool is_from, void *cstate, List *options) { ... foreach(option, options) { DefElem*defel = lfirst_node(DefElem, option); if (strcmp(defel->defname, "format") == 0) { ... opts_out->to_routine = &CopyToRoutineXXX; opts_out->to_routine->CopyToInit(cstate); ... } } ... } > maybe we > should delete copy_test_format if we have this extension as an > example? I haven't read the COPY TO format json thread[1] carefully (sorry), but we may add the JSON format as a built-in format. If we do it, copy_test_format is useful to test the extension API. [1] https://www.postgresql.org/message-id/flat/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU%3Dkcg%40mail.gmail.com Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Fri, 26 Jan 2024 16:41:50 +0800, Junwang Zhao wrote: > CopyToProcessOption()/CopyFromProcessOption() can only handle > single option, and store the options in the opaque field, but it can not > check the relation of two options, for example, considering json format, > the `header` option can not be handled by these two functions. > > I want to find a way when the user specifies the header option, customer > handler can error out. Ah, you want to use a built-in option (such as "header") value from a custom handler, right? Hmm, it may be better that we call CopyToProcessOption()/CopyFromProcessOption() for all options including built-in options. Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Fri, 26 Jan 2024 08:35:19 +0900, Michael Paquier wrote: >> OK. How about the following for the fetch function >> signature? >> >> extern CopyToRoutine *GetBuiltinCopyToRoutine(const char *format); > > Or CopyToRoutineGet()? I am not wedded to my suggestion, got a bad > history with naming things around here. Thanks for the suggestion. I rethink about this and use the following: +extern void ProcessCopyOptionFormatTo(ParseState *pstate, CopyFormatOptions *opts_out, DefElem *defel); It's not a fetch function. It sets CopyToRoutine opts_out instead. But it hides CopyToRoutine* to copyto.c. Is it acceptable? >> OK. I'll focus on 0001 and 0005 for now. I'll restart >> 0002-0004/0006-0008 after 0001 and 0005 are accepted. > > Once you get these, I'd be interested in re-doing an evaluation of > COPY TO and more tests with COPY FROM while running Postgres on > scissors. One thing I was thinking to use here is my blackhole_am for > COPY FROM: > https://github.com/michaelpq/pg_plugins/tree/main/blackhole_am Thanks! Could you evaluate the attached patch set with COPY FROM? I attach v7 patch set. It includes only the 0001 and 0005 parts in v6 patch set because we focus on them for now. 0001: This is based on 0001 in v6. Changes since v6: * Fix comment style * Hide CopyToRoutine{Text,CSV,Binary} * Add more comments * Eliminate "if (cstate->opts.csv_mode)" branches from "text" and "csv" callbacks * Remove CopyTo*_function typedefs * Update benchmark results in commit message but the results are measured on my environment that isn't suitable for accurate benchmark 0002: This is based on 0005 in v6. Changes since v6: * Fix comment style * Hide CopyFromRoutine{Text,CSV,Binary} * Add more comments * Eliminate a "if (cstate->opts.csv_mode)" branch from "text" and "csv" callbacks * NOTE: We can eliminate more "if (cstate->opts.csv_mode)" branches such as one in NextCopyFromRawFields(). Should we do it in this feature improvement (make COPY format extendable)? Can we defer this as a separated improvement? * Remove CopyFrom*_function typedefs Thanks, -- kou >From 3e75129c2e9d9d34eebb6ef31b4fe6579f9eb02d Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Fri, 26 Jan 2024 16:46:51 +0900 Subject: [PATCH v7 1/2] Extract COPY TO format implementations This is a part of making COPY format extendable. See also these past discussions: * New Copy Formats - avro/orc/parquet: https://www.postgresql.org/message-id/flat/20180210151304.fonjztsynewldfba%40gmail.com * Make COPY extendable in order to support Parquet and other formats: https://www.postgresql.org/message-id/flat/CAJ7c6TM6Bz1c3F04Cy6%2BSzuWfKmr0kU8c_3Stnvh_8BR0D6k8Q%40mail.gmail.com This doesn't change the current behavior. This just introduces CopyToRoutine, which just has function pointers of format implementation like TupleTableSlotOps, and use it for existing "text", "csv" and "binary" format implementations. Note that CopyToRoutine can't be used from extensions yet because CopySend*() aren't exported yet. Extensions can't send formatted data to a destination without CopySend*(). They will be exported by subsequent patches. Here is a benchmark result with/without this change because there was a discussion that we should care about performance regression: https://www.postgresql.org/message-id/3741749.1655952719%40sss.pgh.pa.us > I think that step 1 ought to be to convert the existing formats into > plug-ins, and demonstrate that there's no significant loss of > performance. You can see that there is no significant loss of performance: Data: Random 32 bit integers: CREATE TABLE data (int32 integer); SELECT setseed(0.29); INSERT INTO data SELECT random() * 1 FROM generate_series(1, ${n_records}); The number of records: 100K, 1M and 10M 100K without this change: format,elapsed time (ms) text,10.561 csv,10.868 binary,10.287 100K with this change: format,elapsed time (ms) text,9.962 csv,10.453 binary,9.473 1M without this change: format,elapsed time (ms) text,103.265 csv,109.789 binary,104.078 1M with this change: format,elapsed time (ms) text,98.612 csv,101.908 binary,94.456 10M without this change: format,elapsed time (ms) text,1060.614 csv,1065.272 binary,1025.875 10M with this change: format,elapsed time (ms) text,1020.050 csv,1031.279 binary,954.792 --- contrib/file_fdw/file_fdw.c | 2 +- src/backend/commands/copy.c | 71 ++-- src/backend/commands/copyfrom.c | 2 +- src/backend/commands/copyto.c | 551 +++---
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Fri, 26 Jan 2024 16:18:14 +0800, Junwang Zhao wrote: > In the current implementation, there is no way that one can check > incompatibility > options in ProcessCopyOptions, we can postpone the check in CopyFromStart > or CopyToStart, but I think it is a little bit late. Do you think > adding an extra > check for incompatible options hook is acceptable (PFA)? Thanks for the suggestion! But I think that a custom handler can do it in CopyToProcessOption()/CopyFromProcessOption(). What do you think about this? Or could you share a sample COPY TO/FROM WITH() SQL you think? Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Thu, 25 Jan 2024 13:36:03 +0900, Masahiko Sawada wrote: > I've experimented with a similar optimization for csv > and text format; have different callbacks for text and csv format and > remove "if (cstate->opts.csv_mode)" branches. I've attached a patch > for that. Here are results: > > HEAD w/ 0001 patch + remove branches: > binary 2824.502 ms > text 2715.264 ms > csv 2803.381 ms > > The numbers look better now. I'm not sure these are within a noise > range but it might be worth considering having different callbacks for > text and csv formats. Wow! Interesting. I tried the approach before but I didn't see any difference by the approach. But it may depend on my environment. I'll import the approach to the next patch set so that others can try the approach easily. Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Thu, 25 Jan 2024 12:17:55 +0900, Michael Paquier wrote: > +typedef bool (*CopyToProcessOption_function) (CopyToState cstate, DefElem > *defel); > +typedef int16 (*CopyToGetFormat_function) (CopyToState cstate); > +typedef void (*CopyToStart_function) (CopyToState cstate, TupleDesc tupDesc); > +typedef void (*CopyToOneRow_function) (CopyToState cstate, TupleTableSlot > *slot); > +typedef void (*CopyToEnd_function) (CopyToState cstate); > > We don't really need a set of typedefs here, let's put the definitions > in the CopyToRoutine struct instead. OK. I'll do it. > +extern CopyToRoutine CopyToRoutineText; > +extern CopyToRoutine CopyToRoutineCSV; > +extern CopyToRoutine CopyToRoutineBinary; > > All that should IMO remain in copyto.c and copyfrom.c in the initial > patch doing the refactoring. Why not using a fetch function instead > that uses a string in input? Then you can call that once after > parsing the List of options in ProcessCopyOptions(). OK. How about the following for the fetch function signature? extern CopyToRoutine *GetBuiltinCopyToRoutine(const char *format); We may introduce an enum and use it: typedef enum CopyBuiltinFormat { COPY_BUILTIN_FORMAT_TEXT = 0, COPY_BUILTIN_FORMAT_CSV, COPY_BUILTIN_FORMAT_BINARY, } CopyBuiltinFormat; extern CopyToRoutine *GetBuiltinCopyToRoutine(CopyBuiltinFormat format); > +/* All "text" and "csv" options are parsed in ProcessCopyOptions(). We may > + * move the code to here later. */ > Some areas, like this comment, are written in an incorrect format. Oh, sorry. I assumed that the comment style was adjusted by pgindent. I'll use the following style: /* * ... */ > +getTypeBinaryOutputInfo(attr->atttypid, &out_func_oid, &isvarlena); > +fmgr_info(out_func_oid, &cstate->out_functions[attnum - 1]); > > Actually, this split is interesting. It is possible for a custom > format to plug in a custom set of out functions. Did you make use of > something custom for your own stuff? I didn't. My PoC custom COPY format handler for Apache Arrow just handles integer and text for now. It doesn't use cstate->out_functions because cstate->out_functions may not return a valid binary format value for Apache Arrow. So it formats each value by itself. I'll chose one of them for a custom type (that isn't supported by Apache Arrow, e.g. PostGIS types): 1. Report an unsupported error 2. Call output function for Apache Arrow provided by the custom type > Actually, could it make sense to > split the assignment of cstate->out_functions into its own callback? Yes. Because we need to use getTypeBinaryOutputInfo() for "binary" and use getTypeOutputInfo() for "text" and "csv". > Sure, that's part of the start phase, but at least it would make clear > that a custom method *has* to assign these OIDs to work. The patch > implies that as a rule, without a comment that CopyToStart *must* set > up these OIDs. CopyToStart doesn't need to set up them if the handler doesn't use cstate->out_functions. > I think that 0001 and 0005 should be handled first, as pieces > independent of the rest. Then we could move on with 0002~0004 and > 0006~0008. OK. I'll focus on 0001 and 0005 for now. I'll restart 0002-0004/0006-0008 after 0001 and 0005 are accepted. Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, Thanks for trying these patches! In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Thu, 25 Jan 2024 10:53:58 +0800, jian he wrote: > COPY data TO '/dev/null' WITH (FORMAT csv) \watch count=5 Wow! I didn't know the "\watch count="! I'll use it. > Time: 668.996 ms > Time: 596.254 ms > Time: 592.723 ms > Time: 591.663 ms > Time: 590.803 ms It seems that 5 times isn't enough for this case as Michael said. But thanks for trying! Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In <20240124.144936.67229716500876806@clear-code.com> "Re: Make COPY format extendable: Extract COPY TO format implementations" on Wed, 24 Jan 2024 14:49:36 +0900 (JST), Sutou Kouhei wrote: > I've implemented custom COPY format feature based on the > current design discussion. See the attached patches for > details. I forgot to mention one note. Documentation isn't included in these patches. I'll write it after all (or some) patches are merged. Is it OK? Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In <10025bac-158c-ffe7-fbec-32b426291...@dunslane.net> "Re: Make COPY format extendable: Extract COPY TO format implementations" on Wed, 24 Jan 2024 07:15:55 -0500, Andrew Dunstan wrote: > > On 2024-01-24 We 03:11, Michael Paquier wrote: >> On Wed, Jan 24, 2024 at 02:49:36PM +0900, Sutou Kouhei wrote: >>> For COPY TO: >>> >>> 0001: This adds CopyToRoutine and use it for text/csv/binary >>> formats. No implementation change. This just move codes. >> 10M without this change: >> >> format,elapsed time (ms) >> text,1090.763 >> csv,1136.103 >> binary,1137.141 >> >> 10M with this change: >> >> format,elapsed time (ms) >> text,1082.654 >> csv,1196.991 >> binary,1069.697 >> >> These numbers point out that binary is faster by 6%, csv is slower by >> 5%, while text stays around what looks like noise range. That's not >> negligible. Are these numbers reproducible? If they are, that could >> be a problem for anybody doing bulk-loading of large data sets. I am >> not sure to understand where the improvement for binary comes from by >> reading the patch, but perhaps perf would tell more for each format? >> The loss with csv could be blamed on the extra manipulations of the >> function pointers, likely. > > > I don't think that's at all acceptable. > > We've spent quite a lot of blood sweat and tears over the years to make COPY > fast, and we should not sacrifice any of that lightly. These numbers aren't reproducible. Because these benchmarks executed on my normal machine not a machine only for benchmarking. The machine runs another processes such as editor and Web browser. For example, here are some results with master (94edfe250c6a200d2067b0debfe00b4122e9b11e): Format,N records,Elapsed time (ms) csv,1000,1073.715 csv,1000,1022.830 csv,1000,1073.584 csv,1000,1090.651 csv,1000,1052.259 Here are some results with master + the 0001 patch: Format,N records,Elapsed time (ms) csv,1000,1025.356 csv,1000,1067.202 csv,1000,1014.563 csv,1000,1032.088 csv,1000,1058.110 I uploaded my benchmark script so that you can run the same benchmark on your machine: https://gist.github.com/kou/be02e02e5072c91969469dbf137b5de5 Could anyone try the benchmark with master and master+0001? Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, I've implemented custom COPY format feature based on the current design discussion. See the attached patches for details. I also implemented a PoC COPY format handler for Apache Arrow with this implementation and it worked. https://github.com/kou/pg-copy-arrow The patches implement not only custom COPY TO format feature but also custom COPY FROM format feature. 0001-0004 is for COPY TO and 0005-0008 is for COPY FROM. For COPY TO: 0001: This adds CopyToRoutine and use it for text/csv/binary formats. No implementation change. This just move codes. 0002: This adds support for adding custom COPY TO format by "CREATE FUNCTION ${FORMAT_NAME}". This uses the same approach provided by Sawada-san[1] but this doesn't introduce a wrapper CopyRoutine struct for Copy{To,From}Routine. Because I noticed that a wrapper CopyRoutine struct is needless. Copy handler can just return CopyToRoutine or CopyFromRtouine because both of them have NodeTag. We can distinct a returned struct by the NodeTag. [1] https://www.postgresql.org/message-id/cad21aocunywhird3gapzwe6s9jg1wzxj3cr6vgn36ddhegj...@mail.gmail.com 0003: This exports CopyToStateData. No implementation change except CopyDest enum values. I changed COPY_ prefix to COPY_DEST_ to avoid name conflict with CopySource enum values. This just moves codes. 0004: This adds CopyToState::opaque and exports CopySendEndOfRow(). CopySendEndOfRow() is renamed to CopyToStateFlush(). For COPY FROM: 0005: Same as 0001 but for COPY FROM. This adds CopyFromRoutine and use it for text/csv/binary formats. No implementation change. This just move codes. 0006: Same as 0002 but for COPY FROM. This adds support for adding custom COPY FROM format by "CREATE FUNCTION ${FORMAT_NAME}". 0007: Same as 0003 but for COPY FROM. This exports CopyFromStateData. No implementation change except CopySource enum values. I changed COPY_ prefix to COPY_SOURCE_ to align CopyDest changes in 0003. This just moves codes. 0008: Same as 0004 but for COPY FROM. This adds CopyFromState::opaque and exports CopyReadBinaryData(). CopyReadBinaryData() is renamed to CopyFromStateRead(). Thanks, -- kou In <20240115.152702.2011620917962812379@clear-code.com> "Re: Make COPY format extendable: Extract COPY TO format implementations" on Mon, 15 Jan 2024 15:27:02 +0900 (JST), Sutou Kouhei wrote: > Hi, > > If there are no more comments for the current design, I'll > start implementing this feature with the following > approaches for "Discussing" items: > >> 3.1 Should we use one function(internal) for COPY TO/FROM >> handlers or two function(internal)s (one is for COPY TO >> handler and another is for COPY FROM handler)? >> [4] > > I'll choose "one function(internal) for COPY TO/FROM handlers". > >> 3.4 Should we export Copy{To,From}State? Or should we just >> provide getters/setters to access Copy{To,From}State >> internal? >> [10] > > I'll export Copy{To,From}State. > > > Thanks, > -- > kou > > In <20240112.144615.157925223373344229@clear-code.com> > "Re: Make COPY format extendable: Extract COPY TO format implementations" > on Fri, 12 Jan 2024 14:46:15 +0900 (JST), > Sutou Kouhei wrote: > >> Hi, >> >> Here is the current summary for a this discussion to make >> COPY format extendable. It's for reaching consensus and >> starting implementing the feature. (I'll start implementing >> the feature once we reach consensus.) If you have any >> opinion, please share it. >> >> Confirmed: >> >> 1.1 Making COPY format extendable will not reduce performance. >> [1] >> >> Decisions: >> >> 2.1 Use separated handler for COPY TO and COPY FROM because >> our COPY TO implementation (copyto.c) and COPY FROM >> implementation (coypfrom.c) are separated. >> [2] >> >> 2.2 Don't use system catalog for COPY TO/FROM handlers. We can >> just use a function(internal) that returns a handler instead. >> [3] >> >> 2.3 The implementation must include documentation. >> [5] >> >> 2.4 The implementation must include test. >> [6] >> >> 2.5 The implementation should be consist of small patches >> for easy to review. >> [6] >> >> 2.7 Copy{To,From}State must have a opaque space for >> handlers. >> [8] >> >> 2.8 Export CopySendData() and CopySendEndOfRow() for COPY TO >> handlers. >> [8] >> >> 2.9 Make "format" in PgMsg_CopyOutResponse message >> extendable. >> [9] >> >> 2.10 Make makeNode() call avoida
meson: Specify -Wformat as a common warning flag for extensions
Hi, I'm an extension developer. If I use PostgreSQL built with Meson, I get the following warning: cc1: warning: '-Wformat-security' ignored without '-Wformat' [-Wformat-security] Because "pg_config --cflags" includes -Wformat-security but doesn't include -Wformat. Can we specify -Wformat as a common warning flag too? If we do it, "pg_config --cflags" includes both of -Wformat-security and -Wformat. So I don't get the warning. Thanks, -- kou >From 0913033512c9b75ee3d2941c89ff8696f3c5f53b Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Mon, 22 Jan 2024 13:51:58 +0900 Subject: [PATCH v1] meson: Specify -Wformat explicitly for extensions We specify -Wformat-security as a common warning flag explicitly. Our common warning flags are used by extensions via pgxs/src/Makefile.global or "pg_config --cflags". If -Wformat-security is used without -Wall/-Wformat, GCC shows the following warning: cc1: warning: '-Wformat-security' ignored without '-Wformat' [-Wformat-security] We can't assume that all extensions use -Wall/-Wformat. So specifying only -Wformat-security may cause the warning. If we specify -Wformat explicitly, the warning isn't shown. --- meson.build | 8 1 file changed, 8 insertions(+) diff --git a/meson.build b/meson.build index 55184db248..de6ce778fc 100644 --- a/meson.build +++ b/meson.build @@ -1822,6 +1822,14 @@ common_warning_flags = [ '-Wimplicit-fallthrough=3', '-Wcast-function-type', '-Wshadow=compatible-local', + # This is for preventing the "cc1: warning: '-Wformat-security' + # ignored without '-Wformat' [-Wformat-security]" warning. We don't + # need this for PostgreSQL itself. This is just for + # extensions. Extensions use "pg_config --cflags" to build + # themselves. If extensions use only -Wformat-security, the warning + # is shown. If we have this here, extensions use both of -Wformat + # and -Wformat-security. So the warning isn't shown. + '-Wformat', # This was included in -Wall/-Wformat in older GCC versions '-Wformat-security', ] -- 2.43.0
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Mon, 15 Jan 2024 16:03:41 +0900, Masahiko Sawada wrote: >> Defining one more static const struct instead of providing a >> convenient (but a bit tricky) macro may be straightforward: >> >> static const CopyToFormatRoutine testfmt_copyto_routine = { >> .type = T_CopyToFormatRoutine, >> .start_fn = testfmt_copyto_start, >> .onerow_fn = testfmt_copyto_onerow, >> .end_fn = testfmt_copyto_end >> }; >> >> static const CopyFormatRoutine testfmt_copyto_handler = { >> .type = T_CopyFormatRoutine, >> .is_from = false, >> .routine = (Node *) &testfmt_copyto_routine >> }; > > Yeah, IIUC this is the option 2 you mentioned[1]. I think we can go > with this idea as it's the simplest. > > [1] > https://www.postgresql.org/message-id/20240110.120034.501385498034538233.kou%40clear-code.com Ah, you're right. I forgot it... > That is CopyFormatRoutine will be like: > > typedef struct CopyFormatRoutine > { > NodeTag type; > > /* either CopyToFormatRoutine or CopyFromFormatRoutine */ > Node *routine; > } CopyFormatRoutine; > > And the core can check the node type of the 'routine7 in the > CopyFormatRoutine returned by extensions. It makes sense. If no more comments about the current design, I'll start implementing this feature based on the current design. Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, If there are no more comments for the current design, I'll start implementing this feature with the following approaches for "Discussing" items: > 3.1 Should we use one function(internal) for COPY TO/FROM > handlers or two function(internal)s (one is for COPY TO > handler and another is for COPY FROM handler)? > [4] I'll choose "one function(internal) for COPY TO/FROM handlers". > 3.4 Should we export Copy{To,From}State? Or should we just > provide getters/setters to access Copy{To,From}State > internal? > [10] I'll export Copy{To,From}State. Thanks, -- kou In <20240112.144615.157925223373344229@clear-code.com> "Re: Make COPY format extendable: Extract COPY TO format implementations" on Fri, 12 Jan 2024 14:46:15 +0900 (JST), Sutou Kouhei wrote: > Hi, > > Here is the current summary for a this discussion to make > COPY format extendable. It's for reaching consensus and > starting implementing the feature. (I'll start implementing > the feature once we reach consensus.) If you have any > opinion, please share it. > > Confirmed: > > 1.1 Making COPY format extendable will not reduce performance. > [1] > > Decisions: > > 2.1 Use separated handler for COPY TO and COPY FROM because > our COPY TO implementation (copyto.c) and COPY FROM > implementation (coypfrom.c) are separated. > [2] > > 2.2 Don't use system catalog for COPY TO/FROM handlers. We can > just use a function(internal) that returns a handler instead. > [3] > > 2.3 The implementation must include documentation. > [5] > > 2.4 The implementation must include test. > [6] > > 2.5 The implementation should be consist of small patches > for easy to review. > [6] > > 2.7 Copy{To,From}State must have a opaque space for > handlers. > [8] > > 2.8 Export CopySendData() and CopySendEndOfRow() for COPY TO > handlers. > [8] > > 2.9 Make "format" in PgMsg_CopyOutResponse message > extendable. > [9] > > 2.10 Make makeNode() call avoidable in function(internal) > that returns COPY TO/FROM handler. > [9] > > 2.11 Custom COPY TO/FROM handlers must be able to parse > their options. > [11] > > Discussing: > > 3.1 Should we use one function(internal) for COPY TO/FROM > handlers or two function(internal)s (one is for COPY TO > handler and another is for COPY FROM handler)? > [4] > > 3.2 If we use separated function(internal) for COPY TO/FROM > handlers, we need to define naming rule. For example, > _to(internal) for COPY TO handler and > _from(internal) for COPY FROM handler. > [4] > > 3.3 Should we use prefix or suffix for function(internal) > name to avoid name conflict with other handlers such as > tablesample handlers? > [7] > > 3.4 Should we export Copy{To,From}State? Or should we just > provide getters/setters to access Copy{To,From}State > internal? > [10] > > > [1] > https://www.postgresql.org/message-id/flat/20231204.153548.2126325458835528809.kou%40clear-code.com > [2] https://www.postgresql.org/message-id/flat/ZXEUIy6wl4jHy6Nm%40paquier.xyz > [3] > https://www.postgresql.org/message-id/flat/CAD21AoAhcZkAp_WDJ4sSv_%2Bg2iCGjfyMFgeu7MxjnjX_FutZAg%40mail.gmail.com > [4] > https://www.postgresql.org/message-id/flat/CAD21AoDkoGL6yJ_HjNOg9cU%3DaAdW8uQ3rSQOeRS0SX85LPPNwQ%40mail.gmail.com > [5] > https://www.postgresql.org/message-id/flat/TY3PR01MB9889C9234CD220A3A7075F0DF589A%40TY3PR01MB9889.jpnprd01.prod.outlook.com > [6] https://www.postgresql.org/message-id/flat/ZXbiPNriHHyUrcTF%40paquier.xyz > [7] > https://www.postgresql.org/message-id/flat/20231214.184414.2179134502876898942.kou%40clear-code.com > [8] > https://www.postgresql.org/message-id/flat/20231221.183504.1240642084042888377.kou%40clear-code.com > [9] https://www.postgresql.org/message-id/flat/ZYTfqGppMc9e_w2k%40paquier.xyz > [10] > https://www.postgresql.org/message-id/flat/CAD21AoD%3DUapH4Wh06G6H5XAzPJ0iJg9YcW8r7E2UEJkZ8QsosA%40mail.gmail.com > [11] > https://www.postgresql.org/message-id/flat/20240110.152023.1920937326588672387.kou%40clear-code.com > > > Thanks, > -- > kou > >
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Fri, 12 Jan 2024 14:40:41 +0800, Junwang Zhao wrote: >> Could you clarify what should we discuss? We should require >> that COPY TO/FROM handlers should use PostgreSQL's memory >> context for all internal memory allocations? > > Yes, handlers should use PostgreSQL's memory context, and I think > creating other memory context under CopyToStateData.copycontext > should be suggested for handler creators, so I proposed exporting > CopyToStateData to public header. I see. We can provide a getter for CopyToStateData::copycontext if we don't want to export CopyToStateData. Note that I don't have a strong opinion whether we should export CopyToStateData or not. Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, Here is the current summary for a this discussion to make COPY format extendable. It's for reaching consensus and starting implementing the feature. (I'll start implementing the feature once we reach consensus.) If you have any opinion, please share it. Confirmed: 1.1 Making COPY format extendable will not reduce performance. [1] Decisions: 2.1 Use separated handler for COPY TO and COPY FROM because our COPY TO implementation (copyto.c) and COPY FROM implementation (coypfrom.c) are separated. [2] 2.2 Don't use system catalog for COPY TO/FROM handlers. We can just use a function(internal) that returns a handler instead. [3] 2.3 The implementation must include documentation. [5] 2.4 The implementation must include test. [6] 2.5 The implementation should be consist of small patches for easy to review. [6] 2.7 Copy{To,From}State must have a opaque space for handlers. [8] 2.8 Export CopySendData() and CopySendEndOfRow() for COPY TO handlers. [8] 2.9 Make "format" in PgMsg_CopyOutResponse message extendable. [9] 2.10 Make makeNode() call avoidable in function(internal) that returns COPY TO/FROM handler. [9] 2.11 Custom COPY TO/FROM handlers must be able to parse their options. [11] Discussing: 3.1 Should we use one function(internal) for COPY TO/FROM handlers or two function(internal)s (one is for COPY TO handler and another is for COPY FROM handler)? [4] 3.2 If we use separated function(internal) for COPY TO/FROM handlers, we need to define naming rule. For example, _to(internal) for COPY TO handler and _from(internal) for COPY FROM handler. [4] 3.3 Should we use prefix or suffix for function(internal) name to avoid name conflict with other handlers such as tablesample handlers? [7] 3.4 Should we export Copy{To,From}State? Or should we just provide getters/setters to access Copy{To,From}State internal? [10] [1] https://www.postgresql.org/message-id/flat/20231204.153548.2126325458835528809.kou%40clear-code.com [2] https://www.postgresql.org/message-id/flat/ZXEUIy6wl4jHy6Nm%40paquier.xyz [3] https://www.postgresql.org/message-id/flat/CAD21AoAhcZkAp_WDJ4sSv_%2Bg2iCGjfyMFgeu7MxjnjX_FutZAg%40mail.gmail.com [4] https://www.postgresql.org/message-id/flat/CAD21AoDkoGL6yJ_HjNOg9cU%3DaAdW8uQ3rSQOeRS0SX85LPPNwQ%40mail.gmail.com [5] https://www.postgresql.org/message-id/flat/TY3PR01MB9889C9234CD220A3A7075F0DF589A%40TY3PR01MB9889.jpnprd01.prod.outlook.com [6] https://www.postgresql.org/message-id/flat/ZXbiPNriHHyUrcTF%40paquier.xyz [7] https://www.postgresql.org/message-id/flat/20231214.184414.2179134502876898942.kou%40clear-code.com [8] https://www.postgresql.org/message-id/flat/20231221.183504.1240642084042888377.kou%40clear-code.com [9] https://www.postgresql.org/message-id/flat/ZYTfqGppMc9e_w2k%40paquier.xyz [10] https://www.postgresql.org/message-id/flat/CAD21AoD%3DUapH4Wh06G6H5XAzPJ0iJg9YcW8r7E2UEJkZ8QsosA%40mail.gmail.com [11] https://www.postgresql.org/message-id/flat/20240110.152023.1920937326588672387.kou%40clear-code.com Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Wed, 10 Jan 2024 16:53:48 +0900, Masahiko Sawada wrote: >> Interesting. But I feel that it introduces another (a bit) >> tricky mechanism... > > Right. On the other hand, I don't think the idea 3 is good for the > same reason Michael-san pointed out before[1][2]. > > [1] https://www.postgresql.org/message-id/ZXEUIy6wl4jHy6Nm%40paquier.xyz > [2] https://www.postgresql.org/message-id/ZXKm9tmnSPIVrqZz%40paquier.xyz I think that the important part of the Michael-san's opinion is "keep COPY TO implementation and COPY FROM implementation separated for maintainability". The patch focused in [1][2] uses one routine for both of COPY TO and COPY FROM. If we use the approach, we need to change one common routine from copyto.c and copyfrom.c (or export callbacks from copyto.c and copyfrom.c and use them in copyto.c to construct one common routine). It's the problem. The idea 3 still has separated routines for COPY TO and COPY FROM. So I think that it still keeps COPY TO implementation and COPY FROM implementation separated. >> BTW, we also need to set .type: >> >> .routine = COPYTO_ROUTINE ( >> .type = T_CopyToFormatRoutine, >> .start_fn = testfmt_copyto_start, >> .onerow_fn = testfmt_copyto_onerow, >> .end_fn = testfmt_copyto_end >> ) > > I think it's fine as the same is true for table AM. Ah, sorry. I should have said explicitly. I don't this that it's not a problem. I just wanted to say that it's missing. Defining one more static const struct instead of providing a convenient (but a bit tricky) macro may be straightforward: static const CopyToFormatRoutine testfmt_copyto_routine = { .type = T_CopyToFormatRoutine, .start_fn = testfmt_copyto_start, .onerow_fn = testfmt_copyto_onerow, .end_fn = testfmt_copyto_end }; static const CopyFormatRoutine testfmt_copyto_handler = { .type = T_CopyFormatRoutine, .is_from = false, .routine = (Node *) &testfmt_copyto_routine }; Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Wed, 10 Jan 2024 15:33:22 +0900, Masahiko Sawada wrote: >> We can use the satic const struct approach by choosing one >> of the followings: >> >> ... >> >> 4. ... other idea? > > It's a just idea but the fourth idea is to provide a convenient macro > to make it easy to construct the CopyFormatRoutine. For example, > > #define COPYTO_ROUTINE(...) (Node *) &(CopyToFormatRoutine) {__VA_ARGS__} > > static const CopyFormatRoutine testfmt_copyto_handler = { > .type = T_CopyFormatRoutine, > .is_from = true, > .routine = COPYTO_ROUTINE ( > .start_fn = testfmt_copyto_start, > .onerow_fn = testfmt_copyto_onerow, > .end_fn = testfmt_copyto_end > ) > }; > > Datum > copy_testfmt_handler(PG_FUNCTION_ARGS) > { > PG_RETURN_POINTER(& testfmt_copyto_handler); > } Interesting. But I feel that it introduces another (a bit) tricky mechanism... BTW, we also need to set .type: .routine = COPYTO_ROUTINE ( .type = T_CopyToFormatRoutine, .start_fn = testfmt_copyto_start, .onerow_fn = testfmt_copyto_onerow, .end_fn = testfmt_copyto_end ) Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Fri, 22 Dec 2023 10:58:05 +0800, Junwang Zhao wrote: >> 1. Add an opaque space for custom COPY TO handler >>* Add CopyToState{Get,Set}Opaque() >> >> https://github.com/kou/postgres/commit/5a610b6a066243f971e029432db67152cfe5e944 >> >> 2. Export CopyToState::attnumlist >>* Add CopyToStateGetAttNumList() >> >> https://github.com/kou/postgres/commit/15fcba8b4e95afa86edb3f677a7bdb1acb1e7688 >> >> 3. Export CopySend*() >>* Rename CopySend*() to CopyToStateSend*() and export them >>* Exception: CopySendEndOfRow() to CopyToStateFlush() because >> it just flushes the internal buffer now. >> >> https://github.com/kou/postgres/commit/289a5640135bde6733a1b8e2c412221ad522901e >> > I guess the purpose of these helpers is to avoid expose CopyToState to > copy.h, Yes. > but I > think expose CopyToState to user might make life easier, users might want to > use > the memory contexts of the structure (though I agree not all the > fields are necessary > for extension handers). OK. I don't object it as I said in another e-mail: https://www.postgresql.org/message-id/flat/20240110.120644.1876591646729327180.kou%40clear-code.com#d923173e9625c20319750155083cbd72 >> 2. Need an opaque space like IndexScanDesc::opaque does >> >>* A custom COPY TO handler needs to keep its data > > I once thought users might want to parse their own options, maybe this > is a use case for this opaque space. Good catch! I forgot to suggest a callback for custom format options. How about the following API? ... typedef bool (*CopyToProcessOption_function) (CopyToState cstate, DefElem *defel); ... typedef bool (*CopyFromProcessOption_function) (CopyFromState cstate, DefElem *defel); typedef struct CopyToFormatRoutine { ... CopyToProcessOption_function process_option_fn; } CopyToFormatRoutine; typedef struct CopyFromFormatRoutine { ... CopyFromProcessOption_function process_option_fn; } CopyFromFormatRoutine; diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index e7597894bf..1aa8b62551 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -416,6 +416,7 @@ void ProcessCopyOptions(ParseState *pstate, CopyFormatOptions *opts_out, bool is_from, + void *cstate, /* CopyToState* for !is_from, CopyFromState* for is_from */ List *options) { boolformat_specified = false; @@ -593,11 +594,19 @@ ProcessCopyOptions(ParseState *pstate, parser_errposition(pstate, defel->location))); } else - ereport(ERROR, - (errcode(ERRCODE_SYNTAX_ERROR), -errmsg("option \"%s\" not recognized", - defel->defname), -parser_errposition(pstate, defel->location))); + { + bool processed; + if (is_from) + processed = opts_out->from_ops->process_option_fn(cstate, defel); + else + processed = opts_out->to_ops->process_option_fn(cstate, defel); + if (!processed) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), +errmsg("option \"%s\" not recognized", + defel->defname), +parser_errposition(pstate, defel->location))); + } } /* > For the name, I thought private_data might be a better candidate than > opaque, but I do not insist. I don't have a strong opinion for this. Here are the number of headers that use "private_data" and "opaque": $ grep -r private_data --files-with-matches src/include | wc -l 6 $ grep -r opaque --files-with-matches src/include | wc -l 38 It seems that we use "opaque" than "private_data" in general. but it seems that we use "opaque" than "private_data" in our code. > Do you use the arrow library to control the memory? Yes. > Is there a way that > we can let the arrow use postgres' memory context? Yes. Apache Arrow C++ provides a memory pool feature and we can implement PostgreSQL's memory context based memory pool for this. (But this is a custom COPY TO/FROM handler's implementation details.) >I'm not sure this > is necessary, just raise the question for discussion.
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Fri, 22 Dec 2023 10:48:18 +0900, Masahiko Sawada wrote: >> I needed to extend the patch: >> >> 1. Add an opaque space for custom COPY TO handler >>* Add CopyToState{Get,Set}Opaque() >> >> https://github.com/kou/postgres/commit/5a610b6a066243f971e029432db67152cfe5e944 >> >> 2. Export CopyToState::attnumlist >>* Add CopyToStateGetAttNumList() >> >> https://github.com/kou/postgres/commit/15fcba8b4e95afa86edb3f677a7bdb1acb1e7688 > > I think we can move CopyToState to copy.h and we don't need to have > set/get functions for its fields. I don't object the idea if other PostgreSQL developers prefer the approach. Is there any PostgreSQL developer who objects that we export Copy{To,From}StateData as public API? Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Fri, 22 Dec 2023 10:00:24 +0900, Michael Paquier wrote: >> 3. Export CopySend*() >> >>* If we like minimum API, we just need to export >> CopySendData() and CopySendEndOfRow(). But >> CopySend{String,Char,Int32,Int16}() will be convenient >> custom COPY TO handlers. (A custom COPY TO handler for >> Apache Arrow doesn't need them.) > > Hmm. Not sure on this one. This may come down to externalize the > manipulation of fe_msgbuf. Particularly, could it be possible that > some custom formats don't care at all about the network order? It means that all custom formats should control byte order by themselves instead of using CopySendInt*() that always use network byte order, right? It makes sense. Let's export only CopySendData() and CopySendEndOfRow(). >> 1. What value should be used for "format" in >>PgMsg_CopyOutResponse message? >> >>It's 1 for binary format and 0 for text/csv format. >> >>Should we make it customizable by custom COPY TO handler? >>If so, what value should be used for this? > > Interesting point. It looks very tempting to give more flexibility to > people who'd like to use their own code as we have one byte in the > protocol but just use 0/1. Hence it feels natural to have a callback > for that. OK. Let's add a callback something like: typedef int16 (*CopyToGetFormat_function) (CopyToState cstate); > It also means that we may want to think harder about copy_is_binary in > libpq in the future step. Now, having a backend implementation does > not need any libpq bits, either, because a client stack may just want > to speak the Postgres protocol directly. Perhaps a custom COPY > implementation would be OK with how things are in libpq, as well, > tweaking its way through with just text or binary. Can we defer this discussion after we commit a basic custom COPY format handler mechanism? >> 2. Do we need more tries for design discussion for the first >>implementation? If we need, what should we try? > > A makeNode() is used with an allocation in the current memory context > in the function returning the handler. I would have assume that this > stuff returns a handler as a const struct like table AMs. If we use this approach, we can't use the Sawada-san's idea[1] that provides a convenient API to hide CopyFormatRoutine internal. The idea provides MakeCopy{To,From}FormatRoutine(). They return a new CopyFormatRoutine* with suitable is_from member. They can't use static const CopyFormatRoutine because they may be called multiple times in the same process. We can use the satic const struct approach by choosing one of the followings: 1. Use separated function for COPY {TO,FROM} format handlers as I suggested. 2. Don't provide convenient API. Developers construct CopyFormatRoutine by themselves. But it may be a bit tricky. 3. Similar to 2. but don't use a bit tricky approach (don't embed Copy{To,From}FormatRoutine nodes into CopyFormatRoutine). Use unified function for COPY {TO,FROM} format handlers but CopyFormatRoutine always have both of COPY {TO,FROM} format routines and these routines aren't nodes: typedef struct CopyToFormatRoutine { CopyToStart_function start_fn; CopyToOneRow_function onerow_fn; CopyToEnd_function end_fn; } CopyToFormatRoutine; /* XXX: just copied from COPY TO routines */ typedef struct CopyFromFormatRoutine { CopyFromStart_function start_fn; CopyFromOneRow_function onerow_fn; CopyFromEnd_function end_fn; } CopyFromFormatRoutine; typedef struct CopyFormatRoutine { NodeTag type; CopyToFormatRoutine to_routine; CopyFromFormatRoutine from_routine; } CopyFormatRoutine; static const CopyFormatRoutine testfmt_handler = { .type = T_CopyFormatRoutine, .to_routine = { .start_fn = testfmt_copyto_start, .onerow_fn = testfmt_copyto_onerow, .end_fn = testfmt_copyto_end, }, .from_routine = { .start_fn = testfmt_copyfrom_start, .onerow_fn = testfmt_copyfrom_onerow, .end_fn = testfmt_copyfrom_end, }, }; PG_FUNCTION_INFO_V1(copy_testfmt_handler); Datum copy_testfmt_handler(PG_FUNCTION_ARGS) { PG_RETURN_POINTER(&testfmt_handler); } 4. ... other idea? [1] https://www.postgresql.org/message-id/flat/CAD21AoDs9cOjuVbA_krGizAdc50KE%2BFjAuEXWF0NZwbMnc7F3Q%40mail.gmail.com#71bb03d9237252382b245dd33e705a3a Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Mon, 11 Dec 2023 23:31:29 +0900, Masahiko Sawada wrote: > I've sketched the above idea including a test module in > src/test/module/test_copy_format, based on v2 patch. It's not splitted > and is dirty so just for discussion. I implemented a sample COPY TO handler for Apache Arrow that supports only integer and text. I needed to extend the patch: 1. Add an opaque space for custom COPY TO handler * Add CopyToState{Get,Set}Opaque() https://github.com/kou/postgres/commit/5a610b6a066243f971e029432db67152cfe5e944 2. Export CopyToState::attnumlist * Add CopyToStateGetAttNumList() https://github.com/kou/postgres/commit/15fcba8b4e95afa86edb3f677a7bdb1acb1e7688 3. Export CopySend*() * Rename CopySend*() to CopyToStateSend*() and export them * Exception: CopySendEndOfRow() to CopyToStateFlush() because it just flushes the internal buffer now. https://github.com/kou/postgres/commit/289a5640135bde6733a1b8e2c412221ad522901e The attached patch is based on the Sawada-san's patch and includes the above changes. Note that this patch is also dirty so just for discussion. My suggestions from this experience: 1. Split COPY handler to COPY TO handler and COPY FROM handler * CopyFormatRoutine is a bit tricky. An extension needs to create a CopyFormatRoutine node and a CopyToFormatRoutine node. * If we just require "copy_to_${FORMAT}(internal)" function and "copy_from_${FORMAT}(internal)" function, we can remove the tricky approach. And it also avoid name collisions with other handler such as tablesample handler. See also: https://www.postgresql.org/message-id/flat/20231214.184414.2179134502876898942.kou%40clear-code.com#af71f364d0a9f5c144e45b447e5c16c9 2. Need an opaque space like IndexScanDesc::opaque does * A custom COPY TO handler needs to keep its data 3. Export CopySend*() * If we like minimum API, we just need to export CopySendData() and CopySendEndOfRow(). But CopySend{String,Char,Int32,Int16}() will be convenient custom COPY TO handlers. (A custom COPY TO handler for Apache Arrow doesn't need them.) Questions: 1. What value should be used for "format" in PgMsg_CopyOutResponse message? https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/commands/copyto.c;h=c66a047c4a79cc614784610f385f1cd0935350f3;hb=9ca6e7b9411e36488ef539a2c1f6846ac92a7072#l144 It's 1 for binary format and 0 for text/csv format. Should we make it customizable by custom COPY TO handler? If so, what value should be used for this? 2. Do we need more tries for design discussion for the first implementation? If we need, what should we try? Thanks, -- kou diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index cfad47b562..e7597894bf 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -23,6 +23,7 @@ #include "access/xact.h" #include "catalog/pg_authid.h" #include "commands/copy.h" +#include "commands/copyapi.h" #include "commands/defrem.h" #include "executor/executor.h" #include "mb/pg_wchar.h" @@ -32,6 +33,7 @@ #include "parser/parse_coerce.h" #include "parser/parse_collate.h" #include "parser/parse_expr.h" +#include "parser/parse_func.h" #include "parser/parse_relation.h" #include "rewrite/rewriteHandler.h" #include "utils/acl.h" @@ -427,6 +429,8 @@ ProcessCopyOptions(ParseState *pstate, opts_out->file_encoding = -1; + /* Text is the default format. */ + opts_out->to_ops = &CopyToTextFormatRoutine; /* Extract options from the statement node tree */ foreach(option, options) { @@ -442,9 +446,26 @@ ProcessCopyOptions(ParseState *pstate, if (strcmp(fmt, "text") == 0) /* default format */ ; else if (strcmp(fmt, "csv") == 0) + { opts_out->csv_mode = true; +opts_out->to_ops = &CopyToCSVFormatRoutine; + } else if (strcmp(fmt, "binary") == 0) + { opts_out->binary = true; +opts_out->to_ops = &CopyToBinaryFormatRoutine; + } + else if (!is_from) + { +/* + * XXX: Currently we support custom COPY format only for COPY + * TO. + * + * XXX: need to check the combination of the existing options + * and a custom format (e.g., FREEZE)? + */ +opts_out->to_ops = GetCopyToFormatRoutine(fmt); + } else ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), @@ -864,3 +885,62 @@ CopyGetAttnums(TupleDesc tupDesc, Relation rel, List *attnamelist) return attnums; } + +static CopyFormatRoutine * +GetCopyFormatRoutine(char *format_name, bool is_from) +{ + Oid handlerOid; + Oid funcargtypes[1]; + CopyFormatRoutine *cp; + Datum datum; + + funcargtypes[0] = INTERNALOID; + handlerOid = LookupFuncName(list_make1(makeString(format_name)), 1, +funcargtypes, true); + + if (!OidIsValid(handlerOid)) + ereport(ERROR, +(errcode(ERRCODE_UNDEFINED_OBJECT), +
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Fri, 15 Dec 2023 11:27:30 +0800, Junwang Zhao wrote: >> > Adding a prefix or suffix would be one option but to give extensions >> > more flexibility, another option would be to support format = 'custom' >> > and add the "handler" option to specify a copy handler function name >> > to call. For example, COPY ... FROM ... WITH (FORMAT = 'custom', >> > HANDLER = 'arrow_copy_handler'). >> > I like the prefix/suffix idea, easy to implement. *custom* is not a FORMAT, > and user has to know the name of the specific handler names, not > intuitive. Ah! I misunderstood this idea. "custom" is the special format to use "HANDLER". I thought that we can use it like (FORMAT = 'arrow', HANDLER = 'arrow_copy_handler_impl1') and (FORMAT = 'arrow', HANDLER = 'arrow_copy_handler_impl2') . >> Interesting. If we use this option, users can choose an COPY >> FORMAT implementation they like from multiple >> implementations. For example, a developer may implement a >> COPY FROM FORMAT = 'json' handler with PostgreSQL's JSON >> related API and another developer may implement a handler >> with simdjson[1] which is a fast JSON parser. Users can >> choose whichever they like. > Not sure about this, why not move Json copy handler to contrib > as an example for others, any extensions share the same format > function name and just install one? No bound would implement > another CSV or TEXT copy handler IMHO. I should have used a different format not JSON as an example for easy to understand. I just wanted to say that extension developers can implement another implementation without conflicting another implementation. Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "RE: Make COPY format extendable: Extract COPY TO format implementations" on Tue, 12 Dec 2023 02:31:53 +, "Hayato Kuroda (Fujitsu)" wrote: >> Can we discuss how to proceed this improvement? >> >> There are 2 approaches for it: >> >> 1. Do the followings concurrently: >>a. Implementing small changes that got a consensus and >> merge them step-by-step >> (e.g. We got a consensus that we need to extract the >> current format related routines.) >>b. Discuss design >> >>(v1-v3 patches use this approach.) >> >> 2. Implement one (large) complete patch set with design >>discussion and merge it >> >>(v4- patches use this approach.) >> >> Which approach is preferred? (Or should we choose another >> approach?) >> >> I thought that 1. is preferred because it will reduce review >> cost. So I chose 1. > > I'm ok to use approach 1, but could you please divide a large patch? E.g., > > 0001. defines an infrastructure for copy-API > 0002. adjusts current codes to use APIs > 0003. adds a test module in src/test/modules or contrib. > ... > > This approach helps reviewers to see patches deeper. Separated patches can be > combined when they are close to committable. It seems that I should have chosen another approach based on comments so far: 3. Do the followings in order: a. Implement a workable (but maybe dirty and/or incomplete) implementation to discuss design like [1], discuss design with it and get a consensus on design b. Implement small patches based on the design [1]: https://www.postgresql.org/message-id/CAD21AoCunywHird3GaPzWe6s9JG1wzxj3Cr6vGN36DDheGjOjA%40mail.gmail.com I'll implement a custom COPY FORMAT handler with [1] and provide a feedback with the experience. (It's for a.) Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Fri, 15 Dec 2023 05:19:43 +0900, Masahiko Sawada wrote: > To avoid collisions, extensions can be created in a > different schema than public. Thanks. I didn't notice it. > And note that built-in format copy handler doesn't need to > declare its handler function. Right. I know it. > Adding a prefix or suffix would be one option but to give extensions > more flexibility, another option would be to support format = 'custom' > and add the "handler" option to specify a copy handler function name > to call. For example, COPY ... FROM ... WITH (FORMAT = 'custom', > HANDLER = 'arrow_copy_handler'). Interesting. If we use this option, users can choose an COPY FORMAT implementation they like from multiple implementations. For example, a developer may implement a COPY FROM FORMAT = 'json' handler with PostgreSQL's JSON related API and another developer may implement a handler with simdjson[1] which is a fast JSON parser. Users can choose whichever they like. But specifying HANDLER = '...' explicitly is a bit inconvenient. Because only one handler will be installed in most use cases. In the case, users don't need to choose one handler. If we choose this option, it may be better that we also provide a mechanism that can work without HANDLER. Searching a function by name like tablesample method does is an option. [1]: https://github.com/simdjson/simdjson Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Mon, 11 Dec 2023 10:57:15 +0900, Masahiko Sawada wrote: > IIUC we cannot create two same name functions with the same arguments > but a different return value type in the first place. It seems to me > to be an overkill to change such a design. Oh, sorry. I didn't notice it. > Another idea is to encapsulate copy_to/from_handler by a super class > like copy_handler. The handler function is called with an argument, > say copyto, and returns copy_handler encapsulating either > copy_to/from_handler depending on the argument. It's for using "${copy_format_name}" such as "json" and "parquet" as a function name, right? If we use the "${copy_format_name}" approach, we can't use function names that are already used by tablesample method handler such as "system" and "bernoulli" for COPY FORMAT name. Because both of tablesample method handler function and COPY FORMAT handler function use "(internal)" as arguments. I think that tablesample method names and COPY FORMAT names will not be conflicted but the limitation (using the same namespace for tablesample method and COPY FORMAT) is unnecessary limitation. How about using prefix ("copy_to_${copy_format_name}" or something) or suffix ("${copy_format_name}_copy_to" or something) for function names? For example, "copy_to_json"/"copy_from_json" for "json" COPY FORMAT. ("copy_${copy_format_name}" that returns copy_handler encapsulating either copy_to/from_handler depending on the argument may be an option.) Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Sat, 9 Dec 2023 12:38:46 +0100, Hannu Krosing wrote: > Please also see my presentation slides from last years PostgreSQL > Conference in Berlin (attached) Thanks for sharing your idea here. > The main Idea is to make not just "format", but also "transport" and > "stream processing" extendable via virtual function tables. "Transport" and "stream processing" are out of scope in this thread. How about starting new threads for them and discuss them there? > Btw, will any of you here be in Prague next week ? Sorry. No. Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Fri, 8 Dec 2023 15:42:06 +0900, Masahiko Sawada wrote: > So a custom COPY extension would not be able to define SQL functions > just like arrow(internal) for example. We might need to define a rule > like the function returning copy_in/out_handler must be defined as > _to(internal) and _from(internal). We may not need to add "_to"/"_from" suffix by checking both of argument type and return type. Because we use different return type for copy_in/out_handler. But the current LookupFuncName() family doesn't check return type. If we use this approach, we need to improve the current LookupFuncName() family too. Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, Thanks for reviewing our latest patch! In "RE: Make COPY format extendable: Extract COPY TO format implementations" on Sat, 9 Dec 2023 02:43:49 +, "Hayato Kuroda (Fujitsu)" wrote: > (I remember that this theme was talked at Japan PostgreSQL conference) Yes. I should have talked to you more at the conference... I will do it next time! Can we discuss how to proceed this improvement? There are 2 approaches for it: 1. Do the followings concurrently: a. Implementing small changes that got a consensus and merge them step-by-step (e.g. We got a consensus that we need to extract the current format related routines.) b. Discuss design (v1-v3 patches use this approach.) 2. Implement one (large) complete patch set with design discussion and merge it (v4- patches use this approach.) Which approach is preferred? (Or should we choose another approach?) I thought that 1. is preferred because it will reduce review cost. So I chose 1. If 2. is preferred, I'll use 2. (I'll add more changes to the latest patch.) Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Wed, 6 Dec 2023 22:07:51 +0800, Junwang Zhao wrote: > Should we extract both *copy to* and *copy from* for the first step, in that > case we can add the pg_copy_handler catalog smoothly later. I don't object it (mixing TO/FROM changes to one patch) but it may make review difficult. Is it acceptable? FYI: I planed that I implement TO part, and then FROM part, and then unify TO/FROM parts if needed. [1] > Attached V4 adds 'extract copy from' and it passed the cirrus ci, > please take a look. Thanks. Here are my comments: > + /* > + * Error is relevant to a particular line. > + * > + * If line_buf still contains the correct line, print it. > + */ > + if (cstate->line_buf_valid) We need to fix the indentation. > +CopyFromFormatBinaryStart(CopyFromState cstate, TupleDesc tupDesc) > +{ > + FmgrInfo *in_functions; > + Oid*typioparams; > + Oid in_func_oid; > + AttrNumber num_phys_attrs; > + > + /* > + * Pick up the required catalog information for each attribute in the > + * relation, including the input function, the element type (to pass to > + * the input function), and info about defaults and constraints. (Which > + * input function we use depends on text/binary format choice.) > + */ > + num_phys_attrs = tupDesc->natts; > + in_functions = (FmgrInfo *) palloc(num_phys_attrs * sizeof(FmgrInfo)); > + typioparams = (Oid *) palloc(num_phys_attrs * sizeof(Oid)); We need to update the comment because defaults and constraints aren't picked up here. > +CopyFromFormatTextStart(CopyFromState cstate, TupleDesc tupDesc) ... > + /* > + * Pick up the required catalog information for each attribute in the > + * relation, including the input function, the element type (to pass to > + * the input function), and info about defaults and constraints. (Which > + * input function we use depends on text/binary format choice.) > + */ > + in_functions = (FmgrInfo *) palloc(num_phys_attrs * sizeof(FmgrInfo)); > + typioparams = (Oid *) palloc(num_phys_attrs * sizeof(Oid)); ditto. > @@ -1716,15 +1776,6 @@ BeginCopyFrom(ParseState *pstate, > ReceiveCopyBinaryHeader(cstate); > } I think that this block should be moved to CopyFromFormatBinaryStart() too. But we need to run it after we setup inputs such as data_source_cb, pipe and filename... +/* Routines for a COPY HANDLER implementation. */ +typedef struct CopyHandlerOps +{ + /* Called when COPY TO is started. This will send a header. */ + void(*copy_to_start) (CopyToState cstate, TupleDesc tupDesc); + + /* Copy one row for COPY TO. */ + void(*copy_to_one_row) (CopyToState cstate, TupleTableSlot *slot); + + /* Called when COPY TO is ended. This will send a trailer. */ + void(*copy_to_end) (CopyToState cstate); + + void(*copy_from_start) (CopyFromState cstate, TupleDesc tupDesc); + bool(*copy_from_next) (CopyFromState cstate, ExprContext *econtext, + Datum *values, bool *nulls); + void(*copy_from_error_callback) (CopyFromState cstate); + void(*copy_from_end) (CopyFromState cstate); +} CopyHandlerOps; It seems that "copy_" prefix is redundant. Should we use "to_start" instead of "copy_to_start" and so on? BTW, it seems that "COPY FROM (FORMAT json)" may not be implemented. [2] We may need to care about NULL copy_from_* cases. > I added a hook *copy_from_end* but this might be removed later if not used. It may be useful to clean up resources for COPY FROM but the patch doesn't call the copy_from_end. How about removing it for now? We can add it and call it from EndCopyFrom() later? Because it's not needed for now. I think that we should focus on refactoring instead of adding a new feature in this patch. [1]: https://www.postgresql.org/message-id/20231204.153548.2126325458835528809.kou%40clear-code.com [2]: https://www.postgresql.org/message-id/flat/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU%3Dkcg%40mail.gmail.com Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Wed, 6 Dec 2023 15:11:34 +0800, Junwang Zhao wrote: > For the extra curly braces, I mean the following code block in > CopyToFormatBinaryStart: > > + {<-- I thought this is useless? > + /* Generate header for a binary copy */ > + int32 tmp; > + > + /* Signature */ > + CopySendData(cstate, BinarySignature, 11); > + /* Flags field */ > + tmp = 0; > + CopySendInt32(cstate, tmp); > + /* No header extension */ > + tmp = 0; > + CopySendInt32(cstate, tmp); > + } Oh, I see. I've removed and attach the v3 patch. In general, I don't change variable name and so on in this patch. I just move codes in this patch. But I also removed the "tmp" variable for this case because I think that the name isn't suitable for larger scope. (I think that "tmp" is acceptable in a small scope like the above code.) New code: /* Generate header for a binary copy */ /* Signature */ CopySendData(cstate, BinarySignature, 11); /* Flags field */ CopySendInt32(cstate, 0); /* No header extension */ CopySendInt32(cstate, 0); Thanks, -- kou >From 9fe0087d9a6a79a7d1a7d0af63eb16abadbf0d4a Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Mon, 4 Dec 2023 12:32:54 +0900 Subject: [PATCH v3] Extract COPY TO format implementations This is a part of making COPY format extendable. See also these past discussions: * New Copy Formats - avro/orc/parquet: https://www.postgresql.org/message-id/flat/20180210151304.fonjztsynewldfba%40gmail.com * Make COPY extendable in order to support Parquet and other formats: https://www.postgresql.org/message-id/flat/CAJ7c6TM6Bz1c3F04Cy6%2BSzuWfKmr0kU8c_3Stnvh_8BR0D6k8Q%40mail.gmail.com This doesn't change the current behavior. This just introduces CopyToFormatOps, which just has function pointers of format implementation like TupleTableSlotOps, and use it for existing "text", "csv" and "binary" format implementations. Note that CopyToFormatOps can't be used from extensions yet because CopySend*() aren't exported yet. Extensions can't send formatted data to a destination without CopySend*(). They will be exported by subsequent patches. Here is a benchmark result with/without this change because there was a discussion that we should care about performance regression: https://www.postgresql.org/message-id/3741749.1655952719%40sss.pgh.pa.us > I think that step 1 ought to be to convert the existing formats into > plug-ins, and demonstrate that there's no significant loss of > performance. You can see that there is no significant loss of performance: Data: Random 32 bit integers: CREATE TABLE data (int32 integer); INSERT INTO data SELECT random() * 1 FROM generate_series(1, ${n_records}); The number of records: 100K, 1M and 10M 100K without this change: format,elapsed time (ms) text,22.527 csv,23.822 binary,24.806 100K with this change: format,elapsed time (ms) text,22.919 csv,24.643 binary,24.705 1M without this change: format,elapsed time (ms) text,223.457 csv,233.583 binary,242.687 1M with this change: format,elapsed time (ms) text,224.591 csv,233.964 binary,247.164 10M without this change: format,elapsed time (ms) text,2330.383 csv,2411.394 binary,2590.817 10M with this change: format,elapsed time (ms) text,2231.307 csv,2408.067 binary,2473.617 --- src/backend/commands/copy.c | 8 + src/backend/commands/copyto.c | 377 -- src/include/commands/copy.h | 27 ++- 3 files changed, 256 insertions(+), 156 deletions(-) diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index cfad47b562..27a1add456 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -427,6 +427,8 @@ ProcessCopyOptions(ParseState *pstate, opts_out->file_encoding = -1; + /* Text is the default format. */ + opts_out->to_ops = CopyToFormatOpsText; /* Extract options from the statement node tree */ foreach(option, options) { @@ -442,9 +444,15 @@ ProcessCopyOptions(ParseState *pstate, if (strcmp(fmt, "text") == 0) /* default format */ ; else if (strcmp(fmt, "csv") == 0) + { opts_out->csv_mode = true; +opts_out->to_ops = CopyToFormatOpsCSV; + } else if (strcmp(fmt, "binary") == 0) + { opts_out->binary = true; +opts_out->to_ops = CopyToFormatOpsBinary; + } else ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c index c66a047c4a..8f51090a03 100644 --- a/src/backend/commands/copyto.c +++ b/src/backend/commands/copyto.c @@ -131,6 +131,228 @@ static void CopySendEndOfRow(C
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Wed, 6 Dec 2023 11:18:35 +0800, Junwang Zhao wrote: > For the modern formats(parquet, orc, avro, etc.), will they be > implemented as extensions or in core? I think that they should be implemented as extensions because they will depend of external libraries and may not use C. For example, C++ will be used for Apache Parquet because the official Apache Parquet C++ implementation exists but the C implementation doesn't. (I can implement an extension for Apache Parquet after we complete this feature. I'll implement an extension for Apache Arrow with the official Apache Arrow C++ implementation. And it's easy that we convert Apache Arrow data to Apache Parquet with the official Apache Parquet implementation.) > The patch looks good except for a pair of extra curly braces. Thanks for the review! I attach the v2 patch that removes extra curly braces for "if (isnull)". Thanks, -- kou >From 2cd0d344d68667db71b621a8c94f376ddf1707c3 Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Mon, 4 Dec 2023 12:32:54 +0900 Subject: [PATCH v2] Extract COPY TO format implementations This is a part of making COPY format extendable. See also these past discussions: * New Copy Formats - avro/orc/parquet: https://www.postgresql.org/message-id/flat/20180210151304.fonjztsynewldfba%40gmail.com * Make COPY extendable in order to support Parquet and other formats: https://www.postgresql.org/message-id/flat/CAJ7c6TM6Bz1c3F04Cy6%2BSzuWfKmr0kU8c_3Stnvh_8BR0D6k8Q%40mail.gmail.com This doesn't change the current behavior. This just introduces CopyToFormatOps, which just has function pointers of format implementation like TupleTableSlotOps, and use it for existing "text", "csv" and "binary" format implementations. Note that CopyToFormatOps can't be used from extensions yet because CopySend*() aren't exported yet. Extensions can't send formatted data to a destination without CopySend*(). They will be exported by subsequent patches. Here is a benchmark result with/without this change because there was a discussion that we should care about performance regression: https://www.postgresql.org/message-id/3741749.1655952719%40sss.pgh.pa.us > I think that step 1 ought to be to convert the existing formats into > plug-ins, and demonstrate that there's no significant loss of > performance. You can see that there is no significant loss of performance: Data: Random 32 bit integers: CREATE TABLE data (int32 integer); INSERT INTO data SELECT random() * 1 FROM generate_series(1, ${n_records}); The number of records: 100K, 1M and 10M 100K without this change: format,elapsed time (ms) text,22.527 csv,23.822 binary,24.806 100K with this change: format,elapsed time (ms) text,22.919 csv,24.643 binary,24.705 1M without this change: format,elapsed time (ms) text,223.457 csv,233.583 binary,242.687 1M with this change: format,elapsed time (ms) text,224.591 csv,233.964 binary,247.164 10M without this change: format,elapsed time (ms) text,2330.383 csv,2411.394 binary,2590.817 10M with this change: format,elapsed time (ms) text,2231.307 csv,2408.067 binary,2473.617 --- src/backend/commands/copy.c | 8 + src/backend/commands/copyto.c | 383 -- src/include/commands/copy.h | 27 ++- 3 files changed, 262 insertions(+), 156 deletions(-) diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index cfad47b562..27a1add456 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -427,6 +427,8 @@ ProcessCopyOptions(ParseState *pstate, opts_out->file_encoding = -1; + /* Text is the default format. */ + opts_out->to_ops = CopyToFormatOpsText; /* Extract options from the statement node tree */ foreach(option, options) { @@ -442,9 +444,15 @@ ProcessCopyOptions(ParseState *pstate, if (strcmp(fmt, "text") == 0) /* default format */ ; else if (strcmp(fmt, "csv") == 0) + { opts_out->csv_mode = true; +opts_out->to_ops = CopyToFormatOpsCSV; + } else if (strcmp(fmt, "binary") == 0) + { opts_out->binary = true; +opts_out->to_ops = CopyToFormatOpsBinary; + } else ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c index c66a047c4a..79806b9a1b 100644 --- a/src/backend/commands/copyto.c +++ b/src/backend/commands/copyto.c @@ -131,6 +131,234 @@ static void CopySendEndOfRow(CopyToState cstate); static void CopySendInt32(CopyToState cstate, int32 val); static void CopySendInt16(CopyToState cstate, int16 val); +/* + * CopyToFormatOps implementations. + */
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, Thanks for replying to this proposal! In <20231205182458.GC2757816@nathanxps13> "Re: Make COPY format extendable: Extract COPY TO format implementations" on Tue, 5 Dec 2023 12:24:58 -0600, Nathan Bossart wrote: > I think it makes sense to do this part independently, but we should be > careful to design this with the follow-up tasks in mind. OK. I'll keep updating the "TODOs" section in the original e-mail. It also includes design in the follow-up tasks. We can discuss the design separately from the patches submitting. (The current submitted patch just focuses on refactoring but we can discuss the final design.) > I assume the performance concerns stem from the use of > function pointers. Or was there something else? I think so too. The original e-mail that mentioned the performance concern [1] didn't say about the reason but the use of function pointers might be concerned. If the currently supported formats ("text", "csv" and "binary") are implemented as an extension, it may have more concerns but we will keep them as built-in formats for compatibility. So I think that no more concerns exist for these formats. [1]: https://www.postgresql.org/message-id/flat/3741749.1655952719%40sss.pgh.pa.us#2bb7af4a3d2c7669f9a49808d777a20d Thanks, -- kou
Make COPY format extendable: Extract COPY TO format implementations
Hi, I want to work on making COPY format extendable. I attach the first patch for it. I'll send more patches after this is merged. Background: Currently, COPY TO/FROM supports only "text", "csv" and "binary" formats. There are some requests to support more COPY formats. For example: * 2023-11: JSON and JSON lines [1] * 2022-04: Apache Arrow [2] * 2018-02: Apache Avro, Apache Parquet and Apache ORC [3] (FYI: I want to add support for Apache Arrow.) There were discussions how to add support for more formats. [3][4] In these discussions, we got a consensus about making COPY format extendable. But it seems that nobody works on this yet. So I want to work on this. (If there is anyone who wants to work on this together, I'm happy.) Summary: The attached patch introduces CopyToFormatOps struct that is similar to TupleTableSlotOps for TupleTableSlot but CopyToFormatOps is for COPY TO format. CopyToFormatOps has routines to implement a COPY TO format. The attached patch doesn't change: * the current behavior (all existing tests are still passed without changing them) * the existing "text", "csv" and "binary" format output implementations including local variable names (the attached patch just move them and adjust indent) * performance (no significant loss of performance) In other words, this is just a refactoring for further changes to make COPY format extendable. If I use "complete the task and then request reviews for it" approach, it will be difficult to review because changes for it will be large. So I want to work on this step by step. Is it acceptable? TODOs that should be done in subsequent patches: * Add some CopyToState readers such as CopyToStateGetDest(), CopyToStateGetAttnums() and CopyToStateGetOpts() (We will need to consider which APIs should be exported.) (This is for implemeing COPY TO format by extension.) * Export CopySend*() in src/backend/commands/copyto.c (This is for implemeing COPY TO format by extension.) * Add API to register a new COPY TO format implementation * Add "CREATE XXX" to register a new COPY TO format (or COPY TO/FROM format) implementation ("CREATE COPY HANDLER" was suggested in [5].) * Same for COPY FROM Performance: We got a consensus about making COPY format extendable but we should care about performance. [6] > I think that step 1 ought to be to convert the existing > formats into plug-ins, and demonstrate that there's no > significant loss of performance. So I measured COPY TO time with/without this change. You can see there is no significant loss of performance. Data: Random 32 bit integers: CREATE TABLE data (int32 integer); INSERT INTO data SELECT random() * 1 FROM generate_series(1, ${n_records}); The number of records: 100K, 1M and 10M 100K without this change: format,elapsed time (ms) text,22.527 csv,23.822 binary,24.806 100K with this change: format,elapsed time (ms) text,22.919 csv,24.643 binary,24.705 1M without this change: format,elapsed time (ms) text,223.457 csv,233.583 binary,242.687 1M with this change: format,elapsed time (ms) text,224.591 csv,233.964 binary,247.164 10M without this change: format,elapsed time (ms) text,2330.383 csv,2411.394 binary,2590.817 10M with this change: format,elapsed time (ms) text,2231.307 csv,2408.067 binary,2473.617 [1]: https://www.postgresql.org/message-id/flat/24e3ee88-ec1e-421b-89ae-8a47ee0d2df1%40joeconway.com#a5e6b8829f9a74dfc835f6f29f2e44c5 [2]: https://www.postgresql.org/message-id/flat/CAGrfaBVyfm0wPzXVqm0%3Dh5uArYh9N_ij%2BsVpUtDHqkB%3DVyB3jw%40mail.gmail.com [3]: https://www.postgresql.org/message-id/flat/20180210151304.fonjztsynewldfba%40gmail.com [4]: https://www.postgresql.org/message-id/flat/3741749.1655952719%40sss.pgh.pa.us#2bb7af4a3d2c7669f9a49808d777a20d [5]: https://www.postgresql.org/message-id/20180211211235.5x3jywe5z3lkgcsr%40alap3.anarazel.de [6]: https://www.postgresql.org/message-id/3741749.1655952719%40sss.pgh.pa.us Thanks, -- kou >From 7f00b2b0fb878ae1c687c151dd751512d02ed83e Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Mon, 4 Dec 2023 12:32:54 +0900 Subject: [PATCH v1] Extract COPY TO format implementations This is a part of making COPY format extendable. See also these past discussions: * New Copy Formats - avro/orc/parquet: https://www.postgresql.org/message-id/flat/20180210151304.fonjztsynewldfba%40gmail.com * Make COPY extendable in order to support Parquet and other formats: https://www.postgresql.org/message-id/flat/CAJ7c6TM6Bz1c3F04Cy6%2BSzuWfKmr0kU8c_3Stnvh_8BR0D6k8Q%40mail.gmail.com This doesn't change the current behavior. This just introduces CopyToFormatOps, which just has function pointers of format implementation like TupleTableSlotOps, and use it for existing &