Re: question regarding policy for patches to out-of-support branches

2024-06-06 Thread Joe Conway

On 6/6/24 14:12, Tom Lane wrote:

Robert Haas  writes:

On Thu, Jun 6, 2024 at 4:25 AM Hannu Krosing  wrote:

Not absolutely sure, but would at least adding a page to PostgreSQL
Wiki about this make sense ?



I feel like we need to do something. Tom says this is a policy, and
he's made that comment before about other things, but the fact that
they're not memorialized anywhere is a huge problem, IMHO.


I didn't say it wasn't ;-)

ISTM we have two basic choices: wiki page, or new SGML docs section.
In the short term I'd lean to a wiki page.  It'd be reasonable for

https://wiki.postgresql.org/wiki/Committing_checklist

to link to it (and maybe the existing section there about release
freezes would be more apropos on a "Project policies" page?  Not
sure.)

To get a sense of how much of a problem we have, I grepped the git
history for comments mentioning project policies.  Ignoring ones
that are really talking about very localized issues, what I found
is attached.  It seems like it's little enough that a single wiki
page with subsections ought to be enough.  I'm not super handy with
editing the wiki, plus I'm only firing on one cylinder today (seem
to have acquired a head cold at pgconf), so maybe somebody else
would like to draft something?


I added them here with minimal copy editing an no attempt to organize or 
sort into groups:


https://wiki.postgresql.org/wiki/Committing_checklist#Policies

If someone has thoughts on how to improve I am happy to make more changes.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





question regarding policy for patches to out-of-support branches

2024-06-05 Thread Joe Conway
I was having a discussion regarding out-of-support branches and effort 
to keep them building, but could not for the life of me find any actual 
documented policy (although I distinctly remember that we do something...).


I did find fleeting references, for example:

8<---
commit c705646b751e08d584f6eeb098f1ed002aa7b11c
Author: Tom Lane 
Date:   2022-09-21 13:52:38 -0400



Per project policy, this is a candidate for back-patching into
out-of-support branches: it suppresses annoying compiler warnings
but changes no behavior.  Hence, back-patch all the way to 9.2.
8<---

and on its related thread:

8<---
However, I think that that would *not* be fit material for
back-patching into out-of-support branches, since our policy
for them is "no behavioral changes".
8<---

Is the policy written down somewhere, or is it only project lore? In 
either case, what is the actual policy?


Thanks,

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Optimizing COPY with SIMD

2024-06-03 Thread Joe Conway

On 6/2/24 15:17, Neil Conway wrote:

Inspired by David Rowley's work [1]



Welcome back!

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: commitfest.postgresql.org is no longer fit for purpose

2024-05-24 Thread Joe Conway

On 5/24/24 15:45, Tom Lane wrote:

I was *not* proposing doing a regular review, unless of course
somebody really wants to do that.  What I am thinking about is
suggesting how to make progress on patches that are stuck, or in some
cases delivering the bad news that this patch seems unlikely to ever
get accepted and it's time to cut our losses.  (Patches that seem to
be moving along in good order probably don't need any attention in
this process, beyond determining that that's the case.)  That's why
I think we need some senior people doing this, as their opinions are
more likely to be taken seriously.


Maybe do a FOSDEM-style dev meeting with triage review at PG.EU would at 
least move us forward? Granted it is less early and perhaps less often 
than the thread seems to indicate, but has been tossed around before and 
seems doable.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: commitfest.postgresql.org is no longer fit for purpose

2024-05-19 Thread Joe Conway

On 5/19/24 05:37, Dmitry Dolgov wrote:

* How to deal with review scalability bottleneck.

   An evergreen question. PostgreSQL is getting more popular and, as stated in
   diverse research whitepapers, the amount of contribution grows as a power
   law, where the number of reviewers grows at best sub-linearly (limited by the
   velocity of knowledge sharing). I agree with Andrey on this, the only
   way I see to handle this is to scale CF management efforts.



The number of items tracked are surely growing, but I am not sure I 
would call it exponential -- see attached


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


Re: commitfest.postgresql.org is no longer fit for purpose

2024-05-17 Thread Joe Conway

On 5/17/24 11:58, Robert Haas wrote:

I realize that many people here are (rightly!) concerned with
burdening patch authors with more steps that they have to follow. But
the current system is serving new patch authors very poorly. If they
get attention, it's much more likely to be because somebody saw their
email and wrote back than it is to be because somebody went through
the CommitFest and found their entry and was like "oh, I should review
this". Honestly, if we get to a situation where a patch author is sad
because they have to click a link every 2 months to say "yeah, I'm
still here, please review my patch," we've already lost the game. That
person isn't sad because we asked them to click a link. They're sad
it's already been N * 2 months and nothing has happened.


+many

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: commitfest.postgresql.org is no longer fit for purpose

2024-05-17 Thread Joe Conway

On 5/17/24 09:08, Peter Eisentraut wrote:

On 17.05.24 14:42, Joe Conway wrote:
Namely, the week before commitfest I don't actually know if I will 
have the time during that month, but I will make sure my patch is in 
the commitfest just in case I get a few clear days to work on it. 
Because if it isn't there, I can't take advantage of those "found" hours.


A solution to both of these issues (yours and mine) would be to allow 
things to be added *during* the CF month. What is the point of having a 
"freeze" before every CF anyway? Especially if they start out clean. If 
something is ready for review on day 8 of the CF, why not let it be 
added for review?


Maybe this all indicates that the idea of bimonthly commitfests has run
its course.  The original idea might have been, if we have like 50
patches, we can process all of them within a month.  We're clearly not
doing that anymore.  How would the development process look like if we
just had one commitfest per dev cycle that runs from July 1st to March 31st?


What's old is new again? ;-)

I agree with you. Before commitfests were a thing, we had no structure 
at all as I recall. The dates for the dev cycle were more fluid as I 
recall, and we had no CF app to track things. Maybe retaining the 
structure but going back to the continuous development would be just the 
thing, with the CF app tracking just the currently (as of this 
week/month/???) active stuff.



--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: commitfest.postgresql.org is no longer fit for purpose

2024-05-17 Thread Joe Conway

On 5/17/24 08:31, Jelte Fennema-Nio wrote:

On Fri, 17 May 2024 at 14:19, Joe Conway  wrote:


On 5/16/24 22:26, Robert Haas wrote:
> For example, imagine that the CommitFest is FORCIBLY empty
> until a week before it starts. You can still register patches in the
> system generally, but that just means they get CI runs, not that
> they're scheduled to be reviewed. A week before the CommitFest,
> everyone who has a patch registered in the system that still applies
> gets an email saying "click here if you think this patch should be
> reviewed in the upcoming CommitFest -- if you don't care about the
> patch any more or it needs more work before other people review it,
> don't click here". Then, the CommitFest ends up containing only the
> things where the patch author clicked there during that week.

100% agree. This is in line with what I suggested on an adjacent part of
the thread.


Such a proposal would basically mean that no-one that cares about
their patches getting reviews can go on holiday and leave work behind
during the week before a commit fest. That seems quite undesirable to
me.


Well, I'm sure I'll get flamed for this suggestion, be here goes anyway...

I wrote:
Namely, the week before commitfest I don't actually know if I will have 
the time during that month, but I will make sure my patch is in the 
commitfest just in case I get a few clear days to work on it. Because if 
it isn't there, I can't take advantage of those "found" hours.


A solution to both of these issues (yours and mine) would be to allow 
things to be added *during* the CF month. What is the point of having a 
"freeze" before every CF anyway? Especially if they start out clean. If 
something is ready for review on day 8 of the CF, why not let it be 
added for review?



--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: commitfest.postgresql.org is no longer fit for purpose

2024-05-17 Thread Joe Conway

On 5/16/24 22:26, Robert Haas wrote:

For example, imagine that the CommitFest is FORCIBLY empty
until a week before it starts. You can still register patches in the
system generally, but that just means they get CI runs, not that
they're scheduled to be reviewed. A week before the CommitFest,
everyone who has a patch registered in the system that still applies
gets an email saying "click here if you think this patch should be
reviewed in the upcoming CommitFest -- if you don't care about the
patch any more or it needs more work before other people review it,
don't click here". Then, the CommitFest ends up containing only the
things where the patch author clicked there during that week.


100% agree. This is in line with what I suggested on an adjacent part of 
the thread.



The point is - we need a much better signal to noise ratio here. I bet
the number of patches in the CommitFest that actually need review is
something like 25% of the total. The rest are things that are just
parked there by a committer, or that the author doesn't care about
right now, or that are already being actively discussed, or where
there's not a clear way forward.


I think there is another case that no one talks about, but I'm sure 
exists, and that I am not the only one guilty of thinking this way.


Namely, the week before commitfest I don't actually know if I will have 
the time during that month, but I will make sure my patch is in the 
commitfest just in case I get a few clear days to work on it. Because if 
it isn't there, I can't take advantage of those "found" hours.


We could create new statuses for all of those states - "Parked", "In 
Hibernation," "Under Discussion," and "Unclear" - but I think that's 
missing the point. What we really want is to not see that stuff in 
the first place. It's a CommitFest, not

once-upon-a-time-I-wrote-a-patch-Fest.


+1

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: commitfest.postgresql.org is no longer fit for purpose

2024-05-16 Thread Joe Conway

On 5/16/24 17:36, Jacob Champion wrote:

On Thu, May 16, 2024 at 2:29 PM Joe Conway  wrote:

If no one, including the author (new or otherwise) is interested in
shepherding a particular patch, what chance does it have of ever getting
committed?


That's a very different thing from what I think will actually happen, which is

- new author posts patch
- community member says "use commitfest!"


Here is where we should point them at something that explains the care 
and feeding requirements to successfully grow a patch into a commit.



- new author registers patch
- no one reviews it
- patch gets automatically booted


Part of the care and feeding instructions should be a warning regarding 
what happens if you are unsuccessful in the first CF and still want to 
see it through.



- community member says "register it again!"
- new author says ಠ_ಠ


As long as this is not a surprise ending, I don't see the issue.


Like Tom said upthread, the issue isn't really that new authors are
somehow uninterested in their own patches.


First, some of them objectively are uninterested in doing more than 
dropping a patch over the wall and never looking back. But admittedly 
that is not too often.


Second, I don't think a once every two months effort in order to 
register continuing interest is too much to ask.


And third, if we did something like Magnus' suggestion about a CF 
parking lot, the process would be even more simple.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: commitfest.postgresql.org is no longer fit for purpose

2024-05-16 Thread Joe Conway

On 5/16/24 17:24, Jacob Champion wrote:

On Thu, May 16, 2024 at 2:06 PM Joe Conway  wrote:

Maybe the word "care" was a poor choice, but forcing authors to think
about and decide if they have the "time to shepherd a patch" for the
*next CF* is exactly the point. If they don't, why clutter the CF with it.


Because the community regularly encourages new patch contributors to
park their stuff in it, without first asking them to sign on the
dotted line and commit to the next X months of their free time. If
that's not appropriate, then I think we should decide what those
contributors need to do instead, rather than making a new bar for them
to clear.


If no one, including the author (new or otherwise) is interested in 
shepherding a particular patch, what chance does it have of ever getting 
committed?


IMHO the probability is indistinguishable from zero anyway.

Perhaps we should be more explicit to new contributors that they need to 
either own their patch through the process, or convince someone to do it 
for them.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: commitfest.postgresql.org is no longer fit for purpose

2024-05-16 Thread Joe Conway

On 5/16/24 16:48, Magnus Hagander wrote:
On Thu, May 16, 2024 at 10:46 PM Melanie Plageman 
I was reflecting on why a general purpose patch tracker sounded

appealing to me, and I realized that, at least at this time of year, I
have a few patches that really count as "waiting on author" that I
know I need to do additional work on before they need more review but
which aren't currently my top priority. I should probably simply
withdraw and re-register them. My justification was that I'll lose
them if I don't keep them in the commitfest app. But, I could just,
you know, save them somewhere myself instead of polluting the
commitfest app with them. I don't know if others are in this
situation. Anyway, I'm definitely currently guilty of parking.


One thing I think we've talked about before (but not done) is to 
basically have a CF called "parking lot", where you can park patches 
that aren't active in a commitfest  but you also don't want to be dead. 
It would probably also be doable to have the cf bot run patches in that 
commitfest as well as the current one, if that's what people are using 
it for there.


+1

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: commitfest.postgresql.org is no longer fit for purpose

2024-05-16 Thread Joe Conway

On 5/16/24 16:57, Jacob Champion wrote:

On Thu, May 16, 2024 at 1:31 PM Joe Conway  wrote:

Maybe we should just make it a policy that *nothing* gets moved forward
from commitfest-to-commitfest and therefore the author needs to care
enough to register for the next one?


I think that's going to severely disadvantage anyone who doesn't do
this as their day job. Maybe I'm bristling a bit too much at the
wording, but not having time to shepherd a patch is not the same as
not caring.


Maybe the word "care" was a poor choice, but forcing authors to think 
about and decide if they have the "time to shepherd a patch" for the 
*next CF* is exactly the point. If they don't, why clutter the CF with it.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: commitfest.postgresql.org is no longer fit for purpose

2024-05-16 Thread Joe Conway

On 5/16/24 15:47, Tom Lane wrote:

Daniel Gustafsson  writes:

On 16 May 2024, at 20:30, Robert Haas  wrote:
The original intent of CommitFests, and of commitfest.postgresql.org
by extension, was to provide a place where patches could be registered
to indicate that they needed to be reviewed, thus enabling patch
authors and patch reviewers to find each other in a reasonably
efficient way. I don't think it's working any more.



But which part is broken though, the app, our commitfest process and workflow
and the its intent, or our assumption that we follow said process and workflow
which may or may not be backed by evidence?  IMHO, from being CMF many times,
there is a fair bit of the latter, which excacerbates the problem.  This is
harder to fix with more or better software though. 


Yeah.  I think that Robert put his finger on a big part of the
problem, which is that punting a patch to the next CF is a lot
easier than rejecting it, particularly for less-senior CFMs
who may not feel they have the authority to say no (or at
least doubt that the patch author would accept it).


Maybe we should just make it a policy that *nothing* gets moved forward 
from commitfest-to-commitfest and therefore the author needs to care 
enough to register for the next one?



I spent a good deal of time going through the CommitFest this week



And you deserve a big Thank You for that.


+ many


+1 agreed

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: PostgreSQL 17 Beta 1 release announcement draft

2024-05-16 Thread Joe Conway

On 5/16/24 08:05, Joe Conway wrote:

On 5/15/24 21:45, Jonathan S. Katz wrote:

Please provide feedback no later than Wed 2024-05-22 18:00 UTC. As the
beta release takes some extra effort, I want to ensure all changes are
in with time to spare before release day.


"`EXPLAIN` can now show how much time is spent for I/O block reads and 
writes"


Is that really EXPLAIN, or rather EXPLAIN ANALYZE?

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: First draft of PG 17 release notes

2024-05-16 Thread Joe Conway

On 5/15/24 23:48, Andres Freund wrote:

On 2024-05-15 10:38:20 +0200, Alvaro Herrera wrote:

I disagree with this.  IMO the impact of the Sawada/Naylor change is
likely to be enormous for people with large tables and large numbers of
tuples to clean up (I know we've had a number of customers in this
situation, I can't imagine any Postgres service provider that doesn't).
The fact that maintenance_work_mem is no longer capped at 1GB is very
important and I think we should mention that explicitly in the release
notes, as setting it higher could make a big difference in vacuum run
times.


+many.

We're having this debate every release. I think the ongoing reticence to note
performance improvements in the release notes is hurting Postgres.

For one, performance improvements are one of the prime reason users
upgrade. Without them being noted anywhere more dense than the commit log,
it's very hard to figure out what improved for users. A halfway widely
applicable performance improvement is far more impactful than many of the
feature changes we do list in the release notes.


many++


For another, it's also very frustrating for developers that focus on
performance. The reticence to note their work, while noting other, far
smaller, things in the release notes, pretty much tells us that our work isn't
valued.


agreed

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: PostgreSQL 17 Beta 1 release announcement draft

2024-05-16 Thread Joe Conway

On 5/15/24 21:45, Jonathan S. Katz wrote:

Please provide feedback no later than Wed 2024-05-22 18:00 UTC. As the
beta release takes some extra effort, I want to ensure all changes are
in with time to spare before release day.


"You can find information about all of the features and changes found in
PostgreSQL 17"

Sounds repetitive; maybe:

"Information about all of the features and changes in PostgreSQL 17 can 
be found in the [release notes]"



"more explicitly SIMD instructions" I think ought to be "more explicit 
SIMD instructions"



"PostgreSQL 17 includes a built-in collation provider that provides 
similar semantics to the `C` collation provided by libc."


I think that needs to mention UTF-8 encoding somehow, and "provided by 
libc" is not really true; maybe:


"PostgreSQL 17 includes a built-in collation provider that provides 
similar sorting semantics to the `C` collation except with UTF-8 
encoding rather than SQL_ASCII."



--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: First draft of PG 17 release notes

2024-05-11 Thread Joe Conway

On 5/11/24 09:57, Jelte Fennema-Nio wrote:

On Fri, 10 May 2024 at 23:31, Tom Lane  wrote:


Bruce Momjian  writes:
> I looked at both of these.   In both cases I didn't see why the user
> would need to know these changes were made:

I agree that the buffering change is not likely interesting, but
the fact that you can now control-C out of a psql "\c" command
is user-visible.  People might have internalized the fact that
it didn't work, or created complicated workarounds.


The buffering change improved performance up to ~40% in some of the
benchmarks. The case it improves mostly is COPY of large rows and
streaming a base backup. That sounds user-visible enough to me to
warrant an entry imho.


+1

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: 'trusted'/'untrusted' PL in DoD/DISA PostgreSQL STIGs

2024-05-05 Thread Joe Conway

On 5/5/24 13:53, Chapman Flack wrote:

The four STIGs suggest the same email address [5] for comments or
proposed revisions. I could send these comments there myself, but
I thought it likely that others in the community have already been
involved in the development of those documents and might have better
connections.


Those docs were developed by the respective companies (Crunchy and EDB) 
in cooperation with DISA. The community has nothing to do with them. I 
suggest you contact the two companies with corrections and suggestions.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

2024-05-03 Thread Joe Conway

On 5/3/24 11:44, Peter Eisentraut wrote:

On 03.05.24 16:13, Tom Lane wrote:

Peter Eisentraut  writes:

On 30.04.24 19:29, Tom Lane wrote:

Also, the bigger picture here is the seeming assumption that "if
we change pg_trgm then it will be safe to replicate from x86 to
arm".  I don't believe that that's a good idea and I'm unwilling
to promise that it will work, regardless of what we do about
char signedness.  That being the case, I don't want to invest a
lot of effort in the signedness issue.  Option (1) is clearly
a small change with little if any risk of future breakage.



But note that option 1 would prevent some replication that is currently
working.


The point of this thread though is that it's working only for small
values of "work".  People are rightfully unhappy if it seems to work
and then later they get bitten by compatibility problems.

Treating char signedness as a machine property in pg_control would
signal that we don't intend to make it work, and would ensure that
even the most minimal testing would find out that it doesn't work.

If we do not do that, it seems to me we have to buy into making
it work.  That would mean dealing with the consequences of an
incompatible change in pg_trgm indexes, and then going through
the same dance again the next time(s) similar problems are found.


Yes, that is understood.  But anecdotally, replicating between x86-64 arm64 is 
occasionally used for upgrades or migrations.  In practice, this appears to have 
mostly worked.  If we now discover that it won't work with certain index 
extension modules, it's usable for most users. Even if we say, you have to 
reindex everything afterwards, it's probably still useful for these scenarios.


+1

I have heard similar anecdotes, and the reported experience goes even further -- 
many such upgrade/migration uses, with exceedingly rare reported failures.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: New GUC autovacuum_max_threshold ?

2024-04-26 Thread Joe Conway

On 4/26/24 09:31, Robert Haas wrote:

On Fri, Apr 26, 2024 at 9:22 AM Joe Conway  wrote:

Although I don't think 50 is necessarily too small. In my view,
having autovac run very quickly, even if more frequently, provides an
overall better user experience.


Can you elaborate on why you think that? I mean, to me, that's almost
equivalent to removing autovacuum_vacuum_scale_factor entirely,
because only for very small tables will that calculation produce a
value lower than 500k.


If I understood Nathan's proposed calc, for small tables you would still 
get (thresh + sf * numtuples). Once that number exceeds the new limit 
parameter, then the latter would kick in. So small tables would retain 
the current behavior and large enough tables would be clamped.



We might need to try to figure out some test cases here. My intuition
is that this is going to vacuum large tables insanely aggressively.


It depends on workload to be sure. Just because a table is large, it 
doesn't mean that dead rows are generated that fast.


Admittedly it has been quite a while since I looked at all this that 
closely, but if A/V runs on some large busy table for a few milliseconds 
once every few minutes, that is far less disruptive than A/V running for 
tens of seconds once every few hours or for minutes ones every few days 
-- or whatever. The key thing to me is the "few milliseconds" runtime. 
The short duration means that no one notices an impact, and the longer 
duration almost guarantees that an impact will be felt.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: New GUC autovacuum_max_threshold ?

2024-04-26 Thread Joe Conway

On 4/26/24 04:43, Michael Banck wrote:

So this proposal (probably along with a higher default threshold than
50, but IMO less than what Robert and Nathan suggested) sounds like
a stop forward to me. DBAs can set the threshold lower if they want, or
maybe we can just turn it off by default if we cannot agree on a sane
default, but I think this (using the simplified formula from Nathan) is
a good approach that takes some pain away from autovacuum tuning and
reserves that for the really difficult cases.


+1 to the above

Although I don't think 50 is necessarily too small. In my view, 
having autovac run very quickly, even if more frequently, provides an 
overall better user experience.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Table AM Interface Enhancements

2024-04-10 Thread Joe Conway

On 4/10/24 09:19, Robert Haas wrote:

When you commit a patch and another committer writes a post-commit
review saying that your patch has so many serious problems that he
gave up on reviewing before enumerating all of them, that's a really
bad sign. That should be a strong signal to you to step back and take
a close look at whether you really understand the area of the code
that you're touching well enough to be doing whatever it is that
you're doing. If I got a review like that, I would have reverted the
patch instantly, given up for the release cycle, possibly given up on
the patch permanently, and most definitely not tried again to commit
unless I was absolutely certain that I'd learned a lot in the meantime
*and* had the agreement of the committer who wrote that review (or
maybe some other committer who was acknowledged as an expert in that
area of the code).





It's not Andres's job to make sure my patches are not broken. It's my
job. That applies to the patches I write, and the patches written by
other people that I commit. If I commit something and it turns out
that it is broken, that's my bad. If I commit something and it turns
out that it does not have consensus, that is also my bad. It is not
the fault of the other people for not helping me get my patches to a
state where they are up to project standard. It is my fault, and my
fault alone, for committing something that was not ready. Now that
does not mean that it isn't frustrating when I can't get the help I
need. It is extremely frustrating. But the solution is not to commit
anyway and then blame the other people for not providing feedback.


+many

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Security lessons from liblzma

2024-04-09 Thread Joe Conway

On 4/8/24 22:57, Bruce Momjian wrote:

On Sat, Mar 30, 2024 at 04:50:26PM -0400, Robert Haas wrote:

An awful lot of what we do operates on the principle that we know the
people who are involved and trust them, and I'm glad we do trust them,
but the world is full of people who trusted somebody too much and
regretted it afterwards. The fact that we have many committers rather
than a single maintainer probably reduces risk at least as far as the
source repository is concerned, because there are more people paying
attention to potentially notice something that isn't as it should be.


One unwritten requirement for committers is that we are able to
communicate with them securely.  If we cannot do that, they potentially
could be forced by others, e.g., governments, to add code to our
repositories.

Unfortunately, there is on good way for them to communicate with us
securely once they are unable to communicate with us securely.  I
suppose some special word could be used to communicate that status ---
that is how it was done in non-electronic communication in the past.


I don't know how that really helps. If one of our committers is under 
duress, they probably cannot risk outing themselves anyway.


The best defense, IMHO, is the fact that our source code is open and can 
be reviewed freely.


The trick is to get folks to do the review.

I know, for example, at $past_employer we had a requirement to get 
someone on our staff to review every single commit in order to maintain 
certain certifications. Of course there is no guarantee that such 
reviews would catch everything, but maybe we could establish post commit 
reviews by contributors in a more rigorous way? Granted, getting more 
qualified volunteers is not a trivial problem...


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: PostgreSQL 17 Release Management Team & Feature Freeze

2024-04-08 Thread Joe Conway

On 4/8/24 11:05, Tom Lane wrote:

Pavel Borisov  writes:

IMO the fact that people struggle to work on patches, and make them better,
etc. is an immense blessing for the Postgres community. Is the peak of
commits really a big problem provided we have 6 months before actual
release? I doubt March patches tend to be worse than the November ones.


Yes, it's a problem, and yes the average quality of last-minute
patches is visibly worse than that of patches committed in a less
hasty fashion.  We have been through this every year for the last
couple decades, seems like, and we keep re-learning that lesson
the hard way.  I'm just distressed at our utter failure to learn
from experience.



I don't dispute that we could do better, and this is just a simplistic 
look based on "number of commits per day", but the attached does put it 
in perspective to some extent.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


Re: Security lessons from liblzma

2024-03-31 Thread Joe Conway

On 3/31/24 11:49, Tom Lane wrote:

Joe Conway  writes:
I am saying maybe those patches should be eliminated in favor of our 
tree including build options that would produce the same result.


I don't really see how that can be expected to work sanely.
It turns the responsibility for platform-specific build issues
on its head, and it doesn't work at all for issues discovered
after we make a release.  The normal understanding of how you
can vet a distro's package is that you look at the package
contents (the SRPM in Red Hat world and whatever the equivalent
concept is elsewhere), check that the contained tarball
matches upstream and that the patches and build instructions
look sane, and then build it locally and check for a match to
the distro's binary package.  Even if we could overcome the
obstacles to putting the patch files into the upstream tarball,
we're surely not going to include the build instructions, so
we'd not have moved the needle very far in terms of whether the
packager could do something malicious.


True enough I guess.

But it has always bothered me how many patches get applied to the 
upstream tarballs by the OS package builders. Some of them, e.g. glibc 
on RHEL 7, include more than 1000 patches that you would have to 
manually vet if you cared enough and had the skills. Last time I looked 
at the openssl package sources it was similar in volume and complexity. 
They might as well be called forks if everyone were being honest about it...


I know our PGDG packages are no big deal compared to that, fortunately.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Security lessons from liblzma

2024-03-31 Thread Joe Conway

On 3/30/24 21:52, Bruce Momjian wrote:

On Sat, Mar 30, 2024 at 07:54:00PM -0400, Joe Conway wrote:

Virtually every RPM source, including ours, contains out of tree patches
that get applied on top of the release tarball. At least for the PGDG
packages, it would be nice to integrate them into our git repo as build
options or whatever so that the packages could be built without any patches
applied to it. Add a tarball that is signed and traceable back to the git
tag, and we would be in a much better place than we are now.


How would someone access the out-of-tree patches?  I think Debian
includes the patches in its source tarball.


I am saying maybe those patches should be eliminated in favor of our 
tree including build options that would produce the same result.


For example, these patches are applied to our release tarball files when 
the RPM is being built for pg16 on RHEL 9:


-
https://git.postgresql.org/gitweb/?p=pgrpms.git;a=blob;f=rpm/redhat/main/non-common/postgresql-16/main/postgresql-16-rpm-pgsql.patch;h=d9b6d12b7517407ac81352fa325ec91b05587641;hb=HEAD

https://git.postgresql.org/gitweb/?p=pgrpms.git;a=blob;f=rpm/redhat/main/non-common/postgresql-16/main/postgresql-16-var-run-socket.patch;h=f2528efaf8f4681754b20283463eff3e14eedd39;hb=HEAD

https://git.postgresql.org/gitweb/?p=pgrpms.git;a=blob;f=rpm/redhat/main/non-common/postgresql-16/main/postgresql-16-conf.patch;h=da28ed793232316dd81fdcbbe59a6479b054a364;hb=HEAD

https://git.postgresql.org/gitweb/?p=pgrpms.git;a=blob;f=rpm/redhat/main/non-common/postgresql-16/main/postgresql-16-perl-rpath.patch;h=748c42f0ec2c9730af3143e90e5b205c136f40d9;hb=HEAD
-

Nothing too crazy, but wouldn't it be better if no patches were required 
at all?


Ideally we should have reproducible builds so that starting with our 
tarball (which is traceable back to the git release tag) one can easily 
obtain the same binary as what the RPMs/DEBs deliver.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Security lessons from liblzma

2024-03-30 Thread Joe Conway

On 3/30/24 19:54, Joe Conway wrote:

On 2024-03-30 16:50:26 -0400, Robert Haas wrote:

or what Tom does when he builds the release tarballs.


Tom follows this, at least last time I checked:

https://wiki.postgresql.org/wiki/Release_process


Reading through that, I wonder if this part is true anymore:

  In principle this could be done anywhere, but again there's a concern
  about reproducibility, since the results may vary depending on
  installed bison, flex, docbook, etc versions. Current practice is to
  always do this as pgsql on borka.postgresql.org, so it can only be
  done by people who have a login there. In detail:

Maybe if we split out the docs from the release tarball, we could also 
add the script (mk-release) to our git repo?


Some other aspects of that wiki page look out of date too. Perhaps it 
needs an overall update? Maybe Tom and/or Magnus could weigh in here.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Security lessons from liblzma

2024-03-30 Thread Joe Conway

On 3/30/24 17:12, Andres Freund wrote:

Hi,

On 2024-03-30 16:50:26 -0400, Robert Haas wrote:

We might also want to move toward signing commits and tags. One of the
meson maintainers was recommending that on-list not long ago.


I don't know how valuable the commit signing really is, but I strongly agree
that we should sign both tags and tarballs.


+1



We should think about weaknesses that might occur during the packaging
process, too. If someone who alleges that their packaging PG is really
packaging PG w/badstuff123.patch, how would we catch that?


I don't think we realistically can. The environment, configure and compiler
options will influence things too much to do any sort of automatic
verification.

But that shouldn't stop us from ensuring that at least the packages
distributed via *.postgresql.org are reproducibly built.

Another good avenue for introducing an attack would be to propose some distro
specific changes to the packaging teams - there's a lot fewer eyes there.  I
think it might be worth working with some of our packagers to integrate more
of their changes into our tree.


Huge +1 to that. I have thought many times, and even said to Devrim, a 
huge number of people trust him to not be evil.


Virtually every RPM source, including ours, contains out of tree patches 
that get applied on top of the release tarball. At least for the PGDG 
packages, it would be nice to integrate them into our git repo as build 
options or whatever so that the packages could be built without any 
patches applied to it. Add a tarball that is signed and traceable back 
to the git tag, and we would be in a much better place than we are now.



I can't for example verify what the infrastructure team is doing,


Not sure what you feel like you should be able to follow -- anything 
specific?



or what Tom does when he builds the release tarballs.


Tom follows this, at least last time I checked:

https://wiki.postgresql.org/wiki/Release_process


This one however, I think we could improve upon. Making sure the tarball
generation is completely binary reproducible and providing means of checking
that would surely help. This will be a lot easier if we, as dicussed
elsewhere, I believe, split out the generated docs into a separately
downloadable archive. We already stopped including other generated files
recently.


again, big +1


We shouldn't make the mistake of assuming that bad things can't happen to
us.


+1


and again with the +1 ;-)

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Popcount optimization using AVX512

2024-03-25 Thread Joe Conway

On 3/25/24 11:12, Tom Lane wrote:

"Amonson, Paul D"  writes:

I am re-posting the patches as CI for Mac failed (CI error not code/test 
error). The patches are the same as last time.


Just for a note --- the cfbot will re-test existing patches every
so often without needing a bump.  The current cycle period seems to
be about two days.



Just an FYI -- there seems to be an issue with all three of the macos 
cfbot runners (mine included). I spent time over the weekend working 
with Thomas Munro (added to CC list) trying different fixes to no avail. 
Help from macos CI wizards would be gratefully accepted...


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Possibility to disable `ALTER SYSTEM`

2024-03-19 Thread Joe Conway

On 3/19/24 07:49, Andrew Dunstan wrote:



On Tue, Mar 19, 2024 at 5:26 AM Heikki Linnakangas <mailto:hlinn...@iki.fi>> wrote:


I want to remind everyone of this from Gabriele's first message that
started this thread:

 > At the moment, a possible workaround is that `ALTER SYSTEM` can
be blocked
 > by making the postgresql.auto.conf read only, but the returned
message is
 > misleading and that’s certainly bad user experience (which is very
 > important in a cloud native environment):
 >
 >
 > ```
 > postgres=# ALTER SYSTEM SET wal_level TO minimal;
 > ERROR:  could not open file "postgresql.auto.conf": Permission denied
 > ```

I think making the config file read-only is a fine solution. If you
don't want postgres to mess with the config files, forbid it with the
permission system.

Problems with pg_rewind, pg_basebackup were mentioned with that
approach. I think if you want the config files to be managed outside
PostgreSQL, by kubernetes, patroni or whatever, it would be good for
them to be read-only to the postgres user anyway, even if we had a
mechanism to disable ALTER SYSTEM. So it would be good to fix the
problems with those tools anyway.

The error message is not great, I agree with that. Can we improve it?
Maybe just add a HINT like this:

postgres=# ALTER SYSTEM SET wal_level TO minimal;
ERROR:  could not open file "postgresql.auto.conf" for writing:
Permission denied
HINT:  Configuration might be managed outside PostgreSQL


Perhaps we could make that even better with a GUC though. I propose a
GUC called 'configuration_managed_externally = true / false". If you
set
it to true, we prevent ALTER SYSTEM and make the error message more
definitive:

postgres=# ALTER SYSTEM SET wal_level TO minimal;
ERROR:  configuration is managed externally

As a bonus, if that GUC is set, we could even check at server startup
that all the configuration files are not writable by the postgres user,
and print a warning or refuse to start up if they are.

(Another way to read this proposal is to rename the GUC that's been
discussed in this thread to 'configuration_managed_externally'. That
makes it look less like a security feature, and describes the intended
use case.)




I agree with pretty much all of this.



+1 me too.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Improving contrib/tablefunc's error reporting

2024-03-09 Thread Joe Conway

On 3/9/24 15:39, Tom Lane wrote:

Joe Conway  writes:

On 3/9/24 13:07, Tom Lane wrote:

BTW, while I didn't touch it here, it seems fairly bogus that
connectby() checks both type OID and typmod for its output
columns while crosstab() only checks type OID.  I think
crosstab() is in the wrong and needs to be checking typmod.
That might be fit material for a separate patch though.


Something like the attached what you had in mind? (applies on top of 
your two patches)


Yeah, exactly.

As far as the comment change goes:

-   * - attribute [1] of the sql tuple is the category; no need to check it -
-   * attribute [2] of the sql tuple should match attributes [1] to [natts]
+   * attribute [1] of the sql tuple is the category; no need to check it
+   * attribute [2] of the sql tuple should match attributes [1] to [natts - 1]
 * of the return tuple

I suspect that this block looked better when originally committed but
didn't survive contact with pgindent.  You should check whether your
version will; if not, some dashes on the /* line will help.


Thanks for the review and heads up. I fiddled with it a bit to make it 
pgindent clean. I saw you commit your patches so I just pushed mine.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Improving contrib/tablefunc's error reporting

2024-03-09 Thread Joe Conway

On 3/9/24 13:07, Tom Lane wrote:

Joe Conway  writes:

On 3/5/24 17:04, Tom Lane wrote:

After reading the thread at [1], I could not escape the feeling
that contrib/tablefunc's error reporting is very confusing.



The changes all look good to me and indeed more consistent with the docs.
Do you want me to push these? If so, development tip  only, or backpatch?


I can push that.  I was just thinking HEAD, we aren't big on changing
error reporting in back branches.


BTW, while I didn't touch it here, it seems fairly bogus that
connectby() checks both type OID and typmod for its output
columns while crosstab() only checks type OID.  I think
crosstab() is in the wrong and needs to be checking typmod.
That might be fit material for a separate patch though.



I can take a look at this. Presumably this would not be for backpatching.


I'm not sure whether that could produce results bad enough to be
called a bug or not.  But I too would lean towards not back-patching,
in view of the lack of field complaints.



Something like the attached what you had in mind? (applies on top of 
your two patches)


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Make tablefunc crosstab() check typmod too

tablefunc connectby() checks both type OID and typmod for its output
columns while crosstab() only checks type OID. Fix that by makeing
the crosstab() check look more like the connectby() check.

--- tablefunc.c.v0002	2024-03-09 14:38:29.285393890 -0500
+++ tablefunc.c	2024-03-09 14:43:47.021399855 -0500
@@ -1527,10 +1527,10 @@
 compatCrosstabTupleDescs(TupleDesc ret_tupdesc, TupleDesc sql_tupdesc)
 {
 	int			i;
-	Form_pg_attribute ret_attr;
 	Oid			ret_atttypid;
-	Form_pg_attribute sql_attr;
 	Oid			sql_atttypid;
+	int32		ret_atttypmod;
+	int32		sql_atttypmod;
 
 	if (ret_tupdesc->natts < 2)
 		ereport(ERROR,
@@ -1539,34 +1539,39 @@
  errdetail("Return row must have at least two columns.")));
 	Assert(sql_tupdesc->natts == 3);	/* already checked by caller */
 
-	/* check the rowid types match */
+	/* check the row_name types match */
 	ret_atttypid = TupleDescAttr(ret_tupdesc, 0)->atttypid;
 	sql_atttypid = TupleDescAttr(sql_tupdesc, 0)->atttypid;
-	if (ret_atttypid != sql_atttypid)
+	ret_atttypmod = TupleDescAttr(ret_tupdesc, 0)->atttypmod;
+	sql_atttypmod = TupleDescAttr(sql_tupdesc, 0)->atttypmod;
+	if (ret_atttypid != sql_atttypid ||
+		(ret_atttypmod >= 0 && ret_atttypmod != sql_atttypmod))
 		ereport(ERROR,
 (errcode(ERRCODE_DATATYPE_MISMATCH),
  errmsg("invalid crosstab return type"),
  errdetail("Source row_name datatype %s does not match return row_name datatype %s.",
-		   format_type_be(sql_atttypid),
-		   format_type_be(ret_atttypid;
+		   format_type_with_typemod(sql_atttypid, sql_atttypmod),
+		   format_type_with_typemod(ret_atttypid, ret_atttypmod;
 
 	/*
-	 * - attribute [1] of the sql tuple is the category; no need to check it -
-	 * attribute [2] of the sql tuple should match attributes [1] to [natts]
+	 * attribute [1] of the sql tuple is the category; no need to check it
+	 * attribute [2] of the sql tuple should match attributes [1] to [natts - 1]
 	 * of the return tuple
 	 */
-	sql_attr = TupleDescAttr(sql_tupdesc, 2);
+	sql_atttypid = TupleDescAttr(sql_tupdesc, 2)->atttypid;
+	sql_atttypmod = TupleDescAttr(sql_tupdesc, 2)->atttypmod;
 	for (i = 1; i < ret_tupdesc->natts; i++)
 	{
-		ret_attr = TupleDescAttr(ret_tupdesc, i);
-
-		if (ret_attr->atttypid != sql_attr->atttypid)
+		ret_atttypid = TupleDescAttr(ret_tupdesc, i)->atttypid;
+		ret_atttypmod = TupleDescAttr(ret_tupdesc, i)->atttypmod;
+		if (ret_atttypid != sql_atttypid ||
+		(ret_atttypmod >= 0 && ret_atttypmod != sql_atttypmod))
 			ereport(ERROR,
 	(errcode(ERRCODE_DATATYPE_MISMATCH),
 	 errmsg("invalid crosstab return type"),
 	 errdetail("Source value datatype %s does not match return value datatype %s in column %d.",
-			   format_type_be(sql_attr->atttypid),
-			   format_type_be(ret_attr->atttypid),
+			   format_type_with_typemod(sql_atttypid, sql_atttypmod),
+			   format_type_with_typemod(ret_atttypid, ret_atttypmod),
 			   i + 1)));
 	}
 


Re: Improving contrib/tablefunc's error reporting

2024-03-09 Thread Joe Conway

On 3/5/24 17:04, Tom Lane wrote:

After reading the thread at [1], I could not escape the feeling
that contrib/tablefunc's error reporting is very confusing.
Looking into the source code, I soon found that it is also
very inconsistent, with similar error reports being phrased
quite differently.  The terminology for column names doesn't
match the SGML docs either.  And there are some places that are
so confused about whether they are complaining about the calling
query or the called query that the output is flat-out backwards.
So at that point my nascent OCD wouldn't let me rest without
fixing it.  Here's a quick patch series to do that.

For review purposes, I split this into two patches.  0001 simply
adds some more test cases to reach currently-unexercised error
reports.  Then 0002 makes my proposed code changes and shows
how the existing error messages change.

I'm not necessarily wedded to the phrasings I used here,
in case anyone has better ideas.


The changes all look good to me and indeed more consistent with the docs.

Do you want me to push these? If so, development tip  only, or backpatch?


BTW, while I didn't touch it here, it seems fairly bogus that
connectby() checks both type OID and typmod for its output
columns while crosstab() only checks type OID.  I think
crosstab() is in the wrong and needs to be checking typmod.
That might be fit material for a separate patch though.


I can take a look at this. Presumably this would not be for backpatching.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2024-03-08 Thread Joe Conway

On 3/8/24 12:28, Andrey M. Borodin wrote:

Hello everyone!

Thanks for working on this, really nice feature!


On 9 Jan 2024, at 01:40, Joe Conway  wrote:

Thanks -- will have a look


Joe, recently folks proposed a lot of patches in this thread that seem like 
diverted from original way of implementation.
As an author of CF entry [0] can you please comment on which patch version 
needs review?



I don't know if I agree with the proposed changes, but I have also been 
waiting to see how the parallel discussion regarding COPY extensibility 
shakes out.


And there were a couple of issues found that need to be tracked down.

Additionally I have had time/availability challenges recently.

Overall, chances seem slim that this will make it into 17, but I have 
not quite given up hope yet either.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Improving contrib/tablefunc's error reporting

2024-03-05 Thread Joe Conway

On 3/5/24 17:04, Tom Lane wrote:

After reading the thread at [1], I could not escape the feeling
that contrib/tablefunc's error reporting is very confusing.
Looking into the source code, I soon found that it is also
very inconsistent, with similar error reports being phrased
quite differently.  The terminology for column names doesn't
match the SGML docs either.  And there are some places that are
so confused about whether they are complaining about the calling
query or the called query that the output is flat-out backwards.
So at that point my nascent OCD wouldn't let me rest without
fixing it.  Here's a quick patch series to do that.

For review purposes, I split this into two patches.  0001 simply
adds some more test cases to reach currently-unexercised error
reports.  Then 0002 makes my proposed code changes and shows
how the existing error messages change.

I'm not necessarily wedded to the phrasings I used here,
in case anyone has better ideas.

BTW, while I didn't touch it here, it seems fairly bogus that
connectby() checks both type OID and typmod for its output
columns while crosstab() only checks type OID.  I think
crosstab() is in the wrong and needs to be checking typmod.
That might be fit material for a separate patch though.


Been a long time since I gave contrib/tablefunc any love I guess ;-)

I will have a look at your patches, and the other issue you mention, but 
it might be a day or three before I can give it some quality time.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





PostgreSQL Contributors Updates

2024-03-03 Thread Joe Conway

All,

The PostgreSQL Contributor Page 
(https://www.postgresql.org/community/contributors/) includes people who 
have made substantial, long-term contributions of time and effort to the 
PostgreSQL project. The PostgreSQL Contributors Team recognizes the 
following people for their contributions.


New PostgreSQL Contributors:

* Bertrand Drouvot
* Gabriele Bartolini
* Richard Guo

New PostgreSQL Major Contributors:

* Alexander Lakhin
* Daniel Gustafsson
* Dean Rasheed
* John Naylor
* Melanie Plageman
* Nathan Bossart

Thank you and congratulations to all!

Thanks,
On behalf of the PostgreSQL Contributors Team

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: An improved README experience for PostgreSQL

2024-02-28 Thread Joe Conway

On 2/28/24 14:36, Tom Lane wrote:

Nathan Bossart  writes:

On Wed, Feb 28, 2024 at 01:08:03PM -0600, Nathan Bossart wrote:

-PostgreSQL Database Management System
-=
+# PostgreSQL Database Management System



This change can be omitted, which makes the conversion even simpler.


That's a pretty convincing proof-of-concept.  Let's just do this,
and then make sure to keep the file legible as plain text.


+1

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: An improved README experience for PostgreSQL

2024-02-28 Thread Joe Conway

On 2/28/24 12:25, Daniel Gustafsson wrote:

On 28 Feb 2024, at 18:02, Nathan Bossart  wrote:

On Wed, Feb 28, 2024 at 02:46:11PM +0100, Daniel Gustafsson wrote:

On 26 Feb 2024, at 21:30, Tom Lane  wrote:
Nathan Bossart  writes:

I think this would be nice.  If the Markdown version is reasonably readable
as plain-text, maybe we could avoid maintaining two READMEs files, too.
But overall, +1 to modernizing the README a bit.


Per past track record, we change the top-level README only once every
three years or so, so I doubt it'd be too painful to maintain two
versions of it.


It wont be, and we kind of already have two since there is another similar
README displayed at https://www.postgresql.org/ftp/.  That being said, a
majority of those reading the README will likely be new developers accustomed
to Markdown (or doing so via interfaces such as Github) so going to Markdown
might not be a bad idea.  We can also render a plain text version with pandoc
for release builds should we want to.


Sorry, my suggestion wasn't meant to imply that I have any strong concerns
about maintaining two README files.  If we can automate generating one or
the other, that'd be great, but I don't see that as a prerequisite to
adding a Markdown version.


Agreed, and I didn't say we should do it but rather that we can do it based on
the toolchain we already have.  Personally I think just having a Markdown
version is enough, it's become the de facto standard for such documentation for
good reasons.


+1

Markdown is pretty readable as text, I'm not sure why we need both.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: PGC_SIGHUP shared_buffers?

2024-02-19 Thread Joe Conway

On 2/19/24 13:13, Andres Freund wrote:

On 2024-02-19 09:19:16 -0500, Joe Conway wrote:

Couldn't we scale the rounding, e.g. allow small allocations as we do now,
but above some number always round? E.g. maybe >= 2GB round to the nearest
256MB, >= 4GB round to the nearest 512MB, >= 8GB round to the nearest 1GB,
etc?


That'd make the translation considerably more expensive. Which is important,
given how common an operation this is.



Perhaps it is not practical, doesn't help, or maybe I misunderstand, but 
my intent was that the rounding be done/enforced when setting the GUC 
value which surely cannot be that often.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: PGC_SIGHUP shared_buffers?

2024-02-19 Thread Joe Conway

On 2/18/24 15:35, Andres Freund wrote:

On 2024-02-18 17:06:09 +0530, Robert Haas wrote:

How many people set shared_buffers to something that's not a whole
number of GB these days?


I'd say the vast majority of postgres instances in production run with less
than 1GB of s_b. Just because numbers wise the majority of instances are
running on small VMs and/or many PG instances are running on one larger
machine.  There are a lot of instances where the total available memory is
less than 2GB.


I mean I bet it happens, but in practice if you rounded to the nearest GB,
or even the nearest 2GB, I bet almost nobody would really care. I think it's
fine to be opinionated here and hold the line at a relatively large granule,
even though in theory people could want something else.


I don't believe that at all unfortunately.


Couldn't we scale the rounding, e.g. allow small allocations as we do 
now, but above some number always round? E.g. maybe >= 2GB round to the 
nearest 256MB, >= 4GB round to the nearest 512MB, >= 8GB round to the 
nearest 1GB, etc?


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Replace current implementations in crypt() and gen_salt() to OpenSSL

2024-02-16 Thread Joe Conway

On 2/16/24 04:16, Daniel Gustafsson wrote:

On 15 Feb 2024, at 16:49, Peter Eisentraut  wrote:



1. All the block ciphers currently supported by crypt() and gen_salt() are not 
FIPS-compliant.

2. The crypt() and gen_salt() methods built on top of them (modes of operation, 
kind of) are not FIPS-compliant.


I wonder if it's worth trying to make pgcrypto disallow non-FIPS compliant
ciphers when the compiled against OpenSSL is running with FIPS mode enabled, or
raise a WARNING when used?  It seems rather unlikely that someone running
OpenSSL with FIPS=yes want to use our DES cipher without there being an error
or misconfiguration somewhere.

Something like the below untested pseudocode.

diff --git a/contrib/pgcrypto/pgcrypto.c b/contrib/pgcrypto/pgcrypto.c
index 96447c5757..3d4391ebe1 100644
--- a/contrib/pgcrypto/pgcrypto.c
+++ b/contrib/pgcrypto/pgcrypto.c
@@ -187,6 +187,14 @@ pg_crypt(PG_FUNCTION_ARGS)
   *resbuf;
text   *res;
  
+#if defined FIPS_mode

+   if (FIPS_mode())
+#else
+   if 
(EVP_default_properties_is_fips_enabled(OSSL_LIB_CTX_get0_global_default()))
+#endif
+   ereport(ERROR,
+   (errmsg("not available when using OpenSSL in FIPS 
mode")));


Makes sense +1

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: pgjdbc is not working with PKCS8 certificates with password

2024-02-07 Thread Joe Conway

On 2/7/24 06:42, just madhu wrote:

On further investigation,
/With certificate generated as below. JDBC connection is successful./
openssl pkcs8 -topk8 -inform PEM -in client.key -outform DER -out 
client.pk8  -passout pass:foobar / -v1 PBE-MD5-DES


But a connection from pgAdmin (connection failed: 
\SSLCerts\pk8_pass\client_pass_PBE.pk8": no start line) and psql(psql: 
error: could not load private key file "client_pass_PBE.pk8": 
unsupported) is failing


Is there a common way in which certificate with passwords can be 
created  for both libpq and jdbc ?



You may want to check with the pgjdbc project on github rather than (or 
in addition to?) here; see:


  https://github.com/pgjdbc/pgjdbc/issues

Joe

On Wed, Feb 7, 2024 at 3:17 PM just madhu <mailto:justvma...@gmail.com>> wrote:


Hi ,

postgresql-42.7.1.jar

Trying to use establish a connection using PKCS8 certificate created
with password.

/openssl pkcs8 -topk8 -inform PEM -in client.key -outform DER -out
client.pk8  -passout pass:foobar
/

I set the properties as below:
/.../
/sslProperties.setProperty("sslkey", "client.pk8");
sslProperties.setProperty("sslpassword","foobar");/
/.../
/Connection connection = DriverManager.getConnection(jdbcUrl,
sslProperties);
/
//
/This is failing with the error:/
/org.postgresql.util.PSQLException: SSL error: Connection reset
at org.postgresql.ssl.MakeSSL.convert(MakeSSL.java:43)
at

org.postgresql.core.v3.ConnectionFactoryImpl.enableSSL(ConnectionFactoryImpl.java:584)
at

org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:168)
/
/.../

Regards,
Madhu



--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: [17] CREATE SUBSCRIPTION ... SERVER

2024-01-15 Thread Joe Conway

On 1/12/24 20:17, Jeff Davis wrote:

On Fri, 2024-01-05 at 16:11 +0530, Ashutosh Bapat wrote:

I don't think we need to add a test for every FDW. E.g. adding a test
in file_fdw would be pointless. But postgres_fdw is special. The test
could further create a foreign table ftab_foo on subscriber
referencing foo on publisher and then compare the data from foo and
ftab_foo to make sure that the replication is happening. This will
serve as a good starting point for replicated tables setup in a
sharded cluster.



Attached updated patch set with added TAP test for postgres_fdw, which
uses a postgres_fdw server as the source for subscription connection
information.

I think 0004 needs a bit more work, so I'm leaving it off for now, but
I'll bring it back in the next patch set.


I took a quick scan through the patch. The only thing that jumped out at 
me was that it seems like it might make sense to use 
quote_literal_cstr() rather than defining your own appendEscapedValue() 
function?


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: System username in pg_stat_activity

2024-01-10 Thread Joe Conway

On 1/10/24 08:59, Magnus Hagander wrote:

On Wed, Jan 10, 2024 at 2:56 PM Bertrand Drouvot

I think it depends what we want the new field to reflect. If it is the exact
same thing as the SYSTEM_USER then I think it has to be text (as the SYSTEM_USER
is made of "auth_method:identity"). Now if we want it to be "only" the identity
part of it, then the `name` datatype would be fine. I'd vote for the exact same
thing as the SYSTEM_USER (means auth_method:identity).


I definitely think it should be the same. If it's not exactly the
same, then it should be *two* columns, one with auth method and one
with the name.

And thinking more about it maybe that's cleaner, because that makes it
easier to do things like filter based on auth method?


+1 for the overall feature and +1 for two columns


> + 
> +  
> +   authname name
> +  
> +  
> +   The authentication method and identity (if any) that the user
> +   used to log in. It contains the same value as
> +returns in the backend.
> +  
> + 

I'm fine with auth_method:identity.

> +S.authname,

What about using system_user as the field name? (because if we keep
auth_method:identity it's not really the authname anyway).


I was worried system_user or sysuser would both be confusing with the
fact that we have usesysid -- which would reference a *different*
sys...



I think if it is exactly "system_user" it would be pretty clearly a 
match for SYSTEM_USER



--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2024-01-08 Thread Joe Conway

On 1/8/24 14:36, Dean Rasheed wrote:

On Thu, 7 Dec 2023 at 01:10, Joe Conway  wrote:


The attached should fix the CopyOut response to say one column.



Playing around with this, I found a couple of cases that generate an error:

COPY (SELECT 1 UNION ALL SELECT 2) TO stdout WITH (format json);

COPY (VALUES (1), (2)) TO stdout WITH (format json);

both of those generate the following:

ERROR:  record type has not been registered



Thanks -- will have a look

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Password leakage avoidance

2024-01-07 Thread Joe Conway

On 1/6/24 15:10, Tom Lane wrote:

Joe Conway  writes:
The only code specific comments were Tom's above, which have been 
addressed. If there are no serious objections I plan to commit this 
relatively soon.


I had not actually read this patchset before, but now I have, and
I have a few minor suggestions:


Many thanks!


* The API comment for PQchangePassword should specify that encryption
is done according to the server's password_encryption setting, and
probably the SGML docs should too.  You shouldn't have to read the
code to discover that.


Check


* I don't especially care for the useless initializations of
encrypted_password, fmtpw, and fmtuser.  In all those cases the
initial NULL is immediately replaced by a valid value.  Probably
the compiler will figure out that the initializations are useless,
but human readers have to do so as well.  Moreover, I think this
style is more bug-prone not less so, because if you ever change
the logic in a way that causes some code paths to fail to set
the variables, you won't get use-of-possibly-uninitialized-value
warnings from the compiler.

* Perhaps move the declaration of "buf" to the inner block where
it's actually used?


Makes sense -- fixed


* This could be shortened to just "return res":

+   if (!res)
+   return NULL;
+   else
+   return res;


Heh, apparently I needed more coffee at this point :-)


* I'd make the SGML documentation a bit more explicit about the
return value, say

+   Returns a PGresult pointer representing
+   the result of the ALTER USER command, or
+   a null pointer if the routine failed before issuing any command.


Fixed.

I also ran pgindent. I was kind of surprised/unhappy when it made me 
change this (which aligned the two var names):

8<
PQExpBufferDatabuf;
PGresult*res;
8<

to this (which leaves the var names unaligned):
8<
PQExpBufferDatabuf;
PGresult*res;
8<

Anyway, the resulting adjustments attached.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
diff --git a/src/interfaces/libpq/exports.txt b/src/interfaces/libpq/exports.txt
index 850734a..28b861f 100644
*** a/src/interfaces/libpq/exports.txt
--- b/src/interfaces/libpq/exports.txt
*** PQclosePrepared   188
*** 191,193 
--- 191,194 
  PQclosePortal 189
  PQsendClosePrepared   190
  PQsendClosePortal 191
+ PQchangePassword  192
diff --git a/src/interfaces/libpq/fe-auth.c b/src/interfaces/libpq/fe-auth.c
index 912aa14..dcb7c9f 100644
*** a/src/interfaces/libpq/fe-auth.c
--- b/src/interfaces/libpq/fe-auth.c
*** PQencryptPasswordConn(PGconn *conn, cons
*** 1372,1374 
--- 1372,1455 
  
  	return crypt_pwd;
  }
+ 
+ /*
+  * PQchangePassword -- exported routine to change a password
+  *
+  * This is intended to be used by client applications that wish to
+  * change the password for a user. The password is not sent in
+  * cleartext because it is encrypted on the client side. This is
+  * good because it ensures the cleartext password is never known by
+  * the server, and therefore won't end up in logs, pg_stat displays,
+  * etc. The password encryption is performed by PQencryptPasswordConn(),
+  * which is passed a NULL for the algorithm argument. Hence encryption
+  * is done according to the server's password_encryption
+  * setting. We export the function so that clients won't be dependent
+  * on the implementation specific details with respect to how the
+  * server changes passwords.
+  *
+  * Arguments are a connection object, the SQL name of the target user,
+  * and the cleartext password.
+  *
+  * Return value is the PGresult of the executed ALTER USER statement
+  * or NULL if we never get there. The caller is responsible to PQclear()
+  * the returned PGresult.
+  *
+  * PQresultStatus() should be called to check the return value for errors,
+  * and PQerrorMessage() used to get more information about such errors.
+  */
+ PGresult *
+ PQchangePassword(PGconn *conn, const char *user, const char *passwd)
+ {
+ 	char	   *encrypted_password = PQencryptPasswordConn(conn, passwd,
+ 		   user, NULL);
+ 
+ 	if (!encrypted_password)
+ 	{
+ 		/* PQencryptPasswordConn() already registered the error */
+ 		return NULL;
+ 	}
+ 	else
+ 	{
+ 		char	   *fmtpw = PQescapeLiteral(conn, encrypted_password,
+ 			strlen(encrypted_password));
+ 
+ 		/* no longer needed, so clean up now */
+ 		PQfreemem(encrypted_password);
+ 
+ 		if (!fmtpw)
+ 		{
+ 			/* PQescapeLiteral() already registered the error */
+ 			return NULL;
+ 		}
+ 		else
+ 		{
+ 			char	   *fmtuser = PQescapeIdentifier(conn, user, strlen(user));
+ 
+ 			if (!fmtuser)
+ 			{
+ /* PQescapeIdentifier() already registered th

Re: Password leakage avoidance

2024-01-06 Thread Joe Conway

On 1/6/24 13:16, Sehrope Sarkuni wrote:
On Sat, Jan 6, 2024 at 12:39 PM Joe Conway <mailto:m...@joeconway.com>> wrote:


The only code specific comments were Tom's above, which have been
addressed. If there are no serious objections I plan to commit this
relatively soon.


One more thing that we do in pgjdbc is to zero out the input password 
args so that they don't remain in memory even after being freed. It's 
kind of odd in Java as it makes the input interface a char[] and we have 
to convert them to garbage collected Strings internally (which kind of 
defeats the purpose of the exercise).


But in libpq could be done via something like:

memset(pw1, 0, strlen(pw1));
memset(pw2, 0, strlen(pw2));



That part is in psql not libpq

There was some debate on our end of where to do that and we settled on 
doing it inside the encoding functions to ensure it always happens. So 
the input password char[] always gets wiped regardless of how the 
encoding functions are invoked.


Even if it's not added to the password encoding functions (as that kind 
of changes the after effects if anything was relying on the password 
still having the password), I think it'd be good to add it to the 
command.c stuff that has the two copies of the password prior to freeing 
them.


While that change might or might not be worthwhile, I see it as 
independent of this patch.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Password leakage avoidance

2024-01-06 Thread Joe Conway

On 12/24/23 10:14, Joe Conway wrote:

On 12/23/23 11:00, Tom Lane wrote:

Joe Conway  writes:
The attached patch set moves the guts of \password from psql into the 
libpq client side -- PQchangePassword() (patch 0001).


Haven't really read the patch, just looked at the docs, but here's
a bit of bikeshedding:


Thanks!


* This seems way too eager to promote the use of md5.  Surely the
default ought to be SCRAM, full stop.  I question whether we even
need an algorithm parameter.  Perhaps it's a good idea for
future-proofing, but we could also plan that the function would
make its own decisions based on noting the server's version.
(libpq is far more likely to be au courant about what to do than
the calling application, IMO.)



* Parameter order seems a bit odd: to me it'd be natural to write
user before password.



* Docs claim return type is char *, but then say bool (which is
also what the code says).  We do not use bool in libpq's API;
the admittedly-hoary convention is to use int with 1/0 values.
Rather than quibble about that, though, I wonder if we should
make the result be the PGresult from the command, which the
caller would be responsible to free.  That would document error
conditions much more flexibly.  On the downside, client-side
errors would be harder to report, particularly OOM, but I think
we have solutions for that in existing functions.



* The whole business of nonblock mode seems fairly messy here,
and I wonder whether it's worth the trouble to support.  If we
do want to claim it works then it ought to be tested.


All of these (except for the doc "char *" cut-n-pasteo) were due to
trying to stay close to the same interface as PQencryptPasswordConn().

But I agree with your assessment and the attached patch series addresses
all of them.

The original use of PQencryptPasswordConn() in psql passed a NULL for
the algorithm, so I dropped that argument entirely. I also swapped user
and password arguments because as you pointed out that is more natural.

This version returns PGresult. As far as special handling for
client-side errors like OOM, I don't see anything beyond returning a
NULL to signify fatal error, e,g,:

8<--
PGresult *
PQmakeEmptyPGresult(PGconn *conn, ExecStatusType status)
{
PGresult   *result;

result = (PGresult *) malloc(sizeof(PGresult));
if (!result)
return NULL;
8<--

That is the approach I took.

One thing I have not done but, considered, is adding an additional 
optional parameter to allow "VALID UNTIL" to be set. Seems like it would 
be useful to be able to set an expiration when setting a new password.


No strong opinion about that.


Thanks -- hopefully others will weigh in on that.



I just read through the thread and my conclusion is that, specifically 
related to this patch set (i.e. notwithstanding suggestions for related 
features), there is consensus in favor of adding this feature.


The only code specific comments were Tom's above, which have been 
addressed. If there are no serious objections I plan to commit this 
relatively soon.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
diff --git a/src/interfaces/libpq/exports.txt b/src/interfaces/libpq/exports.txt
index 850734a..28b861f 100644
*** a/src/interfaces/libpq/exports.txt
--- b/src/interfaces/libpq/exports.txt
*** PQclosePrepared   188
*** 191,193 
--- 191,194 
  PQclosePortal 189
  PQsendClosePrepared   190
  PQsendClosePortal 191
+ PQchangePassword  192
diff --git a/src/interfaces/libpq/fe-auth.c b/src/interfaces/libpq/fe-auth.c
index 912aa14..a6897f8 100644
*** a/src/interfaces/libpq/fe-auth.c
--- b/src/interfaces/libpq/fe-auth.c
*** PQencryptPasswordConn(PGconn *conn, cons
*** 1372,1374 
--- 1372,1459 
  
  	return crypt_pwd;
  }
+ 
+ /*
+  * PQchangePassword -- exported routine to change a password
+  *
+  * This is intended to be used by client applications that wish to
+  * change the password for a user. The password is not sent in
+  * cleartext because it is encrypted on the client side. This is
+  * good because it ensures the cleartext password is never known by
+  * the server, and therefore won't end up in logs, pg_stat displays,
+  * etc. We export the function so that clients won't be dependent
+  * on the implementation specific details with respect to how the
+  * server changes passwords.
+  *
+  * Arguments are a connection object, the SQL name of the target user,
+  * and the cleartext password.
+  *
+  * Return value is the PGresult of the executed ALTER USER statement
+  * or NULL if we never get there. The caller is responsible to PQclear()
+  * the returned PGresult.
+  * 
+  * PQresultStatus() should be called to check the return value for errors,
+  * and PQerrorMessage() used to get more information about such errors.
+  */
+ PGr

Re: Password leakage avoidance

2024-01-06 Thread Joe Conway

On 1/2/24 07:23, Sehrope Sarkuni wrote:
1. There's two sets of defaults, the client program's default and the 
server's default. Need to pick one for each implemented function. They 
don't need to be the same across the board.


Is there a concrete recommendation here?

2. Password encoding should be split out and made available as its own 
functions. Not just as part of a wider "create _or_ alter a user's 
password" function attached to a connection.


It already is split out in libpq[1], unless I don't understand you 
correctly.


[1] 
https://www.postgresql.org/docs/current/libpq-misc.html#LIBPQ-PQENCRYPTPASSWORDCONN



We went a step further and  added an intermediate function that
generates the "ALTER USER ... PASSWORD" SQL.


I don't think we want that in libpq, but in any case separate 
patch/discussion IMHO.


3. We only added an "ALTER USER ... PASSWORD" function, not anything to 
create a user. There's way too many options for that and keeping this 
targeted at just assigning passwords makes it much easier to test.


+1

Also separate patch/discussion, but I don't think the CREATE version is 
needed.


4. RE:defaults, the PGJDBC approach is that the encoding-only function 
uses the driver's default (SCRAM). The "alterUserPassword(...)" uses the 
server's default (again usually SCRAM for modern installs but could be 
something else). It's kind of odd that they're different but the use 
cases are different as well.


Since PQencryptPasswordConn() already exists, and psql's "\password" 
used it with its defaults, I don't think we want to change the behavior. 
The patch as written behaves in a backward compatible way.


5. Our SCRAM specific function allows for customizing the algo iteration 
and salt parameters. That topic came up on hackers previously[1]. Our 
high level "alterUserPassword(...)" function does not have those options 
but it is available as part of our PasswordUtil SCRAM API for advanced 
users who want to leverage it. The higher level functions have defaults 
for iteration counts (4096) and salt size (16-bytes).


Again separate patch/discussion, IMHO.

6. Getting the verbiage right for the PGJDBC version was kind of 
annoying as we wanted to match the server's wording, e.g. 
"password_encryption", but it's clearly hashing, not encryption. We 
settled on "password encoding" for describing the overall task with the 
comments referencing the server's usage of the term 
"password_encryption". Found a similar topic[2] on changing that 
recently as well but looks like that's not going anywhere.


Really this is irrelevant to this discussion, because the new function 
is called PQchangePassword().


The function PQencryptPasswordConn() has been around for a while and the 
horse is out of the gate on that one.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: BUG #17946: LC_MONETARY & DO LANGUAGE plperl - BUG

2024-01-05 Thread Joe Conway

On 1/5/24 12:56, Robert Haas wrote:

On Sun, Aug 27, 2023 at 4:25 PM Heikki Linnakangas  wrote:

I think you got that it backwards. 'perl_locale_obj' is set to the perl
interpreter's locale, whenever we are *outside* the interpreter.


This thread has had no update for more than 4 months, so I'm marking
the CF entry RwF for now.

It can always be reopened, if Joe or Tristan or Heikki or someone else
picks it up again.



It is definitely a bug, so I do plan to get back to it at some point, 
hopefully sooner rather than later...


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: SET ROLE x NO RESET

2023-12-31 Thread Joe Conway

On 12/30/23 17:19, Michał Kłeczek wrote:



On 30 Dec 2023, at 17:16, Eric Hanson  wrote:

What do you think of adding a NO RESET option to the SET ROLE command?


What I proposed some time ago is SET ROLE … GUARDED BY ‘password’, so 
that you could later: RESET ROLE WITH ‘password'


I like that too, but see it as a separate feature. FWIW that is also 
supported by the set_user extension referenced elsewhere on this thread.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Build versionless .so for Android

2023-12-31 Thread Joe Conway

On 12/31/23 01:24, Michael Paquier wrote:

On Sun, Dec 31, 2023 at 06:08:04AM +0100, Matthias Kuhn wrote:

I was wondering if there are a) any comments on the approach and if I
should be handed in for a commitfest (currently waiting for the cooldown
period after account activation, I am not sure how long that is)


Such discussions are adapted in a commit fest entry.  No idea if there
is a cooldown period after account creation before being able to touch
the CF app contents, though.



FWIW I have expedited the cooldown period, so Matthias can log in right 
away.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: SET ROLE x NO RESET

2023-12-30 Thread Joe Conway

On 12/30/23 11:16, Eric Hanson wrote:

Hi,

What do you think of adding a NO RESET option to the SET ROLE command?

Right now Postgres can enforce data security with roles and RLS, but 
role-per-end-user doesn't really scale:  Db connections are per-role, so 
a connection pooler can't share connections across users.  We can work 
around this with policies that use session variables and checks against 
current_user, but it seems like role-per end user would be more 
beautiful.  If SET ROLE had a NO RESET option, you could connect through 
a connection pooler as a privileged user, but downgrade to the user's 
role for the duration of the session.


+1

I agree this would be useful.

In the meantime, in case it helps, see

  https://github.com/pgaudit/set_user

Specifically set_session_auth(text):
-
When set_session_auth(text) is called, the effective session and current 
user is switched to the rolename supplied, irrevocably. Unlike 
set_user() or set_user_u(), it does not affect logging nor allowed 
statements. If set_user.exit_on_error is "on" (the default), and any 
error occurs during execution, a FATAL error is thrown and the backend 
session exits.

-----

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Password leakage avoidance

2023-12-27 Thread Joe Conway

On 12/27/23 16:09, Tom Lane wrote:

Joe Conway  writes:

On 12/27/23 15:39, Peter Eisentraut wrote:

On 23.12.23 16:13, Joe Conway wrote:
The attached patch set moves the guts of \password from psql into the 
libpq client side -- PQchangePassword() (patch 0001).



I don't follow how you get from the problem statement to this solution.
This proposal doesn't avoid password leakage, does it?
It just provides a different way to phrase the existing solution.


Yes, a fully built one that is convenient to use, and does not ask 
everyone to roll their own.


It's convenient for users of libpq, I guess, but it doesn't help
anyone not writing C code directly atop libpq.  If this is the
way forward then we need to also press JDBC and other client
libraries to implement comparable functionality.  That's within
the realm of sanity surely, and having a well-thought-through
reference implementation in libpq would help those authors.
So I don't think this is a strike against the patch; but the answer
to Peter's question has to be that this is just part of the solution.


As mentioned downthread by Dave Cramer, JDBC is already onboard.

And as Jonathan said in an adjacent part of the thread:

This should generally helpful, as it allows libpq-based drivers to
adopt this method and provide a safer mechanism for setting/changing
passwords! (which we should also promote once availbale).


Which is definitely something I have had in mind all along.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Password leakage avoidance

2023-12-27 Thread Joe Conway

On 12/27/23 15:39, Peter Eisentraut wrote:

On 23.12.23 16:13, Joe Conway wrote:
I have recently, once again for the umpteenth time, been involved in 
discussions around (paraphrasing) "why does Postgres leak the passwords 
into the logs when they are changed". I know well that the canonical 
advice is something like "use psql with \password if you care about that".


And while that works, it is a deeply unsatisfying answer for me to give 
and for the OP to receive.


The alternative is something like "...well if you don't like that, use 
PQencryptPasswordConn() to roll your own solution that meets your 
security needs".


Again, not a spectacular answer IMHO. It amounts to "here is a 
do-it-yourself kit, go put it together". It occurred to me that we can, 
and really should, do better.


The attached patch set moves the guts of \password from psql into the 
libpq client side -- PQchangePassword() (patch 0001).


The usage in psql serves as a ready built-in test for the libpq function 
(patch 0002). Docs included too (patch 0003).


I don't follow how you get from the problem statement to this solution.
This proposal doesn't avoid password leakage, does it?


Yes, it most certainly does. The plaintext password would never be seen 
by the server and therefore never logged. This is exactly why the 
feature already existed in psql.



 It just provides a different way to phrase the existing solution.


Yes, a fully built one that is convenient to use, and does not ask 
everyone to roll their own.


Who is a potential user of this solution? 


Literally every company that has complained that Postgres pollutes their 
logs with plaintext passwords. I have heard the request to provide a 
better solution many times, over many years, while working for three 
different companies.



Right now it just saves a dozen lines in psql, but it's not clear how
it improves anything else.


It is to me, and so far no one else has complained about that. More 
opinions would be welcomed of course.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Password leakage avoidance

2023-12-24 Thread Joe Conway

On 12/24/23 12:22, Tom Lane wrote:

Joe Conway  writes:

Completely unrelated process bikeshedding:
I changed the naming scheme I used for the split patch-set this time. I 
don't know if we have a settled/documented pattern for such naming, but 
the original pattern which I borrowed from someone else's patches was 
"vX--description.patch".


As far as that goes, that filename pattern is what is generated by
"git format-patch".  I agree that the digit-count choices are a tad
odd, but they're not so awful as to be worth trying to override.



Ah, knew it was something like that. I am still a curmudgeon doing 
things the old way ¯\_(ツ)_/¯



The new pattern I picked is "description-vXXX-NN.patch" which fixes all 
of those issues.


Only if you use the same "description" for all patches of a series,
which seems kind of not the point.  In any case, "git format-patch"
is considered best practice for a multi-patch series AFAIK, so we
have to cope with its ideas about how to name the files.
Even if I wanted some differentiating name for the individual patches in 
a set, I still like them to be grouped because it is one unit of work 
from my perspective.


Oh well, I guess I will get with the program and put every patch-set 
into its own directory.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Password leakage avoidance

2023-12-24 Thread Joe Conway

On 12/23/23 11:00, Tom Lane wrote:

Joe Conway  writes:
The attached patch set moves the guts of \password from psql into the 
libpq client side -- PQchangePassword() (patch 0001).


Haven't really read the patch, just looked at the docs, but here's
a bit of bikeshedding:


Thanks!


* This seems way too eager to promote the use of md5.  Surely the
default ought to be SCRAM, full stop.  I question whether we even
need an algorithm parameter.  Perhaps it's a good idea for
future-proofing, but we could also plan that the function would
make its own decisions based on noting the server's version.
(libpq is far more likely to be au courant about what to do than
the calling application, IMO.)



* Parameter order seems a bit odd: to me it'd be natural to write
user before password.



* Docs claim return type is char *, but then say bool (which is
also what the code says).  We do not use bool in libpq's API;
the admittedly-hoary convention is to use int with 1/0 values.
Rather than quibble about that, though, I wonder if we should
make the result be the PGresult from the command, which the
caller would be responsible to free.  That would document error
conditions much more flexibly.  On the downside, client-side
errors would be harder to report, particularly OOM, but I think
we have solutions for that in existing functions.



* The whole business of nonblock mode seems fairly messy here,
and I wonder whether it's worth the trouble to support.  If we
do want to claim it works then it ought to be tested.


All of these (except for the doc "char *" cut-n-pasteo) were due to 
trying to stay close to the same interface as PQencryptPasswordConn().


But I agree with your assessment and the attached patch series addresses 
all of them.


The original use of PQencryptPasswordConn() in psql passed a NULL for 
the algorithm, so I dropped that argument entirely. I also swapped user 
and password arguments because as you pointed out that is more natural.


This version returns PGresult. As far as special handling for 
client-side errors like OOM, I don't see anything beyond returning a 
NULL to signify fatal error, e,g,:


8<--
PGresult *
PQmakeEmptyPGresult(PGconn *conn, ExecStatusType status)
{
PGresult   *result;

result = (PGresult *) malloc(sizeof(PGresult));
if (!result)
return NULL;
8<--

That is the approach I took.

One thing I have not done but, considered, is adding an additional 
optional parameter to allow "VALID UNTIL" to be set. Seems like it would 
be useful to be able to set an expiration when setting a new password.


No strong opinion about that.


Thanks -- hopefully others will weigh in on that.

Completely unrelated process bikeshedding:
I changed the naming scheme I used for the split patch-set this time. I 
don't know if we have a settled/documented pattern for such naming, but 
the original pattern which I borrowed from someone else's patches was 
"vX--description.patch".


The problems I have with that are 1/ there may well be more that 10 
versions of a patch-set, 2/ there are probably never going to be more 
than 2 digits worth of files in a patch-set, and 3/ the description 
coming after the version and file identifiers causes the patches in my 
local directory to sort poorly, intermingling several unrelated patch-sets.


The new pattern I picked is "description-vXXX-NN.patch" which fixes all 
of those issues. Does that bother anyone? *Should* we try to agree on a 
desired pattern (assuming there is not one already)?


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
diff --git a/src/interfaces/libpq/exports.txt b/src/interfaces/libpq/exports.txt
index 850734a..28b861f 100644
*** a/src/interfaces/libpq/exports.txt
--- b/src/interfaces/libpq/exports.txt
*** PQclosePrepared   188
*** 191,193 
--- 191,194 
  PQclosePortal 189
  PQsendClosePrepared   190
  PQsendClosePortal 191
+ PQchangePassword  192
diff --git a/src/interfaces/libpq/fe-auth.c b/src/interfaces/libpq/fe-auth.c
index 912aa14..a6897f8 100644
*** a/src/interfaces/libpq/fe-auth.c
--- b/src/interfaces/libpq/fe-auth.c
*** PQencryptPasswordConn(PGconn *conn, cons
*** 1372,1374 
--- 1372,1459 
  
  	return crypt_pwd;
  }
+ 
+ /*
+  * PQchangePassword -- exported routine to change a password
+  *
+  * This is intended to be used by client applications that wish to
+  * change the password for a user. The password is not sent in
+  * cleartext because it is encrypted on the client side. This is
+  * good because it ensures the cleartext password is never known by
+  * the server, and therefore won't end up in logs, pg_stat displays,
+  * etc. We export the function so that clients won't be dependent
+  * on the implementation specific details with respect to how the
+  *

Password leakage avoidance

2023-12-23 Thread Joe Conway
I have recently, once again for the umpteenth time, been involved in 
discussions around (paraphrasing) "why does Postgres leak the passwords 
into the logs when they are changed". I know well that the canonical 
advice is something like "use psql with \password if you care about that".


And while that works, it is a deeply unsatisfying answer for me to give 
and for the OP to receive.


The alternative is something like "...well if you don't like that, use 
PQencryptPasswordConn() to roll your own solution that meets your 
security needs".


Again, not a spectacular answer IMHO. It amounts to "here is a 
do-it-yourself kit, go put it together". It occurred to me that we can, 
and really should, do better.


The attached patch set moves the guts of \password from psql into the 
libpq client side -- PQchangePassword() (patch 0001).


The usage in psql serves as a ready built-in test for the libpq function 
(patch 0002). Docs included too (patch 0003).


One thing I have not done but, considered, is adding an additional 
optional parameter to allow "VALID UNTIL" to be set. Seems like it would 
be useful to be able to set an expiration when setting a new password.


I will register this in the upcoming commitfest, but meantime 
thought/comments/etc. would be gratefully received.


Thanks,

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.comdiff --git a/src/interfaces/libpq/exports.txt b/src/interfaces/libpq/exports.txt
index 850734a..28b861f 100644
*** a/src/interfaces/libpq/exports.txt
--- b/src/interfaces/libpq/exports.txt
*** PQclosePrepared   188
*** 191,193 
--- 191,194 
  PQclosePortal 189
  PQsendClosePrepared   190
  PQsendClosePortal 191
+ PQchangePassword  192
diff --git a/src/interfaces/libpq/fe-auth.c b/src/interfaces/libpq/fe-auth.c
index 912aa14..3ee67ba 100644
*** a/src/interfaces/libpq/fe-auth.c
--- b/src/interfaces/libpq/fe-auth.c
*** PQencryptPasswordConn(PGconn *conn, cons
*** 1372,1374 
--- 1372,1463 
  
  	return crypt_pwd;
  }
+ 
+ /*
+  * PQchangePassword -- exported routine to change a password
+  *
+  * This is intended to be used by client applications that wish to
+  * change the password for a user.  The password is not sent in
+  * cleartext because it is encrypted on the client side. This is
+  * good because it ensures the cleartext password is never known by
+  * the server, and therefore won't end up in logs, pg_stat displays,
+  * etc. We export the function so that clients won't be dependent
+  * on the implementation specific details with respect to how the
+  * server changes passwords.
+  *
+  * Arguments are a connection object, the cleartext password, the SQL
+  * name of the user it is for, and a string indicating the algorithm to
+  * use for encrypting the password.  If algorithm is NULL,
+  * PQencryptPasswordConn() queries the server for the current
+  * 'password_encryption' value. If you wish to avoid that, e.g. to avoid
+  * blocking, you can execute 'show password_encryption' yourself before
+  * calling this function, and pass it as the algorithm.
+  *
+  * Return value is a boolean, true for success. On error, an error message
+  * is stored in the connection object, and returns false.
+  */
+ bool
+ PQchangePassword(PGconn *conn, const char *passwd, const char *user,
+  const char *algorithm)
+ {
+ 	char		   *encrypted_password = NULL;
+ 	PQExpBufferData buf;
+ 	bool			success = true;
+ 
+ 	encrypted_password = PQencryptPasswordConn(conn, passwd, user, algorithm);
+ 
+ 	if (!encrypted_password)
+ 	{
+ 		/* PQencryptPasswordConn() already registered the error */
+ 		success = false;
+ 	}
+ 	else
+ 	{
+ 		char	   *fmtpw = NULL;
+ 
+ 		fmtpw = PQescapeLiteral(conn, encrypted_password,
+ strlen(encrypted_password));
+ 
+ 		/* no longer needed, so clean up now */
+ 		PQfreemem(encrypted_password);
+ 
+ 		if (!fmtpw)
+ 		{
+ 			/* PQescapeLiteral() already registered the error */
+ 			success = false;
+ 		}
+ 		else
+ 		{
+ 			char	   *fmtuser = NULL;
+ 
+ 			fmtuser = PQescapeIdentifier(conn, user, strlen(user));
+ 			if (!fmtuser)
+ 			{
+ /* PQescapeIdentifier() already registered the error */
+ success = false;
+ PQfreemem(fmtpw);
+ 			}
+ 			else
+ 			{
+ PGresult   *res;
+ 
+ initPQExpBuffer();
+ printfPQExpBuffer(, "ALTER USER %s PASSWORD %s",
+   fmtuser, fmtpw);
+ 
+ res = PQexec(conn, buf.data);
+ if (!res)
+ 	success = false;
+ else
+ 	PQclear(res);
+ 
+ /* clean up */
+ termPQExpBuffer();
+ PQfreemem(fmtuser);
+ PQfreemem(fmtpw);
+ 			}
+ 		}
+ 	}
+ 
+ 	return success;
+ }
diff --git a/src/interfaces/libpq/libpq-fe.h b/src/interfaces/libpq/libpq-fe.h
index 97762d5..f6e7207 100644
*** a/src/interfaces/libpq/libpq-fe.h
--- b/src/interfaces/libpq/libpq-fe.h
*** e

Re: Eager page freeze criteria clarification

2023-12-21 Thread Joe Conway

On 12/21/23 10:56, Melanie Plageman wrote:

On Sat, Dec 9, 2023 at 9:24 AM Joe Conway  wrote:

However, even if we assume a more-or-less normal distribution, we should
consider using subgroups in a way similar to Statistical Process
Control[1]. The reasoning is explained in this quote:

 The Math Behind Subgroup Size

 The Central Limit Theorem (CLT) plays a pivotal role here. According
 to CLT, as the subgroup size (n) increases, the distribution of the
 sample means will approximate a normal distribution, regardless of
 the shape of the population distribution. Therefore, as your
 subgroup size increases, your control chart limits will narrow,
 making the chart more sensitive to special cause variation and more
 prone to false alarms.


I haven't read anything about statistical process control until you
mentioned this. I read the link you sent and also googled around a
bit. I was under the impression that the more samples we have, the
better. But, it seems like this may not be the assumption in
statistical process control?

It may help us to get more specific. I'm not sure what the
relationship between "unsets" in my code and subgroup members would
be.  The article you linked suggests that each subgroup should be of
size 5 or smaller. Translating that to my code, were you imagining
subgroups of "unsets" (each time we modify a page that was previously
all-visible)?


Basically, yes.

It might not makes sense, but I think we could test the theory by 
plotting a histogram of the raw data, and then also plot a histogram 
based on sub-grouping every 5 sequential values in your accumulator.


If the former does not look very normal (I would guess most workloads it 
will be skewed with a long tail) and the latter looks to be more normal, 
then it would say we were on the right track.


There are statistical tests for "normalness" that could be applied too 
( e.g. 
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6350423/#sec2-13title ) 
which be a more rigorous approach, but the quick look at histograms 
might be sufficiently convincing.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Eager page freeze criteria clarification

2023-12-09 Thread Joe Conway
er
table vacuum, especially in case of long-running vacuums.

To benchmark this new heuristic (I'm calling it algo 6 since it is the
6th I've proposed on this thread), I've used a modified subset of my
original workloads:

Workloads

C. Shifting hot set
32 clients inserting multiple rows and then updating an indexed
column of a row on a different page containing data they formerly
inserted. Only recent data is updated.

H. Append only table
32 clients, each inserting a single row at a time

I. Work queue
32 clients, each inserting a row, then updating an indexed column in
2 different rows in nearby pages, then deleting 1 of the updated rows


Workload C:

Algo | Table| AVs | Page Freezes | Pages Frozen | % Frozen
-|--|-|--|--|-
6 | hot_tail |  14 |2,111,824 |1,643,217 |53.4%
M | hot_tail |  16 |  241,605 |3,921 | 0.1%


Algo | WAL GB | Cptr Bgwr Writes | Other r/w  | AV IO time |TPS
-||--|||
6 |193 |5,473,949 | 12,793,574 | 14,870 | 28,397
M |207 |5,213,763 | 20,362,315 | 46,190 | 28,461

Algorithm 6 freezes all of the cold data and doesn't freeze the current
working set. The notable thing is how much this reduces overall system
I/O. On master, autovacuum is doing more than 3x the I/O and the rest of
the system is doing more than 1.5x the I/O. I suspect freezing data when
it is initially vacuumed is saving future vacuums from having to evict
pages of the working set and read in cold data.


Workload H:

Algo | Table | AVs | Page Freezes | Pages Frozen | % frozen
-|---|-|--|--|-
6 | hthistory |  22 |  668,448 |  668,003 |  87%
M | hthistory |  22 |0 |0 |   0%


Algo | WAL GB | Cptr Bgwr Writes | Other r/w | AV IO time |TPS
-||--|---||
6 | 14 |  725,103 |   725,575 |  1 | 43,535
M | 13 |  693,945 |   694,417 |  1 | 43,522

The insert-only table is mostly frozen at the end. There is more I/O
done but not at the expense of TPS. This is exactly what we want.


Workload I:

Algo | Table | AVs | Page Freezes | Pages Frozen | % Frozen
-|---|-|--|--|-
6 | workqueue | 234 |0 |4,416 | 78%
M | workqueue | 234 |0 |4,799 | 87%


Algo | WAL GB | Cptr Bgwr Writes | Other r/w | AV IO Time |TPS
-||--|---||
6 | 74 |   64,345 |64,813 |  1 | 36,366
M | 73 |   63,468 |63,936 |  1 | 36,145

What we want is for the work queue table to freeze as little as
possible, because we will be constantly modifying the data. Both on
master and with algorithm 6 we do not freeze tuples on any pages. You
will notice, however, that much of the table is set frozen in the VM at
the end. This is because we set pages all frozen in the VM if they are
technically all frozen even if we do not freeze tuples on the page. This
is inexpensive and not under the control of the freeze heuristic.

Overall, the behavior of this new adaptive freezing method seems to be
exactly what we want. The next step is to decide how many values to
remove from the accumulator and benchmark cases where old data is
deleted.

I'd be delighted to receive any feedback, ideas, questions, or review.



This is well thought out, well described, and a fantastic improvement in 
my view -- well done!


I do think we will need to consider distributions other than normal, but 
I don't know offhand what they will be.


However, even if we assume a more-or-less normal distribution, we should 
consider using subgroups in a way similar to Statistical Process 
Control[1]. The reasoning is explained in this quote:


The Math Behind Subgroup Size

The Central Limit Theorem (CLT) plays a pivotal role here. According
to CLT, as the subgroup size (n) increases, the distribution of the
sample means will approximate a normal distribution, regardless of
the shape of the population distribution. Therefore, as your
subgroup size increases, your control chart limits will narrow,
making the chart more sensitive to special cause variation and more
prone to false alarms.


[1] 
https://www.qualitygurus.com/rational-subgrouping-in-statistical-process-control/


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-08 Thread Joe Conway

On 12/8/23 14:45, Daniel Verite wrote:

Joe Conway wrote:


copyto_json.007.diff


When the source has json fields with non-significant line feeds, the COPY
output has these line feeds too, which makes the output incompatible
with rule #2 at https://jsonlines.org  ("2. Each Line is a Valid JSON
Value").

create table j(f json);

insert into j values('{"a":1,
"b":2
}');

copy j to stdout (format json);

Result:
{"f":{"a":1,
"b":2
}}

Is that expected? copy.sgml in 007 doesn't describe the output
in terms of lines so it's hard to tell from the doc.


The patch as-is just does the equivalent of row_to_json():
8<
select row_to_json(j) from j;
 row_to_json
--
 {"f":{"a":1,+
 "b":2   +
 }}
(1 row)
8<

So yeah, that is how it works today. I will take a look at what it would 
take to fix it.



--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-07 Thread Joe Conway

On 12/7/23 09:11, David G. Johnston wrote:
Those are all the same breakage though - if truly interpreted as data 
rows the protocol is basically written such that the array format is not 
supportable and only the lines format can be used.  Hence my “format 0 
doesn’t work” comment for array output and we should explicitly add 
format 2 where we explicitly decouple lines of output from rows of 
data.  That said, it would seem in practice format 0 already decouples 
them and so the current choice of the brackets on their own lines is 
acceptable.


I’d prefer to keep them on their own line.


WFM ¯\_(ツ)_/¯

I am merely responding with options to the many people opining on the 
thread.


I also don’t know why you introduced another level of object nesting 
here.  That seems quite undesirable.


I didn't add anything. It is an artifact of the particular query I wrote 
in the copy to statement (I did "select ss from ss" instead of "select * 
from ss"), mea culpa.


This is what the latest patch, as written today, outputs:
8<--
copy
(select 1, g.i from generate_series(1, 3) g(i))
to stdout (format json, force_array);
[
 {"?column?":1,"i":1}
,{"?column?":1,"i":2}
,{"?column?":1,"i":3}
]
8<--

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-07 Thread Joe Conway

On 12/7/23 08:52, Joe Conway wrote:

Or maybe this is preferred?
8<--
[{"ss":{"f1":1,"f2":1}},
   {"ss":{"f1":1,"f2":2}},
   {"ss":{"f1":1,"f2":3}}]
8<--


I don't know why my mail client keeps adding extra spaces, but the 
intention here is a single space in front of row 2 and 3 in order to 
line the json objects up at column 2.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-07 Thread Joe Conway

On 12/7/23 08:35, Daniel Verite wrote:

Joe Conway wrote:

The attached should fix the CopyOut response to say one column. I.e. it 
ought to look something like:


Spending more time with the doc I came to the opinion that in this bit
of the protocol, in CopyOutResponse (B)
...
Int16
The number of columns in the data to be copied (denoted N below).
...

this number must be the number of columns in the source.
That is for COPY table(a,b,c)   the number is 3, independently
on whether the result is formatted in text, cvs, json or binary.

I think that changing it for json can reasonably be interpreted
as a protocol break and we should not do it.

The fact that this value does not help parsing the CopyData
messages that come next is not a new issue. A reader that
doesn't know the field separator and whether it's text or csv
cannot parse these messages into fields anyway.
But just knowing how much columns there are in the original
data might be useful by itself and we don't want to break that.


Ok, that sounds reasonable to me -- I will revert that change.


The other question for me is, in the CopyData message, this
bit:
" Messages sent from the backend will always correspond to single data rows"

ISTM that considering that the "[" starting the json array is a
"data row" is a stretch.
That might be interpreted as a protocol break, depending
on how strict the interpretation is.


If we really think that is a problem I can see about changing it to this 
format for json array:


8<--
copy
(
  with ss(f1, f2) as
  (
select 1, g.i from generate_series(1, 3) g(i)
  )
  select ss from ss
) to stdout (format json, force_array);
[{"ss":{"f1":1,"f2":1}}
,{"ss":{"f1":1,"f2":2}}
,{"ss":{"f1":1,"f2":3}}]
8<--

Is this acceptable to everyone?

Or maybe this is preferred?
8<--
[{"ss":{"f1":1,"f2":1}},
 {"ss":{"f1":1,"f2":2}},
 {"ss":{"f1":1,"f2":3}}]
8<--

Or as long as we are painting the shed, maybe this?
8<--
[{"ss":{"f1":1,"f2":1}},
{"ss":{"f1":1,"f2":2}},
{"ss":{"f1":1,"f2":3}}]
8<--

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-07 Thread Joe Conway

On 12/6/23 21:56, Nathan Bossart wrote:

On Wed, Dec 06, 2023 at 03:20:46PM -0500, Tom Lane wrote:

If Nathan's perf results hold up elsewhere, it seems like some
micro-optimization around the text-pushing (appendStringInfoString)
might be more useful than caching.  The 7% spent in cache lookups
could be worth going after later, but it's not the top of the list.


Hah, it turns out my benchmark of 110M integers really stresses the
JSONTYPE_NUMERIC path in datum_to_json_internal().  That particular path
calls strlen() twice: once for IsValidJsonNumber(), and once in
appendStringInfoString().  If I save the result from IsValidJsonNumber()
and give it to appendBinaryStringInfo() instead, the COPY goes ~8% faster.
It's probably worth giving datum_to_json_internal() a closer look in a new
thread.


Yep, after looking through that code I was going to make the point that 
your 11 integer test was over indexing on that one type. I am sure there 
are other micro-optimizations to be made here, but I also think that it 
is outside the scope of the COPY TO JSON patch.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway

On 12/6/23 20:09, David G. Johnston wrote:
On Wed, Dec 6, 2023 at 5:57 PM Joe Conway <mailto:m...@joeconway.com>> wrote:


On 12/6/23 19:39, David G. Johnston wrote:
 > On Wed, Dec 6, 2023 at 4:45 PM Joe Conway mailto:m...@joeconway.com>
 > <mailto:m...@joeconway.com <mailto:m...@joeconway.com>>> wrote:

 > But I still cannot shake the belief that using a format code of 1 -
 > which really could be interpreted as meaning "textual csv" in
practice -
 > for this JSON output is unwise and we should introduce a new integer
 > value for the new fundamental output format.

No, I am pretty sure you still have that wrong. The "1" means binary
mode


Ok.  I made the same typo twice, I did mean to write 0 instead of 1.


Fair enough.

But the point that we should introduce a 2 still stands.  The new code 
would mean: use text output functions but that there is no inherent 
tabular structure in the underlying contents.  Instead the copy format 
was JSON and the output layout is dependent upon the json options in the 
copy command and that there really shouldn't be any attempt to turn the 
contents directly into a tabular data structure like you presently do 
with the CSV data under format 0.  Ignore the column count and column 
formats as they are fixed or non-existent.


I think that amounts to a protocol change, which we tend to avoid at all 
costs.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway

On 12/6/23 18:09, Joe Conway wrote:

On 12/6/23 14:47, Joe Conway wrote:

On 12/6/23 13:59, Daniel Verite wrote:

Andrew Dunstan wrote:

IMNSHO, we should produce either a single JSON 
document (the ARRAY case) or a series of JSON documents, one per row 
(the LINES case).


"COPY Operations" in the doc says:

" The backend sends a CopyOutResponse message to the frontend, followed
by zero or more CopyData messages (always one per row), followed by
CopyDone".

In the ARRAY case, the first messages with the copyjsontest
regression test look like this (tshark output):

PostgreSQL
 Type: CopyOut response
 Length: 13
 Format: Text (0)
 Columns: 3
Format: Text (0)
PostgreSQL
 Type: Copy data
 Length: 6
 Copy data: 5b0a
PostgreSQL
 Type: Copy data
 Length: 76
 Copy data:
207b226964223a312c226631223a226c696e652077697468205c2220696e2069743a2031…

The first Copy data message with contents "5b0a" does not qualify
as a row of data with 3 columns as advertised in the CopyOut
message. Isn't that a problem?



Is it a real problem, or just a bit of documentation change that I missed?

Anything receiving this and looking for a json array should know how to
assemble the data correctly despite the extra CopyData messages.


Hmm, maybe the real problem here is that Columns do not equal "3" for
the json mode case -- that should really say "1" I think, because the
row is not represented as 3 columns but rather 1 json object.

Does that sound correct?

Assuming yes, there is still maybe an issue that there are two more
"rows" that actual output rows (the "[" and the "]"), but maybe those
are less likely to cause some hazard?



The attached should fix the CopyOut response to say one column. I.e. it 
ought to look something like:


PostgreSQL
 Type: CopyOut response
 Length: 13
 Format: Text (0)
 Columns: 1
 Format: Text (0)
PostgreSQL
 Type: Copy data
 Length: 6
 Copy data: 5b0a
PostgreSQL
 Type: Copy data
 Length: 76
 Copy data: [...]


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 18ecc69..8915fb3 100644
*** a/doc/src/sgml/ref/copy.sgml
--- b/doc/src/sgml/ref/copy.sgml
*** COPY { ta
*** 43,48 
--- 43,49 
  FORCE_QUOTE { ( column_name [, ...] ) | * }
  FORCE_NOT_NULL { ( column_name [, ...] ) | * }
  FORCE_NULL { ( column_name [, ...] ) | * }
+ FORCE_ARRAY [ boolean ]
  ENCODING 'encoding_name'
  
   
*** COPY { ta
*** 206,214 
--- 207,220 
Selects the data format to be read or written:
text,
csv (Comma Separated Values),
+   json (JavaScript Object Notation),
or binary.
The default is text.
   
+  
+   The json option is allowed only in
+   COPY TO.
+  
  
 
  
*** COPY { ta
*** 372,377 
--- 378,396 
   
  
 
+ 
+
+ FORCE_ARRAY
+ 
+  
+   Force output of square brackets as array decorations at the beginning
+   and end of output, and commas between the rows. It is allowed only in
+   COPY TO, and only when using
+   JSON format. The default is
+   false.
+  
+ 
+
  
 
  ENCODING
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfad47b..23b570f 100644
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
*** ProcessCopyOptions(ParseState *pstate,
*** 419,424 
--- 419,425 
  	bool		format_specified = false;
  	bool		freeze_specified = false;
  	bool		header_specified = false;
+ 	bool		force_array_specified = false;
  	ListCell   *option;
  
  	/* Support external use for option sanity checking */
*** ProcessCopyOptions(ParseState *pstate,
*** 443,448 
--- 444,451 
   /* default format */ ;
  			else if (strcmp(fmt, "csv") == 0)
  opts_out->csv_mode = true;
+ 			else if (strcmp(fmt, "json") == 0)
+ opts_out->json_mode = true;
  			else if (strcmp(fmt, "binary") == 0)
  opts_out->binary = true;
  			else
*** ProcessCopyOptions(ParseState *pstate,
*** 540,545 
--- 543,555 
  defel->defname),
  		 parser_errposition(pstate, defel->location)));
  		}
+ 		else if (strcmp(defel->defname, "force_array") == 0)
+ 		{
+ 			if (force_array_specified)
+ errorConflictingDefElem(defel, pstate);
+ 			force_array_specified = true;
+ 			opts_out->force_array = defGetBoolean(defel);
+ 		}
  		else if (strcmp(defel->defname, "convert_selectively") == 0)
  		{
  			/*
*** ProcessCopyOptions(ParseState *pstate,
*** 598,603 
--- 608,625 
  (errcode(ERRCODE_SYNTAX_

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway

On 12/6/23 19:39, David G. Johnston wrote:
On Wed, Dec 6, 2023 at 4:45 PM Joe Conway <mailto:m...@joeconway.com>> wrote:



" The backend sends a CopyOutResponse message to the frontend, followed
     by zero or more CopyData messages (always one per row), followed by
     CopyDone"

probably "always one per row" would be changed to note that json array
format outputs two extra rows for the start/end bracket.


Fair, I was ascribing much more semantic meaning to this than it wants.

I don't see any real requirement, given the lack of semantics, to 
mention JSON at all.  It is one CopyData per row, regardless of the 
contents.  We don't delineate between the header and non-header data in 
CSV.  It isn't a protocol concern.


good point

But I still cannot shake the belief that using a format code of 1 - 
which really could be interpreted as meaning "textual csv" in practice - 
for this JSON output is unwise and we should introduce a new integer 
value for the new fundamental output format.


No, I am pretty sure you still have that wrong. The "1" means binary 
mode. As in

8<--
FORMAT

Selects the data format to be read or written: text, csv (Comma 
Separated Values), or binary. The default is text.

8<--

That is completely separate from text and csv. It literally means to use 
the binary output functions instead of the usual ones:


8<--
if (cstate->opts.binary)
getTypeBinaryOutputInfo(attr->atttypid,
_func_oid,
);
else
getTypeOutputInfo(attr->atttypid,
  _func_oid,
  );
8<--

Both "text" and "csv" mode use are non-binary output formats. I believe 
the JSON output format is also non-binary.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway

On 12/6/23 18:38, David G. Johnston wrote:
On Wed, Dec 6, 2023 at 4:28 PM David G. Johnston 
mailto:david.g.johns...@gmail.com>> wrote:


On Wed, Dec 6, 2023 at 4:09 PM Joe Conway mailto:m...@joeconway.com>> wrote:

On 12/6/23 14:47, Joe Conway wrote:
 > On 12/6/23 13:59, Daniel Verite wrote:
 >>      Andrew Dunstan wrote:
 >>
 >>> IMNSHO, we should produce either a single JSON
 >>> document (the ARRAY case) or a series of JSON documents,
one per row
 >>> (the LINES case).
 >>
 >> "COPY Operations" in the doc says:
 >>
 >> " The backend sends a CopyOutResponse message to the
frontend, followed
 >>     by zero or more CopyData messages (always one per row),
followed by
 >>     CopyDone".
 >>
 >> In the ARRAY case, the first messages with the copyjsontest
 >> regression test look like this (tshark output):
 >>
 >> PostgreSQL
 >>      Type: CopyOut response
 >>      Length: 13
 >>      Format: Text (0)
 >>      Columns: 3
 >>      Format: Text (0)

 > Anything receiving this and looking for a json array should
know how to
 > assemble the data correctly despite the extra CopyData messages.

Hmm, maybe the real problem here is that Columns do not equal
"3" for
the json mode case -- that should really say "1" I think,
because the
row is not represented as 3 columns but rather 1 json object.

Does that sound correct?

Assuming yes, there is still maybe an issue that there are two more
"rows" that actual output rows (the "[" and the "]"), but maybe
those
are less likely to cause some hazard?


What is the limitation, if any, of introducing new type codes for
these.  n = 2..N for the different variants?  Or even -1 for "raw
text"?  And document that columns and structural rows need to be
determined out-of-band.  Continuing to use 1 (text) for this non-csv
data seems like a hack even if we can technically make it function. 
The semantics, especially for the array case, are completely

discarded or wrong.

Also, it seems like this answer would be easier to make if we implement 
COPY FROM now since how is the server supposed to deal with decomposing 
this data into tables without accurate type information?  I don't see 
implementing only half of the feature being a good idea.  I've had much 
more desire for FROM compared to TO personally.


Several people have weighed in on the side of getting COPY TO done by 
itself first. Given how long this discussion has already become for a 
relatively small and simple feature, I am a big fan of not expanding the 
scope now.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway

On 12/6/23 18:28, David G. Johnston wrote:
On Wed, Dec 6, 2023 at 4:09 PM Joe Conway <mailto:m...@joeconway.com>> wrote:


On 12/6/23 14:47, Joe Conway wrote:
 > On 12/6/23 13:59, Daniel Verite wrote:
 >>      Andrew Dunstan wrote:
 >>
 >>> IMNSHO, we should produce either a single JSON
 >>> document (the ARRAY case) or a series of JSON documents, one
per row
 >>> (the LINES case).
 >>
 >> "COPY Operations" in the doc says:
 >>
 >> " The backend sends a CopyOutResponse message to the frontend,
followed
 >>     by zero or more CopyData messages (always one per row),
followed by
 >>     CopyDone".
 >>
 >> In the ARRAY case, the first messages with the copyjsontest
 >> regression test look like this (tshark output):
 >>
 >> PostgreSQL
 >>      Type: CopyOut response
 >>      Length: 13
 >>      Format: Text (0)
 >>      Columns: 3
 >>      Format: Text (0)

 > Anything receiving this and looking for a json array should know
how to
 > assemble the data correctly despite the extra CopyData messages.

Hmm, maybe the real problem here is that Columns do not equal "3" for
the json mode case -- that should really say "1" I think, because the
row is not represented as 3 columns but rather 1 json object.

Does that sound correct?

Assuming yes, there is still maybe an issue that there are two more
"rows" that actual output rows (the "[" and the "]"), but maybe those
are less likely to cause some hazard?


What is the limitation, if any, of introducing new type codes for 
these.  n = 2..N for the different variants?  Or even -1 for "raw 
text"?  And document that columns and structural rows need to be 
determined out-of-band.  Continuing to use 1 (text) for this non-csv 
data seems like a hack even if we can technically make it function.  The 
semantics, especially for the array case, are completely discarded or wrong.


I am not following you here. SendCopyBegin looks like this currently:

8<
SendCopyBegin(CopyToState cstate)
{
StringInfoData buf;
int natts = list_length(cstate->attnumlist);
int16   format = (cstate->opts.binary ? 1 : 0);
int i;

pq_beginmessage(, PqMsg_CopyOutResponse);
pq_sendbyte(, format);  /* overall format */
pq_sendint16(, natts);
for (i = 0; i < natts; i++)
pq_sendint16(, format); /* per-column formats */
pq_endmessage();
cstate->copy_dest = COPY_FRONTEND;
}
8<

The "1" is saying are we binary mode or not. JSON mode will never be 
sending in binary in the current implementation at least. And it always 
aggregates all the columns as one json object. So the correct answer is 
(I think):

8<
*** SendCopyBegin(CopyToState cstate)
*** 146,154 

pq_beginmessage(, PqMsg_CopyOutResponse);
pq_sendbyte(, format);  /* overall format */
!   pq_sendint16(, natts);
!   for (i = 0; i < natts; i++)
!   pq_sendint16(, format); /* per-column formats */
pq_endmessage();
cstate->copy_dest = COPY_FRONTEND;
  }
--- 150,169 

pq_beginmessage(, PqMsg_CopyOutResponse);
pq_sendbyte(, format);  /* overall format */
!   if (!cstate->opts.json_mode)
!   {
!   pq_sendint16(, natts);
!   for (i = 0; i < natts; i++)
!   pq_sendint16(, format); /* per-column formats */
!   }
!   else
!   {
!   /*
!* JSON mode is always one non-binary column
!*/
!   pq_sendint16(, 1);
!   pq_sendint16(, 0);
!   }
pq_endmessage();
cstate->copy_dest = COPY_FRONTEND;
  }
8<

That still leaves the need to fix the documentation:

" The backend sends a CopyOutResponse message to the frontend, followed
   by zero or more CopyData messages (always one per row), followed by
   CopyDone"

probably "always one per row" would be changed to note that json array 
format outputs two extra rows for the start/end bracket.


In fact, as written the patch does this:
8<
COPY (SELECT x.i, 'val' || x.i as v FROM
  generate_series(1, 3) x(i) WHERE false)
TO STDOUT WITH (FORMAT JSON, FORCE_ARRAY);
[
]
8<

Not sure if that is a problem or not.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway

On 12/6/23 14:47, Joe Conway wrote:

On 12/6/23 13:59, Daniel Verite wrote:

Andrew Dunstan wrote:

IMNSHO, we should produce either a single JSON 
document (the ARRAY case) or a series of JSON documents, one per row 
(the LINES case).


"COPY Operations" in the doc says:

" The backend sends a CopyOutResponse message to the frontend, followed
by zero or more CopyData messages (always one per row), followed by
CopyDone".

In the ARRAY case, the first messages with the copyjsontest
regression test look like this (tshark output):

PostgreSQL
 Type: CopyOut response
 Length: 13
 Format: Text (0)
 Columns: 3
Format: Text (0)
PostgreSQL
 Type: Copy data
 Length: 6
 Copy data: 5b0a
PostgreSQL
 Type: Copy data
 Length: 76
 Copy data:
207b226964223a312c226631223a226c696e652077697468205c2220696e2069743a2031…

The first Copy data message with contents "5b0a" does not qualify
as a row of data with 3 columns as advertised in the CopyOut
message. Isn't that a problem?



Is it a real problem, or just a bit of documentation change that I missed?

Anything receiving this and looking for a json array should know how to
assemble the data correctly despite the extra CopyData messages.


Hmm, maybe the real problem here is that Columns do not equal "3" for 
the json mode case -- that should really say "1" I think, because the 
row is not represented as 3 columns but rather 1 json object.


Does that sound correct?

Assuming yes, there is still maybe an issue that there are two more 
"rows" that actual output rows (the "[" and the "]"), but maybe those 
are less likely to cause some hazard?


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway

On 12/6/23 16:42, Sehrope Sarkuni wrote:
On Wed, Dec 6, 2023 at 4:29 PM Joe Conway <mailto:m...@joeconway.com>> wrote:


 > 1. Outputting a top level JSON object without the additional column
 > keys. IIUC, the top level keys are always the column names. A
common use
 > case would be a single json/jsonb column that is already formatted
 > exactly as the user would like for output. Rather than enveloping
it in
 > an object with a dedicated key, it would be nice to be able to
output it
 > directly. This would allow non-object results to be outputted as
well
 > (e.g., lines of JSON arrays, numbers, or strings). Due to how
JSON is
 > structured, I think this would play nice with the JSON lines v.s.
array
 > concept.
 >
 > COPY (SELECT json_build_object('foo', x) AS i_am_ignored FROM
 > generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON,
 > SOME_OPTION_TO_NOT_ENVELOPE)
 > {"foo":1}
 > {"foo":2}
 > {"foo":3}

Your example does not match what you describe, or do I misunderstand? I
thought your goal was to eliminate the repeated "foo" from each row...


The "foo" in this case is explicit as I'm adding it when building the 
object. What I was trying to show was not adding an additional object 
wrapper / envelope.


So each row is:

{"foo":1}

Rather than:

"{"json_build_object":{"foo":1}}


I am still getting confused ;-)

Let's focus on the current proposed patch with a "minimum required 
feature set".


Right now the default behavior is "JSON lines":
8<---
COPY (SELECT x.i, 'val' || x.i as v FROM
  generate_series(1, 3) x(i))
TO STDOUT WITH (FORMAT JSON);
{"i":1,"v":"val1"}
{"i":2,"v":"val2"}
{"i":3,"v":"val3"}
8<---

and the other, non-default option is "JSON array":
8<---
COPY (SELECT x.i, 'val' || x.i as v FROM
  generate_series(1, 3) x(i))
TO STDOUT WITH (FORMAT JSON, FORCE_ARRAY);
[
 {"i":1,"v":"val1"}
,{"i":2,"v":"val2"}
,{"i":3,"v":"val3"}
]
8<---

So the questions are:
1. Do those two formats work for the initial implementation?
2. Is the default correct or should it be switched
   e.g. rather than specifying FORCE_ARRAY to get an
   array, something like FORCE_NO_ARRAY to get JSON lines
   and the JSON array is default?

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway

On 12/6/23 11:28, Sehrope Sarkuni wrote:

Big +1 to this overall feature.


cool!

Regarding the defaults for the output, I think JSON lines (rather than a 
JSON array of objects) would be preferred. It's more natural to combine 
them and generate that type of data on the fly rather than forcing 
aggregation into a single object.


So that is +2 (Sehrope and me) for the status quo (JSON lines), and +2 
(Andrew and Davin) for defaulting to json arrays. Anyone else want to 
weigh in on that issue?


Couple more features / use cases come to mind as well. Even if they're 
not part of a first round of this feature I think it'd be helpful to 
document them now as it might give some ideas for what does make that 
first cut:


1. Outputting a top level JSON object without the additional column 
keys. IIUC, the top level keys are always the column names. A common use 
case would be a single json/jsonb column that is already formatted 
exactly as the user would like for output. Rather than enveloping it in 
an object with a dedicated key, it would be nice to be able to output it 
directly. This would allow non-object results to be outputted as well 
(e.g., lines of JSON arrays, numbers, or strings). Due to how JSON is 
structured, I think this would play nice with the JSON lines v.s. array 
concept.


COPY (SELECT json_build_object('foo', x) AS i_am_ignored FROM 
generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON, 
SOME_OPTION_TO_NOT_ENVELOPE)

{"foo":1}
{"foo":2}
{"foo":3}


Your example does not match what you describe, or do I misunderstand? I 
thought your goal was to eliminate the repeated "foo" from each row...


2. An option to ignore null fields so they are excluded from the output. 
This would not be a default but would allow shrinking the total size of 
the output data in many situations. This would be recursive to allow 
nested objects to be shrunk down (not just the top level). This might be 
worthwhile as a standalone JSON function though handling it during 
output would be more efficient as it'd only be read once.


COPY (SELECT json_build_object('foo', CASE WHEN x > 1 THEN x END) FROM 
generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON, 
SOME_OPTION_TO_NOT_ENVELOPE, JSON_SKIP_NULLS)

{}
{"foo":2}
{"foo":3}


clear enough I think

3. Reverse of #2 when copying data in to allow defaulting missing fields 
to NULL.


good to record the ask, but applies to a different feature (COPY FROM 
instead of COPY TO).


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway

On 12/6/23 11:44, Nathan Bossart wrote:

On Wed, Dec 06, 2023 at 10:33:49AM -0600, Nathan Bossart wrote:

(format csv)
Time: 12295.480 ms (00:12.295)
Time: 12311.059 ms (00:12.311)
Time: 12305.469 ms (00:12.305)

(format json)
Time: 24568.621 ms (00:24.569)
Time: 23756.234 ms (00:23.756)
Time: 24265.730 ms (00:24.266)


I should also note that the json output is 85% larger than the csv output.


I'll see if I can add some caching to composite_to_json(), but based on 
the relative data size it does not sound like there is much performance 
left on the table to go after, no?


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway

On 12/6/23 13:59, Daniel Verite wrote:

Andrew Dunstan wrote:

IMNSHO, we should produce either a single JSON 
document (the ARRAY case) or a series of JSON documents, one per row 
(the LINES case).


"COPY Operations" in the doc says:

" The backend sends a CopyOutResponse message to the frontend, followed
by zero or more CopyData messages (always one per row), followed by
CopyDone".

In the ARRAY case, the first messages with the copyjsontest
regression test look like this (tshark output):

PostgreSQL
 Type: CopyOut response
 Length: 13
 Format: Text (0)
 Columns: 3
Format: Text (0)
PostgreSQL
 Type: Copy data
 Length: 6
 Copy data: 5b0a
PostgreSQL
 Type: Copy data
 Length: 76
 Copy data:
207b226964223a312c226631223a226c696e652077697468205c2220696e2069743a2031…

The first Copy data message with contents "5b0a" does not qualify
as a row of data with 3 columns as advertised in the CopyOut
message. Isn't that a problem?



Is it a real problem, or just a bit of documentation change that I missed?

Anything receiving this and looking for a json array should know how to 
assemble the data correctly despite the extra CopyData messages.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway

On 12/6/23 10:44, Tom Lane wrote:

Joe Conway  writes:
I believe this is ready to commit unless there are further comments or 
objections.


I thought we were still mostly at proof-of-concept stage?


The concept is narrowly scoped enough that I think we are homing in on 
the final patch.



In particular, has anyone done any performance testing?
I'm concerned about that because composite_to_json() has
zero capability to cache any metadata across calls, meaning
there is going to be a large amount of duplicated work
per row.


I will devise some kind of test and report back. I suppose something 
with many rows and many narrow columns comparing time to COPY 
text/csv/json modes would do the trick?


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway

On 12/6/23 10:32, Andrew Dunstan wrote:


On 2023-12-06 We 08:49, Joe Conway wrote:

On 12/6/23 07:36, Andrew Dunstan wrote:


On 2023-12-05 Tu 16:46, Joe Conway wrote:

On 12/5/23 16:20, Andrew Dunstan wrote:

On 2023-12-05 Tu 16:09, Joe Conway wrote:

On 12/5/23 16:02, Joe Conway wrote:

On 12/5/23 15:55, Andrew Dunstan wrote:

and in any other case (e.g. LINES) I can't see why you
would have them.


Oh I didn't address this -- I saw examples in the interwebs of 
MSSQL server I think [1] which had the non-array with commas 
import and export style. It was not that tough to support and the 
code as written already does it, so why not?


That seems quite absurd, TBH. I know we've catered for some 
absurdity in
the CSV code (much of it down to me), so maybe we need to be 
liberal in
what we accept here too. IMNSHO, we should produce either a single 
JSON

document (the ARRAY case) or a series of JSON documents, one per row
(the LINES case).


So your preference would be to not allow the non-array-with-commas 
case but if/when we implement COPY FROM we would accept that format? 
As in Postel'a law ("be conservative in what you do, be liberal in 
what you accept from others")?



Yes, I think so.


Awesome. The attached does it that way. I also ran pgindent.

I believe this is ready to commit unless there are further comments or 
objections.


Sorry to bikeshed a little more, I'm a bit late looking at this.

I suspect that most users will actually want the table as a single JSON
document, so it should probably be the default. In any case FORCE_ARRAY
as an option has a slightly wrong feel to it.


Sure, I can make that happen, although I figured that for the 
many-rows-scenario the single array size might be an issue for whatever 
you are importing into.



I'm having trouble coming up with a good name for the reverse of
that, off the top of my head.


Will think about it and propose something with the next patch revision.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway

On 12/6/23 07:36, Andrew Dunstan wrote:


On 2023-12-05 Tu 16:46, Joe Conway wrote:

On 12/5/23 16:20, Andrew Dunstan wrote:

On 2023-12-05 Tu 16:09, Joe Conway wrote:

On 12/5/23 16:02, Joe Conway wrote:

On 12/5/23 15:55, Andrew Dunstan wrote:

and in any other case (e.g. LINES) I can't see why you
would have them.


Oh I didn't address this -- I saw examples in the interwebs of MSSQL 
server I think [1] which had the non-array with commas import and 
export style. It was not that tough to support and the code as 
written already does it, so why not?


That seems quite absurd, TBH. I know we've catered for some absurdity in
the CSV code (much of it down to me), so maybe we need to be liberal in
what we accept here too. IMNSHO, we should produce either a single JSON
document (the ARRAY case) or a series of JSON documents, one per row
(the LINES case).


So your preference would be to not allow the non-array-with-commas 
case but if/when we implement COPY FROM we would accept that format? 
As in Postel'a law ("be conservative in what you do, be liberal in 
what you accept from others")?



Yes, I think so.


Awesome. The attached does it that way. I also ran pgindent.

I believe this is ready to commit unless there are further comments or 
objections.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Add json format mode to COPY TO

Add json format mode support to COPY TO, which includes two output
variations: 1) "json lines" which is each row as a json object delimited
by newlines (the default); and 2) "json array" which is the same as #1,
but with the addition of a leading "[", trailing "]", and comma row
delimiters, to form a valid json array.

Early versions: helpful hints/reviews provided by Nathan Bossart,
Tom Lane, and Maciek Sakrejda. Final versions: reviewed by Andrew Dunstan
and Davin Shearer.

Requested-by: Davin Shearer
Author: Joe Conway
Reviewed-by: Andrew Dunstan, Davin Shearer 
Discussion: https://postgr.es/m/flat/24e3ee88-ec1e-421b-89ae-8a47ee0d2df1%40joeconway.com#a5e6b8829f9a74dfc835f6f29f2e44c5

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 18ecc69..8915fb3 100644
*** a/doc/src/sgml/ref/copy.sgml
--- b/doc/src/sgml/ref/copy.sgml
*** COPY { ta
*** 43,48 
--- 43,49 
  FORCE_QUOTE { ( column_name [, ...] ) | * }
  FORCE_NOT_NULL { ( column_name [, ...] ) | * }
  FORCE_NULL { ( column_name [, ...] ) | * }
+ FORCE_ARRAY [ boolean ]
  ENCODING 'encoding_name'
  
   
*** COPY { ta
*** 206,214 
--- 207,220 
Selects the data format to be read or written:
text,
csv (Comma Separated Values),
+   json (JavaScript Object Notation),
or binary.
The default is text.
   
+  
+   The json option is allowed only in
+   COPY TO.
+  
  
 
  
*** COPY { ta
*** 372,377 
--- 378,396 
   
  
 
+ 
+
+ FORCE_ARRAY
+ 
+  
+   Force output of square brackets as array decorations at the beginning
+   and end of output, and commas between the rows. It is allowed only in
+   COPY TO, and only when using
+   JSON format. The default is
+   false.
+  
+ 
+
  
 
  ENCODING
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfad47b..23b570f 100644
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
*** ProcessCopyOptions(ParseState *pstate,
*** 419,424 
--- 419,425 
  	bool		format_specified = false;
  	bool		freeze_specified = false;
  	bool		header_specified = false;
+ 	bool		force_array_specified = false;
  	ListCell   *option;
  
  	/* Support external use for option sanity checking */
*** ProcessCopyOptions(ParseState *pstate,
*** 443,448 
--- 444,451 
   /* default format */ ;
  			else if (strcmp(fmt, "csv") == 0)
  opts_out->csv_mode = true;
+ 			else if (strcmp(fmt, "json") == 0)
+ opts_out->json_mode = true;
  			else if (strcmp(fmt, "binary") == 0)
  opts_out->binary = true;
  			else
*** ProcessCopyOptions(ParseState *pstate,
*** 540,545 
--- 543,555 
  defel->defname),
  		 parser_errposition(pstate, defel->location)));
  		}
+ 		else if (strcmp(defel->defname, "force_array") == 0)
+ 		{
+ 			if (force_array_specified)
+ errorConflictingDefElem(defel, pstate);
+ 			force_array_specified = true;
+ 			opts_out->force_array = defGetBoolean(defel);
+ 		}
  		else if (strcmp(defel->defname, "convert_selectively") == 0)
  		{
  			/*
*** ProcessCopyOptions(ParseState *pstate,
*** 598,603 
--- 608,625 
  (errcode(ERRCODE_SYNTAX_ERROR),
   errmsg("cannot specify DEFAULT in BINARY mode")));
  
+ 	if (opts_out->

Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Joe Conway

On 12/5/23 16:20, Andrew Dunstan wrote:

On 2023-12-05 Tu 16:09, Joe Conway wrote:

On 12/5/23 16:02, Joe Conway wrote:

On 12/5/23 15:55, Andrew Dunstan wrote:

and in any other case (e.g. LINES) I can't see why you
would have them.


Oh I didn't address this -- I saw examples in the interwebs of MSSQL 
server I think [1] which had the non-array with commas import and 
export style. It was not that tough to support and the code as written 
already does it, so why not?


That seems quite absurd, TBH. I know we've catered for some absurdity in
the CSV code (much of it down to me), so maybe we need to be liberal in
what we accept here too. IMNSHO, we should produce either a single JSON
document (the ARRAY case) or a series of JSON documents, one per row
(the LINES case).



So your preference would be to not allow the non-array-with-commas case 
but if/when we implement COPY FROM we would accept that format? As in 
Postel'a law ("be conservative in what you do, be liberal in what you 
accept from others")?


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Joe Conway

On 12/5/23 16:12, Andrew Dunstan wrote:


On 2023-12-05 Tu 16:02, Joe Conway wrote:

On 12/5/23 15:55, Andrew Dunstan wrote:


On 2023-12-05 Tu 14:50, Davin Shearer wrote:

Hi Joe,

In reviewing the 005 patch, I think that when used with FORCE ARRAY, 
we should also _imply_ FORCE ROW DELIMITER.  I can't envision a use 
case where someone would want to use FORCE ARRAY without also using 
FORCE ROW DELIMITER.  I can, however, envision a use case where 
someone would want FORCE ROW DELIMITER without FORCE ARRAY, like 
maybe including into a larger array.  I definitely appreciate these 
options and the flexibility that they afford from a user perspective.


In the test output, will you also show the different variations with 
FORCE ARRAY and FORCE ROW DELIMITER => {(false, false), (true, 
false), (false, true), (true, true)}? Technically you've already 
shown me the (false, false) case as those are the defaults.





I don't understand the point of FORCE_ROW_DELIMITER at all. There is
only one legal delimiter of array items in JSON, and that's a comma.
There's no alternative and it's not optional. So in the array case you
MUST have commas and in any other case (e.g. LINES) I can't see why you
would have them.


The current patch already *does* imply row delimiters in the array 
case. It says so here:

8<---
+    
+ FORCE_ARRAY
+ 
+  
+   Force output of array decorations at the beginning and end of 
output.

+   This option implies the FORCE_ROW_DELIMITER
+   option. It is allowed only in COPY TO, and 
only

+   when using JSON format.
+   The default is false.
+  
8<---

and it does so here:
8<---
+ if (opts_out->force_array)
+ opts_out->force_row_delimiter = true;
8<---

and it shows that here:
8<---
+ copy copytest to stdout (format json, force_array);
+ [
+  {"style":"DOS","test":"abc\r\ndef","filler":1}
+ ,{"style":"Unix","test":"abc\ndef","filler":2}
+ ,{"style":"Mac","test":"abc\rdef","filler":3}
+ ,{"style":"esc\\ape","test":"a\\r\\\r\\\n\\nb","filler":4}
+ ]
8<---

It also does not allow explicitly setting row delimiters false while 
force_array is true here:

8<---

+ if (opts_out->force_array &&
+ force_row_delimiter_specified &&
+ !opts_out->force_row_delimiter)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+  errmsg("cannot specify FORCE_ROW_DELIMITER 
false with FORCE_ARRAY true")));

8<---

Am I understanding something incorrectly?



But what's the point of having it if you're not using FORCE_ARRAY?



See the follow up email -- other databases support it so why not? It 
seems to be a thing...


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Joe Conway

On 12/5/23 16:02, Joe Conway wrote:

On 12/5/23 15:55, Andrew Dunstan wrote:

and in any other case (e.g. LINES) I can't see why you
would have them.


Oh I didn't address this -- I saw examples in the interwebs of MSSQL 
server I think [1] which had the non-array with commas import and export 
style. It was not that tough to support and the code as written already 
does it, so why not?


[1] 
https://learn.microsoft.com/en-us/sql/relational-databases/json/remove-square-brackets-from-json-without-array-wrapper-option?view=sql-server-ver16#example-multiple-row-result



--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Joe Conway

On 12/5/23 15:55, Andrew Dunstan wrote:


On 2023-12-05 Tu 14:50, Davin Shearer wrote:

Hi Joe,

In reviewing the 005 patch, I think that when used with FORCE ARRAY, 
we should also _imply_ FORCE ROW DELIMITER.  I can't envision a use 
case where someone would want to use FORCE ARRAY without also using 
FORCE ROW DELIMITER.  I can, however, envision a use case where 
someone would want FORCE ROW DELIMITER without FORCE ARRAY, like maybe 
including into a larger array.  I definitely appreciate these options 
and the flexibility that they afford from a user perspective.


In the test output, will you also show the different variations with 
FORCE ARRAY and FORCE ROW DELIMITER => {(false, false), (true, false), 
(false, true), (true, true)}?  Technically you've already shown me the 
(false, false) case as those are the defaults.





I don't understand the point of FORCE_ROW_DELIMITER at all. There is
only one legal delimiter of array items in JSON, and that's a comma.
There's no alternative and it's not optional. So in the array case you
MUST have commas and in any other case (e.g. LINES) I can't see why you
would have them.


The current patch already *does* imply row delimiters in the array case. 
It says so here:

8<---
+
+ FORCE_ARRAY
+ 
+  
+   Force output of array decorations at the beginning and end of 
output.

+   This option implies the FORCE_ROW_DELIMITER
+   option. It is allowed only in COPY TO, and only
+   when using JSON format.
+   The default is false.
+  
8<---

and it does so here:
8<---
+ if (opts_out->force_array)
+ opts_out->force_row_delimiter = true;
8<---

and it shows that here:
8<---
+ copy copytest to stdout (format json, force_array);
+ [
+  {"style":"DOS","test":"abc\r\ndef","filler":1}
+ ,{"style":"Unix","test":"abc\ndef","filler":2}
+ ,{"style":"Mac","test":"abc\rdef","filler":3}
+ ,{"style":"esc\\ape","test":"a\\r\\\r\\\n\\nb","filler":4}
+ ]
8<---

It also does not allow explicitly setting row delimiters false while 
force_array is true here:

8<---

+   if (opts_out->force_array &&
+   force_row_delimiter_specified &&
+   !opts_out->force_row_delimiter)
+   ereport(ERROR,
+   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ 	 errmsg("cannot specify FORCE_ROW_DELIMITER false with 
FORCE_ARRAY true")));

8<---

Am I understanding something incorrectly?

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Joe Conway

On 12/5/23 12:43, Davin Shearer wrote:

Joe, those test cases look great and the outputs are the same as `jq`.




Forward slash escaping is optional, so not escaping them in Postgres is 
okay. The important thing is that the software _reading_ JSON 
interprets both '\/' and '/' as '/'.


Thanks for the review and info. I modified the existing regression test 
thus:


8<--
create temp table copyjsontest (
id bigserial,
f1 text,
f2 timestamptz);

insert into copyjsontest
  select g.i,
 CASE WHEN g.i % 2 = 0 THEN
   'line with '' in it: ' || g.i::text
 ELSE
   'line with " in it: ' || g.i::text
 END,
 'Mon Feb 10 17:32:01 1997 PST'
  from generate_series(1,5) as g(i);

insert into copyjsontest (f1) values
(E'aaa\"bbb'::text),
(E'aaa\\bbb'::text),
(E'aaa\/bbb'::text),
(E'aaa\'::text),
(E'aaa\fbbb'::text),
(E'aaa\nbbb'::text),
(E'aaa\rbbb'::text),
(E'aaa\tbbb'::text);
copy copyjsontest to stdout json;
{"id":1,"f1":"line with \" in it: 1","f2":"1997-02-10T20:32:01-05:00"}
{"id":2,"f1":"line with ' in it: 2","f2":"1997-02-10T20:32:01-05:00"}
{"id":3,"f1":"line with \" in it: 3","f2":"1997-02-10T20:32:01-05:00"}
{"id":4,"f1":"line with ' in it: 4","f2":"1997-02-10T20:32:01-05:00"}
{"id":5,"f1":"line with \" in it: 5","f2":"1997-02-10T20:32:01-05:00"}
{"id":1,"f1":"aaa\"bbb","f2":null}
{"id":2,"f1":"aaa\\bbb","f2":null}
{"id":3,"f1":"aaa/bbb","f2":null}
{"id":4,"f1":"aaa\","f2":null}
{"id":5,"f1":"aaa\fbbb","f2":null}
{"id":6,"f1":"aaa\nbbb","f2":null}
{"id":7,"f1":"aaa\rbbb","f2":null}
{"id":8,"f1":"aaa\tbbb","f2":null}
8<--

I think the code, documentation, and tests are in pretty good shape at 
this point. Latest version attached.


Any other comments or complaints out there?


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Add json format mode to COPY TO

Add json format mode support to COPY TO, which includes three output
variations: 1) "json lines" which is each row as a json object delimited
by newlines (the default); 2) "json lines", except include comma delimiters
between json objects; and 3) "json array" which is the same as #2, but with
the addition of a leading "[" and trailing "]" to form a valid json array.

Early versions: helpful hints/reviews provided by Nathan Bossart,
Tom Lane, and Maciek Sakrejda. Final versions: reviewed by Andrew Dunstan
and Davin Shearer.

Requested-by: Davin Shearer
Author: Joe Conway
Reviewed-by: Andrew Dunstan, Davin Shearer 
Discussion: https://postgr.es/m/flat/24e3ee88-ec1e-421b-89ae-8a47ee0d2df1%40joeconway.com#a5e6b8829f9a74dfc835f6f29f2e44c5


diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 18ecc69..af8777b 100644
*** a/doc/src/sgml/ref/copy.sgml
--- b/doc/src/sgml/ref/copy.sgml
*** COPY { ta
*** 43,48 
--- 43,50 
  FORCE_QUOTE { ( column_name [, ...] ) | * }
  FORCE_NOT_NULL { ( column_name [, ...] ) | * }
  FORCE_NULL { ( column_name [, ...] ) | * }
+ FORCE_ARRAY [ boolean ]
+ FORCE_ROW_DELIMITER [ boolean ]
  ENCODING 'encoding_name'
  
   
*** COPY { ta
*** 206,214 
--- 208,221 
Selects the data format to be read or written:
text,
csv (Comma Separated Values),
+   json (JavaScript Object Notation),
or binary.
The default is text.
   
+  
+   The json option is allowed only in
+   COPY TO.
+  
  
 
  
*** COPY { ta
*** 372,377 
--- 379,410 
   
  
 
+ 
+
+ FORCE_ROW_DELIMITER
+ 
+  
+   Force output of commas as row delimiters, in addition to the usual
+   end of line characters. This option is allowed only in
+   COPY TO, and only when using
+   JSON format.
+   The default is false.
+  
+ 
+
+ 
+
+ FORCE_ARRAY
+ 
+  
+   Force output of array decorations at the beginning and end of output.
+   This option implies the FORCE_ROW_DELIMITER
+   option. It is allowed only in COPY TO, and only
+   when using JSON format.
+   The default is false.
+  
+ 
+
  
 
  ENCODING
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index c

Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Joe Conway

On 12/4/23 21:54, Joe Conway wrote:

On 12/4/23 17:55, Davin Shearer wrote:

There are however a few characters that need to be escaped



 1. |"|(double quote)
 2. |\|(backslash)
 3. |/|(forward slash)
 4. |\b|(backspace)
 5. |\f|(form feed)
 6. |\n|(new line)
 7. |\r|(carriage return)
 8. |\t|(horizontal tab)

These characters should be represented in the test cases to see how the 
escaping behaves and to ensure that the escaping is done properly per 
JSON requirements.


I can look at adding these as test cases.

So I did a quick check:
8<--
with t(f1) as
(
  values
(E'aaa\"bbb'::text),
(E'aaa\\bbb'::text),
(E'aaa\/bbb'::text),
(E'aaa\'::text),
(E'aaa\fbbb'::text),
(E'aaa\nbbb'::text),
(E'aaa\rbbb'::text),
(E'aaa\tbbb'::text)
)
select
  length(t.f1),
  t.f1,
  row_to_json(t)
from t;
 length | f1  |row_to_json
+-+---
  7 | aaa"bbb | {"f1":"aaa\"bbb"}
  7 | aaa\bbb | {"f1":"aaa\\bbb"}
  7 | aaa/bbb | {"f1":"aaa/bbb"}
  7 | aaa\x08bbb  | {"f1":"aaa\"}
  7 | aaa\x0Cbbb  | {"f1":"aaa\fbbb"}
  7 | aaa+| {"f1":"aaa\nbbb"}
| bbb |
  7 | aaa\rbbb| {"f1":"aaa\rbbb"}
  7 | aaa bbb | {"f1":"aaa\tbbb"}
(8 rows)

8<--

This is all independent of my patch for COPY TO. If I am reading that 
correctly, everything matches Davin's table *except* the forward slash 
("/"). I defer to the experts on the thread to debate that...


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: introduce dynamic shared memory registry

2023-12-05 Thread Joe Conway

On 12/4/23 22:46, Nathan Bossart wrote:

Every once in a while, I find myself wanting to use shared memory in a
loadable module without requiring it to be loaded at server start via
shared_preload_libraries.  The DSM API offers a nice way to create and
manage dynamic shared memory segments, so creating a segment after server
start is easy enough.  However, AFAICT there's no easy way to teach other
backends about the segment without storing the handles in shared memory,
which puts us right back at square one.

The attached 0001 introduces a "DSM registry" to solve this problem.  The
API provides an easy way to allocate/initialize a segment or to attach to
an existing one.  The registry itself is just a dshash table that stores
the handles keyed by a module-specified string.  0002 adds a test for the
registry that demonstrates basic usage.

I don't presently have any concrete plans to use this for anything, but I
thought it might be useful for extensions for caching, etc. and wanted to
see whether there was any interest in the feature.


Notwithstanding any dragons there may be, and not having actually looked 
at the the patches, I love the concept! +


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-04 Thread Joe Conway

On 12/4/23 17:55, Davin Shearer wrote:
Sorry about the top posting / top quoting... the link you sent me gives 
me a 404.  I'm not exactly sure what top quoting / posting means and 
Googling those terms wasn't helpful for me, but I've removed the quoting 
that my mail client is automatically "helpfully" adding to my emails.  I 
mean no offense.


No offense taken. But it is worthwhile to conform to the very long 
established norms of the mailing lists on which you participate. See:


  https://en.wikipedia.org/wiki/Posting_style

I would describe the Postgres list style (based on that link) as

   "inline replying, in which the different parts of the reply follow
the relevant parts of the original post...[with]...trimming of the
original text"


There are however a few characters that need to be escaped



 1. |"|(double quote)
 2. |\|(backslash)
 3. |/|(forward slash)
 4. |\b|(backspace)
 5. |\f|(form feed)
 6. |\n|(new line)
 7. |\r|(carriage return)
 8. |\t|(horizontal tab)

These characters should be represented in the test cases to see how the 
escaping behaves and to ensure that the escaping is done properly per 
JSON requirements.


I can look at adding these as test cases. The latest version of the 
patch (attached) includes some of that already. For reference, the tests 
so far include this:


8<---
test=# select * from copytest;
  style  |   test   | filler
-+--+
 DOS | abc\r   +|  1
 | def  |
 Unix| abc +|  2
 | def  |
 Mac | abc\rdef |  3
 esc\ape | a\r\\r\ +|  4
 | \nb  |
(4 rows)

test=# copy copytest to stdout (format json);
{"style":"DOS","test":"abc\r\ndef","filler":1}
{"style":"Unix","test":"abc\ndef","filler":2}
{"style":"Mac","test":"abc\rdef","filler":3}
{"style":"esc\\ape","test":"a\\r\\\r\\\n\\nb","filler":4}
8<---

At this point "COPY TO" should be sending exactly the unaltered output 
of the postgres JSON processing functions.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 18ecc69..af8777b 100644
*** a/doc/src/sgml/ref/copy.sgml
--- b/doc/src/sgml/ref/copy.sgml
*** COPY { ta
*** 43,48 
--- 43,50 
  FORCE_QUOTE { ( column_name [, ...] ) | * }
  FORCE_NOT_NULL { ( column_name [, ...] ) | * }
  FORCE_NULL { ( column_name [, ...] ) | * }
+ FORCE_ARRAY [ boolean ]
+ FORCE_ROW_DELIMITER [ boolean ]
  ENCODING 'encoding_name'
  
   
*** COPY { ta
*** 206,214 
--- 208,221 
Selects the data format to be read or written:
text,
csv (Comma Separated Values),
+   json (JavaScript Object Notation),
or binary.
The default is text.
   
+  
+   The json option is allowed only in
+   COPY TO.
+  
  
 
  
*** COPY { ta
*** 372,377 
--- 379,410 
   
  
 
+ 
+
+ FORCE_ROW_DELIMITER
+ 
+  
+   Force output of commas as row delimiters, in addition to the usual
+   end of line characters. This option is allowed only in
+   COPY TO, and only when using
+   JSON format.
+   The default is false.
+  
+ 
+
+ 
+
+ FORCE_ARRAY
+ 
+  
+   Force output of array decorations at the beginning and end of output.
+   This option implies the FORCE_ROW_DELIMITER
+   option. It is allowed only in COPY TO, and only
+   when using JSON format.
+   The default is false.
+  
+ 
+
  
 
  ENCODING
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfad47b..0236a9e 100644
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
*** ProcessCopyOptions(ParseState *pstate,
*** 419,424 
--- 419,426 
  	bool		format_specified = false;
  	bool		freeze_specified = false;
  	bool		header_specified = false;
+ 	bool		force_row_delimiter_specified = false;
+ 	bool		force_array_specified = false;
  	ListCell   *option;
  
  	/* Support external use for option sanity checking */
*** ProcessCopyOptions(ParseState *pstate,
*** 443,448 
--- 445,452 
   /* default format */ ;
  			else if (strcmp(fmt, "csv") == 0)
  opts_out->csv_mode = true;
+ 			else if (strcmp(fmt, "json") == 0)
+ opts_out->json_mode = true;
  			else if (strcmp(fmt, "binary") == 0)
  opts_out->binary = true;
  			else
*** ProcessCopyOptions(ParseState *pstate,
*** 540,545 
--- 544,563 
  defel->de

Re: Emitting JSON to file using COPY TO

2023-12-04 Thread Joe Conway

On 12/4/23 09:25, Andrew Dunstan wrote:


On 2023-12-04 Mo 08:37, Joe Conway wrote:

On 12/4/23 07:41, Andrew Dunstan wrote:


On 2023-12-03 Su 20:14, Joe Conway wrote:

(please don't top quote on the Postgres lists)

On 12/3/23 17:38, Davin Shearer wrote:
" being quoted as \\" breaks the JSON. It needs to be \".  This has 
been my whole problem with COPY TO for JSON.


Please validate that the output is in proper format with correct 
quoting for special characters. I use `jq` on the command line to 
validate and format the output.


I just hooked existing "row-to-json machinery" up to the "COPY TO" 
statement. If the output is wrong (just for for this use case?), 
that would be a missing feature (or possibly a bug?).


Davin -- how did you work around the issue with the way the built in 
functions output JSON?


Andrew -- comments/thoughts?


I meant to mention this when I was making comments yesterday.

The patch should not be using CopyAttributeOutText - it will try to
escape characters such as \, which produces the effect complained of
here, or else we need to change its setup so we have a way to inhibit
that escaping.



Interesting.

I am surprised this has never been raised as a problem with COPY TO 
before.


Should the JSON output, as produced by composite_to_json(), be sent 
as-is with no escaping at all? If yes, is JSON somehow unique in this 
regard?



Text mode output is in such a form that it can be read back in using
text mode input. There's nothing special about JSON in this respect -
any text field will be escaped too. But output suitable for text mode
input is not what you're trying to produce here; you're trying to
produce valid JSON.

So, yes, the result of composite_to_json, which is already suitably
escaped, should not be further escaped in this case.


Gotcha.

This patch version uses CopySendData() instead and includes 
documentation changes. Still lacks regression tests.


Hopefully this looks better. Any other particular strings I ought to 
test with?


8<--
test=# copy (select * from foo limit 4) to stdout (format json, 
force_array true);

[
 {"id":1,"f1":"line with \" in it: 
1","f2":"2023-12-03T12:26:41.596053-05:00"}
,{"id":2,"f1":"line with ' in it: 
2","f2":"2023-12-03T12:26:41.596173-05:00"}
,{"id":3,"f1":"line with \" in it: 
3","f2":"2023-12-03T12:26:41.596179-05:00"}
,{"id":4,"f1":"line with ' in it: 
4","f2":"2023-12-03T12:26:41.596182-05:00"}

]
8<--

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 18ecc69..af8777b 100644
*** a/doc/src/sgml/ref/copy.sgml
--- b/doc/src/sgml/ref/copy.sgml
*** COPY { ta
*** 43,48 
--- 43,50 
  FORCE_QUOTE { ( column_name [, ...] ) | * }
  FORCE_NOT_NULL { ( column_name [, ...] ) | * }
  FORCE_NULL { ( column_name [, ...] ) | * }
+ FORCE_ARRAY [ boolean ]
+ FORCE_ROW_DELIMITER [ boolean ]
  ENCODING 'encoding_name'
  
   
*** COPY { ta
*** 206,214 
--- 208,221 
Selects the data format to be read or written:
text,
csv (Comma Separated Values),
+   json (JavaScript Object Notation),
or binary.
The default is text.
   
+  
+   The json option is allowed only in
+   COPY TO.
+  
  
 
  
*** COPY { ta
*** 372,377 
--- 379,410 
   
  
 
+ 
+
+ FORCE_ROW_DELIMITER
+ 
+  
+   Force output of commas as row delimiters, in addition to the usual
+   end of line characters. This option is allowed only in
+   COPY TO, and only when using
+   JSON format.
+   The default is false.
+  
+ 
+
+ 
+
+ FORCE_ARRAY
+ 
+  
+   Force output of array decorations at the beginning and end of output.
+   This option implies the FORCE_ROW_DELIMITER
+   option. It is allowed only in COPY TO, and only
+   when using JSON format.
+   The default is false.
+  
+ 
+
  
 
  ENCODING
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfad47b..46ec34f 100644
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
*** ProcessCopyOptions(ParseState *pstate,
*** 443,448 
--- 443,450 
   /* default format */ ;
  			else if (strcmp(fmt, "csv") == 0)
  opts_out->csv_mode = true;
+ 			else if (strcmp(fmt, "json") == 0)
+ opts_out->json_mode = true;
  			else if (strcmp(fmt, "binary") == 0)
  opts_out->binary = true;
  			else
*** ProcessCopyOptions(ParseState *pstate,
***

Re: Emitting JSON to file using COPY TO

2023-12-04 Thread Joe Conway

On 12/4/23 07:41, Andrew Dunstan wrote:


On 2023-12-03 Su 20:14, Joe Conway wrote:

(please don't top quote on the Postgres lists)

On 12/3/23 17:38, Davin Shearer wrote:
" being quoted as \\" breaks the JSON. It needs to be \".  This has 
been my whole problem with COPY TO for JSON.


Please validate that the output is in proper format with correct 
quoting for special characters. I use `jq` on the command line to 
validate and format the output.


I just hooked existing "row-to-json machinery" up to the "COPY TO" 
statement. If the output is wrong (just for for this use case?), that 
would be a missing feature (or possibly a bug?).


Davin -- how did you work around the issue with the way the built in 
functions output JSON?


Andrew -- comments/thoughts?


I meant to mention this when I was making comments yesterday.

The patch should not be using CopyAttributeOutText - it will try to
escape characters such as \, which produces the effect complained of
here, or else we need to change its setup so we have a way to inhibit
that escaping.



Interesting.

I am surprised this has never been raised as a problem with COPY TO before.

Should the JSON output, as produced by composite_to_json(), be sent 
as-is with no escaping at all? If yes, is JSON somehow unique in this 
regard?


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-03 Thread Joe Conway

(please don't top quote on the Postgres lists)

On 12/3/23 17:38, Davin Shearer wrote:
" being quoted as \\" breaks the JSON. It needs to be \".  This has been 
my whole problem with COPY TO for JSON.


Please validate that the output is in proper format with correct quoting 
for special characters. I use `jq` on the command line to validate and 
format the output.


I just hooked existing "row-to-json machinery" up to the "COPY TO" 
statement. If the output is wrong (just for for this use case?), that 
would be a missing feature (or possibly a bug?).


Davin -- how did you work around the issue with the way the built in 
functions output JSON?


Andrew -- comments/thoughts?

Joe


On Sun, Dec 3, 2023, 10:51 Joe Conway <mailto:m...@joeconway.com>> wrote:


On 12/3/23 10:31, Davin Shearer wrote:
 > Please be sure to include single and double quotes in the test
values
 > since that was the original problem (double quoting in COPY TO
breaking
 > the JSON syntax).

test=# copy (select * from foo limit 4) to stdout (format json);
{"id":2456092,"f1":"line with ' in it:
2456092","f2":"2023-12-03T10:44:40.9712-05:00"}
{"id":2456093,"f1":"line with \\" in it:
2456093","f2":"2023-12-03T10:44:40.971221-05:00"}
{"id":2456094,"f1":"line with ' in it:
2456094","f2":"2023-12-03T10:44:40.971225-05:00"}
{"id":2456095,"f1":"line with \\" in it:
2456095","f2":"2023-12-03T10:44:40.971228-05:00"}

-- 
Joe Conway

PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com <https://aws.amazon.com>



--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-03 Thread Joe Conway

On 12/3/23 14:52, Andrew Dunstan wrote:


On 2023-12-03 Su 14:24, Joe Conway wrote:

On 12/3/23 11:03, Joe Conway wrote:

On 12/3/23 10:10, Andrew Dunstan wrote:
I  realize this is just a POC, but I'd prefer to see 
composite_to_json()

not exposed. You could use the already public datum_to_json() instead,
passing JSONTYPE_COMPOSITE and F_RECORD_OUT as the second and third
arguments.


Ok, thanks, will do


Just FYI, this change does loose some performance in my not massively 
scientific A/B/A test:


8<---



8<---

That is somewhere in the 3% range.


I assume it's because datum_to_json() constructs a text value from which
you then need to extract the cstring, whereas composite_to_json(), just
gives you back the stringinfo. I guess that's a good enough reason to go
with exposing composite_to_json().


Yeah, that was why I went that route in the first place. If you are good 
with it I will go back to that. The code is a bit simpler too.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-03 Thread Joe Conway

On 12/3/23 11:03, Joe Conway wrote:

On 12/3/23 10:10, Andrew Dunstan wrote:

I  realize this is just a POC, but I'd prefer to see composite_to_json()
not exposed. You could use the already public datum_to_json() instead,
passing JSONTYPE_COMPOSITE and F_RECORD_OUT as the second and third
arguments.


Ok, thanks, will do


Just FYI, this change does loose some performance in my not massively 
scientific A/B/A test:


8<---
-- with datum_to_json()
test=# \timing
Timing is on.
test=# copy foo to '/tmp/buf' (format json, force_array);
COPY 1000
Time: 37196.898 ms (00:37.197)
Time: 37408.161 ms (00:37.408)
Time: 38393.309 ms (00:38.393)
Time: 36855.438 ms (00:36.855)
Time: 37806.280 ms (00:37.806)

Avg = 37532

-- original patch
test=# \timing
Timing is on.
test=# copy foo to '/tmp/buf' (format json, force_array);
COPY 1000
Time: 37426.207 ms (00:37.426)
Time: 36068.187 ms (00:36.068)
Time: 38285.252 ms (00:38.285)
Time: 36971.042 ms (00:36.971)
Time: 35690.822 ms (00:35.691)

Avg = 36888

-- with datum_to_json()
test=# \timing
Timing is on.
test=# copy foo to '/tmp/buf' (format json, force_array);
COPY 1000
Time: 39083.467 ms (00:39.083)
Time: 37249.326 ms (00:37.249)
Time: 38529.721 ms (00:38.530)
Time: 38704.920 ms (00:38.705)
Time: 39001.326 ms (00:39.001)

Avg = 38513
8<---

That is somewhere in the 3% range.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-03 Thread Joe Conway

On 12/3/23 11:03, Joe Conway wrote:

  From your earlier post, regarding constructing the aggregate -- not
extensive testing but one data point:
8<--
test=# copy foo to '/tmp/buf' (format json, force_array);
COPY 1000
Time: 36353.153 ms (00:36.353)
test=# copy (select json_agg(foo) from foo) to '/tmp/buf';
COPY 1
Time: 46835.238 ms (00:46.835)
8<--


Also if the table is large enough, the aggregate method is not even 
feasible whereas the COPY TO method works:

8<--
test=# select count(*) from foo;
  count
--
 2000
(1 row)

test=# copy (select json_agg(foo) from foo) to '/tmp/buf';
ERROR:  out of memory
DETAIL:  Cannot enlarge string buffer containing 1073741822 bytes by 1 
more bytes.


test=# copy foo to '/tmp/buf' (format json, force_array);
COPY 2000
8<----------

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-03 Thread Joe Conway

On 12/3/23 10:10, Andrew Dunstan wrote:


On 2023-12-01 Fr 14:28, Joe Conway wrote:

On 11/29/23 10:32, Davin Shearer wrote:

Thanks for the responses everyone.

I worked around the issue using the `psql -tc` method as Filip 
described.


I think it would be great to support writing JSON using COPY TO at 
some point so I can emit JSON to files using a PostgreSQL function 
directly.


-Davin

On Tue, Nov 28, 2023 at 2:36 AM Filip Sedlák <mailto:fi...@sedlakovi.org>> wrote:


    This would be a very special case for COPY. It applies only to a 
single

    column of JSON values. The original problem can be solved with psql
    --tuples-only as David wrote earlier.


    $ psql -tc 'select json_agg(row_to_json(t))
               from (select * from public.tbl_json_test) t;'

   [{"id":1,"t_test":"here's a \"string\""}]


    Special-casing any encoding/escaping scheme leads to bugs and harder
    parsing.


(moved to hackers)

I did a quick PoC patch (attached) -- if there interest and no hard 
objections I would like to get it up to speed for the January commitfest.


Currently the patch lacks documentation and regression test support.

Questions:
--
1. Is supporting JSON array format sufficient, or does it need to 
support some other options? How flexible does the support scheme need 
to be?


2. This only supports COPY TO and we would undoubtedly want to support 
COPY FROM for JSON as well, but is that required from the start?


Thanks for any feedback.


I  realize this is just a POC, but I'd prefer to see composite_to_json()
not exposed. You could use the already public datum_to_json() instead,
passing JSONTYPE_COMPOSITE and F_RECORD_OUT as the second and third
arguments.


Ok, thanks, will do


I think JSON array format is sufficient.


The other formats make sense from a completeness standpoint (versus 
other databases) and the latest patch already includes them, so I still 
lean toward supporting all three formats.



I can see both sides of the COPY FROM argument, but I think insisting on
that makes this less doable for release 17. On balance I would stick to
COPY TO for now.


WFM.

From your earlier post, regarding constructing the aggregate -- not 
extensive testing but one data point:

8<--
test=# copy foo to '/tmp/buf' (format json, force_array);
COPY 1000
Time: 36353.153 ms (00:36.353)
test=# copy (select json_agg(foo) from foo) to '/tmp/buf';
COPY 1
Time: 46835.238 ms (00:46.835)
8<--


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-03 Thread Joe Conway

On 12/3/23 10:31, Davin Shearer wrote:
Please be sure to include single and double quotes in the test values 
since that was the original problem (double quoting in COPY TO breaking 
the JSON syntax).


test=# copy (select * from foo limit 4) to stdout (format json);
{"id":2456092,"f1":"line with ' in it: 
2456092","f2":"2023-12-03T10:44:40.9712-05:00"}
{"id":2456093,"f1":"line with \\" in it: 
2456093","f2":"2023-12-03T10:44:40.971221-05:00"}
{"id":2456094,"f1":"line with ' in it: 
2456094","f2":"2023-12-03T10:44:40.971225-05:00"}
{"id":2456095,"f1":"line with \\" in it: 
2456095","f2":"2023-12-03T10:44:40.971228-05:00"}


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-03 Thread Joe Conway

On 12/2/23 17:37, Joe Conway wrote:

On 12/2/23 16:53, Nathan Bossart wrote:

On Sat, Dec 02, 2023 at 10:11:20AM -0500, Tom Lane wrote:

So if you are writing a production that might need to match
FORMAT followed by JSON, you need to match FORMAT_LA too.


Thanks for the pointer.  That does seem to be the culprit.

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index d631ac89a9..048494dd07 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -3490,6 +3490,10 @@ copy_generic_opt_elem:
  {
  $$ = makeDefElem($1, $2, @1);
  }
+| FORMAT_LA copy_generic_opt_arg
+{
+$$ = makeDefElem("format", $2, @1);
+}
  ;
  
  copy_generic_opt_arg:



Yep -- I concluded the same. Thanks Tom!


The attached implements the above repair, as well as adding support for 
array decoration (or not) and/or comma row delimiters when not an array.


This covers the three variations of json import/export formats that I 
have found after light searching (SQL Server and DuckDB).


Still lacks and documentation, tests, and COPY FROM support, but here is 
what it looks like in a nutshell:


8<---
create table foo(id int8, f1 text, f2 timestamptz);
insert into foo
  select g.i,
 'line: ' || g.i::text,
 clock_timestamp()
  from generate_series(1,4) as g(i);

copy foo to stdout (format json);
{"id":1,"f1":"line: 1","f2":"2023-12-01T12:58:16.776863-05:00"}
{"id":2,"f1":"line: 2","f2":"2023-12-01T12:58:16.777084-05:00"}
{"id":3,"f1":"line: 3","f2":"2023-12-01T12:58:16.777096-05:00"}
{"id":4,"f1":"line: 4","f2":"2023-12-01T12:58:16.777103-05:00"}

copy foo to stdout (format json, force_array);
[
 {"id":1,"f1":"line: 1","f2":"2023-12-01T12:58:16.776863-05:00"}
,{"id":2,"f1":"line: 2","f2":"2023-12-01T12:58:16.777084-05:00"}
,{"id":3,"f1":"line: 3","f2":"2023-12-01T12:58:16.777096-05:00"}
,{"id":4,"f1":"line: 4","f2":"2023-12-01T12:58:16.777103-05:00"}
]

copy foo to stdout (format json, force_row_delimiter);
 {"id":1,"f1":"line: 1","f2":"2023-12-01T12:58:16.776863-05:00"}
,{"id":2,"f1":"line: 2","f2":"2023-12-01T12:58:16.777084-05:00"}
,{"id":3,"f1":"line: 3","f2":"2023-12-01T12:58:16.777096-05:00"}
,{"id":4,"f1":"line: 4","f2":"2023-12-01T12:58:16.777103-05:00"}

copy foo to stdout (force_array);
ERROR:  COPY FORCE_ARRAY requires JSON mode

copy foo to stdout (force_row_delimiter);
ERROR:  COPY FORCE_ROW_DELIMITER requires JSON mode
8<---


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfad47b..1f9ac31 100644
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
*** ProcessCopyOptions(ParseState *pstate,
*** 443,448 
--- 443,450 
   /* default format */ ;
  			else if (strcmp(fmt, "csv") == 0)
  opts_out->csv_mode = true;
+ 			else if (strcmp(fmt, "json") == 0)
+ opts_out->json_mode = true;
  			else if (strcmp(fmt, "binary") == 0)
  opts_out->binary = true;
  			else
*** ProcessCopyOptions(ParseState *pstate,
*** 540,545 
--- 542,559 
  defel->defname),
  		 parser_errposition(pstate, defel->location)));
  		}
+ 		else if (strcmp(defel->defname, "force_row_delimiter") == 0)
+ 		{
+ 			if (opts_out->force_row_delimiter)
+ errorConflictingDefElem(defel, pstate);
+ 			opts_out->force_row_delimiter = true;
+ 		}
+ 		else if (strcmp(defel->defname, "force_array") == 0)
+ 		{
+ 			if (opts_out->force_array)
+ errorConflictingDefElem(defel, pstate);
+ 			opts_out->force_array = true;
+ 		}
  		else if (strcmp(defel->defname, "convert_selectively") == 0)
  		{
  			/*
*** ProcessCopyOptions(ParseState *pstate,
*** 598,603 
--- 612,631 
  (errcode(ERRCODE_SYNTAX_ERROR),
   errmsg("cannot specify DEFAULT in BINARY mode")));
  
+ 	if (opts_out->json_mode)
+ 	{
+ 		if (opts_out->force_array)
+ 			opts_out->force_row_delimiter = true;
+ 	}
+ 	else if (opts_out->force_array)
+ 		ereport(ERROR,
+ (errcode(ERR

Re: Emitting JSON to file using COPY TO

2023-12-02 Thread Joe Conway

On 12/2/23 13:50, Maciek Sakrejda wrote:

On Fri, Dec 1, 2023 at 11:32 AM Joe Conway  wrote:

1. Is supporting JSON array format sufficient, or does it need to
support some other options? How flexible does the support scheme need to be?


"JSON Lines" is a semi-standard format [1] that's basically just
newline-separated JSON values. (In fact, this is what
log_destination=jsonlog gives you for Postgres logs, no?) It might be
worthwhile to support that, too.

[1]: https://jsonlines.org/



Yes, I have seen examples of that associated with other databases (MSSQL 
and Duckdb at least) as well. It probably makes sense to support that 
format too.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-02 Thread Joe Conway

On 12/2/23 16:53, Nathan Bossart wrote:

On Sat, Dec 02, 2023 at 10:11:20AM -0500, Tom Lane wrote:

So if you are writing a production that might need to match
FORMAT followed by JSON, you need to match FORMAT_LA too.


Thanks for the pointer.  That does seem to be the culprit.

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index d631ac89a9..048494dd07 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -3490,6 +3490,10 @@ copy_generic_opt_elem:
  {
  $$ = makeDefElem($1, $2, @1);
  }
+| FORMAT_LA copy_generic_opt_arg
+{
+$$ = makeDefElem("format", $2, @1);
+}
  ;
  
  copy_generic_opt_arg:



Yep -- I concluded the same. Thanks Tom!

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-02 Thread Joe Conway

On 12/1/23 18:09, Nathan Bossart wrote:

On Fri, Dec 01, 2023 at 02:28:55PM -0500, Joe Conway wrote:

I did a quick PoC patch (attached) -- if there interest and no hard
objections I would like to get it up to speed for the January commitfest.


Cool.  I would expect there to be interest, given all the other JSON
support that has been added thus far.


Thanks for the review


I noticed that, with the PoC patch, "json" is the only format that must be
quoted.  Without quotes, I see a syntax error.  I'm assuming there's a
conflict with another json-related rule somewhere in gram.y, but I haven't
tracked down exactly which one is causing it.


It seems to be because 'json' is also a type name ($$ = 
SystemTypeName("json")).


What do you think about using 'json_array' instead? It is more specific 
and accurate, and avoids the need to quote.


test=# copy foo to stdout (format json_array);
[
 {"id":1,"f1":"line: 1","f2":"2023-12-01T12:58:16.776863-05:00"}
,{"id":2,"f1":"line: 2","f2":"2023-12-01T12:58:16.777084-05:00"}
,{"id":3,"f1":"line: 3","f2":"2023-12-01T12:58:16.777096-05:00"}
,{"id":4,"f1":"line: 4","f2":"2023-12-01T12:58:16.777103-05:00"}
]


1. Is supporting JSON array format sufficient, or does it need to support
some other options? How flexible does the support scheme need to be?


I don't presently have a strong opinion on this one.  My instinct would be
start with something simple, though.  I don't think we offer any special
options for log_destination...


WFM


2. This only supports COPY TO and we would undoubtedly want to support COPY
FROM for JSON as well, but is that required from the start?


I would vote for including COPY FROM support from the start.


Check. My thought is to only accept the same format we emit -- i.e. only 
take a json array.



!   if (!cstate->opts.json_mode)


I think it's unfortunate that this further complicates the branching in
CopyOneRowTo(), but after some quick glances at the code, I'm not sure it's
worth refactoring a bunch of stuff to make this nicer.


Yeah that was my conclusion.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-01 Thread Joe Conway

On 12/1/23 22:00, Davin Shearer wrote:
I'm really glad to see this taken up as a possible new feature and will 
definitely use it if it gets released.  I'm impressed with how clean, 
understandable, and approachable the postgres codebase is in general and 
how easy it is to read and understand this patch.


I reviewed the patch (though I didn't build and test the code) and have 
a concern with adding the '[' at the beginning and ']' at the end of the 
json output.  Those are already added by `json_agg` 
(https://www.postgresql.org/docs/current/functions-aggregate.html 
<https://www.postgresql.org/docs/current/functions-aggregate.html>) as 
you can see in my initial email.  Adding them in the COPY TO may be 
redundant (e.g., [[{"key":"value"...}]]).


With this patch in place you don't use json_agg() at all. See the 
example output (this is real output with the patch applied):


(oops -- I meant to send this with the same email as the patch)

8<-
create table foo(id int8, f1 text, f2 timestamptz);
insert into foo
  select g.i,
 'line: ' || g.i::text,
 clock_timestamp()
  from generate_series(1,4) as g(i);

copy foo to stdout (format 'json');
[
 {"id":1,"f1":"line: 1","f2":"2023-12-01T12:58:16.776863-05:00"}
,{"id":2,"f1":"line: 2","f2":"2023-12-01T12:58:16.777084-05:00"}
,{"id":3,"f1":"line: 3","f2":"2023-12-01T12:58:16.777096-05:00"}
,{"id":4,"f1":"line: 4","f2":"2023-12-01T12:58:16.777103-05:00"}
]
8<-


I think COPY TO makes good sense to support, though COPY FROM maybe not 
so much as JSON isn't necessarily flat and rectangular like CSV.


Yeah -- definitely not as straight forward but possibly we just support 
the array-of-jsonobj-rows as input as well?


For my use-case, I'm emitting JSON files to Apache NiFi for processing, 
and NiFi has superior handling of JSON (via JOLT parsers) versus CSV 
where parsing is generally done with regex.  I want to be able to emit 
JSON using a postgres function and thus COPY TO.


Definitely +1 for COPY TO.

I don't think COPY FROM will work out well unless the JSON is required 
to be flat and rectangular.  I would vote -1 to leave it out due to the 
necessary restrictions making it not generally useful.



--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





  1   2   3   4   5   >