from:"Tim Cross"

Re: Timestamp with vs without time zone.

2021-09-22 Thread Tim Cross

Tim Uckun  writes:

> I'll add another layer of complication.
>
> You have a database server hosted in Australia, and that's also where
> your web server and api server is. You have customers all over the
> world though so you set up additional API servers in Europe, USA,
> Japan etc.
>
> A korean user will fetch you single page app as static HTML from S3
> with cloudfront. It will hit your japanese API server,  which will
> fetch the data from your japanese read only replica with the master
> being in Australia.
>
> The master DB  writes the records has to know your end user is in
> Korea somehow so you have to carry that time zone all the way across
> those tiers.
>
> To me the ideal solution would be to have a compound object which has
> the time zone in it. This object gets passed through the tiers and end
> up at the database where it's stored.
>

I wouldn't do it that way. I would have all timestamps in UTC. If the
location of the client is necessary for some business process, then I
would explicitly record that information rather than rely on time zone
data to derive location.

I've also found having all timestamps in UTC makes any analysis and bug
tracing much easier. Nothing worse then when you need to look at
timestamps from different components of your architecture spread over
different time zones where you also have to convert the timestamps to a
common/consistent time zone for comparison and sequencing purposes.

Re: Timestamp with vs without time zone.

2021-09-21 Thread Tim Cross



FWS Neil  writes:

>  On Sep 21, 2021, at 12:34 PM, Dave Cramer  wrote:
>  On Tue, 21 Sept 2021 at 13:20, Peter J. Holzer  wrote:
>
>  On 2021-09-21 20:50:44 +1200, Tim Uckun wrote:
>  > That's all true and I won't argue about the madness that is timezones
>  > in the world. I am simply thinking it would be some sort of a struct
>  > like thing which would store the numerical value of the time stamp and
>  > also the time zone that time was recorded in.  Presumably everything
>  > else is an insane calculation from there. What was the offset on that
>  > day? I guess it depends on the daylight savings time. What would the
>  > conversion to another time zone be? That would depend on the DST
>  > settings on that day in both places.
>
>  Yes, but HOW IS THAT TIME ZONE STORED? 
>
>  As a user you can say "I don't care, just make it work somehow".
>
>  But as a developer you have to decide on a specific way. And as a
>  database developer in particular you would have to choose a way which
>  works for almost everybody.
>
>  And that's the problem because ...
>
>  > Mankind can't agree on what side of the road to drive on, what the
>  > electrical voltage should be at the wall, what those plugs should be,
>  > how you should charge your phone or anything else for that matter
>
>  ... people have different needs and it would be difficult to satisfy
>  them all.
>
>  Simply storing an offset from UTC is simple, fast, doesn't take much
>  space - but it would be almost as misleading as the current state. A
>  simple offset is not a time zone.
>
>  Storing the IANA timezone names (e.g. 'Europe/Vienna') would store an
>  identifier for what most people think of as a time zone - but that takes
>  a lot of space, it needs a lookup for almost any operation and worst of
>  all, you couldn't index such a column (at least not with a btree index)
>  because the comparison functions aren't stable.
>
>  You could use a numeric indentifier instead of the name, that would take
>  less space but wouldn't solve the other problems (and add the problem
>  that now you have just added another mapping which you need to maintain).
>
>  There are other ways, but I'm sure they all have some pros and some
>  cons. None will be perfect.
>
>  So I don't think there is an obvious (or even non-obvious, but clearly
>  good) way for the PostgreSQL developers to add a real "timestamp with
>  timezone" type.
>
>  As an application developer however, you can define a compound type (or
>  just use two or three columns together) which satisfies the needs of
>  your specific application.
>
>  > It's just that the phrase "timestamp with time zone" would seem to
>  > indicate the time zone is stored somewhere in there.
>
>  I absolutely agree. Calling a type which doesn't include a timezone
>  "timestamp with timezone" is - how do I put this? - more than just
>  weird. "timestamp without timezone" should be called "local timestamp
>  with unspecified timezone" and "timestamp with timezone" should be
>  called "global timestamp without timezone". However, those aren't SQL
>  names.
>
>  I would say this is a perspective thing. It's a timestamp with a time zone 
> from the client's perspective.
>
> A timestamp cannot have a time zone and be a valid timestamp.
>
> Let me explain.
>
> A timestamp is a single time that exists in the world.  For example March 1, 
> 2021, 4:15 am is a timestamp.
>
> If you add a time zone (other than UTC) then a time stamp is not always a 
> single time that exists in the world.
>
> For example in the spring using time zone American/Chicago, on April 14, 2021 
> the time zone time changes at 2am to become
> 3am.  The time April 14, 2021, 2:30 am simply does not exists.  And therefore 
> cannot be a timestamp.  Apple’s APIs will by
> default automatically change 2:30am to 3:00am.  Is that correct?  Or should 
> it change to 3:30am?  Apple has the option for the
> latter, but the APIs don’t work.
>
> In the fall it is even worse.  Using time zone America/Chicago, on November 
> 7, 2021, 1:30 am occurs twice.  That does not work
> as a timestamp.  Which one do you use, the early one or the late one.  
> Apple’s APIs give you a choice.
>
> The point being that people do expect to see times in local time, but the 
> only real timestamp is UTC and I can’t ever imagine a
> need to store time zone information related to a timestamp.  If you need to 
> store the location that data originated from, then
> store the location or the Time Zone, but it should not be connected to the 
> timestamp.  Location data is completely different than
> time data.
>

+1. This is the key point often overlooked. To make matters even more
complex, the daylight savings switch over dates can change, often at the
whim of politicians. For example, the daylight savings time has been
changed in Australia because of major events (such as the Olympics). As
soon as you bring time zones into the mix, any calculations based on
differences in dates must now

Re: Storing state machine

2021-04-19 Thread Tim Cross

Mohan Radhakrishnan  writes:

> Hello,
>We have a workflow when we receive events into the service. But we 
> don't have a way to choreograph or orchestrate the workflow. The
> services are all independent and receive and respond to events.
>
> Since there is no order imposed by the event queues I was thinking of storing 
> a simple state machine in the table.
>
> 1. Order PENDING
> 2. Order line 1 PENDING 
> 3. Order line 2 PENDING 
> 4. Order line 1 PROCESSED
> 5. Order line 2 PROCESSED 
> 6. Order PROCESSED
>
> Order and Order lines can be PROCESSED in any order. But at the end all lines 
> should be  PROCESSED and then the order is also  PROCESSED.
> I won't be able to use any PostgreSql functions because we don't depend on 
> those features.
>
> Are there any PostgreSql features that could support this pattern ? Is it 
> just like any other eventually consistent pattern ?
>   

What you appear to have here is two entities - orders and order items.
An order entity has a 'state' (pending/processed) and is linked to 
one or more order items which in turn have a state.

The information about order state could be derived rather than actually
stored i.e. an order is pending if any of its order items are pending
and is processed if all of its order items are processed. At a minimum,
storing the order item state would be sufficient and a basic sql
statement would be able to tell you what the state of an order is.

In general, you don't want to store duplicate or redundant information
as this can be a source of anomalies. (e.g. order state is not updated
to 'processed' when all items are processed or is updated to processed,
but then another item is added and for some reason, the state is not
switched back to pending etc).

in general, it is usually a mistake or poor design to use one table to
represent different 'entities'. That is a 'bit bucket' approach which
really degrades the ability of the database to do what it is good at -
managing entities and their relationships.

>From the description you have provided, everything you need can be
easily managed with basic SQL statements - no need for functions or
stored procedures. All you would need is an SQL statement to insert a
new order item, an SQL statement to update the state of an item and a
SQL statement to report on the state of an order. 

Your requirement statement is extremely simple and I suspect you have
glossed over some of the constraints/requirements, but based on what you
have written, your requirement seems to be trivial and easily satisfied
with basic database facilities. 

--
Tim Cross

Re: The Amazon CloudFront distribution is configured to block access from your country.

2021-04-11 Thread Tim Cross

Jan Wieck  writes:

> On 4/11/21 11:16 AM, Ron wrote:
>> If my software requires zlib, and I link to https://zlib.net/ on my page, 
>> then
>> your logic absolutely means that I'm responsible for supporting zlib.
>
> Let's clarify "support" in this context. If you link to zlib.net, you are not
> responsible for zlib itself. But if for whatever reason that link breaks, you
> are to some degree responsible for giving a user (who tried to follow that 
> link)
> a helping hand where to find what you linked to now.
>
> The OP did precisely that. Asking for help because a link didn't work. And I
> think it was perfectly appropriate to ask here.

I agree, especially given the postgres pages are not explicit/clear that
the download link is a 3rd party link. The OPs question was quite
reasonable and all that was required was to point them to EDB for
assistance. Thankfully, a number of community members did politely do
this and I'm sure the OP was appreciative for the pointers. It has also
highlighted an area for possible clarification/improvement on the
postgres.org site.

-- 
Tim Cross

Re: About CVE-2018-1058

2021-03-26 Thread Tim Cross

"Mahongwei (March, ICSL)"  writes:

> Hi
>
>  
>
> From this blog: 
> https://wiki.postgresql.org/wiki/A_Guide_to_CVE-2018-1058%3A_Protect_Your_Search_Path
>
>  
>
> Do you think that this is just a mitigation, not a real fix, is there any 
> plan to fix it?
>

I would argue that many possible vulnerabilities are 'mitigated' rather
than 'fixed' and that the guidelines outlined in that blog post are
effective mitigation of this vulnerability and therefore sufficient.
Some would argue passwords are only a mitigation for unauthorised
account access, not a fix. 

The CVE-2018-1058 vulnerability is really an example of a more general issue
associated with environments which use some sort of search path to
determine the location for executable code. Operating systems have a
similar vulnerability with the PATH variable used to determine the
search path for executable files. Basically, if someone can get a user
to add a directory which they have write access to, they can create a
trojan which could be exploited to compromise user security. It is for
this reason that it is generally advised that on Linux systems, add a personal
'bin' or non-standard system directory to the end of your PATH variable rather 
than the
beginning. For example, consider

export PATH=~/bin:$PATH

compared to

export PATH= $PATH:~/bin

and running the command

sudo some_command

to run "some_command" with increased privileges. If some 'bad actor' managed
to compromise this user (a user who might have lower security controls
than a "high risk" account or be on a host in a more 'permissive'
security zone), they could drop a trojan into the user's bin directory.
Under the first PATH example, the trojan sudo command will be executed
when the user runs sudo, under the second PATH example, the system (and
secure) sudo command will be found before the trojan and executed
safely.

Is the advice to add personal bin directories (or more generally, add
additional 'non-system' directories at the end of the PATH) a mitigation
of this risk? Yes. Is it sufficient protection? Yes. Does it need to be
'fixed'? Probably not and identifying ways that would prevent this issue
are hard to find which don't either significantly limit the
utility of a search path or add a level of inconvenience which
would be difficult to justify. 

In many respects, the search_path variable is not really the issue. The
issue is having a default 'public' schema which can be used by any user
to add functions/procedures that could 'shadow' system functions or
procedures or have a higher precedence in the search path than the
function/procedure the user intends to run.

In general, I will remove the 'public' schema from the default
search_path. If user's want to access objects in that schema, they
either have to explicitly specify the schema or they need to alter their
search_path and add public. I also discourage users from using the
public schema. Generally, I don't prevent users from being able to add
public to their search path or from creating objects in the public
schema (I assume if they go to the length of setting their search_path,
they should know what they are doing). When risk assessments warrant it,
I will remove the ability to create objects in public. 

Applications should put their database objects inside an application
specific schema. Schemas are really just namespaces for databases and
provide similar benefits to namespaces in programming languages. Failing
to take advantage of this facility is like writing BASIC with lots of
GOSUB and GOTOs - full of unexpected side effects, difficult to follow
and hard to manage.

-- 
Tim Cross

Re: CSV From Oracle with timestamp column getting errors

2021-03-22 Thread Tim Cross

"Saha, Sushanta K"  writes:

> \COPY table1 FROM '/tmp/Oracle_2020_06.csv' DELIMITER ',' CSV HEADER;
> ERROR:  invalid input syntax for type timestamp: "01-JUN-20 06.04.20.634000 
> AM"
> CONTEXT:  COPY table1, line 2, column last_update_timestamp: "01-JUN-20 
> 06.04.20.634000 AM"
>

The problem is psql doesn't understand/recognise the timestamp format
being used in the CSV dump from Oracle.

Modify the SQL used to extract the data from Oracle so that it formats
the timestamp as a string which psql can parse into a timestamp type -
for example ISO or any of the psql timestamp formats (see psql manual). 

-- 
Tim Cross

Re: Compare with default value?

2021-03-14 Thread Tim Cross



Ulrich Goebel  writes:

> Hi,
>
> Am 13.03.21 um 22:21 schrieb Tim Cross:
>> Ulrich Goebel  writes:
>>
>>> Hi,
>>>
>>> o.k. I have to give some more information...
>>>
>>> Am 13.03.21 um 19:39 schrieb David G. Johnston:
>>>> On Saturday, March 13, 2021, Ulrich Goebel >>> <mailto:m...@fam-goebel.de>> wrote:
>>>>
>>>>  I would like to get the rows, where a column has the default value,
>>>>  similar to:
>>>>  select id fromt tbl where col = default
>>>>
>>>> If the default is a simple constant then why go through the trouble 
>>>> instead of
>>>> just writing col = ‘constant’ ?
>>>
>> I think you may need to re-think your design or at least come at it from
>> a different perspective. As shown by another post in the thread, at some
>> level, this is 'sort of' possible, but it will be ugly and fragile.
>
> Yes, I am re-thinking allredy...
>
>> Possibly one of the issues you have is on one hand, you want the
>> solution to be as generic as possible, but on the other, you require it
>> to have specific knowledge about the definition of your table. i.e.
>> which column(s) have a default value. Furthermore, your 'completion'
>> table will need to be defined based on this information i.e. potentially
>> multiple columns with multiple different data types etc.
>> While the additional information you provided does give some increased
>> understanding of what your trying to do, it is still describing your
>> desired solution implementation. It might be better if you provide more
>> high level details of what your attempting to do so that we can better
>> understand how you arrived at the proposed solution and whether there
>> may be better alternatives available.
>
> o.k.: For a conference I have a tbl_person which holds all peoble which are
> involved: participants and people which provide private lodgins for other
> participants. (There are much more roles, but for illustration these two 
> should
> be enough.) Of course each person can have one or more roles at the 
> conference,
> a n-n-relation models that. Now our workflow allows that one person find it 
> way
> in the tbl_person twice (ore even more often): for example the conference 
> office
> generates a row for Tom, because he provides private lodgin. Later on Tom
> decides to participate an fills the online registration formular. These data
> generate the second row for Tom. Both rows hold significant information which
> the other doesn't hold. Let's say the online register gave the birthday, the
> other holds information about the lodgin (bed with or without breakfast). The
> next step then is that the conference office get notice of the doubled person 
> an
> should make one row out of the existing two rows. The office decide which of 
> the
> two rows should be completed with data from the other row. Therefore I would
> like to pick the columns in the first row where we have default values and
> replace it by the value from the second row.
>
> There are more ways to end with two or even more rows per person. May be it
> would have been much better to avoid these possibilities. But for the moment I
> have a given database structure running in productive mode, so it is not easy 
> to
> re-structure the structure or even the workflows...
>


I really hate to say this, but I think you have a major fundamental flaw
in your database design. As soon as I read

> Now our workflow allows that one person find it way
> in the tbl_person twice (ore even more often): for example the conference 
> office
> generates a row for Tom, because he provides private lodgin. Later on Tom
> decides to participate an fills the online registration formular. These data
> generate the second row for Tom. Both rows hold significant
> information

I thought - "oh dear, this is going to be a problem".

Based on this and your previous post, I suspect your now beginning to
run into maintenance problems with your application database and are
trying to find ways to deal with the issues your encountering. The good
news, it can be fixed. The bad news, it will take a fair amount of work.
The really really bad news is while you may be able to work around some
of the issues, things are just going to snowball and get worse,
requiring increasing amounts of maintenance and increasing amounts of
effort to fix.

Although you have said there is an existing structure in production and
changing it will not be easy, bottom line is that it really is your only
sustainable course of action. Anything else you do will at best delay
the inevitable and at worst could result

Re: Compare with default value?

2021-03-13 Thread Tim Cross

Ulrich Goebel  writes:

> Hi,
>
> o.k. I have to give some more information...
>
> Am 13.03.21 um 19:39 schrieb David G. Johnston:
>> On Saturday, March 13, 2021, Ulrich Goebel > > wrote:
>>
>> I would like to get the rows, where a column has the default value,
>> similar to:
>> select id fromt tbl where col = default
>>
>> If the default is a simple constant then why go through the trouble instead 
>> of
>> just writing col = ‘constant’ ?
>

I think you may need to re-think your design or at least come at it from
a different perspective. As shown by another post in the thread, at some
level, this is 'sort of' possible, but it will be ugly and fragile.

Possibly one of the issues you have is on one hand, you want the
solution to be as generic as possible, but on the other, you require it
to have specific knowledge about the definition of your table. i.e.
which column(s) have a default value. Furthermore, your 'completion'
table will need to be defined based on this information i.e. potentially
multiple columns with multiple different data types etc.

While the additional information you provided does give some increased
understanding of what your trying to do, it is still describing your
desired solution implementation. It might be better if you provide more
high level details of what your attempting to do so that we can better
understand how you arrived at the proposed solution and whether there
may be better alternatives available.

Re: Script checking to see what database it's connected to

2021-02-21 Thread Tim Cross

Rob Sargent  writes:

>>>
>>> Take it up a notch?  Write a script which takes the dbname and the
>>> script name:
>>>
>>> /pcode/
>>>
>>> #!/bin/bash -e
>>> if [[ $# -ne 2 ]]; then echo "missing arg(s)"; exit 2; fi
>>> dbn=$1; shift;
>>> sql=$1; shift;
>>> psql --dbname $dbn --file $sql
>>>
>>> /pcode/
>>
>> I thought of that, yet so earnestly want avoid Yet Another Tiny Script.
>>
> Isn't it a toss-up with putting the check in every sql script?
> Or make it /really/ fancy: use proper arg parsing; check for existence
> of the sql script;  add a usage function; split stdout/stderr...  No end
> of fun.

that would be my approach. A general purpose 'launcher' script that does
argument checking, logging and reporting. Stick it in your bin directory
and then call that instead of psql directly. Now all your SQL scripts
are just DDL/DML statements. Nice thing is you can do it in whatever
scripting language your most comfortable with - bash, perl, ruby,
python, whatever and it is available for whatever project your working
on.

--
Tim Cross

Re: Syntax checking DO blocks and ALTER TABLE statements?

2021-02-16 Thread Tim Cross

Ron  writes:

> On 2/16/21 5:44 PM, Tim Cross wrote:
>> Given the number, I think I would do the same. A good example of why
>> being 'lazy' can be a virtue. Faster and easier to write a procedure to
>> generate dynamic SQL than write out all those alter statements manually
>> or even write it using a scripting language and ODBC if there is
>> sufficient variation in the statements to make writing it in plsql
>> 'messy'.
>
> In my case, the statements are generated by Ora2Pg, and the DO blocks are
> generated by a bash script I wrote.  Input data can be messy, so want to
> verify things before running.
>
> Sure, vim is great at highlighting some problems, but certainly not all.
>
> What I'm really looking for the PostgreSQL version of SQL Server's Ctrl-F5:
> it runs just the parser and then stops,

Sounds like exactly what LSP aims to provide. Don't know if the LSP SQL
servers available are mature enough yet, but that is definitely the
objective. Real benefit is that it is editor agnostic. Once your editor
has LSP support, all you need to do is configure the server details and
you get parsing, completion, re-factoring, definition lookup etc.

--
Tim Cross

Re: Syntax checking DO blocks and ALTER TABLE statements?

2021-02-16 Thread Tim Cross

David G. Johnston  writes:

> On Tue, Feb 16, 2021 at 4:28 PM Tim Cross  wrote:
>
>>
>> David G. Johnston  writes:
>>
>> > On Tue, Feb 16, 2021 at 3:43 PM Ron  wrote:
>> >
>> >>
>> >> How does one go about syntax checking this?
>> >>
>> >> (There are 222 ALTER TABLE ADD FOREIGN KEY statements that I'm wrapping
>> in
>> >> similar DO blocks, and want to make sure the statements are clean.)
>> >>
>> >>
>> > Begin a transaction, execute the DO, capture an error if there is one,
>> > rollback the transaction.
>> >
>>
>> As David points out, wrapping the whole thing in a transaction will at
>> least guarantee it all succeeds or it is all rollled back. This can be
>> frustrating if the statements are slow and there are a lot of them as it
>> can result in a very tedious do-run-fix cycle.
>>
>>
> I do presume that someone wanting to test their code in this manner would
> be doing so in a test environment and an empty database.  Which makes the
> execution time very small.
>

True. However, it never ceases to amaze me how many places don't have
such environments. Far too often, my first task when commencing a new
engagement is to sort out environments and procedures to manage change.

> I personally would also solve the "lot of them" problem by using dynamic
> SQL, so one pretty much only has to test the code generator instead of all
> the actual executions - which can simply be confirmed fairly quickly once
> on a test database without the need for transactions.
>

Given the number, I think I would do the same. A good example of why
being 'lazy' can be a virtue. Faster and easier to write a procedure to
generate dynamic SQL than write out all those alter statements manually
or even write it using a scripting language and ODBC if there is
sufficient variation in the statements to make writing it in plsql
'messy'.

--
Tim Cross

Re: Syntax checking DO blocks and ALTER TABLE statements?

2021-02-16 Thread Tim Cross

David G. Johnston  writes:

> On Tue, Feb 16, 2021 at 3:43 PM Ron  wrote:
>
>>
>> How does one go about syntax checking this?
>>
>> (There are 222 ALTER TABLE ADD FOREIGN KEY statements that I'm wrapping in
>> similar DO blocks, and want to make sure the statements are clean.)
>>
>>
> Begin a transaction, execute the DO, capture an error if there is one,
> rollback the transaction.
>

As David points out, wrapping the whole thing in a transaction will at
least guarantee it all succeeds or it is all rollled back. This can be
frustrating if the statements are slow and there are a lot of them as it
can result in a very tedious do-run-fix cycle.

Something which can help is using an editor with good font highlighting
and parsing support. One interesting area I've not yet looked at is the
development of LSP (Language Server Protocol) servers for SQL. I've used
LSP for other languages with great success. The challenge with databases
is that there is enough variation between different vendor
implementations to make accurate parsing and validation tedious to
implement, so most solutions only focus on ANSI compliance. Still, that
can be very useful.

See https://github.com/lighttiger2505/sqls for one example of an LSP
server for SQL and https://microsoft.github.io/language-server-protocol/
for more background on LSP and what it can provide. Many editors,
including VSCode, VI, Emacs, TextMate etc now have some support for LSP.

--
Tim Cross

Re: checkpointer and other server processes crashing

2021-02-15 Thread Tim Cross



Joe Abbate  writes:

> Hello,
>
> We've been experiencing PG server process crashes about every other week
> on a mostly read only website (except for a single insert/update on page
> access).  Typical log entries look like
>
> LOG:  checkpointer process (PID 11200) was terminated by signal 9: Killed
> LOG:  terminating any other active server processes
>
> Other than the checkpointer, the server process that was terminated was
> either doing a "BEGIN READ WRITE", a "COMMIT" or executing a specific
> SELECT.
>
> The database is always recovered within a second and everything else
> appears to resume normally.  We're not certain about what triggers this,
> but in several instances the web logs show an external bot issuing
> multiple HEAD requests on what is logically a single page.  The web
> server logs show "broken pipe" and EOF errors, and PG logs sometimes
> shows a number of "incomplete startup packet" messages before the
> termination message.
>
> This started roughly when the site was migrated to Go, whose web
> "processes" run as "goroutines", scheduled by Go's runtime (previously
> the site used Python and Gunicorn to serve the pages, which probably
> isolated the PG processes from a barrage of nearly simultaneous requests).
>
> As I understand it, the PG server processes doing a SELECT are spawned
> as children of the Go process, so presumably if a "goroutine" dies, the
> associated PG process would die too, but I'm not sure I grasp why that
> would cause a recovery/restart.  I also don't understand where the
> checkpointer process fits in the picture (and what would cause it to die).
>

A signal 9 typically means something is explicitly killing processes. I
would check your system logs in case something is killing processes due
to running out of some resource (like memory). If it is a fairly recent
Debian system, journalctl might be useful for checking.

--
Tim Cross

Re: Insert into on conflict, data size upto 3 billion records

2021-02-15 Thread Tim Cross

Karthik K  writes:

> exactly, for now, what I did was, as the table is already partitioned, I
> created 50 different connections and tried updating the target table by
> directly querying from the source partition tables. Are there any other
> techniques that I can use to speed this up? also when we use on conflict
> statement for both insert and update does Postgres uses batching internally
> (committing for every 1 records etc) or will it update all records at
> once, in that case, does it create a version for each record and do swap
> all at once? I'm wondering how atomicity is guaranteed, also if I have to
> do batching other than selecting from individual partitions does doing it
> batches of 1 records help?
>

I have had pretty good success with the following strategy. However, you
will need to profile/test each assumption as things vary greatly
depending on data and table structure. A bit of trial and error is
usually required.

1. Use the \COPY command to upload the batch data into a temporary table
or at least a table with logging turned off

2. Run a stored procedure which first does updates for existing rows
then one which does inserts for non-existing rows into your final table

3. If your table is partitioned, pre-process your batches into separate
batches that are divided by the partition key, so instead of one big
batch, multiple smaller batches. If this is not possible, break your
upload up into multiple batches rather than one huge batch.

4. Optimise the update/insert statement to suit your data and table
structure, dropping any unnecessary indexes and re-building them once
finished (the whole upload). Note that this will need profiling as
depending on the index and index structure, dropping and re-creating can
be overall slower than leaving index in place.

5. Determine best times to run analyze to update table stats. Probably
want to do it after each update and insert run, but sometimes, may be
overall faster to just do it after each 'job' (update + insert).

6. don't forget to check the logs and watch for WAL writes being too
frequent etc. Often things are tuned for 'normal' (outside bulk uploads)
and are very poor for the bulk uploads. Need to make sure it is the
right balance.

Re: Encryption of Data Specific to a Tenant in PostgreSQL database | General Idea

2021-02-10 Thread Tim Cross

Jagmohan Kaintura  writes:

> HI All,
>
> For POstgreSQL database to store data for multiple tenants, the approach
> decided was to have
> Shared Database (Holding data for all tenants)
>   => Data would be segregated on basis of some additional column
> (tennatid,different tenants having different tenantId)
>=> Data would be accessed through Views on the basis of tenantId
> value.
>
> This is the basic process of most of the customers who are trying to
> implement multiple tenants in PostgreSQL, rather than choosing
> separate databases for each tenant.
>
> Now we need to encrypt the data related to a tenantId, so that now one
> knows this data belongs to which tenant even from Operations group.
> Is there a method in POstgreSQL for encrypting data with different keys
> with respect to different values in a single column.  Moreover pg_crypto
> will impose a single key on the column.
>
> Please share your thoughts in which direction i can start analysing this
> area for encryption of data specific to a tenant.
>

The decision to have all tenants in a single database seems rather
unusual to me. Isolating one tenant from adversely impacting another
would seem complicated and I'm not sure how you would implement a clear
security model. Your model has effectively bypassed all the provided PG
facilities for isolation of data. Disaster recovery and business
continuity planning under this model must be a nightmare!

I doubt you can adopt a solution which is solely within the database.
How would the database know which key to use for which rows of data? How
would you select the data for your tenant views if all that data is
encrypted with different keys? How would you manage these keys in a
secure manner?

With the model you have adopted, I would be looking at performing
encryption/decryption at the client level. However, depending on your
data types, this could be challenging. this is really a requirement
which should have been factored into the initial architecture design.
Anything you try to bolt on now is likely to be complex and have
significant performance impact and that is assuming you can re-interpret
the requirement to make the objective feasible.

--
Tim Cross

Re: ransomware

2021-02-02 Thread Tim Cross

Marc Millas  writes:

> Hi,
>
> I know its quite general. It is as I dont know what approaches may exist.
>
> Requirement is extremely simple: Is there anyway, from a running postgres
> standpoint, to be aware that a ransomware is currently crypting your data ?
>
> answer can be as simple as: when postgres do crash.
>
> something else ?
>
> Marc MILLAS
> Senior Architect
> +33607850334
> www.mokadb.com
>
>
>

Ransomeware tends to work at the disk level rather than the application
level. There is too much work/effort required to focus ransomeware at
an application level because of the amount of variation in applications
and versions, to be profitable.

This means any form of detection you may try to implement really needs
to be at the disk level, not the application level. While it could be
possible to add some sort of monitoring for encryption/modification to
underlying data files, by the time this occurs, it will likely be too
late (and unless your monitoring is running on a different system, the
binaries/scripts are likely also encrypted and won't run as well).

The best protection from ransomeware is a reliable, regular and TESTED
backup and restoration solution which runs frequently enough that any
lost data is acceptable from a business continuity position and which
keeps multiple backup versions in case your ransomeware infection
occurs some time before it is actually triggered i.e. in case your
most recent backups are already infected. Backups should be stored in
multiple locations. For large data sets, this can often mean having the
ability to take fast filesystem snapshots as more traditional 'copy'
approaches are often too slow to perform backups frequently enough to
meet business continuity requirements.

Bar far, the most common failure in backup solutions is around failure
to test the restoration component. I've seen way too many places where
they thought they had adequate backups only to find when they needed to
perform a restoration, key data was missing. This can greatly increase
the time it takes to perform a restoration and in extreme cases can mean
restoration is not possible. regular testing of restoration processes is
critical to any reliable backup solution.

As it is also a good idea to have some sort of testing/staging
environment for testing code/configuration changes, new versions etc, it
can make sense to use your backups as part of your staging/testing
environment 'refresh' process. A regular refresh of your staging/testing
environment from backups then provides you with assurances your backups
are working and that your testing etc is being performed on systems with
data most similar to your production systems.

Tim

Re: Do we need a way to moderate mailing lists?

2021-01-17 Thread Tim Cross



Paul Förster  writes:

> Hi raf,
>
>> On 17. Jan, 2021, at 02:59, raf  wrote:
>>
>> I once wrote a program to do that very thing:
>>
>>  http://raf.org/textmail/
>>  https://github.com/raforg/textmail/
>
> thanks very much for the nice offer but I mostly read my Mails on a Mac, 
> sometimes Windows, but never Linux. I have no mail access on Linux. At home I 
> use Macs and at work I (have to :-() use Windows as desktops. So textmail is 
> not an option for me.
>

There is nothing stopping you from using a text mail program, like mutt,
on macOS.

--
Tim Cross

Re: Do we need a way to moderate mailing lists?

2021-01-17 Thread Tim Cross

Paul Förster  writes:

> Hi Adrian,
>
> I'm on a Mac w/ Big Sur (macOS 11.1) and use Apple Mail. I've been on Macs 
> since 2003. Apple Mail is simple to use and I love it for exactly that. But 
> Apple Mail has everything I expect a mail client to have, it does not allow a 
> couple of things which other mail clients might have. Some people hate it for 
> exactly that. I set Mail to always compose in plain text but there is no way 
> of manipulating incoming mails other than that automatic displaying remote 
> content (HTML links, etc.) can (and should) be turned off.
>
> So I sometimes resort to either hit cmd-opt-u to see the mail text raw 
> source, or better yet, just hit reply and then drop the reply after reading. 
> As I set composing to plain text, it will convert any quoted parts. 
> Sometimes, I just copy/paste the whole mail for reading over to TextMate, 
> which is also sub-optimal but obviously also gives me non-proportional font 
> reading.
>
> Still, this is somewhat cumbersome as I have to do that for each mail 
> individually. Thank god, this doesn't happen too often. Yet, it's still 
> annoying enough.
>

I've used a number of GUI mail clients, including Apple Mail. However, I
find still the fastest, most feature rich and powerful client is the
text based client mutt. My other favourite is mu4e (Emacs client). While
Apple Mail has reasonable keyboard shortcuts, mutt and mu4e can be fully
keyboard driven and both have great (but different) abilities for
customisation and dealing with large amounts of mail. The thing I hate
most (and there is a lot to hate) with Outlook is the dependence on
using the mouse for many operations. Being able to preview, sort, move,
delete, messages and threads just using the keyboard makes dealing with
mail much easier to deal with. Having a client which can do
sophisticated sorting, flagging and searching messages/threads is essential and
being able to easily automate where possible really helps.

Highly recommend a mutt and imap combination. Your not locked into any
particular mail folder format, can still access things via mobile
devices and can process messages fast and efficiently.

>> That is a learning curve thing. Many people don't know that copy and paste 
>> exists for terminals/GUI's/etc. Most people, once they are pointed in the 
>> right direction, will change that habit. That is why I would not advocate 
>> dropping non plain text attachments. Take this as a teaching moment and 
>> explain the reason why text is a benefit.
>
> I guess, they only change their behavior because copying/pasting some text is 
> easier to do than creating a windowshot with aligning the frame manually, 
> etc. But whatever the reason, thank god, some people are willing to learn 
> that if being told.
>

It is easy to forget the different experience levels and sophistication
of users. I once had to help resolve a problem a developer was having
with a database. I asked him to send me the exact error message. He
moaned and said that was a real hassle. I couldn't understand why he
found that so difficult to do. I decided to get him to show me his
workflow.

When the error occurred, he would take a screen shot of his window, send
it to the printer, wait for the printer to send back a PDF and then send
the issue with the PDF attached.

He was amazed when I showed him all he needed to do was highlight the
error message, copy it and paste it into the message. This guy was one
of the senior developers on the team.

I switched employers a few weeks later.

--
Tim Cross

Re: Do we need a way to moderate mailing lists?

2021-01-16 Thread Tim Cross

Michael Nolan  writes:

> There's so much garbage in a Google search any more that they're becoming
> nearly useless.  Between 'sponsored' hits and ones that have little or no
> relevance but throw in words to get included, I find as often as not that
> IF Google finds what I'm looking for, it'll be several pages in.
>

There is certainly a bit of 'art' or 'black magic' involved when doing a
google to find relevant information and the amount of noise in the
signal has certainly gotten worse. I find putting the key terms early in
your search string can help. However, when dealing with an unfamiliar
topic, knowing what those key terms are can be challenging. This is one
reason I rarely tell people to 'just google for the answer' or assume
they haven't tried when the answer seems quite obvious and easily found
for me.

The change I've noticed over the last decade or so is the amount of
completely wrong or misleading information that is easily found. I
rarely use stack overflow sites these days because too often, the
accepted or most popular answer is wrong or gets a result, but in a poor
manner that is likely to introduce other issues.

The one thing I wish people did was provide clear and concise meta data
with the information they post. Often, I find it difficult to know, for
example, how old the information is or which version of the software it
applies to.

When it comes to PG, I think we are very lucky. In general, I find the
official documentation to be of the highest quality. Sometimes, I can be
a little dense and a few more examples would be useful, but I understand
how hard getting the balance between enough examples and concise
information can be.

It is often in this forum where I find some of the most useful
information and ideas. I really appreciate those contributors who not
only provide an answer to a question, but also include URLs to other
sources which frequently contain more background or details. Pointers to
such valuable resources from those more knowledgeable can save hours of
googling and wading through ill informed and misguided advice.

--
Tim Cross

Re: Do we need a way to moderate mailing lists?

2021-01-15 Thread Tim Cross

Hemil Ruparel  writes:

> Exactly my point. We need to raise the bar of the behavior we tolerate.
> This should not be tolerated. We need to set an example. The person in
> question clearly understood english and I have never seen a person who
> could use mailing lists but not google. So that's out of the question.
>
> We are not free consultants. And you are not entitled to shit. You are
> probably being paid to work on that project. We are not. Your problem. Fix
> it yourself. Or at least have to courtesy to google it.
>

While I can understand your frustration, I disagree with your position.

It is too subjective and difficult to apply/adopt such a strong position
and could too easily backfire, resulting in a perception of an elitist,
unwelcoming and unfriendly community.

Banning should be reserved for the most serious and abusive cases.
Banning because someone appears to be acting entitled or lazy is hard to
assess in a non-bias manner and likely has too much cultural variation
to applied consistently. Someone you feel who is being entitled or lazy
might be someone I feel is frustrated, may lack good communication
and/or social skills or might simply be immature and in need of some
guidance and education. My response may also differ depending on my own
state of mind and mood at the time when I read the message.

I've been on the postgres lists for some years now and to be honest,
have not noticed this type of issue very often. There are occasionally
rude and lazy individuals who may appear to be acting entitled, but they
soon go away. In some respects, the list is self-moderating because
people who do act poorly soon get ignored and their messages die out
with no responses.

The great benefit of lists like these is that you can just ignore anyone
you think are rude, entitled or fail to put in the effort you believe is
warranted before their question/issue needs attention. Many mail clients
will even allow you to 'block' specific senders. I have done this once
with someone from a different list. I don't know if they are still
behaving badly as now I never see their messages.

My advice would be to just delete and move on, a luxury you don't have
when you are employed and paid to deal with such messages, which is one
reason I don't like or have the temperament to fulfil the difficult
service/support desk roles which too often maligned and fail to get the
recognition they deserve.

Tim

Re: How to keep format of views source code as entered?

2021-01-09 Thread Tim Cross

or.

Having the code in the database can be useful. I've used this in oracle
to provide enhanced debugging support when developing stored procedures
and packages and enhanced reporting in unit testing. However, you really
don't want this to be the definitive master for your source code. Your
source code benefits from being in a location which makes backup,
restoration and migration easy/fast, where changes can be tracked, analysed and 
rolled
back, where re-factoring can use advanced tools and can work across
projects, not just on a single script, where code sharing is easy and
where people can use their preferred tool rather than being forced to
use something which understands the database.

--
Tim Cross

Re: How to keep format of views source code as entered?

2021-01-08 Thread Tim Cross

Adrian Klaver  writes:

> On 1/8/21 12:38 AM, Markhof, Ingolf wrote:
>> Thanks for your comments and thoughts.
>>
>> I am really surprised that PostgreSQL is unable to keep the source text
>> of a view. Honestly, for me the looks like an implementation gap.
>> Consider software development. You are writing code in C++ maybe on a
>> UNIX host. And whenever you feed you source code into the compiler, it
>> will delete it, keeping the resulting executable, only. And you could
>> not even store your source code on the UNIX system. Instead, you'd be
>> forced to do so in a separate system, like GitHub. Stupid, isn't it?
>> Right. There are good reasons to store the source code on GitHub or
>> alike anyhow. Especially when working on larger project and when
>> collaborating with many people. But in case of rather small project with
>> a few people only, this might be an overkill.
>
> The projects I work on are my own personal ones and I find an
> independent version control solution the way to go for the following
> reasons:
>
> 1) It is easy.
>   a) mkdir project_src
>   b) cd project_src
>   c) git init
> Now you are set.
>
> 2) Even my simple projects generally have multiple layers.
>   a) Database
>   b) Middleware
>   c) UI
> And also multiple languages. It makes sense to me to keep all
> that information in one repo then having each layer operate independently.
>
> 3) It allows me to work on test and production code without stepping on
> each other.
>
> 4) It serves as an aid to memory. Answers the question; What was I
> thinking when I did that? More important it helps anyone else that might
> have to deal with the code.
>
> FYI, the program I use to manage database changes is
> Sqitch(https://sqitch.org/).
>
>>

This is essentially my workflow as well. I have even used sqitch too.

While this has worked well for my projects, attempts to introduce the
discipline necessary to use such a workflow in a team has largely
failed. This seems to be due to 2 main reasons -

1. Lack of SCCM support built into common tools. There are very few
tools which have version control support built in (I believe the jet
brains product does). In particular, pgAdmin would benefit here (maybe
pgadmin4 does, I've not tried it in a few years).

2. Poor SCCM and DBMS Understanding. Despite it being 2021 and both
version control and databases being two very common technologies you
need to interact with as a developer, I'm still surprised at how poorly
many developers understand these tools. I still frequently come across
really bad workflows and practices with version control and code which
uses the database as little more than a bit bucket, which re-implement
searching and sorting at the client level (and then often moan about
poor performance issues).

My editor has good support for psql and psql has always been my goto
tool for PG. As my editor also has good git support, my workflow works
well. However, most people I've worked with prefer things like pgadmin.
Tom Lane responded in this thread to point out some of the complexities
which make it difficult to maintain current code source within the
database itself. This is definitely something which should be kept in
version control. The problem is, if your tool does not support the
version control system, it is too easy to forget/bypass that stage. When
you use something like pgadmin, it is far too easy to modify the source
definitions in the database without ever updating the sources on disk in
the version control working directory and the changes get lost.

The other big challenge is dependency management. Keeping track of what
is affected by a change to a table definition can be a challenge within
a complex system. I've yet to find a good solution to that issue. It is
probably something which needs to be built into a tool. In the past,
I've used a modified sqitch approach that also maintains a small 'dbadm'
schema containing metadata to track dependencies. Although this worked
OK, especially if you understood how all the bits fit together, it still
had many corner cases and to some extent highlighted the complexities involved.

--
Tim Cross

Re: How bad is using queries with thousands of values for operators IN or ANY?

2020-08-31 Thread Tim Cross



Thorsten Schöning  writes:

> Hi all,
>
> I have lots of queries in which I need to restrict access to rows
> using some decimal row-ID and am mostly doing so with using the
> operator IN in WHERE-clauses. Additionally I'm mostly embedding the
> IDs as ","-seperated list into the query directly, e.g. because I
> already hit a limitation of ~32k parameters of the JDBC-driver[1] for
> Postgres.
>
> I really thought that in most cases simply sending a large amount of
> IDs embedded into the query is better than looping, because it safes
> roundtrips to access the DB, the planner of the DB has all pieces of
> information it needs to decide best strategies etc. OTOH, with recent
> tests and an increased number of IDs of about factor 100, I have
> additional load in Tomcat before actually sending the query to the DB
> already and in the DB itself as well of course. I've attached an
> example query and plan.
>
>> ->  Hash  (cost=242592.66..242592.66 rows=6825 width=39) (actual 
>> time=91.117..91.117 rows=40044 loops=3)
>>   Buckets: 65536 (originally 8192)  Batches: 1 (originally 1)  Memory 
>> Usage: 3016kB
>>   ->  Hash Join  (cost=137.57..242592.66 rows=6825 width=39) (actual 
>> time=10.194..82.412 rows=40044 loops=3)
>> Hash Cond: (meter.meter_bcd = meter_bcd.id)
>> ->  Index Scan using pk_meter on meter  (cost=0.42..242237.10 
>> rows=40044 width=25) (actual time=9.350..71.276 rows=40044 loops=3)
>>   Index Cond: (id = ANY ('{[...]}'::integer[]))
>> ->  Hash  (cost=135.73..135.73 rows=113 width=22) (actual 
>> time=0.830..0.830 rows=113 loops=3)
>
> Do you know of any obvious limitations of the JDBC-driver of handling
> such large queries? In the end, the query is mostly large text with
> only very few bind parameters.
>
> Do you know of any obvious problem in Postgres itself with that query,
> when parsing it or alike? Do things simply take how long they take and
> are mostly comparable to looping or is there some additional overhead
> the larger the query itself gets? From my naive expectation, comparing
> IDs shouldn't care if things get looped or transmitted at once.
>
> I'm just trying to collect some input for where to look at to optimize
> things in the future. Thanks!
>
> [1]: https://github.com/pgjdbc/pgjdbc/issues/90
>
> Mit freundlichen Grüßen,
>
> Thorsten Schöning

It would help to see the query as well as the plan.

Where are these 100s of IDs coming from? I sometimes find this a sign
you could be re-structuring your query to be a join between two tables
where one table contains the IDs of interest rather than trying to embed
them into the query as part of a where clause. 

-- 
Tim Cross

Re: Issue with timestamp without time zone datatype with default value as now()/localtimestamp

2020-07-10 Thread Tim Cross



Deepika S Gowda  writes:

> Hi,
>
> On postgres 11.7 Master/Slave node, there is column named "createddate"
> with datatype "timestamp without time zone" with default value as "now()";
>
> Column Name | Date Type | Default value
> createddate |timestamp without time zone|Now()
>
>
> Issue: From the java application , data is getting loaded into this table
> where we expect column value should be today's date with timestamp(
> "2020-07-10 10:56:43.21"). But, out of 3K records, 100 records are loaded
> as  "2019-07-10 10:56:43.21" (change in Year).
>
> What could be the issue? we tried changing the default value to
> "localtimestamp".
>

My bet would be you have some SQL statements which include a value for
'createddate', so the default is not being used.


-- 
Tim Cross

Re: Efficiently advancing a sequence without risking it going backwards.

2020-07-09 Thread Tim Cross



Christopher Browne  writes:

> On Thu, 9 Jul 2020 at 12:59, Jeremy Schneider 
> wrote:
>
>>
>> > On Jul 6, 2020, at 19:06, Paul McGarry  wrote:
>> >
>> > I don't think I can use setval(), because it risks making sequences go
>> backwards, eg:
>> >
>> > 1) Check values
>> > DB1sequence: 1234
>> > DB2sequence: 1233 (1 behind)
>> > 2) setval('DB2sequence',1234);
>> >
>> > but if between (1) and (2) there are 2 nextval(DB2sequence) calls on
>> another process,  (2) would take the sequence back from 1235 to 1234 and I
>> would end up trying to create a duplicate key ID from the sequence.
>>
>> An ability to “lock” the sequence momentarily would give you the tool you
>> need, but I don’t think it’s there.
>>
>> Total hack, but if your application or users can retry when the rare error
>> is encountered then one idea is to rename the sequence momentarily while
>> you do the setval() then rename it back. Do an initial check without
>> renaming, then re-check after renaming and before the setval() call.
>>
>> If you put retry logic into your application then make sure to include
>> back-off logic so you don’t get an outage induced by thundering herd.
>>
>
> This is increasingly looking like a set of attempts to intentionally abuse
> what sequences were designed for.
>
> The use-case where you need a lock on the value so that there can't
> possibly be a hole in the sequence points at the notion of having some
> other kind of a function that takes out a lock on a table, and serially
> gives out "MAX+1" as the next value.
>
> That isn't a very difficult function to write; the problem with it is that
> that sort of function will forcibly serialize all inserts through the
> function+table lock that is giving out "MAX+1" values.  That's going to be
> WAY slower than using a sequence object, and about 98% of the time, people
> will prefer the sequence object, particularly because it's about 98% faster.
>
> I'm not quite sure if anyone has put out there a standard-ish idiom for
> this; that seems like a not TOO difficult "exercise for the user."
>
> There will definitely be more failure cases, and *wildly* more fighting, in
> a concurrent environment, over tuple locks.
>
> - An obvious failure is that if one connection asks for the new MAX+1, gets
> it, and then the transaction fails, for some later, out-of-relevant-scope,
> reason, you'll still potentially get some "holes" in the series of values.
>
> - If there are 10 connections trying to get MAX+1 concurrently, only one
> can get it at a time, and that connection can't relinquish the lock until
> its transaction has completed, and the 9 must wait, regardless of how much
> work the "winner" still has to do.
>
> These are amongst the reasons why people conclude they *don't* want that
> kind of functionality.
>
> It makes me think that the problem needs to be taken back to that initial
> point of "I think I need some somewhat coordinated sequences", and poke at
> what the *real* requirement is there, and why someone thinks that the
> values should be "somewhat coordinated."  Something seems off there.

I agree and was going to write something similar. All the 'solutions'
are problematic in one way or the other and seem to be due to a
misconception about the role for sequences or some requirement which
needs to be re-examined.
-- 
Tim Cross

Re: Persistent Connections

2020-06-24 Thread Tim Cross

Peter J. Holzer  writes:

> On 2020-06-24 13:55:00 -0400, Bee.Lists wrote:
>> On Jun 24, 2020, at 6:47 AM, Peter J. Holzer  wrote:
> Does "I have 37 queries" mean you have seen 37 queries of this type in
> some time window (e.g. the last day or hour) or does it mean you are
> currently seeing 37 connections where the last query was of this type?
>
> If it's the latter, you very obviously have at least 37 (more likely
> 37 + 5 = 42) connections. So you web app is configured to open dozens of
> connections concurrently. You might want to look into that.
>

I just had another thought. Based on a VERY brief scan of the Sequel
API, I suspect it uses a connection pool by default. So it is quite
likely that the expectations on when the connections are closed is
incorrect. It could easily be that the web app creates a connection pool
as soon as it is started and keeps that pool open until either the web
server is closed down or a pool timeout kicks in (some connection pools
use a max lifetime setting for connections and will close a connection
after that period, replacing it with a new connection).

It is also quite likely that the Sequel GEM creates a connection pool
with a default number of connections if not explicitly defined by the
developer. This default could be close to or even exceed the number set
for max connections within PG (especially as the OP has indicated it is
very small).

I have to say, I do hate ORMs. They always reduce the flexibility and
power offered by SQL, tend to result in code where more processing is
done in the client which would have been faster and more efficiently
done by the SQL engine and hides details which make troubleshooting even
harder. However, the worst issue is that it also results in developers
not understanding the power of the underlying RDMS and encourages poor
DB schema design. Those who support such technologies typically point to
the benefits of database neutrality such systems can provide. In over 30
years of DB work, I have yet to see such neutrality actually work. It is
a pipe dream. 
-- 
Tim Cross

Re: Persistent Connections

2020-06-24 Thread Tim Cross



Bee.Lists  writes:

>> On Jun 23, 2020, at 8:09 PM, Tim Cross  wrote:
>> 
>> Sounds like your web app may not be closing connections once it has
>> finished with them? The fact your seeing idle connections would seem to
>> support this. I would be verifying there isn't a code path in your
>> client application which is failing to close a connection correctly. Seeing
>> connections go up and down in the gem may not be telling you the full story 
>> - could
>> be that your client connection objects are being destroyed in your app,
>> but are not closing the connection correctly so PG is unaware the client
>> has finished. 
>
> Hi Tim.  I can’t speak for the gem.  I’m assuming its garbage collection is 
> working.  But yes, it does look that way.  I found someone else who was 
> having similar issues as myself:
>
> https://stackoverflow.com/questions/60843123/djangopostgres-fatal-sorry-too-many-clients-already
>
> I’m also seeing the connection count rise overnight from crontabs.  
>
> For some clarity, the gem is Sequel, which is on top of the PG gem (Ruby).  
> I’ve spoken to the Sequel author and he says everything is fine.  I have some 
> evidence it’s a connection issue and the gem is doing its job, as I’m seeing 
> it’s happening elsewhere with crontabs and other clients.  
>
Rather than a problem with the libraries, I would be looking for a
problem with the code which uses those libraries. If it was a problem
with either the Sequel or PG gems (or with Postgres for that matter), it
would be a lot more wide spread and you would be seeing a lot more
reports.

I'm not familiar with Sequel and haven't used Ruby for nearly 20 years,
but have used plenty of other PG libraries. You mention garbage
collection and I'm sure that is working fine in Ruby. However, you
cannot rely on that to correctly cleanup your PG connections. Somewhere
in your code, there has to be code the developer writes which tells the
library you are finished with the connection. For example, the JS PG
package has the command 'disconnect'. Essentially, your code needs to
tell the remote PG server you have finished with the connection so that
it knows it can clean up things on its end. If your code is not issuing
explicit disconnect commands, what is happening is that the connection
on your client side are being cleanup when the connection object goes
out of scope and the garbage collector kicks in. However, on the PG
side, the connections hang around until PG times them out, which takes
much longer and could easily cause you to hit the limit, especially as
you have such a very small limit.  I think you need to verify that in
all your client code, somewhere there is explicit code being called
which is telling PG you are disconnecting the connection. Don't assume
this is automagically happening as part of GC. 
>
>> Typically, due to the overhead of making a connection, you don't want
>> your client app/web app to create a new connection for every query.
>> Instead, you would use some type of connection pool. Many development
>> languages support some form of pooling as part of their db connection
>> library (don't know about Ruby, but JS, Python, PHP, Java, Perl all do)
>> and there are external 3rd party solutions like pgbouncer which sit
>> between your client and the database and work a bit like a connection
>> broker which will manage a pool of connections.
>
> That’s why I’m thinking installing a connection pooler would solve all of 
> this.  pgbouncer is what I’m looking at now. 
>

I doubt this will solve your problem. It might hide the problem or it
might lengthen the time between failures, but it is very unlikely to
solve the problem. It may help identify the source of the problem.

Have you verified the PG gem doesn't support pooling? If it does, that
would be far easier to configure and use than installing
pgbouncer. Based on your description of the app and the small number of
connections you have PG configured for, adding pgbouncer is like putting
a fighter jet engine in a family SUV. 

>> From the description of what your doing, I would first look to see what
>> level of connection pooling your development language supports. This
>> will likely be easier to configure and use than setting up a whole
>> additional bit of infrastructure which is probably far more powerful
>> than you need.
>
> I will do that.  This is all new.  
>
>> I would also go through your code and make sure that every time you
>> create a database connection, there is some code which is closing that
>> connection once your finished with it. This is a frequent source of
>> problems, especially during development when your code might throw an
>> error and the code you have to disconnect does not get executed or you
>&

Re: UPDATE on 20 Million Records Transaction or not?

2020-06-23 Thread Tim Cross

Jason Ralph  writes:

> Hello List,
> PostgreSQL 11.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 
> (R
> ed Hat 4.4.7-23), 64-bit
>
> I am planning an update on a table with 20Million records, I have been 
> researching the best practices.  I will remove all indexes and foreign keys 
> prior to the update, however I am not sure if I should use a transaction or 
> not.
> My thought process is that a transaction would be easier to recover if 
> something fails, however it would take more time to write to the WAL log in a 
> transaction.
>
> Would it make sense to make a back up of the table then execute update 
> without a transaction? How would you guys do it?
>

This is really the sort of thing you need to test in a development
environment. There are just too many unknowns to provide a definitive
answer. You need to run this with minimal 'adjustments' in a dev
scenario to get a baseline and then decide what, if anything, you need
to do.

Just because you have 20M records, it doesn't mean that you have to do
anything 'special'. A lot will depend on how many of that 20M need to be
updated, the size of the records and fields being updated, how the
updated records are spread through the table, what other processes are
updating or querying the table during the operation, which indexes are
used and which are unnecessary or just update overhead etc, etc.

Trying to optimise this operation without knowing where the bottlenecks
are is extremely difficult. Unless you know that every one of the 20M
records will be updated, I wouldn't rush into removing all indexes. I
would definitely look at breaking up the update into smaller
transactions.

As to the question about whether to use a transaction or not, it really
depends on your data and how easily you can recover from a partial
transaction. Assume the update is going to be interrupted before
completion. If you can recover from that situation easily, then not
using a full transaction with commit/rollback may be acceptable.
otherwise, I would avoid it.

At any rate, baseline first as you may be surprised. I was recently
working on an application using PG 9.6 which had numerous web clients
and a backend process which ran multiple times a day and which updated
millions of rows each run in a table with over 100M records. The full
update would take around 20 - 30 min, but consisted of multiple
transactions. The data was part of a weather information system used for
predictive modelling. The data consisted of rainfall, max/min temp,
humidity and vapour pressure for 9am and 3pm and solar radiation for
grid points 5km apart covering all of Australia. The table had records
for every day for the last 4+ years and as yet, does not use
partitioning. This is a lot of records, but each record is quite small.
While at some point, we will need to look at additional optimisations
like partitioning, the performance is currently within acceptable
limits. The only optimisation we have done is basic db tuning. When I
did this tuning, performance saw a significant improvement, but none of
it was specific to this large table or the update process.

The key point is that large numbers of records doesn't necessarily mean
that an update will be slow (or more specifically too slow for
acceptable performance, whatever that is). Assuming you will need to
take lots of special action may be premature - you may need to do none
or only a couple of things. Get a baseline first and you will know how
big of an issue you have. You will also be able to determine if what you
try has any benefit. 
-- 
Tim Cross

Re: Persistent Connections

2020-06-23 Thread Tim Cross

Bee.Lists  writes:

> I have an issue with a server (v10) that’s seeing increasing connections 
> until it’s maxxed-out.  
>
> max_connections for my 4-core server is set to 12.  
>
> I’ve installed pg_stat_activity and pg_stat_statements.
>
> I access this server through a web app, and another client on two machines.  
> I also send queries through SSH on the LAN.  psql queries indicate there are 
> too many clients already.  I’ve been getting errors from my workstation 
> through a Ruby gem that says dropped connections happen.  Essentially, the 
> server isn’t giving up connections and clients from all sides are showing 
> some issues.
>
> pg_stat_activity has shown a handful of idle queries that are quite simple.  
> I’ve tested those queries and they seem fine on a workstation client.  I’m 
> assuming these queries somehow don’t finish and leave the connection open, 
> but I could be wrong.  All of this is new to me.  
>
> It was suggested on Slack that it sounds like my workstation had some TCP 
> issues with these connections, and that it was a firewall.  I disabled the 
> firewall and the same happened.  The firewall has been restarted.  
>
> I am running no connection pool tool of any sort, and I’ve never used one.  
>
> At this point I don’t know what to even be watching for.  Connections 
> increase and connections “drop”, according to the gem I’m using.  I have 
> simple queries that are idle and won’t disappear for some reason.  
>
> How can I figure this out so the connections remain within the 
> max_connections limit, and connections are not dropped?  
>
> Any insight appreciated.  
>

Sounds like your web app may not be closing connections once it has
finished with them? The fact your seeing idle connections would seem to
support this. I would be verifying there isn't a code path in your
client application which is failing to close a connection correctly. Seeing
connections go up and down in the gem may not be telling you the full story - 
could
be that your client connection objects are being destroyed in your app,
but are not closing the connection correctly so PG is unaware the client
has finished. 

Typically, due to the overhead of making a connection, you don't want
your client app/web app to create a new connection for every query.
Instead, you would use some type of connection pool. Many development
languages support some form of pooling as part of their db connection
library (don't know about Ruby, but JS, Python, PHP, Java, Perl all do)
and there are external 3rd party solutions like pgbouncer which sit
between your client and the database and work a bit like a connection
broker which will manage a pool of connections.

>From the description of what your doing, I would first look to see what
level of connection pooling your development language supports. This
will likely be easier to configure and use than setting up a whole
additional bit of infrastructure which is probably far more powerful
than you need.

I would also go through your code and make sure that every time you
create a database connection, there is some code which is closing that
connection once your finished with it. This is a frequent source of
problems, especially during development when your code might throw an
error and the code you have to disconnect does not get executed or you
simply forget to issue a disconnect when your finished. 

A connection pool can help in tracking down such issues as well. Most
pooling solutions will allow you to set a max pool size. In addition to
enabling you to 'reserve' a set number of connections for a client, you
will know which client seems to be running out of connections, helping
to identify the culprit. 

-- 
Tim Cross

Re: Table partitioning with sequence field in postgresql12

2020-06-18 Thread Tim Cross



Srinivasa T N  writes:

> Hi,
>I have a parent table with one of the field as ""gid" int4 DEFAULT
> nextval('"ami_smart_new".aoi_boundary_gid_seq'::regclass)".
>
>I create child tables which inherit parent and use hash partition.  When
> I directly insert into child tables, will there be any race condition
> causing two child tables getting the same sequence value for gid?
>
Assuming all inserts use the default e.g. nextval from the same
sequence, you won't get duplicates. You could get a duplicate if an
insert sets an explicit value for gid of course or if rows in any table
were inserted with a gid which was not obtained from the same sequence
using nextval i.e. parent and children use same sequence. The sequence
is just a counter with the property that no two calls to nextval from
that sequence will have the same value. You cannot make any additional
assumptions e.g. cannot assume gid values will be inserted in order or
there won't be 'gaps ' etc.
-- 
Tim Cross

Re: Table partitioning with sequence field in postgresql12

2020-06-18 Thread Tim Cross

Srinivasa T N  writes:

> Hi,
>Partitioning of a table with sequence id as one of its fields is
> supported in postgresql12?
>
> Regards,
> Seenu.

A sequence is really just an 'atomic' number generator, you get the next
value, which is guaranteed to be larger than the last 'nextval' (up
until maxvalue). It is unaware of the use i.e. whether it will be used
in a insert or what table that insert is against. So I'm not sure what
your concern with a partitioned table is? Can you elaborate?

-- 
Tim Cross

Re: Should I enforce ssl/local socket use?

2020-06-06 Thread Tim Cross

Michel Pelletier  writes:

> Hello,
>
> I'm the author of the pgsodium cryptography library.  I have a question
> about a best practice I'm thinking of enforcing.  Several functions in
> pgsodium generate secrets, I want to check the Proc info to enforce that
> those functions can only be called using a local domain socket or an ssl
> connection.  If the connection isn't secure by that definition, secret
> generating functions will fail.
>
> If someone really wants to point the gun at their foot, they can connect
> with an unsecured proxy.  My goal would be to make bypassing the check
> annoying.
>
> Any thoughts?  Is this an insufferably rude attitude?  Are there scenarios
> where one can foresee needing to generate secrets not over ssl or a domain
> socket?
>

I'm never very fond of enforcing a particular behaviour as it assumes we
understand all environments and use cases. Far better to make this the
default behaviour, but allow users to disable it if they want and
clearly document that option as insecure. I also suspect that without
the ability to somehow disable the checks, people will find elaborate
ways to work around them which are almost certainly going to be even
worse from a security perspective. 

-- 
Tim Cross

Re: Oracle vs. PostgreSQL - a comment

2020-05-31 Thread Tim Cross

Paul Förster  writes:

> and then, some day, a developer approaches a DBA with a query which is 
> generated and, if printed out in a 11pt. sized font, can fill a billboard on 
> a street, to optimize it or search for what's wrong with it, or why it 
> performs so slow... That's usually when I play BOFH because I'm not willing 
> to debug 10 pages which its creator hasn't even cared to take a look at 
> first. :-P :-)
>
> Same goes for the app guys sending me 10 MB of Java stack trace by email 
> containing one single line of ORA-x. They should send only that line 
> along with a (approximate) time when it occurred. If I get the full stack 
> trace, I send it back to them telling them they should come back when they 
> find the line containing the ORA message. They usually don't come back 
> because they don't know how to grep. :-) Some do, though, and those are the 
> ones I try to help.
>
>> If I'm in an environment where someone else is responsible for all the DBA 
>> stuff, Oracle is nice to work with.
>
> yes, leave the cardiac arrest to us DBAs. :-P
>

Yes, even after longer time doing Oracle, I still never felt as
comfortable or across things as much as I do with PG. Started with
Oracle 7 and stayed until 11g and each year, it got worse rather than better.

After working as a DBA, I know exactly what you mean. Sometimes, DBA has
to equal "Don't Bother Asking". 

As a developer, I have to admit being somewhat embarrassed by the
frequently poor understanding amongst many developers regarding the
technology they are using. I've never understood this. I come across
developers all the time who are completely clueless once outside their
IDE or editor. Too often, they have little understanding of the hosting
environment, the base protocols they are using, the RDBMS or even basic
SQL. I don't understand how you can develop anything of quality if you
don't have a thorough understanding of all the technology involved.

I'm  probably just a dinosaur - I also prefer VI and Emacs as my primary
development environments and will use psql and sqlplus before Taod,
pgAdmin, sqlDeveloper etc. 

Tim

P.S. for moving Oracle databases, we use to just use sed and change the
paths in the control file. Worked remarkably well. Often used this
technique to 'refresh' our dev or testing systems to current prod data. 

-- 
Tim Cross

Re: Oracle vs. PostgreSQL - a comment

2020-05-30 Thread Tim Cross

s you seem to have unless I
wondered off into their 'add-ons'. Sticking with the base RDBMS, I found
it to be pretty solid and reliable. However, I prefer Postgres. The main
reason is that with Oracle, you really need to choose one road or the
other - either be a developer or be a DBA. This is especially true with
the introduction of things like DAtaGuard, GoldenGAte etc. Oracle takes
a lot more administration than Postgres and there is a lot of knowledge
to stay on top of.

The oracle installation process is horrible. In addition to all the
library crap, you also commonly run into bugs in their scripts. Worse
still, some of those bugs have been there for 20 years and are just
'known' issues experienced DBAs deal with. Their documentation site is
also horrible.

As a developer, provided you stick with basic database functionality,
i.e. SQL and PL/SQL and avoid their extensions/add-ons, like their
various pub/sub, rules engine, PSP or anything they have obtained by
purchasing a company and 'integrating' it, it is pretty good. I think
they have one of the best locking models out there. The way they handle
indexes and updates is also much faster than postgres and you have to
worry less about the structure of your queries with respect to
performance.

Still, I prefer Postgres. The main reason is that although I may need to
think about how I structure queries, updates and indexes a bit more, on
the whole, it gets out of my way and does what I want - provide a
reliable data store that I can use and get the job done without having
to spend hours caught up in DBA tasks. Updates are easy and the basic
architecture is easy.

The biggest challenge when migrating from oracle to postgres is
recognising they are completely different and while they may both
provide a compliant SQL implementation, the similarities stop there. If
I'm in an environment where someone else is responsible for all the DBA
stuff, Oracle is nice to work with. However, you tend to only be in that
situation when your working in a large, usually bureaucratic,
environment, which tends to detract from the whole experience in other
ways. If your unlucky enough to also be using any of the Oracle 'value
add' extensions, development frameworks, application layers etc, it is
really horrible and mind numbing.

apart from this, Oracle licensing is an absolute nightmare. Apart from
the expense, the complexity is unbelievable and it is almost impossible
to know with any certainty what you will be paying in 12, 24 or more
months. 
-- 
Tim Cross

Re: AW: Linux Update Experience

2020-05-28 Thread Tim Cross

Zwettler Markus (OIZ)  writes:

> Hi Marco,
>
>  
>
> How do you handle these conflicts? No longer updating that regularly or not 
> at all anymore?
>

Not doing the updates is a poor option due to the potential security
vulnerabilities this may lead to. Likewise, delaying the application of
updates is going to increase risks as well. In fact, we have found such
approaches can make the situation worse. Delaying updates tends to
result in more updates being applied at once, which makes it harder to
identify problems when they do occur.

I think the best approach is to apply updates as soon as possible. Apply
the updates to a test or uat environment (or your dev environment if you
don't have a test/uat/staging one). If there are issues, resolve them
before applying the updates in prod.

We have found it rare for updates to be an issue if your running the
packages from the distribution. Problems seem to be more common when
your running packages sourced from an external repository which may lag
behind changes made by the distribution. When issues do occur, we look
at the type of update e.g. security, bug fix, bump in dependency
versions, new version etc and make a call as to the priority and respond
accordingly. This may mean delaying applying the update, actively
working to resolve the issue with investigations and debugging, raising
an issue/bug with the package maintainers etc.

We also classify all our systems, services, databases etc according to
whether they are core business processes or supporting processes. We
apply updates to supporting systems before core systems. This also
affects when we apply updates. For example, we would not apply updates
to a core system on Friday afternoon. In fact, we may apply updates to
core systems outside main business hours. If issues are encountered when
applying updates to core systems, resolution of those issues are highest
priority. For secondary systems, we are more likely to do the updates
during business hours, will accept longer outages/down times and may not
make resolution of issues the highest priority.

Re: GPG signing

2020-05-26 Thread Tim Cross

Marc Munro  writes:

> I need to be able to cryptographically sign objects in my database
> using a public key scheme.
>
> Is my only option to install plpython or some such thing?   Python
> generally makes me unhappy as I'm never convinced of its stability or
> the quality of its APIs, and it is not obvious to me which of the many
> different gpg-ish packages I should choose.
>
> Any other options?  Am I missing something?
>

This is something you would normally implement at the application level,
using the database as just the store for the data and signatures or
signed digests.

Main reason for this is to allow for secure key management. It is very
difficult to implement a secure key management solution at the database
level unless it is designed into the fundamental architecture of the
rdbms. It is the same age old problem - how can you encrypt data AND
have the keys for the encrypted data in the same place. The main reason
for encryption is so that if your store gets compromised, the data
cannot be read. However, if your key is also in the store, then when
your compromised, your key is compromised and your encryption becomes a
mute issue.

If on the other hand you handle the encryption/signing at the application level,
you can separate your key store and data store so that compromise of one
doesn't also compromise the other. This of course does create other
issues - most notably being that now you have an additional mandatory
layer between you and your data (for example, you can use psql to query
your database, but all you can see is encrypted objects.

In your case, this may not be as big an issue because you state you want
to sign rather than encrypt. You could, for example, design your
application so that the data is in one column and the signature is in
the other (or use json or other 'object' types that allow
attributes/properties). This would allow simple querying of the data and
verification of data integrity to be performed as separate operations.
All you then need to ensure is that every time data is modified, a new
signature is generated.

I would also verify you really do need full cryptographic signing rather
than just some less rigid integrity verification, like a basic checksum
hash. Crypto signing is most useful when you want to both verify the
integrity of something and it's source. i.e. this data has not been
changed and was signed by X. In this case, each source is
encrypted/signed with a different private/secret key. If on the other
hand you just want to know that the data has not been modified, you can
generate a checksum/hash of the data when it is inserted/updated and
store that in a separate column. This data may or may not be encrypted
depending on your use case. In this situation, you only need one key,
the key used to encrypt the column or no keys if you don't actually need
to encrypt it.

While non-encrypted checksums is not as secure, not all applications
need that level of security. In some cases, having a separate checksum
is sufficient. If someone wants to modify the data 'unofficially', in
addition to compromising the db and modifying the data, they have to
also generate a new hash of the modified data and store that in the
appropriate column. If you just want to protect against accidental
modification of the data or have reasonable confidence (for some measure
of reasonable), just having a checksum hash may be sufficient. big
advantage with the simpler case with no actual data encryption is that
other clients can access/use the data and not require access to the
encryption key. 

-- 
Tim Cross

Re: Should I use JSON?

2020-05-21 Thread Tim Cross

stan  writes:

> Worming on a small project, and have been doing a lot of Perl scripting to
> parse various types of files to populate the database. Now I need to get
> data from a cloud services provider (time-keeping). They have a REST API
> that returns data in a JSOSN format.
>
> So here is the question, should I just manually parse this data, as I have
> been doing to insert into appropriate entities into the database? Or should I
> insert the JSON data, and use some queries in the database to populate my
> tables from the JSON tables?

Given you plan to store your data in 'normal' tables and you are already
using a scripting language to get the data from the remote API
and your already processing data in various forms using Perl, I
would not bother. All you will really do is add another layer of
complexity and skill requirement (i.e. JSON in the database and writing
JSON queries using PG's SQL JSON support). 

-- 
Tim Cross

Re: Ubuntu 20.04: apt.postgresql.org/pub/repos/apt focal-pgdg InRelease' doesn't support architecture 'i386'

2020-05-16 Thread Tim Cross



Hugh  writes:

> Hi,
>
> While this doesn't appear to be a bug that causes problems of any kind, I do 
> have a question about its cause.
>
> The "error" listed in the Subject: line is basically what I'm seeing. The 
> entire message is below, particularly the 'N:' at the end. Is there a repo 
> setting I should change to prevent the request for '386' architecture? Thank 
> you in advance for your assistance.
>
> user@ubuntu:~$ sudo apt update
>
> Hit:1 http://us.archive.ubuntu.com/ubuntu focal InRelease
> Get:2 http://us.archive.ubuntu.com/ubuntu focal-updates InRelease [107 kB]
>  
> Get:3 http://security.ubuntu.com/ubuntu focal-security InRelease [107 kB] 
>  
> Get:4 http://us.archive.ubuntu.com/ubuntu focal-backports InRelease [98.3 kB] 
>  
> Get:5 http://us.archive.ubuntu.com/ubuntu focal-updates/main amd64 DEP-11 
> Metadata [90.4 kB]
> Hit:6 http://apt.postgresql.org/pub/repos/apt focal-pgdg InRelease
>  
> Get:7 http://us.archive.ubuntu.com/ubuntu focal-updates/universe amd64 DEP-11 
> Metadata [21.4 kB]
> Get:8 http://us.archive.ubuntu.com/ubuntu focal-backports/universe amd64 
> DEP-11 Metadata [532 B]
> Get:9 http://security.ubuntu.com/ubuntu focal-security/main amd64 DEP-11 
> Metadata [16.6 kB]
> Get:10 http://security.ubuntu.com/ubuntu focal-security/universe amd64 DEP-11 
> Metadata [208 B]
> Fetched 441 kB in 1s (367 kB/s)
> Reading package lists... Done
> Building dependency tree   
> Reading state information... Done
> All packages are up to date.
> N: Skipping acquire of configured file 'main/binary-i386/Packages' as 
> repository 'http://apt.postgresql.org/pub/repos/apt focal-pgdg InRelease' 
> doesn't support architecture 'i386'
>

This error is because by default the debian package manager is trying to
download details on all supported architectures, but failing to find one
for i386. You can add an architecture tag to the repository definition
in the source list file i.e. /etc/apt/sources.list or
/etc/apt/sources.list.d/postgres.list (or whatever you have called
itIf).  Try adding the arch option as


deb [ arch=amd64 ] http://.
deb-src [arch=amd64 ] ...

This should tell apt to only look for the amd64 packages. 

-- 
Tim Cross

Re: Enforcing uniqueness on [real estate/postal] addresses

2020-05-11 Thread Tim Cross

Peter Devoy  writes:

> Hi list
>
> I need to store addresses for properties (as in real estate) so in my
> naivety I created a unique constraint like this:
>
> ALTER TABLE properties
> ADD CONSTRAINT is_unique_address
> UNIQUE (
> description, --e.g. Land north of Foo Cottage
> address_identifier_general,
> street,
> postcode
> );
>
> Of course, if any of the fields are NULL (which they often are) I end
> up with duplicates.
>
> One solution may be to add NOT NULL constraints and use empty strings
> instead of NULL values but, until asking around today, I thought this was
> generally considered bad practice.
>
> Please can anyone recommend a way of approaching this? Perhaps empty strings
> are pragmatic in this situation?
>
> Kind regards
>
>

Hi Peter,

Personally, I don't like the idea of using empty strings just to avoid
having nulls. This is probably a personal preference, but for me null
and '' are quite different. A null indicates an unknown - we don't know
what the value is. An empty string i.e. '' means there is no value (i.e.
we know it has no value). The difference is quite subtle and may not
seem relevant. It may not be or it may be or it may become relevant in
the future. General rule of thumb for me is that my model should reflect
the known information and should always avoid any data transformation or
mapping which reduces the known information.

I would step back a bit and think about why/what constraint you really
need and what needs to be unique. The first field which jumps out for me
is description. Is this really a unique value? Would it be possible to
have two properties with the same description? Does it matter if two
properties have the same description? Does the description really affect
property uniqueness. If two records have the same street, postcode and
general_property_identifier, but different descriptions, are they really
two different records? Will description change over time? As description
is a fairly subjective value, I would be tempted to not include it in
your unique constraint at all. In fact, I would probably keep
description in a separate table as it may be reasonable to have multiple
descriptions for a property. If you want just a single description, then
you can leave it in this table. I would not put a unique or not null
constraint on it.

This would leave you with address_identifier_general, street and
postcode. None of those will be unique by themselves. You will only get
uniqueness when you combine all 3. Can any of them be null? I would
suspect not, so I would define them with not null constraints. I would
then probably add a composite unique index using all 3 values to enforce
uniqueness. Depending on your application needs, I would probably add a
unique property_id field to the table as well (which would be the value
I would used to link records in other tables, such as a
property_description table).

Of course, there are other constraints you could consider - post code
probably has a set format which you might want to enforce or perhaps you
can access a complete listing of valid postcodes and import that into
your system as a postcode table. In that case, your postcode field might
be better defined as a foreign key constraint into the postcode table. 

When defining your constraints, it is important to consider what
information is known at the point of initial data entry. Is there a need
to enter partial data (for example, you might know the street and
postcode, but not the general_property_identifier. Is it expected or
reasonable to allow entry of this sort of partial data? If so, how will
that work with your uniqueness constraints? (it may be quite reasonable
to require all 3 fields be known). The point is, you need to know how
the system will be used and what the expectations of the users are.
Maybe there is a legitimate business case to allow partial data entry,
in which case, you may need a different approach or a way to identify
partial/incomplete records etc. 

-- 
Tim Cross

Re: Best way to use trigger to email a report ?

2020-05-08 Thread Tim Cross

David G. Johnston  writes:

> On Fri, May 8, 2020 at 9:26 AM David Gauthier 
> wrote:
>
>> psql (9.6.0, server 11.3) on linux
>>
>> Looking for ideas.  I want a trigger to...
>> 1) compose an html report based on DB content
>> 2) email the report to a dist list (dl = value of a table column)
>>
>> If this will involve hybrid coding, I prefer PL/perl.  The linux env has
>> both "mail" and "mutt" (if this is of any help).
>>
>> The idea is to send a report to the list when all the data has been
>> collected for a particular job and the final status of the job is updated
>> as a col of a rec of a certain table.  Probably a post update trigger.
>>
>
> I'd probably limit the trigger to checking for the completion of the data
> collection and inserting a record into a "email job" table.  Then I'd have
> cron on a linux machine periodically run a script that queries the "email
> job" table for work, perform the work, and then either flag the work as
> done or remove the job record.
>

Yep, exactly how I would do it as well. Personally, triggers are my last
choice. In some situations, they are the right choice and when you do
need them, keep them as small and simple as possible.

Others have mentioned the issues of using external calls inside a
trigger. In addition to the problem with waiting on external processes
to complete, you also have all the messy external work things to take
care of (like network down, server unavailable, etc).

The solution I've used in the past is to have procedures in the database
which generate the email report and insert that data into an email
table. I then have a shell level script (could be perl, javascript,
java, whatever) which looks in this table periodically and if it finds a
report which has not been sent, extract it, optionally format it and
send it. On successful completion, set a 'sent' flag on the report
record in the DB (or just delete it - I prefer to set a flag so that if
something failed unexpectedly, you still have the report).

There are ways you can trigger periodic activity in the database, but to
be honest, CRON is easy and reliable and avoids needing to add
additional extensions etc to the DB. Turning off the report, changing
the time etc, is a simple crontab edit. 

-- 
Tim Cross

Re: Lock Postgres account after X number of failed logins?

2020-05-06 Thread Tim Cross

Geoff Winkless  writes:

> On Wed, 6 May 2020 at 00:05, Tim Cross  wrote:
>> Where Tom's solution fails is with smaller companies that cannot afford
>> this level of infrastructure.
>
> Is there an objection to openldap? It's lightweight (so could
> reasonably be run on the same hardware without significant impact),
> BSD-ish and mature, and (with the password policy overlay) should
> provide exactly the functionality the OP requested.
>

OpenLDAP is certainly the way I would go. However, for a number of
reasons, smaller companies seem somewhat resistant to that level of
integration. I suspect it is primarily because LDAP skills are less
prevalent amongst admins in these areas. Often, these companies don't
really have a planned architecture - things have grown organically and
got to the point where existing resources are fully allocated just
trying to keep all the bits running. It can be hard to sell the idea,
especially as those making the decisions are not across the issues and
from where they sit, it all looks to be working and your asking for more
resources when it doesn't seem to be broken. The IT guys often fail to
sell the benefits because they focus on the technical aspects rather
than on the business aspects.

One client I helped had admins who had been trying to move everything
over to a centralised LDAP solution for ages and failing. They had
presented great justification for why it was needed, but it focused on
the technical benefits rather than the business continuity, process
improvement and security benefits. Once we put together a new business
case which focused on improved processes for managing access, reduced
security audit costs and improved security controls, they were sold and
made the project a priority. 

Based on additional info I saw from the OP and plans to roll out
many databases, I think a centralised directory service approach is
really their only saleable and maintainable solution. In fact, they
probably need to look at their overall identity management architecture.
Failure to get that basic service correct will result in major support
issue blow out as they increase their customer base.

-- 
Tim Cross

Re: Lock Postgres account after X number of failed logins?

2020-05-05 Thread Tim Cross

Wolff, Ken L  writes:

> As Stephen states, even some basic functionality in this regard would go a 
> long way.  Perhaps something could be built into the postgresql-contrib RPM?  
> Right now the only way I see is to write a hook, which involves changing 
> source code, which then puts us into the situation of (1) maintaining our own 
> code tree and (2) figuring out how to produce a new set of RPMs.  
>
> I realize Postgres is a community project and that there are a great number 
> of other valuable feature/enhancement requests in the queue.  Just adding my 
> $.02 here.
>

The problem here is that everyone has valid points.

Tom is quite correct that this sort of security policy really needs to
be implemented in a single central location, such as LDAP, AD or some
other IAM middleware. Having security policies implemented separately in
different systems is where failures creep in  and why maintenance
becomes a problem.

Where Tom's solution fails is with smaller companies that cannot afford
this level of infrastructure. They can still fall victim to the same
level of regulatory bureaucracy, but without the necessary level of
technical resources of larger organisations. For these organisations,
basic facilities, like the ability to lock an account after a certain
number of failed login attempts for a period of time is a very useful
feature. 

My suggestion would be to develop the basic requirements and contribute
the result to Postgres. This would give back to the community and
eliminate the need to maintain separate code in the long-term. The cost
of paying for extra resources to do this development and maintenance is
still going to be less than the licensing costs for that commercial
competitor. Just requesting the facility is unlikely to result in any
acceptable outcome within any reasonable time frame. 

If your security people are really on top of their game, they will be
providing you with a security architecture which fulfils the enterprise
architecture requirements and which centralises IAM management. This is
really the only truly secure solution which guarantees access is removed
from all system in a timely manner, enables effective logging and
auditing of access, ensures consistent application of security policy
and allows consistent response to security incidents and events. While
requiring additional resources to establish, it does tend to result in
reduced maintenance costs in the longer term.

-- 
Tim Cross

Re: Unable to connect to the database: TypeError: net.Socket is not a constructor

2020-04-20 Thread Tim Cross

in the browser, so that is why trying to
> use `net.Socket` in the browser (via webpack, browserify, etc.) won't work"
>
> Environment Info:
>
>   System:
> OS: Linux 5.3 Ubuntu 18.04.4 LTS (Bionic Beaver)
> CPU: (8) x64 Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
>   Binaries:
> Node: 12.15.0 - ~/.nvm/versions/node/v12.15.0/bin/node
> Yarn: 1.22.4 - ~/.nvm/versions/node/v12.15.0/bin/yarn
> npm: 6.14.4 - ~/.nvm/versions/node/v12.15.0/bin/npm
>   Browsers:
> Chrome: 81.0.4044.92
> Firefox: 75.0
>   npmGlobalPackages:
> @vue/cli: 4.2.3
>
> So... how to solve the problem?
> Looking forward to your kind help.
> Marco

This has nothing to do with postgres. This is a purely Javascript issue
combined with differences between the browser Javascript API and node
javascript API.

Bottom line - if your trying to connect directly to a PG database from
within Javascript code running inside the browser, it won't work. The
browser does not have the 'net' library. You can only do this from
within node.js. Likewise, references to express are also red hearings -
you don't have a web server inside the browser either. If your not
trying to develop inside the browser but are in fact developing at the
node.js level, then you don't need webpack.

I think what you really need to do is step back and look closer at your
architecture. Typically, you would put all your database interaction
stuff in the web server using node.js. A common design pattern would be
to use one of the node.js web servers, like express (but there are
others) and have something like nginx as a proxy server in front of it.
You would then wrap your database interaction in a node.js API which you
would then call from your client browser using http/https or web
sockets. 

-- 
Tim Cross

Re: Using unlogged tables for web sessions

2020-04-16 Thread Tim Cross

Stephen Carboni  writes:

> Hello.
>
> I was wondering if anyone was using unlogged tables for website
> sessions in production. I'm interested if it breaks the prevailing
> opinion that you don't put sessions in PG.

This really depends on what you define as website session data and what
benefit you would see compared to the additional overhead of maintaining
this session information remotely (from the client). Depending on your
application, there is often some session information which makes more
sense stored on the back end server rather than in the client - notably,
data used by your server API to modify responses or possibly encrypted
data to handle 'remember me' type functionality. However, you probably
don't want to store session data used by the client API e.g. browser
Javascript as this would introduce additional network overheads,
latency, load on web and db server, increased web and db server API
complexity and possibly additional data privacy/security concerns you
will need to manage. This can be hard to justify when you have good
client data storage facilities available.

I have not come across a use case where it made sense to store ALL
session data remotely in the database. I have seen situations with a
very specialised application where having a more full featured LOCAL (to
the client) database server to record session information can be useful,
but this is rare.

--
Tim Cross

Re: timestamp and timestamptz

2020-04-15 Thread Tim Cross

Niels Jespersen  writes:

> Hello all
>
>  
>
> We have some data that have entered a timestamp column from a csv. The data 
> in the csv are in utc. We want to access the data in
> our native timezone (CET). 
>
>  
>
> I am considering a few alternatives: 
>
>  
>
> 1.   Early in the process, convert to timestamptz and keep this datatype. 
>
> 2.   Early in the process, convert to timestamp as understood in CET.  
> This will imply by convention that the data in the timestamp
> column represents CET. Users will need to be told that data represents CET, 
> even if data is somwhere in the future kept in another
> country in another timezone. 
>
>  
>
> I probably should choose 1 over 2. But I am a bit hesitant, probably because 
> we almost never have used timestamptz. 
>
>  
>
> Can we agree that the below query is selecting both the original utc 
> timestamps and 2 and 1 (as decribed above)?
>
>  
>
> set timezone to 'cet';
>
> select read_time read_time_utc, (read_time at time zone 'utc')::timestamp 
> read_time_cet, (read_time at time zone 'utc')::timestamptz
> read_time_tz from t limit 10;
>
>  
>
> We are on Postgres 12. 
>
>  
>

Keep your life simple - just go with option 1. Keep all timestamps in
UTC and let clients deal with it in whatever way they need to. This will
also help deal with issues associated with daylight savings time (DST
can be a real pain as different locations have it and others don't and
the 'switchover' date is subject to political whims and can change).
Your option 2 will cause all sorts of issues and keep in mind that most
larger countries have multiple timezones, so even if your data is all
associated with a single country, you can potentially have multiple
conversion routines required. On most *nix systems, clock time is UTC as
well, so having everything in UTC really helps when you want to do
diagnosis across your database and system log files etc. 

-- 
Tim Cross

Re: Estimated resources for a 500 connections instance (VM)

2020-04-07 Thread Tim Cross

David Gauthier  writes:

> After looking at some of the factors that can affect this, I think it may
> be important to know that most of the connections will be almost idle (in
> terms of interacting with the DB).  The "users" are perl/dbi scripts which
> connect to the DB and spend the vast majority of the time doing things
> other than interacting with the DB.  So a connection is consumed, but it's
> not really working very hard with the DB per-se.  I am cleaning up some of
> that code by strategically connecting/disconnecting only when a DB
> interaction is required.  But for my edification, is it roughly true that 2
> connections working with the DB 100% of the time is equivalent to 20
> connections @ 10% = 200 connections @ 1 % (if you know what I mean) ?

Based on that additional info, I would definitely follow Laurenz's
suggestion. Long time since I used Perl DBI, but I'm pretty sure there
is is support for connection pools or you can use one of the PG
connection pooling solutions.

There is a fixed memory allocation per connection, so 2 connections at
100% is not the same as 20 connections @ 10%.

Using a connection pool is usually the first thing I will setup. If
additional connections are still required, then I would increase the
limit in small jumps - definitely would not go from 100 to 500.

BTW running PG on a virtual is not an issue in itself - this is very
common these days. However, I would ensure you are up-to-date wrt latest
minor release for that version and would use clients with the same
version as the master. 

-- 
Tim Cross

Re: Improve COPY performance into table with indexes.

2020-04-03 Thread Tim Cross

mance impacts can be more acceptable than one long one).

- Make sure you have good logging enabled and check for things like
  overly frequent writing of WALs. This can have significant impact on
  performance. If your rows are large, you may be adversely impacting
  performance writing the WAL cache etc. 
-- 
Tim Cross

Re: Detecting which columns a query will modify in a function called by a trigger

2020-03-02 Thread Tim Cross



stan  writes:

> On Mon, Mar 02, 2020 at 11:02:54AM -0800, Adrian Klaver wrote:
>> On 3/2/20 10:59 AM, stan wrote:
>> > I need to implement a fairly fine grained security model. Probably a bit
>> > finer that I can do with the standard ownership functionality.
>> >
>> > My thinking on this is to create a table that contains the users, and a
>> > "permission bit" for each function that they may want to do, vis a vi
>> > altering an existing row,or rows, or inserting new rows.
>> >
>> > Looks relatively straight forward, if fairly time consuming to do. But I
>> > would need to know which column(s) a given query would add..alter from the
>> > function to implement this via a trigger. looks like I see most of what I
>> > need t do this in the docs, but I can't quite figure out if I can get this
>> > down to what column(s) a given trigger will modify. Is this possible?
>>
>> Before you get too far into this I would look at RLS:
>>
>> https://www.postgresql.org/docs/12/ddl-rowsecurity.html
>>
> Thanks for pointing that out.
>
> Using that functionality was my original plan, but let me describe why I do 
> not think it
> can do what I need. This may be an indication of my weakness in design
> though.
>
> Envision a table with a good many columns. This table represents the "life
> history" of a part on a project. Some of the columns need to be
> created/modified by the engineer. Some need to be created/modified by the
> purchasing agent, some of the columns need to be created by the receiving
> department, some of the columns need to be created/modified by the accounts
> payable department.
>
> Make sense?

When you speak of columns needing to be created/modified, do you really
mean columns or rows? It would be a very unusual approach to allow
multiple different 'agencies' to create/modify underlying table design.
If this is the case, then you are in an impossible position and have no
hope of implementing anything that will be maintainable and you will
never be able to manage security.

I'm hoping you mean different agencies which need to add/modify rows
wihtin the tables?

--
Tim Cross

Re: psql UPDATE field [tab] expands to DEFAULT?

2019-06-17 Thread Tim Cross

On Tue, 18 Jun 2019 at 10:39, Tom Lane  wrote:

> Tim Cross  writes:
> > On Tue, 18 Jun 2019 at 09:34, Ken Tanzer  wrote:
> >> Thanks Adrian, though I wasn't really seeking tips for column names.  I
> >> was instead trying to understand whether this particular tab expansion
> was
> >> intentional and considered useful, and if so what that usefulness was,
> >> because it's rather escaping me!
>
> > Have to say, I fid that behaviour unusual as well.
>
> I don't think it's intentional.  A look into tab-complete.c shows that it
> makes no attempt to offer completions beyond the "=" part of the syntax;
> so there's room for improvement there.  But then what is producing the
> "DEFAULT" completion?  After looking around a bit, I think it's
> accidentally matching the pattern for a GUC "set" command:
>
> else if (TailMatches("SET", MatchAny, "TO|="))
> {
> /* special cased code for individual GUCs */
> ...
> else
> COMPLETE_WITH("DEFAULT");
> }
>
> So perhaps that needs to look more like this other place where somebody
> already noticed the conflict against UPDATE:
>
> else if (TailMatches("SET|RESET") && !TailMatches("UPDATE", MatchAny,
> "SET"))
> COMPLETE_WITH_QUERY(Query_for_list_of_set_vars);
>
> More generally, though, I'm inclined to think that offering DEFAULT
> and nothing else, which is what this code does if it doesn't recognize
> the "GUC name", is just ridiculous.  If the word after SET is not a known
> GUC name then we probably have misconstrued the context, as indeed is
> happening in your example; and in any case DEFAULT is about the least
> likely thing for somebody to be trying to enter here.  (They'd probably
> have selected RESET not SET if they were trying to do that.)
>
> regards, tom lane
>


Given that without adding a full blown sql parser in order to identify
legitimate candidates following a '=' in an update statement, my suggestion
would be to refine the rules so that no completion is attempted after the
=. Would rather have tab do nothing over tab replacing what I've already
typed with 'default'.

-- 
regards,

Tim

--
Tim Cross

Re: psql UPDATE field [tab] expands to DEFAULT?

2019-06-17 Thread Tim Cross

On Tue, 18 Jun 2019 at 09:34, Ken Tanzer  wrote:

> On Mon, Jun 17, 2019 at 4:24 PM Adrian Klaver 
> wrote:
>
>> On 6/17/19 3:03 PM, Ken Tanzer wrote:
>> >
>> > So I'm curious if this is intended behavior, if it's considered useful,
>> > and/or if it's a placeholder for something in the future that will be
>> > useful.  Also, is this new, as I've never noticed it before?
>>
>> Not sure how long that has been around.
>>
>> My cheat for dealing with many/long column names is:
>>
>>
> Thanks Adrian, though I wasn't really seeking tips for column names.  I
> was instead trying to understand whether this particular tab expansion was
> intentional and considered useful, and if so what that usefulness was,
> because it's rather escaping me!
>
> Cheers,
> Ken
>
>
>
Have to say, I fid that behaviour unusual as well. I would expect that once
I've typed some characters, the completion mechanism would attempt to
complete based on the characters I've typed and if it cannot, to do
nothing. Instead, what happens is that what I have typed is replaced by
'default'.  For example, if I type

update my_table set my_col = other_t

and hit tab, 'other_t is replaced by 'default', which is of no use. What I
would expect is for tab to either complete (possibly only partially if
there is multiple candidates) what it could for candidates which start with
'other_t' e.g. 'other_table' or it would do nothing i.e. no completion
candidates found, telling me there is no match based on the prefix I've
typed.


-- 
regards,

Tim

--
Tim Cross

Re: Copy Bulk Ignore Duplicated

2019-06-14 Thread Tim Cross

Leandro Guimarães  writes:

> Hi,
>
>I have a scenario with a large table and I'm trying to insert it via a
> COPY command with a csv file.
>
>Everything works, but sometimes my source .csv file has duplicated data
> in the previously fulfilled table. If I add a check constraint and try to
> run the COPY command I have an error that stops the whole insertion.
>
>   I've tried to put the data in a tmp table and fill the main using
> distinct this way (the fields and names are just examples):
>
> INSERT INTO final_table values (name, document)
>SELECT DISTINCT name, document
>FROM tmp_TABLE t1
>WHERE NOT EXISTS (
>SELECT 1 FROM final_table t2
>WHERE (t2.name, t2.document)
>IS NOT DISTINCT FROM (t1.name, t1.document))
>
> The problem is that my final_table is a large (and partitioned) table and
> this query is taking a long time to execute.
>
> Someone have any idea (really guys anything would be great) how to solve
> this situation? I need to ignore duplicates instead to have some error.
>
> I'm using* PostgreSQL 9.4* so I can't use "ON CONFLICT" and upgrade is not
> an option.
>

Explain plan would probably shed some light, but I suspect your
performance is being heavily hit by the sub query. Distinct is an
expensive operation and you are performing it once for every distinct row
in your temp table.

It isn't clear what the primary key is for your final table - name +
document seems suspicious given these seem to be the only two columns
your inserting as well. You don't indicate what the data types are
either - it document is something like 'text' then using it in a
distinct clause is likely to have huge performance impact. 

The first thing I'd do is to eliminate duplicates from your temp table
as a separate statement or by pre-filtering the CSV before import. I
would then try something like an outer join to identify rows in your
temp table which don't exist in your final table and select from there
to insert into the final table. You don't really need the distinct in
the sub query as all you really need to know is if (name, document)
exists - it doesn't matter if more than one exists (for this test). 

If you really don't have something more specific for a primary key,
depending on what data type 'document' is and how large it is, you may
find adding a column which is a checksum of your 'document' field a
useful addition. I have done this in the past where I had an application
where name was not unique and we only wanted distinct instances of
'document' (document was a fairly large XML document in this case). 

--
Tim Cross

Re: General question about OS

2019-06-09 Thread Tim Cross

Drexl Spivey  writes:

> Hello all,
>
> Don't want to start one of those endless internet tug of wars without end 
> threads, but would like some other people's opinions.
>
> First off, I use all Operating systems without problems, personally 
> defaulting to linux at home, but mostly mac at work. I use windows, when 
> necessary, not my favorite.
>
> It seems in my little database development experience that this is one area 
> where windows might actually offer the best, most mature/developed choices. 
> If this is the case, I should acclimate myself to it more.
>
> I have found many applications have been ported to other systems, but they 
> don't seem as "good", and some programs like Power Designer are windows only.
>
> Is database work heavily windows leaning??

Not my experience. Much of my work has been with Oracle and Postgres and
all of it has been on Linux (well, Unix originally Tru64, Solaris). In
medium to larger enterprises, it was often Linux based database backend and
Windows client frontend or Linux backend and 'client' delivered via web
based frontend, typically served from Linux Tomcat,Apache,*ngnix
etc. Smaller organisations tended to have more Windows servers with more
instances of MS SQL Server.

In recent years, I have seen a growth in MS SQL Server, especially with
growth in things like Sharepoint, Skype, MIM/MDM and the push towards
Azure. At the same time, we also see MS embracing Linux more with bash
and linux subsystem, availability of linux VMs in Azure etc.

For your larger databases, I see more Linux than MS. This could be
related to storage and file systems more than anything else.

-- 
Tim Cross

Re: Loading table with indexed jsonb field is stalling

2019-05-17 Thread Tim Cross



Will Hartung  writes:

> I am trying to load data in to a table with a jsonb field that is indexed
> as gin (field jsonb_path_ops).
>
> It's a large table, and I'm loading it 100K rows at a time. Eventually, it
> would be 30M+ rows in the table.
>
> Originally I simply loaded the table and then tried to create the index,
> but it never finished.
>
> So, I'm trying to load it incrementally.
>
> I have 2.8M rows in the table so far, the jsonb field size is, on average,
> 1600 bytes, with the largest (of the 2.8M loaded) 1930. Simply, these are
> not large structures.
>
> The first batches to load took various times for each file. Most of them <
> 1m, some took 1/2 hr.
>
> The current file is "stuck", pushing past 20hrs so far.
>
> The VM only has 4G of RAM, it is certainly "busy", but it is not swapping
> (not at the OS level).
>
> Here is a recent top:
>
> top - 11:34:01 up 1 day,  1:49,  2 users,  load average: 5.84, 4.94, 4.52
> Tasks: 103 total,   1 running,  59 sleeping,   0 stopped,   0 zombie
> %Cpu(s):  0.0 us,  1.0 sy,  0.0 ni,  0.0 id, 95.3 wa,  0.0 hi,  3.7 si,
> 0.0 st
> KiB Mem :  4040212 total,   152336 free,   181792 used,  3706084 buff/cache
> KiB Swap:  4194300 total,  4189948 free, 4352 used.  3443628 avail Mem
>
> Postgres is pretty much default configurations, I have not tweaked any of
> the memory settings (such as work memory).
>
> My Mac OS host isn’t that busy either, but the VM adds some load, and it's
> not thrashing.
>
> While I was loading the file in 100K row chunks, here are the times of each
> respective chunk to actually load:
>
> 0:46s
> 3:17s
> 8:12s
> 9:54s
> 14:09s
> 12:07s
> 18:50s
> 9:01s
> 25:28s
> 38:49s
> 25:24s
> 1:21s
> 0:47s
> 0:32s
> 0:39s
> 0:31s
> 0:31s
> 0:28s
> 0:29s
> 0:28s
> 0:21s
> 0:27s
> 0:36s
> 0:22s
> 0:27s
> 0:20s
> 0:21s
> 0:19s
> 2:16:21s  <— the last to date, but this was yesterday, now it's past 20hrs
>
> It stalled early, but then rocketed to the stalling cliff staring climax
> that it's at now.
>
> The only reason this is on a Linux VM is that I saw similar behavior
> running native Postgres 9.6 on Mac OS (the host). It didn’t make any sense
> that Mac OS would be causing this, but, who knows. Try it and see.
>
> Since the start of the load of the stalled piece, something has consumed
> over 800M of storage, I can’t say what, I did not check with any higher
> fidelity as to where the storage was going.
>
> I do not understand why this suddenly falls over a cliff. The JSONs are not
> large, so I don’t see how any individual one could crush the memory
> subsystem. I have to assume that PG is somehow internally thrashing or
> paging or something. I appreciate that the db is not tuned, but I would not
> expect that it would struggle so to create this index, with these values,
> and such small JSON payloads.
>
> Also, it’s not unique to the 29th piece. I’ve tried in the past to skip
> those, and it still failed. This has been happening for some time (months),
> but I keep putting it away.
>
> Any insight is helpful. My biggest fear is that for whatever reason we will
> not be able to reload this table during any particular crisis in the future
> should it come to that.
>
> Thanks.

Which version of postgres?

How are you loading the data? (application, psql, pg_restore) using
(insert, copy)?

-- 
Tim Cross

Re: PG version recommendation

2019-05-07 Thread Tim Cross

I would find out if the IT team who will maintain the system are running
a specific Linux distribution, such as RHEL and just go with the PG
version that is on that distribution.

Large corp rarely have sufficient IT resources. Unless you specifically
need a particular PG version (which does not seem to be the case based
on your info), you are better off sticking with the version provided by
whatever distro they use. This will ensure reasonable updates and
patches. In corp environments, where IT resources are thin on the
ground, any custom install often results in poor patching and update
cycles because it falls outside 'standard procedures'.

With respect to hardware specifications, it really depends a lot on what
the infrastructure is. Typically, you would be better off specifying the
performance and storage (size) you require and leave it to the IT team
to work out how to best satisfy those requirements e.g. support x
concurrent connections, y Tb/Gb of storage, backup requirements i.e. how
frequent, how many versions and retention requirements. Include details
of any additional PG packages you may need/want and how/where the
database will need to be accessed from.

As you indicate the host will be a VM, it should be relatively easy to
scale up/down cpus or memory as required, unless you have special
requirements (very complex queries, very large data sets, complex data
models involving GIS, XML, etc that may exceed resources available in
their VM infrastructure).

>From your description, your database sounds fairly standard with no
unusual requirements. The number of concurrent users is low and it
sounds like it may be a new application where you probably don't yet
know where performance or resource bottlenecks will be. A standard Linux
server with 16+Gb memory and a couple of Gb for storage running PG 9.6
or higher is likely to be a reasonable starting point.

It would also be a good idea to speak to the IT team and see if they
have any procedures/policies for requesting resources. Ask them what
info they need to know and then gather that. It is unlikely to help if
yuou specify hardware requirements they cannot easily support,
especially if those requirements are really just arbitrary and based on
external recommendations from people who don't know what the
infrastructure is. Nothing frustrates IT teams more than being require
to provide over specified systems which consume valuable resources that
are never used or demand custom configurations which are unnecessary and
just add to their maintenance overheads. 

Tim

David Gauthier  writes:

> Hi:
>
> I'm going to be requesting a PG instance supported by an IT team in a large
> corp.  They will be creating the server as a VM.  We will be loading the DB
> using scripts (perl/dbi) on linux, possibly using bulk loading techniques
> if that's required.  Queries will come from both linux and the web, but
> typically the number of concurrent users will be on the order of 10 reads,
> maybe a couple writers.  < 1T total disk, no partitioning.  I will be
> requesting PITR.
>
> I need to pick a PG version in my request.  I want something that will be
> stable and reliable while, of course, being able to perform well.  What
> would be a good choice for PG version ?
>
> Also, since the server will be a VM, are there any special
> recommendations/suggestions might I forward in the request (install
> options, tuning options, other) ?
>
> Thanks !

--
Tim Cross

Re: AW: Forks of pgadmin3?

2019-03-25 Thread Tim Cross

kpi6...@gmail.com writes:

> Thanks for the link but we're very reluctant to use Java based programs. 
> The main reason is that we need to do some works on servers whose admins 
> simply do not allow to install Java. 
> The screenshots look very promises, however. 
>
> Regards
> Klaus
>
>> -Ursprüngliche Nachricht-
>> Von: Thomas Kellerer 
>> Gesendet: Montag, 25. März 2019 12:06
>> An: pgsql-general@lists.postgresql.org
>> Betreff: Re: Forks of pgadmin3?
>> 
>> kpi6...@gmail.com schrieb am 22.03.2019 um 17:25:
>> > 95% of my time I use pgadminIII just to type select and update
>> > statements and review the output rows.
>> >
>> > I know that I can do this in psql but it’s not handy with many
>> > columns.
>> 
>> An alternative you might want to try is SQL Workbench/J: https://www.sql-
>> workbench.eu/
>> 
>> Full disclosure: I am the author of that tool.
>> 
>> It's a cross DBMS tool, but my primary focus is Postgres.
>> 
>> It focuses on running SQL queries rather then being a DBA tool.
>> 
>> Regards
>> Thomas

You may not need to install anything on the server. GUI based tools
like dbeaver (also java) and I suspect this one, just run on your
desktop/laptop. You connect to the remote DB as normal i.e. port 5432.

If your network environment is locked down to only allow connections to
port 5432 from specific servers and localhost (i.e. the server), then
SSH can work. You use an SSH tunnel to tunnerl traffic for port 5432 to
a local port and then configure the connection as a local connection
using that port e.g.

in one terminal ssh -L 3330:localhost:5432 db.server

in local software tool, configure the connection with host localhost and
port 3330. It may also be necessary to setup proxy connections if the
host your allowed to connect to is not the db host, but many tools
support this as well as it is a common restriction. You can also use ssh
here, but it is a little more complicated, but same principals. 

BTW, the blanket restriction on java runtime is IMO misguided. There is
nothing inherently more risky about the Java runtime than anything else
(python, perl, ruby, node, etc). In fact, the JVM is a pretty decent bit
of kit. The Java language is horrible to work with, but that is a
different issue.

There are some bloody awful Java applications out there, but this really
means, assess on a per app basis, not a blanket ban on all of
them. There are insecure and poorly written apps in every language.

Tim

-- 
Tim Cross

Re: Forks of pgadmin3?

2019-03-22 Thread Tim Cross

Jeff Janes  writes:

> On Fri, Mar 22, 2019 at 8:04 AM Steve Atkins  wrote:
>
>>
>>
>> > On Mar 22, 2019, at 10:56 AM, Christian Henz 
>> wrote:
>> >
>>
>> There's the BigSQL fork, which had at least some minimal support
>> for 10. I've no idea whether it's had / needs anything for 11
>
>
> I just installed BigSQL's v11 of the database to get the pgAdmin3 that
> comes with it (I couldn't get the Windows installer to install just
> pgAdmin, I had to take the entire server installation along with it) .
> Even though it comes with v11, when you start it says it only supports up
> to v10, and then gives a series of warnings about catalogs and system admin
> functions not being as expected.  Once you are past the warnings, it does
> work at least on the surface, but I have to think some features aren't
> going to work.
>
> Cheers,
>
> Jeff

I think you have little choice other than to give up on pgAdmin3 and any
of the forks. The old pgAdmin3 had become difficult to maintain and I
doubt any fork will be able to avoid this.

I completely understand your frustration with pgAdmin4, though I have
observed significant improvement over the last 12 months. I'm in the
position where I have been prevented from upgrading our databases
because nobody on our team likes pgAdmin4 and they don't want to give up
on pgAdmin3. The proverbial tail wagging the dog if you ask me.

I have looked at some alternatives and have found that

1. dbeaver https://dbeaver.io/download/ is not too bad and is free
2. dataGrip from Atlasian is pretty good, but has a paid license
3. Most of our developers use Visual Code as their editor and it has
some pretty reasonable extensions which makes doing basic database
queries and display of results pretty reasonable and provides OK code
completion support.

Datagrip and visual code also have git integration, which is good if
your keen on DDL stuff being tracked and versioned in git.

Based on the improvements I've seen in pgAdmin4, I suspect it will get
to a usable and stable state eventually and will likely be a pretty good
replacement for pgAdmin3. However, currently, I find it still a little
too unstable.

Personally, I'm pleased I spent the time to get my Emacs and psql
integration working to the point that I do 90% of what I need in psql

-- 
Tim Cross

Re: Where to store Blobs?

2019-03-13 Thread Tim Cross

I don't think there is a suitable 'one size fits all' answer to this
question. A lot will depend on how you intend to use the blobs and what
sort of hardware architecture, especially storage systems, you have.

At first glance, sticking everything in the DB seems like an easy
choice. However, that can result in very large databases, which in turn
can lead to issues with respect to backup, replication etc. If all your
after is storage, then sometimes your better off using the file system
for the blobs and keeping the metadata in the db. It can potentially be
faster and easier to serve up blobs from the file system compared to the
db if that is the main use case, but if the blobs are more dynamic or
you use collections of blobs to build a master blob etc, the db has some
advantages.

If you really need database like functionality, given the relative
cheapness of storage and the wealth of options available, storing the
blobs in the database can have advantage. However, it will be important
to select the most appropriate datatype. What some people think of as a
'blob' is just an array of bytes to many DBs and as usual, you need to
make the decision as to what is the best storage representation for your
requirements, keeping in mind that the more general 'blob' like storage
type you choose often represents a loss in functionality but an increase
in flexibility wrt to what can be inserted over more precise data types,
which will be more restrictive about what can be inserted, but offer
more functionality regarding what you can do with it (at the db level).

Tim

Thomas Güttler  writes:

> Some days ago I asked "Where **not** to use PostgreSQL?" on this mailing list.
>
> Now I realized: Nobody talked about Blobs.
>
> I guess most people do not store Blobs in PostgresSQL.
>
> Where do you store Blobs?
>
> (In my case Blobs are PDF/image files with size up to 20 MByte.
> I do not talk about very big blobs which are several hundret MByte)

--
Tim Cross

Re: Revoke SQL doesn't take effect

2019-01-29 Thread Tim Cross

On Wed, 30 Jan 2019 at 07:49, Jason W  wrote:

> I have two postgresql accounts created by someone else who I do not know
> (So I do not know setting for those accounts and tables created). One is
> read only account e.g. read_only_user (This can perform select operations
> only). The other is admin account e.g. admin_user (This can perform grant,
> revoke, CRUD,  and so on operations).
>
> The read only account can query (select  sql) a table (suppose it's called
> table1) under a specific schema (suppose it's schema1). For instance select
> * from schema1.table1. Now I received a request to revoke select for that
> read only account on table1. So I execute
>
> revoke select on schema1.table1 from read_only_user
>
> psql returns REVOKE string (or something similar showing the sql execution
> was successful) on console. However, when check with read_only_user
> account. I am still able to query table1. Searching the internet, [1] looks
> like the closest to my problem. But I do not find solution in that thread.
>
> So my question:
> What steps do I need to perform in order to exactly revoke select from
> read only user account for a particular table? So the read only user
> account wont' be able query that specific table with select permission
> revoke (psql should returns info like permission denied).
>
> Thanks
>
> [1].
> https://www.postgresql.org/message-id/00ce01cbef04%24910606b0%24b3121410%24%40yahoo.com
>
>
It is likely that permissions for the user are being granted via a role
rather than granted directly to the user (think of a role as a user account
which does not have the login permission). First thing to check would be to
look at what roles have been granted to the read_only user and if one of
those grants select on schema1.table1, revoke/remove it from the role.
There may be other complications, such as roles which do a grant select on
all tables in a schema, so getting the order of things correct is
important. First step, understanding how permissions are granted, then you
should be able to revoke them effectively.

Tim

-- 
regards,

Tim

--
Tim Cross

Re: Manage PostgreSQL Database for GITLAB Application?

2019-01-21 Thread Tim Cross

;> Thanks again for your help.
>>>
>>> Regards,
>>>
>>> Karin
>>>
>>> 
>>> *From:* Stephen Frost 
>>> *Sent:* Monday, January 21, 2019 1:53:00 PM
>>> *To:* Hilbert, Karin
>>> *Cc:* pgsql-general@lists.postgresql.org
>>> *Subject:* Re: Manage PostgreSQL Database for GITLAB Application?
>>> Greetings,
>>>
>>> * Hilbert, Karin (i...@psu.edu) wrote:
>>>> Does anyone manage a PostgreSQL database for a GITLAB application?
>>>
>>> Yes.
>>>
>>>> I have PostgreSQL v9.6 installed on my server & we are trying to migrate a 
>>>> GITLAB database there.
>>>>
>>>> The developer says that we need to use the public schema instead of the 
>>>> schema of the same name as the application user.
>>>
>>> Not sure this is really required but it also shouldn't hurt anything
>>> really- I'd definitely have the database be dedicated to gitlab.
>>>
>>>> The schema that he provided me to restore also is revoking all privileges 
>>>> from the database owner & instead granting all privileges to PUBLIC.
>>>
>>> That's terrible.
>>>
>>>> Has anyone else run across this?  I always thought that granting 
>>>> privileges to PUBLIC is a bad security thing to do?
>>>
>>> Yes, that's bad from a security perspective and shouldn't be necessary.
>>> GRANT rights to the user(s) the application logs into, don't just grant
>>> them to PUBLIC- that would allow anyone on the system to have access.
>>>
>>>> If anyone can offer any thoughts regarding this, it would be greatly 
>>>> appreciated.
>>>
>>> Is this developer the only one who is going to be using this gitlab
>>> instance..?  Sounds like maybe they want direct database access which
>>> would only make sense if they're the one running it and should have full
>>> access- but even then, I'd create a role and grant access to that role
>>> and then grant them that role, if that's the requirement.  GRANT'ing
>>> things to public isn't a good idea if you're at all concerned about
>>> security.
>>>
>>> Thanks!
>>>
>>> Stephen
>>
>>
>> --
>> Adrian Klaver
>> adrian.kla...@aklaver.com

Joining discussion late, so apologise in advance if I repeat information
already provided.

Just wanted to mention that unfortunately, there are some libraries out
there which provide a layer of abstraction for working with databases
and postgres in particular, but which do not handle database schemas at
all well. I'm not defending or criticising such libraries, but want to
point out that sometimes, a developer, who is required to use specific
libraries or modules, may not have the freedom to fully use the power of
database schemas and that sometimes, limitations/restrictions are not
necessarily at the DB level. As DBA's we need to recognise such
restrictions exist, even if they seem misguided. As an example, there
have been issues with at least one of the commonly used db interface
libs/modules used by the Javascript SAILS framework (which I believe was
addressed in later versions) that made working with different schemas
very difficult.

I would agree that the description provided regarding changes to
permissions does raise concerns and hints of a developer under pressure
to make something work with insufficient understanding of the
underlying DB security and access control model. It is likely the
developer needs guidance in this area.

I also would argue that the PUBLIC schema is not in itself a security
risk. The problem is with inappropriate use of that schema. It depends
heavily on how the database is used. A database used for a single
application has a completely different security and risk profile from a
database used by multiple users for different applications. Arbitrary
rules such as 'you won't use PUBLIC' are almost always wrong and often
just make both developer and dba lives more complicated and harder to
maintain. Complexity is where things go wrong and where security tends
to break down.

Rather than requiring the developer to use a specific schema, I would

1. Ask them why they believe they have to use the PUBLIC schema
2. If the need to use the PUBLIC schema is confirmed, then work with the
developer to understand what the access requirements are and develop an
appropriate model.
3. If there is no dependency on using the PUBLIC schema, work with the
developer to assist them to resolve there access issues.

Depending on the size of the organisation and complexity of the
environment, choice of libraries and modules is not always as
straight-forward. It may not be easy to switch to another library/module
with better support for schemas etc or even to upgrade to a new
version. Often, such changes will need to be managed in stages and over
time. Work with the developers as complex environments will frequently
require a united voice in order to get changes approved or prioritised.

Tim

--
Tim Cross

Re: Add columns to table; insert values based on row

2018-11-01 Thread Tim Cross

Rich Shepard  writes:

> On Thu, 1 Nov 2018, Adrian Klaver wrote:
>
>>> alter table stations add column start_date date;
>>> alter table stations add column end_date date;
>>> alter table stations add column howmany integer;
>>> alter table stations add column bin_col char(8);
>>> 
>>> insert into stations (start_date, end_date, howmany, bin_col) values ( )
>>>   select site_nbr from stations
>>>   where site_nbr = ' ';
>
>> Are you trying to add new data to existing records?
>
> Adrian,
>
>I am adding four columns to an existing table that already contains four
> columns.
>
>> If so where is the new data coming from?
>
>I have a text file and will fill each insert statement by hand if there's
> not a more efficient way to do this.
>
> Regards,
>
> Rich

Like others, I'm not clear on exactly what your after here, but did want
to point out

1. If your doing it by hand, you don't have to do a separate 'full'
insert statement for every row i.e.

insert into blah (x, y, z) values
(),
(),
(),

(...);

is valid syntax. You don't need to do a full "insert into blah ()
values (...)" for each insert.

2. If it really is an insert you want to do and you already have the
data in a file e.g. CSV or similar, then you can use the \copy command
to process the file, which is very fast.

3. Is it really insert or update you need?

-- 
Tim Cross

Re: Oracle vs PG

2018-10-23 Thread Tim Cross

Ravi Krishna  writes:

>> Again, pretty much content-free. For all you know some application was 
>> creating savepoints, needlessly:
>
>> https://www.postgresql.org/docs/10/static/sql-savepoint.html
>
> I have hardly used savepoints in any application, but if I understand it 
> correctly, isn't it something which is typically used
> in a persistent connection.  I wonder how it is applicable in a web based 
> stateless application like Amazon.com, unless
> even web based application have database level state.

No, savepoints and persistent connections are not necessarily related.

Savepoints are really just a way of managing rollback segments. For
example, if you were doing a large number of inserts/updates, things can
become slow if the rollback segment grows really large. One way around
this is to set savepoints, which will allow you to commit more
frequently and prevent the rollback size from growing too large (there
are other benefits as well, such as allowing other transactions to see
partial changes sooner rather than not seeing any change until after a
long running insert/update has completed etc).

I think that article is really just about headline click bait and lacks
any real details. I'm not even convinced that comparison of Oracle and
PG really makes sense anyway - both databases have their pros and cons.

IMO Oracle is a very good database (though most of the 'add ons' are
less beneficial). However, it is extremely expensive, both to license
and to administer. For certain applications, it would be my first choice
(assuming available budget). However, I prefer PG for the majority of
what I need, partially due to the cost, but mainly because it is rock
solid and much, much easier to administer and sufficient for what I
need. As usual, it is more about requirements than brand and choosing
the right tool for the right job.

Tim

-- 
Tim Cross

Re: Weird procedure question

2018-09-25 Thread Tim Cross



digimer  writes:

> Hi all,
>
>   I've got an interesting use case that I am stuck on. It's a bit of a 
> complicated environment, but I'll try to keep it simple.
>
>   In short; I have a history schema that has tables that match the 
> public schema, plus one 'history_id' column that has a simple sequential 
> bigserial value. Then I have a procedure and trigger that, on UPDATE or 
> INSERT, copies the data to history. Example use case is that I can 
> UPDATE a sensor value in the public table and it's also INSERTs the data 
> into history. So public shows just the most recent values, but I can see 
> changes over time in the history schema.
>
>   I have built my system to support writing to one or more DBs. I keep 
> a list of connected DBs and send INSERT/UPDATE calls to a method that 
> then runs the UPDATE/INSERT against all connected databases, as a form 
> of redundancy. This all works fine.
>
>   The problem I've hit is that the 'history_id' differs between the 
> various databases. So I want to switch this to 'history_uuid' and use 
> UUIDs instead of bigserial.
>
>   Now the question;
>
>   Can I tell a produce to use a specific UUID?
>
>   The idea is to generate a UUID for 'history_uuid' so that I have 
> consistency across databases. Of course, if an UPDATE will change 
> multiple rows, then I'll need to predefine multiple UUIDs. This is where 
> things start to get really complicated I think... Maybe I could pass an 
> array of UUIDs? I don't care if I find out which UUID was used for which 
> record, just that the same UUID was used for the same record when the 
> procedure is (re)run on other DBs.
>
>   The databases are not clustered, on purpose. I've been trying to 
> handle all the HA stuff in my application for various reasons.
>
> If it helps, here is an example pair of tables, the procedure and the 
> trigger I currently use;
>
> 
> CREATE TABLE host_variable (
>   host_variable_uuid uuid not null primary key,
>   host_variable_host_uuid uuid not null,
>   host_variable_name text not null,
>   host_variable_value text not null,
>   modified_date timestamp with time zone not null
> );
> ALTER TABLE host_variable OWNER TO admin;
>
> CREATE TABLE history.host_variable (
>   history_id bigserial,
>   host_variable_uuid uuid,
>   host_variable_host_uuid uuid,
>   host_variable_name text,
>   host_variable_value text,
>   modified_date timestamp with time zone not null
> );
> ALTER TABLE history.host_variable OWNER TO admin;
>
> CREATE FUNCTION history_host_variable() RETURNS trigger
> AS $$
> DECLARE
>   history_host_variable RECORD;
> BEGIN
>   SELECT INTO history_host_variable * FROM host_variable WHERE 
> host_uuid = new.host_uuid;
>   INSERT INTO history.host_variable
>   (host_variable_uuid,
>   host_variable_host_uuid,
>   host_variable_name,
>   host_variable_value,
>   modified_date)
>   VALUES
>   (history_host_variable.host_variable_uuid,
>   history_host_variable.host_variable_host_uuid,
>   history_host_variable.host_variable_name,
>   history_host_variable.host_variable_value,
>   history_host_variable.modified_date);
>   RETURN NULL;
> END;
> $$
> LANGUAGE plpgsql;
> ALTER FUNCTION history_host_variable() OWNER TO admin;
>
> CREATE TRIGGER trigger_host_variable
>   AFTER INSERT OR UPDATE ON host_variable
>   FOR EACH ROW EXECUTE PROCEDURE history_host_variable();
> 
>
>   I know this might sound odd, but I didn't want to complicate things 
> with how my system works. However, if it would help solve the problem, 
> I'm happy to dig into more detail.
>
>   Thanks!

I think James has probably given you the input you need - basically,
don't allow the system to automatically set the modified time - make
that parameter to your function or set that value before the copy to the
history tables - content would then be the same, so uuid v3 should work.

However, I do think you have another big problem lurking in the
shadows. What happens if any of your connected databases are unavailable
or unreachable for a period of time? I suspect your going to run into
update anomalies and depending on your setup/environment, possibly even
partitioning problems (depending on number of clients and typology
etc). These are well known problems in distributed or replication
systems.

You appear to be implementing a 'poor mans' replication system. There
are lots of complex issues to deal with and I wonder why you want to
take them on when PG has already got well tested and robust solutions
for this that would simplify your architecture and avoid the need to
re-implement functionality which already exists?

regards,

Tim

-- 
Tim Cross

Re: Converting to number with given format

2018-09-19 Thread Tim Cross

Gabriel Furstenheim Milerud  writes:

> Sorry,
> So basically what I'm trying to achieve is the following. There is an input
> file from the user and a configuration describing what is being inserted.
> For example, he might have Last activity which is '-MM-DD HH:mi:ss' and
> Join date which is only '-MM-DD' because there is no associated timing.
> For dates this works perfectly and it is possible to configure what the
> input from the user will be. Think it is as a dropdown where the user says,
> this is the kind of data that I have.
>
> Maybe that is not possible with numbers? To say in a format something like
> "my numbers have comma as decimal separator and no thousands separators" or
> "my numbers are point separated and have comma as thousands separator"
>
> Nice thing of having a string for the format is that I can use it as a
> parameter for a prepared statement.
>

I think this is normally something much better dealt with at the client
level. Things like comma separator/grouping in numbers is really just a
'human' thing and is very locale dependent. The values 9,999 and 
are the same values. Things can quickly become complicated as you can
have locale information at both the server and client end and they may
not be the same. 

As you should always be sanitising your data before inserting into the
database anyway, you may as well just add this as another check at the
client end.

Tim

-- 
Tim Cross

Re: WARNING: could not flush dirty data: Function not implemented

2018-09-03 Thread Tim Cross

Tom Lane  writes:

> Ravi Krishna  writes:
>>> Whee ... so you get to cope with all the bugs/idiosyncrasies of three
>>> operating system layers, not just one.  I concur that running Postgres
>>> in the underlying Windows O/S is probably a much better idea.
>
>> Me too, but this is purely for learning and I am much more use to Linux 
>> stack then ... gasp Windows :-)
>
> Hmm, so maybe you should install Ubuntu as the native O/S, and when
> you need Windows, run it inside a VM?
>
>   regards, tom lane

This is what I do and it works well except.

If you don't run the windows VM very often (like me), when you do, start
it before lunch or that next long meeting. The updates will grind things
to a crawl. When you run every day or fairly frequently, you don't
notice them, but if you only run once every 4+ weeks, it can have a big
impact and take ages.

Linux as the host and using the VM is still better than the weird
idiosyncrasies of Windows as the main workstation client (IMO). Where
office policy has insisted on Windows as the workstation, I have had
reasonable success with running virtualbox with Linux, though these
days, the Windows environment is often too locked down to allow
this. I've not yet experimented with the virtual linux layer in w10. 

-- 
Tim Cross

Re: WARNING: could not flush dirty data: Function not implemented

2018-09-03 Thread Tim Cross



Tom Lane  writes:

> Andres Freund  writes:
>> On 2018-09-02 19:29:49 -0400, Tom Lane wrote:
>>> If this is on Ubuntu, I don't understand why you're talking
>>> about Windows.
>
>> The OP said "Ubuntu 18.04 as Windows bash" - so I assume this is
>> postgres compiled as a linux binary is running on MS's new-ish linux
>> emulation.
>
> Whee ... so you get to cope with all the bugs/idiosyncrasies of three
> operating system layers, not just one.

That comment has made my day - thanks Tom!


-- 
Tim Cross

Re: Ways to deal with large amount of columns;

2018-08-30 Thread Tim Cross

On Fri, 31 Aug 2018 at 10:47, a <372660...@qq.com> wrote:

> Thank you very much. Creating a function seems to be a good idea :)
>
>
> -- Original message --
> *From:* "David G. Johnston";
> *Sendtime:* Thursday, Aug 30, 2018 8:31 PM
> *To:* "a"<372660...@qq.com>;
> *Cc:* "pgsql-general";
> *Subject:* Re: Ways to deal with large amount of columns;
>
> On Thursday, August 30, 2018, a <372660...@qq.com> wrote:
>
>> Hi all:
>>
>> I need to make a table contains projected monthly cashflow for multiple
>> agents (10,000 around).
>>
>> Therefore, the column number would be 1000+.
>>
>> I would need to perform simple aggregate function such as count, sum or
>> average on each cashflow projected.
>>
>> So if there is anyway of doing this? Will there be anything like define a
>> macro in C that I can manipulate multiple columns by simple word that
>> representing them.
>>
>
> Better to design a data model that doesn't have so many columns. Otherwise
> generating dynamic SQL via the for,at function and loops and such is your
> best bet.  Can be down in pl/pgsql or in your preferred programming
> language.  Psql variables can maybe be useful too.
>
> David J.
>
>

Perhaps post your proposed table design/definition. There is nothing in
what you have described so far which would indicate a necessity to have
more columns as you increase the number of agents.  It would be normal to
have something like

| agent_id | year | cash_on_hand | bank | creditors | debtors | 

and queries like

select sum(cash_on_hand)
from table
where agent_id = 'agent1'
and yesr = 2018;

to get the sum of cash on hand for agent1 in 2018.

instead of something like

| agent1_cash2018 | agent2_cash2017 | 

which will not work well.

Tim


-- 
regards,

Tim

--
Tim Cross

Re: Ways to deal with large amount of columns;

2018-08-30 Thread Tim Cross

a <372660...@qq.com> writes:

> Hi all:
>
>
> I need to make a table contains projected monthly cashflow for multiple 
> agents (10,000 around).
>
>
> Therefore, the column number would be 1000+. 
>
>

Not sure your data model is correct. Typically, with something like
this, increasing the number of agents would result in tables with more
rows rather than more columns. Tables with large numbers of columns is
often a 'code smell' and indicates the underlying data model needs to be
reviewed. Designs which result in new columns being required because you
are adding new data sources is almost certainly an indication of the
need to review the data model.

Postgres (and many other databases) have lots of functionality to help
deal with tables that have large numbers of rows, but working with
tables that have large numbers of columns has less functionality and
options.

While it is very tempting to create a table and then start coding, you
will almost always get a much better result and simpler code if you
spend some initial time to really analyse your domain, understand the
data elements and how they relate to each other, map them out into a
data model and then start development. Have a look at
https://en.wikipedia.org/wiki/Database_normalization for some background
on the normal forms and why they are useful.

HTH

Tim

-- 
Tim Cross

Re: using a plpgsql function argument as a table column.

2018-08-28 Thread Tim Cross

Off the top of my head, I think you could do this using dynamic (execute)
SQL in a function. However, it is going to be messy, possibly slow and
likely fragile. You would need to query the catalogue to get the column
names in the table and then build the SQL dynamically 'on the fly'.

Without having more detail, my spider sense tells me you have the wrong
table/relationship design. While you may be able to get it to work, it is
likely you will run into constant problems and additional complexity that
could be avoided with a different design. You really want a design where
your queries are driven by the data in your tables and not by the names of
columns. I would seriously consider re-examining your schema design, look
at how your design fits in with the normal forms and adapt as necessary.

Tim

On Wed, 29 Aug 2018 at 15:10, ss  wrote:

> I have a table with many years as columns. y1976, y2077, .. ,
> y2019,y2020 I want to dynamically return a column from a function.
>
>
> select * from FUNCTION('y2016') .
>
> select t1.cola t1.colb, t1.colc, t2.y2016 from . Where t2.y2016 != 0;
>
> or if I select year y2012 I want FUNCTION('y2012')
>
> select t1.cola t1.colb, t1.colc, t2.y2012 from . Where t2.y2012 != 0;
>
>
> to generalize
>
> select * from FUNCTION( year_column )
>
> select t1.cola t1.colb, t1.colc, t2.year_column from . Where
> t2.year_column != 0;
>
> is it possible? if so how?
>
>
>

-- 
regards,

Tim

--
Tim Cross

Re: unorthodox use of PG for a customer

2018-08-24 Thread Tim Cross

David Gauthier  writes:

> Hi Everyone:
>
> I'm going to throw this internal customer request out for ideas, even
> though I think it's a bit crazy.  I'm on the brink of telling him it's
> impractical and/or inadvisable.  But maybe someone has a solution.
>
> He's writing a script/program that runs on a workstation and needs to write
> data to a DB.  This process also sends work to a batch system on a server
> farm external to the workstation that will create multiple, parallel
> jobs/processes that also have to write to the DB as well. The workstation
> may have many of these jobs running at the same time.  And there are 58
> workstation which all have/use locally mounted disks for this work.
>
> At first blush, this is easy.  Just create a DB on a server and have all
> those clients work with it.  But he's also adamant about having the DB on
> the same server(s) that ran the script AND on the locally mounted disk.  He
> said he doesn't want the overhead, dependencies and worries of anything
> like an external DB with a DBA, etc... . He also wants this to be fast.
>

I would agree the customers proposed architecture has problems. It is
likely to be fragile and difficult to maintain. At some point, there
willl likely be a need to consolidate the data in all these separate
databases, which could lead to other problems.  

It sounds like there is some previous experience which has caused
problems for your customer and they are trying to avoid a repeat by
defining the technical solution rather than asking for a solution. The
first step is to spend more time talking to your customer and getting to
understand the underlying reasons why he is proposing those
technical/architecture constraints. I think once you have full details
regarding his requirements and the risks as he perceives them, you will
likely be able to come up with a much more workable solution which will
both address his/her concerns and be an architecture which is solid and
maintainable. There is a good chance all the reasons will not be purely
technical. 

My wild guess is that previously, there has been problems with central IT
services - probably bureaucracy and poor response times or
communication or there was misalignment with regards to expectations. I
often encounter this type of problem working with researchers who are
very smart and informed in their local area of expertise, but less so
when it comes to understanding the complexities, maintenance overheads
and other activities associated with providing reliable services
(backups, upgrades, tuning etc). The two areas (IT and customer)
frequently speak different languages even when using the same words. It
can be extremely challenging to get clear, consistent and agreed
requirements. For example, what does 'fast' mean?

The 'fast' requirement and the desire to have things run locally could
indicate a concern regarding network performance. I find performance is
often blamed on the network, but this is less often the case in modern
networks. More often than not it is poor algorithm design, badly tuned
databases or badly designed database schemas and CPU or memory
limits.

Pointing out the maintenance overheads and possible failure points in
his proposed architecture may help. Being able to collect real metrics
to demonstrate where bottlenecks and performance issues reside will also
help going forward - good metrics and hard numbers can often circumvent
circular arguments regarding problem causes.

Also consider your internal business processes. I've seen too many good
architecture solutions becoming perceived as failures because the other
non-technical stuff failed - poor communications, failure to align
technical maintenance with business requirements, unnecessary delays or
poor responses to enquiries and questions and inability to adapt to
changing business requirements in a timely manner. This is often the
most frustrating part - you can be an excellent technical person able to
define and implement really good technical solutions, but if the
customer is unable to use the solution effectively, it will be seen as a
technical failure.

Tim
--
Tim Cross

Re: Can Pg somehow recognize/honor linux groups to control user access ?

2018-08-22 Thread Tim Cross

Dimitri Maziuk  writes:

> On 08/22/2018 11:38 AM, Ravi Krishna wrote:
>>
>> In fact DBAs don't even need to get involved when a new user needs DB 
>> access.  Sysadmin had to just add that user in a group and we are done.
>
> How is that different from giving your grants to a database role and
> just telling the new user the name and password of that role to connect as?

I hope your not suggesting that multiple users use the same login
credentials to access the database? This would totally destroy the
important security principals of attribution and repudiation.

I have not looked at PAM for PG, but that is certainly something worth
checking out. I have used it in other contexts and it is very
powerful. Depending on the implementation, theoretically, it should be
possible to have PAM provide the information to determine what database
roles to give to a login. What you will probably need is some additional
component to act as the 'meta-directory' so that you can have consistent
usernames, uid/gid across servers/systems (i.e. ldap/ad)

Something else to look at is openLDAP. Again, not yet looked at it in
the PG context, but have used it successfully to manage access to other
systems (including databases in other RDMS) before. Currently, we use
openLDAP and PAM to manage access on Linux servers. One of the things on
my 'todo' list is to look at it in a PG context, just haven't got there
yet.

Avoid any solution which requires a 'generic' account with shared
passwords. Apart from the security implications, you will almost
certainly run into problems with auditors and many regulatory
standards. 

--
Tim Cross

Re: AW: CTE with JOIN of two tables is much faster than a regular query

2018-08-18 Thread Tim Cross

kpi6...@gmail.com writes:

>> -Ursprüngliche Nachricht-
>> Von: Ravi Krishna 
>> Gesendet: Samstag, 18. August 2018 18:25
>> 
>> > What can I do to improve the performance of the regular query without
>> using a CTE?
>> 
>> Why do you care ?  When I find that I can write a SQL 3 different ways, I 
>> will
>> go for the most efficient one.  So why not accept the CTE version of this 
>> SQL.
>> Just curious.
>
> We're using object mapping / entity frameworks (e.g. XPO, Entity Framework 
> Core). These frameworks support regular queries out-of-the box; a CTEs 
> require additional effort and are more difficult to maintain. 
>

Ah, another reason to avoid object mapping/entity frameworks! I guess
really the same reason - loss of flexibility and expressive power.

Sorry, having a similar battle with some developers who are insisting on
using a particular framework because it makes maintenance easier as it
'automates' creation of controllers (MVC). However, they are frustrated
by performance and I'm frustrated as the framework also fails to pass
additional information, such as PGAPPNAME, which would make some
analysis easier. Part of the reason for the performance issues is
because the developers are doing things with result sets within the
client that would be far more efficient performed within the database.

One way I have resolved this in the past is to create database
procedures which present a 'mapped' view back to the framework layer
which hides the SQL from the framework. Works well, with the only main
downside being you now have SQL in a different (another) place, which
can make some people uncomfortable and can be a maintenance issue if all
your developers are just front-end devs who treat a database as just a
key/value repository. .

Tim
-- 
Tim Cross

Re: vPgSql

2018-08-17 Thread Tim Cross

Vlad ABC  writes:

> On Fri, 2018-08-17 at 15:45 +0300, Dmitry Igrishin wrote:
>> 
>> Looking nice! Thank you. But I unable to start it on Ubuntu, because
>> there is no bash(1) in /usr/bin.
>
> Thank you, i'll fix it.
>

I think pretty much all *nix systems put core shells like sh, bash, zsh
etc in /bin (as it is guaranteed to be available immediately at boot, while 
/usr is
not - it could be a separate partition which isn't available until later
in the boot process).

A way to avoid platform differences is to use /usr/bin/env e.g.

#!/usr/bin/env bash

. 
>> Also, it is open source?
>
> No, it is freeware

oh well, too bad.

-- 
Tim Cross

Re: Safe operations?

2018-08-12 Thread Tim Cross

On Mon, 13 Aug 2018 at 12:23, Olivier Gautherot 
wrote:

> On Sun, Aug 12, 2018 at 11:06 PM, Tim Cross  wrote:
>
>>
>> On Mon, 13 Aug 2018 at 11:24, Adrian Klaver 
>> wrote:
>>
>>> On 08/12/2018 05:41 PM, Samuel Williams wrote:
>>> > I wish the documentation would include performance details, i.e. this
>>> > operation is O(N) or O(1) relative to the number of rows.
>>> >
>>> > I found renaming a table was okay.
>>> >
>>> > How about renaming a column? Is it O(1) or proportional to the amount
>>> of
>>> > data?
>>> >
>>> > Is there any documentation about this?
>>>
>>> https://www.postgresql.org/docs/10/static/sql-altertable.html
>>>
>>> "RENAME
>>>
>>>  The RENAME forms change the name of a table (or an index, sequence,
>>> view, materialized view, or foreign table), the name of an individual
>>> column in a table, or the name of a constraint of the table. There is no
>>> effect on the stored data.
>>> "
>>>
>>> Just wondering - what about the case when the column being renamed is
>>> also referenced in an index or check constraint? (I would guess you cannot
>>> rename a column used in a check constraint without first removing it, but
>>> for an index, would this result in the index being rebuilt (or do you have
>>> to take care of that manually or are such references abstracted such that
>>> the column name "text" is irrelevant tot he actual structure of the
>>> index?).
>>
>>
> Tim, as far as I know, names are only an attribute tagged to an OID.
> Internal relations are though these OIDs, not names, so renaming a column
> is really one-shot. Names are mainly a more convenient way of referring to
> objects.
>
> Olivier
>

thanks Olivier, that is what I suspected and your explanation fits with my
mental model. I had assumed table/column names are convenience for humans
and that the system would use OIDs etc for internal references.
-- 
regards,

Tim

--
Tim Cross

Re: Safe operations?

2018-08-12 Thread Tim Cross

On Mon, 13 Aug 2018 at 11:24, Adrian Klaver 
wrote:

> On 08/12/2018 05:41 PM, Samuel Williams wrote:
> > I wish the documentation would include performance details, i.e. this
> > operation is O(N) or O(1) relative to the number of rows.
> >
> > I found renaming a table was okay.
> >
> > How about renaming a column? Is it O(1) or proportional to the amount of
> > data?
> >
> > Is there any documentation about this?
>
> https://www.postgresql.org/docs/10/static/sql-altertable.html
>
> "RENAME
>
>  The RENAME forms change the name of a table (or an index, sequence,
> view, materialized view, or foreign table), the name of an individual
> column in a table, or the name of a constraint of the table. There is no
> effect on the stored data.
> "
>
> Just wondering - what about the case when the column being renamed is also
> referenced in an index or check constraint? (I would guess you cannot
> rename a column used in a check constraint without first removing it, but
> for an index, would this result in the index being rebuilt (or do you have
> to take care of that manually or are such references abstracted such that
> the column name "text" is irrelevant tot he actual structure of the
> index?).



>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>
>

-- 
regards,

Tim

--
Tim Cross

Re: How to revoke privileged from PostgreSQL's superuser

2018-08-06 Thread Tim Cross

bejita0...@yahoo.co.jp writes:

> Hello,
>
> I am a newbie DBA.
>
> I have a request for revoking the access to user's data from DBA-user.
> I think the request is right because users should be the only ones can access 
> their data.
> But DBA-user also need full access to the other data? It means that DBA-user 
> also needs to be a superuser.
>
> So I conclude the request that how to revoke privileged from superuser in 
> postgres.
>
> As my knowledge, the superuser in PostgreSQL bypasses all the permission 
> check.
> So that, there is no way to do it in PostgreSQL, is that right?
>
> Is there some DBAs are faced with this before?
>

There are certainly DBAs who have had very similar requests. Often, they
are generated by non-technical people who don't really understand how
the technology works and have concerns over who has access to the data
(a common one is people who are concerned about who has access to their
email - we had a similar challenge from our Chief legal officer who was
paranoid sys admins were reading his highly sensitive email, this is
despite the fact 1 week previously, I was on a flight sitting in the
seat behind him while he read his email on his iPad, which I could (but
didn't) easily read over his shoulder!).

The key to handling this sort of request is to dig deeper to understand
what the real risk is that they want addressed and work out how you can
do that within the constraints of the technology and what makes sense
within your context. I'm sure someone will respond to this thread with
all sorts of highly restrictive and powerful controls that will restrict
access to the data, but if they are not appropriate for your business
context, will likely cripple the very process you are trying to
protect. All controls/restrictions cause some level of inconvenience -
the challenge is in getting the balance right so that the identified
risk is mitigated with the least level of inconvenience to normal
business operations.

The reality is that at various times, humans will need the ability to
access the data in ways which will limit, if not completely prevent,
your ability to restrict access. This is particularly relevant for
system and database administrators. It is pretty much 100% impossible to
guarantee that a sys admin or DBA cannot access data. However, what you
can do is approach the problem slightly differently and look at ways to
make this access harder and more importantly, make sure that all access
is logged appropriately and can be audited, ensuring the
logging/auditing system is also protected from deletion or modification.

Other posts in the thread include some good pointers on what you can do
to help with this. The principals are pretty straight forward. Possibly
the most important thing to do is ensure there is no 'anonymous' access
e.g. you cannot login to the database as 'postgres' or some other
generic account which multiple people have access to. Instead, ensure
that everyone with any level of administrator privilege has to login
using an account which is specific to them and not shared. The second
thing to do is ensure the logging level is appropriate and that all
logging is also stored/recorded on a system which the administrator does
not have access to and ensure the level of privileges every
individual has is at the minimum they require to get the job done. It is
also important that logs and audit trails are regularly reviewed to
ensure nobody is abusing the system and all controls are still
appropriate (things change, new systems come on line, old ones a
retired, business processes change etc).

If necessary, consider controls which restrict access to accounts with
extended privileges to certain hosts e.g. DBA 'Phil' can only log into
the database from server xxx.xxx.xxx.xxx and he can only log into that
server between 9am and 5pm Mon - Fri etc. Maybe he has to use a hardware
token etc. 

In most cases, provided you can give strong guarantee that unauthorised
data access can be identified, you will satisfy the security
requirements and this is often far more feasible than outright blocking
of access. 

Finally, it is also important that all staff are aware of the
organisations policies, procedures and controls regarding data
access. They need to know what is expected of them and what is
unacceptable.  

--
Tim Cross

Re: User documentation vs Official Docs

2018-07-19 Thread Tim Cross

Our University provides access to a Linux server for any student (not just
those in data science etc)  or staff member and that computer has Postgres
available for anyone who want to use it. The server is also accessible
remotely (80% of our student base is remote/on-line). You also get a shell
account and can install any software which can be installed and run from
that account.  At another University I do some work for, they have moved to
a virtual environment, where students are able to spin up a virtual
computer on demand and have full access to install whatever software they
like (though there are some constraints on what can be setup to 'persist'
across instances. You could install PG, but I'm not sure if it would be
restored next time you spin up hyour virtual server).

>From personal experience, I can say that when I was a student, a $60 book
was very difficult to justify/afford and I greatly valued on-line resources
at that time.  I made extensive use of the library, but obtaining specific
books was not as easy as asking for them - the library has limited space
and can only maintain collections on a demand basis, so you were unlikely
to get a book just based on request.

A further aspect about on-line resources not yet mentioned is the
accessibility aspect. As a blind programmer, I know the huge benefits of
electronic resources compered to dead trees!

Tim

On Fri, 20 Jul 2018 at 11:03, Melvin Davidson  wrote:

>
>
> On Thu, Jul 19, 2018 at 8:54 PM, Adrian Klaver 
> wrote:
>
>> On 07/19/2018 05:43 PM, Melvin Davidson wrote:
>>
>>>
>>>
>>>
>>
>>>
>>>  > Then again people might use shared, university or library computers
>>> Would you please be so kind as to inform us which university or library
>>> allows users to install software on a _shared_ computer.
>>>
>>
>> Pretty sure Ken was referring to looking up documentation, not running
>> Postgres.
>>
>>
>>> BTW, since you mention library, that is an excellent way to have the
>>> books ordered and shared.>FOR FREE<.  AFAIK, all that is required is for
>>> someone to request the library purchase the book, to be used for shared
>>> learning.
>>>
>>>
>>> --
>>> *Melvin Davidson**
>>> Maj. Database & Exploration Specialist**
>>> Universe Exploration Command – UXC***
>>> Employment by invitation only!
>>>
>>
>>
>> --
>> Adrian Klaver
>> adrian.kla...@aklaver.com
>>
>
> > Pretty sure Ken was referring to looking up documentation, not running
> Postgres.
> That does not correlate. To have the need to look up documentation implies
> that the user has a computer running PostgreSQL.
> As universities DO NOT ALLOW software to be installed on shared computers,
> and this is the case especially in a library, it implies
> the user has their own computer. As libraries allow users/citizens to
> request books be purchased >at no cost to the user/citizen, the
> argument that someone cannot afford a book is now a moot point.
>
> --
> *Melvin Davidson*
> *Maj. Database & Exploration Specialist*
> *Universe Exploration Command – UXC*
> Employment by invitation only!
>

-- 
regards,

Tim

--
Tim Cross

Re: User documentation vs Official Docs

2018-07-19 Thread Tim Cross



Peter J. Holzer  writes:

> On 2018-07-18 08:09:35 +1000, Tim Cross wrote:
>> If using web widgets to author content on the wiki is the main
>> impediment for contributing content, maybe we should see if the wiki
>> provides alternative access methods. I've used wikis in the past which
>> allowed users to upload content via xmlrpc, api etc. Perhaps something
>> similar could be made available for those making significant
>> contributions or to a select few 'curators' who could accept content
>> from others.
>
> There are also browser plugins like It's all text, textern, wasavi, etc.
> which allow the user to use a real text editor instead of a text area.
>
> hp

+1. Should have remember that option given I have such a plugin myself
and use it often!

-- 
Tim Cross

Re: Do we need yet another IDE (SQL development assistant) for PostgreSQL?

2018-07-16 Thread Tim Cross

Dmitry Igrishin  writes:

> пн, 16 июл. 2018 г. в 1:14, Tim Cross :
>
>>
>> Your idea to make it integrate with user's preferred editor is a good
>> idea as editors are like opinions and certain anatomical parts -
>> everyone has one! Finding an appropriate API to do this will be a
>> challenge.
>>
> I see two options here: the core of the tool acts as a long-lived server or
> as a short-lived
> console application which communicates with the editor's plugin via
> stdin/stdout.
> Btw, what the text editor do you prefer? :-)
>

Most of the time, I use Emacs on either Linux or macOS. With the support
it has for running a psql process, it works pretty well for most
things. There are pretty reasonable packages for writing SQL and
'static' completion. Getting things setup can take a bit of effort, but
once it is working, it tends to work pretty well.

The two areas where it lacks are dynamic completion i.e. completing on
objects the user has created such as table names and column
names/function names etc. and decent result formatting. 

>>
>> I seem to remember reading somewhere that Oracle was going to remove
>> swing from the core java library. I've always been a little disappointed
>> with Java UIs and found they don't give the cross-platform support that
>> Java originally promised, plus OSX/macOS has not made Java as welcome as
>> it use to be. If you do choose Java, it will need to work under openJDK
>> as this is what most Linux users will have installed.
>>
> For now, the possible options for the GUI part are Qt, wxWidgets or FLTK,
> or even Electron.

I would look at either Qt or even Electron (I believe visual code is
written using Electron, which is the other editor I use from time to
time).

There was an Emacs project called Eclaim (I think) which interfaced with
Eclipse services in order to provide dynamic completion when doing
Java. That could be worth checking out for ideas to borrow.

Tim

-- 
Tim Cross

Re: User documentation vs Official Docs

2018-07-16 Thread Tim Cross

Joshua D. Drake  writes:

> -general.
>
> Over the last year as I have visited many meetups and interacted with 
> people at conferences etc... There are three prevailing issues that 
> continue to come up in contributing to the community. This email is 
> about one of them. Where is the "user" documentation? The official 
> documentation is awesome, if you know what you are doing. It is not 
> particularly useful for HOWTO style docs. There is some user 
> documentation in the wiki but let's be honest, writing a 
> blog/article/howto in a wiki is a pain in the butt.
>
> What does the community think about a community run, community 
> organized, sub project for USER documentation? This type of 
> documentation would be things like, "10 steps to configure replication", 
> "Dumb simple Postgres backups", "5 things to NEVER do with Postgres". I 
> imagine we would sort it by version (9.6/10.0 etc...) as well as break 
> it down via type (Administration, Tuning, Gotchas) etc...
>
> What do we think?
>

I think encouraging user developed docs is a great idea.

However, I'm not sure how your proposal really addresses the issue. How
would your proposal deal with the "but let's be honest, writing a 
blog/article/howto in a wiki is a pain in the butt" issue? Writing
decent documentation or clear examples is hard and the only thing worse
than no documentation is misleading or confusing documentation. 

My only real concern would be to further fracture the PG user base. If
there are barriers preventing users from adding documentation to the
existing documents or wiki, perhaps it would be better to try and
address those first?

Tim

-- 
Tim Cross

Re: Do we need yet another IDE (SQL development assistant) for PostgreSQL?

2018-07-15 Thread Tim Cross

Dmitry Igrishin  writes:

> вс, 15 июл. 2018 г. в 22:42, Chuck Davis :
>
>> If you decide to proceed on this project there's no need to reinvent the
>> wheel.
>>
>> I use Netbeans for my development.  it has quite a good facility for
>> working with databases and I use it regularly with Postgres.  Since
>> Netbeans is now licensed under Apache 2 you might find useful code
>> there.  Be assured it uses JDBC for access but JDBC is universally
>> available and the folks at Postgresql have done quite a nice job with
>> JDBC drivers.  Of course, this already works on all platforms.  The
>> implementation is basic but very useful:  i.e. a good starting point.
>>
> Thank you for the point. I'm the C++ programmer and I'm author of the
> C++ client library for PostgreSQL - Pgfe and I'm going to use it in this
> project. But I'm not sure about the cross-platform GUI toolkit.

The cross-platform GUI toolkit will be the challenge.

Your idea to make it integrate with user's preferred editor is a good
idea as editors are like opinions and certain anatomical parts -
everyone has one! Finding an appropriate API to do this will be a
challenge.

I seem to remember reading somewhere that Oracle was going to remove
swing from the core java library. I've always been a little disappointed
with Java UIs and found they don't give the cross-platform support that
Java originally promised, plus OSX/macOS has not made Java as welcome as
it use to be. If you do choose Java, it will need to work under openJDK
as this is what most Linux users will have installed.

Tim

-- 
Tim Cross

Re: Open Source tool to deploy/promote PostgreSQL DDL

2018-07-11 Thread Tim Cross

UI.

Given the high level of variability in environments, you are probably
best off developing the process and scripts rather than trying to find
an existing tool. Putting a web front end is likely easier than finding
a tool flexible enough to fit with the environment which avoids
situations where the tool begins to dictate how you operate (tail
wagging the dog).

Tim



-- 
Tim Cross

Re: Disabling/Enabling index before bulk loading

2018-07-10 Thread Tim Cross

Ravi Krishna  writes:

> We recently did a test on COPY and found that on large tables (47 million 
> rows , 20GB of raw data) the 
> difference in COPY with 16 indexes and COPY without any index is 1:14. That 
> is, COPY is 14 times slower 
> when data is ingested with all indexes as opposed to COPY first without index 
> and then create all index.
>

This is very system dependent. On our system, when we tested a similar
approach, we found that the time saved through dropping the indexes
before copy was lost when rebuilding the indexes afterwards. In fact, it
ended up being slightly slower.

I suspect a lot depends on the number and types of indexes you
have. Your example had a lot more indexes than ours. We were loading 22
batches with 2.5M records per batch. While copy was significantly faster
than transaction based inserts (minutes vs hours), the differences between
indexes and no indexes was measured in minutes. We only had 3 or 4
indexes.   

> I googled for earlier posting on this and it looks like this has been asked 
> before too.  
>
> This is what I am thinking to do:
>
> 1 - Extract index definition and save it as a SQL somewhere, either a file or 
> a table.
> 2 - Drop all indexes.
> 3 - Ingest data via COPY
> 4 - Recreate all indexes saved in (1).
>
> Is there a generic sql or script or tool to accomplish (1).
>

We are loading data via Javascript using pg and pg-copy-streams modules. It is
pretty straight forward to drop the indexes and recreate them afterwards
via sql, so we didn't look for a tool as such.

As data is only inserted into this table and only by this process, we
also turned off autovacuum for this table, performing vacuum and analyze
manually after load. 

Tim

-- 
Tim Cross

Re: FK v.s unique indexes

2018-07-03 Thread Tim Cross



Rafal Pietrak  writes:

>
> In particular, contrary to what the ERROR says, the target table *does
> have* a "unique constraint matching given keys", admittedly only
> partial. Yet, why should that matter at all? A unique index, partial or
> not, always yield a single row, and that's all what matters for FK. Right?
>
Is that correct? I would have thought that if you have a multi-key
unique index and you only provide values for some of the keys in the
index, you would have no guarantee of a single row being returned. If
this was true, then the additional keys are superfluous.

Have you tried doing the same thing where the fk keys and remote unique
index keys are equal in number?

-- 
Tim Cross

Re: Using COPY to import large xml file

2018-06-25 Thread Tim Cross

Anto Aravinth  writes:

> Thanks a lot. But I do got lot of challenges! Looks like SO data contains
> lot of tabs within itself.. So tabs delimiter didn't work for me. I thought
> I can give a special demiliter but looks like Postrgesql copy allow only
> one character as delimiter :(
>
> Sad, I guess only way is to insert or do a through serialization of my data
> into something that COPY can understand.
>

The COPY command has a number of options, including setting what is used
as the delimiter - it doesn't have to be tab. You need to also look at
the logs/output to see exactly why the copy fails.

I'd recommend first pre-processing your input data to make sure it is
'clean' and all the fields actually match with whatever DDL you have
used to define your db tables etc. I'd then select a small subset and
try different parameters to the copy command until you get the right
combination of data format and copy definition.

It may take some effort to get the right combination, but the result is
probably worth it given your data set size i.e. difference between hours
and days. 

--
Tim Cross

Re: Using COPY to import large xml file

2018-06-24 Thread Tim Cross

On Mon, 25 Jun 2018 at 11:38, Anto Aravinth 
wrote:

>
>
> On Mon, Jun 25, 2018 at 3:44 AM, Tim Cross  wrote:
>
>>
>> Anto Aravinth  writes:
>>
>> > Thanks for the response. I'm not sure, how long does this tool takes for
>> > the 70GB data.
>> >
>> > I used node to stream the xml files into inserts.. which was very slow..
>> > Actually the xml contains 40 million records, out of which 10Million
>> took
>> > around 2 hrs using nodejs. Hence, I thought will use COPY command, as
>> > suggested on the internet.
>> >
>> > Definitely, will try the code and let you know.. But looks like it uses
>> the
>> > same INSERT, not copy.. interesting if it runs quick on my machine.
>> >
>> > On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat <
>> adrien.nay...@anayrat.info>
>> > wrote:
>> >
>> >> On 06/24/2018 05:25 PM, Anto Aravinth wrote:
>> >> > Hello Everyone,
>> >> >
>> >> > I have downloaded the Stackoverflow posts xml (contains all SO
>> questions
>> >> till
>> >> > date).. the file is around 70GB.. I wanna import the data in those
>> xml
>> >> to my
>> >> > table.. is there a way to do so in postgres?
>> >> >
>> >> >
>> >> > Thanks,
>> >> > Anto.
>> >>
>> >> Hello Anto,
>> >>
>> >> I used this tool :
>> >> https://github.com/Networks-Learning/stackexchange-dump-to-postgres
>> >>
>>
>> If you are using nodejs, then you can easily use the pg-copy-streams
>> module to insert the records into your database. I've been using this
>> for inserting large numbers of records from NetCDF files. Takes between
>> 40 to 50 minutes to insert 60 Million+ records and we are doing
>> additional calculations on the values, not just inserting them,
>> plus we are inserting into a database over the network and into a
>> database which is
>> also performing other processing.
>>
>> We found a significant speed improvement with COPY over blocks of insert
>> transactions, which was faster than just individual inserts. The only
>> downside with using COPY is that it either completely works or
>> completely fails and when it fails, it can be tricky to work out which
>> record is causing the failure. A benefit of using blocks of transactions
>> is that you have more fine grained control, allowing you to recover from
>> some errors or providing more specific detail regarding the cause of the
>> error.
>>
>
> Sure, let me try that.. I have a question here, COPY usually works when
> you move data from files to your postgres instance, right? Now in node.js,
> processing the whole file, can I use COPY
> programmatically like COPY Stackoverflow ?
> Because from doc:
>
> https://www.postgresql.org/docs/9.2/static/sql-copy.html
>
> I don't see its possible. May be I need to convert the files to copy
> understandable first?
>
> Anto.
>
>>
>>
>
Yes. Essentially what you do is create a stream and feed whatever
information you want to copy into that stream. PG sees the. data as if it
was seeing each line in a file, so you push data onto the stream wherre
each item is seperated by a tab (or whatever). Here is the basic low level
function I use (Don't know how the formatting will go!)

async function copyInsert(sql, stringifyFN, records) {
  const logName = `${moduleName}.copyInsert`;
  var client;

  assert.ok(Array.isArray(records), "The records arg must be an array");
  assert.ok(typeof(stringifyFN) === "function", "The stringifyFN arg must
be a function");

  return getClient()
.then(c => {
  client = c;
  return new Promise(function(resolve, reject) {
var stream, rs;
var idx = 0;

function done() {
  releaseClient(client);
  client = undefined;
  resolve(idx + 1);
}

function onError(err) {
  if (client !== undefined) {
releaseClient(client);
  }
  reject(new VError(err, `${logName}: COPY failed at record
${idx}`));
}

function arrayRead() {
  if (idx === records.length) {
rs.push(null);
  } else {
let rec = records[idx];
rs.push(stringifyFN(rec));
idx += 1;
  }
}

rs = new Readable;
rs._read = arrayRead;
rs.on("error", onError);
stream = client.query(copyFrom(sql));
stream.on("error", onError);
stream.on("end", done);
rs.pipe(stream);

Re: Using COPY to import large xml file

2018-06-24 Thread Tim Cross

Anto Aravinth  writes:

> Thanks for the response. I'm not sure, how long does this tool takes for
> the 70GB data.
>
> I used node to stream the xml files into inserts.. which was very slow..
> Actually the xml contains 40 million records, out of which 10Million took
> around 2 hrs using nodejs. Hence, I thought will use COPY command, as
> suggested on the internet.
>
> Definitely, will try the code and let you know.. But looks like it uses the
> same INSERT, not copy.. interesting if it runs quick on my machine.
>
> On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat 
> wrote:
>
>> On 06/24/2018 05:25 PM, Anto Aravinth wrote:
>> > Hello Everyone,
>> >
>> > I have downloaded the Stackoverflow posts xml (contains all SO questions
>> till
>> > date).. the file is around 70GB.. I wanna import the data in those xml
>> to my
>> > table.. is there a way to do so in postgres?
>> >
>> >
>> > Thanks,
>> > Anto.
>>
>> Hello Anto,
>>
>> I used this tool :
>> https://github.com/Networks-Learning/stackexchange-dump-to-postgres
>>

If you are using nodejs, then you can easily use the pg-copy-streams
module to insert the records into your database. I've been using this
for inserting large numbers of records from NetCDF files. Takes between
40 to 50 minutes to insert 60 Million+ records and we are doing
additional calculations on the values, not just inserting them,
plus we are inserting into a database over the network and into a database 
which is
also performing other processing. 

We found a significant speed improvement with COPY over blocks of insert
transactions, which was faster than just individual inserts. The only
downside with using COPY is that it either completely works or
completely fails and when it fails, it can be tricky to work out which
record is causing the failure. A benefit of using blocks of transactions
is that you have more fine grained control, allowing you to recover from
some errors or providing more specific detail regarding the cause of the
error.

Be wary of what indexes your defining on your table. Depending on the
type and number, these can have significant impact on insert times as
well.

-- 
Tim Cross

Re: Load data from a csv file without using COPY

2018-06-19 Thread Tim Cross

Ravi Krishna  writes:

> In order to test a real life scenario (and use it for benchmarking) I want to 
> load large number of data from csv files.  
> The requirement is that the load should happen like an application writing to 
> the database ( that is, no COPY command). 
> Is there a tool which can do the job.  Basically parse the csv file and 
> insert it to the database row by row.
>

Not clear what you mean by 'real world scenario', but you could possibly
use PG's foreign table support and define a csv file as a foreign table
and then have scripts which read from there and do whatever
insert/update etc you need. However, this has a high level of 'fakery'
going on and probably not testing what you really want.

There are lots of ways that applications write to the database -
different drivers (e.g. jdbc, odbc, pg etc), different commit
and transaction strategies and even different ways to handle things like
an update or insert process. You can even use streams and copy from an
application.

To get 'real world' equivalence, you really need to use the same
interface as the application you are comparing. Most languages have
support for processing CSV files, so you may be better off writing a
small 'wrapper' app which uses the same drivers and assuming your
database connectivity has been abstracted into some sort of
module/library/class, use the same interface to write to the database
that the application uses. 

Tim
-- 
Tim Cross

Re: Microsoft buys GitHub, is this a threat to open-source

2018-06-04 Thread Tim Cross

Joshua D. Drake  writes:

> On 06/04/2018 10:31 AM, Rich Shepard wrote:
>> On Mon, 4 Jun 2018, Joshua D. Drake wrote:
>>
>>> No but it does show why using non open source platforms for open source
>>> projects is an inherently bad idea.
>>
>> Joshua,
>>
>>   Sourceforge seems to be out of favor, too, so are there any open source
>> platforms that provide services that sourceforge and github do?
>
> Gitlab which can also be self hosted, the one GNU does (I don't recall
> the name).
>

I find gitLab to be a pretty good alternative. However, I don't think
there is any need to panic. While it is possible (likely?) that MS will
change the terms and conditions which work in favour of maintaining
their profitability, which may cause some problems for particularly
large open source projects, nothing is going to happen over night or so
quickly that projects won't have an opportunity to find an alternative.

There is an alternative perspective to seeing MS purchase of Github
which is a little more positive.

The challenge for open source is that at some point, there is a cost
associated with storage, collaboration and sharing of source code. This
cost has to be paid for by someone. While we can hope for philanthropic
donations and gifts to pay this cost, it probably isn't a sustainable
solution. If on the other hand, there is a profitable business which can
maintain profitability while incorporating open source support as part
of core business, then we may have a more sustainable and reliable
solution.

I am no MS fan and have little experience in the MS suite of products,
but I think most would have to acknowledge that MS has started to
embrace open source far more than it did in the past. Consider for
example their VS Code editor or the fact Windows now comes with a Bash
shell and more integrated support for Linux. I suspect that we will see
little change in Github in the short term and provided MS can maintain
long term profitability, we may see little long-term change as well.

Of course, this only holds for the very relaxed definition of open
source. RMS would/will be using this as a clear example of MS destroying
open source and the weakness of the general open source movement when it
fails to emphasise freedom. For a strict open source definition which
emphasises freedom rather than just 'openness', Github would likely
already be ruled out due to their restrictive terms and conditions
regarding ownership and licenses. However, the subtleties of RMS's
concerns are often misunderstood and incompatible with our tendency to
focus on short term, low friction solutions.

For now, I'll just take a leaf out of 'the Guide', grab my towel and not
panic!

Tim

--
Tim Cross

Re: Whither 1:1?

2018-06-01 Thread Tim Cross

Olivier Gautherot  writes:

> On Fri, Jun 1, 2018 at 12:52 PM, Guyren Howe  wrote:
>
>> It’s come to my attention that what seems an obvious and useful database
>> design pattern — 1:1 relations between tables by having a shared primary
>> key — is hardly discussed or used.
>>
>> It would seem to be a very simple pattern, and useful to avoid storing
>> nulls or for groups of fields that tend to be used together.
>>
>> Thoughts? Is there some downside I can’t see?
>>
>
> You will get a benefit in terms of space only if the optional fields in the
> second table exist in a reduced number of instances - and the second table
> is significantly wider. This can make a difference on big tables but this
> gain may be offset by the cost of the join. In this perspective, I don't
> think that there is a clear benefit or drawback: it should be evaluated on
> a case-by-case basis.
>

Well said. Like many database design decisions, there are more
guidelines than rules. Nearly all decisions have pros and
cons. Therefore, you need to assess on a per project basis. In addition
to the structure of data (size, null frequency etc), you also need to
consider how the data is used. It is also important to consider who will
be using the data, how they will access it and what level of
sophistication/understanding they have. The clarity of your data model
is also important as future decisions may be made by others and the
better they understand the design/model, the higher the likelihood the
system will evolve in a sane and maintainable manner.

There other point to remember is that all designs often have an element
of surprise - you make a decision under an assumption which turns out
not to hold due to variables you didn't consider or don't have control
over. Sometimes there are alternatives which may actually perform better
because they are optimised within the system - for example, some of the
benefits for this approach could be realised using partitions. 

I have used this technique, but from memory, this was done as the system
evolved and we found there was a benefit from having a smaller 'main'
table. It isn't a design decision I recall adopting during an initial
modelling of the system, but it may be something to consider once you
find a performance problem (along with other options) which needs to be
addressed. 

I'm not aware of any guideline or school of thought which rules out this
as an option. Possibly the reason it appears to be used infrequently is
because it doesn't realise the benefits you might expect or is simply
not an problem in a majority of use cases. 

Tim

-- 
Tim Cross

Re: LDAP authentication slow

2018-05-30 Thread Tim Cross

ystems are managed by MS Forefront. Our environment has a large
mix of technologies - servers are roughly evenly split between Linux and
MS - still probably more Linux, though MS has been increasing in recent
years. Databases are a mix of Oracle and Postgres plus a smattering of
MySQL. Staff numbers are around 3k with about 60% on MS, 35% OSX and 5%
Linux. Client base is about 80k.

The reason we use both openLDAP and AD is because there are differences
between the two which are important for some of our applications (for
example, attributes which are single valued under LDAP standards but can
be multi-valued under AD) and because we need additional schemas which
are easy to implement in standards compliant LDAP, but difficult in
AD. We also found that when just requiring LDAP functionality, openLDAP
out performed AD.

How easily this can be done in your environment will depend on your
identity management solution. Depending on what that is, it may be as
easy as just adding another downstream target and a few mapping rules.
In this case, it would probably be an overall win. However, if your IAM
system cannot manage things easily, this is probably not practical.

There has been another thread regarding LDAP performance where the issue
looks like it could be a DNS related problem. It seems establishing
connections is close when LDAP address is specified using name and
faster when just an IP address is used. This could be something else you
should look at. We had an issue a while back where our central IT
provider made changes to DNS to improve security and enabling better
handling of misbehaving clients - essentially, it was a DNS throttling
configuration which would temporarily block requests from an IP if
the number of requests being made was  above a specified threshold. This
caused some initial problems for us as we found some application
libraries did not perform proper DNS caching and would regularly exceed
the threshold. It also took some trial and error to get the right
watermark for the throttling. A simple test like using IP address rather than 
name
would likely help to identify if DNS related issues could be the cause
or whether it is just an AD specific issue.

Definitely check AD logs as well - the issue could be simply that adding
a new system has increased demand sufficiently to degrade performance of
AD (though I would expect there would be complaints from others outside
the DB area if this was the case). 

The GSSAPI approach is not as complicated as it sounds, but it can be
affected by environment/infrastructure architecture and it will be
critical to ensure you have good time synchronisation. This can be
somewhat challenging in hybrid environments where you have a mix of
local and remote services. When it all works, it is great, but when you
do have a problem, diagnosis can be challenging.

The overall approach of having one identity with one password per entity
is IMO the right approach and your only hope for good password policy
application. However, getting to that point can be very challenging. 

--
Tim Cross

Re: pgdg-keyring (or apt-key) failure on fresh 9.6 install

2018-05-29 Thread Tim Cross



Moreno Andreo  writes:

> Hi Tim,
>
> Il 29/05/2018 00:06, Tim Cross ha scritto:
>> Moreno Andreo  writes:
>>
>>> Hi folks,
>>> I'm trying to install Postgresql 9.6 on a test machine in Google Cloud
>>> Platform
>>> After a fresh install with Debian 9 (just after the instance has been
>>> created) I follow steps from here
>>>
>>> https://wiki.postgresql.org/wiki/Apt
>>>
>>> (instead of pg 10 I install pg 9.6)
>>>
>>> During the installation process i encounter the following strange
>>> warnings that, even if that's a test machine, make me think twice before
>>> going ahead.
>>>
>>> [...]
>>> Processing triggers for man-db (2.7.6.1-2) ...
>>> Setting up pgdg-keyring (2017.3) ...
>>> Removing apt.postgresql.org key from trusted.gpg: Warning: The postinst
>>> maintainerscript of the package pgdg-keyring
>>> Warning: seems to use apt-key (provided by apt) without depending on
>>> gnupg or gnupg2.
>>> Warning: This will BREAK in the future and should be fixed by the
>>> package maintainer(s).
>>> Note: Check first if apt-key functionality is needed at all - it
>>> probably isn't!
>>> OK
>>> Setting up xml-core (0.17) ...
>>> [...]
>>>
>>> I have to say that installation is successfully and database server goes
>>> up and apparently with no problems at all.
>>>
>> This looks like a warning for the package maintainers regarding ensuring
>> the package depends on either gnupg or gnupg2 and nothing you need to
>> worry about unless you are building/maintaining deb packages for postgres.
> Brilliant. That's what I needed to know. Just to avoid bitter surprises 
> in the future... :-)
>>
>> The Debian package manager, apt, uses gpg keys to verify the
>> authenticity of packages it downloads. My guess is that previously, you
>> only needed to ensure the package had a dependency on apt-key and now
>> apt has/is changing such that you need to have an explicit dependency on
>> either gnupg or gnupg2.
>>
> ... so if I update/upgrade this instance in the future it will be 
> automatically fixed (and there shouldn't be issues), right?
> Thanks a lot!
>

Right. In fact, there are no issues now. That warning is from the Debian
package management system and about the package management system, so
nothing to do with Postgres.

When you upgrade in the future, provided the new Postgres packages have
been created with the dependency for gnupg/gnupg2, there will be no
warnings.

Personally, I tend to prefer using the packages which come with the
particular flavour of Linux your installing as they are often more
in-line with the current version of the package management system being
used. I only grab packages from the specific Postgres repo if the
package is not in the current version of the distribution I'm
installing. 

-- 
Tim Cross

Re: pgdg-keyring (or apt-key) failure on fresh 9.6 install

2018-05-28 Thread Tim Cross

Moreno Andreo  writes:

> Hi folks,
> I'm trying to install Postgresql 9.6 on a test machine in Google Cloud 
> Platform
> After a fresh install with Debian 9 (just after the instance has been 
> created) I follow steps from here
>
> https://wiki.postgresql.org/wiki/Apt
>
> (instead of pg 10 I install pg 9.6)
>
> During the installation process i encounter the following strange 
> warnings that, even if that's a test machine, make me think twice before 
> going ahead.
>
> [...]
> Processing triggers for man-db (2.7.6.1-2) ...
> Setting up pgdg-keyring (2017.3) ...
> Removing apt.postgresql.org key from trusted.gpg: Warning: The postinst 
> maintainerscript of the package pgdg-keyring
> Warning: seems to use apt-key (provided by apt) without depending on 
> gnupg or gnupg2.
> Warning: This will BREAK in the future and should be fixed by the 
> package maintainer(s).
> Note: Check first if apt-key functionality is needed at all - it 
> probably isn't!
> OK
> Setting up xml-core (0.17) ...
> [...]
>
> I have to say that installation is successfully and database server goes 
> up and apparently with no problems at all.
>

This looks like a warning for the package maintainers regarding ensuring
the package depends on either gnupg or gnupg2 and nothing you need to
worry about unless you are building/maintaining deb packages for postgres.

The Debian package manager, apt, uses gpg keys to verify the
authenticity of packages it downloads. My guess is that previously, you
only needed to ensure the package had a dependency on apt-key and now
apt has/is changing such that you need to have an explicit dependency on
either gnupg or gnupg2. 

-- 
Tim Cross

Re: When use triggers?

2018-05-17 Thread Tim Cross

hmidi slim <hmidi.sl...@gmail.com> writes:

> HI,
>
> I'm working on a microservice application and I avoid using triggers
> because they will not be easy to maintain and need an experimented person
> in database administration to manage them. So I prefer to manage the work
> in the application using ORM and javascript.
> However I want to get some opinions and advices about using triggers: when
> should I use them? How to manage them when there are some problems?

I think triggers are generally best avoided. They do have a purpose, but
like regular expressions and Lisp style macros, they are abused more
often than used appropriately. When used correctly, they can help to
ensure your code is robust, clear and easy to maintain. 

The big issue with triggers is that they are really a side effect of
some other action. As such, they are obscure, easily missed, difficult
to debug and often frustrating to maintain.

In nearly 30 years of working with different databases, I've rarely
found triggers necessary. As mentioned by others in this thread, they
can be useful when you need low level auditing and like all guidelines,
there are always exceptions, but in general, they should usually be the
last choice, not the first.

Database functions on the other hand are extremely useful and probably
something more developers should take advantage of. There are far too
many applications out there which are doing things within external
application code which could be handled far more efficiently and
consistently as a database function. The challenge is in getting the
right balance.

My rule of thumb is to develop under the assumption that someone else
will have this dumped on them to maintain. I want the code to be as easy
to understand and follow as possible and I want to make it as easy to
make changes and test those changes as possible. Therefore I prefer my
code to consist of simple units of functionality which can be tested in
isolation and have a purpose which can be understood without requiring a
knowledge of hidden actions or unexpected side effects. If a function
cannot be viewed in a single screen, it is probably too big and trying
to do too many different things which should be broken up into smaller
functions. 

regards,

Tim

-- 
Tim Cross

Re: issues when installing postgres

2018-05-09 Thread Tim Cross

On 10 May 2018 at 09:45, Adrian Klaver <adrian.kla...@aklaver.com> wrote:

> On 05/09/2018 02:47 PM, Antonio Silva wrote:
>
>> Hello Adrian
>>
>> Are you using the Ubuntu or Postgres repos?
>>  > I'm using the Ubuntu repos
>>
>> Can you connect to Postgres using psql?
>>  > No I cannot
>>
>
> What does ps ax | grep post show?
>
> My guess is you are going to have to reinstall Postgres.
>
>
>> Thanks
>>
>> Antonio
>>
>>
>> 2018-05-09 10:36 GMT-03:00 Adrian Klaver <adrian.kla...@aklaver.com
>> <mailto:adrian.kla...@aklaver.com>>:
>>
>>
>> On 05/08/2018 05:54 PM, Antonio Silva wrote:
>>
>> Hello!
>>
>>
>> Comments inline.
>>
>> I bought a new computer and I installed Ubuntu 18.04 and after
>> PostgreSQL.
>> sudo apt install postgresql postgresql-contrib pgadmin3
>>
>>
>> Are you using the Ubuntu or Postgres repos?
>>
>>
>> Nevertheless I had some issues with configuration files and
>> decided to
>> uninstall it completely
>> sudo apt purge postgresql postgresql-contrib pgadmin3
>>
>> When I installed it again I notice that postgresql.conf and
>> pg_hba.conf
>> were the oldies files. Then I uninstall Postgres once more and
>> removed the
>> directory /etc/postgresql/
>>
>> After a new install I noticed that the directory
>> /etc/postgresql/ was
>> completely empty - it was not created again. There are no more
>> postgresql.conf and pg_hba.conf files. I could find only
>> postgresql.conf.sample and pg_hba.conf.sample at
>> /usr/share/postgresql/10
>>
>> /etc/init.d/postgresql status says that Postgres is running fine
>>
>>
>> Can you connect to Postgres using psql?
>>
>>
>> ● postgresql.service - PostgreSQL RDBMS
>> Loaded: loaded (/lib/systemd/system/postgresql.service; enabled;
>> vendor
>> preset: enabled)
>> Active: active (exited) since Tue 2018-05-08 10:43:23 -03; 1h
>> 55min ago
>> Process: 6451 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
>> Main PID: 6451 (code=exited, status=0/SUCCESS)
>>
>>
>> To be running the server would need its conf files somewhere.
>>
>>
>> What should I have to do to heve the folder /etc/postgresql/10/
>> ... and all
>> its files agais?
>>
>>
>> If you are going to use the packages then yes.
>>
>>
>>
>>
>> I really appreciate any help. Thanks in advance.
>>
>> All the best
>>
>> -- Antônio Olinto Ávila da Silva
>>
>>
>>
>> -- Adrian Klaver
>> adrian.kla...@aklaver.com <mailto:adrian.kla...@aklaver.com>
>>
>>
>>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>
>
Probably not relevant, but I noticed your also installing pgadmin3. I don't
believe pgadmin3 will work with Postgres 10. You need pgadmin4, which isn't
available as a package on ubuntu 18.04.

Also, be aware that Ubuntu has also been pushing 'snaps', so make sure that
Postgres hasn't been installed as a snap package (I think the command is
snap lis). I recall when I tried to install postgres in 17.10, which I did
from the 'software centre' on ubuntu, it initially installed it as a snap.
I had to remove the snap and then use apt to get the normal deb package
install.  the snap package system does not use the normal locations for
config files.

This could also be a ubuntu 18.04 issue. This version was only released a
couple of weeks ago and it is the first version which has Postgres 10 as
the default.  I would try the following

1. use systemctl to stop postgresql service
2. remove all postgres packages making sure all config files are also
removed
3. Use synaptic to make sure all postgres package and associated config
files have been removed.
4. Reboot
5. run apt update and then apt upgrade
6 re-install using apt (not the software centre).



-- 
regards,

Tim

--
Tim Cross

Re: Index/trigger implementation for accessing latest records

2018-05-02 Thread Tim Cross


Alastair McKinley <a.mckin...@analyticsengines.com> writes:

> Hi,
>
>
> I have a table that stores a location identifier per person which will be 
> appended to many times.
>
> However, for many queries in this system we only need to know the most recent 
> location per person, which is limited to about 1000 records.
>
>
> Is the following trigger/index strategy a reasonable and safe approach to 
> fast access to the latest location records per person?
>
>
>   1.  A boolean column (latest_record default true) to identify the latest 
> record per person
>   2.  A before insert trigger that updates all other records for that person 
> to latest_record = false
>   3.  A partial index on the latest_record column where latest_record is true
>
>
> Aside from performance, is it safe to update other records in the table from 
> the insert trigger in this way?
>
>
> Minimal example is shown below:
>
>
> create table location_records
>
> (
>
> id bigserial,
>
> person_id bigint,
>
> location_id bigint,
>
> latest_record boolean not null default true
>
> );
>
>
> create function latest_record_update() returns trigger as
>
> $$
>
> BEGIN
>
> update location_records set latest_record = false where person_id = 
> new.person_id and latest_record is true and id != new.id;
>
> return new;
>
> END;
>
> $$ language plpgsql;
>
>
> create trigger latest_record_trigger before insert on location_records
>
> for each row execute procedure latest_record_update();
>
>
> create index latest_record_index on location_records(latest_record) where 
> latest_record is true;
>
>
> insert into location_records(person_id,location_id) values (1,1);
>
> insert into location_records(person_id,location_id) values (1,2);
>
> insert into location_records(person_id,location_id) values (1,3);
>
>
> insert into location_records(person_id,location_id) values (2,3);
>
> insert into location_records(person_id,location_id) values (2,4);
>
>
> select * from location_records;
>

My personal bias will come out here 

I don't think using a trigger is a good solution here. Although very
powerful, the problem with triggers is that they are a 'hidden' side
effect which is easily overlooked and often adds an additional
maintenance burden which could be avoided using alternative approaches.

Consider a few months down the road and your on holidays. One of your
colleagues is asked to add a new feature which involves inserting
records into this table. During testing, they observe an odd result - a
field changing which according to the SQL they wrote should not. The
simple new feature now takes twice as long to develop as your colleague
works out there is a trigger on the table. Worse yet, they don't notice
and put there changes into production and then issue start getting
raised about communications going to the wrong location for customers
etc.

Triggers often become a lot more complicated than they will initially
appear. In your example, what happens for updates as opposed to inserts?
What happens if the 'new' location is actually the same as a previously
recorded location etc. 

In your case, I would try to make what your doing more explicit and
avoid the trigger. There are a number of ways to do this such as

- A function to insert the record. The function could check to see if
  that customer has any previous records and if so, set the boolean flag
  to false for all existing records and true for the new one. You might
  even want to break it up into two functions so that you have one which
  just sets the flag based on a unique key parameter - this would
  provide a way of resetting the current location without having to do
  an insert. 

- Use a timestamp instead of a boolean and change your logic to select
  the current location by selecting the record with the latest
  timestamp.

- Keep the two actions as separate SQL - one to insert a record and one
  to set the current location. This has the advantage of making actions
  clear and easier to maintain and can be useful in domains where people
  move between locations (for example, I've done this for a University
  where the data represented the students current address, which would
  change in and out of semester periods, but often cycle between two
  addresses, their college and their parental home). The downside of
  this approach is that applications which insert this information must
  remember to execute both SQL statements. If you have multiple
  interfaces, this might become a maintenance burden (one of the
  advantages of using a DB function). 


Tim


--
Tim Cross

Re: Long running INSERT+SELECT query

2018-04-27 Thread Tim Cross


Steven Lembark <lemb...@wrkhors.com> writes:

> On Fri, 27 Apr 2018 19:38:15 +0300
> Vitaliy Garnashevich <vgarnashev...@gmail.com> wrote:
>
>> We're going to try using "SELECT 1 FROM table FOR KEY SHARE" for each of 
>> the tables, which are referenced by results, before running the big 
>> query. That should be up to a million of rows in total. It will probably 
>> not cover the case when a record is INSERT'ed and then DELETE'd after 
>> the calculation has begun, but such cases should be even more rare than 
>> the DELETE's we're currently facing.
>
> Thing about using a couple of Materialized Views for the worst 
> part of it.

+1 re: materialised views - I have found them to be extremely useful for
situations where you want a snapshot of data and need to present it in a
way which is easier to process, especially when the underlying data is
changing faster than your reporting process can generate the report.  

-- 
Tim Cross

Re: Postgres and fsync

2018-04-23 Thread Tim Cross


Andres Freund <and...@anarazel.de> writes:

> Hi,
>
> On 2018-04-23 08:30:25 +1000, Tim Cross wrote:
>> the recent article in LWN regarding issues with fsync and error
>> reporting in the Linux kernel and the potential for lost data has
>> prompted me to ask 2 questions.
>
> Note that you need to have *storage* failures for this to
> happen. I.e. your disk needs to die, and there's no raid or such to fix
> the issue.
>
>
>> 1. Is this issue low level enough that it affects all potentially
>> supported sync methods on Linux? For example, if you were concerned
>> about this issue and you had a filesystem which supports open_sync or
>> open_datasync etc, is switching to one of these options something which
>> should be considered or is this issue low level enough that all sync
>> methods are impacted?
>
> No, the issue is largely about datafiles whereas the setting you refer
> to is about the WAL.
>
> Greetings,
>
> Andres Freund

OK, thanks.

-- 
Tim Cross

Re: Postgresql database encryption

2018-04-20 Thread Tim Cross

Ron <ronljohnso...@gmail.com> writes:

> On 04/20/2018 03:55 PM, Vick Khera wrote:
>> On Fri, Apr 20, 2018 at 11:24 AM, Vikas Sharma <shavi...@gmail.com 
>> <mailto:shavi...@gmail.com>> wrote:
>>
>
> Someone really needs to explain that to me. My company-issued laptop has 
> WDE, and that's great for when the machine is shut down and I'm carrying it 
> from place to place, but when it's running, all the data is transparently 
> decrypted for every process that wants to read the data, including malware, 
> industrial spies,
>

It really depends on the architecture. In many server environments these
days, some sort of network storage is used. Having the 'disk' associated
with a specific server encrypted can provide some level of protection from 
another
machine which also has access to the underlying infrastructure from
being able to access that data.

The other level of protection is for when disks are disposed of. There
have been many cases where data has been retrieved off disks which have
been sent for disposal.

Finally, the basic physical protection. Someone cannot just access your
data centre, remove a disk from the SAN and then access the data. 

Then of course there is the bureaucratic protection - "Yes boss, all our
data is encrypted on disk."

Tim
" 
-- 
Tim Cross

Re: Postgresql database encryption

2018-04-20 Thread Tim Cross

Vikas Sharma <shavi...@gmail.com> writes:

> Hello Guys,
>
> Could someone throw light on the postgresql instance wide or database wide
> encryption please? Is this possible in postgresql and been in use in
> production?.
>
> This is a requirement in our production implementation.
>

This sounds like a lazy management requirement specified for 'security'
purposes by people with little understanding of either technology or
security. I suspect it comes form a conversation that went along the
lines of 

"There has been lots in the news about cyber threats"

"Yes, we need our system to be secure"

"I know, lets make one of the requirements that everything must be
encrypted, that will stop them"

"Great idea, I'll add it as requirement 14".

This is a very poor requirement because it is not adequately specified,
but more critically, because it is specifying a 'solution' rather than
articulating the requirement in a way which would allow those with the
necessary expertise to derive an appropriate solution - one which may or
may not involve encryption or hashing of data and which may or may not
be at the database level.

What you really need to do is go back to your stakeholders and ask them
a lot of questions to extract what the real requirement is. Try to find
out what risk they are trying to mitigate with encryption. Once this is
understood, then look at what the technology can do and work out the
design/implementation from there.

It is extremely unlikely you just want all the data in the database
encrypted. When you think about it, such an approach really doesn't make
sense. In basic terms, if the data is encrypted, the database engine
will need to be able to decrypt it in order to operate (consider how a
where clause needs to be able to interpret actions etc). If the db can
read the data, the keys must be in the database. If the keys are in the
database and your database is compromised, then your keys are
compromised. So provided you protect your database from compromise, you
achieve the same level of security as you do with full data encryption
EXCEPT for access to the underlying data files outside of the database
system. For this, you will tend to use some sort of file system
encryption, which is typically managed at the operating system
level. Again, for the operating system to be able to read the file
system, the OS must have access to the decryption keys, so if your OS is
compromised, then that level of protection is lost as well (well, that
is over simplified, but you get the idea). What this level of protection
does give you is data at rest protection - if someone is able to access
hour disks through some other means, they cannot read the data. This is
the same principal most people should be using with their
laptops. Protect the OS with a password and have the data on disk
encrypted. Provided nobody can login to your laptop, they cannot read
your data. Without this encryption, you can just take the disk out of
the laptop, mount it on another system and you have full access. With
disk encryption, you cannot do that. Same basic principal with the
server.

At the database level, a more typical approach is to use one way hashing
for some sensitive data (i.e. passwords) and possibly column level
encryption on a specific column (much rarer) or just well structured
security policies and user roles that restrict who has access to various
tables/columns. To implement this successfully, you need details
regarding the domain, sensitivity of various data elements and the
threats you need to protect against. If you cannot get this information
because your stakeholders don't really know what their risks are and
have not done a proper assessment and what you are really dealing with
is bureaucracy which just as a dumb "data must be encrypted" policy,
just use full disk encryption and state that all data is encrypted on
disk" and your done.

Tim 

-- 
Tim Cross

Re: Commands linked to pg_wrapper not working with non-root users

2018-04-11 Thread Tim Cross


Jonathan Leroy - Inikup <jonat...@inikup.com> writes:

> Hi,
>
> I'm using multiples versions of PostgreSQL from the postgresql.org
> repository (http://apt.postgresql.org/) on Debian Jessie, on multiples
> servers.
> Each postgresql-client-XX package depends on postgresql-client-common,
> which provides the pg_wrapper script
> (/usr/share/postgresql-common/pg_wrapper).
>
> A lot of PostgreSQL commands are linked to pg_wrapper. E.g.:
> /usr/bin/psql -> ../share/postgresql-common/pg_wrapper
>
> Here's my issue : when I'm logged as an user which is not root or
> postgresql, I can't use any of the commands linked to pg_wrapper:
>
> user1@server1:~ $ /usr/bin/psql --version
> Error: Invalid data directory
>
>
> However, everything works with postgres or root user:
>
> root@server1:~ # /usr/bin/psql --version
> psql (PostgreSQL) 9.4.16
>
>
> Also, everything works fine if I bypass pg_wrapper:
>
> user1@server1:~ $ /usr/lib/postgresql/9.4/bin/psql --version
> psql (PostgreSQL) 9.4.16
>
> I am missing something ?
>
> Thanks,

Check your settings in /etc/postgresql-common/user_clusters. The wrapper
script uses that file to determine what databases to connect to or what
is the user default database cluster. It can also be overridden with a
local ~/.postgresqlrc, so check there are no old settings there as well.

Tim

-- 
Tim Cross

1 2 >

1 - 100 of 120 matches

Mail list logo