Re: [PATCH] minor bug fix for pg_dump --clean

2022-10-23 Thread Виктория Шепард
Hi,

Good catch! Here are several points for improvement:
1. pg_dump.c:17380
Maybe better to write simpler


appendPQExpBuffer(delcmd, "CREATE SCHEMA IF NOT EXISTS %s;\n",
tbinfo->dobj.namespace->dobj.name);

because there is a schema name inside the `tbinfo->dobj.namespace->dobj.name
`

2. pg_backup_archiver.c:588
Here are no necessary spaces before and after braces, and no spaces around
the '+' sign.

( strncmp(dropStmt, "CREATE SCHEMA IF NOT EXISTS", 27) == 0 &&
  strstr(dropStmt+29, "CREATE OR REPLACE VIEW") ))


Best regards,
Viktoria Shepard

чт, 1 сент. 2022 г. в 12:13, Frédéric Yhuel :

> Hello,
>
> When using pg_dump (or pg_restore) with option "--clean", there is some
> SQL code to drop every objects at the beginning.
>
> The DROP statement for a view involving circular dependencies is :
>
> CREATE OR REPLACE VIEW [...]
>
> (see commit message of d8c05aff for a much better explanation)
>
> If the view is not in the "public" schema, and the target database is
> empty, this statement fails, because the schema hasn't been created yet.
>
> The attached patches are a TAP test which can be used to reproduce the
> bug, and a proposed fix. They apply to the master branch.
>
> Best regards,
> Frédéric


Re: Re[2]: Possible solution for masking chosen columns when using pg_dump

2022-10-23 Thread Виктория Шепард
Hi,

Thank you, Oleg Tselebrovskiy, for your valuable review, here are the fixes

Best regards,
Viktoria Shepard

ср, 12 окт. 2022 г. в 12:19, Виктория Шепард :

> Hi,
>
> Here is an idea of how to read masking options from a file. Please, take a
> look.
>
> пн, 10 окт. 2022 г. в 14:54, Олег Целебровский  >:
>
>> Hi,
>>
>> I applied most of suggestions: used separate files for most of added
>> code, fixed typos/mistakes, got rid of that pesky TODO that was already
>> implemented, just not deleted.
>>
>> Added tests (and functionality) for cases when you need to mask columns
>> in tables with the same name in different schemas. If schema is not
>> specified, then columns in all tables with specified name are masked
>> (Example - pg_dump -t ‘t0’ --mask-columns id --mask-function mask_int will
>> mask all ids in all tables with names ‘t0’ in all existing schemas).
>>
>> Wrote comments for all ‘magic numbers’
>>
>> About that
>>
>> >- Also it can be hard to use a lot of different functions for different
>> fields, maybe it would be better to set up functions in a file.
>>
>> I agree with that, but I know about at least 2 other patches (both are
>> WIP, but still) that are interacting with reading command-line options from
>> file. And if everyone will write their own version of reading command-line
>> options from file, it will quickly get confusing.
>>
>> A solution to that problem is another patch that will put all options
>> from file (one file for any possible options, from existing to future ones)
>> into **argv in main, so that pg_dump can process them as if they came form
>> command line.
>>
>>
>> Пятница, 7 октября 2022, 8:01 +07:00 от Виктория Шепард <
>> we.vikt...@gmail.com>:
>>
>> Hi,
>> I took a look, here are several suggestions for improvement:
>>
>> - Masking is not a main functionality of pg_dump and it is better to
>> write most of the connected things in a separate file like parallel.c or
>> dumputils.c. This will help slow down the growth of an already huge pg_dump
>> file.
>>
>> - Also it can be hard to use a lot of different functions for different
>> fields, maybe it would be better to set up functions in a file.
>>
>> - How will it work for the same field and tables in the different
>> schemas? Can we set up the exact schema for the field?
>>
>> - misspelling in a word
>> >/*
>> >* Add all columns and funcions to list of MaskColumnInfo structures,
>> >*/
>>
>> - Why did you use 256 here?
>> > char* table = (char*) pg_malloc(256 * sizeof(char));
>> Also for malloc you need malloc on 1 symbol more because you have to
>> store '\0' symbol.
>>
>> - Instead of addFuncToDatabase you can run your query using something
>> already defined from fe_utils/query_utils.c. And It will be better to set
>> up a connection only once and create all functions. Establishing a
>> connection is a resource-intensive procedure. There are a lot of magic
>> numbers, better to leave some comments explaining why there are 64 or 512.
>>
>> - It seems that you are not using temp_string
>> > char   *temp_string = (char*)malloc(256 * sizeof(char));
>>
>> - Grammar issues
>> >/*
>> >* mask_column_info_list contains info about every to-be-masked column:
>> >* its name, a name its table (if nothing is specified - mask all columns
>> with this name),
>> >* name of masking function and name of schema containing this function
>> (public if not specified)
>> >*/
>> the name of its table
>>
>>
>> пн, 3 окт. 2022 г. в 20:45, Julien Rouhaud > >:
>>
>> Hi,
>>
>> On Mon, Oct 03, 2022 at 06:30:17PM +0300, Олег Целебровский wrote:
>> >
>> > Hello, here's my take on masking data when using pg_dump
>> >
>> > The main idea is using PostgreSQL functions to replace data during a
>> SELECT.
>> > When table data is dumped SELECT a,b,c,d ... from ... query is
>> generated, the columns that are marked for masking are replaced with result
>> of functions on those columns
>> > Example: columns name, count are to be masked, so the query will look
>> as such: SELECT id, mask_text(name), mask_int(count), date from ...
>> >
>> > So about the interface: I added 2 more command-line options:
>> >
>> > --mask-columns, which specifies what columns from what tables will be
>> masked
>> > usage example:
>> > --mask-columns "t1.name, t2.descriptio

Re: Re[2]: Possible solution for masking chosen columns when using pg_dump

2022-10-12 Thread Виктория Шепард
Hi,

Here is an idea of how to read masking options from a file. Please, take a
look.

пн, 10 окт. 2022 г. в 14:54, Олег Целебровский :

> Hi,
>
> I applied most of suggestions: used separate files for most of added code,
> fixed typos/mistakes, got rid of that pesky TODO that was already
> implemented, just not deleted.
>
> Added tests (and functionality) for cases when you need to mask columns in
> tables with the same name in different schemas. If schema is not specified,
> then columns in all tables with specified name are masked (Example -
> pg_dump -t ‘t0’ --mask-columns id --mask-function mask_int will mask all
> ids in all tables with names ‘t0’ in all existing schemas).
>
> Wrote comments for all ‘magic numbers’
>
> About that
>
> >- Also it can be hard to use a lot of different functions for different
> fields, maybe it would be better to set up functions in a file.
>
> I agree with that, but I know about at least 2 other patches (both are
> WIP, but still) that are interacting with reading command-line options from
> file. And if everyone will write their own version of reading command-line
> options from file, it will quickly get confusing.
>
> A solution to that problem is another patch that will put all options from
> file (one file for any possible options, from existing to future ones) into
> **argv in main, so that pg_dump can process them as if they came form
> command line.
>
>
> Пятница, 7 октября 2022, 8:01 +07:00 от Виктория Шепард <
> we.vikt...@gmail.com>:
>
> Hi,
> I took a look, here are several suggestions for improvement:
>
> - Masking is not a main functionality of pg_dump and it is better to write
> most of the connected things in a separate file like parallel.c or
> dumputils.c. This will help slow down the growth of an already huge pg_dump
> file.
>
> - Also it can be hard to use a lot of different functions for different
> fields, maybe it would be better to set up functions in a file.
>
> - How will it work for the same field and tables in the different schemas?
> Can we set up the exact schema for the field?
>
> - misspelling in a word
> >/*
> >* Add all columns and funcions to list of MaskColumnInfo structures,
> >*/
>
> - Why did you use 256 here?
> > char* table = (char*) pg_malloc(256 * sizeof(char));
> Also for malloc you need malloc on 1 symbol more because you have to store
> '\0' symbol.
>
> - Instead of addFuncToDatabase you can run your query using something
> already defined from fe_utils/query_utils.c. And It will be better to set
> up a connection only once and create all functions. Establishing a
> connection is a resource-intensive procedure. There are a lot of magic
> numbers, better to leave some comments explaining why there are 64 or 512.
>
> - It seems that you are not using temp_string
> > char   *temp_string = (char*)malloc(256 * sizeof(char));
>
> - Grammar issues
> >/*
> >* mask_column_info_list contains info about every to-be-masked column:
> >* its name, a name its table (if nothing is specified - mask all columns
> with this name),
> >* name of masking function and name of schema containing this function
> (public if not specified)
> >*/
> the name of its table
>
>
> пн, 3 окт. 2022 г. в 20:45, Julien Rouhaud  >:
>
> Hi,
>
> On Mon, Oct 03, 2022 at 06:30:17PM +0300, Олег Целебровский wrote:
> >
> > Hello, here's my take on masking data when using pg_dump
> >
> > The main idea is using PostgreSQL functions to replace data during a
> SELECT.
> > When table data is dumped SELECT a,b,c,d ... from ... query is
> generated, the columns that are marked for masking are replaced with result
> of functions on those columns
> > Example: columns name, count are to be masked, so the query will look as
> such: SELECT id, mask_text(name), mask_int(count), date from ...
> >
> > So about the interface: I added 2 more command-line options:
> >
> > --mask-columns, which specifies what columns from what tables will be
> masked
> > usage example:
> > --mask-columns "t1.name, t2.description" - both columns
> will be masked with the same corresponding function
> > or --mask-columns name - ALL columns with name "name" from
> all dumped tables will be masked with correspoding function
> >
> > --mask-function, which specifies what functions will mask data
> > usage example:
> > --mask-function mask_int - corresponding columns will be
> masked with function named "mask_int" from default schema (public)
> > or --mask-function my_schema.mask_varchar - same as above
>

Re: Possible solution for masking chosen columns when using pg_dump

2022-10-06 Thread Виктория Шепард
Hi,
I took a look, here are several suggestions for improvement:

- Masking is not a main functionality of pg_dump and it is better to write
most of the connected things in a separate file like parallel.c or
dumputils.c. This will help slow down the growth of an already huge pg_dump
file.

- Also it can be hard to use a lot of different functions for different
fields, maybe it would be better to set up functions in a file.

- How will it work for the same field and tables in the different schemas?
Can we set up the exact schema for the field?

- misspelling in a word
>/*
>* Add all columns and funcions to list of MaskColumnInfo structures,
>*/

- Why did you use 256 here?
> char* table = (char*) pg_malloc(256 * sizeof(char));
Also for malloc you need malloc on 1 symbol more because you have to store
'\0' symbol.

- Instead of addFuncToDatabase you can run your query using something
already defined from fe_utils/query_utils.c. And It will be better to set
up a connection only once and create all functions. Establishing a
connection is a resource-intensive procedure. There are a lot of magic
numbers, better to leave some comments explaining why there are 64 or 512.

- It seems that you are not using temp_string
> char   *temp_string = (char*)malloc(256 * sizeof(char));

- Grammar issues
>/*
>* mask_column_info_list contains info about every to-be-masked column:
>* its name, a name its table (if nothing is specified - mask all columns
with this name),
>* name of masking function and name of schema containing this function
(public if not specified)
>*/
the name of its table


пн, 3 окт. 2022 г. в 20:45, Julien Rouhaud :

> Hi,
>
> On Mon, Oct 03, 2022 at 06:30:17PM +0300, Олег Целебровский wrote:
> >
> > Hello, here's my take on masking data when using pg_dump
> >
> > The main idea is using PostgreSQL functions to replace data during a
> SELECT.
> > When table data is dumped SELECT a,b,c,d ... from ... query is
> generated, the columns that are marked for masking are replaced with result
> of functions on those columns
> > Example: columns name, count are to be masked, so the query will look as
> such: SELECT id, mask_text(name), mask_int(count), date from ...
> >
> > So about the interface: I added 2 more command-line options:
> >
> > --mask-columns, which specifies what columns from what tables will be
> masked
> > usage example:
> > --mask-columns "t1.name, t2.description" - both columns
> will be masked with the same corresponding function
> > or --mask-columns name - ALL columns with name "name" from
> all dumped tables will be masked with correspoding function
> >
> > --mask-function, which specifies what functions will mask data
> > usage example:
> > --mask-function mask_int - corresponding columns will be
> masked with function named "mask_int" from default schema (public)
> > or --mask-function my_schema.mask_varchar - same as above
> but with specified schema where the function is stored
> > or --mask-function somedir/filename - the function is
> "defined" here - more on the structure below
>
> FTR I wrote an extension POC [1] last weekend that does that but on the
> backend
> side.  The main advantage is that it's working with any existing versions
> of
> pg_dump (or any client relying on COPY or even plain interactive SQL
> statements), and that the DBA can force a dedicated role to only get a
> masked
> dump, even if they forgot to ask for it.
>
> I only had a quick look at your patch but it seems that you left some todo
> in
> russian, which isn't helpful at least to me.
>
> [1] https://github.com/rjuju/pg_anonymize
>
>
>